Building Infrastructure for Data-Driven Research

Dr. Philipp Zumstein

Mannheim University Library

2017-03-15 Social Science Data Lab, Mannheim

Slides are Open Access, resuse them as

(this does not cover necessarily all the pictures; see individual attributions )

Overview

  • Data-driven Research
  • Building Infrastructure
  • OCR Workflow
  • OCR Software
  • Applications

Data-driven Research

online data

collected data

other data

Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/

§§

?

time

methods

What do you do

with images containing text or printed books/newspapers as input?

Digitization, OCR, Structuring

(infrastructure for research)

Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/

Building Infrastructure

Science Support from Library  

Infrastructure for Scanning

A1-Scanner for newspapers etc.

V-Scanner for rare, old, fragile books

Digitization, "Data-ization"

Our Digitization Infrastructure:

  • V-scanner
  • A1-scanner
  • A2-scanner
  • A3-scanner                            
  • conservation checks and fixes                              

Our Expertise:

  • scanning workflow
  • (manual) double-key-methods
  • automatic text recognition (OCR)
  • digitizing microfiche, microfilm
  • extracting information from CDs to a database
  • structuring information
  • metadata formats

Infrastructure Projects

Ancien Droit: digitizing 800 books from the 17th/18th century from the collection of Desbillon with focus on the history of the "Ancien Droit" https://digi.bib.uni-mannheim.de/

Aktienführer I+II: digitizing the annualy published books "Aktienführer", extracting the data in a data base https://digi.bib.uni-mannheim.de/aktienfuehrer/

Reichsanzeiger: German newspaper (government gazette) from 1819 to 1945 https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger/

LOC-DB: open, distributed infrastructure for cataloguing of citations https://locdb.bib.uni-mannheim.de/

Infolis I+II: connect research data and publications, text mining scientific articles, integration into different retrieval systems http://infolis.github.io/

OCR Workflow

Workflow of OCR-Process

Baierer, Konstantin; Zumstein, Philipp (2016). Verbesserung der OCR in digitalen Sammlungen von Bibliotheken. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture, v. 4, n. 2, p. 72-83. https://doi.org/10.12685/027.7-4-2-155

Image Processing

deskew

dewarp

binarize

denoise

despeckling

Layout Analysis

  • text vs. image classification
  • header, footer, headings
  • multi-columns, reading order
  • line recognition

Text Recognition

a) character-based recognition

b) line-based recognition

"ē"  : 88%

"é"  : 85%

"e"  : 73%

"c"  : 71%

...

LSTM

"mit Weglassung solcher Verse"

Computerlinguistical Methods

  • dictionary
  • bigram, -trigrams, etc. for letters and words

non- words

=

possible errors

words from the dictionary

=

possible corrections

Recognition Errors

  • OCR results have errors
  • errors can occur in each step
    • scanning errors
    • segmentatation/layout errors
    • recognition errors
    • errors in dictionaries
    • untrained characters

Advise: Judge the errors with regard to your application (fuzzy search, topic modeling, extracting exact numbers)

OCR-Software

Commercial OCR Software

Open Source OCR Software

ABBYY Finereader

 

e.g. FineReader Engine 11 CLI for Linux (on one server/pc): 120'000 pages / year for 999 EUR

Tesseract

  • started 1985 by HP Labs
  • since 2006 Open Source
  • supported by Google                         

Ocropus

  • started 2007
  • founded and maintained by Prof. Breuel (DFKI, Google, Nvidia)

etc.


tesseract input.jpg output \ 
	-l eng+deu \
	--oem 1 --psm 7 \
	hocr
										
abbyyocr11 -rl German \
	-if input.jpg \
	-f PDF -of output.pdf
  • Normally good results
  • Closed source, limited options to change behaviour
  • Strong emphazise on language-dependent dictionaries

ABBYY Finereader

Tesseract

OCRopus

  • neural network algorithm since 2013
  • training is key feature
  • different models for scripts (not languages)
  • no dictionary
  • modular scripts (Unix philosophy)
										
./ocropus-nlbin tests/ersch.png

./ocropus-gpageseg ersch/*.bin.png

./ocropus-rpred ersch/*/*.bin.png \
	-m models/fraktur.pyrnn.gz

./ocropus-hocr ersch/*.bin.png

OCR Fileformats

  • recognized text
  • position of the words, lines, characters (bounding boxes)
  • confidence values
  • text direction, recognized language, formats, ...

e.g. hocr file:

Other OCR-formats: ALTO, Page XML, ABBYY XML, TEI, GCV

<p class='ocr_par' lang='deu' title="bbox930">
...
  <span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
	<span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span> 
	<span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span> 
	<span class='ocrx_word' title='bbox 717 803 755 831; x_wconf 96'>ist</span> 
	<span class='ocrx_word' title='bbox 773 803 802 831; x_wconf 96'>in</span> 
	<span class='ocrx_word' title='bbox 821 803 917 830; x_wconf 96'>ihrem</span> 
	<span class='ocrx_word' title='bbox 935 799 1180 838; x_wconf 95'>ursprünglichen</span> 
	<span class='ocrx_word' title='bbox 1199 797 1343 832; x_wconf 95'>Umfange</span> 
	<span class='ocrx_word' title='bbox 1362 805 1399 823; x_wconf 95'>zu</span> 
	<span class='ocrx_word' title='bbox 1417 x_wconf 96'>ver-</span> 
  </span>
  ...

Applications

Ngram Viewer (Google Books)

Number of Females in the Supervisory Board of DAX-30 companies 1979-1999

  1. Go to the "Aktienführer Datenarchiv" and there to "Export"
  2. Increase number of results to 50, search for "DAX", click on select all visible (38 results)
  3. Adjust the year range
  4. Select the category "Supervisory Board"
  5. Export the CSV data
  6. Open in Excel, mark the female names
  7. Finally make a pivot table

Number and age of German voters for EU vote 1989

  1. Go to digizeitschriften.de and then to the Statistisches Jahrbuch für die Bundesrepublik Deutschland 1990
  2. Download the pdf of the chapter "Wahlen" starting from page 76
  3. Open the pdf in the PDF X Change Viewer, run OCR and save it (or the alternatives you heard before)
  4. Download Tabula http://tabula.technology/,
    install it and run it
  5. Open pdf in Tabula, select table
    and extract data as csv

(*) The quality is here not yet optimal, but it shows the possibilities of the tools and data around OCR.

Number of German Emmigrants

from 1870  until 1880

  1. Go to the Reichsanzeiger
  2. Search for "Auswanderer"
  3. Be lucky
  4. Go to the result

Discussion, Questions?

OCRopus run-test executes nlbin, gpageseg, rpred

List of Images