Evaluation von Volltextdaten mit Open-Source-Komponenten
DOI:
https://doi.org/10.5282/o-bib/5888Keywords:
Optical character recognition, Full text, Evaluation, Historical newspapers, Newspaper, DigitizationAbstract
In the area of full text recognition, several fully-fledged open source systems are available today. Established open source tools stemming from the fields of Data Science (DS), Information Retrieval (IR) and Natural Language Processing (NLP) can also be used to evaluate the results. After a brief discussion of common evaluation procedures and metrics, the application of such tools in the DFG-funded project „Digitisaion of historical German newspapers I (Digitalisierung Historischer Deutscher Zeitungen I)“ at the University and State Library Saxony-Anhalt is used as an example.
References
Alex, Beatrice; Burns, John: Estimating and rating the quality of optically character recognised text, in: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 97–102, 2014. Online: https://dl.acm.org/doi/pdf/10.1145/2595188.2595214.
Clausner, Christian; Papadopoulos, Christos; Pletschacher, Stefan; Antonacopoulos, Apostolos: The ENP image and ground truth dataset of historical newspapers, in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 931–935, IEEE, 2015. Online: https://dl.acm.org/doi/10.1109/ICDAR.2015.7333898.
Clausner, Christian; Papadopoulos, Christos; Pletschacher, Stefan; Antonacopoulos, Apostolos: Quality prediction system for large-scale digitisation workflows, in: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 138–143, IEEE, 2016. Online: https://www.primaresearch.org/www/assets/papers/DAS2016_Clausner_QualityPrediction.pdf, Stand: 10.11.2022.
Deutsche Forschungsgemeinschaft: DFG-Vordruck 12.151 – 12/16 – Praxisregeln „Digitalisierung“. 2016. Online: https://www.dfg.de/formulare/12_151/12_151_de.pdf, Stand: 10.11.2022.
Deutsche Forschungsgemeinschaft: Empfehlungen zur Digitalisierung historischer Zeitungen in Deutschland, 2017. Online: https://zeitschriftendatenbank.de/fileadmin/user_upload/ZDB/z/Masterplan.pdf, Stand: 10.11.2022.
Engl, Elisabeth: OCR-D kompakt: Ergebnisse und Stand der Forschung in der Förderinitiative, in: Bibliothek Forschung und Praxis 44 (2), 2020, S. 218–230. Online: https://www.degruyter.com/document/doi/10.1515/bfp-2020-0024/pdf, Stand: 10.11.2022.
Kahle, Philip; Colutto, Sebastian; Hackl, Günter; Mühlberger, Günter: Transkribus – a service platform for transcription, recognition and retrieval of historical documents, in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 4, pp. 19–24, IEEE 2017.
Manning, Christopher; Raghavan, Prabhakar; Schütze, Heinrich: Introduction to Information Retrival, Cambridge 2008.
Maurer, Yves: Improving the quality of the text, a pilot project to assess and correct the OCR in a multilingual environment, 2017. Online: https://slub.qucosa.de/api/qucosa%3A16445/attachment/ATT-0/, Stand: 10.11.2022.
Mühlberger, Günter: Digitalisierung historischer Zeitungen aus dem Blickwinkel der automatisierten Text- und Strukturerkennung (OCR), in: Zeitschrift für Bibliothekswesen und Bibliographie 58 (1), 2011, S. 10–18.
Neudecker, Clemens; Zaczynska, Karolina; Baierer, Konstantin; Rehm, Georg; Gerber, Mike; Schneider, Julián Moreno: Methoden und Metriken zur Messung von OCR-Qualität für die Kuratierung von Daten und Metadaten, in: Qualität in der Inhaltserschließung, Berlin; Boston 2021, S. 137–166. Online: https://pdfs.semanticscholar.org/09ce/7181d7751cfc05365039475b7432f89afcfd.pdf, Stand: 10.11.2022.
Neudecker, Clemens; Baierer, Konstantin; Gerber, Maik; Clausner, Christian; Pletschacher, Stefan; Antonacopoulos, Apostolos: A survey of OCR evaluation tools and metrics, 2021. Online: https://dl.acm.org/doi/pdf/10.1145/3476887.3476888.
Nölte, Manfred; Bultmann, Jan-Paul; Schünemann, Maik; Blenkle, Martin: Automatische Qualitätsverbesserung von Fraktur-Volltexten aus der Retrodigitalisierung am Beispiel der Zeitschrift Die Grenzboten, in: o-bib. Das offene Bibliotheksjournal 3 (1), 2016, S. 32–55. Online: https://doi.org/10.5282/o-bib/2016H1S32-55.
Pletschacher, Stefan; Clausner, Chrisian; Antonacopoulos, Apostolos: Europeana newspapers OCR workflow evaluation, in: Proceedings of the 3rd international workshop on historical document imaging and processing, 2015, pp. 39–46. Online: https://dl.acm.org/doi/pdf/10.1145/2809544.2809554.
Reul, Christian; Christ, Dennis; Hartelt, Alexander; Balbach, Nico; Wehner, Maximilian; Springmann, Uwe; Wick, Christoph; Grundig, Christine; Büttner ,Andreas; Puppe, Frank: OCR4all–An open-source tool providing a (semi-)automatic OCR workflow for historical printings, in: Applied Sciences 9 (22), 2019. Online: https://www.mdpi.com/2076-3417/9/22/4853/htm, Stand: 10.11.2022.
Reul, Christian; Wick, Christoph; Nöth, Maximilian; Büttner, Andreas; Wehner, Maximilian; Springmann, Uwe: Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning, in: The 6th International Workshop on Historical Document Imaging and Processing, 2021, S. 7–12. Online: https://dl.acm.org/doi/pdf/10.1145/3476887.3476910.
Rice, Stephen; Jenkins, Frank; Nartker, Thomas: The fifth annual test of OCR accuracy. Information Science Research Institute Los Angeles, 1996. Online: https://www.stephenvrice.com/images/AT-1996.pdf, Stand: 10.11.2022.
Schink, Manuela: OCR – Evaluierung der Genauigkeit (QM) sowie Tools zur Unterstützung. Online-Konferenz „OCR–Prozesse und Entwicklungen“, 1. März 2021. Online: https://wiki.zbw.eu/pages/viewpage.action?pageId=33620559&preview=/33620559/33620565/2021-02-24 Schink OCR-Evaluierung und Tools.pdf, Stand: 10.11.2022.
Schneider, Pit: Rerunning OCR. A Machine Learning Approach to Quality Assessment and Enhancement Prediction, arXiv preprint arXiv:2110.01661, 2021. Online: https://arxiv.org/pdf/2110.01661, Stand: 10.11.2022.
Smith, Ray: An overview of the Tesseract OCR engine, in: Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2, 2007, S. 629–633. Online: https://research.google/pubs/pub33418.pdf, Stand: 10.11.2022.
Smith, Ray: History of the Tesseract OCR engine. What worked and what didn’t, in: Document Recognition and Retrieval XX, vol. 8658, International Society for Optics and Photonics, 2013. Online: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/8658/865802/History-of-the-Tesseract-OCR-engine—what-worked-and/10.1117/12.2010051.pdf, Stand: 10.11.2022.
Sommer, Dorothea; Heiligenhaus, Kay; Wippermann, Carola; Pankratz, Manfred: Zeitungsdigitalisierung. Eine neue Herausforderung für die ULB Halle, in: ABI Technik34 (2), 2014, S. 75–85.
Springmann,Uwe; Florian Fink; Klaus Schulz: Automatic quality evaluation and (semi-)automatic improvement of OCR models for historical printings, arXiv preprint arXiv:1606.05157, 2016. Online: https://arxiv.org/pdf/1606.05157, Stand: 10.11.2022.
Tanner, Simon; Muñoz, Trevor; Ros, Pich Hemy: Measuring Mass Text Digitization Quality and Usefulness. Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive, in: D-lib Magazine 15 (7/8), 2009. Online: http://www.dlib.org/dlib/july09/munoz/07munoz.html , Stand: 10.11.2022.
Wernersson, Maria: Evaluation von automatisch erzeugten OCR-Daten am Beispiel der Allgemeinen Zeitung, in: ABI Technik 35 (1), 2015, S. 23–35.
Downloads
Published
Issue
Section
License
Copyright (c) 2022 Uwe Hartwig
This work is licensed under a Creative Commons Attribution 4.0 International License.