Projekt OCR-BW
Automatische Texterkennung von Handschriften
DOI:
https://doi.org/10.5282/o-bib/5885Keywords:
OCR, Text recognition, Manuscripts, Digital Humanities, Artificial intelligence (AI), DigitizationAbstract
After the digitization of historical documents, the next logical step is to enrich the digitized material with a searchable full text to further increase the accessibility of the texts and to enable new research questions. While many libraries already use various options for automatic text recognition of printed material, there is much higher reluctance to do so when it comes to manuscripts, since handwritten sources pose new challenges for automatic text recognition. With the help of machine learning, however, great progress has been made in the field of automatic handwritten text recognition in recent years, which libraries can not only use to make their own holdings more accessible, but also to establish themselves as a service partner for science. As part of the OCR-BW project (https://ocr-bw.bib.uni-mannheim.de/), since 2019 the transcription platforms Transkribus and, from 2021, eScriptorium have been systematically tested on selected corpora to generate automatic full texts for manuscripts. The results achieved during the project so far are very positive and show that automatic handwritten text recognition with a character error rate of less than 5 % is possible and can be expected. Full texts that have already been published have significantly increased the visibility and research interest in these materials. The project also aims to support science in the preparation and implementation of research projects. Examples ranging from medieval prayer books to large collections such as legal councils to expedition diaries of the 20th century will be used to show which results can be achieved with which resources.
References
Boenig, Matthias; Federbusch, Maria; Herrmann, Elisa; Neudecker, Clemens; Würzner, Kay-Michael: Ground Truth. Grundwahrheit oder Ad-Hoc-Lösung? Wo stehen die Digital Humanities?, in: DHd 2018. Kritik der digitalen Vernunft. Konferenzabstracts. Universität zu Köln, 26. Februar bis 2. März 2018, 2018, S. 219–223. Online: https://dhd2018.uni-koeln.de/wp-content/uploads/boa-DHd2018-web-ISBN.pdf, Stand: 08.09.2022.
Crusius, Martin: Diarium Martini Crusii, hrsg. von Wilhelm Goez, Ernst Conrad, Reinhold Stahlecker, Eugen Staiger unter Mitw. von Reinhold Rau und Hans Widmann, 4 Bde., Tübingen 1927–1961.
Deutsche Forschungsgemeinschaft: DFG-Praxisregeln „Digitalisierung“. DFG-Vordruck 12.151-12/16, 2016. Online: https://www.dfg.de/formulare/12_151/, Stand: 08.09.2022.
Heumann, Ina; Stoecker, Holger; Tamborini, Marco; Vennen, Mareike: Dinosaurierfragmente. Zur Geschichte der Tendaguru-Expedition und ihrer Objekte, 1906–2018, Göttingen 2018.
Hodel, Tobias; Schoch, David; Schneider, Christa; Purcell, Jake: General Models for Hand- written Text Recognition. Feasibility and State-of-the Art. German Kurrent as an Example, in: Journal of Open Humanities Data, 7 (13), 2021, S. 1–10. Online: https://doi.org/10.5334/johd.46, Stand: 08.09.2022.
Kiessling, Benjamin; Tissot, Robin; Stökl Ben Ezra, Daniel; Stokes, Peter: eScriptorium. An Open Source Platform for Historical Document Analysis, in: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), 2019, S. 19–24. Online: https://doi.org/10.1109/ICDARW.2019.10032, Stand: 08.09.2022.
Maier, Gerhard: African dinosaurs unearthed. The Tendaguru expeditions. Bloomington, Ind. 2003 (Life of the Past).
Michael, Johannes; Weidemann, Max; Labahn, Roger: HTR Engine Based on NNs P3. Optimizing speed and performance - HTR+, READ-H2020 Project 674943, Deliverable D7.9, 2018. Online: https://readcoop.eu/wp-content/uploads/2018/12/Del_D7_9.pdf, Stand: 08.09.2022.
Muehlberger, Guenter; Seaward, Louise; Terras, Melissa; Ares Oliveira, Sofia; Bosch, Vicente; Bryan, Maximilian; Colutto, Sebastian; Déjean, Hervé; Diem, Markus; Fiel, Stefan; Gatos, Basilis; Greinoecker, Albert; Grüning, Tobias; Hackl, Guenter; Haukkovaara, Vili; Heyer, Gerhard; Hirvonen, Lauri; Hodel, Tobias; Jokinen, Matti; Kahle, Philip; Kallio, Mario; Kaplan, Frederic; Kleber, Florian; Labahn, Roger; Lang, Eva Maria; Laube, Sören; Leifert, Gundram; Louloudis, Georgios; McNicholl, Rory; Meunier, Jean-Luc; Michael, Johannes; Mühlbauer, Elena; Philipp, Nathanael; Pratikakis, Ioannis; Puigcerver Pérez, Joan; Putz, Hannelore; Retsinas, George; Romero, Verónica; Sablatnig, Robert; Sánchez, Joan Andreu; Schofield, Philip; Sfikas, Giorgos; Sieber, Christian; Stamatopoulos, Nikolaos; Strauß, Tobias; Terbul, Tamara; Toselli, Alejandro Héctor; Ulreich, Berthold; Villegas, Mauricio; Vidal, Enrique; Walcher, Johanna; Weidemann, Max; Wurster, Herbert; Zagoris, Konstantinos: Transforming scholarship in the archives through handwritten text recognition. Transkribus as a case study, in: Journal of Documentation, 75 (5), 2019, S. 954–976. Online: https://doi.org/10.1108/JD-07-2018-0114, Stand 29.07.2022.
Strauß, Tobias; Weidemann, Max; Labahn, Roger: Language Models. Improving transcriptions by external language resources, READ-H2020 Project 674943, Deliverable D7.12, 2018. Online: https://readcoop.eu/wp-content/uploads/2018/12/D7.12_LMs.pdf, Stand 08.09.2022.
Ströbel, Phillip; Clematide, Simon; Volk, Martin; Schwitter, Raphael; Hodel, Tobias; Schoch, David: Evaluation of HTR models without Ground Truth Material. Preprint 2022. Online: https://www.researchgate.net/publication/357927928_Evaluation_of_HTR_models_without_Ground_Truth_Material, Stand 08.09.2022
Downloads
Published
Issue
Section
License
Copyright (c) 2022 Dorothee Huff, Kristina Stöbener
This work is licensed under a Creative Commons Attribution 4.0 International License.