Historical document classification

We not only participated in the ICDAR 2021 conference with our published papers, but we also participated in a competition on historical document classification. The competition consisted of three tasks: font/script classification, document localization and dating. In all three tasks, our system took first place and we won the whole competition. A detailed description of our approach is accessible at arxiv.org and it was also accepted to the DAS 2022. When we were pre-processing the provided datasets, we also created their splits into training and validation part. These splits, together with a brief description, are publicly available.

Paper submitted to DAS 2022

At the end of May 2022, the 15th International Workshop on Document Analysis System (DAS) will take place in La Rochelle, France. We will participate in this international workshop with a paper describing our system, which we have prepared for the ICDAR 2021 Competition on Historical Document Classification. In this article, we also publish the datasets splits that we have created, which are necessary for fair comparison with other systems.

Papers submitted to ICDAR 2021

In September 2021, the 16th International Conference on Document Analysis and Recognition (ICDAR) will take place in Lausanne, Switzerland. For the PERO project, we sent three papers to this conference and all them were also accepted. The first article deals with text line detection using a neural network model called ParseNet. The second article focuses on the ability to switch between different outputs of a neural network-based text recognizer using a Transcription-Style block. The last article presents a strategy for effective use of large amounts of unannotated data from a target domain when training a text recognizer.

The EGO-DOK project

The Military Historical Institute in Prague (VHÚ) has launched the EGO-DOK project, the aim of which is to digitize historical documents. After obtaining and scanning documents from institutions or private persons, the obtained data are processed using tools developed within the PERO project. The results of the processing are then handed over to the owner of the document and are also published in the Digital Study Room of the Ministry of Defense of the Czech Republic, similarly to the already processed military diaries.

Military diary

One of the organizations that use the services of our PERO-OCR automatic handwritten transcription software is Military Historical Institute in Prague (VHÚ). The result of this cooperation is a digitized military diary, which has already been imported into the Digital Study Room of the Ministry of Defense of the Czech Republic. "Můj Deňik", as this document is named, dates back to the First World War. The diary was processed using tools developed within the PERO project, and after importing into the Digital Study Room, it is possible to search in its content, or download the content of individual pages of the diary.

Published models

In our GitHub repository pero-ocr we have published two models for public usage. The first is a model that is designed to analyze general layout of printed and handwritten pages. The second model is designed for the recognition of European printed text, specialized in Czech newspapers. Both models and the configuration file are compatible with the pero-ocr GitHub "develop" branch.

Update of pero-ocr

During August, we updated our python package pero-ocr to version 0.3. This package is based on our code available in the GitHub repository pero-ocr. In the latest version, we mainly added optimized decoding with the language model, fixed an error when outputting in ALTO XML format, and modified and improved page layout analysis and line detection.

Best Poster Award at ICDAR 2019

At the end of September, the 15th International Conference on Document Analysis and Recognition (ICDAR) took place in Sydney, Australia. We attended this conference with a paper describing the creation of the B-MOD dataset, which was presented here on a poster session. For the poster we received the Best Poster Award.

Help us determine the visual quality of documents

One of the areas we are dealing with is determining the visual quality of a document. This is important, for example, for subsequent adjustment of documents so that the text recognition is as good as possible. Comparing two documents from this perspective using algorithms is very difficult, sometimes impossible. Therefore, we need to manually find out which document looks better. To simplify the work when comparing documents, we created an annotation server. You will see two cut-outs from different documents and the only task is to mark the better-looking one using the button above the image. We would be very happy if you could help us with the comparison and thus improve the text recognition.

Help us improve handwritten text transcription

Systems for automatic handwritten text transcription need training examples from many writers. You have a chance to help us collect such examples and improve transcriptions of historic documents. You can download template pages from our web. We would appreciate if you could print the pages, write the contained text in your own hand and send us the filled pages by mail or scan them and upload the resulting images using our web.

How to forge historic documents?

During our work on image quality enhancement, we found out that our method can easily change text content in images. Have you ever wanted to change history? Now you can. The method is based on Generative Adversarial Networks (GANs) and takes corrupted image and a text string as input. The text string is normally produced by our text recognition and language models, but it can be manually corrected. Normally, this approach would be used to fix barely readable parts of scanned documents. We hope the tool will be used responsibly when released.

Document textline extraction

This post briefly introduces the textline extraction pipeline used in the project. Before OCR can be applied to a document page, textlines need to be efficiently detected and extracted. To fully characterize a textline, two pieces of information are required. First, baseline coordinates specify the location of the textline. Second, height of the font ascenders and descenders further specify the area that needs to be cropped to contain all the information without overshooting into neighbouring textlines. Knowing these parameters, a full textline bounding box can be easily computed. This is especially suitable when working with the standard PAGE XML format. To efficiently predict these values for an input document page, we employ a convolutional neural network with encoder-decoder architecture. This network analyses the page image under three different resolutions to maintain robustness across different page layouts and font sizes. At each resolution step, the outputs from previous steps are concatenated to the current input as seen in the picture. On the output layer of the network, three values are predicted for each pixel. The probability of the pixel containing a baseline, the estimated ascender height of the font in this pixel and, analogically, the estimated descender height. These maps are then processed using simple thresholding and connected component analysis on the baseline probability channel. The font height values for each detected line are computed as median of the height predictions on the corresponding baseline pixels. In some cases where the page has been bent before being scanned or when hand-written textline is crooked, simple cropping of the bounding box leads to images that contain more than one textline and are therefore unsuitable for OCR methods. Therefore, in such cases, textline unfolding is performed as an additional post-processing step. This elastically transforms the image using normals of the detected baseline. Thus, even sub-optimally scanned document pages with challenging layout and variety of different fonts and headings can be fully automatically parsed into cropped textlines.

Active learning for OCR at student conference Excel@FIT

Jan Kohút published a research paper on active learning for historic OCR and gave an oral presentation at student conference Excel@FIT 2019. The goal of the paper was to tune neural networks which combine convolutional and recurrent layers to provide high quality automatic transcriptions for lines of historic texts. These networks ware than used to explore how they can be adapted to new documents while minimizing the need for manual transcriptions. Jan Kohút prepared a large dataset of historical documents gathered and transcribed in project IMPACT. He extracted lines using our text baseline detection tool and he automatically aligned the existing text transcripts with the detected baselines. The resulting dataset contains 1.2 milion text lines with transcripts. It spans nine european languages and ten fonts and alphabets. We were able to achieve 0.6% character error rate on this challenging dataset and we optimized the possible strategies for manual error correction and OCR model adaptation when processing documents with novel fonts and scripts.