Best Poster Award at ICDAR 2019

At the end of September, the 15th International Conference on Document Analysis and Recognition (ICDAR) took place in Sydney, Australia. We attended this conference with a paper describing the creation of the B-MOD dataset, which was presented here on a poster session. For the poster we received the Best Poster Award.

Help us determine the visual quality of documents

One of the areas we are dealing with is determining the visual quality of a document. This is important, for example, for subsequent adjustment of documents so that the text recognition is as good as possible. Comparing two documents from this perspective using algorithms is very difficult, sometimes impossible. Therefore, we need to manually find out which document looks better. To simplify the work when comparing documents, we created an annotation server. You will see two cut-outs from different documents and the only task is to mark the better-looking one using the button above the image. We would be very happy if you could help us with the comparison and thus improve the text recognition.

Help us improve handwritten text transcription

Systems for automatic handwritten text transcription need training examples from many writers. You have a chance to help us collect such examples and improve transcriptions of historic documents. You can download template pages from our web. We would appreciate if you could print the pages, write the contained text in your own hand and send us the filled pages by mail or scan them and upload the resulting images using our web.

How to forge historic documents?

During our work on image quality enhancement, we found out that our method can easily change text content in images. Have you ever wanted to change history? Now you can. The method is based on Generative Adversarial Networks (GANs) and takes corrupted image and a text string as input. The text string is normally produced by our text recognition and language models, but it can be manually corrected. Normally, this approach would be used to fix barely readable parts of scanned documents. We hope the tool will be used responsibly when released.

Document textline extraction

This post briefly introduces the textline extraction pipeline used in the project. Before OCR can be applied to a document page, textlines need to be efficiently detected and extracted. To fully characterize a textline, two pieces of information are required. First, baseline coordinates specify the location of the textline. Second, height of the font ascenders and descenders further specify the area that needs to be cropped to contain all the information without overshooting into neighbouring textlines. Knowing these parameters, a full textline bounding box can be easily computed. This is especially suitable when working with the standard PAGE XML format. To efficiently predict these values for an input document page, we employ a convolutional neural network with encoder-decoder architecture. This network analyses the page image under three different resolutions to maintain robustness across different page layouts and font sizes. At each resolution step, the outputs from previous steps are concatenated to the current input as seen in the picture. On the output layer of the network, three values are predicted for each pixel. The probability of the pixel containing a baseline, the estimated ascender height of the font in this pixel and, analogically, the estimated descender height. These maps are then processed using simple thresholding and connected component analysis on the baseline probability channel. The font height values for each detected line are computed as median of the height predictions on the corresponding baseline pixels. In some cases where the page has been bent before being scanned or when hand-written textline is crooked, simple cropping of the bounding box leads to images that contain more than one textline and are therefore unsuitable for OCR methods. Therefore, in such cases, textline unfolding is performed as an additional post-processing step. This elastically transforms the image using normals of the detected baseline. Thus, even sub-optimally scanned document pages with challenging layout and variety of different fonts and headings can be fully automatically parsed into cropped textlines.

Active learning for OCR at student conference Excel@FIT

Jan Kohút published a research paper on active learning for historic OCR and gave an oral presentation at student conference Excel@FIT. The goal of the paper was to tune neural networks which combine convolutional and recurrent layers to provide high quality automatic transcriptions for lines of historic texts. These networks ware than used to explore how they can be adapted to new documents while minimizing the need for manual transcriptions. Jan Kohút prepared a large dataset of historical documents gathered and transcribed in project IMPACT. He extracted lines using our text baseline detection tool and he automatically aligned the existing text transcripts with the detected baselines. The resulting dataset contains 1.2 milion text lines with transcripts. It spans nine european languages and ten fonts and alphabets. We were able to achieve 0.6% character error rate on this challenging dataset and we optimized the possible strategies for manual error correction and OCR model adaptation when processing documents with novel fonts and scripts.