Jan Kohút published a research paper on active learning for historic OCR and gave an oral presentation at student conference Excel@FIT 2019.
The goal of the paper was to tune neural networks which combine convolutional and recurrent layers to provide high quality automatic transcriptions for lines of historic texts. These networks ware than used to explore how they can be adapted to new documents while minimizing the need for manual transcriptions.
Jan Kohút prepared a large dataset of historical documents gathered and transcribed in project IMPACT. He extracted lines using our text baseline detection tool and he automatically aligned the existing text transcripts with the detected baselines. The resulting dataset contains 1.2 milion text lines with transcripts. It spans nine european languages and ten fonts and alphabets.
We were able to achieve 0.6% character error rate on this challenging dataset and we optimized the possible strategies for manual error correction and OCR model adaptation when processing documents with novel fonts and scripts.