Brno Mobile OCR Dataset (B-MOD) is a collection of 2 113 templates (pages of scientific papers). Those templates were captured using 23 various mobile devices under unrestricted conditions ensuring that the obtained photographs contain various amount of blurriness, illumination etc. In total, the dataset contains 19 725 photographs and more than 500k text lines with precise transcriptions. The template pages are divided into three subsets (training, validation and testing).

This dataset may be used for non-commercial research purpose only. If you publish material based on this dataset, we request you to include a reference to the paper:

Kišš, M., Hradiš, M. and Kodym, O. Brno Mobile OCR Dataset. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). September 2019, p. 1352–1357. ISSN: 1520-5363.

You can download the dataset and evaluate your OCR system below. Our OCR system is available on the github. If you have any question, please contact ikiss@fit.vutbr.cz or ihradis@fit.vutbr.cz.

Download

Samples

Evaluate

You can evaluate your OCR system using the form below. Fill your name or name of your team to identify your results. Please, enter a short description of your system or a link to the description.

Please, upload a single text file where each line corresponds to one transcribed line of the test set with the same formatting as in the text files for training and validation lines in the "Cropped lines with transcriptions" ZIP archive. The formating must follow pattern:

filename transcription

e.g.

6149958838f466bbb508399a83bbeb5c.jpg_rec_l0004.jpg Theorems 1 and 2 show that, in checking for deadlock or

Upload

Leaderboard

Name Description Date Easy Medium Hard Overall
CER WER CER WER CER WER CER WER
Baseline LSTM CNN-LSTM-CTC 30.06.2019 0.33 1.93 5.65 22.39 32.28 72.63 3.15 10.71
Baseline Conv CNN-CTC 30.06.2019 0.50 2.79 7.82 28.50 39.76 80.69 4.19 13.39
Thales of Miletus Baseline CRNN (Random splitting) 10.09.2019 0.07 0.37 1.39 6.04 14.73 39.83 1.03 3.61
Michal Hradis Original CTC LSTM network from the paper decoded using beam search and dictionary. The dictionary is generated from the train/val dataset splits. Implementation from https://github.com/githubharald/CTCWordBeamSearch. 10.09.2019 1.70 6.99 5.46 16.19 33.37 60.42 4.05 11.81
SunBear CRNN 12.09.2019 0.24 1.41 4.25 17.94 27.72 68.21 2.50 8.90
Tesseract https://github.com/tesseract-ocr/tesseract config = ("-l eng --oem 1 --psm 7") 12.09.2019 12.32 24.21 45.00 71.06 79.17 100.87 24.47 40.91
Thales of Miletus - 1 Baseline CRNN (Random splitting) + WordBeamSearch 13.09.2019 0.06 0.37 1.38 6.00 14.64 39.50 1.03 3.58
Attention Conv-LSTM Fairly stanndard seq2seq Conv-LSTM model with attention. 16.09.2019 0.70 1.23 3.97 10.66 20.19 47.82 2.42 5.84
Sayan StaquResearch CRNN_CTC 07.10.2019 0.05 0.32 1.13 5.22 11.30 32.45 0.81 3.04
Sayan Mandal StaquResearch customCNN_LSTM_CTC. No augmentation or LM. 13.11.2019 0.04 0.26 0.90 4.20 10.34 30.27 0.70 2.61
HUST - MSOLab CustomEfficientNet_B2_CascadeAttn 13.01.2021 0.24 0.96 2.22 6.77 12.08 26.46 1.28 3.67
Custom Tesseract Version Trained from scratch with lines train set 18.02.2021 0.29 1.59 4.23 16.56 25.23 60.37 2.42 8.30
PERO-OCR production model CNN-LSTM-CTC model without LM 30.03.2021 0.03 0.20 0.85 3.99 9.80 29.57 0.66 2.48
Tesseract_28_05_2021 Brno_OCR_500000_0.002 28.05.2021 0.91 4.40 11.15 33.85 47.84 86.93 5.75 16.28
a a 07.01.2022 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Test Pova 07.01.2022 99.98 99.98 100.00 100.00 99.80 99.99 99.98 99.99
POVa test 07.01.2022 99.60 99.61 100.00 100.00 100.00 100.00 99.73 99.74
test 07.01.2022 98.06 98.09 100.00 100.00 100.00 100.00 98.70 98.72
POVa test 13k 07.01.2022 66.98 67.57 100.00 100.00 100.00 100.00 77.87 78.30
POVa Test 08.01.2022 0.70 2.49 100.00 100.00 100.00 100.00 33.47 34.75
POVa test hard 08.01.2022 100.00 100.00 100.00 100.00 47.48 86.21 97.91 99.45
POVa Test All 08.01.2022 99.98 99.98 9.13 26.69 47.48 86.21 71.53 78.12
POVa Test All 9.1.2022 09.01.2022 0.70 2.49 9.13 26.69 47.48 86.21 5.01 12.89
POVa mediumeasy 11.01.2022 0.52 2.02 4.15 12.93 28.38 62.21 2.68 7.61
POVa_easymediumhard 13.01.2022 0.42 1.63 3.71 12.44 22.96 53.66 2.27 6.86
POVa_dropout 13.01.2022 0.55 2.19 5.00 16.96 38.72 75.95 3.36 9.44