Projekt PERO

Brno Mobile OCR Dataset (B-MOD) is a collection of 2 113 templates (pages of scientific papers). Those templates were captured using 23 various mobile devices under unrestricted conditions ensuring that the obtained photographs contain various amount of blurriness, illumination etc. In total, the dataset contains 19 725 photographs and more than 500k text lines with precise transcriptions. The template pages are divided into three subsets (training, validation and testing).

This dataset may be used for non-commercial research purpose only. If you publish material based on this dataset, we request you to include a reference to the paper:

Kišš, M., Hradiš, M. and Kodym, O. Brno Mobile OCR Dataset. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). September 2019, p. 1352–1357. ISSN: 1520-5363.

You can download the dataset and evaluate your OCR system below. Our OCR system is available on the github. If you have any question, please contact ikiss@fit.vutbr.cz or ihradis@fit.vutbr.cz.

Download

The dataset is available on Zenodo.

Samples

Evaluate

You can evaluate your OCR system using the form below. Fill your name or name of your team to identify your results. Please, enter a short description of your system or a link to the description.

Please, upload a single text file where each line corresponds to one transcribed line of the test set with the same formatting as in the text files for training and validation lines in the "Cropped lines with transcriptions" ZIP archive. The formating must follow pattern:

filename transcription

e.g.

6149958838f466bbb508399a83bbeb5c.jpg_rec_l0004.jpg Theorems 1 and 2 show that, in checking for deadlock or

Leaderboard

Name	Description	Date	Easy		Medium		Hard		Overall
Name	Description	Date	CER	WER	CER	WER	CER	WER	CER	WER
Baseline LSTM	CNN-LSTM-CTC	30.06.2019	0.33	1.93	5.65	22.39	32.28	72.63	3.15	10.71
Baseline Conv	CNN-CTC	30.06.2019	0.50	2.79	7.82	28.50	39.76	80.69	4.19	13.39
Thales of Miletus	Baseline CRNN (Random splitting)	10.09.2019	0.07	0.37	1.39	6.04	14.73	39.83	1.03	3.61
Michal Hradis	Original CTC LSTM network from the paper decoded using beam search and dictionary. The dictionary is generated from the train/val dataset splits. Implementation from https://github.com/githubharald/CTCWordBeamSearch.	10.09.2019	1.70	6.99	5.46	16.19	33.37	60.42	4.05	11.81
SunBear	CRNN	12.09.2019	0.24	1.41	4.25	17.94	27.72	68.21	2.50	8.90
Tesseract	https://github.com/tesseract-ocr/tesseract config = ("-l eng --oem 1 --psm 7")	12.09.2019	12.32	24.21	45.00	71.06	79.17	100.87	24.47	40.91
Thales of Miletus - 1	Baseline CRNN (Random splitting) + WordBeamSearch	13.09.2019	0.06	0.37	1.38	6.00	14.64	39.50	1.03	3.58
Attention Conv-LSTM	Fairly stanndard seq2seq Conv-LSTM model with attention.	16.09.2019	0.70	1.23	3.97	10.66	20.19	47.82	2.42	5.84
Sayan StaquResearch	CRNN_CTC	07.10.2019	0.05	0.32	1.13	5.22	11.30	32.45	0.81	3.04
Sayan Mandal StaquResearch	customCNN_LSTM_CTC. No augmentation or LM.	13.11.2019	0.04	0.26	0.90	4.20	10.34	30.27	0.70	2.61
HUST - MSOLab	CustomEfficientNet_B2_CascadeAttn	13.01.2021	0.24	0.96	2.22	6.77	12.08	26.46	1.28	3.67
Custom Tesseract Version	Trained from scratch with lines train set	18.02.2021	0.29	1.59	4.23	16.56	25.23	60.37	2.42	8.30
PERO-OCR production model	CNN-LSTM-CTC model without LM	30.03.2021	0.03	0.20	0.85	3.99	9.80	29.57	0.66	2.48
Tesseract_28_05_2021	Brno_OCR_500000_0.002	28.05.2021	0.91	4.40	11.15	33.85	47.84	86.93	5.75	16.28
a	a	07.01.2022	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Test Pova		07.01.2022	99.98	99.98	100.00	100.00	99.80	99.99	99.98	99.99
POVa test		07.01.2022	99.60	99.61	100.00	100.00	100.00	100.00	99.73	99.74
test		07.01.2022	98.06	98.09	100.00	100.00	100.00	100.00	98.70	98.72
POVa test 13k		07.01.2022	66.98	67.57	100.00	100.00	100.00	100.00	77.87	78.30
POVa Test		08.01.2022	0.70	2.49	100.00	100.00	100.00	100.00	33.47	34.75
POVa test hard		08.01.2022	100.00	100.00	100.00	100.00	47.48	86.21	97.91	99.45
POVa Test All		08.01.2022	99.98	99.98	9.13	26.69	47.48	86.21	71.53	78.12
POVa Test All 9.1.2022		09.01.2022	0.70	2.49	9.13	26.69	47.48	86.21	5.01	12.89
POVa mediumeasy		11.01.2022	0.52	2.02	4.15	12.93	28.38	62.21	2.68	7.61
POVa_easymediumhard		13.01.2022	0.42	1.63	3.71	12.44	22.96	53.66	2.27	6.86
POVa_dropout		13.01.2022	0.55	2.19	5.00	16.96	38.72	75.95	3.36	9.44
1e0b1bec-5838-4f19-9220-81013be218a2	1e0b1bec-5838-4f19-9220-81013be218a2	27.12.2024	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A
S	S	14.01.2025	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
test2	test2	23.02.2025	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Name (required)	Short description (optional)	07.04.2025	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A