This post briefly introduces the textline extraction pipeline used in the project.
Before OCR can be applied to a document page, textlines need to be efficiently detected and extracted. To fully characterize a textline, two pieces of information are required. First, baseline coordinates specify the location of the textline. Second, height of the font ascenders and descenders further specify the area that needs to be cropped to contain all the information without overshooting into neighbouring textlines. Knowing these parameters, a full textline bounding box can be easily computed. This is especially suitable when working with the standard PAGE XML format.
To efficiently predict these values for an input document page, we employ a convolutional neural network with encoder-decoder architecture. This network analyses the page image under three different resolutions to maintain robustness across different page layouts and font sizes. At each resolution step, the outputs from previous steps are concatenated to the current input as seen in the picture.
On the output layer of the network, three values are predicted for each pixel. The probability of the pixel containing a baseline, the estimated ascender height of the font in this pixel and, analogically, the estimated descender height. These maps are then processed using simple thresholding and connected component analysis on the baseline probability channel. The font height values for each detected line are computed as median of the height predictions on the corresponding baseline pixels.
In some cases where the page has been bent before being scanned or when hand-written textline is crooked, simple cropping of the bounding box leads to images that contain more than one textline and are therefore unsuitable for OCR methods. Therefore, in such cases, textline unfolding is performed as an additional post-processing step. This elastically transforms the image using normals of the detected baseline.
Thus, even sub-optimally scanned document pages with challenging layout and variety of different fonts and headings can be fully automatically parsed into cropped textlines.