![]() ![]() During experiments, the fully automated application on 19th Century novels showed that OCR4all can considerably outperform the commercial state-of-the-art tool ABBYY Finereader on moderate layouts if suitably pretrained mixed OCR models are available. ![]() Further on, extensive configuration capabilities are provided to set the degree of automation of the workflow and to make adaptations to the carefully selected default parameters for specific printings, if necessary. In the long run, this constant manual correction produces large quantities of valuable, high quality training material, which can be used to improve fully automatic approaches. To deal with this issue in the short run, OCR4all offers a comfortable GUI that allows error corrections not only in the final output, but already in early stages to minimize error propagations. This is mostly due to the fact that the required ground truth for training stronger mixed models (for segmentation, as well as text recognition) is not available, yet, neither in the desired quantity nor quality. While a variety of materials can already be processed fully automatically, books with more complex layouts require manual intervention by the users. In this paper, we present an open-source OCR software called OCR4all, which combines state-of-the-art OCR components and continuous model training into a comprehensive workflow. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars and in particular the combined use of several tools in a workflow. Nevertheless, in the last few years, great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout analysis and segmentation, character recognition, and post-processing. The designed prototype successfully detected individual Urdu characters with 98% accuracy on a self-generated database and 96% accuracy on scanned textbook data.Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Where multiple features such as raw, central and scale-invariant movement along with area, centroid and orientation are used to train and recognize the character. The classification or recognition of Urdu font is achieved using Back Propagation Multilayer Perceptron Neural Network (BP-MLP-NN). In second, step line and ligature segmentation are performed. In the first step, preprocessing is performed using binarization, median filtering and thinning of the scanned image. We develop a font size independent optical character recognition (OCR) system to recognize Nastaleeq based Urdu written script. However, the electronic availability of useful knowledge written in Urdu is not fully available due to the lack of techniques available for digitization of old handwritten or printed scripts. It is widely used in Sub-continent and Middle East region in the form of printed media, books and old historical scrolls. Urdu Nastaleeq text font is considered as a standard composing font for Urdu as well as Arabic script. Word recognizing, hand written text, OCR System, Back propagation Multilayer Perceptron neural network (BP-MLP-NN), Urdu script Abstract School of Science and Technology, University of Management and Technology Lahore, PakistanÄepartment of Electrical Engineering, University of Engineering & Technology, Lahore, PakistanÄepartment of Electrical Engineering, University of Engineering and Technology, Lahore, Pakistan ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |