New Advances in Arabic-Script Optical Character Recognition

By Elijah Cooke
Submitted to Session P4945 (A New Corpus for the Islamicate World and Methods for Its Exploration, 2017 Annual Meeting
Lang
All Middle East;
LCD Projector without Audio;
While there are a number of existing collections of Persian and Arabic digital texts online, these collections each have certain limitations. The available digital Persian collection, for example, requires more prose chronicles and philosophical treatises. The collection of digital Arabic texts would more fully represent the Arabic literary tradition if there were more scientific texts and texts written by representatives of smaller Arabic-speaking religious communities. The most efficient way to address these lacunae and develop the Persian and Arabic digital corpora is to develop a robust Optical Character Recognition (OCR) solution for Arabic-script languages. The existing OCR solutions suffer from a variety of critical problems—foremost amongst which are that they are not open source and their accuracy rates are notoriously low (often not even achieving 70% accuracy on medieval texts). In this presentation, I will present a new Arabic-script OCR solution that my team has developed which has achieved accuracy rates over 97% on Arabic-script languages (e.g., Persian and Arabic) and Syriac. This new OCR solution uses a neural networking approach that far outperforms earlier segmentation-based Arabic-script OCR models and can be trained to recognize new typefaces on as few as one thousand lines of training data. I will conclude with a presentation of our new OCR pipeline, which automate the entire process of OCR process—from file submission to post-correction—and thus allow novice users to produce their own digital texts from printed works.