A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges
"> Figure 1
<p>Types of OCR systems in Arabic and their modes of processing.</p> "> Figure 2
<p>Brief overview of OCR process.</p> "> Figure 3
<p>Character segmentation stages in order to recognize characters with maximum accuracy.</p> "> Figure 4
<p>The flow of the OCR process along with OCR phases and methods involved.</p> "> Figure 5
<p>Opening and closing of an image.</p> "> Figure 6
<p>A skewed document (on the left) is deskewed (on the right) to achieve better OCR results.</p> "> Figure 7
<p>Processes and techniques in each phase of the OCR system.</p> "> Figure 8
<p>Hybrid postprocessing technique based on Google’s spelling suggestion algorithm.</p> ">
Abstract
:1. Introduction
1.1. Types of OCR
1.2. Language vs. Script
1.3. Challenges
1.4. Applications
- Invoice Imaging: Used in many businesses to track business records.
- Legal Industry: To digitize documents and enter the data directly into the databases, OCR is used.
- Banking: OCR is also widely used in banking services. For example, to process check payments, cheques are scanned and transferred in seconds.
- Healthcare: In healthcare, many forms, reports, and insurance applications are processed into databases and for other purposes; OCR helps to transfer all kinds of patient data.
- Captcha: Captcha is used to secure systems. A few letters, numbers, or both are used in a captcha, and the image is distorted. Humans can easily read this captcha, but not an average computer program.
- Automatic Number Recognition: It is used for surveillance systems to track vehicles’ records by getting their number plates. OCR is also used to recognize the characters and numbers from the number plates.
- Handwriting Recognition: in this application of OCR, the text is extracted from handwritten documents and photographs. For this purpose, the model learns and identifies fonts and languages for better results.
- Scanned Receipts: some challenges comes while scanning receipts for extracting information from them, i.e., variations in receipt layout, noise, and distortion [9].
1.5. Brief OCR Process
1.6. Goals and Outlines
2. Datasets
2.1. Handwritten Text
2.2. Printed Arabic
2.3. Scanned Documents/Receipts
2.4. Quranic Text
3. OCR Process
3.1. Preprocessing
3.1.1. Binarization and Thinning
3.1.2. Denoising
3.1.3. Deskewing
3.1.4. Keystone Correction
3.1.5. Upscaling
3.2. Segmentation
3.2.1. Line Segmentation
3.2.2. Word segmentation
3.2.3. Character Segmentation
3.3. Recognition
3.4. Postprocessing
- Spell checking: checks the spelling of the recognized text and corrects any errors by comparing it to a dictionary [99].
- Grammar checking: checks the grammar of the recognized text and corrects any errors by comparing it to a set of grammar rules.
- Lexicon-based correction: uses a lexicon (a list of words and their possible variations) to correct errors in the recognized text by comparing it to the lexicon and suggesting alternative words where there are errors.
- Machine-learning-based approaches: uses machine learning algorithms, such as decision trees, random forests, and support vector machines, to correct errors in the recognized text.
- Deep-learning-based approaches: uses deep learning algorithms, such as CNNs and RNNs, to correct errors in the recognized text.
- Text enhancement: includes techniques to improve the recognized text’s visibility, legibility, and readability, such as binarization, deskewing and smoothing of text.
- Text restoration: includes techniques to recover missing or degraded text, such as text in-painting, completion, and restoration.
3.5. Evaluation
3.6. Summary of Presented Techniques
4. Discussion and Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Alhomed, L.S.; Jambi, K.M. A survey on the existing arabic optical character recognition and future trends. Int. J. Adv. Res. Comput. Commun. Eng. (IJARCCE) 2018, 7, 78–88. [Google Scholar]
- Beg, A.; Ahmed, F.; Campbell, P. Hybrid OCR techniques for cursive script languages-a review and applications. In Proceedings of the International Conference on Computational Intelligence, Communication Systems and Networks, Liverpool, UK, 28–30 July 2010; pp. 101–105. [Google Scholar]
- Djaghbellou, S.; Bouziane, A.; Attia, A.; Akhtar, Z. A Survey on Arabic Handwritten Script Recognition Systems. Int. J. Artif. Intell. Mach. Learn. (IJAIML) 2021, 11, 1–17. [Google Scholar] [CrossRef]
- Islam, N.; Islam, Z.; Noor, N. A survey on optical character recognition system. arXiv 2017, arXiv:1710.05703. [Google Scholar]
- Rashid, D.; Kumar Gondhi, N. Scrutinization of Urdu Handwritten Text Recognition with Machine Learning Approach. In Proceedings of the International Conference on Emerging Technologies in Computer Engineering, Xiamen, China, 21–23 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 383–394. [Google Scholar]
- Idrees, S.; Hassani, H. Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR. Appl. Sci. 2021, 11, 9752. [Google Scholar] [CrossRef]
- Bafjaish, S.S.; Azmi, M.S.; Al-Mhiqani, M.N.; Radzid, A.R.; Mahdin, H. Skew detection and correction of Mushaf Al-Quran script using hough transform. Int. J. Adv. Comput. Sci. Appl. 2018, 9. [Google Scholar] [CrossRef] [Green Version]
- Singh, A.; Bacchuwar, K.; Bhasin, A. A survey of OCR applications. Int. J. Mach. Learn. Comput. 2012, 2, 314. [Google Scholar] [CrossRef] [Green Version]
- Antonio, J.; Putra, A.R.; Abdurrohman, H.; Tsalasa, M.S. A Survey on Scanned Receipts OCR and Information Extraction. In Proceedings of the International Conference on Document Analysis and Recognit, Jerusalem, Israel, 29–30 November 2022. [Google Scholar]
- Al-Sheikh, I.S.; Mohd, M.; Warlina, L. A review of arabic text recognition dataset. Asia-Pac. J. Inf. Technol. Multimed. (APJITM) 2020, 9, 69–81. [Google Scholar] [CrossRef]
- Ahmed, S.B.; Naz, S.; Swati, S.; Razzak, M.I. Handwritten Urdu character recognition using one-dimensional BLSTM classifier. Neural Comput. Appl. 2019, 31, 1143–1151. [Google Scholar] [CrossRef]
- Zayene, O.; Masmoudi Touj, S.; Hennebert, J.; Ingold, R.; Essoukri Ben Amara, N. Open datasets and tools for arabic text detection and recognition in news video frames. J. Imaging 2018, 4, 32. [Google Scholar] [CrossRef] [Green Version]
- Badry, M.; Hassan, H.; Bayomi, H.; Oakasha, H. QTID: Quran Text Image Dataset. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 385–391. [Google Scholar] [CrossRef] [Green Version]
- Pechwitz, M.; Maddouri, S.S.; Märgner, V.; Ellouze, N.; Amiri, H. IFN/ENIT-Database of Handwritten Arabic Words; CIFED: Hammamet, Tunis, 2002; Volume 2, pp. 127–136. [Google Scholar]
- Al-Ma’adeed, S.; Elliman, D.; Higgins, C.A. A data base for Arabic handwritten text recognition research. In Proceedings of the International workshop on frontiers in handwriting recognition, Niagara-on-the-Lake, ON, Canada, 6–8 August 2002; pp. 485–489. [Google Scholar]
- Slimane, F.; Ingold, R.; Kanoun, S.; Alimi, A.M.; Hennebert, J. Database and Evaluation Protocols for Arabic Printed Text Recognition; DIUF-University of Fribourg: Fribourg, Switzerland, 2009; p. 1. [Google Scholar]
- Lawgali, A.; Angelova, M.; Bouridane, A. HACDB: Handwritten Arabic characters database for automatic character recognition. In Proceedings of the European Workshop on Visual Information Processing (EUVIP), Paris, France, 10–12 June 2013; pp. 255–259. [Google Scholar]
- Sabbour, N.; Shafait, F. A segmentation-free approach to Arabic and Urdu OCR. In Proceedings of the Document Recognition and Retrieval, San Jose, CA, USA, 16–20 January 2005; Volume 8658, pp. 215–226. [Google Scholar]
- Saddami, K.; Munadi, K.; Arnia, F. A database of printed Jawi character image. In Proceedings of the International Conference on Image Information Processing (ICIIP), Waknaghat, India, 21–24 December 2015; pp. 56–59. [Google Scholar]
- Mahmoud, S.A.; Ahmad, I.; Al-Khatib, W.G.; Alshayeb, M.; Parvez, M.T.; Märgner, V.; Fink, G.A. KHATT: An open Arabic offline handwritten text database. Pattern Recognit. 2014, 47, 1096–1112. [Google Scholar] [CrossRef]
- Yousfi, S.; Berrani, S.A.; Garcia, C. ALIF: A dataset for Arabic embedded text recognition in TV broadcast. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1221–1225. [Google Scholar]
- Zayene, O.; Hennebert, J.; Touj, S.M.; Ingold, R.; Amara, N.E.B. A dataset for Arabic text detection, tracking and recognition in news videos-AcTiV. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 996–1000. [Google Scholar]
- Chabchoub, F.; Kessentini, Y.; Kanoun, S.; Eglin, V.; Lebourgeois, F. SmartATID: A mobile captured Arabic Text Images Dataset for multi-purpose recognition tasks. In Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR), Hyderabad, India, 4–7 December 2016; pp. 120–125. [Google Scholar]
- Sulaiman, A.; Omar, K.; Nasrudin, M.F. A database for degraded Arabic historical manuscripts. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI), Langkawi, Malaysia, 25–27 November 2017; pp. 1–6. [Google Scholar]
- Bataineh, B. A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition. J. ICT Res. Appl. 2017, 11, 200–212. [Google Scholar] [CrossRef] [Green Version]
- Al-Ohali, Y.; Cheriet, M.; Suen, C. Databases for recognition of handwritten Arabic cheques. Pattern Recognit. 2003, 36, 111–121. [Google Scholar] [CrossRef] [Green Version]
- Awaidah, S.M.; Mahmoud, S.A. A multiple feature/resolution scheme to Arabic (Indian) numerals recognition using hidden Markov models. Signal Process. 2009, 89, 1176–1184. [Google Scholar] [CrossRef]
- Asiri, A.M.; Khorsheed, M.S. Automatic Processing of Handwritten Arabic Forms using Neural Networks. In Proceedings of the IEC (Prague), Prague, Czech Republic, 26–28 August 2005; pp. 313–317. [Google Scholar]
- Luqman, H.; Mahmoud, S.A.; Awaida, S. KAFD Arabic font database. Pattern Recognit. 2014, 47, 2231–2240. [Google Scholar] [CrossRef]
- Ramdan, J.; Omar, K.; Faidzul, M.; Mady, A. Arabic handwriting data base for text recognition. Procedia Technol. 2013, 11, 580–584. [Google Scholar] [CrossRef] [Green Version]
- Amara, N.E.B.; Mazhoud, O.; Bouzrara, N.; Ellouze, N. ARABASE: A Relational Database for Arabic OCR Systems. Int. Arab J. Inf. Technol. 2005, 2, 259–266. [Google Scholar]
- Srihari, S.; Srinivasan, H.; Babu, P.; Bhole, C. Handwritten arabic word spotting using the cedarabic document analysis system. In Proceedings of the Symposium on Document Image Understanding Technology (SDIUT-05), College Park, MD, USA, 2–4 November 2005; pp. 123–132. [Google Scholar]
- Shafi, M.; Zia, K. Urdu character recognition: A systematic literature review. Int. J. Appl. Pattern Recognit. 2021, 6, 283–307. [Google Scholar] [CrossRef]
- Khan, N.H.; Adnan, A. Urdu optical character recognition systems: Present contributions and future directions. IEEE Access 2018, 6, 46019–46046. [Google Scholar] [CrossRef]
- Bhatti, A.; Arif, A.; Khalid, W.; Khan, B.; Ali, A.; Khalid, S.; Rehman, A.u. Recognition and Classification of Handwritten Urdu Numerals Using Deep Learning Techniques. Appl. Sci. 2023, 13, 1624. [Google Scholar] [CrossRef]
- Khosrobeigi, Z.; Veisi, H.; Hoseinzade, E.; Shabanian, H. Persian Optical Character Recognition Using Deep Bidirectional Long Short-Term Memory. Appl. Sci. 2022, 12, 11760. [Google Scholar] [CrossRef]
- Husnain, M.; Saad Missen, M.M.; Mumtaz, S.; Coustaty, M.; Luqman, M.; Ogier, J.M. Urdu handwritten text recognition: A survey. IET Image Process. 2020, 14, 2291–2300. [Google Scholar] [CrossRef]
- Naz, S.; Hayat, K.; Razzak, M.I.; Anwar, M.W.; Madani, S.A.; Khan, S.U. The optical character recognition of Urdu-like cursive scripts. Pattern Recognit. 2014, 47, 1229–1248. [Google Scholar] [CrossRef]
- Alghamdi, M.; Teahan, W. Printed Arabic script recognition: A survey. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 415–427. [Google Scholar] [CrossRef] [Green Version]
- Osman, H.; Zaghw, K.; Hazem, M.; Elsehely, S. An Efficient Language-Independent Multi-Font OCR for Arabic Script. arXiv 2020, arXiv:2009.09115. [Google Scholar]
- Muhammad, M.; ElGhazaly, T. Handling OCR-degraded arabic text: A comprehensive survey. In Proceedings of the ISSR Conference, Turku, Finland, 27–30 June 2013. [Google Scholar]
- Dinges, L.; Al-Hamadi, A.; Elzobi, M.; El-Etriby, S. Synthesis of common Arabic handwritings to aid optical character recognition research. Sensors 2016, 16, 346. [Google Scholar] [CrossRef] [Green Version]
- Bouressace, H. A Review of Arabic Document Analysis Methods. In Proceedings of the International Conference on Pattern Analysis and Intelligent Systems (PAIS), Oum El Bouaghi, Algeria, 12–13 October 2022; pp. 1–7. [Google Scholar]
- Qaroush, A.; Jaber, B.; Mohammad, K.; Washaha, M.; Maali, E.; Nayef, N. An efficient, font independent word and character segmentation algorithm for printed Arabic text. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 1330–1344. [Google Scholar] [CrossRef]
- Al Ghamdi, M.A. A Novel Approach to Printed Arabic Optical Character Recognition. Arab. J. Sci. Eng. 2022, 47, 2219–2237. [Google Scholar] [CrossRef]
- Majumdar, S.; Brick, A. Recognizing Handwriting Styles in a Historical Scanned Document Using Scikit-Fuzzy c-means Clustering. arXiv 2022, arXiv:2210.16780. [Google Scholar]
- Mostafa, A.; Mohamed, O.; Ashraf, A.; Elbehery, A.; Jamal, S.; Khoriba, G.; Ghoneim, A.S. OCFormer: A Transformer-Based Model For Arabic Handwritten Text Recognition. In Proceedings of the International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), Cairo, Egypt, 26–27 May 2021; pp. 182–186. [Google Scholar]
- Badry, M.; Hassanin, M.; Chandio, A.; Moustafa, N. Quranic script optical text recognition using deep learning in IoT systems. CMC-Comput. Mater. Contin. 2021, 68, 1847–1858. [Google Scholar] [CrossRef]
- Moudgil, A.; Singh, S.; Gautam, V. An Overview of Recent Trends in OCR Systems for Manuscripts. In Cyber Intelligence and Information Retrieval; Springer: Berlin, Germany, 2022; pp. 525–533. [Google Scholar]
- Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; Jawahar, C. Icdar2019 competition on scanned receipt ocr and information extraction. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1516–1520. [Google Scholar]
- Bashir, M.H.; Azmi, A.M.; Nawaz, H.; Zaghouani, W.; Diab, M.; Al-Fuqaha, A.; Qadir, J. Arabic natural language processing for Qur’anic research: A systematic review. Artif. Intell. Rev. 2022. [Google Scholar] [CrossRef]
- Gupta, M.R.; Jacobson, N.P.; Garcia, E.K. OCR binarization and image pre-processing for searching historical documents. Pattern Recognit. 2007, 40, 389–397. [Google Scholar] [CrossRef]
- Michalak, H.; Okarma, K. Robust combined binarization method of non-uniformly illuminated document images for alphanumerical character recognition. Sensors 2020, 20, 2914. [Google Scholar] [CrossRef] [PubMed]
- Tellache, M.; Sid-Ahmed, M.; Abaza, B. Thinning algorithms for Arabic OCR. In Proceedings of the Pacific Rim Conference on Communications Computers and Signal Processing, Victoria, BC, Canada, 19–21 May 1993; Volume 1, pp. 248–251. [Google Scholar]
- Mohsenzadegan, K.; Tavakkoli, V.; Kyamakya, K. Deep Neural Network Concept for a Blind Enhancement of Document-Images in the Presence of Multiple Distortions. Appl. Sci. 2022, 12, 9601. [Google Scholar] [CrossRef]
- Mahmud, J.U.; Raihan, M.F.; Rahman, C.M. A complete OCR system for continuous Bengali characters. In Proceedings of the Conference on Convergent Technologies for Asia-Pacific Region (TENCON), Bangalore, India, 15–17 October 2003; Volume 4, pp. 1372–1376. [Google Scholar]
- Mohsenzadegan, K.; Tavakkoli, V.; Kyamakya, K. A Smart Visual Sensing Concept Involving Deep Learning for a Robust Optical Character Recognition under Hard Real-World Conditions. Sensors 2022, 22, 6025. [Google Scholar] [CrossRef]
- Nashwan, F.M.; Rashwan, M.A.; Al-Barhamtoshy, H.M.; Abdou, S.M.; Moussa, A.M. A holistic technique for an Arabic OCR system. J. Imaging 2017, 4, 6. [Google Scholar] [CrossRef] [Green Version]
- Karthick, K.; Ravindrakumar, K.; Francis, R.; Ilankannan, S. Steps involved in text recognition and recent research in OCR; a study. Int. J. Recent Technol. Eng. 2019, 8, 2277–3878. [Google Scholar]
- Cao, Y.; Wang, S.; Li, H. Skew detection and correction in document images based on straight-line fitting. Pattern Recognit. Lett. 2003, 24, 1871–1879. [Google Scholar] [CrossRef]
- Bao, W.; Yang, C.; Wen, S.; Zeng, M.; Guo, J.; Zhong, J.; Xu, X. A Novel Adaptive Deskewing Algorithm for Document Images. Sensors 2022, 22, 7944. [Google Scholar] [CrossRef]
- Boiangiu, C.A.; Dinu, O.A.; Popescu, C.; Constantin, N.; Petrescu, C. Voting-based document image skew detection. Appl. Sci. 2020, 10, 2236. [Google Scholar] [CrossRef] [Green Version]
- Ahmad, R.; Naz, S.; Razzak, I. Efficient skew detection and correction in scanned document images through clustering of probabilistic hough transforms. Pattern Recognit. Lett. 2021, 152, 93–99. [Google Scholar] [CrossRef]
- Li, Y.; Zou, F.; Yang, S.; Liu, H.; Ding, Y.; Zhu, K. Research on Improving OCR Recognition Based on Bending Correction. In Proceedings of the International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 11–13 December 2020; Volume 9, pp. 833–837. [Google Scholar]
- Schulter, S.; Leistner, C.; Bischof, H. Fast and accurate image upscaling with super-resolution forests. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3791–3799. [Google Scholar]
- Pandey, R.K.; Vignesh, K.; Ramakrishnan, A. Binary document image super resolution for improved readability and OCR performance. arXiv 2018, arXiv:1812.02475. [Google Scholar]
- Abdo, H.A.; Abdu, A.; Manza, R.R.; Bawiskar, S. An approach to analysis of Arabic text documents into text lines, words, and characters. Indones. J. Electr. Eng. Comput. Sci. 2022, 26, 754–763. [Google Scholar] [CrossRef]
- Naz, S.; Umar, A.I.; Shirazi, S.H.; Ahmed, S.B.; Razzak, M.I.; Siddiqi, I. Segmentation techniques for recognition of Arabic-like scripts: A comprehensive survey. Educ. Inf. Technol. 2016, 21, 1225–1241. [Google Scholar] [CrossRef]
- Thorat, C.; Bhat, A.; Sawant, P.; Bartakke, I.; Shirsath, S. A Detailed Review on Text Extraction Using Optical Character Recognition. ICT Anal. Appl. 2022, 719–728. [Google Scholar] [CrossRef]
- Qaroush, A.; Awad, A.; Hanani, A.; Mohammad, K.; Jaber, B.; Hasheesh, A. Learning-free, divide and conquer text-line extraction algorithm for printed Arabic text with diacritics. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 7699–7709. [Google Scholar] [CrossRef]
- Brodic, D.; Milivojevic, D.R.; Milivojevic, Z.N. An approach to a comprehensive test framework for analysis and evaluation of text line segmentation algorithms. Sensors 2011, 11, 8782–8812. [Google Scholar] [CrossRef] [Green Version]
- Brodić, D.; Milivojević, D.R.; Milivojević, Z. Basic test framework for the evaluation of text line segmentation and text parameter extraction. Sensors 2010, 10, 5263–5279. [Google Scholar] [CrossRef] [Green Version]
- Reisswig, C.; Katti, A.R.; Spinaci, M.; Höhne, J. Chargrid-OCR: End-to-end trainable optical character recognition through semantic segmentation and object detection. In Proceedings of the Workshop on Document Intelligence at NeurIPS 2019, Vancouver, BC, Canada, 14 December 2019. [Google Scholar]
- Agarwal, M.; Hassan, F.; Pandey, G.; Ghosh, S. Handwriting recognition using deep learning. In Emerging Trends in Data Driven Computing and Communications: Proceedings of DDCIoT 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 67–81. [Google Scholar]
- Boualam, M.; Elfakir, Y.; Khaissidi, G.; Mrabti, M. Arabic handwriting word recognition based on convolutional recurrent neural network. In Proceedings of the 6th International Conference on Wireless Technologies, Embedded, and Intelligent Systems (WITS 2020), Fez, Morocco, 14–16 October 2020; Springer: Berlin/Heidelberg, Germany, 2022; pp. 877–885. [Google Scholar]
- Patil, S.; Varadarajan, V.; Mahadevkar, S.; Athawade, R.; Maheshwari, L.; Kumbhare, S.; Garg, Y.; Dharrao, D.; Kamat, P.; Kotecha, K. Enhancing Optical Character Recognition on Images with Mixed Text Using Semantic Segmentation. J. Sens. Actuator Netw. 2022, 11, 63. [Google Scholar] [CrossRef]
- Tayyab, M.; Hussain, A.; Alshara, M.A.; Khan, S.; Alotaibi, R.M.; Baig, A.R. Recognition of Visual Arabic Scripting News Ticker from Broadcast Stream. IEEE Access 2022, 10, 59189–59204. [Google Scholar] [CrossRef]
- Alginahi, Y.M. A survey on Arabic character segmentation. Int. J. Doc. Anal. Recognit. (IJDAR) 2013, 16, 105–126. [Google Scholar] [CrossRef]
- Boraik, O.A.; Ravikumar, M.; Saif, M.A.N. Characters Segmentation from Arabic Handwritten Document Images: Hybrid Approach. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 395–403. [Google Scholar] [CrossRef]
- AbdAllah, N.; Viriri, S. Off-Line Arabic Handwritten Words Segmentation using Morphological Operators. arXiv 2021, arXiv:2101.02797. [Google Scholar] [CrossRef]
- Jabde, M.; Patil, C.; Mali, S.; Vibhute, A. Comparative Study of Machine Learning and Deep Learning Classifiers on Handwritten Numeral Recognition. In Proceedings of the International Symposium on Intelligent Informatics, Trivandrum, India, 31 August–2 September 2022. [Google Scholar]
- Verma, R.; Ali, J. A-survey of feature extraction and classification techniques in OCR systems. Int. J. Comput. Appl. Inf. Technol. 2012, 1, 1–3. [Google Scholar]
- Hamida, S.; El Gannour, O.; Cherradi, B.; Ouajji, H.; Raihani, A. Efficient feature descriptor selection for improved Arabic handwritten words recognition. Int. J. Electr. Comput. Eng. 2022, 12. [Google Scholar] [CrossRef]
- Peng, X.; Cao, H.; Setlur, S.; Govindaraju, V.; Natarajan, P. Multilingual OCR research and applications: An overview. In Proceedings of the International Workshop on Multilingual OCR, Washington, DC, USA, 24 August 2013; pp. 1–8. [Google Scholar]
- Bergamaschi, S.; De Nardis, S.; Martoglia, R.; Ruozzi, F.; Sala, L.; Vanzini, M.; Vigliermo, R.A. Novel perspectives for the management of multilingual and multialphabetic heritages through automatic knowledge extraction: The digitalmaktaba approach. Sensors 2022, 22, 3995. [Google Scholar] [CrossRef]
- Butt, H.; Raza, M.R.; Ramzan, M.J.; Ali, M.J.; Haris, M. Attention-based CNN-RNN Arabic text recognition from natural scene images. Forecasting 2021, 3, 520–540. [Google Scholar] [CrossRef]
- Al-Barhamtoshy, H.M.; Jambi, K.M.; Rashwan, M.A.; Abdou, S.M. An Arabic Manuscript Regions Detection, Recognition and Its Applications for OCRing. Trans. Asian-Low-Resour. Lang. Inf. Process. 2023, 22, 1–28. [Google Scholar] [CrossRef]
- Chen, X.; Jin, L.; Zhu, Y.; Luo, C.; Wang, T. Text recognition in the wild: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
- Bouchakour, L.; Meziani, F.; Latrache, H.; Ghribi, K.; Yahiaoui, M. Printed Arabic Characters Recognition Using Combined Features and CNN classifier. In Proceedings of the International Conference on Recent Advances in Mathematics and Informatics (ICRAMI), Tebessa, Algeria, 21–22 September 2021; pp. 1–5. [Google Scholar]
- Ahlawat, S.; Choudhary, A.; Nayyar, A.; Singh, S.; Yoon, B. Improved handwritten digit recognition using convolutional neural networks (CNN). Sensors 2020, 20, 3344. [Google Scholar] [CrossRef]
- Ashraf, N.; Arafat, S.Y.; Iqbal, M.J. An Analysis of Optical Character Recognition (OCR) Methods. Int. J. Comput. Linguist. Res. 2019, 10, 81. [Google Scholar] [CrossRef]
- Al-Sadawi, B.; Hussain, A.; Ali, N.S. High-Performance Printed Arabic Optical Character Recognition System Using ANN Classifier. In Proceedings of the Palestinian International Conference on Information and Communication Technology, Gaza, Palestine, 28–29 September 2021; IEEE Computer Society: Colombia, DC, USA, 2021; pp. 1–6. [Google Scholar]
- Mittal, R.; Garg, A. Text extraction using OCR: A systematic review. In Proceedings of the International Conference on Inventive Research in Computing Applications, Coimbatore, India, 15–17 July 2020; pp. 357–362. [Google Scholar]
- Alrobah, N.; Albahli, S. Arabic handwritten recognition using deep learning: A survey. Arab. J. Sci. Eng. 2022, 47, 9943–9963. [Google Scholar] [CrossRef]
- Alwaqfi, Y.M.; Mohamad, M.; Al-Taani, A.T. Generative Adversarial Network for an Improved Arabic Handwritten Characters Recognition. Int. J. Adv. Soft Comput. Its Appl. 2022, 14, 176–195. [Google Scholar] [CrossRef]
- Hamad, K.; Mehmet, K. A detailed analysis of optical character recognition technology. Int. J. Appl. Math. Electron. Comput. 2016, 1, 244–249. [Google Scholar] [CrossRef]
- Subramani, N.; Matton, A.; Greaves, M.; Lam, A. A survey of deep learning approaches for ocr and document understanding. arXiv 2020, arXiv:2011.13534. [Google Scholar]
- Nguyen, T.T.H.; Jatowt, A.; Coustaty, M.; Doucet, A. Survey of post-ocr processing approaches. ACM Comput. Surv. (CSUR) 2021, 54, 1–37. [Google Scholar] [CrossRef]
- Neto, A.F.d.S.; Bezerra, B.L.D.; Toselli, A.H. Towards the natural language processing as spelling correction for offline handwritten text recognition systems. Appl. Sci. 2020, 10, 7711. [Google Scholar] [CrossRef]
- Doush, I.A.; Alkhateeb, F.; Gharaibeh, A.H. A novel Arabic OCR post-processing using rule-based and word context techniques. Int. J. Doc. Anal. Recognit. (IJDAR) 2018, 21, 77–89. [Google Scholar] [CrossRef]
- Bassil, Y.; Alwani, M. Ocr post-processing error correction algorithm using google online spelling suggestion. arXiv 2012, arXiv:1204.0191. [Google Scholar]
- Aliwy, A.H.; Al-Sadawi, B. Corpus-based technique for improving Arabic OCR system. Indones. J. Electr. Eng. Comput. Sci. 2021, 21, 233–241. [Google Scholar] [CrossRef]
- Alghamdi, M.A.; Alkhazi, I.S.; Teahan, W.J. Arabic OCR evaluation tool. In Proceedings of the International conference on computer science and information technology (CSIT), Amman, Jordan, 13–14 July 2016; pp. 1–6. [Google Scholar]
- Kiessling, B.; Kurin, G.; Miller, M.T.; Smail, K.; Miller, M. Advances and Limitations in Open Source Arabic-Script OCR: A Case Study. Digit. Stud. Champ NumÉRique 2021, 11. [Google Scholar] [CrossRef]
- Neudecker, C.; Baierer, K.; Gerber, M.; Clausner, C.; Antonacopoulos, A.; Pletschacher, S. A survey of OCR evaluation tools and metrics. In Proceedings of the International Workshop on Historical Document Imaging and Processing, Lausanne, Switzerland, 5–10 September 2021; pp. 13–18. [Google Scholar]
- Elzobi, M.; Al-Hamadi, A. Generative vs. Discriminative Recognition Models for Off-Line Arabic Handwriting. Sensors 2018, 18, 2786. [Google Scholar] [CrossRef] [Green Version]
- Singh, S.; Garg, N.K.; Kumar, M. On the performance analysis of various features and classifiers for handwritten devanagari word recognition. Neural Comput. Appl. 2023, 35, 7509–7527. [Google Scholar] [CrossRef]
- Vitman, O.; Kostiuk, Y.; Plachinda, P.; Zhila, A.; Sidorov, G.; Gelbukh, A. Evaluating the Impact of OCR Quality on Short Texts Classification Task. In Proceedings of the Mexican International Conference on Artificial Intelligence, Monterrey, Mexico, 24–29 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 163–177. [Google Scholar]
- Reul, C.; Christ, D.; Hartelt, A.; Balbach, N.; Wehner, M.; Springmann, U.; Wick, C.; Grundig, C.; Büttner, A.; Puppe, F. OCR4all—An open-source tool providing a (semi-) automatic OCR workflow for historical printings. Appl. Sci. 2019, 9, 4853. [Google Scholar] [CrossRef] [Green Version]
Dataset | Type of Content | Availability | Size of Dataset |
---|---|---|---|
ACTIV2 [12] | Embedded words | Public | 10,415 text images |
QTID [13] | Synthetic words | Private | 309,720 words and 249,428 characters |
IFN/ENIT [14] | Handwritten words | Public | 115,000 words and 212,000 characters |
AHDB [15] | Handwritten words and digits | Private | 30,000 words |
APTI [16] | Printed words | Public | 113,284 words and 648,280 characters |
HACDB [17] | Handwritten characters | Public | 6600 characters and 50 writers |
UPTI [18] | Printed text lines | Public | 10,000 text lines |
Digital Jawi [19] | Jawi paleography images | Public | 168 words and 1524 characters |
KHATT [20] | Handwritten text lines | Public | 9327 lines, 165,890 words and 589,924 characters |
ALIF [21] | Embedded text lines | Upon request | 1804 words and 89,819 characters |
ACTIV [22] | Embedded text lines | Public | 4824 lines and 21,520 words |
SmartATID [23] | Printed and handwritten pages | Public | 9088 pages |
Degraded historical [24] | Handwritten documents | Public | 10 handwritten images and 10 printed images |
Printed PAW [25] | Printed subwords | Upon request | 415,280 unique words and 550,000 sub words |
Checks [26] | Handwritten subwords and digits | Private | 29,498 subwords and 15,148 digits |
Numeral [27] | Handwritten digits | Public | 21,120 digits and 44 writers |
Forms [28] | Handwritten characters | Private | 15,800 characters and 500 writers |
KAFD [29] | Printed pages and lines | Public | 28,767 pages and 644,006 lines |
AHDBIFTR [30] | Handwritten images | Public | 497 word images and 5 writers |
ARABASE [31] | Handwritten text | Public | 47,000 words and 500 free Arabic sentences |
CEDAR [32] | Handwritten pages | Private | 20,000 words, 10 writers, and 100 documents |
CENPARMI [26] | Handwritten subwords and digits | Public | 6000 digit images |
Description | Stats |
---|---|
Total text lines of dataset | 4,000,000 |
Total words | 15,000,000 |
Unique words | 200,000 |
Text lines per image | 70 |
Total used fonts (with sizes) | 11 fonts (sizes:12, 14, and 18) |
UPTI | CALAM | UNHD | |
---|---|---|---|
Total writers | 250 | 725 | 500 |
Text lines | 60,000 | 3043 | 10,000 |
Words | 240,000 | 46,664 | 187,200 |
Characters | 970,650 | 101,181 | 312,000 |
Availability | Private | Private | Public |
Technique | Method | Brief Description |
---|---|---|
Preprocessing | Binarization | Transforms the input image into a binary format |
Keystone Correction | Aligns the distortion on the edges of the image | |
Skew correction | Corrects the angle of the rotated text | |
Denoising | Filters-out the extra-noisy pixels from the image | |
Dilation | Restores an eroded image by cleaning it up | |
Erosion | Removes object boundaries and unwanted parts in images | |
Thinning | Reduces thickness of objects by removing boundary pixels | |
Upscaling | Enhances the resolution of the image | |
Segmentation | Line | Image is divided into lines for line-by-line processing |
Word | Each line is divided into words using spacing methods | |
Character | Image of each word is divided into individual characters | |
Recognition | Template Matching | Matches an input image with predefined characters |
Feature Extraction | Extracts features and classifies image using learning algorithm | |
Neural Networks | Uses interconnected neurons to predict text from image | |
Deep Learning | Uses neural networks with many layers to learn patterns | |
Decision Trees | Builds tree-like structure with decisions and consequences | |
SVM | Constructs hyperplane separating image into different classes | |
Naive Bayes | Uses Bayes’ theorem to classify an input image | |
Random Forest | Builds multiple decision trees, combining their outputs | |
CNN | Uses deep learning with convolutional layers to classify image | |
RNN | Neural network for processing sequences (characters in OCR) | |
kNN | Classifies image based on k-nearest neighbors’ majority class | |
HT | Detects lines, circles, and edges from image for text extraction | |
HOG | Computes image gradients in histograms and extract features | |
HMM | Models transition probabilities of text for accurate recognition | |
Profile Projection | Extracts character features using projection onto 1D axis | |
Postprocessing | Spell-check | Error correction, text enhancement, and restoration |
Contextual Analysis | Analyses the surrounding words based on specific context | |
Confidence Scoring | Assigns scores to words—higher score means more accurate | |
Language Model | Uses large corpus of text to guess best word in context | |
Evaluation | Character Error Rate | Percentage of characters incorrectly predicted |
Word Error Rate | Percentage of words incorrectly predicted | |
Recognition Rate | Percentage of characters/words correctly recognized |
OCR Techniques | OCR Tasks | Accuracy | ||||
---|---|---|---|---|---|---|
Preprocessing | Segmentation | Recognition | Postprocessing | Evaluation | ||
Ahmad et al. [63] | ✓ | ✓ | ✓ | 99.3% (Scanned) | ||
Bafjaish et al. [7] | ✓ | ✓ | ✓ | 90% (Scanned) | ||
Karthick et al. [59] | ✓ | ✓ | ✓ | ✓ | 87.4% (Handwritten), 90% (Scanned) | |
Abdo et al. [67] | ✓ | ✓ | ✓ | ✓ | 94.1% (Printed) | |
Qaroush et al. [70] | ✓ | ✓ | ✓ | 11% (Segmentation) | ||
Tayyab et al. [77] | ✓ | ✓ | ✓ | ✓ | 98.36% (Scanned) | |
Alginahi [78] | ✓ | ✓ | 93.65% (Handwritten), 86.14% (Scanned) | |||
Verma and Ali [82] | ✓ | ✓ | No Recognition | |||
Hamida et al. [83] | ✓ | ✓ | ✓ | 99.88% (Handwritten) | ||
Butt et al. [86] | ✓ | ✓ | ✓ | 87% (Scanned) | ||
Nguyen et al. [98] | ✓ | ✓ | No Recognition | |||
Doush et al. [100] | ✓ | ✓ | No Recognition | |||
Neudecker et al. [105] | ✓ | No Recognition | ||||
Vitman et al. [108] | ✓ | ✓ | ✓ | 83.5% (High-quality), 58.4%(Low-quality) |
Method | Pros | Cons |
---|---|---|
Template matching | Simple and easy to implement | Limited accuracy, sensitive to noise and variations in text |
Deep learning | High accuracy, can handle variations in text | Requires large amounts of training data, computationally expensive |
kNN | For small datasets, takes less training time and make predictions quickly | Sensitive to noisy or irrelevant features |
RNN | For processing large sequential data and can learn term dependencies | Computationally expensive and sensitive to overfitting |
Hough Transformation | Robust to noise and can detect lines and circles at any orientation | Computationally expensive when dealing with large images |
Histogram Oriented Gradient | Extracts features such as edge orientation and texture, and is computed quickly | Ineffective at detecting finer details and is sensitive to variations in lighting and contrast |
Hidden Markov Model | Models complex patterns and can be trained on large/sequential datasets | Computationally expensive to train and sensitive to model parameters |
Profile Projection | Extracts features from images, such as character width and spacing | Sensitive to variations in lighting and contrast. |
Random Forest | Relatively easy to train and can handle noisy or missing data | Does not perform well on highly imbalanced or sparse datasets |
SVM | Used for classification tasks and can handle high-dimensional data | Computationally expensive non-linear kernels require hyperparameter tuning |
Hybrid approaches | Combines the strengths of multiple methods | More complex and difficult to implement |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Faizullah, S.; Ayub, M.S.; Hussain, S.; Khan, M.A. A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges. Appl. Sci. 2023, 13, 4584. https://doi.org/10.3390/app13074584
Faizullah S, Ayub MS, Hussain S, Khan MA. A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges. Applied Sciences. 2023; 13(7):4584. https://doi.org/10.3390/app13074584
Chicago/Turabian StyleFaizullah, Safiullah, Muhammad Sohaib Ayub, Sajid Hussain, and Muhammad Asad Khan. 2023. "A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges" Applied Sciences 13, no. 7: 4584. https://doi.org/10.3390/app13074584
APA StyleFaizullah, S., Ayub, M. S., Hussain, S., & Khan, M. A. (2023). A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges. Applied Sciences, 13(7), 4584. https://doi.org/10.3390/app13074584