Optical Character Recognition of Balochi Script

Authors

  • Muhammad Mazhar Department of Computer Science and Technology, Faculty of Information Science and Technology, Ocean University of China Author
  • Qinbo Faculty of Engineering and Technology (FET) University of Sindh, Jamshoro, Pakistan Author
  • Dil Nawaz Hakro Faculty of Engineering and Technology (FET) University of Sindh, Jamshoro, Pakistan Author
  • Abdul Majid Department of Computer Science and Technology, Faculty of Information Science and Technology, Ocean University of China Author

DOI:

https://doi.org/10.32628/CSEIT241046

Abstract

Optical Character Recognition is considered one of the fastest methods of data entry. OCR converts the text image representation of x and y coordinates representing pixel information to be converted into text data in a particular language. OCR as a field of pattern recognition and document image understanding, OCR requires a challenging job once a different language text is available on the image. Difference in language script will pose different challenges for OCR which requires entirely different approaches and algorithms. Latin scripts require a different approach whereas the Balochi adopted language scripts require a different approach. In this regard, various solutions have been proposed for different languages. Segmentation is considered one of the important tasks in the process of OCR. A good segmentation will definitely increase the accuracy of an OCR. Segmentation includes the segmentation of text lines from text images which are further divided into words. These segmented words are further divided into characters which are to be recognized. A single segmentation algorithm to segment various scripts of the languages is proposed in this study which checks the script and then segments the text image for the further processing in OCR. The proposed generalized algorithm will check the style, direction and other properties of the script and then adopts the segmentation process to segment text lines, words and characters of the language. The proposed algorithm segments more than ten languages of three scripts and segments for their OCRs. These images can be further processed for feature extraction and classification further. The process of OCR for selected languages will be made easier to recognize. Multiple scripts, languages and images were experimented, and the proposed algorithm successfully segmented 42,833 images of text line, words and character image. The algorithm provides 97% accuracy while segmenting these images and can be extended to further languages as well as scripts .

Downloads

Download data is not yet available.

References

Naseer, G. J., Basit, A., Ali, I., & Iqbal, A. (2020). Balochi Non-Cursive Isolated Character Recognition using Deep Neural Network. International Journal of Advanced Computer Science and Applications, 11(4). DOI: https://doi.org/10.14569/IJACSA.2020.0110492

S. Lodhi and M. Matin, “Urdu character recognition using Fourier descriptors for optical networks,” in Photonic Devices and Algorithms for Computing VII, vol. 5907. International Society for Optics and Photonics, 2005, p. 59070O. DOI: https://doi.org/10.1117/12.612650

L. M. Lorigo and V. Govindaraju, “Offline Arabic handwriting recognition: a survey,” IEEE Transactions on pattern analysis and machine intelligence, vol. 28, no. 5, pp. 712–724, 2006. DOI: https://doi.org/10.1109/TPAMI.2006.102

F. Solimanpour, J. Sadri, and C. Y. Suen, “Standard databases for recognition of handwritten digits, numerical strings, legal amounts, letters and dates in Farsi language,” 2006.

I. Shamsher, Z. Ahmad, J. K. Orakzai, and A. Adnan, “Ocr for printed urdu script using feed-forward neural network,” in Proceedings of World Academy of Science, Engineering and Technology, vol. 23. Citeseer, 2007, pp. 172–175.

S. A. Sattar, S. Haque, M. K. Pathan, and Q. Gee, “Implementation challenges for Nastaliq character recognition,” in International Multi-Topic Conference. Springer, 2008, pp. 279–285. DOI: https://doi.org/10.1007/978-3-540-89853-5_30

Z. Al Aghbari and S. Brook, “Hah manuscripts: A holistic paradigm for classifying and retrieving historical Arabic handwritten documents,” Expert Systems with Applications, vol. 36, no. 8, pp. 10 942–10 951, 2009 DOI: https://doi.org/10.1016/j.eswa.2009.02.024

N. A. Shaikh, G. A. Mallah, and Z. A. Shaikh, “Character segmentation of Sindhi, an Arabic style scripting language, using height profile vector,” Australian Journal of Basic and Applied Sciences, vol. 3, no. 4, pp. 4160–4169, 2009.

A. Alaei, P. Nagabhushan, and U. Pal, “A new two-stage scheme for the recognition of Persian handwritten characters,” in 2010 12th International Conference on Frontiers in Handwriting Recognition. IEEE, 2010, pp. 130–135. DOI: https://doi.org/10.1109/ICFHR.2010.27

S. Taha, Y. Babiker, and M. Abbas, “Optical character recognition of Arabic printed text,” in 2012 IEEE Student Conference on Research and Development (SCOReD). IEEE, 2012, pp. 235–240. DOI: https://doi.org/10.1109/SCOReD.2012.6518645

D. N. Hakro, I. A. Ismaili, A. Z. Talib, Z. Bhatti, and G. N. Mojai, “Issues and challenges in Sindhi ocr,” Sindh University Research Journal (Science Series), vol. 46, no. 2, pp. 143–152, 2014.

Ahmad, I., Wang, X., Li, R., & Rasheed, S. (2017). Offline Urdu Nastaleeq optical character recognition based on stacked denoising autoencoder. China Communications, 14(1), 146-157. DOI: https://doi.org/10.1109/CC.2017.7839765

Srivastava, S., Verma, A., & Sharma, S. (2022, February). Optical character recognition techniques: A review. In 2022 IEEE International Students' Conference on Electrical, Electronics and Computer Science (SCEECS) (pp. 1-6). IEEE. DOI: https://doi.org/10.1109/SCEECS54111.2022.9740911

Alghyaline, S. Arabic Optical Character Recognition: A Review.

Siddiqu, A., Basit, A., Noor, W., Khan, M. A., Kakar, M. S. H., & Khan, A. (2023). Baseline Isolated Printed Text Image Database for Pashto Script Recognition. Intelligent Automation & Soft Computing, 37(1). DOI: https://doi.org/10.32604/iasc.2023.036426

Sanjrani, A. A., Naveed, M. S., Sajid, M., Ahmed, A., Awan, S., & Jumani, A. K. (2020). Multilingual OCR systems for the regional languages in Balochistan. Indian Journal of Science and Technology, 13(21), 2157-2167. DOI: https://doi.org/10.17485/IJST/v13i21.2

Abir, A. S. M., Rahman, S., Ellin, S., Farzana, M., Manik, M. H., & Rahman, C. R. (2020). Constraints in Developing a Complete Bengali Optical Character Recognition System. no. March.

Pulabaigari, V. (2019, July). An Efficient Multi-Lingual Optical Character Recognition System for Indian Languages Through Use of Bharati Script. In Document Analysis and Recognition: 4th Workshop, DAR 2018, Held in Conjunction with ICVGIP 2018, Hyderabad, India, December 18, 2018, Revised Selected Papers (Vol. 1020, p. 74). Springer. DOI: https://doi.org/10.1007/978-981-13-9361-7_7

Downloads

Published

14-07-2024

Issue

Section

Research Articles

How to Cite

[1]
Muhammad Mazhar, Qinbo, Dil Nawaz Hakro, and Abdul Majid, “Optical Character Recognition of Balochi Script”, Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol, vol. 10, no. 4, pp. 115–124, Jul. 2024, doi: 10.32628/CSEIT241046.