A Comprehensive Survey of Script Identification in Indian Multi-Script Documents

Authors

  • Jagin M. Patel  M. K. Institute of Computer Studies, Bharuch, Gujarat, India
  • Bharat C. Patel  Smt. Tanuben and Dr. Manubhai Trivedi College of Information Science, Surat, Gujarat, India

Keywords:

Indian scripts, Multi-lingual, Script Identification, OCR

Abstract

Multilingual character recognition plays a crucial role in various applications, including document analysis, text-to-speech synthesis, machine translation, and optical character recognition (OCR). The diversity and complexity of Indian scripts pose significant challenges for accurate character recognition, as these scripts exhibit variations in shape, size, and structure across different languages and writing styles. This survey paper provides an overview of the current state of research and advancements in the field of multilingual character recognition for Indian scripts. The diversity and complexity of Indian scripts present unique challenges in accurately recognizing characters across different languages and writing styles. This survey aims to summarize the existing literature, methodologies, and techniques employed in multilingual character recognition for Indian scripts.

References

  1. B.B. Chaudhuri, U. Pal, A complete printed Bangla OCR system, Pattern Recognit. 31 (1998) 531–549.
  2. Chanda, Sukalpa, Srikanta Pal, and Umapada Pal. "Word-wise sinhala tamil and english script identification using gaussian kernel svm." In 2008 19th International Conference on Pattern Recognition, pp. 1-4. IEEE, 2008.
  3. Chanda, Sukalpa, Srikanta Pal, Katrin Franke, and Umapada Pal. "Two-stage approach for word-wise script identification." In 2009 10th International Conference on Document Analysis and Recognition, pp. 926-930. IEEE, 2009.
  4. Chaudhuri, B. B., and Umapada Pal. "An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi)." In Proceedings of the fourth international conference on document analysis and recognition, vol. 2, pp. 1011-1015. IEEE, 1997.
  5. Dhanya, D., and A. G. Ramakrishnan. "Script identification in printed bilingual documents." In Document Analysis Systems V: 5th International Workshop, DAS 2002 Princeton, NJ, USA, August 19–21, 2002 Proceedings 5, pp. 13-24. Springer Berlin Heidelberg, 2002.
  6. Dhandra, B. V., H. Mallikarjun, Ravindra Hegadi, and V. S. Malemath. "Word-wise script identification from bilingual documents based on morphological reconstruction." In 2006 1st International Conference on Digital Information Management, pp. 389-394. IEEE, 2006.
  7. Ghosh, Shamita, and Bidyut B. Chaudhuri. "Composite script identification and orientation detection for indian text images." In 2011 International Conference on Document Analysis and Recognition, pp. 294-298. IEEE, 2011.
  8. Jaeger, Stefan, Huanfeng Ma, and David Doermann. "Identifying script on word-level with informational confidence." In Eighth international conference on document analysis and recognition (ICDAR'05), pp. 416-420. IEEE, 2005.
  9. Padma, M. C., and P. A. Vijaya. "Monothetic separation of Telugu, Hindi and English text lines from a multi script document." In 2009 IEEE International Conference on Systems, Man and Cybernetics, pp. 4870-4875. IEEE, 2009.
  10. Pal, U., and B. B. Chaudhuri. "Script line separation from Indian multi-script documents." IETE Journal of Research 49, no. 1 (2003): 3-11.
  11. Pal, U., and B. B. Chaudhuri. "Automatic identification of english, chinese, arabic, devnagari and bangla script line." In Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 790-794. IEEE, 2001.
  12. Pal, Umapada, Suranjit Sinha, and B. B. Chaudhuri. "Multi-script line identification from Indian documents." In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., vol. 3, pp. 880-880. IEEE Computer Society, 2003.
  13. Aithal, Prakash K., G. Rajesh, Dinesh U. Acharya, and NV Krishnamoorthi M. Subbareddy. "Text line script identification for a tri-lingual document." In 2010 Second International conference on Computing, Communication and Networking Technologies, pp. 1-3. IEEE, 2010.
  14. Rani, Rajneesh, Renu Dhir, and Gurpreet Singh Lehal. "Script identification of pre-segmented multi-font characters and digits." In 2013 12th international conference on document analysis and recognition, pp. 1150-1154. IEEE, 2013.
  15. Dhandra, B. V., and Mallikarjun Hangarge. "Global and local features based handwritten text words and numerals script identification." In International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), vol. 2, pp. 471-475. IEEE, 2007.
  16. Hiremath, P. S., Jagdeesh D. Pujari, S. Shivashankar, and V. Mouneswara. "Script identification in a handwritten document image using texture features." In 2010 IEEE 2nd International Advance Computing Conference (IACC), pp. 110-114. IEEE, 2010.
  17. Namboodiri, Anoop M., and Anil K. Jain. "Online script recognition." In 2002 International Conference on Pattern Recognition, vol. 3, pp. 736-739. IEEE, 2002.
  18. Obaidullah, Sk Md, Kaushik Roy, and Nibaran Das. "Comparison of different classifiers for script identification from handwritten document." In 2013 IEEE International Conference on Signal Processing, Computing and Control (ISPCC), pp. 1-6. IEEE, 2013.
  19. Pal, Umapada, Nabin Sharma, Tetsushi Wakabayashi, and Fumitaka Kimura. "Handwritten numeral recognition of six popular Indian scripts." In Ninth international conference on document analysis and recognition (ICDAR 2007), vol. 2, pp. 749-753. IEEE, 2007.
  20. Roy, K., U. Pal, and B. B. Chaudhuri. "Neural network based word-wise handwritten script identification system for Indian postal automation." In Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005., pp. 240-245. IEEE, 2005.
  21. Roy, Kaushik, and Kinshuk Majumder. "Trilingual script separation of handwritten postal document." In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 693-700. IEEE, 2008.
  22. Roy, K., S. Kundu Das, and Sk Md Obaidullah. "Script identification from handwritten document." In 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, pp. 66-69. IEEE, 2011.
  23. Zhou, Lijun, Yue Lu, and Chew Lim Tan. "Bangla/English script identification based on analysis of connected component profiles." In Document Analysis Systems VII: 7th International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006. Proceedings 7, pp. 243-254. Springer Berlin Heidelberg, 2006.
  24. Sharma, Nabin, Sukalpa Chanda, Umapada Pal, and Michael Blumenstein. "Word-wise script identification from video frames." In 2013 12th International Conference on Document Analysis and Recognition, pp. 867-871. IEEE, 2013.
  25. Phan, Trung Quy, Palaiahnakote Shivakumara, Zhang Ding, Shijian Lu, and Chew Lim Tan. "Video script identification based on text lines." In 2011 international conference on document analysis and recognition, pp. 1240-1244. IEEE, 2011.
  26. Spitz, A. Lawrence. "Determination of the script and language content of document images." IEEE Transactions on Pattern Analysis and Machine Intelligence 19, no. 3 (1997): 235-245.
  27. Pal, Umapada, and B. B. Chaudhuri. "Indian script character recognition: a survey." pattern Recognition 37, no. 9 (2004): 1887-1899.
  28. S. Chaudhury and R. Sheth., Trainable script identification strategies for Indian languages. Fifth International Conference on Document Analysis and Recognition (1999) 657–660.
  29. Joshi, Gopal Datt, Saurabh Garg, and Jayanthi Sivaswamy. "Script identification from Indian documents." In Document Analysis Systems, vol. 7, pp. 255-267. 2006.
  30. Pati, Peeta Basa, S. Sabari Raju, Nishikanta Pati, and A. G. Ramakrishnan. "Gabor filters for document analysis in Indian bilingual documents." In International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of, pp. 123-126. IEEE, 2004.
  31. Mohanty, Sanghamitra, and HN Das Bebartta. "A novel approach for bilingual (english-oriya) script identification and recognition in a printed document." International Journal of Image Processing (IJIP) 4, no. 2 (2010): 175.
  32. Dhandra, B. V., Mallikarjun Hangarge, Ravindra Hegadi, and V. S. Malemath. "Word level script identification in bilingual documents through discriminating features." In 2007 International Conference on Signal Processing, Communications and Networking, pp. 630-635. IEEE, 2007.
  33. Obaidullah, Sk Md, Anamika Mondal, and Kaushik Roy. "Structural feature based approach for script identification from printed Indian document." In 2014 International Conference on Signal Processing and Integrated Networks (SPIN), pp. 120-124. IEEE, 2014.
  34. Dhandra, B. V., P. Nagabhushan, Mallikarjun Hangarge, Ravindra Hegadi, and V. S. Malemath. "Script identification based on morphological reconstruction in document images." In 18th International Conference on Pattern Recognition (ICPR'06), vol. 2, pp. 950-953. IEEE, 2006.
  35. Roy, K., S. Kundu Das, and Sk Md Obaidullah. "Script identification from handwritten document." In 2011 Third National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, pp. 66-69. IEEE, 2011.
  36. Padma, M. C., and P. A. Vijaya. "Global approach for script identification using wavelet packet based features." International Journal of Signal Processing, Image Processing and Pattern Recognition 3, no. 3 (2010): 29-40.
  37. Padma, M. C., and P. A. Vijaya. "Wavelet packet based texture features for automatic script identification." Int. J. Image Process 4, no. 1 (2010): 53-65.
  38. Singhal, Vivek, Nishant Navin, and Debashish Ghosh. "Script-based classification of hand-written text documents in a multilingual environment." In Proceedings. Seventeenth Workshop on Parallel and Distributed Simulation, pp. 47-54. IEEE, 2003.
  39. Hiremath, P. S., Jagdeesh D. Pujari, S. Shivashankar, and V. Mouneswara. "Script identification in a handwritten document image using texture features." In 2010 IEEE 2nd International Advance Computing Conference (IACC), pp. 110-114. IEEE, 2010.
  40. Joshi, Gopal Datt, Saurabh Garg, and Jayanthi Sivaswamy. "Script identification from Indian documents." In Document Analysis Systems, vol. 7, pp. 255-267. 2006.
  41. Rajput, Ganapatsingh G., and H. B. Anita. "Handwritten script recognition using DCT, gabor filter and wavelet features at line level." Soft Computing Techniques in Vision Science (2012): 33-43.
  42. Pati, Peeta Basa, and A. G. Ramakrishnan. "Word level multi-script identification." Pattern Recognition Letters 29, no. 9 (2008): 1218-1229.
  43. Pal, Umapada, and B. B. Chaudhuri. "Automatic separation of words in multi-lingual multi-script Indian documents." In Proceedings of the fourth international conference on document analysis and recognition, vol. 2, pp. 576-579. IEEE, 1997.
  44. Sinha, Suranjit, Umapada Pal, and B. B. Chaudhuri. "Word–wise script identification from indian documents." In Document Analysis Systems VI: 6th International Workshop, DAS 2004, Florence, Italy, September 8-10, 2004. Proceedings 6, pp. 310-321. Springer Berlin Heidelberg, 2004.
  45. Hassan, Ehtesham, Ritu Garg, Santanu Chaudhury, and M. Gopal. "Script based text identification: a multi-level architecture." In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, pp. 1-8. 2011.
  46. R. Rani, R. Dhir, and G. S. Lehal, "Performance Analysis of Feature Extractors and Classifiers for Script Recognition of English and Gurmukhi Words," Proceeding of the ACM workshop on Document Analysis and Recognition, pp. 30-36, 2012.
  47. Singh, Sukhvir, Anil Kumar, Dinesh Kr Shaw, and D. Ghosh. "Script separation in machine printed bilingual (Devnagari and Gurumukhi) documents using morphological approach." In 2014 Twentieth National Conference on Communications (NCC), pp. 1-5. IEEE, 2014.
  48. Sreejith, C., M. Indu, and PC Reghu Raj. "N-gram based algorithm for distinguishing between Hindi and Sanskrit texts." In 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1-4. IEEE, 2013.
  49. Sarkar, Ram, Nibaran Das, Subhadip Basu, Mahantapas Kundu, Mita Nasipuri, and Dipak Kumar Basu. "Word level script identification from bangla and devanagri handwritten texts mixed with roman script." arXiv preprint arXiv:1002.4007 (2010).
  50. Pati, Peeta Basa, and A. G. Ramakrishnan. "Word level multi-script identification." Pattern Recognition Letters 29, no. 9 (2008): 1218-1229.
  51. Hangarge, Mallikarjun, and B. V. Dhandra. "Offline handwritten script identification in document images." International Journal of Computer Applications 4, no. 6 (2010): 6-10.
  52. Chaudhari, Shailesh A., and Ravi M. Gulati. "An OCR for separation and identification of mixed English—Gujarati digits using kNN classifier." In 2013 International Conference on Intelligent Systems and Signal Processing (ISSP), pp. 190-193. IEEE, 2013.
  53. Chaudhari, Shailesh A., and Ravi M. Gulati. "Script Identification from Bilingual Gujarati-English Documents." International Journal of Computer Applications 93, no. 17 (2014).
  54. Bhunia, A.K., Mukherjee, S., Sain, A., Bhunia, A.K., Roy, P.P., Pal, U.: Indic handwritten script identification using offline-online multi-modal deep network. Inf. Fusion 57, 1–14 (2020)
  55. Ghosh, R., Vamshi, C., Kumar, P.: Rnn based online handwritten word recognition in Devanagari and Bengali scripts using horizontal zoning. Pattern Recogn. 92, 203–218 (2019)

Downloads

Published

2022-04-30

Issue

Section

Research Articles

How to Cite

[1]
Jagin M. Patel, Bharat C. Patel, " A Comprehensive Survey of Script Identification in Indian Multi-Script Documents" International Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 8, Issue 2, pp.535-546, March-April-2022.