Parallel Corpora : A Much-Needed Linguistic Resource for Low Computational Resource Languages

Preeti Dubey

doi:10.32628/CSEIT174406

Authors

Preeti Dubey Assistant Professor, Department of Computer Science, J&K Higher Education Department, India

Keywords:

Text Corpus, Speech Corpus, Parallel Corpora, Natural Language Processing, Low Resource Languages

Abstract

Natural language Processing (NLP) is one of the upcoming research areas of computer science. There are many applications of NLP, but in the last decade, most of the effort in this field is inclined towards machine translation. A lot of work is available for the machine translation of English and Hindi. Some work is also undertaken for the translation of Indian languages, therefore; there has been a revolutionary research in development of text in machine readable form. Currently efforts are being made for developing large parallel corpora for most Indian languages, which is a much-needed linguistic resource for the development of Statistical Machine Translation systems. This paper introduces the concept of parallel corpus, its need and application in natural language processing. The various projects undertaken for the development of parallel corpus, followed by tools where parallel corpus is applied is also presented. The need of development of this resource for languages with low computational resources is also discussed.

References

Akshar Bharati, Dipti Misra Sharma, Rajeev Sangal et al., (15th December, 2006), AnnCorra: Annotating Corpora, Guidelines for POS and Chunk Annotation for Indian Languages. Retrievedfrom http://researchweb.iiit.ac.in/~rashid.ahmedpg08/ilmtdocs/chunk-posann-guidelines-15-Dec-06.pdf (15th December, 2006) AnnCorra: Annotating Corpora, Guidelines for POS and Chunk Annotation for Indian Languages. Retrievedfrom http://researchweb.iiit.ac.in/~rashid.ahmedpg08/ilmtdocs/chunk-pos-ann-guidelines-15-Dec-06.pdf
Ben Langmead. (n.d.) Hidden Markov Models. Retrieved from http://www.cs.jhu.edu/~langmea/resources/lecture_notes/hidden_markov_models.pdf
PAN Localization. (n.d.). Retrieved from http://www.panl10n.net/english /Outputs%20 Phase%202/CCs/ Nepal /MPP/Papers/2008/Report% 20on% 20Nepali%20Computational%20Grammar.pdf
Chirag Patel et. al., Part-Of- Speech Tagging for Gujarati Using Conditional Random Fields, Proc. Of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, 2008, pp.117-122.
NJ KHAN,et.al, ‘ machine translation approaches and survey for indian languages' https://arxiv.org/ftp/arxiv/papers/1701/1701.04290.pdf
Mutatis Iqbal, et.Al ‘ English to Kashmiri machine translation system, International journal of Advance Research in Computer Science & technology ( IJARCST 2015),vol:3, issue2 (Apr. - Jun. 2015), ISSN : 2347 - 8446 (Online) ISSN : 2347 - 9817 (Print)
Raghavendra Udupa U, et. Al, “ An English-Hindi Statistical Machine Translation System”, Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248), LNAI 3248, pp. 254–262, 2005. https://link.springer.com/chapter/10.1007/978-3-540-30211-7_27
Tej Bahadur Shai et al. 2013. Support Vector Machines based Part of Speech Tagging for Nepali Text, International Journal of Computer Applications, May 2013, Vol: 70-No. 24, pp. 0975-8887.
Prajadip Sinha et al. Enhancing the Performance of Part of Speech tagging of Nepali language through Hybrid approach, International Journal of Emerging Technology and Advanced Engineering.2015 Vol 5(5).
Antony P J et al. 2011.Parts of Speech Tagging for Indian Languages: A Literature Survey, International Journal of Computer Applications, 2011, Vol. 34(8), pp. 0975-8887.
Amruta Godase, “MACHINE TRANSLATION DEVELOPMENT FOR INDIAN LANGUAGES AND ITS APPROACHES”, International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,April 2015, ISSN: 2278-1307

Parallel Corpora : A Much-Needed Linguistic Resource for Low Computational Resource Languages

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite