Parallel Corpora : A Much-Needed Linguistic Resource for Low Computational Resource Languages

Authors(1) :-Preeti Dubey

Natural language Processing (NLP) is one of the upcoming research areas of computer science. There are many applications of NLP, but in the last decade, most of the effort in this field is inclined towards machine translation. A lot of work is available for the machine translation of English and Hindi. Some work is also undertaken for the translation of Indian languages, therefore; there has been a revolutionary research in development of text in machine readable form. Currently efforts are being made for developing large parallel corpora for most Indian languages, which is a much-needed linguistic resource for the development of Statistical Machine Translation systems. This paper introduces the concept of parallel corpus, its need and application in natural language processing. The various projects undertaken for the development of parallel corpus, followed by tools where parallel corpus is applied is also presented. The need of development of this resource for languages with low computational resources is also discussed.

Authors and Affiliations

Preeti Dubey
Assistant Professor, Department of Computer Science, J&K Higher Education Department, India

Text Corpus, Speech Corpus, Parallel Corpora, Natural Language Processing, Low Resource Languages

  1. Akshar Bharati, Dipti Misra Sharma, Rajeev Sangal et al., (15th December, 2006), AnnCorra: Annotating Corpora, Guidelines for POS and Chunk Annotation for Indian Languages. Retrievedfrom http://researchweb.iiit.ac.in/~rashid.ahmedpg08/ilmtdocs/chunk-posann-guidelines-15-Dec-06.pdf (15th December, 2006) AnnCorra: Annotating Corpora, Guidelines for POS and Chunk Annotation for Indian Languages. Retrievedfrom http://researchweb.iiit.ac.in/~rashid.ahmedpg08/ilmtdocs/chunk-pos-ann-guidelines-15-Dec-06.pdf
  2. Ben Langmead. (n.d.) Hidden Markov Models. Retrieved from http://www.cs.jhu.edu/~langmea/resources/lecture_notes/hidden_markov_models.pdf
  3. PAN Localization. (n.d.). Retrieved from http://www.panl10n.net/english /Outputs%20 Phase%202/CCs/ Nepal /MPP/Papers/2008/Report% 20on% 20Nepali%20Computational%20Grammar.pdf
  4. Chirag Patel et. al., Part-Of- Speech Tagging for Gujarati Using Conditional Random Fields, Proc. Of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, 2008, pp.117-122.
  5. NJ KHAN,et.al, ‘ machine translation approaches and survey for indian languages' https://arxiv.org/ftp/arxiv/papers/1701/1701.04290.pdf
  6. Mutatis Iqbal, et.Al ‘ English to Kashmiri machine translation system, International journal of Advance Research in Computer Science & technology ( IJARCST 2015),vol:3, issue2 (Apr. - Jun. 2015), ISSN : 2347 - 8446 (Online) ISSN : 2347 - 9817 (Print)
  7. Raghavendra Udupa U, et. Al, “ An English-Hindi Statistical Machine Translation System”, Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248), LNAI 3248, pp. 254–262, 2005. https://link.springer.com/chapter/10.1007/978-3-540-30211-7_27
  8. Tej Bahadur Shai et al. 2013. Support Vector Machines based Part of Speech Tagging for Nepali Text, International Journal of Computer Applications, May 2013, Vol: 70-No. 24, pp. 0975-8887.
  9. Prajadip Sinha et al. Enhancing the Performance of Part of Speech tagging of Nepali language through Hybrid approach, International Journal of Emerging Technology and Advanced Engineering.2015 Vol 5(5).
  10. Antony P J et al. 2011.Parts of Speech Tagging for Indian Languages: A Literature Survey, International Journal of Computer Applications, 2011, Vol. 34(8), pp. 0975-8887.
  11. Amruta Godase, “MACHINE TRANSLATION DEVELOPMENT FOR INDIAN LANGUAGES AND ITS APPROACHES”, International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,April 2015, ISSN: 2278-1307

Publication Details

Published in : Volume 2 | Issue 7 | September 2017
Date of Publication : 2017-09-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 41-44
Manuscript Number : CSEIT174406
Publisher : Technoscience Academy

ISSN : 2456-3307

Cite This Article :

Preeti Dubey, "Parallel Corpora : A Much-Needed Linguistic Resource for Low Computational Resource Languages", International Journal of Scientific Research in Computer Science, Engineering and Information Technology (IJSRCSEIT), ISSN : 2456-3307, Volume 2, Issue 7, pp.41-44, September-2017. |          | BibTeX | RIS | CSV

Article Preview