Language Independent and Multilingual Language Identification using Infinity Ngram Approach

Authors

  • Kidst Ergetie Andargie  Kidst Ergetie Andargie, Tsegay Mullu Kassa
  • Tsegay Mullu Kassa  Kidst Ergetie Andargie, Tsegay Mullu Kassa

DOI:

https://doi.org//10.32628/CSEIT195414

Keywords:

Language Identification, Multilingual, Infinity Character Ngram, Ngram Location, Language Independent

Abstract

Now days it is possible to get massive amount of multilingual digital information that are generated, propagated, exchanged, stored and accessed through the web each day across the world. Such accumulation of multilingual digital data becomes an obstacle for information acquisition. In order to tackling such difficulty language identification is the first step among many steps that are used for information acquisition. Language identification is the process of labeling given text content into corresponding language category.

In past decades research works have been done in the area of language identification. However, there are issues which are not solved until: multilingual language identification, discriminating language category of very closely related languages documents and labelling the language category for very short texts like words or phrases.

In this investigation, we propose an approach which able to eradicate unsolved issues of language identification (i.e. multilingual and very short texts language identification) without language barrier. In order to attain this we adopt an approach of that uses all character ngram features of given text unit (i.e. word, phrase or etc). 

Moreover, the proposed approach has a capability of identify the language of a text at any text unit (i.e. word, phrase, sentence or document) in both monolingual and multilingual setting. The reason behind this capability of proposed approach is due to adopting word level features, in which every words need to be classify with regard to its language category. The infinity ngram approach uses all character ngrams of text unit together in order to label the language category of given text per word level.

In order to observe the effectiveness of the proposed approach four experimental techniques are conducted: pure infinity character ngram, infinity ngram with location feature and infinity ngram with sentence and document level reformulation. The experimental result indicates that an infinity ngram with location feature and along with sentence and document level reformulation achieves a promising result, which is an average F-measure of 100% at word, phrase, sentence, document level in monolingual setting. As well, for multilingual setting also attains an average F-measure of 100% for both sentence and document level, but for phrase level achieves 84.33%, 88.95% and 90.19% For Amharic, Geeze and Tigrigna respectively. Beside this, at word level achieves 83.16%, 80.96% and 85.85% for Amharic, Geeze, and Tigrigna respectively.

References

  1. Marco Lui, Jey Han Lau and Timothy Baldwin. (2014). “Automatic Detection and Language Identification of Multilingual Documents”, Transactions of the Association for Computational Linguistics, pp. 27–40.
  2. Hughes, Baden, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and Andrew MacKinlay, (2006). “Reconsidering language identification for written language resources“, in Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Italy, Genoa, 485–488, pp.
  3. Prager, John M. (1999). Linguini: language identification for multilingual documents. In Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences (HICSS-32), Maui, USA.
  4. Teahan, W. J. (2000). Text classification and segmentation using minimum crossentropy. In Proceedings of the 6th International Conference Recherche d’Information Assistee par Ordinateur (RIAO’00), 943–961, College de France, France.
  5. Řehûrek R, Kolkus M. (2009) Language Identification on the Web: Extending the Dictionary Method. In Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings. Vyd. první. Mexico City, Mexico: Springer-Verlag, 2009. ISBN 978-3-642-00381-3, pp. 357-368.
  6. Yamaguchi, Hiroshi, and Kumiko Tanaka-Ishii. (2012). Text segmentation by language using minimum description length. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), 969–978, Jeju Island, Korea.
  7. King, Ben, and Steven Abney. (2013). Labeling the languages of words in mixed language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1110–1119, Atlanta, Georgia.
  8. Nguyen, Dong, and A. Seza Dogruoz. (2013). Word level language identification in online multilingual communication. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 857–862,bSeattle, USA.
  9. Disuke O. Jun I.T. (2009).Text Categorization with All Substring Features.
  10. S. F. Chen and J. Goodman. (1996) “An empirical study of smoothing techniques for language modeling,” in Proc. ACL, pp. 310–318.
  11. Zampieri, Binyam Gebrekidan Gebre, and Holland Nijmegen. (2012). Automatic identification of language varieties: The case of Portuguese. In Proceedings of The 11th Conference on Natural Language Processing (KONVENS 2012), 233–237, Vienna, Austria.
  12. Elfardy, Heba, and Mona Diab. (2013). Sentence level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 456–461, Sofia, Bulgaria.
  13. Diwersy, Sascha, Stefan Evert, and Stella Neumann. (2014). A weakly supervised multivariate approach to the study of language variation. In Aggregating Dialectology, Typology, and Register Analysis. Linguistic Variation in Text and Speech, ed. by Benedikt Szmrecsanyi and Bernhard Wälchli. Berlin: De Gruyter.
  14. Zampieri, Marcos. (2013). Using bag-of-words to distinguish similar languages: How efficient are they? In Proceedings of the 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI), 37–41, Budapest, Hungary.

Downloads

Published

2019-08-30

Issue

Section

Research Articles

How to Cite

[1]
Kidst Ergetie Andargie, Tsegay Mullu Kassa, " Language Independent and Multilingual Language Identification using Infinity Ngram Approach, IInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology(IJSRCSEIT), ISSN : 2456-3307, Volume 5, Issue 4, pp.217-228, July-August-2019. Available at doi : https://doi.org/10.32628/CSEIT195414