Contextual Sentence Similarity from News Articles

Nikhil  Chaturvedi; Jigyasu  Dubey

doi:10.32628/CSEIT2390628

Authors

Nikhil Chaturvedi Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore, Indore, Madhya Pradesh, India Author
Jigyasu Dubey Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore, Indore, Madhya Pradesh, India Author

DOI:

https://doi.org/10.32628/CSEIT2390628

Keywords:

sentence similarity, BERT, deep learning, Cosine Similarity

Abstract

An important topic in the field of natural language processing is the measurement of sentence similarity. It's important to precisely gauge how similar two sentences are. Existing methods for determining sentence similarity challenge two problems Because sentence level semantics are not explicitly modelled at training, labelled datasets are typically small, making them insufficient for training supervised neural models; and there is a training-test gap for unsupervised language modelling (LM) based models to compute semantic scores between sentences. As a result, this task is performed at a lower level. In this paper, we suggest a novel paradigm to handle these two concerns by robotics method framework. The suggested robotics framework is built on the essential premise that a sentence's meaning is determined by its context and that sentence similarity may be determined by comparing the probabilities of forming two phrases given the same context. In an unsupervised way, the proposed approach can create high-quality, large-scale datasets with semantic similarity scores between two sentences, bridging the train-test gap to a great extent. Extensive testing shows that the proposed framework does better than existing baselines on a wide range of datasets.

Downloads

Download data is not yet available.

References

Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., ... & Wiebe, J. (2015, June). Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015) (pp. 252-263). DOI: https://doi.org/10.18653/v1/S15-2045

Agirre, E., Banea, C., Cardie, C., Cer, D. M., Diab, M. T., Gonzalez-Agirre, A., ... & Wiebe, J. (2014, August). SemEval-2014 Task 10: Multilingual Semantic Textual Similarity. In SemEval@ COLING (pp. 81-91). DOI: https://doi.org/10.3115/v1/S14-2010

Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez Agirre, A., Mihalcea, R., ... & Wiebe, J. (2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511.. ACL (Association for Computational Linguistics). DOI: https://doi.org/10.18653/v1/S16-1081

Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013, June). * SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity (pp. 32-43).

Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012) (pp. 385-393).

Arora, S., Liang, Y., & Ma, T. (2017). A simple but tough-to-beat baseline for sentence embeddings. In International conference on learning representations.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large, annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. DOI: https://doi.org/10.18653/v1/D15-1075

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055. DOI: https://doi.org/10.18653/v1/S17-2001

Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., ... & Kurzweil, R. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175. DOI: https://doi.org/10.18653/v1/D18-2029

Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. DOI: https://doi.org/10.18653/v1/P17-1171

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364. DOI: https://doi.org/10.18653/v1/D17-1070

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dor, L. E., Mass, Y., Halfon, A., Venezian, E., Shnayderman, I., Aharonov, R., & Slonim, N. (2018, July). Learning thematic similarity metric from article sections using triplet networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 49-54).

Farouk, M., Ishizuka, M., & Bollegala, D. (2018, October). Graph matching based semantic search engine. In Research conference on metadata and semantics research (pp. 89-100). Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-14401-2_8

Hill, F., Cho, K., & Korhonen, A. (2016). Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483. DOI: https://doi.org/10.18653/v1/N16-1162

Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020). Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.550

Ke, P., Ji, H., Liu, S., Zhu, X., & Huang, M. (2019). SentiLARE: Sentiment-aware language representation learning with linguistic knowledge. arXiv preprint arXiv:1911.02493. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.567

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. Advances in neural information processing systems, 28.

Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196). PMLR.

Li, B., Zhou, H., He, J., Wang, M., Yang, Y., & Li, L. (2020). On the sentence embeddings from pre-trained language models. arXiv preprint arXiv:2011.05864. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.733

Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., & Jurafsky, D. (2017). Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547. DOI: https://doi.org/10.18653/v1/D17-1230

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Logeswaran, L., & Lee, H. (2018). An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893.

Peng, S., Cui, H., Xie, N., Li, S., Zhang, J., & Li, X. (2020, April). Enhanced-RCNN: an efficient method for learning sentence similarity. In Proceedings of The Web Conference 2020 (pp. 2500-2506). DOI: https://doi.org/10.1145/3366423.3379998

Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. DOI: https://doi.org/10.18653/v1/D19-1410

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Williams, A., Nangia, N., & Bowman, S. R. (2017). A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. DOI: https://doi.org/10.18653/v1/N18-1101

Wu, Z., Wang, S., Gu, J., Khabsa, M., Sun, F., & Ma, H. (2020). Clear: Contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466.

Yang, M., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhang, M., & Zhao, T. (2019, July). Sentence-level agreement for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3076-3082). DOI: https://doi.org/10.18653/v1/P19-1296

Zhang, Z., Wu, Y., Zhao, H., Li, Z., Zhang, S., Zhou, X., & Zhou, X. (2020, April). Semantics-aware BERT for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 05, pp. 9628-9635). DOI: https://doi.org/10.1609/aaai.v34i05.6510