Developing Semantic Textual Similarity for Guragigna Language Using Deep Learning Approach

Getnet Degemu

Abstract

Semantic Similarity is one of the highest levels of NLP. STS has significant advantages in NLP applications like information retrieval, information extraction, text summarization, data mining, machine translation, and other tasks. This research aims to present a deep learning approach for capturing semantic textual similarity (STS) in the Guragigna. The methodology involves collecting a Guragigna language corpus and preprocessing the text data and text representation is done using the Universal Sentence Encoder (USE), along with word embedding techniques including Word2Vec and GloVe and mean Square Error (MSE) is used to measure the performance. In the experimentation phase, models like LSTM, Bidirectional RNN, GRU, and Stacked RNN are trained and evaluated using different embedding techniques. The results demonstrate the efficacy of the developed models in capturing semantic textual similarity in the Guragigna language. Across different embedding techniques, including Word2Vec, GloVe, and USE, the Bidirectional RNN model with USE embedding achieves the lowest MSE of 0.0950 and the highest accuracy of 0.9244. GloVe and Word2Vec embedding also show competitive performance with slightly higher MSE and lower accuracy. The Universal Sentence Encoder consistently emerges as the top-performing embedding across all RNN architectures. The research results demonstrate the effectiveness of LSTM, GRU, Bi RNN, and Stacked RNN models in measuring semantic textual similarity in the Guragigna language.

References

[1] D. Chopra, I. Mathur, and N. Joshi, Mastering natural language processing with Python : maximize your NLP capabilities while creating amazing NLP projects in Python. 2016.
[2] M. Shajalal and M. Aono, “Semantic textual similarity between sentences using bilingual word semantics,” Prog. Artif. Intell., vol. 8, no. 2, pp. 263–272, Jun. 2019, doi: 10.1007/s13748-019-00180-4.
[3] M. A. Iqbal, O. Sharif, M. M. Hoque, and I. H. Sarkar, “Word Embedding based Textual Semantic Similarity Measure in Bengali,” in Procedia Computer Science, Elsevier B.V., 2021, pp. 92–101. doi: 10.1016/j.procs.2021.10.010.
[4] L. J. Simms, K. Zelazny, T. F. Williams, and L. Bernstein, “Does the Number of Response Options Matter? Psychometric Perspectives Using Personality Questionnaire Data,” Psychol. Assess., vol. 31, no. 4, pp. 557–566, 2019, doi: 10.1037/pas0000648.
[5] S. Oni, Z. Chen, S. Hoban, and O. Jademi, “A comparative study of data cleaning tools,” Int. J. Data Warehous. Min., vol. 15, no. 4, pp. 48–65, 2019, doi: 10.4018/IJDWM.2019100103.
[6] A. Nigam, A. Dhruv, and F. J. J. S, “TEXT PRE-PROCESSING AND FEATURE EXTRACTION USING NLP,” no. 06, pp. 1550–1554, 2021.
[7] J. Kaur, “Stopwords Removal and Its Algorithms Based on Different Methods,” Int. J. Adv. Res. Comput. Sci., vol. 9, no. 5, pp. 81–88, 2018, doi: 10.26483/ijarcs.v9i5.6301.
[8] D. Cer et al., “Universal sentence encoder for English,” EMNLP 2018 - Conf. Empir. Methods Nat. Lang. Process. Syst. Demonstr. Proc., pp. 169–174, 2018, doi: 10.18653/v1/d18-2029.
[9] HANSON ER, “GloVe: Global Vectors for Word Representation,” AES J. Audio Eng. Soc., vol. 19, no. 5, pp. 417–425, 1971.
[10] L. Yao, Z. Pan, and H. Ning, “Unlabeled Short Text Similarity with LSTM Encoder,” IEEE Access, vol. 7, no. c, pp. 3430–3437, 2019, doi: 10.1109/ACCESS.2018.2885698.
[11] M. Zulqarnain, R. Ghazali, M. G. Ghouse, and M. F. Mushtaq, “Efficient processing of GRU based on word embedding for text classification,” Int. J. Informatics Vis., vol. 3, no. 4, pp. 377–383, 2019, doi: 10.30630/joiv.3.4.289.
[12] A. Anand, T. Chakraborty, and N. Park, “We used neural networks to detect clickbaits: You won’t believe what happened next!,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 10193 LNCS, pp. 541–547, 2017, doi: 10.1007/978-3-319-56608-5_46.
[13] P. S. Mashhadi, S. Nowaczyk, and S. Pashami, “Stacked ensemble of recurrent neural networks for predicting turbocharger remaining useful life,” Appl. Sci., vol. 10, no. 1, 2020, doi: 10.3390/app10010069.
[14] J. Austen, “recurrent neural network (RNN) and long short-term - (LSTM) memory,” 2024.

Authors

Getnet Degemu
21fuja.com@gmail.com (Primary Contact)
Degemu, G. (2024). Developing Semantic Textual Similarity for Guragigna Language Using Deep Learning Approach. International Journal of Advanced Science and Computer Applications, 5(1). https://doi.org/10.47679/ijasca.v4i2.106

Article Details