REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS

Authors

  • N. Kholodna Lviv Polytechnic National University, Lviv, Ukraine, Ukraine
  • V. Vysotska Lviv Polytechnic National University, Lviv, Ukraine , Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2022-4-11

Keywords:

natural language processing, NLP, rewrite identification, detection of paraphrasing in text, supervised machine learning, deep learning, text classification, text analysis, word embeddings, WordNet, semantic similarity

Abstract

Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed.

Objective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content.

Method. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa.

Results. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other.

Conclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.

Author Biographies

N. Kholodna, Lviv Polytechnic National University, Lviv, Ukraine

Student of Information Systems and Networks Department

V. Vysotska, Lviv Polytechnic National University, Lviv, Ukraine

PhD, Associate Professor of Information Systems and Networks Department

References

Salton G., Wong A., Yang C.-S. A vector space model for automatic indexing, Communications of the ACM, 1975, Vol. 18(11), pp. 613–620. DOI: 10.1145/361219.361220

Turney P. D., Pantel P. From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, 2010, Vol. 37(1), pp. 141–188. DOI: 10.1613/jair.2934

Mikolov T., Chen K., Corrado G. s, Dean J. Efficient Estimation of Word Representations in Vector Space, ArXiv, 2013. DOI: 10.48550/arXiv.1301.3781

Do Online Plagiarism Checkers Identify Paraphrased Content? DotNek Software Development, 2021, https://www.dotnek.com/Blog/Marketing/do-onlineplagiarism-checkers-identify-paraph

Miller G., Beckwith R., Fellbaum C., Gross D., Miller K. Introduction to WordNet: An On-line Lexical Database, International Journal of Lexicography, 1990, Vol. 3(4), pp. 235–244. DOI: 10.1093/ijl/3.4.235

Corley C., Mihalcea R. Measuring the Semantic Similarity of Texts, ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, Michigan, Jun. 2005, proceedings, pp. 13–18.

Leacock C., Chodorow M. Combining Local Context and WordNet Similarity for Word Sense Identification, WordNet: An Electronic Lexical Database, 1998, Vol. 49(2), pp. 265–283. DOI: 10.7551/mitpress/7287.003.0018

Lesk M. E. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone, SIGDOC '86: the 5th Annual International Conference on Systems documentation. Toronto, Ontario, Canada, June 1986, proceedings, pp. 24–26. DOI: 10.1145/318723.318728

Wu Z., Palmer M. Verbs Semantics and Lexical Selection, ACL '94: the 32nd annual meeting on Association for Computational Linguistics.Las Cruces, New Mexico, June 27–30, 1994, proceedings, pp. 133–138. DOI: 10.3115/981732.981751.

Resnik P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, ArXiv, 1995. DOI: 10.48550/arXiv.cmp-lg/9511007

Lin D. An Information-Theoretic Definition of Similarity, ICML, 1998, https://www.cse.iitb.ac.in/~cs626449/Papers/WordSimilarity/3.pdf

Jiang J. J., Conrath D. W. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy, 10th Research on Computational Linguistics International Conference. Taipei, Taiwan, Aug. 1997, proceedings, pp. 19–33.

Mihalcea R. Corley C., Strapparava C. Corpus-based and Knowledge-based Measures of Text Semantic Similarity, AAAI'06: Proceedings of the 21st national conference on Artificial intelligence, Boston, Massachusetts, July 2006 : proceedings, Vol. 1, pp. 775–780.

Hassan S., Mihalcea R. Semantic Relatedness Using Salient Semantic Analysis, AAAI 2011: Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, California, USA, August 7–11, 2011, proceedings, https://web.eecs.umich.edu/~mihalcea/papers/hassan.aaai11. pdf

Fernando S., Stevenson M. A Semantic Similarity Approach to Paraphrase Detection, 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, May 2008, proceedings, pp. 45–52.

Milajevs D., Kartsaklis D., Sadrzadeh M., Purver M. Evaluating Neural Word Representations in Tensor-Based Compositional Settings, ArXiv, 2014. DOI: 10.48550/arXiv.1408.6179

Islam A., Inkpen D. Semantic similarity of short texts, Current Issues in Linguistic Theory: Recent Advances in Natural Language Processing, 2009, Vol. 309, pp. 227–236. DOI: 10.1075/cilt.309.18isl

Chong M., Specia L., Mitkov R. Using Natural Language Processing for Automatic Detection of Plagiarism, IPC2010, 4th International Plagiarism Conference, Newcastle-uponTyne, May 2010, proceedings, https://www.academia.edu/326444/Using_Natural_Languag e_Processing_for_Automatic_Detection_of_Plagiarism

Šarić F., Glavaš G., Karan M., Šnajder J., Dalbelo Bašić B. TakeLab: Systems for Measuring Semantic Text Similarity, *SEM 2012: The First Joint Conference on Lexical and Computational Semantics, Volume 1, Proceedings of the main conference and the shared task, and Volume 2, Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). Montréal, Canada, May 2012, proceedings, pp. 441–448.

Agirre E., Cer D., Diab M., Gonzalez-Agirre A. SemEval2012 Task 6: A Pilot on Semantic Textual Similarity,*SEM 2012: The First Joint Conference on Lexical and Computational Semantics, Volume 1, Proceedings of the main conference and the shared task, and Volume 2, Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). Montréal, Canada, May 2012, proceedings, pp. 385–393.

Kong L., Lu Z., Qi H., Han Z. Detecting High Obfuscation Plagiarism: Exploring Multi-Features Fusion via Machine Learning, International Journal of u- and e- Service, Science and Technology, 2014, Vol. 7, pp. 385–396. DOI: 10.14257/ijunnesst.2014.7.4.35

Yin W., Schütz H. Convolutional Neural Network for Paraphrase Identification, North American Chapter of the Association for Computational Linguistics: Human Language Technologies Conferences. Denver, Colorado, May 2015, proceedings, pp. 901–911. DOI: 10.3115/v1/N15-1091.

Qiu L., Kan M.-Y., Chua T.-S. Paraphrase Recognition via Dissimilarity Significance Classification, Empirical Methods in Natural Language Processing Conference. Sydney, Australia, Jul. 2006, proceedings, pp. 18–26.

Kozareva Z., Montoyo A. Paraphrase Identification on the Basis of Supervised Machine Learning Techniques, Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science, 2006, Vol. 4139, pp. 524–533. DOI: 10.1007/11816508_52

Finch A., Sumita E. Using machine translation evaluation techniques to determine sentence-level semantic equivalence, IWP2005, 3rd International Workshop on Paraphrasing, May 2005, proceedings, pp. 17–24.

Madnani N., Tetreault J., Chodorow M. Re-examining Machine Translation Metrics for Paraphrase Identification, North American Chapter of the Association for Computational Linguistics, Human Language Technologies Conference of the, Montréal. Canada, June 2012, proceedings, pp. 182–190.

Agarwal B., Ramampiaro H., Langseth H., Ruocco M. A Deep Network Model for Paraphrase Detection in Short Text Messages, ArXiv, 2017. DOI: 10.48550/arXiv.1712.02820

Socher R., Huang E. H.-C., Pennington J., Ng A., Manning C. D. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection, NIPS'11: the 24th International Conference on Neural Information Processing Systems, Granada Spain, December 2011, proceedings, pp. 801–809.

Thyagarajan A. Siamese Recurrent Architectures for Learning Sentence Similarity, The Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, February 12–17, 2016, proceedings, Vol. 30(1), pp. 2786–2792. DOI: 10.1609/aaai.v30i1.10350

Neculoiu P., Versteegh M., Rotaru M. Learning Text Similarity with Siamese Recurrent Networks, Workshop on Representation Learning for NLP. Berlin, Germany, August 2016, proceedings, pp. 148–157. DOI: 10.18653/v1/W161617

Ranasinghe T., Orasan C., Mitkov R. Semantic Textual Similarity with Siamese Neural Networks, RANLP 2019, International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria, Sep. 2019, proceedings, pp. 1004–1011. DOI: 10.26615/978-954-452056-4_116

Mahmoud A., Zrigui M. BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification, Arabian Journal for Science and Engineering, 2021, Vol. 46, pp. 4163–4174. DOI: 10.1007/s13369-020-05320-w

Reddy D., Kumar M., Kp S. LSTM Based Paraphrase Identification Using Combined Word Embedding Features, Computing and Signal Processing. Advances in Intelligent Systems and Computing, 2019, Vol. 898, pp. 385–394. DOI: 10.1007/978-981-13-3393-4_40

Li Z., Jiang X., Shang L., Li H. Paraphrase Generation with Deep Reinforcement Learning, ArXiv, 2017. DOI: 10.48550/arXiv.1711.00279

Gomaa W., Fahmy A. SimAll: A flexible tool for text similarity, ESOLEC' 2017, The Seventeenth Conference on Language Engineering, December 2017, proceedings. https://www.academia.edu/35381793/SimAll_A_flexible_to ol_for_text_similarity

Ahmed M., Samee M. R., Mercer R. E. Improving TreeLSTM with Tree Attention, ArXiv, 2019. DOI: 10.48550/arXiv.1901.00066

Pontes E. L., Huet S., Linhares A. C., Torres-Moreno J.-M. Predicting the Semantic Textual Similarity with Siamese CNN and LSTM, Actes de la Conférence TALN. Rennes, France, May 2018, proceedings, pp. 311–320.

Wahle J. P., Ruas T., Meuschke N., Gipp B. Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection, ArXiv, 2021. DOI: 10.48550/arXiv.2103.12450

Vaswani A., Shazeer N., Parmar N., J. Uszkoreit, Jones L., Gomez A. N., Kaiser Ł., Polosukhin I. Attention Is All You Need, ArXiv, 2017. DOI: 10.48550/arXiv.1706.0376

Nighojkar A., Licato J. Improving Paraphrase Detection with the Adversarial Paraphrasing Task, ArXiv, 2021. DOI: 10.48550/arXiv.2106.07691

Arase Y., Tsujii J. Transfer fine-tuning of BERT with phrasal paraphrases, ArXiv, 2021. DOI: 10.48550/arXiv.1909.00931

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding, ArXiv, 2018. DOI: 10.48550/arXiv.1810.04805

Kusner M. J., Sun Y., Kolkin N. I., Weinberger K. Q. From Word Embeddings to Document Distances, JMLR: W&CP, 2015, Vol. 37, pp. 957–966.

Zhang Y., Baldridge J., He L. PAWS: Paraphrase Adversaries from Word Scrambling, ArXiv, 2019. DOI: 10.48550/arXiv.1904.01130

Oliinyk V.-A., Vysotska V., Burov Y., Mykich K., Fernandes V. B. Propaganda Detection in Text Data Based on NLP and Machine Learning, Modern Machine Learning Technologies and Data Science (MoMLeT+DS 2020), Workshop, Lviv-Shatsk, 2–3 June 2020, CEUR workshop proceedings. Aachen, CEUR-WS.org, 2020, Vol. 2631, pp. 132–144.

Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach, ArXiv, 2019. DOI: 10.48550/arXiv.1907.11692

Published

2022-12-13

How to Cite

Kholodna, N., & Vysotska, V. (2022). REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS . Radio Electronics, Computer Science, Control, (4), 126. https://doi.org/10.15588/1607-3274-2022-4-11

Issue

Section

Progressive information technologies