REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS
DOI:
https://doi.org/10.15588/1607-3274-2022-4-11Keywords:
natural language processing, NLP, rewrite identification, detection of paraphrasing in text, supervised machine learning, deep learning, text classification, text analysis, word embeddings, WordNet, semantic similarityAbstract
Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed.
Objective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content.
Method. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa.
Results. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other.
Conclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.
References
Salton G., Wong A., Yang C.-S. A vector space model for automatic indexing, Communications of the ACM, 1975, Vol. 18(11), pp. 613–620. DOI: 10.1145/361219.361220
Turney P. D., Pantel P. From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, 2010, Vol. 37(1), pp. 141–188. DOI: 10.1613/jair.2934
Mikolov T., Chen K., Corrado G. s, Dean J. Efficient Estimation of Word Representations in Vector Space, ArXiv, 2013. DOI: 10.48550/arXiv.1301.3781
Do Online Plagiarism Checkers Identify Paraphrased Content? DotNek Software Development, 2021, https://www.dotnek.com/Blog/Marketing/do-onlineplagiarism-checkers-identify-paraph
Miller G., Beckwith R., Fellbaum C., Gross D., Miller K. Introduction to WordNet: An On-line Lexical Database, International Journal of Lexicography, 1990, Vol. 3(4), pp. 235–244. DOI: 10.1093/ijl/3.4.235
Corley C., Mihalcea R. Measuring the Semantic Similarity of Texts, ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, Michigan, Jun. 2005, proceedings, pp. 13–18.
Leacock C., Chodorow M. Combining Local Context and WordNet Similarity for Word Sense Identification, WordNet: An Electronic Lexical Database, 1998, Vol. 49(2), pp. 265–283. DOI: 10.7551/mitpress/7287.003.0018
Lesk M. E. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone, SIGDOC '86: the 5th Annual International Conference on Systems documentation. Toronto, Ontario, Canada, June 1986, proceedings, pp. 24–26. DOI: 10.1145/318723.318728
Wu Z., Palmer M. Verbs Semantics and Lexical Selection, ACL '94: the 32nd annual meeting on Association for Computational Linguistics.Las Cruces, New Mexico, June 27–30, 1994, proceedings, pp. 133–138. DOI: 10.3115/981732.981751.
Resnik P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy, ArXiv, 1995. DOI: 10.48550/arXiv.cmp-lg/9511007
Lin D. An Information-Theoretic Definition of Similarity, ICML, 1998, https://www.cse.iitb.ac.in/~cs626449/Papers/WordSimilarity/3.pdf
Jiang J. J., Conrath D. W. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy, 10th Research on Computational Linguistics International Conference. Taipei, Taiwan, Aug. 1997, proceedings, pp. 19–33.
Mihalcea R. Corley C., Strapparava C. Corpus-based and Knowledge-based Measures of Text Semantic Similarity, AAAI'06: Proceedings of the 21st national conference on Artificial intelligence, Boston, Massachusetts, July 2006 : proceedings, Vol. 1, pp. 775–780.
Hassan S., Mihalcea R. Semantic Relatedness Using Salient Semantic Analysis, AAAI 2011: Twenty-Fifth AAAI Conference on Artificial Intelligence, San Francisco, California, USA, August 7–11, 2011, proceedings, https://web.eecs.umich.edu/~mihalcea/papers/hassan.aaai11. pdf
Fernando S., Stevenson M. A Semantic Similarity Approach to Paraphrase Detection, 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, May 2008, proceedings, pp. 45–52.
Milajevs D., Kartsaklis D., Sadrzadeh M., Purver M. Evaluating Neural Word Representations in Tensor-Based Compositional Settings, ArXiv, 2014. DOI: 10.48550/arXiv.1408.6179
Islam A., Inkpen D. Semantic similarity of short texts, Current Issues in Linguistic Theory: Recent Advances in Natural Language Processing, 2009, Vol. 309, pp. 227–236. DOI: 10.1075/cilt.309.18isl
Chong M., Specia L., Mitkov R. Using Natural Language Processing for Automatic Detection of Plagiarism, IPC2010, 4th International Plagiarism Conference, Newcastle-uponTyne, May 2010, proceedings, https://www.academia.edu/326444/Using_Natural_Languag e_Processing_for_Automatic_Detection_of_Plagiarism
Šarić F., Glavaš G., Karan M., Šnajder J., Dalbelo Bašić B. TakeLab: Systems for Measuring Semantic Text Similarity, *SEM 2012: The First Joint Conference on Lexical and Computational Semantics, Volume 1, Proceedings of the main conference and the shared task, and Volume 2, Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). Montréal, Canada, May 2012, proceedings, pp. 441–448.
Agirre E., Cer D., Diab M., Gonzalez-Agirre A. SemEval2012 Task 6: A Pilot on Semantic Textual Similarity,*SEM 2012: The First Joint Conference on Lexical and Computational Semantics, Volume 1, Proceedings of the main conference and the shared task, and Volume 2, Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). Montréal, Canada, May 2012, proceedings, pp. 385–393.
Kong L., Lu Z., Qi H., Han Z. Detecting High Obfuscation Plagiarism: Exploring Multi-Features Fusion via Machine Learning, International Journal of u- and e- Service, Science and Technology, 2014, Vol. 7, pp. 385–396. DOI: 10.14257/ijunnesst.2014.7.4.35
Yin W., Schütz H. Convolutional Neural Network for Paraphrase Identification, North American Chapter of the Association for Computational Linguistics: Human Language Technologies Conferences. Denver, Colorado, May 2015, proceedings, pp. 901–911. DOI: 10.3115/v1/N15-1091.
Qiu L., Kan M.-Y., Chua T.-S. Paraphrase Recognition via Dissimilarity Significance Classification, Empirical Methods in Natural Language Processing Conference. Sydney, Australia, Jul. 2006, proceedings, pp. 18–26.
Kozareva Z., Montoyo A. Paraphrase Identification on the Basis of Supervised Machine Learning Techniques, Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science, 2006, Vol. 4139, pp. 524–533. DOI: 10.1007/11816508_52
Finch A., Sumita E. Using machine translation evaluation techniques to determine sentence-level semantic equivalence, IWP2005, 3rd International Workshop on Paraphrasing, May 2005, proceedings, pp. 17–24.
Madnani N., Tetreault J., Chodorow M. Re-examining Machine Translation Metrics for Paraphrase Identification, North American Chapter of the Association for Computational Linguistics, Human Language Technologies Conference of the, Montréal. Canada, June 2012, proceedings, pp. 182–190.
Agarwal B., Ramampiaro H., Langseth H., Ruocco M. A Deep Network Model for Paraphrase Detection in Short Text Messages, ArXiv, 2017. DOI: 10.48550/arXiv.1712.02820
Socher R., Huang E. H.-C., Pennington J., Ng A., Manning C. D. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection, NIPS'11: the 24th International Conference on Neural Information Processing Systems, Granada Spain, December 2011, proceedings, pp. 801–809.
Thyagarajan A. Siamese Recurrent Architectures for Learning Sentence Similarity, The Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, February 12–17, 2016, proceedings, Vol. 30(1), pp. 2786–2792. DOI: 10.1609/aaai.v30i1.10350
Neculoiu P., Versteegh M., Rotaru M. Learning Text Similarity with Siamese Recurrent Networks, Workshop on Representation Learning for NLP. Berlin, Germany, August 2016, proceedings, pp. 148–157. DOI: 10.18653/v1/W161617
Ranasinghe T., Orasan C., Mitkov R. Semantic Textual Similarity with Siamese Neural Networks, RANLP 2019, International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria, Sep. 2019, proceedings, pp. 1004–1011. DOI: 10.26615/978-954-452056-4_116
Mahmoud A., Zrigui M. BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification, Arabian Journal for Science and Engineering, 2021, Vol. 46, pp. 4163–4174. DOI: 10.1007/s13369-020-05320-w
Reddy D., Kumar M., Kp S. LSTM Based Paraphrase Identification Using Combined Word Embedding Features, Computing and Signal Processing. Advances in Intelligent Systems and Computing, 2019, Vol. 898, pp. 385–394. DOI: 10.1007/978-981-13-3393-4_40
Li Z., Jiang X., Shang L., Li H. Paraphrase Generation with Deep Reinforcement Learning, ArXiv, 2017. DOI: 10.48550/arXiv.1711.00279
Gomaa W., Fahmy A. SimAll: A flexible tool for text similarity, ESOLEC' 2017, The Seventeenth Conference on Language Engineering, December 2017, proceedings. https://www.academia.edu/35381793/SimAll_A_flexible_to ol_for_text_similarity
Ahmed M., Samee M. R., Mercer R. E. Improving TreeLSTM with Tree Attention, ArXiv, 2019. DOI: 10.48550/arXiv.1901.00066
Pontes E. L., Huet S., Linhares A. C., Torres-Moreno J.-M. Predicting the Semantic Textual Similarity with Siamese CNN and LSTM, Actes de la Conférence TALN. Rennes, France, May 2018, proceedings, pp. 311–320.
Wahle J. P., Ruas T., Meuschke N., Gipp B. Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection, ArXiv, 2021. DOI: 10.48550/arXiv.2103.12450
Vaswani A., Shazeer N., Parmar N., J. Uszkoreit, Jones L., Gomez A. N., Kaiser Ł., Polosukhin I. Attention Is All You Need, ArXiv, 2017. DOI: 10.48550/arXiv.1706.0376
Nighojkar A., Licato J. Improving Paraphrase Detection with the Adversarial Paraphrasing Task, ArXiv, 2021. DOI: 10.48550/arXiv.2106.07691
Arase Y., Tsujii J. Transfer fine-tuning of BERT with phrasal paraphrases, ArXiv, 2021. DOI: 10.48550/arXiv.1909.00931
Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding, ArXiv, 2018. DOI: 10.48550/arXiv.1810.04805
Kusner M. J., Sun Y., Kolkin N. I., Weinberger K. Q. From Word Embeddings to Document Distances, JMLR: W&CP, 2015, Vol. 37, pp. 957–966.
Zhang Y., Baldridge J., He L. PAWS: Paraphrase Adversaries from Word Scrambling, ArXiv, 2019. DOI: 10.48550/arXiv.1904.01130
Oliinyk V.-A., Vysotska V., Burov Y., Mykich K., Fernandes V. B. Propaganda Detection in Text Data Based on NLP and Machine Learning, Modern Machine Learning Technologies and Data Science (MoMLeT+DS 2020), Workshop, Lviv-Shatsk, 2–3 June 2020, CEUR workshop proceedings. Aachen, CEUR-WS.org, 2020, Vol. 2631, pp. 132–144.
Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. RoBERTa: A Robustly Optimized BERT Pretraining Approach, ArXiv, 2019. DOI: 10.48550/arXiv.1907.11692
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Н. M. Холодна, В. А. Висоцька
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.