TECHNOLOGY FOR GRAMMATICAL ERRORS CORRECTION IN UKRAINIAN TEXT CONTENT BASED ON MACHINE LEARNING METHODS

Authors

  • N. Kholodna Lviv Polytechnic National University, Lviv, Ukraine, Ukraine
  • V. Vysotska Lviv Polytechnic National University, Lviv, Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2023-1-12

Keywords:

NLP, text pre-processing, error correction, grammatical error correction, machine learning, deep learning, text analysis, text classification, neural network

Abstract

Context. Most research in grammatical and stylistic error correction focuses on error correction in English-language textual content. Thanks to the availability of large data sets, a significant increase in the accuracy of English grammar correction has been achieved. Unfortunately, there are few studies on other languages. Systems for the English language are constantly developing and currently actively use machine learning methods: classification (sequence tagging) and machine translation. A large amount of parallel or manually labelled data is required to build a high-quality machine learning model for correcting grammatical/stylistic errors in the texts of those morphologically complex languages. Manual data annotation requires a lot of effort by professional linguists, which makes the creation of text corpora, especially in morphologically rich languages, mainly Ukrainian, a time- and resource-consuming process.

Objective of the study is to develop a technology for correcting errors in Ukrainian-language texts based on machine learning methods using a small set of annotated parallel data.

Method. For this study, machine learning algorithms were selected when developing a system for correcting errors in Ukrainianlanguage texts using an optimal pipeline, including pre-processing and selecting text content and generating features in small annotated data corpora. The neural network’s use with a new architecture, a review of state-of-the-art methods, and a comparison of different pipeline stages will make it possible to determine such a combination of them, allowing a high-quality error correction model in Ukrainian-language texts.

Results. A machine learning model for error correction in Ukrainian-language texts has been developed. A universal scheme for creating an error correction system for different languages is proposed. According to the results, the neural network can correct simple sentences written in Ukrainian. However, creating a full-fledged system will require spell-checking using dictionaries and checking rules, both simple and based on the result of parsing dependencies or other features. The pre-trained neural translation model mT5 has the best performance among the three models. To save computing resources, it is also possible to use a pre-trained BERT-type neural network as an encoder and a decoder. Such a neural network has half the number of parameters as other pretrained machine translation models and shows satisfactory results in correcting grammatical and stylistic errors.

Conclusions. The created model shows excellent classification results on test data. The calculated machine translation quality metrics allow only a partial comparison of the models since most of the words and phrases in the original and corrected sentences are the same. The best value for both BLEU (0.908) and METEOR (0.956) is obtained for mT5, which is consistent with the case study in which the most accurate error corrections without changing the initial value of the sentence are obtained for such a neural network. The M2M100 has a higher BLEU score (0.847) than the “Ukrainian Roberta” Encoder-Decoder (0.697). However, subjectively evaluating the results of the correction of examples, the M2M100 does a much worse job than the other two models. For METEOR, M2M100 (0.925) also has a higher score than the “Ukrainian Roberta” Encoder-Decoder (0.876). 

Author Biographies

N. Kholodna, Lviv Polytechnic National University, Lviv, Ukraine

PhD student of Information Systems and Networks Department

V. Vysotska, Lviv Polytechnic National University, Lviv, Ukraine

PhD, Associate Professor of Information Systems and Networks Department

References

Naghshnejad M., Joshi T., Nair V. N. Recent Trends in the Use of Deep Learning Models for Grammar Error Handling, ArXiv, 2020. DOI: 10.48550/arXiv.2009.02358

Leacock C., Chodorow M., Gamon M., Tetreault J. Automated Grammatical Error Detection for Language Learners, Second Edition, Synthesis Lectures on Human Language Technologies, 2014. Berlin, Springer, 154 p. DOI: 10.1007/978-3-031-02153-4

Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention Is All You Need, ArXiv, 2017. DOI: 10.48550/arXiv.1706.03762

Wolf T., Debut L., Sanh V., Chaumond J., Delangue C., Moi A., Cistac P., Rault T., Louf R., Funtowicz M., Davison J., Shleifer S., Platen P. v., Ma C., Jernite Y., Plu J., Xu C., Scao T. L., Gugger S., Drame M., Lhoest Q., Rush A. Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020: Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, ALC Anthology, Oct. 2020 : proceedings, pp. 38–45. DOI: 10.18653/v1/2020.emnlpdemos.6

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Minneapolis. Minnesota, ALC Anthology, June 2019 : proceedings, pp. 4171–4186. DOI: 10.18653/v1/N19-1423

Rozovskaya A., Roth D. Grammar Error Correction in Morphologically Rich Languages: The Case of Russian, Transactions of the Association for Computational Linguistics, 2019, Vol. 7, pp. 1–17. DOI: 10.1162/tacl_a_00251

Bick E. DanProof: Pedagogical Spell and Grammar Checking for Danish, International Conference Recent Advances in Natural Language Processing. Hissar, Bulgaria, Sep. 2015, proceedings, pp. 55–62.

Gakis P., Panagiotakopoulos C. T., Sgarbas K. N., Tsalidis C., Verykios V. S., Design and construction of the Greek grammar checker, Digital Scholarship in the Humanities, 2017, Vol. 32, pp. 554–576. DOI: 10.1093/llc/fqw025

Deksne D. A New Phase in the Development of a Grammar Checker for Latvian. Human Language Technologies, The Baltic Perspective, IOS Press, 2016, pp. 147–152. DOI: 10.3233/978-1-61499-701-6-147

Sorokin A. Spelling Correction for Morphologically Rich Language: a Case Study of Russian, 6th Workshop on BaltoSlavic Natural Language Processing, Valencia, Spain, Apr. 2017, proceedings, pp. 45–53. DOI: 10.18653/v1/W171408.

Gill M. S., Lehal G. S. A Grammar Checking System for Punjabi, Coling 2008: Companion volume: Demonstrations, Manchester, UK, Aug. 2008 : proceedings, pp. 149–152.

Go M. P., Borra A. Developing an Unsupervised Grammar Checker for Filipino Using Hybrid N-grams as Grammar Rules, PACLIC: 30th Pacific Asia Conference on Language, Information and Computation, Oral Papers. Seoul, South Korea, Oct. 2016 : proceedings, pp. 105–113.

Shaalan K. F. Arabic GramCheck: a grammar checker for Arabic, Software: Practice and Experience, 2005, Vol. 35(7), pp. 643–665. DOI: 10.1002/spe.653

Wang Y., Wang Y., Liu J., Liu Z. A Comprehensive Survey of Grammar Error Correction, ArXiv, 2020. DOI: 10.48550/arXiv.2005.06600

Syvokon O., Nahorna O. UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language ArXiv, 2021. DOI: 10.48550/arXiv.2103.16997

Lardinois F. Grammarly goes beyond grammar, TechCrunch, 2019. Access mode: https://techcrunch.com/2019/07/16/grammarly-goesbeyond-grammar/.

Lardinois F. Grammarly gets a tone detector to keep you out of email trouble, TechCrunch, 2019. Access mode: https://techcrunch.com/2019/09/24/grammarly-gets-a-tonedetector-to-keep-you-out-of-email-trouble/.

Grammarly Inc. About Us / Grammarly Inc. Access mode: https://www.grammarly.com/about.

Grammarly Inc. Does Grammarly support languages other than English? / Grammarly Inc. Access mode: https://support.grammarly.com/hc/enus/articles/115000090971-Does-Grammarly-supportlanguages-other-than-English-.

LanguageTool. Languages, LanguageTool. Access mode: https://dev.languagetool.org/languages.

LanguageTool. Error Rules for LanguageTool / LanguageTool Community. Access mode: https://community.languagetool.org/rule/list?offset=0&max =10&lang=uk&filter=&categoryFilter=&_action_list=%D0 %A4%D1%96%D0%BB%D1%8C%D1%82%D1%80.

LanguageTool. About / LanguageTool. Access mode: https://languagetool.org/about.

Jayanthi S., Pruthi D., Neubig G. NeuSpell: A Neural Spelling Correction Toolkit, ArXiv, 2020. DOI: 10.48550/arXiv.2010.11085

Hunspell, Github, 2021, Access mode: https://github.com/hunspell/hunspell.

Korobov M. Morphological Analyzer and Generator for Russian and Ukrainian Languages, ArXiv, 2015. DOI: 10.48550/arXiv.1503.07283

Tmienova N., Sus B. System of Intellectual Ukrainian Language Processing, ITS: the XIX International Conference on Information Technologies and Security, Kyiv, Ukraine, Nov. 28, 2019 : proceedings, pp. 199–209.

Pogorilyy S., Kramov A. A. Method of noun phrase detection in Ukrainian texts, ArXiv, 2020. DOI: 10.48550/arXiv.2010.11548

Hlybovets A., Tochytskyi V. Algorithm of tokenization and stemming for texts in the Ukrainian language, Scientific notes of NaUKMA. Computer Science, 2017, Vol. 198, pp. 4–8.

Hao S., Hao G. A Research on Online Grammar Checker System Based on Neural Network Model, Journal of Physics, 2020, Vol. 1651, pp. 1–8. DOI: 10.1088/17426596/1651/1/012135

Batiuk T. M., Vysotska V. Technology for Personalities Socialization by Common Interests Based on Machine Learning Methods And SEO-Technologies, Radio Electronics, Computer Science, Control, 2022, Vol. 2 (61), pp. 53–68. DOI: 10.15588/1607-3274-2022-2-6

Ng H. T., Wu S. M., Briscoe T., Hadiwinoto C., Susanto R. H., Bryant C. The CoNLL-2014 Shared Task on Grammatical Error Correction, Conference on Computational Natural Language Learning: Shared Task. Baltimore, Maryland, Jun. 2014, proceedings, pp. 1–14. DOI: 10.3115/v1/W14-1701

Chollampatt S., Ng H. T. A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction, ArXiv, 2018. DOI: 10.48550/arXiv.1801.08831

Omelianchuk K., Atrasevych V., Chernodub A. N., Skurzhanskyi O. GECToR – Grammatical Error Correction: Tag, Not Rewrite, ArXiv, 2020. DOI: 10.48550/arXiv.2005.12592

Rozovskaya A., Chang K.-W., Sammons M., Roth D. The University of Illinois System in the CoNLL-2013 Shared Task, Conference on Computational Natural Language Learning: Shared Task. Sofia, Bulgaria, Aug. 2013, proceedings, pp. 13–19.

Ng H. T., Wu S. M., Wu Y., Hadiwinoto C., Tetreault J. The CoNLL-2013 Shared Task on Grammatical Error Correction, Conference on Computational Natural Language Learning: Shared Task. Sofia, Bulgaria, Aug. 2013, proceedings, pp. 1–12.

Rothe S., Mallinson J., Malmi E., Krause S., Severyn A. A Simple Recipe for Multilingual Grammatical Error Correction, ArXiv, 2021. DOI: 10.48550/arXiv.2106.03830

Mita M., Yanaka H. Do Grammatical Error Correction Models Realize Grammatical Generalization?, ArXiv, 2021. DOI: 10.48550/arXiv.2106.03031

Sun X., Ge T., Ma S., Li J., Wei F., and Wang H. A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model, ArXiv, 2022. DOI: 10.48550/arXiv.2201.10707

Yasunaga M., Leskovec J., and Liang P. LM-Critic: Language Models for Unsupervised Grammatical Error Correction, ArXiv, 2021. DOI: 10.48550/arXiv.2109.06822

Choe Y. J., Ham J., Park K., Yoon Y. A Neural Grammatical Error Correction System Built on Better Pretraining and Sequential Transfer Learning, Workshop on Innovative Use of NLP for Building Educational Applications, Florence. Italy, Aug. 2019 : proceedings, pp. 213–227. DOI: 10.18653/v1/W19-4423

Wan Z., Wan X. A Syntax-Guided Grammatical Error Correction Model with Dependency Tree Correction, ArXiv, 2021. DOI: 10.48550/arXiv.2111.03294

Xie Z., Avati A., Arivazhagan N., Jurafsky D., Ng A. Neural Language Correction with Character-Based Attention, ArXiv, 2016. DOI: 10.48550/arXiv.1603.09727

Parnow K., Li Z., Zhao H. Grammatical Error Correction as GAN-like Sequence Labeling, ArXiv, 2021. DOI: 10.48550/arXiv.2105.14209

Wang X., Zhong W. Research and Implementation of English Grammar Check and Error Correction Based on Deep Learning, Scientific Programming, 2022, Vol. 2022, Article ID 4082082. DOI: 10.1155/2022/4082082

Náplava J., Straka M. Grammatical Error Correction in Low-Resource Scenarios, W-NUT: 5th Workshop on Noisy User-generated Text. Hong Kong, China, Nov. 2019 : proceedings, pp. 346–356. DOI: 10.18653/v1/D19-5545.

Zhou W., Ge T., Mu C., Xu K., Wei F., Zhou M. Improving Grammatical Error Correction with Machine Translation Pairs, ArXiv, 2020. DOI: 10.48550/arXiv.1911.02825

Raheja V., Alikaniotis D. Adversarial Grammatical Error Correction, EMNLP: Findings of the Association for Computational Linguistics, Online, Nov. 2020 : proceedings, pp. 3075–3087. DOI: 10.18653/v1/2020.findings-emnlp.275

Radchenko V. Ukrainian Roberta. Access mode: https://github.com/youscan/language-models.

Xue L., Constant N., Roberts A., Kale M., Al-Rfou R., Siddhant A., Barua A., Raffel C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, Jun. 2021 : proceedings, pp. 483–498. DOI: 10.18653/v1/2021.naacl-main.41.

Fan A., Bhosale S., Schwenk H., Ma Z., El-Kishky A., Goyal S., Baines M., Celebi O., Wenzek G., Chaudhary V., Goyal N., Birch T., Liptchinsky V., Edunov S., Grave E., Auli M., Joulin A. Beyond English-Centric Multilingual Machine Translation, ArXiv, 2020. DOI: 10.48550/arXiv.2010.11125

Tang Y., Tran C., Li Xian, Chen P.-J., Goyal N., Chaudhary V., Gu J., Fan A. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning, ArXiv, 2020. DOI: 10.48550/arXiv.2008.00401

Platen vоn Patrick Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models. Access: https://huggingface.co/blog/warm-starting-encoder-decoder.

Rothe S., Narayan S., Severyn A. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks, ArXiv, 2019. DOI: 10.48550/arXiv.1907.12461

Napoles C., Sakaguchi K., Post M., Tetreault J. Ground truth for grammatical error correction metrics, 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China, July 2015 : proceedings, pp. 588–593. DOI: 10.3115/v1/P15-2097

Papineni K., Roukos S., Ward T., and Zhu W.-J. Bleu: a Method for Automatic Evaluation of Machine Translation, 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, July 2002 : proceedings, pp. 311–318. DOI: 10.3115/1073083.1073135.

Banerjee S., Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Michigan, Jun. 2005 : proceedings, pp. 65–72.

Platen vоn Patrick Encoder-Decoder models don’t need costly pre-training to yield state-of-the-art results on seq2seq tasks. Access mode: https://twitter.com/patrickplaten/status/132584424409597132 8.

Published

2023-02-27

How to Cite

Kholodna, N., & Vysotska, V. (2023). TECHNOLOGY FOR GRAMMATICAL ERRORS CORRECTION IN UKRAINIAN TEXT CONTENT BASED ON MACHINE LEARNING METHODS. Radio Electronics, Computer Science, Control, (1), 114. https://doi.org/10.15588/1607-3274-2023-1-12

Issue

Section

Progressive information technologies