TECHNOLOGY FOR GRAMMATICAL ERRORS CORRECTION IN UKRAINIAN TEXT CONTENT BASED ON MACHINE LEARNING METHODS
DOI:
https://doi.org/10.15588/1607-3274-2023-1-12Keywords:
NLP, text pre-processing, error correction, grammatical error correction, machine learning, deep learning, text analysis, text classification, neural networkAbstract
Context. Most research in grammatical and stylistic error correction focuses on error correction in English-language textual content. Thanks to the availability of large data sets, a significant increase in the accuracy of English grammar correction has been achieved. Unfortunately, there are few studies on other languages. Systems for the English language are constantly developing and currently actively use machine learning methods: classification (sequence tagging) and machine translation. A large amount of parallel or manually labelled data is required to build a high-quality machine learning model for correcting grammatical/stylistic errors in the texts of those morphologically complex languages. Manual data annotation requires a lot of effort by professional linguists, which makes the creation of text corpora, especially in morphologically rich languages, mainly Ukrainian, a time- and resource-consuming process.
Objective of the study is to develop a technology for correcting errors in Ukrainian-language texts based on machine learning methods using a small set of annotated parallel data.
Method. For this study, machine learning algorithms were selected when developing a system for correcting errors in Ukrainianlanguage texts using an optimal pipeline, including pre-processing and selecting text content and generating features in small annotated data corpora. The neural network’s use with a new architecture, a review of state-of-the-art methods, and a comparison of different pipeline stages will make it possible to determine such a combination of them, allowing a high-quality error correction model in Ukrainian-language texts.
Results. A machine learning model for error correction in Ukrainian-language texts has been developed. A universal scheme for creating an error correction system for different languages is proposed. According to the results, the neural network can correct simple sentences written in Ukrainian. However, creating a full-fledged system will require spell-checking using dictionaries and checking rules, both simple and based on the result of parsing dependencies or other features. The pre-trained neural translation model mT5 has the best performance among the three models. To save computing resources, it is also possible to use a pre-trained BERT-type neural network as an encoder and a decoder. Such a neural network has half the number of parameters as other pretrained machine translation models and shows satisfactory results in correcting grammatical and stylistic errors.
Conclusions. The created model shows excellent classification results on test data. The calculated machine translation quality metrics allow only a partial comparison of the models since most of the words and phrases in the original and corrected sentences are the same. The best value for both BLEU (0.908) and METEOR (0.956) is obtained for mT5, which is consistent with the case study in which the most accurate error corrections without changing the initial value of the sentence are obtained for such a neural network. The M2M100 has a higher BLEU score (0.847) than the “Ukrainian Roberta” Encoder-Decoder (0.697). However, subjectively evaluating the results of the correction of examples, the M2M100 does a much worse job than the other two models. For METEOR, M2M100 (0.925) also has a higher score than the “Ukrainian Roberta” Encoder-Decoder (0.876).
References
Naghshnejad M., Joshi T., Nair V. N. Recent Trends in the Use of Deep Learning Models for Grammar Error Handling, ArXiv, 2020. DOI: 10.48550/arXiv.2009.02358
Leacock C., Chodorow M., Gamon M., Tetreault J. Automated Grammatical Error Detection for Language Learners, Second Edition, Synthesis Lectures on Human Language Technologies, 2014. Berlin, Springer, 154 p. DOI: 10.1007/978-3-031-02153-4
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention Is All You Need, ArXiv, 2017. DOI: 10.48550/arXiv.1706.03762
Wolf T., Debut L., Sanh V., Chaumond J., Delangue C., Moi A., Cistac P., Rault T., Louf R., Funtowicz M., Davison J., Shleifer S., Platen P. v., Ma C., Jernite Y., Plu J., Xu C., Scao T. L., Gugger S., Drame M., Lhoest Q., Rush A. Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020: Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, ALC Anthology, Oct. 2020 : proceedings, pp. 38–45. DOI: 10.18653/v1/2020.emnlpdemos.6
Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Minneapolis. Minnesota, ALC Anthology, June 2019 : proceedings, pp. 4171–4186. DOI: 10.18653/v1/N19-1423
Rozovskaya A., Roth D. Grammar Error Correction in Morphologically Rich Languages: The Case of Russian, Transactions of the Association for Computational Linguistics, 2019, Vol. 7, pp. 1–17. DOI: 10.1162/tacl_a_00251
Bick E. DanProof: Pedagogical Spell and Grammar Checking for Danish, International Conference Recent Advances in Natural Language Processing. Hissar, Bulgaria, Sep. 2015, proceedings, pp. 55–62.
Gakis P., Panagiotakopoulos C. T., Sgarbas K. N., Tsalidis C., Verykios V. S., Design and construction of the Greek grammar checker, Digital Scholarship in the Humanities, 2017, Vol. 32, pp. 554–576. DOI: 10.1093/llc/fqw025
Deksne D. A New Phase in the Development of a Grammar Checker for Latvian. Human Language Technologies, The Baltic Perspective, IOS Press, 2016, pp. 147–152. DOI: 10.3233/978-1-61499-701-6-147
Sorokin A. Spelling Correction for Morphologically Rich Language: a Case Study of Russian, 6th Workshop on BaltoSlavic Natural Language Processing, Valencia, Spain, Apr. 2017, proceedings, pp. 45–53. DOI: 10.18653/v1/W171408.
Gill M. S., Lehal G. S. A Grammar Checking System for Punjabi, Coling 2008: Companion volume: Demonstrations, Manchester, UK, Aug. 2008 : proceedings, pp. 149–152.
Go M. P., Borra A. Developing an Unsupervised Grammar Checker for Filipino Using Hybrid N-grams as Grammar Rules, PACLIC: 30th Pacific Asia Conference on Language, Information and Computation, Oral Papers. Seoul, South Korea, Oct. 2016 : proceedings, pp. 105–113.
Shaalan K. F. Arabic GramCheck: a grammar checker for Arabic, Software: Practice and Experience, 2005, Vol. 35(7), pp. 643–665. DOI: 10.1002/spe.653
Wang Y., Wang Y., Liu J., Liu Z. A Comprehensive Survey of Grammar Error Correction, ArXiv, 2020. DOI: 10.48550/arXiv.2005.06600
Syvokon O., Nahorna O. UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language ArXiv, 2021. DOI: 10.48550/arXiv.2103.16997
Lardinois F. Grammarly goes beyond grammar, TechCrunch, 2019. Access mode: https://techcrunch.com/2019/07/16/grammarly-goesbeyond-grammar/.
Lardinois F. Grammarly gets a tone detector to keep you out of email trouble, TechCrunch, 2019. Access mode: https://techcrunch.com/2019/09/24/grammarly-gets-a-tonedetector-to-keep-you-out-of-email-trouble/.
Grammarly Inc. About Us / Grammarly Inc. Access mode: https://www.grammarly.com/about.
Grammarly Inc. Does Grammarly support languages other than English? / Grammarly Inc. Access mode: https://support.grammarly.com/hc/enus/articles/115000090971-Does-Grammarly-supportlanguages-other-than-English-.
LanguageTool. Languages, LanguageTool. Access mode: https://dev.languagetool.org/languages.
LanguageTool. Error Rules for LanguageTool / LanguageTool Community. Access mode: https://community.languagetool.org/rule/list?offset=0&max =10&lang=uk&filter=&categoryFilter=&_action_list=%D0 %A4%D1%96%D0%BB%D1%8C%D1%82%D1%80.
LanguageTool. About / LanguageTool. Access mode: https://languagetool.org/about.
Jayanthi S., Pruthi D., Neubig G. NeuSpell: A Neural Spelling Correction Toolkit, ArXiv, 2020. DOI: 10.48550/arXiv.2010.11085
Hunspell, Github, 2021, Access mode: https://github.com/hunspell/hunspell.
Korobov M. Morphological Analyzer and Generator for Russian and Ukrainian Languages, ArXiv, 2015. DOI: 10.48550/arXiv.1503.07283
Tmienova N., Sus B. System of Intellectual Ukrainian Language Processing, ITS: the XIX International Conference on Information Technologies and Security, Kyiv, Ukraine, Nov. 28, 2019 : proceedings, pp. 199–209.
Pogorilyy S., Kramov A. A. Method of noun phrase detection in Ukrainian texts, ArXiv, 2020. DOI: 10.48550/arXiv.2010.11548
Hlybovets A., Tochytskyi V. Algorithm of tokenization and stemming for texts in the Ukrainian language, Scientific notes of NaUKMA. Computer Science, 2017, Vol. 198, pp. 4–8.
Hao S., Hao G. A Research on Online Grammar Checker System Based on Neural Network Model, Journal of Physics, 2020, Vol. 1651, pp. 1–8. DOI: 10.1088/17426596/1651/1/012135
Batiuk T. M., Vysotska V. Technology for Personalities Socialization by Common Interests Based on Machine Learning Methods And SEO-Technologies, Radio Electronics, Computer Science, Control, 2022, Vol. 2 (61), pp. 53–68. DOI: 10.15588/1607-3274-2022-2-6
Ng H. T., Wu S. M., Briscoe T., Hadiwinoto C., Susanto R. H., Bryant C. The CoNLL-2014 Shared Task on Grammatical Error Correction, Conference on Computational Natural Language Learning: Shared Task. Baltimore, Maryland, Jun. 2014, proceedings, pp. 1–14. DOI: 10.3115/v1/W14-1701
Chollampatt S., Ng H. T. A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction, ArXiv, 2018. DOI: 10.48550/arXiv.1801.08831
Omelianchuk K., Atrasevych V., Chernodub A. N., Skurzhanskyi O. GECToR – Grammatical Error Correction: Tag, Not Rewrite, ArXiv, 2020. DOI: 10.48550/arXiv.2005.12592
Rozovskaya A., Chang K.-W., Sammons M., Roth D. The University of Illinois System in the CoNLL-2013 Shared Task, Conference on Computational Natural Language Learning: Shared Task. Sofia, Bulgaria, Aug. 2013, proceedings, pp. 13–19.
Ng H. T., Wu S. M., Wu Y., Hadiwinoto C., Tetreault J. The CoNLL-2013 Shared Task on Grammatical Error Correction, Conference on Computational Natural Language Learning: Shared Task. Sofia, Bulgaria, Aug. 2013, proceedings, pp. 1–12.
Rothe S., Mallinson J., Malmi E., Krause S., Severyn A. A Simple Recipe for Multilingual Grammatical Error Correction, ArXiv, 2021. DOI: 10.48550/arXiv.2106.03830
Mita M., Yanaka H. Do Grammatical Error Correction Models Realize Grammatical Generalization?, ArXiv, 2021. DOI: 10.48550/arXiv.2106.03031
Sun X., Ge T., Ma S., Li J., Wei F., and Wang H. A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model, ArXiv, 2022. DOI: 10.48550/arXiv.2201.10707
Yasunaga M., Leskovec J., and Liang P. LM-Critic: Language Models for Unsupervised Grammatical Error Correction, ArXiv, 2021. DOI: 10.48550/arXiv.2109.06822
Choe Y. J., Ham J., Park K., Yoon Y. A Neural Grammatical Error Correction System Built on Better Pretraining and Sequential Transfer Learning, Workshop on Innovative Use of NLP for Building Educational Applications, Florence. Italy, Aug. 2019 : proceedings, pp. 213–227. DOI: 10.18653/v1/W19-4423
Wan Z., Wan X. A Syntax-Guided Grammatical Error Correction Model with Dependency Tree Correction, ArXiv, 2021. DOI: 10.48550/arXiv.2111.03294
Xie Z., Avati A., Arivazhagan N., Jurafsky D., Ng A. Neural Language Correction with Character-Based Attention, ArXiv, 2016. DOI: 10.48550/arXiv.1603.09727
Parnow K., Li Z., Zhao H. Grammatical Error Correction as GAN-like Sequence Labeling, ArXiv, 2021. DOI: 10.48550/arXiv.2105.14209
Wang X., Zhong W. Research and Implementation of English Grammar Check and Error Correction Based on Deep Learning, Scientific Programming, 2022, Vol. 2022, Article ID 4082082. DOI: 10.1155/2022/4082082
Náplava J., Straka M. Grammatical Error Correction in Low-Resource Scenarios, W-NUT: 5th Workshop on Noisy User-generated Text. Hong Kong, China, Nov. 2019 : proceedings, pp. 346–356. DOI: 10.18653/v1/D19-5545.
Zhou W., Ge T., Mu C., Xu K., Wei F., Zhou M. Improving Grammatical Error Correction with Machine Translation Pairs, ArXiv, 2020. DOI: 10.48550/arXiv.1911.02825
Raheja V., Alikaniotis D. Adversarial Grammatical Error Correction, EMNLP: Findings of the Association for Computational Linguistics, Online, Nov. 2020 : proceedings, pp. 3075–3087. DOI: 10.18653/v1/2020.findings-emnlp.275
Radchenko V. Ukrainian Roberta. Access mode: https://github.com/youscan/language-models.
Xue L., Constant N., Roberts A., Kale M., Al-Rfou R., Siddhant A., Barua A., Raffel C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, Jun. 2021 : proceedings, pp. 483–498. DOI: 10.18653/v1/2021.naacl-main.41.
Fan A., Bhosale S., Schwenk H., Ma Z., El-Kishky A., Goyal S., Baines M., Celebi O., Wenzek G., Chaudhary V., Goyal N., Birch T., Liptchinsky V., Edunov S., Grave E., Auli M., Joulin A. Beyond English-Centric Multilingual Machine Translation, ArXiv, 2020. DOI: 10.48550/arXiv.2010.11125
Tang Y., Tran C., Li Xian, Chen P.-J., Goyal N., Chaudhary V., Gu J., Fan A. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning, ArXiv, 2020. DOI: 10.48550/arXiv.2008.00401
Platen vоn Patrick Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models. Access: https://huggingface.co/blog/warm-starting-encoder-decoder.
Rothe S., Narayan S., Severyn A. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks, ArXiv, 2019. DOI: 10.48550/arXiv.1907.12461
Napoles C., Sakaguchi K., Post M., Tetreault J. Ground truth for grammatical error correction metrics, 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China, July 2015 : proceedings, pp. 588–593. DOI: 10.3115/v1/P15-2097
Papineni K., Roukos S., Ward T., and Zhu W.-J. Bleu: a Method for Automatic Evaluation of Machine Translation, 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, July 2002 : proceedings, pp. 311–318. DOI: 10.3115/1073083.1073135.
Banerjee S., Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments, ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Michigan, Jun. 2005 : proceedings, pp. 65–72.
Platen vоn Patrick Encoder-Decoder models don’t need costly pre-training to yield state-of-the-art results on seq2seq tasks. Access mode: https://twitter.com/patrickplaten/status/132584424409597132 8.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Н. M. Холодна, В. А. Висоцька
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.