TECHNOLOGY FOR AUTOMATED CONSTRUCTION OF DOMAIN DICTIONARIES WITH SPECIAL PROCESSING OF SHORT DOCUMENTS

O. B. Kungurtsev; I. I. Mileiko; N. O. Novikova

doi:10.15588/1607-3274-2023-4-14

Authors

O. B. Kungurtsev Odessа Polytechnic National University, Odessa, Ukraine, Ukraine
I. I. Mileiko Odessа Polytechnic National University, Odessa, Ukraine, Ukraine
N. O. Novikova Odessa National Maritime University, Odessa, Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2023-4-14

Keywords:

domain dictionary, information system, term, clustering, information technology, short document

Abstract

Context. The task of automating the construction of domain dictionaries in the process of implementing software projects based on the analysis of documents, taking into account their size and presentation form.

Objective. The goal of the work is to improve the quality of the dictionary based on the use of new technology, including special processing of short documents.

Method. A model of a short document is proposed, which presents it in the form of three parts: header, content and final. The header and final parts usually contain information not related to the subject area. Therefore, a method for extracting content based on the use of many keywords has been proposed. The size of a short document (its content) does not allow determining the frequency characteristics of words and, therefore, identifying multi-word terms, the share of which reaches 50% of all terms. To make it possible to identify terms in short documents, a method for their clustering is proposed, based on the selection of nouns and the calculation of their frequency characteristics. The resulting clusters are treated as ordinary documents, since their size allows for the selection of multi-word terms. To highlight terms, it is proposed to select sequences of words containing nouns in the text. Analysis of the frequency of repetition of such sequences allows us to identify multi-word terms. To determine the interpretation of terms, a previously developed method of automated search for interpretations in dictionaries was used.

Results. Based on the proposed model and methods, software was created to build a domain dictionary and a number of experiments were conducted to confirm the effectiveness of the developed solutions.

Conclusions. The experiments carried out confirmed the performance of the proposed software and allow us to recommend it for use in practice for creating dictionaries of the subject area of various information systems. Prospects for further research may include the construction of corporate search systems based on dictionaries of terms and document clustering.

Author Biographies

O. B. Kungurtsev, Odessа Polytechnic National University, Odessa, Ukraine

PhD, Professor, Professor of the Software Engineering Department

I. I. Mileiko, Odessа Polytechnic National University, Odessa, Ukraine

Student of the Software Engineering Department

N. O. Novikova, Odessa National Maritime University, Odessa, Ukraine

PhD, Associate Professor of the Department of Technical Cybernetics and Information Technologies named after professor R.V. Merct

References

Larman K. Primenenie UML 2.0 i shablonov proektirovanija. Prakticheskoe rukovodstvo. 3-e izdanie. Moscow, Izdatel’skij dom “Vil’jams”, 2013, 736 p. [in Russian].

Bourgeois D., Mortati J., Wang S., et al. Information Systems for Business and Beyond. Information systems, their use in business, and the larger impact they are having on our world [Electronic resource]. Access mode: https://opentextbook.site/exports/ ISBB-2019.pdf

Artamonov A., Kshnyakov D., Danilova V. et al. Methodology for the Development of Dictionaries for Automated Classification System, 8th Annual International Conference on Biologically Inspired Cognitive Architectures (BICA 2017). Procedia Computer Science Volume:123. Moscow, Russia, 1–6 August 2017, pp. 57–62. doi:10.1016/j.procs.2018.01.010

Dalglish S. L., Khalid H., McMahon S. A. Document analysis in health policy research: the READ approach, Health Policy and Planning, 2020, Vol. 35, Issue 10, pp. 1424–1431.

Cheng Y. Huang Y. Research and development of domain dictionary construction system, Proceedings of the International Conference on Web Intelligence, August 2017, pp. 1162–1165. https://doi.org/10.1145/3106426.3109046

Liang S., Yilmaz E., Kanoulas E. Dynamic Clustering of Streaming Short Documents, International Conference on Knowledge Discovery and Data Mining, August 2016, pp. 995–1004.

Wang Y., Yang S. Outlier detection from massive short documents using domain ontology. International Conference on Intelligent Computing and Intelligent Systems, 29–31 Oct. 2010, Xiamen, China. DOI: 10.1109/ICICISYS.2010.5658426

Shi T., Kang K., Choo J. et al. Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations, WWW ‘18: Proceedings of The Web Conference 2018, April 2018. Lyon, France. DOI: 10.1145/3178876.3186009

Hafeez R., Khan S., Abbas M. et al. Topic based Summarization of Multiple Documents using Semantic Analysis and Clustering, International Conference on Smart; 8–10 Oct. 2018. Islamabad, Pakistan. DOI: 10.1109/HONET.2018.8551325

Vo D-T., Ock C-Y. Learning to classify short text from scientific documents using topic models with various types of knowledge, Expert Systems with Applications: An International Journal, 2015, V. 42, Issue 3, pp. 1684–1698. https://doi.org/10.1016/j.eswa.2014.09.031

Liang S., Yilmaz E., Kanoulas E. Dynamic Clustering of Streaming Short Documents, 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, US, 13 August 2016, pp. 995–1004. https://doi.org/10.1145/2939672.2939748

Seki H., Toriyama S. On Term Similarity Measures for Short Text Classification, 11th International Workshop on Computational Intelligence and Applications (IWCIA), 9–10 Nov.2019, pp. 53–58, DOI: 10.1109/IWCIA47330.2019.8955045.

Seki H., Toriyama S. Using term similarity measures for classifying short document data, International Journal of Computational Intelligence Studies, Vol. 10, Issue 2–3, https://doi.org/10.1504/IJCISTUDIES.2021.115430

Rogers N., Longo L. A. Comparison on the Classification of Short-text Documents Using Latent Dirichlet Allocation and Formal Concept Analysis, 25th Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2017. Dublin, Ireland, 7–8 December 2017, V. 2086, pp. 50–62.

Kungurtsev O., Zinovatna S., Potochniak I. et al. Development of Methods for Pre-clustering and Virtual Merging of Short Documents for Building Domain Dictionaries, Eastern-european Journal of Enterprise Technologies, 2020, Vol. 5, № 2 (107), pp. 39–47. http://doi.org/10.15587/1729-4061.2020.215190

Kungurtsev O., Zinovatnaya S., Potochniak Ia. et al. Development of information technology of term extraction from documents in natural language, Eastern-European Journal of Enterprise Technologies, 2018, V. 6, №2 (96), pp. 44–51. doi: https://doi.org/10.15587/17294061.2018.147978

García R. G., Beltrán B., Vilariño D. et al. Comparison of Clustering Algorithms in Text Clustering Tasks, Computación y Sistemas, 2021, Vol. 24, № 2. https://doi.org/10.13053/cys-24-2-3369

Shevchenko A. Organizacija elektronnogo dokumentoobigu na pidpryjemstvi [Electronic resource]. Access mode: https://uteka.ua/ua/publication/commerce-12dokumentooborot-2-organizaciya-elektronnogodokumentooborota-na-predpriyatii [in Ukrainian].

Typova instrukcija z dilovodstva v ministerstvah, inshyh central’nyh ta miscevyh organah vykonavchoi’ vlady [Electronic resource]. Access mode: https://borispolrada.gov.ua/item/39961-typova-instruktsiya-z-dilovodstvav-ministerstvakh-inshykh-tsentralnykh-ta-mistsevykhorhanakh-vykonavchoi-vlady.html [in Ukrainian].

Lions K. Long-Tail Keywords: What They Are & How to Use Them for SEO. [Electronic resource]. Access mode: https://www.semrush.com/blog/how-to-choose-long-tailkeywords/

Borysova, N. V., Kanyshheva O. V., Kanyshheva O. V. The formation of problem domain dictionary, Eastern-European Journal of Enterprise Technologies, 2013. Vol. 5, №3(65), pp. 16–19. https://doi.org/10.15587/1729-4061.2013.18

Rahoo L. A., Unar M. A. Design and Development of an Automated Library Management System for Mehran University Library, Jamshoro, Control Theory and Informatics, 2016, № 6(1), pp. 1–6.

Kungurtsev O., Novikova N., Kozhushan M. Automation of Serching for Terms in the Explanatory Dictionary, Proceedings of Odessa Polytechnic University, 2020, № 3(62), pp. 91–100. DOI: 10.15276/opu.3.62.2020.11

Sketch Engine. [Electronic resource]. Access mode: https://www.sketchengine.eu/

TECHNOLOGY FOR AUTOMATED CONSTRUCTION OF DOMAIN DICTIONARIES WITH SPECIAL PROCESSING OF SHORT DOCUMENTS

Authors

DOI:

Keywords:

Abstract

Author Biographies

O. B. Kungurtsev, Odessа Polytechnic National University, Odessa, Ukraine

I. I. Mileiko, Odessа Polytechnic National University, Odessa, Ukraine

N. O. Novikova, Odessa National Maritime University, Odessa, Ukraine

References

Downloads

Published

How to Cite

Issue

Section

License

Creative Commons Licensing Notifications in the Copyright Notices

Information

Current Issue