MULTITOPIC TEXT CLUSTERING AND CLUSTER LABELING USING CONTEXTUALIZED WORD EMBEDDINGS
Keywords:NLP, word embedding, text clustering, cluster labeling, BERT, keyword extraction, semantic similarity.
Context. In the current information era, the problem of analyzing large volumes of unlabeled textual data and its further grouping with respect to the semantic similarity between texts is emerging. This raises the need for robust text analysis algorithms, namely, clustering and extraction of key data from texts. Despite recent progress in the field of natural language processing, new neural methods lack interpretability when used for unsupervised tasks, whereas traditional distributed semantics and word counting techniques tend to disregard contextual information.
Objective. The objective of the study is to develop an interpretable text clustering and cluster labeling methods with respect to the semantic similarity that require no additional training on the user’s dataset.
Method. To approach the task of text clustering, we incorporate deep contextualized word embeddings and analyze their evolution through layers of pretrained transformer models. Given word embeddings, we look for similar tokens across all corpus and form topics that are present in multiple sentences. We merge topics so that sentences that share many topics are assigned to one cluster. One sentence can contain a few topics, it can be present in more then one cluster simultaneously. Similarly, to generate labels for the existing cluster, we use token embeddings to order them based on how much they are descriptive of the cluster. To do so, we propose a novel metric – token rank measure and evaluate two other metrics.
Results. A new unsupervised text clustering approach was described and implemented. It is capable of assigning a text to different clusters based on semantic similarity to other texts in the group. A keyword extraction approach was developed and applied in both text clustering and cluster labeling tasks. Obtained clusters are annotated and can be interpreted through the terms that formed the clusters.
Conclusions. Evaluation on different datasets demonstrated applicability, relevance, and interpretability of the obtained results. The advantages and possible improvements to the proposed methods were described. Recommendations for using methods were provided, as well as possible modifications.
Gareiss R., There’s Nothing Artificial About It, Nemertes Research, Mokena, IL, Quarterly Rep. DN7575, 2019.
Zhang Y., Jin R. and Zhou Z. Understanding bag-of-words model: a statistical framework, International Journal of Machine Learning and Cybernetics, Vol. 1, No. 1–4, pp. 43–52, 2010. DOI: 10.1007/s13042-010-0001-0.
Zhao R. and Mao K., Fuzzy Bag-of-Words Model for Document Representation, IEEE Transactions on Fuzzy Systems, 2018, Vol. 26, No. 2, pp. 794–804, DOI:10.1109/dexa.2010.32.
Wartena C., Brussee R. and Slakhorst W. Keyword Extraction Using Word Co-occurrence, in Proc. of 21st International Conference on Database and Expert Systems Applications (DEXA), 2010. DOI: 10.1109/dexa.2010.32.
Mikolov T., Chen K., Corrado G. and Dean J. Efficient Estimation of Word Representations in Vector Space, arXiv: 1301.3781 [cs.CL], Sep. 2013.
Mikolov T., Sutskever I., Chen K., Corrado G. and Dean J. Distributed Representations of Words and Phrases and their Compositionality, Advances in Neural Information Processing Systems, 2013, Vol. 26.
Brück T. vor der and Pouly M. Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing, in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019, Vol. 1, Minneapolis, USA, pp. 1827–1836. DOI: 10.18653/v1/n19-118.
Arora S., Liang Y. and Ma T. A Simple but Tough-to-Beat Baseline for Sentence Embeddings, in Proc. of 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.
Le Q. and Mikolov T. Distributed Representations of Sentences and Documents, in Proc. of 31st International Conference on International Conference on Machine Learning (ICML), Beijing, China, 2014, pp. 1188–1196.
Blei D., Ng A. and Jordan M. “Latent dirichlet allocation”, The Journal of Machine Learning Research, 2003, Vol. 3, No. 1, pp. 993–1022.
Tong Z. and Zhang H. A Text Mining Research Based on LDA Topic Modelling, Computer Science & Information Technology (CS & IT), 2016. DOI: 10.5121/csit.2016.60616.
Alghamdi R. and Alfalqi K. A Survey of Topic Modeling in Text Mining, International Journal of Advanced Computer Science and Applications, 2015, Vol. 6, No. 1. DOI: 10.14569/ijacsa.2015.060121.
Vaswani A. et al. Attention is all you need, in Proc. of 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, USA, 2017, pp. 6000-6010.
Devlin J., Chang M., Lee K. and Toutanova K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding, in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, USA, 2019, pp. 4171–4186. DOI: 10.18653/v1/n19-1423.
Jawahar G., Sagot B. and Seddah D. What Does BERT Learn about the Structure of Language?, in Proc. of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 2019, pp. 3651–3657. DOI: 10.18653/v1/p19-1356.
Reimers N. and Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, in Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3982–3992. DOI: 10.18653/v1/d19-1410.
Cer D., Diab M., Agirre E., Lopez-Gazpio I. and Specia L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation, in Proc. of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, 2017, pp. 1–14. DOI: 10.18653/v1/s17-2001.
Wang B. and Kuo C. SBERT-WK: A Sentence Embedding Method by Dissecting BERT-Based Word Models, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, Vol. 28, pp. 2146–2157. DOI: 10.1109/taslp.2020.3008390.
Rose S., Engel D., Cramer N. and Cowley W. Automatic Keyword Extraction from Individual Documents, Text Mining, 2010, pp. 1–20, DOI: 10.1002/9780470689646.ch1.
Campos R., Mangaravite V., Pasquali A., Jorge A., Nunes C. and Jatowt A. YAKE! Keyword extraction from single documents using multiple local features, Information Sciences, 2020, Vol. 509, pp. 257–289, DOI: 10.1016/j.ins.2019.09.013.
Wolf T. et al, HuggingFace’s Transformers: State-of-the-art Natural Language Processing arXiv: 1910.03771 [cs.CL], Jul. 2020.
Bowman S., Angeli G., Potts C. and Manning C. A large annotated corpus for learning natural language inference, in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 2015, pp. 632–642, 2015. DOI: 10.18653/v1/d15-1075.
Lewis D., Yang Y., Rose T. and Li F. RCV1: A New Benchmark Collection for Text Categorization Research, Journal of Machine Learning Research, 2004, Vol. 5, No. 5, pp. 361–397.
Wu Y. et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, arXiv: 1609.08144 [cs.CL], Oct, 2016.
Ackermann M., Blömer J., Kuntze D. and Sohler C., Analysis of Agglomerative Clustering, Algorithmica, Vol. 69, No. 1, pp. 184–215, 2012. DOI: 10.1007/s00453-012-97174.
Virtanen P. et al., SciPy 1.0: fundamental algorithms for scientific computing in Python, Nature Methods, Vol. 17, No. 3, pp. 261–272, 2020. DOI: 10.1038/s41592-019-06862.
Blondel V., Guillaume J., Lambiotte R. and Lefebvre E., Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment, 2008. Vol. 2008, No. 10, p. P10008, DOI: 10.1088/17425468/2008/10/p10008.
Řehůřek R. and Sojka P. Software Framework for Topic Modelling with Large Corpora, in Proc. of the 7th Conference on Language Resources and Evaluation (LREC), Valletta, Malta, 2010, pp. 45–50. DOI: 10.13140/2.1.2393.1847.
Qaiser S. and Ali R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents, International Journal of Computer Applications, Vol. 181, No. 1, pp. 25– 29, 2018. Available: 10.5120/ijca2018917395.
Loper E. and Bird S. “NLTK”, in Proc. of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics, Philadelphia, USA, 2002, pp. 63–70. DOI: 10.3115/1118108.1118117.
How to Cite
Copyright (c) 2020 Z. V. Ostapiuk, T. O. Korotyeyeva
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.