DOI: https://doi.org/10.15588/1607-3274-2020-4-10

MULTITOPIC TEXT CLUSTERING AND CLUSTER LABELING USING CONTEXTUALIZED WORD EMBEDDINGS

Z. V. Ostapiuk, T. O. Korotyeyeva

Abstract


Context. In the current information era, the problem of analyzing large volumes of unlabeled textual data and its further grouping with respect to the semantic similarity between texts is emerging. This raises the need for robust text analysis algorithms, namely, clustering and extraction of key data from texts. Despite recent progress in the field of natural language processing, new neural methods lack interpretability when used for unsupervised tasks, whereas traditional distributed semantics and word counting techniques tend to disregard contextual information.

Objective. The objective of the study is to develop an interpretable text clustering and cluster labeling methods with respect to the semantic similarity that require no additional training on the user’s dataset. 

Method. To approach the task of text clustering, we incorporate deep contextualized word embeddings and analyze their evolution through layers of pretrained transformer models. Given word embeddings, we look for similar tokens across all corpus and form topics that are present in multiple sentences. We merge topics so that sentences that share many topics are assigned to one cluster. One sentence can contain a few topics, it can be present in more then one cluster simultaneously. Similarly, to generate labels for the existing cluster, we use token embeddings to order them based on how much they are descriptive of the cluster. To do so, we propose a novel metric – token rank measure and evaluate two other metrics.

Results. A new unsupervised text clustering approach was described and implemented. It is capable of assigning a text to different clusters based on semantic similarity to other texts in the group. A keyword extraction approach was developed and applied in both text clustering and cluster labeling tasks. Obtained clusters are annotated and can be interpreted through the terms that formed the clusters.

Conclusions. Evaluation on different datasets demonstrated applicability, relevance, and interpretability of the obtained results. The advantages and possible improvements to the proposed methods were described. Recommendations for using methods were provided, as well as possible modifications.


Keywords


NLP, word embedding, text clustering, cluster labeling, BERT, keyword extraction, semantic similarity.

Full Text:

PDF

References


Gareiss R., There’s Nothing Artificial About It, Nemertes Research, Mokena, IL, Quarterly Rep. DN7575, 2019.

Zhang Y., Jin R. and Zhou Z. Understanding bag-of-words model: a statistical framework, International Journal of Machine Learning and Cybernetics, Vol. 1, No. 1–4, pp. 43–52, 2010. DOI: 10.1007/s13042-010-0001-0.

Zhao R. and Mao K., Fuzzy Bag-of-Words Model for Document Representation, IEEE Transactions on Fuzzy Systems, 2018, Vol. 26, No. 2, pp. 794–804, DOI:10.1109/dexa.2010.32.

Wartena C., Brussee R. and Slakhorst W. Keyword Extraction Using Word Co-occurrence, in Proc. of 21st International Conference on Database and Expert Systems Applications (DEXA), 2010. DOI: 10.1109/dexa.2010.32.

Mikolov T., Chen K., Corrado G. and Dean J. Efficient Estimation of Word Representations in Vector Space, arXiv: 1301.3781 [cs.CL], Sep. 2013.

Mikolov T., Sutskever I., Chen K., Corrado G. and Dean J. Distributed Representations of Words and Phrases and their Compositionality, Advances in Neural Information Processing Systems, 2013, Vol. 26.

Brück T. vor der and Pouly M. Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing, in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019, Vol. 1, Minneapolis, USA, pp. 1827–1836. DOI: 10.18653/v1/n19-118.

Arora S., Liang Y. and Ma T. A Simple but Tough-to-Beat Baseline for Sentence Embeddings, in Proc. of 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.

Le Q. and Mikolov T. Distributed Representations of Sentences and Documents, in Proc. of 31st International Conference on International Conference on Machine Learning (ICML), Beijing, China, 2014, pp. 1188–1196.

Blei D., Ng A. and Jordan M. “Latent dirichlet allocation”, The Journal of Machine Learning Research, 2003, Vol. 3, No. 1, pp. 993–1022.

Tong Z. and Zhang H. A Text Mining Research Based on LDA Topic Modelling, Computer Science & Information Technology (CS & IT), 2016. DOI: 10.5121/csit.2016.60616.

Alghamdi R. and Alfalqi K. A Survey of Topic Modeling in Text Mining, International Journal of Advanced Computer Science and Applications, 2015, Vol. 6, No. 1. DOI: 10.14569/ijacsa.2015.060121.

Vaswani A. et al. Attention is all you need, in Proc. of 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, USA, 2017, pp. 6000-6010.

Devlin J., Chang M., Lee K. and Toutanova K. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding, in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, USA, 2019, pp. 4171–4186. DOI: 10.18653/v1/n19-1423.

Jawahar G., Sagot B. and Seddah D. What Does BERT Learn about the Structure of Language?, in Proc. of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 2019, pp. 3651–3657. DOI: 10.18653/v1/p19-1356.

Reimers N. and Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, in Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 3982–3992. DOI: 10.18653/v1/d19-1410.

Cer D., Diab M., Agirre E., Lopez-Gazpio I. and Specia L. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation, in Proc. of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, 2017, pp. 1–14. DOI: 10.18653/v1/s17-2001.

Wang B. and Kuo C. SBERT-WK: A Sentence Embedding Method by Dissecting BERT-Based Word Models, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, Vol. 28, pp. 2146–2157. DOI: 10.1109/taslp.2020.3008390.

Rose S., Engel D., Cramer N. and Cowley W. Automatic Keyword Extraction from Individual Documents, Text Mining, 2010, pp. 1–20, DOI: 10.1002/9780470689646.ch1.

Campos R., Mangaravite V., Pasquali A., Jorge A., Nunes C. and Jatowt A. YAKE! Keyword extraction from single documents using multiple local features, Information Sciences, 2020, Vol. 509, pp. 257–289, DOI: 10.1016/j.ins.2019.09.013.

Wolf T. et al, HuggingFace’s Transformers: State-of-the-art Natural Language Processing arXiv: 1910.03771 [cs.CL], Jul. 2020.

Bowman S., Angeli G., Potts C. and Manning C. A large annotated corpus for learning natural language inference, in Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 2015, pp. 632–642, 2015. DOI: 10.18653/v1/d15-1075.

Lewis D., Yang Y., Rose T. and Li F. RCV1: A New Benchmark Collection for Text Categorization Research, Journal of Machine Learning Research, 2004, Vol. 5, No. 5, pp. 361–397.

Wu Y. et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, arXiv: 1609.08144 [cs.CL], Oct, 2016.

Ackermann M., Blömer J., Kuntze D. and Sohler C., Analysis of Agglomerative Clustering, Algorithmica, Vol. 69, No. 1, pp. 184–215, 2012. DOI: 10.1007/s00453-012-97174.

Virtanen P. et al., SciPy 1.0: fundamental algorithms for scientific computing in Python, Nature Methods, Vol. 17, No. 3, pp. 261–272, 2020. DOI: 10.1038/s41592-019-06862.

Blondel V., Guillaume J., Lambiotte R. and Lefebvre E., Fast unfolding of communities in large networks, Journal of Statistical Mechanics: Theory and Experiment, 2008. Vol. 2008, No. 10, p. P10008, DOI: 10.1088/17425468/2008/10/p10008.

Řehůřek R. and Sojka P. Software Framework for Topic Modelling with Large Corpora, in Proc. of the 7th Conference on Language Resources and Evaluation (LREC), Valletta, Malta, 2010, pp. 45–50. DOI: 10.13140/2.1.2393.1847.

Qaiser S. and Ali R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents, International Journal of Computer Applications, Vol. 181, No. 1, pp. 25– 29, 2018. Available: 10.5120/ijca2018917395.

Loper E. and Bird S. “NLTK”, in Proc. of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics, Philadelphia, USA, 2002, pp. 63–70. DOI: 10.3115/1118108.1118117.


GOST Style Citations


1. There’s Nothing Artificial About It: quarterly report (final): DN7575 / Nemertes; R. Gareiss. – Mokena, 2019. – 12 p.

2. Zhang Y. Understanding bag-of-words model: a statistical framework / Y. Zhang, R. Jin, Z. Zhou // International Journal of Machine Learning and Cybernetics. – 2010. – № 1. – P. 43–52. DOI: 10.1007/s13042-010-0001-0.

3. Zhao R. Fuzzy Bag-of-Words Model for Document Representation / R. Zhao, K. Mao // IEEE Transactions on Fuzzy Systems. – 2017. – Vol. 14, № 8. DOI: 10.1109/TFUZZ.2017.2690222.

4. Wartena C. Keyword Extraction Using Word Co-occurrence / C. Wartena, R. Brussee, W. Slakhorst // 21st International Conference on Database and Expert Systems Applications (DEXA), Bilbao, Spain, August 30 – September 3, 2010: proceedings. – Bilbao: IEEE, 2010. – P. 54–58. DOI: 10.1109/DEXA.2010.32.

5. Efficient Estimation of Word Representations in Vector Space / T. Mikolov, K. Chen, G. Corrado, J. Dean // arXiv. – 2013. – Vol. abs/1301.3781.

6. Distributed Representations of Words and Phrases and their Compositionality / [T. Mikolov, I. Sutskever, K. Chen et al.] // Advances in Neural Information Processing Systems. – 2013. – № 26.

7. Vor der Brück T. Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing / T. Vor der Brück, M. Pouly // The 2019 Conference of the North American Chapter of the Association for Computa
tional Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, USA, June, 2019: proceedings. – Minneapolis : ACL, 2019. – Vol. 1. – P. 1827–1836. DOI: 10.18653/v1/N19-1181.

8. Arora S. A Simple but Tough-to-Beat Baseline for Sentence Embeddings / S. Arora, Y. Liang, T. Ma // The 5th International Conference on Learning Representations (ICLR), Toulon, France, April 24–26, 2017: proceedings. – Toulon : OpenReview.net, 2017.

9. Le Q. Distributed Representations of Sentences and Documents / Q. Le, T. Mikolov // The 31st International Conference on Machine Learning (ICML), Beijing, China, 2014: proceedings. – Beijing : PMLR, 2014. – Vol. 32. – P. 1188– 1196.

10. Blei D. Latent Dirichlet Allocation / D. Blei, A. Ng, M. Jordan // Journal of Machine Learning Research. – 2003. – Vol. 3. – P. 993–1022. DOI: 10.1162/jmlr.2003.3.45.993.

11. Tong Z. A Text Mining Research Based on LDA Topic Modelling / Z. Tong, H. Zhang // The 6th International Conference on Computer Science, Engineering and Information Technology (CSEIT), Vienna, Austria, May 21–22, 2016: proceedings. – Vienna : CSIT, 2016. – P. 201–210. DOI: 10.5121/csit.2016.60616.

12. Alghamdi R. A Survey of Topic Modeling in Text Mining / R. Alghamdi, K. Alfalqi // International Journal of Advanced Computer Science and Applications. – 2015. – Vol. 6, № 1. – P. 147–153. DOI: 10.14569/IJACSA.2015.060121.

13. Attention Is All You Need / [A. Vaswani, N. Shazeer, N. Parmar et al.] // The 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, USA, December 4–9, 2017: proceedings. – Long Beach : Curran Associates Inc., 2017. – P. 6000–6010.

14. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding / [J. Devlin, M. Chang, K. Lee, K. Toutanova] // The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, USA, June 2–7, 2019: proceedings. – Minneapolis: ACL, 2019. – P. 4171–4186. DOI:10.18653/v1/N19-1423.

15. Jawahar G. What does BERT learn about the structure of language? / G. Jawahar, B. Sagot, D. Seddah // The 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, July 28 – August 2, 2019: proceedings. – Florence : ACL, 2019. – P. 3651–3657. DOI: 10.18653/v1/P19-1356.

16. Reimer N. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks / N. Reimer, I. Gurevych // The 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, November 3–7, 2019: proceedings. – Hong Kong : ACL, 2019. – P. 3982–3992. DOI: 10.18653/v1/D19-1410.

17. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation / [D. Cer, M. Diab, E. Agirre et al.] // The 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, August, 2017: proceedings. – Vancouver: ACL, 2017. – P. 1–14. DOI: 10.18653/v1/S17-2001.

18. Wang B. SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models / B. Wang, J. Kuo // IEEE/ACM Transactions on Audio, Speech, and Language Processing. – 2020. – Vol 28. – P. 2146–2157. https://arxiv.org/pdf/2002.06652.pdf DOI: 10.1109/TASLP.2020.3008390.

19. Automatic Keyword Extraction from Individual Documents / [S. Rose, D. Engel, N. Cramer, W. Cowley] // Text Mining: Applications and Theory. – Padstow : John Wiley & Sons, 2010. – (Mathematics). – P. 1–20. DOI: 10.1002/9780470689646.ch1.

20. YAKE! Keyword Extraction from Single Documents using Multiple Local Features / [R. Campos, V. Mangaravite, A. Pasquali et al.] // Information Sciences. – 2020. – № 509. – P. 257–289. DOI: 10.1016/j.ins.2019.09.013.

21. HuggingFace’s Transformers: State-of-the-art Natural Language Processing / [T. Wolf, L. Debut, V. Sanh et al.] // arXiv. – 2019. – Vol. abs/1910.03771.

22. A large annotated corpus for learning natural language inference / [S. Bowman, G. Angeli, C. Potts, C. Manning] // The 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, September, 2015: proceedings. – Lisbon : ACL, 2015. – P. 632–642. DOI: 10.18653/v1/D15-1075.

23. RCV1: A New Benchmark Collection for Text Categorization Research / [D. Lewis, Y. Yang, T. Rose, F. Li] // Journal of Machine Learning Research. – 2004. – Vol. 5, №5. – P. 361–397.

24. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation / [Y. Wu, M. Schuster, Z. Chen et al.] // arXiv. – 2016. – Vol. abs/1609.08144.

25. Analysis of Agglomerative Clustering / M. Ackermann, J. Blomer, D. Kuntze, D. Sohler // Algorithmica. – 2012. – Vol. 69, № 1. – P. 184–215. DOI: 10.1007/s00453-0129717-4.

26. SciPy 1.0: fundamental algorithms for scientific computing in Python / [P. Virtanen, R. Gommers, T. Oliphant et al.] // Nature Methods. – 2020. – Vol 17. – P. 261–272. DOI: 10.1038/s41592-019-0686-2.

27. Fast unfolding of communities in large networks / [V. Blondel, J. Guillaume, R. Lambiotte, E. Lefebvre] // Journal of Statistical Mechanics: Theory and Experiment. – 2008. – Vol. 2008, № 10. DOI: 10.1088/17425468/2008/10/P10008.

28. Rehurek R. Software Framework for Topic Modelling with Large Corpora / R. Rehurek, P. Sojka // The 7th Conference on Language Resources and Evaluation (LREC), Valletta, Malta, May 22nd, 2010: proceedings. – Valetta: ELRA, 2010. – P. 45–50. DOI: 10.13140/2.1.2393.1847.

29. Qaiser S. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents / S. Qaiser, R. Ali // International Journal of Computer Applications. – 2018. – Vol. 181, №1. – P. 25–29. DOI: 10.5120/ijca2018917395.

30. Loper E. NLTK: The Natural Language Toolkit / E. Loper, S. Bird // Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Philadelphia, USA, July, 2002: proceedings. – Philadelphia : ACL, 2002. – P. 63–70. DOI: 10.3115/1118108.1118117.







Copyright (c) 2020 Z. V. Ostapiuk, T. O. Korotyeyeva

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Address of the journal editorial office:
Editorial office of the journal «Radio Electronics, Computer Science, Control»,
National University "Zaporizhzhia Polytechnic", 
Zhukovskogo street, 64, Zaporizhzhia, 69063, Ukraine. 
Telephone: +38-061-769-82-96 – the Editing and Publishing Department.
E-mail: rvv@zntu.edu.ua

The reference to the journal is obligatory in the cases of complete or partial use of its materials.