DIGITAL REPRESENTATIONS OF TELEGRAM CHANNELS
DOI:
https://doi.org/10.26906/SUNZ.2024.1.158Keywords:
natural language text processing, BERT, thematic classification of messages, representation learningAbstract
The subject of research of this article is digital representations of textual information resources on the example of Telegram channels. The purpose of the work is to determine the optimal method of forming digital representations of Telegram channels for further thematic classification. The following tasks are solved in the article:definition of approaches to the formation of the input vector; determination of the stages of text data processing for the digital representation of the Telegram channel; creation of a dataset of digital representations of Telegram channels; dataset marking for solving the classification problem; determination of hyperparameters of optimal classification models. The following results were obtained: a dataset of digital representations of Telegram channels formed on the basis of the SBERT network was created using three approaches: aggregation of publication vectors, concatenation of keywords using the TF-IDF method, and a combination of the first two approaches; it was determined that the approach of concatenation of keywords using the TF-IDF method and the combination of the first two approaches to the formation of digital representations of Telegram channels based on text publications is the most effective for further classification by topic; the optimal hyperparameters of the thematic classification models are determined in terms of accuracy: Logistic Regressio and deep learning neural networks. A promising direction of further research is the evaluation of the application of the proposed digital representations to clustering and search tasks.Downloads
References
Cкринінг українського суспільства протягом повномасштабної війни. Національна рада України з питань телебачення і радіомовлення. URL: https://www.nrada.gov.ua/wp-content/uploads/2022/05/GradusResearch_Report_Suspilne_50K_27042022.pdf .
Mikolov, T., Chen, K., Corrado, G., & Dean, J. Efficient estimation of word representations in vector space. 2013. arXiv preprint arXiv:1301.3781.
Pennington, J., Socher, R., & Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) 2014, October. (pp. 1532-1543).
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, 135-146. 2017.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
Reimers, N., & Gurevych, I.. Sentence-BERT: Sentence embeddings using siamese BERT-networks. arXiv preprint arXiv:1908.10084. 2019
Barbaro, F., & Skumanich, A. Addressing socially destructive disinformation on the web with advanced AI tools: Russia as a case study. In Companion Proceedings of the ACM Web Conference 2023 (pp. 204-207). 2023, April.
Wei, F., & Nguyen, U. T. Twitter Bot Detection Using Neural Networks and Linguistic Embeddings. IEEE Open Journal of the Computer Society. 2023.
Hugging Face – The AI community building the future. Hugging Face. URL: https://huggingface.co/ (date of access: 30.11.2023).
NLTK : Natural Language Toolkit. NLTK :: Natural Language Toolkit. URL: https://www.nltk.org/ (date of access: 30.11.2023).
Ukrainian-Stopwords. GitHub. URL: https://github.com/skupriienko/Ukrainian-Stopwords (date of access: 30.11.2023).
stopwords-iso/stopwords-ru. GitHub. URL: https://github.com/stopwords-iso/stopwords-ru (date of access: 30.11.2023).
Korobov M.: Morphological Analyzer and Generator for Russian and Ukrainian Languages // Analysis of Images, Social Networks and Texts,pp 320-332. 2015.
scikit-learn: machine learning in Python. scikit-learn. URL: https://scikit-learn.org/ (date of access: 30.11.2023).
The Pushshift Telegram Dataset / B. Jason et al. Zenodo. URL: https://zenodo.org/records/3607497 (date of access: 30.11.2023).
Dataset-for-teenagers-chat-in-Telegram-groups: Dataset for teenagers' chat in Telegram groups (Persian). GitHub. URL:https://github.com/imRezaAlie/Dataset-for-teenagers-chat-in-Telegram-groups (date of access: 30.11.2023).
Temnikova I. TRACES Bulgarian Telegram Dataset Annotated with Linguistic Markers of Lies. Zenodo. URL:https://zenodo.org/records/7614294 (date of access: 30.11.2023).
Crypto telegram groups. Kaggle: Your Machine Learning and Data Science Community. URL:https://www.kaggle.com/datasets/aagghh/crypto-telegram-groups (date of access: 30.11.2023).
Telegram channels and groups catalog. TGStat.com. URL: https://tgstat.com/ (date of access: 30.11.2023).
Keras: The high-level API for TensorFlow | TensorFlow Core [Electronic resource] // TensorFlow. – Mode of access: https://www.tensorflow.org/guide/keras (date of access: 08.12.2023)