![]() |
Text Classification of News Using Deep Learning and Natural Language Processing Models Based on Transformers for Brazilian Portuguese
Isabel Nadine de Santana, Raphael Souza de Oliveira, Erick Giovani Sperandio Nascimento
Proceedings of the 26th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2022, Vol. III, pp. 134-139 (2022); https://doi.org/10.54808/WMSCI2022.03.134
|
The 26th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2022
Virtual Conference July 12 - 15, 2022 Proceedings of WMSCI 2022 ISSN: 2771-0947 (Print) ISBN (Volume III): 978-1-950492-66-4 (Print) |
Abstract
This work proposes the use of a fine-tuned Transformer-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of the embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange the news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. The metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on the fine-tuned BERTimbau has a superior performance than the other explored techniques.
|
||