IIIS - 2022 Conferences Proceedings

2022 Summer Conferences Proceedings

	Text Classification of News Using Deep Learning and Natural Language Processing Models Based on Transformers for Brazilian Portuguese Isabel Nadine de Santana, Raphael Souza de Oliveira, Erick Giovani Sperandio Nascimento Proceedings of the 26th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2022, Vol. III, pp. 134-139 (2022); https://doi.org/10.54808/WMSCI2022.03.134	The 26th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2022 Virtual Conference July 12 - 15, 2022 Proceedings of WMSCI 2022 ISSN: 2771-0947 (Print) ISBN (Volume III): 978-1-950492-66-4 (Print)
	Authors Information \| Citation \| Full Text \| Isabel Nadine de Santana Manufacturing and Technology Integrated Campus, SENAI CIMATEC, Salvador, Bahia, Brazil Raphael Souza de Oliveira Manufacturing and Technology Integrated Campus, SENAI CIMATEC, Salvador, Bahia, Brazil Erick Giovani Sperandio Nascimento Manufacturing and Technology Integrated Campus, SENAI CIMATEC, Salvador, Bahia, Brazil Cite this paper as: Santana, I. N. d., Oliveira, R. S. d., Nascimento, E. G. S. (2022). Text Classification of News Using Deep Learning and Natural Language Processing Models Based on Transformers for Brazilian Portuguese. In N. Callaos, J. Horne, B. Sánchez, M. Savoie (Eds.), Proceedings of the 26th World Multi-Conference on Systemics, Cybernetics and Informatics: WMSCI 2022, Vol. III, pp. 134-139. International Institute of Informatics and Cybernetics. https://doi.org/10.54808/WMSCI2022.03.134 DOI: 10.54808/WMSCI2022.03.134 ISBN - Volume III: 978-1-950492-66-4 (Print) ISSN: 2771-0947 (Print) Copyright: © International Institute of Informatics and Systemics 2022 Publisher: International Institute of Informatics and Cybernetics
Abstract This work proposes the use of a fine-tuned Transformer-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of the embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange the news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. The metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on the fine-tuned BERTimbau has a superior performance than the other explored techniques.
Full Text