public active Public 15 cnae-9 2015-01-21T22:19:32Z https://www.openml.org/data/download/1586233/phpmcGu2X 1 0 22997 Class cnae-9 ARFF **Author**: Patrick Marques Ciarelli, Elias Oliviera **Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/CNAE-9) - 2010 **Please cite**: ### Description This is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories. ### Source ``` Patrick Marques Ciarelli, pciarelli '@' lcad.inf.ufes.br, Department of Electrical Engineering, Federal University of Espirito Santo Elias Oliveira, elias '@' lcad.inf.ufes.br, Department of Information Science, Federal University of Espirito Santo ``` ### Data Set Information This is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories cataloged in a table called National Classification of Economic Activities (Classificação Nacional de Atividade Econômicas - CNAE). The original texts were preprocessed to obtain the current data set: initially, it was kept only letters and then it was removed prepositions of the texts. Next, the words were transformed to their canonical form. Finally, each document was represented as a vector, where the weight of each word is its frequency in the document. This data set is highly sparse (99.22% of the matrix is filled with zeros). ### Attribute Information In the dataset there are 857 attributes, 1 attributes with the class of instance and 856 with word frequency: ``` 1. category: range 1 - 9 (integer) 2. 857. word frequency: (integer) ``` ### Relevant Papers Patrick Marques Ciarelli, Elias Oliveira, 'Agglomeration and Elimination of Terms for Dimensionality Reduction', Ninth International Conference on Intelligent Systems Design and Applications, pp.547-552, 2009 Patrick Marques Ciarelli, Elias Oliveira, Evandro O. T. Salles, 'An Evolving System Based on Probabilistic Neural Network', Brazilian Symposium on Artificial Neural Network, 2010 2015-01-21T22:19:32Z