public class active 2015-01-08T00:18:07Z Public 5 DEXTER is a text classification problem in a bag-of-word representation. This is a two-class classification problem with sparse continuous input variables. This dataset is one of five datasets of the NIPS 2003 feature selection challenge. Source: a. Original owners The original data set we used is a subset of the well-known Reuters text categorization benchmark. The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. It is hosted by the UCI KDD repository: David D. Lewis is hosting valuable resources about this data (see We used the “corporate acquisition” text classification class pre-processed by Thorsten Joachims <thorsten '@'>. The data is one of the examples of the software package SVM-Light., see The example can be downloaded from b. Donor of database This version of the database was prepared for the NIPS 2003 variable and feature selection benchmark by Isabelle Guyon, 955 Creston Road, Berkeley, CA 94708, USA (isabelle '@' Data Set Information: The original data were formatted by Thorsten Joachims in the “bag-of-words” representation. There were 9947 features (of which 2562 are always zeros for all the examples) representing frequencies of occurrence of word stems in text. The task is to learn which Reuters articles are about 'corporate acquisitions'. We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized. DEXTER -- Positive ex. -- Negative ex. -- Total Training set --150 -- 150 -- 300 Validation set -- 150 -- 150 -- 300 Test set -- 1000 -- 1000 -- 2000 All -- 1300 -- 1300 -- 2600 Number of variables/features/attributes: Real: 9947 Probes: 10053 Total: 20000 0 1 Dexter 0 2015-01-08T00:18:07Z Dexter Sparse_ARFF