https://www.openml.org/data/download/53921/gina_agnostic.arff
20
gina_agnostic
gina_agnostic
67908
0
label
2014-01-06T22:56:01Z
public
Public
1
ARFF
**Author**: [Isabelle Guyon](isabelle@clopinet.com)
**Source**: [Agnostic Learning vs. Prior Knowledge Challenge](http://www.agnostic.inf.ethz.ch)
**Please cite**: None
Dataset from the Agnostic Learning vs. Prior Knowledge Challenge (http://www.agnostic.inf.ethz.ch), which consisted of 5 different datasets (SYLVA, GINA, NOVA, HIVA, ADA). The purpose of the challenge was to check if the performance of domain-specific feature engineering (prior knowledge) can be met by algorithms that were trained on data without any domain-specific knowledge (agnostic). For the latter, the data was anonymised and preprocessed in a way that makes them uninterpretable.
Modified by TunedIT (converted to ARFF format)
### Topic
The task of GINA is handwritten digit recognition. This is the agnostic version of a subset of the MNIST data set. We chose the problem of separating the odd numbers from even numbers. We use 2-digit numbers. Only the unit digit is informative for that task, therefore at least ½ of the features are distracters. This is a twoclass classification problem with sparse continuous input variables, in which each class is
composed of several clusters. It is a problems with heterogeneous classes.
### Source
The data set was constructed from the MNIST data that is made available by Yann
LeCun of the NEC Research Institute at [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/). The digits have been size-normalized and centered in a fixed-size image of dimension 28x28. Examples are shown in the [documentation in chapter 3](http://clopinet.com/isabelle/Projects/agnostic/Dataset.pdf).
### Description
To construct the “agnostic” dataset, we performed the following steps:
- We removed the pixels that were 99% of the time white. This reduced the original
feature set of 784 pixels to 485.
- The original resolution (256 gray levels) was kept.
- In spite of the fact that the data are rather sparse (about 30% of the values are
non-zero), we saved the data as a dense matrix because we found that it can be
compressed better in this way (to 19 MB.)
- The feature names are the (i,j) matrix coordinates of the pixels (in a 28x28
matrix.)
- We created 2 digit numbers by dividing the datasets into to parts and pairing the
digits at random.
- The task is to separate odd from even numbers. The digit of the tens being not
informative, the features of that digit act as distracters.
To construct the “prior” dataset, we went back to the original data and fetched the
“informative” digit in its original representation. Therefore, this data representation
consists in a vector of concatenating the lines of a 28x28 pixel map.
Data type: non-sparse
Number of features: 970
Number of examples and check-sums:
Pos_ex Neg_ex Tot_ex Check_sum
Train 1550 1603 3153 164947945.00
Valid 155 160 315 16688946.00
This dataset contains samples from both training and validation datasets.
2014-01-06T22:56:01Z
active