Data
ada_agnostic

ada_agnostic

active ARFF Publicly available Visibility: public Uploaded 05-12-2017 by Jann Goschenhofer
0 likes downloaded by 1 people , 2 total downloads 1 issues 1 downvotes
Issue #Downvotes for this reason By
Missing column type from API for some reason1User 5824


Loading wiki
Help us complete this description Edit
Author: [Isabelle Guyon](isabelle@clopinet.com) Source: [Agnostic Learning vs. Prior Knowledge Challenge](http://www.agnostic.inf.ethz.ch) Please cite: None __Major change w.r.t. version 1: updated data type of binary variables to factor type.__ Dataset from the Agnostic Learning vs. Prior Knowledge Challenge (http://www.agnostic.inf.ethz.ch), which consisted of 5 different datasets (SYLVA, GINA, NOVA, HIVA, ADA). The purpose of the challenge was to check if the performance of domain-specific feature engineering (prior knowledge) can be met by algorithms that were trained on data without any domain-specific knowledge (agnostic). For the latter, the data was anonymised and preprocessed in a way that makes them uninterpretable. This dataset contains the agnostic (smashed) version of a data set from the US census bureau for the time span June 2005 - September 2006. Similar data set on OpenML is called __adult__. The raw data from the census bureau is also known as the Adult database in the UCI machine-learning repository. ### Topic The task of ADA is to discover high revenue people from census data. This is a two-class classification problem. The raw data from the census bureau is known as the Adult database in the UCI machine-learning repository. It contains continuous, binary and categorical variables. The “prior knowledge track” has access to the original features and their identity. The agnostic track has access to a preprocessed numeric representation eliminating categorical variables. ### Source Original owners This data was extracted from the census bureau database found at http://www.census.gov/ftp/pub/DES/www/welcome.html Donor: Ronny Kohavi and Barry Becker, Data Mining and Visualization Silicon Graphics. e-mail: ronnyk@sgi.com for questions Dataset from: http://www.agnostic.inf.ethz.ch/datasets.php ### Preprocessing In [this documentation](http://clopinet.com/isabelle/Projects/agnostic/Dataset.pdf) the organisers of the challenge describe the steps they performed to come up with the __agnostic__ data. The 14 original attributes (features) include age, workclass, education, marital status, occupation, native country, etc. It contains continuous, binary and categorical features. This dataset is from the "agnostic learning track", i.e. has access to a preprocessed numeric representation eliminating categorical variables, but the identity of the features is not revealed. Furthermore, features are scrambled and cannot be linked to the features given in the dataset documentation. ### Additional Info This dataset contains samples from both training and validation datasets. Modified by TunedIT (converted to ARFF format). Data type: non-sparse Number of features: 48 Number of examples and check-sums: Pos_ex Neg_ex Tot_ex Check_sum Train 1029 3118 4147 6798109.00 Valid 103 312 415 681151.00

49 features

attr0nominal2 unique values
0 missing
attr1numeric77 unique values
0 missing
attr2nominal2 unique values
0 missing
attr3nominal2 unique values
0 missing
attr4nominal2 unique values
0 missing
attr5nominal2 unique values
0 missing
attr6nominal2 unique values
0 missing
attr7nominal2 unique values
0 missing
attr8nominal2 unique values
0 missing
attr9nominal2 unique values
0 missing
attr10nominal2 unique values
0 missing
attr11nominal2 unique values
0 missing
attr12nominal2 unique values
0 missing
attr13nominal2 unique values
0 missing
attr14numeric363 unique values
0 missing
attr15nominal2 unique values
0 missing
attr16nominal2 unique values
0 missing
attr17numeric16 unique values
0 missing
attr18nominal2 unique values
0 missing
attr19numeric70 unique values
0 missing
attr20nominal2 unique values
0 missing
attr21nominal2 unique values
0 missing
attr22nominal2 unique values
0 missing
attr23numeric52 unique values
0 missing
attr24nominal2 unique values
0 missing
attr25nominal2 unique values
0 missing
attr26nominal2 unique values
0 missing
attr27nominal2 unique values
0 missing
attr28nominal2 unique values
0 missing
attr29numeric56 unique values
0 missing
attr30nominal2 unique values
0 missing
attr31nominal2 unique values
0 missing
attr32nominal2 unique values
0 missing
attr33nominal2 unique values
0 missing
attr34nominal2 unique values
0 missing
attr35nominal2 unique values
0 missing
attr36nominal2 unique values
0 missing
attr37nominal2 unique values
0 missing
attr38nominal2 unique values
0 missing
attr39nominal1 unique values
0 missing
attr40nominal2 unique values
0 missing
attr41nominal2 unique values
0 missing
attr42nominal2 unique values
0 missing
attr43nominal2 unique values
0 missing
attr44nominal2 unique values
0 missing
attr45nominal2 unique values
0 missing
attr46nominal2 unique values
0 missing
attr47nominal2 unique values
0 missing
labelnominal2 unique values
0 missing

62 properties

4562
Number of instances (rows) of the dataset.
49
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
6
Number of numeric attributes.
43
Number of nominal attributes.
479.67
Third quartile of means among attributes of the numeric type.
Maximum mutual information between the nominal attributes and the target attribute.
1
The minimal number of distinct values among attributes of the nominal type.
12.24
Percentage of numeric attributes.
Third quartile of mutual information between the nominal attributes and the target attribute.
2
The maximum number of distinct values among attributes of the nominal type.
-0.29
Minimum skewness among attributes of the numeric type.
87.76
Percentage of nominal attributes.
5.95
Third quartile of skewness among attributes of the numeric type.
10.75
Maximum skewness among attributes of the numeric type.
72.4
Minimum standard deviation of attributes of the numeric type.
First quartile of entropy among attributes.
149.73
Third quartile of standard deviation of attributes of the numeric type.
158.02
Maximum standard deviation of attributes of the numeric type.
Percentage of instances belonging to the least frequent class.
0.5
First quartile of kurtosis among attributes of the numeric type.
0.15
Standard deviation of the number of distinct values among attributes of the nominal type.
Average entropy of the attributes.
Number of instances belonging to the least frequent class.
19.56
First quartile of means among attributes of the numeric type.
24.64
Mean kurtosis among attributes of the numeric type.
42
Number of binary attributes.
First quartile of mutual information between the nominal attributes and the target attribute.
272.28
Mean of means among attributes of the numeric type.
0.07
First quartile of skewness among attributes of the numeric type.
Average class difference between consecutive instances.
Average mutual information between the nominal attributes and the target attribute.
81.26
First quartile of standard deviation of attributes of the numeric type.
Entropy of the target attribute values.
An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.
Second quartile (Median) of entropy among attributes.
0.01
Number of attributes divided by the number of instances.
1.98
Average number of distinct values among the attributes of the nominal type.
3.77
Second quartile (Median) of kurtosis among attributes of the numeric type.
Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.
2.83
Mean skewness among attributes of the numeric type.
268.65
Second quartile (Median) of means among attributes of the numeric type.
Percentage of instances belonging to the most frequent class.
113.04
Mean standard deviation of attributes of the numeric type.
Second quartile (Median) of mutual information between the nominal attributes and the target attribute.
Number of instances belonging to the most frequent class.
Minimal entropy among attributes.
0.99
Second quartile (Median) of skewness among attributes of the numeric type.
Maximum entropy among attributes.
-0.05
Minimum kurtosis among attributes of the numeric type.
85.71
Percentage of binary attributes.
108.32
Second quartile (Median) of standard deviation of attributes of the numeric type.
Third quartile of entropy among attributes.
121.7
Maximum kurtosis among attributes of the numeric type.
12.09
Minimum of means among attributes of the numeric type.
0
Percentage of instances having missing values.
43.89
Third quartile of kurtosis among attributes of the numeric type.
634.02
Maximum of means among attributes of the numeric type.
Minimal mutual information between the nominal attributes and the target attribute.
0
Percentage of missing values.

2 tasks

0 runs - estimation_procedure: 50 times Clustering
Define a new task