Data
Census-Income

Census-Income

active ARFF Publicly available Visibility: public Uploaded 16-02-2016 by Hilda Fabiola Bernard
1 likes downloaded by 5 people , 9 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Author: U.S. Census Bureau http://www.census.gov/ United States Department of Commerce Source: UCI Please cite: Please refer to the Machine Learning Repository's citation policy Source: Original Owner: U.S. Census Bureau http://www.census.gov/ United States Department of Commerce Donor: Terran Lane and Ronny Kohavi Data Mining and Visualization Silicon Graphics. terran '@' ecn.purdue.edu, ronnyk '@' sgi.com Data Set Information: This data set contains weighted census data extracted from the 1994 and 1995 Current Population Surveys conducted by the U.S. Census Bureau. The data contains 41 demographic and employment related variables. The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should *not* be used in the classifiers. One instance per line with comma delimited fields. There are 199523 instances in the data file and 99762 in the test file. The data was split into train/test in approximately 2/3, 1/3 proportions using MineSet's MIndUtil mineset-to-mlc. Attribute Information: More information detailing the meaning of the attributes can be found in the Census Bureau's documentation To make use of the data descriptions at this site, the following mappings to the Census Bureau's internal database column names will be needed: age AAGE class of worker ACLSWKR industry code ADTIND occupation code ADTOCC adjusted gross income AGI education AHGA wage per hour AHRSPAY enrolled in edu inst last wk AHSCOL marital status AMARITL major industry code AMJIND major occupation code AMJOCC mace ARACE hispanic Origin AREORGN sex ASEX member of a labor union AUNMEM reason for unemployment AUNTYPE full or part time employment stat AWKSTAT capital gains CAPGAIN capital losses CAPLOSS divdends from stocks DIVVAL federal income tax liability FEDTAX tax filer status FILESTAT region of previous residence GRINREG state of previous residence GRINST detailed household and family stat HHDFMX detailed household summary in household HHDREL instance weight MARSUPWT migration code-change in msa MIGMTR1 migration code-change in reg MIGMTR3 migration code-move within reg MIGMTR4 live in this house 1 year ago MIGSAME migration prev res in sunbelt MIGSUN num persons worked for employer NOEMP family members under 18 PARENT total person earnings PEARNVAL country of birth father PEFNTVTY country of birth mother PEMNTVTY country of birth self PENATVTY citizenship PRCITSHP total person income PTOTVAL own business or self employed SEOTR taxable income amount TAXINC fill inc questionnaire for veteran's admin VETQVA veterans benefits VETYN weeks worked in year WKSWORK Note that Incomes have been binned at the $50K level to present a binary classification problem, much like the original UCI/ADULT database. The goal field of this data, however, was drawn from the "total person income" field rather than the "adjusted gross income" and may, therefore, behave differently than the orginal ADULT goal field. Relevant Papers: N/A Papers That Cite This Data Set1: Eibe Frank and Geoffrey Holmes and Richard Kirkby and Mark A. Hall. Racing Committees for Large Datasets. Discovery Science. 2002. [View Context]. Nikunj C. Oza and Stuart J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. KDD. 2001. [View Context]. Stephen D. Bay. Multivariate Discretization for Set Mining. Knowl. Inf. Syst, 3. 2001. [View Context]. Masahiro Terabe and Takashi Washio and Hiroshi Motoda. The Effect of Subsampling Rate on S 3 Bagging Performance. Mitsubishi Research Institute. [View Context]. Citation Request: Please refer to the Machine Learning Repository's citation policy

42 features

V1numeric91 unique values
0 missing
V2nominal9 unique values
0 missing
V3numeric52 unique values
0 missing
V4numeric47 unique values
0 missing
V5nominal17 unique values
0 missing
V6numeric1425 unique values
0 missing
V7nominal3 unique values
0 missing
V8nominal7 unique values
0 missing
V9nominal24 unique values
0 missing
V10nominal15 unique values
0 missing
V11nominal5 unique values
0 missing
V12nominal10 unique values
0 missing
V13nominal2 unique values
0 missing
V14nominal3 unique values
0 missing
V15nominal6 unique values
0 missing
V16nominal8 unique values
0 missing
V17numeric133 unique values
0 missing
V18numeric114 unique values
0 missing
V19numeric1675 unique values
0 missing
V20nominal6 unique values
0 missing
V21nominal6 unique values
0 missing
V22nominal51 unique values
0 missing
V23nominal38 unique values
0 missing
V24nominal8 unique values
0 missing
V25numeric123232 unique values
0 missing
V26nominal10 unique values
0 missing
V27nominal9 unique values
0 missing
V28nominal10 unique values
0 missing
V29nominal3 unique values
0 missing
V30nominal4 unique values
0 missing
V31numeric7 unique values
0 missing
V32nominal5 unique values
0 missing
V33nominal43 unique values
0 missing
V34nominal43 unique values
0 missing
V35nominal43 unique values
0 missing
V36nominal5 unique values
0 missing
V37numeric3 unique values
0 missing
V38nominal3 unique values
0 missing
V39numeric3 unique values
0 missing
V40numeric53 unique values
0 missing
V41numeric2 unique values
0 missing
V42nominal2 unique values
0 missing

62 properties

299285
Number of instances (rows) of the dataset.
42
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
13
Number of numeric attributes.
29
Number of nominal attributes.
4.76
Percentage of binary attributes.
22.32
Second quartile (Median) of standard deviation of attributes of the numeric type.
Maximum entropy among attributes.
-2
Minimum kurtosis among attributes of the numeric type.
0
Percentage of instances having missing values.
Third quartile of entropy among attributes.
1057.03
Maximum kurtosis among attributes of the numeric type.
0.18
Minimum of means among attributes of the numeric type.
0
Percentage of missing values.
108.1
Third quartile of kurtosis among attributes of the numeric type.
1740.1
Maximum of means among attributes of the numeric type.
Minimal mutual information between the nominal attributes and the target attribute.
30.95
Percentage of numeric attributes.
145.18
Third quartile of means among attributes of the numeric type.
Maximum mutual information between the nominal attributes and the target attribute.
2
The minimal number of distinct values among attributes of the nominal type.
69.05
Percentage of nominal attributes.
Third quartile of mutual information between the nominal attributes and the target attribute.
51
The maximum number of distinct values among attributes of the nominal type.
-1.21
Minimum skewness among attributes of the numeric type.
First quartile of entropy among attributes.
8.28
Third quartile of skewness among attributes of the numeric type.
27.14
Maximum skewness among attributes of the numeric type.
0.5
Minimum standard deviation of attributes of the numeric type.
-1.29
First quartile of kurtosis among attributes of the numeric type.
633.74
Third quartile of standard deviation of attributes of the numeric type.
4670.77
Maximum standard deviation of attributes of the numeric type.
Percentage of instances belonging to the least frequent class.
6.64
First quartile of means among attributes of the numeric type.
14.76
Standard deviation of the number of distinct values among attributes of the nominal type.
Average entropy of the attributes.
Number of instances belonging to the least frequent class.
First quartile of mutual information between the nominal attributes and the target attribute.
128.77
Mean kurtosis among attributes of the numeric type.
2
Number of binary attributes.
0.29
First quartile of skewness among attributes of the numeric type.
203.24
Mean of means among attributes of the numeric type.
1.61
First quartile of standard deviation of attributes of the numeric type.
Average class difference between consecutive instances.
Average mutual information between the nominal attributes and the target attribute.
Second quartile (Median) of entropy among attributes.
Entropy of the target attribute values.
An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.
-0.53
Second quartile (Median) of kurtosis among attributes of the numeric type.
0
Number of attributes divided by the number of instances.
13.72
Average number of distinct values among the attributes of the nominal type.
34.54
Second quartile (Median) of means among attributes of the numeric type.
Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.
5.28
Mean skewness among attributes of the numeric type.
Second quartile (Median) of mutual information between the nominal attributes and the target attribute.
Percentage of instances belonging to the most frequent class.
633.03
Mean standard deviation of attributes of the numeric type.
0.83
Second quartile (Median) of skewness among attributes of the numeric type.
Number of instances belonging to the most frequent class.
Minimal entropy among attributes.

4 tasks

0 runs - estimation_procedure: 33% Holdout set - target_feature: V42
0 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: area_under_roc_curve - target_feature: V42
0 runs - estimation_procedure: 50 times Clustering
Define a new task