Data
pollen

pollen

active ARFF Publicly available Visibility: public Uploaded 29-09-2014 by Joaquin Vanschoren
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Author: Source: Unknown - Date unknown Please cite: This dataset is synthetic. It was generated by David Coleman at RCA Laboratories in Princeton, N.J. For convenience, we will refer to it as the POLLEN DATA. The first three variables are the lengths of geometric features observed sampled pollen grains - in the x, y, and z dimensions: a "ridge" along x, a "nub" in the y direction, and a "crack" in along the z dimension. The fourth variable is pollen grain weight, and the fifth is density. There are 3848 observations, in random order (for people whose software packages cannot handle this much data, it is recommended that the data be sampled). The dataset is broken up into eight pieces, POLLEN1.DAT - POLLEN8.DAT, each with 481 observations. We will call the variables: 1. RIDGE 2. NUB 3. CRACK 4. WEIGHT 5. DENSITY 6. OBSERVATION NUMBER (for convenience) The data analyst is advised that there is more than one "feature" to these data. Each feature can be observed through various graphical techniques, but analytic methods, as well, can help "crack" the dataset. Additional Info: I no longer have the description handed out during the JSM, but can tell you how I generated the data, in minitab. 1. Part A was generated: 5000 (I think) 5-variable, uncorrelated, i.i.d. Gaussian observations. 2. To get part B, I duplicated part A, then reversed the sign on the observations for 3 of the 5 variables. 3. Part B was appended to Part A. 4. The order of the observations was randomized. 5. While waiting for my tardy car-pool companion, I took a piece of graph paper, and figured out a dot-matrix representation of the word, "EUREKA." I then added these observations to the "center" of the datatset. 6. The data were scaled, by variable (something like 1,3,5,7,11). 7. The data were rotated, then translated. 8. A few points in space within the datacloud were chosen as ellipsoid centers, then for each center, all observations within a (scaled and rotated) radius were identified, and eliminated - to form ellipsoidal voids. 9. The variables were given entirely ficticious names. FYI, only the folks at Bell Labs, Murray Hill, found everything, including the voids. Hope this is helpful! References: Becker, R.A., Denby, L., McGill, R., and Wilks, A. (1986). Datacryptanalysis: A Case Study. Proceedings of the Section on Statistical Graphics, 92-97. Slomka, M. (1986). The Analysis of a Synthetic Data Set. Proceedings of the Section on Statistical Graphics, 113-116. Information about the dataset CLASSTYPE: numeric CLASSINDEX: none specific

5 features

DENSITY (target)numeric3784 unique values
0 missing
RIDGEnumeric3809 unique values
0 missing
NUBnumeric3811 unique values
0 missing
CRACKnumeric3816 unique values
0 missing
WEIGHTnumeric3826 unique values
0 missing
OBSERVATION_NUMBER (ignore)numeric3848 unique values
0 missing

19 properties

3848
Number of instances (rows) of the dataset.
5
Number of attributes (columns) of the dataset.
0
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
5
Number of numeric attributes.
0
Number of nominal attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
-2.56
Average class difference between consecutive instances.
0
Percentage of missing values.
100
Percentage of numeric attributes.
0
Number of attributes divided by the number of instances.
0
Percentage of nominal attributes.
Percentage of instances belonging to the most frequent class.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.

14 tasks

0 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: mean_absolute_error - target_feature: DENSITY
0 runs - estimation_procedure: 10 times 10-fold Crossvalidation - evaluation_measure: mean_absolute_error - target_feature: DENSITY
0 runs - estimation_procedure: 33% Holdout set - target_feature: DENSITY
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
Define a new task