Data

pollen

active
ARFF
Publicly available Visibility: public Uploaded 29-09-2014 by Joaquin Vanschoren

0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue | #Downvotes for this reason | By |
---|

Loading wiki

Help us complete this description
Edit

Author:
Source: Unknown - Date unknown
Please cite:
This dataset is synthetic. It was generated by David Coleman
at RCA Laboratories in Princeton, N.J. For convenience, we will
refer to it as the POLLEN DATA. The first three variables are the
lengths of geometric features observed sampled pollen grains - in the
x, y, and z dimensions: a "ridge" along x, a "nub" in the y
direction, and a "crack" in along the z dimension. The fourth
variable is pollen grain weight, and the fifth is density.
There are 3848 observations, in random order (for people whose
software packages cannot handle this much data, it is recommended
that the data be sampled). The dataset is broken up into eight
pieces, POLLEN1.DAT - POLLEN8.DAT, each with 481 observations.
We will call the variables:
1. RIDGE
2. NUB
3. CRACK
4. WEIGHT
5. DENSITY
6. OBSERVATION NUMBER (for convenience)
The data analyst is advised that there is more than one "feature" to
these data. Each feature can be observed through various graphical
techniques, but analytic methods, as well, can help "crack" the
dataset.
Additional Info:
I no longer have the description handed out during the JSM, but can
tell you how I generated the data, in minitab.
1. Part A was generated: 5000 (I think) 5-variable, uncorrelated, i.i.d.
Gaussian observations.
2. To get part B, I duplicated part A, then reversed the sign on the
observations for 3 of the 5 variables.
3. Part B was appended to Part A.
4. The order of the observations was randomized.
5. While waiting for my tardy car-pool companion, I took a piece of
graph paper, and figured out a dot-matrix representation of the word,
"EUREKA." I then added these observations to the "center" of the
datatset.
6. The data were scaled, by variable (something like 1,3,5,7,11).
7. The data were rotated, then translated.
8. A few points in space within the datacloud were chosen as ellipsoid
centers, then for each center, all observations within a (scaled and
rotated) radius were identified, and eliminated - to form ellipsoidal
voids.
9. The variables were given entirely ficticious names.
FYI, only the folks at Bell Labs, Murray Hill, found everything,
including the voids.
Hope this is helpful!
References:
Becker, R.A., Denby, L., McGill, R., and Wilks,
A. (1986). Datacryptanalysis: A Case Study.
Proceedings of the Section on Statistical Graphics, 92-97.
Slomka, M. (1986). The Analysis of a Synthetic Data Set.
Proceedings of the Section on Statistical Graphics, 113-116.
Information about the dataset
CLASSTYPE: numeric
CLASSINDEX: none specific

DENSITY (target) | numeric | 3784 unique values 0 missing | |

RIDGE | numeric | 3809 unique values 0 missing | |

NUB | numeric | 3811 unique values 0 missing | |

CRACK | numeric | 3816 unique values 0 missing | |

WEIGHT | numeric | 3826 unique values 0 missing | |

OBSERVATION_NUMBER (ignore) | numeric | 3848 unique values 0 missing |

Third quartile of mutual information between the nominal attributes and the target attribute.

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

0.11

Third quartile of skewness among attributes of the numeric type.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

8.96

Third quartile of standard deviation of attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

-0.23

First quartile of kurtosis among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

Average mutual information between the nominal attributes and the target attribute.

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

First quartile of mutual information between the nominal attributes and the target attribute.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Standard deviation of the number of distinct values among attributes of the nominal type.

Average number of distinct values among the attributes of the nominal type.

-0.09

First quartile of skewness among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

4.17

First quartile of standard deviation of attributes of the numeric type.

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

-0.16

Second quartile (Median) of kurtosis among attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

0

Second quartile (Median) of means among attributes of the numeric type.

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

Minimal mutual information between the nominal attributes and the target attribute.

0.07

Second quartile (Median) of skewness among attributes of the numeric type.

6.4

Second quartile (Median) of standard deviation of attributes of the numeric type.

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Maximum mutual information between the nominal attributes and the target attribute.

The minimal number of distinct values among attributes of the nominal type.

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

The maximum number of distinct values among attributes of the nominal type.

0.07

Third quartile of kurtosis among attributes of the numeric type.

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2