Source: Unknown - Date unknown
This dataset is synthetic. It was generated by David Coleman
at RCA Laboratories in Princeton, N.J. For convenience, we will
refer to it as the POLLEN DATA. The first three variables are the
lengths of geometric features observed sampled pollen grains - in the
x, y, and z dimensions: a "ridge" along x, a "nub" in the y
direction, and a "crack" in along the z dimension. The fourth
variable is pollen grain weight, and the fifth is density.
There are 3848 observations, in random order (for people whose
software packages cannot handle this much data, it is recommended
that the data be sampled). The dataset is broken up into eight
pieces, POLLEN1.DAT - POLLEN8.DAT, each with 481 observations.
We will call the variables:
6. OBSERVATION NUMBER (for convenience)
The data analyst is advised that there is more than one "feature" to
these data. Each feature can be observed through various graphical
techniques, but analytic methods, as well, can help "crack" the
I no longer have the description handed out during the JSM, but can
tell you how I generated the data, in minitab.
1. Part A was generated: 5000 (I think) 5-variable, uncorrelated, i.i.d.
2. To get part B, I duplicated part A, then reversed the sign on the
observations for 3 of the 5 variables.
3. Part B was appended to Part A.
4. The order of the observations was randomized.
5. While waiting for my tardy car-pool companion, I took a piece of
graph paper, and figured out a dot-matrix representation of the word,
"EUREKA." I then added these observations to the "center" of the
6. The data were scaled, by variable (something like 1,3,5,7,11).
7. The data were rotated, then translated.
8. A few points in space within the datacloud were chosen as ellipsoid
centers, then for each center, all observations within a (scaled and
rotated) radius were identified, and eliminated - to form ellipsoidal
9. The variables were given entirely ficticious names.
FYI, only the folks at Bell Labs, Murray Hill, found everything,
including the voids.
Hope this is helpful!
Becker, R.A., Denby, L., McGill, R., and Wilks,
A. (1986). Datacryptanalysis: A Case Study.
Proceedings of the Section on Statistical Graphics, 92-97.
Slomka, M. (1986). The Analysis of a Synthetic Data Set.
Proceedings of the Section on Statistical Graphics, 113-116.
Information about the dataset
CLASSINDEX: none specific