Data

trains

active
ARFF
Publicly available Visibility: public Uploaded 06-04-2014 by Jan van Rijn

0 likes downloaded by 8 people , 14 total downloads 0 issues 0 downvotes

0 likes downloaded by 8 people , 14 total downloads 0 issues 0 downvotes

Issue | #Downvotes for this reason | By |
---|

Loading wiki

Help us complete this description
Edit

Author:
Source: Unknown -
Please cite:
1. Title: INDUCE Trains Data set
2. Sources:
- Donor: GMU, Center for AI, Software Librarian,
Eric E. Bloedorn (bloedorn@aic.gmu.edu)
- Original owners: Ryszard S. Michalski (michalski@aic.gmu.edu)
and Robert Stepp
- Date received: 1 June 1994
- Date updated: 24 June 1994 (Thanks to Larry Holder (UT Arlington)
for noticing a translation error)
3. Past usage:
- This set most closely resembles the data sets described in the following
two publications:
1. R.S. Michalski and J.B. Larson "Inductive Inference of VL
Decision Rules" In Proceedings of the Workshop in Pattern-Directed
Inference Systems, Hawaii, May 1977. Also published in SIGART
Newsletter, ACM No. 63, pp. 38-44, June 1977.
2. Stepp, R.E. and Michalski, R.S. "Conceptual Clustering: Inventing
Goal-Oriented Classifications of Structured Objects" In
R.S. Michalski, J.G. Carbonell, and T.M. Mitchell (Eds.) "Machine
Learning: An Artificial Intelligence Approach, Volume II". Los
Altos, Ca: Morgan Kaufmann.
Both of these papers describe a set of 10 trains, 5 east-bound and 5 west
bound. Both refer to the same 10 trains as seen by the figures in these
publications. The differences are:
1) This dataset has 10 attributes, no wheel, or load color attributes
2) Reference 2 (Stepp, Michalski): does not completely list the
attributes used, but does mention wheel color - an attribute not
present in this dataset.
3) Reference 1 (Michalski, Larson): 12 attributes mentioned, but only 6
are explicitly described. These 6 are included in the dataset below
and the Stepp and Michalski set.
Results:
[1] Michalski and Larson found the following decision rules:
(1) There exists car1, car2, lod1 and lod2 such that
[infront(car1, car2)][lcont(car1, lod1)][lcont(car2,lod2)]
[load-shape(lod1)=triangle][load-shape(lod2)=polygon]=>[dir=east]
(2) There exists a car1 such that
[ln(car1)=short][car-shape(car1)=closed-top]=>[dir=east]
(3) [ncar=3]v There exists car1 such that [car1(car-shape(car1)=jagged-
top] =>[dir=west]
There exists car1 such that
(4) [#cars(ln=long)=2][cshape(car1)=open,trapezoind,u-shaped] v
[location(car1)=2][cshape(car1)=closed, rectangle]=>[dir=west]
(The first selector in rule 4 uses a meta descriptor generated by
the program that counts the number of long cars in a train)
[2] The goal of the cluster research is to develop a general method
for clustering structured objects that can generate conjunctive
descriptions that occur in human classifications or invent new
concepts that have similar appeal. CLUSTER/S was able to find the
following cognitively appealing clusters: 1) a) "There are two
different car shapes in the train" b) "There are three or more
different car shapes in the train" 2) a) Wheels on all cars have
the same color, b) wheels on all cars do not have the same color."
4. Relevant information:
- Additional "background" knowledge is supplied that provides a partial
ordering on some of the attribute values.
- We are providing this dataset both in its original form and in a form
similar to the more typical propositional datasets in our repository.
Since the trains dataset records relations between attributes, this
transformation was somewhat challenging. However, it may shed some
insight on this problem for people who are more familiar with the simple
one-instance-per-line dataset format.
- Hierarchy of values:
if (cshape is one of {openrect,opentrap,ushaped,dblopnrect}
then cshape is opentop
if (cshape is one of {hexagon,ellipse,closedrect,jaggedtop,slopetop,
engine}
then cshape closedtop
- Prediction task: Determine concise decision rules distinguishing
trains traveling east from those traveling west.
5. Number of instances: 10
6. Number of attributes:
- 10, not including the class attribute
1. ccont(train idx1, car idx2): car idx is contained in train idx
2. ncar(train idx): # of trains in car train idx (int)
3. infront(car idx1, car idx2): relative positions of cars in train
4. loc(car idx): absolute position of car in train (int)
5. nwhl(car idx): # of wheels of car idx (int)
6. ln(car idx): length of car idx (long, short)
7. cshape(car idx): shape of car (engine, dblopenrect,
closedrect, openrect, opentrap, ushaped,
hexagon, ellipse, jaggedtop, slopetop,
opentop, closedtop)
8. npl(car idx): number of loads in car idx
9. lcont(car idx, load idx): description of which cars hold which loads
10. lhshape(load idx): description of load shape (trianglod,
rectanglod, circlelod, hexagonlod)
Class: direction (east, west)
The following format was used for the "transformed" dataset representation
as found in trains.transformed.data (one instance per line):
Attributes: 33
1. Number_of_cars (integer in [3-5])
2. Number_of_different_loads (integer in [1-4])
3-22: 5 attributes for each of cars 2 through 5: (20 attributes total)
- num_wheels (integer in [2-3])
- length (short or long)
- shape (closedrect, dblopnrect, ellipse, engine, hexagon,
jaggedtop, openrect, opentrap, slopetop, ushaped)
- num_loads (integer in [0-3])
- load_shape (circlelod, hexagonlod, rectanglod, trianglod)
23-32: 10 Boolean attributes describing whether 2 types of loads are on
adjacent cars of the train
- Rectangle_next_to_rectangle (0 if false, 1 if true)
- Rectangle_next_to_triangle (0 if false, 1 if true)
- Rectangle_next_to_hexagon (0 if false, 1 if true)
- Rectangle_next_to_circle (0 if false, 1 if true)
- Triangle_next_to_triangle (0 if false, 1 if true)
- Triangle_next_to_hexagon (0 if false, 1 if true)
- Triangle_next_to_circle (0 if false, 1 if true)
- Hexagon_next_to_hexagon (0 if false, 1 if true)
- Hexagon_next_to_circle (0 if false, 1 if true)
- Circle_next_to_circle (0 if false, 1 if true)
33. Class attribute (east or west)
The number of cars vary between 3 and 5. Therefore, attributes referring
to properties of cars that do not exist (such as the 5 attriubutes for
the "5th" car when the train has fewer than 5 cars) are assigned a value
of "-".
7. Distribution of classes:
- There are 5 east-bound trains and 5 west-bound trains
(i.e., 50% east, 50% west)
Information about the dataset
CLASSTYPE: nominal
CLASSINDEX: last

class (target) | nominal | 2 unique values 0 missing | |

Number_of_cars | nominal | 3 unique values 0 missing | |

Number_of_different_loads | nominal | 4 unique values 0 missing | |

num_wheels_2 | nominal | 2 unique values 0 missing | |

length_2 | nominal | 2 unique values 0 missing | |

shape_2 | nominal | 5 unique values 0 missing | |

num_loads_2 | nominal | 2 unique values 0 missing | |

load_shape_2 | nominal | 3 unique values 0 missing | |

num_wheels_3 | nominal | 2 unique values 0 missing | |

length_3 | nominal | 2 unique values 0 missing | |

shape_3 | nominal | 8 unique values 0 missing | |

num_loads_3 | nominal | 2 unique values 0 missing | |

load_shape_3 | nominal | 3 unique values 0 missing | |

num_wheels_4 | nominal | 2 unique values 3 missing | |

length_4 | nominal | 2 unique values 3 missing | |

shape_4 | nominal | 4 unique values 3 missing | |

num_loads_4 | nominal | 3 unique values 3 missing | |

load_shape_4 | nominal | 4 unique values 4 missing | |

num_wheels_5 | nominal | 1 unique values 7 missing | |

length_5 | nominal | 1 unique values 7 missing | |

shape_5 | nominal | 2 unique values 7 missing | |

num_loads_5 | nominal | 1 unique values 7 missing | |

load_shape_5 | nominal | 2 unique values 7 missing | |

Rectangle_next_to_rectangle | nominal | 2 unique values 0 missing | |

Rectangle_next_to_triangle | nominal | 2 unique values 0 missing | |

Rectangle_next_to_hexagon | nominal | 1 unique values 0 missing | |

Rectangle_next_to_circle | nominal | 2 unique values 0 missing | |

Triangle_next_to_triangle | nominal | 2 unique values 0 missing | |

Triangle_next_to_hexagon | nominal | 2 unique values 0 missing | |

Triangle_next_to_circle | nominal | 2 unique values 0 missing | |

Hexagon_next_to_hexagon | nominal | 1 unique values 0 missing | |

Hexagon_next_to_circle | nominal | 2 unique values 0 missing | |

Circle_next_to_circle | nominal | 1 unique values 0 missing |

0

Minimal mutual information between the nominal attributes and the target attribute.

0.11

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

0

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

1

The minimal number of distinct values among attributes of the nominal type.

0

Second quartile (Median) of skewness among attributes of the numeric type.

-0.2

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3

1

Maximum mutual information between the nominal attributes and the target attribute.

0

Second quartile (Median) of standard deviation of attributes of the numeric type.

0.76

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

8

The maximum number of distinct values among attributes of the nominal type.

0.2

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

5.37

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

0.6

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.6

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

0.4

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

0.4

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.76

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

0.72

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes

0.25

Third quartile of mutual information between the nominal attributes and the target attribute.

0.2

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.2

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

-0.2

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001

0.6

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.6

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

0.4

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

0

Third quartile of standard deviation of attributes of the numeric type.

0.4

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.76

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

0.19

Average mutual information between the nominal attributes and the target attribute.

3.65

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

0.4

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

0.2

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.2

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

-0.2

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001

2.41

Average number of distinct values among the attributes of the nominal type.

0.01

First quartile of mutual information between the nominal attributes and the target attribute.

0.6

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0.6

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

0.4

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001

-0.2

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1

0.4

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

1.41

Standard deviation of the number of distinct values among attributes of the nominal type.

0

First quartile of standard deviation of attributes of the numeric type.

0.4

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

0.2

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

0

Second quartile (Median) of kurtosis among attributes of the numeric type.

-0.2

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2

0.5

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

0

Second quartile (Median) of means among attributes of the numeric type.

0.4

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3