]>
Ontology of Data Mining Experiments
Larisa Soldatova
Joaquin Vanschoren
for describing data type of parameters
useful for describing with algorithms contain which preprocessing/postprocessing steps
useful for describing which preprocessing application preceded another one
Relationship indicating that the second entity is a component of the first entity
IAO: is about: has description
Relationship indicating that the second entity is a parameter of the first entity
implementation
A realizable entity is something that can be physically implemented. As such, it is a superclass for all implementation-related aspects of DM experiments. It belongs to the BFO ontology.
Pruning technique of the C4.5 algorithm
A parameter often used in certain data mining algorithms. These are parameters that affect the inner working of the algorithm and have primitive values. For structured components used inside data mining algorithm, see DM_component. These are NOT parameters that have to do with implementation-specific aspects of an algorithm (e.g. debug).
OBI:0000283
http://purl.obofoundry.org/obo/iao.owl
abstract specification
An information content entity provides a specification of a certain object. As such, it is a superclass for all bits of information used to describe parts of a DM experiment. It belongs to the IAO ontology.
A planned process is the realization of a plan, executed by an agent to achieve a certain goal. It is a superclass for all application-specific aspects of a DM experiment.
OBI:0000011
http://purl.obolibrary.org/obo/obi.owl
application
An often used sequence of DM operators
list DM tasks
first level: do you want to do query/experiment/...
Parameter indicating whether or not the algorithm should use binary splits when choosing to split on a nominal feature
A decision tree learner is an inductive algoritm that assumes that the data can be correctly modelled using a decision tree. Each node usually contains a test on one of the features of the given data, and branches out for different outcomes of that test. They incrementally expand the tree as more data points are given, refining the initial hypothesis.
Pruning is a post-processing technique used for decision tree learners. They work by deleting (pruning) branches from a decision tree that are not helpful, because they cause overfitting: modelling noise instead of the target concept.
The percentage of the original size of the dataset to which must be downsampled. E.g. 90 means the resulting dataset will have a size 90% that of the original dataset.
Ensemble algorithms are inductive algorithms that utilise other inductive algorithms (base learners) to provide models (and predictions) for often overlapping parts of the given data and combine their predictions, e.g. by voting, to give the final predictions. Examples are bagging, boosting and stacking.
An entropy-based splitting criterion is a splitting criterion for decision trees based on the entropy of a class distribution. Entropy is a measure for the (im)purity (or randomness) of an arbitrary distribution. Features whose value distributions have low entropy are often good candidates for the next split.
Entropy is a splitting criterium for decision trees. It chooses the feature with the lowest entropy.
TODO Equal to a heuristic?
An hypothesis optimization learner is an inductive algorithm that assumes the data can be correctly modelled by a given mathematical function, and that adjusts that function (e.g. by adjusting weights) to fit the given data optimally. Examples are linear regression, backpropagation in artificial neural networks, Bayes rule in Bayesian methods and the kernel trick in kernel methods.
An hypothesis refinement learner is an inductive algorithm that assumes the data can be correctly modelled by a given model, and that incrementally adjusts that model as it sees new data points (observations). Examples are decision tree learners, covering learners (rule learners), kMeans and logical induction.
E.g. winnow in case of irrelevant dimensions
An inductive algorithm is an algorithm that, given a number of observations, can predict the outcome of future observations and/or can plan future actions based on those observations.
Information gain ratio is a splitting criterion which normalizes the information gain with the 'gain ratio', a value based on the number of examples left in each class after the test on the attribute. It biases the decision tree against considering attributes with a large number of distinct values.
It solves a drawback of information gain: attributes with large numbers of distinct values (e.g. credit card numbers) usually have high information gain, but tend to overfit the data. An alternative solution is using binary splits for nominal attributes.
Information gain is a splitting criterium for decision trees. It calculates the expected reduction in entropy caused by knowing the value of a certain feature. Splitting the data based on the feature with the highest information gain will thus gain the highest drop in entropy.
Information gain is a synonym for the Kullback-Leibler divergence.
A lazy learner is an inductive algorithm that does not try to model the given data. Instead, it keeps all observations in memory and bases its predictions on 'similar' prior observations.
A parameter of a decision tree learner. It defines the minimum number of instances that can be allocated in each leaf before the algorithm decides to split the leaf in several branches.
The number of folds in which the training data will be split for reduced error pruning. One fold will be used for pruning, the others for training. The inverse of this number is the percentage of data set aside for pruning.
TODO The confidence threshold for deciding whether a node in a decision tree should be pruned. It is part of a chi squared test?
Randomly selects instances from the original dataset. It is possible that the same instance is selected twice.
Used to counter overfitting: yields the smallest version of the most accurate subtree. However, since it claims part of the training data, the tree's performance will suffer if not enough data is available.
Reduced error pruning is a decision tree pruning technique in which a part of the training data is set aside (the validation set). The decision tree is grown on the remaining data, and afterwards, the performance of the tree is tested against another version of the tree in which branches are pruned. Greedily removes the branch that most improves the validation set accuracy.
A splitting criterion is a heuristic that defines which node in a decision tree should be split up next.
Randomly selects instances from the original dataset, but ensures that the class distribution is roughly the same as in the original dataset. It is possible that the same instance is selected twice.