Holdout or random subsampling is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it.
In a k% holdout, the original sample is randomly partitioned into a test set containing k% of the input sample size, and a 1-k% training set. Sampling is done without replacement. This holdout is usually repeated n times, yielding n random partitions of the original sample. The n results are averaged (or otherwise combined) to produce a single estimation.
For classification problems, one typically uses stratified sampling, so that the test set contains roughly the same proportions of class labels as the original sample.
OpenML generates train-test splits given the percentage size of the holdout and the number of repeats, so that different users can evaluate their models with the same splits. Stratification is applied by default for classification problems (unless otherwise specified). The splits are given as part of the task description as an ARFF file with the row id, fold number (0/1), repeat number and the class (TRAIN or TEST). The uploaded predictions should be labeled with the fold and repeat number of the test instance, so that the results can be properly evaluated and aggregated. OpenML stores both the per fold/repeat results and the aggregated scores.