Guide
Guide

OpenML aims to create a frictionless, collaborative environment for exploring machine learning

Data sets and workflows from various sources analysed and organized online for easy access

Integrated into machine learning environments for automated experimentation, logging, and sharing

Fully reproducible and organized results (e.g. models, predictions) you can build on and compare against

Share your work with the world or within circles of trusted researchers

Make your work more visible and easily citable

Tools to help you design and optimize workflows

In short, OpenML makes it easy to access data, connect to the right people, and automate experimentation, so that you can focus on the data science.

Data

You can upload data sets through the website, or API. Data hosted elsewhere can be referenced by URL.

OpenML automatically analyses the data, checks for problems, visualizes it, and computes data characteristics useful to find and compare datasets.

dataset properties

Every data set gets a dedicated page with all known information (check out zoo), including a wiki, visualizations, statistics, user discussions, and the tasks in which it is used.

Currently, OpenML only accepts a limited number of data formats (e.g. ARFF for tabular data). We aim to extend this in the near future, and allow conversions between the main data types.

Tasks

Tasks describe what to do with the data. OpenML covers several task types, such as classification and clustering. You can create tasks online.

Tasks are little containers including the data and other information such as train/test splits, and define what needs to be returned.

Tasks are machine-readable so that machine learning environments know what to do, and you can focus on finding the best algorithm. You can run algorithms on your own machine(s) and upload the results. OpenML evaluates and organizes all solutions online.

dataset properties

Tasks are real-time, collaborative data mining challenges (e.g. see this one): you can study, discuss and learn from all submissions (code has to be shared), while OpenML keeps track of who was first.

dataset properties

You can also supply hidden test sets for the evaluation of solutions. Novel ways of ranking solutions will be added in the near future.

Flows

Flows are algorithms, workflows, or scripts solving tasks. You can upload them through the website, or API. Code hosted elsewhere (e.g., GitHub) can be referenced by URL.

Ideally, flows are wrappers around existing algorithms/tools so that they can automatically read and solve OpenML tasks.

Every flow gets a dedicated page with all known information (check out WEKA's RandomForest), including a wiki, hyperparameters, evaluations on all tasks, and user discussions.

dataset properties

Currently, you will need to install things locally to run flows. We aim to add support for VMs so that flows can be easily (re)run in any environment.

Runs

Runs are applications of flows on a specific task. They are typically submitted automatically by machine learning environments (through the OpenML API), which make sure that all details are included to ensure reproducibility.

OpenML organizes all runs online, linked to the underlying data, flows, parameter settings, people, and other details. OpenML also independently evaluates the results contained in the run.

You can search and compare everyone's runs online, download all results into your favorite machine learning enviroment, and relate evaluations to known properties of the data and algorithms.

dataset properties

OpenML stores and analyses results in fine detail, up to the level of individual instances.

Plugins

OpenML is deeply integrated in several popular machine learning environments. Given a task, these plugins will automatically download the data into the environments, allow you to run any algorithm/flow, and automatically upload all runs.

dataset properties

Programming APIs

If you want to integrate OpenML into your own tools, we offer several language-specific API's, so you can easily interact with OpenML to list, download and upload data sets, tasks, flows and runs.

With these APIs you can download a task, run an algorithm, and upload the results in just a few lines of code.

OpenML also offers a REST API which allows you to talk to OpenML directly.

dataset properties

Studies (under construction)

You can combine data sets, flows and runs into studies, to collaborate with others online, or simply keep a log of your work.

Each project gets its own page, which can be linked to publications so that others can find all the details online.

Circles (under construction)

You can create circles of trusted researchers in which data can be shared that is not yet ready for publication.

Altmetrics (under construction)

To encourage open science, OpenML now includes altmetrics to track and reward scientific activity, reach and impact, and in the future will include further gamification features such as badges.

Learn more about altmetrics

Jobs (under construction)

OpenML can help you run large experiments. A job is a small container defining a specific flow, with specific parameters settings, to run on a specific tasks. You can generate batches of these jobs online, and you can run a helper tool on your machines/clouds/clusters that downloads these jobs (including all data), executes them, and uploads the results.