Fully reproducible and organized results (e.g. models, predictions) you can build on and compare against
Share your work with the world or within circles of trusted researchers
Make your work more visible and easily citable
Tools to help you design and optimize workflows
In short, OpenML makes it easy to access data, connect to the right people, and automate experimentation, so that you can focus on the data science.
OpenML automatically analyses the data, checks for problems, visualizes it, and computes data characteristics useful to find and compare datasets.
Every data set gets a dedicated page with all known information (check out zoo), including a wiki, visualizations, statistics, user discussions, and the tasks in which it is used.
Currently, OpenML only accepts a limited number of data formats (e.g. ARFF for tabular data). We aim to extend this in the near future, and allow conversions between the main data types.
Tasks are little containers including the data and other information such as train/test splits, and define what needs to be returned.
Tasks are machine-readable so that machine learning environments know what to do, and you can focus on finding the best algorithm. You can run algorithms on your own machine(s) and upload the results. OpenML evaluates and organizes all solutions online.
Tasks are real-time, collaborative data mining challenges (e.g. see this one): you can study, discuss and learn from all submissions (code has to be shared), while OpenML keeps track of who was first.
You can also supply hidden test sets for the evaluation of solutions. Novel ways of ranking solutions will be added in the near future.
Ideally, flows are wrappers around existing algorithms/tools so that they can automatically read and solve OpenML tasks.
Every flow gets a dedicated page with all known information (check out WEKA's RandomForest), including a wiki, hyperparameters, evaluations on all tasks, and user discussions.
Currently, you will need to install things locally to run flows. We aim to add support for VMs so that flows can be easily (re)run in any environment.
Runs are applications of flows on a specific task. They are typically submitted automatically by machine learning environments (through the OpenML API), which make sure that all details are included to ensure reproducibility.
OpenML organizes all runs online, linked to the underlying data, flows, parameter settings, people, and other details. OpenML also independently evaluates the results contained in the run.
You can search and compare everyone's runs online, download all results into your favorite machine learning enviroment, and relate evaluations to known properties of the data and algorithms.
OpenML stores and analyses results in fine detail, up to the level of individual instances.
OpenML is deeply integrated in several popular machine learning environments. Given a task, these plugins will automatically download the data into the environments, allow you to run any algorithm/flow, and automatically upload all runs.
If you want to integrate OpenML into your own tools, we offer several language-specific API's, so you can easily interact with OpenML to list, download and upload data sets, tasks, flows and runs.
With these APIs you can download a task, run an algorithm, and upload the results in just a few lines of code.
OpenML also offers a REST API which allows you to talk to OpenML directly.
You can combine data sets, flows and runs into studies, to collaborate with others online, or simply keep a log of your work.
Each project gets its own page, which can be linked to publications so that others can find all the details online.
You can create circles of trusted researchers in which data can be shared that is not yet ready for publication.
To encourage open science, OpenML now includes altmetrics to track and reward scientific activity, reach and impact, and in the future will include further gamification features such as badges.
OpenML can help you run large experiments. A job is a small container defining a specific flow, with specific parameters settings, to run on a specific tasks. You can generate batches of these jobs online, and you can run a helper tool on your machines/clouds/clusters that downloads these jobs (including all data), executes them, and uploads the results.