mlr3: A modern object-oriented machine learning framework in R

, object-oriented


Summary
The R (R Core Team, 2019) package mlr3 and its associated ecosystem of extension packages implements a powerful, object-oriented and extensible framework for machine learning (ML) in R. It provides a unified interface to many learning algorithms available on CRAN, augmenting them with model-agnostic general-purpose functionality that is needed in every ML project, for example train-test-evaluation, resampling, preprocessing, hyperparameter tuning, nested resampling, and visualization of results from ML experiments.The package is a complete reimplementation of the mlr (Bischl et al., 2016) package that leverages many years of experience and learned best practices to provide a state-of-the-art system that is powerful, flexible, extensible, and maintainable.We target both practitioners who want to quickly apply ML algorithms to their problems and researchers who want to implement, benchmark, and compare their new methods in a structured environment.mlr3 is suitable for short scripts that test an idea, for complex multi-stage experiments with advanced functionality that use a broad range of ML functionality, as a foundation to implement new ML (meta-)algorithms (for example AutoML systems), and everything in between.Functional correctness is ensured through extensive unit and integration tests.

Lessons Learned from 6 Years of Machine Learning in R
The predecessor package mlr was first released to CRAN in 2013, with the core design and architecture dating back much further.As with most software, more code was added over time to integrate more ML algorithms, more approaches for feature selection or hyperparameter tuning, more methods to analyze trained models, and many other things.With each addition, the code base became larger and more difficult to test and maintain, in particular as changes in the dozens of packages that we integrated with mlr would break our code and prevent releases.Installing the package with all dependencies and a complete build with all tests would take hours -we had arrived at a point where adding any new functionality became a major undertaking.Further, some of the architectural and design decisions made it essentially impossible to support new cross-cutting functionality, for example ML pipelines, or using new R packages for better performance.
mlr3 takes these lessons learned to heart and now follows these design principles: • Be modular and light on dependencies.The core mlr3 package provides only the basic building blocks of ML: tasks, a few learners, resampling methods, and performance measures.Everything else can be installed and loaded separately through additional packages in the mlr3 ecosystem, for example support for other kinds of data, methods for tuning hyperparameters, or integrations for additional ML packages.• Leverage modern R packages, especially data.tablefor fast and efficient computations on rectangular data.• Embrace R6 for a clean object-oriented design, object state changes, and reference semantics.• Defensive programming and type safety.All user input is checked with checkmate (Lang, 2017).Return types are documented and automatic type casting for "simplification" is avoided.
In addition, we simplified the API considerably by unifying container and result classes.Many result objects are now tabular by mixing data.table'slist-column feature with R6 objects, which also allows for easy and efficient selection and "split-apply-combine" type operations.

Ecosystem
In addition to the main mlr3 package, mlr3learners provides integrations to a careful selection of the most important ML algorithms and packages in R. Complex ML workflows (using directed acyclic graphs) that can incorporate preprocessing, (stacking) ensembles, alternativebranch execution, and much more can be built with the mlr3pipelines package.Funtionality for hyperparameter tuning and nested resampling of learners and complex pipelines is provided by the mlr3tuning package.mlr3filters integrates many feature filtering techniques and mlr3db allows direct use of databases as data sources for out-of-memory data.We are planning and working on many more packages; for example for Bayesian optimization, Hyperband, probabilistic regression, survival analysis, and spatial and temporal data.A complete list of existing and planned extension packages can be found on the mlr3 wiki.
mlr3 and its ecosystem are documented in numerous manual pages and a comprehensive book (work in progress).All packages are licensed under GNU Lesser General Public License (LGPL-3).