MLJ: A Julia package for composable Machine Learning

MLJ (Machine Learing in Julia) is an open source software package providing a common interface for interacting with machine learning models written in Julia and other languages. It provides tools and meta-algorithms for selecting, tuning, evaluating, composing and comparing those models, with a focus on flexible model composition. In this design overview we detail chief novelties of the framework, together with the clear benefits of Julia over the dominant multi-language alternatives.

: Part of the scientific type hierarchy. dispatch, Julia solves the ubiquitous "two language problem" [7]. With less technical programming knowledge, experts in a domain of application can get under the hood of machine learning software to broaden its applicability, and innovation can be accelerated through a dramatically reduced software development cycle.
As an example of the productivity boost provided by the single-language paradigm, we cite the DifferentialEquations.jl package [8], which, in a few short years of development by a small team of domain experts, became the best package in its class [9].
Another major advantange of a single-language solution is the ability to automatically differentiate (AD) functions from their code representations. The Flux.jl package [10], for example, already makes use of AD to allow unparalleled flexibility in neural network design.
As a new language, Julia is high-performance computing-ready, and its superlative metaprogramming features allow developers to create domain-specific syntax for user interaction.

Novelties
In line with current trends in "auto-ML", MLJ's design is largely predicated on the importance of model composability. Composite models share all the behaviour of regular models, constructed using a new flexible "learning networks" syntax. Unlike the toolboxes cited above, MLJ's composition syntax is flexible enough to define stacked models, with out-of-sample predictions for the base learners, as well as more routine linear pipelines, which can include target transformations that are learned. As in mlr, hyper-parameter tuning is implemented as a model wrapper.
In MLJ, probabilistic prediction is treated as a first class feature, leveraging Julia's type sytem. In particular, unnecessary case-distinctions, and ambiguous conventions regarding the representation of probabilities, are avoided.
A user can connect models directly to tabular data in a manifold of in-memory and out-of-memory formats, and usability is enhanced through the introduction of "scientific types" allowing the user to focus on the intended purpose of data ("continous", "ordered factor", etc) rather than particular machine type representations.
Finally, with the help of scientific types and the CategoricalArrays.jl package [11], users are guided to create safe representations of categorical data, in which the complete pool of possible classes is embedded in the data representation, and classifiers preserve this information when making predictions!. This avoids a pain-point familiar in environments that simply recast categorical data using integers (e.g., scikit-learn): evaluating a classifier on the test target, only to find the test data includes classes not seen in the training data. Preservation of the original labels for these classes also facilitates exploratory data anaysis and interpretability.

Scientific types
To help users focus less on data representation (e.g., Float32, CategoricalValue{Char,UInt8} or DataFrame) and more on intended interpretation (such as, "continuous", "ordered factor" and "table") MLJ articulates model data requirements, as well as data pre-processing tasks, using scientific types. A scientific type is an ordinary Julia type (generally without instances) reserved for indicating how some data should be interpreted.
Some of these types (provided by an external package for re-use elsewhere) are shown in Figure 1.
To the scientific types, MLJ adds a specific convention specifying a scientific type for every Julia object. The convention is expressed through a single method scitype. A coerce method to recasts machine types to have the desired scientific type (interpretation), and a schema method summarizes the machine and scientfic types of tabular data.
In [3]: fixed_table = coerce(column_table, :age=>Continuous, :query=>Multiclass) schema(fixed_table) Since scientific types are also Julia types, Julia's advanced type system means scientific types can be organized in a type hierarchy, and it is straightforward to check the compatibility of data with a model's scientific requirements and methods can be dispatched on scientific type just as they would on ordinary types.
3 Connecting models directly to arbitrary data containers MLJ models generally expect features (and multivariate target data) to be tabular (rather than a raw matrix, for example). While there are many options for storing tabular data in Julia, both in memory or on disk, these can be accessed using a common interface provided by the Tables.jl package [12]. In MLJ, any Tables.jl-compatible table has scientific type Table{K}, where the type parameter K is the union of the column scientific types; most models allow scientific type for input features to be some subtype of Table. While internally many models convert tabular data to matrices, a lightweight table wrapper for matrices provided by Tables.jl means that type coercion is skipped by the compiler in the case of matrix input (as readily verified using Julia's code inspection macro @code_llvm).

Finding the right model
A model registry gives the user access to model metadata without the need to actually load code defining the model implementation. This metadata includes the model's data requirements (framed in terms of scientific types), the names and types of hyper-parameters, a brief document string, the url for the providing package, open source license and a load path to enable MLJ to locate the model interface code.
Such information allows users to match models to machine learning tasks, facilitating searches for an optimal model. For example, to find all supervised models making probabilistic predictions, compatible with input data X and target y, one defines a filter task(model) = matching(model, X, y) && model.prediction_type == :probabilistic models(task) and lists the models with models(task).

Flexible and compact work-flows for performance evaluation and tuning
Evaluating the performance of some model object (specifying the hyper-parameters of some supervised learning algorithm) using some specified resampling strategy, and measured against some battery of performance measures, looks like this: As in mlr, hyper-parameter optimization is realized as a model wrapper, which transforms a base model into a "selftuning" version of that model. That is, tuning is is abstractly specified before being executed. This allows tuning to be integrated into work-flows (learning networks) in multiple ways. A well-documented tuning interface [13] allows developers to easily extend available hyper-parameter tuning strategies.
We now give an example of syntax for wrapping a model called forest_model in a random search tuning strategy, using cross-validation, and optimizing the mean square loss. The model in this case is a composite model with an ordinary hyper-parameter called bagging_fraction and a nested hyper-parameter atom.n_subfeatures (where atom is another model). The first two lines of code define ranges for these parameters.
In this random search example default priors are assigned to each hyper-parameter but options exist to customize these. Both resampling and tuning have options for parallelization; Julia has first class support for both distributed and multi-threaded parallelism.
6 A unified approach to probabilistic predictions and their evaluation MLJ puts probabilistic models and deterministic models on equal footing. Unlike most most frameworks, a supervised model is either probablistic -meaning it's predict method returns a distribution object -or it is deterministicmeaning it returns objects of the same scientific type as the training observations. To use a probabilistic model to make deterministic predictions one can wrap the model in a pipeline with an appropriate post-processing function, or use additional predict_mean, predict_median, predict_mode methods to deal with the common use-cases.
A "distribution" object returned by a probabilistic predictor is one that can be sampled (using Julia's rand method) and queried for properties. Where possible the object is in fact a Distribution object from the Distributions.jl package [14], for which an additional pdf method for evaluating the distribution's probability density or mass function will be implemented, and in addition to mode, mean and median methods (allowing MLJ's fallbacks for predict_mean, etc, to work).
One important distribution not provided by Distributions.jl is a distribution for finite labeled data (called UnivariateFinite) which additionally tracks all possible classes of the categorical variable it is modelling, and not just those observed in training data.
By predicting distributions, instead of raw probablities or parameters, MLJ avoids a common pain point, namely deciding and agreeing upon a convention about how these should be represented: Should a binary classifier predict one probability or two? Are we using the standard deviation or the variance here? What's the protocol for deciding the order of (unordered) classes? How should multi-target predictions be combined?, etc.
A case-in-point concerns performance measures (metrics) for probabilistic models, such as cross-entropy and Brier loss. All built-in probablisitic measures provided by MLJ are passed a distribution in their prediction slot.
For an overview on probabilistic supervised learning we refer to [15].

Model interfaces
MLJ provides a basic fit/update/predict interface to be implemented by new supervised models. For unsupervised models predict is replaced with transform and an optional inverse_transform method. These methods operate on models which are mutable structs storing hyper-parameters, and nothing else. This model interface is purely functional for maximum flexibility. Presently the general MLJ user is encouraged to interact through a machine interface sitting on top. See more on this below.

The model interface
In MLJ a model is just a struct storing the hyper-parameters associated with some learning algorithm suggested by the struct name (e.g., DecisionTreeClassifier), and that is all. In the low-level, functional-style, model interface learned parameters are not stored, only passed around. Learned parameters are stored in machines (which additionally point to the hyperparameters stored in a model); see below. The separation of hyper-parameters and learned parameters is essential to flexible model composition.
For supervised models the fit method has this signature:

fit(model, verbosity, X, y)
where X is training input and y the training target. The method outputs a triple, typically denoted (fitresult, cache, report).
The fitresult stores the learned parameters, which must include everything needed by predict to make predictions, apart from model and new input data: The purpose of cache is to pass on "state" not included in the fitresult to an update method that the model implementer may optionally overload: update(model, verbosity, fitresult, cache, X, y) This method is to be called instead of fit (and passed the fitresult and cache returned by the fit call) when retraining using identical data. (The data X, y, are included for implementer convenience.) It provides an opportunity for the model implementer to avoid unnecessary repetition of code execution. The three main use-cases are: • Iterative models. If the only change to a random forest model is an increase in the number of trees by ten, for example, then not all trees need to be retrained; only ten new trees need to be trained. If a "self-tuning" model has been fit (i.e., tuned) using 70 iterations of Tree Parzen optimization, then adding 20 more iterations should build on the existing surrogate objective function, not ignore the existing tuning history. • Data preprocessing. Avoid overheads associated with data preprocessing, such as coercion of data into an algorithm-specific type. • Smart training of composite models. When tuning a simple transformer-predictor pipeline model using a holdout set, for example, it is unecessary to retrain the transformer if only the predictor hyper-parameters change. MLJ implements "smart" retraining of composite models like this by defining appropriate update methods.
In the future MLJ will add an update_data method to support models that can cary out on-line learning.

The machine interface
The general MLJ user trains models through its machine interface. This makes some work-flows more convenient, but more significantly, introduces a syntax closely aligned with that for model composition (see below).
A machine is a mutable struct that binds a model to data at construction mach = machine(model, X, y) When the user calls fit!(mach, rows=...) the fitresult, cache and report variables generated by lower-level calls to fit or update, are stored or updated in the machine struct, mach, with the training being optionally restricted to the specified rows of data. To retrain with new hyper-parameters, the user simply mutates model and repeats the fit! call.
Syntax for predicting using a machine is predict(mach, Xnew).

Flexible model composition
Several limitations surrounding model composition are increasingly evident to users of the dominant machine learning software platforms. The basic model composition interfaces provided by the toolboxes mentioned in the Introduction all share one or more of the following shortcomings, which do not exist in MLJ: • Composite models do not inherit all the behavior of ordinary models.
• Composition is limited to linear (non-branching) pipelines.
• Supervised components in a linear pipeline can only occur at the end of the pipeline.
• Hyper-parameters in homogeneous model ensembles cannot be coupled.
• Model stacking, with out-of-sample predictions for base learners, cannot be implemented.
• Hyper-parameters and/or learned parameters of component models are not easily inspected or manipulated (in tuning algorithms, for example).
We now sketch MLJ's composition API, referring the reader to [16] for technical details, and to the MLJ documentation [17]; [18] for examples that will clarify how the composition syntax works in practice.
Note that MLJ also provides "canned" model composition for common use cases, such as non-branching pipelines and homogeneous ensembles, which are not discussed further here.
Specifying a new composite model type is in two steps, prototyping and export.

Prototyping
In prototyping the user defines a so-called learning network, by effectively writing down the same code she would use if composing the models "by hand". She does this using the machine syntax, with which she will already be familiar, from the basic fit!/predict work-flow for single models. There is no need for the user to provide production training data in this process. A dummy data set suffices, for the purposes of testing the learning network as it is built.
The upper panel side of Figure 2 illustrates a simple learning network in which a continuous target y is "normalized" using a learned Box Cox transformation, producing z, while PCA dimension reduction is applied to some features X, to obtain Xr. A Ridge regressor, trained using data from Xr and z, is then applied to Xr to make a target predictionẑ. To obtain a final predictionŷ, we apply the inverse of the Box Cox transform, learned previously, toẑ.
The lower "training" panel of the figure shows the three machines which will store the parameters learned in training -the Box Cox exponent and shift (machine1), the PCA projection (machine2) and the ridge model coefficients and intercept (machine3). The diagram additionally indicates where machines should look for training data, and where to accesses model hyper-parameters (stored in box_cox, PCA and ridge_regressor).
The only syntactic difference between composing "by hand" and building a learning network is that the training data must be wrapped in "source nodes" (which can be empty if testing is not required). Each data "variable" in the manual workflow becomes instead a node of a directed acyclic graph encoding the composite model architecture. Nodes are callable, with a node call triggering lazy evaluation of the predict, transform and other operations in the network. Instead of calling fit! on every machine, a single call to fit! on a node triggers training of all machines needed to call that node, in appropriate order. As mentioned earlier, training such a node is "smart" in the sense that hyper-parameter changes to a model only trigger retraining of necessary machines. So, for example, there is no need to retrain the Box Cox transformer in the preceding example if only the ridge regressor hyper-parameters have changed.
The syntax, then, for specifying the learning network shown in Figure 2 looks like this: fit!(ŷ) # to test training on the dummy datâ y() # to test prediction on the dummy data Note that the machine syntax is a mechanism allowing for multiple nodes to point to the same learned parameters of a model, as in the learned target transformation/inverse transformation above. They also allow multiple nodes to share the same model (hyper-parameters) as in homogeneous ensembles. And different nodes can be accessed during training and "prediction" modes of operation, as in stacking.

Export
In the second step of model composition, the learning network is "exported" as a new stand-alone composite model type, with the component models appearing in the learning network becoming default values for corresponding hyper-parameters (whose values are themselves models). This new type (which is unattached to any particular data) can be instantiated and used just like any other MLJ model (tuned, evaluated, etc). Under the hood, training such a model builds a learning network, so that training is "smart". Defining a new composite model type requires generating and evaluating code, but this is readily implemented using Julia's meta-programming tools, i.e., executed by the user with a simple macro call.

Future directions
Here is a selection of future work planned or in progress: • Supporting more models. Proofs of concept already exist for interfacing pure-Julia deep learning and probabilistic programming models. • Enhancing core functionality. Add more tuning strategies, in particular, Bayesian methods and AD-powered gradient descent. • Broadening Scope. Adding resampling strategies and tools for dealing with time series data, and for dealing with sparse data relevant in natural language processing. • Scalability. Add DAG scheduling for learning network training A more comprehensive road map is linked from the MLJ repository [6].