Lumen: A software for the interactive visualization of probabilistic models together with data

Research in machine learning and applied statistics has led to the development of a plethora of different types of models. Lumen aims to make a particular yet broad class of models, namely, probabilistic models, more easily accessible to humans. Lumen does so by providing an interactive web application for the visual exploration, comparison, and validation of probabilistic models together with underlying data. As the main feature of Lumen a user can rapidly and incrementally build flexible and potentially complex interactive visualizations of both the probabilistic model and the data that the model was trained on.

Many classic machine learning methods learn models that predict the value of some target variable(s) given the value of some input variable(s). Probabilistic models go beyond this point estimation by predicting instead of a particular value a probability distribution over the target variable(s). This allows, for instance, to estimate the prediction's uncertainty, a highly relevant quantity. For a demonstrative example consider a model predicts that an image of a suspicious skin area does not show a malignant tumor. Here it would be extremely valuable to additionally know whether the model is sure to 99.99% or just 51%, that is, to know the uncertainty in the model's prediction.
Lumen is build on top of the modelbase back-end, which provides a SQL-like interface for querying models and its data (Lucas, 2020).

Statement of need
A major challenge for both the application and development of machine learning/modelling methods is their accessibility to a human analyst, that is, the amount of hurdles that one must overcome to practically make use and benefit from them. Lumen aims to improve accessibility of probabilistic machine learning models with respect to multiple aspects as follows: Model Building: Building a statistical/machine learning model is often an iterative, analystdriven process. This is particularly true for the field of probabilistic programming, a modelling approach where the analyst explicitly declares the likelihood of the observed data as a probability density function. The analyst typically starts with an exploration of the data. Based on insights gained from data exploration and on the analyst's domain knowledge, the analyst creates an initial simple model involving only some data. Subsequently, this model is iteratively made more complex (Gabry et al., 2019;Gelman et al., 2013) until it meets the expert's goals. In particular, the model must be validated after each iteration. Lumen supports this model building process by (i) enabling visual-interactive data exploration, (ii) supporting model validation by means of a visual comparison of data queries to semantically equivalent model queries, and (iii) enabling a direct comparison of model iterates.
Debugging: Even for a machine learning expert it may be hard to know whether a model has been trained on the data as expected. Possible reasons for artifacts in a model include an inappropriate application of the machine learning method, implementation bugs in the machine learning method, and issues in the training data. Direct visual inspection of the probabilistic model provides an approach to model debugging that enables the analyst to literally spot model artifacts that may cause degrading performance. Classical approaches to validation would rely on aggregating measures like information criterions or preditictive accuracy scores.
Education: By its intuitive visual representations of models, Lumen aims to promote understanding of the underlying modelling techniques. For instance, the effect of varying a parameter value for a modelling method on the probabilistic model can be observed visually rather than remaining an abstract description in a textbook. Similarily, the differences between models/model types can be visually illustrated by plotting them side by side. Also, probabilistic concepts such as conditioning or marginalization, which are often difficult to grasp, can be tried out interactively, providing immediate feedback.

Software
Lumen's interface is inspired by the academic Polaris project and its commercial successor Tableau (Stolte et al., 2002). However, while Polaris/Tableau is for data only, Lumen provides a uniform visual language and interactions for both data and probabilistic models. Figure 1 shows an screenshot of Lumen to illustrate the user interface. The Schema panel (left) contains the random variables of the probabilistic model that the user has currently selected. Users can drag'n'drop variables onto the visual channels of the Specification panel (middleleft). This reconfigures the currently active visualization on the dashboard (middle to right), triggers execution of corresponding data and model queries, and finally updates and re-renders the visualization. To foster comparison of multiple models (for instance from different classes of models or from iterates of an incremental model building process) Lumen allows users to create as many visualizations of as many models as desired. All visualization support basic interactions like panning, zoom, or selections and are resizable as well as freely movable on the dashboard. While Lumen handles all user facing aspects (such as visualizations and interactions) most computational aspects (such as execution of model or data queries that are triggered by a user interaction) are delegated to a dedicated back-end. The back-end is implemented in the modelbase project (Lucas, 2020). This separation follows a classic client-server architecture where Lumen is the web-client and modelbase the web-service. For the standard usage scenario both client and server would be installed locally on the same machine. However, they can, of course, also be separated on different machines across a network.
Lumen is model-agnostic in the sense that it can be used with models of any class of probabilistic models as long as this model class implements the common, abstract API in the modelbase back end. The API essentially requires that a model class • contains only quantitative and categorical random variables, i.e. Lumen has no native support for images, time series, or vector-valued random variables, • supports marginalization of random variables, i.e. the operation to remove/integrate out any subset of random variables of the model, • supports conditioning of random variables on values of its domain, i.e. the operation to fix the value of random variables to particular values, and • supports density queries, i.e. the operation to ask for the value of the model's probability density function at any point of its domain.