pytorch-widedeep: A flexible package for multimodal deep learning

In recent years datasets have grown in size and diversity, combining different data types. Multimodal machine learning projects involving tabular data, images and/or text are gaining popularity (e.g. Garg et al. (2022)). Traditional approaches involved independent feature generation from every data type and their combination in the later stage before passing them to an algorithm for classification or regression.

Furthermore, the flexibility inherent to DL approaches allows the usage of techniques primarily designed only for text and/or images to tabular data, e.g., transfer learning or self-supervised pre-training.
With that in mind, we introduce pytorch-widedeep, a flexible package for multimodal deep learning designed to facilitate the combination of tabular data with text and images.

Statement of need
There is a small number of packages available to use DL for tabular data alone (e.g., pytorchtabular (Joseph, 2021), pytorch-tabnet or autogluon-tabular (Erickson et al., 2020)) or that focus mainly on combining text and images (e.g., MMF (Singh et al., 2020)). With that in mind, our goal is to provide a modular, flexible, and "easy-to-use" framework that allows the combination of a wide variety of models for all data types.
pytorch-widedeep is based on Google's Wide and Deep Algorithm (Cheng et al., 2016), hence its name. The original algorithm is heavily adjusted for multimodal datasets and intended to facilitate the combination of text and images with corresponding tabular data. As opposed to Google's "Wide and Deep" and "Deep and Cross" (R. Wang et al., 2017) architecture implementations in Keras/Tensorflow, we use the wide/cross and deep model design as an initial building block of PyTorch deep learning models to provide the basis for a plethora of state-of-the-art models and architecture implementations that can be seamlessly assembled with just a few lines of code. Additionally, the individual components do not necessarily have to be a part of the final architecture. The main components of those architectures are shown in Figure 1.  figure) is used. The faded-green deeptabular box aims to indicate that the output of the deeptabular component will be concatenated directly with the output of the deeptext or deepimage components or with the FC heads if these are used. Finally, the arrows indicate the connections, which of course, depend on the final architecture that the user chooses to build.
Following the notation of (Cheng et al., 2016), the expression for the architecture without a deephead component can be formulated as: }, is the sigmoid function, are the weight matrices applied to the wide model and to the final activations of the deep models, are these final activations, ( ) are the cross-product transformations of the original features , and is the bias term.
If there is a deephead component, the previous expression turns into: At this stage, it is worth mentioning that the library has been built with a special emphasis on flexibility. That is, we want users to easily run as many different models as possible and/or use their custom components if they prefer. With that in mind, each and every data type component in the figure above can be used independently and in isolation. For example, if the user wants to use a ResNet model to perform classification in an image-only dataset, that is perfectly possible using this library. In addition, following some minor adjustments described in the documentation, the user can use any custom model for each data type -mainly, a custom model is a standard PyTorch model class that must have a property or attribute called output_dim. This way, the WideDeep collector class knows the size of the incoming activations and is able to construct the multimodal model. Examples of how to use custom components can be found in the repository and documentation.

The Model Hub
This section will briefly introduce the current model components available for each data type in the library. Remember that the library is constantly under development, and models are constantly added to the "model-hub".

The wide component
This is a linear model for tabular data where the non-linearities are captured via cross-product transformations. This is the simplest of all components, and we consider it very useful as a benchmark when used on its own.

The deeptabular component
Currently, pytorch-widedeep offers the following models for the so-called deeptabular component:(i) TabMlp

The deeptext component
Currently, pytorch-widedeep offers the following models for the deeptext component: (i) BasicRNN, (ii) AttentiveRNN and (iii) StackedAttentiveRNN. The library will be integrated with the Huggingface transformers library (Wolf et al., 2019) in the near future. However, it is worth mentioning that although transformer-based models are not natively supported by our library, these can be used easily with pytorch-widedeep as a custom model (please, see the documentation for details).

Forms of model training:
Training single or multi-mode models in pytorch-widedeep is handled by the different training classes. Currently, pytorch-widedeep offers the following training options: (i) "Standard" Supervised training, (ii) Supervised Bayesian training, and (iii) Self-supervised pre-training.

Contribution
pytorch-widedeep is being developed and used by many active community members. Anyone can join the discussion on Slack.
• the Callbacks and Initializers structure and code is inspired by the torchsample library (TorchSample maintainers & contributors, 2017), which in itself partially inspired by Keras (Chollet & others, 2015) • the TextProcessor class in this library uses the fastai (J. Howard & Gugger, 2020) Tokenizer and Vocab; the code at utils.fastai_transforms is a minor adaptation of their code, so it functions within this library; to our experience, their Tokenizer is the best in class • the ImageProcessor class in this library uses code from the fantastic Deep Learning for Computer Vision (DL4CV) (Adrian, 2017)