audiomate: A Python package for working with audio datasets

Machine learning tasks in the audio domain frequently require large datasets with training data. Over the last years, numerous datasets have been made available for various purposes, for example, (Snyder, Chen, & Povey, 2015) and (Ardila et al., 2019). Unfortunately, most of the datasets are stored in widely differing formats. As a consequence, machine learning practitioners have to convert datasets into other formats before they can be used or combined. Furthermore, common tasks like reading, partitioning, or shuffling of datasets have to be developed over and over again for each format and require intimate knowledge of the formats. We purpose Audiomate, a Python toolkit, to solve this problem.

• Reading and writing of numerous dataset formats using a uniform programming interface, for example (Snyder et al., 2015), (Panayotov, Chen, Povey, & Khudanpur, 2015) and (Ardila et al., 2019) • Accessing metadata, like speaker information and labels • Reading audio data (single files, batches of files) • Retrieval of information about the data (e.g., number of speakers, total duration).
• Splitting data into smaller subsets (e.g., create training, validation, and test sets with a reasonable distribution of classes).• Validation of data for specific requirements (e.g., check whether all samples were assigned a label)

Merging and Partitioning Datasets
Another area where Audiomate excels is mixing datasets and partitioning them into training, test, and validation sets.Assume that the task is to train a neural network to detect segments in audio streams that are music.MUSAN (Snyder et al., 2015) and GTZAN ("GTZAN music/speech collection," n.d.) are two suitable datasets for this task because they provide a wide selection of music, speech, and noise samples.In the example below, we first download MUSAN and GTZAN to the local disk before creating Loader instances for each format that allow Audiomate to access both datasets using a unified interface.

Implementation
Audiomate was designed with extensibility in mind.Therefore, it is straightforward to add support for additional data formats.Support for another format can be added by implementing at least one of three available abstract interfaces.
• Reader: A Reader defines the procedure to load data that is structured in a specific format.It converts it into a Audiomate-specific data structure.
• Writer: A Writer defines the procedure to store data in a specific format.It does that by converting the data from the Audiomate-specific data structure into the target format.
• Downloader: A Downloader can be used to download a dataset.It downloads all required files automatically.
Rarely all interfaces are implemented for a particular format.Usually, Reader and Downloader are implemented for datasets, while Writer is implemented for machine learning toolkits.
Audiomate supports more than a dozen datasets and half as many toolkits.

Related Work
A variety of frameworks and tools offer functionality similar to Audiomate.
Data loaders Data loaders are libraries that focus on downloading and preprocessing data sets to make them easily accessible without requiring a specific tool or framework.In contrast to Audiomate, they cannot convert between formats, split or merge data sets.Examples of libraries in that category are ("Mirdata," 2020), ("Speech corpus downloader," 2020), and ("Audio datasets," 2020).Furthermore, some of these libraries focus on a particular kind of data, such as music, and do not assist with speech data sets.
Tools for specific frameworks Various machine learning tools and deep learning frameworks include the necessary infrastructure to make various datasets readily available to their users.One notable example is TensorFlow (Abadi et al., 2016), which includes data loaders for different kinds of data, including image, speech, and music data sets, such as (Ardila et al., 2019).Another one is torchaudio ("TORCHAUDIO," 2020) for PyTorch, which not only offers data loaders but is also capable of converting between various formats.In contrast to Audiomate, those tools or libraries support a specific machine learning or deep learning framework (TensorFlow or PyTorch, respectively), whereas Audiomate is framework agnostic.
Then, we instruct Audiomate to merge both datasets.Afterwards, we use a Splitter to partition the merged dataset into a train and test set.By merely creating views, Audiomate avoids creating unnecessary disk I/O and is therefore ideally suited to work with large datasets in the range of tens or hundreds of gigabytes.Ultimately, we load the samples and labels by iterating over all utterances.Audio samples are numpy arrays.They allow for fast access, high processing speed and ensure interoperability with third-party programs that can operate on numpy arrays, for example TensorFlow or PyTorch.Alternatively, it is possible to load the samples in batches, which is ideal for feeding them to a deep learning toolkit like PyTorch.