Transformer-Based Audio Generation Conditioned by 2D Latent Maps: A Demonstration

Authors: Christian Limberg, Zhe Zhang and Marc A. Kastner

Abstract

This paper presents a demonstration of an improved framework for audio sample generation using interactive 2D latent maps. Building upon the foundational work "Mapping the Audio Landscape for Innovative Music Sample Generation", we enhance the framework by introducing visualization techniques for exploring the 2D audio landscape through different audio features such as energy and bandwidth. Additionally, we train a t-SNE embedding over these features to create a more abstract visualization of the audio samples on the map. This demo also significantly improves usability and user interactivity, allowing for a more intuitive and efficient exploration of the generated audio samples. This demo app showcases these improvements in real-time, providing users with an enhanced novel interface for generating high-quality audio samples.

Audio Sample Generator

Display Classifier Colors

We trained a classifier that can distinguishes the classes we trained the transformer model with. We color the embedding space with the corresponding class color by using a 5 nearest neighbor classifier, where we mix the colors of the 5 nearest neighbors for creating a visual transition between classes.

Sample Classes:

Show/Hide Training Data

Display Audio Features

We computed 5 different audio features and use a color map for visualizing each one as a background image for orientation.

Overlay audio features and classifier map

3D T-SNE

Based on the 5 audio features we trained a 3D T-SNE embedding which we map to RGB space for visualizing a richer, but more abstract map of the sound scape.

Our interactive interface, lets you generate musical samples by selecting points on a 2d plane. A fraction of the training data is displayed for orientation. The background is colored in the average colors of the nearest neighboring samples. Click on a certain location for playing the related sample (speaker or headphones required).

Model Architecture

This schematic depicts the training procedure of our model. In the first stage, a VAE with a 2d latent bottleneck is trained. In the second stage the Transformer model is trained with the VAE as a conditioning model.

Cite Us

This demo paper has to be accepted first before a reference will spawn here.