Mapping the Audio Landscape
for Innovative Music Sample Generation

Authors: Christian Limberg and Zhe Zhang

Abstract

This paper introduces the Generative Sample Map (GESAM), a novel two-stage unsupervised learning framework capable of generating high-quality and expressive audio samples for music production. Recent generative approaches based on language models rely on text prompts as conditions. However, fine nuances in musical audio samples can hardly be described in the modality of text. For addressing this shortcoming, we propose to learn a highly descriptive latent 2D audio map by a Variational Autoencoder (VAE) which is then utilized for conditioning a Transformer model. We demonstrate the Transformer model's ability to achieve high generation quality and compare its performance against two baseline models. By selecting points on the map that compresses the manifold of the audio training set into 2D, we enable a more natural interaction with the model. We showcase this capability through an interactive demo interface, which is accessible on this website.

Audio Sample Generator

Our interactive interface, lets you generate musical samples by selecting points on a 2d plane. A fraction of the training data is displayed for orientation. The background is colored in the average colors of the nearest neighboring samples. Click on a certain location for playing the related sample (speaker or headphones required).

Model Architecture

Model Architecture
This schematic depicts the training procedure of our model. In the first stage, a VAE with a 2d latent bottleneck is trained. In the second stage the Transformer model is trained with the VAE as a conditioning model.

Cite Us

This article needs to get accepted first, before a reference will spawn here.