AI can continue playing music after hearing a snippet

Author: TD SYNNEX Newsflash Published: 12th October 2022

A new AI system can produce flowing music in the same style after being played just a few seconds of audio.

The system, known as AudioLM, can also reproduce the sounds of people speaking if fed a snippet of dialogue.

AI can continue playing music after hearing a snippet

AudioLM uses a different approach to existing AI-generated audio systems, which are reasonably commonplace.

For example, digital assistants like Alexa produce voices that speak using natural language processing, while OpenAI’s Jukebox has been able to create original music that corresponds to specific genres.

Existing techniques tend to require a lot of human preparation, with extensive transcription and annotation needed for the AI training.

AudioLM does not need the same sort of transcription or labelling.

Instead, the system is supplied with databases of sounds, and audio files are compressed into small snippets known as tokens.

These retain most of the information and can be used to train a machine learning model, which uses natural language processing to identify patterns in the sound.

AI predicts what should come next after ‘hearing’ a few seconds of sound

Once analysed, a snippet of sound can be fed into AudioLM, and the system will predict what should come next.

It’s a similar process to that used by writing systems like GPT-3, which can generate text by predicting what words and sentences typically follow each other.

The audio created by AudioLM is described as sounding natural and more fluid than music generated using other existing AI techniques.

To create piano music, for example, the system needs to recreate the subtle vibrations that make up each note when a piano key is struck.

Rhythms and harmonies also have to be identified and maintained over the correct periods, with the system learning to create structures at multiple levels.

When the system is trained using human voices, it can also generate speech that mimics the accent and delivery of the original speaker.

At this point, however, while the cadence and rhythm of the speech may sound natural, the actual sentence construction may be filled with non-sequiturs that do not make sense. AudioLM is adept at learning elements such as the pauses and emphasis that go into speech but not the actual meaning, which would require extensive annotation in the training phase.

Today’s news was brought to you by TD SYNNEX – the UK’s number one solutions distributor.