Code Tutorial

Code Tutorial#

import torch
import IPython.display as ipd
sr = 44100
duration = 5
audio_sample = torch.randn(1, sr * duration)
ipd.Audio(audio_sample.numpy(), rate=sr)

Stable Audio Open Tutorial#

Stable Audio Open is fully avaiable through HuggingFace. To run Stable Audio Open locally, you’ll first need to generate a $HF_TOKEN for yourself, which can be done here https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication (which you will first need a HuggingFace account for). Once you generate the token, you should export it as an environment variable with a bash command like

export HF_TOKEN="YOUR_HF_TOKEN"

The rest of the tutorial very much follows the demo design of the public Stable Audio Open resources:

First, we’ll install some dependencies if you don’t already have them. Stable-Audio-Tools can be a bit finnicky to install directly, so we suggest making a dedicated virtual envinroment (and not conda) to run this notebook.

# !pip install torch torchaudio torchvision stable-audio-tools einops

If running this locally, you can simply set the HF_TOKEN in your local environment (as done below). If you’re using a collab notebook, you first need to upload your HF_TOKEN as a “secret key” to your collab, and the below command won’t have any affect in that case.

import os
import warnings
os.environ['HF_TOKEN'] = 'Your API key'
warnings.filterwarnings('ignore', category=FutureWarning)

Next, we can load the model from huggingface. Note that there are some known dependency issues with stable-audio-tools on M1 Macs, so we recommend running this as a collab notebook (or on some linux system)

import torch
import torchaudio
# import librosa
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import IPython.display as ipd
from functools import partial

device = "cuda" if torch.cuda.is_available() else "cpu"

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 2
      1 import torch
----> 2 import torchaudio
      3 # import librosa
      4 from einops import rearrange

ModuleNotFoundError: No module named 'torchaudio'

First we’ll wrap the sampling code in a simpler wrapper, as there’s a few parameters that need to be provided but are not strictly useful to play around with.

# this just cleans things up a bit so the code below highlights the important knobs
easy_generate = partial(generate_diffusion_cond, sample_size=sample_size, sigma_min=0.3, sigma_max=500, device=device)

Next we can define our conditioning, which for the default Stable Audio Open involves text, timing, and overall length.

# Set up text and timing conditioning
conditioning = [{
    "prompt": "clean guitar, sweep picking, 140 bpm, G minor",
    "seconds_start": 0, # this says "where" in time the sample is in the song,
    "seconds_total": 30 # total sample length in seconds, rest gets padded with silency
}]

seed = 1000
n_steps = 50
cfg = 7.5
sampler = "dpmpp-3m-sde"

output = easy_generate(
    model,
    conditioning=conditioning,
    steps=n_steps, # number of diffusion steps to run
    cfg_scale=cfg, # classifier free guidance guidance scale
    sampler_type=sampler, # sampling "algorithm", check out https://github.com/Stability-AI/stable-audio-tools/blob/main/stable_audio_tools/inference/sampling.py#L177 for more options
    seed=seed,
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()[:, :round(conditioning[0]['seconds_total']*sample_rate)]

/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torch/amp/autocast_mode.py:265: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn(

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=500.00006103515625 and t1=500.000061.
  warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")

Now we can listen to the output! Note: if running on a collab notebook, rendering audio will stop the autosave feature, so be sure to delete the block outputs if you want to turn this back on!

ipd.display(ipd.Audio(output, rate=sample_rate))

Code Tutorial

Contents

Code Tutorial#

Stable Audio Open Tutorial#