Code Tutorial#
import torch
import IPython.display as ipd
sr = 44100
duration = 5
audio_sample = torch.randn(1, sr * duration)
ipd.Audio(audio_sample.numpy(), rate=sr)
Stable Audio Open Tutorial#
Stable Audio Open is fully avaiable through HuggingFace. To run Stable Audio Open locally, you’ll first need to generate a $HF_TOKEN for yourself, which can be done here https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication (which you will first need a HuggingFace account for). Once you generate the token, you should export it as an environment variable with a bash command like
export HF_TOKEN="YOUR_HF_TOKEN"
The rest of the tutorial very much follows the demo design of the public Stable Audio Open resources:
First, we’ll install some dependencies if you don’t already have them. Stable-Audio-Tools can be a bit finnicky to install directly, so we suggest making a dedicated virtual envinroment (and not conda) to run this notebook.
# !pip install torch torchaudio torchvision stable-audio-tools einops
If running this locally, you can simply set the HF_TOKEN in your local environment (as done below). If you’re using a collab notebook, you first need to upload your HF_TOKEN as a “secret key” to your collab, and the below command won’t have any affect in that case.
import os
import warnings
os.environ['HF_TOKEN'] = 'Your API key'
warnings.filterwarnings('ignore', category=FutureWarning)
Next, we can load the model from huggingface. Note that there are some known dependency issues with stable-audio-tools on M1 Macs, so we recommend running this as a collab notebook (or on some linux system)
import torch
import torchaudio
# import librosa
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond
import IPython.display as ipd
from functools import partial
device = "cuda" if torch.cuda.is_available() else "cpu"
# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-open-1.0")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[4], line 2
1 import torch
----> 2 import torchaudio
3 # import librosa
4 from einops import rearrange
ModuleNotFoundError: No module named 'torchaudio'
First we’ll wrap the sampling code in a simpler wrapper, as there’s a few parameters that need to be provided but are not strictly useful to play around with.
# this just cleans things up a bit so the code below highlights the important knobs
easy_generate = partial(generate_diffusion_cond, sample_size=sample_size, sigma_min=0.3, sigma_max=500, device=device)
Next we can define our conditioning, which for the default Stable Audio Open involves text, timing, and overall length.
# Set up text and timing conditioning
conditioning = [{
"prompt": "clean guitar, sweep picking, 140 bpm, G minor",
"seconds_start": 0, # this says "where" in time the sample is in the song,
"seconds_total": 30 # total sample length in seconds, rest gets padded with silency
}]
seed = 1000
n_steps = 50
cfg = 7.5
sampler = "dpmpp-3m-sde"
output = easy_generate(
model,
conditioning=conditioning,
steps=n_steps, # number of diffusion steps to run
cfg_scale=cfg, # classifier free guidance guidance scale
sampler_type=sampler, # sampling "algorithm", check out https://github.com/Stability-AI/stable-audio-tools/blob/main/stable_audio_tools/inference/sampling.py#L177 for more options
seed=seed,
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()[:, :round(conditioning[0]['seconds_total']*sample_rate)]
1000
/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torch/amp/autocast_mode.py:265: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/seungheond/anaconda3/envs/p310/lib/python3.10/site-packages/torchsde/_brownian/brownian_interval.py:608: UserWarning: Should have tb<=t1 but got tb=500.00006103515625 and t1=500.000061.
warnings.warn(f"Should have {tb_name}<=t1 but got {tb_name}={tb} and t1={self._end}.")
Now we can listen to the output! Note: if running on a collab notebook, rendering audio will stop the autosave feature, so be sure to delete the block outputs if you want to turn this back on!
ipd.display(ipd.Audio(output, rate=sample_rate))