WavFlow: Meta's Open Project for Audio Generation in Raw Waveform Space

A concise look at facebookresearch/WavFlow: its positioning, method, installation, inference entry point, training flow, and usage limits. WavFlow tries to bypass latent compression and generate synchronized high-fidelity audio from video and text directly in raw waveform space.

facebookresearch/WavFlow is a multimodal audio generation project released by Meta AI. The paper title is WavFlow: Audio Generation in Waveform Space.

Project: https://github.com/facebookresearch/WavFlow

It is not focused on speech synthesis or pure music generation. Its goal is to generate synchronized, high-fidelity audio from video and text conditions. More importantly, it does not follow the common latent compression route. It tries to perform end-to-end audio generation directly in raw waveform space.

At the time of writing, the GitHub page shows about 55 stars and 3 forks. The code is mainly Python, and the project has no published release. The README also makes an important point: because of organizational policy constraints, production-trained checkpoints cannot currently be released. The team is working on a foundation checkpoint trained on fully open-source data. Until then, users need to train their own models.

What WavFlow Tries to Solve

Many multimodal audio generation methods first compress audio into a latent space, generate there, and then reconstruct the waveform. This path is efficient, but it can introduce a problem: compression may lose details, affecting audio texture, synchronization, and high-frequency information.

WavFlow tries to bypass that step and generate audio directly in raw waveform space.

The README says it uses waveform patchifying and amplitude lifting so that flow matching can work stably on raw audio, with direct x-prediction. In plainer terms, it does not first compress sound into an intermediate representation. Instead, it cuts the audio waveform itself into patches suitable for model processing and applies amplitude transformation so the model can learn generation at the waveform level.

That is the most interesting part. If end-to-end waveform generation can work reliably, it may reduce the information bottleneck introduced by encoders and decoders.

Supported Input Modes

Based on the README and training guide, WavFlow supports three input modes.

The first is VT2A, or video + text to audio. The model receives video and text descriptions, then generates audio synchronized with the visual scene and semantics, such as forests, frogs, drums, or skateboards.

The second is T2A, or text to audio. There is only a text description and no video input. Training uses CLIP text features, and during inference the CSV can set video_exist to 0.

The third is V2A, or video to audio. There is video but no text. During inference, text_exist can be set to 0, and the model uses a learned empty CLIP-text token.

This design is practical. Real datasets do not always contain complete video, text, and audio annotations for every sample. WavFlow uses fields such as video_exist and text_exist to explicitly represent missing modalities, so both training and inference can handle different combinations.

Evaluation and Positioning

The README says WavFlow is evaluated on VGGSound for VT2A and AudioCaps for T2A, with performance comparable to existing latent-based methods.

The meaning is not that it has already beaten all current models. It is that end-to-end raw waveform generation does not necessarily lose to traditional latent frameworks. At least on acoustic richness, fidelity, and synchronization, it can reach the same tier.

The project page also provides demos such as forest, frog, drum, and skateboard, with more than 24 samples and side-by-side benchmark comparisons. For audio generation models, demos matter a lot, because text metrics cannot fully describe sound texture, spatial feeling, and synchronization.

Installation

The official automatic setup is:

1
2
3
4
git clone https://github.com/facebookresearch/WavFlow.git
cd WavFlow
bash scripts/setup.sh
conda activate wavflow

scripts/setup.sh creates a conda environment named wavflow and installs the required dependencies.

For manual setup, follow the README:

1
2
3
4
5
conda create -n wavflow python=3.10 -y
conda activate wavflow
pip install -r requirements.txt
pip install -e . --no-deps
conda install -n wavflow -c conda-forge "ffmpeg<7" -y

The ffmpeg<7 dependency is mainly for torio video decoding. The README also notes that required external weights such as CLIP, Synchformer, and the empty-string CFG embedding are downloaded or computed automatically on first run and cached under ~/.cache/wavflow/.

Running Inference

Because the official project has not released production-trained checkpoints yet, the following inference entry point only applies after you already have a trained checkpoint.

1
bash scripts/launch/predict.sh [--gpu N] [--config PATH]

The default config file is:

1
wavflow/configs/infer.yaml

The input CSV is specified by data.csv_path, and it supports video, text, or both:

1
2
3
4
5
video_path,caption,video_exist,text_exist
/abs/path/sample1.mp4,a whistling rocket explodes,1,1
/abs/path/sample2.mp4,birds chirping in a forest,1,1
,a whistling rocket explodes,0,1
/abs/path/sample3.mp4,,1,0

Here, video_exist=0 means no video decoding is used, and the model uses learned empty CLIP/Sync tokens. text_exist=0 means the caption is ignored and the model uses a learned empty CLIP-text token. Captions with commas need to be quoted.

Common launcher parameters include:

1
2
3
--gpu N
--config PATH
WAVFLOW_ENV

Important config fields include model.name, model.ckpt_path, model.use_ema, inference.duration_sec, target_sample_rate, inference.cfg, num_steps, noise_scale, noise_shift, prediction_type, seed, and the output directory.

An EMA Pitfall

The README specifically warns about model.use_ema.

A WavFlow checkpoint may contain model_ema1, which is updated with ema_decay = 0.9999. If training only runs for a few hundred or a few thousand steps, the EMA tensor may still contain many random initialization values and produce noise during inference.

So if you are doing a short run, overfitting a tiny sample set, or running a smoke test, consider sampling with:

1
model.use_ema: false

Alternatively, use an ema_epoch_*.pth saved after enough training. This detail is useful because otherwise it is easy to assume the model is broken, when in fact the EMA has not stabilized yet.

Training Flow

The official TRAINING.md divides training into two steps.

The first step is feature extraction.

T2A extracts only CLIP text features. VT2A extracts CLIP frame features, Synchformer features, and CLIP text features. An example CSV looks like:

1
2
id,audio_path,video_path,caption
sample1,/abs/or/relative/wav/sample1.wav,/abs/or/relative/video/sample1.mp4,a whistling rocket explodes

Videos must be at least extraction.duration_sec long, which defaults to 8 seconds. Shorter clips are skipped. Feature extraction can be run with:

1
2
bash scripts/launch/extract_t2a.sh
bash scripts/launch/extract_vt2a.sh

For more GPUs or a custom config:

1
2
NPROC_PER_NODE=4 bash scripts/launch/extract_vt2a.sh
CONFIG_PATH=path/to/your_extract.yaml bash scripts/launch/extract_t2a.sh

The second step is training.

For single-node multi-GPU training:

1
bash scripts/launch/train_single_node.sh

Multi-node training requires NNODES, NODE_RANK, MASTER_ADDR, MASTER_PORT, and NPROC_PER_NODE. Training outputs include checkpoint_latest.pth, checkpoint_epoch_*.pth, ema_epoch_*.pth, generated audio samples, and training.log.

Training resumes automatically: if checkpoint_latest.pth exists in the experiment directory, training continues from it.

Who Should Pay Attention

WavFlow is more relevant to researchers and engineering teams than to ordinary users who want a finished sound-effect tool.

It is worth following if you:

  • Research video-to-audio, text-to-audio, or multimodal audio generation.
  • Want to compare raw waveform generation with latent-based audio generation.
  • Need to train your own audio generation model and can prepare data plus GPU resources.
  • Work on applications that require strong synchronization between video and sound.
  • Want to explore whether flow matching is viable on raw audio waveforms.

If you just want a web tool where you type a prompt and immediately get a sound effect, WavFlow is not the easiest option today. It does not yet provide a public production checkpoint, and its deployment path is closer to research code.

Things to Watch

First, do not treat it as a downloadable, ready-to-use audio generation model. The official project currently does not release production-trained checkpoints. Before real inference, you need to train your own model or wait for a future open-data checkpoint.

Second, the license is not a permissive commercial default. The README says most of WavFlow is licensed under CC-BY-NC 4.0, while some vendored components keep their original licenses, including MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License. Read LICENSE and NOTICE.txt carefully before commercial use.

Third, training data is critical. WavFlow’s promise depends on aligned audio, video, and text data. If data quality is poor, captions are inaccurate, or audio and video are out of sync, the model will struggle to learn stable sound generation.

Fourth, raw waveform generation may reduce the latent bottleneck, but it may also increase training and inference cost. Real projects still need to balance audio quality, speed, VRAM, sample rate, and output duration.

Summary

The value of WavFlow is that it asks a clear question: does multimodal audio generation have to compress audio into latent space first?

With waveform patchifying, amplitude lifting, and flow matching, it tries to generate synchronized high-fidelity audio directly in raw waveform space. The evaluation suggests that this route can at least stand in the same range as mature latent-based methods.

For now, though, it is more of a research and training framework than an out-of-the-box product model. No public production checkpoint, a mostly non-commercial license, and the need for aligned audio-video-text data all make it better suited to research, reproduction, and further training. If you care about the next generation of video-to-audio or text-to-audio models, WavFlow is worth a serious look.

References

记录并分享
Built with Hugo
Theme Stack designed by Jimmy