facebookresearch/WavFlow is a multimodal audio generation project released by Meta AI. The paper title is WavFlow: Audio Generation in Waveform Space.
Project: https://github.com/facebookresearch/WavFlow
It is not focused on speech synthesis or pure music generation. Its goal is to generate synchronized, high-fidelity audio from video and text conditions. More importantly, it does not follow the common latent compression route. It tries to perform end-to-end audio generation directly in raw waveform space.
At the time of writing, the GitHub page shows about 55 stars and 3 forks. The code is mainly Python, and the project has no published release. The README also makes an important point: because of organizational policy constraints, production-trained checkpoints cannot currently be released. The team is working on a foundation checkpoint trained on fully open-source data. Until then, users need to train their own models.
What WavFlow Tries to Solve
Many multimodal audio generation methods first compress audio into a latent space, generate there, and then reconstruct the waveform. This path is efficient, but it can introduce a problem: compression may lose details, affecting audio texture, synchronization, and high-frequency information.
WavFlow tries to bypass that step and generate audio directly in raw waveform space.
The README says it uses waveform patchifying and amplitude lifting so that flow matching can work stably on raw audio, with direct x-prediction. In plainer terms, it does not first compress sound into an intermediate representation. Instead, it cuts the audio waveform itself into patches suitable for model processing and applies amplitude transformation so the model can learn generation at the waveform level.
That is the most interesting part. If end-to-end waveform generation can work reliably, it may reduce the information bottleneck introduced by encoders and decoders.
Supported Input Modes
Based on the README and training guide, WavFlow supports three input modes.
The first is VT2A, or video + text to audio. The model receives video and text descriptions, then generates audio synchronized with the visual scene and semantics, such as forests, frogs, drums, or skateboards.
The second is T2A, or text to audio. There is only a text description and no video input. Training uses CLIP text features, and during inference the CSV can set video_exist to 0.
The third is V2A, or video to audio. There is video but no text. During inference, text_exist can be set to 0, and the model uses a learned empty CLIP-text token.
This design is practical. Real datasets do not always contain complete video, text, and audio annotations for every sample. WavFlow uses fields such as video_exist and text_exist to explicitly represent missing modalities, so both training and inference can handle different combinations.
Evaluation and Positioning
The README says WavFlow is evaluated on VGGSound for VT2A and AudioCaps for T2A, with performance comparable to existing latent-based methods.
The meaning is not that it has already beaten all current models. It is that end-to-end raw waveform generation does not necessarily lose to traditional latent frameworks. At least on acoustic richness, fidelity, and synchronization, it can reach the same tier.
The project page also provides demos such as forest, frog, drum, and skateboard, with more than 24 samples and side-by-side benchmark comparisons. For audio generation models, demos matter a lot, because text metrics cannot fully describe sound texture, spatial feeling, and synchronization.
Installation
The official automatic setup is:
|
|
scripts/setup.sh creates a conda environment named wavflow and installs the required dependencies.
For manual setup, follow the README:
|
|
The ffmpeg<7 dependency is mainly for torio video decoding. The README also notes that required external weights such as CLIP, Synchformer, and the empty-string CFG embedding are downloaded or computed automatically on first run and cached under ~/.cache/wavflow/.
Running Inference
Because the official project has not released production-trained checkpoints yet, the following inference entry point only applies after you already have a trained checkpoint.
|
|
The default config file is:
|
|
The input CSV is specified by data.csv_path, and it supports video, text, or both:
|
|
Here, video_exist=0 means no video decoding is used, and the model uses learned empty CLIP/Sync tokens. text_exist=0 means the caption is ignored and the model uses a learned empty CLIP-text token. Captions with commas need to be quoted.
Common launcher parameters include:
|
|
Important config fields include model.name, model.ckpt_path, model.use_ema, inference.duration_sec, target_sample_rate, inference.cfg, num_steps, noise_scale, noise_shift, prediction_type, seed, and the output directory.
An EMA Pitfall
The README specifically warns about model.use_ema.
A WavFlow checkpoint may contain model_ema1, which is updated with ema_decay = 0.9999. If training only runs for a few hundred or a few thousand steps, the EMA tensor may still contain many random initialization values and produce noise during inference.
So if you are doing a short run, overfitting a tiny sample set, or running a smoke test, consider sampling with:
|
|
Alternatively, use an ema_epoch_*.pth saved after enough training. This detail is useful because otherwise it is easy to assume the model is broken, when in fact the EMA has not stabilized yet.
Training Flow
The official TRAINING.md divides training into two steps.
The first step is feature extraction.
T2A extracts only CLIP text features. VT2A extracts CLIP frame features, Synchformer features, and CLIP text features. An example CSV looks like:
|
|
Videos must be at least extraction.duration_sec long, which defaults to 8 seconds. Shorter clips are skipped. Feature extraction can be run with:
|
|
For more GPUs or a custom config:
|
|
The second step is training.
For single-node multi-GPU training:
|
|
Multi-node training requires NNODES, NODE_RANK, MASTER_ADDR, MASTER_PORT, and NPROC_PER_NODE. Training outputs include checkpoint_latest.pth, checkpoint_epoch_*.pth, ema_epoch_*.pth, generated audio samples, and training.log.
Training resumes automatically: if checkpoint_latest.pth exists in the experiment directory, training continues from it.
Who Should Pay Attention
WavFlow is more relevant to researchers and engineering teams than to ordinary users who want a finished sound-effect tool.
It is worth following if you:
- Research video-to-audio, text-to-audio, or multimodal audio generation.
- Want to compare raw waveform generation with latent-based audio generation.
- Need to train your own audio generation model and can prepare data plus GPU resources.
- Work on applications that require strong synchronization between video and sound.
- Want to explore whether flow matching is viable on raw audio waveforms.
If you just want a web tool where you type a prompt and immediately get a sound effect, WavFlow is not the easiest option today. It does not yet provide a public production checkpoint, and its deployment path is closer to research code.
Things to Watch
First, do not treat it as a downloadable, ready-to-use audio generation model. The official project currently does not release production-trained checkpoints. Before real inference, you need to train your own model or wait for a future open-data checkpoint.
Second, the license is not a permissive commercial default. The README says most of WavFlow is licensed under CC-BY-NC 4.0, while some vendored components keep their original licenses, including MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License. Read LICENSE and NOTICE.txt carefully before commercial use.
Third, training data is critical. WavFlow’s promise depends on aligned audio, video, and text data. If data quality is poor, captions are inaccurate, or audio and video are out of sync, the model will struggle to learn stable sound generation.
Fourth, raw waveform generation may reduce the latent bottleneck, but it may also increase training and inference cost. Real projects still need to balance audio quality, speed, VRAM, sample rate, and output duration.
Summary
The value of WavFlow is that it asks a clear question: does multimodal audio generation have to compress audio into latent space first?
With waveform patchifying, amplitude lifting, and flow matching, it tries to generate synchronized high-fidelity audio directly in raw waveform space. The evaluation suggests that this route can at least stand in the same range as mature latent-based methods.
For now, though, it is more of a research and training framework than an out-of-the-box product model. No public production checkpoint, a mostly non-commercial license, and the need for aligned audio-video-text data all make it better suited to research, reproduction, and further training. If you care about the next generation of video-to-audio or text-to-audio models, WavFlow is worth a serious look.
References
- facebookresearch/WavFlow: https://github.com/facebookresearch/WavFlow
- WavFlow Project Page: https://facebookresearch.github.io/WavFlow/
- WavFlow arXiv: https://arxiv.org/abs/2605.18749
- WavFlow Training Guide: https://github.com/facebookresearch/WavFlow/blob/main/TRAINING.md