<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>WavFlow on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/wavflow/</link>
        <description>Recent content in WavFlow on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Mon, 25 May 2026 08:00:37 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/wavflow/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>WavFlow: Meta&#39;s Open Project for Audio Generation in Raw Waveform Space</title>
        <link>https://knightli.com/en/2026/05/25/wavflow-raw-waveform-audio-generation/</link>
        <pubDate>Mon, 25 May 2026 08:00:37 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/25/wavflow-raw-waveform-audio-generation/</guid>
        <description>&lt;p&gt;&lt;code&gt;facebookresearch/WavFlow&lt;/code&gt; is a multimodal audio generation project released by Meta AI. The paper title is &lt;code&gt;WavFlow: Audio Generation in Waveform Space&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Project: &lt;a class=&#34;link&#34; href=&#34;https://github.com/facebookresearch/WavFlow&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/facebookresearch/WavFlow&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It is not focused on speech synthesis or pure music generation. Its goal is to generate synchronized, high-fidelity audio from video and text conditions. More importantly, it does not follow the common latent compression route. It tries to perform end-to-end audio generation directly in raw waveform space.&lt;/p&gt;
&lt;p&gt;At the time of writing, the GitHub page shows about 55 stars and 3 forks. The code is mainly Python, and the project has no published release. The README also makes an important point: because of organizational policy constraints, production-trained checkpoints cannot currently be released. The team is working on a foundation checkpoint trained on fully open-source data. Until then, users need to train their own models.&lt;/p&gt;
&lt;h2 id=&#34;what-wavflow-tries-to-solve&#34;&gt;What WavFlow Tries to Solve
&lt;/h2&gt;&lt;p&gt;Many multimodal audio generation methods first compress audio into a latent space, generate there, and then reconstruct the waveform. This path is efficient, but it can introduce a problem: compression may lose details, affecting audio texture, synchronization, and high-frequency information.&lt;/p&gt;
&lt;p&gt;WavFlow tries to bypass that step and generate audio directly in raw waveform space.&lt;/p&gt;
&lt;p&gt;The README says it uses waveform patchifying and amplitude lifting so that flow matching can work stably on raw audio, with direct &lt;code&gt;x&lt;/code&gt;-prediction. In plainer terms, it does not first compress sound into an intermediate representation. Instead, it cuts the audio waveform itself into patches suitable for model processing and applies amplitude transformation so the model can learn generation at the waveform level.&lt;/p&gt;
&lt;p&gt;That is the most interesting part. If end-to-end waveform generation can work reliably, it may reduce the information bottleneck introduced by encoders and decoders.&lt;/p&gt;
&lt;h2 id=&#34;supported-input-modes&#34;&gt;Supported Input Modes
&lt;/h2&gt;&lt;p&gt;Based on the README and training guide, WavFlow supports three input modes.&lt;/p&gt;
&lt;p&gt;The first is VT2A, or video + text to audio. The model receives video and text descriptions, then generates audio synchronized with the visual scene and semantics, such as forests, frogs, drums, or skateboards.&lt;/p&gt;
&lt;p&gt;The second is T2A, or text to audio. There is only a text description and no video input. Training uses CLIP text features, and during inference the CSV can set &lt;code&gt;video_exist&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The third is V2A, or video to audio. There is video but no text. During inference, &lt;code&gt;text_exist&lt;/code&gt; can be set to &lt;code&gt;0&lt;/code&gt;, and the model uses a learned empty CLIP-text token.&lt;/p&gt;
&lt;p&gt;This design is practical. Real datasets do not always contain complete video, text, and audio annotations for every sample. WavFlow uses fields such as &lt;code&gt;video_exist&lt;/code&gt; and &lt;code&gt;text_exist&lt;/code&gt; to explicitly represent missing modalities, so both training and inference can handle different combinations.&lt;/p&gt;
&lt;h2 id=&#34;evaluation-and-positioning&#34;&gt;Evaluation and Positioning
&lt;/h2&gt;&lt;p&gt;The README says WavFlow is evaluated on VGGSound for VT2A and AudioCaps for T2A, with performance comparable to existing latent-based methods.&lt;/p&gt;
&lt;p&gt;The meaning is not that it has already beaten all current models. It is that end-to-end raw waveform generation does not necessarily lose to traditional latent frameworks. At least on acoustic richness, fidelity, and synchronization, it can reach the same tier.&lt;/p&gt;
&lt;p&gt;The project page also provides demos such as forest, frog, drum, and skateboard, with more than 24 samples and side-by-side benchmark comparisons. For audio generation models, demos matter a lot, because text metrics cannot fully describe sound texture, spatial feeling, and synchronization.&lt;/p&gt;
&lt;h2 id=&#34;installation&#34;&gt;Installation
&lt;/h2&gt;&lt;p&gt;The official automatic setup is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/facebookresearch/WavFlow.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; WavFlow
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/setup.sh
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda activate wavflow
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;scripts/setup.sh&lt;/code&gt; creates a conda environment named &lt;code&gt;wavflow&lt;/code&gt; and installs the required dependencies.&lt;/p&gt;
&lt;p&gt;For manual setup, follow the README:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda create -n wavflow &lt;span class=&#34;nv&#34;&gt;python&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;3.10 -y
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda activate wavflow
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -r requirements.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -e . --no-deps
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda install -n wavflow -c conda-forge &lt;span class=&#34;s2&#34;&gt;&amp;#34;ffmpeg&amp;lt;7&amp;#34;&lt;/span&gt; -y
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The &lt;code&gt;ffmpeg&amp;lt;7&lt;/code&gt; dependency is mainly for torio video decoding. The README also notes that required external weights such as CLIP, Synchformer, and the empty-string CFG embedding are downloaded or computed automatically on first run and cached under &lt;code&gt;~/.cache/wavflow/&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;running-inference&#34;&gt;Running Inference
&lt;/h2&gt;&lt;p&gt;Because the official project has not released production-trained checkpoints yet, the following inference entry point only applies after you already have a trained checkpoint.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/predict.sh &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;--gpu N&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;--config PATH&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The default config file is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;wavflow/configs/infer.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The input CSV is specified by &lt;code&gt;data.csv_path&lt;/code&gt;, and it supports video, text, or both:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csv&#34; data-lang=&#34;csv&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;video_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;caption&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;video_exist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;text_exist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;/abs/path/sample1.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;a whistling rocket explodes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;/abs/path/sample2.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;birds chirping in a forest&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;a whistling rocket explodes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;/abs/path/sample3.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Here, &lt;code&gt;video_exist=0&lt;/code&gt; means no video decoding is used, and the model uses learned empty CLIP/Sync tokens. &lt;code&gt;text_exist=0&lt;/code&gt; means the caption is ignored and the model uses a learned empty CLIP-text token. Captions with commas need to be quoted.&lt;/p&gt;
&lt;p&gt;Common launcher parameters include:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--gpu N
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--config PATH
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;WAVFLOW_ENV
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Important config fields include &lt;code&gt;model.name&lt;/code&gt;, &lt;code&gt;model.ckpt_path&lt;/code&gt;, &lt;code&gt;model.use_ema&lt;/code&gt;, &lt;code&gt;inference.duration_sec&lt;/code&gt;, &lt;code&gt;target_sample_rate&lt;/code&gt;, &lt;code&gt;inference.cfg&lt;/code&gt;, &lt;code&gt;num_steps&lt;/code&gt;, &lt;code&gt;noise_scale&lt;/code&gt;, &lt;code&gt;noise_shift&lt;/code&gt;, &lt;code&gt;prediction_type&lt;/code&gt;, &lt;code&gt;seed&lt;/code&gt;, and the output directory.&lt;/p&gt;
&lt;h2 id=&#34;an-ema-pitfall&#34;&gt;An EMA Pitfall
&lt;/h2&gt;&lt;p&gt;The README specifically warns about &lt;code&gt;model.use_ema&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A WavFlow checkpoint may contain &lt;code&gt;model_ema1&lt;/code&gt;, which is updated with &lt;code&gt;ema_decay = 0.9999&lt;/code&gt;. If training only runs for a few hundred or a few thousand steps, the EMA tensor may still contain many random initialization values and produce noise during inference.&lt;/p&gt;
&lt;p&gt;So if you are doing a short run, overfitting a tiny sample set, or running a smoke test, consider sampling with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;model.use_ema&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;false&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Alternatively, use an &lt;code&gt;ema_epoch_*.pth&lt;/code&gt; saved after enough training. This detail is useful because otherwise it is easy to assume the model is broken, when in fact the EMA has not stabilized yet.&lt;/p&gt;
&lt;h2 id=&#34;training-flow&#34;&gt;Training Flow
&lt;/h2&gt;&lt;p&gt;The official &lt;code&gt;TRAINING.md&lt;/code&gt; divides training into two steps.&lt;/p&gt;
&lt;p&gt;The first step is feature extraction.&lt;/p&gt;
&lt;p&gt;T2A extracts only CLIP text features. VT2A extracts CLIP frame features, Synchformer features, and CLIP text features. An example CSV looks like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csv&#34; data-lang=&#34;csv&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;id&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;audio_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;video_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;caption&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;sample1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;/abs/or/relative/wav/sample1.wav&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;/abs/or/relative/video/sample1.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;a whistling rocket explodes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Videos must be at least &lt;code&gt;extraction.duration_sec&lt;/code&gt; long, which defaults to 8 seconds. Shorter clips are skipped. Feature extraction can be run with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/extract_t2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/extract_vt2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For more GPUs or a custom config:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;NPROC_PER_NODE&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;4&lt;/span&gt; bash scripts/launch/extract_vt2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CONFIG_PATH&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;path/to/your_extract.yaml bash scripts/launch/extract_t2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The second step is training.&lt;/p&gt;
&lt;p&gt;For single-node multi-GPU training:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/train_single_node.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Multi-node training requires &lt;code&gt;NNODES&lt;/code&gt;, &lt;code&gt;NODE_RANK&lt;/code&gt;, &lt;code&gt;MASTER_ADDR&lt;/code&gt;, &lt;code&gt;MASTER_PORT&lt;/code&gt;, and &lt;code&gt;NPROC_PER_NODE&lt;/code&gt;. Training outputs include &lt;code&gt;checkpoint_latest.pth&lt;/code&gt;, &lt;code&gt;checkpoint_epoch_*.pth&lt;/code&gt;, &lt;code&gt;ema_epoch_*.pth&lt;/code&gt;, generated audio samples, and &lt;code&gt;training.log&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Training resumes automatically: if &lt;code&gt;checkpoint_latest.pth&lt;/code&gt; exists in the experiment directory, training continues from it.&lt;/p&gt;
&lt;h2 id=&#34;who-should-pay-attention&#34;&gt;Who Should Pay Attention
&lt;/h2&gt;&lt;p&gt;WavFlow is more relevant to researchers and engineering teams than to ordinary users who want a finished sound-effect tool.&lt;/p&gt;
&lt;p&gt;It is worth following if you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Research video-to-audio, text-to-audio, or multimodal audio generation.&lt;/li&gt;
&lt;li&gt;Want to compare raw waveform generation with latent-based audio generation.&lt;/li&gt;
&lt;li&gt;Need to train your own audio generation model and can prepare data plus GPU resources.&lt;/li&gt;
&lt;li&gt;Work on applications that require strong synchronization between video and sound.&lt;/li&gt;
&lt;li&gt;Want to explore whether flow matching is viable on raw audio waveforms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want a web tool where you type a prompt and immediately get a sound effect, WavFlow is not the easiest option today. It does not yet provide a public production checkpoint, and its deployment path is closer to research code.&lt;/p&gt;
&lt;h2 id=&#34;things-to-watch&#34;&gt;Things to Watch
&lt;/h2&gt;&lt;p&gt;First, do not treat it as a downloadable, ready-to-use audio generation model. The official project currently does not release production-trained checkpoints. Before real inference, you need to train your own model or wait for a future open-data checkpoint.&lt;/p&gt;
&lt;p&gt;Second, the license is not a permissive commercial default. The README says most of WavFlow is licensed under CC-BY-NC 4.0, while some vendored components keep their original licenses, including MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License. Read &lt;code&gt;LICENSE&lt;/code&gt; and &lt;code&gt;NOTICE.txt&lt;/code&gt; carefully before commercial use.&lt;/p&gt;
&lt;p&gt;Third, training data is critical. WavFlow&amp;rsquo;s promise depends on aligned audio, video, and text data. If data quality is poor, captions are inaccurate, or audio and video are out of sync, the model will struggle to learn stable sound generation.&lt;/p&gt;
&lt;p&gt;Fourth, raw waveform generation may reduce the latent bottleneck, but it may also increase training and inference cost. Real projects still need to balance audio quality, speed, VRAM, sample rate, and output duration.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The value of WavFlow is that it asks a clear question: does multimodal audio generation have to compress audio into latent space first?&lt;/p&gt;
&lt;p&gt;With waveform patchifying, amplitude lifting, and flow matching, it tries to generate synchronized high-fidelity audio directly in raw waveform space. The evaluation suggests that this route can at least stand in the same range as mature latent-based methods.&lt;/p&gt;
&lt;p&gt;For now, though, it is more of a research and training framework than an out-of-the-box product model. No public production checkpoint, a mostly non-commercial license, and the need for aligned audio-video-text data all make it better suited to research, reproduction, and further training. If you care about the next generation of video-to-audio or text-to-audio models, WavFlow is worth a serious look.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;facebookresearch/WavFlow: &lt;a class=&#34;link&#34; href=&#34;https://github.com/facebookresearch/WavFlow&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/facebookresearch/WavFlow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;WavFlow Project Page: &lt;a class=&#34;link&#34; href=&#34;https://facebookresearch.github.io/WavFlow/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://facebookresearch.github.io/WavFlow/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;WavFlow arXiv: &lt;a class=&#34;link&#34; href=&#34;https://arxiv.org/abs/2605.18749&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://arxiv.org/abs/2605.18749&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;WavFlow Training Guide: &lt;a class=&#34;link&#34; href=&#34;https://github.com/facebookresearch/WavFlow/blob/main/TRAINING.md&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/facebookresearch/WavFlow/blob/main/TRAINING.md&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
