<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Multimodal on KnightLi Blog</title>
        <link>https://knightli.com/en/tags/multimodal/</link>
        <description>Recent content in Multimodal on KnightLi Blog</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <lastBuildDate>Mon, 25 May 2026 08:00:37 +0800</lastBuildDate><atom:link href="https://knightli.com/en/tags/multimodal/index.xml" rel="self" type="application/rss+xml" /><item>
        <title>WavFlow: Meta&#39;s Open Project for Audio Generation in Raw Waveform Space</title>
        <link>https://knightli.com/en/2026/05/25/wavflow-raw-waveform-audio-generation/</link>
        <pubDate>Mon, 25 May 2026 08:00:37 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/25/wavflow-raw-waveform-audio-generation/</guid>
        <description>&lt;p&gt;&lt;code&gt;facebookresearch/WavFlow&lt;/code&gt; is a multimodal audio generation project released by Meta AI. The paper title is &lt;code&gt;WavFlow: Audio Generation in Waveform Space&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Project: &lt;a class=&#34;link&#34; href=&#34;https://github.com/facebookresearch/WavFlow&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/facebookresearch/WavFlow&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It is not focused on speech synthesis or pure music generation. Its goal is to generate synchronized, high-fidelity audio from video and text conditions. More importantly, it does not follow the common latent compression route. It tries to perform end-to-end audio generation directly in raw waveform space.&lt;/p&gt;
&lt;p&gt;At the time of writing, the GitHub page shows about 55 stars and 3 forks. The code is mainly Python, and the project has no published release. The README also makes an important point: because of organizational policy constraints, production-trained checkpoints cannot currently be released. The team is working on a foundation checkpoint trained on fully open-source data. Until then, users need to train their own models.&lt;/p&gt;
&lt;h2 id=&#34;what-wavflow-tries-to-solve&#34;&gt;What WavFlow Tries to Solve
&lt;/h2&gt;&lt;p&gt;Many multimodal audio generation methods first compress audio into a latent space, generate there, and then reconstruct the waveform. This path is efficient, but it can introduce a problem: compression may lose details, affecting audio texture, synchronization, and high-frequency information.&lt;/p&gt;
&lt;p&gt;WavFlow tries to bypass that step and generate audio directly in raw waveform space.&lt;/p&gt;
&lt;p&gt;The README says it uses waveform patchifying and amplitude lifting so that flow matching can work stably on raw audio, with direct &lt;code&gt;x&lt;/code&gt;-prediction. In plainer terms, it does not first compress sound into an intermediate representation. Instead, it cuts the audio waveform itself into patches suitable for model processing and applies amplitude transformation so the model can learn generation at the waveform level.&lt;/p&gt;
&lt;p&gt;That is the most interesting part. If end-to-end waveform generation can work reliably, it may reduce the information bottleneck introduced by encoders and decoders.&lt;/p&gt;
&lt;h2 id=&#34;supported-input-modes&#34;&gt;Supported Input Modes
&lt;/h2&gt;&lt;p&gt;Based on the README and training guide, WavFlow supports three input modes.&lt;/p&gt;
&lt;p&gt;The first is VT2A, or video + text to audio. The model receives video and text descriptions, then generates audio synchronized with the visual scene and semantics, such as forests, frogs, drums, or skateboards.&lt;/p&gt;
&lt;p&gt;The second is T2A, or text to audio. There is only a text description and no video input. Training uses CLIP text features, and during inference the CSV can set &lt;code&gt;video_exist&lt;/code&gt; to &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The third is V2A, or video to audio. There is video but no text. During inference, &lt;code&gt;text_exist&lt;/code&gt; can be set to &lt;code&gt;0&lt;/code&gt;, and the model uses a learned empty CLIP-text token.&lt;/p&gt;
&lt;p&gt;This design is practical. Real datasets do not always contain complete video, text, and audio annotations for every sample. WavFlow uses fields such as &lt;code&gt;video_exist&lt;/code&gt; and &lt;code&gt;text_exist&lt;/code&gt; to explicitly represent missing modalities, so both training and inference can handle different combinations.&lt;/p&gt;
&lt;h2 id=&#34;evaluation-and-positioning&#34;&gt;Evaluation and Positioning
&lt;/h2&gt;&lt;p&gt;The README says WavFlow is evaluated on VGGSound for VT2A and AudioCaps for T2A, with performance comparable to existing latent-based methods.&lt;/p&gt;
&lt;p&gt;The meaning is not that it has already beaten all current models. It is that end-to-end raw waveform generation does not necessarily lose to traditional latent frameworks. At least on acoustic richness, fidelity, and synchronization, it can reach the same tier.&lt;/p&gt;
&lt;p&gt;The project page also provides demos such as forest, frog, drum, and skateboard, with more than 24 samples and side-by-side benchmark comparisons. For audio generation models, demos matter a lot, because text metrics cannot fully describe sound texture, spatial feeling, and synchronization.&lt;/p&gt;
&lt;h2 id=&#34;installation&#34;&gt;Installation
&lt;/h2&gt;&lt;p&gt;The official automatic setup is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;git clone https://github.com/facebookresearch/WavFlow.git
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nb&#34;&gt;cd&lt;/span&gt; WavFlow
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/setup.sh
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda activate wavflow
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;&lt;code&gt;scripts/setup.sh&lt;/code&gt; creates a conda environment named &lt;code&gt;wavflow&lt;/code&gt; and installs the required dependencies.&lt;/p&gt;
&lt;p&gt;For manual setup, follow the README:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda create -n wavflow &lt;span class=&#34;nv&#34;&gt;python&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;3.10 -y
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda activate wavflow
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -r requirements.txt
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;pip install -e . --no-deps
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;conda install -n wavflow -c conda-forge &lt;span class=&#34;s2&#34;&gt;&amp;#34;ffmpeg&amp;lt;7&amp;#34;&lt;/span&gt; -y
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The &lt;code&gt;ffmpeg&amp;lt;7&lt;/code&gt; dependency is mainly for torio video decoding. The README also notes that required external weights such as CLIP, Synchformer, and the empty-string CFG embedding are downloaded or computed automatically on first run and cached under &lt;code&gt;~/.cache/wavflow/&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&#34;running-inference&#34;&gt;Running Inference
&lt;/h2&gt;&lt;p&gt;Because the official project has not released production-trained checkpoints yet, the following inference entry point only applies after you already have a trained checkpoint.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/predict.sh &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;--gpu N&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt; &lt;span class=&#34;o&#34;&gt;[&lt;/span&gt;--config PATH&lt;span class=&#34;o&#34;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The default config file is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;wavflow/configs/infer.yaml
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The input CSV is specified by &lt;code&gt;data.csv_path&lt;/code&gt;, and it supports video, text, or both:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csv&#34; data-lang=&#34;csv&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;video_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;caption&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;video_exist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;text_exist&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;/abs/path/sample1.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;a whistling rocket explodes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;/abs/path/sample2.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;birds chirping in a forest&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;a whistling rocket explodes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;/abs/path/sample3.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;0&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Here, &lt;code&gt;video_exist=0&lt;/code&gt; means no video decoding is used, and the model uses learned empty CLIP/Sync tokens. &lt;code&gt;text_exist=0&lt;/code&gt; means the caption is ignored and the model uses a learned empty CLIP-text token. Captions with commas need to be quoted.&lt;/p&gt;
&lt;p&gt;Common launcher parameters include:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--gpu N
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;--config PATH
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;WAVFLOW_ENV
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Important config fields include &lt;code&gt;model.name&lt;/code&gt;, &lt;code&gt;model.ckpt_path&lt;/code&gt;, &lt;code&gt;model.use_ema&lt;/code&gt;, &lt;code&gt;inference.duration_sec&lt;/code&gt;, &lt;code&gt;target_sample_rate&lt;/code&gt;, &lt;code&gt;inference.cfg&lt;/code&gt;, &lt;code&gt;num_steps&lt;/code&gt;, &lt;code&gt;noise_scale&lt;/code&gt;, &lt;code&gt;noise_shift&lt;/code&gt;, &lt;code&gt;prediction_type&lt;/code&gt;, &lt;code&gt;seed&lt;/code&gt;, and the output directory.&lt;/p&gt;
&lt;h2 id=&#34;an-ema-pitfall&#34;&gt;An EMA Pitfall
&lt;/h2&gt;&lt;p&gt;The README specifically warns about &lt;code&gt;model.use_ema&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;A WavFlow checkpoint may contain &lt;code&gt;model_ema1&lt;/code&gt;, which is updated with &lt;code&gt;ema_decay = 0.9999&lt;/code&gt;. If training only runs for a few hundred or a few thousand steps, the EMA tensor may still contain many random initialization values and produce noise during inference.&lt;/p&gt;
&lt;p&gt;So if you are doing a short run, overfitting a tiny sample set, or running a smoke test, consider sampling with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-yaml&#34; data-lang=&#34;yaml&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nt&#34;&gt;model.use_ema&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;:&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;kc&#34;&gt;false&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Alternatively, use an &lt;code&gt;ema_epoch_*.pth&lt;/code&gt; saved after enough training. This detail is useful because otherwise it is easy to assume the model is broken, when in fact the EMA has not stabilized yet.&lt;/p&gt;
&lt;h2 id=&#34;training-flow&#34;&gt;Training Flow
&lt;/h2&gt;&lt;p&gt;The official &lt;code&gt;TRAINING.md&lt;/code&gt; divides training into two steps.&lt;/p&gt;
&lt;p&gt;The first step is feature extraction.&lt;/p&gt;
&lt;p&gt;T2A extracts only CLIP text features. VT2A extracts CLIP frame features, Synchformer features, and CLIP text features. An example CSV looks like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-csv&#34; data-lang=&#34;csv&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;id&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;audio_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;video_path&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;caption&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;s&#34;&gt;sample1&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;/abs/or/relative/wav/sample1.wav&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;/abs/or/relative/video/sample1.mp4&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;a whistling rocket explodes&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Videos must be at least &lt;code&gt;extraction.duration_sec&lt;/code&gt; long, which defaults to 8 seconds. Shorter clips are skipped. Feature extraction can be run with:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/extract_t2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/extract_vt2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;For more GPUs or a custom config:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;NPROC_PER_NODE&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;m&#34;&gt;4&lt;/span&gt; bash scripts/launch/extract_vt2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nv&#34;&gt;CONFIG_PATH&lt;/span&gt;&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;path/to/your_extract.yaml bash scripts/launch/extract_t2a.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The second step is training.&lt;/p&gt;
&lt;p&gt;For single-node multi-GPU training:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;bash scripts/launch/train_single_node.sh
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Multi-node training requires &lt;code&gt;NNODES&lt;/code&gt;, &lt;code&gt;NODE_RANK&lt;/code&gt;, &lt;code&gt;MASTER_ADDR&lt;/code&gt;, &lt;code&gt;MASTER_PORT&lt;/code&gt;, and &lt;code&gt;NPROC_PER_NODE&lt;/code&gt;. Training outputs include &lt;code&gt;checkpoint_latest.pth&lt;/code&gt;, &lt;code&gt;checkpoint_epoch_*.pth&lt;/code&gt;, &lt;code&gt;ema_epoch_*.pth&lt;/code&gt;, generated audio samples, and &lt;code&gt;training.log&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Training resumes automatically: if &lt;code&gt;checkpoint_latest.pth&lt;/code&gt; exists in the experiment directory, training continues from it.&lt;/p&gt;
&lt;h2 id=&#34;who-should-pay-attention&#34;&gt;Who Should Pay Attention
&lt;/h2&gt;&lt;p&gt;WavFlow is more relevant to researchers and engineering teams than to ordinary users who want a finished sound-effect tool.&lt;/p&gt;
&lt;p&gt;It is worth following if you:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Research video-to-audio, text-to-audio, or multimodal audio generation.&lt;/li&gt;
&lt;li&gt;Want to compare raw waveform generation with latent-based audio generation.&lt;/li&gt;
&lt;li&gt;Need to train your own audio generation model and can prepare data plus GPU resources.&lt;/li&gt;
&lt;li&gt;Work on applications that require strong synchronization between video and sound.&lt;/li&gt;
&lt;li&gt;Want to explore whether flow matching is viable on raw audio waveforms.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you just want a web tool where you type a prompt and immediately get a sound effect, WavFlow is not the easiest option today. It does not yet provide a public production checkpoint, and its deployment path is closer to research code.&lt;/p&gt;
&lt;h2 id=&#34;things-to-watch&#34;&gt;Things to Watch
&lt;/h2&gt;&lt;p&gt;First, do not treat it as a downloadable, ready-to-use audio generation model. The official project currently does not release production-trained checkpoints. Before real inference, you need to train your own model or wait for a future open-data checkpoint.&lt;/p&gt;
&lt;p&gt;Second, the license is not a permissive commercial default. The README says most of WavFlow is licensed under CC-BY-NC 4.0, while some vendored components keep their original licenses, including MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License. Read &lt;code&gt;LICENSE&lt;/code&gt; and &lt;code&gt;NOTICE.txt&lt;/code&gt; carefully before commercial use.&lt;/p&gt;
&lt;p&gt;Third, training data is critical. WavFlow&amp;rsquo;s promise depends on aligned audio, video, and text data. If data quality is poor, captions are inaccurate, or audio and video are out of sync, the model will struggle to learn stable sound generation.&lt;/p&gt;
&lt;p&gt;Fourth, raw waveform generation may reduce the latent bottleneck, but it may also increase training and inference cost. Real projects still need to balance audio quality, speed, VRAM, sample rate, and output duration.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The value of WavFlow is that it asks a clear question: does multimodal audio generation have to compress audio into latent space first?&lt;/p&gt;
&lt;p&gt;With waveform patchifying, amplitude lifting, and flow matching, it tries to generate synchronized high-fidelity audio directly in raw waveform space. The evaluation suggests that this route can at least stand in the same range as mature latent-based methods.&lt;/p&gt;
&lt;p&gt;For now, though, it is more of a research and training framework than an out-of-the-box product model. No public production checkpoint, a mostly non-commercial license, and the need for aligned audio-video-text data all make it better suited to research, reproduction, and further training. If you care about the next generation of video-to-audio or text-to-audio models, WavFlow is worth a serious look.&lt;/p&gt;
&lt;h2 id=&#34;references&#34;&gt;References
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;facebookresearch/WavFlow: &lt;a class=&#34;link&#34; href=&#34;https://github.com/facebookresearch/WavFlow&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/facebookresearch/WavFlow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;WavFlow Project Page: &lt;a class=&#34;link&#34; href=&#34;https://facebookresearch.github.io/WavFlow/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://facebookresearch.github.io/WavFlow/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;WavFlow arXiv: &lt;a class=&#34;link&#34; href=&#34;https://arxiv.org/abs/2605.18749&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://arxiv.org/abs/2605.18749&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;WavFlow Training Guide: &lt;a class=&#34;link&#34; href=&#34;https://github.com/facebookresearch/WavFlow/blob/main/TRAINING.md&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/facebookresearch/WavFlow/blob/main/TRAINING.md&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Gemini 3.5 Flash positioning and strengths: why it fits high-frequency, multimodal, low-latency use cases</title>
        <link>https://knightli.com/en/2026/05/24/gemini-35-flash-positioning-advantages-low-latency-multimodal/</link>
        <pubDate>Sun, 24 May 2026 08:43:24 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/24/gemini-35-flash-positioning-advantages-low-latency-multimodal/</guid>
        <description>&lt;p&gt;The keywords for &lt;code&gt;Gemini 3.5 Flash&lt;/code&gt; are not &amp;ldquo;the strongest,&amp;rdquo; but &amp;ldquo;high-frequency, fast, cost-efficient, and easy to integrate.&amp;rdquo; It is more like the workhorse model in the Gemini family: it may not be the model you use for the hardest reasoning tasks, but it is well suited for real production workloads such as Q&amp;amp;A, summarization, customer support, content processing, multimodal understanding, lightweight coding assistance, and automated workflows.&lt;/p&gt;
&lt;p&gt;The key to understanding Flash is not to treat it as a replacement for a Pro-class flagship model. It is better understood as a model tier optimized for throughput and response speed. For developers and enterprises, the real cost of many AI applications is not only the strongest single response, but the latency, stability, price, and context-handling ability across thousands or millions of daily requests.&lt;/p&gt;
&lt;h2 id=&#34;product-positioning&#34;&gt;Product positioning
&lt;/h2&gt;&lt;p&gt;The Gemini family usually separates models into different tiers. Flagship models handle more complex reasoning, planning, and difficult tasks. Flash models emphasize speed, cost, and large-scale invocation.&lt;/p&gt;
&lt;p&gt;The positioning of &lt;code&gt;Gemini 3.5 Flash&lt;/code&gt; can be summarized as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;More suitable than Pro for high-frequency calls.&lt;/li&gt;
&lt;li&gt;More capable than tiny lightweight models for complex input.&lt;/li&gt;
&lt;li&gt;Optimized for low latency and high throughput.&lt;/li&gt;
&lt;li&gt;Suitable for multimodal input and long-context processing.&lt;/li&gt;
&lt;li&gt;Better as the default model inside applications, not only as a model for rare difficult requests.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This type of model is best for tasks that run many times every day. Its value is not just answer quality in one call, but whether it can reliably process large amounts of text, images, audio, video, or structured information at manageable cost.&lt;/p&gt;
&lt;h2 id=&#34;why-flash-matters&#34;&gt;Why Flash matters
&lt;/h2&gt;&lt;p&gt;When AI products move into production, a practical issue appears: the strongest model is useful, but not every request deserves the strongest model.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A user asks an ordinary customer-support question.&lt;/li&gt;
&lt;li&gt;A system summarizes a meeting transcript.&lt;/li&gt;
&lt;li&gt;A backend classifies a batch of tickets.&lt;/li&gt;
&lt;li&gt;An app explains an uploaded image.&lt;/li&gt;
&lt;li&gt;An automation extracts fields from an email.&lt;/li&gt;
&lt;li&gt;An agent reads a set of documents before deciding the next step.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These tasks need models that are reliable, cheap, and fast, but they do not always require the full reasoning power of a flagship model. That is where Flash matters: it puts &amp;ldquo;strong enough&amp;rdquo; and &amp;ldquo;fast enough&amp;rdquo; in the same place.&lt;/p&gt;
&lt;p&gt;If an AI application serves many users, the default model cannot be chosen only by peak capability. Average request cost, response speed, concurrency, and failure rate matter just as much. Flash is an application-layer model for that reality.&lt;/p&gt;
&lt;h2 id=&#34;advantage-1-low-latency-and-high-throughput&#34;&gt;Advantage 1: low latency and high throughput
&lt;/h2&gt;&lt;p&gt;The most direct advantage of Flash is speed.&lt;/p&gt;
&lt;p&gt;For chat products, retrieval-augmented search, support bots, real-time writing assistance, and agent workflows, latency directly affects user experience. Users may not know model parameters or benchmark results, but they immediately feel whether the product keeps them waiting.&lt;/p&gt;
&lt;p&gt;Low latency brings several benefits:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Conversations feel more real-time.&lt;/li&gt;
&lt;li&gt;Multi-step tool calls do not slow down as much.&lt;/li&gt;
&lt;li&gt;Agents can make intermediate decisions more often.&lt;/li&gt;
&lt;li&gt;Backend batch processing finishes faster.&lt;/li&gt;
&lt;li&gt;Product teams can place AI features into more small workflows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This matters especially for agent applications. A model does not answer only once; it repeatedly judges, calls tools, reads context, and generates the next action. Lower single-call latency improves the whole chain.&lt;/p&gt;
&lt;h2 id=&#34;advantage-2-better-cost-for-scale&#34;&gt;Advantage 2: better cost for scale
&lt;/h2&gt;&lt;p&gt;Another core value of Flash is cost.&lt;/p&gt;
&lt;p&gt;When enterprises and developers put AI applications into production, they usually care about three questions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How much does each call cost?&lt;/li&gt;
&lt;li&gt;How many calls happen per day?&lt;/li&gt;
&lt;li&gt;Are cost and latency controllable at peak concurrency?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If a task runs hundreds of thousands of times per day, even a small per-call price gap becomes large over time. Flash-style models are designed so that most requests do not have to go directly to the most expensive and heaviest model.&lt;/p&gt;
&lt;p&gt;A common pattern is tiered routing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Ordinary requests go to Flash by default.&lt;/li&gt;
&lt;li&gt;Difficult problems, complex planning, and long-chain reasoning escalate to Pro.&lt;/li&gt;
&lt;li&gt;Simple classification or fixed-format extraction can go to even lighter models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This lets an AI system keep high-end capability while controlling everyday cost.&lt;/p&gt;
&lt;h2 id=&#34;advantage-3-multimodal-input-fits-real-applications&#34;&gt;Advantage 3: multimodal input fits real applications
&lt;/h2&gt;&lt;p&gt;The Gemini family has long emphasized multimodal capability. Flash is valuable because it is not only for text requests; it can also handle images, audio, video, documents, and related inputs.&lt;/p&gt;
&lt;p&gt;That matters in real products. Business data is often not pure text:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Users upload screenshots for support.&lt;/li&gt;
&lt;li&gt;Customer support needs to understand a photo of a problem.&lt;/li&gt;
&lt;li&gt;Education products process images of exercises.&lt;/li&gt;
&lt;li&gt;Content platforms analyze video clips.&lt;/li&gt;
&lt;li&gt;Office workflows read PDFs, spreadsheets, and presentations.&lt;/li&gt;
&lt;li&gt;E-commerce products analyze product images and user descriptions.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If multimodal understanding depends only on expensive flagship models, many high-frequency scenarios are hard to scale. Flash brings multimodal understanding into a model tier better suited for large-scale invocation.&lt;/p&gt;
&lt;h2 id=&#34;advantage-4-long-context-makes-it-good-at-reading-material&#34;&gt;Advantage 4: long context makes it good at reading material
&lt;/h2&gt;&lt;p&gt;Long context is an important Gemini-family capability. For Flash, long context is not simply about stuffing everything into the prompt; it lets the model handle more information-organization tasks.&lt;/p&gt;
&lt;p&gt;Examples include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Summarizing long documents.&lt;/li&gt;
&lt;li&gt;Reading product manuals.&lt;/li&gt;
&lt;li&gt;Analyzing meeting notes.&lt;/li&gt;
&lt;li&gt;Organizing multi-page PDFs.&lt;/li&gt;
&lt;li&gt;Comparing contracts or proposals.&lt;/li&gt;
&lt;li&gt;Providing agents with large task backgrounds.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Long context combined with lower cost is well suited for workflows that first read a lot of material and then produce actionable results. Flash does not need to solve extremely hard reasoning tasks every time. It can include more context in one pass, which is useful for office work, customer support, knowledge bases, and developer assistance.&lt;/p&gt;
&lt;h2 id=&#34;advantage-5-suitable-as-a-default-model&#34;&gt;Advantage 5: suitable as a default model
&lt;/h2&gt;&lt;p&gt;Many AI products need a &amp;ldquo;default model.&amp;rdquo; It does not have to be the most expensive or strongest, but it must satisfy several conditions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Stable quality on most questions.&lt;/li&gt;
&lt;li&gt;Fast response.&lt;/li&gt;
&lt;li&gt;Manageable cost.&lt;/li&gt;
&lt;li&gt;Ability to handle multimodal input.&lt;/li&gt;
&lt;li&gt;Sufficient long-context support.&lt;/li&gt;
&lt;li&gt;Easy API and product integration.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is where &lt;code&gt;Gemini 3.5 Flash&lt;/code&gt; has an advantage. It is suitable as the default entry point: handle most requests first, and route complex tasks to stronger models when needed.&lt;/p&gt;
&lt;p&gt;This pattern will become increasingly common. Future AI systems will not simply &amp;ldquo;choose one model&amp;rdquo;; they will use Flash as the workhorse, Pro as the escalation path, and smaller models for edge tasks.&lt;/p&gt;
&lt;h2 id=&#34;suitable-scenarios&#34;&gt;Suitable scenarios
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;Gemini 3.5 Flash&lt;/code&gt; is well suited for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Customer-support Q&amp;amp;A and answers after knowledge-base retrieval.&lt;/li&gt;
&lt;li&gt;Long-document summaries, report organization, and meeting notes.&lt;/li&gt;
&lt;li&gt;Multimodal understanding of images, screenshots, PDFs, and video clips.&lt;/li&gt;
&lt;li&gt;Real-time AI assistants inside apps.&lt;/li&gt;
&lt;li&gt;Content moderation, classification, and tag generation.&lt;/li&gt;
&lt;li&gt;Information extraction from emails, tickets, and forms.&lt;/li&gt;
&lt;li&gt;Intermediate decisions and context compression in agent workflows.&lt;/li&gt;
&lt;li&gt;Code explanation, lightweight fix suggestions, and documentation generation.&lt;/li&gt;
&lt;li&gt;Education products for exercise explanation and study assistance.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These scenarios share the same traits: high request volume, sensitivity to user wait time, complex input types, and no need for flagship-level deep reasoning every time.&lt;/p&gt;
&lt;h2 id=&#34;where-flash-should-not-be-the-only-model&#34;&gt;Where Flash should not be the only model
&lt;/h2&gt;&lt;p&gt;Flash is not universal. It is optimized for high-frequency and low-latency use, but that does not mean every problem should use only Flash.&lt;/p&gt;
&lt;p&gt;The following scenarios still fit stronger Pro-class models better, or at least require tiered routing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Complex mathematics and rigorous proofs.&lt;/li&gt;
&lt;li&gt;Long-chain planning and multi-step strategic reasoning.&lt;/li&gt;
&lt;li&gt;High-risk legal, medical, or financial judgment.&lt;/li&gt;
&lt;li&gt;Deep refactoring plans for large codebases.&lt;/li&gt;
&lt;li&gt;Complex agent tasks requiring high reliability.&lt;/li&gt;
&lt;li&gt;Professional reports with extremely low tolerance for hallucination.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A safer strategy is to let Flash handle, judge, and organize first; when task complexity rises, escalate to a stronger model.&lt;/p&gt;
&lt;h2 id=&#34;relationship-with-pro-class-models&#34;&gt;Relationship with Pro-class models
&lt;/h2&gt;&lt;p&gt;Flash and Pro should not be understood as &amp;ldquo;which one replaces the other.&amp;rdquo; They have different jobs.&lt;/p&gt;
&lt;p&gt;Flash is the everyday workhorse:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Fast.&lt;/li&gt;
&lt;li&gt;Cost-friendly.&lt;/li&gt;
&lt;li&gt;Suitable for high concurrency.&lt;/li&gt;
&lt;li&gt;Good for multimodal and long-context applications.&lt;/li&gt;
&lt;li&gt;Suitable for default product flows.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Pro is the hard-task model:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Better for complex reasoning.&lt;/li&gt;
&lt;li&gt;Better for difficult planning.&lt;/li&gt;
&lt;li&gt;Better for high-value requests.&lt;/li&gt;
&lt;li&gt;Better for small numbers of important deep-analysis tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Good AI products usually combine the two instead of choosing only one.&lt;/p&gt;
&lt;h2 id=&#34;how-developers-should-use-it&#34;&gt;How developers should use it
&lt;/h2&gt;&lt;p&gt;If you want to integrate Gemini 3.5 Flash into a product, consider these patterns:&lt;/p&gt;
&lt;p&gt;First, use it as the default model. Most ordinary requests go to Flash first, giving both speed and cost control.&lt;/p&gt;
&lt;p&gt;Second, design model routing. When Flash identifies a task as complex, high-risk, or requiring deep reasoning, escalate to Pro.&lt;/p&gt;
&lt;p&gt;Third, use it for context compression. Before an agent executes a task, Flash can summarize documents, extract key facts, and generate structured context.&lt;/p&gt;
&lt;p&gt;Fourth, make multimodal input part of the normal workflow. Images, screenshots, PDFs, audio, and video should not only be edge features; they can become default input types.&lt;/p&gt;
&lt;p&gt;Fifth, evaluate with your own data. Do not rely only on official benchmarks. Test with your support questions, documents, code, images, and business workflows to decide which tasks Flash handles well and which need escalation.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The core positioning of &lt;code&gt;Gemini 3.5 Flash&lt;/code&gt; is a multimodal workhorse model for high-frequency real applications. Its advantage is not replacing Pro-class flagship models, but placing speed, cost, long context, and multimodal ability into a tier better suited for large-scale invocation.&lt;/p&gt;
&lt;p&gt;For developers, the most important part of Flash is not a single benchmark, but a product architecture shift: the default model can be faster, cheaper, and better at reading complex inputs; harder tasks can still escalate to stronger models. This keeps user experience good while controlling cost.&lt;/p&gt;
&lt;p&gt;If Pro is the heavy tool for difficult problems, Flash is the main tool running on the production line every day. In real AI products, the latter is often what users experience most.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Google official blog: &lt;a class=&#34;link&#34; href=&#34;https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Google DeepMind Gemini Flash: &lt;a class=&#34;link&#34; href=&#34;https://deepmind.google/en/models/gemini/flash/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://deepmind.google/en/models/gemini/flash/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;User-provided Zhihu discussion link: &lt;a class=&#34;link&#34; href=&#34;https://www.zhihu.com/question/2040529179641385344/answer/2040531897613285214&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.zhihu.com/question/2040529179641385344/answer/2040531897613285214&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Running Qwen3.6-35B Locally on an RTX 3070 8GB: llama.cpp Deployment Notes and Tuning Parameters</title>
        <link>https://knightli.com/en/2026/05/22/rtx-3070-8gb-qwen36-35b-llama-cpp-local-deployment/</link>
        <pubDate>Fri, 22 May 2026 22:44:16 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/22/rtx-3070-8gb-qwen36-35b-llama-cpp-local-deployment/</guid>
        <description>&lt;p&gt;Whether an 8GB GPU can run a 35B-class model depends on more than the total parameter count. Model architecture, quantization format, and the way the inference framework schedules work all matter.&lt;/p&gt;
&lt;p&gt;The core idea in this setup is to use a GGUF quantized version of an MoE model such as Qwen3.6-35B-A3B, then use llama.cpp with CUDA acceleration, CPU Offload, MoE parameter scheduling, and KV Cache quantization to split memory pressure between the GPU and system RAM. With that approach, an older GPU such as the RTX 3070 8GB can still have a chance to run a 35B-class local multimodal model.&lt;/p&gt;
&lt;p&gt;One point needs to be clear first: this is not &amp;ldquo;fitting a full 35B model entirely into 8GB of VRAM.&amp;rdquo; A more accurate way to understand it is that the GPU handles the compute that benefits most from GPU acceleration, while some expert layers and cache pressure are carried by system memory. The real experience depends on RAM capacity, CPU performance, quantization format, context length, and parameter choices.&lt;/p&gt;
&lt;h2 id=&#34;test-environment&#34;&gt;Test environment
&lt;/h2&gt;&lt;p&gt;This kind of setup is sensitive to system memory. A reference configuration is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;CPU: Intel Core i7-12700 class&lt;/li&gt;
&lt;li&gt;GPU: NVIDIA RTX 3070 8GB&lt;/li&gt;
&lt;li&gt;RAM: 64GB&lt;/li&gt;
&lt;li&gt;OS: Windows 11&lt;/li&gt;
&lt;li&gt;Inference framework: llama.cpp CUDA build&lt;/li&gt;
&lt;li&gt;Model format: GGUF&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you only have 16GB or 32GB of RAM, it is not necessarily impossible to try, but a 35B MoE model is more likely to create memory pressure during loading and long-context inference. For stable use, 64GB of RAM is a safer target.&lt;/p&gt;
&lt;h2 id=&#34;why-8gb-vram-can-still-run-a-35b-model&#34;&gt;Why 8GB VRAM can still run a 35B model
&lt;/h2&gt;&lt;p&gt;The key to Qwen3.6-35B-A3B is its MoE architecture. Its total parameter scale is 35B, but not all parameters are activated during each inference step; only part of the expert parameters are active.&lt;/p&gt;
&lt;p&gt;That leads to two consequences:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The full model file is still large and requires enough disk space and system memory.&lt;/li&gt;
&lt;li&gt;The active compute per inference step is lower than a full 35B Dense model.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;llama.cpp&amp;rsquo;s CPU Offload and MoE-related parameters can further reduce the VRAM threshold. The GPU mainly handles attention and some high-value compute, while the CPU and system memory carry part of the expert-layer weights. The tradeoff is that speed, response latency, and stability depend more on the whole machine, not only the GPU model.&lt;/p&gt;
&lt;h2 id=&#34;preparing-llamacpp&#34;&gt;Preparing llama.cpp
&lt;/h2&gt;&lt;p&gt;Windows users can download a prebuilt CUDA version of llama.cpp directly. Pay attention to three points:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The GPU driver should be new enough, and the CUDA runtime should match the llama.cpp package you download.&lt;/li&gt;
&lt;li&gt;After downloading, place it in a path without Chinese characters or special characters so batch scripts are easier to run.&lt;/li&gt;
&lt;li&gt;Put model files under a unified &lt;code&gt;models&lt;/code&gt; directory to avoid very long paths in commands.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you use AMD, Intel graphics, or a CPU-only environment, you can also choose Vulkan, HIP, SYCL, or CPU builds, but the parameters and performance will be different. This article focuses on the CUDA route for NVIDIA GPUs.&lt;/p&gt;
&lt;h2 id=&#34;download-the-model-and-multimodal-projection-file&#34;&gt;Download the model and multimodal projection file
&lt;/h2&gt;&lt;p&gt;The model used here is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen3.6-35B-A3B-UD-Q4_K_M.gguf&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;Q4_K_M&lt;/code&gt; quantization format is chosen mainly to balance accuracy, file size, and speed. On low-VRAM machines, it is not a good idea to start with a higher-precision version, because loading failures or frequent system paging become much more likely.&lt;/p&gt;
&lt;p&gt;If you want image understanding, you also need the multimodal projection file, for example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;mmproj-BF16.gguf&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This file is important. Downloading only the main model usually gives you text inference only. Without &lt;code&gt;mmproj&lt;/code&gt;, the web UI may not expose a usable image upload feature, or uploaded images may not be processed correctly.&lt;/p&gt;
&lt;p&gt;Keep the directory structure simple:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;5
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama.cpp/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;├─ llama-server.exe
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;└─ models/
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   ├─ Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;   └─ mmproj-BF16.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;rtx-3070-8gb-startup-parameters&#34;&gt;RTX 3070 8GB startup parameters
&lt;/h2&gt;&lt;p&gt;Below is an example startup script for an RTX 3070 8GB. Change the path to your own llama.cpp directory.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;13
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;14
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;15
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;16
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;17
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;18
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;19
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;20
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;21
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;22
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bat&#34; data-lang=&#34;bat&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;@&lt;/span&gt;&lt;span class=&#34;k&#34;&gt;echo&lt;/span&gt; off
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;chcp 65001 &lt;span class=&#34;p&#34;&gt;&amp;gt;&lt;/span&gt;nul
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;cd&lt;/span&gt; /d D:\AI\llama.cpp
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;llama-server.exe &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -m &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --mmproj &lt;span class=&#34;s2&#34;&gt;&amp;#34;models\mmproj-BF16.gguf&amp;#34;&lt;/span&gt; &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -ngl 99 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --n-cpu-moe 999 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --flash-attn on &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --jinja &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -c 32768 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -t 12 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -b 512 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; -ub 128 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --cache-type-k q4_0 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --cache-type-v q4_0 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --mlock &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --host 127.0.0.1 &lt;span class=&#34;se&#34;&gt;^
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;se&#34;&gt; &lt;/span&gt; --port 8080
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;k&#34;&gt;pause&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;After startup, open this address in your browser:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;http://127.0.0.1:8080
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;If the page opens and the model replies normally, the service has started successfully. The first model load can be slow. Avoid launching multiple instances repeatedly during loading, because that can fill system memory more easily.&lt;/p&gt;
&lt;h2 id=&#34;understanding-the-key-parameters&#34;&gt;Understanding the key parameters
&lt;/h2&gt;&lt;p&gt;&lt;code&gt;-ngl 99&lt;/code&gt; tries to place as many layers as possible on the GPU. How many layers actually fit depends on the model structure, quantization format, and VRAM usage.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--n-cpu-moe 999&lt;/code&gt; pushes more MoE expert layers to the CPU side, reducing VRAM pressure. It is one of the key parameters for running large MoE models on low-VRAM hardware.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--flash-attn on&lt;/code&gt; enables Flash Attention, which can reduce the cost of attention computation. Whether it is available depends on the current llama.cpp version and GPU support.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;-c 32768&lt;/code&gt; sets the context length. Long context significantly increases KV Cache pressure. If startup fails or inference is very slow, try lowering it to &lt;code&gt;8192&lt;/code&gt; or &lt;code&gt;16384&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;--cache-type-k q4_0&lt;/code&gt; and &lt;code&gt;--cache-type-v q4_0&lt;/code&gt; quantize the KV Cache, saving memory and VRAM, though they may have a small impact on output quality and speed.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;-b 512&lt;/code&gt; and &lt;code&gt;-ub 128&lt;/code&gt; control batching-related parameters. In a low-VRAM environment, do not start with overly aggressive batch settings.&lt;/p&gt;
&lt;h2 id=&#34;common-issues&#34;&gt;Common issues
&lt;/h2&gt;&lt;p&gt;If startup reports insufficient VRAM, first reduce the context length, for example changing &lt;code&gt;-c 32768&lt;/code&gt; to &lt;code&gt;-c 8192&lt;/code&gt;, then try lowering &lt;code&gt;-b&lt;/code&gt; and &lt;code&gt;-ub&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If the image upload button is unavailable, first check whether the &lt;code&gt;--mmproj&lt;/code&gt; path is correct and whether the &lt;code&gt;mmproj&lt;/code&gt; file matches the model.&lt;/p&gt;
&lt;p&gt;If the model responds slowly after loading, it usually does not mean the GPU is idle. Large amounts of weights or expert layers may be handled by the CPU and system memory. Use Task Manager to observe GPU, CPU, memory, and disk usage to identify the bottleneck.&lt;/p&gt;
&lt;p&gt;If the output format looks wrong, confirm that &lt;code&gt;--jinja&lt;/code&gt; is enabled and check whether the model requires the corresponding chat template.&lt;/p&gt;
&lt;p&gt;If the browser cannot open the service after startup, check the &lt;code&gt;--host&lt;/code&gt; and &lt;code&gt;--port&lt;/code&gt; settings, and make sure port 8080 is not occupied by another program.&lt;/p&gt;
&lt;h2 id=&#34;who-should-try-this&#34;&gt;Who should try this
&lt;/h2&gt;&lt;p&gt;This setup is suitable for users who already have 8GB VRAM devices such as RTX 3070, RTX 4060 Laptop, or RTX 3060 8GB, but want to experiment with larger MoE models.&lt;/p&gt;
&lt;p&gt;It is not suitable for people who need maximum speed. Running a 35B MoE model on low VRAM essentially trades CPU and system memory for a lower VRAM requirement. Being able to run it is one thing; whether it feels smooth enough is another.&lt;/p&gt;
&lt;p&gt;If your goal is high-frequency daily chatting, 7B, 8B, or 14B models may feel better. If your goal is to explore larger MoE models, multimodal capability, and the boundary of local deployment, an RTX 3070 8GB with 64GB of RAM is still worth trying.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;The reason an RTX 3070 8GB can run Qwen3.6-35B-A3B is not that the GPU suddenly has more VRAM. It is the combination of MoE architecture, GGUF quantization, llama.cpp CPU Offload, and KV Cache optimization that lowers the threshold.&lt;/p&gt;
&lt;p&gt;The most interesting part of this setup is that it lets older GPUs still participate in local large-model experiments. As long as you accept tradeoffs in speed and stability, an 8GB VRAM machine can still be a local AI model testing platform, not only an entry-level device for small models.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Original article: &lt;a class=&#34;link&#34; href=&#34;https://www.freedidi.com/24267.html&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://www.freedidi.com/24267.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>What Is Gemini Omni? A Complete Look at Google&#39;s AI Video Multi-Turn Editing Model</title>
        <link>https://knightli.com/en/2026/05/20/google-gemini-omni-video-editing/</link>
        <pubDate>Wed, 20 May 2026 23:11:58 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/20/google-gemini-omni-video-editing/</guid>
        <description>&lt;p&gt;Google DeepMind has published a page for &lt;code&gt;Gemini Omni&lt;/code&gt;. Its positioning is direct: create content from any input, with the current focus starting from video.&lt;/p&gt;
&lt;p&gt;If Nano Banana is more about image generation and editing, Gemini Omni feels more like a multimodal editing model for video. Users can modify a video step by step with natural language, with each later change building on the previous one, while trying to keep scenes, people, actions, and visual logic consistent.&lt;/p&gt;
&lt;p&gt;Project page: &lt;a class=&#34;link&#34; href=&#34;https://deepmind.google/models/gemini-omni/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://deepmind.google/models/gemini-omni/&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&#34;the-core-problem-it-tries-to-solve&#34;&gt;The Core Problem It Tries to Solve
&lt;/h2&gt;&lt;p&gt;Traditional video editing often requires timelines, layers, masks, keyframes, color grading, audio tracks, and a lot of manual work. AI video generation tools can already create clips from prompts, but they often run into two problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A generated result is hard to refine precisely.&lt;/li&gt;
&lt;li&gt;During multi-turn edits, characters, scenes, styles, and actions can drift.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Gemini Omni is aimed at the second step: not just generating a video, but letting users keep asking for changes as if they were talking to an editor.&lt;/p&gt;
&lt;p&gt;The project page describes it as a way to edit any video through natural, step-by-step conversation. Each edit builds on the prior result, with the goal of maintaining a coherent and unified scene.&lt;/p&gt;
&lt;h2 id=&#34;main-capabilities&#34;&gt;Main Capabilities
&lt;/h2&gt;&lt;p&gt;Gemini Omni&amp;rsquo;s capabilities can be grouped into several areas.&lt;/p&gt;
&lt;p&gt;The first is natural-language video editing. Users can directly ask the model to change a video&amp;rsquo;s aesthetic style, motion, or effects. For example, it can make a mirror ripple like liquid, turn a person into line art, a felt toy, or a transparent holographic wireframe, or transform an entire environment into 3D voxel art.&lt;/p&gt;
&lt;p&gt;The second is action reconstruction. It can change what happens in a video, such as enlarging a hand-formed hole, making a toy produce the corresponding animal sound, or making building lights react to music.&lt;/p&gt;
&lt;p&gt;The third is editing real video based on reference images. Users can provide an image reference and ask the model to place a building, sun, aircraft, or other object into a real video scene.&lt;/p&gt;
&lt;p&gt;The fourth is maintaining consistency across multi-turn edits. The page shows a continuous editing flow: moving a violinist into a reference-image environment, removing the violin, and then changing the shot to an over-the-shoulder angle. This is closer to an actual creative process than a one-shot prompt.&lt;/p&gt;
&lt;p&gt;The fifth is multi-input reference. Gemini Omni can combine image, text, video, and audio inputs into one output, supporting tasks such as style transfer, motion transfer, character replacement, and sketch-to-video generation.&lt;/p&gt;
&lt;h2 id=&#34;why-it-emphasizes-world-knowledge&#34;&gt;Why It Emphasizes World Knowledge
&lt;/h2&gt;&lt;p&gt;Google repeatedly emphasizes that Gemini Omni is not only about making visuals look realistic. It also uses Gemini&amp;rsquo;s world knowledge, physical intuition, history, science, and narrative logic.&lt;/p&gt;
&lt;p&gt;That matters. If a video model only optimizes for visual quality, it can easily produce illogical motion, confused object relationships, or mismatches between text and image. Gemini Omni&amp;rsquo;s goal is for video to look right while also being more coherent in story, physics, and meaning.&lt;/p&gt;
&lt;p&gt;Examples on the page include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A marble rolling through a chain-reaction track.&lt;/li&gt;
&lt;li&gt;A claymation explanation of protein folding.&lt;/li&gt;
&lt;li&gt;A stop-motion style explanation of how the hippocampus works.&lt;/li&gt;
&lt;li&gt;Letters appearing in sync with objects in the scene.&lt;/li&gt;
&lt;li&gt;On-screen words appearing one by one to the rhythm.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These examples suggest that Gemini Omni is not just a short-video effects tool. It tries to combine knowledge expression, storytelling, and audiovisual generation.&lt;/p&gt;
&lt;h2 id=&#34;how-it-relates-to-veo-flow-and-nano-banana&#34;&gt;How It Relates to Veo, Flow, and Nano Banana
&lt;/h2&gt;&lt;p&gt;In Google&amp;rsquo;s current product lineup, Gemini Omni looks like a layer for multimodal creation and editing.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;Veo&lt;/code&gt; is more focused on the video generation model itself, emphasizing cinematic video and audio generation. &lt;code&gt;Google Flow&lt;/code&gt; is an AI creative studio for creators, suitable for organizing shots, assets, and video projects. &lt;code&gt;Nano Banana&lt;/code&gt; is more focused on image creation and detailed editing. Gemini Omni emphasizes multimodal editing from any input to a consistent output, especially multi-turn natural-language control for video.&lt;/p&gt;
&lt;p&gt;A simple way to understand it:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;To generate high-quality video, watch Veo.&lt;/li&gt;
&lt;li&gt;To organize video projects in a creative workflow, watch Google Flow.&lt;/li&gt;
&lt;li&gt;To edit images, watch Nano Banana.&lt;/li&gt;
&lt;li&gt;To modify video conversationally while referencing images, text, video, and audio, watch Gemini Omni.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;access-points&#34;&gt;Access Points
&lt;/h2&gt;&lt;p&gt;The page lists these access points:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Gemini app.&lt;/li&gt;
&lt;li&gt;Google Flow.&lt;/li&gt;
&lt;li&gt;YouTube Shorts.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;However, it also notes that a Google AI subscription is required, and availability depends on subscription tier and region. In other words, not every user in every region can immediately access the full feature set.&lt;/p&gt;
&lt;p&gt;For creators, Google Flow may be the most important entry point because it is closer to a complete creative workspace. For general users, Gemini app and YouTube Shorts may be lower-friction ways to try it.&lt;/p&gt;
&lt;h2 id=&#34;safety-and-content-labels&#34;&gt;Safety and Content Labels
&lt;/h2&gt;&lt;p&gt;The Gemini Omni page specifically mentions safety work. Gemini Omni Flash was developed in collaboration with internal safety and responsibility teams, with automated evaluations, human evaluations, human red teaming, automated red teaming, and pre-launch ethics and safety reviews.&lt;/p&gt;
&lt;p&gt;For content transparency, the page says content created or edited with Omni in Gemini app, Google Flow, or YouTube will include imperceptible &lt;code&gt;SynthID&lt;/code&gt; digital watermarks and &lt;code&gt;C2PA Content Credentials&lt;/code&gt;. Users can verify content in Gemini app, with expansion to Chrome and Search planned later.&lt;/p&gt;
&lt;p&gt;This is especially important for video models. The more realistic video generation and editing becomes, the more important source labeling, abuse prevention, and verification tools become.&lt;/p&gt;
&lt;h2 id=&#34;who-it-is-for&#34;&gt;Who It Is For
&lt;/h2&gt;&lt;p&gt;Gemini Omni is suitable for several types of users:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Content creators who want to modify video quickly with natural language.&lt;/li&gt;
&lt;li&gt;Design teams that need to combine sketches, reference images, audio, and video assets into a finished clip.&lt;/li&gt;
&lt;li&gt;People making short videos, ad concepts, educational explainers, and product visual drafts.&lt;/li&gt;
&lt;li&gt;Creators building AI video workflows in Google Flow.&lt;/li&gt;
&lt;li&gt;Developers and researchers watching the boundaries of multimodal video editing.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But it is not ideal for every scenario. Serious commercial films, brand key visuals, film production, and product launch videos still require human review, copyright checks, fact-checking, and asset management. AI can clearly speed up concept generation and first-draft iteration, but it should not replace final review.&lt;/p&gt;
&lt;h2 id=&#34;how-to-read-gemini-omni&#34;&gt;How to Read Gemini Omni
&lt;/h2&gt;&lt;p&gt;The significance of Gemini Omni is that it moves AI video from &amp;ldquo;one-shot generation&amp;rdquo; toward &amp;ldquo;conversational editing.&amp;rdquo; That is closer to real creative workflows than simply improving visual quality.&lt;/p&gt;
&lt;p&gt;If it performs reliably in multi-turn editing, consistency, reference control, audio-video synchronization, and content labeling, the way people use AI video tools will change. Users will no longer only write one long prompt and hope for the best; they will revise scenes, actions, styles, and narratives step by step like directors, editors, and designers.&lt;/p&gt;
&lt;p&gt;What still needs to be observed is actual availability, pricing, regional limits, video length, resolution, copyright policy, and commercial-use rules. For ordinary creators, the most practical question is whether Gemini Omni can reliably handle multi-turn video editing inside Google Flow and Gemini app.&lt;/p&gt;
&lt;p&gt;References:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://deepmind.google/models/gemini-omni/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Google DeepMind: Gemini Omni&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        <item>
        <title>Let AI Operate Your Computer? UI-TARS-desktop Connects Desktop, Browser, and Tools</title>
        <link>https://knightli.com/en/2026/05/19/ui-tars-desktop-multimodal-ai-agent-stack/</link>
        <pubDate>Tue, 19 May 2026 10:56:50 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/05/19/ui-tars-desktop-multimodal-ai-agent-stack/</guid>
        <description>&lt;p&gt;&lt;code&gt;bytedance/UI-TARS-desktop&lt;/code&gt; is ByteDance&amp;rsquo;s open source multimodal AI agent project. It is not just a single desktop app, but an agent stack. The current README mainly contains two directions: &lt;code&gt;Agent TARS&lt;/code&gt; and &lt;code&gt;UI-TARS Desktop&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Project URL: &lt;a class=&#34;link&#34; href=&#34;https://github.com/bytedance/UI-TARS-desktop&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://github.com/bytedance/UI-TARS-desktop&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Official site: &lt;a class=&#34;link&#34; href=&#34;https://agent-tars.com&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;https://agent-tars.com&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;At the time of writing, the GitHub API showed about 34k stars, TypeScript as the main language, and an Apache-2.0 license. The README describes it as an &amp;ldquo;Open-Source Multimodal AI Agent Stack.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;difference-between-agent-tars-and-ui-tars-desktop&#34;&gt;Difference Between Agent TARS and UI-TARS Desktop
&lt;/h2&gt;&lt;p&gt;The README places the two projects in one comparison table:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Agent TARS&lt;/code&gt;: a general multimodal AI agent stack that connects GUI agents, vision, terminal, browser, and product workflows.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;UI-TARS Desktop&lt;/code&gt;: a desktop application based on UI-TARS models, providing native GUI agent capabilities for operating local or remote computers and browsers.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Simply put, Agent TARS is more like a general agent runtime, while UI-TARS Desktop is the desktop GUI operation entry point.&lt;/p&gt;
&lt;h2 id=&#34;what-agent-tars-can-do&#34;&gt;What Agent TARS Can Do
&lt;/h2&gt;&lt;p&gt;Agent TARS mainly provides a CLI and Web UI. Its goal is to let multimodal models complete task flows closer to human operation through MCP and various tools.&lt;/p&gt;
&lt;p&gt;Core capabilities listed in the README include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One-command CLI startup, supporting headful Web UI and headless server.&lt;/li&gt;
&lt;li&gt;Hybrid browser agent control through GUI Agent, DOM, or mixed strategies.&lt;/li&gt;
&lt;li&gt;Event Stream for tracing and debugging data flows.&lt;/li&gt;
&lt;li&gt;MCP integration for mounting MCP Servers and real tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Quick start:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;npx @agent-tars/cli@latest
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Global installation:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;npm install @agent-tars/cli@latest -g
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;Run with a model provider:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt;1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;2
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;agent-tars --provider volcengine --model doubao-1-5-thinking-vision-pro-250428 --apiKey your-api-key
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;agent-tars --provider anthropic --model claude-3-7-sonnet-latest --apiKey your-api-key
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;h2 id=&#34;what-ui-tars-desktop-can-do&#34;&gt;What UI-TARS Desktop Can Do
&lt;/h2&gt;&lt;p&gt;UI-TARS Desktop is a desktop GUI Agent. Based on UI-TARS and Seed-1.5-VL / 1.6 model families, it focuses on letting the model understand the screen and execute mouse and keyboard operations.&lt;/p&gt;
&lt;p&gt;Capabilities listed in the README include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Natural language control.&lt;/li&gt;
&lt;li&gt;Screenshots and visual recognition.&lt;/li&gt;
&lt;li&gt;Precise mouse and keyboard control.&lt;/li&gt;
&lt;li&gt;Cross-platform support for Windows, macOS, and browsers.&lt;/li&gt;
&lt;li&gt;Real-time feedback and status display.&lt;/li&gt;
&lt;li&gt;Local processing with an emphasis on privacy and security.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example tasks include changing VS Code settings, checking GitHub issues, and operating remote computers or browsers.&lt;/p&gt;
&lt;h2 id=&#34;why-gui-agents-matter&#34;&gt;Why GUI Agents Matter
&lt;/h2&gt;&lt;p&gt;Traditional automation depends on APIs, DOM, or scripts. A GUI Agent starts from the interface: it sees buttons, input boxes, menus, and state, then operates through mouse and keyboard.&lt;/p&gt;
&lt;p&gt;This has two values. First, many applications do not have stable APIs, or APIs do not cover the full workflow. A GUI Agent can interact from the same surface a human uses.&lt;/p&gt;
&lt;p&gt;Second, multimodal models can handle screenshots, documents, web pages, and app interfaces, combining visual understanding with execution.&lt;/p&gt;
&lt;p&gt;The limitation is also clear. GUI operations are affected by resolution, language, layout changes, pop-ups, and network latency. Production workflows still need permission control, confirmation steps, and rollback plans.&lt;/p&gt;
&lt;h2 id=&#34;relationship-with-mcp&#34;&gt;Relationship With MCP
&lt;/h2&gt;&lt;p&gt;Agent TARS emphasizes MCP integration. MCP is useful because it gives agents a unified way to call browsers, files, command lines, databases, internal services, and other tools.&lt;/p&gt;
&lt;p&gt;For complex tasks, GUI clicking alone is not stable enough. A better pattern is often:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Use APIs where APIs are available.&lt;/li&gt;
&lt;li&gt;Use vision when page state must be understood.&lt;/li&gt;
&lt;li&gt;Use browser control when real web interaction is needed.&lt;/li&gt;
&lt;li&gt;Use GUI Agent when local software must be operated.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Projects like UI-TARS-desktop are exploring how to place these capabilities in one agent stack.&lt;/p&gt;
&lt;h2 id=&#34;what-to-watch-out-for&#34;&gt;What To Watch Out For
&lt;/h2&gt;&lt;p&gt;First, desktop agents have execution risk. They can operate mouse, keyboard, and browser, so permissions must be limited to avoid accidental file changes, account operations, payment, or production system actions.&lt;/p&gt;
&lt;p&gt;Second, remote computer and remote browser control needs a clear security boundary. Do not expose unauthenticated control endpoints to the public internet.&lt;/p&gt;
&lt;p&gt;Third, multimodal models can misread interfaces. Critical operations should require human confirmation, especially delete, submit, pay, publish, trade, or other irreversible actions.&lt;/p&gt;
&lt;h2 id=&#34;who-it-is-for&#34;&gt;Who It Is For
&lt;/h2&gt;&lt;p&gt;UI-TARS-desktop is suitable for developers exploring GUI agents, teams building AI assistants for desktop workflows, and researchers comparing browser, DOM, MCP, and visual-control strategies. It is not a simple consumer assistant yet.&lt;/p&gt;
&lt;h2 id=&#34;summary&#34;&gt;Summary
&lt;/h2&gt;&lt;p&gt;UI-TARS-desktop is worth watching because it moves AI agents from &amp;ldquo;answering in chat&amp;rdquo; toward &amp;ldquo;seeing the screen and operating tools.&amp;rdquo; Its value is not only in desktop control, but in combining GUI, browser, terminal, and MCP capabilities in one stack.&lt;/p&gt;
</description>
        </item>
        <item>
        <title>OpenAI Introduces ChatGPT Images 2.0: Image Generation Starts Moving Toward Deliverable Output</title>
        <link>https://knightli.com/en/2026/04/22/openai-chatgpt-images-2-0-deliverable-image-generation/</link>
        <pubDate>Wed, 22 Apr 2026 14:21:45 +0800</pubDate>
        
        <guid>https://knightli.com/en/2026/04/22/openai-chatgpt-images-2-0-deliverable-image-generation/</guid>
        <description>&lt;p&gt;OpenAI published &lt;a class=&#34;link&#34; href=&#34;https://openai.com/index/introducing-chatgpt-images-2-0/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Introducing ChatGPT Images 2.0&lt;/a&gt; on April 21, 2026. Judging from the announcement page, the main point is not simply that the images look better. The bigger message is that image generation is moving toward something more controllable, more layout-aware, and more directly usable.&lt;/p&gt;
&lt;p&gt;If you look only at this launch page, it reads more like a dense capability showcase than a traditional technical announcement. There is very little about model architecture, training details, or benchmarks. Instead, OpenAI uses a large set of examples to answer a more practical question: can ChatGPT now handle more of the work that previously required repeated manual fixes for text, layout, and final polish?&lt;/p&gt;
&lt;h2 id=&#34;01-the-clearest-signals-in-this-release&#34;&gt;01 The clearest signals in this release
&lt;/h2&gt;&lt;p&gt;The most prominent phrases on the page already summarize the focus:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Greater precision and control&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Stronger across languages&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Stylistic sophistication and realism&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Taken together, those three ideas say a lot.&lt;/p&gt;
&lt;p&gt;First, the emphasis is shifting away from imagination alone and toward control. The page includes many examples such as posters, magazine spreads, promo pages, infographics, character sheets, comic pages, and print-ready bookmark designs. What these examples share is not just visual appeal. They require text handling, hierarchy, whitespace, composition, stylistic consistency, and format control at the same time. That suggests OpenAI is intentionally pushing the product from &amp;ldquo;generate an image&amp;rdquo; toward &amp;ldquo;generate a visual asset people can actually use.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Second, multilingual text rendering is being treated as a headline feature. The page includes multilingual posters, book covers, a Korean hospitality campaign, Japanese manga, and several typography-focused examples. That matters because one of the most persistent weak points in image models has been long text, complex layouts, and non-English scripts. OpenAI putting this front and center is itself a signal: text rendering and cross-language layout are now capabilities it believes are worth showcasing directly.&lt;/p&gt;
&lt;p&gt;Third, the stylistic range is very broad. The examples span photorealistic images, retro collage posters, Bauhaus-inspired graphics, fashion editorials, black-and-white documentary styles, children&amp;rsquo;s-book illustrations, manga, educational infographics, product grids, and character reference sheets. The message is not only that the model can imitate many visual styles. It is that the system is trying to adapt to a wider set of real visual tasks.&lt;/p&gt;
&lt;h2 id=&#34;02-why-this-looks-like-a-move-toward-deliverable-output&#34;&gt;02 Why this looks like a move toward deliverable output
&lt;/h2&gt;&lt;p&gt;From the announcement itself, ChatGPT Images 2.0 looks less like a stronger text-to-image model and more like an upgraded visual production tool.&lt;/p&gt;
&lt;p&gt;Earlier models could produce impressive pictures, but the experience often broke down when the task changed into things like these:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;creating a poster with a full headline, subtitle, and supporting copy&lt;/li&gt;
&lt;li&gt;building a magazine or promo page with dense information&lt;/li&gt;
&lt;li&gt;generating a comic page with continuity across characters and panels&lt;/li&gt;
&lt;li&gt;producing marketing assets with fixed aspect ratios, clear layout constraints, and brand tone&lt;/li&gt;
&lt;li&gt;creating polished visual content that includes multilingual text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This release seems designed to answer those older limitations directly.&lt;/p&gt;
&lt;p&gt;The page includes educational infographics, design-trend posters, print-ready bookmark layouts, a cafe launch poster, tourism promo material, product-merch mockups, and a redesigned academic poster. These are not just images that look nice at a glance. They are much closer to semi-finished or even finished outputs from real creative workflows.&lt;/p&gt;
&lt;p&gt;In that sense, the most important change here may not be a simple increase in image quality. It may be that the model is starting to look more like a system for content production, brand materials, education, and lightweight design work.&lt;/p&gt;
&lt;h2 id=&#34;03-what-this-means-for-chatgpts-product-direction&#34;&gt;03 What this means for ChatGPT&amp;rsquo;s product direction
&lt;/h2&gt;&lt;p&gt;The structure of the announcement also hints at a broader product shift.&lt;/p&gt;
&lt;p&gt;OpenAI does not present ChatGPT Images 2.0 as a niche tool only for artists or visual creators. Instead, it repeatedly frames the feature through research, reasoning, source transformation, layout organization, knowledge communication, and marketing output. The page even includes examples built around math proofs, design trends, historical notes, and academic papers.&lt;/p&gt;
&lt;p&gt;That suggests image generation inside ChatGPT is no longer just about adding a picture to a chat or generating a single illustration. It is moving closer to being a general-purpose expression layer. The goal seems to be this: once a user has already researched, thought through, organized, and written something in ChatGPT, the system should also be able to handle the final visual output.&lt;/p&gt;
&lt;p&gt;If that direction continues, competition in image generation will rely less on pure aesthetics or realism alone and more on capabilities like these:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether the system can reliably handle complex text&lt;/li&gt;
&lt;li&gt;whether it can preserve consistency across pages or panels&lt;/li&gt;
&lt;li&gt;whether it can produce layouts closer to real working materials&lt;/li&gt;
&lt;li&gt;whether it can connect naturally to research, writing, marketing, and teaching workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;04-what-the-announcement-does-not-say&#34;&gt;04 What the announcement does not say
&lt;/h2&gt;&lt;p&gt;At the same time, the format of the page also makes its limits clear.&lt;/p&gt;
&lt;p&gt;As of the official page published on April 21, 2026, the announcement focuses much more on outputs than on methods. It does not go into detail about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;quantified improvements over the previous generation&lt;/li&gt;
&lt;li&gt;explicit metrics for text accuracy or multilingual rendering&lt;/li&gt;
&lt;li&gt;failure boundaries for complex layout tasks&lt;/li&gt;
&lt;li&gt;API details, pricing, access modes, or enterprise integration specifics&lt;/li&gt;
&lt;li&gt;concrete changes to safety policies or generation limits&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the page is best read as a product signal rather than a full technical specification.&lt;/p&gt;
&lt;h2 id=&#34;05-short-conclusion&#34;&gt;05 Short conclusion
&lt;/h2&gt;&lt;p&gt;If I had to summarize ChatGPT Images 2.0 in one sentence, the key upgrade is not that it &amp;ldquo;draws better,&amp;rdquo; but that it is becoming better at producing finished work.&lt;/p&gt;
&lt;p&gt;OpenAI clearly wants image generation to evolve from an inspiration tool into a production tool that is more executable, more layout-aware, more communicative, and more directly usable. Text control, multilingual output, layout structure, stylistic range, and long-form visual organization used to be places where image models often showed their weaknesses. In this release, those same areas are being presented as selling points.&lt;/p&gt;
&lt;p&gt;That does not mean image generation has solved every design problem. But this announcement does suggest a shift in what matters. The next competitive edge may not come from who can generate the most striking single image. It may come from who can most reliably generate visual content that is actually ready to use.&lt;/p&gt;
&lt;h2 id=&#34;related-links&#34;&gt;Related Links
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class=&#34;link&#34; href=&#34;https://openai.com/index/introducing-chatgpt-images-2-0/&#34;  target=&#34;_blank&#34; rel=&#34;noopener&#34;
    &gt;Introducing ChatGPT Images 2.0 - OpenAI&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
        </item>
        
    </channel>
</rss>
