Multimodal on KnightLi Blog

WavFlow: Meta's Open Project for Audio Generation in Raw Waveform Space

Mon, 25 May 2026 08:00:37 +0800

facebookresearch/WavFlow is a multimodal audio generation project released by Meta AI. The paper title is WavFlow: Audio Generation in Waveform Space.

Project: https://github.com/facebookresearch/WavFlow

It is not focused on speech synthesis or pure music generation. Its goal is to generate synchronized, high-fidelity audio from video and text conditions. More importantly, it does not follow the common latent compression route. It tries to perform end-to-end audio generation directly in raw waveform space.

At the time of writing, the GitHub page shows about 55 stars and 3 forks. The code is mainly Python, and the project has no published release. The README also makes an important point: because of organizational policy constraints, production-trained checkpoints cannot currently be released. The team is working on a foundation checkpoint trained on fully open-source data. Until then, users need to train their own models.

What WavFlow Tries to Solve

Many multimodal audio generation methods first compress audio into a latent space, generate there, and then reconstruct the waveform. This path is efficient, but it can introduce a problem: compression may lose details, affecting audio texture, synchronization, and high-frequency information.

WavFlow tries to bypass that step and generate audio directly in raw waveform space.

The README says it uses waveform patchifying and amplitude lifting so that flow matching can work stably on raw audio, with direct x-prediction. In plainer terms, it does not first compress sound into an intermediate representation. Instead, it cuts the audio waveform itself into patches suitable for model processing and applies amplitude transformation so the model can learn generation at the waveform level.

That is the most interesting part. If end-to-end waveform generation can work reliably, it may reduce the information bottleneck introduced by encoders and decoders.

Supported Input Modes

Based on the README and training guide, WavFlow supports three input modes.

The first is VT2A, or video + text to audio. The model receives video and text descriptions, then generates audio synchronized with the visual scene and semantics, such as forests, frogs, drums, or skateboards.

The second is T2A, or text to audio. There is only a text description and no video input. Training uses CLIP text features, and during inference the CSV can set video_exist to 0.

The third is V2A, or video to audio. There is video but no text. During inference, text_exist can be set to 0, and the model uses a learned empty CLIP-text token.

This design is practical. Real datasets do not always contain complete video, text, and audio annotations for every sample. WavFlow uses fields such as video_exist and text_exist to explicitly represent missing modalities, so both training and inference can handle different combinations.

Evaluation and Positioning

The README says WavFlow is evaluated on VGGSound for VT2A and AudioCaps for T2A, with performance comparable to existing latent-based methods.

The meaning is not that it has already beaten all current models. It is that end-to-end raw waveform generation does not necessarily lose to traditional latent frameworks. At least on acoustic richness, fidelity, and synchronization, it can reach the same tier.

The project page also provides demos such as forest, frog, drum, and skateboard, with more than 24 samples and side-by-side benchmark comparisons. For audio generation models, demos matter a lot, because text metrics cannot fully describe sound texture, spatial feeling, and synchronization.

Installation

The official automatic setup is:

git clone https://github.com/facebookresearch/WavFlow.git
cd WavFlow
bash scripts/setup.sh
conda activate wavflow

scripts/setup.sh creates a conda environment named wavflow and installs the required dependencies.

For manual setup, follow the README:

conda create -n wavflow python=3.10 -y
conda activate wavflow
pip install -r requirements.txt
pip install -e . --no-deps
conda install -n wavflow -c conda-forge "ffmpeg<7" -y

The ffmpeg<7 dependency is mainly for torio video decoding. The README also notes that required external weights such as CLIP, Synchformer, and the empty-string CFG embedding are downloaded or computed automatically on first run and cached under ~/.cache/wavflow/.

Running Inference

Because the official project has not released production-trained checkpoints yet, the following inference entry point only applies after you already have a trained checkpoint.

`1`	`bash scripts/launch/predict.sh [--gpu N] [--config PATH]`

The default config file is:

`1`	`wavflow/configs/infer.yaml`

The input CSV is specified by data.csv_path, and it supports video, text, or both:

video_path,caption,video_exist,text_exist
/abs/path/sample1.mp4,a whistling rocket explodes,1,1
/abs/path/sample2.mp4,birds chirping in a forest,1,1
,a whistling rocket explodes,0,1
/abs/path/sample3.mp4,,1,0

Here, video_exist=0 means no video decoding is used, and the model uses learned empty CLIP/Sync tokens. text_exist=0 means the caption is ignored and the model uses a learned empty CLIP-text token. Captions with commas need to be quoted.

Common launcher parameters include:

1
2
3

--gpu N
--config PATH
WAVFLOW_ENV

Important config fields include model.name, model.ckpt_path, model.use_ema, inference.duration_sec, target_sample_rate, inference.cfg, num_steps, noise_scale, noise_shift, prediction_type, seed, and the output directory.

An EMA Pitfall

The README specifically warns about model.use_ema.

A WavFlow checkpoint may contain model_ema1, which is updated with ema_decay = 0.9999. If training only runs for a few hundred or a few thousand steps, the EMA tensor may still contain many random initialization values and produce noise during inference.

So if you are doing a short run, overfitting a tiny sample set, or running a smoke test, consider sampling with:

`1`	`model.use_ema: false`

Alternatively, use an ema_epoch_*.pth saved after enough training. This detail is useful because otherwise it is easy to assume the model is broken, when in fact the EMA has not stabilized yet.

Training Flow

The official TRAINING.md divides training into two steps.

The first step is feature extraction.

T2A extracts only CLIP text features. VT2A extracts CLIP frame features, Synchformer features, and CLIP text features. An example CSV looks like:

1
2

id,audio_path,video_path,caption
sample1,/abs/or/relative/wav/sample1.wav,/abs/or/relative/video/sample1.mp4,a whistling rocket explodes

Videos must be at least extraction.duration_sec long, which defaults to 8 seconds. Shorter clips are skipped. Feature extraction can be run with:

1
2

bash scripts/launch/extract_t2a.sh
bash scripts/launch/extract_vt2a.sh

For more GPUs or a custom config:

1
2

NPROC_PER_NODE=4 bash scripts/launch/extract_vt2a.sh
CONFIG_PATH=path/to/your_extract.yaml bash scripts/launch/extract_t2a.sh

The second step is training.

For single-node multi-GPU training:

`1`	`bash scripts/launch/train_single_node.sh`

Multi-node training requires NNODES, NODE_RANK, MASTER_ADDR, MASTER_PORT, and NPROC_PER_NODE. Training outputs include checkpoint_latest.pth, checkpoint_epoch_*.pth, ema_epoch_*.pth, generated audio samples, and training.log.

Training resumes automatically: if checkpoint_latest.pth exists in the experiment directory, training continues from it.

Who Should Pay Attention

WavFlow is more relevant to researchers and engineering teams than to ordinary users who want a finished sound-effect tool.

It is worth following if you:

Research video-to-audio, text-to-audio, or multimodal audio generation.
Want to compare raw waveform generation with latent-based audio generation.
Need to train your own audio generation model and can prepare data plus GPU resources.
Work on applications that require strong synchronization between video and sound.
Want to explore whether flow matching is viable on raw audio waveforms.

If you just want a web tool where you type a prompt and immediately get a sound effect, WavFlow is not the easiest option today. It does not yet provide a public production checkpoint, and its deployment path is closer to research code.

Things to Watch

First, do not treat it as a downloadable, ready-to-use audio generation model. The official project currently does not release production-trained checkpoints. Before real inference, you need to train your own model or wait for a future open-data checkpoint.

Second, the license is not a permissive commercial default. The README says most of WavFlow is licensed under CC-BY-NC 4.0, while some vendored components keep their original licenses, including MIT, Apache 2.0, CC BY-NC 4.0, and Stability AI Community License. Read LICENSE and NOTICE.txt carefully before commercial use.

Third, training data is critical. WavFlow’s promise depends on aligned audio, video, and text data. If data quality is poor, captions are inaccurate, or audio and video are out of sync, the model will struggle to learn stable sound generation.

Fourth, raw waveform generation may reduce the latent bottleneck, but it may also increase training and inference cost. Real projects still need to balance audio quality, speed, VRAM, sample rate, and output duration.

Summary

The value of WavFlow is that it asks a clear question: does multimodal audio generation have to compress audio into latent space first?

With waveform patchifying, amplitude lifting, and flow matching, it tries to generate synchronized high-fidelity audio directly in raw waveform space. The evaluation suggests that this route can at least stand in the same range as mature latent-based methods.

For now, though, it is more of a research and training framework than an out-of-the-box product model. No public production checkpoint, a mostly non-commercial license, and the need for aligned audio-video-text data all make it better suited to research, reproduction, and further training. If you care about the next generation of video-to-audio or text-to-audio models, WavFlow is worth a serious look.

References

facebookresearch/WavFlow: https://github.com/facebookresearch/WavFlow
WavFlow Project Page: https://facebookresearch.github.io/WavFlow/
WavFlow arXiv: https://arxiv.org/abs/2605.18749
WavFlow Training Guide: https://github.com/facebookresearch/WavFlow/blob/main/TRAINING.md

Gemini 3.5 Flash positioning and strengths: why it fits high-frequency, multimodal, low-latency use cases

Sun, 24 May 2026 08:43:24 +0800

The keywords for Gemini 3.5 Flash are not “the strongest,” but “high-frequency, fast, cost-efficient, and easy to integrate.” It is more like the workhorse model in the Gemini family: it may not be the model you use for the hardest reasoning tasks, but it is well suited for real production workloads such as Q&A, summarization, customer support, content processing, multimodal understanding, lightweight coding assistance, and automated workflows.

The key to understanding Flash is not to treat it as a replacement for a Pro-class flagship model. It is better understood as a model tier optimized for throughput and response speed. For developers and enterprises, the real cost of many AI applications is not only the strongest single response, but the latency, stability, price, and context-handling ability across thousands or millions of daily requests.

Product positioning

The Gemini family usually separates models into different tiers. Flagship models handle more complex reasoning, planning, and difficult tasks. Flash models emphasize speed, cost, and large-scale invocation.

The positioning of Gemini 3.5 Flash can be summarized as:

More suitable than Pro for high-frequency calls.
More capable than tiny lightweight models for complex input.
Optimized for low latency and high throughput.
Suitable for multimodal input and long-context processing.
Better as the default model inside applications, not only as a model for rare difficult requests.

This type of model is best for tasks that run many times every day. Its value is not just answer quality in one call, but whether it can reliably process large amounts of text, images, audio, video, or structured information at manageable cost.

Why Flash matters

When AI products move into production, a practical issue appears: the strongest model is useful, but not every request deserves the strongest model.

For example:

A user asks an ordinary customer-support question.
A system summarizes a meeting transcript.
A backend classifies a batch of tickets.
An app explains an uploaded image.
An automation extracts fields from an email.
An agent reads a set of documents before deciding the next step.

These tasks need models that are reliable, cheap, and fast, but they do not always require the full reasoning power of a flagship model. That is where Flash matters: it puts “strong enough” and “fast enough” in the same place.

If an AI application serves many users, the default model cannot be chosen only by peak capability. Average request cost, response speed, concurrency, and failure rate matter just as much. Flash is an application-layer model for that reality.

Advantage 1: low latency and high throughput

The most direct advantage of Flash is speed.

For chat products, retrieval-augmented search, support bots, real-time writing assistance, and agent workflows, latency directly affects user experience. Users may not know model parameters or benchmark results, but they immediately feel whether the product keeps them waiting.

Low latency brings several benefits:

Conversations feel more real-time.
Multi-step tool calls do not slow down as much.
Agents can make intermediate decisions more often.
Backend batch processing finishes faster.
Product teams can place AI features into more small workflows.

This matters especially for agent applications. A model does not answer only once; it repeatedly judges, calls tools, reads context, and generates the next action. Lower single-call latency improves the whole chain.

Advantage 2: better cost for scale

Another core value of Flash is cost.

When enterprises and developers put AI applications into production, they usually care about three questions:

How much does each call cost?
How many calls happen per day?
Are cost and latency controllable at peak concurrency?

If a task runs hundreds of thousands of times per day, even a small per-call price gap becomes large over time. Flash-style models are designed so that most requests do not have to go directly to the most expensive and heaviest model.

A common pattern is tiered routing:

Ordinary requests go to Flash by default.
Difficult problems, complex planning, and long-chain reasoning escalate to Pro.
Simple classification or fixed-format extraction can go to even lighter models.

This lets an AI system keep high-end capability while controlling everyday cost.

Advantage 3: multimodal input fits real applications

The Gemini family has long emphasized multimodal capability. Flash is valuable because it is not only for text requests; it can also handle images, audio, video, documents, and related inputs.

That matters in real products. Business data is often not pure text:

Users upload screenshots for support.
Customer support needs to understand a photo of a problem.
Education products process images of exercises.
Content platforms analyze video clips.
Office workflows read PDFs, spreadsheets, and presentations.
E-commerce products analyze product images and user descriptions.

If multimodal understanding depends only on expensive flagship models, many high-frequency scenarios are hard to scale. Flash brings multimodal understanding into a model tier better suited for large-scale invocation.

Advantage 4: long context makes it good at reading material

Long context is an important Gemini-family capability. For Flash, long context is not simply about stuffing everything into the prompt; it lets the model handle more information-organization tasks.

Examples include:

Summarizing long documents.
Reading product manuals.
Analyzing meeting notes.
Organizing multi-page PDFs.
Comparing contracts or proposals.
Providing agents with large task backgrounds.

Long context combined with lower cost is well suited for workflows that first read a lot of material and then produce actionable results. Flash does not need to solve extremely hard reasoning tasks every time. It can include more context in one pass, which is useful for office work, customer support, knowledge bases, and developer assistance.

Advantage 5: suitable as a default model

Many AI products need a “default model.” It does not have to be the most expensive or strongest, but it must satisfy several conditions:

Stable quality on most questions.
Fast response.
Manageable cost.
Ability to handle multimodal input.
Sufficient long-context support.
Easy API and product integration.

This is where Gemini 3.5 Flash has an advantage. It is suitable as the default entry point: handle most requests first, and route complex tasks to stronger models when needed.

This pattern will become increasingly common. Future AI systems will not simply “choose one model”; they will use Flash as the workhorse, Pro as the escalation path, and smaller models for edge tasks.

Suitable scenarios

Gemini 3.5 Flash is well suited for:

Customer-support Q&A and answers after knowledge-base retrieval.
Long-document summaries, report organization, and meeting notes.
Multimodal understanding of images, screenshots, PDFs, and video clips.
Real-time AI assistants inside apps.
Content moderation, classification, and tag generation.
Information extraction from emails, tickets, and forms.
Intermediate decisions and context compression in agent workflows.
Code explanation, lightweight fix suggestions, and documentation generation.
Education products for exercise explanation and study assistance.

These scenarios share the same traits: high request volume, sensitivity to user wait time, complex input types, and no need for flagship-level deep reasoning every time.

Where Flash should not be the only model

Flash is not universal. It is optimized for high-frequency and low-latency use, but that does not mean every problem should use only Flash.

The following scenarios still fit stronger Pro-class models better, or at least require tiered routing:

Complex mathematics and rigorous proofs.
Long-chain planning and multi-step strategic reasoning.
High-risk legal, medical, or financial judgment.
Deep refactoring plans for large codebases.
Complex agent tasks requiring high reliability.
Professional reports with extremely low tolerance for hallucination.

A safer strategy is to let Flash handle, judge, and organize first; when task complexity rises, escalate to a stronger model.

Relationship with Pro-class models

Flash and Pro should not be understood as “which one replaces the other.” They have different jobs.

Flash is the everyday workhorse:

Fast.
Cost-friendly.
Suitable for high concurrency.
Good for multimodal and long-context applications.
Suitable for default product flows.

Pro is the hard-task model:

Better for complex reasoning.
Better for difficult planning.
Better for high-value requests.
Better for small numbers of important deep-analysis tasks.

Good AI products usually combine the two instead of choosing only one.

How developers should use it

If you want to integrate Gemini 3.5 Flash into a product, consider these patterns:

First, use it as the default model. Most ordinary requests go to Flash first, giving both speed and cost control.

Second, design model routing. When Flash identifies a task as complex, high-risk, or requiring deep reasoning, escalate to Pro.

Third, use it for context compression. Before an agent executes a task, Flash can summarize documents, extract key facts, and generate structured context.

Fourth, make multimodal input part of the normal workflow. Images, screenshots, PDFs, audio, and video should not only be edge features; they can become default input types.

Fifth, evaluate with your own data. Do not rely only on official benchmarks. Test with your support questions, documents, code, images, and business workflows to decide which tasks Flash handles well and which need escalation.

Summary

The core positioning of Gemini 3.5 Flash is a multimodal workhorse model for high-frequency real applications. Its advantage is not replacing Pro-class flagship models, but placing speed, cost, long context, and multimodal ability into a tier better suited for large-scale invocation.

For developers, the most important part of Flash is not a single benchmark, but a product architecture shift: the default model can be faster, cheaper, and better at reading complex inputs; harder tasks can still escalate to stronger models. This keeps user experience good while controlling cost.

If Pro is the heavy tool for difficult problems, Flash is the main tool running on the production line every day. In real AI products, the latter is often what users experience most.

References:

Google official blog: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/
Google DeepMind Gemini Flash: https://deepmind.google/en/models/gemini/flash/
User-provided Zhihu discussion link: https://www.zhihu.com/question/2040529179641385344/answer/2040531897613285214

Running Qwen3.6-35B Locally on an RTX 3070 8GB: llama.cpp Deployment Notes and Tuning Parameters

Fri, 22 May 2026 22:44:16 +0800

Whether an 8GB GPU can run a 35B-class model depends on more than the total parameter count. Model architecture, quantization format, and the way the inference framework schedules work all matter.

The core idea in this setup is to use a GGUF quantized version of an MoE model such as Qwen3.6-35B-A3B, then use llama.cpp with CUDA acceleration, CPU Offload, MoE parameter scheduling, and KV Cache quantization to split memory pressure between the GPU and system RAM. With that approach, an older GPU such as the RTX 3070 8GB can still have a chance to run a 35B-class local multimodal model.

One point needs to be clear first: this is not “fitting a full 35B model entirely into 8GB of VRAM.” A more accurate way to understand it is that the GPU handles the compute that benefits most from GPU acceleration, while some expert layers and cache pressure are carried by system memory. The real experience depends on RAM capacity, CPU performance, quantization format, context length, and parameter choices.

Test environment

This kind of setup is sensitive to system memory. A reference configuration is:

CPU: Intel Core i7-12700 class
GPU: NVIDIA RTX 3070 8GB
RAM: 64GB
OS: Windows 11
Inference framework: llama.cpp CUDA build
Model format: GGUF

If you only have 16GB or 32GB of RAM, it is not necessarily impossible to try, but a 35B MoE model is more likely to create memory pressure during loading and long-context inference. For stable use, 64GB of RAM is a safer target.

Why 8GB VRAM can still run a 35B model

The key to Qwen3.6-35B-A3B is its MoE architecture. Its total parameter scale is 35B, but not all parameters are activated during each inference step; only part of the expert parameters are active.

That leads to two consequences:

The full model file is still large and requires enough disk space and system memory.
The active compute per inference step is lower than a full 35B Dense model.

llama.cpp’s CPU Offload and MoE-related parameters can further reduce the VRAM threshold. The GPU mainly handles attention and some high-value compute, while the CPU and system memory carry part of the expert-layer weights. The tradeoff is that speed, response latency, and stability depend more on the whole machine, not only the GPU model.

Preparing llama.cpp

Windows users can download a prebuilt CUDA version of llama.cpp directly. Pay attention to three points:

The GPU driver should be new enough, and the CUDA runtime should match the llama.cpp package you download.
After downloading, place it in a path without Chinese characters or special characters so batch scripts are easier to run.
Put model files under a unified models directory to avoid very long paths in commands.

If you use AMD, Intel graphics, or a CPU-only environment, you can also choose Vulkan, HIP, SYCL, or CPU builds, but the parameters and performance will be different. This article focuses on the CUDA route for NVIDIA GPUs.

Download the model and multimodal projection file

The model used here is:

Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

The Q4_K_M quantization format is chosen mainly to balance accuracy, file size, and speed. On low-VRAM machines, it is not a good idea to start with a higher-precision version, because loading failures or frequent system paging become much more likely.

If you want image understanding, you also need the multimodal projection file, for example:

mmproj-BF16.gguf

This file is important. Downloading only the main model usually gives you text inference only. Without mmproj, the web UI may not expose a usable image upload feature, or uploaded images may not be processed correctly.

Keep the directory structure simple:

llama.cpp/
├─ llama-server.exe
└─ models/
   ├─ Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
   └─ mmproj-BF16.gguf

RTX 3070 8GB startup parameters

Below is an example startup script for an RTX 3070 8GB. Change the path to your own llama.cpp directory.

@echo off
chcp 65001 >nul
cd /d D:\AI\llama.cpp

llama-server.exe ^
  -m "models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" ^
  --mmproj "models\mmproj-BF16.gguf" ^
  -ngl 99 ^
  --n-cpu-moe 999 ^
  --flash-attn on ^
  --jinja ^
  -c 32768 ^
  -t 12 ^
  -b 512 ^
  -ub 128 ^
  --cache-type-k q4_0 ^
  --cache-type-v q4_0 ^
  --mlock ^
  --host 127.0.0.1 ^
  --port 8080

pause

After startup, open this address in your browser:

`1`	`http://127.0.0.1:8080`

If the page opens and the model replies normally, the service has started successfully. The first model load can be slow. Avoid launching multiple instances repeatedly during loading, because that can fill system memory more easily.

Understanding the key parameters

-ngl 99 tries to place as many layers as possible on the GPU. How many layers actually fit depends on the model structure, quantization format, and VRAM usage.

--n-cpu-moe 999 pushes more MoE expert layers to the CPU side, reducing VRAM pressure. It is one of the key parameters for running large MoE models on low-VRAM hardware.

--flash-attn on enables Flash Attention, which can reduce the cost of attention computation. Whether it is available depends on the current llama.cpp version and GPU support.

-c 32768 sets the context length. Long context significantly increases KV Cache pressure. If startup fails or inference is very slow, try lowering it to 8192 or 16384.

--cache-type-k q4_0 and --cache-type-v q4_0 quantize the KV Cache, saving memory and VRAM, though they may have a small impact on output quality and speed.

-b 512 and -ub 128 control batching-related parameters. In a low-VRAM environment, do not start with overly aggressive batch settings.

Common issues

If startup reports insufficient VRAM, first reduce the context length, for example changing -c 32768 to -c 8192, then try lowering -b and -ub.

If the image upload button is unavailable, first check whether the --mmproj path is correct and whether the mmproj file matches the model.

If the model responds slowly after loading, it usually does not mean the GPU is idle. Large amounts of weights or expert layers may be handled by the CPU and system memory. Use Task Manager to observe GPU, CPU, memory, and disk usage to identify the bottleneck.

If the output format looks wrong, confirm that --jinja is enabled and check whether the model requires the corresponding chat template.

If the browser cannot open the service after startup, check the --host and --port settings, and make sure port 8080 is not occupied by another program.

Who should try this

This setup is suitable for users who already have 8GB VRAM devices such as RTX 3070, RTX 4060 Laptop, or RTX 3060 8GB, but want to experiment with larger MoE models.

It is not suitable for people who need maximum speed. Running a 35B MoE model on low VRAM essentially trades CPU and system memory for a lower VRAM requirement. Being able to run it is one thing; whether it feels smooth enough is another.

If your goal is high-frequency daily chatting, 7B, 8B, or 14B models may feel better. If your goal is to explore larger MoE models, multimodal capability, and the boundary of local deployment, an RTX 3070 8GB with 64GB of RAM is still worth trying.

Summary

The reason an RTX 3070 8GB can run Qwen3.6-35B-A3B is not that the GPU suddenly has more VRAM. It is the combination of MoE architecture, GGUF quantization, llama.cpp CPU Offload, and KV Cache optimization that lowers the threshold.

The most interesting part of this setup is that it lets older GPUs still participate in local large-model experiments. As long as you accept tradeoffs in speed and stability, an 8GB VRAM machine can still be a local AI model testing platform, not only an entry-level device for small models.

References:

Original article: https://www.freedidi.com/24267.html

What Is Gemini Omni? A Complete Look at Google's AI Video Multi-Turn Editing Model

Wed, 20 May 2026 23:11:58 +0800

Google DeepMind has published a page for Gemini Omni. Its positioning is direct: create content from any input, with the current focus starting from video.

If Nano Banana is more about image generation and editing, Gemini Omni feels more like a multimodal editing model for video. Users can modify a video step by step with natural language, with each later change building on the previous one, while trying to keep scenes, people, actions, and visual logic consistent.

Project page: https://deepmind.google/models/gemini-omni/

The Core Problem It Tries to Solve

Traditional video editing often requires timelines, layers, masks, keyframes, color grading, audio tracks, and a lot of manual work. AI video generation tools can already create clips from prompts, but they often run into two problems:

A generated result is hard to refine precisely.
During multi-turn edits, characters, scenes, styles, and actions can drift.

Gemini Omni is aimed at the second step: not just generating a video, but letting users keep asking for changes as if they were talking to an editor.

The project page describes it as a way to edit any video through natural, step-by-step conversation. Each edit builds on the prior result, with the goal of maintaining a coherent and unified scene.

Main Capabilities

Gemini Omni’s capabilities can be grouped into several areas.

The first is natural-language video editing. Users can directly ask the model to change a video’s aesthetic style, motion, or effects. For example, it can make a mirror ripple like liquid, turn a person into line art, a felt toy, or a transparent holographic wireframe, or transform an entire environment into 3D voxel art.

The second is action reconstruction. It can change what happens in a video, such as enlarging a hand-formed hole, making a toy produce the corresponding animal sound, or making building lights react to music.

The third is editing real video based on reference images. Users can provide an image reference and ask the model to place a building, sun, aircraft, or other object into a real video scene.

The fourth is maintaining consistency across multi-turn edits. The page shows a continuous editing flow: moving a violinist into a reference-image environment, removing the violin, and then changing the shot to an over-the-shoulder angle. This is closer to an actual creative process than a one-shot prompt.

The fifth is multi-input reference. Gemini Omni can combine image, text, video, and audio inputs into one output, supporting tasks such as style transfer, motion transfer, character replacement, and sketch-to-video generation.

Why It Emphasizes World Knowledge

Google repeatedly emphasizes that Gemini Omni is not only about making visuals look realistic. It also uses Gemini’s world knowledge, physical intuition, history, science, and narrative logic.

That matters. If a video model only optimizes for visual quality, it can easily produce illogical motion, confused object relationships, or mismatches between text and image. Gemini Omni’s goal is for video to look right while also being more coherent in story, physics, and meaning.

Examples on the page include:

A marble rolling through a chain-reaction track.
A claymation explanation of protein folding.
A stop-motion style explanation of how the hippocampus works.
Letters appearing in sync with objects in the scene.
On-screen words appearing one by one to the rhythm.

These examples suggest that Gemini Omni is not just a short-video effects tool. It tries to combine knowledge expression, storytelling, and audiovisual generation.

How It Relates to Veo, Flow, and Nano Banana

In Google’s current product lineup, Gemini Omni looks like a layer for multimodal creation and editing.

Veo is more focused on the video generation model itself, emphasizing cinematic video and audio generation. Google Flow is an AI creative studio for creators, suitable for organizing shots, assets, and video projects. Nano Banana is more focused on image creation and detailed editing. Gemini Omni emphasizes multimodal editing from any input to a consistent output, especially multi-turn natural-language control for video.

A simple way to understand it:

To generate high-quality video, watch Veo.
To organize video projects in a creative workflow, watch Google Flow.
To edit images, watch Nano Banana.
To modify video conversationally while referencing images, text, video, and audio, watch Gemini Omni.

Access Points

The page lists these access points:

Gemini app.
Google Flow.
YouTube Shorts.

However, it also notes that a Google AI subscription is required, and availability depends on subscription tier and region. In other words, not every user in every region can immediately access the full feature set.

For creators, Google Flow may be the most important entry point because it is closer to a complete creative workspace. For general users, Gemini app and YouTube Shorts may be lower-friction ways to try it.

Safety and Content Labels

The Gemini Omni page specifically mentions safety work. Gemini Omni Flash was developed in collaboration with internal safety and responsibility teams, with automated evaluations, human evaluations, human red teaming, automated red teaming, and pre-launch ethics and safety reviews.

For content transparency, the page says content created or edited with Omni in Gemini app, Google Flow, or YouTube will include imperceptible SynthID digital watermarks and C2PA Content Credentials. Users can verify content in Gemini app, with expansion to Chrome and Search planned later.

This is especially important for video models. The more realistic video generation and editing becomes, the more important source labeling, abuse prevention, and verification tools become.

Who It Is For

Gemini Omni is suitable for several types of users:

Content creators who want to modify video quickly with natural language.
Design teams that need to combine sketches, reference images, audio, and video assets into a finished clip.
People making short videos, ad concepts, educational explainers, and product visual drafts.
Creators building AI video workflows in Google Flow.
Developers and researchers watching the boundaries of multimodal video editing.

But it is not ideal for every scenario. Serious commercial films, brand key visuals, film production, and product launch videos still require human review, copyright checks, fact-checking, and asset management. AI can clearly speed up concept generation and first-draft iteration, but it should not replace final review.

How to Read Gemini Omni

The significance of Gemini Omni is that it moves AI video from “one-shot generation” toward “conversational editing.” That is closer to real creative workflows than simply improving visual quality.

If it performs reliably in multi-turn editing, consistency, reference control, audio-video synchronization, and content labeling, the way people use AI video tools will change. Users will no longer only write one long prompt and hope for the best; they will revise scenes, actions, styles, and narratives step by step like directors, editors, and designers.

What still needs to be observed is actual availability, pricing, regional limits, video length, resolution, copyright policy, and commercial-use rules. For ordinary creators, the most practical question is whether Gemini Omni can reliably handle multi-turn video editing inside Google Flow and Gemini app.

References:

Google DeepMind: Gemini Omni

Let AI Operate Your Computer? UI-TARS-desktop Connects Desktop, Browser, and Tools

Tue, 19 May 2026 10:56:50 +0800

bytedance/UI-TARS-desktop is ByteDance’s open source multimodal AI agent project. It is not just a single desktop app, but an agent stack. The current README mainly contains two directions: Agent TARS and UI-TARS Desktop.

Project URL: https://github.com/bytedance/UI-TARS-desktop

Official site: https://agent-tars.com

At the time of writing, the GitHub API showed about 34k stars, TypeScript as the main language, and an Apache-2.0 license. The README describes it as an “Open-Source Multimodal AI Agent Stack.”

Difference Between Agent TARS and UI-TARS Desktop

The README places the two projects in one comparison table:

Agent TARS: a general multimodal AI agent stack that connects GUI agents, vision, terminal, browser, and product workflows.
UI-TARS Desktop: a desktop application based on UI-TARS models, providing native GUI agent capabilities for operating local or remote computers and browsers.

Simply put, Agent TARS is more like a general agent runtime, while UI-TARS Desktop is the desktop GUI operation entry point.

What Agent TARS Can Do

Agent TARS mainly provides a CLI and Web UI. Its goal is to let multimodal models complete task flows closer to human operation through MCP and various tools.

Core capabilities listed in the README include:

One-command CLI startup, supporting headful Web UI and headless server.
Hybrid browser agent control through GUI Agent, DOM, or mixed strategies.
Event Stream for tracing and debugging data flows.
MCP integration for mounting MCP Servers and real tools.

Quick start:

`1`	`npx @agent-tars/cli@latest`

Global installation:

`1`	`npm install @agent-tars/cli@latest -g`

Run with a model provider:

1
2

agent-tars --provider volcengine --model doubao-1-5-thinking-vision-pro-250428 --apiKey your-api-key
agent-tars --provider anthropic --model claude-3-7-sonnet-latest --apiKey your-api-key

What UI-TARS Desktop Can Do

UI-TARS Desktop is a desktop GUI Agent. Based on UI-TARS and Seed-1.5-VL / 1.6 model families, it focuses on letting the model understand the screen and execute mouse and keyboard operations.

Capabilities listed in the README include:

Natural language control.
Screenshots and visual recognition.
Precise mouse and keyboard control.
Cross-platform support for Windows, macOS, and browsers.
Real-time feedback and status display.
Local processing with an emphasis on privacy and security.

Example tasks include changing VS Code settings, checking GitHub issues, and operating remote computers or browsers.

Why GUI Agents Matter

Traditional automation depends on APIs, DOM, or scripts. A GUI Agent starts from the interface: it sees buttons, input boxes, menus, and state, then operates through mouse and keyboard.

This has two values. First, many applications do not have stable APIs, or APIs do not cover the full workflow. A GUI Agent can interact from the same surface a human uses.

Second, multimodal models can handle screenshots, documents, web pages, and app interfaces, combining visual understanding with execution.

The limitation is also clear. GUI operations are affected by resolution, language, layout changes, pop-ups, and network latency. Production workflows still need permission control, confirmation steps, and rollback plans.

Relationship With MCP

Agent TARS emphasizes MCP integration. MCP is useful because it gives agents a unified way to call browsers, files, command lines, databases, internal services, and other tools.

For complex tasks, GUI clicking alone is not stable enough. A better pattern is often:

Use APIs where APIs are available.
Use vision when page state must be understood.
Use browser control when real web interaction is needed.
Use GUI Agent when local software must be operated.

Projects like UI-TARS-desktop are exploring how to place these capabilities in one agent stack.

What To Watch Out For

First, desktop agents have execution risk. They can operate mouse, keyboard, and browser, so permissions must be limited to avoid accidental file changes, account operations, payment, or production system actions.

Second, remote computer and remote browser control needs a clear security boundary. Do not expose unauthenticated control endpoints to the public internet.

Third, multimodal models can misread interfaces. Critical operations should require human confirmation, especially delete, submit, pay, publish, trade, or other irreversible actions.

Who It Is For

UI-TARS-desktop is suitable for developers exploring GUI agents, teams building AI assistants for desktop workflows, and researchers comparing browser, DOM, MCP, and visual-control strategies. It is not a simple consumer assistant yet.

Summary

UI-TARS-desktop is worth watching because it moves AI agents from “answering in chat” toward “seeing the screen and operating tools.” Its value is not only in desktop control, but in combining GUI, browser, terminal, and MCP capabilities in one stack.

OpenAI Introduces ChatGPT Images 2.0: Image Generation Starts Moving Toward Deliverable Output

Wed, 22 Apr 2026 14:21:45 +0800

OpenAI published Introducing ChatGPT Images 2.0 on April 21, 2026. Judging from the announcement page, the main point is not simply that the images look better. The bigger message is that image generation is moving toward something more controllable, more layout-aware, and more directly usable.

If you look only at this launch page, it reads more like a dense capability showcase than a traditional technical announcement. There is very little about model architecture, training details, or benchmarks. Instead, OpenAI uses a large set of examples to answer a more practical question: can ChatGPT now handle more of the work that previously required repeated manual fixes for text, layout, and final polish?

01 The clearest signals in this release

The most prominent phrases on the page already summarize the focus:

Greater precision and control
Stronger across languages
Stylistic sophistication and realism

Taken together, those three ideas say a lot.

First, the emphasis is shifting away from imagination alone and toward control. The page includes many examples such as posters, magazine spreads, promo pages, infographics, character sheets, comic pages, and print-ready bookmark designs. What these examples share is not just visual appeal. They require text handling, hierarchy, whitespace, composition, stylistic consistency, and format control at the same time. That suggests OpenAI is intentionally pushing the product from “generate an image” toward “generate a visual asset people can actually use.”

Second, multilingual text rendering is being treated as a headline feature. The page includes multilingual posters, book covers, a Korean hospitality campaign, Japanese manga, and several typography-focused examples. That matters because one of the most persistent weak points in image models has been long text, complex layouts, and non-English scripts. OpenAI putting this front and center is itself a signal: text rendering and cross-language layout are now capabilities it believes are worth showcasing directly.

Third, the stylistic range is very broad. The examples span photorealistic images, retro collage posters, Bauhaus-inspired graphics, fashion editorials, black-and-white documentary styles, children’s-book illustrations, manga, educational infographics, product grids, and character reference sheets. The message is not only that the model can imitate many visual styles. It is that the system is trying to adapt to a wider set of real visual tasks.

02 Why this looks like a move toward deliverable output

From the announcement itself, ChatGPT Images 2.0 looks less like a stronger text-to-image model and more like an upgraded visual production tool.

Earlier models could produce impressive pictures, but the experience often broke down when the task changed into things like these:

creating a poster with a full headline, subtitle, and supporting copy
building a magazine or promo page with dense information
generating a comic page with continuity across characters and panels
producing marketing assets with fixed aspect ratios, clear layout constraints, and brand tone
creating polished visual content that includes multilingual text

This release seems designed to answer those older limitations directly.

The page includes educational infographics, design-trend posters, print-ready bookmark layouts, a cafe launch poster, tourism promo material, product-merch mockups, and a redesigned academic poster. These are not just images that look nice at a glance. They are much closer to semi-finished or even finished outputs from real creative workflows.

In that sense, the most important change here may not be a simple increase in image quality. It may be that the model is starting to look more like a system for content production, brand materials, education, and lightweight design work.

03 What this means for ChatGPT’s product direction

The structure of the announcement also hints at a broader product shift.

OpenAI does not present ChatGPT Images 2.0 as a niche tool only for artists or visual creators. Instead, it repeatedly frames the feature through research, reasoning, source transformation, layout organization, knowledge communication, and marketing output. The page even includes examples built around math proofs, design trends, historical notes, and academic papers.

That suggests image generation inside ChatGPT is no longer just about adding a picture to a chat or generating a single illustration. It is moving closer to being a general-purpose expression layer. The goal seems to be this: once a user has already researched, thought through, organized, and written something in ChatGPT, the system should also be able to handle the final visual output.

If that direction continues, competition in image generation will rely less on pure aesthetics or realism alone and more on capabilities like these:

whether the system can reliably handle complex text
whether it can preserve consistency across pages or panels
whether it can produce layouts closer to real working materials
whether it can connect naturally to research, writing, marketing, and teaching workflows

04 What the announcement does not say

At the same time, the format of the page also makes its limits clear.

As of the official page published on April 21, 2026, the announcement focuses much more on outputs than on methods. It does not go into detail about:

quantified improvements over the previous generation
explicit metrics for text accuracy or multilingual rendering
failure boundaries for complex layout tasks
API details, pricing, access modes, or enterprise integration specifics
concrete changes to safety policies or generation limits

So the page is best read as a product signal rather than a full technical specification.

05 Short conclusion

If I had to summarize ChatGPT Images 2.0 in one sentence, the key upgrade is not that it “draws better,” but that it is becoming better at producing finished work.

OpenAI clearly wants image generation to evolve from an inspiration tool into a production tool that is more executable, more layout-aware, more communicative, and more directly usable. Text control, multilingual output, layout structure, stylistic range, and long-form visual organization used to be places where image models often showed their weaknesses. In this release, those same areas are being presented as selling points.

That does not mean image generation has solved every design problem. But this announcement does suggest a shift in what matters. The next competitive edge may not come from who can generate the most striking single image. It may come from who can most reliably generate visual content that is actually ready to use.

Introducing ChatGPT Images 2.0 - OpenAI

Multimodal on KnightLi Blog

WavFlow: Meta's Open Project for Audio Generation in Raw Waveform Space

What WavFlow Tries to Solve

Supported Input Modes

Evaluation and Positioning

Installation

Running Inference

An EMA Pitfall

Training Flow

Who Should Pay Attention

Things to Watch

Summary

References

Gemini 3.5 Flash positioning and strengths: why it fits high-frequency, multimodal, low-latency use cases

Product positioning

Why Flash matters

Advantage 1: low latency and high throughput

Advantage 2: better cost for scale

Advantage 3: multimodal input fits real applications

Advantage 4: long context makes it good at reading material

Advantage 5: suitable as a default model

Suitable scenarios

Where Flash should not be the only model

Relationship with Pro-class models

How developers should use it

Summary

Running Qwen3.6-35B Locally on an RTX 3070 8GB: llama.cpp Deployment Notes and Tuning Parameters

Test environment

Why 8GB VRAM can still run a 35B model

Preparing llama.cpp

Download the model and multimodal projection file

RTX 3070 8GB startup parameters

Understanding the key parameters

Common issues

Who should try this

Summary

What Is Gemini Omni? A Complete Look at Google's AI Video Multi-Turn Editing Model

The Core Problem It Tries to Solve

Main Capabilities

Why It Emphasizes World Knowledge

How It Relates to Veo, Flow, and Nano Banana

Access Points

Safety and Content Labels

Who It Is For

How to Read Gemini Omni

Let AI Operate Your Computer? UI-TARS-desktop Connects Desktop, Browser, and Tools

Difference Between Agent TARS and UI-TARS Desktop

What Agent TARS Can Do

What UI-TARS Desktop Can Do

Why GUI Agents Matter

Relationship With MCP

What To Watch Out For

Who It Is For

Summary

OpenAI Introduces ChatGPT Images 2.0: Image Generation Starts Moving Toward Deliverable Output

01 The clearest signals in this release

02 Why this looks like a move toward deliverable output

03 What this means for ChatGPT’s product direction

04 What the announcement does not say

05 Short conclusion

Related Links