AI Video on KnightLi Blog

How to Use Gemini 3.5 Flash and Gemini Omni for Free: Entry Points for Users and Developers

Wed, 20 May 2026 23:13:35 +0800

After Google released Gemini 3.5 Flash and Gemini Omni, the practical question is not the benchmark score, but how ordinary users and developers can actually use them, which entry points are free, and which ones are only low-friction trials.

The short version:

For chat, writing, image understanding, and everyday Q&A: use Gemini app first.
To test Gemini 3.5 Flash parameters, prompts, and multimodal input: use Google AI Studio.
To call Gemini 3.5 Flash from code: create an API key in AI Studio.
To try it from the terminal for free: look at Gemini CLI.
To try Gemini Omni video editing: start with Gemini app and Google Flow.
For real production use: do not rely on free quotas; move to a paid API or Vertex AI.

Note: free quotas, regional availability, subscription tiers, and model menus change over time. This article was written on May 20, 2026. Before official use, always check Google’s current pages.

Free Gemini 3.5 Flash Method 1: Gemini App

The simplest entry point is Gemini app:

https://gemini.google.com/

The basic flow is straightforward:

Open Gemini.
Sign in with a Google account.
Look for 3.5 Flash in the model selector.
Start chatting.

This entry point is best for ordinary users. You can use it for writing, summarization, image understanding, file analysis, everyday Q&A, and simple planning. According to public reports, Gemini 3.5 Flash has been made available to users globally and can be selected from Gemini’s model dropdown.

The limits are also clear: free users usually face daily message, regional, and feature limits. If you exceed the limit, you need to wait for the quota to refresh or upgrade your subscription.

Free Gemini 3.5 Flash Method 2: Google AI Studio

If you want more than chat, and need to tune prompts, inspect parameters, or test structured output, Google AI Studio is a better fit:

https://aistudio.google.com/

Basic flow:

Sign in to Google AI Studio.
Create a new prompt.
Select gemini-3.5-flash in the model dropdown.
Enter the prompt and run it.

AI Studio gives you more control. You can adjust temperature, system instructions, structured output, and multi-image input, and you can export a working prompt into code or an API call.

For developers, AI Studio is a free testing bench. Tune the prompt and input format here first, then move into API integration to avoid wasting quota.

Free Gemini 3.5 Flash Method 3: Free API Key

Developers care most about the API. AI Studio can create a Gemini API key for calling gemini-3.5-flash.

Basic flow:

Open Google AI Studio.
Find Get API key.
Select or create a project.
Create an API key.
Save the key to a local environment variable.

Python example:

import os
from google import genai

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

response = client.models.generate_content(
    model="gemini-3.5-flash",
    contents="Explain in three sentences what Gemini 3.5 Flash is best suited for."
)

print(response.text)

Node.js example:

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-3.5-flash",
  contents: "Explain in three sentences what Gemini 3.5 Flash is best suited for."
});

console.log(response.text);

curl example:

curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.5-flash:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"Hello Gemini 3.5 Flash"}]}]}'

Public information suggests that the AI Studio free tier usually gives Gemini Flash models a daily request allowance. The exact numbers can vary by time, region, and account status. Common claims include around 1,500 requests per day, per-minute request limits, and token limits. Do not bake those numbers into a production plan; check Google’s current pricing and limits pages before launch.

Free Gemini 3.5 Flash Method 4: Gemini CLI

If you like the command line, look at Gemini CLI. It is useful for temporary scripts, repository summaries, file reading, and quick Q&A in the terminal.

Installation is usually:

`1`	`npm install -g @google/gemini-cli`

Then run:

gemini

The CLI is better suited to personal developer workflows than production integration. Production should still use API keys, service accounts, permission controls, and auditable calling patterns.

Free or Low-Friction Gemini Omni Access: Gemini App and Google Flow

Gemini Omni is a multimodal model for video creation and editing. Its core capability is not ordinary text Q&A, but multi-turn video editing with natural language while referencing image, text, video, and audio inputs.

The Google DeepMind page lists these entry points:

Gemini app.
Google Flow.
YouTube Shorts.

The page also notes that a Google AI subscription is required, and that features vary by subscription tier and region. So “free access” to Gemini Omni should be understood more carefully: some entry points may let free users see or try part of the experience, but full video editing may require a subscription, regional availability, or product rollout access.

If you only want to try it, use this order:

Open Gemini app first and check whether Gemini Omni or a related video editing entry is available.
Then open Google Flow: https://flow.google/
If you make short-form content, watch for Omni-related editing features in YouTube Shorts.

If the entry point is not visible, it usually does not mean you did something wrong. Your account, region, subscription tier, or rollout group may simply not qualify yet.

How Gemini Omni Is Best Used

Gemini Omni is more suitable for creators than for ordinary chat.

You can try these directions:

Upload or select a video and ask it to change the style.
Make a specific action in the video more exaggerated.
Use a reference image to replace an object or character in the scene.
Modify camera, action, environment, and style over multiple turns.
Combine sketches, reference images, audio, or video into a new output.

You can write prompts as if giving instructions to an editor:

Keep the original person and room structure unchanged. Change the effect after touching the mirror into liquid ripples. The motion should feel natural, and the lighting should not change abruptly.

For multi-turn editing, do not pack too many requirements into one request. A safer approach is:

Change the main action first.
Then change the style.
Then adjust the camera angle.
Finally tune sound, text, and rhythm.

This makes consistency easier to maintain and helps you identify which step caused a problem.

Common Pitfalls When Using Free Access

First, free quota is not production quota. A free API key is suitable for testing, personal tools, and prototypes, not for promising a stable service.

Second, do not send sensitive data to free or third-party entry points. This includes private code, customer data, contracts, keys, financial spreadsheets, and internal documents.

Third, check data-use settings. Free tiers may have different data-use policies, so review the settings in AI Studio or your Google account before use.

Fourth, video capabilities are usually more restricted than text capabilities. Gemini Omni-style video editing may be limited by subscription, region, queueing, duration, resolution, and content safety policies.

Fifth, be careful with third-party “unlimited free API” services. Many gateways rate-limit, forward requests, keep logs, or use opaque payment methods. Sensitive work should not go through these entry points.

Which Entry Point Should You Choose?

If you are an ordinary user:

Gemini 3.5 Flash: use Gemini app.
Gemini Omni: check Gemini app first, then Google Flow.

If you are a creator:

Use Google Flow to try Omni video workflows.
Use Gemini app for scripts, storyboards, prompts, and material descriptions.

If you are a developer:

Use AI Studio to debug prompts.
Use an API key to integrate gemini-3.5-flash.
Use Gemini CLI for personal terminal workflows.
For production, consider Vertex AI or the paid API.

If you are an enterprise:

Do not rely on free quotas.
Focus on permissions, logs, audits, data residency, compliance, and key management.
For video generation and editing, add watermarking, content review, and copyright processes.

Summary

Gemini 3.5 Flash has relatively clear free access paths: Gemini app, Google AI Studio, AI Studio API key, and Gemini CLI can all serve as low-friction entry points. It is suitable for chat, writing, coding, agent prototypes, and multimodal testing.

Gemini Omni focuses on video editing and multimodal creation. Its main entry points are Gemini app, Google Flow, and YouTube Shorts, but full capabilities are more likely to depend on subscription and region. It is best for creators to start with trials and concept validation, not to plan around it as a stable production service from day one.

The safest strategy is: test text and code tasks first with the Gemini 3.5 Flash free tier; validate video creation effects with Gemini Omni in Gemini app or Flow; when you need to launch something real, move to a formal setup with auditability, billing, and controlled permissions.

References:

What Is Gemini Omni? A Complete Look at Google's AI Video Multi-Turn Editing Model

Wed, 20 May 2026 23:11:58 +0800

Google DeepMind has published a page for Gemini Omni. Its positioning is direct: create content from any input, with the current focus starting from video.

If Nano Banana is more about image generation and editing, Gemini Omni feels more like a multimodal editing model for video. Users can modify a video step by step with natural language, with each later change building on the previous one, while trying to keep scenes, people, actions, and visual logic consistent.

Project page: https://deepmind.google/models/gemini-omni/

The Core Problem It Tries to Solve

Traditional video editing often requires timelines, layers, masks, keyframes, color grading, audio tracks, and a lot of manual work. AI video generation tools can already create clips from prompts, but they often run into two problems:

A generated result is hard to refine precisely.
During multi-turn edits, characters, scenes, styles, and actions can drift.

Gemini Omni is aimed at the second step: not just generating a video, but letting users keep asking for changes as if they were talking to an editor.

The project page describes it as a way to edit any video through natural, step-by-step conversation. Each edit builds on the prior result, with the goal of maintaining a coherent and unified scene.

Main Capabilities

Gemini Omni’s capabilities can be grouped into several areas.

The first is natural-language video editing. Users can directly ask the model to change a video’s aesthetic style, motion, or effects. For example, it can make a mirror ripple like liquid, turn a person into line art, a felt toy, or a transparent holographic wireframe, or transform an entire environment into 3D voxel art.

The second is action reconstruction. It can change what happens in a video, such as enlarging a hand-formed hole, making a toy produce the corresponding animal sound, or making building lights react to music.

The third is editing real video based on reference images. Users can provide an image reference and ask the model to place a building, sun, aircraft, or other object into a real video scene.

The fourth is maintaining consistency across multi-turn edits. The page shows a continuous editing flow: moving a violinist into a reference-image environment, removing the violin, and then changing the shot to an over-the-shoulder angle. This is closer to an actual creative process than a one-shot prompt.

The fifth is multi-input reference. Gemini Omni can combine image, text, video, and audio inputs into one output, supporting tasks such as style transfer, motion transfer, character replacement, and sketch-to-video generation.

Why It Emphasizes World Knowledge

Google repeatedly emphasizes that Gemini Omni is not only about making visuals look realistic. It also uses Gemini’s world knowledge, physical intuition, history, science, and narrative logic.

That matters. If a video model only optimizes for visual quality, it can easily produce illogical motion, confused object relationships, or mismatches between text and image. Gemini Omni’s goal is for video to look right while also being more coherent in story, physics, and meaning.

Examples on the page include:

A marble rolling through a chain-reaction track.
A claymation explanation of protein folding.
A stop-motion style explanation of how the hippocampus works.
Letters appearing in sync with objects in the scene.
On-screen words appearing one by one to the rhythm.

These examples suggest that Gemini Omni is not just a short-video effects tool. It tries to combine knowledge expression, storytelling, and audiovisual generation.

How It Relates to Veo, Flow, and Nano Banana

In Google’s current product lineup, Gemini Omni looks like a layer for multimodal creation and editing.

Veo is more focused on the video generation model itself, emphasizing cinematic video and audio generation. Google Flow is an AI creative studio for creators, suitable for organizing shots, assets, and video projects. Nano Banana is more focused on image creation and detailed editing. Gemini Omni emphasizes multimodal editing from any input to a consistent output, especially multi-turn natural-language control for video.

A simple way to understand it:

To generate high-quality video, watch Veo.
To organize video projects in a creative workflow, watch Google Flow.
To edit images, watch Nano Banana.
To modify video conversationally while referencing images, text, video, and audio, watch Gemini Omni.

Access Points

The page lists these access points:

Gemini app.
Google Flow.
YouTube Shorts.

However, it also notes that a Google AI subscription is required, and availability depends on subscription tier and region. In other words, not every user in every region can immediately access the full feature set.

For creators, Google Flow may be the most important entry point because it is closer to a complete creative workspace. For general users, Gemini app and YouTube Shorts may be lower-friction ways to try it.

Safety and Content Labels

The Gemini Omni page specifically mentions safety work. Gemini Omni Flash was developed in collaboration with internal safety and responsibility teams, with automated evaluations, human evaluations, human red teaming, automated red teaming, and pre-launch ethics and safety reviews.

For content transparency, the page says content created or edited with Omni in Gemini app, Google Flow, or YouTube will include imperceptible SynthID digital watermarks and C2PA Content Credentials. Users can verify content in Gemini app, with expansion to Chrome and Search planned later.

This is especially important for video models. The more realistic video generation and editing becomes, the more important source labeling, abuse prevention, and verification tools become.

Who It Is For

Gemini Omni is suitable for several types of users:

Content creators who want to modify video quickly with natural language.
Design teams that need to combine sketches, reference images, audio, and video assets into a finished clip.
People making short videos, ad concepts, educational explainers, and product visual drafts.
Creators building AI video workflows in Google Flow.
Developers and researchers watching the boundaries of multimodal video editing.

But it is not ideal for every scenario. Serious commercial films, brand key visuals, film production, and product launch videos still require human review, copyright checks, fact-checking, and asset management. AI can clearly speed up concept generation and first-draft iteration, but it should not replace final review.

How to Read Gemini Omni

The significance of Gemini Omni is that it moves AI video from “one-shot generation” toward “conversational editing.” That is closer to real creative workflows than simply improving visual quality.

If it performs reliably in multi-turn editing, consistency, reference control, audio-video synchronization, and content labeling, the way people use AI video tools will change. Users will no longer only write one long prompt and hope for the best; they will revise scenes, actions, styles, and narratives step by step like directors, editors, and designers.

What still needs to be observed is actual availability, pricing, regional limits, video length, resolution, copyright policy, and commercial-use rules. For ordinary creators, the most practical question is whether Gemini Omni can reliably handle multi-turn video editing inside Google Flow and Gemini app.

References:

Google DeepMind: Gemini Omni

web-video-presentation: an Agent Skill for turning articles into screen-recordable web videos

Fri, 15 May 2026 09:02:15 +0800

web-video-presentation is an agent skill in ConardLi/garden-skills. It solves a concrete problem: turn an article or narration script into a web-based presentation that can be recorded as a video.

Project: https://github.com/ConardLi/garden-skills/tree/main/skills/web-video-presentation

It is not a normal slide template or a React component library. It is a production process for AI agents: rewrite content into narration, turn it into an outline, choose a theme, build a 16:9 click-driven Vite + React + TypeScript web surface, then record it.

It is not trying to make slides

The README makes an important distinction: the skill generates a “video production surface”, not a slide deck.

Each click advances a narration beat. Each step owns a 1920×1080 stage. The UI progress controls stay hidden unless hovered, making recordings clean.

It is useful for:

Turning blog posts into YouTube or Bilibili-style explainers
Building visuals for narration scripts
Product demos
Tutorial videos
Keynote-style visual talks
Dynamic presentations that do not feel like PowerPoint

The value is not replacing video editing software. It makes the browser a controllable, iterative video canvas.

Core principles

The skill has several clear principles.

First, a fixed 16:9 stage. Design happens in a stable 1920×1080 coordinate system, then scales to the viewport. This prevents layout drift during recording.

Second, a global step cursor. Clicks and keyboard input advance (chapter, step) and save progress locally. It behaves like a video timeline, but controlled through web state.

Third, one idea per step. Every beat should have its own visual moment, not just more bullets on the same page.

Fourth, narration drives structure. The script defines rhythm; the outline defines chapters and steps; visuals follow the story.

Fifth, motion first. Each scene should have a moving visual anchor. If it is only static text, it has not become video language yet.

Sixth, theme tokens. A theme is not just colors; it controls typography, colors, cards, background, separators, decoration, and tone through semantic tokens.

Four-part workflow

The workflow has four stages.

First is content writing. If the user provides an article, the agent rewrites it into script.md, then creates outline.md. If the user already provides a narration script, it saves it as script.md and generates the outline.

Second is web development. The agent scaffolds a Vite / React / TypeScript project and implements scenes chapter by chapter. Chapter 1 must be completed by the main thread and approved by the user, because it becomes the style anchor.

Third is optional audio generation. The skill can extract narration definitions from each chapter’s narrations.ts and run a voice synthesis flow.

Fourth is recording and post-production. The web app is the recording stage; the user records the click-driven presentation.

The process has hard checkpoints: script, outline, theme, asset plan, and development mode must be aligned first; chapter 1 must be reviewed; audio generation must also be confirmed.

Why outline should not define animation

One interesting constraint is that outline.md plans rhythm and information density, but not concrete animations.

It may describe chapters, step count, screen content, information pools, asset plans, and estimated duration. It should not define CSS animation type, timing, clip-path, or filter implementation.

The reason is good: if outline locks animation, later implementation becomes mechanical. Video feeling should be designed per chapter based on content relationships.

narrations.ts as the source of truth

Each chapter has a narrations.ts. It stores the step count and corresponding narration text. The skill requires the maximum step used in the chapter .tsx to align with narrations.length.

This prevents drift across script.md, outline.md, chapter code, chapters.ts, and audio files. For video production, keeping narration, screen, audio, and step count aligned is essential.

Themes are more than skins

Built-in themes include paper-press, warm-keynote, midnight-press, blueprint, chalk-garden, terminal-green, bauhaus-bold, sunset-zine, newsroom, and monochrome-print.

These are not just color swaps. They define different visual languages: print, keynote, blueprint, terminal, newsroom, and so on.

During planning, the agent should recommend two or three themes based on the topic and tone. The user can also request a custom theme.

Three development modes

Chapter 1 is always built by the main thread and reviewed first. After that, there are three modes.

Mode A: chapter-by-chapter confirmation. Lowest risk and best quality control.

Mode B: sequential development. The main thread builds remaining chapters and reviews at the end.

Mode C: parallel development. After chapter 1 approval, subagents build later chapters in parallel. It is fastest, but visual differences may appear. Theme tokens provide consistency while each chapter can still have its own expression.

Who should use it

This skill is best for people who already have content: an article, script, product description, tutorial, or technical explanation.

If the user has no topic or material, the agent should ask for source content. This is not an ideation tool; it is a content-to-video production flow.

Summary

web-video-presentation is valuable because it turns content video production into a collaborative, reviewable, reusable workflow.

It connects article, narration, outline, theme, chapter implementation, audio, and recording, while hard checkpoints prevent the agent from running away.

Even if you do not use its scaffold, ideas like “one step, one idea”, “chapter 1 as style anchor”, “narrations.ts as source of truth”, and “outline does not hard-code animation” are worth borrowing.

miHoYo LPM 1.0 Explained: How an AI Video Model Could Reshape Game NPCs

Fri, 08 May 2026 22:27:10 +0800

LPM 1.0 is easy to mistake for another AI video generation model. Judging only by demos, it may not look as visually explosive as some text-to-video systems. But viewed through the paper’s goal, it is not mainly trying to generate a good-looking clip. It is trying to make a digital character feel present during interaction.

That is the biggest difference between LPM 1.0 and ordinary video models. A typical video model focuses on image quality, camera continuity, and prompt following. LPM 1.0 focuses on character performance: lip sync, rhythm, and expression while speaking; nods, gaze, pauses, and micro-expressions while listening; and stable identity across long interactions.

From generating video to generating performance

LPM stands for Large Performance Model. The name matters because it shifts the task boundary from “video” to “performance”.

In real conversation, whether someone feels natural is not only about what they say. Listening is part of communication: the timing of nods, the direction of gaze, and subtle emotional changes all affect whether we believe a character is alive.

Many digital human systems still attach text, speech, and lip motion to a character. The character can talk, but may not truly listen. It can output lines, but may not react continuously to the previous second of input. LPM 1.0 aims to turn passive playback into real-time interaction.

The three hard problems

The LPM 1.0 paper describes a trilemma in AI character performance: expressiveness, real-time inference, and long-horizon identity stability. A system may look detailed but be slow, respond quickly but feel rigid, or stay stable briefly but drift over time. Achieving all three is much harder.

To address this, LPM 1.0 uses richer character conditioning. Instead of giving the model only one reference image, it introduces multi-granularity identity references, including global appearance, multi-view body images, and facial expression examples. The goal is to reduce hallucinated details such as profile shape, teeth, expression texture, and body proportions.

The paper also separates speaking and listening behavior. Speaking audio mainly drives lip sync, speech rhythm, head motion, and body rhythm. Listening audio triggers gaze, nodding, posture changes, and micro-expressions. If both signals are mixed into one control stream, the model can easily learn the wrong behavior. LPM 1.0 models speaking and listening separately, then connects them in one online interaction system.

Base LPM and Online LPM

According to the public paper, LPM 1.0 is built on a 17B-parameter Diffusion Transformer. Base LPM learns high-quality, controllable, identity-consistent character performance video. Online LPM is a distilled streaming generator designed for low-latency, long-running interaction.

This split is important. Offline models can focus on quality, but interactive systems cannot make users wait. When a user starts speaking, the character should begin listening immediately. When the character starts speaking, lip sync, expression, and body motion must follow at once. Online LPM is valuable because it compresses complex video generation into something closer to real-time interaction.

So LPM 1.0 is not just a short-video asset tool for creators. It is closer to a visual engine for conversational agents, virtual streamers, and game NPCs: the language model understands and generates content, the speech model provides the voice, and LPM makes the on-screen character perform credibly.

What it means for games

In games, LPM 1.0 points less toward prettier cutscenes and more toward the next generation of interactive characters.

Traditional NPCs rely on prewritten scripts, fixed animations, and limited branches. Players can talk to them, but their responses are usually predesigned. In the AI era, the target goes further: different players may experience different story paths in the same world, and the same character may respond with actions, emotions, and dialogue that fit each player’s context.

That is what a truly personalized game experience needs underneath. Language models can generate lines, and behavior systems can choose goals, but if the character on screen still looks stiff, players will struggle to believe it understands them. LPM 1.0 tries to fill that visual and performance layer.

Not a finished magic product

LPM 1.0 should still be understood as a technical direction, not an immediately scalable commercial product. The paper and demos show a possibility: real-time, full-duplex, identity-stable character video generation is getting closer to usable. But before it can enter games broadly, there are still problems around cost, latency, edge deployment, content safety, character rights, multiplayer scenes, and engine integration.

A more realistic path may start with virtual streamers, AI companions, story interaction, character support agents, and educational coaching. As model cost falls and latency improves, the technology can move into more complex game systems.

Summary

The value of LPM 1.0 is not whether it can generate the most spectacular video clip. It is that it pushes AI video from “image generation” toward “character presence”.

If future games become more personalized, more dynamic, and more dependent on AI characters, language, speech, motion, expression, and identity consistency must be designed together. LPM 1.0 offers one possible path: digital characters that do not just talk, but listen, react, and remain recognizably themselves over long interactions.

References:

Pixelle-Video: An Open-Source AI Engine for Generating Short Videos From One Topic

Thu, 07 May 2026 20:25:17 +0800

Pixelle-Video is an open-source fully automated short-video generation engine from AIDC-AI. Its goal is direct: the user enters a topic, and the system automatically writes the script, generates AI images or videos, creates voice narration, adds background music, and renders the final video.

This kind of tool is useful for batch short-video creation, knowledge explainers, talking-head content, novel recaps, history and culture videos, and self-media experiments. It is not a single text-to-video model. It is a production pipeline that connects several AI capabilities.

What It Automates

Pixelle-Video’s default flow can be summarized as:

enter a topic or fixed script;
use an LLM to generate narration;
plan scenes and generate images or video clips;
use TTS to create voice narration;
add background music;
apply a video template and render the final result.

The README describes the flow as “script generation → image planning → frame-by-frame processing → video composition.” The modular design is clear: each step can be replaced, tuned, or connected to a custom workflow.

Key Features

The project covers a fairly complete set of capabilities:

AI script writing: automatically generate narration from a topic;
AI image generation: create illustrations for each line or scene;
AI video generation: connect to video generation models such as WAN 2.1;
TTS voice: support Edge-TTS, Index-TTS, and other options;
background music: use built-in BGM or custom music;
multiple aspect ratios: support vertical, horizontal, and other video sizes;
multiple models: connect to GPT, Qwen, DeepSeek, Ollama, and more;
ComfyUI workflows: use built-in workflows or replace image, TTS, and video generation steps.

Recent updates also mention motion transfer, digital-human talking videos, image-to-video pipelines, multilingual TTS voices, RunningHub support, and a Windows all-in-one package. The project is clearly moving beyond a simple script toward a fuller creation tool.

Installation and Launch

Windows users can first look at the official all-in-one package. It is designed to reduce setup friction: no manual Python, uv, or ffmpeg installation is required. After extracting the package, run start.bat, open the web interface, and configure the required APIs and image generation service.

For source installation, the README gives this basic flow:

1
2
3

git clone https://github.com/AIDC-AI/Pixelle-Video.git
cd Pixelle-Video
uv run streamlit run web/app.py

The source route is suitable for macOS and Linux users, and for anyone who wants to modify templates, workflows, or service configuration. The main prerequisites are uv and ffmpeg.

Configuration Priorities

On first use, the key is not to click “generate” immediately. The important part is connecting the external capabilities properly.

LLM configuration determines script quality. You can choose models such as Qwen, GPT, DeepSeek, or Ollama, then fill in the API Key, Base URL, and model name. If you want to minimize cost, local Ollama is one option. If you want more stable results, a cloud model is usually easier.

Image and video generation configuration determines visual quality. The project supports local ComfyUI and RunningHub. Users who understand ComfyUI can place their own workflows under workflows/ to replace the default image, video, or TTS pipeline.

Template configuration determines the final visual form. The project organizes video templates under templates/, with naming rules for static templates, image templates, and video templates. For creators, this is more practical than generating raw assets only, because the output is a video that can be previewed and downloaded directly.

Who It Is For

Pixelle-Video is especially suitable for three groups:

Short-video creators who want to turn ideas into draft videos quickly.
AIGC tool users who want to connect LLMs, ComfyUI, TTS, and video composition.
Developers and automation users who want to modify templates, workflows, or integrate their own materials and models.

If you only want to make one polished premium video, it may not replace manual editing. But if you want to generate many explainers, talking videos, or science and education videos with a consistent structure, its pipeline approach is valuable.

Things to Note

The ceiling of this kind of tool is determined by multiple links in the chain. A weak script model produces empty content; a weak image model gives scattered visuals; unnatural TTS makes the video feel rough; and a poor template weakens the final result.

So it is better to start with one fixed scenario, such as a “60-second vertical science explainer.” Fix the LLM, visual style, TTS voice, BGM, and template first, then expand to more topics.

The project supports a local free setup, but local setups often require a GPU, ComfyUI configuration, and model files. Users without a local inference environment can reduce setup difficulty by using a cloud LLM plus RunningHub, while keeping an eye on usage cost.

Short Take

Pixelle-Video is interesting not merely because it can “generate a video from one sentence.” Its real value is that it breaks short-video production into replaceable modules: script, visuals, voice, music, templates, and rendering. For ordinary users, it is a low-barrier AI video tool. For developers, it is closer to a hackable short-video automation framework.

If you are studying AI short-video pipelines, or want to connect ComfyUI, TTS, LLMs, and template rendering into a usable product, Pixelle-Video is worth trying and dissecting.