What Is OpenTalking? An Open-Source Framework for Getting AI Digital Human Conversations Running

Thu, 11 Jun 2026 08:22:48 +0800

OpenTalking is an open-source real-time digital human conversation orchestration framework from datascale-ai. It is not trying to solve only the narrow problem of “making a face move its mouth.” Instead, it connects the common pieces of a digital human conversation product: front-end interaction, session state, LLM responses, TTS and voice selection, STT, subtitle events, interruption control, WebRTC audio/video playback, and local or remote digital human synthesis backends.

So when you look at OpenTalking, it is better not to treat it as a startup script for one digital human model. It is closer to an engineering skeleton for a digital human production line: models can be swapped, speech services can be swapped, inference can run locally or remotely, and the front end brings characters, voices, model connection status, and real-time conversation into one place.

What It Is Good For

OpenTalking fits three kinds of needs.

The first is quickly validating a digital human conversation product. The project provides a mock mode, so you do not need to download model weights or deploy a video inference backend first. You can still run through the API, LLM, TTS, STT, WebRTC, and browser playback flow. The digital human image uses a static mock frame, but dialogue, subtitles, streaming TTS, and transport can already be tested.

The second is single-machine real-time rendering on consumer GPUs. The project can connect local backends such as quicktalk, wav2lip, and musetalk, which suits 3090 / 4090-class machines for real video rendering, lip sync, and custom avatar validation.

The third is high-quality or private deployment. When you care about visual quality, multi-GPU setups, remote GPU/NPU machines, or production isolation, you can connect flashtalk, flashhead, and other higher-quality models through OmniRT, separating the orchestration layer from the inference layer.

Why the WebUI Matters

OpenTalking provides a Web service interface for managing the digital human conversation flow. In the UI, you can choose or create digital characters, configure voices, LLM, TTS, STT, and the digital human driver model, check model connection status, and validate real-time conversation, subtitles, and audio/video playback on the same page.

This matters a lot in engineering. Many digital human demos only answer the question “can the model run?” But once you try to turn the demo into a product, you immediately run into other questions:

How should character assets be managed?
How do you switch voices and TTS providers?
How should LLM, STT, and TTS keys and base URLs be configured?
Is the model backend online?
How do you observe first-frame latency, interruption, subtitles, and audio-video sync?
How can regular users test in a browser instead of making engineers read logs?

OpenTalking puts these entry points together and reduces the friction between a model demo and a product prototype.

Quick Start Path

For a first try, start with Mock mode and get the full chain running.

export DIGITAL_HUMAN_HOME=/opt/digital_human
mkdir -p "$DIGITAL_HUMAN_HOME"

cd "$DIGITAL_HUMAN_HOME"
git clone https://github.com/datascale-ai/opentalking.git && cd opentalking

export UV_DEFAULT_INDEX=https://pypi.tuna.tsinghua.edu.cn/simple
uv sync --extra dev --python 3.11
source .venv/bin/activate
cp .env.example .env

The environment requirements include Python 3.10+ (3.11 recommended), Node.js 18+, and FFmpeg. In .env, configure at least the LLM / TTS related settings. If you use edge TTS, no key is required.

Start Mock mode:

1
2

cd "$DIGITAL_HUMAN_HOME/opentalking"
bash scripts/start_unified.sh --mock

The default front-end address is:

`1`	`http://localhost:5173`

To change ports, specify them explicitly:

`1`	`bash scripts/start_unified.sh --mock --api-port 8210 --web-port 5280`

The goal of this step is not visual quality. It is to confirm that the browser, API, LLM, TTS, STT, subtitle events, and WebRTC transport can all connect. After the chain works, decide whether to download model weights and deploy an inference backend.

Common Startup Options

The project recommends scripts/start_unified.sh as the unified entry point. Common options are easier to understand by purpose:

--mock: use the built-in Mock mode, without model weights or a video inference backend;
--backend <mock|local|omnirt|direct_ws>: choose the inference backend;
--model <name>: choose a model, such as quicktalk;
--omnirt <url>: connect to an OmniRT inference service;
--api-port <port>: set the OpenTalking backend port;
--web-port <port>: set the WebUI port;
--host <host>: set the WebUI listen address;
--env <file>: specify the env file path.

For example, the local QuickTalk route:

`1`	`bash scripts/start_unified.sh --backend local --model quicktalk`

The remote OmniRT route:

bash scripts/start_unified.sh \
  --backend omnirt \
  --model flashtalk \
  --api-port 8210 \
  --web-port 5280 \
  --omnirt http://<gpu-server>:9000

How to Choose a Deployment Route

The OpenTalking README splits deployment routes fairly clearly. A more practical way to think about it is: first ask whether you need real video rendering, then ask whether inference should run on the same machine as the Web service.

If you only need to validate the chain, use mock. It does not need a GPU or model weights, and it is the right first-day path to get the system running.

If you have a consumer GPU and want real-time digital human rendering on a single machine, start with quicktalk. The project references 3090 / 4090-class machines, which are suitable for validating custom avatars and real-time video output.

If you only need lighter lip sync and custom avatar validation, look at wav2lip. It has lower deployment pressure and works well as a lightweight route.

If you need a fully local private audio chain, combine sensevoice, local_cosyvoice, and quicktalk, moving STT and TTS to local models as well. This route is heavier, but it fits scenarios where you do not want to depend on cloud speech services.

If you need higher visual quality, multiple GPUs, or production isolation, put inference on a remote machine and connect flashtalk or flashhead through OmniRT. In this mode, OpenTalking acts more like the orchestration layer, responsible for sessions, the front end, service configuration, and inference endpoint calls.

Model Support and Resource Expectations

The current model routes can be summarized like this:

mock: static frame placeholder, no GPU required;
quicktalk: template video + audio, local CUDA GPU, 3090 / 4090 recommended;
wav2lip: reference image or frames + audio, suitable for local or omnirt;
musetalk: full frames + audio, higher VRAM demand;
soulx-flashtalk-14b: portrait + audio, suitable for OmniRT deployment on multi-GPU / NPU machines;
soulx-flashhead-1.3b: portrait + audio, also aimed at higher-quality remote inference.

The README also gives a consumer GPU reference: quicktalk on an RTX 3090 with template video + audio outputs 720x900 / 25fps, uses about 3.8 GiB of VRAM, and generates at about 35 fps. Treat this as a rough deployment expectation. Actual experience still depends on first-frame building, cache reuse, resolution, audio models, and the machine environment.

Configuration Notes

OpenTalking has many configuration items. In particular, LLM, STT, and TTS no longer share a single fallback key. Even if you use the same DashScope key, write it into the corresponding environment variables separately:

OPENTALKING_LLM_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
OPENTALKING_LLM_API_KEY=sk-your-key
OPENTALKING_LLM_MODEL=qwen-flash

OPENTALKING_STT_DEFAULT_PROVIDER=dashscope
OPENTALKING_STT_DASHSCOPE_MODEL=paraformer-realtime-v2
OPENTALKING_STT_DASHSCOPE_API_KEY=sk-your-key

OPENTALKING_TTS_DASHSCOPE_API_KEY=sk-your-key
OPENTALKING_TTS_DEFAULT_PROVIDER=edge
OPENTALKING_TTS_EDGE_VOICE=zh-CN-XiaoxiaoNeural

This configuration style looks a bit verbose, but the benefit is clear boundaries: LLM, speech recognition, speech synthesis, and voice cloning can each replace their provider without binding every capability to one service.

Engineering Structure

OpenTalking’s code structure reflects its positioning. The core orchestration layer lives in opentalking/, including protocol definitions, providers, model adapters, avatar, voice, media, pipeline, and runtime. apps/ contains the FastAPI service, unified startup mode, React front end, and CLI. configs/ stores YAML configuration. docker/ and docker-compose.yml handle containerized deployment. scripts/ provides unified startup and quickstart tools. docs/ adds model, deployment, and configuration documentation.

This structure shows that the project is not a single-model repository. It is splitting the digital human product chain into clear boundaries: front end, backend, model inference, speech, assets, and runtime.

Who Should Pay Attention

OpenTalking is worth watching if you:

Want to build a real-time digital human conversation prototype;
Need to connect LLM, TTS, STT, WebRTC, and a digital human model into a full chain;
Want to validate the system with Mock first, then gradually replace it with real models;
Have a consumer GPU and want to run QuickTalk / Wav2Lip / MuseTalk locally;
Need private deployment or remote multi-GPU inference, separating inference from Web orchestration;
Want to use a WebUI to manage digital characters, voices, models, and conversation testing.

It is not ideal for users who only want “one-click generation of a digital human video.” OpenTalking is more of an engineering framework. To use it well, you need to understand model weights, audio services, inference backends, ports, environment variables, and browser real-time transport.

Conclusion

OpenTalking’s value is that it breaks real-time digital human conversation into an engineering chain that can be replaced and deployed step by step. You can start with mock and only validate API, LLM, TTS, STT, and WebRTC. You can switch to local quicktalk for real video rendering. For higher-quality or production scenarios, you can move inference to remote GPU / NPU through OmniRT.

If you are building digital human applications, live interaction, virtual anchors, companion products, or private enterprise digital human validation, OpenTalking is worth studying. Its barrier is not low, but it handles the engineering layer that most easily falls apart between a demo and a deployable digital human system.

References: datascale-ai/opentalking GitHub repository, OpenTalking documentation site

WebRTC on KnightLi Blog