Qwen3-Omni-Flash-Realtime-2025-12-01

Qwen3-Omni-Flash-Realtime

Copied!

Add to Compare

Overview

This is the real-time version of Qwen3-Omni-Flash multimodal large model, based on the Thinker-Talker Hybrid Expert (MoE) architecture. It supports efficient understanding and speech generation of text, images, audio, and video, enabling text interaction in 119 languages and voice interaction in 20 languages. It supports 49 voice timbres and generates human-like speech for accurate cross-language communication. The model features powerful command following and system prompt customization capabilities, flexibly adapting to dialogue styles and role settings. It is widely used in text creation, voice assistants, multimedia analysis, and other scenarios, providing a natural and smooth multimodal interactive experience. This version is a snapshot from December 1, 2025.

Input

TextImageAudioVideo

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Context

65.53K

Max Input

49.15K

Max Output

16.38K

Rate Limits

RPMRequests Per Minute
60
TPMTokens Per Minute
100K