Qwen3-Omni-Flash-Realtime
Overview
This is the real-time version of Qwen3-Omni-Flash multimodal large model, based on the Thinker-Talker Hybrid Expert (MoE) architecture. It supports efficient understanding and speech generation of text, images, audio, and video, enabling text interaction in 119 languages and voice interaction in 20 languages. It supports 49 voice timbres and generates human-like speech for accurate cross-language communication. The model features powerful command following and system prompt customization capabilities, flexibly adapting to dialogue styles and role settings. It is widely used in text creation, voice assistants, multimedia analysis, and other scenarios, providing a natural and smooth multimodal interactive experience. This version is a snapshot from December 1, 2025.
Input
Output
Features
Prefix Completion
Function Calling
Cache
Structured Outputs
Batches
Web Search
Context
Rate Limits
- RPMRequests Per Minute60
- TPMTokens Per Minute100K