Qwen3-Omni-Flash-Realtime

Copied!
Add to Compare

Overview

This is the real-time version of Qwen3-Omni-Flash multimodal large model, based on the Thinker-Talker Hybrid Expert (MoE) architecture. It supports efficient understanding and speech generation of text, images, audio, and video, enabling text interaction in 119 languages ​​and voice interaction in 20 languages. It supports 49 voice timbres and generates human-like speech for accurate cross-language communication. The model features powerful command following and system prompt customization capabilities, flexibly adapting to dialogue styles and role settings. It is widely used in text creation, voice assistants, multimedia analysis, and other scenarios, providing a natural and smooth multimodal interactive experience. This version is a snapshot from December 1, 2025.

Input

TextImageAudioVideo

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Context

Context
65.53K
Max Input
49.15K
Max Output
16.38K

Rate Limits

  • RPMRequests Per Minute
    60
  • TPMTokens Per Minute
    100K