Qwen3-Omni-Flash-Realtime
Copied!
Real-time Omni-modality
Overview
Real-time Omni-modality
The real-time version of the Qwen3-Omni-Flash multimodal large-scale model, based on the Thinker–Talker Mixed Expert (MoE) architecture, supports efficient understanding and speech generation of text, images, audio, and video. It can interact with text in 119 languages and speech in 20 languages, generating human-like speech for precise cross-lingual communication. The model boasts powerful command-following and system prompt customization capabilities, flexibly adapting to conversational styles and character settings. It is widely used in scenarios such as text creation, voice assistants, and multimedia analysis, providing a natural and smooth multimodal interaction experience.
Input
TextImageAudioVideo
Output
TextAudio
Features
Prefix Completion
Function Calling
Cache
Structured Outputs
Batches
Web Search
Pricing
- Input: Text$0.52Per 1M tokens
- Input: Audio$4.57Per 1M tokens
- Input: Vision$0.94Per 1M tokens
- Output: Text (When input contains only text) $1.99Per 1M tokens
- Output: Text (When input contains images/audio/video)$3.67Per 1M tokens
- Output: Text&Audio (Output text is not charged)$18.13Per 1M tokens
Context
Context
64K
Max Input
56K
Max Output
8K
Rate Limits
- RPMRequests Per Minute60
- TPMTokens Per Minute100K
API Reference
Get API KeyCopied!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768