Qwen3-Omni-Flash
Copied!
Multimodal
Overview
Multimodal
Qwen3-Omni-Flash multimodal large-scale model, based on the Thinker–Talker Mixed Expert (MoE) architecture, supports efficient understanding and speech generation of text, images, audio, and video. It can interact with text in 119 languages and speech in 20 languages, generating human-like speech for precise cross-lingual communication. The model boasts powerful command-following and system prompt customization capabilities, flexibly adapting to conversational styles and character settings. It is widely used in scenarios such as text creation, voice assistants, and multimedia analysis, providing a natural and smooth multimodal interaction experience.This version is a snapshot version from September 15, 2025.
Input
TextImageAudioVideo
Output
TextAudio
Features
Prefix Completion
Function Calling
Cache
Structured Outputs
Batches
Web Search
Pricing
- Input: Text$0.43Per 1M tokens
- Input: Audio$3.81Per 1M tokens
- Input: Vision$0.78Per 1M tokens
- Output: Text (When input contains only text) $1.66Per 1M tokens
- Output: Text (When input contains images/audio/video)$3.06Per 1M tokens
- Output: Text&Audio (Output text is not charged)$15.11Per 1M tokens
- Input: Text(Thinking)$0.43Per 1M tokens
- Input: Audio(Thinking)$3.81Per 1M tokens
- Input: Vision(Thinking)$0.78Per 1M tokens
- Output: Text (in thinking mode, when input contains only text)$1.66Per 1M tokens
- Output: Text (in thinking mode, when the input contains images/audio/video)$3.06Per 1M tokens
Rate Limits
- RPMRequests Per Minute60
- TPMTokens Per Minute100K
API Reference
Get API KeyCopied!
12345678910111213141516171819202122232425