Qwen3.5-Omni-Flash
Copied!
Multimodal
Overview
Multimodal
Qwen 3.5-Omni is the latest generation of Qwen's multimodal large model, supporting text, image, audio, and audio-visual understanding and interaction. As a comprehensive evolution of Qwen 3-Omni, it supports over 10 hours of audio understanding and over 400 seconds of 720P (1 FPS) audio-visual understanding and dialogue. It further expands the language range, supporting audio input in 60+ languages and speech output in 30+ languages. It also possesses powerful structured audio-visual understanding capabilities and is widely used in text creation, voice assistants, multimedia analysis, and other scenarios, providing a natural and fluent multimodal understanding and interactive experience.This version is a snapshot from March 15, 2026.
Input
TextImageVideoAudio
Output
TextAudio
Features
Prefix Completion
Function Calling
Cache
Structured Outputs
Batches
Web Search
Pricing
- Input: Audio$3Per 1M tokens
- Output: Text&Audio (Output text is not charged)$11.9Per 1M tokens
- input:Text/Image/Video$0.4Per 1M tokens
- Output: Text$2.2Per 1M tokens
Context
Context
262.14K
Max Input
196.60K
Max Output
65.53K
Rate Limits
- RPMRequests Per Minute60
- TPMTokens Per Minute100K
Built-in Tools
search_strategy:agentCompletions API
API Reference
Get API KeyCopied!
12345678910111213141516171819202122232425