Qwen-Omni-Turbo
Copied!
Multimodal
Overview
Multimodal
Qwen-Omni-Turbo is a brand-new multimodal understanding and generation large model that supports text, image, speech, and video input comprehension, as well as mixed input understanding. It features simultaneous streaming generation for both text and speech, significantly enhanced multimodal content comprehension speed, and offers four natural conversational voice styles.This version is a snapshot version from March 26, 2025.
Input
TextImageVideoAudio
Output
TextAudio
Features
Prefix Completion
Function Calling
Cache
Structured Outputs
Batches
Web Search
Pricing
- Input: Text$0.07Per 1M tokens
- Input: Audio$4.44Per 1M tokens
- Input: Vision$0.21Per 1M tokens
- Output: Text (When input contains only text) $0.27Per 1M tokens
- Output: Text (When input contains images/audio/video)$0.63Per 1M tokens
- Output: Text&Audio (Output text is not charged)$8.89Per 1M tokens
Context
Context
32.76K
Max Input
30.72K
Max Output
2.04K
Rate Limits
- RPMRequests Per Minute60
- TPMTokens Per Minute100K