Qwen-Omni-Turbo-2025-03-26

Qwen-Omni-Turbo

Copied!

Add to Compare

Multimodal

Overview

Multimodal

Qwen-Omni-Turbo is a brand-new multimodal understanding and generation large model that supports text, image, speech, and video input comprehension, as well as mixed input understanding. It features simultaneous streaming generation for both text and speech, significantly enhanced multimodal content comprehension speed, and offers four natural conversational voice styles.This version is a snapshot version from March 26, 2025.

Input

TextImageVideoAudio

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

Input: Text
$0.07Per 1M tokens
Input: Audio
$4.44Per 1M tokens
Input: Vision
$0.21Per 1M tokens
Output: Text (When input contains only text)
$0.27Per 1M tokens
Output: Text (When input contains images/audio/video)
$0.63Per 1M tokens
Output: Text&Audio (Output text is not charged)
$8.89Per 1M tokens

Context

32.76K

Max Input

30.72K

Max Output

2.04K

Rate Limits

RPMRequests Per Minute
60
TPMTokens Per Minute
100K