Qwen-Omni-Turbo

Copied!
Add to Compare
Multimodal

Overview

Multimodal

Qwen-Omni-Turbo is a brand-new multimodal understanding and generation large model that supports text, image, speech, and video input comprehension, as well as mixed input understanding. It features simultaneous streaming generation for both text and speech, significantly enhanced multimodal content comprehension speed, and offers four natural conversational voice styles.This version is a snapshot version from March 26, 2025.

Input

TextImageVideoAudio

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

  • Input: Text
    $0.07Per 1M tokens
  • Input: Audio
    $4.44Per 1M tokens
  • Input: Vision
    $0.21Per 1M tokens
  • Output: Text (When input contains only text)
    $0.27Per 1M tokens
  • Output: Text (When input contains images/audio/video)
    $0.63Per 1M tokens
  • Output: Text&Audio (Output text is not charged)
    $8.89Per 1M tokens

Context

Context
32.76K
Max Input
30.72K
Max Output
2.04K

Rate Limits

  • RPMRequests Per Minute
    60
  • TPMTokens Per Minute
    100K