Qwen-Omni-Turbo-Realtime

Copied!
Add to Compare
Real-time Omni-modality

Overview

Real-time Omni-modality

This is the real-time version of Qwen-Omni-Turbo, a brand-new multimodal understanding and generation large model, designed for real-time audio interaction scenarios. It supports mixed input comprehension of audio along with text, images, and video, enables simultaneous streaming generation of both speech and text, and offers four natural conversational voice styles.

Input

TextImageAudio

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

  • Input: Text
    $0.27Per 1M tokens
  • Input: Audio
    $4.44Per 1M tokens
  • Input: Vision
    $0.84Per 1M tokens
  • Output: Text (When input contains only text)
    $1.07Per 1M tokens
  • Output: Text (When input contains images/audio/video)
    $2.52Per 1M tokens
  • Output: Text&Audio (Output text is not charged)
    $8.89Per 1M tokens

Context

Context
32.76K
Max Input
30.72K
Max Output
2.04K

Rate Limits

  • RPMRequests Per Minute
    60
  • TPMTokens Per Minute
    10K

API Reference

Get API Key
Copied!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768