Qwen3.5-Omni-Flash-Realtime

Copied!
Add to Compare
Real-time Omni-modality

Overview

Real-time Omni-modality

Qwen 3.5-Omni is the latest generation of Qwen's multimodal large model, supporting text, image, audio, and audio-visual understanding and interaction. As a fully evolved version of Qwen3-Omni, it supports audio input in 60+ languages, voice output in 30+ languages, and controllable voice dialogue, WebSearch and complex FunctionCall invocation, and has intelligent semantic interruption interaction capabilities. It is widely used in scenarios such as text creation, voice assistants, and multimedia analysis, providing a natural and smooth multimodal interactive experience.

Input

TextImageVideoAudio

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

  • Input: Audio
    $4.5Per 1M tokens
  • Output: Text&Audio (Output text is not charged)
    $17.7Per 1M tokens
  • input:Text/Image/Video
    $0.55Per 1M tokens
  • Output: Text
    $3.3Per 1M tokens

Context

Context
262.14K
Max Input
196.60K
Max Output
65.53K

Rate Limits

  • RPMRequests Per Minute
    60
  • TPMTokens Per Minute
    100K

Built-in Tools

search_strategy:agentCompletions API

API Reference

Get API Key
Copied!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768