Qwen3.5-Omni-Flash

Copied!
Add to Compare
Multimodal

Overview

Multimodal

Qwen 3.5-Omni is the latest generation of Qwen's multimodal large model, supporting text, image, audio, and audio-visual understanding and interaction. As a comprehensive evolution of Qwen 3-Omni, it supports over 10 hours of audio understanding and over 400 seconds of 720P (1 FPS) audio-visual understanding and dialogue. It further expands the language range, supporting audio input in 60+ languages and speech output in 30+ languages. It also possesses powerful structured audio-visual understanding capabilities and is widely used in text creation, voice assistants, multimedia analysis, and other scenarios, providing a natural and fluent multimodal understanding and interactive experience.This version is a snapshot from March 15, 2026.

Input

TextImageVideoAudio

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

  • Input: Audio
    $3Per 1M tokens
  • Output: Text&Audio (Output text is not charged)
    $11.9Per 1M tokens
  • input:Text/Image/Video
    $0.4Per 1M tokens
  • Output: Text
    $2.2Per 1M tokens

Context

Context
262.14K
Max Input
196.60K
Max Output
65.53K

Rate Limits

  • RPMRequests Per Minute
    60
  • TPMTokens Per Minute
    100K

Built-in Tools

search_strategy:agentCompletions API

API Reference

Get API Key
Copied!
12345678910111213141516171819202122232425