Qwen3-Omni-Flash

Copied!
Add to Compare
Multimodal

Overview

Multimodal

Qwen3-Omni-Flash multimodal large-scale model, based on the Thinker–Talker Mixed Expert (MoE) architecture, supports efficient understanding and speech generation of text, images, audio, and video. It can interact with text in 119 languages ​​and speech in 20 languages, generating human-like speech for precise cross-lingual communication. The model boasts powerful command-following and system prompt customization capabilities, flexibly adapting to conversational styles and character settings. It is widely used in scenarios such as text creation, voice assistants, and multimedia analysis, providing a natural and smooth multimodal interaction experience.This version is a snapshot version from September 15, 2025.

Input

TextImageAudioVideo

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

  • Input: Text
    $0.43Per 1M tokens
  • Input: Audio
    $3.81Per 1M tokens
  • Input: Vision
    $0.78Per 1M tokens
  • Output: Text (When input contains only text)
    $1.66Per 1M tokens
  • Output: Text (When input contains images/audio/video)
    $3.06Per 1M tokens
  • Output: Text&Audio (Output text is not charged)
    $15.11Per 1M tokens
  • Input: Text(Thinking)
    $0.43Per 1M tokens
  • Input: Audio(Thinking)
    $3.81Per 1M tokens
  • Input: Vision(Thinking)
    $0.78Per 1M tokens
  • Output: Text (in thinking mode, when input contains only text)
    $1.66Per 1M tokens
  • Output: Text (in thinking mode, when the input contains images/audio/video)
    $3.06Per 1M tokens

Rate Limits

  • RPMRequests Per Minute
    60
  • TPMTokens Per Minute
    100K

API Reference

Get API Key
Copied!
12345678910111213141516171819202122232425