Qwen3-Omni-30b-a3b-Captioner

Copied!

Add to Compare

Speech Recognition

Overview

Speech Recognition

Qwen3-Omni-30b-a3b-Captioner is a powerful fine-grained audio analysis model designed to generate accurate and comprehensive content descriptions in complex and changing audio scenarios. It can automatically parse and describe various audio content, from complex speech and ambient sounds to music and film and television sound effects, and can maintain stable and reliable output even in multi-source and mixed environments.

Input

Audio

Output

Text

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

Input: Audio
$3.81Per 1M tokens
Output: Text (When input contains images/audio/video)
$3.06Per 1M tokens

Rate Limits

RPMRequests Per Minute
60
TPMTokens Per Minute
100K