Qwen3-Omni-30b-a3b-Captioner

Copied!
Add to Compare
Speech Recognition

Overview

Speech Recognition

Qwen3-Omni-30b-a3b-Captioner is a powerful fine-grained audio analysis model designed to generate accurate and comprehensive content descriptions in complex and changing audio scenarios. It can automatically parse and describe various audio content, from complex speech and ambient sounds to music and film and television sound effects, and can maintain stable and reliable output even in multi-source and mixed environments.

Input

Audio

Output

Text

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

  • Input: Audio
    $3.81Per 1M tokens
  • Output: Text (When input contains images/audio/video)
    $3.06Per 1M tokens

Rate Limits

  • RPMRequests Per Minute
    60
  • TPMTokens Per Minute
    100K