Qwen3-Omni-30b-a3b-Captioner
Copied!
Speech Recognition
Overview
Speech Recognition
Qwen3-Omni-30b-a3b-Captioner is a powerful fine-grained audio analysis model designed to generate accurate and comprehensive content descriptions in complex and changing audio scenarios. It can automatically parse and describe various audio content, from complex speech and ambient sounds to music and film and television sound effects, and can maintain stable and reliable output even in multi-source and mixed environments.
Input
Audio
Output
Text
Features
Prefix Completion
Function Calling
Cache
Structured Outputs
Batches
Web Search
Pricing
- Input: Audio$3.81Per 1M tokens
- Output: Text (When input contains images/audio/video)$3.06Per 1M tokens
Rate Limits
- RPMRequests Per Minute60
- TPMTokens Per Minute100K