Qwen2.5-Open-Source
Copied!
Try AIAdd to Compare
Multimodal
Overview
Multimodal
Based on the Qwen2.5 training, a new multimodal understanding and generation large model is developed, which supports text, image, voice, video input understanding and mixed input understanding, has the ability to generate text and voice simultaneously, significantly improves the speed of multimodal content understanding, and provides four kinds of natural dialogue timbres.
Input
TextImageVideoAudio
Output
TextAudio
Features
Prefix Completion
Function Calling
Cache
Structured Outputs
Batches
Web Search
Pricing
- Input: Text$0.1Per 1M tokens
- Input: Audio$6.76Per 1M tokens
- Input: Vision$0.28Per 1M tokens
- Output: Text (When input contains only text) $0.4Per 1M tokens
- Output: Text (When input contains images/audio/video)$0.84Per 1M tokens
- Output: Text&Audio (Output text is not charged)$13.51Per 1M tokens
Context
Context
32.76K
Max Input
30.72K
Max Output
2.04K
Rate Limits
API Reference
Get API KeyCopied!
12345678910111213141516171819202122232425