Qwen2.5-Open-Source

Copied!
Try AIAdd to Compare
Multimodal

Overview

Multimodal

Based on the Qwen2.5 training, a new multimodal understanding and generation large model is developed, which supports text, image, voice, video input understanding and mixed input understanding, has the ability to generate text and voice simultaneously, significantly improves the speed of multimodal content understanding, and provides four kinds of natural dialogue timbres.

Input

TextImageVideoAudio

Output

TextAudio

Features

Prefix Completion

Function Calling

Cache

Structured Outputs

Batches

Web Search

Pricing

  • Input: Text
    $0.1Per 1M tokens
  • Input: Audio
    $6.76Per 1M tokens
  • Input: Vision
    $0.28Per 1M tokens
  • Output: Text (When input contains only text)
    $0.4Per 1M tokens
  • Output: Text (When input contains images/audio/video)
    $0.84Per 1M tokens
  • Output: Text&Audio (Output text is not charged)
    $13.51Per 1M tokens

Context

Context
32.76K
Max Input
30.72K
Max Output
2.04K

Rate Limits

    API Reference

    Get API Key
    Copied!
    12345678910111213141516171819202122232425