Alibaba logo

Qwen3-235B-A22B-Instruct-2507

Alibaba

Qwen3-235B-A22B-Instruct-2507 is an updated instruction version of Qwen3-235B-A22B with substantial improvements in overall capabilities, including instruction following, logical reasoning, text comprehension, math, science, coding, and tool use. The model delivers significant gains in specialized knowledge coverage across multiple languages and notably better alignment with user preferences on subjective and open-ended tasks.

Key Specifications

Parameters
235.0B
Context
131.1K
Release Date
July 22, 2025
Average Score
72.1%

Timeline

Key dates in the model's history
Announcement
July 22, 2025
Last Update
August 3, 2025
Today
March 25, 2026

Technical Specifications

Parameters
235.0B
Training Tokens
-
Knowledge Cutoff
-
Family
-
Capabilities
MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)
$0.15
Output (per 1M tokens)
$0.80
Max Input Tokens
131.1K
Max Output Tokens
16.4K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis
GPQA
Accuracy AI: AccuracySelf-reported
77.5%

Other Tests

Specialized benchmarks
Aider-Polyglot
Accuracy AI: (Translation )Self-reported
57.3%
AIME25
Accuracy We we determine accuracy how proportion correct answers on set questions. This one from most ways evaluation performance LLM and our main metric for determination that, how well well model handles with task. We we measure accuracy on several analysis: 1. General accuracy: How well well model works on all ? This gives representation about general performance model on benchmark. 2. Accuracy by : How model works in specific fields knowledge? For example, we we can break down on and in order to in any fields model has strong and weak side. 3. Accuracy by complexity: accuracy on and complex questions? This allows us understand abilities model. 4. Accuracy by format output: Model better handles with with multiple choice, on or ? 5. Accuracy by : How various methods (for example, chain thinking, standard prompt, code) on accuracy model for given tasks? This analysis accuracy allows us not simply determine, how well well works model, but and understand, where specifically at her complexity and in any conditions she/it works better totalSelf-reported
70.3%
ARC-AGI
Accuracy AI: ChatGPT's ability to correctly solve problems, measured by the % of correctly solved problems out of all that it attempted. We compute accuracy as #Correct / #Attempted.Self-reported
41.8%
Arena-Hard v2
Win Rate Win Rate () how well often model A outperforms model B in time comparison. Computation Win Rate: 1. models task 2. answers from each model 3. quality answers with help human-or LLM-4. each 5. how often model A model B Strong side: - and comparison two models - Allows conduct and from Can for evaluation types tasks or Limitations: - Can be from-for evaluations human or LLM - for number Not provides scores quality, only comparisonSelf-reported
79.2%
BFCL-v3
Accuracy AISelf-reported
70.9%
Creative Writing v3
## Evaluation We we evaluate examples reasoning by criteria: 1. ****: how well well model general strategy solutions or task on managed components. 2. **Correctness**: whether model correct final answer. 3. ****: how well and sequentially presented reasoning. 4. **Efficiency**: whether model direct approach (when this possible). 5. ****: whether model or 6. ****: whether model in order to results on specific 7. **Verification**: whether model intermediate results or its final answer. 8. ****: whether model or approach. 9. ****: whether model its process and whetherSelf-reported
87.5%
CSimpleQA
Accuracy AI: [GPT-4o]Self-reported
84.3%
HMMT25
Accuracy AI ## What such method accuracy? Accuracy measures correctness answer model on task. This one from most general and methods evaluation performance model. ## How accuracy? Accuracy by means of verification answer model and correct answer, often in For different types tasks are used various approaches: - **Tasks with choice answer**: simple correct answers. - **Tasks with answer**: for them often more complex systems evaluation, such how comparison with reference answers or use other models for evaluation. - **Tasks with complex structure answer**: can evaluation with several ## Advantages method - and easily Allows conduct comparison between models on sets tasks. - for tasks and ## Disadvantages method - Not nuances in answers or correctness. - Can be difficult apply to tasks, where answers or have set correct Not gives representations about process reasoning model. - In some cases model can correct answers by incorrect ## When use this method? Method accuracy most useful, when: - base evaluation general performance model. - Task has correct and incorrect answers. - compare several models on set tasks. Accuracy often is used how first step in analysis, after which more methods for deep understanding performance modelSelf-reported
55.4%
IFEval
Accuracy We we consider accuracy how ability model give correct answers. Despite on this often task for evaluation, since is required thoroughly set questions with clearly correct answers. This can be by set questions with answers or with help functions evaluation, which determines, is whether answer correct. evaluation: - data with clearly correct answers - Not always is one correct solution or answer - various expressions answers methods evaluation: - evaluation by data with reference answers - evaluation for complex or tasksSelf-reported
88.7%
INCLUDE
Evaluation AI: I general structure approach, with Then I various components, evaluating, how well well they First to : - consists in that, that LLM can perform complex mathematical assignments, providing them corresponding examples. - is use "", where model can its thinking. approach: 1. Demonstration with examples (8/10): - Good: several examples complexity gives model representation about task. - Good: explanations allows model understand logic. - improve: example with and 2. Structure (7/10): - Good: on steps "thinking" and "answer". - improve: more structure thinking (for example, cases). 3. (9/10): - Good: on reasoning. - Good: additional verification. 4. General efficiency (8/10): - approach has on He in itself key which make mathematical reasoning : structured approach, and by : 1. capability "" after examples, but before in order to model could on simple cases. 2. instructions by and errors. 3. more structured for verification cases. In whole, this approach, which well uses capabilities LLM. General evaluation: 8/10Self-reported
79.5%
LiveBench 20241125
Accuracy AISelf-reported
75.4%
LiveCodeBench v6
Accuracy AI: 0.9Self-reported
51.8%
MMLU-Pro
Accuracy AI: 0.99 Human: 1.0Self-reported
83.0%
MMLU-ProX
Accuracy AISelf-reported
79.4%
MMLU-Redux
Accuracy AISelf-reported
93.1%
Multi-IF
Accuracy AISelf-reported
77.5%
MultiPL-E
Evaluation AI: 1-25-22 1. model should : evaluate model (1-25-22). 25 points, if answer fully matches model (for example, can be sufficiently manner how LLM); 0 points, if this fully not matches (for example, if this simply general by solving mathematical tasks). 2. most important then, how well answer reflects progress in questions, with artificial intelligence, its understanding mathematical concepts, and how these questions should be Model should provide understanding capabilities artificial intelligence and its limitations in mathematical 3. and answer. whether example models, which already are used? whether data, or with models? whether answer how ?Self-reported
87.9%
PolyMATH
Accuracy AI: that LLM by its are for which can only solution problems, but not solve their, and in that even if LLM contain they all indeed ability solve problems. However by-evaluate and compare their abilities, and standard assignments-tests can be from-for or use that, in difference from people, which at specific conditions can that problem with help methods, LLM often not show sequential levels and have complex patterns strong and sides. For determination that, whether model that-then, can use direct, strict : model should sequentially correct answer in several attempts. This differs from approach to evaluation, which determines, whether model in correct answer for type tasks, with in that, which examples can be if target score. solutions specific tasks to that, how we we evaluate people, when we verify, indeed whether they then, thatSelf-reported
50.2%
SimpleQA
Accuracy AI models are often expected to be highly accurate or even infallible. This expectation sometimes results in excessive trust in AI responses, commonly known as "automation bias." We might observe a system exhibiting various behaviors related to accuracy: 1. Verifiably Correct Outputs: The system provides answers that can be verified as correct through external sources or mathematical proof. 2. Misinformation: The system confidently states incorrect information as fact, possibly due to: - Training data containing inaccuracies - Hallucinations (generating plausible-sounding but false information) - Temporal limitations (outdated knowledge cutoff) 3. Self-correction: The system demonstrates ability to: - Identify when it makes mistakes - Correct its own errors when presented with new information - Acknowledge uncertainty appropriately 4. Uncertainty handling: How well the system: - Expresses appropriate confidence levels - Admits knowledge limitations - Avoids overconfidence on incorrect answers - Provides appropriate caveats For analysis purposes, we can evaluate a system's accuracy across different knowledge domains (e.g., mathematics, history, current events) and task types (factual recall, reasoning, prediction).Self-reported
54.3%
SuperGPQA
Accuracy AI: 0.5Self-reported
62.6%
Tau2 airline
Accuracy is whether statement correct by Accuracy is determined how match statements and actual If contains or but by its should exact. Examples when statement will : - contains actually information - values, or substantially from incorrect connection, in statement is evaluated how exact or If you not determine accuracy (from-for knowledge or information), this mannerSelf-reported
44.0%
Tau2 retail
Accuracy AI: ISelf-reported
71.3%
WritingBench
Accuracy AI: task. translation: AccuracySelf-reported
85.2%
ZebraLogic
Accuracy AI: AccuracySelf-reported
95.0%

License & Metadata

License
apache_2_0
Announcement Date
July 22, 2025
Last Updated
August 3, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.