Qwen3-235B-A22B-Instruct-2507

Name: Qwen3-235B-A22B-Instruct-2507
Author: Alibaba

Alibaba

Qwen3-235B-A22B-Instruct-2507 is an updated instruction version of Qwen3-235B-A22B with substantial improvements in overall capabilities, including instruction following, logical reasoning, text comprehension, math, science, coding, and tool use. The model delivers significant gains in specialized knowledge coverage across multiple languages and notably better alignment with user preferences on subjective and open-ended tasks.

Key Specifications

Parameters

235.0B

Context

131.1K

Release Date

July 22, 2025

Average Score

72.1%

API Documentation Research Paper Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

July 22, 2025

Last Update

August 3, 2025

Today

May 9, 2026

Technical Specifications

Parameters

235.0B

Training Tokens

Knowledge Cutoff

Family

Capabilities

MultimodalZeroEval

Pricing & Availability

Input (per 1M tokens)

$0.15

Output (per 1M tokens)

$0.80

Max Input Tokens

131.1K

Max Output Tokens

16.4K

Supported Features

Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning

Benchmark Results

Model performance metrics across various tests and benchmarks

Reasoning

Logical reasoning and analysis

GPQA

Accuracy AI: Accuracy • Self-reported

77.5%

Other Tests

Specialized benchmarks

Aider-Polyglot

Accuracy AI: (Translation ) • Self-reported

57.3%

AIME25

Accuracy We we determine accuracy how proportion correct answers on set questions. This one from most ways evaluation performance LLM and our main metric for determination that, how well well model handles with task. We we measure accuracy on several analysis: 1. General accuracy: How well well model works on all ? This gives representation about general performance model on benchmark. 2. Accuracy by : How model works in specific fields knowledge? For example, we we can break down on and in order to in any fields model has strong and weak side. 3. Accuracy by complexity: accuracy on and complex questions? This allows us understand abilities model. 4. Accuracy by format output: Model better handles with with multiple choice, on or ? 5. Accuracy by : How various methods (for example, chain thinking, standard prompt, code) on accuracy model for given tasks? This analysis accuracy allows us not simply determine, how well well works model, but and understand, where specifically at her complexity and in any conditions she/it works better total • Self-reported

70.3%

ARC-AGI

Accuracy AI: ChatGPT's ability to correctly solve problems, measured by the % of correctly solved problems out of all that it attempted. We compute accuracy as #Correct / #Attempted. • Self-reported

41.8%

Arena-Hard v2

Win Rate Win Rate () how well often model A outperforms model B in time comparison. Computation Win Rate: 1. models task 2. answers from each model 3. quality answers with help human-or LLM-4. each 5. how often model A model B Strong side: - and comparison two models - Allows conduct and from Can for evaluation types tasks or Limitations: - Can be from-for evaluations human or LLM - for number Not provides scores quality, only comparison • Self-reported

79.2%

BFCL-v3

Accuracy AI • Self-reported

70.9%

Creative Writing v3

## Evaluation We we evaluate examples reasoning by criteria: 1. ****: how well well model general strategy solutions or task on managed components. 2. **Correctness**: whether model correct final answer. 3. ****: how well and sequentially presented reasoning. 4. **Efficiency**: whether model direct approach (when this possible). 5. ****: whether model or 6. ****: whether model in order to results on specific 7. **Verification**: whether model intermediate results or its final answer. 8. ****: whether model or approach. 9. ****: whether model its process and whether • Self-reported

87.5%

CSimpleQA

Accuracy AI: [GPT-4o] • Self-reported

84.3%

HMMT25

Accuracy AI ## What such method accuracy? Accuracy measures correctness answer model on task. This one from most general and methods evaluation performance model. ## How accuracy? Accuracy by means of verification answer model and correct answer, often in For different types tasks are used various approaches: - **Tasks with choice answer**: simple correct answers. - **Tasks with answer**: for them often more complex systems evaluation, such how comparison with reference answers or use other models for evaluation. - **Tasks with complex structure answer**: can evaluation with several ## Advantages method - and easily Allows conduct comparison between models on sets tasks. - for tasks and ## Disadvantages method - Not nuances in answers or correctness. - Can be difficult apply to tasks, where answers or have set correct Not gives representations about process reasoning model. - In some cases model can correct answers by incorrect ## When use this method? Method accuracy most useful, when: - base evaluation general performance model. - Task has correct and incorrect answers. - compare several models on set tasks. Accuracy often is used how first step in analysis, after which more methods for deep understanding performance model • Self-reported

55.4%

IFEval

Accuracy We we consider accuracy how ability model give correct answers. Despite on this often task for evaluation, since is required thoroughly set questions with clearly correct answers. This can be by set questions with answers or with help functions evaluation, which determines, is whether answer correct. evaluation: - data with clearly correct answers - Not always is one correct solution or answer - various expressions answers methods evaluation: - evaluation by data with reference answers - evaluation for complex or tasks • Self-reported

88.7%

INCLUDE

Evaluation AI: I general structure approach, with Then I various components, evaluating, how well well they First to : - consists in that, that LLM can perform complex mathematical assignments, providing them corresponding examples. - is use "", where model can its thinking. approach: 1. Demonstration with examples (8/10): - Good: several examples complexity gives model representation about task. - Good: explanations allows model understand logic. - improve: example with and 2. Structure (7/10): - Good: on steps "thinking" and "answer". - improve: more structure thinking (for example, cases). 3. (9/10): - Good: on reasoning. - Good: additional verification. 4. General efficiency (8/10): - approach has on He in itself key which make mathematical reasoning : structured approach, and by : 1. capability "" after examples, but before in order to model could on simple cases. 2. instructions by and errors. 3. more structured for verification cases. In whole, this approach, which well uses capabilities LLM. General evaluation: 8/10 • Self-reported

79.5%

LiveBench 20241125

Accuracy AI • Self-reported

75.4%

LiveCodeBench v6

Accuracy AI: 0.9 • Self-reported

51.8%

MMLU-Pro

Accuracy AI: 0.99 Human: 1.0 • Self-reported

83.0%

MMLU-ProX

Accuracy AI • Self-reported

79.4%

MMLU-Redux

Accuracy AI • Self-reported

93.1%

Multi-IF

Accuracy AI • Self-reported

77.5%

MultiPL-E

Evaluation AI: 1-25-22 1. model should : evaluate model (1-25-22). 25 points, if answer fully matches model (for example, can be sufficiently manner how LLM); 0 points, if this fully not matches (for example, if this simply general by solving mathematical tasks). 2. most important then, how well answer reflects progress in questions, with artificial intelligence, its understanding mathematical concepts, and how these questions should be Model should provide understanding capabilities artificial intelligence and its limitations in mathematical 3. and answer. whether example models, which already are used? whether data, or with models? whether answer how ? • Self-reported

87.9%

PolyMATH

Accuracy AI: that LLM by its are for which can only solution problems, but not solve their, and in that even if LLM contain they all indeed ability solve problems. However by-evaluate and compare their abilities, and standard assignments-tests can be from-for or use that, in difference from people, which at specific conditions can that problem with help methods, LLM often not show sequential levels and have complex patterns strong and sides. For determination that, whether model that-then, can use direct, strict : model should sequentially correct answer in several attempts. This differs from approach to evaluation, which determines, whether model in correct answer for type tasks, with in that, which examples can be if target score. solutions specific tasks to that, how we we evaluate people, when we verify, indeed whether they then, that • Self-reported

50.2%

SimpleQA

Accuracy AI models are often expected to be highly accurate or even infallible. This expectation sometimes results in excessive trust in AI responses, commonly known as "automation bias." We might observe a system exhibiting various behaviors related to accuracy: 1. Verifiably Correct Outputs: The system provides answers that can be verified as correct through external sources or mathematical proof. 2. Misinformation: The system confidently states incorrect information as fact, possibly due to: - Training data containing inaccuracies - Hallucinations (generating plausible-sounding but false information) - Temporal limitations (outdated knowledge cutoff) 3. Self-correction: The system demonstrates ability to: - Identify when it makes mistakes - Correct its own errors when presented with new information - Acknowledge uncertainty appropriately 4. Uncertainty handling: How well the system: - Expresses appropriate confidence levels - Admits knowledge limitations - Avoids overconfidence on incorrect answers - Provides appropriate caveats For analysis purposes, we can evaluate a system's accuracy across different knowledge domains (e.g., mathematics, history, current events) and task types (factual recall, reasoning, prediction). • Self-reported

54.3%

SuperGPQA

Accuracy AI: 0.5 • Self-reported

62.6%

Tau2 airline

Accuracy is whether statement correct by Accuracy is determined how match statements and actual If contains or but by its should exact. Examples when statement will : - contains actually information - values, or substantially from incorrect connection, in statement is evaluated how exact or If you not determine accuracy (from-for knowledge or information), this manner • Self-reported

44.0%

Tau2 retail

Accuracy AI: I • Self-reported

71.3%

WritingBench

Accuracy AI: task. translation: Accuracy • Self-reported

85.2%

ZebraLogic

Accuracy AI: Accuracy • Self-reported

95.0%

License & Metadata

License

apache_2_0

Announcement Date

July 22, 2025

Last Updated

August 3, 2025

Similar Models

All Models

Qwen3.5 122B A10B

Alibaba

122.0B

Released:Mar 2026

Qwen3-235B-A22B-Thinking-2507

Alibaba

235.0B

Released:Jul 2025

Price:$0.30/1M tokens

Qwen3-Coder 480B A35B Instruct

Alibaba

480.0B

Best score:0.8 (TAU)

Released:Jan 2025

Qwen3 Max

Alibaba

Best score:0.6 (GPQA)

Released:Dec 2025

Qwen2.5 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Sep 2024

Price:$0.30/1M tokens

Qwen3 235B A22B

Alibaba

235.0B

Best score:0.9 (MMLU)

Released:Apr 2025

Price:$0.20/1M tokens

Qwen2 7B Instruct

Alibaba

7.6B

Best score:0.8 (HumanEval)

Released:Jul 2024

QwQ-32B

Alibaba

32.5B

Best score:0.7 (GPQA)

Released:Mar 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.