Alibaba Unveils Qwen3-235B-A22B-Thinking-2507: A New Leader in Open-Source Reasoning Models
Introduction
Over the past year, the Qwen family of models has steadily advanced from competitive general-purpose LLMs to state-of-the-art reasoning engines. Today, the Qwen team at Alibaba announces Qwen3-235B-A22B-Thinking-2507, the latest milestone that pushes open-source AI into territory previously dominated by closed commercial systems such as OpenAI o3 and Gemini-2.5 Pro.
With 235 billion total parameters—only 22 billion of which are active at any moment—the model combines large capacity with efficient inference through a Mixture-of-Experts (MoE) design. More importantly, it is the first open model to systematically optimize for “thinking”: an internal chain-of-thought stage that dramatically boosts accuracy on complex reasoning tasks.
Architecture in Plain Language
Sparse MoE backbone
128 expert sub-networks sit inside every layer. For each token the router activates only 8 experts, keeping memory and compute under control while retaining the expressive power of a much larger dense model.Long context
A native context window of 262 144 tokens (≈ 400 pages of text) lets the model ingest entire research papers or code bases in one pass.Guided reasoning
A special chat template wraps every user prompt with an implicit<think>
block. The model is forced to reason step-by-step before emitting the final answer. This technique, popularized by recent closed systems, is now available out-of-the-box and fully open.
Benchmark Highlights
Category | Test | Qwen3-Thinking-2507 | Best Prior Open Model | Best Closed Model | Delta |
---|---|---|---|---|---|
Mathematics | AIME 2025 | 92.3 % | 81.5 % (Qwen3) | 92.7 % (OpenAI o4-mini) | −0.4 pp |
HMMT 2025 | 83.9 % | 62.5 % | 82.5 % (Gemini-2.5 Pro) | +1.4 pp | |
Code | LiveCodeBench v6 | 74.1 % | 55.7 % | 72.5 % (Gemini-2.5 Pro) | +1.6 pp |
Science | SuperGPQA | 64.9 % | 60.7 % | 62.3 % (Gemini-2.5 Pro) | +2.6 pp |
Knowledge | MMLU-Pro | 84.4 % | 82.8 % | 85.9 % (OpenAI o3) | −1.5 pp |
Long Context | HLE (text-only) | 18.2 % | 11.8 % | 21.6 % (Gemini-2.5 Pro) | −3.4 pp |
Note: Differences of ±2 pp are within standard evaluation noise. The key takeaway is that Qwen3-Thinking-2507 is now on par with—or exceeds—proprietary giants in nearly every reasoning discipline.
What “Thinking Mode” Means for Developers
When you query the model, it first produces an internal scratchpad. This scratchpad contains:
- Step-by-step derivations
- Self-correction loops
- References to premises or code snippets
Because the model is trained to expose its thought process, downstream applications can:
- Audit reasoning paths for safety or compliance.
- Continue generation from any intermediate step.
- Fine-tune on the scratchpad to specialize for narrower domains.
From Qwen-7B to World-Class Reasoner: Alibaba’s Trajectory
- Qwen-7B (2023) – First bilingual open LLM rivaling LLaMA-2.
- Qwen1.5-MoE (2024) – Demonstrated that sparse models can match dense 70 B-parameter performance at 1/3 the cost.
- Qwen2 (mid-2024) – Introduced native 128 K context and strong multilingual coverage.
- Qwen3-Thinking-2507 (July 2025) – Closes the gap with proprietary frontier labs while staying fully open-source.
Each release has added measurable capability gains rather than marketing claims. With this latest model, Alibaba is no longer “catching up”—it is setting the pace for transparent, capable, and efficient reasoning systems.
Takeaway
Qwen3-235B-A22B-Thinking-2507 delivers research-grade reasoning in a downloadable checkpoint. Whether you are fine-tuning for scientific discovery, building a coding copilot, or auditing AI safety, the model offers:
- Transparent thought chains
- Competitive accuracy with GPT-4-class systems
- Long-document understanding that rivals commercial APIs
Alibaba’s Qwen team has turned open-source AI from “good enough” into best in class.