IBM logo

Granite 3.3 8B Base

Multimodal
IBM

Granite 3.3 8B Base is a foundational language model from IBM's Granite family with 8 billion parameters. This is the pre-trained base model prior to instruction tuning, suitable for fine-tuning on domain-specific tasks. It demonstrates strong capabilities in language understanding, reasoning, and knowledge retrieval.

Key Specifications

Parameters
8.2B
Context
-
Release Date
April 16, 2025
Average Score
64.3%

Timeline

Key dates in the model's history
Announcement
April 16, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
8.2B
Training Tokens
-
Knowledge Cutoff
April 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
HellaSwag
Score : Measurement Approach Measurements need to assess the behavior of AI systems to provide confidence to a variety of stakeholders that the system meets societal expectations for its behavior. These behaviors and expectations include: - The system's adherence to established procedures to achieve intended goals - The system's potential for manipulation of users - The system's potential for harmful outputs - The system's competence in specific domains One method for measurement is self-evaluation, which evaluates a model on relevant benchmarks or metrics, or evaluates a model's adherence to responsible use strategies. Self-evaluation can include the use of synthetic evaluators, and both human and automatic evaluation. Frontier developers and governments should create measurement approaches that are strong enough to catch novel, unanticipated risks from frontier AI. However, it is difficult to measure risk—risks may be highly context-dependent or imprecisely defined. Quantitative measures should be complemented with regular qualitative assessments by independent third-party evaluators. Self-evaluation should also involve multiple methodologies and consider difficult-to-detect behaviors, such as deception. Frontier developers could publish results of self-evaluation across the development cycle to understand how risks emerge as systems improve, and to allow the broader community to give feedback on their risk assessment approachesSelf-reported
80.1%
MMLU
## Score 0 5: - 0: 1: 2: 3: 4: 5: : evaluation 5 5Self-reported
63.9%
TruthfulQA
ScoreSelf-reported
52.1%
Winogrande
Score AI: : 1. 2. () 3. 4. 5. evaluationSelf-reported
74.4%

Programming

Programming skills tests
HumanEval
# OLMES OLMES (Open-Language Model External Stimulus) - : 1. ****: 2. ****: 3. ****: OLMES : - OLMES : - OLMESSelf-reported
89.7%

Mathematics

Mathematical problems and computations
GSM8k
(0 1 ) (0 10)Self-reported
59.0%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
OLMES ()Self-reported
69.1%
DROP
Score : 1. 2. 2. : - : : 3. : - : : 4. 0 5: 5: 4: Correct 3: Correct 2: 1: 0: 5Self-reported
36.1%

Other Tests

Specialized benchmarks
AGIEval
# Score (explanation-first), (answer-first), (validity), (factual correctness)Self-reported
49.3%
AIME 2024
Not specifiedSelf-reported
81.2%
AlpacaEval 2.0
## Score "", "" "". evaluation0 3Self-reported
62.7%
ARC-C
Score ScoreSelf-reported
50.8%
Arena Hard
Arena Hard Claude RLHF Arena: Hard. Arena Hard RLHF (). Claude: RLHF Arena: Hard — : - : Claude : : : Anthropic Claude, AnthropicSelf-reported
57.6%
AttaQ
Not specified (OLMES)Self-reported
88.5%
HumanEval+
# OLMES OLMES () — ## OLMES ## OLMES : 1. 2. 3. ## OLMESSelf-reported
86.1%
IFEval
OLMES AI: OLMES (Online Large Model Evaluation System) - nolmo.ai, LLM, OLMES : • API. • • OLMES : • • • •Self-reported
74.8%
MATH-500
Not specifiedSelf-reported
69.0%
NQ
ScoreSelf-reported
36.5%
PopQA
ScoreSelf-reported
26.2%
TriviaQA
ScoreSelf-reported
78.2%

License & Metadata

License
apache_2_0
Announcement Date
April 16, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.