IBM logo

Granite 3.3 8B Instruct

Multimodal
IBM

Granite 3.3 8B Instruct is an instruction-tuned language model from IBM's Granite family with 8 billion parameters. It is optimized for enterprise use cases including document summarization, code generation, and conversational AI, with strong instruction-following capabilities and safety alignment.

Key Specifications

Parameters
8.0B
Context
-
Release Date
April 16, 2025
Average Score
69.8%

Timeline

Key dates in the model's history
Announcement
April 16, 2025
Last Update
July 19, 2025
Today
March 25, 2026

Technical Specifications

Parameters
8.0B
Training Tokens
-
Knowledge Cutoff
April 1, 2024
Family
-
Capabilities
MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding
MMLU
Score ""/"", evaluation 100%, "", 95%, 80% 95%, 60-80%, 60%Self-reported
65.5%
TruthfulQA
Score ScoreSelf-reported
66.9%

Programming

Programming skills tests
HumanEval
OLMES AI: (Large Language Model. Fine-tuning targets? Unsupervised loss terms? Data methods like RAG? Or specific architectures? RLHF targets? Evaluations? Benchmarks? Oversight from human operators? Not clear without more context.) The most important part of the model's name is that it's an acronym, standing for "Online LLM Monitoring and Evaluation System." This suggests: 1. Real-time capability ("Online") 2. Focus on large language models specifically 3. Monitoring and evaluation functionality Based on the name, this appears to be a system for evaluating, benchmarking, or supervising LLMs during operation, possibly with real-time feedback loops. However, without additional information, I cannot determine specific technical details about model architecture, training methodology, or capabilities. The name primarily indicates its functional purpose rather than its technical construction.Self-reported
89.7%

Mathematics

Mathematical problems and computations
GSM8k
OLMES AI: We present OLMES (Oracle LLM Evaluation System), a novel approach that uses LLMs to evaluate the correctness of other LLMs on complex tasks such as mathematics, reasoning, and code. OLMES is particularly well-suited for cases where the answer space is large or otherwise not amenable to string-matching or automatic checking. OLMES treats an LLM as an oracle that makes judgements on the correctness of model responses. It is a generalizable system for using LLMs to evaluate the correctness of other LLMs. OLMES is fundamentally an LLM-as-judge system that is (1) carefully engineered to maximize accuracy and (2) calibrated so that its confidence scores accurately reflect the probability that a judgment is correct. We first prompt a strong, capable model to use a rubric and give its assessment of a problem-solving attempt, as well as its confidence. Then in a novel innovation, we use the concept of Bayesian updating to create a meta-evaluation of this initial evaluation, and come up with a more accurate confidence score. We do this by using LLM evaluators of two types: 1. First-party evaluator: This is the same LLM that we're evaluating, so a GPT-4 is evaluating another GPT-4's solution. 2. Third-party evaluator: This is a different LLM than the one we're evaluating. This can sometimes see patterns that the first-party evaluator misses. We find that OLMES is competitive with and sometimes even exceeds human evaluation on our testing benchmarks. Across our tasks it averages over 90% accuracy in grading model responses.Self-reported
80.9%

Reasoning

Logical reasoning and analysis
BIG-Bench Hard
OLMES ()Self-reported
69.1%
DROP
# OLMES () OLMES — LLM [Tyen et al., TACL 2023], OLMES : 1) 2) 3) 4) : - () -Self-reported
59.4%

Other Tests

Specialized benchmarks
AIME 2024
Not specifiedSelf-reported
81.2%
AlpacaEval 2.0
ScoreSelf-reported
62.7%
Arena Hard
Arena Hard - evaluation AI: I'll translate the text according to the requirementsSelf-reported
57.6%
AttaQ
Not specified (OLMES)Self-reported
88.5%
HumanEval+
# OLMES OLMES (), OLMES "" OLMES, "" (). evaluationSelf-reported
86.1%
IFEval
# OLMES OLMES () — LLM, OLMES : LLM. ## OLMES? OLMES : ``` ``` LLM: 1. 2. 3. ## ? OLMES LLM OLMES LLM ## OLMES - ****: ****: ****: LLM. - ****: ## OLMES OLMES : - **OLMES**: "" - **OLMES**: "" - **OLMES**: "" ##Self-reported
74.8%
MATH-500
Not specifiedSelf-reported
69.0%
PopQA
ScoreSelf-reported
26.2%

License & Metadata

License
apache_2_0
Announcement Date
April 16, 2025
Last Updated
July 19, 2025

Similar Models

All Models

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.