Granite 3.3 8B Instruct
MultimodalGranite 3.3 8B Instruct is an instruction-tuned language model from IBM's Granite family with 8 billion parameters. It is optimized for enterprise use cases including document summarization, code generation, and conversational AI, with strong instruction-following capabilities and safety alignment.
Key Specifications
Parameters
8.0B
Context
-
Release Date
April 16, 2025
Average Score
69.8%
Timeline
Key dates in the model's history
Announcement
April 16, 2025
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
8.0B
Training Tokens
-
Knowledge Cutoff
April 1, 2024
Family
-
Capabilities
MultimodalZeroEval
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
Score ""/"", evaluation 100%, "", 95%, 80% 95%, 60-80%, 60% • Self-reported
TruthfulQA
Score
Score • Self-reported
Programming
Programming skills tests
HumanEval
OLMES
AI: (Large Language Model. Fine-tuning targets? Unsupervised loss terms? Data methods like RAG? Or specific architectures? RLHF targets? Evaluations? Benchmarks? Oversight from human operators? Not clear without more context.)
The most important part of the model's name is that it's an acronym, standing for "Online LLM Monitoring and Evaluation System." This suggests:
1. Real-time capability ("Online")
2. Focus on large language models specifically
3. Monitoring and evaluation functionality
Based on the name, this appears to be a system for evaluating, benchmarking, or supervising LLMs during operation, possibly with real-time feedback loops.
However, without additional information, I cannot determine specific technical details about model architecture, training methodology, or capabilities. The name primarily indicates its functional purpose rather than its technical construction. • Self-reported
Mathematics
Mathematical problems and computations
GSM8k
OLMES
AI: We present OLMES (Oracle LLM Evaluation System), a novel approach that uses LLMs to evaluate the correctness of other LLMs on complex tasks such as mathematics, reasoning, and code. OLMES is particularly well-suited for cases where the answer space is large or otherwise not amenable to string-matching or automatic checking. OLMES treats an LLM as an oracle that makes judgements on the correctness of model responses. It is a generalizable system for using LLMs to evaluate the correctness of other LLMs.
OLMES is fundamentally an LLM-as-judge system that is (1) carefully engineered to maximize accuracy and (2) calibrated so that its confidence scores accurately reflect the probability that a judgment is correct.
We first prompt a strong, capable model to use a rubric and give its assessment of a problem-solving attempt, as well as its confidence. Then in a novel innovation, we use the concept of Bayesian updating to create a meta-evaluation of this initial evaluation, and come up with a more accurate confidence score. We do this by using LLM evaluators of two types:
1. First-party evaluator: This is the same LLM that we're evaluating, so a GPT-4 is evaluating another GPT-4's solution.
2. Third-party evaluator: This is a different LLM than the one we're evaluating. This can sometimes see patterns that the first-party evaluator misses.
We find that OLMES is competitive with and sometimes even exceeds human evaluation on our testing benchmarks. Across our tasks it averages over 90% accuracy in grading model responses. • Self-reported
Reasoning
Logical reasoning and analysis
BIG-Bench Hard
OLMES () • Self-reported
DROP
# OLMES () OLMES — LLM [Tyen et al., TACL 2023], OLMES : 1) 2) 3) 4) : - () - • Self-reported
Other Tests
Specialized benchmarks
AIME 2024
Not specified • Self-reported
AlpacaEval 2.0
Score • Self-reported
Arena Hard
Arena Hard - evaluation AI: I'll translate the text according to the requirements • Self-reported
AttaQ
Not specified (OLMES) • Self-reported
HumanEval+
# OLMES OLMES (), OLMES "" OLMES, "" (). evaluation • Self-reported
IFEval
# OLMES OLMES () — LLM, OLMES : LLM. ## OLMES? OLMES : ``` ``` LLM: 1. 2. 3. ## ? OLMES LLM OLMES LLM ## OLMES - ****: ****: ****: LLM. - ****: ## OLMES OLMES : - **OLMES**: "" - **OLMES**: "" - **OLMES**: "" ## • Self-reported
MATH-500
Not specified • Self-reported
PopQA
Score • Self-reported
License & Metadata
License
apache_2_0
Announcement Date
April 16, 2025
Last Updated
July 19, 2025
Similar Models
All ModelsGranite 3.3 8B Base
IBM
MM8.2B
Best score:0.9 (HumanEval)
Released:Apr 2025
IBM Granite 4.0 Tiny Preview
IBM
7.0B
Best score:0.8 (HumanEval)
Released:May 2025
Phi-4-multimodal-instruct
Microsoft
MM5.6B
Released:Feb 2025
Price:$0.05/1M tokens
Gemini 1.5 Flash 8B
MM8.0B
Best score:0.4 (GPQA)
Released:Mar 2024
Price:$0.07/1M tokens
Gemma 3n E2B
MM8.0B
Best score:0.5 (ARC)
Released:Jun 2025
MedGemma 4B IT
MM4.3B
Released:May 2025
Gemma 3 4B
MM4.0B
Best score:0.7 (HumanEval)
Released:Mar 2025
Price:$0.02/1M tokens
Gemma 3n E4B Instructed LiteRT Preview
MM1.9B
Best score:0.8 (HumanEval)
Released:May 2025
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.