Granite 3.3 8B Instruct

Name: Granite 3.3 8B Instruct
Author: IBM

Multimodal

IBM

Granite 3.3 8B Instruct is an instruction-tuned language model from IBM's Granite family with 8 billion parameters. It is optimized for enterprise use cases including document summarization, code generation, and conversational AI, with strong instruction-following capabilities and safety alignment.

Key Specifications

Parameters

8.0B

Context

Release Date

April 16, 2025

Average Score

69.8%

API Documentation Repository Model Weights Results Blog

Timeline

Key dates in the model's history

Announcement

April 16, 2025

Last Update

July 19, 2025

Today

May 9, 2026

Technical Specifications

Parameters

8.0B

Training Tokens

Knowledge Cutoff

April 1, 2024

Family

Capabilities

MultimodalZeroEval

Benchmark Results

Model performance metrics across various tests and benchmarks

General Knowledge

Tests on general knowledge and understanding

MMLU

Score ""/"", evaluation 100%, "", 95%, 80% 95%, 60-80%, 60% • Self-reported

65.5%

TruthfulQA

Score Score • Self-reported

66.9%

Programming

Programming skills tests

HumanEval

OLMES AI: (Large Language Model. Fine-tuning targets? Unsupervised loss terms? Data methods like RAG? Or specific architectures? RLHF targets? Evaluations? Benchmarks? Oversight from human operators? Not clear without more context.) The most important part of the model's name is that it's an acronym, standing for "Online LLM Monitoring and Evaluation System." This suggests: 1. Real-time capability ("Online") 2. Focus on large language models specifically 3. Monitoring and evaluation functionality Based on the name, this appears to be a system for evaluating, benchmarking, or supervising LLMs during operation, possibly with real-time feedback loops. However, without additional information, I cannot determine specific technical details about model architecture, training methodology, or capabilities. The name primarily indicates its functional purpose rather than its technical construction. • Self-reported

89.7%

Mathematics

Mathematical problems and computations

GSM8k

OLMES AI: We present OLMES (Oracle LLM Evaluation System), a novel approach that uses LLMs to evaluate the correctness of other LLMs on complex tasks such as mathematics, reasoning, and code. OLMES is particularly well-suited for cases where the answer space is large or otherwise not amenable to string-matching or automatic checking. OLMES treats an LLM as an oracle that makes judgements on the correctness of model responses. It is a generalizable system for using LLMs to evaluate the correctness of other LLMs. OLMES is fundamentally an LLM-as-judge system that is (1) carefully engineered to maximize accuracy and (2) calibrated so that its confidence scores accurately reflect the probability that a judgment is correct. We first prompt a strong, capable model to use a rubric and give its assessment of a problem-solving attempt, as well as its confidence. Then in a novel innovation, we use the concept of Bayesian updating to create a meta-evaluation of this initial evaluation, and come up with a more accurate confidence score. We do this by using LLM evaluators of two types: 1. First-party evaluator: This is the same LLM that we're evaluating, so a GPT-4 is evaluating another GPT-4's solution. 2. Third-party evaluator: This is a different LLM than the one we're evaluating. This can sometimes see patterns that the first-party evaluator misses. We find that OLMES is competitive with and sometimes even exceeds human evaluation on our testing benchmarks. Across our tasks it averages over 90% accuracy in grading model responses. • Self-reported

80.9%

Reasoning

Logical reasoning and analysis

BIG-Bench Hard

OLMES () • Self-reported

69.1%

DROP

# OLMES () OLMES — LLM [Tyen et al., TACL 2023], OLMES : 1) 2) 3) 4) : - () - • Self-reported

59.4%

Other Tests

Specialized benchmarks

AIME 2024

Not specified • Self-reported

81.2%

AlpacaEval 2.0

Score • Self-reported

62.7%

Arena Hard

Arena Hard - evaluation AI: I'll translate the text according to the requirements • Self-reported

57.6%

AttaQ

Not specified (OLMES) • Self-reported

88.5%

HumanEval+

# OLMES OLMES (), OLMES "" OLMES, "" (). evaluation • Self-reported

86.1%

IFEval

# OLMES OLMES () — LLM, OLMES : LLM. ## OLMES? OLMES : ``` ``` LLM: 1. 2. 3. ## ? OLMES LLM OLMES LLM ## OLMES - ****: ****: ****: LLM. - ****: ## OLMES OLMES : - **OLMES**: "" - **OLMES**: "" - **OLMES**: "" ## • Self-reported

74.8%

MATH-500

Not specified • Self-reported

69.0%

PopQA

Score • Self-reported

26.2%

License & Metadata

License

apache_2_0

Announcement Date

April 16, 2025

Last Updated

July 19, 2025

Similar Models

All Models

Granite 3.3 8B Base

IBM

MM8.2B

Best score:0.9 (HumanEval)

Released:Apr 2025

IBM Granite 4.0 Tiny Preview

IBM

7.0B

Best score:0.8 (HumanEval)

Released:May 2025

Phi-4-multimodal-instruct

Microsoft

MM5.6B

Released:Feb 2025

Price:$0.05/1M tokens

Gemini 1.5 Flash 8B

Google

MM8.0B

Best score:0.4 (GPQA)

Released:Mar 2024

Price:$0.07/1M tokens

Gemma 3n E2B

Google

MM8.0B

Best score:0.5 (ARC)

Released:Jun 2025

MedGemma 4B IT

Google

MM4.3B

Released:May 2025

Gemma 3 4B

Google

MM4.0B

Best score:0.7 (HumanEval)

Released:Mar 2025

Price:$0.02/1M tokens

Gemma 3n E4B Instructed LiteRT Preview

Google

MM1.9B

Best score:0.8 (HumanEval)

Released:May 2025

Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.