Key Specifications
Parameters
14.7B
Context
16.0K
Release Date
December 12, 2024
Average Score
66.0%
Timeline
Key dates in the model's history
Announcement
December 12, 2024
Last Update
July 19, 2025
Today
March 25, 2026
Technical Specifications
Parameters
14.7B
Training Tokens
9.8T tokens
Knowledge Cutoff
June 1, 2024
Family
-
Capabilities
MultimodalZeroEval
Pricing & Availability
Input (per 1M tokens)
$0.07
Output (per 1M tokens)
$0.14
Max Input Tokens
16.0K
Max Output Tokens
16.0K
Supported Features
Function CallingStructured OutputCode ExecutionWeb SearchBatch InferenceFine-tuning
Benchmark Results
Model performance metrics across various tests and benchmarks
General Knowledge
Tests on general knowledge and understanding
MMLU
simple-evals approach to evaluation language abilities language models and comparison models. : - and execution - tests, different abilities - Without capabilities "data"; assignments Without necessity in complex verification answers; majority answers For this need to: - evaluation abilities and comparison models - models - representation about that, that model can and not can do tests: - ****: tests on basic abilities thinking and reasoning - ****: tests on mathematical abilities various levels complexity - ****: tests on ability code and ****: tests on knowledge • Self-reported
Programming
Programming skills tests
HumanEval
simple-evals for simple and evaluations on LLM. ``` pip install simple-evals ``` Examples evaluation use ```python from simple_evals import AutoEvaluator evaluator = AutoEvaluator(model_name="gpt-4-turbo-preview") questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` With using ```python from simple_evals import AutoEvaluator, LLMRunnable class CustomModel(LLMRunnable): def run(self, prompt: str) -> str: # logic return "Answer on: " + prompt evaluator = AutoEvaluator(model=CustomModel()) questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` evaluation ```python from simple_evals import SemiAutoEvaluator evaluator = SemiAutoEvaluator( model_a_name="gpt-4-turbo-preview", model_b_name="gpt-3.5-turbo" ) questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` evaluation with using ```python from simple_evals import SemiAutoEvaluator, LLMRunnable class CustomModelA(LLMRunnable): def run(self, prompt: str) -> str: # logic return "Answer from model A: " + prompt class CustomModelB(LLMRunnable): def run(self, prompt: str) -> str: # logic return "Answer from model B: " + prompt evaluator = SemiAutoEvaluator( model_a=CustomModelA(), model_b=CustomModelB() ) questions = [ "Who won the 2020 US presidential election?", "What is the capital of France?", ] evaluator.evaluate(questions=questions) ``` results results in CSV-```python evaluator.to_csv("results.csv") ``` results in JSON-```python evaluator.to_json("results.json") ``` • Self-reported
Mathematics
Mathematical problems and computations
MATH
simple-evals system for evaluation LLM on basis output. ## simple-evals — this system for settings LLM and evaluation their behavior on basis output. Key : - through YAML - various including and capability Computation metrics with results - all with model for analysis simple-evals on benchmarks, which can and which give about LLM. ## ``` pip install simple-evals ``` ## Use simple-evals fully through YAML-YAML-determines evaluation, including model and method evaluation: ```yaml name: Test Evaluation description: Verification basic capabilities version: 0.1 models: - name: gpt-3.5-turbo type: openai - name: gpt-4 type: openai evaluator: type: simple metrics: - accuracy ``` Then you Each contains examples: ```yaml name: Math Problems description: mathematical tasks examples: - name: addition input: will 2+2? checks: - type: contains value: "4" - name: multiplication input: will 7*8? checks: - type: contains value: "56" ``` evaluation : ``` simple-evals run config.yaml ``` ## capabilities simple-evals supports: - verification by means of `Check` - with various LLM (OpenAI, Anthropic, model) - data for deep analysis and Verification — this only beginning. You complex evaluation with for each example, in order to behavior model • Self-reported
MGSM
simple-evals AI: Translate on Russian language, I only translation • Self-reported
Reasoning
Logical reasoning and analysis
DROP
simple-evals This tool evaluates quality several models AI on standard tests thinking. We we use its, in order to track progress in capabilities models LLM. What our tests: - Reasoning, and tasks type head-scratchers - and computation - and understanding and and science - mathematics with tasks - Tasks with where important not information You tests following manner: 1. this 2. `python run_evals.py --help` for obtaining 3. For example: `python run_evals.py --model gpt-3.5-turbo --dataset mmlu_stem` evaluation GPT-3.5 on MMLU STEM You also its own model, if for model, which describes her/its API • Self-reported
GPQA
# simple-evals and methods for evaluation LLM, on use, and ## 1. **:** Use standard (JSON, CSV) and from creation complexity. In cases is used one and one in "correctly/incorrectly" or numerical evaluations. 2. **:** including instructions, prompts and answers, in order to can was fully without determination 3. **:** on several sets data and several LLM with ## set complex tools for evaluation, but they solve too tasks immediately and simple scenarios. we we can: - and tests - errors in evaluationfull prompts and data on basis connection - evaluation set models on data ## through `pip install simple-evals` and how: ```python from simple_evals.eval import evaluate_tasks results = evaluate_tasks( llms={ "gpt-4-turbo": lambda x: call_openai("gpt-4-turbo", x), "claude-3-opus": lambda x: call_anthropic("claude-3-opus", x), }, tasks={ "gsm8k": lambda: get_gsm8k_tasks(20), "mmlu": lambda: get_mmlu_tasks(["physics", "chemistry"], 20), }, system_message="AI and exactly.", ) ``` ## ### Evaluation tasks - `evaluate_tasks(llms, tasks, system_message)`: several models on several sets tasks, results and metrics. ### Tasks QA - `binary_qa_task(question, answer)`: task, where model should answer "correctly" or "incorrectly" - `choice_qa_task(question, choices, answer)`: task with multiple choice, where model should choose correct answer ### Computation metrics - `compute • Self-reported
Other Tests
Specialized benchmarks
Arena Hard
simple-evals AI: Translate text simple-evals - this for simple and evaluation performance LLM on various tasks and benchmarks. We all and LLM use this for evaluation own models, and also testing. On this we following benchmarks: * matheval: numbers and etc.etc. * mmlueval: Evaluation MMLU * codeeval: tasks programming and etc.etc. * langeval: Understanding and generation language * reasoningeval: • Self-reported
HumanEval+
simple-evals and set tools for evaluation models AI (for example, GPT-4, Claude) by various What such simple-evals? simple-evals — this set tools, which computational evaluations for large language models. We we provide way obtaining answers from set models through various API and set tests for evaluation capabilities models. capabilities: - various evaluations, including GPQA, and other - Query several models through OpenAI API, Claude API and etc.etc., and also models - for and improvement for analysis results - for creation own evaluations ``` pip install simple-evals ``` Use how can use simple-evals for obtaining answers from models on set questions: ```python import simple_evals as se # all questions GPQA on GPT-4 results = se.run_evals( eval_set="gpqa", models="gpt-4", ) # results se.visualize(results) ``` evaluation - GPQA: questions for evaluation knowledge in field on level : questions from MATH and GSM8K - tasks: HumanEval and MBPP - : and : generation and evaluation multi-step tasks: ARC, BBH, MMLU - other... model - OpenAI (GPT-4, GPT-3.5) - Anthropic (Claude) - Mistral and Mixtral - Gemini - model through vLLM, transformers and etc.etc. Extended use ```python # set tests on different models results = se.run_evals( eval_sets=["gpqa", "gsm8k", "humaneval"], models=["gpt-4", "gpt-3.5-turbo", "claude-2"], max_samples=100, # number samples cache=True, # for tokens ) # and results se.analyze • Self-reported
IFEval
# simple-evals simple-evals — this tool for evaluation and verification LLM, which can easily and under various He consists from set tools, which evaluate model by tasks. ## Key model on diverse tasks in "question and answer" with help our for evaluation - evaluation for basic abilities reasoning, /and tests for own tasks - evaluation with help other model in capacity for evaluation model and to models through API (vLLM) or API (OpenAI, Anthropic, and etc.etc.) ## ```bash pip install simple-evals ``` for use specific For example: ```bash pip install simple-evals[openai] # for use models OpenAI pip install simple-evals[anthropic] # for use models Anthropic pip install simple-evals[huggingface] # for models Hugging Face ``` ## Beginning work In order to quickly work, can CLI-tool `simple-evals`: ```bash $ simple-evals run path/to/questions.jsonl --model gpt-4o ``` through Python: ```python from simple_evals import evaluator, backends, questions # and tasks backend = backends.OpenAIBackend(model="gpt-4o") qs = questions.load_questions("path/to/questions.jsonl") # evaluation results = evaluator.evaluate(qs, backend) # information evaluator.print_results_summary(results) ``` ## on https://simple-evals.readthedocs.io/ • Self-reported
LiveBench
simple-evals for evaluation LLM function-from LLM for evaluation generation, on (possible, with examples). for tasks: from to creation ## How works? 1. task, set instructions and criteria evaluation 2. model generates answer 3. model (can be indeed) evaluates match answer instructions and criteria 4. results in various ## Examples use * Comparison models on specific tasks * improvement prompts and methods * competitions inside ## ``` pip install simple-evals ``` ## ```bash # evaluation simple-evals evaluate --task-file samples/tasks/arithmetic.yaml # Comparison different prompts or models simple-evals evaluate --task-file samples/tasks/arithmetic.yaml --comparison --runs 5 ``` • Self-reported
MMLU-Pro
simple-evals AI: # Strong and weak side, and also limitations (SWOLA) ## Definition Method SWOLA (Strengths, Weaknesses, and Limitations of Attributes) — this method evaluation and analysis, for identification strong and sides specific and also their limitations. This method is applied for formation understanding ## Methodology SWOLA includes in itself: 1. **Definition for evaluation**: key or 2. **Analysis strong sides**: positive aspects each which advantages or performance. 3. **Analysis sides**: or fields, requiring improvements for each 4. **Evaluation limitations**: Definition or scenarios, at which can efficiency or 5. **analysis**: information for obtaining representations about each with context and evaluation. ## Application SWOLA is applied for: - Evaluations models machine training and their algorithms - analysis various approaches - fields for research and improvements - solutions about or ## Advantages - for evaluation - Allows obtain understanding and capabilities for ## Limitations - knowledge in field - Can be without thoroughly specific Efficiency depends from completeness understanding for interpretation results • Self-reported
PhiBench
simple-evals for work and comparison large language models. She/It evaluation performance models on basis in code. You in part with examples (on yaml), and simple-evals automatically answers from each language model criteria. Results can in or in CSV. supports various criteria evaluation: - best answer from several models - Evaluation answers models on match criteria - Verification, contains whether answer model or Evaluation on basis computations Comparison models very instructions: - its prompt - answers from several models - criteria comparison - results for decision-making solutions • Self-reported
SimpleQA
simple-evals AI: methods evaluation • Self-reported
License & Metadata
License
mit
Announcement Date
December 12, 2024
Last Updated
July 19, 2025
Similar Models
All ModelsPhi 4 Reasoning Plus
Microsoft
14.0B
Best score:0.9 (HumanEval)
Released:Apr 2025
Phi 4 Reasoning
Microsoft
14.0B
Best score:0.9 (HumanEval)
Released:Apr 2025
Phi-3.5-MoE-instruct
Microsoft
60.0B
Best score:0.9 (ARC)
Released:Aug 2024
Llama 3.1 70B Instruct
Meta
70.0B
Best score:0.9 (ARC)
Released:Jul 2024
Price:$0.89/1M tokens
Hermes 3 70B
Nous Research
70.0B
Best score:0.8 (MMLU)
Released:Aug 2024
Nemotron 3 Nano (30B A3B)
NVIDIA
32.0B
Best score:0.8 (GPQA)
Released:Dec 2025
Price:$0.06/1M tokens
Llama 3.1 Nemotron 70B Instruct
NVIDIA
70.0B
Best score:0.8 (MMLU)
Released:Oct 2024
Codestral-22B
Mistral AI
22.2B
Best score:0.8 (HumanEval)
Released:May 2024
Price:$0.20/1M tokens
Recommendations are based on similarity of characteristics: developer organization, multimodality, parameter size, and benchmark performance. Choose a model to compare or go to the full catalog to browse all available AI models.