Artificial Analysis Benchmarking Methodology

Scope

Artificial Analysis performs intelligence, quality, performance and price benchmarking on AI models, and AI inference API endpoints. This section of our website describes our benchmarking methodology, including both our quality benchmarking and performance benchmarking.

For our language model benchmarking, we note that we consider endpoints to be serverless when customers only pay for their usage, not a fixed rate for access to a system. Typically this means that endpoints are priced on a per token basis, often with different prices for input and output tokens.

Across all modalities, our performance benchmarking measures the end-to-end performance experienced by customers of AI inference services. This means that benchmark results are not intended to represent the maximum possible performance on any particular hardware platform, they are intended to represent the real-world performance customers experience across providers.

We benchmark both proprietary and open weights models.

Methodology Details:

Language Model Intelligence

Language Model Performance

Text to Image

Speech to Text

Text to Speech

Speech Reasoning

Definitions

On this page, and across the Artificial Analysis website, we use the following terms:

Model: A large language model (LLM), including proprietary, open source and open weights models.
Model Creator: The organization that developed and trained the model. For example, OpenAI is the creator of GPT-4 and Meta is the creator of Llama 3.
Endpoint: A hosted instance of a model that can be accessed via an API. A single model may have multiple endpoints across different providers.
Provider: A company that hosts and provides access to one or more model endpoints via an API. Examples include OpenAI, AWS Bedrock, Together.ai and more. Companies are often both Model Creators and Providers.
Serverless: Cloud service provided on an as-used basis, in relation to LLM inference APIs generally means priced per token of input and output. Serverless cloud products do still run on servers!
Open Weights: A model whose weights have been released publicly by the model's creator. We refer to 'open weights' or just 'open' models rather than 'open-source' as many open LLMs have been released with licenses that do not meet the full definition of open-source software.
Token: Modern LLMs are built around tokens - numerical representations of words and characters. LLMs take tokens as input and generate tokens as output. Input text is translated into tokens by a tokenizer. Different LLMs use different tokenizers.
OpenAI Tokens: Tokens as generated by OpenAI's GPT-3.5 and GPT-4 tokenizer, generally measured for Artificial Analysis benchmarking with OpenAI's tiktoken package for Python (cl100k_base tokenizer). We use OpenAI tokens as a standard unit of measurement across Artificial Analysis to allow fair comparisons between models. All 'tokens per second' metrics refer to OpenAI tokens.
Native Tokens: Tokens as generated by an LLM's own tokenizer. We refer to 'native tokens' to distinguish from 'OpenAI tokens'. Prices generally refer to native tokens.
Price (Input/Output): The price charged by a provider per input token sent to the model and per output token received from the model. Prices shown are the current prices listed by providers.
Price (Blended): To enable easier comparison, we calculate a blended price assuming a 3:1 ratio of input to output tokens.
$\frac{3 \times \text{Input Price} + \text{Output Price}}{4}$
Latency: Time to First Token: The time in seconds between sending a request to the API and receiving the first token of the response. For reasoning models which return reasoning tokens, this will be the first reasoning token.
$\text{Time to First Token} = \text{Time of First Token Arrival} - \text{Time Request Sent}$
Latency: Time to First Answer Token:The time in seconds between sending a request to the API and receiving the first answer token of the response. For reasoning models, this is measured after any 'thinking' time.
$\text{Time to First Answer Token} = \text{Input Processing Time} + \frac{\text{Avg. Reasoning Tokens}}{\text{Reasoning Output Speed}}$
Output Speed (output tokens per second): The average number of tokens received per second, after the first token is received.
$\text{Output Speed} = \frac{\text{Total Tokens} - \text{First Chunk Tokens}}{\text{Time of Final Token Chunk Received} - \text{Time of First Token Chunk Received}}$
Total Response Time for 100 Output Tokens: The number of seconds to generate 100 output tokens, calculated synthetically based on TTFT and Output Speed to assure maximum comparison utility.
$\text{Total Response Time} = \text{Time to First Token} + \frac{100}{\text{Output Speed}}$
End-to-End Response Time: The total time to receive a complete response, including input processing time, model reasoning time, and answer generation time.
$\text{End-to-End Response Time} = \text{Input Processing Time} + \frac{\text{Avg. Reasoning Tokens}}{\text{Reasoning Output Speed}} + \frac{500}{\text{Answer Output Speed}}$
Average Reasoning Tokens: Time reasoning models spend outputting 'reasoning' tokens before providing an answer. This is calculated based on the average number of 'reasoning' tokens across a diverse set of 60 prompts. Where the average number of reasoning tokens is not available or has not yet been calculated, we assume 2k reasoning tokens. These prompts are of varied lengths and include coverage of a range of topics including, personal-related queries, commercial-related queries, coding, math, science and others. Prompts are a combination of being written by Artificial Analysis and others sourced from the following evaluations: MMLU Pro, AIME 2024, HumanEval, and LiveCodeBench. These prompts can be accessed here.