Why are my benchmark results so different from the MTEB leaderboard?

knpwrs · September 11, 2025, 5:43pm

I put together my own retrieval benchmark to test various embedding models: LetsChurch/bible-embeddings · Datasets at Hugging Face

The basic idea is I want to retrieve Bible verses for a given query. I have a list of 532 queries, each with one or more expected verses. If a model returns an expected verse first in rank, then it gets 3 points, second in rank 2 points, third in rank 1 point, otherwise no points (so it is possible for a model to get up to 1596 points).

OpenAI’s text-embedding-3-large and text-embedding-3-small top my benchmark, followed by Gemini text-embedding-004, followed by OpenAI’s last-gen ada embedding model, followed finally by GTE Large, and then other models.

The complete ranking, as of today:

text-embedding-3-large Accuracy: 89.3% (1348/1596 points across 532 queries)
text-embedding-3-small Accuracy: 81% (1228/1596 points across 532 queries)
text-embedding-004 Accuracy: 80.3% (1214/1596 points across 532 queries)
text-embedding-ada-002 Accuracy: 78.3% (1186/1596 points across 532 queries)
thenlper-gte-large Accuracy: 73.1% (1105/1596 points across 532 queries)
thenlper-gte-base Accuracy: 72.5% (1096/1596 points across 532 queries)
intfloat-e5-large-v2 Accuracy: 72.1% (1085/1596 points across 532 queries)
Qwen-Qwen3-Embedding-8B Accuracy: 71.5% (1077/1596 points across 532 queries)
nomic-ai-nomic-embed-text-v1.5 Accuracy: 70.0% (1058/1596 points across 532 queries)
intfloat-e5-base-v2 Accuracy: 69.4% (1049/1596 points across 532 queries)
thenlper-gte-small Accuracy: 69.2% (1047/1596 points across 532 queries)
Salesforce-SFR-Embedding-Mistral Accuracy: 69.0% (1039/1596 points across 532 queries)
ibm-granite-granite-embedding-125m-english Accuracy: 68.5% (1039/1596 points across 532 queries)
voyage-3 Accuracy: 67.0% (1015/1596 points across 532 queries)
ibm-granite-granite-embedding-30m-english Accuracy: 65.4% (996/1596 points across 532 queries)
ibm-granite-granite-embedding-278m-multilingual Accuracy: 64.7% (979/1596 points across 532 queries)
intfloat-e5-small-v2 Accuracy: 63.8% (966/1596 points across 532 queries)
ibm-granite-granite-embedding-107m-multilingual Accuracy: 61.3% (929/1596 points across 532 queries)
sentence-transformers-all-MiniLM-L6-v2 Accuracy: 59.6% (910/1596 points across 532 queries)
Qwen-Qwen3-Embedding-4B Accuracy: 58.9% (885/1596 points across 532 queries)
BAAI-bge-base-en Accuracy: 56.9% (861/1596 points across 532 queries)
Qwen-Qwen3-Embedding-0.6B Accuracy: 56.8% (855/1596 points across 532 queries)
BAAI-bge-large-en Accuracy: 47.4% (701/1596 points across 532 queries)
google-embeddinggemma-300m Accuracy: 46.4% (708/1596 points across 532 queries)
BAAI-bge-small-en Accuracy: 43.4% (652/1596 points across 532 queries)
answerdotai-ModernBERT-base Accuracy: 7.7% (116/1596 points across 532 queries)
answerdotai-ModernBERT-large Accuracy: 7.1% (110/1596 points across 532 queries)

Meanwhile, if we sort the current MTEB leaderboard for retrieval, Qwen 8B and 4B are in spots 1 and 2 (vs spots 8 and 20 for my benchmark), and my top two ranked models (OpenAI) appear in MTEB at spots 14 and 40, respectively.

So what’s going on? I’m completely open to the answer that my methodology wrong. Or that my example queries aren’t reflective of what the benchmark is measuring for. Or anything else, really.

John6666 · September 12, 2025, 12:21pm

Even the slightest change in the dataset used for benchmarking or the options passed to the model will likely produce different results. However, the reason the results vary so significantly here might be that the MTEB benchmark incorporates several design choices to ensure consistency.

knpwrs · September 12, 2025, 1:39pm

Wow, that’s a very comprehensive response. Thank you, @John6666! I definitely have some stuff to dig into now.

knpwrs · September 12, 2025, 4:50pm

I just reran the Qwen3 queries, changing nothing other than setting prompt=”query”, and the 0.6B model went from 56.8% to 73.9%, 4B went from 58.9% to 72.4%, and 8B went from 71.5% to 82.8% (outperforming the small model from OpenAI).

This is without touching my ranking methodology (yet). This is phenomenal. Thank you again, @John6666!

Topic		Replies	Views
RAG: Embedding models have converged Research	0	289	November 17, 2025
Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal? Models	2	273	June 10, 2025
Embeddig model information Beginners	5	404	October 20, 2024
Reasoning LLM Benchmarking 🤗Transformers	2	4123	March 24, 2025
Why do I get different embeddings when I perform batch encoding in huggingface MT5 model? 🤗Transformers	2	808	March 12, 2024

Why are my benchmark results so different from the MTEB leaderboard?

Related topics