Why are my benchmark results so different from the MTEB leaderboard?

I put together my own retrieval benchmark to test various embedding models: LetsChurch/bible-embeddings · Datasets at Hugging Face

The basic idea is I want to retrieve Bible verses for a given query. I have a list of 532 queries, each with one or more expected verses. If a model returns an expected verse first in rank, then it gets 3 points, second in rank 2 points, third in rank 1 point, otherwise no points (so it is possible for a model to get up to 1596 points).

OpenAI’s text-embedding-3-large and text-embedding-3-small top my benchmark, followed by Gemini text-embedding-004, followed by OpenAI’s last-gen ada embedding model, followed finally by GTE Large, and then other models.

The complete ranking, as of today:

  1. text-embedding-3-large Accuracy: 89.3% (1348/1596 points across 532 queries)
  2. text-embedding-3-small Accuracy: 81% (1228/1596 points across 532 queries)
  3. text-embedding-004 Accuracy: 80.3% (1214/1596 points across 532 queries)
  4. text-embedding-ada-002 Accuracy: 78.3% (1186/1596 points across 532 queries)
  5. thenlper-gte-large Accuracy: 73.1% (1105/1596 points across 532 queries)
  6. thenlper-gte-base Accuracy: 72.5% (1096/1596 points across 532 queries)
  7. intfloat-e5-large-v2 Accuracy: 72.1% (1085/1596 points across 532 queries)
  8. Qwen-Qwen3-Embedding-8B Accuracy: 71.5% (1077/1596 points across 532 queries)
  9. nomic-ai-nomic-embed-text-v1.5 Accuracy: 70.0% (1058/1596 points across 532 queries)
  10. intfloat-e5-base-v2 Accuracy: 69.4% (1049/1596 points across 532 queries)
  11. thenlper-gte-small Accuracy: 69.2% (1047/1596 points across 532 queries)
  12. Salesforce-SFR-Embedding-Mistral Accuracy: 69.0% (1039/1596 points across 532 queries)
  13. ibm-granite-granite-embedding-125m-english Accuracy: 68.5% (1039/1596 points across 532 queries)
  14. voyage-3 Accuracy: 67.0% (1015/1596 points across 532 queries)
  15. ibm-granite-granite-embedding-30m-english Accuracy: 65.4% (996/1596 points across 532 queries)
  16. ibm-granite-granite-embedding-278m-multilingual Accuracy: 64.7% (979/1596 points across 532 queries)
  17. intfloat-e5-small-v2 Accuracy: 63.8% (966/1596 points across 532 queries)
  18. ibm-granite-granite-embedding-107m-multilingual Accuracy: 61.3% (929/1596 points across 532 queries)
  19. sentence-transformers-all-MiniLM-L6-v2 Accuracy: 59.6% (910/1596 points across 532 queries)
  20. Qwen-Qwen3-Embedding-4B Accuracy: 58.9% (885/1596 points across 532 queries)
  21. BAAI-bge-base-en Accuracy: 56.9% (861/1596 points across 532 queries)
  22. Qwen-Qwen3-Embedding-0.6B Accuracy: 56.8% (855/1596 points across 532 queries)
  23. BAAI-bge-large-en Accuracy: 47.4% (701/1596 points across 532 queries)
  24. google-embeddinggemma-300m Accuracy: 46.4% (708/1596 points across 532 queries)
  25. BAAI-bge-small-en Accuracy: 43.4% (652/1596 points across 532 queries)
  26. answerdotai-ModernBERT-base Accuracy: 7.7% (116/1596 points across 532 queries)
  27. answerdotai-ModernBERT-large Accuracy: 7.1% (110/1596 points across 532 queries)

Meanwhile, if we sort the current MTEB leaderboard for retrieval, Qwen 8B and 4B are in spots 1 and 2 (vs spots 8 and 20 for my benchmark), and my top two ranked models (OpenAI) appear in MTEB at spots 14 and 40, respectively.

So what’s going on? I’m completely open to the answer that my methodology wrong. Or that my example queries aren’t reflective of what the benchmark is measuring for. Or anything else, really.

Even the slightest change in the dataset used for benchmarking or the options passed to the model will likely produce different results. However, the reason the results vary so significantly here might be that the MTEB benchmark incorporates several design choices to ensure consistency.

Wow, that’s a very comprehensive response. Thank you, @John6666! I definitely have some stuff to dig into now.

I just reran the Qwen3 queries, changing nothing other than setting prompt=”query”, and the 0.6B model went from 56.8% to 73.9%, 4B went from 58.9% to 72.4%, and 8B went from 71.5% to 82.8% (outperforming the small model from OpenAI).

This is without touching my ranking methodology (yet). This is phenomenal. Thank you again, @John6666!