I put together my own retrieval benchmark to test various embedding models: LetsChurch/bible-embeddings · Datasets at Hugging Face
The basic idea is I want to retrieve Bible verses for a given query. I have a list of 532 queries, each with one or more expected verses. If a model returns an expected verse first in rank, then it gets 3 points, second in rank 2 points, third in rank 1 point, otherwise no points (so it is possible for a model to get up to 1596 points).
OpenAI’s text-embedding-3-large and text-embedding-3-small top my benchmark, followed by Gemini text-embedding-004, followed by OpenAI’s last-gen ada embedding model, followed finally by GTE Large, and then other models.
The complete ranking, as of today:
- text-embedding-3-large Accuracy: 89.3% (1348/1596 points across 532 queries)
- text-embedding-3-small Accuracy: 81% (1228/1596 points across 532 queries)
- text-embedding-004 Accuracy: 80.3% (1214/1596 points across 532 queries)
- text-embedding-ada-002 Accuracy: 78.3% (1186/1596 points across 532 queries)
- thenlper-gte-large Accuracy: 73.1% (1105/1596 points across 532 queries)
- thenlper-gte-base Accuracy: 72.5% (1096/1596 points across 532 queries)
- intfloat-e5-large-v2 Accuracy: 72.1% (1085/1596 points across 532 queries)
- Qwen-Qwen3-Embedding-8B Accuracy: 71.5% (1077/1596 points across 532 queries)
- nomic-ai-nomic-embed-text-v1.5 Accuracy: 70.0% (1058/1596 points across 532 queries)
- intfloat-e5-base-v2 Accuracy: 69.4% (1049/1596 points across 532 queries)
- thenlper-gte-small Accuracy: 69.2% (1047/1596 points across 532 queries)
- Salesforce-SFR-Embedding-Mistral Accuracy: 69.0% (1039/1596 points across 532 queries)
- ibm-granite-granite-embedding-125m-english Accuracy: 68.5% (1039/1596 points across 532 queries)
- voyage-3 Accuracy: 67.0% (1015/1596 points across 532 queries)
- ibm-granite-granite-embedding-30m-english Accuracy: 65.4% (996/1596 points across 532 queries)
- ibm-granite-granite-embedding-278m-multilingual Accuracy: 64.7% (979/1596 points across 532 queries)
- intfloat-e5-small-v2 Accuracy: 63.8% (966/1596 points across 532 queries)
- ibm-granite-granite-embedding-107m-multilingual Accuracy: 61.3% (929/1596 points across 532 queries)
- sentence-transformers-all-MiniLM-L6-v2 Accuracy: 59.6% (910/1596 points across 532 queries)
- Qwen-Qwen3-Embedding-4B Accuracy: 58.9% (885/1596 points across 532 queries)
- BAAI-bge-base-en Accuracy: 56.9% (861/1596 points across 532 queries)
- Qwen-Qwen3-Embedding-0.6B Accuracy: 56.8% (855/1596 points across 532 queries)
- BAAI-bge-large-en Accuracy: 47.4% (701/1596 points across 532 queries)
- google-embeddinggemma-300m Accuracy: 46.4% (708/1596 points across 532 queries)
- BAAI-bge-small-en Accuracy: 43.4% (652/1596 points across 532 queries)
- answerdotai-ModernBERT-base Accuracy: 7.7% (116/1596 points across 532 queries)
- answerdotai-ModernBERT-large Accuracy: 7.1% (110/1596 points across 532 queries)
Meanwhile, if we sort the current MTEB leaderboard for retrieval, Qwen 8B and 4B are in spots 1 and 2 (vs spots 8 and 20 for my benchmark), and my top two ranked models (OpenAI) appear in MTEB at spots 14 and 40, respectively.
So what’s going on? I’m completely open to the answer that my methodology wrong. Or that my example queries aren’t reflective of what the benchmark is measuring for. Or anything else, really.