MTEB(por) β Random Baseline Encoder
β οΈ This is NOT a trained model. It is the chance-level floor reference for the MTEB(por, v2) Brazilian-Portuguese embedding benchmark.
It maps each input text to a deterministic, L2-normalized random vector (seeded by a hash of the text). It carries zero semantic signal β two textually-different but semantically-similar sentences get unrelated vectors β so it scores at chance level on every task family (STS, retrieval, classification, clustering, reranking, regression).
Why a random baseline?
- Interpretability β it anchors every number. Is
0.30on a retrieval task good or near-random? Only the floor answers that. - Task discrimination β if a real model scores near the floor on a task, that task does not discriminate. A concrete empirical sanity check.
- Convention β mirrors
mteb/baseline-random-encoderfrom the upstream MTEB leaderboard.
Design
- Each text
tβrng = numpy.random.default_rng(sha256("42|" + t))βv = rng.standard_normal(768)βv / βvβ. - Deterministic per text (fully reproducible), dim 768, seed 42.
- No weights, no GPU, no training.
Reproduce
import hashlib
import numpy as np
DIM, SEED = 768, 42
def encode(texts: list[str]) -> np.ndarray:
"""Deterministic per-text L2-normalized random vectors (chance-level floor)."""
out = np.empty((len(texts), DIM), dtype=np.float32)
for i, t in enumerate(texts):
h = int(hashlib.sha256((str(SEED) + "|" + (t or "")).encode()).hexdigest(), 16) % (2**32)
v = np.random.default_rng(h).standard_normal(DIM).astype(np.float32)
out[i] = v / (np.linalg.norm(v) + 1e-9)
return out
The full evaluation script (run_random_baseline.py, using the same pinned-revision MTEB(por)
tasks as the benchmarked models) is included in this repo.
Floor scores β MTEB(por, v2)
Retrieval (nDCG@10)
| Task | Floor |
|---|---|
| MedPTRetrieval | 0.0083 |
| FaQuADIR | 0.0235 |
| Quati | 0.0 |
| FaqBacenRetrieval | 0.0027 |
| JurisTCU | 0.0 |
| BRTaxQAR | 0.0129 |
Reranking (MAP)
| Task | Floor |
|---|---|
| QuatiReranking | 0.1804 |
| JurisTCUReranking | 0.1434 |
| PortuLexRRIP | 0.1415 |
STS (Spearman)
| Task | Floor |
|---|---|
| AssinSTS | 0.005 |
| Assin2STS | -0.0288 |
Pair classification (AP)
| Task | Floor |
|---|---|
| AssinRTE | 0.2328 |
| InferBR | 0.3556 |
Classification (acc/AP)
| Task | Floor |
|---|---|
| HateBR | 0.5016 |
| ToxSynPT | 0.495 |
| FactckBrClassification | 0.322 |
| OlidBrMultilabelClassification | 0.2035 |
| BrighterEmotionMultilabelClassification | 0.2027 |
Clustering (V-measure)
| Task | Floor |
|---|---|
| MedPTClustering | 0.5289 |
| WikipediaPTCategoriesClusteringP2P | 0.3248 |
| JurisTCUClusteringP2P | 0.1225 |
| SciELOClusteringP2P | 0.0859 |
| StackoverflowPtClustering | 0.3353 |
| CamaraProposicoesClustering | 0.4912 |
Regression (Spearman)
| Task | Floor |
|---|---|
| BrighterEmotionIntensityRegression | 0.0223 |
| EnemEssayRegression | -0.0783 |
| NarrativeEssaysBRRegression | -0.0526 |
Floor is non-zero for clustering (the V-measure of a random partition is not 0) and for classification (chance β 1/num-classes); real models score well above it on every task.
Citation
Part of the MTEB(por) benchmark by the mteb-pt project. The floor is computed with the
identical pinned-SHA tasks used for every benchmarked model.