Data Refinery — Blue Hen RE

Provenance

Source: EVIDENCE.md · extractor heuristic · chunking sentence

OKF dataset card

---
type: Dataset
title: Evidence and science review ledgers
description: "Point-in-time collection: 2 docs, 27 chunks."
tags: [dataset, datalab]
timestamp: "2026-07-02T16:02:39Z"
datasetId: 20260702-110239-evidence-and-science-review-ledgers
---

# Provenance

Point-in-time collection run `20260702-110239-evidence-and-science-review-ledgers` — 2 documents,
27 chunks (sentence strategy,
~11246 tokens). Raw artifacts live at
`data/datalab/20260702-110239-evidence-and-science-review-ledgers/` (docs.jsonl, chunks.jsonl, manifest.json);
trace `20260702-110239-da534e13` in `data/traces/`.

# Sources

| Source | Status |
|--------|--------|
| `EVIDENCE.md` | ok |
| `SCIENCE_REVIEW.md` | ok |

# Consumption

Chunks feed the training worker's pair builder and the retrieval index.
Filter on `token_estimate` and `strategy` in `chunks.jsonl`. See the
[data pipeline](/platform/data-pipeline.md) concept for stage details.

Sample chunks (first 20, sanitized)

# Evidence ledger — ASN & enterprise RAG **Normative rule:** product and whitepaper claims advance only when a row here moves from **Hypothesis** → **Measured** (reproducible command + date) or **Rejected**. Narrative from source docs (`docs/sources/`) does not count as evidence. Related: [`WHITEPAPER.md`](./WHITEPAPER.md) §8 · [`SCIENCE_REVIEW.md`](./SCIENCE_REVIEW.md) · [`specs/0008-eval-harness

a5df963acfeb9cfe

| Claim | Status | Measurement | Command | |---|---|---|---| | Effective rank → 1.0 on rank-1 matrix | **Measured** | \|erank − 1\| < 1e-3 | `pytest packages/asn-engine/tests/test_spectral.py::test_effective_rank_rank_one` | | Isotropic Gaussian erank ≈ full dimension | **Measured** | erank ≥ 0.9·min(n,d) for 256×64 | `…::test_effective_rank_isotropic` | | Quintic NS stable band ≈ [0.68, 1.27] | *

a5df963acfeb9cfe

| Gate | Threshold | Status | Notes | |---|---|---|---| | `rankAboveBaseline` | erank > 8.0 on eval slice | **Hypothesis** | Prior ~62 deploy reports **retracted** (train_loop bug); re-measure per workspace | | `ndcgNonRegression` | pairwise nDCG@10 ≥ 0.35 | **Hypothesis** | k=2 proxy in harness today; expand to k=10 panel | | `mrlWithinTolerance` | Matryoshka truncate tolerance | **Not measured**

a5df963acfeb9cfe

**Command:** `uv run python scripts/collect_evidence.py --ablation --site hub --vicreg --epochs 10` **Snapshot:** `data/evidence/latest.json` (ablation block) | arm | eval erank | nDCG@10 | surgeries | |---|---|---|---| | InfoNCE baseline | 7.398 | 0.9539 | 0 | | ASN + three-tier surgery | 7.379 | **0.9654** | 8 | | InfoNCE + VICReg (no surgery) | 7.358 | **0.9654** | 0 | **Findings:** 1. **Gate

a5df963acfeb9cfe

**Gate 1: 0/4** (ASN never beats baseline on rank). **VICReg fleet verdict:** helps only on hub (same +0.0115 as surgery, zero interventions); neutral on 2 sites; **hurts** research-rag at saturated nDCG. **Default recipe:** plain InfoNCE; enable VICReg per-tenant when ablation shows gain; keep `asn.enabled: false` fleet-wide until rank gate passes. ### Run B — encoder trigger + peak–drop + hetero

a5df963acfeb9cfe

**Setup:** 5-topic synthetic corpus, 60 train pairs, MiniLM-L6-v2, **30 epochs**, seed 0, eval erank over a 30-sentence pool (headroom), nDCG@10 leave-one-out by topic. **Command:** `EPOCHS=30 packages/asn-engine/.venv/Scripts/python.exe scripts/engine_proof.py` **Root cause of the §3 failure — found and fixed.** The collapse trigger compared `rankFloor` against the effective rank of a *single bat

a5df963acfeb9cfe

The headline "ASN beats baseline" remains **Hypothesis**, pending a genuine collapse-regime experiment (weaker/random init or large-scale training) vs real baselines (BGE-M3 / e5 / Qwen3-Embed). Gate `scripts/engine_proof.py` exits 0 on the *no-harm* claim only and says so explicitly. ### 3.2 Collapse-regime experiment — ASN surgery REJECTED (2026-06-27) **Goal:** test the benefit claim in a

a5df963acfeb9cfe

| arm | served effRank | kNN acc | surgeries | |---|---|---|---| | alignment baseline | **3.3 – 3.5** (collapsed from raw 21) | 1.000 | 0 | | + ASN (three-tier surgery + NS) | **1.01 – 1.02** (near-total collapse) | 0.79 – 0.88 | 79 | **Interpretation — a mechanism/pathology mismatch.** Three-tier surgery *shrinks the weak (middle) singular band* to combat **anisotropy** (a few over-dominant direc

a5df963acfeb9cfe

Request full access ← Catalog