Data Refinery — Blue Hen RE

Provenance

Source: docs/wiki/GOALS.md · extractor heuristic · chunking sentence

OKF dataset card

---
type: Dataset
title: Wiki — goals and build docs
description: "Point-in-time collection: 3 docs, 8 chunks."
tags: [dataset, datalab]
timestamp: "2026-07-02T16:02:39Z"
datasetId: 20260702-110239-wiki---goals-and-build-docs
---

# Provenance

Point-in-time collection run `20260702-110239-wiki---goals-and-build-docs` — 3 documents,
8 chunks (sentence strategy,
~2357 tokens). Raw artifacts live at
`data/datalab/20260702-110239-wiki---goals-and-build-docs/` (docs.jsonl, chunks.jsonl, manifest.json);
trace `20260702-110239-84ab5b83` in `data/traces/`.

# Sources

| Source | Status |
|--------|--------|
| `docs/wiki/GOALS.md` | ok |
| `docs/wiki/BUILD.md` | ok |
| `docs/wiki/IMPROVEMENT_LOOP.md` | ok |

# Consumption

Chunks feed the training worker's pair builder and the retrieval index.
Filter on `token_estimate` and `strategy` in `chunks.jsonl`. See the
[data pipeline](/platform/data-pipeline.md) concept for stage details.

Sample chunks (first 8, sanitized)

# Goal alignment baselines Evaluate new actions against these deliverables. Product claims must match `EVIDENCE.md`. ## Primary mission Multi-tenant **synthetic orgs** — each runs **collect → train → eval → deploy** to beat zero-shot embedders (BGE, e5) on *its* corpus, with an **edge tier** (Matryoshka t=8 + int8). ## Phase A deliverables (active) | Deliverable | Acceptance signal | |---|---| | F

f8d8995b6241a7e2

Non-trivial features need a spec in `specs/` ## Current blockers (check live) ```powershell pnpm work:blockers ``` Disk (BLK-DISK) blocks Docker, real-text evals, and large corpus harvests.

f8d8995b6241a7e2

# B.U.I.L.D. framework **Philosophy:** Action supersedes over-engineering. The system sharpens through consistent repetition and execution. If a process or tool does not actively add value, remove it. ## Agent runtimes (shared) All coding agents use the same wiki + queue. Runtime-specific lanes live in `config/agents.json`. | Runtime | Claim as | Entrypoint | |---|---|---| | **Cursor** | `--agent

c84cc0cda34ec721

Base — architecture & tooling | Layer | Location | Notes | |---|---|---| | Raw ingest | `docs/raw/` | Dumps, API responses, unstructured reference | | Wiki | `docs/wiki/` | Indexed docs, ADRs, finalized guidelines | | Knowledge bundle | `knowledge/` | OKF v0.1 — platform concepts, dataset cards, living SME reviews | | Data collection | `packages/datalab` | `python -m datalab collect` → `data/datal

c84cc0cda34ec721

```powershell uv run python scripts/build_sync.py inflow-sessions --agent claude --limit 10 uv run python scripts/build_sync.py inflow-sessions --agent opencode --limit 10 ``` ## 4. Loop — triaged improvement See [IMPROVEMENT_LOOP.md](./IMPROVEMENT_LOOP.md). Route all structural changes through buckets 1–3; do not attempt full automation for system modifications. | Runtime | Bucket policy | |---|-

c84cc0cda34ec721

# Improvement loop — triage buckets Route **all** system improvements, code generations, and structural changes through these buckets. Do not attempt full automation for modifications. ## Bucket 1 — Auto-approve (low risk) Agent may proceed without human sign-off. - Typo fixes - Standard Python linting / formatting - Basic SQL syntax corrections - Obvious documentation updates in `docs/wiki/` or c

3e32081ddae11c32

- Architectural shifts (new services, ADR-level decisions) - Model initialization strategies - MoE routing changes (Phase B / finance-lab) - ML recipe changes without `EVIDENCE.md` row - Anything where output quality must be judged by a human **Claude lane:** `scripts/autoresearch_train.py` architecture edits are always bucket-3. ## Classifier ```powershell uv run python scripts/build_sync.py clas

3e32081ddae11c32

**Execution (bucket-1):** ```powershell .\scripts\opencode-loop.ps1 ` -Goal "Implement SITE-003 dumbmodel museum page per spec 0007" ` -WorkDir C:\Users\jcdav\bluehenre ` -Agent opencode ` -FixUntilGreen ` -TestCmd "pnpm --filter @synthaembed/dumbmodel build" ``` **Research delegate (bucket-3, one hypothesis):** ```powershell .\scripts\opencode-loop.ps1 ` -Goal "Apply AR-306 depth-2 GE

3e32081ddae11c32

Request full access ← Catalog