Pipeline test (test_pipeline)

test_pipeline v1.0.0 · 2026-05-26 13:28 UTC

Design

Factor: task_arg_reasoning_effort
Levels: high, low
Evals: 2
Dimensions: 2
Samples per eval: 3
Data quality: 12/12 scores usable

Pipeline test deliverable — generated from the committed 3-row mock fixture (data/scenarios.example.jsonl), judged by openai/gpt-4.1-nano. Contains no real plan drafts and no real rater names. Safe to deploy publicly and to use when iterating on the report template / hosting / auth.

Mean score per dimension

Bars: mean per-eval score on each dimension, grouped by task_arg_reasoning_effort. Whiskers: 95% CI from the eval's stderr() metric (SEM × 1.96, n = samples per eval).

Eval logs

You're viewing this report over file://. The embedded Inspect log viewer is a SPA that loads .eval files via fetch() — browsers block that from file:// origins, so the iframe is empty and clicking individual logs triggers a download.

Serve over HTTP:

cd "test_pipeline" && python3 -m http.server 8765
# then open http://localhost:8765/

Standalone viewer: viewer/.

Generated by reports/make_report.py. Plot: Inspect Viz. Log viewer: inspect view bundle.