Pipeline test (test_pipeline)

test_pipeline v1.0.0 · 2026-05-26 13:28 UTC

Design

Factor
task_arg_reasoning_effort
Levels
high, low
Evals
2
Dimensions
2
Samples per eval
3
Data quality
12/12 scores usable

Pipeline test deliverable — generated from the committed 3-row mock fixture (data/scenarios.example.jsonl), judged by openai/gpt-4.1-nano. Contains no real plan drafts and no real rater names. Safe to deploy publicly and to use when iterating on the report template / hosting / auth.

Mean score per dimension

Bars: mean per-eval score on each dimension, grouped by task_arg_reasoning_effort. Whiskers: 95% CI from the eval's stderr() metric (SEM × 1.96, n = samples per eval).

Eval logs

You're viewing this report over file://. The embedded Inspect log viewer is a SPA that loads .eval files via fetch() — browsers block that from file:// origins, so the iframe is empty and clicking individual logs triggers a download.

Serve over HTTP:

cd "test_pipeline" && python3 -m http.server 8765
# then open http://localhost:8765/

Standalone viewer: viewer/.