Pipeline test (test_pipeline)
Design
- Factor
task_arg_reasoning_effort- Levels
high,low- Evals
- 2
- Dimensions
- 2
- Samples per eval
- 3
- Data quality
- 12/12 scores usable
Pipeline test deliverable — generated from the committed 3-row mock fixture (data/scenarios.example.jsonl), judged by openai/gpt-4.1-nano. Contains no real plan drafts and no real rater names. Safe to deploy publicly and to use when iterating on the report template / hosting / auth.
Mean score per dimension
Eval logs
You're viewing this report over file://.
The embedded Inspect log viewer is a SPA that loads
.eval files via fetch() — browsers block that
from file:// origins, so the iframe is empty and clicking
individual logs triggers a download.
Serve over HTTP:
cd "test_pipeline" && python3 -m http.server 8765 # then open http://localhost:8765/
Standalone viewer: viewer/.