Some checks failed
Test examples / Test Examples (20) (push) Has been cancelled
Test examples / Test Examples (22) (push) Has been cancelled
Lock Threads / action (push) Has been cancelled
Trigger Release / start (push) Has been cancelled
Stale issue handler / stale (push) Has been cancelled
Update Font Data / create-pull-request (push) Has been cancelled
build-and-deploy / deploy-target (push) Has been cancelled
build-and-deploy / build (push) Has been cancelled
build-and-deploy / stable - aarch64-unknown-linux-musl - node@16 (push) Has been cancelled
build-and-deploy / stable - x86_64-unknown-linux-musl - node@16 (push) Has been cancelled
build-and-deploy / stable - aarch64-unknown-linux-gnu - node@16 (push) Has been cancelled
build-and-deploy / stable - x86_64-unknown-linux-gnu - node@16 (push) Has been cancelled
build-and-deploy / stable - aarch64-pc-windows-msvc - node@16 (push) Has been cancelled
build-and-deploy / stable - x86_64-pc-windows-msvc - node@16 (push) Has been cancelled
build-and-deploy / stable - aarch64-apple-darwin - node@16 (push) Has been cancelled
build-and-deploy / stable - x86_64-apple-darwin - node@16 (push) Has been cancelled
build-and-deploy / build-wasm (nodejs) (push) Has been cancelled
build-and-deploy / build-wasm (web) (push) Has been cancelled
build-and-deploy / Deploy preview tarball (push) Has been cancelled
build-and-deploy / Potentially publish release (push) Has been cancelled
build-and-deploy / publish-turbopack-npm-packages (push) Has been cancelled
build-and-deploy / Deploy examples (push) Has been cancelled
build-and-deploy / thank you, build (push) Has been cancelled
build-and-deploy / Upload Turbopack Bytesize metrics to Datadog (push) Has been cancelled
Rspack Next.js development integration tests / Rspack integration tests (push) Has been cancelled
Rspack Next.js production integration tests / Rspack integration tests (push) Has been cancelled
Turbopack Next.js development integration tests / Next.js integration tests (push) Has been cancelled
Turbopack Next.js production integration tests / Next.js integration tests (push) Has been cancelled
Update Rspack test manifest / Update and upload Rspack development test manifest (push) Has been cancelled
Update Rspack test manifest / Update and upload Rspack production test manifest (push) Has been cancelled
Upload bundler test manifests to areweturboyet.com / Upload test results (push) Has been cancelled
Update React / create-pull-request (push) Has been cancelled
test-e2e-project-reset-cron / reset-test-project (push) Has been cancelled
Notify about the top 15 issues/PRs/feature requests (most reacted) in the last 90 days / run (push) Has been cancelled
120 lines
5.9 KiB
Markdown
120 lines
5.9 KiB
Markdown
# Evals
|
||
|
||
Agent evals for Next.js. Each eval is a small Next.js app + a prompt + assertions. We run the prompt through a coding agent in a sandbox and check what it wrote.
|
||
|
||
The point: find places where agents get Next.js wrong because their training data is stale, then fix it by shipping better docs in the `next` package itself.
|
||
|
||
## How it works
|
||
|
||
The runner is [`@vercel/agent-eval`](https://github.com/vercel-labs/agent-eval). It spins up a sandbox (Vercel or local Docker), copies the fixture in, runs the coding agent against `PROMPT.md`, then executes `EVAL.ts` as a vitest file against whatever the agent wrote. The `PROMPT.md` / `EVAL.ts` / fixture-dir convention you'll see below is that package's convention — see its README for the full spec.
|
||
|
||
`run-evals.js` is a thin wrapper around it: pack the local `next` build into a tarball, generate two experiment configs (`baseline` and `agents-md`) that differ only in whether they drop an `AGENTS.md` pointing at the bundled docs, then invoke `agent-eval run-all`. Everything from "spawn sandbox" onward is `@vercel/agent-eval`'s job.
|
||
|
||
## One-time setup
|
||
|
||
Vercel employees: request access to the `vercel-labs` team in Lumos, then:
|
||
|
||
```bash
|
||
# Vercel CLI, if you don't have it: npm i -g vercel
|
||
vc link # at repo root, pick vercel-labs team
|
||
vc env pull # writes .env.local to repo root
|
||
```
|
||
|
||
External contributors can run the same evals in local Docker with their own API key — see [Running without Vercel sandbox access](#running-without-vercel-sandbox-access).
|
||
|
||
## Writing an eval
|
||
|
||
Copy an existing fixture. Take the next free number — gaps are fine.
|
||
|
||
```bash
|
||
cp -r evals/evals/agent-034-async-cookies evals/evals/agent-042-your-thing
|
||
```
|
||
|
||
Then edit three files:
|
||
|
||
**`PROMPT.md`** — what you'd type into the agent. Write it like a real user would: describe the symptom or goal, not the API. "Navigating from `/a` to `/b` is slow, fix it" is a good prompt. "Use `unstable_instant`" is not — you're testing whether the agent understands the feature well enough to reach for it, not whether it can pattern-match a name you handed it.
|
||
|
||
**`EVAL.ts`** — vitest assertions against files the agent wrote. Regex the source, don't run it.
|
||
|
||
```ts
|
||
import { expect, test } from 'vitest'
|
||
import { readFileSync } from 'fs'
|
||
import { join } from 'path'
|
||
|
||
const page = readFileSync(join(process.cwd(), 'app/page.tsx'), 'utf-8')
|
||
|
||
test('exports unstable_instant', () => {
|
||
expect(page).toMatch(/export const unstable_instant\b/)
|
||
})
|
||
```
|
||
|
||
**`app/`** (or `pages/`) — the starting state. Give the agent something to edit, not a blank slate.
|
||
|
||
`package.json` needs a `build` script. `next.config.ts` and `tsconfig.json` stay unless your feature requires specific config.
|
||
|
||
## Running
|
||
|
||
```bash
|
||
pnpm eval agent-042-your-thing
|
||
```
|
||
|
||
This runs two variants in parallel and prints pass/fail for each:
|
||
|
||
```
|
||
✗ baseline/agent-042-your-thing (81s)
|
||
✓ agents-md/agent-042-your-thing (200s)
|
||
```
|
||
|
||
`agents-md` drops an AGENTS.md into the sandbox telling the agent to check `node_modules/next/dist/docs/` first. `baseline` doesn't. That's the whole difference — same prompt, same model, one extra file. If `agents-md` passes and `baseline` doesn't, the bundled docs are doing their job.
|
||
|
||
A run takes ~2–5 min. To validate a fixture without executing:
|
||
|
||
```bash
|
||
pnpm eval agent-042-your-thing --dry
|
||
```
|
||
|
||
Full transcripts land in `evals/results/<variant>/<timestamp>/<eval>/run-1/`. Grep `transcript-raw.jsonl` to see exactly what the agent did.
|
||
|
||
## When to rebuild
|
||
|
||
`pnpm eval` packs `packages/next/dist/` into a tarball and ships that to the sandbox. It does not build. If you changed `packages/next/src/**` or `docs/**`, run `pnpm --filter=next build` first or the sandbox will see stale code. If you only changed fixture files, no rebuild is needed.
|
||
|
||
## Workflow
|
||
|
||
1. **Write the fixture.** `PROMPT.md` describes a user-facing problem. `EVAL.ts` asserts the API you expect the agent to reach for.
|
||
|
||
2. **Build Next.js.** `pnpm build`. The eval runner packs whatever is already in `dist/` — it won't build for you.
|
||
|
||
3. **Run it.** `pnpm eval <name>`. If the feature isn't in the agent's training data and isn't documented in `dist/docs/`, both variants fail. That's the expected starting point for a new feature.
|
||
|
||
4. **Write the doc.** Add an `.mdx` under `docs/`. Use `version: draft` in the frontmatter to keep it off nextjs.org while still bundling it into the package.
|
||
|
||
5. **Build again.** New doc needs to land in `dist/docs/` before the next pack sees it.
|
||
|
||
6. **Run it again.** `baseline` should still fail; `agents-md` should find the new doc and pass. Baseline staying red while agents-md flips green tells you the doc did it, not run-to-run noise.
|
||
|
||
7. **Commit the eval and the doc together.** The full suite gets pulled by the external benchmark runner and published to nextjs.org/evals. Keeping the fixture alongside the doc it validates means that score tracks over time as both the docs and the models change.
|
||
|
||
## Layout
|
||
|
||
```
|
||
evals/
|
||
├── evals/agent-*/ # fixtures
|
||
├── lib/setup.ts # uploads tarball, writes AGENTS.md (shared by all evals)
|
||
├── experiments/ # generated per-run, gitignored
|
||
├── .tarballs/ # packed next, gitignored
|
||
└── results/ # transcripts + outputs, gitignored
|
||
```
|
||
|
||
Sandbox tokens live in `.env.local` at the repo root (from `vc env pull`).
|
||
|
||
## Running without Vercel sandbox access
|
||
|
||
If you don't have Vercel credentials, `@vercel/agent-eval` falls back to local Docker — see [its direct API keys docs](https://github.com/vercel-labs/agent-eval#direct-api-keys-no-vercel-account-required) for the full list of supported env vars. Have Docker running and provide your own model key in `.env.local` at the repo root:
|
||
|
||
```bash
|
||
ANTHROPIC_API_KEY=sk-ant-...
|
||
```
|
||
|
||
Then run `pnpm eval <name>` as normal. Docker pulls `node:24-slim` on first run. Tarball packing, both variants, and the results layout are identical to the remote path — `run-evals.js` doesn't know or care which sandbox backend got picked.
|