rayan/next.js

Fork 0

Files

History

Arian Tron 61f56f997c

Test examples / Test Examples (20) (push) Has been cancelled

Details

Test examples / Test Examples (22) (push) Has been cancelled

Details

Lock Threads / action (push) Has been cancelled

Details

Trigger Release / start (push) Has been cancelled

Details

Stale issue handler / stale (push) Has been cancelled

Details

Update Font Data / create-pull-request (push) Has been cancelled

Details

build-and-deploy / deploy-target (push) Has been cancelled

Details

build-and-deploy / build (push) Has been cancelled

Details

build-and-deploy / stable - aarch64-unknown-linux-musl - node@16 (push) Has been cancelled

Details

build-and-deploy / stable - x86_64-unknown-linux-musl - node@16 (push) Has been cancelled

Details

build-and-deploy / stable - aarch64-unknown-linux-gnu - node@16 (push) Has been cancelled

Details

build-and-deploy / stable - x86_64-unknown-linux-gnu - node@16 (push) Has been cancelled

Details

build-and-deploy / stable - aarch64-pc-windows-msvc - node@16 (push) Has been cancelled

Details

build-and-deploy / stable - x86_64-pc-windows-msvc - node@16 (push) Has been cancelled

Details

build-and-deploy / stable - aarch64-apple-darwin - node@16 (push) Has been cancelled

Details

build-and-deploy / stable - x86_64-apple-darwin - node@16 (push) Has been cancelled

Details

build-and-deploy / build-wasm (nodejs) (push) Has been cancelled

Details

build-and-deploy / build-wasm (web) (push) Has been cancelled

Details

build-and-deploy / Deploy preview tarball (push) Has been cancelled

Details

build-and-deploy / Potentially publish release (push) Has been cancelled

Details

build-and-deploy / publish-turbopack-npm-packages (push) Has been cancelled

Details

build-and-deploy / Deploy examples (push) Has been cancelled

Details

build-and-deploy / thank you, build (push) Has been cancelled

Details

build-and-deploy / Upload Turbopack Bytesize metrics to Datadog (push) Has been cancelled

Details

Rspack Next.js development integration tests / Rspack integration tests (push) Has been cancelled

Details

Rspack Next.js production integration tests / Rspack integration tests (push) Has been cancelled

Details

Turbopack Next.js development integration tests / Next.js integration tests (push) Has been cancelled

Details

Turbopack Next.js production integration tests / Next.js integration tests (push) Has been cancelled

Details

Update Rspack test manifest / Update and upload Rspack development test manifest (push) Has been cancelled

Details

Update Rspack test manifest / Update and upload Rspack production test manifest (push) Has been cancelled

Details

Upload bundler test manifests to areweturboyet.com / Upload test results (push) Has been cancelled

Details

Update React / create-pull-request (push) Has been cancelled

Details

test-e2e-project-reset-cron / reset-test-project (push) Has been cancelled

Details

Notify about the top 15 issues/PRs/feature requests (most reacted) in the last 90 days / run (push) Has been cancelled

Details

first commit

2026-03-10 19:37:31 +03:30

evals

first commit

2026-03-10 19:37:31 +03:30

lib

first commit

2026-03-10 19:37:31 +03:30

README.md

first commit

2026-03-10 19:37:31 +03:30

tsconfig.json

first commit

2026-03-10 19:37:31 +03:30

README.md

Evals

Agent evals for Next.js. Each eval is a small Next.js app + a prompt + assertions. We run the prompt through a coding agent in a sandbox and check what it wrote.

The point: find places where agents get Next.js wrong because their training data is stale, then fix it by shipping better docs in the next package itself.

How it works

The runner is @vercel/agent-eval. It spins up a sandbox (Vercel or local Docker), copies the fixture in, runs the coding agent against PROMPT.md, then executes EVAL.ts as a vitest file against whatever the agent wrote. The PROMPT.md / EVAL.ts / fixture-dir convention you'll see below is that package's convention — see its README for the full spec.

run-evals.js is a thin wrapper around it: pack the local next build into a tarball, generate two experiment configs (baseline and agents-md) that differ only in whether they drop an AGENTS.md pointing at the bundled docs, then invoke agent-eval run-all. Everything from "spawn sandbox" onward is @vercel/agent-eval's job.

One-time setup

Vercel employees: request access to the vercel-labs team in Lumos, then:

# Vercel CLI, if you don't have it: npm i -g vercel
vc link       # at repo root, pick vercel-labs team
vc env pull   # writes .env.local to repo root

External contributors can run the same evals in local Docker with their own API key — see Running without Vercel sandbox access.

Writing an eval

Copy an existing fixture. Take the next free number — gaps are fine.

cp -r evals/evals/agent-034-async-cookies evals/evals/agent-042-your-thing

Then edit three files:

PROMPT.md — what you'd type into the agent. Write it like a real user would: describe the symptom or goal, not the API. "Navigating from /a to /b is slow, fix it" is a good prompt. "Use unstable_instant" is not — you're testing whether the agent understands the feature well enough to reach for it, not whether it can pattern-match a name you handed it.

EVAL.ts — vitest assertions against files the agent wrote. Regex the source, don't run it.

import { expect, test } from 'vitest'
import { readFileSync } from 'fs'
import { join } from 'path'

const page = readFileSync(join(process.cwd(), 'app/page.tsx'), 'utf-8')

test('exports unstable_instant', () => {
  expect(page).toMatch(/export const unstable_instant\b/)
})

app/ (or pages/) — the starting state. Give the agent something to edit, not a blank slate.

package.json needs a build script. next.config.ts and tsconfig.json stay unless your feature requires specific config.

Running

pnpm eval agent-042-your-thing

This runs two variants in parallel and prints pass/fail for each:

✗ baseline/agent-042-your-thing   (81s)
✓ agents-md/agent-042-your-thing  (200s)

agents-md drops an AGENTS.md into the sandbox telling the agent to check node_modules/next/dist/docs/ first. baseline doesn't. That's the whole difference — same prompt, same model, one extra file. If agents-md passes and baseline doesn't, the bundled docs are doing their job.

A run takes ~2–5 min. To validate a fixture without executing:

pnpm eval agent-042-your-thing --dry

Full transcripts land in evals/results/<variant>/<timestamp>/<eval>/run-1/. Grep transcript-raw.jsonl to see exactly what the agent did.

When to rebuild

pnpm eval packs packages/next/dist/ into a tarball and ships that to the sandbox. It does not build. If you changed packages/next/src/** or docs/**, run pnpm --filter=next build first or the sandbox will see stale code. If you only changed fixture files, no rebuild is needed.

Workflow

Write the fixture. PROMPT.md describes a user-facing problem. EVAL.ts asserts the API you expect the agent to reach for.
Build Next.js. pnpm build. The eval runner packs whatever is already in dist/ — it won't build for you.
Run it. pnpm eval <name>. If the feature isn't in the agent's training data and isn't documented in dist/docs/, both variants fail. That's the expected starting point for a new feature.
Write the doc. Add an .mdx under docs/. Use version: draft in the frontmatter to keep it off nextjs.org while still bundling it into the package.
Build again. New doc needs to land in dist/docs/ before the next pack sees it.
Run it again. baseline should still fail; agents-md should find the new doc and pass. Baseline staying red while agents-md flips green tells you the doc did it, not run-to-run noise.
Commit the eval and the doc together. The full suite gets pulled by the external benchmark runner and published to nextjs.org/evals. Keeping the fixture alongside the doc it validates means that score tracks over time as both the docs and the models change.

Layout

evals/
├── evals/agent-*/   # fixtures
├── lib/setup.ts     # uploads tarball, writes AGENTS.md (shared by all evals)
├── experiments/     # generated per-run, gitignored
├── .tarballs/       # packed next, gitignored
└── results/         # transcripts + outputs, gitignored

Sandbox tokens live in .env.local at the repo root (from vc env pull).

Running without Vercel sandbox access

If you don't have Vercel credentials, @vercel/agent-eval falls back to local Docker — see its direct API keys docs for the full list of supported env vars. Have Docker running and provide your own model key in .env.local at the repo root:

ANTHROPIC_API_KEY=sk-ant-...

Then run pnpm eval <name> as normal. Docker pulls node:24-slim on first run. Tarball packing, both variants, and the results layout are identical to the remote path — run-evals.js doesn't know or care which sandbox backend got picked.

README.md Unescape Escape

Evals

How it works

One-time setup

Writing an eval

Running

When to rebuild

Workflow

Layout

Running without Vercel sandbox access

README.md