First Time Setup
This guide takes a fresh checkout of The Test Cabinet to the point where you can launch a run. It covers the toolchain, the container runtime, the harness image, the headless browser, and credentials — the four things a run needs that the repository alone does not provide.
The project is in early development, so setup assumes some familiarity with Rust, Node, and containers. Building holds the authoritative build details; this guide is the task-oriented version that sits on top of it.
The tcab command
Section titled “The tcab command”Runs are driven by the tcab CLI (binary tcab, crate test-cabinet-cli).
There are two ways to invoke it, and the rest of these guides use the first:
- A released binary —
tcab run …. Released binaries are published on GitHub (Linux static-musl, Windows, macOS). - From a source checkout —
cargo run -p test-cabinet-cli -- run …. Everything after--is passed totcab. This is the form to use while working in the repository.
Wherever a guide shows tcab <args>, the source-checkout equivalent is
cargo run -p test-cabinet-cli -- <args>.
1. Toolchain
Section titled “1. Toolchain”The repository is both a Cargo (Rust) and an npm (TypeScript) workspace. Build both once:
cargo build --workspace # Rust: core, CLI, desktop shellnpm install # TypeScript: installs every workspaceThe pinned Rust toolchain is declared in rust-toolchain.toml. Format and lint
with cargo fmt --all and cargo clippy --workspace.
If you are on a distribution without the generic FHS dynamic loader (notably
NixOS), build the fully static tcab instead with cargo build-portable (an
alias that targets x86_64-unknown-linux-musl); see
Portable build for the musl
prerequisites.
2. A container runtime
Section titled “2. A container runtime”Every run executes inside an isolated container so a model cannot reach the host
filesystem or other runs’ outputs (see Execution).
You need Podman (preferred) or Docker on PATH. The runtime is
auto-detected; override it with TCAB_CONTAINER_RUNTIME=<binary>.
Runs always execute Linux containers, so platform expectations differ:
- Linux — rootless Podman runs containers directly on the host.
tcabadds--userns=keep-idso the mounted repository stays writable by the run user. - macOS — Podman runs containers inside its managed Linux VM
(
podman machine init && podman machine start). The VM shares your home directory but not the OS temp directory, which is why staged inputs default to~/.tcab(below). On Apple Silicon the machine isarm64, so harness images build and runarm64by default. - Windows — Podman runs on its WSL2 backend, so WSL must be installed
(
wsl --install) beforepodman machine init.
Where a run stages its mountable inputs — the seeded repository, collected
artifacts, and capture scratch — is resolved as --work-dir, then
TCAB_WORK_DIR, then ~/.tcab. It must be a path the runtime can mount; on
macOS and Windows that rules out the OS temp directory, which is why the default
is home-based.
3. The harness image
Section titled “3. The harness image”A run drives an agent harness inside the
container, so the harness’s run-container image must be built once. From the
containers/ directory (see its README.md):
cd containers && DOCKER=podman ./build.sh claude # builds the base + claude imageBuild the image for whichever harness you intend to run. The supported harness
slugs are claude, codex, cline, antigravity, goose, kilo,
opencode, and pi. Confirm availability without starting a run:
tcab harnesses # human-readable table; add --json for machine output4. A headless browser
Section titled “4. A headless browser”The validator and the reference renderer shell
out to a Playwright browser driver. Install the Chromium revision the driver
expects through the pinning workspace — a bare npx playwright fetches a
different version:
npm exec -w @test-cabinet/browser-driver -- playwright install chromiumThe driver (packages/browser-driver/driver.mjs) is located relative to the
working directory; override with TCAB_BROWSER_DRIVER. A run will not start
unless every one of the selected variant’s reference mockups renders, since those
screenshots are both the seeded visual targets and the validation baselines — a
render failure aborts the run before a harness session is spent. (The seed,
validate, and catalog commands degrade per-view instead of aborting.)
5. Credentials
Section titled “5. Credentials”The harness needs an API key for its model provider. The CLI keeps the several kinds of credential separate and never conflates them (see CLI Authentication); for a basic run you only need the harness key.
Each harness reads a specific variable — ANTHROPIC_API_KEY for claude,
OPENAI_API_KEY for codex, OPENROUTER_API_KEY for the OpenRouter-backed
harnesses. The CLI loads a .env from the working directory (or any parent) on
startup; copy .env.example to .env and fill in the keys. Variables already
exported in the shell take precedence over the file. The key is passed into the
run container as a secret and is never written into the seeded repository.
6. Make a first run
Section titled “6. Make a first run”Run from the repository root so the test-cases/ catalog and the browser driver
resolve (override the catalog location with TCAB_TEST_CASES_DIR):
tcab run \ --test-case pong --version v1.0.0 --variant base \ --harness claude --model anthropic/claude-opus-4 \ --out-dir runsThis renders the references, seeds a fresh repository with the selected variant’s
specs and screenshots, renders the prompt and hands it to the harness in a
container while printing the live event stream, then
builds and load-checks the result,
runs the declared checks, and writes runs/<id>/run-record.json alongside a copy
of the implementation. --variant is required; --max-runtime <seconds>
overrides the case’s default cap for this invocation.
Next steps
Section titled “Next steps”- Run a Test Case — the quickstart, once setup is done.
- Reviewing Test Run Results — assess the run you just produced.
- Authoring a Test Case — write your own case.