Terminology
Catalog
Section titled “Catalog”The term “catalog” is used to refer to The Test Cabinet’s full set of test cases.
Harness
Section titled “Harness”In the context of The Test Cabinet, “harness” can refer to two elements:
- The Test Cabinet itself
- Agentic harnesses used to drive model(s)
The Test Cabinet handles running other harnesses. It does not directly hit LLM APIs or implement an agentic loop. That responsibility lies entirely with the agentic harnesses that The Test Cabinet uses to run the tests.
Models are the large language models that determine the actions an agentic harness takes.
Publishing
Section titled “Publishing”“Publishing” refers to releasing an implementation to GitHub and uploading its run record to The Test Cabinet’s backend, from which a public snapshot is exported for the website. Test runs exist only locally until published.
Reporters
Section titled “Reporters”Reporters are The Test Cabinet components capable of reporting run results. Only GUI reporters allow users to interact with test case implementations.
Review
Section titled “Review”All test cases are manually reviewed after the implementation is complete. This allows the reviewer to assess how well a model matched the spec, check for any bugs, and otherwise provide non-automated feedback about the run result. Reviews are slightly subjective since games don’t map cleanly to a rigid grading scale.
Run Records
Section titled “Run Records”A run record is produced each time a test case runs to completion. This records all information from the run, such as its run time, version information, and token/cost data.
Runners
Section titled “Runners”The term “runner” is used to refer to any The Test Cabinet component that is capable of running test cases.
Test Case
Section titled “Test Case”Test cases provide the scenarios used for testing. Each test case represents some project that must be implemented from scratch.
Validation
Section titled “Validation”The Test Cabinet makes use of a small amount of automated validation. These are used for basic checks like “Does this implementation even build?” or “How well does the implemented UI match the reference image?”.
Variant
Section titled “Variant”Test cases may define multiple variants, which identify modifications to make to the specifications provided as input for the test. These variants may change game mechanics, add or remove content, and may noticeably affect the difficulty of a test case.