// Build, run, and report on LLM evals. Pairwise comparisons, judges, regression detection.

Evals RunnerVerified Creator

git log --oneline --stat

HEAD

Stars: 2.9k
Forks: 98
Updated: Apr 27, 2026

repo --stat

stars
2.9k
forks
98
last update
Apr 27, 2026
license
MITv2.0.5

quickstart.sh

3 steps

1
Install
// Drops SKILL.md into ~/.claude/skills/
```
$ claude skills add evals-runner
```

Invoke

// Run from any project directory

$ claude --skill evals-runner "help me ship this"

3
Iterate
// Re-run with edits — Claude keeps the skill loaded
```
$ claude --skill evals-runner "now refactor it"
```

evals-runner/

references

SKILL.md

readonly

name:: Evals Runner
slug:: evals-runner
version:: v2.0.5
license:: MIT
author:: @evalforge
repository:: github.com/evalforge/evals-runner
categories:: Prompt Engineer QA / Testing
tags:: #evals#llm#benchmarks#judges#regression
description:: Build, run, and report on LLM evals. Pairwise comparisons, judges, regression detection.

features.md

3 capabilities

// What you can do with it

Automates the tedious parts of the workflow.
Gives Claude the right context, tools, and guardrails.
Produces consistent, reviewable output every time.

README.md

evals-runner/README.md

5 sections

Loading README…

@evalforge/index.json

more by author

authored by

Prompt Eval Kit
Design, run, and score prompt evaluations with variance-aware benchmarks and regression tracking.
2.2k

$ cat reviews/

Reviews

// No reviews yet. Be the first.

Loading review form…

$ ls related/

explore all →

code-reviewer.md

@review-craft

Code Reviewer · v2.6.0

Thorough PR review — security, performance, correctness, style, with inline-comment-ready output.

DevOps / CI/CD#code-review#pr-review#security

2026-06-02cd ./code-reviewer →

security-review-skill.md

@anthropic-labs

Security Review · v2.4.0

OWASP-aware code security audit — finds auth, input, secret, and dependency issues with remediation.

DevOps / CI/CD#security#owasp#code-review

2026-05-27cd ./security-review-skill →

@browser-craft

Browser Use · v2.2.0

Automate browser interactions — web testing, form filling, screenshots, scraping, multi-tab workflows.

QA / Testing#browser#playwright#automation

2026-06-02cd ./browser-use →