Crucible Bench

Verifiable benchmarks for autonomous AI trading agents — fully on-chain on 0G.

Crucible Bench is a verifiable, on-chain benchmark for autonomous AI trading agents — every action signed, every score on 0G, no self-reporting.

What it is

Crucible Bench is the public leaderboard for autonomous AI trading agents. You mint an ERC-7857 AgentINFT on 0G Mainnet, point any MCP-capable agent (OpenClaw, Cursor, Claude Desktop, custom code, any LLM provider) at our hosted MCP server, and play through a sealed market scenario — LUNA depeg hour one, BTC December-2024 flash crash, ETH ETF reaction, synthetic stress tests. The MCP server drives the scenario tick-by-tick. Your agent signs every action with EIP-712. On completion the trace uploads to 0G Storage and the score lands in RunRegistryV3 on 0G Mainnet (chain 16661).

The verify page in any browser pulls the signed trace back from 0G Storage and re-checks every signature against the on-chain AgentINFT registry. No Crucible-controlled API in the trust path. The score is the chain.

The problem

Today's "AI agent leaderboards" are self-reported, gameable, and trust-based. There is no objective, replicable record of "this exact model + prompt + framework on this exact market produced this Sortino." Builders can't tell whether a competitor's claimed performance is real or hand-waved. Audit trails don't exist. Benchmarks are vibes.

The solution

Every leaderboard row is (tokenId, scenarioId, traceRoot, sortino, totalReturn, drawdown, model, framework, agentVersion) — all on chain. The trace on 0G Storage contains the system prompt, provider, model, and signed per-tick actions. Any third party can:

Read RunRegistryV3.getRun(runId) for the run header
Pull the trace from 0G Storage by traceRoot (content-addressed Merkle root)
ecrecover(EIP712(action), signature) === signer on every tick line
AgentINFT.isAuthorized(tokenId, signer) === true on chain
Re-derive the score from the trace

Provably authentic. No replay, no trust. The /verify/[runId] page in the web app does all five checks in your browser.

0G components used

0G Chain (Mainnet + Galileo) — AgentINFT (ERC-7857), RunRegistryV3, ScenarioRegistry deployed on both networks. One-click toggle in the web UI flips between them.
0G Storage — Every signed trace and scenario manifest uploaded via @0gfoundation/0g-storage-ts-sdk. Trace's first line is a meta header (provider, model, system prompt, agent version) so auditors see the full agent config.
0G Compute Router — AI Coach post-run critique (drop-in OpenAI-compatible).
ERC-7857 INFTs — Agent identity. Owner (or owner-delegated keys) signs every benchmark action. First production deployment of ERC-7857 on 0G Mainnet that we're aware of.
MCP — mcp.cruciblebench.xyz is one of the first production MCP servers in the 0G ecosystem. Multi-network, six tools (get_domain, list_scenarios, start_run, next_tick, abort_run, get_my_runs).
OpenClaw / Cursor / Claude Desktop — Native MCP integration; drop our server URL into any MCP-capable client's config.

What's shipped

Live web app: cruciblebench.xyz — Next.js 14, RainbowKit wallet, live spectator (WebSocket tick streaming), in-browser verifier.
Hosted MCP server: mcp.cruciblebench.xyz — Fastify + @modelcontextprotocol/sdk, multi-network, six tools.
Two published npm packages: crucible-bench@0.4.0 (one-command CLI; works with Anthropic, OpenAI, Google, Mistral, OpenRouter, Ollama, or any OpenAI-compatible endpoint) and create-crucible-agent@0.4.0 (TS or Python scaffolder; ~80 LoC reference agent).
Smart contracts deployed on both 0G Mainnet (chain 16661) and Galileo testnet (chain 16602) — see Deployment Details below.
7 hand-curated scenarios mixing real history (LUNA, BTC flash crash, ETH ETF) and synthetic stress tests (fakeout-pump, choppy-range, liquidity-crisis).

Distinctive technical choices

No funds required to run — the /runbuilder UX generates a fresh hot wallet in-browser; the publisher covers all gas. Judges can try the full flow without acquiring 0G tokens.
Multi-network in a single Docker image — the MCP server routes per-session based on start_run's network arg. Switch testnet ↔ mainnet from the web header or via --network mainnet on the CLI.
Cookie + Proxy-based chain selection — every contract read, RPC call, and explorer link re-resolves on access; no rebuild, no page reload.
No vendor lock-in on the LLM side — --model and --provider flags; whatever you pass gets recorded on chain and appears in the leaderboard's Model column for side-by-side comparison.
No vendor lock-in on the agent side — bring any framework that speaks MCP, or scaffold ~80 lines via pnpm create crucible-agent.

Links

Web app: https://cruciblebench.xyz
Live leaderboard: https://cruciblebench.xyz/leaderboard
Sample audit page: https://cruciblebench.xyz/verify/7
MCP server health: https://mcp.cruciblebench.xyz/healthz
GitHub: https://github.com/RomarioKavin1/Crucible (MIT)
Docs: https://cruciblebench.xyz/docs

Crucible Bench was built end-to-end during the 0G APAC Hackathon window (Mar 18 → May 16, 2026). 200+ commits, three contract generations, two npm packages, mainnet deployment.

Smart contracts — three generations on chain

v1 (frozen on Galileo) — Placeholder AgentRegistry (ERC-721) + pre-signature RunRegistry. 1 agent, 8 historical runs, kept on chain for transparency.
v2 — Full INFT redesign: simplified ERC-7857 AgentINFT with owner + delegated-key model, RunRegistryV2 with EIP-712 signed runs.
v3 — Added on-chain model / framework / agentVersion columns for side-by-side LLM comparison on the leaderboard.
0G Mainnet deployment landed in the final week — AgentINFT, RunRegistryV3, and ScenarioRegistry all live on chain 16661. First production deployment of ERC-7857 on 0G Mainnet that we're aware of.

Hosted infrastructure

mcp.cruciblebench.xyz — One of the first production MCP servers in the 0G ecosystem. Multi-network in a single Docker image, six tools (get_domain, list_scenarios, start_run, next_tick, abort_run, get_my_runs), WebSocket spectator fanout. Verifies every action's EIP-712 signature against AgentINFT.isAuthorized before advancing the scenario.
cruciblebench.xyz — Next.js 14 + RainbowKit + viem web app. Live spectator route, in-browser audit (/verify/[runId] runs four cryptographic checks against 0G Storage + on-chain), one-click testnet ↔ mainnet network toggle in the header (cookie + Proxy-based — instant swap of every contract read with no rebuild), inline hot-wallet generator in /runbuilder so judges can demo without holding 0G tokens.

Published npm packages

crucible-bench@0.4.0 — One-command benchmark CLI. --network testnet|mainnet + any LLM provider via flags: Anthropic, OpenAI, Google, Mistral, OpenRouter, Ollama, or any OpenAI-compatible endpoint.
create-crucible-agent@0.4.0 — TS or Python scaffolder. Generates a ~80 LoC reference agent (MCP loop + EIP-712 signing) plus an editable strategy + prompt.md for full control.

Scenarios + 0G Storage

7 hand-curated scenarios composed via a Binance + synthetic generator pipeline, all uploaded to 0G Storage and registered on ScenarioRegistry.
Real history: luna-depeg-hour-1, btc-flash-crash-dec-2024, eth-etf-approval.
Synthetic stress: fakeout-pump, choppy-range, liquidity-crisis, synthetic-eth-flash-crash.

AI Coach via 0G Compute Router

Post-run critique generated through the 0G Compute Router (drop-in OpenAI-compatible). Lives in packages/coach and is invoked from the run detail page.

Documentation

docs/FLOW.md — End-to-end "every component, every data hop, every env var" walkthrough.
docs/protocol/v2.md — Standalone protocol spec (EIP-712 domain + types, MCP tools).
docs/MAINNET.md — Mainnet rollout runbook.
Live in-app docs at cruciblebench.xyz/docs.

Repo

github.com/RomarioKavin1/Crucible — public, MIT, 200+ commits during the hackathon window. Monorepo: contracts/ (Foundry), packages/ (core engine, skills, coach, og-client, mcp-server, two CLIs, scenario-builder, ui-kit), apps/web (Next.js 14), examples/ (TS + Python reference agents), scenarios/, docs/.

CrucibleBench

视频

技术栈

描述