atopile logo

Physics-backed benchmarks for AI hardware-design agents.

Aletheia gives frontier labs a way to evaluate and train electrical-engineering agents against real design constraints: spec compliance, robustness, cost, sourcing, and manufacturability.

EEBench
modelavg scoretokens
01grok-4-1-fast
91%19.8M
02sonnet
90%134k
03claude-haiku-4-5
83%3.1M
04gemini-3-flash-preview
82%7.5M
05gemini-3-pro-preview
80%14.4M
06gpt-5.4
76%3.5M
07claude-sonnet-4-6
60%3.0M
08gemini-3.1-pro-preview
60%1.5M
09gpt-4.1-mini
40%10.9M
10gpt-5.4-mini
38%6.3M

Powered by real design tools that compile, simulate, and verify.

Capture Requirements

Tasks are expressed as concrete targets, interfaces, budgets, and constraints that compile directly into atopile designs.

Simulation and Validation

Designs are deterministically validated, simulated, and diffed against specs — no manual review in the loop.

Manufacturing

BOM analysis, sourcing checks, and manufacturability scoring run directly on the design artifact the agent produced.

Scores tied to engineering constraints, not vibes.

Spec compliance

hard limits

Deterministic checks for task targets like voltage, gain, ripple, noise, latency, or bandwidth.

Robustness

hidden sweeps

Tolerance, corner, and operating-condition sweeps that punish brittle designs and overfitting.

Cost and complexity

BOM-aware

Score tradeoffs like BOM cost, part pressure, and design simplicity instead of raw pass rate alone.

Code-first hardware design gives models a native interface to real engineering.

01 Submission artifact
module TPS54560_DV:
  inductor = 6.8uH
  c_out = 4x22uF
  assert vout within 5V +/- 2%
02 Preflight checks
compileinterfacestopologyload-step
03 Simulation evidence
startup
load-step
efficiency
04 Score package
REQ_001 startuppass
REQ_002 load-stepfail
EFF_001 efficiencyreport
reward scalar0.742
01

Aletheia starts from a structured task spec with targets, interfaces, budgets, and hidden stress cases.

02

Agents build a real atopile hardware design artifact that can be reviewed, validated, and manufactured, not just a simulator-only answer.

03

Pre-simulation checks reject invalid submissions early by validating structure, interfaces, topology, and hard design constraints.

04

Simulation and design analysis measure behavior under load, corners, tolerances, sourcing pressure, and manufacturability constraints.

05

EEBench returns a deterministic score package with requirement results, failed checks, sweeps, and reward-ready metrics.

A run dashboard should make model failure modes obvious.

Aletheia run dashboard
live artifact view
partial passbuck_12v_to_5v_v0Anthropic / Claude

TPS54560 buck-converter validation run

A single-run inspector modeled after the EEBench viewer: task brief, requirement checks, source artifacts, and raw evaluator payloads all stay attached to the same execution record.

Verdict2 hard checks pass, 1 fails
Blocking issueload-step undershoot at 200 kHz
Peak efficiency93.1% at light load
Runnerdeterministic atopile + ngspice
Simulation PlotMeasured artifact for the currently selected requirement.
failREQ_002
selected artifactLoad-Step Envelope

Minimum VOUT during the 1 A to 5 A step.

4.65 V4.75 V4.85 V4.95 V5.05 VREQ_002 lower boundbelow200 kHz4.780 V400 kHz4.827 V600 kHz4.886 V1 MHzSwitching frequencyMinimum VOUT during step
Limitmin VOUT > 4.75 V
Observed200 kHz fails; 400 kHz to 1 MHz pass
Sourceannotated buck load-step plot
Requirement ChecksVisible scoring outcome from the evaluator.
Verdict2 hard checks pass, 1 fails
Blocking issueload-step undershoot at 200 kHz
Peak efficiency93.1% at light load
Runnerdeterministic atopile + ngspice

Task suites grounded in real electrical engineering work.

Power delivery

Buck, boost, protection, transients

High-signal tasks where efficiency, regulation, protection behavior, and operating envelope all matter at once.

Signal conditioning

Amplifiers, filters, sensing front ends

Good surfaces for measuring tradeoffs between gain, noise, stability, bandwidth, and component choice.

Analog building blocks

References, comparators, timing stages

Compact tasks with hidden variants that stress topology reasoning, generalization, and evaluator quality.

Frontier labs need evaluators they can actually trust.

01

Version benchmark suites, evaluator logic, and score definitions so labs can compare model runs over time without moving the goalposts.

02

Publish methodology, task-family coverage, and known limitations so the benchmark is inspectable instead of a black box.

03

Use hidden tests, robustness sweeps, and operating-condition variation to make benchmark gaming and reward hacking harder.

04

Return auditable artifacts like plots, failed constraints, and measured outputs so teams can see why a model passed or failed.

We are trying to close the loop on hardware design.

Software accelerated when build-test-iterate loops collapsed from months to minutes. Physical product development still spends weeks or months waiting for validated feedback.

At atopile, we are building toward that tooling layer: hardware as code, deterministic validation, and workflows that competent agents can actually operate. Aletheia sits alongside that effort as evaluation and training infrastructure, because models improve much faster when they can be measured against real design tasks with real feedback.

Today, hardware design still depends on a long chain of human work across design, review, procurement, manufacturing, bring-up, and lab validation. Our goal is to make more of that chain machine-operable, so agents can design, manufacture, and test real hardware instead of training only against neat simulator abstractions.

That is how we ground training in reality. When the big loop is closed on real hardware outcomes, the smaller simulation and validation loops can be refined against measured results over time. That makes them more accurate and makes it possible to give models denser, higher-quality rewards without losing touch with the physical world.

Tooling

Modern hardware tooling should make design artifacts structured, executable, and fast to validate.

Reality loop

Atopile is turning more of the real EE workflow into something agents can operate: design, manufacturing handoff, and test instead of simulation-only exercises.

Evaluation

Once the big real-world loop is instrumented, smaller simulation and validation loops can be recalibrated against reality and used to deliver denser, higher-quality rewards.

Outcome

Together, that should compress hardware loops from weeks or months down to multiple validated iterations per day.

Built by engineers who know how real hardware gets made.

Built by ex-Tesla electrical engineers with direct experience shipping real hardware.
Grounded in production constraints like cost, validation, manufacturability, and review discipline.
Designed around how serious EE teams actually specify, test, and iterate on circuits.

Start with a benchmark your team can inspect, rerun, and pressure-test.

A scoped benchmark brief with the task family, limits, and success criteria agreed up front.
An initial task suite with evaluators and hosted execution so results are rerunnable, not one-off demos.
Baseline runs on selected frontier models with side-by-side comparison against the same suite.
An evaluation report with scores, failure modes, representative artifacts, and methodology notes.
If useful, one custom task family shaped around your internal design priorities or domain constraints.