Integrate in one line.
Everything you need to wire an autonomous agent into Ainfera. Reads top-to-bottom in about 12 minutes; or jump to the section you need.
Ainfera is an inference router for autonomous agents. You call our API the way you'd call OpenAI's. We pick the best model for each call within your caps, settle the bill, and post every decision to a public audit chain.
This page is the developer docs. For how calls are placed, see how routing works. For the agent-readable version of the same content, see /llms.txt.
quickstart
If you want one snippet to copy, here it is:
from openai import OpenAI client = OpenAI(base_url="https://api.ainfera.ai/v1", api_key=os.environ["AINFERA_KEY"]) res = client.chat.completions.create( model="ainfera-inference", messages=[{"role": "user", "content": "hi"}], extra_body={"caps": {"budget": 0.012, "latency_ms": 1500}}, )
Full walkthrough on the Quickstart page →
SDKs & compatibility
Ainfera is wire-compatible with the OpenAI and Anthropic SDKs. Point your existing client at api.ainfera.ai/v1and you're routed. We also ship a small dedicated SDK for ergonomics — sub-millisecond overhead.
Authentication
Pass an Ainfera key as a bearer token. Keys are scoped to one agent — caps and audit run on the agent the key belongs to.
Authorization: Bearer $AINFERA_KEYCreate keys on the agent detail page. Rotate at the cadence your secret store enforces — rotation never affects routing behavior. Workspace-admin keys exist for tooling.
Data handling. Keys are scoped per agent. The audit chain stores content hashes— the decision, prompt and response hashes plus routing metadata — never your prompt or response bodies. Provider calls are made over TLS; we don't retain payloads beyond what a routed request needs. Subprocessors are listed on the subprocessors page.
Routing
The whole product, in one paragraph. Pass model: "ainfera-inference" and we score every eligible model against your caps, pick the best, and call it. The decision is returned alongside the completion as a routing field, and posted as a hash to the audit chain.
If you want a specific model, pass it by name (e.g. model: "claude-opus-4-7"). Routing is skipped, caps still apply, audit still happens.
response.routing. Inspect it in dev; ignore it in prod; cite it to your customer when they ask why an agent did what it did.Caps
Hard limits on every call. We refuse to violate any of them — if no model fits, we return 409 no_eligible_model instead of downgrading.
| Field | Type | Description | |
|---|---|---|---|
| budget | number | opt | Hard cap in USD per call. Inherits the agent default if omitted. |
| latency_ms | integer | opt | p50 ceiling in milliseconds. Measured against rolling 24h production traffic per model. |
| quality | number | opt | Minimum quality floor, 0.00–1.00. Defaults to the agent's per-task floor. |
| reliability | number | opt | Minimum 30-day success rate. Models below the floor are excluded. |
{
"model": "ainfera-inference",
"messages": [...],
"caps": {
"budget": 0.012,
"latency_ms": 1500,
"quality": 0.90,
"reliability": 0.9985
}
}Streaming
Pass stream: true on either the OpenAI or Anthropic surface. Audit metadata arrives in the final chunk so you can keep streaming UX intact while still surfacing the inference id.
res = client.chat.completions.create( model="ainfera-inference", messages=[...], stream=True, ) for chunk in res: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") elif chunk.audit: print("\\naudit:", chunk.audit.id) # final chunk
Audit
Every settled call is hashed and posted to the Ainfera audit chain. The chain is public — anyone with an inference id can verify the record without an API key.
Returns the decision hash, prompt hash, response hash, block number, and confirmation count. Re-hash the canonical payload locally and compare. Full walkthrough in how routing works → proof
POST /chat/completions
OpenAI-compatible chat completion. The only new request field is caps. The only new response field is routing.
request
| Field | Type | Description | |
|---|---|---|---|
| model | string | req | "ainfera-inference" for routed, or a specific model name (e.g. "claude-opus-4-7"). When pinned, routing is skipped but caps still apply. |
| messages | Message[] | req | OpenAI-format messages array. role ∈ {system, user, assistant, tool}. |
| caps | Caps | opt | See caps. Inherits agent defaults if omitted. |
| stream | boolean | opt | Server-sent events. Audit metadata in the final chunk. |
| temperature | number | opt | Sampling temperature, 0–2. Passed through to the selected model. |
| tools | Tool[] | opt | OpenAI tool-calls. Only models declaring tool support are eligible. |
response · extras
{
// ...standard OpenAI fields
"routing": {
"model": "claude-opus-4-7",
"candidates": [/* full candidate set */],
"caps_applied": { "budget": 0.012, "latency_ms": 1500 },
"policy_version": "<version>",
"cell": "reasoning-frontier/research/A"
},
"audit": {
"id": "inf_...",
"block": "block_height",
"hash": "0x..."
},
"cost": {
"direct": 0.0056,
"margin": 0.0005,
"billed": 0.0061
}
}POST /embeddings
Same shape as /chat/completions, including "model": "ainfera-inference" for routed selection across the embeddings model class.
GET /models
Live leaderboard. Same data that powers the Models page — quality, cost, latency, reliability, refreshed continuously, sourced from the audit chain. No key required.
Templates
Published workflows you can run by id. Browse the gallery and authoring tools inside the workspace.
res = client.templates.run( "research-deep-dive@v4", input={'{'}"query": "federated vs centralized learning"{'}'}, caps={'{'}"budget": 0.020, "latency_ms": 6000{'}'}, ) print(res.output) print(res.workflow_id) # wf_id
Webhooks
Subscribe to events. We POST a small payload with a hash you can verify against the chain — no payload bodies leave your account otherwise.
inference.settled— every settled call (high volume; usually for analytics)inference.fallback— primary failed, fallback succeededagent.cap_warning— agent hit 75% of a budget or latency capagent.cap_exceeded— agent paused due to caphitl.required— workflow paused, awaiting humanincident.opened·incident.resolved
Error codes
Idiomatic HTTP. Common ones an agent will hit:
response.routing and response.audit.Retry-After.routing.candidates for cause.Rate limits
Two layers: workspace burst (3,000 r/s default) and agent budget. The first is rare to hit; the second is your job. Both return 429 with a structured body.
For higher workspace bursts, email hello@ainfera.ai — there is no plan tier; we just raise the number.