Integrate in one line.

Everything you need to wire an autonomous agent into Ainfera. Reads top-to-bottom in about 12 minutes; or jump to the section you need.

Ainfera is an inference router for autonomous agents. You call our API the way you'd call OpenAI's. We pick the best model for each call within your caps, settle the bill, and post every decision to a public audit chain.

This page is the developer docs. For how calls are placed, see how routing works. For the agent-readable version of the same content, see /llms.txt.

quickstart

If you want one snippet to copy, here it is:

python60s to first call

from openai import OpenAI
client = OpenAI(base_url="https://api.ainfera.ai/v1",
                api_key=os.environ["AINFERA_KEY"])

res = client.chat.completions.create(
  model="ainfera-inference",
  messages=[{"role": "user", "content": "hi"}],
  extra_body={"caps": {"budget": 0.012, "latency_ms": 1500}},
)

Full walkthrough on the Quickstart page →

SDKs & compatibility

Ainfera is wire-compatible with the OpenAI and Anthropic SDKs. Point your existing client at api.ainfera.ai/v1and you're routed. We also ship a small dedicated SDK for ergonomics — sub-millisecond overhead.

openai-python

SDK ≥ 1.40 · drop-in

github.com/openai/openai-python

openai-node

SDK ≥ 4.30 · drop-in

github.com/openai/openai-node

anthropic-python

SDK ≥ 0.40 · drop-in

github.com/anthropics/anthropic-sdk-python

anthropic-typescript

SDK ≥ 0.30 · drop-in

github.com/anthropics/anthropic-sdk-typescript

ainfera/sdk

native · v1.4.0

ainfera.ai/quickstart

vercel/ai

provider · @ainfera/ai-provider

npm install @ainfera/ai-provider

Authentication

Pass an Ainfera key as a bearer token. Keys are scoped to one agent — caps and audit run on the agent the key belongs to.

http

Authorization: Bearer $AINFERA_KEY

Create keys on the agent detail page. Rotate at the cadence your secret store enforces — rotation never affects routing behavior. Workspace-admin keys exist for tooling.

Data handling. Keys are scoped per agent. The audit chain stores content hashes— the decision, prompt and response hashes plus routing metadata — never your prompt or response bodies. Provider calls are made over TLS; we don't retain payloads beyond what a routed request needs. Subprocessors are listed on the subprocessors page.

Routing

The whole product, in one paragraph. Pass model: "ainfera-inference" and we score every eligible model against your caps, pick the best, and call it. The decision is returned alongside the completion as a routing field, and posted as a hash to the audit chain.

If you want a specific model, pass it by name (e.g. model: "claude-opus-4-7"). Routing is skipped, caps still apply, audit still happens.

The candidate set, scores, exclusion reasons, and chosen model are all returned in response.routing. Inspect it in dev; ignore it in prod; cite it to your customer when they ask why an agent did what it did.

Caps

Hard limits on every call. We refuse to violate any of them — if no model fits, we return 409 no_eligible_model instead of downgrading.

Field	Type		Description
budget	number	opt	Hard cap in USD per call. Inherits the agent default if omitted.
latency_ms	integer	opt	p50 ceiling in milliseconds. Measured against rolling 24h production traffic per model.
quality	number	opt	Minimum quality floor, 0.00–1.00. Defaults to the agent's per-task floor.
reliability	number	opt	Minimum 30-day success rate. Models below the floor are excluded.

jsonrequest body fragment

{
  "model":    "ainfera-inference",
  "messages": [...],
  "caps": {
    "budget":      0.012,
    "latency_ms":  1500,
    "quality":     0.90,
    "reliability": 0.9985
  }
}

Streaming

Pass stream: true on either the OpenAI or Anthropic surface. Audit metadata arrives in the final chunk so you can keep streaming UX intact while still surfacing the inference id.

python

res = client.chat.completions.create(
  model="ainfera-inference", messages=[...], stream=True,
)

for chunk in res:
  if chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")
  elif chunk.audit:
    print("\\naudit:", chunk.audit.id)   # final chunk

Audit

Every settled call is hashed and posted to the Ainfera audit chain. The chain is public — anyone with an inference id can verify the record without an API key.

GEThttps://audit.ainfera.ai/v1/{inference_id}

Returns the decision hash, prompt hash, response hash, block number, and confirmation count. Re-hash the canonical payload locally and compare. Full walkthrough in how routing works → proof

POST /chat/completions

OpenAI-compatible chat completion. The only new request field is caps. The only new response field is routing.

POSThttps://api.ainfera.ai/v1/chat/completions

request

Field	Type		Description
model	string	req	`"ainfera-inference"` for routed, or a specific model name (e.g. `"claude-opus-4-7"`). When pinned, routing is skipped but caps still apply.
messages	Message[]	req	OpenAI-format messages array. `role` ∈ {system, user, assistant, tool}.
caps	Caps	opt	See caps. Inherits agent defaults if omitted.
stream	boolean	opt	Server-sent events. Audit metadata in the final chunk.
temperature	number	opt	Sampling temperature, 0–2. Passed through to the selected model.
tools	Tool[]	opt	OpenAI tool-calls. Only models declaring tool support are eligible.

response · extras

jsonfields beyond OpenAI

{
  // ...standard OpenAI fields
  "routing": {
    "model":         "claude-opus-4-7",
    "candidates":    [/* full candidate set */],
    "caps_applied":  { "budget": 0.012, "latency_ms": 1500 },
    "policy_version": "<version>",
    "cell":           "reasoning-frontier/research/A"
  },
  "audit": {
    "id":    "inf_...",
    "block": "block_height",
    "hash":  "0x..."
  },
  "cost": {
    "direct": 0.0056,
    "margin": 0.0005,
    "billed": 0.0061
  }
}

POST /embeddings

Same shape as /chat/completions, including "model": "ainfera-inference" for routed selection across the embeddings model class.

POSThttps://api.ainfera.ai/v1/embeddings

GET /models

Live leaderboard. Same data that powers the Models page — quality, cost, latency, reliability, refreshed continuously, sourced from the audit chain. No key required.

GEThttps://api.ainfera.ai/v1/models?task=research

Templates

Published workflows you can run by id. Browse the gallery and authoring tools inside the workspace.

python

res = client.templates.run(
  "research-deep-dive@v4",
  input={'{'}"query": "federated vs centralized learning"{'}'},
  caps={'{'}"budget": 0.020, "latency_ms": 6000{'}'},
)
print(res.output)
print(res.workflow_id)              # wf_id

Webhooks

Subscribe to events. We POST a small payload with a hash you can verify against the chain — no payload bodies leave your account otherwise.

inference.settled — every settled call (high volume; usually for analytics)
inference.fallback — primary failed, fallback succeeded
agent.cap_warning — agent hit 75% of a budget or latency cap
agent.cap_exceeded — agent paused due to cap
hitl.required — workflow paused, awaiting human
incident.opened · incident.resolved

Error codes

Idiomatic HTTP. Common ones an agent will hit:

200okRouted and settled. Inspect response.routing and response.audit.

400invalid_capsCaps object malformed or contains an unknown field.

401invalid_keyBearer token missing or rotated.

402balance_insufficientAgent wallet is below the per-call cost. Top up or raise the cap.

409no_eligible_modelNo model fits your caps. Returned instead of a quiet downgrade. Loosen a cap or accept the failure.

429workspace_rate_limitedWorkspace burst limit. Backoff returned in Retry-After.

5xxupstream_failureAll candidates returned errors; fallback exhausted. Inspect routing.candidates for cause.

Rate limits

Two layers: workspace burst (3,000 r/s default) and agent budget. The first is rare to hit; the second is your job. Both return 429 with a structured body.

For higher workspace bursts, email hello@ainfera.ai — there is no plan tier; we just raise the number.

was this page useful? yes · nosend doc feedback to hello@ainfera.ai · or read as llms.txt