Ainfera

Integrate in one line.

Everything you need to wire an autonomous agent into Ainfera. Reads top-to-bottom in about 12 minutes; or jump to the section you need.

Ainfera is an inference router for autonomous agents. You call our API the way you'd call OpenAI's. We pick the best model for each call within your caps, settle the bill, and post every decision to a public audit chain.

This page is the developer docs. For how calls are placed, see how routing works. For the agent-readable version of the same content, see /llms.txt.

quickstart

If you want one snippet to copy, here it is:

python60s to first call
from openai import OpenAI
client = OpenAI(base_url="https://api.ainfera.ai/v1",
                api_key=os.environ["AINFERA_KEY"])

res = client.chat.completions.create(
  model="ainfera-inference",
  messages=[{"role": "user", "content": "hi"}],
  extra_body={"caps": {"budget": 0.012, "latency_ms": 1500}},
)

Full walkthrough on the Quickstart page →

SDKs & compatibility

Ainfera is wire-compatible with the OpenAI and Anthropic SDKs. Point your existing client at api.ainfera.ai/v1and you're routed. We also ship a small dedicated SDK for ergonomics — sub-millisecond overhead.

Authentication

Pass an Ainfera key as a bearer token. Keys are scoped to one agent — caps and audit run on the agent the key belongs to.

http
Authorization: Bearer $AINFERA_KEY

Create keys on the agent detail page. Rotate at the cadence your secret store enforces — rotation never affects routing behavior. Workspace-admin keys exist for tooling.

Data handling. Keys are scoped per agent. The audit chain stores content hashes— the decision, prompt and response hashes plus routing metadata — never your prompt or response bodies. Provider calls are made over TLS; we don't retain payloads beyond what a routed request needs. Subprocessors are listed on the subprocessors page.

Routing

The whole product, in one paragraph. Pass model: "ainfera-inference" and we score every eligible model against your caps, pick the best, and call it. The decision is returned alongside the completion as a routing field, and posted as a hash to the audit chain.

If you want a specific model, pass it by name (e.g. model: "claude-opus-4-7"). Routing is skipped, caps still apply, audit still happens.

The candidate set, scores, exclusion reasons, and chosen model are all returned in response.routing. Inspect it in dev; ignore it in prod; cite it to your customer when they ask why an agent did what it did.

Caps

Hard limits on every call. We refuse to violate any of them — if no model fits, we return 409 no_eligible_model instead of downgrading.

FieldTypeDescription
budgetnumberoptHard cap in USD per call. Inherits the agent default if omitted.
latency_msintegeroptp50 ceiling in milliseconds. Measured against rolling 24h production traffic per model.
qualitynumberoptMinimum quality floor, 0.00–1.00. Defaults to the agent's per-task floor.
reliabilitynumberoptMinimum 30-day success rate. Models below the floor are excluded.
jsonrequest body fragment
{
  "model":    "ainfera-inference",
  "messages": [...],
  "caps": {
    "budget":      0.012,
    "latency_ms":  1500,
    "quality":     0.90,
    "reliability": 0.9985
  }
}

Streaming

Pass stream: true on either the OpenAI or Anthropic surface. Audit metadata arrives in the final chunk so you can keep streaming UX intact while still surfacing the inference id.

python
res = client.chat.completions.create(
  model="ainfera-inference", messages=[...], stream=True,
)

for chunk in res:
  if chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")
  elif chunk.audit:
    print("\\naudit:", chunk.audit.id)   # final chunk

Audit

Every settled call is hashed and posted to the Ainfera audit chain. The chain is public — anyone with an inference id can verify the record without an API key.

GEThttps://audit.ainfera.ai/v1/{inference_id}

Returns the decision hash, prompt hash, response hash, block number, and confirmation count. Re-hash the canonical payload locally and compare. Full walkthrough in how routing works → proof

POST /chat/completions

OpenAI-compatible chat completion. The only new request field is caps. The only new response field is routing.

POSThttps://api.ainfera.ai/v1/chat/completions

request

FieldTypeDescription
modelstringreq"ainfera-inference" for routed, or a specific model name (e.g. "claude-opus-4-7"). When pinned, routing is skipped but caps still apply.
messagesMessage[]reqOpenAI-format messages array. role{system, user, assistant, tool}.
capsCapsoptSee caps. Inherits agent defaults if omitted.
streambooleanoptServer-sent events. Audit metadata in the final chunk.
temperaturenumberoptSampling temperature, 0–2. Passed through to the selected model.
toolsTool[]optOpenAI tool-calls. Only models declaring tool support are eligible.

response · extras

jsonfields beyond OpenAI
{
  // ...standard OpenAI fields
  "routing": {
    "model":         "claude-opus-4-7",
    "candidates":    [/* full candidate set */],
    "caps_applied":  { "budget": 0.012, "latency_ms": 1500 },
    "policy_version": "<version>",
    "cell":           "reasoning-frontier/research/A"
  },
  "audit": {
    "id":    "inf_...",
    "block": "block_height",
    "hash":  "0x..."
  },
  "cost": {
    "direct": 0.0056,
    "margin": 0.0005,
    "billed": 0.0061
  }
}

POST /embeddings

Same shape as /chat/completions, including "model": "ainfera-inference" for routed selection across the embeddings model class.

POSThttps://api.ainfera.ai/v1/embeddings

GET /models

Live leaderboard. Same data that powers the Models page — quality, cost, latency, reliability, refreshed continuously, sourced from the audit chain. No key required.

GEThttps://api.ainfera.ai/v1/models?task=research

Templates

Published workflows you can run by id. Browse the gallery and authoring tools inside the workspace.

python
res = client.templates.run(
  "research-deep-dive@v4",
  input={'{'}"query": "federated vs centralized learning"{'}'},
  caps={'{'}"budget": 0.020, "latency_ms": 6000{'}'},
)
print(res.output)
print(res.workflow_id)              # wf_id

Webhooks

Subscribe to events. We POST a small payload with a hash you can verify against the chain — no payload bodies leave your account otherwise.

  • inference.settled — every settled call (high volume; usually for analytics)
  • inference.fallback — primary failed, fallback succeeded
  • agent.cap_warning — agent hit 75% of a budget or latency cap
  • agent.cap_exceeded — agent paused due to cap
  • hitl.required — workflow paused, awaiting human
  • incident.opened · incident.resolved

Error codes

Idiomatic HTTP. Common ones an agent will hit:

200okRouted and settled. Inspect response.routing and response.audit.
400invalid_capsCaps object malformed or contains an unknown field.
401invalid_keyBearer token missing or rotated.
402balance_insufficientAgent wallet is below the per-call cost. Top up or raise the cap.
409no_eligible_modelNo model fits your caps. Returned instead of a quiet downgrade. Loosen a cap or accept the failure.
429workspace_rate_limitedWorkspace burst limit. Backoff returned in Retry-After.
5xxupstream_failureAll candidates returned errors; fallback exhausted. Inspect routing.candidates for cause.

Rate limits

Two layers: workspace burst (3,000 r/s default) and agent budget. The first is rare to hit; the second is your job. Both return 429 with a structured body.

For higher workspace bursts, email hello@ainfera.ai — there is no plan tier; we just raise the number.

was this page useful? yes · nosend doc feedback to hello@ainfera.ai · or read as llms.txt
routing · activeblock #12,691models · 246audit · on-chainainfera · the inference of ai agents