Ainfera
How routing works

Pick the model by the result, not the reputation.

Every agent call is hard to place — capability, cost and latency trade off differently each time. Ainfera scores the candidates against the task and routes to the one most likely to finish it. Here's what goes into that, and the proof it leaves behind.

01 · Signals we weigh

Four inputs decide whether a call finishes.

These are the signals every candidate is scored on. How we weigh them is the part that compounds with traffic — so the weights stay ours — but the inputs are no secret.

Task type

What the call is

A drafting call and a tool-use call don't want the same model. We read the shape of the request first.

Cost

What it costs

Live per-token price for each candidate, against the ceiling you set.

Latency

How fast it answers

Measured on rolling production traffic, not vendor-published numbers.

Availability

Whether it's healthy now

A provider that's erroring or rate-limiting this minute drops out, and comes back when it recovers.

  1. GPT-OSS 120B (Novita)352 tok/s
  2. GPT-OSS 20B (Novita)257 tok/s
  3. Qwen3.7 Max (Novita)184 tok/s
  4. Minimax M2.1175 tok/s
  5. Qwen3-Next-80B-A3B-Instruct164 tok/s
  6. Qwen3.5-35B-A3B161 tok/s
  7. MiniMax M2.5158 tok/s
  8. qwen/qwen3-vl-8b-instruct145 tok/s
  9. Qwen3.5-122B-A10B144 tok/s
  10. Qwen3.6-35B-A3B140 tok/s
  11. Qwen3-VL-30B-A3B-Instruct121 tok/s
  12. Qwen3 Coder 30b A3B Instruct110 tok/s
Reference output speed (Artificial Analysis). We score live per-call latency on top of this — measured on production traffic, not published numbers.

Intelligence: Artificial Analysis · artificialanalysis.ai

02 · Outcome

We route to the model most likely to finish the task.

Not the biggest name, not a model you pinned six months ago and forgot. The pick is made per call and changes as price, speed and health change — so the cheapest model that still clears the bar is the one that runs.

204060080160240320Output speed · tokens / sec →Intelligence Index ↑ZGLM-5 · Z.ai (GLM) · index 50 · 75 tok/s · coverageMiniMax M2.7 (Novita) · MiniMax · index 50 · 47 tok/s · coverageQwen3.6 Plus · Alibaba (Qwen) · index 50 · 53 tok/s · coverageDeepseek V4 Flash · DeepSeek · index 46 · 97 tok/s · coverageQwen3.6-27B · Alibaba (Qwen) · index 46 · 57 tok/s · coverageQwen3.6-35B-A3B · Alibaba (Qwen) · index 44 · 140 tok/s · coverageZGLM-4.7 · Z.ai (GLM) · index 42 · 104 tok/s · coverageMiniMax M2.5 · MiniMax · index 42 · 158 tok/s · coverageQwen3.5-122B-A10B · Alibaba (Qwen) · index 42 · 144 tok/s · coverageQwen3.5-27B · Alibaba (Qwen) · index 42 · 84 tok/s · coverageMinimax M2.1 · MiniMax · index 39 · 175 tok/s · coverageQwen3.5-35B-A3B · Alibaba (Qwen) · index 37 · 161 tok/s · coverageQwen3.5-9B · Alibaba (Qwen) · index 32 · 65 tok/s · coverageQwen3-Max · Alibaba (Qwen) · index 31 · 54 tok/s · coverageZGLM-4.6 · Z.ai (GLM) · index 30 · 51 tok/s · coverageZGLM-4.7-Flash · Z.ai (GLM) · index 30 · 79 tok/s · coverageQwen3 Coder Next · Alibaba (Qwen) · index 28 · 82 tok/s · coverageZGLM-4.5 · Z.ai (GLM) · index 26 · 50 tok/s · coverageQwen3 235B A22B Instruct 2507 · Alibaba (Qwen) · index 25 · 63 tok/s · coverageQwen3 Coder 480B A35B Instruct · Alibaba (Qwen) · index 25 · 66 tok/s · coverageZzai-org/glm-4.5-air · Z.ai (GLM) · index 23 · 75 tok/s · coverageQwen3-VL-235B-A22B-Instruct · Alibaba (Qwen) · index 21 · 53 tok/s · coverageQwen QwQ-32B · Alibaba (Qwen) · index 20 · 31 tok/s · coverageQwen3 Coder 30b A3B Instruct · Alibaba (Qwen) · index 20 · 110 tok/s · coverageQwen3-Next-80B-A3B-Instruct · Alibaba (Qwen) · index 20 · 164 tok/s · coverageZGLM 4.6V · Z.ai (GLM) · index 17 · 65 tok/s · coverageQwen3-VL-32B-Instruct · Alibaba (Qwen) · index 17 · 74 tok/s · coverageDeepSeek R1 Distill LLama 70B · DeepSeek · index 16 · 43 tok/s · coverageQwen3-VL-30B-A3B-Instruct · Alibaba (Qwen) · index 16 · 121 tok/s · coverageqwen/qwen3-vl-8b-instruct · Alibaba (Qwen) · index 14 · 145 tok/s · coverageZGLM 4.5V · Z.ai (GLM) · index 13 · 43 tok/s · coverageQwen3 Omni 30B A3B Instruct · Alibaba (Qwen) · index 11 · 108 tok/s · coverageQwen3.7 Max (Novita) · Alibaba (Qwen) · index 57 · 184 tok/s · preferred coreMiniMax-M3 · MiniMax · index 55 · 47 tok/s · preferred coreDeepSeek V4 Pro (Together) · DeepSeek · index 52 · 57 tok/s · preferred coreZGLM 5.1 (Novita) · Z.ai (GLM) · index 51 · 71 tok/s · preferred coreQwen3.5 397B A17B (DeepInfra) · Alibaba (Qwen) · index 45 · 52 tok/s · preferred coreQwen3.5 397B A17B (Together) · Alibaba (Qwen) · index 45 · 52 tok/s · preferred coreGPT-OSS 120B (Novita) · OpenAI · index 33 · 352 tok/s · preferred coreGPT-OSS 20B (Novita) · OpenAI · index 24 · 257 tok/s · preferred core
Faster isn’t smarter — we pick the point that finishes the task inside your caps.Speed is the Artificial Analysis reference; live per-call latency is scored on top of it.

Intelligence + speed: Artificial Analysis · artificialanalysis.ai

03 · Your controls

You set the box. We pick the model inside it.

Routing is yours to bound. Three controls, settable per agent or per task type.

Caps

Set the box

Per-call cost ceilings and latency targets, per agent or per task type. If nothing fits, we tell you — we never quietly downgrade.

Pins

Force a model

Pin a specific model or provider when you need it, and keep routing everywhere else.

Fallbacks

Stay up

On a 429, 5xx, timeout or refusal we retry the next eligible candidate inside your caps — logged and audited like any other call.

04 · Proof

Every decision is signed, on a public chain.

No black box and no dashboard claim. Every routed call is hashed, Ed25519-signed, and appended to an append-only chain. Verify any one of them with a single keyless request — no account, no key.

Trace · live auditlive
timehash · agentevent · modelseq
04:56:52
0xf940…776c · tulkas
material fetched
3,836
04:56:52
0x3173…da86 · varda
material fetched
4,176
04:56:52
0x3bbf…3d4f · vaire
material fetched
81
04:56:52
0x235f…be48 · ulmo
material fetched
79
04:56:52
0xd298…45c1 · yavanna
material fetched
499
04:56:52
0xa3e2…b25f · namo
material fetched
610
verify — no key required
# the public chain is keyless
curl https://api.ainfera.ai/v1/audit/public

# → each entry: the routed model, provider,
#   sequence, block height and the Ed25519
#   signature. Re-hash it yourself to verify.

Stop picking models. Start finishing tasks.

One endpoint, every provider, every decision on chain.

routing · activeblock #12,671models · 246audit · on-chainainfera · the inference of ai agents