Network-AI
Core docs

Benchmarks

Provider performance, rate limits, local GPU guidance, and token-cap recommendations for production swarms.

Source file: BENCHMARKS.md

Benchmarks & Performance

Performance data for Network-AI deployments. Your swarm is only as fast as the backend it calls — this page helps you choose the right setup.

BlackboardValidator Throughput

Layer 1 validation (rule-based, zero LLM calls) measured on Node.js 20, Apple M2, single-thread:

Input sizeOps/secLatency
Small entry (~100 chars)~1,000,000< 1 µs
Medium entry (~1 KB)~500,000~2 µs
Large entry (~10 KB)~159,000~6 µs

Layer 2 (QualityGateAgent) adds LLM latency and is async — intended for high-value writes, not every write.


Cloud Provider Performance

Not all cloud APIs perform the same. Model size, inference infrastructure, and tier all affect how fast each agent gets a response — and that directly multiplies across every agent in your swarm.

Provider / ModelAvg response (5-agent swarm)RPM limit (free/tier-1)Notes
OpenAI gpt-5.26–10s per call3–6 RPMFlagship model, high latency, strict RPM
OpenAI gpt-4o-mini2–4s per call500 RPMFast, cheap, good for reviewer agents
OpenAI gpt-4o4–7s per call60–500 RPMBalanced quality/speed
Anthropic Claude 3.5 Haiku2–3s per call50 RPMFastest Claude, great for parallel agents
Anthropic Claude 3.7 Sonnet4–8s per call50 RPMStronger reasoning, higher latency
Google Gemini 2.0 Flash1–3s per call15 RPM (free)Very fast inference, low RPM on free tier
Groq (Llama 3.3 70B)0.5–2s per call30 RPMFastest cloud inference available
Together AI / Fireworks1–3s per callVaries by planGood for parallel workloads

Key insight: A 5-agent swarm using gpt-4o-mini at 500 RPM can fire all 5 agents truly in parallel and finish in ~4s total. The same swarm on gpt-5.2 at 6 RPM must go sequential and takes 60s. The model tier matters more than the orchestration framework.

Choosing a Model for Swarm Agents

  • Speed over depth (many agents, real-time) → gpt-4o-mini, claude-3.5-haiku, gemini-2.0-flash, groq/llama-3.3-70b
  • Depth over speed (few agents, high-stakes) → gpt-4o, claude-3.7-sonnet
  • Free / no-cost testing → Groq free tier, Gemini free tier, or Ollama locally
  • Production with budget → multiple keys across providers, route agents to different models

Rate Limit Patterns

When you run a 5-agent swarm sharing one API key and hit the RPM ceiling, the API silently returns empty responses — not a 429 error, just blank content. Network-AI's swarm demos handle this automatically with sequential dispatch and adaptive header-based pacing (reads x-ratelimit-reset-requests to wait exactly as long as needed).

You haveWhat to expect
One cloud API keySequential dispatch, 40–70s per 5-agent swarm — handled automatically
Multiple cloud keysNear-parallel, 10–15s — one key per adapter instance
Local GPU (Ollama, vLLM)True parallel, 5–20s depending on hardware
Home GPU + cloud mixLocal agents never block — cloud agents rate-paced independently

Multiple Keys = True Parallel

import { CustomAdapter, AdapterRegistry } from 'network-ai';

const registry = new AdapterRegistry();

for (const reviewer of REVIEWERS) {
  const adapter = new CustomAdapter();
  const client  = new OpenAI({ apiKey: process.env[`OPENAI_KEY_${reviewer.id.toUpperCase()}`] });

  adapter.registerHandler(reviewer.id, async (payload) => {
    const resp = await client.chat.completions.create({ /* ... */ });
    return { findings: extractContent(resp) };
  });

  registry.register(reviewer.id, adapter);
}

// All 5 dispatch in parallel via Promise.all — ~8–12s instead of ~60s

Local GPU = Zero Rate Limits

const localClient = new OpenAI({
  apiKey : 'not-needed',
  baseURL: 'http://localhost:11434/v1',   // Ollama, vLLM, llama.cpp
});

adapter.registerHandler('reviewer', async (payload) => {
  const resp = await localClient.chat.completions.create({
    model   : 'llama3.2',
    messages: [/* ... */],
  });
  return { findings: extractContent(resp) };
});

Cloud GPU Instances (Self-Hosted)

Running your own model on AWS / GCP / Azure sits between managed APIs and local hardware:

SetupSpeed vs managed APIRPM
A100 (80GB) + vLLM, Llama 3.3 70BFaster — 0.5–2s/callNone
H100 + vLLM, Mixtral 8x7BFaster — 0.3–1s/callNone
T4 / V100 + Ollama, Llama 3.2 8BComparableNone

Cost: $1–5/hr for GPU VMs. For high-volume production swarms or teams that want no external API dependency, it is the fastest architecture available. The connection is identical to local Ollama — just point baseURL at your VM's IP.


max_completion_tokens — The Silent Truncation Trap

One of the most common failure modes in agentic output tasks. When a model hits the max_completion_tokens ceiling it stops mid-output and returns whatever it has — no error, no warning. The API call succeeds with finish_reason: &quot;length&quot; instead of &quot;stop&quot;.

This is especially dangerous for code-rewrite agents where the output is a full file.

# Real numbers (gpt-5-mini, order-service.ts rewrite):
  Blockers section:  ~120 tokens
  Fixed code:        ~2,800 tokens  (213 lines with // FIX: comments)
  Total needed:      ~3,000 tokens  ← hits the cap exactly → empty output
  Fix: set to 16,000 → full rewrite delivered in one shot

Rule of Thumb by Task

TaskRecommended cap
Short classification / sentiment200–500
Code review findings (one reviewer)400–800
Blocker summary (coordinator)500–1,000
Full file rewrite (≤300 lines)12,000–16,000
Full file rewrite (≤1,000 lines)32,000–64,000
Document / design revision16,000–32,000

All GPT-5 variants support 128,000 max output tokens — the ceiling is never the model, it is always the cap you set.

Lessons from Building the Code-Review Swarm

IssueRoot causeFix
Fixed code output was emptymax_completion_tokens: 3000 too lowRaise to 16000+ for any code-output agent
finish_reason: &quot;length&quot; silently discardsModel hits cap, partial response, no errorAlways check choices[0].finish_reason and alert on &quot;length&quot;
Flagship model slow + expensive for reviewersHigh latency + $14/1M output tokensUse gpt-5-mini ($2/1M, same RPM) for reviewer/fixer agents
Coordinator + fixer as two callsSecond call hits rate limit window, +60sMerge into one structured two-section call