Benchmarks & Performance
Performance data for Network-AI deployments. Your swarm is only as fast as the backend it calls — this page helps you choose the right setup.
BlackboardValidator Throughput
Layer 1 validation (rule-based, zero LLM calls) measured on Node.js 20, Apple M2, single-thread:
| Input size | Ops/sec | Latency |
|---|---|---|
| Small entry (~100 chars) | ~1,000,000 | < 1 µs |
| Medium entry (~1 KB) | ~500,000 | ~2 µs |
| Large entry (~10 KB) | ~159,000 | ~6 µs |
Layer 2 (QualityGateAgent) adds LLM latency and is async — intended for high-value writes, not every write.
Cloud Provider Performance
Not all cloud APIs perform the same. Model size, inference infrastructure, and tier all affect how fast each agent gets a response — and that directly multiplies across every agent in your swarm.
| Provider / Model | Avg response (5-agent swarm) | RPM limit (free/tier-1) | Notes |
|---|---|---|---|
| OpenAI gpt-5.2 | 6–10s per call | 3–6 RPM | Flagship model, high latency, strict RPM |
| OpenAI gpt-4o-mini | 2–4s per call | 500 RPM | Fast, cheap, good for reviewer agents |
| OpenAI gpt-4o | 4–7s per call | 60–500 RPM | Balanced quality/speed |
| Anthropic Claude 3.5 Haiku | 2–3s per call | 50 RPM | Fastest Claude, great for parallel agents |
| Anthropic Claude 3.7 Sonnet | 4–8s per call | 50 RPM | Stronger reasoning, higher latency |
| Google Gemini 2.0 Flash | 1–3s per call | 15 RPM (free) | Very fast inference, low RPM on free tier |
| Groq (Llama 3.3 70B) | 0.5–2s per call | 30 RPM | Fastest cloud inference available |
| Together AI / Fireworks | 1–3s per call | Varies by plan | Good for parallel workloads |
Key insight: A 5-agent swarm using gpt-4o-mini at 500 RPM can fire all 5 agents truly in parallel and finish in ~4s total. The same swarm on gpt-5.2 at 6 RPM must go sequential and takes 60s. The model tier matters more than the orchestration framework.
Choosing a Model for Swarm Agents
- Speed over depth (many agents, real-time) →
gpt-4o-mini,claude-3.5-haiku,gemini-2.0-flash,groq/llama-3.3-70b - Depth over speed (few agents, high-stakes) →
gpt-4o,claude-3.7-sonnet - Free / no-cost testing → Groq free tier, Gemini free tier, or Ollama locally
- Production with budget → multiple keys across providers, route agents to different models
Rate Limit Patterns
When you run a 5-agent swarm sharing one API key and hit the RPM ceiling, the API silently returns empty responses — not a 429 error, just blank content. Network-AI's swarm demos handle this automatically with sequential dispatch and adaptive header-based pacing (reads x-ratelimit-reset-requests to wait exactly as long as needed).
| You have | What to expect |
|---|---|
| One cloud API key | Sequential dispatch, 40–70s per 5-agent swarm — handled automatically |
| Multiple cloud keys | Near-parallel, 10–15s — one key per adapter instance |
| Local GPU (Ollama, vLLM) | True parallel, 5–20s depending on hardware |
| Home GPU + cloud mix | Local agents never block — cloud agents rate-paced independently |
Multiple Keys = True Parallel
import { CustomAdapter, AdapterRegistry } from 'network-ai';
const registry = new AdapterRegistry();
for (const reviewer of REVIEWERS) {
const adapter = new CustomAdapter();
const client = new OpenAI({ apiKey: process.env[`OPENAI_KEY_${reviewer.id.toUpperCase()}`] });
adapter.registerHandler(reviewer.id, async (payload) => {
const resp = await client.chat.completions.create({ /* ... */ });
return { findings: extractContent(resp) };
});
registry.register(reviewer.id, adapter);
}
// All 5 dispatch in parallel via Promise.all — ~8–12s instead of ~60sLocal GPU = Zero Rate Limits
const localClient = new OpenAI({
apiKey : 'not-needed',
baseURL: 'http://localhost:11434/v1', // Ollama, vLLM, llama.cpp
});
adapter.registerHandler('reviewer', async (payload) => {
const resp = await localClient.chat.completions.create({
model : 'llama3.2',
messages: [/* ... */],
});
return { findings: extractContent(resp) };
});Cloud GPU Instances (Self-Hosted)
Running your own model on AWS / GCP / Azure sits between managed APIs and local hardware:
| Setup | Speed vs managed API | RPM |
|---|---|---|
| A100 (80GB) + vLLM, Llama 3.3 70B | Faster — 0.5–2s/call | None |
| H100 + vLLM, Mixtral 8x7B | Faster — 0.3–1s/call | None |
| T4 / V100 + Ollama, Llama 3.2 8B | Comparable | None |
Cost: $1–5/hr for GPU VMs. For high-volume production swarms or teams that want no external API dependency, it is the fastest architecture available. The connection is identical to local Ollama — just point baseURL at your VM's IP.
max_completion_tokens — The Silent Truncation Trap
One of the most common failure modes in agentic output tasks. When a model hits the max_completion_tokens ceiling it stops mid-output and returns whatever it has — no error, no warning. The API call succeeds with finish_reason: "length" instead of "stop".
This is especially dangerous for code-rewrite agents where the output is a full file.
# Real numbers (gpt-5-mini, order-service.ts rewrite):
Blockers section: ~120 tokens
Fixed code: ~2,800 tokens (213 lines with // FIX: comments)
Total needed: ~3,000 tokens ← hits the cap exactly → empty output
Fix: set to 16,000 → full rewrite delivered in one shotRule of Thumb by Task
| Task | Recommended cap |
|---|---|
| Short classification / sentiment | 200–500 |
| Code review findings (one reviewer) | 400–800 |
| Blocker summary (coordinator) | 500–1,000 |
| Full file rewrite (≤300 lines) | 12,000–16,000 |
| Full file rewrite (≤1,000 lines) | 32,000–64,000 |
| Document / design revision | 16,000–32,000 |
All GPT-5 variants support 128,000 max output tokens — the ceiling is never the model, it is always the cap you set.
Lessons from Building the Code-Review Swarm
| Issue | Root cause | Fix |
|---|---|---|
| Fixed code output was empty | max_completion_tokens: 3000 too low | Raise to 16000+ for any code-output agent |
finish_reason: "length" silently discards | Model hits cap, partial response, no error | Always check choices[0].finish_reason and alert on "length" |
| Flagship model slow + expensive for reviewers | High latency + $14/1M output tokens | Use gpt-5-mini ($2/1M, same RPM) for reviewer/fixer agents |
| Coordinator + fixer as two calls | Second call hits rate limit window, +60s | Merge into one structured two-section call |