Cost-Effective Multi-Agent AI: Separating Reasoning from Execution
Building AI agents that actually work in production is expensive. Every API call to GPT-4 or Claude costs money, and when you're running agentic loops with multiple tool calls, those costs add up fast. But what if you could use expensive models only where they truly matter, and run everything else locally for free?
That's exactly what I built. A multi-agent system that uses powerful cloud LLMs for planning and tiny local models for execution.
The Problem with Current Approaches
Most AI agent frameworks treat all tasks equally. Need to plan a complex workflow? Call GPT-4. Need to fetch weather data? Call GPT-4. Need to format a response? Call GPT-4 again.
This is wasteful. Planning and decomposition require genuine intelligence. But once you have a clear, small task like "Call this API and extract the temperature", a 2B parameter model running on your laptop can handle it just fine.
There's also a privacy angle. Your reasoning agent only needs to understand the problem structure. It doesn't need to see customer PII, payment data, or internal API responses. By splitting reasoning from execution, sensitive data never leaves your infrastructure.
The Architecture
flowchart TB
subgraph Cloud["☁️ Cloud / Powerful Model"]
RA["🧠 Reasoning Agent<br/>(external LLM)<br/>───────────<br/>• Decomposes problems<br/>• No tool access<br/>• No sensitive data"]
end
User["👤 User Query"] --> RA
RA -->|"Task 1"| EA1
RA -->|"Task 2"| EA2
RA -->|"Task 3"| EA3
subgraph Local["🏠 Local Infrastructure"]
EA1["⚡ Execution Agent 1<br/>(local LLM)"]
EA2["⚡ Execution Agent 2<br/>(local LLM)"]
EA3["⚡ Execution Agent 3<br/>(local LLM)"]
EA1 --> Tools1["🔧 Weather API"]
EA2 --> Tools2["🔧 City Database"]
EA3 --> Tools3["🔧 Internal APIs"]
end
EA1 -->|"Result 1"| Synth
EA2 -->|"Result 2"| Synth
EA3 -->|"Result 3"| Synth
Synth["📋 Synthesize Results"] --> Response["✅ Final Response"]
style Cloud fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
style Local fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style RA fill:#fff3e0,stroke:#f57c00,stroke-width:2px
style EA1 fill:#f3e5f5,stroke:#7b1fa2
style EA2 fill:#f3e5f5,stroke:#7b1fa2
style EA3 fill:#f3e5f5,stroke:#7b1fa2
The system has two types of agents:
🧠 Reasoning Agent - Uses a capable model (like cloud-based GPT-4o or local gemma3:12b) to decompose complex problems into independent, atomic tasks. This agent never touches sensitive data or makes tool calls. It just plans.
⚡ Execution Agents - Use lightweight local models (like local granite4:tiny-h with tool-call support) to execute individual tasks. These agents have access to tools and APIs, handle sensitive data, and run entirely on your hardware.
from strands import Agent
from strands.models.openai import OpenAIModel
from strands.models.ollama import OllamaModel
# Reasoning agent - smarter model, no tools, no sensitive data
reasoning_model = OpenAIModel(
model_name="gpt-4o",
temperature=0.5
)
reasoning_agent = Agent(
model=reasoning_model,
system_prompt="""You are a planning agent. Break down complex
problems into small, independent tasks. Output a JSON list of
tasks. Do NOT execute anything yourself."""
)
# Execution agent - tiny model, full tool access
execution_model = OllamaModel(
host="http://localhost:11434",
model_id="granite4:tiny-h",
temperature=0.3
)
execution_agent = Agent(
model=execution_model,
tools=[get_city_info, get_weather, search_books]
)
How It Works
Take a query like "Find cities similar to Austin, TX based on climate and population."
The reasoning agent breaks this into atomic tasks:
- Get Austin's weather data
- Get Austin's population
- Search for cities with similar climate
- Search for cities with similar population
- Find the intersection
Each task is then dispatched to an execution agent. These can run in parallel since the tasks are independent.
def orchestrate(user_query: str):
# Step 1: Plan with the smart model
plan = reasoning_agent(f"Break this into tasks: {user_query}")
tasks = json.loads(plan.message)
# Step 2: Execute with cheap local models
results = []
for task in tasks:
result = execution_agent(task["instruction"])
results.append(result)
# Step 3: Synthesize (can use reasoning or execution agent)
return execution_agent(f"Combine these results: {results}")
Real Examples
Example 1: Similar Cities Finder (output)
I used gemma3:12b for reasoning and granite4:tiny-h for execution. The reasoning agent decomposed "find similar cities to Austin, TX" into weather lookups, population queries, and comparison tasks. Execution agents made the actual API calls.
Example 2: Book Recommendations with Geopolitical Context (output)
For another complex task combining book data with current events, I used granite4:tiny-h for both reasoning and execution. The key insight: even a tiny model can plan effectively when the problem domain is well-defined and the output format is constrained.
The Trade-offs
Pros:
- Significant cost reduction. Cloud API calls drop by 80-90% in typical workflows
- Privacy by design. Sensitive data stays on your infrastructure
- Parallelization. Independent tasks can run concurrently
- Fault isolation. One failed execution doesn't break the whole workflow
- Model flexibility. Swap reasoning or execution models without changing architecture
Cons:
- Latency overhead. The planning step adds one round-trip before execution starts
- Task decomposition quality matters. Bad plans lead to bad results (garbage in, garbage out)
- Local model limitations. Some tasks genuinely need stronger models
Where This Applies
This pattern works well for:
- Enterprise workflows with strict data residency requirements
- High-volume applications where API costs are a real concern
- Hybrid cloud setups where you want cloud intelligence with local execution
- Regulated industries (healthcare, finance) where data exposure is a compliance issue
- Edge deployments where bandwidth to cloud APIs is limited
It's less suitable for tasks that require tight reasoning-execution coupling, or when the problem can't be cleanly decomposed.
Technical Notes
I built this using the Strands Agents SDK, which handles the agentic loop, tool calling, and model abstraction cleanly. Strands supports Ollama, LM Studio, llama.cpp, and cloud providers through a unified interface, making it easy to mix and match models.
The key to making this work is good task decomposition prompts. The reasoning agent needs clear instructions on output format and what constitutes an "atomic" task. I found structured output (JSON or XML-like tagging) for explicit task breakdown works better than free-form text.
Conclusion
Not every AI task needs a frontier model. By separating reasoning from execution, you can build agents that are cheaper, faster, more private, and often just as effective. The cloud LLM handles the hard part—understanding what to do. Local models handle the easy part—actually doing it.
The infrastructure is ready. Ollama makes running local models trivial. Strands makes building agents straightforward. The missing piece was the architecture to combine them effectively.
Think big, execute small.