Cost-Effective Multi-Agent AI: Separating Reasoning from Execution

Lasantha Kularatne posts ai ai-agents llm reasoning strands

Building AI agents that actually work in production is expensive. Every API call to GPT-4 or Claude costs money, and when you're running agentic loops with multiple tool calls, those costs add up fast. But what if you could use expensive models only where they truly matter, and run everything else locally for free?

That's exactly what I built. A multi-agent system that uses powerful cloud LLMs for planning and tiny local models for execution.

View Code on GitHub

The Problem with Current Approaches

Most AI agent frameworks treat all tasks equally. Need to plan a complex workflow? Call GPT-4. Need to fetch weather data? Call GPT-4. Need to format a response? Call GPT-4 again.

This is wasteful. Planning and decomposition require genuine intelligence. But once you have a clear, small task like "Call this API and extract the temperature", a 2B parameter model running on your laptop can handle it just fine.

There's also a privacy angle. Your reasoning agent only needs to understand the problem structure. It doesn't need to see customer PII, payment data, or internal API responses. By splitting reasoning from execution, sensitive data never leaves your infrastructure.

The Architecture

flowchart TB
    subgraph Cloud["☁️ Cloud / Powerful Model"]
        RA["🧠 Reasoning Agent<br/>(external LLM)<br/>───────────<br/>• Decomposes problems<br/>• No tool access<br/>• No sensitive data"]
    end
    
    User["👤 User Query"] --> RA
    
    RA -->|"Task 1"| EA1
    RA -->|"Task 2"| EA2
    RA -->|"Task 3"| EA3
    
    subgraph Local["🏠 Local Infrastructure"]
        EA1["⚡ Execution Agent 1<br/>(local LLM)"]
        EA2["⚡ Execution Agent 2<br/>(local LLM)"]
        EA3["⚡ Execution Agent 3<br/>(local LLM)"]
        
        EA1 --> Tools1["🔧 Weather API"]
        EA2 --> Tools2["🔧 City Database"]
        EA3 --> Tools3["🔧 Internal APIs"]
    end
    
    EA1 -->|"Result 1"| Synth
    EA2 -->|"Result 2"| Synth
    EA3 -->|"Result 3"| Synth
    
    Synth["📋 Synthesize Results"] --> Response["✅ Final Response"]
    
    style Cloud fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Local fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    style RA fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style EA1 fill:#f3e5f5,stroke:#7b1fa2
    style EA2 fill:#f3e5f5,stroke:#7b1fa2
    style EA3 fill:#f3e5f5,stroke:#7b1fa2

The system has two types of agents:

🧠 Reasoning Agent - Uses a capable model (like cloud-based GPT-4o or local gemma3:12b) to decompose complex problems into independent, atomic tasks. This agent never touches sensitive data or makes tool calls. It just plans.

Execution Agents - Use lightweight local models (like local granite4:tiny-h with tool-call support) to execute individual tasks. These agents have access to tools and APIs, handle sensitive data, and run entirely on your hardware.

from strands import Agent
from strands.models.openai import OpenAIModel
from strands.models.ollama import OllamaModel

# Reasoning agent - smarter model, no tools, no sensitive data
reasoning_model = OpenAIModel(
        model_name="gpt-4o",
        temperature=0.5
    )
reasoning_agent = Agent(
        model=reasoning_model,
        system_prompt="""You are a planning agent. Break down complex 
        problems into small, independent tasks. Output a JSON list of 
        tasks. Do NOT execute anything yourself."""
    )

# Execution agent - tiny model, full tool access
execution_model = OllamaModel(
        host="http://localhost:11434", 
        model_id="granite4:tiny-h",
        temperature=0.3
    )
execution_agent = Agent(
        model=execution_model,
        tools=[get_city_info, get_weather, search_books]
    )

How It Works

Take a query like "Find cities similar to Austin, TX based on climate and population."

The reasoning agent breaks this into atomic tasks:

  1. Get Austin's weather data
  2. Get Austin's population
  3. Search for cities with similar climate
  4. Search for cities with similar population
  5. Find the intersection

Each task is then dispatched to an execution agent. These can run in parallel since the tasks are independent.

def orchestrate(user_query: str):
    # Step 1: Plan with the smart model
    plan = reasoning_agent(f"Break this into tasks: {user_query}")
    tasks = json.loads(plan.message)
    
    # Step 2: Execute with cheap local models
    results = []
    for task in tasks:
        result = execution_agent(task["instruction"])
        results.append(result)
    
    # Step 3: Synthesize (can use reasoning or execution agent)
    return execution_agent(f"Combine these results: {results}")

Real Examples

Example 1: Similar Cities Finder (output)

I used gemma3:12b for reasoning and granite4:tiny-h for execution. The reasoning agent decomposed "find similar cities to Austin, TX" into weather lookups, population queries, and comparison tasks. Execution agents made the actual API calls.

Example 2: Book Recommendations with Geopolitical Context (output)

For another complex task combining book data with current events, I used granite4:tiny-h for both reasoning and execution. The key insight: even a tiny model can plan effectively when the problem domain is well-defined and the output format is constrained.

The Trade-offs

Pros:

  • Significant cost reduction. Cloud API calls drop by 80-90% in typical workflows
  • Privacy by design. Sensitive data stays on your infrastructure
  • Parallelization. Independent tasks can run concurrently
  • Fault isolation. One failed execution doesn't break the whole workflow
  • Model flexibility. Swap reasoning or execution models without changing architecture

Cons:

  • Latency overhead. The planning step adds one round-trip before execution starts
  • Task decomposition quality matters. Bad plans lead to bad results (garbage in, garbage out)
  • Local model limitations. Some tasks genuinely need stronger models

Where This Applies

This pattern works well for:

  • Enterprise workflows with strict data residency requirements
  • High-volume applications where API costs are a real concern
  • Hybrid cloud setups where you want cloud intelligence with local execution
  • Regulated industries (healthcare, finance) where data exposure is a compliance issue
  • Edge deployments where bandwidth to cloud APIs is limited

It's less suitable for tasks that require tight reasoning-execution coupling, or when the problem can't be cleanly decomposed.

Technical Notes

I built this using the Strands Agents SDK, which handles the agentic loop, tool calling, and model abstraction cleanly. Strands supports Ollama, LM Studio, llama.cpp, and cloud providers through a unified interface, making it easy to mix and match models.

The key to making this work is good task decomposition prompts. The reasoning agent needs clear instructions on output format and what constitutes an "atomic" task. I found structured output (JSON or XML-like tagging) for explicit task breakdown works better than free-form text.

Conclusion

Not every AI task needs a frontier model. By separating reasoning from execution, you can build agents that are cheaper, faster, more private, and often just as effective. The cloud LLM handles the hard part—understanding what to do. Local models handle the easy part—actually doing it.

The infrastructure is ready. Ollama makes running local models trivial. Strands makes building agents straightforward. The missing piece was the architecture to combine them effectively.


Think big, execute small.