Back to blogAI Systems

MCPs vs custom workflows: Choosing how much freedom your agents get

MCPs give agents full autonomy. Custom workflows give deterministic control. In production, the difference is predictable costs vs. $0.50 surprises per query.

Alejandro Valencia15 min

MCP (Model Context Protocol) is the hottest thing in AI tooling right now. Anthropic releases it, everyone rushes to connect their agents to Slack, Google Drive, databases. "Give your AI hands!"

I use both MCPs and custom workflows in production. They solve different problems. Choosing wrong will cost you months of debugging autonomous agents that break your business logic.

The core distinction

MCP approach:

User input → Agent decides what tools to use → Agent executes → Output

Custom workflow approach:

User input → Orchestrator decides what happens → Agent executes within boundaries → Output

Same input, same output. Completely different control model.

When MCP wins

MCPs are perfect for third-party routine tasks:

  • Send a Slack message
  • Create a Google Calendar event
  • Query a database
  • Fetch a webpage
  • Upload to S3

These are stateless, reversible, low-risk operations. If the agent decides to send a Slack message when it shouldn't have, you delete it. No business damage.

// MCP style: Agent has full autonomy
const tools = [
  slack_send_message,
  google_calendar_create,
  notion_create_page
];

const response = await agent.run({
  prompt: userMessage,
  tools: tools  // Agent decides which to use, when, how
});

The agent reasons about which tool fits. This works because:

  1. Tools are interchangeable — wrong choice = minor inconvenience
  2. Operations are idempotent — can retry safely
  3. Context is self-contained — tool doesn't need business state

When custom workflows win

My production system handles AI coaching sessions. A user message might need:

  1. Security validation
  2. Context retrieval from 30-day history
  3. Compression decision based on message significance
  4. Agent type routing (coaching vs. selling vs. support)
  5. Response generation with specific intensity level
  6. State update for next interaction
  7. Credit deduction based on user tier

This is business logic. Getting step 4 wrong doesn't just send a bad message — it breaks the user's coaching journey.

Here's what a real execution trace looks like:

{
  "processes_executed": [
    {"name": "001 Main",              "nesting_level": 0},
    {"name": "  003 Extract message", "nesting_level": 1},
    {"name": "  014 Process security", "nesting_level": 1},
    {"name": "  005 Process command", "nesting_level": 1},
    {"name": "  301 Selling process", "nesting_level": 1},
    {"name": "  202 Agent - Coaching", "nesting_level": 1},
    {"name": "    101 AI Execute LLM", "nesting_level": 2},
    {"name": "    021 Send message",  "nesting_level": 2},
    {"name": "      022 Record exchange", "nesting_level": 3},
    {"name": "  901 Cleanup User Lock", "nesting_level": 1}
  ]
}

10 steps, 3 nesting levels. The orchestrator decided this path, not the agent. The LLM execution (101 AI Execute LLM) happens inside a bounded context—the Coaching Agent (202)—after security, routing, and business logic have already been handled deterministically. The agent only controlled what happened inside that one node: the coaching response content.

The autonomy spectrum

Full MCP                                          Full Orchestration
    │                                                      │
    ▼                                                      ▼
┌────────┐    ┌────────┐    ┌────────┐    ┌────────┐    ┌────────┐
│ Agent  │    │ Agent  │    │ Agent  │    │ Agent  │    │ Agent  │
│ picks  │    │ picks  │    │ picks  │    │ picks  │    │executes│
│ tools  │    │ tools  │    │ from   │    │ within │    │  only  │
│  and   │    │  but   │    │approved│    │  flow  │    │        │
│ params │    │ guardrails  │ subset │    │ branch │    │        │
└────────┘    └────────┘    └────────┘    └────────┘    └────────┘
   ▲                                                        ▲
   │                                                        │
Chatbots                                           Business-critical
Assistants                                         Multi-step flows

Most production systems need something in the middle-right. MCP hype pushes everyone to the left.

Real example: why I don't use MCP for routing

Imagine giving an MCP agent these tools:

const tools = [
  route_to_coaching_agent,
  route_to_sales_agent,
  route_to_support_agent,
  send_direct_response
];

The agent would need to understand:

  • User's current state in their journey
  • Business rules about when to upsell vs. support
  • Context from 30 days of interactions
  • Compression priorities for the response
  • Credit implications of each route

You could put all this in the prompt. Now your prompt is 15,000 tokens and the agent still makes wrong routing decisions 5% of the time. At scale, 5% wrong routes = broken user experiences = churn.

My approach:

// Step 1: Agent identifies intent (bounded by specific prompt)
const intent = await routerAgent.execute({
  message: userMessage,
  context: minimalContext,
  // Prompt constrains output to specific intent categories
  // No tool access, no side effects, just classification
});

// Step 2: Workflow branches deterministically based on intent
// This is hardcoded logic, not agent decision
switch (intent.category) {
  case 'coaching':
    return coachingWorkflow.execute(preparedContext);
  case 'sales':
    return salesWorkflow.execute(preparedContext);
  case 'support':
    return supportWorkflow.execute(preparedContext);
}

The router agent has narrow autonomy: interpret the message into predefined categories. The workflow has zero autonomy: given category X, always execute workflow X.

This is layered bounded autonomy — agents operate freely within their layer, but can't escape to other layers.

Context and state: where MCP falls apart

MCPs are stateless by design. Each tool call is independent.

Business logic is stateful by nature:

{
  "context_management": {
    "user_input": {
      "compression_score": 0.1,
      "compression_priority": "preserve_full",
      "significant_moment_detected": true
    },
    "agent_output": {
      "compression_score": 0.15,
      "context_utilization": "high"
    }
  }
}

This compression decision depends on:

  • What the user said (semantic significance)
  • What they said last week (conversation trajectory)
  • What the agent is about to say (output importance)
  • How close we are to context window limits

An MCP tool can't make this decision. It doesn't have the state. You'd need to pass entire conversation history to every tool call, then trust the agent to reason about compression correctly.

Or you build a workflow that manages state explicitly and gives the agent only what it needs.

The hybrid pattern

In practice, I use both:

┌─────────────────────────────────────────────────────────┐
│                    ORCHESTRATOR                          │
│                 (Custom Workflows)                       │
│                                                          │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐              │
│   │ Security │──▶│ Routing │──▶│ Context │              │
│   │  Check   │   │ Decision│   │  Prep   │              │
│   └─────────┘   └─────────┘   └─────────┘              │
│                                     │                    │
│                                     ▼                    │
│                              ┌─────────────┐            │
│                              │    AGENT    │            │
│                              │  (Bounded)  │            │
│                              │             │            │
│                              │  ┌───────┐  │            │
│                              │  │  MCP  │  │            │
│                              │  │ Tools │  │            │
│                              │  └───────┘  │            │
│                              └─────────────┘            │
│                                     │                    │
│   ┌─────────┐   ┌─────────┐   ┌─────────┐              │
│   │  State  │◀──│ Record  │◀──│Response │              │
│   │ Update  │   │Exchange │   │  Send   │              │
│   └─────────┘   └─────────┘   └─────────┘              │
└─────────────────────────────────────────────────────────┘

The orchestrator handles:

  • Flow control
  • State management
  • Business rule enforcement
  • Context preparation
  • Audit logging

The agent handles:

  • Natural language understanding
  • Response generation
  • MCP tool usage for external services (when orchestrator allows)

MCPs live inside bounded agents, not as the control layer.

The principle is simple:

Business logic requires deterministic control — the orchestrator guarantees security runs, routing is correct, state persists. No probability, no "95% of the time works".

Atomic task execution is where agents get freedom — within a bounded step, the agent can use MCPs to send messages, query APIs, generate content. If it chooses wrong, the blast radius is contained to that step.

Here's how all the pieces fit together:

  • A Multi-agent system is a team—multiple specialists coordinated to achieve complex goals (using n8n, LangGraph, or similar to interconnect)
  • An Agent is a specialized worker within that team
  • The LLM is the raw brain—cognitive capacity without direction
  • Optimization techniques (prompting, fine-tuning) are the specialization—what turns a generic brain into an expert coach, router, or analyst
  • CAG (Cache-Augmented Generation) is long-term memory—knowledge preloaded, always available
  • Internal workflows are each worker's procedures—how they process, validate, and retry (using n8n, LangGraph, or similar frameworks)
  • RAG (Retrieval-Augmented Generation) is reference books—knowledge the agent can consult when needed
  • MCPs are interfaces to shared external tools—Slack, APIs, databases that any agent can use

Why deterministic flows matter

The argument for deterministic orchestration isn't about distrust of LLMs. It's about guarantees.

In my system, if security validation doesn't run, a malicious user could inject prompts. If state doesn't persist, the next interaction loses 30 days of context. If routing fails, a user in crisis gets a sales pitch.

These aren't "oops, retry" scenarios. They're business-critical invariants.

Probabilistic (MCP-driven):

"Security check runs ~98% of the time"

Deterministic (Workflow-driven):

"Security check runs. Period. Or the entire flow errors."

For external tasks (send email, create calendar event), 98% is fine. For business invariants, only 100% is acceptable.

This is true regardless of what tool you use to build the workflow.

Workflow tools: n8n vs LangGraph

I use n8n. But LangGraph is equally valid for deterministic orchestration. Different trade-offs:

n8n

Strengths:

  • Visual editor — non-technical team members can understand and modify flows
  • 400+ pre-built integrations — Telegram, Stripe, databases, APIs ready to use
  • Rapid prototyping — build complex flows in hours, not days
  • Self-hosted — full control, no vendor lock-in, predictable costs
  • CI/CD is simplified — activate workflow = deploy (CI and CD are collapsed into one step)
  • n8n 2.0 — native workflow versioning now available
  • Backups via API — a workflow can backup all other workflows using the self-hosted instance API
  • Human-in-the-loop made easy — Telegram/Slack nodes with "wait for response" pause execution until human replies, menu options included, zero custom code
  • Visual debugging with modularity — when workflows are modularized (sub-workflows), you can inspect each module's execution independently, see exactly which node failed, what data went in/out, without reading stack traces

Limitations:

  • Testing is semi-manual — pin node outputs, review execution logs, but no automated test suites on deploy
  • Staging requires workarounds — duplicate workflows with tags, not native environments
  • Code review is visual diff — not line-by-line like Git

Best for: Teams with mixed technical/non-technical members, rapid MVPs, integration-heavy workflows where deployment speed matters more than traditional CI/CD ceremony.

LangGraph

Strengths:

  • Native Python — full CI/CD, pytest, type checking, code review
  • Checkpointing built-in — pause/resume long-running flows programmatically
  • Human-in-the-loop patterns — typed, code-defined approval steps (more control, more setup)
  • State management — explicit, typed, version-controlled
  • Testing — unit test each node, mock LLM calls

Limitations:

  • Code-only — non-technical team members can't participate
  • Steeper learning curve — StateGraph concepts take time
  • Fewer pre-built integrations — you build or use LangChain's
  • Debugging — stack traces, not visual flow inspection

Best for: Engineering-heavy teams, complex state machines, systems requiring CI/CD rigor.

The honest assessment

My n8n implementation works in production. Deployment is instant — activate and it's live. n8n 2.0 versioning helps with rollbacks. Testing is semi-manual: I pin node outputs, review execution logs, run scenarios. Not pytest, but not blind either. When something breaks at 3am, I debug visually — which is actually faster for workflow issues than reading stack traces.

Why I chose n8n and still choose it: I can build a working proof-of-concept in hours, validate it with real users, then incrementally add complexity until it meets production requirements. The same tool scales from "quick experiment" to "business-critical system" without rewriting. That prototyping speed compounds over time.

If I were starting today with a pure engineering team demanding automated test suites and PR-based deploys, I'd seriously consider LangGraph.

If I were building with a team that includes non-engineers who need to modify flows, or needed 50 integrations working in week one, or valued rapid iteration from prototype to production, n8n remains the better choice.

The architectural pattern — deterministic orchestration with bounded agent autonomy — is what matters. The tool is implementation detail.

The hidden cost: token economics

There's an aspect of MCP-first architectures that nobody talks about:uncontrolled token consumption.

When you give an agent full autonomy with 20 tools:

Agent reasoning: "Let me think about which tool to use..."
  → 500 tokens reasoning
Agent calls tool 1, gets result
  → 200 tokens processing result
Agent reasoning: "Now I need to decide what's next..."
  → 400 tokens reasoning
Agent calls tool 2, gets result
  → 300 tokens processing
Agent reasoning: "Maybe I should also check..."
  → 600 tokens reasoning
... continues until agent decides it's done

Each reasoning step burns tokens. Each tool call adds context. The agent decides when to stop. You get the bill at the end of the month.

I've seen MCP-heavy implementations where a single user query costs $0.50+ in tokensbecause the agent "explored" multiple tools before settling on an answer.

Deterministic workflows: predictable costs

In my workflow architecture:

Step 1: Router agent (bounded prompt, ~2K tokens max)
Step 2: Context preparation (code, 0 tokens)
Step 3: Coaching agent (bounded prompt, ~8K tokens max)
Step 4: Response validation (code, 0 tokens)
Step 5: Retry if invalid (max 2 retries, controlled)

I know the maximum token cost before execution starts. I can set hard limits per step. If step 3 fails validation, I retry with a modified prompt—not with "agent, try again however you want."

Chain-of-thought control

MCPs let the agent control its own reasoning chain. Workflows let you control it:

AspectMCPDeterministic Workflow
Reasoning stepsAgent decidesYou define
Token budget per stepUncontrolledCapped
Output validationHope agent self-correctsCode validates, structured retry
Retry logicAgent decides if/howYou define max retries, fallback
CoT structureEmergentDesigned

You can implement semi-structured Chain-of-Thought: "Step 1 produces X, validated by schema Y, passed to Step 2 with context Z." The reasoning is guided, not free-form.

Real numbers

From my production system:

Average cost per coaching interaction: $0.06
Maximum cost (worst case with retries): $0.15
Predictable monthly cost at 10K interactions: ~$600-800

If I gave the coaching agent full MCP autonomy with tools for "search knowledge base", "check user history", "evaluate response quality", "retry if needed"—the same interaction could cost $0.30-0.80 depending on how much the agent "explores."

Bounded autonomy isn't just about correctness. It's about economics.

To be fair: a well-designed MCP agent can have token limits, output validation, and controlled retries too. The problem isn't MCP technically—it's that the ecosystem, tutorials, and default patterns don't push you toward constraints. The path of least resistance is unbounded autonomy. Workflows force you to think about limits upfront.

Decision framework

Use MCP-first when:

  • Tasks are routine and reversible
  • External service integration (Slack, Calendar, APIs)
  • Agent mistakes have low business impact
  • Stateless operations
  • You want rapid prototyping

Use Custom Workflows when:

  • Business logic determines flow
  • State must persist across interactions
  • Wrong decisions have real consequences
  • Multi-step processes with dependencies
  • You need audit trails and observability
  • Context management is non-trivial

Use Hybrid when:

  • Core flow is business-critical (workflow)
  • But agents need external tool access (MCP within bounded scope)

The industry mistake

The current narrative: "MCPs give AI agents hands. More tools = more capable agents."

YouTube tutorials say: "Just add MCPs and your agent can do everything." And it works—in demos, with one user, on the creator's machine.

Then you deploy to production with 1,000 concurrent users.

Token costs explode. Latency spikes. Errors cascade because Agent #847 called tools in an unexpected order. You can't debug because each execution path is emergent. You can't predict costs because agent reasoning is unbounded.

AspectTutorial ScaleProduction Scale
Users1 (you)1,000+ concurrent
Cost tolerance$5/day? Who cares$5/day × 1000 = bankruptcy
Error toleranceRetry manuallyErrors = churn
DebuggingWatch it runNeed logs, traces, replay
Predictability"Usually works"Must always work

The tutorial approach is private-scale architecture presented as production-ready.

Anthropic, OpenAI, Google—they're building general-purpose agent infrastructure. Their incentive is adoption, not your unit economics. Your job is to constrain it for your specific business.

A coaching agent with 50 MCP tools and full autonomy will eventually:

  • Route users to sales when they need support
  • Over-compress critical context
  • Skip required compliance checks
  • Burn $2 in tokens on a query that should cost $0.06
  • Make decisions that optimize for "task completion" over user wellbeing

Not because the LLM is bad. Because you gave it freedom without boundaries, following tutorials designed for demos, not production.

The strongest counterarguments

Let me address what MCP advocates would say:

"LLMs will get smarter—eventually they'll make optimal decisions"

True, but this confuses cognitive capacity with business rule alignment. A smarter LLM will understand your context better, but it still won't know that your internal policy says "never offer discounts to trial users during the first 3 days." That's not intelligence—it's proprietary knowledge that changes every quarter.

Also, "smarter" doesn't eliminate variance. You go from 95% to 98% correct decisions, but for business invariants you need 100%. The gap between 98% and 100% doesn't close with more parameters.

And there's the economics: GPT-5 will be smarter but also more expensive. The token consumption problem doesn't disappear—it gets worse.

"You can create specialized agents with just 1-3 MCP tools each"

This is the strongest counterargument because it describes exactly the hybrid pattern I'm advocating—just with MCP as the communication layer.

And here MCP has a real advantage: portability. If tomorrow you migrate from Claude to GPT-5 to Gemini, your MCP interfaces keep working.

The honest answer: if your orchestrator calls specialized agents (each with 1-3 MCP tools), you're doing exactly what this article proposes—just using MCP as a protocol instead of direct calls. That's valid and arguably better for long-term flexibility.

What this article isn't criticizing

This isn't a criticism of MCP as a standard. MCP's value is real: portable interfaces that work across LLM providers, standardized tool definitions, easier integration switching.

If you're using orchestrated workflows that call specialized agents, each with 1-3 MCP tools, you're doing exactly what this article recommends—just using MCP as the communication protocol instead of direct calls. That's not just valid; it might be the best of both worlds.

What I'm criticizing is MCP as the control architecture—the pattern of "one agent, 20 tools, let it figure out." That's where you get unbounded token consumption, unpredictable execution paths, and the 5% error rate that kills business-critical flows.

The distinction matters:

PatternMCP RoleControlVerdict
One agent, 20 tools, full autonomyArchitectureAgent decides everything❌ Dangerous at scale
Orchestrator → specialized agents → 1-3 tools eachProtocolWorkflow decides flow, agent decides execution✅ Best of both worlds
Orchestrator → direct integrations, no MCPNoneWorkflow decides everything✅ Works but less portable

The trade-off I'm making

In the interest of honesty, there's a cost to deterministic workflows I haven't mentioned:maintenance burden.

Every business logic change requires modifying the orchestrator. New routing rule? Edit the workflow. New user tier with different behavior? Add branches. Policy change from legal? Update nodes.

With well-designed autonomous agents (properly bounded), some changes get absorbed without new code. "Be more conservative with discounts" might just need a prompt tweak, not a workflow restructure.

ApproachBusiness logic change
Deterministic workflowModify nodes/code, test, deploy
Bounded autonomous agentSometimes just prompt update

I accept this trade-off because:

  1. My business logic changes less often than it executes. Modifying a workflow once to handle 10,000 correct executions is worth it.
  2. When logic changes, I want explicit control. A prompt tweak that "absorbs" a policy change might also absorb it incorrectly in edge cases I didn't consider.
  3. Debugging explicit logic is easier than debugging emergent behavior. When something breaks after a prompt change, figuring out what changed is harder than reviewing a node diff.

But if your business logic changes daily and you have strong evaluation frameworks to catch prompt regressions, the calculus might be different. This is a trade-off, not a universal truth.

When to migrate: n8n → LangGraph

Since I recommended both tools, the natural question is: when should you switch?

Stay with n8n if:

  • Your team includes non-engineers who modify workflows
  • You're still iterating on business logic weekly
  • Visual debugging saves you more time than automated tests would
  • Your integrations are mostly pre-built (Telegram, Slack, databases)

Consider migrating to LangGraph when:

  • Your workflows are stable and rarely change structure
  • You need PR-based reviews for compliance/audit reasons
  • Your team is 100% engineers comfortable with Python
  • You're hitting n8n's limits on complex state management
  • You want to open-source your agent logic

Hybrid approach:

You can use both. n8n for integration-heavy orchestration (webhooks, APIs, notifications), LangGraph for complex agent internals (state machines, sophisticated retry logic). They're not mutually exclusive.

Conclusion

MCPs are a tool, not an architecture. They're excellent for what they're designed for: connecting LLMs to external services.

They're terrible for what the hype implies: replacing careful orchestration of business-critical workflows.

Give your agents hands. But keep them on a leash.