Can I try something before hiring?

Yes. Lucy (chatbot below) runs llama3.2-3b fine-tuned (synthetic dataset of 2,500 AI business strategy entries) deployed on Modal.com. Production system showcasing custom fine-tuning + scalable deployment. SeducSer (seducser.com) is a multi-agent system with 10+ coordinated agents in production. Use the /matrix command to verify I built it. Both are verifiable right now.

Why an individual consultant vs a large agency?

Large agencies charge you $50K+ USD for systems they delegate to juniors. I personally design, implement and train your team. Zero intermediaries. Also: agencies want lock-in (to keep you paying). I deliver complete code + documentation so you are autonomous.

My team has no AI experience. Will they be able to maintain the system?

I use n8n (visual, no-code/low-code). If your team knows basic JavaScript, they can maintain and modify workflows. I include: 2+ hours of training, exhaustive documentation, and 30 days of post-handoff support for questions. 80% of my clients make simple modifications on their own after 2 weeks.

How long until production?

Tier 2 (Single-Agent System): 4-6 weeks. Tier 3 (Multi-Agent System): 8-12 weeks. Includes: architecture, implementation, testing, documentation, training. Specific timeline defined in discovery call.

← Back to blogLLMOps

Building custom LLMOps: When Langfuse wasn't enough for production AI agents

500K+ interactions, 10 agents, 18 months in production. Why PostgreSQL + structured logs beat Langfuse, LangSmith, and Phoenix.

Alejandro Valencia•2024-12-15•12 min

I've been operating a 10-agent AI system in production for 18 months. 500K+ interactions, $67K operational savings, 24/7 multilingual operation. After evaluating Langfuse, LangSmith, and Arize Phoenix, I built my own observability stack.

This isn't about NIH syndrome. It's about what production multi-agent systems actually need.

The problem: long-running agent sessions

Standard LLMOps tools are designed for request-response patterns. User sends message → LLM responds → trace complete.

My system handles coaching sessions that span days or weeks. A single user journey involves:

10+ agent handoffs
Context compression decisions at each step
Nested workflow execution (up to 5 levels deep)
Business logic interleaved with AI decisions

Langfuse traces looked like spaghetti. No correlation between sessions. No visibility into why an agent made a decision.

What commercial tools miss

1. Agent decision tracking

Langfuse tracks what the LLM said. It doesn't track what your agent decided.

My agents make structured decisions every execution:

{
  "agent_decisions": {
    "flow": {
      "challenge_status": "in_progress",
      "significant_moment_detected": true
    },
    "safety": {
      "meta_questions_behavior": "normal"
    },
    "coaching_quality": {
      "intensity_level": "balanced",
      "message_length": 2325,
      "response_structure": {
        "sections": ["ANÁLISIS", "ESTRATEGIA", "CALIBRACIÓN"],
        "section_count": 7
      }
    }
  }
}

This is the "mind" of the agent. Langfuse has no concept of this.

2. Context compression observability

I use exponential token compression to manage long conversations. Each message gets a compression score based on significance.

{
  "context_management": {
    "user_input": {
      "compression_score": 0.1,
      "compression_priority": "preserve_full",
      "significant_moment_detected": true
    },
    "agent_output": {
      "compression_score": 0.15,
      "context_utilization": "high"
    }
  }
}

When significant_moment_detected: true, compression is minimal. Routine exchanges get aggressive compression. This is how you maintain coherent month-long coaching relationships without exploding token costs.

No commercial tool tracks this. Because no commercial tool assumes you're doing it.

3. Multi-level workflow correlation

A single user message in my system can trigger:

{
  "processes_executed": [
    {"name": "001 Main",               "nesting_level": 0},
    {"name": "  003 Extract message",  "nesting_level": 1},
    {"name": "  014 Process security", "nesting_level": 1},
    {"name": "  005 Process command",  "nesting_level": 1},
    {"name": "  301 Selling process",  "nesting_level": 1},
    {"name": "  202 Agent - Coaching", "nesting_level": 1},
    {"name": "    101 AI Execute LLM", "nesting_level": 2},
    {"name": "    021 Send message",   "nesting_level": 2},
    {"name": "      022 Record exchange", "nesting_level": 3},
    {"name": "  901 Cleanup User Lock", "nesting_level": 1}
  ],
  "created_at": "2025-11-28T21:41:34.752Z",
  "finalized_at": "2025-11-28T21:42:02.129Z"
}

10 workflows, 3 nesting levels, ~27 seconds total (including LLM processing). Every step correlated by execution_id. Every user journey traceable across days.

Langfuse can show you a trace. It can't show you a journey.

The architecture

Four log tables, one correlation key:

┌─────────────────────────────────────────────────────────────┐
│                      execution_id                            │
│                           │                                  │
│    ┌──────────────────────┼──────────────────────┐          │
│    │                      │                      │          │
│    ▼                      ▼                      ▼          │
│ logs_execution     logs_user_journey     logs_agent_execution│
│ (workflow trace)   (event sourcing)      (AI decisions)     │
│    │                                            │           │
│    │                                            ▼           │
│    │                                   logs_llm_execution   │
│    │                                   (tokens, costs)      │
│    └────────────────────────────────────────────┘          │
└─────────────────────────────────────────────────────────────┘

logs_execution

Workflow orchestration. Which n8n workflows fired, in what order, with what nesting.

logs_user_journey

Event sourcing for user interactions. Every message received, every message sent, every state transition.

{
  "event_type": "user_message_sent_requested",
  "particular_data": {
    "role": "system",
    "text_length": 934,
    "credits_used": 0,
    "is_bot_scope": true,
    "is_coaching_scope": false,
    "is_selling_scope": false
  }
}

Business context baked into every event. I can query "show me all selling-scope messages for user X" in one SQL statement.

logs_agent_execution

Agent-level decisions. Context preparation, compression choices, coaching quality metrics.

logs_llm_execution

Raw LLM calls. Tokens, costs, latency, prompt hashes.

{
  "event_type": "llm_valid_execution",
  "model_target": "claude-sonnet-4-20250514",
  "prompt_hash": "# ADVANCED MALE SOCIAL COACHING SYSTEM...43713chars",
  "output_tokens": 1568,
  "processing_time_ms": 25937,
  "output_cost": "0.02352",
  "total_cost": "0.06123"
}

Implementation: n8n + PostgreSQL

The entire observability layer runs on the same stack as the application:

n8n: Workflow orchestration with built-in execution tracking
PostgreSQL: Structured logs with full SQL queryability
Redis: Execution correlation across async workflows

No additional infrastructure. No SaaS bills. No data leaving my servers.

Log insertion pattern

Every workflow starts with context capture:

// Pseudocode - actual implementation in n8n nodes
const executionLog = {
  execution_id: generateCorrelationId(),
  user_id: context.user.id,
  telegram_id: context.user.telegram_id,
  processes_executed: [],
  user_tier: context.user.tier,
  user_language: context.user.language,
  user_input_type: detectInputType(message),
  created_at: new Date().toISOString()
};

// Pass through entire workflow, append each process
executionLog.processes_executed.push({
  name: currentWorkflow.name,
  timestamp: new Date().toISOString(),
  timestamp_ms: Date.now(),
  nesting_level: context.nestingLevel
});

Querying patterns

Find slow agent executions:

SELECT
  execution_id,
  agent_type,
  particular_data->>'event_duration_ms' as duration_ms,
  created_at
FROM logs_agent_execution
WHERE (particular_data->>'event_duration_ms')::int > 20000
ORDER BY created_at DESC;

Cost analysis by agent type:

SELECT
  agent_type,
  COUNT(*) as executions,
  SUM((total_cost)::numeric) as total_cost,
  AVG((processing_time_ms)::numeric) as avg_latency
FROM logs_llm_execution
WHERE event_type = 'llm_valid_execution'
GROUP BY agent_type;

User journey reconstruction:

SELECT
  event_type,
  particular_data,
  created_at
FROM logs_user_journey
WHERE user_id = '57'
ORDER BY created_at;

Results

18 months in production:

Metric	Value
Total interactions	500K+
Agent types	10
Avg response time	3-5 seconds
Operational savings	$67K
Downtime	0
External observability cost	$0

The custom observability stack has caught:

Context compression bugs: Significant moments being over-compressed
Agent decision drift: Coaching intensity creeping too high/low
Cost anomalies: Specific user patterns triggering expensive execution paths
Latency regressions: New prompt versions causing slower responses

When to build custom

Build your own observability if:

Your agents make structured decisions that you need to track
Sessions span days/weeks, not minutes
Business context matters for debugging (tiers, scopes, journeys)
You're already running PostgreSQL and don't want another SaaS
You need SQL queryability, not just dashboards

Use Langfuse/LangSmith if:

Request-response pattern fits your use case
You want quick setup over customization
Your team expects familiar UI
You're not doing context compression or multi-agent orchestration

Conclusion

The LLMOps market assumes you're building chatbots. If you're building something more complex—long-running agent sessions, custom context management, multi-agent orchestration—the tools will fight you.

PostgreSQL + structured logs + your domain knowledge = observability that actually works.

Total implementation time: ~2 weeks. Total ongoing cost: $0. Total vendor lock-in: 0.