Back to blogLLMOps

Building custom LLMOps: When Langfuse wasn't enough for production AI agents

500K+ interactions, 10 agents, 18 months in production. Why PostgreSQL + structured logs beat Langfuse, LangSmith, and Phoenix.

Alejandro Valencia12 min

I've been operating a 10-agent AI system in production for 18 months. 500K+ interactions, $67K operational savings, 24/7 multilingual operation. After evaluating Langfuse, LangSmith, and Arize Phoenix, I built my own observability stack.

This isn't about NIH syndrome. It's about what production multi-agent systems actually need.

The problem: long-running agent sessions

Standard LLMOps tools are designed for request-response patterns. User sends message → LLM responds → trace complete.

My system handles coaching sessions that span days or weeks. A single user journey involves:

  • 10+ agent handoffs
  • Context compression decisions at each step
  • Nested workflow execution (up to 5 levels deep)
  • Business logic interleaved with AI decisions

Langfuse traces looked like spaghetti. No correlation between sessions. No visibility into why an agent made a decision.

What commercial tools miss

1. Agent decision tracking

Langfuse tracks what the LLM said. It doesn't track what your agent decided.

My agents make structured decisions every execution:

{
  "agent_decisions": {
    "flow": {
      "challenge_status": "in_progress",
      "significant_moment_detected": true
    },
    "safety": {
      "meta_questions_behavior": "normal"
    },
    "coaching_quality": {
      "intensity_level": "balanced",
      "message_length": 2325,
      "response_structure": {
        "sections": ["ANÁLISIS", "ESTRATEGIA", "CALIBRACIÓN"],
        "section_count": 7
      }
    }
  }
}

This is the "mind" of the agent. Langfuse has no concept of this.

2. Context compression observability

I use exponential token compression to manage long conversations. Each message gets a compression score based on significance.

{
  "context_management": {
    "user_input": {
      "compression_score": 0.1,
      "compression_priority": "preserve_full",
      "significant_moment_detected": true
    },
    "agent_output": {
      "compression_score": 0.15,
      "context_utilization": "high"
    }
  }
}

When significant_moment_detected: true, compression is minimal. Routine exchanges get aggressive compression. This is how you maintain coherent month-long coaching relationships without exploding token costs.

No commercial tool tracks this. Because no commercial tool assumes you're doing it.

3. Multi-level workflow correlation

A single user message in my system can trigger:

{
  "processes_executed": [
    {"name": "001 Main",               "nesting_level": 0},
    {"name": "  003 Extract message",  "nesting_level": 1},
    {"name": "  014 Process security", "nesting_level": 1},
    {"name": "  005 Process command",  "nesting_level": 1},
    {"name": "  301 Selling process",  "nesting_level": 1},
    {"name": "  202 Agent - Coaching", "nesting_level": 1},
    {"name": "    101 AI Execute LLM", "nesting_level": 2},
    {"name": "    021 Send message",   "nesting_level": 2},
    {"name": "      022 Record exchange", "nesting_level": 3},
    {"name": "  901 Cleanup User Lock", "nesting_level": 1}
  ],
  "created_at": "2025-11-28T21:41:34.752Z",
  "finalized_at": "2025-11-28T21:42:02.129Z"
}

10 workflows, 3 nesting levels, ~27 seconds total (including LLM processing). Every step correlated by execution_id. Every user journey traceable across days.

Langfuse can show you a trace. It can't show you a journey.

The architecture

Four log tables, one correlation key:

┌─────────────────────────────────────────────────────────────┐
│                      execution_id                            │
│                           │                                  │
│    ┌──────────────────────┼──────────────────────┐          │
│    │                      │                      │          │
│    ▼                      ▼                      ▼          │
│ logs_execution     logs_user_journey     logs_agent_execution│
│ (workflow trace)   (event sourcing)      (AI decisions)     │
│    │                                            │           │
│    │                                            ▼           │
│    │                                   logs_llm_execution   │
│    │                                   (tokens, costs)      │
│    └────────────────────────────────────────────┘          │
└─────────────────────────────────────────────────────────────┘

logs_execution

Workflow orchestration. Which n8n workflows fired, in what order, with what nesting.

logs_user_journey

Event sourcing for user interactions. Every message received, every message sent, every state transition.

{
  "event_type": "user_message_sent_requested",
  "particular_data": {
    "role": "system",
    "text_length": 934,
    "credits_used": 0,
    "is_bot_scope": true,
    "is_coaching_scope": false,
    "is_selling_scope": false
  }
}

Business context baked into every event. I can query "show me all selling-scope messages for user X" in one SQL statement.

logs_agent_execution

Agent-level decisions. Context preparation, compression choices, coaching quality metrics.

logs_llm_execution

Raw LLM calls. Tokens, costs, latency, prompt hashes.

{
  "event_type": "llm_valid_execution",
  "model_target": "claude-sonnet-4-20250514",
  "prompt_hash": "# ADVANCED MALE SOCIAL COACHING SYSTEM...43713chars",
  "output_tokens": 1568,
  "processing_time_ms": 25937,
  "output_cost": "0.02352",
  "total_cost": "0.06123"
}

Implementation: n8n + PostgreSQL

The entire observability layer runs on the same stack as the application:

  • n8n: Workflow orchestration with built-in execution tracking
  • PostgreSQL: Structured logs with full SQL queryability
  • Redis: Execution correlation across async workflows

No additional infrastructure. No SaaS bills. No data leaving my servers.

Log insertion pattern

Every workflow starts with context capture:

// Pseudocode - actual implementation in n8n nodes
const executionLog = {
  execution_id: generateCorrelationId(),
  user_id: context.user.id,
  telegram_id: context.user.telegram_id,
  processes_executed: [],
  user_tier: context.user.tier,
  user_language: context.user.language,
  user_input_type: detectInputType(message),
  created_at: new Date().toISOString()
};

// Pass through entire workflow, append each process
executionLog.processes_executed.push({
  name: currentWorkflow.name,
  timestamp: new Date().toISOString(),
  timestamp_ms: Date.now(),
  nesting_level: context.nestingLevel
});

Querying patterns

Find slow agent executions:

SELECT
  execution_id,
  agent_type,
  particular_data->>'event_duration_ms' as duration_ms,
  created_at
FROM logs_agent_execution
WHERE (particular_data->>'event_duration_ms')::int > 20000
ORDER BY created_at DESC;

Cost analysis by agent type:

SELECT
  agent_type,
  COUNT(*) as executions,
  SUM((total_cost)::numeric) as total_cost,
  AVG((processing_time_ms)::numeric) as avg_latency
FROM logs_llm_execution
WHERE event_type = 'llm_valid_execution'
GROUP BY agent_type;

User journey reconstruction:

SELECT
  event_type,
  particular_data,
  created_at
FROM logs_user_journey
WHERE user_id = '57'
ORDER BY created_at;

Results

18 months in production:

MetricValue
Total interactions500K+
Agent types10
Avg response time3-5 seconds
Operational savings$67K
Downtime0
External observability cost$0

The custom observability stack has caught:

  • Context compression bugs: Significant moments being over-compressed
  • Agent decision drift: Coaching intensity creeping too high/low
  • Cost anomalies: Specific user patterns triggering expensive execution paths
  • Latency regressions: New prompt versions causing slower responses

When to build custom

Build your own observability if:

  1. Your agents make structured decisions that you need to track
  2. Sessions span days/weeks, not minutes
  3. Business context matters for debugging (tiers, scopes, journeys)
  4. You're already running PostgreSQL and don't want another SaaS
  5. You need SQL queryability, not just dashboards

Use Langfuse/LangSmith if:

  1. Request-response pattern fits your use case
  2. You want quick setup over customization
  3. Your team expects familiar UI
  4. You're not doing context compression or multi-agent orchestration

Conclusion

The LLMOps market assumes you're building chatbots. If you're building something more complex—long-running agent sessions, custom context management, multi-agent orchestration—the tools will fight you.

PostgreSQL + structured logs + your domain knowledge = observability that actually works.

Total implementation time: ~2 weeks. Total ongoing cost: $0. Total vendor lock-in: 0.