Building custom LLMOps: When Langfuse wasn't enough for production AI agents
500K+ interactions, 10 agents, 18 months in production. Why PostgreSQL + structured logs beat Langfuse, LangSmith, and Phoenix.
I've been operating a 10-agent AI system in production for 18 months. 500K+ interactions, $67K operational savings, 24/7 multilingual operation. After evaluating Langfuse, LangSmith, and Arize Phoenix, I built my own observability stack.
This isn't about NIH syndrome. It's about what production multi-agent systems actually need.
The problem: long-running agent sessions
Standard LLMOps tools are designed for request-response patterns. User sends message → LLM responds → trace complete.
My system handles coaching sessions that span days or weeks. A single user journey involves:
- 10+ agent handoffs
- Context compression decisions at each step
- Nested workflow execution (up to 5 levels deep)
- Business logic interleaved with AI decisions
Langfuse traces looked like spaghetti. No correlation between sessions. No visibility into why an agent made a decision.
What commercial tools miss
1. Agent decision tracking
Langfuse tracks what the LLM said. It doesn't track what your agent decided.
My agents make structured decisions every execution:
{
"agent_decisions": {
"flow": {
"challenge_status": "in_progress",
"significant_moment_detected": true
},
"safety": {
"meta_questions_behavior": "normal"
},
"coaching_quality": {
"intensity_level": "balanced",
"message_length": 2325,
"response_structure": {
"sections": ["ANÁLISIS", "ESTRATEGIA", "CALIBRACIÓN"],
"section_count": 7
}
}
}
}This is the "mind" of the agent. Langfuse has no concept of this.
2. Context compression observability
I use exponential token compression to manage long conversations. Each message gets a compression score based on significance.
{
"context_management": {
"user_input": {
"compression_score": 0.1,
"compression_priority": "preserve_full",
"significant_moment_detected": true
},
"agent_output": {
"compression_score": 0.15,
"context_utilization": "high"
}
}
}When significant_moment_detected: true, compression is minimal. Routine exchanges get aggressive compression. This is how you maintain coherent month-long coaching relationships without exploding token costs.
No commercial tool tracks this. Because no commercial tool assumes you're doing it.
3. Multi-level workflow correlation
A single user message in my system can trigger:
{
"processes_executed": [
{"name": "001 Main", "nesting_level": 0},
{"name": " 003 Extract message", "nesting_level": 1},
{"name": " 014 Process security", "nesting_level": 1},
{"name": " 005 Process command", "nesting_level": 1},
{"name": " 301 Selling process", "nesting_level": 1},
{"name": " 202 Agent - Coaching", "nesting_level": 1},
{"name": " 101 AI Execute LLM", "nesting_level": 2},
{"name": " 021 Send message", "nesting_level": 2},
{"name": " 022 Record exchange", "nesting_level": 3},
{"name": " 901 Cleanup User Lock", "nesting_level": 1}
],
"created_at": "2025-11-28T21:41:34.752Z",
"finalized_at": "2025-11-28T21:42:02.129Z"
}10 workflows, 3 nesting levels, ~27 seconds total (including LLM processing). Every step correlated by execution_id. Every user journey traceable across days.
Langfuse can show you a trace. It can't show you a journey.
The architecture
Four log tables, one correlation key:
┌─────────────────────────────────────────────────────────────┐
│ execution_id │
│ │ │
│ ┌──────────────────────┼──────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ logs_execution logs_user_journey logs_agent_execution│
│ (workflow trace) (event sourcing) (AI decisions) │
│ │ │ │
│ │ ▼ │
│ │ logs_llm_execution │
│ │ (tokens, costs) │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘logs_execution
Workflow orchestration. Which n8n workflows fired, in what order, with what nesting.
logs_user_journey
Event sourcing for user interactions. Every message received, every message sent, every state transition.
{
"event_type": "user_message_sent_requested",
"particular_data": {
"role": "system",
"text_length": 934,
"credits_used": 0,
"is_bot_scope": true,
"is_coaching_scope": false,
"is_selling_scope": false
}
}Business context baked into every event. I can query "show me all selling-scope messages for user X" in one SQL statement.
logs_agent_execution
Agent-level decisions. Context preparation, compression choices, coaching quality metrics.
logs_llm_execution
Raw LLM calls. Tokens, costs, latency, prompt hashes.
{
"event_type": "llm_valid_execution",
"model_target": "claude-sonnet-4-20250514",
"prompt_hash": "# ADVANCED MALE SOCIAL COACHING SYSTEM...43713chars",
"output_tokens": 1568,
"processing_time_ms": 25937,
"output_cost": "0.02352",
"total_cost": "0.06123"
}Implementation: n8n + PostgreSQL
The entire observability layer runs on the same stack as the application:
- n8n: Workflow orchestration with built-in execution tracking
- PostgreSQL: Structured logs with full SQL queryability
- Redis: Execution correlation across async workflows
No additional infrastructure. No SaaS bills. No data leaving my servers.
Log insertion pattern
Every workflow starts with context capture:
// Pseudocode - actual implementation in n8n nodes
const executionLog = {
execution_id: generateCorrelationId(),
user_id: context.user.id,
telegram_id: context.user.telegram_id,
processes_executed: [],
user_tier: context.user.tier,
user_language: context.user.language,
user_input_type: detectInputType(message),
created_at: new Date().toISOString()
};
// Pass through entire workflow, append each process
executionLog.processes_executed.push({
name: currentWorkflow.name,
timestamp: new Date().toISOString(),
timestamp_ms: Date.now(),
nesting_level: context.nestingLevel
});Querying patterns
Find slow agent executions:
SELECT
execution_id,
agent_type,
particular_data->>'event_duration_ms' as duration_ms,
created_at
FROM logs_agent_execution
WHERE (particular_data->>'event_duration_ms')::int > 20000
ORDER BY created_at DESC;Cost analysis by agent type:
SELECT
agent_type,
COUNT(*) as executions,
SUM((total_cost)::numeric) as total_cost,
AVG((processing_time_ms)::numeric) as avg_latency
FROM logs_llm_execution
WHERE event_type = 'llm_valid_execution'
GROUP BY agent_type;User journey reconstruction:
SELECT
event_type,
particular_data,
created_at
FROM logs_user_journey
WHERE user_id = '57'
ORDER BY created_at;Results
18 months in production:
| Metric | Value |
|---|---|
| Total interactions | 500K+ |
| Agent types | 10 |
| Avg response time | 3-5 seconds |
| Operational savings | $67K |
| Downtime | 0 |
| External observability cost | $0 |
The custom observability stack has caught:
- Context compression bugs: Significant moments being over-compressed
- Agent decision drift: Coaching intensity creeping too high/low
- Cost anomalies: Specific user patterns triggering expensive execution paths
- Latency regressions: New prompt versions causing slower responses
When to build custom
Build your own observability if:
- Your agents make structured decisions that you need to track
- Sessions span days/weeks, not minutes
- Business context matters for debugging (tiers, scopes, journeys)
- You're already running PostgreSQL and don't want another SaaS
- You need SQL queryability, not just dashboards
Use Langfuse/LangSmith if:
- Request-response pattern fits your use case
- You want quick setup over customization
- Your team expects familiar UI
- You're not doing context compression or multi-agent orchestration
Conclusion
The LLMOps market assumes you're building chatbots. If you're building something more complex—long-running agent sessions, custom context management, multi-agent orchestration—the tools will fight you.
PostgreSQL + structured logs + your domain knowledge = observability that actually works.
Total implementation time: ~2 weeks. Total ongoing cost: $0. Total vendor lock-in: 0.