From Sidecars to Kernel: Building an Observability Stack for the Intelligence Era

TL;DR

Sidecars eat 20-30% of cluster resources. eBPF does the same job at under 1%.
VictoriaMetrics uses 10x less RAM than InfluxDB and 7x less than Prometheus with Thanos.
Foundation models like MOMENT and TimesFM can spot anomalies without any training data.
But honestly? Statistical baselines still beat ML most of the time. Layer them.
Cardinality creeps up quietly. By the time you notice, your storage costs are already out of control.

The Hidden Cost of Sidecars

Istio and Linkerd changed how we build microservice architectures. No argument there. What nobody warned us about was the resource bill of running a proxy next to every single pod.

What Is a Sidecar, Anyway?

In Kubernetes, a sidecar is an extra container that runs alongside your main application inside the same pod. Every request your app sends or receives passes through it first. Service meshes like Istio use Envoy sidecars to handle:

Traffic management: Routing requests between services
Security: Encrypting communication, verifying identities
Observability: Collecting metrics about requests, latency, errors

Great idea—until you pay the bill.

The Math Gets Ugly Fast

Each Envoy sidecar idles at 100–200 MB of RAM. Under load, it burns 0.1–0.5 CPU cores and adds 1–5 ms of latency per hop.

500 services × 3 replicas = 1,500 pods → 1,500 sidecars
150–300 GB RAM doing proxy work, not business logic
150–750 CPU cores spent on plumbing
A 5-hop request can pay ~25 ms just in sidecar tax

Teams running meshes at scale routinely report 20–30% of total cluster resources going to sidecars.

The Problems You Don’t See on the Dashboard

Fate sharing: if the sidecar dies or hangs, your app is down even if your code is fine.
Upgrades: Envoy bumps require restarting every pod—thousands of rolling restarts for a patch.
Visibility blind spots: sidecars only see network traffic. System calls, file I/O, in-process behavior? Invisible.
Cold starts: new pods wait for sidecars to initialize, fetch config, and connect—adding 5–15 seconds to scale-up.

eBPF: Moving Observability Into the Kernel

Sidecars bounce packets between kernel and user space. eBPF pushes tiny programs into the kernel so data is captured where events happen—no extra context switches.

What’s Actually Happening

Kernel space view

Kprobes: attach to kernel functions.
Uprobes: attach to user-space functions.
Tracepoints: predefined kernel event hooks.
XDP/TC/socket filters: tap packets before or as the kernel processes them.

Programs write to eBPF maps (kernel key-value stores) and stream data up efficiently—no app changes.

The Performance Gap Is Massive

Metric	Sidecars (per pod)	eBPF (per node)
CPU overhead	5–15%	<1%
Memory	100–200 MB	50–100 MB
Latency hit	1–5 ms per hop	<0.1 ms
Deployment model	One per pod	One per node
Visibility	L7 only	L3/L4/L7 + syscalls + file I/O

Seeing Things Sidecars Can’t

System calls: read(), write(), connect(), accept().
File I/O: which files are read/written and how long it takes.
Network: XDP packets, TCP states, retransmits, drops, protocols.
CPU scheduling: when processes run and get preempted.
Memory patterns: allocations, leaks, pressure.

The Tradeoffs (Being Honest)

Needs modern kernels (realistically 5.x).
TLS payload visibility requires uprobes into crypto libs—finicky and version-sensitive.
Linux-only; Windows nodes need another approach.
Writing eBPF safely is hard—stick to battle-tested tools (Cilium, Pixie, Falco, Tetragon).
Kernel-level bugs can crash nodes—test in non-prod first.

Where Do You Store All This Data?

Telemetry volume and cardinality explode fast. You need low-latency operational queries and affordable long-term analytics.

The Options That Actually Work at Scale

VictoriaMetrics: Efficiency That Keeps Surprising Me

10x less RAM than InfluxDB; 7x less than Prometheus+Thanos/Cortex.
High-cardinality queries up to 20x faster; compresses to 0.4–0.8 bytes/point.
Prometheus-compatible; simple single binary or split (vmselect/vminsert/vmstorage).

Use it when you want Prometheus compatibility without the overhead, or cardinality is already biting.

ClickHouse: When You Need Real Analytics Power

OLAP-first: JOINs, subqueries, window functions, mixed data (metrics + events + logs).
10–20x compression is normal; full SQL interface your team already knows.

Use it when PromQL can’t express the questions you’re asking or you need to correlate metrics with business data.

Using Both: The Tiered Approach

Observability storage tiers

Hot/warm: VictoriaMetrics for dashboards, alerts, recent debugging.
Cold/analytic: ClickHouse for deep history and complex correlations.

Foundation Models for Time Series: Separating Hype from Reality

Large pre-trained time-series models promise zero-shot anomaly detection/forecasting on your metrics.

MOMENT (CMU, ICML 2024)

Transformer (40M–385M params) trained on the “Time-Series Pile.”
Zero-shot anomaly detection via masked reconstruction.
Reality: KNN baselines often beat it on common infrastructure metrics.

TimesFM (Google, ICML 2024)

Decoder-only transformer (200M params) trained on 100B+ time points.
Zero-shot forecasting; open on HuggingFace (google/timesfm-2.5-200m-pytorch).
Needs GPU for speed; semantics of metrics still matter; must guard against nonsense outputs.

What I’ve Actually Learned Using These

Great for complex seasonality, brand-new metrics, heterogeneous data, and when labeling is impossible.
Old-school stats still win for stable metrics, low latency, and resource-constrained environments.
Use foundation models as an enhancement layer—keep statistical baselines as the workhorse.

A Three-Layer Detection Architecture That Actually Works

Layer 1: Static Thresholds (The Safety Net)

rules:
  - metric: container_cpu_usage_percent
    operator: gt
    threshold: 95
    duration: 5m
    severity: critical

Fast, reliable, and noisy by design—your last line of defense.

Layer 2: Statistical Baselines (The Workhorse)

# Z-score against rolling baseline
baseline_mean = metrics.rolling(window='7d').mean()
baseline_stddev = metrics.rolling(window='7d').std()
z_score = (current_value - baseline_mean) / baseline_stddev

if abs(z_score) > 3:
    alert("Anomaly detected", confidence=0.997)

Adapts per service, handles weekday/weekend cycles, explains alerts in one sentence.

Layer 3: ML/Foundation Models (The Enhancement)

# Bring in ML for cases where statistics are uncertain
if statistical_baseline.uncertain:
    tsfm_prediction = timesfm.predict(historical_window)
    reconstruction_error = moment.detect(current_window)
    
    if reconstruction_error > threshold:
        alert("Complex anomaly detected",
              confidence=reconstruction_error,
              explanation=tsfm_prediction)

ML augments but never gates. If ML is slow or down, layers 1 and 2 keep firing.

Cardinality: The Problem That Bankrupts You Quietly

Every unique metric-label combination is a series. One innocent label can explode into millions of series, wrecking performance and cost.

How One Innocent Metric Becomes Millions

# Looks harmless
http_requests_total{
  method="GET",             # 5 values
  path="/api/v1/users",     # 1,000 if IDs are embedded
  status="200",             # 50 values
  pod="web-abc123",         # 500 pods
  customer_id="12345"       # 100,000 customers — the killer
}

Potential: 12.5 trillion series. You won’t hit that, but millions are enough to melt dashboards and storage.

A Governance Framework That Actually Works

Set limits by criticality (e.g., Critical: 100k; Standard: 25k; Batch: 10k; Dev: 5k).
Enforce at multiple layers:
- Agent/collector: strip high-card labels, aggregate/sample.
- Storage: hard per-tenant limits, expire inactive series, rate-limit ingestion.
- Policy (Kubernetes): CRD for allowed labels; admission webhook blocks bad configs.

Finding Problems Before Your Bill Does

Prometheus/VictoriaMetrics:

topk(10, count by (__name__)({__name__=~".+"}))

Alert early:

- alert: HighCardinalityMetric
  expr: count by (__name__)({__name__=~".+"}) > 10000
  for: 5m
  annotations:
    summary: "Metric {{ $labels.__name__ }} has over 10k series"
    description: "Investigate immediately. Check for high-cardinality labels."

OpenTelemetry: The Standard That Actually Won

Why OpenTelemetry Matters

Vendor-neutral; collect once, send anywhere.
Complete: metrics, logs, traces with shared concepts.
Massive momentum and universal vendor support.

The Collector: Your Telemetry Hub

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
  prometheus:
    config:
      scrape_configs:
        - job_name: "kubernetes-pods"

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  servicedna/enrich:
    profile_endpoint: "http://dna-brain:8080"
    enrich_attributes:
      - servicedna.risk_score
      - servicedna.anomaly.detected
      - servicedna.baseline.deviation

exporters:
  prometheusremotewrite:
    endpoint: "http://victoriametrics:8428/api/v1/write"
  clickhouse:
    endpoint: "tcp://clickhouse:9000"
    database: observability

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, servicedna/enrich]
      exporters: [prometheusremotewrite, clickhouse]

OTel Collector becomes the hub: receive, enrich, and route to hot and cold storage without duplicating instrumentation.

What To Take Away From This

eBPF slashes overhead and sees what sidecars can’t.
Storage choice compounds—VictoriaMetrics for efficiency, ClickHouse for deep analytics, both together for the win.
Foundation models are useful, not magical—layer them on top of statistical baselines.
Cardinality governance is non-negotiable—set limits and enforce them before the bill bites.
OpenTelemetry has won—build on it and stay portable.

Written by Raju Gupta

DevOps & Cloud Infrastructure Leader | Kubernetes, Cilium, Terraform, ArgoCD

LinkedIn GitHub

Beyond Kubernetes: Why Modern Infrastructure Needs an Intelligence Layer

Kubernetes excels at orchestration—but today's systems demand understanding, prediction, and intelligent decision-making.

The Physics of Scale: Mathematical Foundations for Intelligent Infrastructure

Before building intelligent systems, we need to understand the physics that governs them. Two laws change how you think about scalability forever.