My Grafana Dashboard Looked Beautiful and Had Exactly One Operation in It

59 of 59 traces were the same 250ms heartbeat. The cron polling its own queue.

I had just finished standing up Tacavar’s observability stack: Tempo for distributed tracing, Grafana for visualization, OpenTelemetry piping telemetry from every service. Three dashboards rendered cleanly. Latency heatmaps looked tight. Bar charts were populated. Traces were flowing. I was ready to call it done.

Then I opened the query inspector.

The Three Dashboards That Looked Alive

The Bailian Team Overview dashboard showed agent activity, token burn, and cost curves. The Paperclip Swarm Observability dashboard displayed trace volume, span latency percentiles, and error rates. The Tacavar Ops board consolidated node health, queue depth, and circuit breaker state. Every panel had data. Every chart had color. Nothing looked broken.

This is the trap. A dashboard that renders is not a dashboard that monitors. The visual layer in Grafana is forgiving to a fault. It will happily draw empty timeseries with the same styling it uses for real data. A null result and a zero result look nearly identical when the y-axis auto-scales and the legend omits the count.

I had three dashboards that passed the eyeball test. None of them passed the data test.

Probing the Traces: The Heartbeat Illusion

I ran the underlying TraceQL query directly against Tempo. The last hour returned 59 traces. Every single one was paperclip_handle_heartbeat: a 60-second cron that polled its own work queue, found nothing, and exited. 250ms each. Zero variance.

No agent runs. No LLM calls. No tool invocations. No task completions. The entire observability surface was one cron job talking to itself in a loop.

The OpenTelemetry instrumentation was technically correct. Spans were being generated, batched, and exported. Tempo was ingesting and indexing. Grafana was querying and rendering. The pipeline was alive. The system it was supposed to observe was not.

This is the difference between telemetry plumbing and operational signal. I had built the former and mistaken it for the latter.

Why Missing Data Looks Identical to Healthy Data

The Bailian Team Overview dashboard queried Prometheus metrics named agent_calls_total, agent_tokens_total, and agent_cost_usd_total. These metrics did not exist. The service had never emitted them. The PromQL queries returned empty result sets.

Grafana’s default behavior for an empty timeseries is to render a styled, bordered panel with a flat line at the bottom. It looks like low activity. It looks like a quiet afternoon. It does not look like a broken metric path or a missing instrumentation label.

This is a dashboard anti-pattern baked into the tool’s design philosophy: graceful degradation. For a customer-facing product, graceful degradation is a feature. For observability, it is a liability. A broken dashboard makes people investigate. A beautifully empty one makes people think their system is working. The worst failure mode is the one that hides in plain sight.

At Tacavar, we had been running blind for days while the dashboards told us we could see.

The Grafana Anti-Pattern: Graceful Degradation

The problem is not Grafana-specific, but Grafana makes it easy. The query editor is separate from the panel renderer. The renderer applies thresholds, color scales, and legends that give visual weight to nothing. A panel with zero rows and a panel with a thousand rows can share the same border, the same title formatting, the same position on the grid.

The deeper anti-pattern is trusting the rendered output without inspecting the query result. Most operators, myself included, spend time tuning the visual layer: colors, units, alert thresholds, annotation layers. We spend almost no time validating that the query underneath returns non-trivial, non-synthetic data.

The watchdog incident from two weeks earlier should have been a warning. v1 and v2 of the Bailian watchdog had eight tg_alert() calls with no throttling. A crash loop generated over 100 Telegram alerts in an hour. Every alert was technically correct. None of them were actionable. The monitoring was accurate and useless. v3 fixed this with a three-tier escalation: silent self-heal first, AI escalation second, human alert last. The right default is not to notify. The right default is to fix, then to escalate, then to wake someone up only if a human can actually help.

Observability built for the happy path is hostile when something actually breaks. The same principle applies to dashboards.

How We Rebuilt the Stack With Query Validation

We changed the Tacavar observability setup in three ways.

First, every dashboard query now has a validation step. Before a panel is considered “done,” we run the raw query against the data store and count the rows. If the result set is empty, the panel is marked red in our internal tracker until the instrumentation is fixed. No exceptions. A dashboard with empty queries is a broken dashboard, even if Grafana draws it cleanly.

Second, we added synthetic span injection. Every hour, a known test transaction runs through the system and generates a trace with a fixed trace ID prefix. We query Tempo for that prefix. If it is missing, the pipeline is broken, regardless of how many heartbeats are flowing. This separates pipeline health from system health.

Third, we changed the Grafana panel defaults. Empty result sets now render with a explicit “NO DATA” annotation instead of a flat line. This required custom panel overrides, but it removes the ambiguity. A panel with no data must look different from a panel with low data.

The OpenTelemetry collector configuration was also audited. We found that tail-based sampling was dropping all spans above 500ms because the sampling policy had been copied from a different service with a stricter latency budget. Real agent runs, which routinely took 2-10 seconds, were being discarded before they reached Tempo. The heartbeat survived because it was fast. The actual work vanished.

One Rule for Every Observability Setup

Run the query. Count the rows. Do not trust the render.

This is the single rule that would have caught every issue in our stack. The heartbeat illusion, the missing Prometheus metrics, the dropped long spans. All of them would have been obvious if the validation step came before the visual polish.

At Tacavar, we now treat dashboard completion as a two-phase process: validation, then presentation. The presentation layer is what stakeholders see. The validation layer is what operators depend on. Reversing the order, or skipping the first, is how you end up with 59 traces that all say the same thing.

Tacavar builds monitoring stacks that fail loudly, not beautifully. If your dashboards need a second look, we can audit them at tacavar.com.