Pattern 1: Error rate spike detection
The most common support query. Customer reports errors — you need to quantify the problem, identify when it started, and determine the severity.
What you're looking for: the customer's error rate over time, compared to their baseline. The query should answer: "Is this actually happening? When did it start? How bad is it?"
Typical approach: Query the API logs table filtered by customer ID, group by time interval (5 min or 1 hour), calculate error rate as percentage of total requests. Compare the current window against the previous 24–48 hours to establish baseline.
Key fields: customer_id, timestamp, http_status_code (or error_type), endpoint, request_id.
What this tells you: Whether the customer's complaint is valid, when the problem started (which narrows the root cause), and whether it's getting worse or stabilizing.
Pattern 2: Latency percentile tracking
Customer reports "things are slow." You need to measure actual latency, identify the affected endpoints, and determine if this is a regression.
What you're looking for: p50, p95, and p99 latency for the customer's requests, broken down by endpoint. Compare current values against their historical baseline.
Key insight: Customers often report "everything is slow" when only one endpoint is affected. The query should break down latency by endpoint to isolate the problem. If p95 jumped but p50 is stable, you're looking at a tail-latency issue (likely a specific code path or external dependency).
Pattern 3: Customer activity timeline
For debugging configuration or authentication issues, you need a timeline of the customer's recent activity — what they called, when, and what happened.
What you're looking for: the last N requests from this customer, ordered by time, with status codes and relevant metadata. This is the "replay the tape" query.
Key insight: Often reveals the root cause immediately. A customer reporting "auth stopped working" may have rotated their API key 2 hours ago and the old key is cached somewhere. The timeline shows the exact moment behavior changed.
Pattern 4: Comparative analysis across customers
Is this a single-customer issue or a platform-wide problem? This query determines blast radius.
What you're looking for: the same error/latency metric across multiple customers during the same time window. If only one customer is affected, it's likely a customer-specific issue (config, billing, usage). If multiple customers show the same pattern, it's likely a platform issue (deploy regression, infrastructure).
This query is critical for triage. A customer-specific issue means you check Stripe and their configuration. A platform-wide issue means you check GitHub deploys and StatusPage.
Pattern 5: Usage and billing reconciliation
Customer disputes their invoice. You need to count their actual usage for the billing period and compare it against what they were charged.
What you're looking for: total request count (or whatever your billing unit is) for the customer during the billing period, broken down by billable vs. non-billable requests, by endpoint, and by day to identify any spikes.
Key insight: The most common billing disputes come from unexpected usage spikes — a customer's CI pipeline running in a loop, a misconfigured retry policy, or a legitimate traffic increase they didn't anticipate. The daily breakdown reveals these patterns instantly.
The manual bottleneck: not the query, but the context
If your support team runs these queries manually, the bottleneck isn't ClickHouse's execution time — it's everything around it:
- •Opening the ClickHouse console and connecting to the right cluster
- •Remembering (or looking up) the table names, column names, and customer ID format
- •Adjusting the time window based on when the customer reported the issue
- •Interpreting results in the context of the customer's plan tier and normal usage
- •Then switching to Linear, Stripe, and GitHub to continue the investigation
When these 5 query patterns are automated — running in parallel, with results correlated against Linear bugs, Stripe billing, and GitHub deploys — the entire investigation completes in under 2 minutes instead of 20–45.
"Our support engineers were writing the same ClickHouse queries 10 times a day. Same tables, same columns, different customer ID. Automating that alone saved hours per week."