Each finding below comes from a real engagement. Numbers are real, names and identifying details are not. Each follows the same structure we use in every audit deliverable: finding, detection, remediation, impact, effort.
A bizevents bucket sat at multi-TB volume with 400-day retention. Query telemetry for the previous quarter showed zero DQL queries beyond day 35. Storage was being paid for data that nobody read.
fetch dt.system.buckets | sort estimated_uncompressed_bytes desc cross-checked against the query audit log to identify retention vs. actual access pattern.
Drop retention from 400 days to 90 days on that single bucket. No code change, no schema change, no impact on dashboards.
Approximately 75% recovery on that line. Six-figure annualized savings on a single bucket.
30 minutes config change. 1 internal review.
Three internal services were emitting double-digit million DEBUG lines per 24h into the production logs bucket. Root cause: an IaC default left on after a 2023 incident, never cleaned up.
fetch logs | filter loglevel=="DEBUG" | summarize count(), by:{k8s.namespace.name, log.source} | sort count() desc
Two OpenPipeline ingest filters dropping DEBUG for the offending workloads, plus an IaC PR to flip the Spring/Logback config back to production defaults.
Approximately 40% reduction on default_logs ingest. Dashboards untouched.
90 minutes scoping. 1 day rollout.
PRE and INT tracing buckets were configured with 35-day retention (matching production), holding 4.5 TB and 639 GB respectively. Empty no-prod buckets with 7-15 day retention already existed but were unused.
fetch dt.system.buckets | filter contains(name, "_pre_") OR contains(name, "_int_") against the retention column and the existing low-retention non-prod buckets.
Routing rules to send non-prod traces into the existing low-retention buckets. No data loss because no-prod debug data does not need >7 day retention.
Approximately 80% recovery on those two bucket lines.
4 hours config. 24h burn-in observation window.
Of approximately 2.000 hosts under management, over a third sat in DISCOVERY mode (sub-utilized, no full-stack telemetry of value) but still consumed host-hour DPS budget every month.
Smartscape host inventory query, grouped by monitoringMode, cross-referenced with active service detection over the previous 30 days.
Decommission DISCOVERY hosts that did not report any monitored service in 30 days. Downgrade Full-Stack to Infra-Only on hosts where code-level traces were not consumed by any owner.
Approximately 35% reduction on the host-hour bill. No SLO impact.
2 days inventory. Change ticket per host group.
A single application's persistence layer was emitting (N×3)+3 Redis operations per request via Spring Data abstractions. Every save touched 3 sets without TTL, producing a memory leak plus an observability cost cascade.
Distributed trace inspection on the hot endpoint, plus INFO MEMORY on the Redis cluster, plus a custom DQL on Redis op spans to confirm the amplification pattern.
Migrate to raw SET EX with explicit TTL. Drop the auto-repository annotation that was producing the N×3 amplification.
-67% Redis ops sustained at 1.700 QPS (45.9k to 15.3k ops/s). DPS bizevents volume cascaded down proportionally.
1 sprint dev work (scoping + PR + canary + rollout).
The OTel collector exporter was at max_batch_size=1, concurrency=1. Every span left the gateway as a separate HTTPS request with 500-800 bytes of header overhead. Effectively 40-50% bandwidth waste on the egress VPC.
Compare bytes-on-wire (egress metrics) against payload size (collector self-telemetry) to surface the per-request overhead ratio.
max_batch_size: 200, concurrency_limit: 5. No infrastructure change required.
Header overhead dropped to under 1%. Egress bill on the gateway VPC dropped proportionally.
1 hour config. 24h validation window.
Three high-traffic dashboards (status overview, business KPIs, infrastructure health) were querying raw spans and logs on every load. Every executive opening one of them was billed as a Grail charge.
Query audit log analysis plus dashboard tile inspection plus a cost-by-query DQL to identify the highest-cost recurring queries.
Convert tiles to pre-aggregated metrics via value_metric_extraction_processor and counter_metric_extraction_processor in OpenPipeline. Dashboards reference metrics, not raw data.
0 GB Grail consumed per dashboard refresh, down from full scan each time. Dashboard UX identical.
1-2 days per dashboard (extraction config + validation).
Calls between modern services (W3C-compliant) and legacy clients showed different trace.id between hops. The legacy bus was not propagating traceparent. End-to-end visibility was broken across the most critical call chain.
fetch spans | filter app == "X" | summarize count(), by:{trace.id} | join [fetch spans | filter app == "Y"] on:{trace.id} returned near-zero matches, confirming the propagation gap.
Activate the OpenTracing propagator on the legacy bus, or add an outbound interceptor that copies B3/W3C headers from MDC.
End-to-end traces functional. MTTR on incidents involving the legacy boundary dropped from days to hours.
2 days dev. 1 sprint rollout.
The environment segment returned three distinct values for production: pro, prod, production. Different teams adopted different conventions over 3 years. SLO filters in dashboards picked one and missed 60% of traffic.
fetch dt.system.tags | filter key=="environment" | summarize count(), by:{value} surfaced the duplication in seconds.
Pick canonical (production), open IaC PR to retag all services, add tenant-wide segment alias to absorb legacy values without breaking historical dashboards.
SLO filters now reflect 100% of traffic. SLI/SLO reporting unblocked across the org.
1 week governance work + change tickets per service owner.
Same critical incidents triggered alerts in both Splunk (legacy alerting profiles) and Dynatrace (new policy). During planned switchovers, both sides fired false positives. Two SRE teams chasing ghost incidents in parallel.
Cross-reference alert audit logs (Splunk vs DT) for the same time windows. Correlation queries on event source identified the duplicated alert pairs.
Splunk decommission roadmap plus DT alert correlation rules plus on-call shift consolidation.
Alert volume halved. Six-figure cost saved (Splunk Cloud licensing avoided + DPS dedupe).
2 months program management + tactical alert migration.
Each engagement is a 2-week fixed-scope DPS Optimization Audit. Findings packaged as PDF report, DQL library, and remediation backlog. 3x savings guarantee or you pay only the deposit.
See audit packages Get the DQL Starter Pack