Case Studies — 10 Real Findings From Veristack DPS Audits

Bizevents bucket retention 12× what was actually queried

Large enterprise · ~EUR 400k DPS annual

Finding

A bizevents bucket sat at multi-TB volume with 400-day retention. Query telemetry for the previous quarter showed zero DQL queries beyond day 35. Storage was being paid for data that nobody read.

Detection

fetch dt.system.buckets | sort estimated_uncompressed_bytes desc cross-checked against the query audit log to identify retention vs. actual access pattern.

Remediation

Drop retention from 400 days to 90 days on that single bucket. No code change, no schema change, no impact on dashboards.

Impact

Approximately 75% recovery on that line. Six-figure annualized savings on a single bucket.

Effort

30 minutes config change. 1 internal review.

DEBUG flowing into prod buckets at industrial scale

Mid-size enterprise platform team

Finding

Three internal services were emitting double-digit million DEBUG lines per 24h into the production logs bucket. Root cause: an IaC default left on after a 2023 incident, never cleaned up.

Detection

fetch logs | filter loglevel=="DEBUG" | summarize count(), by:{k8s.namespace.name, log.source} | sort count() desc

Remediation

Two OpenPipeline ingest filters dropping DEBUG for the offending workloads, plus an IaC PR to flip the Spring/Logback config back to production defaults.

Impact

Approximately 40% reduction on default_logs ingest. Dashboards untouched.

Effort

90 minutes scoping. 1 day rollout.

Non-prod traces sitting in 35-day retention

Enterprise with multiple environments

Finding

PRE and INT tracing buckets were configured with 35-day retention (matching production), holding 4.5 TB and 639 GB respectively. Empty no-prod buckets with 7-15 day retention already existed but were unused.

Detection

fetch dt.system.buckets | filter contains(name, "_pre_") OR contains(name, "_int_") against the retention column and the existing low-retention non-prod buckets.

Remediation

Routing rules to send non-prod traces into the existing low-retention buckets. No data loss because no-prod debug data does not need >7 day retention.

Impact

Approximately 80% recovery on those two bucket lines.

Effort

4 hours config. 24h burn-in observation window.

A third of the host fleet on DISCOVERY mode pulling DPS

Large enterprise infrastructure

Finding

Of approximately 2.000 hosts under management, over a third sat in DISCOVERY mode (sub-utilized, no full-stack telemetry of value) but still consumed host-hour DPS budget every month.

Detection

Smartscape host inventory query, grouped by monitoringMode, cross-referenced with active service detection over the previous 30 days.

Remediation

Decommission DISCOVERY hosts that did not report any monitored service in 30 days. Downgrade Full-Stack to Infra-Only on hosts where code-level traces were not consumed by any owner.

Impact

Approximately 35% reduction on the host-hour bill. No SLO impact.

Effort

2 days inventory. Change ticket per host group.

Application emitting (N×3)+3 Redis ops per request, no TTL

Post-Grail enterprise · bizevents-heavy workload

Finding

A single application's persistence layer was emitting (N×3)+3 Redis operations per request via Spring Data abstractions. Every save touched 3 sets without TTL, producing a memory leak plus an observability cost cascade.

Detection

Distributed trace inspection on the hot endpoint, plus INFO MEMORY on the Redis cluster, plus a custom DQL on Redis op spans to confirm the amplification pattern.

Remediation

Migrate to raw SET EX with explicit TTL. Drop the auto-repository annotation that was producing the N×3 amplification.

Impact

-67% Redis ops sustained at 1.700 QPS (45.9k to 15.3k ops/s). DPS bizevents volume cascaded down proportionally.

Effort

1 sprint dev work (scoping + PR + canary + rollout).

API gateway OTel exporter mis-configured: batch=1, concurrency=1

Mid-size enterprise edge gateway

Finding

The OTel collector exporter was at max_batch_size=1, concurrency=1. Every span left the gateway as a separate HTTPS request with 500-800 bytes of header overhead. Effectively 40-50% bandwidth waste on the egress VPC.

Detection

Compare bytes-on-wire (egress metrics) against payload size (collector self-telemetry) to surface the per-request overhead ratio.

Remediation

max_batch_size: 200, concurrency_limit: 5. No infrastructure change required.

Impact

Header overhead dropped to under 1%. Egress bill on the gateway VPC dropped proportionally.

Effort

1 hour config. 24h validation window.

Dashboards reading raw spans/logs every refresh

Enterprise with executive dashboard culture

Finding

Three high-traffic dashboards (status overview, business KPIs, infrastructure health) were querying raw spans and logs on every load. Every executive opening one of them was billed as a Grail charge.

Detection

Query audit log analysis plus dashboard tile inspection plus a cost-by-query DQL to identify the highest-cost recurring queries.

Remediation

Convert tiles to pre-aggregated metrics via value_metric_extraction_processor and counter_metric_extraction_processor in OpenPipeline. Dashboards reference metrics, not raw data.

Impact

0 GB Grail consumed per dashboard refresh, down from full scan each time. Dashboard UX identical.

Effort

1-2 days per dashboard (extraction config + validation).

Distributed traces broken at the legacy boundary

Enterprise with mixed Spring Boot 3 + legacy JAX-RS

Finding

Calls between modern services (W3C-compliant) and legacy clients showed different trace.id between hops. The legacy bus was not propagating traceparent. End-to-end visibility was broken across the most critical call chain.

Detection

fetch spans | filter app == "X" | summarize count(), by:{trace.id} | join [fetch spans | filter app == "Y"] on:{trace.id} returned near-zero matches, confirming the propagation gap.

Remediation

Activate the OpenTracing propagator on the legacy bus, or add an outbound interceptor that copies B3/W3C headers from MDC.

Impact

End-to-end traces functional. MTTR on incidents involving the legacy boundary dropped from days to hours.

Effort

2 days dev. 1 sprint rollout.

Tenant taxonomy duplicated across years of legacy

Enterprise with 3+ years of Dynatrace adoption

Finding

The environment segment returned three distinct values for production: pro, prod, production. Different teams adopted different conventions over 3 years. SLO filters in dashboards picked one and missed 60% of traffic.

Detection

fetch dt.system.tags | filter key=="environment" | summarize count(), by:{value} surfaced the duplication in seconds.

Remediation

Pick canonical (production), open IaC PR to retag all services, add tenant-wide segment alias to absorb legacy values without breaking historical dashboards.

Impact

SLO filters now reflect 100% of traffic. SLI/SLO reporting unblocked across the org.

Effort

1 week governance work + change tickets per service owner.

Splunk + Dynatrace alerts firing in parallel, uncoordinated

Enterprise mid-migration from Splunk to Dynatrace Grail

Finding

Same critical incidents triggered alerts in both Splunk (legacy alerting profiles) and Dynatrace (new policy). During planned switchovers, both sides fired false positives. Two SRE teams chasing ghost incidents in parallel.

Detection

Cross-reference alert audit logs (Splunk vs DT) for the same time windows. Correlation queries on event source identified the duplicated alert pairs.

Remediation

Splunk decommission roadmap plus DT alert correlation rules plus on-call shift consolidation.

Impact

Alert volume halved. Six-figure cost saved (Splunk Cloud licensing avoided + DPS dedupe).

Effort

2 months program management + tactical alert migration.

10 Real Findings From
Recent DPS Audits.

Bizevents bucket retention 12× what was actually queried

DEBUG flowing into prod buckets at industrial scale

Non-prod traces sitting in 35-day retention

A third of the host fleet on DISCOVERY mode pulling DPS

Application emitting (N×3)+3 Redis ops per request, no TTL

API gateway OTel exporter mis-configured: batch=1, concurrency=1

Dashboards reading raw spans/logs every refresh

Distributed traces broken at the legacy boundary

Tenant taxonomy duplicated across years of legacy

Splunk + Dynatrace alerts firing in parallel, uncoordinated

Every audit produces this kind of output.

10 Real Findings FromRecent DPS Audits.

Bizevents bucket retention 12× what was actually queried

DEBUG flowing into prod buckets at industrial scale

Non-prod traces sitting in 35-day retention

A third of the host fleet on DISCOVERY mode pulling DPS

Application emitting (N×3)+3 Redis ops per request, no TTL

API gateway OTel exporter mis-configured: batch=1, concurrency=1

Dashboards reading raw spans/logs every refresh

Distributed traces broken at the legacy boundary

Tenant taxonomy duplicated across years of legacy

Splunk + Dynatrace alerts firing in parallel, uncoordinated

Every audit produces this kind of output.

10 Real Findings From
Recent DPS Audits.