Continuous Optimization Playbook (Senior / Staff)
A pragmatic lifecycle for sustaining performance excellence.
1. Hypothesis Intake
| Source |
Example |
Action |
| SLO Alert |
p95 latency ↑ 20% |
Trace sample + hotspot tag filter |
| Bench Diff |
time/op +6% |
Reproduce locally with -count=15 |
| Profile Drift |
New hotspot >4% flat |
Code archeology + changelog review |
| Cost Spike |
CPU hours +12% |
Check allocator churn & GC cycles |
2. Triage Matrix
Impact vs Effort grid:
- High Impact / Low Effort → immediate PR
- High Impact / High Effort → design review
- Low Impact / Low Effort → batch fixes
- Low Impact / High Effort → defer/backlog
| Signal |
Primary Tool |
Escalate To |
| Latency |
Tempo traces |
Trace + profile correlation |
| CPU |
pprof top/diff |
Block/Mutex profile |
| Memory |
alloc_space diff |
Escape analysis (-gcflags -m) |
| Alloc churn |
alloc_objects |
Object pooling experiment |
| Goroutine leak |
goroutine profile |
blocking ops instrumentation |
4. Root Cause Patterns
| Pattern |
Signature |
Strategy |
| Algorithmic O(n^2) |
Hotspot grows super-linearly with input |
Redesign data structure |
| Excess allocations |
High alloc_space without flat CPU |
Reuse buffers / sync.Pool |
| Lock contention |
Mutex profile spikes |
Shard lock / reduce critical section |
| I/O Bound |
Low CPU, high wall latency |
Parallelize / pipeline |
| GC Pressure |
Frequent short GC cycles |
Reduce transient allocations |
- Profile baseline (commit A)
- Implement change (branch)
- Run
ci_profiles locally + AI analyzer
- Confirm improvements > regressions
- Add micro-bench if gap previously unmeasured
- Merge behind feature flag if risky
- Baseline vs optimized profiles archived
- Benchstat delta ≤ +2% for unrelated benchmarks
- No new hotspots >5% without justification
- Gates pass (or waiver documented)
- Follow-up monitoring dashboard updated
7. Weekly Rituals
| Ritual |
Outcome |
| Hotspot Review |
Rotate top 5 persistent CPU offenders |
| Allocation Audit |
Focus on top alloc_space regressions |
| Benchmark Trend Scan |
Detect slow drifts early |
| Gate Failure Postmortems |
Improve thresholds / detection logic |
8. Backlog Taxonomy
| Category |
Examples |
KPI |
| Preventive |
Pre-warm caches, pooling |
Reduced p95 latency |
| Corrective |
Remove quadratic join |
Flat% decrease |
| Hygiene |
Update benchmarks |
Coverage % of critical paths |
| Strategic |
Async pipeline redesign |
Throughput gain |
9. Metrics & KPIs
| KPI |
Target |
| Net Performance Win Rate |
>70% PRs with positive rating |
| Mean Time to Detect Regression |
<1 day |
| Hotspot Churn Rate |
<15% weekly |
| Benchmark Coverage (critical funcs) |
>80% |
10. Escalation Criteria
Escalate to arch review if:
- Any single regression >15% persists 3 PRs
- Overall rating <4 twice in a sprint
- Hotspot churn >25% (instability signal)
| Rank |
Idea |
Leverage |
| 1 |
Span-linked profile URLs |
Pyroscope + Tempo |
| 2 |
Selective dynamic profiling |
Agent API |
| 3 |
Historical baseline server |
Object store + diff API |
| 4 |
SARIF export |
GitHub code scanning UI |
| 5 |
Performance budget dashboard |
Grafana JSON datasource |
Treat performance as a product: observable state, feedback loops, user centric outcomes, and continuous iteration. This playbook institutionalizes that mindset.