Continuous Profiling in CI (Pull Requests)

This guide shows how each Pull Request automatically generates profiling data, flamegraphs, and benchmark comparisons so reviewers can confidently assess performance impact.

Objectives

Capture CPU & memory profiles for a representative workload in PRs.
Generate SVG flamegraphs for quick hotspot visualization.
Provide top CPU and allocation hotspots directly in the PR summary.
Compare micro-benchmarks (baseline main vs PR) using benchstat.
Keep execution fast (< ~2 min) to avoid slowing developer feedback loops.

Workflow Overview

Two GitHub Actions jobs:

profiling — Builds profiling binaries, runs them with a reduced dataset, extracts CPU/heap profiles, generates flamegraphs, and uploads artifacts.
benchdiff — Checks out main for baseline benchmarks, then current PR branch, runs key benchmarks multiple times, and generates statistical diffs with benchstat.

Artifacts exposed to reviewers:

ci-profiles-<sha>/flamegraphs/*.svg
CPU top stacks: generator_cpu_top.txt, generator_optimized_cpu_top.txt
Allocation top stacks: generator_mem_top.txt, generator_optimized_mem_top.txt
Benchmark diff summaries in benchdiff-<sha>/

A PR comment is posted on first run (optional) and rich summaries appear in the Checks tab.

Key Makefile Target

The CI invokes a purpose-built target:

ci_profiles: clean
    mkdir -p .docs/artifacts/ci/flamegraphs
    go build -o $(BUILD_DIR)/bin/generator-profiling github.com/dmgo1014/interviewing-golang/cmd/generator-profiling
    go build -o $(BUILD_DIR)/bin/generator-optimized-profiling github.com/dmgo1014/interviewing-golang/cmd/generator-optimized-profiling
    $(BUILD_DIR)/bin/generator-profiling 50000 /dev/null || true
    $(BUILD_DIR)/bin/generator-optimized-profiling 50000 /dev/null || true
    go tool pprof -svg generator_cpu.prof > .docs/artifacts/ci/flamegraphs/generator_cpu.svg 2>/dev/null || true
    go tool pprof -svg generator_optimized_cpu.prof > .docs/artifacts/ci/flamegraphs/generator_optimized_cpu.svg 2>/dev/null || true
    go tool pprof -top -nodecount=15 generator_cpu.prof > .docs/artifacts/ci/generator_cpu_top.txt 2>/dev/null || true
    go tool pprof -top -nodecount=15 generator_optimized_cpu.prof > .docs/artifacts/ci/generator_optimized_cpu_top.txt 2>/dev/null || true
    go tool pprof -top -alloc_space -nodecount=15 generator_mem.prof > .docs/artifacts/ci/generator_mem_top.txt 2>/dev/null || true
    go tool pprof -top -alloc_space -nodecount=15 generator_optimized_mem.prof > .docs/artifacts/ci/generator_optimized_mem_top.txt 2>/dev/null || true

GitHub Workflow (Excerpt)

jobs:
  profiling:
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'
      - run: go mod download
      - run: make ci_profiles
      - uses: actions/upload-artifact@v4
        with:
          name: ci-profiles-${{ github.sha }}
          path: .docs/artifacts/ci
      - name: Summarize hotspots
        run: |
          echo '## CPU Hotspots' >> $GITHUB_STEP_SUMMARY
          sed -n '1,25p' .docs/artifacts/ci/generator_cpu_top.txt >> $GITHUB_STEP_SUMMARY || true

The benchmark comparison job fetches main as a baseline and runs benchstat:

go install golang.org/x/perf/cmd/benchstat@latest
benchstat /tmp/base.txt /tmp/pr.txt > bench_original.diff
benchstat /tmp/base_opt.txt /tmp/pr_opt.txt > bench_optimized.diff

Reading the Flamegraphs

Look for:

Wide stacks high in the graph: hot cumulative paths.
Repeated string allocation or JSON marshaling frames: tuning targets.
Excessive time in encoding/json vs business logic: candidate for alternative encoders.

Reading the Hotspot Tables (pprof -top)

Columns:

flat: Time spent directly in function.
flat%: Percent of total samples for flat.
sum%: Cumulative percentage up to this row.
cum: Cumulative time including callees.

Prioritize functions with high flat% and high cum if they’re on critical paths.

Reading the Allocation Tables (-alloc_space)

Focus on large flat allocators to reduce GC pressure.
Watch for transient allocations in tight loops.

Benchmark Diff Interpretation

A benchstat diff example:

name                       old time/op    new time/op    delta
GeneratorOriginal-8          2.45ms ± 3%    2.10ms ± 2%  -14.3%  (p=0.002 n=5+5)
GeneratorOptimized-8         1.10ms ± 2%    1.05ms ± 2%   -4.5%  (p=0.041 n=5+5)

Key signals:

delta: Relative improvement (negative = faster / fewer allocs).
p=: Statistical significance (<= 0.05 is generally meaningful).
Always consider variance ±%. High variance may need more iterations.

Extending Further

Goal	Approach
Track regressions over time	Persist artifacts to S3 + compare last N builds
Alert on > X% slowdown	Add a parsing step + GitHub Status API failure
Deeper diff (flamegraph diff)	Integrate `speedscope` JSON export + PR link
Continuous profiling (runtime)	Run Pyroscope in CI ephemeral container + capture 30s sample

Troubleshooting

Issue	Fix
Missing SVG flamegraphs	Ensure `go tool pprof` available (Go installed)
Empty hotspot files	Workload too small; raise event count or remove `	true` for visibility
Bench diffs empty	Baseline branch fetch failed; ensure `main` exists upstream

Review Checklist for PRs

[ ] Any increase in flat% for JSON or string generation functions?
[ ] Did optimized build show expected lower allocations?
[ ] Benchmark deltas negative (improvement) or neutral?
[ ] No large variance suggesting instability?
[ ] Flamegraph width dominated by intended hot loops only?

This pipeline turns performance review into a first-class, automated signal—use it to prevent regressions early.

CI Pipelines (PR Profiling)