engineering

A very happy Golang memory profiling story at CloudQuery

Mariano Gappa

Mariano Gappa

Introduction #

At CloudQuery, we provide simple and efficient ELT for a broad range of customers. Recently, I was tasked with investigating an OutOfMemory issue in one of our popular sync configurations: syncing from an AWS source to an S3 destination.
Our syncs involve significant parallelism, and since we hadn’t encountered major memory issues before, we feared it was time to refine our concurrency model. But first, we needed a memory profile to understand the problem better.

How to run a memory profile? #

We’re a Go shop, so our sync tool and most of our software is written in Go.
While I hadn’t done much profiling in Go before, I knew they use a variant of Google’s pprof from reading Russ Cox's blogpost.
The full instructions are here, but the simplest way to run a memory profile using pprof is to add this import to main.go:
import _ "net/http/pprof"
Ensure you have a server running (don’t do this if you already have one):
go func() {
	log.Println(http.ListenAndServe("localhost:6060", nil))
}()
Then, you can take a snapshot of your memory usage by calling this endpoint:
go tool pprof http://localhost:6060/debug/pprof/heap
This returns a pprof file you can chart. For a quick chart:
go tool pprof -png http://localhost:6060/debug/pprof/heap > out.png
However, this chart is only useful if you know the precise moment to take it, as it’s a snapshot in time:
To chart memory over time, I turned to GC debug logs.

GC Debug Logs #

Garbage collection information is crucial for us because streaming data from sources to destinations is bound to cause memory churn. You can add a debug flag to your Go binary's environment:
$ GODEBUG=gctrace=1 cloudquery sync ...
This outputs a line to stderr every time your binary runs a GC:
gc 1966 @134.752s 2%: 0.76+5.7+0.036 ms clock, 6.1+0.52/10/2.4+0.29 ms cpu, 68->70->35 MB, 71 MB goal, 0 MB stacks, 0 MB globals, 8 P
This provides:
  • When GC ran, in seconds, since the binary started.
  • CPU % spent in GC.
  • Heap size before GC, heap size after GC, and live heap after GC.
The full docs for GODEBUG are here. Dave Cheney has an article on charting GC output using his tool gcvis. I created a similar tool for more customization:
  • Pipe the stderr output to a file.
  • Write a regex-based parse script to extract metrics to a CSV file.
  • Plot it with matplotlib.

Profiling CloudQuery syncs with gctrace #

This was the initial chart I saw:
These weird spikes initially made me think we had a rogue allocation bug. Closer inspection revealed three distinct processes reporting GC information:
After updating the profiling setup, I found that only the S3 process was consuming a lot of memory:
Key observations:
  • An initial spike of 2.5GB, leading to an OOM at peak, with a long tail.
  • The total heap is twice as high as the live heap, meaning 50% of the total heap is garbage.

GOMEMLIMIT to the rescue? #

Since 50% of memory was garbage, a straightforward solution was to trigger GC more often using GOMEMLIMIT. The GOMEMLIMIT variable sets a soft memory limit for the runtime, running GC more often when this threshold is crossed (read more here).
While this was a valiant attempt, it had drawbacks:
  • Setting an absolute GOMEMLIMIT value for each customer setup is challenging.
  • Legitimate high memory use will still cause OOM.
  • Frequent GC increases CPU usage, slowing down sync time.
Key points:
  • Total heap usage decreased by over 50%.
  • GC ran more frequently, consuming more CPU.
  • Sync time increased significantly.
This was a no-go. The best way forward was to look at the code for improvements.

Revisiting pprof #

A pprof heap snapshot at peak memory usage revealed a large amount of memory used by the S3 SDK's memory pooling mechanism:
This raised questions about:
  • A bug in the SDK.
  • Our misuse or misconfiguration of the SDK
  • Legitimate but inefficient use of the SDK.

Three engineers look at three processes 👀 #

Our Framework team, consisting of Erez, Aleksandr, and myself, examined the code of the three components of our sync for bugs or optimizations.

Source backpressure #

I found that we were not limiting the number of tables being synced concurrently. By setting a default concurrency limit of 125 tables, peak memory usage and the tail disappeared without increasing the total sync time.
Key points:
  • 42% reduction in memory usage.
  • Total sync time remained the same.
We didn't merge the PR, as we didn't need it.

Apache Arrow's RecordBatch #

The three processes stream Apache Arrow records via gRPC connections.
Aleksandr noted we were sending one record at a time instead of using RecordBatches to stream multiple records at once. This change significantly improved memory usage.
Aleksandr describes the implementation in his blogpost.

S3 SDK's memory over-allocation #

Erez found that the S3 SDK allocated 5MB per uploaded chunk if the supplied io.Reader didn’t support seeking, only freeing memory when the upload succeeded. He fixed this by providing an io.Reader that supported seeking.
Key points:
  • 84% memory usage reduction!
  • Total sync time remained the same.
We merged the fix and reported it to AWS for consideration in a rewrite.

Conclusion #

After a week’s work, we managed to:
Resource utilization issues will resurface as we deal with new sources, destinations, scaling challenges, and new features, but we’re better prepared now.
ELT at scale is complex, so if you’re rolling out your solution and facing difficulties, give CloudQuery a try.
Subscribe to product updates

Be the first to know about new features.