Tutorials

Scaling out ELT pipelines with CloudQuery and CSV

•

3 min read

Performance is a key factor in data pipelines both from usability and cost perspective. This tutorial will show you how to scale out your ELT pipelines with CloudQuery and CSV utilizing CloudQuery's stateless and performant architecture.

Background #

Let's take a look at your typical ELT pipeline:

[Source] -> [Database]

This is the basic architecture where you have one instance running your ELT workload scaling vertically, and one database instance that needs to also scale vertically to handle the stream/load. This is a common architecture and it works well for small to medium workloads. However, as the workload grows you will need to scale out your ELT workload. This is where things get complicated and expensive.

[Source] --\
[Source]    -> [Database]
[Source] --/

Scaling out sources now creates a bottleneck where you need to scale out your database and/or have things like PgBouncer (in-case of PostgreSQL) or use scale-out databases like CockroachDB. If you use data lakes or data warehouses (such as BigQuery) directly, you can also hit streaming limits because they tend to work better with batch loads. But again you will need some kind of intermediate server that batches all requests from all sources and then loads those to the data warehouse.

Scaling out with CloudQuery and CSV #

CloudQuery workers are stateless, so you can run as many as you want and slice and dice what data they are fetching as you need without them connecting to any database and/or backend. With the CloudQuery File Destination Integration — which writes CSV, JSON, or Parquet — each worker can write results to a local file, from where you can upload it to a cloud storage of your choice. This way you can scale out your ELT workload to infinity and beyond and only pay for the storage you use.

[Source] --\
[Source]    -> [Cloud Storage] -> [Data lake/Data warehouse]
[Source] --/

Loading CSV Files to data lakes or data warehouses is fast and cheap without the need to take care of scaling out a database or an intermediate proxy vertically.

Summary #

Scaling out ELT workloads with CSV and Cloud Storage will work well especially if you extract data from high number of sources and accounts, but it has the downside that data in the destination won't be as live when compared to streaming directly to a database. So as always, all depends on your use-case and constraints. Check out the File integration documentation to see how to configure the File destination integration to write CSV, JSON or Parquet.

Want to see CloudQuery in action? Schedule a demo with our team or check out the platform documentation to learn more.

Want help getting started? Join the CloudQuery community to connect with other users and experts, or message our team directly here if you have any questions.

CloudOps