announcement
product

Smarter, faster syncs with CloudQuery's automatic sharding

Michal Brutvan

Michal Brutvan

Introduction #

When you run a sync with CloudQuery for a service with a large amount of data, such as AWS, it can often take hours to complete. One way to make the sync finish faster is to split the configuration and run syncs in parallel based on tables, accounts, or any other way that makes sense for the particular service.
Setting everything up can be quite tedious and the setup requires continuous maintenance. You need to figure out the best way to split the configs for your specific environment, build and run containers with the configs, and when you need to change something, you almost need to start from scratch.
Previously, we made it easier to understand the details of each sync to help you split configs manually. Our latest improvement to our SDK and CloudQuery CLI, sharding, enables you to split them automatically. Let's dive deeper into how the automatic sharding works in CloudQuery.

What is Automatic Sharding? #

Sharding is an approach to data transfer that enables parallel processing to deliver increased transaction throughput and performance.
CloudQuery benefits from its extensible architecture and achieves this by splitting the collection of tables provided by the plugin into groups. When you run the cloudquery sync command and provide the number of shards and the shard number, the sharded sync will determine what tables it is responsible for and will sync only those tables, assuming you will run the sync command for other shards as well.
Here is a practical example: To split a long running sync into two shards that can be run in parallel, run the following commands independently:
cloudquery sync config.yml --shard 1/2
cloudquery sync config.yml --shard 2/2
It really is that simple! CloudQuery will determine which tables to sync and when both syncs finish, all the data will be in your destination.

Real-World Use Cases #

This new feature makes it easy to run syncs using popular continuous integration (CI) or orchestration platforms, such as GitHub actions. With the GitHub Actions matrix configuration, it is super easy to split your syncs into multiple jobs and run them in parallel:
name: CloudQuery Parallel
on:
  schedule:
    - cron: '0 3 * * *' # Run daily at 03:00 (3am)
jobs:
  cloudquery:
    permissions:
      id-token: write
      contents: read
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1/4, 2/4, 3/4, 4/4] # Split the sync into 4 parts
    steps:
      - uses: actions/checkout@v3 # Checkout the code so we have access to the config file
      - name: Configure AWS credentials # Setup AWS credentials (example)
        uses: aws-actions/configure-aws-credentials@v1
        with:
          role-to-assume: <role-arn> # based on the role you created in the prerequisites
          aws-region: <region> # based on the region you created the role in
      - uses: cloudquery/setup-cloudquery@v3
        name: Setup CloudQuery
        with:
          version: "v6.8.1"
      - name: Sync with CloudQuery
        run: cloudquery sync config.yml --log-console --shard ${{ matrix.shard }}
        env:
          CLOUDQUERY_API_KEY: ${{ secrets.CLOUDQUERY_API_KEY }} # See https://docs.cloudquery.io/docs/deployment/generate-api-key
          CQ_DSN: ${{ secrets.CQ_DSN }} # Connection string to a PostgreSQL database
To learn more about this feature, take a look at our documentation. We have also prepared a full guide on how to deploy CloudQuery with GitHub Actions to help make running these syncs even more straightforward.

Future Plans #

We are planning to make further improvements. Right now, table splitting is done on the SDK/CLI level. In the future, we plan to make the plugins smarter and advise the CLI how to split the syncs with respect to known rate limits or data throughputs of the individual services, or based on the telemetry recorded from a previous sync.
If you tried this feature, and have some more ideas for improvements, we’d love to hear from you!. Reach out to the team on our Community portal.
Michal Brutvan

Written by Michal Brutvan

Michal is CloudQuery's senior product manager and has responsibility for new features and CloudQuery's product roadmap. He has had a wealth of product ownership roles and prior to that, worked as a software engineer.

Start your free trial today

Experience Simple, Fast and Extensible Data Movement.