Building Local and Serverless ELT(ETL) pipelines with CloudQuery and MotherDuck

Engineering

Tutorials

Building Local and Serverless ELT(ETL) pipelines with CloudQuery and MotherDuck

•

CloudQuery is an open-source high-performance ELT framework powered by Apache Arrow, and DuckDB is an open-source in-process OLAP database. In this blog post we will walk through building an ELT pipeline using CloudQuery and DuckDB locally and then how to run the same configuration and pipeline with the CloudQuery Platform and MotherDuck (DuckDB Cloud version).

One cool thing about CloudQuery and DuckDB is that both of them can be run locally in only a few steps which makes it easy to develop and debug before running in the cloud which can save time on feedback loops and surging cloud costs of compute, data, and egress - especially for development purposes.

Setting Up Local Pipeline #

CloudQuery is a single binary CLI that you can download and install following these instructions.
DuckDB is also a single binary and can be downloaded and installed following these instructions.

Syncing Data from HackerNews API to DuckDB Locally #

Now, let’s sync this year’s HackerNews stories and comments to DuckDB. To do that we will need to use the CloudQuery Hacker News Source and the DuckDB destination integration. The configuration for CloudQuery will look like the following:

# config.yml
kind: source
spec:
  name: 'hackernews'
  path: 'cloudquery/hackernews'
  registry: 'cloudquery'
  version: 'v3.0.21'
  tables: ['*']
  destinations:
    - 'duckdb'
  spec:
  start_time: 2024-01-01T00:00:00Z
---
kind: destination
spec:
  name: duckdb
  path: cloudquery/duckdb
  registry: cloudquery
  version: 'v5.1.0'
  write_mode: 'overwrite-delete-stale'
  spec:
    connection_string: ./example.db

Now that we have CloudQuery installed and this configuration file we can run the following:

cloudquery login
cloudquery sync ./config.yml

This will start syncing all the items from the start of 2024 to a local DuckDB. It can take a while, so feel free to interrupt it in the middle as you will have data to play with anyway.

Now you can open up DuckDB and run the following (the total size of the database is about 512MB at the time of writing):

SELECT
    title, score
FROM
    hackernews_items
WHERE
    parent = 0 AND
    title ILIKE '%open source%' OR
    title ILIKE '%open-source%'
ORDER BY score DESC LIMIT 20;

This will show something similar to the following:

Top stories that have “open source” or “open-source” in their title (how to spell open-source). We can definitely see that HN loves open-source!

Now let’s do something a bit more interesting and run the below query:

WITH lists_split_into_rows AS (
SELECT
  score,
  unnest(string_split(lower(title), ' ')) AS word
FROM
    hackernews_items
) SELECT
      word,
      sum(score) as score_weight
  FROM
      lists_split_into_rows
  WHERE
      word NOT IN ('to', 'the', 'of', 'in', 'for', 'and', 'the', 'with', 'on', 'a', 'from', 'are', '', 'is', 'ask', 'how', 'what', 'why', 'by', 'an', 'as', 'your', 'at', 'you', 'it', 'over', 'using', 'into')
GROUP BY
  word
ORDER BY score_weight DESC LIMIT 100;

This will return a list of words with score_weight, which represents the number of times a word occurred in a title. If a story gains traction, this word will get more weight.

This query could use more fine-tuning but, even now, we can see that no doubt if you want to get noticed by the HN community, using words such as "AI", "LLM", "737", "open-source", or "free" will increase your odds of making the front page!

When the queries are ready, and you want to move them to your production database where other users and downstream subsystems can take advantage of the data, you can move to the next section and use the same queries and ELT configuration!

Setting Up Managed / serverless Pipeline #

Now let’s see how we can set up the same thing as a cloud pipeline to run periodically.

Sign up to the CloudQuery Platform
Create MotherDuck Account

Syncing Data from HackerNews API to DuckDB #

Once logged into the CloudQuery Platform, Click New Sync from the top right.

The Destination Plugin

This will load the Configure Destination screen. From the Select destination plugin dropdown, choose DuckDB. You can leave the Select destination plugin version and Destination Name fields alone as the defaults will be fine.

The Destination Spec field for MotherDuck should look like the following.

kind: destination
spec:
  name: duckdb
  path: cloudquery/duckdb
  registry: cloudquery
  version: 'v5.1.0'
  write_mode: 'overwrite-delete-stale'
  spec:
    connection_string: 'md:hackernews?motherduck_token=${MOTHERDUCK_TOKEN}'

However, you will need to replace the start of the connection_string (replacing the md:hackernews portion shown here) with the name you want to use for your MotherDuck database (as described in authenticating to MotherDuck).

In this case, you'll notice that we're using CloudQuery's secret injection to securely store the MOTHERDUCK_TOKEN and inject it during run-time. When you selected the DuckDB Destination Plugin this automatically added a secret for this key into the Secret storage, so all you need to do to use it is to paste your MotherDuck Service Token into the Value Field.

With that done, click Next.

The Source Plugin

This will progress to the Configure Source screen. From the Select source plugin dropdown select HackerNews. In this case, all the default values are acceptable, so click Continue

Schedule and Resources

This will progress to the New Sync screen. Here, you can change the name of the sync, alter the schedule, and set the resource limits. Again, in this case the default values are acceptable, so click Save and run.

Accessing the data #

Now you should be able to see the running sync, and over in the MotherDuck UI, the data should be arriving.

The other cool part is that you can also connect from the local shell to MotherDuck. You can run queries both on the local and remote datasets from the same local shell by using .open md:blog and now both the local hackernews_items and remote hackernews.main.hackernews_items will be accessible!

Closing thoughts #

Data engineering suffered for a while from local/development environments that couldn't be easily deployed to the cloud with the same configuration and query language. DuckDB, MotherDuck, and CloudQuery bring those capabilities to you!

Ready to get started with CloudQuery? You can try out CloudQuery locally with our quick start guide or explore the CloudQuery Platform (currently in beta) for a more scalable solution.

Want help getting started? Join the CloudQuery community to connect with other users and experts, or message our team directly here if you have any questions.

CloudQuery

Test CloudQuery's capabilities with a demo

Building Local and Serverless ELT(ETL) pipelines with CloudQuery and MotherDuck

Setting Up Local Pipeline #

Syncing Data from HackerNews API to DuckDB Locally #

Setting Up Managed / serverless Pipeline #

Syncing Data from HackerNews API to DuckDB #

The Destination Plugin

The Source Plugin

Schedule and Resources

Accessing the data #

Closing thoughts #

Related posts

Running Embedded ELT with CloudQuery and Snowpark Container Service