Engineering
Exploring API Data with DuckDB
DuckDB has been making a splash in the data world, and for good reason. It's a fast, in-process SQL OLAP database management system that can be used to explore data in a variety of ways.
Looking at DuckDB's GitHub star history, it's clear that DuckDB has been gaining popularity rapidly:
I'm excited to share that CloudQuery has now added a new DuckDB destination plugin, so you can sync data from any CloudQuery source and explore it using DuckDB. But what is DuckDB? In this post I will introduce DuckDB, show how to sync data to a local DuckDB database, and then explore the data using the DuckDB CLI.
What is DuckDB? #
DuckDB describes itself as an in-process SQL OLAP database management system. That's a bit of a mouthful. DuckDB is perhaps most easily understood as being in the same category as SQLite, but focused on OLAP (online analytical processing) instead of OLTP (online transaction processing). As such, DuckDB is a columnar database, which means that it stores data in columns instead of rows. This makes it fast for aggregations and other analytical queries. Being embedded with a focus on OLAP helps it fill a niche that is not well served by other databases, as illustrated by the diagram below from the 2019 SIGMOD paper by Raasveldt & Mühleisen: DuckDB: an Embeddable Analytical Database:
The fact that it runs in-process means that it can be used in a wide variety of applications: from Jupyter notebooks to web browsers to command-line tools.
Exploring API Data with DuckDB #
By using the CloudQuery DuckDB destination plugin, we can explore data from a whole range of APIs from the comfort of a local DuckDB database. But since DuckDB is an "in-process" database, and CloudQuery usually writes to databases or data lakes, it might not be entirely clear what we mean by a CloudQuery destination plugin for DuckDB. Similar to SQLite, DuckDB databases can also be stored as files, so we can use the CloudQuery
duckdb
destination plugin to sync data to a local DuckDB database file. Let's try it out!First, we need a source dataset to sync from. If you're familiar with CloudQuery, you probably already know about the extensive AWS, GCP and Azure plugins that can sync infrastructure data. DuckDB is a great fit for quick ad-hoc exploration of this data as well, but just for fun, today let's sync data from the Homebrew Analytics API and explore it a bit using DuckDB.
First we create a CloudQuery config file called
homebrew-duckdb.yml
, in which we'll put both the Homebrew source and the DuckDB destination configuration:kind: source
spec:
name: 'homebrew'
path: 'cloudquery/homebrew'
registry: 'cloudquery'
version: 'v4.7.0'
destinations: ['duckdb']
tables:
- homebrew_analytics_installs_30d
- homebrew_analytics_installs_90d
- homebrew_analytics_installs_365d
---
kind: destination
spec:
name: 'duckdb'
path: 'cloudquery/duckdb'
registry: 'cloudquery'
version: 'v5.9.19'
spec:
connection_string: './homebrew.duckdb'
Now we can run
cloudquery sync
to sync data from the Homebrew source to a local DuckDB database file (you will need to have CloudQuery installed for this step):cloudquery sync homebrew-duckdb.yml
This should output something like the following:
cloudquery sync homebrew-duckdb.yml
Loading spec(s) from homebrew-duckdb.yml
Starting migration with 3 tables for: homebrew (v4.7.0) -> [duckdb (v5.9.19)]
Migration completed successfully.
Starting sync for: homebrew (v4.7.0) -> [duckdb (v5.9.19)]
Sync completed successfully. Resources: 30000, Errors: 0, Panics: 0, Time: 10s
Now we have a local DuckDB database file called
homebrew.duckdb
that contains the data from the Homebrew Analytics API. We can use the DuckDB CLI to explore the data. Check out the DuckDB Installation page for more information on how to install the CLI. Let's open the database:duckdb homebrew.duckdb
This outputs the following and then waits for input:
v0.7.1 b00b93f0b1
Enter ".help" for usage hints.
D
DuckDB is great for data exploration, so let's explore some data! Let's start by looking at the
homebrew_analytics_installs_30d
table:SELECT
number, formula, count, percent
FROM homebrew_analytics_installs_30d
ORDER BY number ASC
LIMIT 10;
Now we can see the top 10 most installed Homebrew packages (on MacOS) in the last 30 days:
┌────────┬─────────────────┬────────┬─────────┐
│ number │ formula │ count │ percent │
│ int64 │ varchar │ int64 │ float │
├────────┼─────────────────┼────────┼─────────┤
│ 1 │ [email protected] │ 313469 │ 2.37 │
│ 2 │ libnghttp2 │ 258853 │ 1.96 │
│ 3 │ zstd │ 255983 │ 1.94 │
│ 4 │ [email protected] │ 238906 │ 1.81 │
│ 5 │ ca-certificates │ 219814 │ 1.67 │
│ 6 │ freetype │ 207035 │ 1.57 │
│ 7 │ xz │ 204572 │ 1.55 │
│ 8 │ jpeg-turbo │ 203761 │ 1.54 │
│ 9 │ libxcb │ 202785 │ 1.54 │
│ 10 │ harfbuzz │ 176521 │ 1.34 │
├────────┴─────────────────┴────────┴─────────┤
│ 10 rows 4 columns │
└─────────────────────────────────────────────┘
Great! This proves it works, but it's a bit boring, in my opinion. What if we wanted to see the fastest-growing Homebrew packages instead?
We can combine the
homebrew_analytics_installs_30d
and homebrew_analytics_installs_90d
tables to find out which packages have gone up in popularity the most. We shouldn't use the count
column for this, because Homebrew recently added and publicized a new way to control the analytics that gets sent. This means that the count
column, which is an absolute count of installs, is being skewed by more users turning off analytics in the last month. Instead, we will use the number
column, which is a relative ranking of popularity. If we do this, we should also make sure we only look at formulae that had a relatively large number of installs to begin with, otherwise a small gain in installs can lead to a large jump in ranking for long-tail formulae. Let's try it out:SELECT
a.formula,
'https://formulae.brew.sh/formula/' || a.formula as url,
a.number AS number_30d,
b.number AS number_90d,
(a.number - b.number) AS change_90d
FROM homebrew_analytics_installs_30d a
JOIN homebrew_analytics_installs_90d b ON a.formula = b.formula
WHERE
-- only consider formulae that had at least 500 installs in the last 90 days
b.count > 500
-- only consider formulae that are not custom taps
AND NOT contains(a.formula, '/')
-- ignore specific version installs
AND NOT contains(a.formula, '@')
ORDER BY change_90d ASC
LIMIT 15;
And the results are:
formula | url | number_30d | number_90d | change_90d |
---|---|---|---|---|
spandsp | https://formulae.brew.sh/formula/spandsp | 1256 | 2347 | -1091 |
aws-console | https://formulae.brew.sh/formula/aws-console | 1222 | 2156 | -934 |
megatools | https://formulae.brew.sh/formula/megatools | 1376 | 2282 | -906 |
librsync | https://formulae.brew.sh/formula/librsync | 892 | 1797 | -905 |
traefik | https://formulae.brew.sh/formula/traefik | 1379 | 2284 | -905 |
noti | https://formulae.brew.sh/formula/noti | 1350 | 2197 | -847 |
solr | https://formulae.brew.sh/formula/solr | 1297 | 2070 | -773 |
checkbashisms | https://formulae.brew.sh/formula/checkbashisms | 1485 | 2233 | -748 |
jd | https://formulae.brew.sh/formula/jd | 1115 | 1792 | -677 |
terraform-ls | https://formulae.brew.sh/formula/terraform-ls | 1028 | 1695 | -667 |
trash-cli | https://formulae.brew.sh/formula/trash-cli | 1096 | 1756 | -660 |
zola | https://formulae.brew.sh/formula/zola | 1208 | 1866 | -658 |
grails | https://formulae.brew.sh/formula/grails | 1515 | 2167 | -652 |
hubble | https://formulae.brew.sh/formula/hubble | 1711 | 2335 | -624 |
libxmp | https://formulae.brew.sh/formula/libxmp | 550 | 1156 | -606 |
To get the output in Markdown format (like I did for this post), you can run
.mode markdown
before running the query.There are some interesting tools in that list! I'll be honest, knowing DuckDB's rise in popularity, I was expecting to see it in this list as well. Let's run a quick query to see how it did:
SELECT
a.formula,
'https://formulae.brew.sh/formula/' || a.formula as url,
a.number AS number_30d,
b.number AS number_90d,
(a.number - b.number) AS change_90d
FROM homebrew_analytics_installs_30d a
JOIN homebrew_analytics_installs_90d b ON a.formula = b.formula
WHERE
a.formula = 'duckdb';
formula | url | number_30d | number_90d | change_90d |
---|---|---|---|---|
duckdb | https://formulae.brew.sh/formula/duckdb | 1709 | 1937 | -228 |
Perhaps not in the top 15, but moving up 228 places in ranking is still impressive! It also backs up the meteoric rise in popularity that we observed on the GitHub Stars graph earlier.
If you'd like to explore the most up-to-date data yourself, you can follow the steps above and try out some of your own queries. You can also check out the DuckDB documentation for more information on how to use DuckDB. A word of caution on the Homebrew Analytics API dataset: I suspect that these Homebrew rankings can be easily gamed or accidentally skewed by tools being installed as part of automated processes (like CI). It will also be influenced by the target demographic's likelihood to have disabled analytics. That said, it's still an interesting way to discover up-and-coming new tools; just don't take the numbers here as the final word on the matter.
Conclusion #
In this post, I introduced DuckDB and the new CloudQuery DuckDB destination that allows you to load data from many APIs into a DuckDB-compatible format. I also showed you one specific use case: using DuckDB to query the Homebrew Analytics API and discover new tools that are gaining popularity. From here, we could take the analysis in a million different directions: perhaps by combining it with data from other APIs or datasets? Or graphing the popularity trends? Classifying Homebrew tools with ChatGPT or GPT4 and creating separate rankings by class? I will leave these as exercises for the reader.
Another interesting next step would be to allow CloudQuery to be run as an embedded process in the same way as DuckDB, so that both could be run from the same application, like Jupyter Notebooks. This would certainly make for a powerful combination, and it's something I'd like to explore more. Right now it would be a two-step process: sync to a local DuckDB file first, then load it from the notebook.
I'm looking forward to seeing how CloudQuery and DuckDB get used and what the community comes up with! If you have any questions or feedback, feel free to reach out to us on GitHub or our Community.
Ready to get started with CloudQuery? You can try out CloudQuery locally with our quick start guide or explore the CloudQuery Platform (currently in beta) for a more scalable solution.
Want help getting started? Join the CloudQuery community to connect with other users and experts, or message our team directly here if you have any questions.
Written by Herman Schaaf
Herman is the Director of Engineering at CloudQuery and an Apache Arrow contributor. A polyglot with a preference for Go and Python, he has spoken at QCon London and Data Council New York.