Tutorials
How to Sync Simple Analytics Data to BigQuery using GitHub Actions
Introduction #
In this how-to guide, we will show you how you can load page views and events from Simple Analytics into BigQuery on regular schedule. We will use GitHub Actions to automate the process and make use of the CloudQuery Simple Analytics source integration and the BigQuery destination integration. This is a simple, reliable (and potentially free!) way to load Simple Analytics data into BigQuery on a regular basis.
Prerequisites #
Before we start, to follow this guide you will need the following:
- a Simple Analytics account to load data from,
- a BigQuery dataset to sync to,
- a GitHub repository to store the workflow and config files, and
- the
gcloud
CLI installed and configured. You may also choose to use the Google Cloud Console UI (or Terraform) instead, but we won't cover the steps for that here.
Now, let's get started!
Step 1: Add Workflow Configuration to the GitHub repository #
Step 1.1: Add the workflow file #
Inside your git repository, create a new file called
.github/workflows/sync_simple_analytics_to_bigquery.yml
and add the following content:name: Sync Simple Analytics to BigQuery
on:
workflow_dispatch:
schedule:
- cron: "0 2 * * *" # daily at 02:00 UTC
jobs:
simple-analytics-to-bigquery:
timeout-minutes: 60
runs-on: ubuntu-latest
permissions:
id-token: 'write' # This required for OIDC
contents: 'read' # This is required for actions/checkout@v3
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 2
- name: Authenticate to Google Cloud
uses: 'google-github-actions/auth@v0'
with:
workload_identity_provider: 'projects/${{ secrets.SA_BIGQUERY_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/cloudquery-pool/providers/cloudquery-provider'
service_account: 'cloudquery-service-account@${{ secrets.SA_BIGQUERY_PROJECT_ID }}.iam.gserviceaccount.com'
- uses: cloudquery/setup-cloudquery@v3
name: Setup CloudQuery
with:
version: "v2.3.1"
- name: CloudQuery Sync
run: cloudquery sync --log-console simple-analytics.yml bigquery.yml
env:
SA_USER_ID: ${{ secrets.SA_USER_ID }}
SA_API_KEY: ${{ secrets.SA_API_KEY }}
SA_BIGQUERY_PROJECT_ID: ${{ secrets.SA_BIGQUERY_PROJECT_ID }}
SA_BIGQUERY_DATASET_ID: ${{ secrets.SA_BIGQUERY_DATASET_ID }}
This configuration allows us to run the workflow manually or on a schedule. The example above is configured to run daily at 2 a.m. UTC. The workflow will run on a self-hosted runner, which is a virtual machine hosted by GitHub. The workflow will run the following steps:
- Check out the repository
- Authenticate to Google Cloud using the
google-github-actions/auth
Action - Setup CloudQuery using the
cloudquery/setup-cloudquery
Action - Run the
cloudquery sync
command to load data from Simple Analytics into BigQuery
We will now configure the secrets and the configuration files.
Step 1.2: Add the secrets #
In the GitHub repository, go to
Settings > Secrets
and add the following repository secrets:SA_USER_ID
: the user ID of your Simple Analytics account. You can find this in the Simple Analytics Account Settings.SA_API_KEY
: the API key of your Simple Analytics account. You can find this in the Simple Analytics Account Settings.SA_BIGQUERY_PROJECT_ID
: the ID of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.SA_BIGQUERY_DATASET_ID
: the ID of the BigQuery dataset where you want to load the data. You can find this in the BigQuery UI.SA_BIGQUERY_PROJECT_NUMBER
: the number of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.
Step 1.3: Add the Simple Analytics source integration configuration file #
In the git repository, create a new file called
simple-analytics.yml
and add the following content:kind: source
spec:
name: "simple-analytics"
path: "simpleanalytics/simpleanalytics"
version: "v1.0.0"
tables:
["*"]
destinations:
- "bigquery"
spec:
user_id: ${SA_USER_ID}
api_key: ${SA_API_KEY}
websites:
- hostname: <your-website-hostname>
You may find the latest version on the releases page. Replace
<your-website-hostname>
with the host name of your website as defined on Simple Analytics, e.g. mywebsite.com
. You can also add more than one website, specify additional metadata that should be fetched or a period of days to fetch data for. See the Simple Analytics source plugin documentation for all the details.Step 1.4: Add the BigQuery destination plugin configuration file #
Now, let's set up the BigQuery destination. In the git repository, create a new file called
bigquery.yml
and add the following content:kind: destination
spec:
name: bigquery
path: cloudquery/bigquery
registry: cloudquery
version: "v4.1.19"
write_mode: "append"
spec:
project_id: ${SA_BIGQUERY_PROJECT_ID}
dataset_id: ${SA_BIGQUERY_DATASET_ID}
You can find the latest version in the BigQuery destination plugin documentation. There are also a few more configuration options you may want to consider, like partitioning based on sync time.
Step 1.5: Push to the repository #
With all this in place, make sure you to push the files to the GitHub repository. You can now run the workflow manually by going to
Actions > Sync Simple Analytics to BigQuery > Run workflow
and clicking the Run workflow
button. It will run, but probably end with an error, because we still need to set up the Workload Identity to authenticate with BigQuery. Let's do that next.Step 2: Set up Workload Identity #
In this section, we will set up Workload Identity to allow the GitHub Actions runner to authenticate with BigQuery. This is required because the GitHub Actions runner is a self-hosted runner, which is not part of the Google Cloud project. We will use the Google GitHub Auth and the steps in this guide are based on the instructions there, but fine-tuned for our use case.
Step 2.1: Set Environment Variables #
First, let's set a few environment variables that we'll use throughout the guide. Set the following variables, replacing
<your-project-id>
and <your-repo>
with your own values:export PROJECT_ID=<your-project-id> # The ID of your Google Cloud project, e.g. "my-project"
export REPO=<your-repo> # e.g. cloudquery/simple-analytics-to-bigquery
Step 2.2: Create a Service Account #
We will need a service account that can write to BigQuery. We will use the
gcloud
CLI to create the service account and grant it the necessary permissions. Run the following commands:gcloud iam service-accounts create "cloudquery-service-account" \
--project "${PROJECT_ID}"
If you haven't used the IAM Credentials API in the project before, it will need to be enabled:
gcloud services enable iamcredentials.googleapis.com \
--project "${PROJECT_ID}"
Now we will grant the Google Cloud Service Account permissions to access Google Cloud resources. In this case, BigQuery:
We will assign the service account the
BigQuery Data Editor
role:gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" --role=roles/bigquery.dataEditor
Step 2.3: Create a Workload Identity Pool and Provider #
We will now set up Workload Identity, which the GitHub action will rely on to authenticate itself without the need for keys. You can read more about keyless authentication with GCP and GitHub Actions here.
First we need to create a pool:
gcloud iam workload-identity-pools create "cloudquery-pool" \
--project="${PROJECT_ID}" \
--location="global" \
--display-name="Cloudquery pool"
Now let's describe the pool and set an environment variable that we'll use in subsequent steps:
export WORKLOAD_IDENTITY_POOL_ID=$(gcloud iam workload-identity-pools describe "cloudquery-pool" \
--project="${PROJECT_ID}" \
--location="global" \
--format="value(name)")
We can use
echo $WORKLOAD_IDENTITY_POOL_ID
to verify that the variable is set correctly:echo $WORKLOAD_IDENTITY_POOL_ID
# projects/<your-project-id>/locations/global/workloadIdentityPools/cloudquery-pool
Now let's create an OIDC provider:
gcloud iam workload-identity-pools providers create-oidc "cloudquery-provider" \
--project="${PROJECT_ID}" \
--location="global" \
--workload-identity-pool="cloudquery-pool" \
--display-name="Cloudquery provider" \
--attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
--issuer-uri="https://token.actions.githubusercontent.com"
Finally, we need to grant the service account the
Workload Identity User
role:gcloud iam service-accounts add-iam-policy-binding "cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" \
--project="${PROJECT_ID}" \
--role="roles/iam.workloadIdentityUser" \
--member="principalSet://iam.googleapis.com/${WORKLOAD_IDENTITY_POOL_ID}/attribute.repository/${REPO}"
Step 3: Query the data #
Now that we have set up the Workload Identity, we can run the workflow again. Go to
Actions > Sync Simple Analytics to BigQuery > Run workflow
and click the Run workflow
button. You should see the workflow run successfully. You can also check the logs to see that the data was successfully written to BigQuery, or get messages about any errors that occurred. After a successful workflow run you should be able to query the data in BigQuery!Let's head over to the BigQuery console and run the following query to test it out:
SELECT * FROM `cq-analytics-v1.simple_analytics.simple_analytics_page_views` LIMIT 10
Nice, we have raw Simple Analytics data in BigQuery! We can now use this data to build dashboards, reports, and more. The GitHub Action will run daily, so the data will be kept up to date automatically.
Step 4: Post-processing the data (Optional) #
Now that we have extracted and loaded our data into BigQuery, we may also want to do the "T" part of ELT: transformation as a post-processing step. For example, because the BigQuery destination only supports
append
mode, we should make sure that our query results don't contain duplicates. One simple way to do this is to run a SQL to remove duplicates after the sync completes. We can do this by adding a few additional steps to the workflow that run deduplication queries as SQL.Let's first write the queries. First we'll create
deduplicate_page_views.sql
:CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_page_views` AS
SELECT
*
FROM
`simple_analytics`.`simple_analytics_page_views`
WHERE
_cq_sync_time = (
SELECT
max(_cq_sync_time)
FROM
`simple_analytics`.`simple_analytics_page_views`
);
And similarly, we'll create
deduplicate_events.sql
:CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_events` AS
SELECT
*
FROM
`simple_analytics`.`simple_analytics_events`
WHERE
_cq_sync_time = (
SELECT
max(_cq_sync_time)
FROM
`simple_analytics`.`simple_analytics_events`
);
Now we can add these queries to the workflow. We'll add a few steps to the workflow to install the
bq
tool and run the queries:- name: Set up Cloud SDK
uses: 'google-github-actions/setup-gcloud@v1'
- name: Deduplicate page views
run: bq query --use_legacy_sql=False --flagfile deduplicate_page_views.sql
- name: Deduplicate events
run: bq query --use_legacy_sql=False --flagfile deduplicate_events.sql
Conclusion #
That's it! Now we have a fully automated pipeline that extracts data from Simple Analytics, loads it into BigQuery, and deduplicates it on a regular basis. We can now use this data to build dashboards, reports, and more. To see a live repository employing this workflow, check out the CloudQuery recipes repository.
Ready to get started with CloudQuery? You can try out CloudQuery locally with our quick start guide or explore the CloudQuery Platform (currently in beta) for a more scalable solution.
Want help getting started? Join the CloudQuery community to connect with other users and experts, or message our team directly here if you have any questions.
Written by Herman Schaaf
Herman is the Director of Engineering at CloudQuery and an Apache Arrow contributor. A polyglot with a preference for Go and Python, he has spoken at QCon London and Data Council New York.