Tutorials

How to Sync Simple Analytics Data to BigQuery using GitHub Actions

Herman Schaaf

Herman Schaaf

Introduction #

In this how-to guide, we will show you how you can load page views and events from Simple Analytics into BigQuery on regular schedule. We will use GitHub Actions to automate the process and make use of the CloudQuery Simple Analytics source integration and the BigQuery destination integration. This is a simple, reliable (and potentially free!) way to load Simple Analytics data into BigQuery on a regular basis.

Prerequisites #

Before we start, to follow this guide you will need the following:
  • a Simple Analytics account to load data from,
  • a BigQuery dataset to sync to,
  • a GitHub repository to store the workflow and config files, and
  • the gcloud CLI installed and configured. You may also choose to use the Google Cloud Console UI (or Terraform) instead, but we won't cover the steps for that here.
Now, let's get started!

Step 1: Add Workflow Configuration to the GitHub repository #

Step 1.1: Add the workflow file #

Inside your git repository, create a new file called .github/workflows/sync_simple_analytics_to_bigquery.yml and add the following content:
name: Sync Simple Analytics to BigQuery

on:
  workflow_dispatch:
  schedule:
    - cron: "0 2 * * *"  # daily at 02:00 UTC

jobs:
  simple-analytics-to-bigquery:
    timeout-minutes: 60
    runs-on: ubuntu-latest
    permissions:
      id-token: 'write' # This required for OIDC
      contents: 'read' # This is required for actions/checkout@v3
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 2
      - name: Authenticate to Google Cloud
        uses: 'google-github-actions/auth@v0'
        with:
          workload_identity_provider: 'projects/${{ secrets.SA_BIGQUERY_PROJECT_NUMBER }}/locations/global/workloadIdentityPools/cloudquery-pool/providers/cloudquery-provider'
          service_account: 'cloudquery-service-account@${{ secrets.SA_BIGQUERY_PROJECT_ID }}.iam.gserviceaccount.com'
      - uses: cloudquery/setup-cloudquery@v3
        name: Setup CloudQuery
        with:
          version: "v2.3.1"
      - name: CloudQuery Sync
        run: cloudquery sync --log-console simple-analytics.yml bigquery.yml
        env:
          SA_USER_ID: ${{ secrets.SA_USER_ID }}
          SA_API_KEY: ${{ secrets.SA_API_KEY }}
          SA_BIGQUERY_PROJECT_ID: ${{ secrets.SA_BIGQUERY_PROJECT_ID }}
          SA_BIGQUERY_DATASET_ID: ${{ secrets.SA_BIGQUERY_DATASET_ID }}
This configuration allows us to run the workflow manually or on a schedule. The example above is configured to run daily at 2 a.m. UTC. The workflow will run on a self-hosted runner, which is a virtual machine hosted by GitHub. The workflow will run the following steps:
  1. Check out the repository
  2. Authenticate to Google Cloud using the google-github-actions/auth Action
  3. Setup CloudQuery using the cloudquery/setup-cloudquery Action
  4. Run the cloudquery sync command to load data from Simple Analytics into BigQuery
We will now configure the secrets and the configuration files.

Step 1.2: Add the secrets #

In the GitHub repository, go to Settings > Secrets and add the following repository secrets:
  • SA_USER_ID: the user ID of your Simple Analytics account. You can find this in the Simple Analytics Account Settings.
  • SA_API_KEY: the API key of your Simple Analytics account. You can find this in the Simple Analytics Account Settings.
  • SA_BIGQUERY_PROJECT_ID: the ID of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.
  • SA_BIGQUERY_DATASET_ID: the ID of the BigQuery dataset where you want to load the data. You can find this in the BigQuery UI.
  • SA_BIGQUERY_PROJECT_NUMBER: the number of the Google Cloud project where your BigQuery dataset is located. You can find this in the Google Cloud Console home page when your project is selected.

Step 1.3: Add the Simple Analytics source integration configuration file #

In the git repository, create a new file called simple-analytics.yml and add the following content:
kind: source
spec:
  name: "simple-analytics"
  path: "simpleanalytics/simpleanalytics"
  version: "v1.0.0"
  tables:
    ["*"]
  destinations:
    - "bigquery"
  spec:
    user_id: ${SA_USER_ID}
    api_key: ${SA_API_KEY}
    websites:
      - hostname: <your-website-hostname>
You may find the latest version on the releases page. Replace <your-website-hostname> with the host name of your website as defined on Simple Analytics, e.g. mywebsite.com. You can also add more than one website, specify additional metadata that should be fetched or a period of days to fetch data for. See the Simple Analytics source plugin documentation for all the details.

Step 1.4: Add the BigQuery destination plugin configuration file #

Now, let's set up the BigQuery destination. In the git repository, create a new file called bigquery.yml and add the following content:
kind: destination
spec:
  name: bigquery
  path: cloudquery/bigquery
  registry: cloudquery
  version: "v4.1.19"
  write_mode: "append"
  spec:
    project_id: ${SA_BIGQUERY_PROJECT_ID}
    dataset_id: ${SA_BIGQUERY_DATASET_ID}
You can find the latest version in the BigQuery destination plugin documentation. There are also a few more configuration options you may want to consider, like partitioning based on sync time.

Step 1.5: Push to the repository #

With all this in place, make sure you to push the files to the GitHub repository. You can now run the workflow manually by going to Actions > Sync Simple Analytics to BigQuery > Run workflow and clicking the Run workflow button. It will run, but probably end with an error, because we still need to set up the Workload Identity to authenticate with BigQuery. Let's do that next.

Step 2: Set up Workload Identity #

In this section, we will set up Workload Identity to allow the GitHub Actions runner to authenticate with BigQuery. This is required because the GitHub Actions runner is a self-hosted runner, which is not part of the Google Cloud project. We will use the Google GitHub Auth and the steps in this guide are based on the instructions there, but fine-tuned for our use case.

Step 2.1: Set Environment Variables #

First, let's set a few environment variables that we'll use throughout the guide. Set the following variables, replacing <your-project-id> and <your-repo> with your own values:
export PROJECT_ID=<your-project-id> # The ID of your Google Cloud project, e.g. "my-project"
export REPO=<your-repo> # e.g. cloudquery/simple-analytics-to-bigquery

Step 2.2: Create a Service Account #

We will need a service account that can write to BigQuery. We will use the gcloud CLI to create the service account and grant it the necessary permissions. Run the following commands:
gcloud iam service-accounts create "cloudquery-service-account" \
  --project "${PROJECT_ID}"
If you haven't used the IAM Credentials API in the project before, it will need to be enabled:
gcloud services enable iamcredentials.googleapis.com \
  --project "${PROJECT_ID}"
Now we will grant the Google Cloud Service Account permissions to access Google Cloud resources. In this case, BigQuery:
We will assign the service account the BigQuery Data Editor role:
gcloud projects add-iam-policy-binding $PROJECT_ID --member="serviceAccount:cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" --role=roles/bigquery.dataEditor

Step 2.3: Create a Workload Identity Pool and Provider #

We will now set up Workload Identity, which the GitHub action will rely on to authenticate itself without the need for keys. You can read more about keyless authentication with GCP and GitHub Actions here.
First we need to create a pool:
gcloud iam workload-identity-pools create "cloudquery-pool" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --display-name="Cloudquery pool"
Now let's describe the pool and set an environment variable that we'll use in subsequent steps:
export WORKLOAD_IDENTITY_POOL_ID=$(gcloud iam workload-identity-pools describe "cloudquery-pool" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --format="value(name)")
We can use echo $WORKLOAD_IDENTITY_POOL_ID to verify that the variable is set correctly:
echo $WORKLOAD_IDENTITY_POOL_ID
# projects/<your-project-id>/locations/global/workloadIdentityPools/cloudquery-pool
Now let's create an OIDC provider:
gcloud iam workload-identity-pools providers create-oidc "cloudquery-provider" \
  --project="${PROJECT_ID}" \
  --location="global" \
  --workload-identity-pool="cloudquery-pool" \
  --display-name="Cloudquery provider" \
  --attribute-mapping="google.subject=assertion.sub,attribute.actor=assertion.actor,attribute.repository=assertion.repository" \
  --issuer-uri="https://token.actions.githubusercontent.com"
Finally, we need to grant the service account the Workload Identity User role:
gcloud iam service-accounts add-iam-policy-binding "cloudquery-service-account@${PROJECT_ID}.iam.gserviceaccount.com" \
  --project="${PROJECT_ID}" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principalSet://iam.googleapis.com/${WORKLOAD_IDENTITY_POOL_ID}/attribute.repository/${REPO}"

Step 3: Query the data #

Now that we have set up the Workload Identity, we can run the workflow again. Go to Actions > Sync Simple Analytics to BigQuery > Run workflow and click the Run workflow button. You should see the workflow run successfully. You can also check the logs to see that the data was successfully written to BigQuery, or get messages about any errors that occurred. After a successful workflow run you should be able to query the data in BigQuery!
Let's head over to the BigQuery console and run the following query to test it out:
SELECT * FROM `cq-analytics-v1.simple_analytics.simple_analytics_page_views` LIMIT 10
Nice, we have raw Simple Analytics data in BigQuery! We can now use this data to build dashboards, reports, and more. The GitHub Action will run daily, so the data will be kept up to date automatically.

Step 4: Post-processing the data (Optional) #

Now that we have extracted and loaded our data into BigQuery, we may also want to do the "T" part of ELT: transformation as a post-processing step. For example, because the BigQuery destination only supports append mode, we should make sure that our query results don't contain duplicates. One simple way to do this is to run a SQL to remove duplicates after the sync completes. We can do this by adding a few additional steps to the workflow that run deduplication queries as SQL.
Let's first write the queries. First we'll create deduplicate_page_views.sql:
CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_page_views` AS
SELECT
    *
FROM
    `simple_analytics`.`simple_analytics_page_views`
WHERE
    _cq_sync_time = (
    SELECT
    max(_cq_sync_time)
    FROM
    `simple_analytics`.`simple_analytics_page_views`
    );
And similarly, we'll create deduplicate_events.sql:
CREATE
OR REPLACE TABLE `simple_analytics`.`simple_analytics_events` AS
SELECT
    *
FROM
    `simple_analytics`.`simple_analytics_events`
WHERE
    _cq_sync_time = (
    SELECT
    max(_cq_sync_time)
    FROM
    `simple_analytics`.`simple_analytics_events`
    );
Now we can add these queries to the workflow. We'll add a few steps to the workflow to install the bq tool and run the queries:
- name: Set up Cloud SDK
    uses: 'google-github-actions/setup-gcloud@v1'
  - name: Deduplicate page views
    run: bq query --use_legacy_sql=False --flagfile deduplicate_page_views.sql
  - name: Deduplicate events
    run: bq query --use_legacy_sql=False --flagfile deduplicate_events.sql

Conclusion #

That's it! Now we have a fully automated pipeline that extracts data from Simple Analytics, loads it into BigQuery, and deduplicates it on a regular basis. We can now use this data to build dashboards, reports, and more. To see a live repository employing this workflow, check out the CloudQuery recipes repository.
Ready to get started with CloudQuery? You can try out CloudQuery locally with our quick start guide or explore the CloudQuery Platform (currently in beta) for a more scalable solution.
Want help getting started? Join the CloudQuery community to connect with other users and experts, or message our team directly here if you have any questions.
Herman Schaaf

Written by Herman Schaaf

Herman is the Director of Engineering at CloudQuery and an Apache Arrow contributor. A polyglot with a preference for Go and Python, he has spoken at QCon London and Data Council New York.

Turn cloud chaos into clarity

Find out how CloudQuery can help you get clarity from a chaotic cloud environment with a personalized conversation and demo.

Join our mailing list

Subscribe to our newsletter to make sure you don't miss any updates.

Legal

© 2024 CloudQuery, Inc. All rights reserved.

We use tracking cookies to understand how you use the product and help us improve it. Please accept cookies to help us improve. You can always opt out later via the link in the footer.