CloudQuery News
Embracing Efficiency: Transitioning to a Single Deterministic Primary Key for Faster And Easier Updates and Fixes!
We have heard from many users that it can be difficult to keep up with the latest version of Source Plugins. This is especially true for those users that are building business critical applications on top of CloudQuery data and cannot afford to drop tables to do migrations. Today we are happy to announce that starting with version
v24.0.0
of the AWS Plugin we will be moving away from Compound Primary Keys to using just the _cq_id
field as the only Primary key. Prior to this release any change to a primary key necessitated a major version bump, going forward plugin developers that use this new capability will no longer require a schema change to alter the Primary key, making plugin developers able to release fixes to Primary Keys faster with less user impact.Primary Keys are an important capability for CloudQuery Tables as primary keys enable users to be confident that data is not being duplicated even as CloudQuery is scaling to handle hundreds of thousands of concurrent API calls. Unfortunately for plugin developers API documentation rarely is explicit about what fields determine the uniqueness of a resource, as a result of that plugin developers are forced to make assumptions about the Primary Keys. Wrong assumptions can lead to duplicated data or even worse, erroneously dropped columns. Because data integrity is one of the most critical aspects of any ETL solution developers will prioritize releasing a major version to help users.
This functionality in the AWS plugin is based an added capability in the open source Go Plugin SDK and is available to all developers writing plugins in Go. If you are writing plugins with one of the other SDKs (Python, TypeScript, or Java) and you are interested in this capability, please reach out to us so we can be sure to prioritize adding this capability to those SDKs.
Impacts On You #
- Easier Updates : With a single deterministic primary key, that doesn't change from version to version. You can be confident that Source Plugin Updates won't require a schema change on your end.
- CloudQuery Spec Configs: If you are using the AWS Plugin and have set the
deterministic_cq_id: true
andpk_mode: cq-id-only
options then you will see no change in behavior. In this case you can remove those options from your spec and the plugin will continue to work as expected. - Adoption: If you are using the latest version (
v7.3.1
) of the Postgres Destination plugin it will handle all of the migrations for you. If you are using an older version of the Postgres Destination plugin or any other destination that support write modes other thanappend
you will need to manually update your schema to remove the compound primary key and add the_cq_id
field as the primary key. Users can setmigrate_mode: true
and CloudQuery will migrate the table by dropping the existing table and remaking the table with the improved schema. - Performance: We expect that for most destinations that support Primary Keys this change will have a negligible or small positive impact on sync times. ****Depending on your queries you might see an increase in latency if you were previously utilizing an index. In these cases you can manually add indexes to your data to improve performance.
tl;dr: Primary Keys that are misconfigured can lead to duplicate rows, or worse, missing rows. Plugin developers therefore prioritize fixes to Primary Keys, but until now this could only be done via a breaking change. After this change, plugin developers can safely fix Primary Keys in a minor release.
If you have any questions or concerns please reach out to us on our Community.
Ready to dive deeper? Contact CloudQuery here or join the CloudQuery Community to connect with other users and experts. You can also try out CloudQuery locally with our quick start guide or explore the CloudQuery Platform (currently in beta) for a more scalable solution.
Written by Ben Bernays
Ben is a Senior Software Engineer at CloudQuery with experience in Go, AWS, C++ and data analytics among many other things.