Engineering
The Lost Fourth Pillar of Observability - Config Data Monitoring
A lot has been written about logs, metrics, and traces as they are indeed key components in observability, application, and system monitoring. One thing that is often overlooked, however, is config data and its observability. In this blog, we'll explore what config data is, how it differs from logs, metrics, and traces, and discuss what architecture is needed to store this type of data and in which scenarios it provides value.
Logs, Metrics, and Traces - A Quick Recap #
For those that are new to the three pillars of observability, let’s do a quick recap:
Logs: Detailed records of events that occur within your system. They provide information about specific occurrences, including timestamps, error messages, and other relevant details. Logs help with debugging and forensic analysis.
Metrics: Numerical measurements collected at regular intervals. They help monitor system health, performance, and behavior over time. Examples include CPU usage, request rates, error rates, and response times.
Traces: Records that follow a request as it moves through different services in a distributed system. Traces provide visibility into request flows, helping to identify bottlenecks and understand dependencies.
The backend for such technologies is usually some sort of time-series database, and the type of data is usually what we call low cardinality data (it can be higher cardinality, but this typically becomes expensive and is advised against).
Another key aspect is that to get any of these metrics, you usually need to instrument the system - i.e., you need access to the application or the infrastructure so you can deploy an agent or add a Prometheus exporter.
Config Data: The Fourth Pillar #
Infrastructure isn't limited to AWS EC2 instances but also includes IAM users, configurations from security tools, SaaS applications, and more. This configuration data differs from traditional observability data in several important ways:
- Not instrumentable: These systems cannot be instrumented directly, but they expose their configuration through APIs.
- High cardinality and relational: The data typically has high cardinality and is highly relational. It also doesn't change as frequently as metrics like disk I/O on a server, as it's more focused on configuration states.
- Lower frequency, higher detail: The tradeoff we want to make here is less frequent collection but with higher cardinality and detail.
Why Config Data Matters #
Configuration data monitoring fills critical gaps in your observability strategy:
- Security posture monitoring: Track IAM permissions, security group rules, encryption settings, and other configuration items that impact your security posture.
- Compliance tracking: Monitor configurations against internal policies or external compliance requirements (SOC2, HIPAA, PCI-DSS, etc.).
- Cost optimization: Identify misconfigurations leading to unnecessary costs, like over sized instances or unused resources.
- Change management: Detect and track configuration changes across your environment, providing visibility into who changed what and when.
- Drift detection: Identify when resources deviate from their expected or desired configurations.
Architecture for Config Data Monitoring #
Let's examine some key architectural decisions we made at CQ for handling config data:
Data Ingestion #
First, the data extraction challenge is a different beast. We can't instrument these systems, so we need to create extractors (or ETL scripts), and the primary challenge is maintaining these connectors. Any system that wants to address this need must maintain high-quality connectors to various data sources.
The collection frequency can usually be daily, but sometimes it might need to be configurable for higher cadence depending on the criticality of the configuration.
Storage #
Due to the highly relational nature of the data obtained from APIs, we utilize a SQL database where you can create complex joins. NoSQL databases or time-series databases aren't optimized for this use case.
The tradeoff on frequency and cardinality here would be something like daily partitioning. Some extractors might run more frequently, but the snapshotting would still typically run daily; otherwise, data volume would explode.
Insights Generation #
This is somewhat similar to how observability platforms solve the "blank page syndrome" of "I have the data, now what do I monitor?" We provide numerous insights out of the box, but we recognize that every organization has slightly different needs, and there is no one-size-fits-all rule in cloud governance. Therefore, customers can access the raw queries and modify them as well as add new custom data sources.
Relationships and Materialized Views #
One significant advantage of storing config data in a relational database is the ability to model and query the relationships between different configuration items. For example:
- Which IAM roles have access to which S3 buckets?
- Which security groups are associated with which instances?
- How do your Kubernetes RBAC settings relate to your cloud IAM permissions?
Materialized views can be used to pre-compute common relationship queries, improving performance for frequently requested insights.
Integration with Traditional Observability #
While config data serves as a fourth pillar, its true power emerges when integrated with traditional observability data:
- Root cause analysis: When an incident occurs, correlating metrics, logs, and traces with configuration changes can quickly identify the root cause.
- Context enrichment: Enhance your metrics and logs with configuration context (e.g., "This spike in errors occurred after a configuration change to the load balancer").
- Proactive monitoring: Detect configuration changes that might lead to future performance issues or outages before they affect your metrics.
Challenges and Considerations #
Implementing config data monitoring comes with its own set of challenges:
- API rate limiting: Many services impose rate limits on their APIs, which can constrain how frequently you can collect configuration data.
- Authentication and authorization: Managing credentials and permissions for numerous systems requires careful security considerations.
- Data volume management: Even with less frequent collection, the high cardinality of configuration data can lead to significant storage requirements.
- Schema evolution: APIs change over time, requiring adaptation of your data extraction and storage mechanisms.
Wrap up #
While logs, metrics, and traces remain crucial components of observability, configuration data represents a fourth pillar that provides unique insights into your systems. By implementing comprehensive config data monitoring, organizations can enhance their security posture, ensure compliance, optimize costs, and gain deeper understanding of their infrastructure.
As systems grow more complex and distributed, the value of configuration data monitoring will only increase. Organizations that recognize this fourth pillar and incorporate it into their observability strategy will be better positioned to understand, troubleshoot, and optimize their infrastructure in an increasingly complex world.