Monitoring Cloud Resolver using Prometheus - BlueCat Cloud Resolver - 1.5.0

BlueCat Cloud Resolver Administration Guide

Locale
English
Product name
BlueCat Cloud Resolver
Version
1.5.0

Cloud Resolver exposes real-time metrics through an HTTP channel in Prometheus format. You can pull these metrics using different dashboard and altering systems. You can also use Prometheus directly; however, the Prometheus format is consumable by all modern monitoring and observability tool chains.

api_counts

API counts are integer counters that indicate the number of successful and failed API calls. The counts are divided into the following three api_types:
  • csp—Cloud Service Provider control plane API calls
  • edge—BlueCat DNS Edge API calls
  • discover—the overall discover process, including both Cloud Service Provider and DNS Edge API calls
For each api_type, there are the following three different response metrics:
  • success—the API call or Discovery Process completed successfully
  • total_error—the number of API calls or Discovery Processes that resulted in any error
  • live_error—returns a value greater than 0 if Cloud Resolver is currently receiving errors from API calls or the overall discovery process. The value resets of 0 after a successful API call or discovery.

Sporadic API errors are expected to either the Cloud Service Provider or DNS Edge due to timeouts or other issues that quickly resolve themselves. BlueCat recommends configuring an alarm on the live_error for discovery, especially if the value is greater than 5.

Depending on the size of the zone and the number of resources within the Cloud Service Provider, Cloud Resolver might encounter issues with the discovery process due to API rate limiting configurations of the Cloud Service Provider. If you encounter issues with API rate limiting configurations on the Cloud Service Provider, you can take one of the following actions:
  • Increase the API rate limiting configuration on the Cloud Service Provider
  • Implement multiple Cloud Resolver instances to capture information from the Cloud Service Provider
  • Increase the CRS_POLLING_INTERVAL time so that Cloud Resolver polls the Cloud Service Provider less frequently. For more information, refer to Creating the Cloud Resolver configuration file.
The following displays an example of the Prometheus metrics for api_counts:
# HELP api_counts API Error, Success, and Live Error counts
# TYPE api_counts counter
api_counts{api_type="csp",response="live_error"} 0
api_counts{api_type="csp",response="success"} 5028352
api_counts{api_type="csp",response="total_error"} 536
api_counts{api_type="discover",response="live_error"} 0
api_counts{api_type="discover",response="success"} 40720
api_counts{api_type="discover",response="total_error"} 317
api_counts{api_type="edge",response="live_error"} 0
api_counts{api_type="edge",response="success"} 10
api_counts{api_type="edge",response="total_error"} 0

cache_statistics

Cloud Resolver has a short-lived cache to reduce the load on actual DNS calls to the Cloud Service Provider resolver. The cache is also used if increased back pressure causes DNS requests to queue, even if the cache is stale. The size, TTL, and back pressure queue are configurable. The default values used are sensible and typically, there is no need to adjust these values. For more information on configuring the size, TTL, and back pressure queue size, refer to Creating the Cloud Resolver configuration file.

The cache_statistics integer counter of each cache action are as follows:
  • add—the DNS Response Message was added to the cache
  • hit—a DNS Response Message was found for a DNS Query Message
  • miss—no DNS Response Message was found for a DNS Query Message

There is no need to configure an alarm on any specific value. Since the cache TTL is 15 seconds by default, it is expected that the hit rate will be lower than a fully functional DNS cache.

The following displays an example of the Prometheus metrics for cache_statistics:
# HELP cache_statistics Cache Statistics
# TYPE cache_statistics counter
cache_statistics{action="add"} 77365
cache_statistics{action="hit"} 1087671
cache_statistics{action="miss"} 10

queries_in_flight

The queries_in_flight is an integer gauge metric that measures the DNS queries currently being processed by Cloud Resolver. The specific query channels are divided by the following protocols:
  • in_tcp—inbound DNS Messages via TCP
  • in_udp—inbound DNS Messages via UDP
  • out_dns—outbound DNS Messages for which Cloud Resolver queries the OS-configured resolver on the host where it is deployed.
  • out_dns—outbound DNS Messages for which Cloud Resolver forwards directly to a configured remote DNS server, such as a fallback resolver or cloud-configured resolver.
  • out_fr(AWS-only) outbound DNS Messages for which Cloud Resolver communicates with a BlueCat AWS Function Resolver

As a gauge metric, when Cloud Resolver is idle, the counts will be 0. During normal query processing, the values are typically 100 times lower than the expected QPS, as DNS transactions are rapid. By default, Cloud Resolver will switch to back_pressure mode and only answer from the cache is the in_tcp or in_udp are greater than 500. BlueCat recommends configuring an alarm if the number is 50% of the back pressure limit.

The following displays an example of the Prometheus metrics for queries_in_flight:
# HELP queries_in_flight Queries in Flight
# TYPE queries_in_flight gauge
queries_in_flight{protocol="in_tcp"} 0
queries_in_flight{protocol="in_udp"} 0
queries_in_flight{protocol="out_dns"} 0
queries_in_flight{protocol="out_fr"} 0
queries_in_flight{protocol="out_remote_dns"} 0

query_counts

The query_counts is an integer counter metric that increments for each query received since Cloud Resolver started. Query counts are divided into the following protocol:
  • tcp—queries received via TCP
  • udp—queries received via UDP

This is an informational metric.

The following displays an example of the Prometheus metrics for query_counts:
# HELP query_counts Query Counts
# TYPE query_counts counter
query_counts{protocol="tcp"} 0
query_counts{protocol="udp"} 1165036

response_code_counts

The response_code_counts is an integer counter metric that details the response codes received for queries.
Note: A single query might be processed multiple times if the zone name for the resource records is duplicated with different owner, such as accounts or subscriptions. The counts may not match the overall query counts.
The response_code_counts are divided into the following resolver_type:
  • local—the query was answered locally either because it was found in the short term cache or the DNS Response was generated from API discovery data
  • dns—the query was answered using the OS resolver on the host where Cloud Resolver is deployed
  • remote_dns—the query was answered by a DNS resolver identified by the configuration zone
  • fr(AWS-only) the query was answered via the BlueCat AWS Function Resolver
Each resolver_type metrics are divided into the following response_code:
  • noerror - RCODE 0: Query completed successfully.
  • formerr - RCODE 1: DNS Message was corrupt.
  • servfail - RCODE 2: Server failed to complete the request.

    If this is for resolver_type local, Cloud Resolver generates the servfail. This occurs if there are unexpected issues when parsing the message or if Cloud Resolver is in back pressure mode and the query requested is not in the cache, regardless of TTL. For all other resolver_type metrics, this indicates that either the query never returned or the upstream server specifically returned a servfail.

  • nxdomain - RCODE 3: Domain name does not exist.

    If this is for the resolver_type local, it is possible that the zone for the query is not present in the discovered zone map or that the zone was discovered but the FQDN was not found for any record type.

  • notimp - RCODE 4: Function is not implemented.

    Unless an upstream resolver returns this unexpectedly, the cause of the response code is a DNS OpCode other than query, such as a DDNS Update, which is not yet implemented for the Cloud Resolver.

  • refused - RCODE 5: Server refused to answer the query.

    If this is for the resolver_type local, Cloud Resolver is refusing the query because the record type is not supported, such as AXFR or IXFT. For all other resolver_type metrics, the upstream server is refusing the query which might indicate permission issues with the upstream servers.

  • other - catch all for other response codes
The following displays an example of the Prometheus metrics for response_code_counts:
# HELP response_code_counts Response Code Counts
# TYPE response_code_counts counter
response_code_counts{resolver_type="dns",response_code="formerr"} 0
response_code_counts{resolver_type="dns",response_code="noerror"} 2
response_code_counts{resolver_type="dns",response_code="notimp"} 0
response_code_counts{resolver_type="dns",response_code="nxdomain"} 3
response_code_counts{resolver_type="dns",response_code="other"} 0
response_code_counts{resolver_type="dns",response_code="refused"} 0
response_code_counts{resolver_type="dns",response_code="servfail"} 0
response_code_counts{resolver_type="fr",response_code="formerr"} 0
response_code_counts{resolver_type="fr",response_code="noerror"} 0
response_code_counts{resolver_type="fr",response_code="notimp"} 0
response_code_counts{resolver_type="fr",response_code="nxdomain"} 0
response_code_counts{resolver_type="fr",response_code="other"} 0
response_code_counts{resolver_type="fr",response_code="refused"} 0
response_code_counts{resolver_type="fr",response_code="servfail"} 0
response_code_counts{resolver_type="local",response_code="formerr"} 0
response_code_counts{resolver_type="local",response_code="noerror"} 0
response_code_counts{resolver_type="local",response_code="notimp"} 0
response_code_counts{resolver_type="local",response_code="nxdomain"} 1165028
response_code_counts{resolver_type="local",response_code="other"} 0
response_code_counts{resolver_type="local",response_code="refused"} 0
response_code_counts{resolver_type="local",response_code="servfail"} 0
response_code_counts{resolver_type="remote_dns",response_code="formerr"} 0
response_code_counts{resolver_type="remote_dns",response_code="noerror"} 1
response_code_counts{resolver_type="remote_dns",response_code="notimp"} 0
response_code_counts{resolver_type="remote_dns",response_code="nxdomain"} 2
response_code_counts{resolver_type="remote_dns",response_code="other"} 0
response_code_counts{resolver_type="remote_dns",response_code="refused"} 0
response_code_counts{resolver_type="remote_dns",response_code="servfail"} 0

snapshot_statistics

The snapshot_statistics is an integer counter metric for the Cloud Resolver snapshot service. Snapshots are serialized binary files of the discovered Cloud Resolver data. They are created during a Cloud Resolver shutdown and the latest snapshot is read when Cloud Resolver starts. This allows Cloud Resolver to immediately start processing queries without the need to wait for a successfully completed discovery. Snapshots can also be created on a running Cloud Resolver instance using an HTTP request to the snapshot endpoint. For more information, refer to Creating a snapshot.

The snapshot_statistics are comprised of action metrics that include the following:
  • failed—failed to create or read a snapshot
  • read—successfully read an existing snapshot
  • write—successfully created a new snapshot

Typically, there should never be errors with snapshots. Failing to write a snapshot might be caused by storage issues, such as a full disk, or permission issues on the directory where snapshots are written. Alarms can be created on the failure metric for values greater than 0.

The following displays an example of the Prometheus metrics for snapshot_statistics:
# HELP snapshot_statistics Snapshot Statistics
# TYPE snapshot_statistics counter
snapshot_statistics{action="failed"} 0
snapshot_statistics{action="read"} 0
snapshot_statistics{action="write"} 3

uptime

The uptime is a simple single metric gauge of the number of seconds that have passed since Cloud Resolver started. The value resets to 0 if Cloud Resolver restarts.

The following displays an example of the Prometheus metrics for uptime:
# HELP uptime Uptime in Seconds
# TYPE uptime gauge
uptime 2584742