Cloud Resolver exposes real-time metrics through an HTTP channel in Prometheus format. You can pull these metrics using different dashboard and altering systems. You can also use Prometheus directly; however, the Prometheus format is consumable by all modern monitoring and observability tool chains.
api_counts
- csp—Cloud Service Provider control plane API calls
- edge—BlueCat DNS Edge API calls
- discover—the overall discover process, including both Cloud Service Provider and DNS Edge API calls
- success—the API call or Discovery Process completed successfully
- total_error—the number of API calls or Discovery Processes that resulted in any error
- live_error—returns a value greater than 0 if Cloud Resolver is currently receiving errors from API calls or the overall discovery process. The value resets of 0 after a successful API call or discovery.
Sporadic API errors are expected to either the Cloud Service Provider or DNS Edge due to timeouts or other issues that quickly resolve themselves. BlueCat recommends configuring an alarm on the live_error for discovery, especially if the value is greater than 5.
- Increase the API rate limiting configuration on the Cloud Service Provider
- Implement multiple Cloud Resolver instances to capture information from the Cloud Service Provider
- Increase the CRS_POLLING_INTERVAL time so that Cloud Resolver polls the Cloud Service Provider less frequently. For more information, refer to Creating the Cloud Resolver configuration file.
# HELP api_counts API Error, Success, and Live Error counts # TYPE api_counts counter api_counts{api_type="csp",response="live_error"} 0 api_counts{api_type="csp",response="success"} 5028352 api_counts{api_type="csp",response="total_error"} 536 api_counts{api_type="discover",response="live_error"} 0 api_counts{api_type="discover",response="success"} 40720 api_counts{api_type="discover",response="total_error"} 317 api_counts{api_type="edge",response="live_error"} 0 api_counts{api_type="edge",response="success"} 10 api_counts{api_type="edge",response="total_error"} 0
cache_statistics
Cloud Resolver has a short-lived cache to reduce the load on actual DNS calls to the Cloud Service Provider resolver. The cache is also used if increased back pressure causes DNS requests to queue, even if the cache is stale. The size, TTL, and back pressure queue are configurable. The default values used are sensible and typically, there is no need to adjust these values. For more information on configuring the size, TTL, and back pressure queue size, refer to Creating the Cloud Resolver configuration file.
- add—the DNS Response Message was added to the cache
- hit—a DNS Response Message was found for a DNS Query Message
- miss—no DNS Response Message was found for a DNS Query Message
There is no need to configure an alarm on any specific value. Since the cache TTL is 15 seconds by default, it is expected that the hit rate will be lower than a fully functional DNS cache.
# HELP cache_statistics Cache Statistics # TYPE cache_statistics counter cache_statistics{action="add"} 77365 cache_statistics{action="hit"} 1087671 cache_statistics{action="miss"} 10
queries_in_flight
- in_tcp—inbound DNS Messages via TCP
- in_udp—inbound DNS Messages via UDP
- out_dns—outbound DNS Messages for which Cloud Resolver queries the OS-configured resolver on the host where it is deployed.
- out_dns—outbound DNS Messages for which Cloud Resolver forwards directly to a configured remote DNS server, such as a fallback resolver or cloud-configured resolver.
- out_fr—(AWS-only) outbound DNS Messages for which Cloud Resolver communicates with a BlueCat AWS Function Resolver
As a gauge metric, when Cloud Resolver is idle, the counts will be 0. During normal query processing, the values are typically 100 times lower than the expected QPS, as DNS transactions are rapid. By default, Cloud Resolver will switch to back_pressure mode and only answer from the cache is the in_tcp or in_udp are greater than 500. BlueCat recommends configuring an alarm if the number is 50% of the back pressure limit.
# HELP queries_in_flight Queries in Flight # TYPE queries_in_flight gauge queries_in_flight{protocol="in_tcp"} 0 queries_in_flight{protocol="in_udp"} 0 queries_in_flight{protocol="out_dns"} 0 queries_in_flight{protocol="out_fr"} 0 queries_in_flight{protocol="out_remote_dns"} 0
query_counts
- tcp—queries received via TCP
- udp—queries received via UDP
This is an informational metric.
# HELP query_counts Query Counts # TYPE query_counts counter query_counts{protocol="tcp"} 0 query_counts{protocol="udp"} 1165036
response_code_counts
- local—the query was answered locally either because it was found in the short term cache or the DNS Response was generated from API discovery data
- dns—the query was answered using the OS resolver on the host where Cloud Resolver is deployed
- remote_dns—(AWS-only) the query was answered via the BlueCat AWS Function Resolver
- noerror - RCODE 0: Query completed successfully.
- formerr - RCODE 1: DNS Message was corrupt.
- servfail - RCODE 2: Server failed to complete the request.
If this is for resolver_type local, Cloud Resolver generates the servfail. This occurs if there are unexpected issues when parsing the message or if Cloud Resolver is in back pressure mode and the query requested is not in the cache, regardless of TTL. For all other resolver_type metrics, this indicates that either the query never returned or the upstream server specifically returned a servfail.
- nxdomain - RCODE 3: Domain name does not exist.
If this is for the resolver_type local, it is possible that the zone for the query is not present in the discovered zone map or that the zone was discovered but the FQDN was not found for any record type.
- notimp - RCODE 4: Function is not implemented.
Unless an upstream resolver returns this unexpectedly, the cause of the response code is a DNS OpCode other than query, such as a DDNS Update, which is not yet implemented for the Cloud Resolver.
- refused - RCODE 5: Server refused to answer the query.
If this is for the resolver_type local, Cloud Resolver is refusing the query because the record type is not supported, such as AXFR or IXFT. For all other resolver_type metrics, the upstream server is refusing the query which might indicate permission issues with the upstream servers.
- other - catch all for other response codes
# HELP response_code_counts Response Code Counts # TYPE response_code_counts counter response_code_counts{resolver_type="dns",response_code="formerr"} 0 response_code_counts{resolver_type="dns",response_code="noerror"} 2 response_code_counts{resolver_type="dns",response_code="notimp"} 0 response_code_counts{resolver_type="dns",response_code="nxdomain"} 3 response_code_counts{resolver_type="dns",response_code="other"} 0 response_code_counts{resolver_type="dns",response_code="refused"} 0 response_code_counts{resolver_type="dns",response_code="servfail"} 0 response_code_counts{resolver_type="fr",response_code="formerr"} 0 response_code_counts{resolver_type="fr",response_code="noerror"} 0 response_code_counts{resolver_type="fr",response_code="notimp"} 0 response_code_counts{resolver_type="fr",response_code="nxdomain"} 0 response_code_counts{resolver_type="fr",response_code="other"} 0 response_code_counts{resolver_type="fr",response_code="refused"} 0 response_code_counts{resolver_type="fr",response_code="servfail"} 0 response_code_counts{resolver_type="local",response_code="formerr"} 0 response_code_counts{resolver_type="local",response_code="noerror"} 0 response_code_counts{resolver_type="local",response_code="notimp"} 0 response_code_counts{resolver_type="local",response_code="nxdomain"} 1165028 response_code_counts{resolver_type="local",response_code="other"} 0 response_code_counts{resolver_type="local",response_code="refused"} 0 response_code_counts{resolver_type="local",response_code="servfail"} 0 response_code_counts{resolver_type="remote_dns",response_code="formerr"} 0 response_code_counts{resolver_type="remote_dns",response_code="noerror"} 1 response_code_counts{resolver_type="remote_dns",response_code="notimp"} 0 response_code_counts{resolver_type="remote_dns",response_code="nxdomain"} 2 response_code_counts{resolver_type="remote_dns",response_code="other"} 0 response_code_counts{resolver_type="remote_dns",response_code="refused"} 0 response_code_counts{resolver_type="remote_dns",response_code="servfail"} 0
snapshot_statistics
The snapshot_statistics is an integer counter metric for the Cloud Resolver snapshot service. Snapshots are serialized binary files of the discovered Cloud Resolver data. They are created during a Cloud Resolver shutdown and the latest snapshot is read when Cloud Resolver starts. This allows Cloud Resolver to immediately start processing queries without the need to wait for a successfully completed discovery. Snapshots can also be created on a running Cloud Resolver instance using an HTTP request to the snapshot endpoint. For more information, refer to Creating a snapshot.
- failed—failed to create or read a snapshot
- read—successfully read an existing snapshot
- write—successfully created a new snapshot
Typically, there should never be errors with snapshots. Failing to write a snapshot might be caused by storage issues, such as a full disk, or permission issues on the directory where snapshots are written. Alarms can be created on the failure metric for values greater than 0.
# HELP snapshot_statistics Snapshot Statistics # TYPE snapshot_statistics counter snapshot_statistics{action="failed"} 0 snapshot_statistics{action="read"} 0 snapshot_statistics{action="write"} 3
uptime
The uptime is a simple single metric gauge of the number of seconds that have passed since Cloud Resolver started. The value resets to 0 if Cloud Resolver restarts.
# HELP uptime Uptime in Seconds # TYPE uptime gauge uptime 2584742