Cloud Resolver exposes real-time metrics through an HTTP channel in Prometheus format. You can pull these metrics using different dashboard and altering systems. You can also use Prometheus directly; however, the Prometheus format is consumable by all modern monitoring and observability tool chains.
By default, the Prometheus endpoint is the following URL:
http://<cloud_resolver_id_or_hostname>:9090
api_counts
api_types
:csp
—Cloud Service Provider control plane API callsedge
—BlueCat DNS Edge API callsdiscover
—the overall discover process, including both Cloud Service Provider and DNS Edge API calls
api_type
, there are the following three different
response
metrics:success
—the API call or Discovery Process completed successfullytotal_error
—the number of API calls or Discovery Processes that resulted in any errorlive_error
—returns a value greater than 0 if Cloud Resolver is currently receiving errors from API calls or the overall discovery process. The value resets of 0 after a successful API call or discovery.
Sporadic API errors are expected to either the Cloud Service Provider or DNS Edge due
to timeouts or other issues that quickly resolve themselves. BlueCat recommends
configuring an alarm on the live_error
for discovery, especially if
the value is greater than 5.
- Increase the API rate limiting configuration on the Cloud Service Provider
- Implement multiple Cloud Resolver instances to capture information from the Cloud Service Provider
- Increase the
CRS_POLLING_INTERVAL
time so that Cloud Resolver polls the Cloud Service Provider less frequently. For more information, refer to Creating the Cloud Resolver configuration file.
api_counts
:# HELP api_counts API Error, Success, and Live Error counts
# TYPE api_counts counter
api_counts{api_type="csp",response="live_error"} 0
api_counts{api_type="csp",response="success"} 5028352
api_counts{api_type="csp",response="total_error"} 536
api_counts{api_type="discover",response="live_error"} 0
api_counts{api_type="discover",response="success"} 40720
api_counts{api_type="discover",response="total_error"} 317
api_counts{api_type="edge",response="live_error"} 0
api_counts{api_type="edge",response="success"} 10
api_counts{api_type="edge",response="total_error"} 0
cache_statistics
Cloud Resolver has a short-lived cache to reduce the load on actual DNS calls to the Cloud Service Provider resolver. The cache is also used if increased back pressure causes DNS requests to queue, even if the cache is stale. The size, TTL, and back pressure queue are configurable. The default values used are sensible and typically, there is no need to adjust these values. For more information on configuring the size, TTL, and back pressure queue size, refer to Creating the Cloud Resolver configuration file.
cache_statistics
integer counter of each cache
action
are as follows:add
—the DNS Response Message was added to the cachehit
—a DNS Response Message was found for a DNS Query Messagemiss
—no DNS Response Message was found for a DNS Query Message
There is no need to configure an alarm on any specific value. Since the cache TTL is 15 seconds by default, it is expected that the hit rate will be lower than a fully functional DNS cache.
cache_statistics
:# HELP cache_statistics Cache Statistics
# TYPE cache_statistics counter
cache_statistics{action="add"} 77365
cache_statistics{action="hit"} 1087671
cache_statistics{action="miss"} 10
queries_in_flight
queries_in_flight
is an integer gauge metric that measures the
DNS queries currently being processed by Cloud Resolver. The specific query channels
are divided by the following protocols
:in_tcp
—inbound DNS Messages via TCPin_udp
—inbound DNS Messages via UDPout_dns
—outbound DNS Messages for which Cloud Resolver queries the OS-configured resolver on the host where it is deployed.out_dns
—outbound DNS Messages for which Cloud Resolver forwards directly to a configured remote DNS server, such as a fallback resolver or cloud-configured resolver.out_fr
—(AWS-only) outbound DNS Messages for which Cloud Resolver communicates with a BlueCat AWS Function Resolver
As a gauge metric, when Cloud Resolver is idle, the counts will be 0. During normal
query processing, the values are typically 100 times lower than the expected QPS, as
DNS transactions are rapid. By default, Cloud Resolver will switch to
back_pressure
mode and only answer from the cache is the
in_tcp
or in_udp
are greater than 500. BlueCat
recommends configuring an alarm if the number is 50% of the back pressure limit.
queries_in_flight
:# HELP queries_in_flight Queries in Flight
# TYPE queries_in_flight gauge
queries_in_flight{protocol="in_tcp"} 0
queries_in_flight{protocol="in_udp"} 0
queries_in_flight{protocol="out_dns"} 0
queries_in_flight{protocol="out_fr"} 0
queries_in_flight{protocol="out_remote_dns"} 0
query_counts
query_counts
is an integer counter metric that increments for
each query received since Cloud Resolver started. Query counts are divided into the
following protocol
:tcp
—queries received via TCPudp
—queries received via UDP
This is an informational metric.
query_counts
:# HELP query_counts Query Counts
# TYPE query_counts counter
query_counts{protocol="tcp"} 0
query_counts{protocol="udp"} 1165036
response_code_counts
response_code_counts
is an integer counter metric that details
the response codes received for queries.response_code_counts
are
divided into the following resolver_type
:local
—the query was answered locally either because it was found in the short term cache or the DNS Response was generated from API discovery datadns
—the query was answered using the OS resolver on the host where Cloud Resolver is deployedremote_dns
—the query was answered by a DNS resolver identified by the configuration zonefr
—(AWS-only) the query was answered via the BlueCat AWS Function Resolverauth
—the query was answered by Cloud Resolver that is configured as an Authority for discovered and enumerated zones that are not reachable through upstream DNS resolution.
resolver_type
metrics are divided into the following
response_code
:noerror
- RCODE 0: Query completed successfully.formerr
- RCODE 1: DNS Message was corrupt.servfail
- RCODE 2: Server failed to complete the request.If this is for
resolver_type local
, Cloud Resolver generates theservfail
. This occurs if there are unexpected issues when parsing the message or if Cloud Resolver is in back pressure mode and the query requested is not in the cache, regardless of TTL. For all otherresolver_type
metrics, this indicates that either the query never returned or the upstream server specifically returned aservfail
.nxdomain
- RCODE 3: Domain name does not exist.If this is for the
resolver_type local
, it is possible that the zone for the query is not present in the discovered zone map or that the zone was discovered but the FQDN was not found for any record type.notimp
- RCODE 4: Function is not implemented.Unless an upstream resolver returns this unexpectedly, the cause of the response code is a DNS OpCode other than query, such as a DDNS Update, which is not yet implemented for the Cloud Resolver.
refused
- RCODE 5: Server refused to answer the query.If this is for the
resolver_type local
, Cloud Resolver is refusing the query because the record type is not supported, such asAXFR
orIXFT
. For all otherresolver_type
metrics, the upstream server is refusing the query which might indicate permission issues with the upstream servers.other
- catch all for other response codes
response_code_counts
:# HELP response_code_counts Response Code Counts
# TYPE response_code_counts counter
response_code_counts{resolver_type="dns",response_code="formerr"} 0
response_code_counts{resolver_type="dns",response_code="noerror"} 2
response_code_counts{resolver_type="dns",response_code="notimp"} 0
response_code_counts{resolver_type="dns",response_code="nxdomain"} 3
response_code_counts{resolver_type="dns",response_code="other"} 0
response_code_counts{resolver_type="dns",response_code="refused"} 0
response_code_counts{resolver_type="dns",response_code="servfail"} 0
response_code_counts{resolver_type="fr",response_code="formerr"} 0
response_code_counts{resolver_type="fr",response_code="noerror"} 0
response_code_counts{resolver_type="fr",response_code="notimp"} 0
response_code_counts{resolver_type="fr",response_code="nxdomain"} 0
response_code_counts{resolver_type="fr",response_code="other"} 0
response_code_counts{resolver_type="fr",response_code="refused"} 0
response_code_counts{resolver_type="fr",response_code="servfail"} 0
response_code_counts{resolver_type="local",response_code="formerr"} 0
response_code_counts{resolver_type="local",response_code="noerror"} 0
response_code_counts{resolver_type="local",response_code="notimp"} 0
response_code_counts{resolver_type="local",response_code="nxdomain"} 1165028
response_code_counts{resolver_type="local",response_code="other"} 0
response_code_counts{resolver_type="local",response_code="refused"} 0
response_code_counts{resolver_type="local",response_code="servfail"} 0
response_code_counts{resolver_type="remote_dns",response_code="formerr"} 0
response_code_counts{resolver_type="remote_dns",response_code="noerror"} 1
response_code_counts{resolver_type="remote_dns",response_code="notimp"} 0
response_code_counts{resolver_type="remote_dns",response_code="nxdomain"} 2
response_code_counts{resolver_type="remote_dns",response_code="other"} 0
response_code_counts{resolver_type="remote_dns",response_code="refused"} 0
response_code_counts{resolver_type="remote_dns",response_code="servfail"} 0
response_code_counts{resolver_type="auth",response_code="formerr"} 0
response_code_counts{resolver_type="auth",response_code="noerror"} 0
response_code_counts{resolver_type="auth",response_code="notimp"} 0
response_code_counts{resolver_type="auth",response_code="nxdomain"} 0
response_code_counts{resolver_type="auth",response_code="other"} 0
response_code_counts{resolver_type="auth",response_code="refused"} 0
response_code_counts{resolver_type="auth",response_code="servfail"} 0
snapshot_statistics
The snapshot_statistics
is an integer counter metric for the Cloud
Resolver snapshot service. Snapshots are serialized binary files of the discovered
Cloud Resolver data. They are created during a Cloud Resolver shutdown and the
latest snapshot is read when Cloud Resolver starts. This allows Cloud Resolver to
immediately start processing queries without the need to wait for a successfully
completed discovery. Snapshots can also be created on a running Cloud Resolver
instance using an HTTP request to the snapshot endpoint. For more information, refer
to Creating a snapshot.
snapshot_statistics
are comprised of action
metrics that include the following:failed
—failed to create or read a snapshotread
—successfully read an existing snapshotwrite
—successfully created a new snapshot
Typically, there should never be errors with snapshots. Failing to write a snapshot
might be caused by storage issues, such as a full disk, or permission issues on the
directory where snapshots are written. Alarms can be created on the
failure
metric for values greater than 0.
snapshot_statistics
:# HELP snapshot_statistics Snapshot Statistics
# TYPE snapshot_statistics counter
snapshot_statistics{action="failed"} 0
snapshot_statistics{action="read"} 0
snapshot_statistics{action="write"} 3
uptime
The uptime
is a simple single metric gauge of the number of seconds
that have passed since Cloud Resolver started. The value resets to 0 if Cloud
Resolver restarts.
uptime
:# HELP uptime Uptime in Seconds
# TYPE uptime gauge
uptime 2584742