Skip to main content
Version: 24.07.01

Metrics

Kafka is a distributed event streaming platform capable of handling a large number of events per day, enabling real-time data processing and integration across various applications and systems. Monitoring its metrics is crucial for ensuring performance, stability, and reliability. The following is a list of key Kafka metrics in PDS. Understanding these metrics will help administrators optimize performance, troubleshoot issues, and ensure the Kafka cluster runs smoothly.

Broker metrics

DescriptionMbean NameNormal Value
Message in ratekafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=([-.\w]+)Incoming message rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Byte in rate from clientskafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=([-.\w]+)Byte in (from the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Byte in rate from other brokerskafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSecByte in (from the other brokers) rate across all topics.
Controller Request rate from Brokerkafka.controller:type=ControllerChannelManager,name=RequestRateAndQueueTimeMs,brokerId=([0-9]+)The rate (requests per second) at which the ControllerChannelManager takes requests from the queue of the given broker. And the time it takes for a request to stay in this queue before it is taken from the queue.
Controller Event queue sizekafka.controller:type=ControllerEventManager,name=EventQueueSizeSize of the ControllerEventManager's queue.
Controller Event queue timekafka.controller:type=ControllerEventManager,name=EventQueueTimeMsTime that takes for any event (except the Idle event) to wait in the ControllerEventManager's queue before being processed
Request ratekafka.network:type=RequestMetrics,name=RequestsPerSec,request={Produce|FetchConsumer|FetchFollower},version=([0-9]+)
Error ratekafka.network:type=RequestMetrics,name=ErrorsPerSec,request=([-.\w]+),error=([-.\w]+)Number of errors in responses counted per-request-type, per-error-code. If a response contains multiple errors, all are counted. error=NONE indicates successful responses.
Produce request ratekafka.server:type=BrokerTopicMetrics,name=TotalProduceRequestsPerSec,topic=([-.\w]+)Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Fetch request ratekafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec,topic=([-.\w]+)Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Failed produce request ratekafka.server:type=BrokerTopicMetrics,name=FailedProduceRequestsPerSec,topic=([-.\w]+)Failed Produce request rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Failed fetch request ratekafka.server:type=BrokerTopicMetrics,name=FailedFetchRequestsPerSec,topic=([-.\w]+)Failed Fetch request (from clients or followers) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Request size in byteskafka.network:type=RequestMetrics,name=RequestBytes,request=([-.\w]+)Size of requests for each request type.
Temporary memory size in byteskafka.network:type=RequestMetrics,name=TemporaryMemoryBytes,request={Produce|Fetch}Temporary memory used for message format conversions and decompression.
Message conversion timekafka.network:type=RequestMetrics,name=MessageConversionsTimeMs,request={Produce|Fetch}Time in milliseconds spent on message format conversions.
Message conversion ratekafka.server:type=BrokerTopicMetrics,name={Produce|Fetch}MessageConversionsPerSec,topic=([-.\w]+)Message format conversion rate, for Produce or Fetch requests, per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Request Queue Sizekafka.network:type=RequestChannel,name=RequestQueueSizeSize of the request queue.
Byte out rate to clientskafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=([-.\w]+)Byte out (to the clients) rate per topic. Omitting 'topic=(...)' will yield the all-topic rate.
Byte out rate to other brokerskafka.server:type=BrokerTopicMetrics,name=ReplicationBytesOutPerSecByte out (to the other brokers) rate across all topics
Rejected byte ratekafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec,topic=([-.\w]+)Rejected byte rate per topic, due to the record batch size being greater than max.message.bytes configuration. Omitting 'topic=(...)' will yield the all-topic rate.
Message validation failure rate due to no key specified for compacted topickafka.server:type=BrokerTopicMetrics,name=NoKeyCompactedTopicRecordsPerSec0
Message validation failure rate due to invalid magic numberkafka.server:type=BrokerTopicMetrics,name=InvalidMagicNumberRecordsPerSec0
Message validation failure rate due to incorrect crc checksumkafka.server:type=BrokerTopicMetrics,name=InvalidMessageCrcRecordsPerSec0
Message validation failure rate due to non-continuous offset or sequence number in batchkafka.server:type=BrokerTopicMetrics,name=InvalidOffsetOrSequenceRecordsPerSec0
Log flush rate and timekafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs
# of offline log directorieskafka.log:type=LogManager,name=OfflineLogDirectoryCount0
Leader election ratekafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMsnon-zero when there are broker failures
Unclean leader election ratekafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec0
Is controller active on brokerkafka.controller:type=KafkaController,name=ActiveControllerCountonly one broker in the cluster should have 1
Pending topic deleteskafka.controller:type=KafkaController,name=TopicsToDeleteCount
Pending replica deleteskafka.controller:type=KafkaController,name=ReplicasToDeleteCount
Ineligible pending topic deleteskafka.controller:type=KafkaController,name=TopicsIneligibleToDeleteCount
Ineligible pending replica deleteskafka.controller:type=KafkaController,name=ReplicasIneligibleToDeleteCount
# of under replicated partitions (|ISR| < |all replicas|)kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions0
# of under minIsr partitions (|ISR| < min.insync.replicas)kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount0
# of at minIsr partitions (|ISR| = min.insync.replicas)kafka.server:type=ReplicaManager,name=AtMinIsrPartitionCount0
Producer Id countskafka.server:type=ReplicaManager,name=ProducerIdCountCount of all producer ids created by transactional and idempotent producers in each replica on the broker
Partition countskafka.server:type=ReplicaManager,name=PartitionCountmostly even across brokers
Offline Replica countskafka.server:type=ReplicaManager,name=OfflineReplicaCount0
Leader replica countskafka.server:type=ReplicaManager,name=LeaderCountmostly even across brokers
ISR shrink ratekafka.server:type=ReplicaManager,name=IsrShrinksPerSecIf a broker goes down, ISR for some of the partitions will shrink. When that broker is up again, ISR will be expanded once the replicas are fully caught up. Other than that, the expected value for both ISR shrink rate and expansion rate is 0.
ISR expansion ratekafka.server:type=ReplicaManager,name=IsrExpandsPerSecSee above
Failed ISR update ratekafka.server:type=ReplicaManager,name=FailedIsrUpdatesPerSec0
Max lag in messages btw follower and leader replicaskafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replicalag should be proportional to the maximum batch size of a produce request.
Lag in messages per follower replicakafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)lag should be proportional to the maximum batch size of a produce request.
Requests waiting in the producer purgatorykafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Producenon-zero if ack=-1 is used
Requests waiting in the fetch purgatorykafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetchsize depends on fetch.wait.max.ms in the consumer
Request total timekafka.network:type=RequestMetrics,name=TotalTimeMs,request={Produce|FetchConsumer|FetchFollower}broken into queue, local, remote and response send time
Time the request waits in the request queuekafka.network:type=RequestMetrics,name=RequestQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request is processed at the leaderkafka.network:type=RequestMetrics,name=LocalTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time the request waits for the followerkafka.network:type=RequestMetrics,name=RemoteTimeMs,request={Produce|FetchConsumer|FetchFollower}non-zero for produce requests when ack=-1
Time the request waits in the response queuekafka.network:type=RequestMetrics,name=ResponseQueueTimeMs,request={Produce|FetchConsumer|FetchFollower}
Time to send the responsekafka.network:type=RequestMetrics,name=ResponseSendTimeMs,request={Produce|FetchConsumer|FetchFollower}
Number of messages the consumer lags behind the producer by. Published by the consumer, not broker.kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max
The average fraction of time the network processors are idlekafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercentbetween 0 and 1, ideally > 0.3
The number of connections disconnected on a processor due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authenticationkafka.server:type=socket-server-metrics,listener=[SASL_PLAINTEXT|SASL_SSL],networkProcessor=<#>,name=expired-connections-killed-countideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this (listener, processor) combination
The total number of connections disconnected, across all processors, due to a client not re-authenticating and then using the connection beyond its expiration time for anything other than re-authenticationkafka.network:type=SocketServer,name=ExpiredConnectionsKilledCountideally 0 when re-authentication is enabled, implying there are no longer any older, pre-2.2.0 clients connecting to this broker
The average fraction of time the request handler threads are idlekafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercentbetween 0 and 1, ideally > 0.3
Bandwidth quota metrics per (user, client-id), user or client-idkafka.server:type={Produce|Fetch},user=([-.\w]+),client-id=([-.\w]+)Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. byte-rate indicates the data produce/consume rate of the client in bytes/sec. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Request quota metrics per (user, client-id), user or client-idkafka.server:type=Request,user=([-.\w]+),client-id=([-.\w]+)Two attributes. throttle-time indicates the amount of time in ms the client was throttled. Ideally = 0. request-time indicates the percentage of time spent in broker network and I/O threads to process requests from client group. For (user, client-id) quotas, both user and client-id are specified. If per-client-id quota is applied to the client, user is not specified. If per-user quota is applied, client-id is not specified.
Requests exempt from throttlingkafka.server:type=Requestexempt-throttle-time indicates the percentage of time spent in broker network and I/O threads to process requests that are exempt from throttling.
ZooKeeper client request latencykafka.server:type=ZooKeeperClientMetrics,name=ZooKeeperRequestLatencyMsLatency in milliseconds for ZooKeeper requests from broker.
ZooKeeper connection statuskafka.server:type=SessionExpireListener,name=SessionStateConnection status of broker's ZooKeeper session which may be one of Disconnected|SyncConnected|AuthFailed|ConnectedReadOnly|SaslAuthenticated|Expired.
Max time to load group metadatakafka.server:type=group-coordinator-metrics,name=partition-load-time-maxmaximum time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Avg time to load group metadatakafka.server:type=group-coordinator-metrics,name=partition-load-time-avgaverage time, in milliseconds, it took to load offsets and group metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Max time to load transaction metadatakafka.server:type=transaction-coordinator-metrics,name=partition-load-time-maxmaximum time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Avg time to load transaction metadatakafka.server:type=transaction-coordinator-metrics,name=partition-load-time-avgaverage time, in milliseconds, it took to load transaction metadata from the consumer offset partitions loaded in the last 30 seconds (including time spent waiting for the loading task to be scheduled)
Rate of transactional verification errorskafka.server:type=AddPartitionsToTxnManager,name=VerificationFailureRateRate of verifications that returned in failure either from the AddPartitionsToTxn API response or through errors in the AddPartitionsToTxnManager. In steady state 0, but transient errors are expected during rolls and reassignments of the transactional state partition.
Time to verify a transactional requestkafka.server:type=AddPartitionsToTxnManager,name=VerificationTimeMsThe amount of time queueing while a possible previous request is in-flight plus the round trip to the transaction coordinator to verify (or not verify)
Consumer Group Offset Countkafka.server:type=GroupMetadataManager,name=NumOffsetsTotal number of committed offsets for Consumer Groups
Consumer Group Countkafka.server:type=GroupMetadataManager,name=NumGroupsTotal number of Consumer Groups
Consumer Group Count, per Statekafka.server:type=GroupMetadataManager,name=NumGroups[PreparingRebalance,CompletingRebalance,Empty,Stable,Dead]The number of Consumer Groups in each state: PreparingRebalance, CompletingRebalance, Empty, Stable, Dead
Number of reassigning partitionskafka.server:type=ReplicaManager,name=ReassigningPartitionsThe number of reassigning leader partitions on a broker.
Outgoing byte rate of reassignment traffickafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesOutPerSec0; non-zero when a partition reassignment is in progress.
Incoming byte rate of reassignment traffickafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSec0; non-zero when a partition reassignment is in progress.
Size of a partition on disk (in bytes)kafka.log:type=Log,name=Size,topic=([-.\w]+),partition=([0-9]+)The size of a partition on disk, measured in bytes.
Number of log segments in a partitionkafka.log:type=Log,name=NumLogSegments,topic=([-.\w]+),partition=([0-9]+)The number of log segments in a partition.
First offset in a partitionkafka.log:type=Log,name=LogStartOffset,topic=([-.\w]+),partition=([0-9]+)The first offset in a partition.
Last offset in a partitionkafka.log:type=Log,name=LogEndOffset,topic=([-.\w]+),partition=([0-9]+)The last offset in a partition.

Kraft monitoring

Kraft quorum monitoring metrics

MetricMbean nameDescription
Current Statekafka.server:type=raft-metrics,name=current-stateThe current state of this member; possible values are leader, candidate, voted, follower, unattached, observer.
Current Leaderkafka.server:type=raft-metrics,name=current-leaderThe current quorum leader's id; -1 indicates unknown.
Current Votedkafka.server:type=raft-metrics,name=current-voteThe current voted leader's id; -1 indicates not voted for anyone.
Current Epochkafka.server:type=raft-metrics,name=current-epochThe current quorum epoch.
High Watermarkkafka.server:type=raft-metrics,name=high-watermarkThe high watermark maintained on this member; -1 if it is unknown.
Log End Offsetkafka.server:type=raft-metrics,name=log-end-offsetThe current raft log end offset.
Number of Unknown Voter Connectionskafka.server:type=raft-metrics,name=number-unknown-voter-connectionsNumber of unknown voters whose connection information is not cached. This value of this metric is always 0.
Average Commit Latencykafka.server:type=raft-metrics,name=commit-latency-avgThe average time in milliseconds to commit an entry in the raft log.
Maximum Commit Latencykafka.server:type=raft-metrics,name=commit-latency-maxThe maximum time in milliseconds to commit an entry in the raft log.
Average Election Latencykafka.server:type=raft-metrics,name=election-latency-avgThe average time in milliseconds spent on electing a new leader.
Maximum Election Latencykafka.server:type=raft-metrics,name=election-latency-maxThe maximum time in milliseconds spent on electing a new leader.
Fetch Records Ratekafka.server:type=raft-metrics,name=fetch-records-rateThe average number of records fetched from the leader of the raft quorum.
Append Records Ratekafka.server:type=raft-metrics,name=append-records-rateThe average number of records appended per sec by the leader of the raft quorum.
Average Poll Idle Ratiokafka.server:type=raft-metrics,name=poll-idle-ratio-avgThe average fraction of time the client's poll() is idle as opposed to waiting for the user code to process records.
Current Metadata Versionkafka.server:type=MetadataLoader,name=CurrentMetadataVersionOutputs the feature level of the current effective metadata version.
Metadata Snapshot Load Countkafka.server:type=MetadataLoader,name=HandleLoadSnapshotCountThe total number of times we have loaded a KRaft snapshot since the process was started.
Latest Metadata Snapshot Sizekafka.server:type=SnapshotEmitter,name=LatestSnapshotGeneratedBytesThe total size in bytes of the latest snapshot that the node has generated. If none have been generated yet, this is the size of the latest snapshot that was loaded. If no snapshots have been generated or loaded, this is 0.
Latest Metadata Snapshot Agekafka.server:type=SnapshotEmitter,name=LatestSnapshotGeneratedAgeMsThe interval in milliseconds since the latest snapshot that the node has generated. If none have been generated yet, this is approximately the time delta since the process was started.

Kraft controller monitoring metrics

MetricMbean nameDescription
Active Controller Countkafka.controller:type=KafkaController,name=ActiveControllerCountThe number of Active Controllers on this node. Valid values are '0' or '1'.
Event Queue Time Mskafka.controller:type=ControllerEventManager,name=EventQueueTimeMsA Histogram of the time in milliseconds that requests spent waiting in the Controller Event Queue.
Event Queue Processing Time Mskafka.controller:type=ControllerEventManager,name=EventQueueProcessingTimeMsA Histogram of the time in milliseconds that requests spent being processed in the Controller Event Queue.
Fenced Broker Countkafka.controller:type=KafkaController,name=FencedBrokerCountThe number of fenced brokers as observed by this Controller.
Active Broker Countkafka.controller:type=KafkaController,name=ActiveBrokerCountThe number of active brokers as observed by this Controller.
Global Topic Countkafka.controller:type=KafkaController,name=GlobalTopicCountThe number of global topics as observed by this Controller.
Global Partition Countkafka.controller:type=KafkaController,name=GlobalPartitionCountThe number of global partitions as observed by this Controller.
Offline Partition Countkafka.controller:type=KafkaController,name=OfflinePartitionCountThe number of offline topic partitions (non-internal) as observed by this Controller.
Preferred Replica Imbalance Countkafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCountThe count of topic partitions for which the leader is not the preferred leader.
Metadata Error Countkafka.controller:type=KafkaController,name=MetadataErrorCountThe number of times this controller node has encountered an error during metadata log processing.
Last Applied Record Offsetkafka.controller:type=KafkaController,name=LastAppliedRecordOffsetThe offset of the last record from the cluster metadata partition that was applied by the Controller.
Last Committed Record Offsetkafka.controller:type=KafkaController,name=LastCommittedRecordOffsetThe offset of the last record committed to this Controller.
Last Applied Record Timestampkafka.controller:type=KafkaController,name=LastAppliedRecordTimestampThe timestamp of the last record from the cluster metadata partition that was applied by the Controller.
Last Applied Record Lag Mskafka.controller:type=KafkaController,name=LastAppliedRecordLagMsThe difference between now and the timestamp of the last record from the cluster metadata partition that was applied by the controller. For active Controllers the value of this lag is always zero.
ZooKeeper Write Behind Lagkafka.controller:type=KafkaController,name=ZkWriteBehindLagThe amount of lag in records that ZooKeeper is behind relative to the highest committed record in the metadata log. This metric will only be reported by the active KRaft controller.
ZooKeeper Metadata Snapshot Write Timekafka.controller:type=KafkaController,name=ZkWriteSnapshotTimeMsThe number of milliseconds the KRaft controller took reconciling a snapshot into ZooKeeper.
ZooKeeper Metadata Delta Write Timekafka.controller:type=KafkaController,name=ZkWriteDeltaTimeMsThe number of milliseconds the KRaft controller took writing a delta into ZK.
Timed-out Broker Heartbeat Countkafka.controller:type=KafkaController,name=TimedOutBrokerHeartbeatCountThe number of broker heartbeats that timed out on this controller since the process was started. Note that only active controllers handle heartbeats, so only they will see increases in this metric.
Number Of Operations Started In Event Queuekafka.controller:type=KafkaController,name=EventQueueOperationsStartedCountThe total number of controller event queue operations that were started. This includes deferred operations.
Number of Operations Timed Out In Event Queuekafka.controller:type=KafkaController,name=EventQueueOperationsTimedOutCountThe total number of controller event queue operations that timed out before they could be performed.
Number Of New Controller Electionskafka.controller:type=KafkaController,name=NewActiveControllersCountCounts the number of times this node has seen a new controller elected. A transition to the "no leader" state is not counted here. If the same controller as before becomes active, that still counts.

Kraft broker monitoring metrics

MetricMbean nameDescription
Last Applied Record Offsetkafka.server:type=broker-metadata-metrics,name=last-applied-record-offsetThe offset of the last record from the cluster metadata partition that was applied by the broker
Last Applied Record Timestampkafka.server:type=broker-metadata-metrics,name=last-applied-record-timestampThe timestamp of the last record from the cluster metadata partition that was applied by the broker.
Last Applied Record Lag Mskafka.server:type=broker-metadata-metrics,name=last-applied-record-lag-msThe difference between now and the timestamp of the last record from the cluster metadata partition that was applied by the broker
Metadata Load Error Countkafka.server:type=broker-metadata-metrics,name=metadata-load-error-countThe number of errors encountered by the BrokerMetadataListener while loading the metadata log and generating a new MetadataDelta based on it.
Metadata Apply Error Countkafka.server:type=broker-metadata-metrics,name=metadata-apply-error-countThe number of errors encountered by the BrokerMetadataPublisher while applying a new MetadataImage based on the latest MetadataDelta.

Common monitoring metrics for producers, consumers, connect, or streams

NameDescription
connection-close-rateConnections closed per second in the window.
connection-close-totalTotal connections closed in the window.
connection-creation-rateNew connections established per second in the window.
connection-creation-totalTotal new connections established in the window.
network-io-rateThe average number of network operations (reads or writes) on all connections per second.
network-io-totalThe total number of network operations (reads or writes) on all connections.
outgoing-byte-rateThe average number of outgoing bytes sent per second to all servers.
outgoing-byte-totalThe total number of outgoing bytes sent to all servers.
request-rateThe average number of requests sent per second.
request-totalThe total number of requests sent.
request-size-avgThe average size of all requests in the window.
request-size-maxThe maximum size of any request sent in the window.
incoming-byte-rateBytes/second read off all sockets.
incoming-byte-totalTotal bytes read off all sockets.
response-rateResponses received per second.
response-totalTotal responses received.
select-rateNumber of times the I/O layer checked for new I/O to perform per second.
select-totalTotal number of times the I/O layer checked for new I/O to perform.
io-wait-time-ns-avgThe average length of time the I/O thread spent waiting for a socket ready for reads or writes in nanoseconds.
io-wait-time-ns-totalThe total time the I/O thread spent waiting in nanoseconds.
io-wait-ratioThe fraction of time the I/O thread spent waiting.
io-time-ns-avgThe average length of time for I/O per select call in nanoseconds.
io-time-ns-totalThe total time the I/O thread spent doing I/O in nanoseconds.
io-ratioThe fraction of time the I/O thread spent doing I/O.
connection-countThe current number of active connections.
successful-authentication-rateConnections per second that were successfully authenticated using SASL or SSL.
successful-authentication-totalTotal connections that were successfully authenticated using SASL or SSL.
failed-authentication-rateConnections per second that failed authentication.
failed-authentication-totalTotal connections that failed authentication.
successful-reauthentication-rateConnections per second that were successfully re-authenticated using SASL.
successful-reauthentication-totalTotal connections that were successfully re-authenticated using SASL.
reauthentication-latency-maxThe maximum latency in ms observed due to re-authentication.
reauthentication-latency-avgThe average latency in ms observed due to re-authentication.
failed-reauthentication-rateConnections per second that failed re-authentication.
failed-reauthentication-totalTotal connections that failed re-authentication.
successful-authentication-no-reauth-totalTotal connections that were successfully authenticated by older, pre-2.2.0 SASL clients that do not support re-authentication. May only be non-zero.

Common per-broker metrics for producers, consumers, connect, or streams

NameDescription
outgoing-byte-rateThe average number of outgoing bytes sent per second for a node.
outgoing-byte-totalThe total number of outgoing bytes sent for a node.
request-rateThe average number of requests sent per second for a node.
request-totalThe total number of requests sent for a node.
request-size-avgThe average size of all requests in the window for a node.
request-size-maxThe maximum size of any request sent in the window for a node.
incoming-byte-rateThe average number of bytes received per second for a node.
incoming-byte-totalThe total number of bytes received for a node.
request-latency-avgThe average request latency in ms for a node.
request-latency-maxThe maximum request latency in ms for a node.
response-rateResponses received per second for a node.
response-totalTotal responses received for a node.

Producer monitoring

NameDescription
waiting-threadsThe number of user threads blocked waiting for buffer memory to enqueue their records.
buffer-total-bytesThe maximum amount of buffer memory the client can use (whether or not it is currently used).
buffer-available-bytesThe total amount of buffer memory that is not being used (either unallocated or in the free list).
bufferpool-wait-timeThe fraction of time an appender waits for space allocation.
bufferpool-wait-time-totalDeprecated The total time an appender waits for space allocation in nanoseconds. Replacement is bufferpool-wait-time-ns-total.
bufferpool-wait-time-ns-totalThe total time an appender waits for space allocation in nanoseconds.
flush-time-ns-totalThe total time the Producer spent in Producer.flush in nanoseconds.
txn-init-time-ns-totalThe total time the Producer spent initializing transactions in nanoseconds (for EOS).
txn-begin-time-ns-totalThe total time the Producer spent in beginTransaction in nanoseconds (for EOS).
txn-send-offsets-time-ns-totalThe total time the Producer spent sending offsets to transactions in nanoseconds (for EOS).
txn-commit-time-ns-totalThe total time the Producer spent committing transactions in nanoseconds (for EOS).
txn-abort-time-ns-totalThe total time the Producer spent aborting transactions in nanoseconds (for EOS).

Producer sender metrics

NameDescription
batch-size-avgThe average number of bytes sent per partition per-request.
batch-size-maxThe max number of bytes sent per partition per-request.
batch-split-rateThe average number of batch splits per second.
batch-split-totalThe total number of batch splits.
compression-rate-avgThe average compression rate of record batches, defined as the average ratio of the compressed batch size over the uncompressed size.
metadata-ageThe age in seconds of the current producer metadata being used.
produce-throttle-time-avgThe average time in ms a request was throttled by a broker.
produce-throttle-time-maxThe maximum time in ms a request was throttled by a broker.
record-error-rateThe average per-second number of record sends that resulted in errors.
record-error-totalThe total number of record sends that resulted in errors.
record-queue-time-avgThe average time in ms record batches spent in the send buffer.
record-queue-time-maxThe maximum time in ms record batches spent in the send buffer.
record-retry-rateThe average per-second number of retried record sends.
record-retry-totalThe total number of retried record sends.
record-send-rateThe average number of records sent per second.
record-send-totalThe total number of records sent.
record-size-avgThe average record size.
record-size-maxThe maximum record size.
records-per-request-avgThe average number of records per request.
request-latency-avgThe average request latency in ms.
request-latency-maxThe maximum request latency in ms.
requests-in-flightThe current number of in-flight requests awaiting a response.

kafka.producer:type=producer-topic-metrics,client-id="{client-id}",topic="{topic}"

NameDescription
byte-rateThe average number of bytes sent per second for a topic.
byte-totalThe total number of bytes sent for a topic.
compression-rateThe average compression rate of record batches for a topic, defined as the average ratio of the compressed batch size over the uncompressed size.
record-error-rateThe average per-second number of record sends that resulted in errors for a topic.
record-error-totalThe total number of record sends that resulted in errors for a topic.
record-retry-rateThe average per-second number of retried record sends for a topic.
record-retry-totalThe total number of retried record sends for a topic.
record-send-rateThe average number of records sent per second for a topic.
record-send-totalThe total number of records sent for a topic.

Consumer monitoring

NameDescription
time-between-poll-avgThe average delay between invocations of poll().
time-between-poll-maxThe max delay between invocations of poll().
last-poll-seconds-agoThe number of seconds since the last poll() invocation.
poll-idle-ratio-avgThe average fraction of time the consumer's poll() is idle as opposed to waiting for the user code to process records.
committed-time-ns-totalThe total time the Consumer spent in committed in nanoseconds.
commit-sync-time-ns-totalThe total time the Consumer spent committing offsets in nanoseconds (for AOS).

Consumer group metrics

NameDescription
commit-latency-avgThe average time taken for a commit request
commit-latency-maxThe max time taken for a commit request
commit-rateThe number of commit calls per second
commit-totalThe total number of commit calls
assigned-partitionsThe number of partitions currently assigned to this consumer
heartbeat-response-time-maxThe max time taken to receive a response to a heartbeat request
heartbeat-rateThe average number of heartbeats per second
heartbeat-totalThe total number of heartbeats
join-time-avgThe average time taken for a group rejoin
join-time-maxThe max time taken for a group rejoin
join-rateThe number of group joins per second
join-totalThe total number of group joins
sync-time-avgThe average time taken for a group sync
sync-time-maxThe max time taken for a group sync
sync-rateThe number of group syncs per second
sync-totalThe total number of group syncs
rebalance-latency-avgThe average time taken for a group rebalance
rebalance-latency-maxThe max time taken for a group rebalance
rebalance-latency-totalThe total time taken for group rebalances so far
rebalance-totalThe total number of group rebalances participated
rebalance-rate-per-hourThe number of group rebalances participated per hour
failed-rebalance-totalThe total number of failed group rebalances
failed-rebalance-rate-per-hourThe number of failed group rebalance events per hour
last-rebalance-seconds-agoThe number of seconds since the last rebalance event
last-heartbeat-seconds-agoThe number of seconds since the last controller heartbeat
partitions-revoked-latency-avgThe average time taken by the on-partitions-revoked rebalance listener callback
partitions-revoked-latency-maxThe max time taken by the on-partitions-revoked rebalance listener callback
partitions-assigned-latency-avgThe average time taken by the on-partitions-assigned rebalance listener callback
partitions-assigned-latency-maxThe max time taken by the on-partitions-assigned rebalance listener callback
partitions-lost-latency-avgThe average time taken by the on-partitions-lost rebalance listener callback
partitions-lost-latency-maxThe max time taken by the on-partitions-lost rebalance listener callback

Consumer fetch metrics

NameDescription
bytes-consumed-rateThe average number of bytes consumed per second
bytes-consumed-totalThe total number of bytes consumed
fetch-latency-avgThe average time taken for a fetch request.
fetch-latency-maxThe max time taken for any fetch request.
fetch-rateThe number of fetch requests per second.
fetch-size-avgThe average number of bytes fetched per request
fetch-size-maxThe maximum number of bytes fetched per request
fetch-throttle-time-avgThe average throttle time in ms
fetch-throttle-time-maxThe maximum throttle time in ms
fetch-totalThe total number of fetch requests.
records-consumed-rateThe average number of records consumed per second
records-consumed-totalThe total number of records consumed
records-lag-maxThe maximum lag in terms of number of records for any partition in this window. NOTE: This is based on current offset and not committed offset
records-lead-minThe minimum lead in terms of number of records for any partition in this window
records-per-request-avgThe average number of records in each request

kafka.consumer:type=consumer-fetch-manager-metrics,client-id="{client-id}",topic="{topic}"

NameDescription
bytes-consumed-rateThe average number of bytes consumed per second for a topic
bytes-consumed-totalThe total number of bytes consumed for a topic
fetch-size-avgThe average number of bytes fetched per request for a topic
fetch-size-maxThe maximum number of bytes fetched per request for a topic
records-consumed-rateThe average number of records consumed per second for a topic
records-consumed-totalThe total number of records consumed for a topic
records-per-request-avgThe average number of records in each request for a topic

kafka.consumer:type=consumer-fetch-manager-metrics,partition="{partition}",topic="{topic}",client-id="{client-id}"

NameDescription
preferred-read-replicaThe current read replica for the partition, or -1 if reading from leader
records-lagThe latest lag of the partition
records-lag-avgThe average lag of the partition
records-lag-maxThe max lag of the partition
records-leadThe latest lead of the partition
records-lead-avgThe average lead of the partition
records-lead-minThe min lead of the partition
Was this page helpful?