모니터링 조회

수정 날짜: 수, 7월 16, 2025 시간: 5:16 PM

Grafana는 시각화도구로서 Metric정보를 읽어와서 Query를 이용하여 사용자가 알아보기 쉽게 여러 형태로 Dashboard를 구성, 정보들을 확인할 수 있습니다. 수집된 Metric정보들이 일정수치 이상이나 상태가 변동되었을 때 설정된 alarm을 통해서 여러 곳으로(slack, opsgenie, E-mail 등) 관련내용을 전달할 수도 있습니다.

Grafana에 대한 자세한 내용이나 다운로드를 하고 싶으시다면 Grafana 홈페이지를 참고하시기 바랍니다.

본 가이드에서는 Grafana의 Dashboard 사용법 및 각 항목에 대해 설명 합니다.

서비스를 사용하기 위해서는 ZCP Console 사이드 메뉴에서 모니터링 을 클릭 합니다.

모니터링 화면 - Grafana
- Grafana Dashborad
- Dashboard 종류 (13개)

모니터링 화면 - Grafana

Grafana Dashborad

Select the Home menu at the top
Check the expanded menus
Recently selected Dashboard (Recent) and basic configuration Dashboard (4)
Select Basic Configuration Dashboard
Check the selected Dashboard

Dashboard Types (13)

Addon Dashboards

ElasticSearch: Displays information about ElasticSearch (JVM, CPU, Memory, Documents, Indices, etc.)

Group 명	Pannel 명	설명
KPI	Cluster health	Current status of elasticsearch cluster (N/A / Green / Yellow / Red)
	Tripped for breakers	The average value is tripeed because the cluster is broken
	CPU usage Avg.	The average value is tripped because the cluster is broken
	JVM memory used Avg.	Average JVM memory usage
	Nodes	Number of nodes in the cluster.
	Data nodes	Number of data nodes in the cluster.
	Pending tasks	Cluster level changes which have not yet been executed.
	Openfile descriptors per cluster	The total number of open files in elasticsearch
Shards	Active primary shards	The number of primary shards in your cluster. This is an aggregate total across all indices.
	Active shards	Aggregate total of all shards across all indices, which includes replica shards.
	Initializing shards	Count of shards that are being freshly created.
	Relocating shards	The number of shards that are currently moving from one node to another node.
	Delayed shards	Shards delayed to reduce reallocation overhead.
	Unassigned shards	The number of shards that exist in the cluster state, but cannot be found in the cluster itself.
JVM Garbage Collection	GC count	Number of items processed by Garbage Collection
JVM Garbage Collection	GC time	Time taken for Garbage Collection to process
CPU and Memory	Load average	Load average used in elasticsearch
	CPU usage	CPU usage in elasticsearch
	JVM memory usage	elasticsearch에서 사용하는 JVM memory 사용량
	JVM memory committed	elasticsearch에서 commit하는데 사용하는 JVM memory 사용량
Disk and Network	Disk usage	elasticsearch에서 사용하는 Disk 사용량
Disk and Network	Network usage	elasticsearch에서 사용하는 Network 사용량
Documents	Documents count on node	data node에 저장된 document 개수
	Documents indexed rate	document들이 index된 비율
	Documents deleted rate	document들이 delete된 비율
	Documents merged rate	document들이 merge된 비율
	Documents merged bytes	document들이 merge된 용량(bytes)
Times	Query time	Query 실행 시간
	Indexing time	Indexing 실행 시간
	Merging time	Merging 실행 시간
	Throttle time for index store	index를 저장하기 위한 throttle 시간
Indices: Count of documents and Total size	Count of documents with only primary shards	primary shard들의 document 개수
	Total size of stored index data in bytes with only primary shards on all nodes	primary shard들이 저장된 index data의 총용량
	Total size of stored index data in bytes with all shards on all nodes	모든 shard들이 저장된 index data의 총용량
Indices: Index writer	Index writer with only primary shards on all nodes in bytes	primary shard들이 index로 쓰여지고 있는 용량
Indices: Index writer	Index writer with all shards on all nodes in bytes	모든 shard들이 index로 쓰여지고 있는 용량

ZCP Services Status : 'zcp-system' namespace의 health check (CPU usages, 상태값)

Pannel 명	내용
Duration	probe duration seconds
Status : alertmanager	alertmanager health (UP / DOWN)
alertmanager Status Code	alertmanager 상태코드
Status : grafana	grafana health (UP / DOWN)
grafana Status Code	grafana 상태코드
Status : prometheus	prometheus health (UP / DOWN)
prometheus Status Code	prometheus 상태코드

Cluster Dashboards

Etcd Cluster : Etcd 상태값 (RPC Rate, DB Size, Disk Sync Duration 등)

Pannel 명	설명
Etcd has a leader?	Etcd가 leader를 가지고 있는지 체크 (YES / NO)
The number of leader changes seen	Etcd leader가 바뀐 개수
The total number of failed proposals seen	proposal이 실패한 총 개수
RPC Rate	gRPC가 5분 동안 시작되거나 handling된 개수
Etcd DB Size	Etcd debugging mvcc db total size in bytes
Etcd Disk Sync Duration	5분 동안 etc disk가 wal fsync한 총 개수 (Histogram 99)
Etcd Memory	'etcd' job의 메모리 사용량
Etcd Client Traffic In	etcd network client gRPC가 5분 동안 받았던 traffic 총 개수
Etcd Client Traffic Out	etcd network client gRPC가 5분 동안 보냈던 traffic 총 개수
Etcd Peer Traffic In	etcd network peer가 5분 동안 받았던 traffic 총 개수
Etcd Peer Traffic Out	etcd network peer가 5분 동안 보냈던 traffic 총 개수
Etcd Proposals rate(Fail,Pending,commit,apply)	etcd server가 5분 동안 proposal한 총 committed 개수
Etcd Disk operations(AVG)	etcd disk가 2분 동안 backend commit한 총 개수
Network	etcd network client gRPC가 2분 동안 받았던 traffic 총 개수
Snapshot duration	Abnormally high snapshot duration (snapshot_save_total_duration_seconds) indicates disk issues and might cause the cluster to be unstable.

Kubernetes: Cluster Overview : 전체/Node평균/Cluster평균 Resource에 대한 정보 (Node/Pod/Container 수, CPU/Memory/Network Usage 등)

Group 명	Pannel 명	설명
Resource Dashboard	Alertmanager Alerts Firing	Alert 총 개수
	Node Not Ready	Node가 'Not Ready' 상태인 개수
	Node Unschedulable	Node가 'Unschedulable' 상태인 개수
	Node Memory Pressure	Node가 'Memory Pressure' 상태인 개수
	Node Disk Pressure	Node가 'Disk Pressure' 상태인 개수
	Running Pod Total	현재 'Running' 상태인 Pod의 개수
	Running Pod Total by Node	각 노드에서 현재 'Running' 상태인 Pod의 개수
	Running Container Total	현재 'Running' 상태인 Container의 개수
	Running Container Total by Node	각 노드에서 현재 'Running' 상태인 Container의 개수
Node Resource Usage	Number of Node	현재 클러스터 내 노드의 총 개수
	Total CPU	현재 클러스터 내 노드의 CPU 합계
	Used Memory	현재 클러스터 내 노드의 Memory 사용양
	Total Memory	현재 클러스터 내 노드의 Memory 합계
	DIsk Usage	현재 클러스터 내 노드의 DIsk 사용양
	DIsk Total	현재 클러스터 내 노드의 DIsk 합계
	Avg CPU Usage	현재 클러스터 내 노드의 CPU 평균 사용양
	Avg Memory Usage	현재 클러스터 내 노드의 Memory 평균 사용양
	Avg Disk Usage	현재 클러스터 내 노드의 Disk 평균 사용양
	Network Usage (Node NIC)	현재 클러스터 내 노드의 Network 사용양
Cluster Resource Usage	Cluster CPU Usage(Used/Total)	현재 클러스터 내 노드의 CPU 전체 중 사용양(%) - 부가적으로 밑에 전체 CPU 양(Core)과 사용된 양도 표기됨
	Cluster Memory Usage(Used/Total)	현재 클러스터 내 노드의 Memory 전체 중 사용양(%) - 부가적으로 밑에 전체 Memory 양(Gib)과 사용된 양도 표기됨
	Cluster DIsk Usage(Used/Total)	현재 클러스터 내 노드의 DIsk 전체 중 사용양(%) - 부가적으로 밑에 전체 DIsk 양(Gib)과 사용된 양도 표기됨
	Pod Count by namespace	Namespace별로 kubernetes에 등록된 Pod의 개수
	Container Count by namespace	Namespace별로 kubernetes에 등록된 Container의 개수

Kubernetes: Performance Overview : API Server Requests/Latency, Pod/Container Running Trands, Creating Rate 등

Pannel 명	설명
APIServer Request Rate	APIServer에서 2분 단위로 Request한 합계
APIServer Latency	APIServer가 request latencies한 평균
Kubelet POD Start Latency	Latency in microseconds for a single pod to go from pending to running. Broken down by podname.
Running Pod Trands	kubelet에서 'running'상태인 pod의 개수
Create Rate of Pods	kubelet에서 2분 동안 새로 생성된 Pod의 비율
Running Containers Trands	kubelet에서 'running'상태인 Containers의 개수
Create Rate of Containers	kubelet에서 2분 동안 새로 생성된 Container의 비율

Kubernetes: Resource Requests : Node의 CPU/Memory usages, Pod count에 대한 정보를 표시

Pannel 명

설명

Cluster CPU(Allocated/Request)

This represents the total [CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) in the cluster.

For comparison the total [allocatable CPU cores](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Memory(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Pod(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Container Dashboards

Kubernetes: DaemonSet Overview : Daemonset에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)

Pannel Name	Description
Desired Replicas	Expected number of daemonset replicas
Available Replicas	Number of currently running daemonset replicas
Metadata Generation	Number of daemonsets created from metadata
DaemonSet Create Time	Creation time of the oldest daemonset from now
Total CPU	Total CPU usage (Core) of containers created by daemonset
Total Memory	Total memory usage (MiB) of containers created by daemonset
Total Network	Total network usage (MBps) of containers created by daemonset
CPU Usage	CPU usage of containers created by daemonset
Memory Usage	Memory usage of containers created by daemonset
Replicas Status	Status of daemonset replicas (Ready / Available / Unavailable / Misscheduled)

Kubernetes: Deployment Overview : Deployment에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)

Pannel Name	Description
Desired Replicas	Expected number of deployment replicas
Available Replicas	Number of currently running deployment replicas
Observed Generation	Number of deployments created based on Observed
Metadata Generation	Number of deployments created based on Metadata
Deployment Create Time	Creation time of the oldest deployment from now
AVG CPU	Average CPU usage (Core) of Containers created by Deployment
AVG Memory	Average Memory usage (MiB) of Containers created by Deployment
AVG Network	Average Network usage (kBps) of Containers created by Deployment
CPU Usage	CPU usage of Containers created by Deployment
Memory Usage	Memory usage of Containers created by Deployment
Replicas Status	Status of Deployment replicas (Ready / Available / Unavailable / Misscheduled)
Spec	Deployment replica specification (Replicas / Paused)

Kubernetes: POD Overview : Pod에 대한 정보 (Pod의 status, restart count, pod에서 사용된 CPU/Memory/Network Usage 표시

Pannel Name	Description
POD Count	Number of Pods in the selected Namespace
Pod Status	Status of Pods in the selected Namespace (Failed / Pending / Running / Succeeded / Unknown)
Pod Restart Count	Number of restarts for Pods in the selected Namespace
POD/Container CPU Usage	CPU usage and trend for Containers in Pods of the selected Namespace
POD/Container Memory Usage	Memory usage and trend for Containers in Pods of the selected Namespace
POD/Container Network Usage	Network usage and trend for Containers in Pods of the selected Namespace

Kubernetes: StatefulSets Overview : StatefulSets에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)

Pannel Name	Description
Desired Replicas	Expected number of statefulset replicas
Available Replicas	Number of statefulset replicas currently in use
Observed Generation	Number of statefulsets created by observed generation
Metadata Generation	Number of statefulsets created by metadata generation
Statefulset Create Time	Creation time of the oldest statefulset from now
Total CPU	Total CPU used by containers created from statefulsets (Core)
Total Memory	Total Memory used by containers created from statefulsets (MiB)
Total Network	Total Network usage by containers created from statefulsets (MBps)
CPU Usage	CPU usage of containers created from statefulsets
Memory Usage	Memory usage of containers created from statefulsets
Replicas Status	Status of replicas in the statefulset (Current / Available)

System Dashboards

System Disk Space : 각각의 Node에서 사용된 Disk Usage 추이

Pannel 명	설명
Root Disk 용량 체크	Amount of disk space used and available on various mount points. Running out of disk space on OS volume, database volume or volume used for temporary space can cause downtime. Some storage may also have reduced performance when small amount of space is available.

System Usage Overview : 각각의 Node에서 사용량 정보 (Idle cpu, DISK I/O, Network received/transmitted, Memory/Disk Usage 등)

Pannel Name	Description
Idle by CPU Core	5-minute average idle time of CPUs within the selected Node
System Load(1,5,15)	Average load of the selected Node (1 min / 5 min / 15 min)
Memory Usage	Memory usage by type on the selected Node (memory used / memory buffers / memory cached / memory free)
Memory Usage	Total memory usage ratio (%) on the selected Node
Disk I/O	Disk usage by type (read / written) on the selected Node
Disk Usage	Total Disk usage ratio (%) on the selected Node
Received Bytes by Network Interface	Amount of bytes received over the network during 5 minutes on the selected Node
Transmitted Bytes by Network Interface	Amount of bytes transmitted over the network during 5 minutes on the selected Node

System: Overview : 각각의 Node에 대한 요약 정보 (Load Average, Swap, CPU/Memory/Network Usage 등)

Pannel Name	Description
System Uptime	Uptime duration of the system during the selected interval of the selected Node
Virtual CPU	Current Virtual CPU allocation of the selected Node
RAM	Current Memory allocation of the selected Node
Memory Available	Current Memory usage ratio (%) of the selected Node
Load Average	Average Load (min, max, avg shown separately) during the selected interval of the selected Node
Memory	Memory usage (Gib) by type (Total / Used / Available) during the selected interval of the selected Node - min, max, avg shown separately
CPU Usage	CPU usage ratio (%) of idle / user / system / steal / iowait / softirq / nice during the selected interval of the selected Node - min, max, avg shown separately
Memory Distribution	Memory Distribution usage (Gib) by type (Cached / Used / Free / Buffers) during the selected interval of the selected Node - min, max, avg shown separately
Network Traffic(KBps)	Network Traffic usage (kBps) by type (Inbound / Outbound for each item) during the selected interval of the selected Node - min, max, avg shown separately
Network Utilization	Network Utilization usage (MiB) by type (Sent / Received) during the selected interval of the selected Node - min, max, avg shown separately
Swap	Swap usage (B) by type (Used / Free) during the selected interval of the selected Node - min, max, avg shown separately
Swap Activity	Swap Activity usage (Bps) by type (Swap In / Swap Out) during the selected interval of the selected Node - min, max, avg shown separately

Dashboard 작성 Guide

http://docs.grafana.org/reference/templating/