Grafana는 시각화도구로서 Metric정보를 읽어와서 Query를 이용하여 사용자가 알아보기 쉽게 여러 형태로 Dashboard를 구성, 정보들을 확인할 수 있습니다. 수집된 Metric정보들이 일정수치 이상이나 상태가 변동되었을 때 설정된 alarm을 통해서 여러 곳으로(slack, opsgenie, E-mail 등) 관련내용을 전달할 수도 있습니다.
Grafana에 대한 자세한 내용이나 다운로드를 하고 싶으시다면 Grafana 홈페이지를 참고하시기 바랍니다.
본 가이드에서는 Grafana의 Dashboard 사용법 및 각 항목에 대해 설명 합니다.
서비스를 사용하기 위해서는 ZCP Console 사이드 메뉴에서 모니터링 을 클릭 합니다.
모니터링 화면 - Grafana
Grafana Dashborad
- Select the Home menu at the top
- Check the expanded menus
Recently selected Dashboard (Recent) and basic configuration Dashboard (4)
- Select Basic Configuration Dashboard
- Check the selected Dashboard
Dashboard Types (13)
Addon Dashboards
ElasticSearch: Displays information about ElasticSearch (JVM, CPU, Memory, Documents, Indices, etc.)
Group 명 | Pannel 명 | 설명 |
---|---|---|
KPI | Cluster health | Current status of elasticsearch cluster (N/A / Green / Yellow / Red) |
Tripped for breakers | The average value is tripeed because the cluster is broken | |
CPU usage Avg. | The average value is tripped because the cluster is broken | |
JVM memory used Avg. | Average JVM memory usage | |
Nodes | Number of nodes in the cluster. | |
Data nodes | Number of data nodes in the cluster. | |
Pending tasks | Cluster level changes which have not yet been executed. | |
Openfile descriptors per cluster | The total number of open files in elasticsearch | |
Shards | Active primary shards | The number of primary shards in your cluster. This is an aggregate total across all indices. |
Active shards | Aggregate total of all shards across all indices, which includes replica shards. | |
Initializing shards | Count of shards that are being freshly created. | |
Relocating shards | The number of shards that are currently moving from one node to another node. | |
Delayed shards | Shards delayed to reduce reallocation overhead. | |
Unassigned shards | The number of shards that exist in the cluster state, but cannot be found in the cluster itself. | |
JVM Garbage Collection | GC count | Number of items processed by Garbage Collection |
GC time | Time taken for Garbage Collection to process | |
CPU and Memory | Load average | Load average used in elasticsearch |
CPU usage | CPU usage in elasticsearch | |
JVM memory usage | elasticsearch에서 사용하는 JVM memory 사용량 | |
JVM memory committed | elasticsearch에서 commit하는데 사용하는 JVM memory 사용량 | |
Disk and Network | Disk usage | elasticsearch에서 사용하는 Disk 사용량 |
Network usage | elasticsearch에서 사용하는 Network 사용량 | |
Documents | Documents count on node | data node에 저장된 document 개수 |
Documents indexed rate | document들이 index된 비율 | |
Documents deleted rate | document들이 delete된 비율 | |
Documents merged rate | document들이 merge된 비율 | |
Documents merged bytes | document들이 merge된 용량(bytes) | |
Times | Query time | Query 실행 시간 |
Indexing time | Indexing 실행 시간 | |
Merging time | Merging 실행 시간 | |
Throttle time for index store | index를 저장하기 위한 throttle 시간 | |
Indices: Count of documents and Total size | Count of documents with only primary shards | primary shard들의 document 개수 |
Total size of stored index data in bytes with only primary shards on all nodes | primary shard들이 저장된 index data의 총용량 | |
Total size of stored index data in bytes with all shards on all nodes | 모든 shard들이 저장된 index data의 총용량 | |
Indices: Index writer | Index writer with only primary shards on all nodes in bytes | primary shard들이 index로 쓰여지고 있는 용량 |
Index writer with all shards on all nodes in bytes | 모든 shard들이 index로 쓰여지고 있는 용량 |
ZCP Services Status : 'zcp-system' namespace의 health check (CPU usages, 상태값)
Pannel 명 | 내용 |
---|---|
Duration | probe duration seconds |
Status : alertmanager | alertmanager health (UP / DOWN) |
alertmanager Status Code | alertmanager 상태코드 |
Status : grafana | grafana health (UP / DOWN) |
grafana Status Code | grafana 상태코드 |
Status : prometheus | prometheus health (UP / DOWN) |
prometheus Status Code | prometheus 상태코드 |
Cluster Dashboards
Etcd Cluster : Etcd 상태값 (RPC Rate, DB Size, Disk Sync Duration 등)
Pannel 명 | 설명 |
---|---|
Etcd has a leader? | Etcd가 leader를 가지고 있는지 체크 (YES / NO) |
The number of leader changes seen | Etcd leader가 바뀐 개수 |
The total number of failed proposals seen | proposal이 실패한 총 개수 |
RPC Rate | gRPC가 5분 동안 시작되거나 handling된 개수 |
Etcd DB Size | Etcd debugging mvcc db total size in bytes |
Etcd Disk Sync Duration | 5분 동안 etc disk가 wal fsync한 총 개수 (Histogram 99) |
Etcd Memory | 'etcd' job의 메모리 사용량 |
Etcd Client Traffic In | etcd network client gRPC가 5분 동안 받았던 traffic 총 개수 |
Etcd Client Traffic Out | etcd network client gRPC가 5분 동안 보냈던 traffic 총 개수 |
Etcd Peer Traffic In | etcd network peer가 5분 동안 받았던 traffic 총 개수 |
Etcd Peer Traffic Out | etcd network peer가 5분 동안 보냈던 traffic 총 개수 |
Etcd Proposals rate(Fail,Pending,commit,apply) | etcd server가 5분 동안 proposal한 총 committed 개수 |
Etcd Disk operations(AVG) | etcd disk가 2분 동안 backend commit한 총 개수 |
Network | etcd network client gRPC가 2분 동안 받았던 traffic 총 개수 |
Snapshot duration | Abnormally high snapshot duration (snapshot_save_total_duration_seconds) indicates disk issues and might cause the cluster to be unstable. |
Kubernetes: Cluster Overview : 전체/Node평균/Cluster평균 Resource에 대한 정보 (Node/Pod/Container 수, CPU/Memory/Network Usage 등)
Group 명 | Pannel 명 | 설명 |
---|---|---|
Resource Dashboard | Alertmanager Alerts Firing | Alert 총 개수 |
Node Not Ready | Node가 'Not Ready' 상태인 개수 | |
Node Unschedulable | Node가 'Unschedulable' 상태인 개수 | |
Node Memory Pressure | Node가 'Memory Pressure' 상태인 개수 | |
Node Disk Pressure | Node가 'Disk Pressure' 상태인 개수 | |
Running Pod Total | 현재 'Running' 상태인 Pod의 개수 | |
Running Pod Total by Node | 각 노드에서 현재 'Running' 상태인 Pod의 개수 | |
Running Container Total | 현재 'Running' 상태인 Container의 개수 | |
Running Container Total by Node | 각 노드에서 현재 'Running' 상태인 Container의 개수 | |
Node Resource Usage | Number of Node | 현재 클러스터 내 노드의 총 개수 |
Total CPU | 현재 클러스터 내 노드의 CPU 합계 | |
Used Memory | 현재 클러스터 내 노드의 Memory 사용양 | |
Total Memory | 현재 클러스터 내 노드의 Memory 합계 | |
DIsk Usage | 현재 클러스터 내 노드의 DIsk 사용양 | |
DIsk Total | 현재 클러스터 내 노드의 DIsk 합계 | |
Avg CPU Usage | 현재 클러스터 내 노드의 CPU 평균 사용양 | |
Avg Memory Usage | 현재 클러스터 내 노드의 Memory 평균 사용양 | |
Avg Disk Usage | 현재 클러스터 내 노드의 Disk 평균 사용양 | |
Network Usage (Node NIC) | 현재 클러스터 내 노드의 Network 사용양 | |
Cluster Resource Usage | Cluster CPU Usage(Used/Total) | 현재 클러스터 내 노드의 CPU 전체 중 사용양(%) - 부가적으로 밑에 전체 CPU 양(Core)과 사용된 양도 표기됨 |
Cluster Memory Usage(Used/Total) | 현재 클러스터 내 노드의 Memory 전체 중 사용양(%) - 부가적으로 밑에 전체 Memory 양(Gib)과 사용된 양도 표기됨 | |
Cluster DIsk Usage(Used/Total) | 현재 클러스터 내 노드의 DIsk 전체 중 사용양(%) - 부가적으로 밑에 전체 DIsk 양(Gib)과 사용된 양도 표기됨 | |
Pod Count by namespace | Namespace별로 kubernetes에 등록된 Pod의 개수 | |
Container Count by namespace | Namespace별로 kubernetes에 등록된 Container의 개수 |
Kubernetes: Performance Overview : API Server Requests/Latency, Pod/Container Running Trands, Creating Rate 등
Pannel 명 | 설명 |
---|---|
APIServer Request Rate | APIServer에서 2분 단위로 Request한 합계 |
APIServer Latency | APIServer가 request latencies한 평균 |
Kubelet POD Start Latency | Latency in microseconds for a single pod to go from pending to running. Broken down by podname. |
Running Pod Trands | kubelet에서 'running'상태인 pod의 개수 |
Create Rate of Pods | kubelet에서 2분 동안 새로 생성된 Pod의 비율 |
Running Containers Trands | kubelet에서 'running'상태인 Containers의 개수 |
Create Rate of Containers | kubelet에서 2분 동안 새로 생성된 Container의 비율 |
Kubernetes: Resource Requests : Node의 CPU/Memory usages, Pod count에 대한 정보를 표시
Container Dashboards
Kubernetes: DaemonSet Overview : Daemonset에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)
Pannel Name | Description |
---|---|
Desired Replicas | Expected number of daemonset replicas |
Available Replicas | Number of currently running daemonset replicas |
Metadata Generation | Number of daemonsets created from metadata |
DaemonSet Create Time | Creation time of the oldest daemonset from now |
Total CPU | Total CPU usage (Core) of containers created by daemonset |
Total Memory | Total memory usage (MiB) of containers created by daemonset |
Total Network | Total network usage (MBps) of containers created by daemonset |
CPU Usage | CPU usage of containers created by daemonset |
Memory Usage | Memory usage of containers created by daemonset |
Replicas Status | Status of daemonset replicas (Ready / Available / Unavailable / Misscheduled) |
Kubernetes: Deployment Overview : Deployment에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)
Pannel Name | Description |
---|---|
Desired Replicas | Expected number of deployment replicas |
Available Replicas | Number of currently running deployment replicas |
Observed Generation | Number of deployments created based on Observed |
Metadata Generation | Number of deployments created based on Metadata |
Deployment Create Time | Creation time of the oldest deployment from now |
AVG CPU | Average CPU usage (Core) of Containers created by Deployment |
AVG Memory | Average Memory usage (MiB) of Containers created by Deployment |
AVG Network | Average Network usage (kBps) of Containers created by Deployment |
CPU Usage | CPU usage of Containers created by Deployment |
Memory Usage | Memory usage of Containers created by Deployment |
Replicas Status | Status of Deployment replicas (Ready / Available / Unavailable / Misscheduled) |
Spec | Deployment replica specification (Replicas / Paused) |
Kubernetes: POD Overview : Pod에 대한 정보 (Pod의 status, restart count, pod에서 사용된 CPU/Memory/Network Usage 표시
Pannel Name | Description |
---|---|
POD Count | Number of Pods in the selected Namespace |
Pod Status | Status of Pods in the selected Namespace (Failed / Pending / Running / Succeeded / Unknown) |
Pod Restart Count | Number of restarts for Pods in the selected Namespace |
POD/Container CPU Usage | CPU usage and trend for Containers in Pods of the selected Namespace |
POD/Container Memory Usage | Memory usage and trend for Containers in Pods of the selected Namespace |
POD/Container Network Usage | Network usage and trend for Containers in Pods of the selected Namespace |
Kubernetes: StatefulSets Overview : StatefulSets에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)
Pannel Name | Description |
---|---|
Desired Replicas | Expected number of statefulset replicas |
Available Replicas | Number of statefulset replicas currently in use |
Observed Generation | Number of statefulsets created by observed generation |
Metadata Generation | Number of statefulsets created by metadata generation |
Statefulset Create Time | Creation time of the oldest statefulset from now |
Total CPU | Total CPU used by containers created from statefulsets (Core) |
Total Memory | Total Memory used by containers created from statefulsets (MiB) |
Total Network | Total Network usage by containers created from statefulsets (MBps) |
CPU Usage | CPU usage of containers created from statefulsets |
Memory Usage | Memory usage of containers created from statefulsets |
Replicas Status | Status of replicas in the statefulset (Current / Available) |
System Dashboards
System Disk Space : 각각의 Node에서 사용된 Disk Usage 추이
Pannel 명 | 설명 |
---|---|
Root Disk 용량 체크 | Amount of disk space used and available on various mount points. Running out of disk space on OS volume, database volume or volume used for temporary space can cause downtime. Some storage may also have reduced performance when small amount of space is available. |
System Usage Overview : 각각의 Node에서 사용량 정보 (Idle cpu, DISK I/O, Network received/transmitted, Memory/Disk Usage 등)
Pannel Name | Description |
---|---|
Idle by CPU Core | 5-minute average idle time of CPUs within the selected Node |
System Load(1,5,15) | Average load of the selected Node (1 min / 5 min / 15 min) |
Memory Usage | Memory usage by type on the selected Node (memory used / memory buffers / memory cached / memory free) |
Memory Usage | Total memory usage ratio (%) on the selected Node |
Disk I/O | Disk usage by type (read / written) on the selected Node |
Disk Usage | Total Disk usage ratio (%) on the selected Node |
Received Bytes by Network Interface | Amount of bytes received over the network during 5 minutes on the selected Node |
Transmitted Bytes by Network Interface | Amount of bytes transmitted over the network during 5 minutes on the selected Node |
System: Overview : 각각의 Node에 대한 요약 정보 (Load Average, Swap, CPU/Memory/Network Usage 등)
Pannel Name | Description |
---|---|
System Uptime | Uptime duration of the system during the selected interval of the selected Node |
Virtual CPU | Current Virtual CPU allocation of the selected Node |
RAM | Current Memory allocation of the selected Node |
Memory Available | Current Memory usage ratio (%) of the selected Node |
Load Average | Average Load (min, max, avg shown separately) during the selected interval of the selected Node |
Memory | Memory usage (Gib) by type (Total / Used / Available) during the selected interval of the selected Node - min, max, avg shown separately |
CPU Usage | CPU usage ratio (%) of idle / user / system / steal / iowait / softirq / nice during the selected interval of the selected Node - min, max, avg shown separately |
Memory Distribution | Memory Distribution usage (Gib) by type (Cached / Used / Free / Buffers) during the selected interval of the selected Node - min, max, avg shown separately |
Network Traffic(KBps) | Network Traffic usage (kBps) by type (Inbound / Outbound for each item) during the selected interval of the selected Node - min, max, avg shown separately |
Network Utilization | Network Utilization usage (MiB) by type (Sent / Received) during the selected interval of the selected Node - min, max, avg shown separately |
Swap | Swap usage (B) by type (Used / Free) during the selected interval of the selected Node - min, max, avg shown separately |
Swap Activity | Swap Activity usage (Bps) by type (Swap In / Swap Out) during the selected interval of the selected Node - min, max, avg shown separately |
Dashboard 작성 Guide
http://docs.grafana.org/reference/templating/