Tickets Documents Updates 로그인

모니터링 조회

Grafana는 시각화도구로서 Metric정보를 읽어와서 Query를 이용하여 사용자가 알아보기 쉽게 여러 형태로 Dashboard를 구성, 정보들을 확인할 수 있습니다. 수집된 Metric정보들이 일정수치 이상이나 상태가 변동되었을 때 설정된 alarm을 통해서 여러 곳으로(slack, opsgenie, E-mail 등) 관련내용을 전달할 수도 있습니다.

Grafana에 대한 자세한 내용이나 다운로드를 하고 싶으시다면 Grafana 홈페이지를 참고하시기 바랍니다. 

 

본 가이드에서는 Grafana의 Dashboard 사용법 및 각 항목에 대해 설명 합니다.

서비스를 사용하기 위해서는 ZCP Console 사이드 메뉴에서 모니터링 을 클릭 합니다. 

모니터링 화면 - Grafana

Grafana Dashborad

  1. Select the Home menu at the top
  2. Check the expanded menus

    Recently selected Dashboard (Recent) and basic configuration Dashboard (4)

  3. Select Basic Configuration Dashboard
  4. Check the selected Dashboard

Dashboard Types (13)

Addon Dashboards

  • ElasticSearch: Displays information about ElasticSearch (JVM, CPU, Memory, Documents, Indices, etc.)

Group 명Pannel 명설명
KPICluster healthCurrent status of elasticsearch cluster (N/A / Green / Yellow / Red)
Tripped for breakersThe average value is tripeed because the cluster is broken
CPU usage Avg.The average value is tripped because the cluster is broken
JVM memory used Avg.Average JVM memory usage
NodesNumber of nodes in the cluster.
Data nodesNumber of data nodes in the cluster.
Pending tasksCluster level changes which have not yet been executed.
Openfile descriptors per clusterThe total number of open files in elasticsearch
ShardsActive primary shardsThe number of primary shards in your cluster. This is an aggregate total across all indices.
Active shardsAggregate total of all shards across all indices, which includes replica shards.
Initializing shardsCount of shards that are being freshly created.
Relocating shardsThe number of shards that are currently moving from one node to another node.
Delayed shardsShards delayed to reduce reallocation overhead.
Unassigned shardsThe number of shards that exist in the cluster state, but cannot be found in the cluster itself.
JVM Garbage CollectionGC countNumber of items processed by Garbage Collection
GC timeTime taken for Garbage Collection to process
CPU and MemoryLoad averageLoad average used in elasticsearch
CPU usageCPU usage in elasticsearch
JVM memory usageelasticsearch에서 사용하는 JVM memory 사용량
JVM memory committedelasticsearch에서 commit하는데 사용하는 JVM memory 사용량
Disk and NetworkDisk usageelasticsearch에서 사용하는 Disk 사용량
Network usageelasticsearch에서 사용하는 Network 사용량
DocumentsDocuments count on nodedata node에 저장된 document 개수
Documents indexed ratedocument들이 index된 비율
Documents deleted ratedocument들이 delete된 비율
Documents merged ratedocument들이 merge된 비율
Documents merged bytesdocument들이 merge된 용량(bytes)
TimesQuery timeQuery 실행 시간

Indexing time

Indexing 실행 시간
Merging timeMerging 실행 시간
Throttle time for index storeindex를 저장하기 위한 throttle 시간
Indices: Count of documents and Total sizeCount of documents with only primary shardsprimary shard들의 document 개수
Total size of stored index data in bytes with only primary shards on all nodesprimary shard들이 저장된 index data의 총용량
Total size of stored index data in bytes with all shards on all nodes모든 shard들이 저장된 index data의 총용량
Indices: Index writerIndex writer with only primary shards on all nodes in bytesprimary shard들이 index로 쓰여지고 있는 용량
Index writer with all shards on all nodes in bytes모든 shard들이 index로 쓰여지고 있는 용량


  • ZCP Services Status : 'zcp-system' namespace의 health check (CPU usages, 상태값)

Pannel 내용
 Durationprobe duration seconds
Status : alertmanageralertmanager health (UP / DOWN)
alertmanager Status Codealertmanager 상태코드
Status : grafanagrafana health (UP / DOWN)
grafana Status Codegrafana 상태코드
Status : prometheusprometheus health (UP / DOWN)
prometheus Status Codeprometheus 상태코드


Cluster Dashboards

  • Etcd Cluster : Etcd 상태값 (RPC Rate, DB Size, Disk Sync Duration 등)

Pannel 설명

Etcd has a leader?

Etcd가 leader를 가지고 있는지 체크 (YES / NO)
The number of leader changes seenEtcd leader가 바뀐 개수
The total number of failed proposals seen

proposal이 실패한 총 개수

RPC RategRPC가 5분 동안 시작되거나 handling된 개수
Etcd DB SizeEtcd debugging mvcc db total size in bytes
Etcd Disk Sync Duration5분 동안 etc disk가 wal fsync한 총 개수 (Histogram 99)
Etcd Memory'etcd' job의 메모리 사용량
Etcd Client Traffic In

etcd network client gRPC가 5분 동안 받았던 traffic 총 개수

Etcd Client Traffic Outetcd network client gRPC가 5분 동안 보냈던 traffic 총 개수
Etcd Peer Traffic Inetcd network peer가 5분 동안 받았던 traffic 총 개수
Etcd Peer Traffic Outetcd network peer가 5분 동안 보냈던 traffic 총 개수
Etcd Proposals rate(Fail,Pending,commit,apply)etcd server가 5분 동안 proposal한 총 committed 개수
Etcd Disk operations(AVG)etcd disk가 2분 동안 backend commit한 총 개수
Networketcd network client gRPC가 2분 동안 받았던 traffic 총 개수
Snapshot durationAbnormally high snapshot duration (snapshot_save_total_duration_seconds) indicates disk issues and might cause the cluster to be unstable.


  • Kubernetes: Cluster Overview : 전체/Node평균/Cluster평균 Resource에 대한 정보 (Node/Pod/Container 수, CPU/Memory/Network Usage 등)

Group 명Pannel 설명
Resource DashboardAlertmanager Alerts FiringAlert 총 개수
Node Not ReadyNode가 'Not Ready' 상태인 개수
Node UnschedulableNode가 'Unschedulable' 상태인 개수

Node Memory Pressure

Node가 'Memory Pressure' 상태인 개수

Node Disk Pressure

Node가 'Disk Pressure' 상태인 개수

Running Pod Total

현재 'Running' 상태인 Pod의 개수

Running Pod Total by Node

각 노드에서 현재 'Running' 상태인 Pod의 개수

Running Container Total

현재 'Running' 상태인 Container의 개수

Running Container Total by Node

각 노드에서 현재 'Running' 상태인 Container의 개수
Node Resource Usage

Number of Node

현재 클러스터 내 노드의 총 개수

Total CPU

현재 클러스터 내 노드의 CPU 합계

Used Memory

현재 클러스터 내 노드의 Memory 사용양

Total Memory

현재 클러스터 내 노드의 Memory 합계

DIsk Usage

현재 클러스터 내 노드의 DIsk  사용양

DIsk Total

현재 클러스터 내 노드의 DIsk 합계

Avg CPU Usage

현재 클러스터 내 노드의 CPU  평균 사용양

Avg Memory Usage

현재 클러스터 내 노드의 Memory  평균 사용양

Avg Disk Usage

현재 클러스터 내 노드의 Disk  평균 사용양

Network Usage (Node NIC)

현재 클러스터 내 노드의 Network 사용양
Cluster Resource Usage

Cluster CPU Usage(Used/Total)

현재 클러스터 내 노드의 CPU  전체 중 사용양(%)

 - 부가적으로 밑에 전체 CPU 양(Core)과 사용된 양도 표기됨

Cluster Memory Usage(Used/Total)

현재 클러스터 내 노드의 Memory  전체 중 사용양(%)

 - 부가적으로 밑에 전체 Memory 양(Gib)과 사용된 양도 표기됨

Cluster DIsk Usage(Used/Total)

현재 클러스터 내 노드의 DIsk  전체 중 사용양(%)

 - 부가적으로 밑에 전체 DIsk 양(Gib)과 사용된 양도 표기됨

Pod Count by namespace

Namespace별로 kubernetes에 등록된 Pod의 개수
Container Count by namespaceNamespace별로 kubernetes에 등록된 Container의 개수


  • Kubernetes: Performance Overview : API Server Requests/Latency,  Pod/Container Running Trands, Creating Rate 등

Pannel 설명

APIServer Request Rate

APIServer에서 2분 단위로 Request한 합계
APIServer LatencyAPIServer가 request latencies한 평균
Kubelet POD Start LatencyLatency in microseconds for a single pod to go from pending to running. Broken down by podname.
Running Pod Trandskubelet에서 'running'상태인 pod의 개수
Create Rate of Podskubelet에서 2분 동안 새로 생성된 Pod의 비율
Running Containers Trandskubelet에서 'running'상태인 Containers의 개수
Create Rate of Containerskubelet에서 2분 동안 새로 생성된 Container의 비율


  • Kubernetes: Resource Requests : Node의 CPU/Memory usages, Pod count에 대한 정보를 표시

Pannel 설명
Cluster CPU(Allocated/Request)

This represents the total [CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) in the cluster.

For comparison the total [allocatable CPU cores](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Memory(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Pod(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.


Container Dashboards

  • Kubernetes: DaemonSet Overview : Daemonset에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)

Pannel NameDescription
Desired ReplicasExpected number of daemonset replicas
Available ReplicasNumber of currently running daemonset replicas
Metadata GenerationNumber of daemonsets created from metadata
DaemonSet Create TimeCreation time of the oldest daemonset from now
Total CPUTotal CPU usage (Core) of containers created by daemonset
Total MemoryTotal memory usage (MiB) of containers created by daemonset
Total NetworkTotal network usage (MBps) of containers created by daemonset
CPU UsageCPU usage of containers created by daemonset
Memory UsageMemory usage of containers created by daemonset

Replicas Status

Status of daemonset replicas (Ready / Available / Unavailable / Misscheduled)


  • Kubernetes: Deployment Overview : Deployment에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)

Pannel NameDescription
Desired ReplicasExpected number of deployment replicas
Available ReplicasNumber of currently running deployment replicas
Observed GenerationNumber of deployments created based on Observed
Metadata GenerationNumber of deployments created based on Metadata
Deployment Create TimeCreation time of the oldest deployment from now
AVG CPUAverage CPU usage (Core) of Containers created by Deployment
AVG MemoryAverage Memory usage (MiB) of Containers created by Deployment
AVG NetworkAverage Network usage (kBps) of Containers created by Deployment
CPU UsageCPU usage of Containers created by Deployment
Memory UsageMemory usage of Containers created by Deployment

Replicas Status

Status of Deployment replicas (Ready / Available / Unavailable / Misscheduled)
SpecDeployment replica specification (Replicas / Paused)


  • Kubernetes: POD Overview : Pod에 대한 정보 (Pod의 status, restart count, pod에서 사용된 CPU/Memory/Network Usage 표시

Pannel NameDescription

POD Count

Number of Pods in the selected Namespace
Pod StatusStatus of Pods in the selected Namespace (Failed / Pending / Running / Succeeded / Unknown)
Pod Restart CountNumber of restarts for Pods in the selected Namespace
POD/Container CPU UsageCPU usage and trend for Containers in Pods of the selected Namespace
POD/Container Memory UsageMemory usage and trend for Containers in Pods of the selected Namespace
POD/Container Network UsageNetwork usage and trend for Containers in Pods of the selected Namespace


  • Kubernetes: StatefulSets Overview : StatefulSets에 대한 정보 (Replicas, CPU/Memory/Network Usage 등)

Pannel NameDescription
Desired ReplicasExpected number of statefulset replicas
Available ReplicasNumber of statefulset replicas currently in use
Observed GenerationNumber of statefulsets created by observed generation
Metadata GenerationNumber of statefulsets created by metadata generation
Statefulset Create TimeCreation time of the oldest statefulset from now
Total CPUTotal CPU used by containers created from statefulsets (Core)
Total MemoryTotal Memory used by containers created from statefulsets (MiB)
Total NetworkTotal Network usage by containers created from statefulsets (MBps)
CPU UsageCPU usage of containers created from statefulsets
Memory UsageMemory usage of containers created from statefulsets

Replicas Status

Status of replicas in the statefulset (Current / Available)


System Dashboards

  • System Disk Space : 각각의 Node에서 사용된 Disk Usage 추이


Pannel 설명
Root Disk 용량 체크Amount of disk space used and available on various mount points.  Running out of disk space on OS volume,  database volume or volume used for temporary space can cause downtime.   Some storage may also have reduced performance when small amount of space is available.


  • System Usage Overview : 각각의 Node에서 사용량 정보 (Idle cpu, DISK I/O, Network received/transmitted, Memory/Disk Usage 등)


Pannel NameDescription

Idle by CPU Core

5-minute average idle time of CPUs within the selected Node
System Load(1,5,15)Average load of the selected Node (1 min / 5 min / 15 min)
Memory UsageMemory usage by type on the selected Node (memory used / memory buffers / memory cached / memory free)
Memory UsageTotal memory usage ratio (%) on the selected Node
Disk I/ODisk usage by type (read / written) on the selected Node
Disk UsageTotal Disk usage ratio (%) on the selected Node
Received Bytes by Network InterfaceAmount of bytes received over the network during 5 minutes on the selected Node
Transmitted Bytes by Network InterfaceAmount of bytes transmitted over the network during 5 minutes on the selected Node


  • System: Overview : 각각의 Node에 대한 요약 정보 (Load Average, Swap, CPU/Memory/Network Usage 등)


Pannel NameDescription

System Uptime

Uptime duration of the system during the selected interval of the selected Node
Virtual CPUCurrent Virtual CPU allocation of the selected Node 
RAMCurrent Memory allocation of the selected Node 
Memory AvailableCurrent Memory usage ratio (%) of the selected Node
Load AverageAverage Load (min, max, avg shown separately) during the selected interval of the selected Node
Memory

Memory usage (Gib) by type (Total / Used / Available) during the selected interval of the selected Node

 - min, max, avg shown separately

CPU Usage

CPU usage ratio (%) of idle / user / system / steal / iowait / softirq / nice during the selected interval of the selected Node

 - min, max, avg shown separately

Memory Distribution

Memory Distribution usage (Gib) by type (Cached / Used / Free / Buffers) during the selected interval of the selected Node

 - min, max, avg shown separately

Network Traffic(KBps)

Network Traffic usage (kBps) by type (Inbound / Outbound for each item) during the selected interval of the selected Node

 - min, max, avg shown separately

Network Utilization

Network Utilization usage (MiB) by type (Sent / Received) during the selected interval of the selected Node

 - min, max, avg shown separately

Swap

Swap usage (B) by type (Used / Free) during the selected interval of the selected Node

 - min, max, avg shown separately

Swap Activity

Swap Activity usage (Bps) by type (Swap In / Swap Out) during the selected interval of the selected Node

 - min, max, avg shown separately


Dashboard 작성 Guide

http://docs.grafana.org/reference/templating/


L
L1 is the author of this solution article.

이 답변이 유용합니까? 아니오

Send feedback
도움이 되어드리지 못해 죄송합니다. 아티클 개선을 위해 의견을 제공해 주시기 바랍니다.