成人无码视频,亚洲精品久久久久av无码,午夜精品久久久久久毛片,亚洲 中文字幕 日韩 无码

資訊專欄INFORMATION COLUMN

使用Prometheus+Alertmanager告警JVM異常情況

lushan / 2100人閱讀

摘要:,負(fù)責(zé)抓取存儲(chǔ)指標(biāo)信息,并提供查詢功能,本文重點(diǎn)使用它的告警功能。,負(fù)責(zé)將告警通知給相關(guān)人員。配置的告警觸發(fā)規(guī)則使用超過(guò)最大上限的機(jī)時(shí)間超過(guò)秒分鐘分鐘時(shí)間在最近分鐘里超過(guò)配置連接,配置。

原文地址

在前一篇文章中提到了如何使用Prometheus+Grafana來(lái)監(jiān)控JVM。本文介紹如何使用Prometheus+Alertmanager來(lái)對(duì)JVM的某些情況作出告警。

本文所提到的腳本可以在這里下載。

摘要

用到的工具:

Docker,本文大量使用了Docker來(lái)啟動(dòng)各個(gè)應(yīng)用。

Prometheus,負(fù)責(zé)抓取/存儲(chǔ)指標(biāo)信息,并提供查詢功能,本文重點(diǎn)使用它的告警功能。

Grafana,負(fù)責(zé)數(shù)據(jù)可視化(本文重點(diǎn)不在于此,只是為了讓讀者能夠直觀地看到異常指標(biāo))。

Alertmanager,負(fù)責(zé)將告警通知給相關(guān)人員。

JMX exporter,提供JMX中和JVM相關(guān)的metrics。

Tomcat,用來(lái)模擬一個(gè)Java應(yīng)用。

先講一下大致步驟:

利用JMX exporter,在Java進(jìn)程內(nèi)啟動(dòng)一個(gè)小型的Http server

配置Prometheus抓取那個(gè)Http server提供的metrics。

配置Prometheus的告警觸發(fā)規(guī)則

heap使用超過(guò)最大上限的50%、80%、90%

instance down機(jī)時(shí)間超過(guò)30秒、1分鐘、5分鐘

old gc時(shí)間在最近5分鐘里超過(guò)50%、80%

配置Grafana連接Prometheus,配置Dashboard。

配置Alertmanager的告警通知規(guī)則

告警的大致過(guò)程如下:

Prometheus根據(jù)告警觸發(fā)規(guī)則查看是否觸發(fā)告警,如果是,就將告警信息發(fā)送給Alertmanager。

Alertmanager收到告警信息后,決定是否發(fā)送通知,如果是,則決定發(fā)送給誰(shuí)。

第一步:?jiǎn)?dòng)幾個(gè)Java應(yīng)用

1) 新建一個(gè)目錄,名字叫做prom-jvm-demo。

2) 下載JMX exporter到這個(gè)目錄。

3) 新建一個(gè)文件simple-config.yml內(nèi)容如下:

---
lowercaseOutputLabelNames: true
lowercaseOutputName: true
whitelistObjectNames: ["java.lang:type=OperatingSystem"]
rules:
 - pattern: "java.lang<>((?!process_cpu_time)w+):"
   name: os_$1
   type: GAUGE
   attrNameSnakeCase: true

4) 運(yùn)行以下命令啟動(dòng)3個(gè)Tomcat,記得把替換成正確的路徑(這里故意把-Xmx-Xms設(shè)置的很小,以觸發(fā)告警條件):

docker run -d 
  --name tomcat-1 
  -v :/jmx-exporter 
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" 
  -p 6060:6060 
  -p 8080:8080 
  tomcat:8.5-alpine

docker run -d 
  --name tomcat-2 
  -v :/jmx-exporter 
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" 
  -p 6061:6060 
  -p 8081:8080 
  tomcat:8.5-alpine

docker run -d 
  --name tomcat-3 
  -v :/jmx-exporter 
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" 
  -p 6062:6060 
  -p 8082:8080 
  tomcat:8.5-alpine

5) 訪問(wèn)http://localhost:8080|8081|8082看看Tomcat是否啟動(dòng)成功。

6) 訪問(wèn)對(duì)應(yīng)的http://localhost:6060|6061|6062看看JMX exporter提供的metrics。

備注:這里提供的simple-config.yml僅僅提供了JVM的信息,更復(fù)雜的配置請(qǐng)參考JMX exporter文檔。

第二步:?jiǎn)?dòng)Prometheus

1) 在之前新建目錄prom-jvm-demo,新建一個(gè)文件prom-jmx.yml,內(nèi)容如下:

scrape_configs:
  - job_name: "java"
    static_configs:
    - targets:
      - ":6060"
      - ":6061"
      - ":6062"

# alertmanager的地址
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - ":9093"

# 讀取告警觸發(fā)條件規(guī)則
rule_files:
  - "/prometheus-config/prom-alert-rules.yml"

2) 新建文件prom-alert-rules.yml,該文件是告警觸發(fā)規(guī)則:

# severity按嚴(yán)重程度由高到低:red、orange、yello、blue
groups:
  - name: jvm-alerting
    rules:

    # down了超過(guò)30秒
    - alert: instance-down
      expr: up == 0
      for: 30s
      labels:
        severity: yellow
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."

    # down了超過(guò)1分鐘
    - alert: instance-down
      expr: up == 0
      for: 1m
      labels:
        severity: orange
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

    # down了超過(guò)5分鐘
    - alert: instance-down
      expr: up == 0
      for: 5m
      labels:
        severity: red
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

    # 堆空間使用超過(guò)50%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 50
      for: 1m
      labels:
        severity: yellow
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 50%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 堆空間使用超過(guò)80%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 80
      for: 1m
      labels:
        severity: orange
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 80%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 80%] for more than 1 minutes. current usage ({{ $value }}%)"
    
    # 堆空間使用超過(guò)90%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 90
      for: 1m
      labels:
        severity: red
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 90%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 在5分鐘里,Old GC花費(fèi)時(shí)間超過(guò)30%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.3
      for: 5m
      labels:
        severity: yellow
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分鐘里,Old GC花費(fèi)時(shí)間超過(guò)50%        
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.5
      for: 5m
      labels:
        severity: orange
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分鐘里,Old GC花費(fèi)時(shí)間超過(guò)80%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.8
      for: 5m
      labels:
        severity: red
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

3) 啟動(dòng)Prometheus:

docker run -d 
  --name=prometheus 
  -p 9090:9090 
  -v :/prometheus-config 
  prom/prometheus --config.file=/prometheus-config/prom-jmx.yml

4) 訪問(wèn)http://localhost:9090/alerts應(yīng)該能看到之前配置的告警規(guī)則:

如果沒(méi)有看到三個(gè)instance,那么等一會(huì)兒再試。

第三步:配置Grafana

參考使用Prometheus+Grafana監(jiān)控JVM

第四步:?jiǎn)?dòng)Alertmanager

1) 新建一個(gè)文件alertmanager-config.yml

global:
  smtp_smarthost: ""
  smtp_from: ""
  smtp_auth_username: ""
  smtp_auth_password: ""

# The directory from which notification templates are read.
templates: 
- "/alertmanager-config/*.tmpl"

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ["alertname", "instance"]

  # When a new group of alerts is created by an incoming alert, wait at
  # least "group_wait" to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait "group_interval" to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait "repeat_interval" to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: "user-a"

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is 
# already critical.
inhibit_rules:
- source_match:
    severity: "red"
  target_match_re:
    severity: ^(blue|yellow|orange)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ["alertname", "instance"]
- source_match:
    severity: "orange"
  target_match_re:
    severity: ^(blue|yellow)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ["alertname", "instance"]
- source_match:
    severity: "yellow"
  target_match_re:
    severity: ^(blue)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ["alertname", "instance"]

receivers:
- name: "user-a"
  email_configs:
  - to: ""

修改里面關(guān)于smtp_*的部分和最下面user-a的郵箱地址。

備注:因?yàn)閲?guó)內(nèi)郵箱幾乎都不支持TLS,而Alertmanager目前又不支持SSL,因此請(qǐng)使用Gmail或其他支持TLS的郵箱來(lái)發(fā)送告警郵件,見(jiàn)這個(gè)issue,這個(gè)問(wèn)題已經(jīng)修復(fù),下面是阿里云企業(yè)郵箱的配置例子:

smtp_smarthost: "smtp.qiye.aliyun.com:465"
smtp_hello: "company.com"
smtp_from: "username@company.com"
smtp_auth_username: "username@company.com"
smtp_auth_password: password
smtp_require_tls: false

2) 新建文件alert-template.tmpl,這個(gè)是郵件內(nèi)容模板:

{{ define "email.default.html" }}

Summary

{{ .CommonAnnotations.summary }}

Description

{{ .CommonAnnotations.description }}

{{ end}}

3) 運(yùn)行下列命令啟動(dòng):

docker run -d 
  --name=alertmanager 
  -v :/alertmanager-config 
  -p 9093:9093 
  prom/alertmanager:master --config.file=/alertmanager-config/alertmanager-config.yml

4) 訪問(wèn)http://localhost:9093,看看有沒(méi)有收到Prometheus發(fā)送過(guò)來(lái)的告警(如果沒(méi)有看到稍等一下):

第五步:等待郵件

等待一會(huì)兒(最多5分鐘)看看是否收到郵件。如果沒(méi)有收到,檢查配置是否正確,或者docker logs alertmanager看看alertmanager的日志,一般來(lái)說(shuō)都是郵箱配置錯(cuò)誤導(dǎo)致。

文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址:http://m.hztianpu.com/yun/71889.html

相關(guān)文章

  • Kubernetes中使用prometheus+alertmanager實(shí)現(xiàn)監(jiān)控告警

    摘要:監(jiān)控告警原型圖原型圖解釋與作為運(yùn)行在同一個(gè)中并交由控制器管理,默認(rèn)開(kāi)啟端口,因?yàn)槲覀兊呐c是處于同一個(gè)中,所以直接使用就可以與通信用于發(fā)送告警通知,告警規(guī)則配置以的形式掛載到容器供使用,告警通知對(duì)象配置也通過(guò)掛載到容器供使用,這里我們使用郵件 監(jiān)控告警原型圖 showImg(https://segmentfault.com/img/bVbhYgs?w=1280&h=962); 原型圖解釋...

    wupengyu 評(píng)論0 收藏0
  • 數(shù)人云工程師手記 | 容器日志管理實(shí)踐

    摘要:容器內(nèi)文件日志平臺(tái)支持的文件存儲(chǔ)是,避免了許多復(fù)雜環(huán)境的處理。以上是數(shù)人云在實(shí)踐容器日志系統(tǒng)過(guò)程中遇到的問(wèn)題,更高層次的應(yīng)用包括容器日志分析等,還有待繼續(xù)挖掘和填坑,歡迎大家提出建議,一起交流。 業(yè)務(wù)平臺(tái)每天產(chǎn)生大量日志數(shù)據(jù),為了實(shí)現(xiàn)數(shù)據(jù)分析,需要將生產(chǎn)服務(wù)器上的所有日志收集后進(jìn)行大數(shù)據(jù)分析處理,Docker提供了日志驅(qū)動(dòng),然而并不能滿足不同場(chǎng)景需求,本次將結(jié)合實(shí)例分享日志采集、存儲(chǔ)以...

    saucxs 評(píng)論0 收藏0
  • 使用prometheus operator監(jiān)控envoy

    摘要:集群三步安裝概述應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過(guò)一些無(wú)侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 kubernetes集群三步安裝 概述 prometheus operator應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過(guò)一些無(wú)侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 不過(guò)對(duì)于新手使用上還是有一...

    Jeff 評(píng)論0 收藏0
  • 使用prometheus operator監(jiān)控envoy

    摘要:集群三步安裝概述應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過(guò)一些無(wú)侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 kubernetes集群三步安裝 概述 prometheus operator應(yīng)當(dāng)是使用監(jiān)控系統(tǒng)的最佳實(shí)踐了,首先它一鍵構(gòu)建整個(gè)監(jiān)控系統(tǒng),通過(guò)一些無(wú)侵入的手段去配置如監(jiān)控?cái)?shù)據(jù)源等故障自動(dòng)恢復(fù),高可用的告警等。。 不過(guò)對(duì)于新手使用上還是有一...

    sorra 評(píng)論0 收藏0

發(fā)表評(píng)論

0條評(píng)論

閱讀需要支付1元查看
<