Prometheus Alertmanager 配置钉钉告警

一、部署、配置钉钉告警插件

配置钉钉机器人请参阅https://blog.csdn.net/knight_zhou/article/details/105583741

1.1 部署prometheus-webhook-dingtalk 钉钉告警插件

  • 二进制部署

1
[admin@prometheus prometheus]$ wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
  • docker部署

1
2
3
4
[admin@prometheus prometheus]$ docker pull timonwong/prometheus-webhook-dingtalk

# 启动容器
[admin@prometheus prometheus]$ docker run -d -p 8060:8060 --name webhook timonwong/prometheus-webhook --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token={替换成自己的dingding token}

1.2 配置prometheus-webhook-dingtalk

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[admin@prometheus prometheus-webhook-dingtalk]$ cp config.example.yml config.yml
## Request timeout
# timeout: 5s

templates:
- contrib/templates/*.tmpl

targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=******
mention:
all: true
message:
#title: '{{ template "legacy.title" . }}'
#text: '{{ template "dingding_alert.html" . }}'
text: '{{ template "_ding.link.content" . }}'
secret: SEC000000000000000000000
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx
webhook_legacy:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx

1.3 配置 告警模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
[admin@prometheus prometheus-webhook-dingtalk]$ cat contrib/templates/dingding.tmpl

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
{{ end }}{{ end }}

{{ define "___text_alert_list" }}{{ range . }}
---
**告警主题:** {{ .Labels.alertname | upper }}
**告警级别:** {{ .Labels.severity | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**事件信息:** {{ range .Annotations.SortedPairs }} {{ .Value | markdown | html }}
{{ end }}

**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ define "___text_alertresovle_list" }}{{ range . }}
---
**告警主题:** {{ .Labels.alertname | upper }}
**告警级别:** {{ .Labels.severity | upper }}
**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
**事件信息:** {{ range .Annotations.SortedPairs }} {{ .Value | markdown | html }}
{{ end }}

**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}

{{/* Default */}}
{{ define "_default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "_default.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}

![警报 图标](https://duojia-lemei.oss-cn-beijing.aliyuncs.com/ERROR.jpg)
**========告警触发========**
{{ template "___text_alert_list" .Alerts.Firing }}
{{- end }}

{{ if gt (len .Alerts.Resolved) 0 -}}
![恢复图标](https://duojia-lemei.oss-cn-beijing.aliyuncs.com/OK.jpg)
**========告警恢复========**
{{ template "___text_alertresovle_list" .Alerts.Resolved }}


{{- end }}
{{- end }}

{{/* Legacy */}}
{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
{{ define "legacy.content" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}

{{/* Following names for compatibility */}}
{{ define "_ding.link.title" }}{{ template "_default.title" . }}{{ end }}
{{ define "_ding.link.content" }}{{ template "_default.content" . }}{{ end }}

遇到的问题:刚开始发往钉钉的消息都显示成一行了,特别不好看;
在这里需要在告警消息后面留4个空格,例如:

1
**告警主题:** {{ .Labels.alertname | upper }}    

1.4 启动服务

1
[admin@prometheus prometheus-webhook-dingtalk]$ ./prometheus-webhook-dingtalk --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token={替换成自己的dingding token}"

或使用supervisor启动服务

1
2
3
4
5
[program:prometheus-webhook-dingtalk]
command=/home/admin/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=config.yml --web.listen-address=:8060
directory=/home/admin/prometheus-webhook-dingtalk
redirect_stderr=true
autorestart=true

二、配置prometheus报警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
groups:
- name: HostStatsAlert
rules:
- alert: "主机连通性告警"
expr: probe_success{type="icmp"} == 0
for: 3m
labels:
severity: fatal,warning
annotations:
value: "实例网络不可达 {{ $value }}"
- alert: "CPU告警"
expr: round(sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (cluster,job,env,export,instance) * 100,5) > 85
for: 20m
labels:
severity: warning
level: 3
annotations:
value: "实例CPU占用 {{ $value }}%"
- alert: "内存告警"
expr: round(((node_memory_MemTotal_bytes{job!="dev3"} - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{endpoint!="https"}) ) * 100 ,0.5) > 90
for: 10m
labels:
severity: warning
annotations:
value: "实例内存占用 {{ $value }}%"

重启prometheus服务

1
[admin@prometheus prometheus]$ curl -X POST http://127.0.0.1:9090/-/reload

supervisor启动配置

1
2
3
4
5
[program:prometheus]
command=/home/admin/prometheus/prometheus --config.file=/home/admin/prometheus/prometheus.yml --storage.tsdb.path=/home/admin/prometheus/data --storage.tsdb.retention=30d --log.level=info --web.enable-lifecycle --web.enable-admin-api
directory=/home/admin/prometheus
redirect_stderr=true
autorestart=true

三、配置alertmanager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
global:
resolve_timeout: 5m
#邮件
smtp_smarthost: 'smtp.exmail.qq.com:465'
smtp_from: '***'
smtp_auth_username: '****'
smtp_auth_password: '****' #邮箱的授权密码或登录密码
smtp_require_tls: false

#定义模板信息
templates:
- '/home/admin/alertmanager/template/*.tmpl'

route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 5m
receiver: 'dingding_ops'
routes:
# - receiver: 'dingding_group'
# group_wait: 10s
# group_interval: 10s
# repeat_interval: 3m
# match_re:
# severity: 'error'

receivers:
- name: 'dingding_ops'
webhook_configs:
- url: 'http://192.168.10.73:8060/dingtalk/webhook1/send'
send_resolved: true #发送已解决通知
#message: '{{ template "dingding_alert.html" . }}'

#告警抑制
inhibit_rules:
- source_match:
alertname: 'critical'
target_match:
alertname: 'warning'
equal: ['job']

重启alertmanager服务

supervisor启动配置

1
2
3
4
5
[program:alertmanager]
command=/home/admin/alertmanager/alertmanager --config.file=/home/admin/alertmanager/alertmanager.yml --storage.path=/home/admin/alertmanager/data/
directory=/home/admin/alertmanager
redirect_stderr=true
autorestart=true

告警效果如下:

图片2

-------------本文结束感谢您的阅读-------------