Alertmanager的告警模板

监控和告警对于运维工程师来说,是一个永远也无法逃避的课题;一套完善的监控和告警系统,可以将运维人员从繁忙的支撑工作中抽身出来,去研究和探索更有意义的课题。

本次课题不讲述如何监控,只是记录自己在运维工作中所实现的一些告警的功能,供参考。

一、告警模板

1.1 邮件模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
<table border="1">
<tr bgcolor="#ED3E3C">
<td align="center" valign="middle">告警项</td>
<td align="center" valign="middle">集群</td>
<td align="center" valign="middle">产品线</td>
<td align="center" valign="middle">接口</td>
<td align="center" valign="middle">摘要信息</td>
<td align="center" valign="middle">开始时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr bgcolor="#FFFF6F">
<td align="left" valign="middle">{{ index $alert.Labels "alert_name" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "cluster" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "business_line" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "interface" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "desc"}}</td>
<td align="left" valign="middle">{{ .StartsAt.Format "2006-01-02 15:04:05" }}</td>
</tr>
{{ end }}
</table>
<br />
<br />
更多详细信息请前往<a href="http://prometheus.feiersmart.com/alerts" target="_blank">prometheus</a>页面查看
<br />
{{ end }}


{{ define "mail_business.restore.html" }}
<table border="1">
<tr bgcolor="#ED3E3C">
<td align="center" valign="middle">告警项</td>
<td align="center" valign="middle">集群</td>
<td align="center" valign="middle">产品线</td>
<td align="center" valign="middle">接口</td>
<td align="center" valign="middle">摘要信息</td>
<td align="center" valign="middle">恢复时间</td>
</tr>
{{ range $i, $alert := .Alerts }}
<tr bgcolor="#FFFF6F">
<td align="left" valign="middle">{{ index $alert.Labels "alert_name" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "cluster" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "business_line" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "interface" }}</td>
<td align="left" valign="middle">{{ index $alert.Labels "desc"}}</td>
<td align="left" valign="middle">{{ .EndsAt.Format "2006-01-02 15:04:05" }}</td>
</tr>
{{ end }}
</table>
<br />
<br />
更多详细信息请前往<a href="http://prometheus.feiersmart.com/alerts" target="_blank">prometheus</a>页面查看
<br />
{{ end }}

{{ define "mail_business.html" }}
{{ if eq .Status "firing"}}{{ template "mail_business.start.html" . }}
{{ end }}
{{ if eq .Status "resolved" }}{{ template "mail_business.restore.html" . }}
{{ end }}
{{ end }}

这里把告警恢复分开了,可根据自己需求定制不同的内容,我这里仅仅是区分了开始时间恢复时间

邮件模板支持html,所在可以自定义展示的颜色和样式

图片1

1.2 微信模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{{ define "wechat_business.start.html" }}
{{ range $i, $alert := .Alerts }}
============Start============
[告警状态]:{{ .Status }}
[告警项]: {{ index $alert.Labels "alert_name" }}
[集群]:{{ index $alert.Labels "cluster" }}
[接口]:{{ index $alert.Labels "interface" }}
[摘要信息]:{{ index $alert.Annotations "value" }}
[开始时间]:{{ .StartsAt.Format "2006-01-02 15:04:05" }}
============End=============
{{ end }}
{{ end }}

{{ define "wechat_business.restore.html" }}
{{ range $i, $alert := .Alerts }}
============Start============
[告警状态]:{{ .Status }}
[告警项]: {{ index $alert.Labels "alert_name" }}
[集群]:{{ index $alert.Labels "cluster" }}
[接口]:{{ index $alert.Labels "interface" }}
[摘要信息]:{{ index $alert.Annotations "value" }}
[恢复时间]:{{ .EndsAt.Format "2006-01-02 15:04:05" }}
============End=============
{{ end }}
{{ end }}

{{ define "wechat_business.html" }}
{{ if eq .Status "firing"}}{{ template "wechat_business.start.html" . }}
{{ end }}
{{ if eq .Status "resolved" }}{{ template "wechat_business.restore.html" . }}
{{ end }}
{{ end }}

微信模板貌似只能以文本的方式展示,相对于邮件来说就没有那么多华丽的样式和颜色了,虽然我的邮件告警也并不华丽,哈哈哈……

目前还没有这个模板的微信告警,下面的图片是其它模板的,不过效果都是一样的

图片2

1.3 短信模板

短信模板申请

短信告警用的是阿里短信服务,首先需要在阿里的短信服务中申请一个短信签名,然后用这个签名申请一个短信模板,以下是申请好的短信模板

图片3

短信告警服务配置

我使用PrometheusAlert服务来做短信告警,github地址:https://github.com/feiyu563/PrometheusAlert

PrometheusAlert服务配置阿里云短信接口

1
2
3
4
5
6
7
8
9
10
11
12
vim PrometheusAlert/conf/app.conf
#---------------------↓阿里云接口-----------------------
#是否开启阿里云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-alydx=1
#阿里云短信主账号AccessKey的ID
ALY_DX_AccessKeyId=***********
#阿里云短信接口密钥
ALY_DX_AccessSecret=*************
#阿里云短信签名名称
ALY_DX_SignName=******
#阿里云短信模板ID
ALY_DX_Template=*********

有条件的话也可以自己动手写一个短信接口

短信告警模板

这里说是短信告警模板有点不太恰当,因为从严格意义上来说在我的环境里短信告警并没有模板,只是相对于其它的PromQL来说多了一条配置,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
prometheus/alerts/host_alert.rules

- alert: "宕机告警"
expr: up{cluster!="kubernetes-dev",env!="es"} == 0
for: 5m
labels:
severity: fatal,warning
level: 3
annotations:
value: "实例宕机已超过5分钟 {{ $value }}"
description: " 集群: {{ $labels.cluster }}\n 服务器: {{ $labels.job }}\n 摘要: 实例宕机已超过5分钟
{{ $value }}"
- alert: "服务器重启告警"
expr: time() - node_boot_time_seconds{} < 600
for: 35s
labels:
severity: warning
annotations:
value: "服务器 {{ $labels.instance }} 10分钟前发生过重启"

相对于最后一条PromQL而言,在 labels 和 annotations 下面分别多了一个 level 和 description 项,level 定义故障的严重程度,description 用于追加需要用短信展示的内容。

图片4

1.4 语音外呼告警

对于语音外呼告警也可以使用各大云平台的外呼功能;我这里并没有使用收费的外呼功能,只是粗略的用一台电脑、一部手机、一个电话卡、实现了呼叫功能(仅仅是在半夜可以把你从睡梦中叫醒,目前还做不到语音播报告警内容)。

之所以没有使用阿里云的外呼功能,是因为申请过程特别繁琐;当然也可以使用阿里云市场中别人封装好的外呼能力,按次计费而且也特别便宜,后续可以考虑使用。

二、使用告警模板

首先由prometheus查询PromQL来触发告警,由prometheus 将告警信息发送给 alertmanager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
vim prometheus/prometheus.yml

# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
scrape_timeout: 15s
evaluation_interval: 20s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.5.248:9093']
#- alertmanager: ['192.168.5.248:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/work/admin/prometheus/alerts/*.rules"

然后由alertmanager收到告警信息后,进行分组、抑制和告警路由

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
vim alertmanager-0.18.0/alertmanager.yml

#全局配置项
global:
resolve_timeout: 5m
#邮件
smtp_smarthost: 'smtp.mxhichina.com:465'
smtp_from: 'alert@alert.com'
smtp_auth_username: 'alert@alert.com'
smtp_auth_password: '******' #邮箱的授权密码或登录密码
smtp_require_tls: false
#微信
#wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/gettoken?corpid='
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'

#定义模板信息
templates:
- '/work/admin/alertmanager-0.18.0/template/*.tmpl'

#定义告警接收者信息
route:
group_by: ['alertname'] #告警分组依据
group_wait: 50s #组报警等待时间
group_interval: 5m #组报警间隔时间
repeat_interval: 30m #重复报警间隔时间
receiver: 'email' #默认告警接收
routes:
- receiver: 'email_wechat'
group_wait: 50s
group_interval: 5m
repeat_interval: 30m
match_re:
alertname: '宕机告警|主机连通性告警|后处理vip告警|负载告警|服务器网络速率告警|服务器TCP连接数告警|logstash内存告警|Inode告警|磁盘IO告警|ES集群可用性|Redis内存告警|Haproxy错误告警|DNS解析告警|ES JVM告警|MySQL QPS告警|内存告警|HTTP404告警|Nginx后端响应延时告警|Haproxy后端响应延时告警|Haproxy后端请求数告警|Nginx实时请求数告警|Nginx实时请求时间告警|MySQL 连接数告警|Kafka集群告警|ES GC告警|ES GC告警'
#severity: 'fatal'
- receiver: 'email_http'
group_wait: 45s
match_re:
alertname: 'HTTP告警'
- receiver: 'email_http_url'
group_wait: 45s
match_re:
alertname: 'HTTP404告警'
- receiver: 'email_tcp'
match_re:
alertname: 'TCP告警'
- receiver: 'k8s_email'
match_re:
severity: 'warn'
- receiver: 'k8s_email_wechat'
match_re:
severity: 'critical'
- receiver: 'grafana_wechat'
group_wait: 50s
group_interval: 10s
repeat_interval: 30m
match_re:
#metric: 'Count'
alertname: '酷狗错误码监控 alert|科勒设备token失效监控 alert|科勒设备token失效监控 alert|定时刷新科勒设备token监控 alert'
- receiver: 'business'
group_wait: 50s
group_interval: 5m
repeat_interval: 30m
match_re:
severity: 'business'
- receiver: 'business_zdmall_order'
group_wait: 50s
group_interval: 5m
repeat_interval: 30m
match_re:
severity: 'business_zdmall_order'
- receiver: 'business_zdmall_order'
group_wait: 50s
group_interval: 5m
repeat_interval: 1h
match_re:
severity: 'business_zdmall_order_nighttime'
- receiver: 'outbound_phone'
group_wait: 30s
group_interval: 10m
repeat_interval: 30m
match_re:
severity: 'outbound_phone'

receivers:
- name: 'email'
webhook_configs:
- url: 'http://192.168.5.248:8080/prometheus/alert'
email_configs:
- to: 'feier_ops@feiersmart.com,hanfei@feiersmart.com'
html: '{{ template "mail.html" . }}'
send_resolved: true
- name: 'k8s_email'
email_configs:
- to: 'feier_ops@feiersmart.com,hanfei@feiersmart.com'
html: '{{ template "k8s_mail.html" . }}'
send_resolved: true

- name: 'k8s_email_wechat'
email_configs:
- to: 'feier_ops@feiersmart.com,hanfei@feiersmart.com,caoyue@feiersmart.com'
html: '{{ template "k8s_mail.html" . }}'
send_resolved: true
# html: '{{ template "k8s_mail_recover.html" . }}'
#headers: { Subject: " {{ 第二路由匹配测试}}" }
wechat_configs: # 企业微信报警配置
- send_resolved: true
to_party: '10|9|11' # 接收部门的id
agent_id: '1000027' # (企业微信-->自定应用-->AgentId)
corp_id: '**********' # 企业信息(我的企业-->CorpId[在底部])
api_secret: '**********' # 企业微信(企业微信-->自定应用-->Secret)
message: '{{ template "k8s_wechat.html" . }}' # 发送消息模板的设定

- name: 'business'
email_configs:
- to: 'feier_ops@feiersmart.com,hanfei@feiersmart.com,caoyue@feiersmart.com'
html: '{{ template "mail_business.html" . }}'
send_resolved: true

- name: 'business_zdmall_order'
email_configs:
- to: 'feier_ops@feiersmart.com,hanfei@feiersmart.com,caoyue@feiersmart.com'
html: '{{ template "mail_business_order.html" . }}'
send_resolved: true
webhook_configs:
- url: 'http://192.168.5.248:8080/prometheus/router?phone=18650032533,15209884564,15902877852'
wechat_configs: # 企业微信报警配置
- send_resolved: true
to_party: '10|9|11' # 接收部门的id
agent_id: '1000027' # (企业微信-->自定应用-->AgentId)
corp_id: '**********' # 企业信息(我的企业-->CorpId[在底部])
api_secret: '**********' # 企业微信(企业微信-->自定应用-->Secret)
message: '{{ template "wechat_business_order.html" . }}' # 发送消息模板的设定

#自己实现的语音外呼接口
- name: 'outbound_phone'
webhook_configs:
- url: 'http://10.8.0.29:8803/outbound_phone'


#告警抑制
inhibit_rules:
- source_match:
alertname: '主机连通性告警'
#severity: 'fatal'
target_match:
alertname: '宕机告警'
#severity: 'warning'
equal: ['job']
- source_match:
alertname: '主机连通性告警'
target_match:
alertname: 'TCP告警'
equal: ['job']
- source_match:
alertname: '主机连通性告警'
target_match:
alertname: 'HTTP告警'
equal: ['job']
- source_match:
alertname: '主机连通性告警'
target_match:
alertname: 'API请求延迟'
equal: ['job']
- source_match:
alertname: '宕机告警'
target_match:
alertname: 'TCP告警'
equal: ['job']
- source_match:
type: 'alerts'
target_match:
type: 'call_time,request_time'
equal: ['interface']

- source_match:
alertname: 'KubeDeploymentReplicasMismatch'
target_match:
alertname: 'KubePodCrashLooping'
equal: ['container','deployment']
- source_match:
alertname: 'KubeDeploymentReplicasMismatch'
target_match:
alertname: 'KubeContainerWaiting'
equal: ['container']
- source_match:
alertname: 'KubePodCrashLooping'
target_match:
alertname: 'KubeContainerWaiting'
equal: ['container']

- source_match:
alertname: 'KubeDeploymentReplicasMismatch'
target_match:
alertname: 'KubePodPending'
equal: ['critical','warn']
-------------本文结束感谢您的阅读-------------