Prometheus process-exporter 监控服务进程

由于我们常用的node_exporter并不能覆盖所有监控项,这里我们使用Process-exporter 对进程进行监控。

process-export主要用来做进程监控,比如某个服务的进程数、消耗了多少CPU、内存等资源。

一、process-exporter使用

1.1 下载 process-exporter

process-exporter GibHUB地址
process-exporter 下载地址

process-exporter可以使用命令行参数也可以指定配置文件启动

1.2 配置 process-exporter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
vim /home/admin/process-exporter/process_name.yaml

process_names:
# - name: "{{.Comm}}"
# cmdline:
# - '.+'

- name: "{{.Matches}}"
cmdline:
- 'nginx'

- name: "{{.Matches}}"
cmdline:
- '/opt/atlassian/confluence/bin/tomcat-juli.jar'

- name: "{{.Matches}}"
cmdline:
- 'vsftpd'

- name: "{{.Matches}}"
cmdline:
- 'redis-server'

cmdline: 所选进程的唯一标识,ps -ef 可以查询到。如果改进程不存在,则不会有该进程的数据采集到。

例如:> ps -ef | grep redis

redis 4287 4127 0 Oct31 ? 00:58:12 redis-server *:6379

{{.Comm}}groupname=”redis-server”exe或者sh文件名称
{{.ExeBase}}groupname=”redis-server *:6379”/
{{.ExeFull}}groupname=”/usr/bin/redis-server *:6379”ps中的进程完成信息
{{.Username}}groupname=”redis”使用进程所属的用户进行分组
{{.Matches}}groupname=”map[:redis]”表示配置到关键字“redis”

1.3 编写启动脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
vim /usr/lib/systemd/system/process_exporter.service

[Unit]
Description=Prometheus exporter for processors metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/ncabatoff/process-exporter
After=network.target

[Service]
Type=simple
User=prometheus
WorkingDirectory=/home/admin/process-exporter
ExecStart=/home/admin/process-exporter/process-exporter -config.path=/home/admin/process-exporter/process_name.yaml
Restart=on-failure

[Install]
WantedBy=multi-user.target

1.4 启动 procexx-export

1
2
3
systemctl daemon-reload
systemctl start process_exporter
systemctl enable process_exporter

验证监控数据

1
curl http://localhost:9256/metrics

二、prometheus 配置

添加或修改配置

1
2
3
4
5
6
7
8
9
10
- job_name: 'doog_dev_prometheus'
scrape_interval: 10s
honor_labels: true
metrics_path: '/metrics'

static_configs:
- targets: ['192.168.10.73:9090','192.168.10.73:9100']
labels: {cluster: 'dev',type: 'basic',env: 'dev',job: 'prometheus',export: 'prometheus'}
- targets: ['192.168.10.73:9256']
labels: {cluster: 'dev',type: 'process',env: 'dev',job: 'prometheus',export: 'process_exporter'}

重启prometheus服务

1
curl -X POST http://127.0.0.1:9090/-/reload

三、grafana出图

process-exporter对应的dashboard为:https://grafana.com/grafana/dashboards/249

效果如下

图片1

四、常用监控规则

进程数

1
2
3
4
5
6
7
alert: 进程告警
expr: sum(namedprocess_namegroup_states) by (cluster,job,instance) > 500
for: 20s
labels:
severity: warning
annotations:
value: 服务器当前已产生 {{ $value }} 个进程,大于告警阈值

僵尸进程数

1
2
3
4
5
6
7
alert: 进程告警
expr: sum by(cluster, job, instance, groupname) (namedprocess_namegroup_states{state="Zombie"}) > 0
for: 1m
labels:
severity: warning
annotations:
value: 当前产生 {{ $value }} 个僵尸进程

进程重启

1
2
3
4
5
6
7
8
alert: 进程重启告警
expr: ceil(time() - max by(cluster, job, instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60
for: 25s
labels:
label: alert_once
severity: warning
annotations:
value: 进程 {{ $labels.groupname }} {{ $value }} 秒前发生重启

进程退出

1
2
3
4
5
6
7
alert: 进程退出告警
expr: up{export="process_exporter"} == 0 or max by(cluster, job, instance, groupname) (delta(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^map.*"}[10d])) < 0
for: 55s
labels:
severity: warning
annotations:
value: 进程 {{ $labels.export}} 已退出
-------------本文结束感谢您的阅读-------------