部署prometheus 一、准备环境
主机名
IP
配置
client
10.100.40.185
agent
10.100.40.171
agent
二、部署prometheus 1、下载地址 1 2 3 4 wget prometheus-2.47.2.linux-amd64.tar.gz 官网地址 https://prometheus.io/download/
2、创建目录 1 mkdir -pv /data/pormetheus/{data,soft,lag}
3、解压安装包 1 tar xzf prometheus-2.47.2.linux-amd64.tar.gz -C /data/prometheus/soft
做个软连接,方便操作 1 2 cd /data/prometheus/soft ln -sv prometheus-2.47.2.linux-amd64 prometheus
三、启动 1 2 在/data/prometheus/soft/prometfeus目录下启动 nohup ./prometheus &
四、检查,并查看 1 2 3 4 netstat -lnpt #查看9090端口有没有 浏览器访问 http://IP + 9090
五、配置systemctl启动文件 1、编辑启动项 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 vim /etc/systemd/system/prometheus-server.service [Unit] Description=xinjizhiwa Prometheus Server Documentation=https://prometheus.io/docs/introduction/overview/ After=network.target [Service] Restart=on-failure ExecStart=/data/prometheus/soft/prometheus/prometheus \ --config.file=/data/prometheus/soft/prometheus/prometheus.yml \ --web.enable-lifecycle ExecReload=/bin/kill -HUP \$MAINPID LimitNOFILE=65535 [Install] WantedBy=multi-user.target
2、重新加载systemctl 1 2 systemctl daemon-reload systemctl enable --now prometheus-server.serivce
3、重新加载prometfeus 1 2 3 4 systemctl reload prometheus-server.service# 如果不管用,则使用下面的命令 curl -X POST http://本机IP:9090/-/reload
部署被监控节点node-exporter 一、部署node节点 1、下载地址 1 wget node_exporter-1.8.1.linux-amd64.tar.gz
2、创建目录 1 mkdir -pv /data/node_export/{data,soft,logs}
3、解压 1 tar xzf node_exporter-1.8.1.linux-amd64 -C /data/prometfeus/soft
4、创建软链接 1 ln -sv /node-export/soft/node_exporter-1.6.1.linux-amd64 /node-export/soft/node-exporter
5、配置启动项 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 vim /etc/systemd/system/node-exporter.service [Unit] Description=xinjizhiwa node-exporter Documentation=https://prometheus.io/docs/introduction/overview/ After=network.target [Service] Restart=on-failure ExecStart=/data/node-export/soft/node-exporter/node_exporter ExecReload=/bin/kill -HUP \$MAINPID LimitNOFILE=65535 [Install] WantedBy=multi-user.target
6、重新加载systemd启动 1 2 systemctl daemon-reload systemctl enable --now node-exporter.service
7、检查,并查看 1 2 3 4 netstat -lnpt #查看端口9100有么有起来 浏览器 http://IP +9100
二、配置prometheus收集node-exporter采集数据 1、在prometheus上操作,配置node节点的信息 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 vim /data/prometheus/soft/prometheus/prometheus.yml # 抓取监控的间隔时间,多长时间获取一次数据(生产环境,建议15-30s); scrape_interval: 3s # 多久读一次规则 evaluation_interval: 15s# 先不解释,之后会讲 alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093# 先不讲,之后会讲 rule_files: # - "first_rules.yml" # - "second_rules.yml" # 被监控的配置 scrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # 另起一个job名称,被监控的主体自定义名称 - job_name: "node-exporter" #名字可以自定义 static_configs: #被监控的数据抓取地址; - targets: ["10.0.0.41:9100"] #填被监控节点的IP和端口
2、重新加载prometheus服务 1 2 3 curl -X POST http://IP:9090/-/reload 此时刷新页面,就可以看到node节点的信息了
3、PromeQL语句查询
1 2 3 4 5 up #代表查看所有被监控节点是否存活 1表示存活; 0表示存活;
部署grafana 1、下载地址 1 2 3 4 wget grafana-enterprise-11.1.0-1.x86_64.rpm 官网地址 https://grafana.com/grafana/dashboards/
2、移动到prometfeus的soft目录下,好管理 1 mv /root/grafana-enterprise-11.1.0-1.x86_64.rpm /data/prometfeus/soft
3、安装grafana
1 2 3 4 5 yum -y localinstall grafana-enterprise-11.1.0-1.x86_64.rpm 或 yum install -y fontconfig rpm -ivh grafana-enterprise-11.1.0-1.x86_64.rpm
4、启动
1 systemctl enable --now grafana-server.service
5、检查,并查看
1 2 3 4 5 netstat -lnpt #查看3000端口有没有 浏览器访问http: 默认账号密码: admin/admin
3、配置数据源 1 【home】-【adminstration】-【data sources】-【add data-sources】-【prometheus】
4、新建仪表盘 1 【home】-【dashboards】-【new】-【new folder】
5、创建一个新的folder
1 2 3 进入目录后,创建仪表盘 【create dashboard】
1 2 3 选择数据源 【Add visualization】
6、测试
1 2 3 第一步,测试代码,就是计算一个cpu 使用率的PromeQL代码; 测试没问题,就复制;
1 2 3 写入grafana图形 (1-sum(node_cpu_seconds_total{mode="idle"})/sum(node_cpu_seconds_total))*100
6、下载开源的仪表盘 1 2 https://g rafana.com/grafana/ dashboards
1 2 3 Copy ID to clipboard Download JSON
1 2 上传仪表盘json文件到grafana 【home】-【dashboard】-【new】-【import】
grafana的变量 1 2 3 grafana的下拉列表选项制作-grafana的变量 教程 https://blog.csdn.net/2302_79199605/article/details/136438841?spm=1001.2014.3001.5501
配置prometheus服务的动态发现 一、基于文档的自动发现 1、修改prometheus的配置文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 vim /prometheus/soft/prometheus/prometheus.yml # 通用设置 global: # 抓取监控的间隔时间,多长时间获取一次数据(生产环境,建议15-30s); scrape_interval: 3s # 多久读一次规则 evaluation_interval: 15s # 先不解释,之后会讲 alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093# 先不讲,之后会讲 rule_files: # - "first_rules.yml" # - "second_rules.yml" # 被监控的配置 scrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # 另起一个job名称,被监控的主体自定义名称 - job_name: "node-exporter01" #基于文档自动发现 file_sd_configs: #文档的地址路径 - files: #- /prometheus/soft/prometheus/file-sd.json - /data/prometheus/soft/prometheus/file-sd.yaml #自己定义的发现文档路径
2、重新加载prometheus服务 1 curl -X POST http://IP:9090/-/reload
3、编辑自动发现文档 1 2 3 4 5 6 vim /data/prometheus/soft/prometheus/file-sd.yaml - targets: - '10.0.0.41:9100' labels: xinjizhiwa: prometheus-learn office: www.xinjizhiwa.com
4、刷新浏览器查看
配置prometheus的数据存储 一、本地存储prometheus收集的监控数据 1、配置systemctl启动文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 vim /etc/systemd/system/prometheus-server.service [Unit] Description=Prometheus Server Documentation=https://prometheus.io/docs/introduction/overview/ After=network.target [Service] Restart=on-failure ExecStart=/data/prometheus/softwares/prometheus-2.47.2.linux-amd64/prometheus \ --config.file=/data/prometheus/softwares/prometheus-2.47.2.linux-amd64/prometheus.yml \ --web.enable-lifecycle \ --storage.tsdb.path=/data/prometheus/data/prometheus \ --storage.tsdb.retention.time=60d \ --web.listen-address=0.0.0.0:9090 \ --web.max-connections=4096 \ --storage.tsdb.retention.size=512MB \ --query.timeout=10s \ --query.max-concurrency=20 \ --log.level=info \ --log.format=json \ --web.read-timeout=5m ExecReload=/bin/kill -HUP $MAINPID LimitNOFILE=65535 [Install] WantedBy=multi-user.target 参数说明: --config.file=/prometheus/softwares/prometheus/prometheus.yml 指定prometheus的配置文件。 --web.enable-lifecycle 启用web方式热加载。 --storage.tsdb.path="/prometheus/data/prometheus" 指定prometheus数据存储路径。如果不指定,则默认其实时的同级目录下。 --storage.tsdb.retention.time="60d" 指定prometheus数据存储周期。 --web.listen-address="0.0.0.0:9090" 指定prometheus的监听端口。 --web.max-connections=4096 指定最大的连接数。 --storage.tsdb.retention.size="512MB" 指定prometheus数据块的滚动大小(每到512M缓存,进行一次落盘存储)。 --query.timeout=10s 查询数据的超时时间。 --query.max-concurrency=20 最大并发查询数量。 --log.level=info 指定日志级别。 --log.format=logfmt 指定日志格式。 --web.read-timeout=5m 最大的空闲超时时间。
3、重新加载systemctl 1 2 systemctl daemon-reload systemctl restart prometheus-server
二、prometheus数据远端存储 1 2 3 这里就不作介绍了 教程 https://blog.csdn.net/2302_79199605/article/details/136467629?spm=1001.2014.3001.5501
监控的告警通知-alertmanager组件工具 1、定义告警规则 修改Prometheus配置文件prometheus.yml,添加以下配置:
1 2 3 4 5 6 7 8 9 rule_files: - /usr/l ocal/prometheus/rules/*.rules alerting: alertmanagers: - static_configs: - targets: - localhost:9093
在目录/usr/local/prometheus/rules/下创建告警文件hoststats-alert.rules内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 groups: - name: hostStatsAlert rules: - alert: hostCpuUsageAlert expr: sum by (instance) (avg without (cpu) (irate(node_cpu_seconds_total{mode!="idle"}[5m]))) > 0.5 for: 1m labels: # 严重性 severity: warning annotations: title: cpu飚高告警 summary: "Instance {{ $labels.instance }} CPU usgae high" description: "{{ $labels.instance }} CPU usage above 50% (current value: {{ $value }})" - alert: hostMemUsageAlert expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes > 0.85 for: 1m labels: severity: warning annotations: title: 内存使用率飚高告警 summary: "Instance {{ $labels.instance }} MEM usgae high" description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
重启Prometheus后访问Prometheus http:// IP:9090/rules可以查看当前以加载的规则文件
2、安装配置prometheus-webhook-dingtalk 1 2 3 4 5 6 wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz tar -zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz -C /usr/local mv /usr/local/prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/prometheus-webhook-dingtalk cp /usr/local/prometheus-webhook-dingtalk/config.example.yml /usr/local/prometheus-webhook-dingtalk/config.yml vim config.yml # 将配置文件修改成下面这样
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 # # timeout : 5s # # no_builtin_template: true # templates: - contrib/templates/mytemplate.tmpl # 这里指向你生成的模板 # # # default_message: # title: '{{ template "legacy.title" . }}' # text: '{{ template "legacy.content" . }}' # targets: webhook1: # 钉钉机器人的webhook, 是从钉钉机器人中获取的值 url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # secret for signature 加签后得到的值, 机器人的加签 # secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx# webhook2: # url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx # webhook_legacy: # url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx # # message: # # title: '{{ template "legacy.title" . }}' # text: '{{ template "legacy.content" . }}' # webhook_mention_all: # url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx # mention: # all: true # webhook_mention_users: # url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx # mention: # mobiles: ['156xxxx8827' , '189xxxx8325' ]
1 2 3 4 5 6 7 # 添加如下模板,模板中需要有prometheus添加的 Annotations中需要title、description;Labels中需要有severity vim /usr/local/prometheus-webhook-dingtalk/contrib/templates/mytemplate.tmpl cd /usr/local/prometheus-webhook-dingtalk/ ./prometheus-webhook-dingtalk --config.file=config.yml >dingtalk.log 2>&1 &
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 {{ define "__subject" }} [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ end }} {{ define "__alert_list" }}{{ range . }} --- {{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }} **告警名称**: {{ index .Annotations "title" }} **告警级别**: {{ .Labels.severity }} **告警主机**: {{ .Labels.instance }} **告警信息**: {{ index .Annotations "description" }} **告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} {{ end }}{{ end }} {{ define "__resolved_list" }}{{ range . }} --- {{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }} **告警名称**: {{ index .Annotations "title" }} **告警级别**: {{ .Labels.severity }} **告警主机**: {{ .Labels.instance }} **告警信息**: {{ index .Annotations "description" }} **告警时间**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }} **恢复时间**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }} {{ end }}{{ end }} {{ define "default.title" }} {{ template "__subject" . }} {{ end }} {{ define "default.content" }} {{ if gt (len .Alerts.Firing) 0 }} **====侦测到{{ .Alerts.Firing | len }}个故障====** {{ template "__alert_list" .Alerts.Firing }} --- {{ end }} {{ if gt (len .Alerts.Resolved) 0 }} **====恢复{{ .Alerts.Resolved | len }}个故障====** {{ template "__resolved_list" .Alerts.Resolved }} {{ end }} {{ end }} {{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }} {{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }} {{ template "default.title" . }} {{ template "default.content" . }}
3、安装配置prometheus-alertmanager 1 2 3 4 5 6 7 8 wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz tar -zxvf alertmanager-0.25.0.linux-amd64.tar.gz mv alertmanager-0.25.0.linux-amd64 /usr/local/alertmanager# 修改告警管理的配置文件如下 vim /usr/local/alertmanager/alertmanager.yml cd /usr/local/alertmanager/ ./alertmanager --config.file=alertmanager.yml >alertmanager.log 2>&1 &
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 global: # 每一分钟检查一次是否恢复 resolve_timeout: 5m route: # 采用哪个标签来作为分组依据 group_by: ['alertname'] # 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出 group_wait: 10s # 两组告警的间隔时间 group_interval: 1m # 重复告警的间隔时间,减少相同告警的发送频率 repeat_interval: 1m # 设置默认接收人 receiver: 'web.hook' routes: - receiver: 'dingding.webhook1' match_re: alertname: ".*" receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' - name: 'dingding.webhook1' webhook_configs: # 这里的webhook1,根据我们在钉钉告警插件配置文件中targets中指定的值做修改 - url: 'http://127.0.0.1:8060/dingtalk/webhook1/send' send_resolved: true inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
此时,我们可以手动拉高系统的CPU使用率,验证Prometheus的告警流程,在主机上运行以下命令
1 2 3 4 cat /dev/zero>/dev/null 等待告警状态为firing后钉钉群机器人会发出告警信息 连接地址:https://blog.csdn.net/weixin_44223946 /article/details/131411062 ?ops_request_misc= %257 B%2522 request%255 Fid%2522 %253 A%2522172032212116800213088728 %2522 %252 C%2522 scm%2522 %253 A%252220140713 .130102334 ..%2522 %257 D&request_id= 172032212116800213088728 &biz_id= 0 &utm_medium= distribute.pc_search_result.none-task-blog-2 ~all~sobaiduend~default -1 -131411062 -null -null .142 ^v100 ^pc_search_result_base3 &utm_term= prometheus%E5 %91 %8 A%E8 %AD %A6 %E5 %8 F%91 %E9 %80 %81 %E9 %92 %89 %E9 %92 %89 &spm= 1018.2226 .3001.4187
1、下载工具
1 2 3 4 5 wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz tar xf alertmanager-0.26.0.linux-amd64.tar.gz -C /prometheus/softwares/ ln -svf /prometheus/softwares/alertmanager-0.26.0.linux-amd64/
1 2 教程 https://blog.csdn.net/2302_79199605/article/details/136494677?spm=1001.2014.3001.5501