保姆级教程:K8s 1.23集群上部署kube-prometheus 0.10(含国内镜像源替换与网络策略避坑)
2026/6/15 22:17:11 网站建设 项目流程

从零构建K8s监控体系:kube-prometheus实战指南与深度调优

当容器化应用规模突破临界点,传统的日志排查和性能分析手段会瞬间失效。去年某电商大促期间,运维团队曾因无法实时感知Pod内存泄漏,导致集群雪崩。这正是我们需要在Kubernetes集群中部署专业监控系统的根本原因——没有可观测性的容器编排就像蒙眼驾驶F1赛车

1. 环境准备:构建标准化K8s基础

1.1 集群版本精准匹配策略

选择Kubernetes 1.23与kube-prometheus 0.10的组合并非偶然。这个版本组合经过社区长期验证,其API兼容性矩阵如下:

组件最低K8s版本推荐版本
Prometheus Operator1.191.21+
kube-state-metrics1.181.20+
Grafana无要求8.5+

关键验证步骤

kubectl version --short | grep Server git clone https://github.com/prometheus-operator/kube-prometheus.git -b release-0.10

1.2 基础设施预配置

/opt/k8s目录下建立标准化工作区:

mkdir -p /opt/k8s/{manifests,images,backup} chmod 755 /opt/k8s

2. 镜像加速:破解国内部署困局

2.1 智能镜像替换方案

针对quay.io和k8s.gcr.io的访问难题,采用多级回退策略:

  1. 首选USTC镜像源

    sed -i 's/quay.io/quay.mirrors.ustc.edu.cn/g' manifests/*.yaml
  2. 备用阿里云镜像

    sed -i 's/k8s.gcr.io/registry.aliyuncs.com\/google_containers/g' manifests/kubeStateMetrics-deployment.yaml
  3. 本地镜像应急方案

    docker save -o /opt/k8s/images/prometheus-v2.30.3.tar quay.mirrors.ustc.edu.cn/prometheus/prometheus:v2.30.3

2.2 镜像预加载技巧

使用并行拉取加速准备过程:

grep "image:" manifests/* -r | awk -F 'image: ' '{print $2}' | xargs -P 4 -I {} docker pull {}

3. 部署实战:避坑操作手册

3.1 服务暴露方式优化

修改manifests/grafana-service.yaml暴露服务:

apiVersion: v1 kind: Service metadata: name: grafana spec: type: NodePort ports: - name: http port: 3000 targetPort: http nodePort: 33000 selector: app.kubernetes.io/name: grafana

3.2 关键部署流程

分阶段执行确保CRD就绪:

kubectl apply --server-side -f manifests/setup kubectl wait --for=condition=Established crd --all --timeout=300s kubectl apply -f manifests/

4. 网络策略与安全调优

4.1 访问控制白名单

替代直接删除NetworkPolicy的更优方案:

# manifests/prometheus-networkPolicy.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: prometheus-allow-specific spec: podSelector: matchLabels: app.kubernetes.io/name: prometheus ingress: - from: - ipBlock: cidr: 192.168.1.0/24

4.2 持久化存储配置

为Prometheus添加PVC声明:

# manifests/prometheus-prometheus.yaml apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: k8s spec: storage: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 100Gi

5. 高级监控场景实践

5.1 自定义指标采集

配置PodMonitor监控自定义应用:

# manifests/custom-podmonitor.yaml apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: payment-service spec: selector: matchLabels: app: payment podMetricsEndpoints: - port: metrics interval: 15s

5.2 告警规则管理

创建业务级告警规则示例:

# manifests/custom-alert.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: business-alerts spec: groups: - name: payment.rules rules: - alert: HighPaymentErrorRate expr: rate(payment_errors_total[5m]) > 0.1 for: 10m labels: severity: critical annotations: summary: "High error rate on {{ $labels.instance }}"

在Grafana中导入ID为315的Kubernetes集群监控大盘后,突然发现某个Node的CPU使用率持续超过90%。通过PromQL查询sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])),最终定位到某个未设置资源限制的批处理Pod。这种问题在监控系统上线前通常需要数小时才能发现,现在只需30秒。

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询