Monitoring-stack (endpoints, secrets, alert-rules)

Feiten over de monitoring-stack: URLs, services, secrets, alert-rules en scrape-targets. Voor context en ontwerpkeuzes, zie de explanation.

Bron-of-truth: k8s/monitoring/*.yaml.

Endpoints

Component Externe URL Cluster-interne service Auth
Grafana https://grafana.tapster.nl grafana.monitoring.svc.cluster.local:3000 Eigen login (admin + Secret grafana-credentials)
Prometheus https://prometheus.tapster.nl prometheus.monitoring.svc.cluster.local:9090 Basic-auth via Secret prometheus-basic-auth
Alertmanager https://alertmanager.tapster.nl alertmanager.monitoring.svc.cluster.local:9093 Basic-auth via Secret prometheus-basic-auth

Externe DNS wordt door external-dns aangelegd op 89.41.171.120 (en IPv6-tegenhanger), TLS-certificaat via cert-manager (monitoring-tls, issuer letsencrypt-production).

Pods, PVCs, en images

Workload Strategy PVC Image
Prometheus - prometheus-storage prom/prometheus (pinned tag)
Alertmanager Recreate alertmanager-storage (1Gi) prom/alertmanager (pinned tag)
Grafana Recreate grafana-storage (10Gi) grafana/grafana:latest

Alle workloads zijn nodeSelector: tapster.nl/node-name: jive. PVCs zijn ReadWriteOnce, daarom is rollout-strategy Recreate (RollingUpdate zou deadlocken op PVC-attachment).

Secrets

Naam Namespace Type Keys Bron
alertmanager-smtp monitoring Opaque SMTP_FROM, SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASS Handmatig aangemaakt, Flowmailer-credentials
alertmanager-webhook monitoring Opaque HEALTHCHECK_WATCHDOG_URL Handmatig aangemaakt, healthchecks.io / BetterStack URL
prometheus-basic-auth monitoring Opaque users (htpasswd-format) k8s/monitoring/ingress.yaml
grafana-credentials monitoring Opaque admin-password k8s/monitoring/grafana-deployment.yaml
monitoring-tls monitoring TLS cert + key voor *.tapster.nl-monitoring-hosts cert-manager (monitoring-tls Certificate)

Voor het roteren van de basic-auth-credentials, zie de basic-auth rotate how-to.

Scrape-jobs

Gedefinieerd in prometheus-config.yaml, ConfigMap prometheus-config.

Job Bron Labels toegevoegd
tapster-api-production Pods in namespace tapster met annotatie prometheus.io/scrape=true environment=production
tapster-api-staging Pods in namespace tapster-development met annotatie prometheus.io/scrape=true environment=staging
kube-state-metrics kube-state-metrics.kube-system.svc.cluster.local:8080 (static) -
kubernetes-nodes-cadvisor Alle nodes, via /api/v1/nodes/<node>/proxy/metrics/cadvisor Node-labels via labelmap

scrape_interval en evaluation_interval staan beide op 15 seconden.

Alert-rules

Geladen via rule_files in prometheus-config.yaml. Drie ConfigMaps in k8s/monitoring/:

mongoose-pool-rules (MongoDB-connectie-pool)

Alert Expressie (kort) for Severity
MongoosePoolHighUsage sum(mongoose_pool_in_use) / 1500 > 0.70 5m warning
MongoosePoolCritical sum(mongoose_pool_in_use) / 1500 > 0.85 2m critical
MongoosePoolRapidGrowth deriv(sum(mongoose_pool_in_use)[5m:30s]) > 50 3m critical
MongoosePoolPerPodSaturated mongoose_pool_in_use / mongoose_pool_max_size > 0.90 5m warning

De 1500-noemer is de Atlas M10-tier-limiet. Update bij tier-upgrade én sync met de ADR-blok in apps/api/src/common/config/mongoose/mongooseConnectionService.ts.

Recording-rules in dezelfde groep: mongoose:pool_in_use:cluster_total, mongoose:pool_in_use:fraction_of_tier.

Runbook voor alle pool-alerts: issue #1852.

kubernetes-platform-rules (platform + api availability)

Alert Expressie (kort) for Severity Service
PodCrashLooping increase(kube_pod_container_status_restarts_total[15m]) > 3 (excl. kube-system) 10m warning kubernetes
PodOOMKilled increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) > 0 0m critical kubernetes
NodeDiskPressure kube_node_status_condition{condition="DiskPressure",status="true"} == 1 2m critical kubernetes
ApiDown max by (job) (up{job=~"tapster-api-.*"}) == 0 2m critical tapster-api
HighHttpErrorRate sum(rate(...{status=~"5.."}[5m])) / sum(rate(...[5m])) > 0.05 10m warning tapster-api
CertificateExpiringSoon certmanager_certificate_expiration_timestamp_seconds - time() < 14d 15m warning cert-manager
AlertmanagerNotificationsFailed rate(alertmanager_notifications_failed_total[5m]) > 0 5m critical alertmanager

CertificateExpiringSoon is een no-op tot cert-manager als scrape-target wordt toegevoegd; de rule kost niets en gaat live zodra de metric verschijnt.

watchdog-rules (Dead Man’s Switch)

Alert Expressie for Severity
Watchdog vector(1) - none

Vuurt continu zolang Prometheus evalueert. Routing in alertmanager-config.yaml stuurt hem naar watchdog-webhook met repeat_interval: 1m.

Alertmanager-routing

Bron: alertmanager-config.yaml, ConfigMap alertmanager-config.

Tree (in deze volgorde gematcht):

  1. alertname = "Watchdog" → receiver watchdog-webhook (interval 1m)
  2. severity = "critical" → receiver roy-email (repeat 1h)
  3. Default → receiver roy-email (repeat 4h)

Receivers:

  • roy-email: SMTP via ${SMTP_HOST}:${SMTP_PORT} (Flowmailer), to: roy@tapster.nl, send_resolved: true. Subject-format: [FIRING:N] alertname of [RESOLVED] alertname.
  • watchdog-webhook: POST naar ${HEALTHCHECK_WATCHDOG_URL}, send_resolved: false.

Inhibit-rules (twee paden, zie explanation):

  1. App-specifiek: severity=critical + app_name=~".+" onderdrukt severity=warning + app_name=~".+" als alertname én app_name gelijk.
  2. Cluster-wide: severity=critical + app_name="" onderdrukt severity=warning + app_name="" (correlatie via label-set gelijkheid).

Grafana

Provisioned datasources (grafana-datasources ConfigMap):

Naam UID URL isDefault
prometheus prometheus http://prometheus:9090 true
alertmanager alertmanager http://alertmanager:9093 false

De Alertmanager-datasource gebruikt implementation: prometheus (vanilla Alertmanager API). handleGrafanaManagedAlerts: false zodat Grafana niet zijn eigen alerts beheert maar alleen de Alertmanager-state leest.

Dashboard-configmaps (mounted onder /var/lib/grafana/dashboards/<map>):

  • tapster-dashboards-root (incl. tapster-main-dashboard)
  • tapster-dashboards-api
  • tapster-dashboards-jobs
  • tapster-dashboards-infrastructure
  • tapster-dashboards-business
  • grafana-dashboards-configmap (alertmanager-overview, etc.)

envsubst-whitelist

In alertmanager-deployment.yaml, init-container render-config:

envsubst '$SMTP_FROM $SMTP_HOST $SMTP_PORT $SMTP_USER $SMTP_PASS $HEALTHCHECK_WATCHDOG_URL' \
  < /etc/alertmanager-template/alertmanager.yml \
  > /etc/alertmanager/alertmanager.yml

Bij toevoegen van een nieuwe placeholder in alertmanager-config.yaml: variabele óók aan deze whitelist toevoegen. Zie de explanation.

Ingress en middleware-chain

Bron: ingress.yaml. Voor elk component een IngressRoute op websecure (TLS-only) plus een redirect-route op web (HTTP → HTTPS).

Middleware-chain per route:

  • Grafana: security-headers
  • Prometheus: security-headers, prometheus-auth
  • Alertmanager: security-headers, prometheus-auth

prometheus-auth is een Traefik Middleware van type basicAuth die het Secret prometheus-basic-auth (key users, htpasswd-format) leest.