Monitoring-stack (endpoints, secrets, alert-rules)
Feiten over de monitoring-stack: URLs, services, secrets, alert-rules en scrape-targets. Voor context en ontwerpkeuzes, zie de explanation.
Bron-of-truth: k8s/monitoring/*.yaml.
Endpoints
| Component | Externe URL | Cluster-interne service | Auth |
|---|---|---|---|
| Grafana | https://grafana.tapster.nl |
grafana.monitoring.svc.cluster.local:3000 |
Eigen login (admin + Secret grafana-credentials) |
| Prometheus | https://prometheus.tapster.nl |
prometheus.monitoring.svc.cluster.local:9090 |
Basic-auth via Secret prometheus-basic-auth |
| Alertmanager | https://alertmanager.tapster.nl |
alertmanager.monitoring.svc.cluster.local:9093 |
Basic-auth via Secret prometheus-basic-auth |
Externe DNS wordt door external-dns aangelegd op 89.41.171.120 (en IPv6-tegenhanger), TLS-certificaat via cert-manager (monitoring-tls, issuer letsencrypt-production).
Pods, PVCs, en images
| Workload | Strategy | PVC | Image |
|---|---|---|---|
| Prometheus | - | prometheus-storage |
prom/prometheus (pinned tag) |
| Alertmanager | Recreate |
alertmanager-storage (1Gi) |
prom/alertmanager (pinned tag) |
| Grafana | Recreate |
grafana-storage (10Gi) |
grafana/grafana:latest |
Alle workloads zijn nodeSelector: tapster.nl/node-name: jive. PVCs zijn ReadWriteOnce, daarom is rollout-strategy Recreate (RollingUpdate zou deadlocken op PVC-attachment).
Secrets
| Naam | Namespace | Type | Keys | Bron |
|---|---|---|---|---|
alertmanager-smtp |
monitoring |
Opaque |
SMTP_FROM, SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASS |
Handmatig aangemaakt, Flowmailer-credentials |
alertmanager-webhook |
monitoring |
Opaque |
HEALTHCHECK_WATCHDOG_URL |
Handmatig aangemaakt, healthchecks.io / BetterStack URL |
prometheus-basic-auth |
monitoring |
Opaque |
users (htpasswd-format) |
k8s/monitoring/ingress.yaml |
grafana-credentials |
monitoring |
Opaque |
admin-password |
k8s/monitoring/grafana-deployment.yaml |
monitoring-tls |
monitoring |
TLS | cert + key voor *.tapster.nl-monitoring-hosts |
cert-manager (monitoring-tls Certificate) |
Voor het roteren van de basic-auth-credentials, zie de basic-auth rotate how-to.
Scrape-jobs
Gedefinieerd in prometheus-config.yaml, ConfigMap prometheus-config.
| Job | Bron | Labels toegevoegd |
|---|---|---|
tapster-api-production |
Pods in namespace tapster met annotatie prometheus.io/scrape=true |
environment=production |
tapster-api-staging |
Pods in namespace tapster-development met annotatie prometheus.io/scrape=true |
environment=staging |
kube-state-metrics |
kube-state-metrics.kube-system.svc.cluster.local:8080 (static) |
- |
kubernetes-nodes-cadvisor |
Alle nodes, via /api/v1/nodes/<node>/proxy/metrics/cadvisor |
Node-labels via labelmap |
scrape_interval en evaluation_interval staan beide op 15 seconden.
Alert-rules
Geladen via rule_files in prometheus-config.yaml. Drie ConfigMaps in k8s/monitoring/:
mongoose-pool-rules (MongoDB-connectie-pool)
| Alert | Expressie (kort) | for |
Severity |
|---|---|---|---|
MongoosePoolHighUsage |
sum(mongoose_pool_in_use) / 1500 > 0.70 |
5m |
warning |
MongoosePoolCritical |
sum(mongoose_pool_in_use) / 1500 > 0.85 |
2m |
critical |
MongoosePoolRapidGrowth |
deriv(sum(mongoose_pool_in_use)[5m:30s]) > 50 |
3m |
critical |
MongoosePoolPerPodSaturated |
mongoose_pool_in_use / mongoose_pool_max_size > 0.90 |
5m |
warning |
De 1500-noemer is de Atlas M10-tier-limiet. Update bij tier-upgrade én sync met de ADR-blok in apps/api/src/common/config/mongoose/mongooseConnectionService.ts.
Recording-rules in dezelfde groep: mongoose:pool_in_use:cluster_total, mongoose:pool_in_use:fraction_of_tier.
Runbook voor alle pool-alerts: issue #1852.
kubernetes-platform-rules (platform + api availability)
| Alert | Expressie (kort) | for |
Severity | Service |
|---|---|---|---|---|
PodCrashLooping |
increase(kube_pod_container_status_restarts_total[15m]) > 3 (excl. kube-system) |
10m |
warning |
kubernetes |
PodOOMKilled |
increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) > 0 |
0m |
critical |
kubernetes |
NodeDiskPressure |
kube_node_status_condition{condition="DiskPressure",status="true"} == 1 |
2m |
critical |
kubernetes |
ApiDown |
max by (job) (up{job=~"tapster-api-.*"}) == 0 |
2m |
critical |
tapster-api |
HighHttpErrorRate |
sum(rate(...{status=~"5.."}[5m])) / sum(rate(...[5m])) > 0.05 |
10m |
warning |
tapster-api |
CertificateExpiringSoon |
certmanager_certificate_expiration_timestamp_seconds - time() < 14d |
15m |
warning |
cert-manager |
AlertmanagerNotificationsFailed |
rate(alertmanager_notifications_failed_total[5m]) > 0 |
5m |
critical |
alertmanager |
CertificateExpiringSoon is een no-op tot cert-manager als scrape-target wordt toegevoegd; de rule kost niets en gaat live zodra de metric verschijnt.
watchdog-rules (Dead Man’s Switch)
| Alert | Expressie | for |
Severity |
|---|---|---|---|
Watchdog |
vector(1) |
- | none |
Vuurt continu zolang Prometheus evalueert. Routing in alertmanager-config.yaml stuurt hem naar watchdog-webhook met repeat_interval: 1m.
Alertmanager-routing
Bron: alertmanager-config.yaml, ConfigMap alertmanager-config.
Tree (in deze volgorde gematcht):
alertname = "Watchdog"→ receiverwatchdog-webhook(interval 1m)severity = "critical"→ receiverroy-email(repeat 1h)- Default → receiver
roy-email(repeat 4h)
Receivers:
roy-email: SMTP via${SMTP_HOST}:${SMTP_PORT}(Flowmailer),to: roy@tapster.nl,send_resolved: true. Subject-format:[FIRING:N] alertnameof[RESOLVED] alertname.watchdog-webhook: POST naar${HEALTHCHECK_WATCHDOG_URL},send_resolved: false.
Inhibit-rules (twee paden, zie explanation):
- App-specifiek:
severity=critical+app_name=~".+"onderdruktseverity=warning+app_name=~".+"alsalertnameénapp_namegelijk. - Cluster-wide:
severity=critical+app_name=""onderdruktseverity=warning+app_name=""(correlatie via label-set gelijkheid).
Grafana
Provisioned datasources (grafana-datasources ConfigMap):
| Naam | UID | URL | isDefault |
|---|---|---|---|
prometheus |
prometheus |
http://prometheus:9090 |
true |
alertmanager |
alertmanager |
http://alertmanager:9093 |
false |
De Alertmanager-datasource gebruikt implementation: prometheus (vanilla Alertmanager API). handleGrafanaManagedAlerts: false zodat Grafana niet zijn eigen alerts beheert maar alleen de Alertmanager-state leest.
Dashboard-configmaps (mounted onder /var/lib/grafana/dashboards/<map>):
tapster-dashboards-root(incl.tapster-main-dashboard)tapster-dashboards-apitapster-dashboards-jobstapster-dashboards-infrastructuretapster-dashboards-businessgrafana-dashboards-configmap(alertmanager-overview, etc.)
envsubst-whitelist
In alertmanager-deployment.yaml, init-container render-config:
envsubst '$SMTP_FROM $SMTP_HOST $SMTP_PORT $SMTP_USER $SMTP_PASS $HEALTHCHECK_WATCHDOG_URL' \
< /etc/alertmanager-template/alertmanager.yml \
> /etc/alertmanager/alertmanager.yml
Bij toevoegen van een nieuwe placeholder in alertmanager-config.yaml: variabele óók aan deze whitelist toevoegen. Zie de explanation.
Ingress en middleware-chain
Bron: ingress.yaml. Voor elk component een IngressRoute op websecure (TLS-only) plus een redirect-route op web (HTTP → HTTPS).
Middleware-chain per route:
- Grafana:
security-headers - Prometheus:
security-headers,prometheus-auth - Alertmanager:
security-headers,prometheus-auth
prometheus-auth is een Traefik Middleware van type basicAuth die het Secret prometheus-basic-auth (key users, htpasswd-format) leest.