Skip to content

Vijfpas operations

This document defines the operating model: patching, backup and restore, DR, monitoring processes, and runbook standards.

Detailed per-component and per-product HA modes are documented in Vijfpas HA Profiles (ha.md). Confidentiality handling baseline is documented in Vijfpas Confidentiality Model (confidentiality-model.md). Backup and restore design is documented in Vijfpas Backup and Restore Architecture (backup-restore.md).

1. Operating model

1.1 Cadences

  • Patch cycle (planned): quarterly patch window for all platform and product components
  • CVE monitoring: daily feed ingestion and CVSS scoring for newly published vulnerabilities
  • Urgent patch trigger: vulnerabilities with CVSS >= 9.0 must be mitigated or patched in an emergency window
  • Security hotfix SLA (urgent, CVSS >= 9.0, ADR-0010):
  • Internet-facing/dmz services: mitigate or patch in <24 hours
  • Internal-only services: mitigate or patch in <=72 hours
  • Backup verification: daily job checks + monthly restore test
  • Capacity review: monthly
  • Access review (privileged): monthly

1.2 Access model

Reference: Vijfpas Security and SoD and Vijfpas Confidentiality Model.

  • No shared admin accounts.
  • Break-glass accounts per tier with audited use.
  • Mandatory peer review for privilege-tier 0/1 changes.
  • Environment-scoped credentials for CI/CD deploy paths (dev, acc, prd).

1.3 Confidentiality-driven operations

  • CONF-2 and above require an explicit data owner, regular access review, and labeled storage/backup paths.
  • CONF-3 and above use masked or synthetic data by default in dev; acc use of production-like data requires explicit approval and compensating controls.
  • CONF-4 values must remain in approved secret systems only; suspected disclosure triggers immediate rotation and incident handling.

1.4 Multi-tenant operational baseline

Minimum operations controls for future Product Engineering Platform tenant enablement:

  • Maintain a tenant registry (tenant_id, owner contacts, active environments, isolation profile).
  • Enforce tenant-scoped credentials, namespaces, quotas, and policy objects at onboarding.
  • Record tenant lifecycle events (create/update/suspend/delete) with auditable change references.
  • Keep tenant offboarding procedures that remove runtime access and rotate tenant-scoped secrets.
  • Keep showback-ready usage labels (tenant_id) on logs/metrics/traces, even before billing is enabled.
  • Operate a cross-team tenant enablement squad for onboarding policy and tenant support coordination.

1.5 Patch decision flow (Mermaid)

flowchart TD
  CVE[Daily CVE ingestion] --> SCORE[CVSS scoring]
  SCORE --> HIGH{CVSS >= 9.0?}
  HIGH -- yes --> EMERG[Emergency patch window]
  EMERG --> EXP{Internet-facing/dmz?}
  EXP -- yes --> SLA24[Mitigate or patch in <24h]
  EXP -- no --> SLA72[Mitigate or patch in <=72h]
  HIGH -- no --> QPATCH[Quarterly patch backlog]
  QPATCH --> WINDOW[Quarterly maintenance window]
  SLA24 --> VERIFY[Post-patch verification]
  SLA72 --> VERIFY
  WINDOW --> VERIFY
  VERIFY --> CLOSE[Change closure + evidence]

2. Backup and restore

2.1 What must be backed up

  • selected data-bearing Proxmox VMs and backup repo VMs (via PBS); rebuild-only IaC-managed VMs may be excluded by policy
  • Kubernetes manifests via GitOps + control-plane state where applicable
  • Infrastructure databases via native backup methods (plus WAL/binlog where relevant)
  • Application databases via native backup methods (plus WAL/binlog where relevant)
  • GitLab/Gitaly/Nexus data and configuration
  • keycloak realm/config plus its backing PostgreSQL domain in each environment where Keycloak exists
  • JupyterHub control-plane metadata DB and JupyterLab workspace volumes
  • Secrets vault state and recovery material (Shamir 3-of-5 custody policy, ADR-0009)

2.2 Retention baseline

  • PBS snapshots: daily 35 days, weekly 12 weeks, monthly 12 months
  • Database logical backups: daily 14 days minimum
  • Database point-in-time logs (WAL/binlog): minimum 7 days
  • GitLab/Nexus backups: daily 30 days
  • Operational logs: hot 30 days, cold/archive 180 days
  • Retention, restore scope, and export handling for CONF-3 and CONF-4 data must preserve confidentiality class restrictions.

2.3 Restore testing

  • Monthly restore drill for at least one privilege-tier 1/2 service.
  • Quarterly "lost node" simulation for Proxmox, Ceph, or Kubernetes.
  • Annual full DR exercise across at least one critical end-to-end product path.
  • Annual OpenBao recovery drill must validate the 3-of-5 custody process end-to-end.

2.4 Baseline recovery targets

  • Privilege-tier 0: target RPO <= 4h, RTO <= 24h
  • Privilege-tier 1: target RPO <= 4h, RTO <= 24h
  • Privilege-tier 2: target RPO <= 8h, RTO <= 24h
  • Privilege-tier 3: target RPO <= 24h, RTO <= 72h unless product-specific SLO requires tighter

2.5 Backup/restore cycle (Mermaid)

flowchart LR
  JOBS[Scheduled backups] --> CHECK[Daily backup checks]
  CHECK --> DRILL[Monthly restore drill]
  DRILL --> SIM[Quarterly lost-node simulation]
  SIM --> FULL[Annual full DR exercise]
  FULL --> IMPROVE[Runbook and policy improvements]
  IMPROVE --> JOBS

2.6 Tenant backup/restore scope baseline

  • Backups must support tenant-scoped restore where data model allows it.
  • Restore procedures must declare scope explicitly: single tenant, full environment, or full platform domain.
  • Restore execution is provider-operated; tenants can request restore and track status/evidence in tenant scope.
  • Tenant-scoped restore tests should be included in quarterly restore validation.

3. Incident response

3.1 Severity model

  • S1: user-visible outage or security incident affecting production-critical services
  • S2: major degradation, partial outage, or high-risk security issue
  • S3: localized issue with workaround available

3.2 Response baseline

  • Triage starts immediately for S1/S2.
  • Preserve evidence (logs, snapshots, timelines) for security-significant events.
  • Use an incident channel with timestamped decisions/actions.
  • Complete post-incident review with corrective actions and owner/date.
  • Suspected CONF-4 disclosure is treated as S1 by default; CONF-3 disclosure is minimum S2 unless clearly bounded and low impact.

3.3 Incident lifecycle (Mermaid)

stateDiagram-v2
  [*] --> Detected
  Detected --> Triage
  Triage --> Mitigating
  Mitigating --> Recovering
  Recovering --> Monitoring
  Monitoring --> PIR
  PIR --> Closed
  Closed --> [*]

3.4 Tenant incident scoping

  • Incidents must declare tenant impact scope (single-tenant, multi-tenant, or platform-wide).
  • Communications and timeline evidence must include affected tenant_id values.
  • Cross-tenant data exposure events are treated as security incidents by default.

4. Runbooks

Create a runbook per critical service/product under vijfpas/docs/runbooks/.

4.1 Runbook ID convention

  • Platform components: RB-SVC-<component>
  • Products: RB-PROD-<product>
  • External integrations: RB-EXT-<integration>

4.2 Component runbook coverage

Component Runbook ID Path Status
Proxmox RB-SVC-PROXMOX runbooks/proxmox.md exists (template content still TBD)
Ceph RB-SVC-CEPH runbooks/ceph.md exists (template content still TBD)
PBS RB-SVC-PBS runbooks/pbs.md exists (implemented baseline)
UniFi edge/network RB-SVC-UNIFI runbooks/unifi-edge.md exists (implemented baseline)
Chrony RB-SVC-CHRONY runbooks/chrony.md exists (implemented baseline)
Kubernetes runtime planes + Traefik RB-SVC-KUBERNETES runbooks/kubernetes.md exists (template content still TBD)
Rancher RB-SVC-RANCHER runbooks/rancher.md exists (live pfm management-cluster baseline with active pfm and prd downstream imports)
GitLab RB-SVC-GITLAB runbooks/gitlab.md exists (template content still TBD)
Gitaly RB-SVC-GITALY runbooks/gitaly.md exists (template content still TBD)
Nexus RB-SVC-NEXUS runbooks/nexus.md exists (template content still TBD)
Taiga RB-SVC-TAIGA runbooks/taiga.md exists (documented rollout baseline; not live yet)
Keycloak RB-SVC-KEYCLOAK runbooks/keycloak.md exists (template content still TBD)
Vaultwarden RB-SVC-VAULTWARDEN runbooks/vaultwarden.md exists (template content still TBD)
OpenBao RB-SVC-OPENBAO runbooks/openbao.md exists
Certificate management RB-SVC-CERTMGMT runbooks/certificate-management.md exists (template content still TBD)
Vulnerability/malware controls RB-SVC-VULN-MALWARE runbooks/vulnerability-malware.md exists (template content still TBD)
PostgreSQL RB-SVC-POSTGRESQL runbooks/postgresql.md exists (template content still TBD)
MariaDB RB-SVC-MARIADB runbooks/mariadb.md exists (template content still TBD)
Neo4j RB-SVC-NEO4J runbooks/neo4j.md exists (template content still TBD)
Redis RB-SVC-REDIS runbooks/redis.md exists (template content still TBD)
OpenSearch (app/search) RB-SVC-OPENSEARCH-APP runbooks/opensearch-app.md exists (template content still TBD)
CephRGW RB-SVC-CEPHRGW runbooks/cephrgw.md current production baseline and manual keyring/bootstrap procedure documented
RabbitMQ RB-SVC-RABBITMQ runbooks/rabbitmq.md exists (template content still TBD)
Kafka (infra) RB-SVC-KAFKA-INFRA runbooks/kafka-infra.md exists (template content still TBD)
Kafka (app) RB-SVC-KAFKA-APP runbooks/kafka-app.md exists (template content still TBD)
Debezium/Kafka Connect RB-SVC-DEBEZIUM-CONNECT runbooks/debezium-kafka-connect.md exists (template content still TBD)
Kafka UI RB-SVC-KAFKA-UI runbooks/kafka-ui.md exists (template content still TBD)
Airflow RB-SVC-AIRFLOW runbooks/airflow.md exists (template content still TBD)
Spark RB-SVC-SPARK runbooks/spark.md exists (template content still TBD)
JupyterHub/JupyterLab RB-SVC-JUPYTER runbooks/jupyterhub-jupyterlab.md exists (template content still TBD)
Iceberg RB-SVC-ICEBERG runbooks/iceberg.md exists (template content still TBD)
dbt RB-SVC-DBT runbooks/dbt.md exists (template content still TBD)
OpenMetadata RB-SVC-OPENMETADATA runbooks/openmetadata.md exists (template content still TBD)
Hive Metastore RB-SVC-HIVE-METASTORE runbooks/hive-metastore.md exists (template content still TBD)
Trino RB-SVC-TRINO runbooks/trino.md exists (template content still TBD)
Superset RB-SVC-SUPERSET runbooks/superset.md exists (template content still TBD)
Prometheus/Alertmanager RB-SVC-PROM-AM runbooks/prometheus-alertmanager.md exists (template content still TBD)
Grafana RB-SVC-GRAFANA runbooks/grafana.md exists (template content still TBD)
OpenSearch logging RB-SVC-OPENSEARCH-LOG runbooks/opensearch-logging.md exists (template content still TBD)

4.3 Product and external runbook coverage

Item Runbook ID Path Status
ref RB-PROD-REF runbooks/ref.md exists (template content still TBD)
nibbler RB-PROD-NIBBLER runbooks/nibbler.md exists (template content still TBD)
genea RB-PROD-GENEA runbooks/genea.md exists (template content still TBD)
shop RB-PROD-SHOP runbooks/shop.md exists (template content still TBD)
notimon RB-PROD-NOTIMON runbooks/notimon.md exists (template content still TBD)
sec48 RB-PROD-SEC48 runbooks/sec48.md exists (template content still TBD)
homeassistant RB-PROD-HOMEASSISTANT runbooks/homeassistant.md exists (template content still TBD)
mermaid live RB-PROD-MERMAIDLIVE runbooks/mermaid-live.md exists (template content still TBD)
draw.io RB-PROD-DRAWIO runbooks/drawio.md exists (template content still TBD)
nextcloud RB-PROD-NEXTCLOUD runbooks/nextcloud.md exists; target architecture and rollout baseline documented, incident procedure still incomplete
Payment provider integration RB-EXT-PAYMENT runbooks/ext-payment-provider.md exists (template content still TBD)
Mail provider integration RB-EXT-MAIL runbooks/ext-mail-provider.md exists (template content still TBD)
Push provider integration RB-EXT-PUSH runbooks/ext-push-provider.md exists (template content still TBD)
Docs platform integration RB-EXT-DOCS runbooks/ext-docs-platform.md exists; live public docs portal baseline documented

4.4 Runbook template fields

  • Symptoms
  • Immediate actions
  • Verification
  • Root cause notes
  • Long-term fix

5. Operations backlog

  1. Confirm emergency patch workflow ownership per team.
  2. Operationalize ADR-0010 evidence and exception workflow in incident/change tooling.
  3. Replace template content still TBD runbook stubs with executable procedures and validation checks.
  4. Validate quarterly patch cadence against maintenance windows.
  5. Add executable tenant onboarding/offboarding and tenant-scoped restore procedures.