Vijfpas operations¶
This document defines the operating model: patching, backup and restore, DR, monitoring processes, and runbook standards.
Detailed per-component and per-product HA modes are documented in Vijfpas HA Profiles (ha.md).
Confidentiality handling baseline is documented in Vijfpas Confidentiality Model (confidentiality-model.md).
Backup and restore design is documented in Vijfpas Backup and Restore Architecture (backup-restore.md).
1. Operating model¶
1.1 Cadences¶
- Patch cycle (planned): quarterly patch window for all platform and product components
- CVE monitoring: daily feed ingestion and CVSS scoring for newly published vulnerabilities
- Urgent patch trigger: vulnerabilities with CVSS >= 9.0 must be mitigated or patched in an emergency window
- Security hotfix SLA (urgent, CVSS >= 9.0, ADR-0010):
- Internet-facing/dmz services: mitigate or patch in <24 hours
- Internal-only services: mitigate or patch in <=72 hours
- Backup verification: daily job checks + monthly restore test
- Capacity review: monthly
- Access review (privileged): monthly
1.2 Access model¶
Reference: Vijfpas Security and SoD and Vijfpas Confidentiality Model.
- No shared admin accounts.
- Break-glass accounts per tier with audited use.
- Mandatory peer review for privilege-tier 0/1 changes.
- Environment-scoped credentials for CI/CD deploy paths (
dev,acc,prd).
1.3 Confidentiality-driven operations¶
CONF-2and above require an explicit data owner, regular access review, and labeled storage/backup paths.CONF-3and above use masked or synthetic data by default indev;accuse of production-like data requires explicit approval and compensating controls.CONF-4values must remain in approved secret systems only; suspected disclosure triggers immediate rotation and incident handling.
1.4 Multi-tenant operational baseline¶
Minimum operations controls for future Product Engineering Platform tenant enablement:
- Maintain a tenant registry (
tenant_id, owner contacts, active environments, isolation profile). - Enforce tenant-scoped credentials, namespaces, quotas, and policy objects at onboarding.
- Record tenant lifecycle events (create/update/suspend/delete) with auditable change references.
- Keep tenant offboarding procedures that remove runtime access and rotate tenant-scoped secrets.
- Keep showback-ready usage labels (
tenant_id) on logs/metrics/traces, even before billing is enabled. - Operate a cross-team tenant enablement squad for onboarding policy and tenant support coordination.
1.5 Patch decision flow (Mermaid)¶
flowchart TD
CVE[Daily CVE ingestion] --> SCORE[CVSS scoring]
SCORE --> HIGH{CVSS >= 9.0?}
HIGH -- yes --> EMERG[Emergency patch window]
EMERG --> EXP{Internet-facing/dmz?}
EXP -- yes --> SLA24[Mitigate or patch in <24h]
EXP -- no --> SLA72[Mitigate or patch in <=72h]
HIGH -- no --> QPATCH[Quarterly patch backlog]
QPATCH --> WINDOW[Quarterly maintenance window]
SLA24 --> VERIFY[Post-patch verification]
SLA72 --> VERIFY
WINDOW --> VERIFY
VERIFY --> CLOSE[Change closure + evidence]
2. Backup and restore¶
2.1 What must be backed up¶
- selected data-bearing Proxmox VMs and backup repo VMs (via PBS); rebuild-only IaC-managed VMs may be excluded by policy
- Kubernetes manifests via GitOps + control-plane state where applicable
- Infrastructure databases via native backup methods (plus WAL/binlog where relevant)
- Application databases via native backup methods (plus WAL/binlog where relevant)
- GitLab/Gitaly/Nexus data and configuration
keycloakrealm/config plus its backing PostgreSQL domain in each environment where Keycloak exists- JupyterHub control-plane metadata DB and JupyterLab workspace volumes
- Secrets vault state and recovery material (Shamir
3-of-5custody policy, ADR-0009)
2.2 Retention baseline¶
- PBS snapshots: daily 35 days, weekly 12 weeks, monthly 12 months
- Database logical backups: daily 14 days minimum
- Database point-in-time logs (WAL/binlog): minimum 7 days
- GitLab/Nexus backups: daily 30 days
- Operational logs: hot 30 days, cold/archive 180 days
- Retention, restore scope, and export handling for
CONF-3andCONF-4data must preserve confidentiality class restrictions.
2.3 Restore testing¶
- Monthly restore drill for at least one privilege-tier 1/2 service.
- Quarterly "lost node" simulation for Proxmox, Ceph, or Kubernetes.
- Annual full DR exercise across at least one critical end-to-end product path.
- Annual OpenBao recovery drill must validate the
3-of-5custody process end-to-end.
2.4 Baseline recovery targets¶
- Privilege-tier 0: target RPO <= 4h, RTO <= 24h
- Privilege-tier 1: target RPO <= 4h, RTO <= 24h
- Privilege-tier 2: target RPO <= 8h, RTO <= 24h
- Privilege-tier 3: target RPO <= 24h, RTO <= 72h unless product-specific SLO requires tighter
2.5 Backup/restore cycle (Mermaid)¶
flowchart LR
JOBS[Scheduled backups] --> CHECK[Daily backup checks]
CHECK --> DRILL[Monthly restore drill]
DRILL --> SIM[Quarterly lost-node simulation]
SIM --> FULL[Annual full DR exercise]
FULL --> IMPROVE[Runbook and policy improvements]
IMPROVE --> JOBS
2.6 Tenant backup/restore scope baseline¶
- Backups must support tenant-scoped restore where data model allows it.
- Restore procedures must declare scope explicitly: single tenant, full environment, or full platform domain.
- Restore execution is provider-operated; tenants can request restore and track status/evidence in tenant scope.
- Tenant-scoped restore tests should be included in quarterly restore validation.
3. Incident response¶
3.1 Severity model¶
- S1: user-visible outage or security incident affecting production-critical services
- S2: major degradation, partial outage, or high-risk security issue
- S3: localized issue with workaround available
3.2 Response baseline¶
- Triage starts immediately for S1/S2.
- Preserve evidence (logs, snapshots, timelines) for security-significant events.
- Use an incident channel with timestamped decisions/actions.
- Complete post-incident review with corrective actions and owner/date.
- Suspected
CONF-4disclosure is treated as S1 by default;CONF-3disclosure is minimum S2 unless clearly bounded and low impact.
3.3 Incident lifecycle (Mermaid)¶
stateDiagram-v2
[*] --> Detected
Detected --> Triage
Triage --> Mitigating
Mitigating --> Recovering
Recovering --> Monitoring
Monitoring --> PIR
PIR --> Closed
Closed --> [*]
3.4 Tenant incident scoping¶
- Incidents must declare tenant impact scope (
single-tenant,multi-tenant, orplatform-wide). - Communications and timeline evidence must include affected
tenant_idvalues. - Cross-tenant data exposure events are treated as security incidents by default.
4. Runbooks¶
Create a runbook per critical service/product under vijfpas/docs/runbooks/.
4.1 Runbook ID convention¶
- Platform components:
RB-SVC-<component> - Products:
RB-PROD-<product> - External integrations:
RB-EXT-<integration>
4.2 Component runbook coverage¶
| Component | Runbook ID | Path | Status |
|---|---|---|---|
| Proxmox | RB-SVC-PROXMOX |
runbooks/proxmox.md |
exists (template content still TBD) |
| Ceph | RB-SVC-CEPH |
runbooks/ceph.md |
exists (template content still TBD) |
| PBS | RB-SVC-PBS |
runbooks/pbs.md |
exists (implemented baseline) |
| UniFi edge/network | RB-SVC-UNIFI |
runbooks/unifi-edge.md |
exists (implemented baseline) |
| Chrony | RB-SVC-CHRONY |
runbooks/chrony.md |
exists (implemented baseline) |
| Kubernetes runtime planes + Traefik | RB-SVC-KUBERNETES |
runbooks/kubernetes.md |
exists (template content still TBD) |
| Rancher | RB-SVC-RANCHER |
runbooks/rancher.md |
exists (live pfm management-cluster baseline with active pfm and prd downstream imports) |
| GitLab | RB-SVC-GITLAB |
runbooks/gitlab.md |
exists (template content still TBD) |
| Gitaly | RB-SVC-GITALY |
runbooks/gitaly.md |
exists (template content still TBD) |
| Nexus | RB-SVC-NEXUS |
runbooks/nexus.md |
exists (template content still TBD) |
| Taiga | RB-SVC-TAIGA |
runbooks/taiga.md |
exists (documented rollout baseline; not live yet) |
| Keycloak | RB-SVC-KEYCLOAK |
runbooks/keycloak.md |
exists (template content still TBD) |
| Vaultwarden | RB-SVC-VAULTWARDEN |
runbooks/vaultwarden.md |
exists (template content still TBD) |
| OpenBao | RB-SVC-OPENBAO |
runbooks/openbao.md |
exists |
| Certificate management | RB-SVC-CERTMGMT |
runbooks/certificate-management.md |
exists (template content still TBD) |
| Vulnerability/malware controls | RB-SVC-VULN-MALWARE |
runbooks/vulnerability-malware.md |
exists (template content still TBD) |
| PostgreSQL | RB-SVC-POSTGRESQL |
runbooks/postgresql.md |
exists (template content still TBD) |
| MariaDB | RB-SVC-MARIADB |
runbooks/mariadb.md |
exists (template content still TBD) |
| Neo4j | RB-SVC-NEO4J |
runbooks/neo4j.md |
exists (template content still TBD) |
| Redis | RB-SVC-REDIS |
runbooks/redis.md |
exists (template content still TBD) |
| OpenSearch (app/search) | RB-SVC-OPENSEARCH-APP |
runbooks/opensearch-app.md |
exists (template content still TBD) |
| CephRGW | RB-SVC-CEPHRGW |
runbooks/cephrgw.md |
current production baseline and manual keyring/bootstrap procedure documented |
| RabbitMQ | RB-SVC-RABBITMQ |
runbooks/rabbitmq.md |
exists (template content still TBD) |
| Kafka (infra) | RB-SVC-KAFKA-INFRA |
runbooks/kafka-infra.md |
exists (template content still TBD) |
| Kafka (app) | RB-SVC-KAFKA-APP |
runbooks/kafka-app.md |
exists (template content still TBD) |
| Debezium/Kafka Connect | RB-SVC-DEBEZIUM-CONNECT |
runbooks/debezium-kafka-connect.md |
exists (template content still TBD) |
| Kafka UI | RB-SVC-KAFKA-UI |
runbooks/kafka-ui.md |
exists (template content still TBD) |
| Airflow | RB-SVC-AIRFLOW |
runbooks/airflow.md |
exists (template content still TBD) |
| Spark | RB-SVC-SPARK |
runbooks/spark.md |
exists (template content still TBD) |
| JupyterHub/JupyterLab | RB-SVC-JUPYTER |
runbooks/jupyterhub-jupyterlab.md |
exists (template content still TBD) |
| Iceberg | RB-SVC-ICEBERG |
runbooks/iceberg.md |
exists (template content still TBD) |
| dbt | RB-SVC-DBT |
runbooks/dbt.md |
exists (template content still TBD) |
| OpenMetadata | RB-SVC-OPENMETADATA |
runbooks/openmetadata.md |
exists (template content still TBD) |
| Hive Metastore | RB-SVC-HIVE-METASTORE |
runbooks/hive-metastore.md |
exists (template content still TBD) |
| Trino | RB-SVC-TRINO |
runbooks/trino.md |
exists (template content still TBD) |
| Superset | RB-SVC-SUPERSET |
runbooks/superset.md |
exists (template content still TBD) |
| Prometheus/Alertmanager | RB-SVC-PROM-AM |
runbooks/prometheus-alertmanager.md |
exists (template content still TBD) |
| Grafana | RB-SVC-GRAFANA |
runbooks/grafana.md |
exists (template content still TBD) |
| OpenSearch logging | RB-SVC-OPENSEARCH-LOG |
runbooks/opensearch-logging.md |
exists (template content still TBD) |
4.3 Product and external runbook coverage¶
| Item | Runbook ID | Path | Status |
|---|---|---|---|
| ref | RB-PROD-REF |
runbooks/ref.md |
exists (template content still TBD) |
| nibbler | RB-PROD-NIBBLER |
runbooks/nibbler.md |
exists (template content still TBD) |
| genea | RB-PROD-GENEA |
runbooks/genea.md |
exists (template content still TBD) |
| shop | RB-PROD-SHOP |
runbooks/shop.md |
exists (template content still TBD) |
| notimon | RB-PROD-NOTIMON |
runbooks/notimon.md |
exists (template content still TBD) |
| sec48 | RB-PROD-SEC48 |
runbooks/sec48.md |
exists (template content still TBD) |
| homeassistant | RB-PROD-HOMEASSISTANT |
runbooks/homeassistant.md |
exists (template content still TBD) |
| mermaid live | RB-PROD-MERMAIDLIVE |
runbooks/mermaid-live.md |
exists (template content still TBD) |
| draw.io | RB-PROD-DRAWIO |
runbooks/drawio.md |
exists (template content still TBD) |
| nextcloud | RB-PROD-NEXTCLOUD |
runbooks/nextcloud.md |
exists; target architecture and rollout baseline documented, incident procedure still incomplete |
| Payment provider integration | RB-EXT-PAYMENT |
runbooks/ext-payment-provider.md |
exists (template content still TBD) |
| Mail provider integration | RB-EXT-MAIL |
runbooks/ext-mail-provider.md |
exists (template content still TBD) |
| Push provider integration | RB-EXT-PUSH |
runbooks/ext-push-provider.md |
exists (template content still TBD) |
| Docs platform integration | RB-EXT-DOCS |
runbooks/ext-docs-platform.md |
exists; live public docs portal baseline documented |
4.4 Runbook template fields¶
- Symptoms
- Immediate actions
- Verification
- Root cause notes
- Long-term fix
5. Operations backlog¶
- Confirm emergency patch workflow ownership per team.
- Operationalize ADR-0010 evidence and exception workflow in incident/change tooling.
- Replace
template content still TBDrunbook stubs with executable procedures and validation checks. - Validate quarterly patch cadence against maintenance windows.
- Add executable tenant onboarding/offboarding and tenant-scoped restore procedures.