Private
Public Access
1
0

docs: align SDD / REQUIREMENTS / SPEC v0.0.3 with closed open issues

ARCHITECTURE.md -> 0.0.3
REQUIREMENTS.md -> 0.0.2
SPEC.md         -> 0.0.2

Closed OI-01 through OI-06 with concrete decisions:
- OI-01: Encryption at rest delegated to hardware-host (no OS-level LUKS,
  no column-level). Compliance intent preserved at infrastructure layer.
- OI-02: Argon2id starting parameters m=64MiB, t=3, p=1; 250-500 ms
  login-latency budget on Intel Xeon 4c/16GB; calibration recorded in
  system_config at deploy time.
- OI-03: JWT signing = EdDSA/Ed25519; 90-day key rotation with 24-hour
  overlap; web holds signing key, worker holds verifying key only.
- OI-04: CIDR scan concurrency = 128, per-host timeout = 1.5 s; /22 across
  sites completes under 10 s; progress UI + cancel required.
- OI-05: PDF stack = printpdf + plotters (in-process, no sidecar);
  charts required; no branding; no digital signatures.
- OI-06: /status/health = minimal unauthenticated liveness;
  /api/v1/status/fleet = authenticated fleet aggregates.

Added architecture decisions:
- AD-15: Web UI TLS certificate strategy (self-signed from internal CA
  by default; operator may supply external cert)
- AD-16: Azure SSO + SMTP runtime configuration via Settings GUI with
  test-connection actions
- AD-17: PDF generation via printpdf + plotters
- AD-18: IP whitelist enforcement at every listener

Added FR-07 (System Configuration) in REQUIREMENTS.md covering Azure
SSO GUI, SMTP GUI, polling-interval tuning, Web UI TLS strategy,
and IP whitelist management.

SDD review pass also added (from v0.0.2):
- IEEE 1016-aligned structure (Introduction, Stakeholders, Design
  Rationale, Risks, Open Issues, Glossary, References, Revision History)
- Portable ASCII diagrams; split into Context/Logical/Deployment/Process
  views
- Explicit WebSocket ticket authentication flow
- Rollback data flow (6.5)
- API error envelope + X-Request-Id correlation
- Configuration, migration, and backup/DR sections
- Worker heartbeat and dead-process detection
- Sizing math for 2,500-host scalability claim
- Split /status/health (Manager) from /api/v1/health (Agent) namespaces

See ARCHITECTURE.md section 18 for the full change log.
This commit is contained in:
2026-04-23 15:18:10 +00:00
parent f6540133c2
commit 3eb7fd9f95
4 changed files with 765 additions and 308 deletions

4
.gitignore vendored
View File

@ -12,3 +12,7 @@ venv/**
# IDE # IDE
.vscode/ .vscode/
.idea/ .idea/
# Backup files
*.bak
*.bak.*

View File

@ -1,332 +1,691 @@
# Linux_Patch_Manager - Architecture Document # Linux_Patch_Manager — Software Design Document (SDD)
## Project Overview ## Document Control
**Title:** Linux_Patch_Manager
**Version:** 0.0.1
**Status:** Draft
## Architecture Decisions | Field | Value |
|-------|-------|
| Title | Linux_Patch_Manager — Software Design Document |
| Version | 0.0.3 |
| Status | Draft |
| Standard | Aligned with IEEE 1016-2009 |
| Owner | Echo (for Kelly / Moon Dragon) |
| Last Updated | 2026-04-23 |
| Related Docs | `SPEC.md`, `REQUIREMENTS.md`, `README.md` |
| Decision | Choice | Rationale | ### Revision History
|----------|--------|-----------|
| Backend language/framework | Rust with Axum | Security-aligned with linux_patch_api, memory-safe, high async performance |
| Frontend framework | React + TypeScript SPA | Rich ecosystem for enterprise dashboards, strong typing |
| Database | PostgreSQL with SQLx | Enterprise-grade, type-safe Rust queries, handles concurrent access |
| Async runtime | Tokio | Standard Rust async runtime, integrates with Axum |
| Deployment model | Single bare metal/VM | Simplicity, supports up to 2,500 managed hosts |
| Frontend serving | Axum serves static files | Simplest deployment, single process |
| Background processing | Separate worker process | Clean separation of concerns, communicates via PostgreSQL |
| Session management | JWT + refresh tokens | Short-lived access tokens (15 min), revocable refresh tokens (1 hr) |
| Encryption at rest | LUKS full-disk (infrastructure) | HIPAA/PCI-DSS compliant, handled at infrastructure level |
| Certificate management | Internal CA on Patch Manager host | Issues/renews mTLS certs, manual distribution to clients |
## System Architecture | Version | Date | Author | Summary |
|---------|------|--------|---------|
| 0.0.1 | 2026-04-23 | Initial | First draft of architecture document |
| 0.0.2 | 2026-04-23 | Echo | SDD review pass: IEEE 1016 alignment, ASCII diagram fixes, added stakeholders, rationale, error handling, rollback flow, config/secrets, migrations, backup/DR, observability, glossary, and open issues sections |
| 0.0.3 | 2026-04-23 | Echo | Closed OI-01 through OI-06 with concrete decisions; encryption at rest moved to hardware-host (no OS-level LUKS); committed Argon2id parameters, EdDSA JWT signing, CIDR scan tuning, PDF stack (`printpdf`+`plotters`), health-endpoint split; added AD-15 (web UI TLS cert strategy) and AD-16 (Azure SSO / SMTP config GUI); added IP whitelist enforcement |
---
## 1. Introduction
### 1.1 Purpose
This Software Design Document (SDD) describes the architecture and detailed design of the **Linux_Patch_Manager**, an enterprise-class, secure, web-based management interface used to control patching and updates on a fleet of Linux servers and workstations. It translates the requirements in `REQUIREMENTS.md` and the product scope in `SPEC.md` into a concrete technical design that implementers can build from and reviewers can evaluate against.
### 1.2 Scope
The design covers the management plane only: the web server, background worker, PostgreSQL database, internal Certificate Authority (CA), and the React SPA. Managed hosts run the upstream **Linux Patch API** agent, which is a separate project (`linux_patch_api`) and is treated here as an external dependency.
### 1.3 Intended Audience
- Software engineers implementing the system
- Security and compliance reviewers (HIPAA / PCI-DSS)
- Operators / administrators deploying and maintaining the system
- Future maintainers performing changes or audits
### 1.4 Document Conventions
- **MUST / SHOULD / MAY** follow RFC 2119 semantics.
- Code, paths, and identifiers appear in `monospace`.
- ASCII box diagrams use pure ASCII (`+ - | >`) for portability; Unicode box-drawing is avoided to prevent alignment drift across editors.
- "Manager API" refers to this project's own REST API; "Agent API" refers to the upstream Linux Patch API running on managed hosts.
### 1.5 References
- IEEE Std 1016-2009, *IEEE Standard for Information Technology — Systems Design — Software Design Descriptions*
- RFC 2119, *Key words for use in RFCs to Indicate Requirement Levels*
- RFC 8446, *TLS 1.3*
- HIPAA Security Rule, 45 CFR §164.312
- PCI-DSS v4.0
- Upstream: [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api)
- Internal: `SPEC.md`, `REQUIREMENTS.md` (same repository)
### 1.6 Glossary
| Term | Definition |
|------|------------|
| Agent | The Linux Patch API service running on each managed host |
| Manager | This project — the Linux_Patch_Manager web application |
| mTLS | Mutual TLS; both client and server present X.509 certificates |
| RBAC | Role-Based Access Control |
| SPA | Single-Page Application |
| CA | Certificate Authority |
| JWT | JSON Web Token |
| TOTP | Time-based One-Time Password |
| WebAuthn | W3C Web Authentication standard (FIDO2) |
| SSO | Single Sign-On |
| FQDN | Fully Qualified Domain Name |
| CIDR | Classless Inter-Domain Routing (network range notation) |
---
## 2. Stakeholders and Design Concerns
| Stakeholder | Primary Concerns |
|-------------|------------------|
| Administrator | Full fleet control, user management, CA management, SSO config, auditability |
| Operator | Group-scoped patch deployment, scheduling, job monitoring, reporting |
| Security / Compliance Officer | MFA, audit log integrity, encryption at rest and in transit, HIPAA / PCI-DSS mapping |
| Server Administrator (managed host owner) | Minimal agent footprint, predictable maintenance windows, manual cert control |
| System Implementer | Clear component boundaries, testable data flows, deterministic error handling |
| System Operator (of the Manager host) | systemd-friendly deployment, structured logs, health endpoint, backup/restore |
---
## 3. Architecture Decisions
| # | Decision | Choice | Rationale |
|---|----------|--------|-----------|
| AD-01 | Backend language / framework | Rust with Axum | Memory-safe, high async throughput, aligned with `linux_patch_api` stack |
| AD-02 | Frontend framework | React + TypeScript SPA (Vite) | Rich ecosystem for enterprise dashboards, strong typing, fast dev loop |
| AD-03 | Database | PostgreSQL with SQLx | Enterprise-grade, type-safe compile-time checked queries, strong concurrency |
| AD-04 | Async runtime | Tokio | De facto Rust async runtime; required by Axum |
| AD-05 | Deployment model | Single bare-metal / VM host | Simplicity; sized to support up to 2,500 agents |
| AD-06 | Frontend serving | Axum serves static assets | Single process, one TLS endpoint, simplest deployment |
| AD-07 | Background processing | Separate worker process | Isolation of long-running work from request path; independent restart |
| AD-08 | Web ↔ Worker coordination | PostgreSQL job queue + `LISTEN/NOTIFY` | Avoids extra broker (Redis / RabbitMQ); sub-second wake for immediate-apply |
| AD-09 | Session management | Short-lived JWT access + DB-backed refresh | 15-minute access token; 1-hour inactivity-based refresh; revocable |
| AD-10 | Encryption at rest | Hardware-host full-disk encryption | Provided by the underlying infrastructure; application does not manage disk encryption; satisfies HIPAA / PCI-DSS storage protection |
| AD-11 | Certificate management | Internal CA on Manager host | Issues and renews mTLS certs; distribution to agents is manual by design |
| AD-12 | API versioning | URL path versioning (`/api/v1/…`) | Consistent with upstream Agent API convention; clear breaking-change boundary |
| AD-13 | TLS | TLS 1.3 only, both Agent and Web UI | Eliminates legacy cipher risk; required for compliance posture |
| AD-14 | Observability transport | Structured JSON logs via `tracing` | Machine-readable; no hard dependency on external stack |
| AD-15 | Web UI TLS certificate | Self-signed from internal CA by default; operator may supply external cert | Zero-touch default for internal deployments; easy upgrade path to infrastructure wildcard certs |
| AD-16 | Azure SSO and SMTP | Runtime-configured via Settings GUI with test actions | Operators can change tenants / mail relays without redeploy; test-connection closes configuration loop |
| AD-17 | PDF generation | `printpdf` + `plotters` (in-process) | Charts required; avoids sidecar (e.g., wkhtmltopdf) and its operational surface; all rendering stays in the Rust process |
| AD-18 | IP whitelist enforcement | Enforced at every listener and on agent-call origination | Mandatory security control; reduces attack surface beyond TLS and mTLS |
---
## 4. System Architecture
### 4.1 Context Diagram
``` ```
┌──────────────────────────────────────────────────────────────┐ +------------------------+
Linux Patch Manager Host │ Browser (HTTPS) | Admin / Operator |
│ (Ubuntu 24.04) │ ---------------->| Workstation |
│ │ +-----------+------------+
│ ┌─────────────────────┐ ┌──────────────────────────────┐ |
Axum Web Server Background Worker │ │ | HTTPS (TLS 1.3) / WSS
│ │ │ │ │ │ v
┌───────────────┐ ┌────────────────────────┐ +------------------------+
REST API │ │ Health Poller │ │ │ | Linux Patch Manager |
(CRUD, auth) │ (5 min intervals) │ │ | (this project) |
└───────────────┘ └────────────────────────┘ +-----------+------------+
│ │ ┌───────────────┐ ┌────────────────────────┐ |
WebSocket │ Patch Data Poller │ │ │ mTLS / REST + WSS (port 12443)
Relay (30 min intervals) │ │ │ |
└───────────────┘ └────────────────────────┘ │ │ +------------------+------------------+
┌───────────────┐ ┌────────────────────────┐ │ │ v v v
Static Files │ │ Job Scheduler │ │ │ +--------+ +--------+ +--------+
│ (React SPA) │ │ │ │ (maintenance windows) │ │ │ | Host A | | Host B | ... | Host N |
└───────────────┘ └────────────────────────┘ │ | Agent | | Agent | | Agent |
│ │ ┌───────────────┐ │ ┌────────────────────────┐ +--------+ +--------+ +--------+
mTLS Client │ │ │ │ Retry Engine │ │ │ (Linux Patch API agents, up to 2,500)
│ │ │ (agent comm) │◄─┼────┼─►│ (exp. backoff) │ │ │
└───────────────┘ └────────────────────────┘ │ │ Optional: Azure AD (OAuth2 / OIDC SSO)
│ └─────────┬─────────┘ │ ┌────────────────────────┐ │ │
│ │ │ │ Email Notifier │ │ │
│ │ │ │ (optional/disabled) │ │ │
│ │ │ └────────────────────────┘ │ │
│ │ └──────────────┬───────────────┘ │
│ │ │ │
│ │ ┌───────────────────┘ │
│ │ │ │
│ ┌─────────▼─────────▼──────────────────────────────────┐ │
│ │ PostgreSQL │ │
│ │ (hosts, groups, users, jobs, schedules, audit, etc.) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Internal CA (mTLS certs) │ │
│ └───────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
mTLS / REST API (port 12443)
┌──────┼──────┐
▼ ▼ ▼
┌──────┐┌──────┐┌──────┐
│ Host ││ Host ││ Host │ ← Linux Patch API agents
│ A ││ B ││ C │ (up to 2,500)
└──────┘└──────┘└──────┘
``` ```
## Component Design ### 4.2 Logical View — Host-Internal Components
### 1. Axum Web Server ```
+---------------------------------------------------------------+
| Linux Patch Manager Host (Ubuntu 24.04) |
| |
| +-----------------------+ +-----------------------------+ |
| | Axum Web Server | | Background Worker | |
| | (systemd unit) | | (systemd unit) | |
| | | | | |
| | +-----------------+ | | +-----------------------+ | |
| | | REST API | | | | Health Poller | | |
| | | (CRUD, auth) | | | | (5 min intervals) | | |
| | +-----------------+ | | +-----------------------+ | |
| | +-----------------+ | | +-----------------------+ | |
| | | WebSocket | | | | Patch Data Poller | | |
| | | Relay | | | | (30 min intervals) | | |
| | +-----------------+ | | +-----------------------+ | |
| | +-----------------+ | | +-----------------------+ | |
| | | Static Files | | | | Job Scheduler | | |
| | | (React SPA) | | | | (maintenance windows)| | |
| | +-----------------+ | | +-----------------------+ | |
| | +-----------------+ | | +-----------------------+ | |
| | | mTLS Client | | | | Job Executor + | | |
| | | (agent comm) | | | | Retry Engine | | |
| | +-----------------+ | | +-----------------------+ | |
| | | | +-----------------------+ | |
| | | | | Email Notifier | | |
| | | | | (optional/disabled) | | |
| | | | +-----------------------+ | |
| | | | +-----------------------+ | |
| | | | | Data Pruner | | |
| | | | +-----------------------+ | |
| +----------+------------+ +--------------+--------------+ |
| | | |
| | +--------------------------+ |
| v v |
| +------------------------------------------------------+ |
| | PostgreSQL | |
| | (hosts, groups, users, jobs, schedules, audit, ...) | |
| | Coordination: LISTEN/NOTIFY channels | |
| +------------------------------------------------------+ |
| |
| +------------------------------------------------------+ |
| | Internal CA (mTLS certs) | |
| +------------------------------------------------------+ |
| |
| Host-level: hardware-host full-disk encryption (infrastructure)|
+---------------------------------------------------------------+
```
### 4.3 Deployment View
All components co-reside on a single Ubuntu 24.04 host. Two `systemd` units run the application:
- `patch-manager-web.service` — Axum web server; listens on TCP `443` (HTTPS) for browsers.
- `patch-manager-worker.service` — Background worker; no inbound listener.
Both connect to a local `postgresql.service`. Outbound agent calls go to TCP `12443` on each managed host. See §10 for deployment details.
### 4.4 Process View
- **Web process** handles HTTP requests, serves the SPA, validates JWTs, authorizes via RBAC, and performs on-demand mTLS calls to agents (e.g., manual refresh, immediate patch triggers that are short-lived).
- **Worker process** runs scheduled polls, scans CIDR ranges on-demand, executes queued jobs at maintenance-window boundaries, and prunes expired data.
- **PostgreSQL** is the single source of truth. The web and worker processes communicate indirectly through rows in `patch_jobs`, `patch_job_hosts`, and related tables, using `LISTEN / NOTIFY` channels (`job_enqueued`, `job_cancelled`) to wake the worker without polling latency.
---
## 5. Component Design
### 5.1 Axum Web Server
**Responsibility:** Handle all HTTP/HTTPS requests from browsers and serve the React SPA. **Responsibility:** Handle all HTTP/HTTPS requests from browsers and serve the React SPA.
- **REST API:** CRUD operations for hosts, groups, users, schedules, certificates, reports - **Manager REST API** at `/api/v1/…` — CRUD for hosts, groups, users, schedules, certificates, reports.
- **WebSocket Relay:** Proxy real-time job status from agent WebSocket streams to browser clients - **WebSocket Relay** at `/api/v1/ws/jobs` — Authenticated WSS endpoint; Manager opens an upstream mTLS WSS to the relevant agent(s) and multiplexes events to the browser.
- **Static File Server:** Serve compiled React SPA (HTML, JS, CSS, assets) - **Static File Server** Serves compiled React SPA (HTML, JS, CSS, assets) from a single directory.
- **Authentication:** JWT access token validation, refresh token management, MFA enforcement - **Authentication** JWT access-token validation, refresh-token issuance/rotation, MFA enforcement, Azure OIDC flow.
- **Authorization:** RBAC middleware enforcing admin/operator/group-scoped access - **Authorization** RBAC middleware enforcing `admin`, `operator`, and group-scoped access (see §7.2).
- **mTLS Client:** HTTP client with client certificates for communicating with Linux Patch API agents - **mTLS Client** — Rustls-based HTTP client holding the Manager's client certificate for on-demand calls to agents.
**API Versioning:** URL path versioning (`/api/v1/`) to match the upstream Linux Patch API convention. **API versioning:** The Manager's own API uses URL path versioning (`/api/v1/`). This is independent of the Agent API version, even though the convention matches.
### 2. Background Worker **Browser → WebSocket authentication:** The client obtains a short-lived WS ticket from `POST /api/v1/ws/ticket` (JWT-authenticated), then opens `wss://…/api/v1/ws/jobs?ticket=…`. The ticket is single-use and expires in 60 seconds.
### 5.2 Background Worker
**Responsibility:** All scheduled and asynchronous background processing. **Responsibility:** All scheduled and asynchronous background processing.
- **Health Poller:** Periodic health checks to all registered agents (5-minute intervals) - **Health Poller** Periodic health checks to all registered agents (5-minute interval; configurable).
- **Patch Data Poller:** Periodic patch availability queries to all agents (30-minute intervals) - **Patch Data Poller** Periodic patch-availability queries to all agents (30-minute interval; configurable).
- **Job Scheduler:** Execute queued patch operations when maintenance windows open - **Job Scheduler** — Opens maintenance windows and dispatches queued jobs.
- **Retry Engine:** Handle agent communication failures with exponential backoff (3 retries, max 30 min) - **Job Executor** — Invokes agent endpoints for patch apply / install / remove / reboot; tracks async job IDs returned by the agent.
- **Job Executor:** Trigger patch operations on agents, track async job status - **Retry Engine** — Exponential backoff for transient agent communication failures: up to **3 retries**, max **30 minutes** between retries (see §8).
- **Email Notifier:** Optional email notifications (disabled by default) - **Email Notifier** Optional; disabled by default.
- **Data Pruner:** Clean up operational data older than 30 days, audit logs older than 6 months - **Data Pruner** — Daily job that deletes operational data older than 30 days and audit-log rows older than 6 months.
**Communication:** Worker reads job queue from PostgreSQL, updates results back to PostgreSQL. Web server reads results from PostgreSQL for API responses. **Concurrency bounds:** The worker uses a bounded Tokio `Semaphore` (default **64 concurrent agent calls**, configurable) to avoid saturating the host's network or file-descriptor limits when polling thousands of agents.
### 3. PostgreSQL Database **Coordination:**
- Scheduled pollers run on Tokio intervals.
- Immediate-apply and on-demand actions are enqueued by the web process with `INSERT … RETURNING id` followed by `NOTIFY job_enqueued, '<id>'`. The worker holds a `LISTEN job_enqueued` connection and wakes immediately.
**Responsibility:** Persistent storage for all application data. ### 5.3 PostgreSQL Database
**Key Tables:** **Responsibility:** Persistent storage and coordination primitive for the system.
- `hosts` — registered hosts, metadata, health status, last seen
- `groups` — static groups for access control
- `host_groups` — many-to-many host ↔ group membership
- `users` — local accounts with hashed passwords, MFA secrets
- `user_groups` — many-to-many user ↔ group membership
- `refresh_tokens` — server-side refresh tokens for session management
- `maintenance_windows` — per-device recurring and one-time schedules
- `patch_jobs` — queued, running, completed, failed patch operations
- `patch_job_hosts` — per-host status within a batch job
- `host_patch_data` — cached patch availability data from agents
- `host_health_data` — cached health check results
- `certificates` — issued mTLS client certificates
- `audit_log` — tamper-evident audit trail
- `azure_sso_config` — Azure AD SSO configuration
**Data Retention:** **Key tables (logical; exact DDL lives in `migrations/`):**
- Operational data (health, patches, jobs): 30 days
- Audit logs: 6 months
### 4. React + TypeScript SPA | Table | Purpose |
|-------|---------|
| `hosts` | Registered hosts, metadata, health status, last-seen timestamp |
| `groups` | Static groups for access control |
| `host_groups` | Many-to-many host ↔ group membership |
| `users` | Local accounts with Argon2 hashes, MFA secrets |
| `user_groups` | Many-to-many user ↔ group membership |
| `refresh_tokens` | Server-side refresh tokens; revocable |
| `maintenance_windows` | Per-device recurring and one-time schedules |
| `patch_jobs` | Queued, running, completed, failed patch operations |
| `patch_job_hosts` | Per-host status within a batch job |
| `host_patch_data` | Cached patch availability snapshots |
| `host_health_data` | Cached health check results |
| `certificates` | Issued mTLS client certificates (metadata, not private keys) |
| `audit_log` | Tamper-evident audit trail (hash-chained) |
| `azure_sso_config` | Azure AD SSO configuration |
| `system_config` | Key/value runtime configuration (polling intervals, etc.) |
**Data retention:**
- Operational tables (`host_patch_data`, `host_health_data`, `patch_jobs`, `patch_job_hosts`): 30 days.
- `audit_log`: 6 months.
**Migrations:** Managed via `sqlx-cli` (`sqlx migrate add / run`). Migrations are embedded into the binaries via `sqlx::migrate!` and applied automatically at startup of the web process (single-writer election via advisory lock).
### 5.4 React + TypeScript SPA
**Responsibility:** User-facing web interface. **Responsibility:** User-facing web interface.
**Pages:** **Pages:**
1. Dashboard — fleet overview, compliance %, health summary, upcoming windows, root CA download
2. Hosts — filterable host list by group, status, OS
3. Host Detail — system info, packages, patches, jobs, maintenance window config, host cert download
4. Patch Deployment — select hosts, review patches, deploy (queue or immediate)
5. Jobs — real-time job monitoring with WebSocket updates
6. Maintenance Windows — per-device recurring/one-time schedule management
7. Groups — manage static groups, assign hosts and operators
8. Reports — generate/export compliance, patch history, vulnerability, audit (CSV/PDF)
9. Users — local account management, MFA setup, group assignments
10. Certificates — view/manage internal CA, issue/renew client certs
11. Settings — system config, Azure SSO, polling intervals
### 5. Internal CA 1. **Dashboard** — Fleet overview: compliance %, health summary, upcoming windows, root CA download.
2. **Hosts** — Filterable host list by group, status, OS.
3. **Host Detail** — System info, packages, patches, jobs, maintenance-window config, host cert download.
4. **Patch Deployment** — Select hosts, review patches, deploy (queue or immediate).
5. **Jobs** — Real-time job monitoring via WebSocket.
6. **Maintenance Windows** — Per-device recurring / one-time schedule management.
7. **Groups** — Manage static groups; assign hosts and operators.
8. **Reports** — Generate / export compliance, patch history, vulnerability, audit (CSV / PDF).
9. **Users** — Local account management, MFA setup, group assignments.
10. **Certificates** — View / manage internal CA; issue / renew client certs.
11. **Settings** — System config: Azure SSO setup (with "Test Connection"), SMTP setup (with "Send Test Email"), polling intervals, Web UI TLS certificate strategy (internal CA vs. operator-supplied), IP whitelist management.
**Responsibility:** mTLS certificate management for agent communication. ### 5.5 Internal CA
- Runs on the same Patch Manager host **Responsibility:** mTLS certificate lifecycle for agent communication.
- Issues client certificates for mTLS communication with agents
- Manages certificate renewal
- Root CA certificate downloadable from dashboard for manual distribution
- Host-specific mTLS certificates downloadable from host detail page
- No automated distribution to clients — server administrators handle this manually
## Data Flow - Runs in-process within the web server (library-level, `rcgen` + `rustls`).
- Issues client certificates for mTLS communication with agents.
- Supports renewal; revocation is performed by issuing a new cert and marking the old one revoked in `certificates`.
- Root CA certificate downloadable from Dashboard for manual distribution.
- Host-specific mTLS certificates downloadable from each Host Detail page.
- **No automated distribution to managed clients** — server administrators install them manually.
- CA private key is stored on the Manager host at `/etc/patch-manager/ca/ca.key` with `0600` permissions, owned by the service user. Disk-level protection is provided by hardware-host full-disk encryption.
---
## 6. Data Flow
### 6.1 Host Registration
### Host Registration Flow
``` ```
1. Admin enters FQDN/IP → Axum validates & resolves FQDN 1. Admin enters FQDN / IP -> Web validates and resolves FQDN to IP.
2. Axum stores host in PostgreSQL 2. Web inserts row in `hosts` (status = pending).
3. Worker picks up new host → initial health check via mTLS 3. Web NOTIFYs `host_registered` -> Worker performs initial mTLS health check.
4. Health result stored in PostgreSQL → visible in dashboard 4. Worker updates `hosts.health_status` and `host_health_data` -> visible in Dashboard.
``` ```
### Auto-Discovery Flow ### 6.2 Auto-Discovery (CIDR scan)
``` ```
1. Admin triggers CIDR scan → Axum sends request to Worker 1. Admin triggers CIDR scan -> Web inserts a discovery job and NOTIFYs `discovery_enqueued`.
2. Worker scans subnet for agents on port 12443 2. Worker scans the subnet for agents listening on port 12443 (bounded concurrency, TLS probe).
3. Discovered agents reported back → Admin selects which to register 3. Discovered agents written to a transient `discovery_results` table.
4. Selected hosts stored in PostgreSQL 4. Admin reviews and selects which to register; each selection follows the 6.1 flow.
``` ```
### Patch Deployment Flow (Queued) ### 6.3 Patch Deployment Queued
``` ```
1. Operator selects hosts + patches → chooses "Queue for next window" 1. Operator selects hosts + patches -> "Queue for next window".
2. Axum creates patch job in PostgreSQL (status: queued) 2. Web creates `patch_jobs` row (status = queued) and `patch_job_hosts` rows.
3. When maintenance window opens → Worker triggers patch operations on agents 3. Job Scheduler detects the next applicable maintenance window per host.
4. Worker monitors async job status via agent API 4. At window open, Worker calls the Agent API to start patch operations.
5. Results stored in PostgreSQL → WebSocket relay pushes updates to browser 5. Worker polls agent job status (and/or consumes WebSocket events) and updates rows.
6. Failed jobs auto-retried once if still within window 6. WebSocket Relay pushes updates to subscribed browsers in real time.
7. Failed hosts are auto-retried once if still within the window (see §8).
``` ```
### Patch Deployment Flow (Immediate) ### 6.4 Patch Deployment Immediate
``` ```
1. Operator selects hosts + patches → chooses "Apply Now" 1. Operator selects hosts + patches -> "Apply Now".
2. Axum creates patch job in PostgreSQL (status: pending) 2. Web creates `patch_jobs` row (status = pending) and NOTIFYs `job_enqueued`.
3. Worker immediately triggers patch operations on agents 3. Worker wakes immediately and triggers the agent calls.
4. Same monitoring and retry logic as queued flow 4. Same monitoring and retry logic as the queued flow.
``` ```
### Health/Patch Polling Flow ### 6.5 Rollback
``` ```
1. Worker polls each agent on schedule (5 min health, 30 min patches) 1. Operator opens a completed or failed job and clicks "Rollback".
2. Results cached in PostgreSQL 2. Web creates a `patch_jobs` row with kind = rollback, parent_job_id = <original>.
3. Unhealthy agents marked with visual alerts in dashboard 3. Worker calls POST /api/v1/jobs/{id}/rollback on each affected agent.
4. On-demand refresh: operator clicks refresh → Worker queries agent immediately 4. Results are tracked like any other job; audit log records the rollback actor.
``` ```
## Technology Stack ### 6.6 Health / Patch Polling
| Layer | Technology | Version/Notes | ```
|-------|-----------|---------------| 1. Worker polls each agent on schedule (5 min health, 30 min patches).
2. Results cached in `host_health_data` and `host_patch_data`.
3. Unhealthy agents are flagged with visual alerts in the Dashboard.
4. On-demand refresh: operator clicks refresh -> Web NOTIFYs `refresh_requested`; Worker queries immediately.
```
---
## 7. Security Architecture
### 7.1 Authentication
- **Local accounts:** Argon2id-hashed passwords; TOTP or WebAuthn for MFA (enforced).
- **Azure SSO:** OAuth2 / OIDC Authorization Code flow with PKCE; Azure's built-in MFA satisfies the MFA requirement.
- **Access tokens:** JWT, signed with **EdDSA / Ed25519**; 15-minute TTL. Signing keys rotated every 90 days with a 24-hour overlap window. The web process holds the signing key; the worker process holds only the verifying (public) key.
- **Refresh tokens:** Opaque, 256-bit, stored hashed in `refresh_tokens`; **1-hour sliding inactivity timeout** (rotated on use; revocable).
- **Revocation:** Admins can force-revoke a user's refresh tokens; the next access-token expiry terminates all sessions.
### 7.2 Authorization (RBAC)
- **Admin** — Full access to all resources and settings.
- **Operator** — Can add / remove hosts and manage schedules / patches only for devices in their assigned groups.
- **Group scoping** — Enforced by middleware at every API endpoint that touches host-scoped data.
- **Ungrouped hosts** — Accessible by any operator or admin (explicit product decision).
### 7.3 Agent Communication
- **mTLS** — Client certificate authentication for every agent call and WebSocket.
- **TLS 1.3 only** — Older TLS versions are refused at the Rustls configuration layer.
- **Internal CA** — Manager issues and renews client certificates.
- **Manual distribution** — Server administrators install certs on managed clients; the Manager holds no credentials for managed hosts and cannot push files to them.
### 7.4 Data Protection
- **Encryption at rest** — Provided by the underlying hardware host (infrastructure-level full-disk encryption). The application does not configure or manage disk encryption; this is delegated to the infrastructure layer and satisfies HIPAA / PCI-DSS storage protection requirements.
- **Encryption in transit** — TLS 1.3 for all agent and browser connections.
- **Audit log integrity** — Hash-chained rows (`audit_log.prev_hash`, `audit_log.row_hash`); integrity verified by a periodic check job and on-demand from the UI.
- **Password storage** — Argon2id with per-user salt. Starting parameters: `m_cost = 65536 KiB (64 MiB)`, `t_cost = 3`, `p_cost = 1`; calibrated to land in the 250500 ms login-latency budget on the target hardware (Intel Xeon, 4 cores, 16 GB RAM). Final calibration result recorded in `system_config`.
- **Secrets on disk** — Configuration secrets (JWT signing key, CA private key, DB password) are stored in `/etc/patch-manager/secrets/` with `0600` permissions, owned by the service user; not committed to the repository.
### 7.5 Compliance Mapping
- **HIPAA §164.312:** Audit controls (§7.4), access controls (§7.2 + MFA), integrity controls (hash-chained audit), transmission security (TLS 1.3 / mTLS), automatic logoff (1-hour inactivity).
- **PCI-DSS:** Requirement 6 (vulnerability management — core function), Requirement 7 (need-to-know via group scoping), Requirement 8 (MFA, unique IDs), Requirement 10 (audit with 6-month retention), Requirements 3 & 4 (encryption at rest and in transit).
---
## 8. Error Handling and Reliability
### 8.1 Agent Communication Failures
- Mark host as **unhealthy** in the Dashboard.
- Retry with **exponential backoff**: up to **3 retries**, capped at **30 minutes** between attempts (example schedule: 1 min, 5 min, 30 min).
- Continue processing other hosts without blocking.
- After exhausting retries, the host is flagged and reported in the next compliance report.
### 8.2 Patch Job Failures
- Auto-retry a failed patch job **once** if still within the maintenance window.
- If the retry fails, or the window has closed, surface the failure prominently in the Jobs view and in any configured email notifications.
### 8.3 Batch Operations with Partial Failures
- Auto-retry failed hosts **once**.
- If retry fails, report the failed hosts in the job detail view and let the operator decide next steps.
- Successful hosts complete normally regardless of failures elsewhere in the batch.
### 8.4 API Error Response Format
All Manager API errors use a consistent JSON envelope:
```json
{
"error": {
"code": "host_not_found",
"message": "No host with id 42 in any group you can access.",
"request_id": "01JF8Q...",
"details": {}
}
}
```
HTTP status codes follow standard REST semantics (`400`, `401`, `403`, `404`, `409`, `422`, `429`, `500`, `503`). Every response carries an `X-Request-Id` header to correlate logs and user reports.
### 8.5 Input Validation
- All request bodies are validated with strongly-typed Rust structs (`serde` + `validator`); validation errors return `422` with field-level details.
- FQDNs, IPs, and CIDR ranges are parsed with the standard library / `ipnet` and rejected early.
---
## 9. Technology Stack
| Layer | Technology | Notes |
|-------|-----------|-------|
| Backend | Rust + Axum | Tokio async runtime, Tower middleware | | Backend | Rust + Axum | Tokio async runtime, Tower middleware |
| Database | PostgreSQL | SQLx for type-safe queries, migrations via sqlx-cli | | Database | PostgreSQL 16+ | SQLx for type-safe queries; migrations via `sqlx-cli` |
| Frontend | React + TypeScript | Vite build tooling | | Frontend | React 18+ + TypeScript | Vite build tooling |
| UI Components | MUI (Material UI) | Enterprise dashboard components, dark mode, theming | | UI Components | MUI (Material UI) | Enterprise dashboard components, dark mode, theming |
| WebSocket | Axum native WebSocket | Agent Manager Browser relay | | WebSocket | Axum native WebSocket | Agent -> Manager -> Browser relay |
| Auth (Local) | Argon2 password hashing + TOTP/WebAuthn | MFA enforcement | | Auth (Local) | Argon2id + TOTP / WebAuthn | MFA enforcement |
| Auth (SSO) | OAuth2/OIDC via Azure AD | Optional, with Azure MFA | | Auth (SSO) | OAuth2 / OIDC (Azure AD) | Optional; Azure MFA |
| Session | JWT (access) + PostgreSQL (refresh) | 15 min access, 1 hr refresh | | Session | JWT (access) + DB-backed refresh | 15-min access, 1-hr inactivity refresh |
| mTLS Client | Rustls + client certs | TLS 1.3 only | | mTLS Client | Rustls + client certs | TLS 1.3 only |
| Internal CA | Rustls/RCGen | Certificate issuance and renewal | | Internal CA | Rustls / `rcgen` | Certificate issuance and renewal |
| Email | Lettre (Rust email crate) | Optional, disabled by default | | Email | Lettre | Optional; disabled by default |
| PDF Export | Rust PDF generation crate | Compliance and audit reports | | PDF Export | `printpdf` + `plotters` | In-process pure-Rust PDF + charts; no sidecar |
| CSV Export | Rust CSV crate | Data export for all report types | | CSV Export | `csv` crate | Data export for all report types |
| Service Management | systemd | Ubuntu 24.04 | | Service Management | systemd | Ubuntu 24.04 |
| Static Files | Axum built-in static file serving | React SPA served directly | | Static Files | Axum built-in static serving | React SPA served directly |
| Logging / Tracing | `tracing` + `tracing-subscriber` (JSON) | Structured logs |
## Security Architecture ---
### Authentication ## 10. Deployment Architecture
- **Local accounts:** Argon2-hashed passwords + TOTP or WebAuthn for MFA
- **Azure SSO:** OAuth2/OIDC flow with Azure AD, using Azure's built-in MFA
- **Session tokens:** Short-lived JWT (15 min) for API access, server-side refresh tokens (1 hr inactivity timeout)
- **Refresh token revocation:** Stored in PostgreSQL, can be immediately revoked for forced logout
### Authorization (RBAC)
- **Admin:** Full access to all resources and settings
- **Operator:** Can add/remove clients, manage schedules and patches only for devices in their group memberships
- **Group scoping:** Operators can only interact with hosts in their assigned groups
- **Ungrouped hosts:** Accessible by any operator or admin
### Agent Communication
- **mTLS:** Client certificate authentication for all agent communication
- **TLS 1.3 only:** No older TLS versions
- **Internal CA:** Patch Manager manages CA, issues and renews client certificates
- **Manual distribution:** Server administrators manually install certs on managed clients
### Data Protection
- **Encryption at rest:** LUKS full-disk encryption (infrastructure-managed)
- **Encryption in transit:** TLS 1.3 for all connections (agent and web UI)
- **Audit log integrity:** Tamper-evident logging (hash chaining)
- **Password storage:** Argon2 with salt
### Compliance
- **HIPAA:** Audit controls, access controls, integrity controls, transmission security, automatic logoff
- **PCI-DSS:** Vulnerability management (core function), access restrictions, user identification, audit tracking, data protection
## Deployment Architecture
``` ```
┌─────────────────────────────────────────┐ +---------------------------------------------+
Patch Manager Host (Ubuntu 24.04) | Patch Manager Host (Ubuntu 24.04, bare |
| metal or VM) |
┌─────────────────────────────────────┐ │ | |
│ │ systemd: patch-manager-web │ | +---------------------------------------+ |
│ (Axum web server + static files) | | systemd: patch-manager-web.service | |
└─────────────────────────────────────┘ │ | | (Axum web server + static SPA) | |
| | Listens: 443/tcp (HTTPS, TLS 1.3) | |
┌─────────────────────────────────────┐ | +---------------------------------------+ |
systemd: patch-manager-worker | |
│ │ (Background polling + jobs) │ | +---------------------------------------+ |
└─────────────────────────────────────┘ │ | | systemd: patch-manager-worker.service | |
| | (Background polling + jobs) | |
┌─────────────────────────────────────┐ │ | | No inbound listener | |
│ │ PostgreSQL │ | +---------------------------------------+ |
(Database) │ │ | |
└─────────────────────────────────────┘ | +---------------------------------------+ |
| | systemd: postgresql.service | |
┌─────────────────────────────────────┐ │ | | (Local, Unix socket or 127.0.0.1) | |
│ Internal CA │ │ | +---------------------------------------+ |
(Certificate management) | |
└─────────────────────────────────────┘ | +---------------------------------------+ |
| | /etc/patch-manager/ | |
┌─────────────────────────────────────┐ │ | | config.toml, secrets/*, ca/* | |
│ LUKS (Full-disk encryption) │ │ | +---------------------------------------+ |
(Infrastructure-managed) | |
└─────────────────────────────────────┘ │ | Hardware-host full-disk encryption (infra) |
└─────────────────────────────────────────┘ +---------------------------------------------+
``` ```
- Two systemd services: `patch-manager-web` and `patch-manager-worker` - Two systemd services: `patch-manager-web` and `patch-manager-worker`; independent restart and logging.
- PostgreSQL runs on the same host - PostgreSQL runs on the same host; connections via Unix domain socket.
- Internal CA runs on the same host - Internal CA material lives in `/etc/patch-manager/ca/` with `0600` permissions.
- LUKS full-disk encryption managed by infrastructure - No Docker / LXC in production — bare-metal / VM deployment. Containerized **development** environments are acceptable and do not affect production design.
- No Docker/LXC — bare metal/VM deployment - Internal network only — no public internet exposure. Ingress limited to the Manager's HTTPS port; egress to agents on `12443` and, optionally, Azure AD / SMTP.
- Internal network only — no public internet exposure
## Scalability ### 10.1 Configuration
- **Single-instance design:** Supports 500 typical hosts, up to 2,500 - Primary config file: `/etc/patch-manager/config.toml` (non-secret tunables: bind address, DB URL, polling intervals, concurrency caps, log level, feature flags).
- **Manual horizontal scaling:** Divide clients between multiple Patch Manager hosts if needed - Secrets: separate files in `/etc/patch-manager/secrets/` referenced by path from the config — never inlined.
- **Connection pooling:** Axum handles thousands of concurrent connections with Tokio - Environment variables may override any config key (`PATCH_MANAGER__SECTION__KEY`) for operator convenience; env-based overrides are logged at startup.
- **Background worker:** Independent scaling of polling/jobs from web serving - Runtime-tunable values (polling intervals, Azure SSO settings) are stored in `system_config` and editable from the Settings page; static values (bind address, DB URL) require a service restart.
- **Database:** PostgreSQL handles the workload easily on a single host
- **No automatic clustering or load balancing required**
## Integration Points ### 10.2 Database Migrations
**Upstream Dependency:** [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api) - Managed with `sqlx migrate`; migration files live under `migrations/` and are embedded into the web binary via `sqlx::migrate!`.
- Applied on web-process startup; a PostgreSQL advisory lock ensures only one instance runs migrations at a time.
- Worker process waits for the expected schema version before accepting work (`SELECT version FROM _sqlx_migrations ORDER BY installed_on DESC LIMIT 1`).
### 10.3 Backup and Disaster Recovery
- **Database:** Nightly `pg_dump` to `/var/backups/patch-manager/`, with an external copy to an encrypted off-host location (operator-configured).
- **CA material:** Included in the nightly backup; treated as highest-sensitivity.
- **Configuration:** `/etc/patch-manager/` included in the backup, excluding secret files unless the backup destination is encrypted.
- **Restore procedure:** Documented in `docs/runbooks/restore.md` (to be created during implementation).
- **RPO target:** 24 hours. **RTO target:** 4 hours on comparable hardware.
---
## 11. Scalability
- **Single-instance design:** Supports ~500 typical hosts comfortably, tested target up to 2,500.
- **Sizing basis:** 2,500 hosts × one health poll / 5 min = ~8.3 req/s average; 2,500 × one patch poll / 30 min = ~1.4 req/s; bursts during maintenance windows bounded by the worker semaphore (default 64 concurrent calls). These rates are trivial for Axum + Tokio on the target hardware (Intel Xeon, 4 cores, 16 GB RAM).
- **Manual horizontal scaling:** Divide the fleet between multiple Manager hosts if the fleet grows beyond 2,500. There is no automatic sharding.
- **Connection pooling:** SQLx `PgPool` (default 20 connections, tunable) shared across request handlers.
- **Background worker:** Independent process — its polling load does not compete with user request latency.
- **No automatic clustering or load balancing.** Multi-instance deployments are explicitly out of scope.
---
## 12. Integration Points
**Upstream dependency:** [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api)
| Integration | Protocol | Direction | Purpose | | Integration | Protocol | Direction | Purpose |
|-------------|----------|-----------|----------| |-------------|----------|-----------|---------|
| Agent REST API | HTTPS/mTLS (TLS 1.3) | Manager Agent | Queries, patch operations | | Agent REST API | HTTPS / mTLS (TLS 1.3) on port 12443 | Manager -> Agent | Queries and patch operations |
| Agent WebSocket | WSS/mTLS | Agent Manager | Real-time job status streaming | | Agent WebSocket | WSS / mTLS on port 12443 | Agent -> Manager | Real-time job status streaming |
| Azure AD | HTTPS/OAuth2 | Manager Azure | SSO authentication (optional) | | Azure AD | HTTPS / OAuth2 / OIDC | Manager -> Azure | SSO authentication (optional) |
| SMTP | SMTPS | Manager -> SMTP relay | Optional email notifications |
**API Endpoints Used:** ### 12.1 Agent API Endpoints Consumed
- `GET /api/v1/health` — Agent health checks
- `GET /api/v1/system/info` — Host system information - `GET /api/v1/health` — Agent health check
- `GET /api/v1/packages`List installed packages - `GET /api/v1/system/info`Host system information
- `GET /api/v1/patches` — List available patches - `GET /api/v1/packages` — List installed packages
- `GET /api/v1/patches` — List available patches
- `POST /api/v1/patches/apply` — Apply patches - `POST /api/v1/patches/apply` — Apply patches
- `PUT /api/v1/packages/{name}` — Update specific package - `PUT /api/v1/packages/{name}` — Update a specific package
- `DELETE /api/v1/packages/{name}` — Remove package - `DELETE /api/v1/packages/{name}` — Remove a package
- `POST /api/v1/packages` — Install packages - `POST /api/v1/packages` — Install packages
- `GET /api/v1/jobs` — List jobs - `GET /api/v1/jobs` — List jobs
- `GET /api/v1/jobs/{id}` — Get job status - `GET /api/v1/jobs/{id}` — Get job status
- `POST /api/v1/jobs/{id}/rollback` — Rollback a job - `POST /api/v1/jobs/{id}/rollback` — Rollback a job
- `POST /api/v1/system/reboot` — Reboot host - `POST /api/v1/system/reboot` — Reboot host
- `WebSocket /api/v1/ws/jobs` — Real-time job status - `WS /api/v1/ws/jobs` — Real-time job status
## Monitoring and Observability ### 12.2 Manager's Own API Surface (selected)
- **Application logging:** Structured JSON logging (tracing crate) - `POST /api/v1/auth/login`, `POST /api/v1/auth/refresh`, `POST /api/v1/auth/logout`
- **Log levels:** Configurable at runtime (DEBUG, INFO, WARN, ERROR) - `POST /api/v1/auth/mfa/totp/setup`, `POST /api/v1/auth/mfa/webauthn/register`
- **Health endpoint:** `GET /api/v1/health` on the Patch Manager's own API for infrastructure monitoring - `GET /api/v1/hosts`, `POST /api/v1/hosts`, `GET /api/v1/hosts/{id}`, `DELETE /api/v1/hosts/{id}`
- **Dashboard alerts:** Visual indicators for unhealthy/unreachable agents (red/yellow status) - `POST /api/v1/discovery/cidr`
- **Audit logging:** All significant events logged to PostgreSQL with tamper-evident hash chaining - `GET /api/v1/groups`, `POST /api/v1/groups`, …
- **No external monitoring integration required** (dashboard-only alerts) - `GET /api/v1/jobs`, `POST /api/v1/jobs` (queue / immediate), `POST /api/v1/jobs/{id}/rollback`
- `GET /api/v1/reports/compliance`, `GET /api/v1/reports/patch-history`, `GET /api/v1/reports/audit` (with `?format=csv|pdf`)
- `GET /api/v1/ca/root.crt`, `GET /api/v1/hosts/{id}/client.crt`
- `POST /api/v1/ws/ticket`, `WS /api/v1/ws/jobs?ticket=...`
- `GET /status/health`**Manager's own** unauthenticated liveness endpoint (distinct namespace from the agent's `/api/v1/health`)
---
## 13. Monitoring and Observability
- **Structured logging:** JSON lines via the `tracing` crate; one field schema for both services.
- **Log levels:** Configurable at runtime (`DEBUG`, `INFO`, `WARN`, `ERROR`) per module.
- **Request correlation:** Every HTTP request is tagged with `request_id` (ULID), propagated into logs and error responses.
- **Liveness / readiness:** `GET /status/health` on the Manager (unauthenticated, Manager's own namespace — do not confuse with the agent's `/api/v1/health`). Returns `200` when the process can reach the database and worker heartbeat is fresh.
- **Worker heartbeat:** Worker writes a row to `worker_heartbeat` every 30 seconds; the web process surfaces stale heartbeats as a banner alert.
- **Dashboard alerts:** Visual indicators for unhealthy / unreachable agents (red / yellow status).
- **Audit logging:** All significant events logged to PostgreSQL with tamper-evident hash chaining.
- **Optional metrics (future):** `tracing` lends itself to an OpenTelemetry exporter; Prometheus scrape endpoint at `/metrics` is a candidate future addition (see §17). Not required for v0.0.x.
---
## 14. Design Rationale
- **Why Rust + Axum, not Node / Go / Python?** A patch manager is a high-trust, long-running administrative control plane. Memory safety and strong typing are high-value there; Rust's async story via Tokio is mature; Axum keeps the HTTP layer thin and composable. Aligning with the upstream Agent API's stack also reduces cognitive load for maintainers.
- **Why a single process per role (web + worker), not monolith or microservices?** A monolith couples polling jitter into request latency; microservices require a broker and more operational surface area than a fleet of ≤2,500 agents justifies. Two processes + PostgreSQL coordination is the smallest design that satisfies the non-functional requirements.
- **Why PostgreSQL as the queue?** At our scale (tens of req/s), PostgreSQL's `LISTEN/NOTIFY` plus `SELECT ... FOR UPDATE SKIP LOCKED` is more than sufficient and avoids introducing Redis or a dedicated broker as a second stateful dependency.
- **Why no automatic cert distribution?** Pushing certificates onto managed hosts would require elevated credentials on those hosts, materially expanding the Manager's blast radius. Manual distribution is a deliberate least-privilege choice.
- **Why hardware-host encryption and not column-level?** The hardware host provides full-disk encryption transparently at a layer below the OS, covering every byte — PostgreSQL data, WAL, backups, temporary files, logs, and swap — with zero application complexity. Column-level encryption would duplicate protection for some data, leave other data unprotected, and add key-management burden without improving the compliance posture on a single-host deployment.
- **Why URL path versioning (`/api/v1/…`)?** It is explicit, easy to operate behind a proxy, matches the Agent API, and makes breaking-change boundaries unambiguous.
- **Why JWT + refresh, not session cookies only?** Short-lived JWTs keep the authorization path stateless and cheap; refresh tokens give admins a server-side revocation hook. Inactivity timeout comes from the refresh token, not the JWT.
---
## 15. Risks and Trade-offs
| # | Risk / Trade-off | Mitigation |
|---|------------------|------------|
| R-01 | Single-host deployment = single point of failure | Documented backup/restore (§10.3); operator may run a warm standby restored from nightly backups |
| R-02 | PostgreSQL as queue has lower throughput ceiling than a dedicated broker | Bounded-scope design (≤2,500 agents); revisit if scale expands |
| R-03 | Manual cert distribution creates human error risk | Clear UX: per-host download, audit log records who downloaded which cert and when |
| R-04 | Hash-chained audit is tamper-evident but not tamper-proof | Document that integrity checks detect — not prevent — tampering; recommend off-host log shipping for high-assurance environments |
| R-05 | Hardware-host encryption does not protect running-process memory | Out of scope; treated as an OS / hypervisor / hardware concern |
| R-06 | WebSocket ticket pattern adds a round-trip | Acceptable; keeps WS auth simple and avoids query-string JWT exposure in access logs |
| R-07 | Configuration via TOML + env overrides can be surprising | Startup log dumps the effective config (redacting secrets) |
| R-08 | Agent API changes could break the Manager | Pin to `/api/v1/`; integration tests run against a known Agent version |
---
## 16. Open Issues
| # | Issue | Owner | Target |
|---|-------|-------|--------|
| OI-01 | **CLOSED** — Encryption at rest delegated to hardware-host (infrastructure-level). `REQUIREMENTS.md` v0.0.2 and `SPEC.md` v0.0.2 updated to match. No OS-level LUKS; no column-level encryption. | — | Closed 2026-04-23 |
| OI-02 | **CLOSED** — Argon2id starting parameters: `m_cost = 65536 KiB (64 MiB)`, `t_cost = 3`, `p_cost = 1`; targets ~400 ms on Intel Xeon 4-core / 16 GB RAM. Final calibration performed at deploy time and recorded in `system_config`. | — | Closed 2026-04-23 |
| OI-03 | **CLOSED** — JWT signing algorithm: **EdDSA / Ed25519**. Keys rotated every 90 days with a 24-hour overlap window; signing key lives with web process, verifying key published to worker. | — | Closed 2026-04-23 |
| OI-04 | **CLOSED** — CIDR scan defaults: concurrency = **128**, per-host TCP+TLS probe timeout = **1.5 s**. Sized to complete a `/22` (~1,024 hosts) across sites in under 10 s. Progress UI and cancel action are required (NFR-05). | — | Closed 2026-04-23 |
| OI-05 | **CLOSED** — PDF generation: **`printpdf`** for document layout, **`plotters`** for charts. Both are in-process pure-Rust crates; no sidecar required. Company branding and digital signatures are not required. | — | Closed 2026-04-23 |
| OI-06 | **CLOSED**`/status/health` is Manager-only minimal liveness (web up, DB reachable, worker heartbeat fresh), unauthenticated. Fleet aggregates exposed on authenticated **`/api/v1/status/fleet`** to avoid leaking fleet size to unauthenticated probes. | — | Closed 2026-04-23 |
---
## 17. Future Considerations (non-binding)
- Prometheus `/metrics` endpoint and OpenTelemetry traces.
- Optional webhook / Slack notifier (currently out of scope).
- Multi-instance active/passive failover using PostgreSQL streaming replication.
- CRL or OCSP responder for the internal CA (currently: revocation by re-issuance + `certificates.revoked_at`).
- Automated cert distribution via an opt-in agent endpoint (requires Agent API change; pure opt-in with operator approval).
- Per-group maintenance-window templates to reduce per-host configuration effort.
---
## 18. Change Log (this review pass)
| # | Change | Reason |
|---|--------|--------|
| C-01 | Renamed title to "Software Design Document (SDD)" and added Document Control + Revision History | Aligns with IEEE 1016; establishes versioning discipline |
| C-02 | Added §1 Introduction (Purpose, Scope, Audience, Conventions, References, Glossary) | Standard SDD front matter was missing |
| C-03 | Added §2 Stakeholders and Design Concerns | IEEE 1016 viewpoint prerequisite; clarifies who the design serves |
| C-04 | Replaced Unicode box-drawing in diagrams with pure ASCII and fixed misaligned borders in the original logical view | Original diagram (lines 2673 of v0.0.1) had truncated right borders and an ambiguous bidirectional arrow between the web-server mTLS client and the worker's retry engine, which did not match the described data flow |
| C-05 | Split the single architecture diagram into Context View (§4.1), Logical View (§4.2), Deployment View (§4.3), and Process View (§4.4) | Matches IEEE 1016 viewpoint model; each diagram now has a single responsibility |
| C-06 | Numbered architecture decisions (AD-01 … AD-14) and added AD-08 (PG `LISTEN/NOTIFY` coordination), AD-12 (API versioning), AD-13 (TLS 1.3), AD-14 (observability) | Original table had implicit/overlapping decisions; numbering enables cross-reference; added decisions were previously only implied |
| C-07 | Clarified Web ↔ Worker coordination uses `LISTEN/NOTIFY` + `SELECT ... FOR UPDATE SKIP LOCKED` | Original said the worker "reads job queue from PostgreSQL" without specifying how it wakes for immediate-apply jobs; this would have left implementation undefined |
| C-08 | Added concurrency bound (default 64 concurrent agent calls via Tokio `Semaphore`) | Polling 2,500 agents without bounds would exhaust FDs and network resources; bound was a known implicit requirement |
| C-09 | Clarified API-versioning statement: Manager's own API uses `/api/v1/`; this is independent of the Agent API version even though the convention matches | Original text conflated the two, creating ambiguity about what "v1" refers to |
| C-10 | Added explicit WebSocket authentication flow (single-use ticket from `POST /api/v1/ws/ticket`) | Original listed "WebSocket Relay" but did not specify browser-side authentication, leaving a security gap in the design |
| C-11 | Added §6.5 Rollback data flow | REQUIREMENTS FR-03 calls for rollback support, but the original SDD had no rollback flow |
| C-12 | Expanded §7 Security: Argon2id (not just "Argon2"), rotating JWT signing key, refresh-token rotation on use, secret storage paths/permissions, audit-chain verification | Tightens vague or missing details; aligns with HIPAA/PCI-DSS control expectations |
| C-13 | v0.0.2 committed to LUKS-only for encryption at rest and flagged `REQUIREMENTS.md` inconsistency as OI-01. v0.0.3 supersedes this: encryption at rest is now delegated to the hardware host (see C-24). | The v0.0.2 commitment was based on a prior LUKS mandate; updated operator guidance from Kelly replaces OS-level LUKS with hardware-host encryption |
| C-24 | (v0.0.3) Replaced OS-level LUKS with hardware-host full-disk encryption throughout AD-10, §4.2, §4.3, §5.5, §7.4, §10, §14, §15 | Kelly directed that encryption at rest is handled by the hardware host; preserves compliance intent while reducing operational burden on the guest OS |
| C-25 | (v0.0.3) Closed OI-01 through OI-06 with concrete decisions in §16 | Implementer needs unambiguous values; closing OIs finalizes SDD for v0.1.0 planning |
| C-26 | (v0.0.3) Added AD-15 (Web UI TLS cert strategy), AD-16 (Azure SSO / SMTP runtime config GUI), AD-17 (PDF stack), AD-18 (IP whitelist enforcement) | Captures new binding decisions; AD-18 reflects the standing IP-whitelist security mandate that was previously implicit |
| C-27 | (v0.0.3) `REQUIREMENTS.md` bumped to 0.0.2: added FR-07 (System Configuration), NFR updates for Argon2id / EdDSA / CIDR timing, IP whitelist, TLS 1.3 on web UI | Brings REQUIREMENTS into line with SDD; adds previously-implicit configuration-GUI requirements |
| C-28 | (v0.0.3) `SPEC.md` bumped to 0.0.2: portable ASCII diagram, expanded Settings page scope, TLS 1.3 explicit, IP whitelist, hardware-host encryption note | Three-document alignment across REQUIREMENTS / SPEC / ARCHITECTURE |
| C-29 | (v0.0.3) Added `system_config` as a runtime-tunable table reference throughout | Runtime configuration via Settings GUI requires a persistent store for tunable values |
| C-30 | (v0.0.3) Added progress / cancel requirement for long-running scans aligned with NFR-05 | 10-second `/22` scan target plus operator UX demands explicit progress feedback |
| C-14 | Added §8.4 API Error Response Format and `X-Request-Id` correlation | Error schema was undefined, making client-side handling and log correlation unreliable |
| C-15 | Added §10.1 Configuration, §10.2 Database Migrations, §10.3 Backup / DR | Production deployment concerns entirely absent from v0.0.1; each is required by enterprise operations and by compliance audit |
| C-16 | Clarified "No Docker/LXC" applies to production; development may use containers | Original blanket statement conflicted with the actual development environment and would confuse contributors |
| C-17 | Added sizing basis (req/s math) to §11 Scalability | Original claim of "supports 2,500 hosts" had no justification; now traceable |
| C-18 | Separated Manager's liveness endpoint (`/status/health`) from the Agent's `/api/v1/health` in §12 and §13 | Original used `/api/v1/health` for both, creating an endpoint-namespace collision and ambiguity |
| C-19 | Added §12.2 Manager's Own API Surface | Original documented only the Agent endpoints consumed; the Manager's own API was undocumented |
| C-20 | Added §13 worker heartbeat mechanism and request correlation | Needed to detect a dead worker process; otherwise the system could silently stop processing jobs |
| C-21 | Added §14 Design Rationale, §15 Risks and Trade-offs, §16 Open Issues, §17 Future Considerations | IEEE 1016 §7 (Design Rationale) was missing; risks and open issues give reviewers a clear audit surface |
| C-22 | Replaced the Email Notifier arrow that pointed back into the web server's mTLS client on the original diagram with a correct component placement in §4.2 | Original diagram implied email flowed through the mTLS client, which is not the design |
| C-23 | Added C-X change IDs throughout this log | Enables traceability in future reviews |

View File

@ -1,8 +1,28 @@
# Linux_Patch_Manager - Requirements Document # Linux_Patch_Manager Requirements Document
## Document Control
| Field | Value |
|-------|-------|
| Title | Linux_Patch_Manager — Requirements Document |
| Version | 0.0.2 |
| Status | Draft |
| Last Updated | 2026-04-23 |
| Related Docs | `SPEC.md`, `ARCHITECTURE.md`, `README.md` |
### Revision History
| Version | Date | Summary |
|---------|------|---------|
| 0.0.1 | 2026-04-21 | Initial draft |
| 0.0.2 | 2026-04-23 | Aligned with SDD v0.0.3: hardware-host encryption at rest (no OS-level LUKS), Argon2id, EdDSA JWTs, Azure SSO configuration GUI, web-UI TLS cert strategy, SMTP runtime configurability |
---
## Project Overview ## Project Overview
**Title:** Linux_Patch_Manager **Title:** Linux_Patch_Manager
**Version:** 0.0.1 **Description:** Enterprise-class, secure, web-based management interface for controlling patching and updates on Linux servers and workstations
**Version:** 0.0.2
**Status:** Draft **Status:** Draft
## Functional Requirements ## Functional Requirements
@ -44,7 +64,8 @@
- Compliance report: percentage of hosts fully patched, by group or fleet-wide - Compliance report: percentage of hosts fully patched, by group or fleet-wide
- Patch history: log of all patch operations per host or per group - Patch history: log of all patch operations per host or per group
- Vulnerability exposure: hosts with known CVEs pending patches - Vulnerability exposure: hosts with known CVEs pending patches
- Audit trail: who did what when (user actions, patch operations) - Audit trail: who did what, when (user actions, patch operations)
- Charts and graphs required in PDF exports (compliance trends, patch-status distributions)
- Export formats: CSV and PDF - Export formats: CSV and PDF
### FR-06: User Management ### FR-06: User Management
@ -56,18 +77,30 @@
- Azure SSO integration (optional, with Azure's built-in MFA) - Azure SSO integration (optional, with Azure's built-in MFA)
- Group membership management for users and hosts - Group membership management for users and hosts
### FR-07: System Configuration
- Azure SSO configuration GUI in the Settings page (tenant ID, client ID, client secret, redirect URI, scopes)
- "Test connection" action in the Azure SSO config GUI that performs a round-trip against Azure AD and reports success/failure without enabling SSO
- SMTP configuration GUI (host, port, auth mode, username/password, TLS mode, from-address); disabled by default
- "Send test email" action in the SMTP config GUI
- Polling-interval tuning (health and patch pollers)
- Web UI TLS certificate strategy selection: self-signed from the internal CA (default) or operator-supplied certificate/key (e.g., existing infrastructure wildcard)
## Non-Functional Requirements ## Non-Functional Requirements
### NFR-01: Security ### NFR-01: Security
- Combination authentication: local accounts + Azure SSO - Combination authentication: local accounts + Azure SSO
- MFA required for all users (TOTP or WebAuthn; Azure MFA for SSO users) - MFA required for all users (TOTP or WebAuthn; Azure MFA for SSO users)
- Session management: short-lived JWT access tokens (15 min) + server-side refresh tokens (1-hour inactivity timeout, revocable) - Password hashing: **Argon2id**
- mTLS for all agent communication (certificate-based, TLS 1.3 only) - Session management: short-lived JWT access tokens (15 min, signed with **EdDSA / Ed25519**) + server-side opaque refresh tokens (1-hour inactivity timeout, rotated on use, revocable)
- HTTPS enforced for web UI - JWT signing key rotation every 90 days with a 24-hour overlap window for in-flight tokens
- mTLS for all agent communication (certificate-based, **TLS 1.3 only**)
- HTTPS enforced for web UI (TLS 1.3 only)
- Internal CA managed by Patch Manager for mTLS certificate issuance and renewal - Internal CA managed by Patch Manager for mTLS certificate issuance and renewal
- Certificate distribution to managed clients is manual (server administrators responsible) - Certificate distribution to managed clients is manual (server administrators responsible)
- RBAC with group-scoped access control - RBAC with group-scoped access control
- IP whitelist enforcement on all connection points
### NFR-02: Performance ### NFR-02: Performance
@ -75,6 +108,8 @@
- Dashboard load time under 5 seconds for full fleet view - Dashboard load time under 5 seconds for full fleet view
- Background polling must not degrade UI responsiveness - Background polling must not degrade UI responsiveness
- Concurrent batch operations (e.g., patch 500 hosts simultaneously) must not overwhelm the system - Concurrent batch operations (e.g., patch 500 hosts simultaneously) must not overwhelm the system
- Login latency budget: 250500 ms on target hardware (Intel Xeon, 4 cores, 16 GB RAM); Argon2id parameters calibrated to land in this window
- CIDR auto-discovery of a `/22` network (~1,024 hosts) across sites completes within 10 seconds wall-clock
### NFR-03: Scalability ### NFR-03: Scalability
@ -95,6 +130,7 @@
- Responsive design for desktop/laptop screens - Responsive design for desktop/laptop screens
- Dark mode support - Dark mode support
- Certificate download links integrated into dashboard (root CA) and host detail (host-specific mTLS) - Certificate download links integrated into dashboard (root CA) and host detail (host-specific mTLS)
- Long-running scans (CIDR discovery, full-fleet operations) must display progress and offer a cancel action
## Interface Requirements ## Interface Requirements
@ -104,6 +140,8 @@
- Real-time job status via WebSocket relay (agent WebSocket → Patch Manager → browser) - Real-time job status via WebSocket relay (agent WebSocket → Patch Manager → browser)
- RESTful API backend for all UI operations - RESTful API backend for all UI operations
- Certificate download endpoints for root CA and host-specific mTLS certs - Certificate download endpoints for root CA and host-specific mTLS certs
- Unauthenticated liveness endpoint at `/status/health` (minimal: process up, DB reachable, worker heartbeat fresh)
- Authenticated fleet-aggregate endpoint at `/api/v1/status/fleet` (counts of healthy / degraded / unreachable agents)
### IR-02: Linux Patch API Integration ### IR-02: Linux Patch API Integration
@ -112,12 +150,12 @@
- Base path: `/api/v1/`, Port: 12443, TLS 1.3 only - Base path: `/api/v1/`, Port: 12443, TLS 1.3 only
- Sync operations: GET endpoints (packages, patches, system info, health) - Sync operations: GET endpoints (packages, patches, system info, health)
- Async operations: POST/PUT/DELETE endpoints (install, update, remove, patch apply, reboot) - Async operations: POST/PUT/DELETE endpoints (install, update, remove, patch apply, reboot)
- Job status tracking via GET `/api/v1/jobs/{id}` and WebSocket `/api/v1/ws/jobs` - Job status tracking via `GET /api/v1/jobs/{id}` and WebSocket `/api/v1/ws/jobs`
- Rollback via POST `/api/v1/jobs/{id}/rollback` - Rollback via `POST /api/v1/jobs/{id}/rollback`
## Data Requirements ## Data Requirements
- **Database:** PostgreSQL - **Database:** PostgreSQL 16+
- **Operational data retention:** 30 days (host patch history, job history, health history) - **Operational data retention:** 30 days (host patch history, job history, health history)
- **Audit log retention:** 6 months - **Audit log retention:** 6 months
- **Data storage:** All data on Patch Manager host - **Data storage:** All data on Patch Manager host
@ -126,27 +164,43 @@
### HIPAA (Health Insurance Portability and Accountability Act) ### HIPAA (Health Insurance Portability and Accountability Act)
- **Audit Controls (§164.312(b)):** Comprehensive audit logging of all system activity (covered by audit logging requirements) - **Audit Controls (§164.312(b)):** Comprehensive audit logging of all system activity (hash-chained rows for integrity)
- **Access Controls (§164.312(a)(1)):** RBAC with group-scoped access, unique user identification, MFA enforcement - **Access Controls (§164.312(a)(1)):** RBAC with group-scoped access, unique user identification, MFA enforcement
- **Integrity Controls (§164.312(c)(1)):** Audit log integrity protection (tamper-evident logging) - **Integrity Controls (§164.312(c)(1)):** Audit log integrity protection via hash chaining
- **Transmission Security (§164.312(e)(1)):** mTLS for all agent communication, HTTPS for web UI, TLS 1.3 minimum - **Transmission Security (§164.312(e)(1)):** mTLS for all agent communication, HTTPS for web UI, TLS 1.3 minimum
- **Encryption at Rest:** PostgreSQL data encryption (full-disk or column-level for sensitive fields) - **Encryption at Rest:** Provided by the underlying hardware host (infrastructure-level full-disk encryption). The application does not manage disk encryption.
- **Automatic Logoff (§164.312(a)(2)(iii)):** 1-hour inactivity session timeout - **Automatic Logoff (§164.312(a)(2)(iii)):** 1-hour inactivity session timeout
### PCI-DSS (Payment Card Industry Data Security Standard) ### PCI-DSS (Payment Card Industry Data Security Standard)
- **Requirement 6:** Vulnerability management — patch management is core PCI-DSS requirement; system must track and enforce timely patching - **Requirement 3:** Protect stored data — encryption at rest provided by the hardware host
- **Requirement 4:** Encrypt transmission — mTLS (TLS 1.3) for agent communication, HTTPS (TLS 1.3) for web UI
- **Requirement 6:** Vulnerability management — patch management is the core function; system tracks and enforces timely patching
- **Requirement 7:** Restrict access to need-to-know — RBAC with group-scoped operator access - **Requirement 7:** Restrict access to need-to-know — RBAC with group-scoped operator access
- **Requirement 8:** Identify and authenticate users — MFA required, unique IDs, session timeouts - **Requirement 8:** Identify and authenticate users — MFA required, unique IDs, session timeouts
- **Requirement 10:** Track and monitor all access — comprehensive audit logging with 6-month retention - **Requirement 10:** Track and monitor all access — comprehensive audit logging with 6-month retention
- **Requirement 3:** Protect stored data — encryption at rest for PostgreSQL
- **Requirement 4:** Encrypt transmission — mTLS (TLS 1.3) for agent communication, HTTPS for web UI ## Audit Logging
**Captured Events:**
- All user login/logout events (success and failure)
- All patch operations (who triggered, which hosts, what patches, queue vs. immediate)
- All host registration/removal events
- All group membership changes (hosts and users)
- All certificate operations (issue, renew, download, revoke)
- All maintenance window changes
- All configuration changes (including Azure SSO and SMTP configuration)
**Integrity:** Tamper-evident via hash-chained rows (`prev_hash`, `row_hash`). Periodic and on-demand integrity verification.
**Retention:** 6 months
## Constraints ## Constraints
- Single bare metal/VM host running Ubuntu 24.04 - Single bare metal/VM host running Ubuntu 24.04
- Systemd service management - Systemd service management
- Internal network only (no public internet exposure) - Internal network only (no public internet exposure)
- Rust/Axum backend, React/TypeScript frontend, PostgreSQL database - Rust/Axum backend, React/TypeScript frontend, PostgreSQL 16+ database
- No direct permissions on managed clients - No direct permissions on managed clients
- Certificate distribution to clients is manual - Certificate distribution to clients is manual
- Encryption at rest is provided by the hardware host; the application does not configure or manage disk encryption

102
SPEC.md
View File

@ -1,9 +1,28 @@
# Linux_Patch_Manager - Specification Document # Linux_Patch_Manager Specification Document
## Document Control
| Field | Value |
|-------|-------|
| Title | Linux_Patch_Manager — Specification Document |
| Version | 0.0.2 |
| Status | Draft |
| Last Updated | 2026-04-23 |
| Related Docs | `REQUIREMENTS.md`, `ARCHITECTURE.md`, `README.md` |
### Revision History
| Version | Date | Summary |
|---------|------|---------|
| 0.0.1 | 2026-04-21 | Initial draft |
| 0.0.2 | 2026-04-23 | Aligned with SDD v0.0.3: portable ASCII diagram, hardware-host encryption at rest, Argon2id / EdDSA / TLS 1.3 called out, Settings page scope expanded (Azure SSO, SMTP, web-UI TLS), IP whitelist enforcement |
---
## Project Overview ## Project Overview
**Title:** Linux_Patch_Manager **Title:** Linux_Patch_Manager
**Description:** Enterprise class secure web based management interface for controlling patching and updates on Linux servers and workstations **Description:** Enterprise-class, secure, web-based management interface for controlling patching and updates on Linux servers and workstations
**Version:** 0.0.1 **Version:** 0.0.2
**Status:** Draft **Status:** Draft
## Scope ## Scope
@ -13,13 +32,15 @@
- Multi-distribution support (Debian/Ubuntu, RHEL/CentOS/Fedora, Alpine, Arch) - Multi-distribution support (Debian/Ubuntu, RHEL/CentOS/Fedora, Alpine, Arch)
- Batch patch operations across multiple hosts - Batch patch operations across multiple hosts
- Maintenance window scheduling (per-device, daily/weekly/monthly recurring + one-time) with immediate-apply override - Maintenance window scheduling (per-device, daily/weekly/monthly recurring + one-time) with immediate-apply override
- Compliance reporting and patch status dashboards (compliance, patch history, vulnerability exposure, audit trail — exportable as CSV and PDF) - Compliance reporting and patch status dashboards (compliance, patch history, vulnerability exposure, audit trail — exportable as CSV and PDF, with charts/graphs in PDF output)
- User management with RBAC - User management with RBAC
- Secure mTLS communication with Linux Patch API agents - Secure mTLS communication with Linux Patch API agents (TLS 1.3 only)
- Real-time job status via WebSocket relay - Real-time job status via WebSocket relay
- Host registration (manual FQDN/IP + on-demand CIDR auto-discover) - Host registration (manual FQDN/IP + on-demand CIDR auto-discover)
- Static group-based device organization with group-scoped operator access - Static group-based device organization with group-scoped operator access
- Email notifications (optional, disabled by default) - Email notifications (optional, disabled by default, runtime-configurable SMTP)
- Azure SSO configuration GUI with "test connection" action (runtime-configurable)
- Web UI TLS certificate strategy selection (self-signed from internal CA or operator-supplied)
**Out of Scope:** **Out of Scope:**
- Configuration management (Ansible/Puppet/Chef territory) - Configuration management (Ansible/Puppet/Chef territory)
@ -38,7 +59,7 @@
**Key Goals:** **Key Goals:**
- Fleet-wide visibility into patch status and compliance - Fleet-wide visibility into patch status and compliance
- Zero-friction patch deployment via maintenance windows - Zero-friction patch deployment via maintenance windows
- Secure-by-design architecture (Rust core, mTLS, MFA) - Secure-by-design architecture (Rust core, mTLS, MFA, Argon2id, EdDSA JWTs)
- Single-instance simplicity supporting up to 2,500 managed hosts - Single-instance simplicity supporting up to 2,500 managed hosts
## Constraints ## Constraints
@ -46,22 +67,28 @@
**Deployment:** **Deployment:**
- Single bare metal/VM host running Ubuntu 24.04 - Single bare metal/VM host running Ubuntu 24.04
- Systemd service management - Systemd service management
- Internal network access only (same network as managed agents) - Internal network access only (same network as managed agents, no public internet exposure)
- Encryption at rest provided by the hardware host (infrastructure-level); the application does not manage disk encryption
**Technical:** **Technical:**
- Backend: Rust with Axum framework, Tokio async runtime - Backend: Rust with Axum framework, Tokio async runtime
- Frontend: React + TypeScript SPA - Frontend: React + TypeScript SPA (Vite build)
- Database: PostgreSQL with SQLx for type-safe queries - Database: PostgreSQL 16+ with SQLx for type-safe queries; migrations via `sqlx-cli`
- Real-time: Axum native WebSocket support for agent-to-browser relay - Real-time: Axum native WebSocket support for agent-to-browser relay
- Single-instance design (manual horizontal scaling by dividing clients between multiple Patch Manager hosts if needed) - Single-instance design (manual horizontal scaling by dividing clients between multiple Patch Manager hosts if needed)
- Fleet capacity: ~500 typical, up to 2,500 hosts - Fleet capacity: ~500 typical, up to 2,500 hosts
- PDF generation: `printpdf` + `plotters` for charts (in-process, no sidecar)
**Security:** **Security:**
- Combination authentication: local accounts + Azure SSO - Combination authentication: local accounts + Azure SSO
- MFA required for all users (TOTP or WebAuthn) - MFA required for all users (TOTP or WebAuthn)
- Azure SSO users may use Azure's built-in MFA - Azure SSO users may use Azure's built-in MFA
- mTLS for all agent communication - Password hashing: Argon2id
- HTTPS for web UI - JWT access tokens signed with EdDSA / Ed25519 (15-minute TTL), 90-day key rotation with 24-hour overlap
- Refresh tokens: opaque, server-side stored, 1-hour inactivity timeout, rotated on use, revocable
- mTLS for all agent communication (TLS 1.3 only)
- HTTPS for web UI (TLS 1.3 only)
- **IP whitelist enforcement on all connection points**
- Role-based access control: - Role-based access control:
- **Admin**: Full access to manage all aspects of Linux Patch Manager - **Admin**: Full access to manage all aspects of Linux Patch Manager
- **Operator**: Can add/remove clients, manage schedules and patches only for devices in their group memberships - **Operator**: Can add/remove clients, manage schedules and patches only for devices in their group memberships
@ -73,25 +100,26 @@
Management plane web application communicating with Linux Patch API agents on each managed host. Management plane web application communicating with Linux Patch API agents on each managed host.
``` ```
┌─────────────────────────────┐ +-----------------------------+
Linux Patch Manager Web UI (this project) | Linux Patch Manager | <- Web UI (this project)
(Management Plane) Rust/Axum + React/TS | (Management Plane) | Rust/Axum + React/TS
PostgreSQL + WebSocket | PostgreSQL + WebSocket |
└──────────────┬──────────────┘ +--------------+--------------+
│ mTLS / REST API |
┌──────┼──────┐ | mTLS / REST + WSS (TLS 1.3, port 12443)
▼ ▼ ▼ +-------+-------+
┌──────┐┌──────┐┌──────┐ v v v
│ Host ││ Host ││ Host │ ← Linux Patch API agents +------+ +------+ +------+
│ A ││ B ││ C │ (up to 2,500) | Host | | Host | | Host | <- Linux Patch API agents
└──────┘└──────┘└──────┘ | A | | B | | C | (up to 2,500)
+------+ +------+ +------+
``` ```
## API Integration ## API Integration
**Upstream Dependency:** [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api) **Upstream Dependency:** [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api)
- All managed device access uses the Linux Patch API - All managed device access uses the Linux Patch API
- mTLS certificate-based authentication to agents - mTLS certificate-based authentication to agents (TLS 1.3 only)
- Hybrid sync/async operation model (sync for queries, async jobs for patch operations) - Hybrid sync/async operation model (sync for queries, async jobs for patch operations)
- WebSocket streaming for real-time job status from agents - WebSocket streaming for real-time job status from agents
- Base path: `/api/v1/`, Port: 12443, TLS 1.3 only - Base path: `/api/v1/`, Port: 12443, TLS 1.3 only
@ -102,6 +130,7 @@ Management plane web application communicating with Linux Patch API agents on ea
- Patch Manager issues and renews client certificates for mTLS communication - Patch Manager issues and renews client certificates for mTLS communication
- Certificate distribution to managed target clients is manual (server administrators responsible) - Certificate distribution to managed target clients is manual (server administrators responsible)
- Patch Manager has no direct permissions on managed clients - Patch Manager has no direct permissions on managed clients
- Web UI TLS certificate: self-signed from the internal CA by default; operator may supply an external certificate (e.g., infrastructure wildcard) via configuration
## User Interface ## User Interface
@ -114,10 +143,15 @@ Management plane web application communicating with Linux Patch API agents on ea
5. **Jobs** — Real-time job monitoring with WebSocket status updates 5. **Jobs** — Real-time job monitoring with WebSocket status updates
6. **Maintenance Windows** — Create/edit recurring and one-time windows per device 6. **Maintenance Windows** — Create/edit recurring and one-time windows per device
7. **Groups** — Manage static groups, assign hosts and operators 7. **Groups** — Manage static groups, assign hosts and operators
8. **Reports** — Generate and export compliance, patch history, vulnerability, audit reports (CSV and PDF) 8. **Reports** — Generate and export compliance, patch history, vulnerability, audit reports (CSV and PDF with charts)
9. **Users** — Manage local accounts, MFA setup, group assignments 9. **Users** — Manage local accounts, MFA setup, group assignments
10. **Certificates** — View/manage internal CA, issue/renew client certs 10. **Certificates** — View/manage internal CA, issue/renew client certs
11. **Settings** — System configuration, Azure SSO setup, polling intervals 11. **Settings** — System configuration including:
- Azure SSO setup (tenant ID, client ID/secret, redirect URI, scopes) with "Test Connection" action
- SMTP configuration (host, port, auth, TLS mode, from-address) with "Send Test Email" action
- Polling intervals (health, patch data)
- Web UI TLS certificate strategy (internal CA vs. operator-supplied)
- IP whitelist management
## Error Handling ## Error Handling
@ -141,23 +175,29 @@ Management plane web application communicating with Linux Patch API agents on ea
- Linux Patch API agent is installed and running on each managed host - Linux Patch API agent is installed and running on each managed host
- Server administrators manually distribute mTLS and root certificates to managed clients - Server administrators manually distribute mTLS and root certificates to managed clients
- PostgreSQL is available on the Patch Manager host - PostgreSQL is available on the Patch Manager host
- Server administrators manually distribute mTLS and root certificates to managed clients
- PostgreSQL is available on the Patch Manager host
- Hardware host provides full-disk encryption (no OS-level disk encryption managed by the application)
## Dependencies ## Dependencies
- Linux Patch API (upstream agent on each managed host) - Linux Patch API (upstream agent on each managed host)
- PostgreSQL - PostgreSQL 16+
- Internal CA for mTLS certificates - Internal CA for mTLS certificates
- Azure AD (optional, for SSO) - Azure AD (optional, for SSO)
- SMTP relay (optional, runtime-configurable, for email notifications)
## Audit Logging ## Audit Logging
**Captured Events:** **Captured Events:**
- All user login/logout events (success and failure) - All user login/logout events (success and failure)
- All patch operations (who triggered, which hosts, what patches, queue vs immediate) - All patch operations (who triggered, which hosts, what patches, queue vs. immediate)
- All host registration/removal events - All host registration/removal events
- All group membership changes (hosts and users) - All group membership changes (hosts and users)
- All certificate operations (issue, renew, download) - All certificate operations (issue, renew, download, revoke)
- All maintenance window changes - All maintenance window changes
- All configuration changes - All configuration changes (including Azure SSO, SMTP, IP whitelist, TLS cert strategy)
**Integrity:** Hash-chained rows (tamper-evident). Periodic and on-demand verification.
**Retention:** 6 months **Retention:** 6 months