Backend: - migrations/003_jobs_scheduling.sql: retry_next_at/last_error columns, pg_notify trigger for immediate job dispatch, retry index - pm-agent-client: ApplyPatchesRequest/Response, AgentJobStatus, RollbackResponse types; apply_patches/job_status/rollback_job client methods + generic POST helper - pm-core/models: JobStatus, JobKind, PatchJob, PatchJobHost, CreateJobRequest, PatchJobSummary - pm-web/routes/jobs.rs: POST/GET /api/v1/jobs, GET /jobs/:id, POST /jobs/:id/cancel, POST /jobs/:id/rollback - pm-worker/job_executor.rs: NOTIFY listener, periodic scanner, execute_host_job, poll_running_jobs, handle_host_failure (3-retry exponential backoff 1m/5m/30m), sync_job_status, retry_pending_jobs - pm-worker/main.rs: spawn job_executor Frontend: - types/index.ts: PatchInfo, PatchJobHost, PatchJob, PatchJobSummary, CreateJobRequest interfaces - api/client.ts: jobsApi (list/get/create/cancel/rollback), patchesApi (getHostPatches) - pages/PatchDeploymentPage.tsx: 3-step MUI Stepper (host select → configure → result) - pages/JobsPage.tsx: job list table, expandable per-host detail, cancel/rollback actions with confirm dialog, load-more pagination - App.tsx: /jobs and /deployment routes wired to real pages cargo check: 0 errors | vite build: 0 errors
18 KiB
Linux Patch Manager — Implementation Plan
Project Structure
linux_patch_manager/
├── Cargo.toml # Workspace root
├── crates/
│ ├── pm-web/ # Axum web server binary crate
│ ├── pm-worker/ # Background worker binary crate
│ ├── pm-core/ # Shared library: config, DB pool, models, errors, types
│ ├── pm-agent-client/ # mTLS HTTP client for agent communication
│ ├── pm-auth/ # Auth: JWT (EdDSA), Argon2id, TOTP, WebAuthn, RBAC, Azure SSO
│ ├── pm-ca/ # Internal CA: rcgen + rustls certificate management
│ └── pm-reports/ # PDF (printpdf + plotters) and CSV generation
├── migrations/ # SQLx database migrations
│ ├── 001_initial_schema.sql
│ ├── 002_auth_system.sql
│ ├── 003_host_management.sql
│ ├── 004_jobs_and_scheduling.sql
│ ├── 005_audit_logging.sql
│ └── 006_system_config.sql
├── frontend/ # React + TypeScript SPA
│ ├── src/
│ │ ├── api/ # API client (axios/fetch)
│ │ ├── components/ # Shared MUI components
│ │ ├── pages/ # 11 page components
│ │ ├── hooks/ # Custom React hooks
│ │ ├── store/ # State management (zustand or context)
│ │ ├── theme/ # MUI theme (light + dark)
│ │ ├── types/ # TypeScript interfaces
│ │ └── utils/ # Utilities
│ ├── package.json
│ ├── vite.config.ts
│ ├── tsconfig.json
│ └── index.html
├── config/
│ └── config.example.toml # Example configuration
├── systemd/
│ ├── patch-manager-web.service
│ └── patch-manager-worker.service
├── docs/
│ └── runbooks/
│ └── restore.md # Backup/restore runbook
├── scripts/
│ ├── setup.sh # Initial host setup script
│ └── build-frontend.sh # Frontend build script
├── SPEC.md
├── REQUIREMENTS.md
├── ARCHITECTURE.md
├── README.md
└── .gitignore
Milestones
Each milestone produces a testable vertical slice — backend + frontend + database working together.
M1: Project Scaffolding + Database Schema + Core Infrastructure
Goal: Runnable workspace with DB, config, logging, error handling.
- Initialize Rust workspace with 7 crates (pm-web, pm-worker, pm-core, pm-agent-client, pm-auth, pm-ca, pm-reports)
- Initialize React + TypeScript + Vite + MUI frontend project
- Create
config.example.tomlwith all configuration keys - Implement
pm-core::config— TOML config loading + env overrides (PATCH_MANAGER__SECTION__KEY) - Implement
pm-core::db— SQLx PgPool initialization, connection from config - Implement
pm-core::error— Unified error type with API error envelope (error.code,error.message,error.request_id,error.details) - Implement
pm-core::request_id— ULID generation +X-Request-Idheader middleware - Implement
pm-core::logging—tracing+tracing-subscriberJSON formatter, configurable log levels - Create initial database migrations (001_initial_schema.sql):
hosts,groups,host_groups,users,user_groups,refresh_tokens,maintenance_windows,patch_jobs,patch_job_hosts,host_patch_data,host_health_data,certificates,audit_log,azure_sso_config,system_config,worker_heartbeat,discovery_results - Implement
pm-webbinary: Axum app skeleton, static file serving placeholder,/status/healthendpoint - Implement
pm-workerbinary: Tokio runtime skeleton, DB connection, worker heartbeat writer (30s interval) - Implement
sqlx::migrate!embedded migrations in pm-web, advisory lock for single-writer - Worker waits for expected schema version before accepting work
- Create
systemd/patch-manager-web.serviceandsystemd/patch-manager-worker.serviceunit files - Create
scripts/setup.shfor initial host setup - Create
scripts/build-frontend.sh - Verify: both services start,
/status/healthreturns 200, worker heartbeat updates
M2: Authentication & Authorization + Frontend Shell
Goal: Users can log in with MFA, JWT auth works, RBAC middleware enforces roles.
- Implement
pm-auth::password— Argon2id hashing with calibrated parameters (m_cost=65536,t_cost=3,p_cost=1) - Implement
pm-auth::jwt— EdDSA/Ed25519 JWT issuance and validation, 15-min TTL, 90-day key rotation with 24-hour overlap - Implement
pm-auth::refresh— Opaque 256-bit refresh tokens, hashed storage inrefresh_tokens, 1-hour sliding inactivity timeout, rotation on use - Implement
pm-auth::mfa_totp— TOTP setup, verify, QR code generation - Implement
pm-auth::mfa_webauthn— WebAuthn registration and authentication - Implement
pm-auth::rbac— Admin/Operator role middleware, group-scoped access enforcement - Implement
pm-auth::session— Login flow (password → MFA → access+refresh tokens), logout (revoke refresh), force-revoke - Implement
pm-webauth routes:POST /api/v1/auth/login,POST /api/v1/auth/refresh,POST /api/v1/auth/logout, MFA setup endpoints - Implement IP whitelist middleware on all connection points
- Frontend: App shell with React Router, MUI theme (light + dark), auth context, login page, MFA setup page
- Frontend: API client with JWT interceptors (auto-refresh), 401 redirect to login
- Create seed migration: default admin account
- Verify: login with MFA, JWT validation, refresh token rotation, RBAC blocks unauthorized access, IP whitelist blocks unknown IPs
M3: Host Management + Groups + Frontend Pages
Goal: Full host CRUD, group management, auto-discovery.
- Implement host CRUD routes:
GET/POST /api/v1/hosts,GET/DELETE /api/v1/hosts/{id} - Implement FQDN resolution on host add (resolve to IP at registration time)
- Implement group CRUD routes:
GET/POST /api/v1/groups,GET/DELETE /api/v1/hosts/{id}/groups - Implement host ↔ group and user ↔ group membership management
- Implement RBAC scoping: operators can only see/manage hosts in their groups
- Implement auto-discovery:
POST /api/v1/discovery/cidr→ worker scans CIDR, bounded concurrency (128), TCP+TLS probe (1.5s timeout), progress tracking, cancel action - Implement discovery results table and review flow
- Implement host removal with audit logging
- Frontend: Hosts page (filterable list by group, status, OS)
- Frontend: Host Detail page (system info, packages, patches, jobs, maintenance window config)
- Frontend: Groups page (manage groups, assign hosts and operators)
- Frontend: Users page (local account management, MFA setup, group assignments)
- Verify: add/remove hosts, group assignments, RBAC enforcement, CIDR scan with progress
M4: Agent Communication Layer + Dashboard
Goal: mTLS client works, health/patch polling operational, dashboard shows fleet status.
- Implement
pm-agent-client— Rustls-based mTLS HTTP client with client certificate, TLS 1.3 only - Implement agent API calls:
GET /api/v1/health,GET /api/v1/system/info,GET /api/v1/packages,GET /api/v1/patches - Implement worker health poller: 5-minute intervals, bounded concurrency (64 semaphore), update
host_health_data - Implement worker patch data poller: 30-minute intervals, bounded concurrency, update
host_patch_data - Implement on-demand refresh:
POST /api/v1/hosts/{id}/refresh→NOTIFY refresh_requested→ worker queries immediately - Implement host health status tracking: healthy/degraded/unreachable with timestamps
- Implement dashboard API:
GET /api/v1/status/fleet(authenticated, fleet aggregates) - Frontend: Dashboard page — compliance %, health summary, pending patches, upcoming windows, root CA download icon
- Frontend: Real-time health status indicators (green/yellow/red) on host lists
- Verify: polling works, dashboard shows live fleet data, on-demand refresh works, visual alerts for unhealthy agents
M5: Patch Deployment & Job Management + Frontend Pages
Goal: Full patch lifecycle — queue, immediate, retry, rollback, job monitoring.
- Implement job creation:
POST /api/v1/jobs(queue for window or apply now) - Implement
patch_jobsandpatch_job_hostsrow creation - Implement
NOTIFY job_enqueuedfor immediate-apply wake - Implement worker job executor: call agent
POST /api/v1/patches/apply, track async job IDs - Implement worker retry engine: exponential backoff (1min, 5min, 30min), 3 retries max
- Implement patch job auto-retry within maintenance window (1 retry)
- Implement batch partial failure handling: auto-retry once, then report
- Implement rollback:
POST /api/v1/jobs/{id}/rollback→ worker calls agent rollback endpoint - Implement job status tracking: poll agent
GET /api/v1/jobs/{id}for running jobs - Implement job listing/detail API:
GET /api/v1/jobs,GET /api/v1/jobs/{id} - Frontend: Patch Deployment page (select hosts → review patches → queue or apply now)
- Frontend: Jobs page (job list, per-host status, rollback action)
- Verify: queued job waits for window, immediate job runs now, retry logic works, rollback works, batch partial failures reported
M6: Maintenance Windows & Scheduling + Frontend Page
Goal: Per-device recurring and one-time maintenance windows, auto-execution at window open.
- Implement maintenance window CRUD:
GET/POST/PUT/DELETE /api/v1/hosts/{id}/maintenance-windows - Implement recurring schedule logic: daily, weekly, monthly (cron-like evaluation)
- Implement one-time window support
- Implement worker job scheduler: detect window openings, dispatch queued jobs
- Implement window-open event triggering job execution
- Frontend: Maintenance Windows page (per-device schedule management)
- Frontend: Maintenance window config on Host Detail page
- Verify: create recurring/one-time windows, queued jobs execute at window open, window expiration stops execution
M7: WebSocket Relay (Real-Time Job Status)
Goal: Browser receives live job updates via WebSocket.
- Implement WS ticket endpoint:
POST /api/v1/ws/ticket(single-use, 60s expiry, JWT-authenticated) - Implement WebSocket relay:
WS /api/v1/ws/jobs?ticket=...→ authenticated browser connection - Implement agent WebSocket consumption: worker subscribes to agent
WS /api/v1/ws/jobsfor running jobs - Implement event multiplexing: agent WS events → PostgreSQL update → browser WS push
- Frontend: WebSocket client hook with auto-reconnect and ticket refresh
- Frontend: Live job progress updates on Jobs page
- Verify: open job in browser, see real-time progress updates, WS ticket expires correctly
M8: Internal CA + Certificate Management + Frontend Page
Goal: CA issues/renews certs, download links work.
- Implement
pm-ca— CA initialization (root key + cert generation), stored at/etc/patch-manager/ca/with 0600 permissions - Implement client certificate issuance for mTLS (per-host certs)
- Implement certificate renewal flow
- Implement certificate revocation (mark revoked in
certificatestable, re-issue replacement) - Implement download endpoints:
GET /api/v1/ca/root.crt,GET /api/v1/hosts/{id}/client.crt - Implement Web UI TLS certificate: self-signed from internal CA (default) or operator-supplied cert/key
- Frontend: Certificates page (view/manage CA, issue/renew certs, view expiry)
- Frontend: Root CA download icon on Dashboard
- Frontend: Host-specific cert download icon on Host Detail page
- Verify: CA generates certs, downloads work, TLS cert strategy switchable
M9: Reporting (CSV + PDF with Charts) + Frontend Page
Goal: All 4 report types exportable as CSV and PDF.
- Implement
pm-reports::csv— CSV generation for all report types - Implement
pm-reports::pdf— PDF generation withprintpdf+plotterscharts - Implement compliance report: % hosts fully patched by group/fleet, trend charts
- Implement patch history report: operations per host/group
- Implement vulnerability exposure report: hosts with pending CVEs
- Implement audit trail report: who did what when
- Implement report API:
GET /api/v1/reports/compliance,patch-history,vulnerability,auditwith?format=csv|pdf - Frontend: Reports page (select type, filters, generate, download)
- Verify: all 4 reports generate as CSV and PDF, PDFs include charts
M10: Settings Page (Azure SSO, SMTP, TLS, IP Whitelist) + Frontend Page
Goal: All runtime configuration manageable from the UI.
- Implement
system_configtable CRUD API - Implement Azure SSO configuration: tenant ID, client ID/secret, redirect URI, scopes
- Implement "Test Connection" action for Azure SSO (round-trip against Azure AD, report success/failure without enabling)
- Implement SMTP configuration: host, port, auth mode, username/password, TLS mode, from-address
- Implement "Send Test Email" action for SMTP
- Implement polling interval tuning (health, patch) in Settings
- Implement Web UI TLS certificate strategy selection (internal CA vs. operator-supplied)
- Implement IP whitelist management in Settings
- Implement Azure SSO OAuth2/OIDC Authorization Code flow with PKCE
- Frontend: Settings page with all configuration sections and test actions
- Verify: Azure SSO test connection works, test email sends, TLS strategy switches, IP whitelist updates take effect
M11: Email Notifications + Audit Logging Hardening
Goal: Optional email works, audit logs are tamper-evident.
- Implement email notifier in worker (Lettre crate, optional/disabled by default)
- Implement email templates: patch failure, job completion, maintenance window reminders
- Implement audit log hash chaining:
prev_hash+row_hashon every insert - Implement periodic audit integrity verification job
- Implement on-demand audit integrity verification from UI
- Implement audit log for all configuration changes (Azure SSO, SMTP, IP whitelist, TLS cert strategy)
- Implement audit log for certificate operations (issue, renew, download, revoke)
- Frontend: Email notification settings integration in Settings page
- Frontend: Audit integrity verification action in Reports/Users area
- Verify: email sends on failure, audit chain is intact, tampering detected by verification
M12: Deployment Packaging, Backup/DR, Integration Testing
Goal: Production-ready deployment with documented runbooks.
- Create
docs/runbooks/restore.md— backup/restore procedure - Implement nightly
pg_dumpscript to/var/backups/patch-manager/ - Implement CA material backup inclusion
- Implement
/etc/patch-manager/config backup (excluding secrets unless encrypted destination) - Create
scripts/setup.sh— full host setup (install deps, create service user, set permissions, initialize DB) - Finalize systemd unit files with proper dependencies, restart policies, logging
- End-to-end integration tests: full patch lifecycle across multiple agents
- Performance test: verify 500-host polling, dashboard load < 5s, CIDR scan < 10s for /22
- Security review: TLS 1.3 enforcement, IP whitelist, RBAC, audit chain integrity
- Compliance mapping verification: HIPAA and PCI-DSS controls documented and testable
- Verify: backup/restore works, RPO 24h / RTO 4h achievable, all NFRs met
Dependency Graph
M1 (scaffolding)
├──> M2 (auth)
│ ├──> M3 (hosts/groups)
│ │ ├──> M4 (agent comm + dashboard)
│ │ │ ├──> M5 (patch deployment + jobs)
│ │ │ │ ├──> M6 (maintenance windows)
│ │ │ │ │ └──> M7 (websocket relay)
│ │ │ │ └──> M7 (websocket relay)
│ │ │ └──> M8 (CA + certs)
│ │ └──> M8 (CA + certs)
│ └──> M10 (settings)
├──> M8 (CA + certs) [needed by M4 for mTLS]
└──> M9 (reports)
M10 (settings) ──> M11 (email + audit hardening)
M11 ──> M12 (deployment + testing)
Critical path: M1 → M2 → M3 → M4 → M5 → M6 → M7 → M11 → M12
Note: M8 (CA) should be started early (after M1) since M4 (agent communication) requires mTLS client certs.
Estimated Effort
| Milestone | Backend | Frontend | DB | Total |
|---|---|---|---|---|
| M1 | 3 days | 1 day | 1 day | 5 days |
| M2 | 4 days | 2 days | 0.5 day | 6.5 days |
| M3 | 3 days | 3 days | 0.5 day | 6.5 days |
| M4 | 3 days | 2 days | 0.5 day | 5.5 days |
| M5 | 4 days | 2 days | 0.5 day | 6.5 days |
| M6 | 2 days | 1.5 days | 0.5 day | 4 days |
| M7 | 2 days | 1.5 days | 0 | 3.5 days |
| M8 | 2 days | 1.5 days | 0 | 3.5 days |
| M9 | 3 days | 1.5 days | 0 | 4.5 days |
| M10 | 3 days | 2 days | 0.5 day | 5.5 days |
| M11 | 2 days | 1 day | 0.5 day | 3.5 days |
| M12 | 2 days | 0.5 days | 0.5 day | 3 days |
| Total | 33 days | 19.5 days | 5 days | ~57.5 days |
With a single developer: ~12 weeks. With parallel backend/frontend: ~7-8 weeks.
Review Notes
- Kelly to review and approve this plan before implementation begins
- Confirm milestone ordering and priorities
- Confirm whether M8 (CA) should be pulled forward to support M4
- Confirm whether any milestones can be deferred to a later release