Private
Public Access
1
0
Files
linux_patch_manager/tasks/todo.md
Echo 6f9c6dc881 M5: Patch Deployment & Job Management
Backend:
- migrations/003_jobs_scheduling.sql: retry_next_at/last_error columns,
  pg_notify trigger for immediate job dispatch, retry index
- pm-agent-client: ApplyPatchesRequest/Response, AgentJobStatus,
  RollbackResponse types; apply_patches/job_status/rollback_job
  client methods + generic POST helper
- pm-core/models: JobStatus, JobKind, PatchJob, PatchJobHost,
  CreateJobRequest, PatchJobSummary
- pm-web/routes/jobs.rs: POST/GET /api/v1/jobs, GET /jobs/:id,
  POST /jobs/:id/cancel, POST /jobs/:id/rollback
- pm-worker/job_executor.rs: NOTIFY listener, periodic scanner,
  execute_host_job, poll_running_jobs, handle_host_failure (3-retry
  exponential backoff 1m/5m/30m), sync_job_status, retry_pending_jobs
- pm-worker/main.rs: spawn job_executor

Frontend:
- types/index.ts: PatchInfo, PatchJobHost, PatchJob, PatchJobSummary,
  CreateJobRequest interfaces
- api/client.ts: jobsApi (list/get/create/cancel/rollback),
  patchesApi (getHostPatches)
- pages/PatchDeploymentPage.tsx: 3-step MUI Stepper
  (host select → configure → result)
- pages/JobsPage.tsx: job list table, expandable per-host detail,
  cancel/rollback actions with confirm dialog, load-more pagination
- App.tsx: /jobs and /deployment routes wired to real pages

cargo check: 0 errors | vite build: 0 errors
2026-04-23 17:08:43 +00:00

18 KiB

Linux Patch Manager — Implementation Plan

Project Structure

linux_patch_manager/
├── Cargo.toml                    # Workspace root
├── crates/
│   ├── pm-web/                   # Axum web server binary crate
│   ├── pm-worker/                # Background worker binary crate
│   ├── pm-core/                  # Shared library: config, DB pool, models, errors, types
│   ├── pm-agent-client/          # mTLS HTTP client for agent communication
│   ├── pm-auth/                  # Auth: JWT (EdDSA), Argon2id, TOTP, WebAuthn, RBAC, Azure SSO
│   ├── pm-ca/                    # Internal CA: rcgen + rustls certificate management
│   └── pm-reports/               # PDF (printpdf + plotters) and CSV generation
├── migrations/                   # SQLx database migrations
│   ├── 001_initial_schema.sql
│   ├── 002_auth_system.sql
│   ├── 003_host_management.sql
│   ├── 004_jobs_and_scheduling.sql
│   ├── 005_audit_logging.sql
│   └── 006_system_config.sql
├── frontend/                      # React + TypeScript SPA
│   ├── src/
│   │   ├── api/                  # API client (axios/fetch)
│   │   ├── components/           # Shared MUI components
│   │   ├── pages/                # 11 page components
│   │   ├── hooks/                # Custom React hooks
│   │   ├── store/                # State management (zustand or context)
│   │   ├── theme/                # MUI theme (light + dark)
│   │   ├── types/                # TypeScript interfaces
│   │   └── utils/                # Utilities
│   ├── package.json
│   ├── vite.config.ts
│   ├── tsconfig.json
│   └── index.html
├── config/
│   └── config.example.toml       # Example configuration
├── systemd/
│   ├── patch-manager-web.service
│   └── patch-manager-worker.service
├── docs/
│   └── runbooks/
│       └── restore.md            # Backup/restore runbook
├── scripts/
│   ├── setup.sh                  # Initial host setup script
│   └── build-frontend.sh         # Frontend build script
├── SPEC.md
├── REQUIREMENTS.md
├── ARCHITECTURE.md
├── README.md
└── .gitignore

Milestones

Each milestone produces a testable vertical slice — backend + frontend + database working together.

M1: Project Scaffolding + Database Schema + Core Infrastructure

Goal: Runnable workspace with DB, config, logging, error handling.

  • Initialize Rust workspace with 7 crates (pm-web, pm-worker, pm-core, pm-agent-client, pm-auth, pm-ca, pm-reports)
  • Initialize React + TypeScript + Vite + MUI frontend project
  • Create config.example.toml with all configuration keys
  • Implement pm-core::config — TOML config loading + env overrides (PATCH_MANAGER__SECTION__KEY)
  • Implement pm-core::db — SQLx PgPool initialization, connection from config
  • Implement pm-core::error — Unified error type with API error envelope (error.code, error.message, error.request_id, error.details)
  • Implement pm-core::request_id — ULID generation + X-Request-Id header middleware
  • Implement pm-core::loggingtracing + tracing-subscriber JSON formatter, configurable log levels
  • Create initial database migrations (001_initial_schema.sql): hosts, groups, host_groups, users, user_groups, refresh_tokens, maintenance_windows, patch_jobs, patch_job_hosts, host_patch_data, host_health_data, certificates, audit_log, azure_sso_config, system_config, worker_heartbeat, discovery_results
  • Implement pm-web binary: Axum app skeleton, static file serving placeholder, /status/health endpoint
  • Implement pm-worker binary: Tokio runtime skeleton, DB connection, worker heartbeat writer (30s interval)
  • Implement sqlx::migrate! embedded migrations in pm-web, advisory lock for single-writer
  • Worker waits for expected schema version before accepting work
  • Create systemd/patch-manager-web.service and systemd/patch-manager-worker.service unit files
  • Create scripts/setup.sh for initial host setup
  • Create scripts/build-frontend.sh
  • Verify: both services start, /status/health returns 200, worker heartbeat updates

M2: Authentication & Authorization + Frontend Shell

Goal: Users can log in with MFA, JWT auth works, RBAC middleware enforces roles.

  • Implement pm-auth::password — Argon2id hashing with calibrated parameters (m_cost=65536, t_cost=3, p_cost=1)
  • Implement pm-auth::jwt — EdDSA/Ed25519 JWT issuance and validation, 15-min TTL, 90-day key rotation with 24-hour overlap
  • Implement pm-auth::refresh — Opaque 256-bit refresh tokens, hashed storage in refresh_tokens, 1-hour sliding inactivity timeout, rotation on use
  • Implement pm-auth::mfa_totp — TOTP setup, verify, QR code generation
  • Implement pm-auth::mfa_webauthn — WebAuthn registration and authentication
  • Implement pm-auth::rbac — Admin/Operator role middleware, group-scoped access enforcement
  • Implement pm-auth::session — Login flow (password → MFA → access+refresh tokens), logout (revoke refresh), force-revoke
  • Implement pm-web auth routes: POST /api/v1/auth/login, POST /api/v1/auth/refresh, POST /api/v1/auth/logout, MFA setup endpoints
  • Implement IP whitelist middleware on all connection points
  • Frontend: App shell with React Router, MUI theme (light + dark), auth context, login page, MFA setup page
  • Frontend: API client with JWT interceptors (auto-refresh), 401 redirect to login
  • Create seed migration: default admin account
  • Verify: login with MFA, JWT validation, refresh token rotation, RBAC blocks unauthorized access, IP whitelist blocks unknown IPs

M3: Host Management + Groups + Frontend Pages

Goal: Full host CRUD, group management, auto-discovery.

  • Implement host CRUD routes: GET/POST /api/v1/hosts, GET/DELETE /api/v1/hosts/{id}
  • Implement FQDN resolution on host add (resolve to IP at registration time)
  • Implement group CRUD routes: GET/POST /api/v1/groups, GET/DELETE /api/v1/hosts/{id}/groups
  • Implement host ↔ group and user ↔ group membership management
  • Implement RBAC scoping: operators can only see/manage hosts in their groups
  • Implement auto-discovery: POST /api/v1/discovery/cidr → worker scans CIDR, bounded concurrency (128), TCP+TLS probe (1.5s timeout), progress tracking, cancel action
  • Implement discovery results table and review flow
  • Implement host removal with audit logging
  • Frontend: Hosts page (filterable list by group, status, OS)
  • Frontend: Host Detail page (system info, packages, patches, jobs, maintenance window config)
  • Frontend: Groups page (manage groups, assign hosts and operators)
  • Frontend: Users page (local account management, MFA setup, group assignments)
  • Verify: add/remove hosts, group assignments, RBAC enforcement, CIDR scan with progress

M4: Agent Communication Layer + Dashboard

Goal: mTLS client works, health/patch polling operational, dashboard shows fleet status.

  • Implement pm-agent-client — Rustls-based mTLS HTTP client with client certificate, TLS 1.3 only
  • Implement agent API calls: GET /api/v1/health, GET /api/v1/system/info, GET /api/v1/packages, GET /api/v1/patches
  • Implement worker health poller: 5-minute intervals, bounded concurrency (64 semaphore), update host_health_data
  • Implement worker patch data poller: 30-minute intervals, bounded concurrency, update host_patch_data
  • Implement on-demand refresh: POST /api/v1/hosts/{id}/refreshNOTIFY refresh_requested → worker queries immediately
  • Implement host health status tracking: healthy/degraded/unreachable with timestamps
  • Implement dashboard API: GET /api/v1/status/fleet (authenticated, fleet aggregates)
  • Frontend: Dashboard page — compliance %, health summary, pending patches, upcoming windows, root CA download icon
  • Frontend: Real-time health status indicators (green/yellow/red) on host lists
  • Verify: polling works, dashboard shows live fleet data, on-demand refresh works, visual alerts for unhealthy agents

M5: Patch Deployment & Job Management + Frontend Pages

Goal: Full patch lifecycle — queue, immediate, retry, rollback, job monitoring.

  • Implement job creation: POST /api/v1/jobs (queue for window or apply now)
  • Implement patch_jobs and patch_job_hosts row creation
  • Implement NOTIFY job_enqueued for immediate-apply wake
  • Implement worker job executor: call agent POST /api/v1/patches/apply, track async job IDs
  • Implement worker retry engine: exponential backoff (1min, 5min, 30min), 3 retries max
  • Implement patch job auto-retry within maintenance window (1 retry)
  • Implement batch partial failure handling: auto-retry once, then report
  • Implement rollback: POST /api/v1/jobs/{id}/rollback → worker calls agent rollback endpoint
  • Implement job status tracking: poll agent GET /api/v1/jobs/{id} for running jobs
  • Implement job listing/detail API: GET /api/v1/jobs, GET /api/v1/jobs/{id}
  • Frontend: Patch Deployment page (select hosts → review patches → queue or apply now)
  • Frontend: Jobs page (job list, per-host status, rollback action)
  • Verify: queued job waits for window, immediate job runs now, retry logic works, rollback works, batch partial failures reported

M6: Maintenance Windows & Scheduling + Frontend Page

Goal: Per-device recurring and one-time maintenance windows, auto-execution at window open.

  • Implement maintenance window CRUD: GET/POST/PUT/DELETE /api/v1/hosts/{id}/maintenance-windows
  • Implement recurring schedule logic: daily, weekly, monthly (cron-like evaluation)
  • Implement one-time window support
  • Implement worker job scheduler: detect window openings, dispatch queued jobs
  • Implement window-open event triggering job execution
  • Frontend: Maintenance Windows page (per-device schedule management)
  • Frontend: Maintenance window config on Host Detail page
  • Verify: create recurring/one-time windows, queued jobs execute at window open, window expiration stops execution

M7: WebSocket Relay (Real-Time Job Status)

Goal: Browser receives live job updates via WebSocket.

  • Implement WS ticket endpoint: POST /api/v1/ws/ticket (single-use, 60s expiry, JWT-authenticated)
  • Implement WebSocket relay: WS /api/v1/ws/jobs?ticket=... → authenticated browser connection
  • Implement agent WebSocket consumption: worker subscribes to agent WS /api/v1/ws/jobs for running jobs
  • Implement event multiplexing: agent WS events → PostgreSQL update → browser WS push
  • Frontend: WebSocket client hook with auto-reconnect and ticket refresh
  • Frontend: Live job progress updates on Jobs page
  • Verify: open job in browser, see real-time progress updates, WS ticket expires correctly

M8: Internal CA + Certificate Management + Frontend Page

Goal: CA issues/renews certs, download links work.

  • Implement pm-ca — CA initialization (root key + cert generation), stored at /etc/patch-manager/ca/ with 0600 permissions
  • Implement client certificate issuance for mTLS (per-host certs)
  • Implement certificate renewal flow
  • Implement certificate revocation (mark revoked in certificates table, re-issue replacement)
  • Implement download endpoints: GET /api/v1/ca/root.crt, GET /api/v1/hosts/{id}/client.crt
  • Implement Web UI TLS certificate: self-signed from internal CA (default) or operator-supplied cert/key
  • Frontend: Certificates page (view/manage CA, issue/renew certs, view expiry)
  • Frontend: Root CA download icon on Dashboard
  • Frontend: Host-specific cert download icon on Host Detail page
  • Verify: CA generates certs, downloads work, TLS cert strategy switchable

M9: Reporting (CSV + PDF with Charts) + Frontend Page

Goal: All 4 report types exportable as CSV and PDF.

  • Implement pm-reports::csv — CSV generation for all report types
  • Implement pm-reports::pdf — PDF generation with printpdf + plotters charts
  • Implement compliance report: % hosts fully patched by group/fleet, trend charts
  • Implement patch history report: operations per host/group
  • Implement vulnerability exposure report: hosts with pending CVEs
  • Implement audit trail report: who did what when
  • Implement report API: GET /api/v1/reports/compliance, patch-history, vulnerability, audit with ?format=csv|pdf
  • Frontend: Reports page (select type, filters, generate, download)
  • Verify: all 4 reports generate as CSV and PDF, PDFs include charts

M10: Settings Page (Azure SSO, SMTP, TLS, IP Whitelist) + Frontend Page

Goal: All runtime configuration manageable from the UI.

  • Implement system_config table CRUD API
  • Implement Azure SSO configuration: tenant ID, client ID/secret, redirect URI, scopes
  • Implement "Test Connection" action for Azure SSO (round-trip against Azure AD, report success/failure without enabling)
  • Implement SMTP configuration: host, port, auth mode, username/password, TLS mode, from-address
  • Implement "Send Test Email" action for SMTP
  • Implement polling interval tuning (health, patch) in Settings
  • Implement Web UI TLS certificate strategy selection (internal CA vs. operator-supplied)
  • Implement IP whitelist management in Settings
  • Implement Azure SSO OAuth2/OIDC Authorization Code flow with PKCE
  • Frontend: Settings page with all configuration sections and test actions
  • Verify: Azure SSO test connection works, test email sends, TLS strategy switches, IP whitelist updates take effect

M11: Email Notifications + Audit Logging Hardening

Goal: Optional email works, audit logs are tamper-evident.

  • Implement email notifier in worker (Lettre crate, optional/disabled by default)
  • Implement email templates: patch failure, job completion, maintenance window reminders
  • Implement audit log hash chaining: prev_hash + row_hash on every insert
  • Implement periodic audit integrity verification job
  • Implement on-demand audit integrity verification from UI
  • Implement audit log for all configuration changes (Azure SSO, SMTP, IP whitelist, TLS cert strategy)
  • Implement audit log for certificate operations (issue, renew, download, revoke)
  • Frontend: Email notification settings integration in Settings page
  • Frontend: Audit integrity verification action in Reports/Users area
  • Verify: email sends on failure, audit chain is intact, tampering detected by verification

M12: Deployment Packaging, Backup/DR, Integration Testing

Goal: Production-ready deployment with documented runbooks.

  • Create docs/runbooks/restore.md — backup/restore procedure
  • Implement nightly pg_dump script to /var/backups/patch-manager/
  • Implement CA material backup inclusion
  • Implement /etc/patch-manager/ config backup (excluding secrets unless encrypted destination)
  • Create scripts/setup.sh — full host setup (install deps, create service user, set permissions, initialize DB)
  • Finalize systemd unit files with proper dependencies, restart policies, logging
  • End-to-end integration tests: full patch lifecycle across multiple agents
  • Performance test: verify 500-host polling, dashboard load < 5s, CIDR scan < 10s for /22
  • Security review: TLS 1.3 enforcement, IP whitelist, RBAC, audit chain integrity
  • Compliance mapping verification: HIPAA and PCI-DSS controls documented and testable
  • Verify: backup/restore works, RPO 24h / RTO 4h achievable, all NFRs met

Dependency Graph

M1 (scaffolding)
 ├──> M2 (auth)
 │      ├──> M3 (hosts/groups)
 │      │      ├──> M4 (agent comm + dashboard)
 │      │      │      ├──> M5 (patch deployment + jobs)
 │      │      │      │      ├──> M6 (maintenance windows)
 │      │      │      │      │      └──> M7 (websocket relay)
 │      │      │      │      └──> M7 (websocket relay)
 │      │      │      └──> M8 (CA + certs)
 │      │      └──> M8 (CA + certs)
 │      └──> M10 (settings)
 ├──> M8 (CA + certs) [needed by M4 for mTLS]
 └──> M9 (reports)

M10 (settings) ──> M11 (email + audit hardening)
M11 ──> M12 (deployment + testing)

Critical path: M1 → M2 → M3 → M4 → M5 → M6 → M7 → M11 → M12

Note: M8 (CA) should be started early (after M1) since M4 (agent communication) requires mTLS client certs.


Estimated Effort

Milestone Backend Frontend DB Total
M1 3 days 1 day 1 day 5 days
M2 4 days 2 days 0.5 day 6.5 days
M3 3 days 3 days 0.5 day 6.5 days
M4 3 days 2 days 0.5 day 5.5 days
M5 4 days 2 days 0.5 day 6.5 days
M6 2 days 1.5 days 0.5 day 4 days
M7 2 days 1.5 days 0 3.5 days
M8 2 days 1.5 days 0 3.5 days
M9 3 days 1.5 days 0 4.5 days
M10 3 days 2 days 0.5 day 5.5 days
M11 2 days 1 day 0.5 day 3.5 days
M12 2 days 0.5 days 0.5 day 3 days
Total 33 days 19.5 days 5 days ~57.5 days

With a single developer: ~12 weeks. With parallel backend/frontend: ~7-8 weeks.


Review Notes

  • Kelly to review and approve this plan before implementation begins
  • Confirm milestone ordering and priorities
  • Confirm whether M8 (CA) should be pulled forward to support M4
  • Confirm whether any milestones can be deferred to a later release