diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..859b358 --- /dev/null +++ b/.gitignore @@ -0,0 +1,14 @@ +# Agent Zero project data +.a0proj/ + +# Python environments & cache +venv/** +**/__pycache__/** + +# Node.js dependencies +**/node_modules/** +**/.npm/** + +# IDE +.vscode/ +.idea/ diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index 3fbbecf..826430e 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -7,42 +7,326 @@ ## Architecture Decisions - +| Decision | Choice | Rationale | +|----------|--------|-----------| +| Backend language/framework | Rust with Axum | Security-aligned with linux_patch_api, memory-safe, high async performance | +| Frontend framework | React + TypeScript SPA | Rich ecosystem for enterprise dashboards, strong typing | +| Database | PostgreSQL with SQLx | Enterprise-grade, type-safe Rust queries, handles concurrent access | +| Async runtime | Tokio | Standard Rust async runtime, integrates with Axum | +| Deployment model | Single bare metal/VM | Simplicity, supports up to 2,500 managed hosts | +| Frontend serving | Axum serves static files | Simplest deployment, single process | +| Background processing | Separate worker process | Clean separation of concerns, communicates via PostgreSQL | +| Session management | JWT + refresh tokens | Short-lived access tokens (15 min), revocable refresh tokens (1 hr) | +| Encryption at rest | LUKS full-disk (infrastructure) | HIPAA/PCI-DSS compliant, handled at infrastructure level | +| Certificate management | Internal CA on Patch Manager host | Issues/renews mTLS certs, manual distribution to clients | ## System Architecture - +``` +┌──────────────────────────────────────────────────────────────┐ +│ Linux Patch Manager Host │ +│ (Ubuntu 24.04) │ +│ │ +│ ┌─────────────────────┐ ┌──────────────────────────────┐ │ +│ │ Axum Web Server │ │ Background Worker │ │ +│ │ │ │ │ │ +│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │ +│ │ │ REST API │ │ │ │ Health Poller │ │ │ +│ │ │ (CRUD, auth) │ │ │ │ (5 min intervals) │ │ │ +│ │ └───────────────┘ │ │ └────────────────────────┘ │ │ +│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │ +│ │ │ WebSocket │ │ │ │ Patch Data Poller │ │ │ +│ │ │ Relay │ │ │ │ (30 min intervals) │ │ │ +│ │ └───────────────┘ │ │ └────────────────────────┘ │ │ +│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │ +│ │ │ Static Files │ │ │ │ Job Scheduler │ │ │ +│ │ │ (React SPA) │ │ │ │ (maintenance windows) │ │ │ +│ │ └───────────────┘ │ │ └────────────────────────┘ │ │ +│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │ +│ │ │ mTLS Client │ │ │ │ Retry Engine │ │ │ +│ │ │ (agent comm) │◄─┼────┼─►│ (exp. backoff) │ │ │ +│ │ └───────────────┘ │ │ └────────────────────────┘ │ │ +│ └─────────┬─────────┘ │ ┌────────────────────────┐ │ │ +│ │ │ │ Email Notifier │ │ │ +│ │ │ │ (optional/disabled) │ │ │ +│ │ │ └────────────────────────┘ │ │ +│ │ └──────────────┬───────────────┘ │ +│ │ │ │ +│ │ ┌───────────────────┘ │ +│ │ │ │ +│ ┌─────────▼─────────▼──────────────────────────────────┐ │ +│ │ PostgreSQL │ │ +│ │ (hosts, groups, users, jobs, schedules, audit, etc.) │ │ +│ └───────────────────────────────────────────────────────┘ │ +│ │ +│ ┌───────────────────────────────────────────────────────┐ │ +│ │ Internal CA (mTLS certs) │ │ +│ └───────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────┘ + │ + mTLS / REST API (port 12443) + ┌──────┼──────┐ + ▼ ▼ ▼ + ┌──────┐┌──────┐┌──────┐ + │ Host ││ Host ││ Host │ ← Linux Patch API agents + │ A ││ B ││ C │ (up to 2,500) + └──────┘└──────┘└──────┘ +``` ## Component Design - +### 1. Axum Web Server + +**Responsibility:** Handle all HTTP/HTTPS requests from browsers and serve the React SPA. + +- **REST API:** CRUD operations for hosts, groups, users, schedules, certificates, reports +- **WebSocket Relay:** Proxy real-time job status from agent WebSocket streams to browser clients +- **Static File Server:** Serve compiled React SPA (HTML, JS, CSS, assets) +- **Authentication:** JWT access token validation, refresh token management, MFA enforcement +- **Authorization:** RBAC middleware enforcing admin/operator/group-scoped access +- **mTLS Client:** HTTP client with client certificates for communicating with Linux Patch API agents + +**API Versioning:** URL path versioning (`/api/v1/`) to match the upstream Linux Patch API convention. + +### 2. Background Worker + +**Responsibility:** All scheduled and asynchronous background processing. + +- **Health Poller:** Periodic health checks to all registered agents (5-minute intervals) +- **Patch Data Poller:** Periodic patch availability queries to all agents (30-minute intervals) +- **Job Scheduler:** Execute queued patch operations when maintenance windows open +- **Retry Engine:** Handle agent communication failures with exponential backoff (3 retries, max 30 min) +- **Job Executor:** Trigger patch operations on agents, track async job status +- **Email Notifier:** Optional email notifications (disabled by default) +- **Data Pruner:** Clean up operational data older than 30 days, audit logs older than 6 months + +**Communication:** Worker reads job queue from PostgreSQL, updates results back to PostgreSQL. Web server reads results from PostgreSQL for API responses. + +### 3. PostgreSQL Database + +**Responsibility:** Persistent storage for all application data. + +**Key Tables:** +- `hosts` — registered hosts, metadata, health status, last seen +- `groups` — static groups for access control +- `host_groups` — many-to-many host ↔ group membership +- `users` — local accounts with hashed passwords, MFA secrets +- `user_groups` — many-to-many user ↔ group membership +- `refresh_tokens` — server-side refresh tokens for session management +- `maintenance_windows` — per-device recurring and one-time schedules +- `patch_jobs` — queued, running, completed, failed patch operations +- `patch_job_hosts` — per-host status within a batch job +- `host_patch_data` — cached patch availability data from agents +- `host_health_data` — cached health check results +- `certificates` — issued mTLS client certificates +- `audit_log` — tamper-evident audit trail +- `azure_sso_config` — Azure AD SSO configuration + +**Data Retention:** +- Operational data (health, patches, jobs): 30 days +- Audit logs: 6 months + +### 4. React + TypeScript SPA + +**Responsibility:** User-facing web interface. + +**Pages:** +1. Dashboard — fleet overview, compliance %, health summary, upcoming windows, root CA download +2. Hosts — filterable host list by group, status, OS +3. Host Detail — system info, packages, patches, jobs, maintenance window config, host cert download +4. Patch Deployment — select hosts, review patches, deploy (queue or immediate) +5. Jobs — real-time job monitoring with WebSocket updates +6. Maintenance Windows — per-device recurring/one-time schedule management +7. Groups — manage static groups, assign hosts and operators +8. Reports — generate/export compliance, patch history, vulnerability, audit (CSV/PDF) +9. Users — local account management, MFA setup, group assignments +10. Certificates — view/manage internal CA, issue/renew client certs +11. Settings — system config, Azure SSO, polling intervals + +### 5. Internal CA + +**Responsibility:** mTLS certificate management for agent communication. + +- Runs on the same Patch Manager host +- Issues client certificates for mTLS communication with agents +- Manages certificate renewal +- Root CA certificate downloadable from dashboard for manual distribution +- Host-specific mTLS certificates downloadable from host detail page +- No automated distribution to clients — server administrators handle this manually ## Data Flow - +### Host Registration Flow +``` +1. Admin enters FQDN/IP → Axum validates & resolves FQDN +2. Axum stores host in PostgreSQL +3. Worker picks up new host → initial health check via mTLS +4. Health result stored in PostgreSQL → visible in dashboard +``` + +### Auto-Discovery Flow +``` +1. Admin triggers CIDR scan → Axum sends request to Worker +2. Worker scans subnet for agents on port 12443 +3. Discovered agents reported back → Admin selects which to register +4. Selected hosts stored in PostgreSQL +``` + +### Patch Deployment Flow (Queued) +``` +1. Operator selects hosts + patches → chooses "Queue for next window" +2. Axum creates patch job in PostgreSQL (status: queued) +3. When maintenance window opens → Worker triggers patch operations on agents +4. Worker monitors async job status via agent API +5. Results stored in PostgreSQL → WebSocket relay pushes updates to browser +6. Failed jobs auto-retried once if still within window +``` + +### Patch Deployment Flow (Immediate) +``` +1. Operator selects hosts + patches → chooses "Apply Now" +2. Axum creates patch job in PostgreSQL (status: pending) +3. Worker immediately triggers patch operations on agents +4. Same monitoring and retry logic as queued flow +``` + +### Health/Patch Polling Flow +``` +1. Worker polls each agent on schedule (5 min health, 30 min patches) +2. Results cached in PostgreSQL +3. Unhealthy agents marked with visual alerts in dashboard +4. On-demand refresh: operator clicks refresh → Worker queries agent immediately +``` ## Technology Stack - +| Layer | Technology | Version/Notes | +|-------|-----------|---------------| +| Backend | Rust + Axum | Tokio async runtime, Tower middleware | +| Database | PostgreSQL | SQLx for type-safe queries, migrations via sqlx-cli | +| Frontend | React + TypeScript | Vite build tooling | +| UI Components | MUI (Material UI) | Enterprise dashboard components, dark mode, theming | +| WebSocket | Axum native WebSocket | Agent → Manager → Browser relay | +| Auth (Local) | Argon2 password hashing + TOTP/WebAuthn | MFA enforcement | +| Auth (SSO) | OAuth2/OIDC via Azure AD | Optional, with Azure MFA | +| Session | JWT (access) + PostgreSQL (refresh) | 15 min access, 1 hr refresh | +| mTLS Client | Rustls + client certs | TLS 1.3 only | +| Internal CA | Rustls/RCGen | Certificate issuance and renewal | +| Email | Lettre (Rust email crate) | Optional, disabled by default | +| PDF Export | Rust PDF generation crate | Compliance and audit reports | +| CSV Export | Rust CSV crate | Data export for all report types | +| Service Management | systemd | Ubuntu 24.04 | +| Static Files | Axum built-in static file serving | React SPA served directly | ## Security Architecture - +### Authentication +- **Local accounts:** Argon2-hashed passwords + TOTP or WebAuthn for MFA +- **Azure SSO:** OAuth2/OIDC flow with Azure AD, using Azure's built-in MFA +- **Session tokens:** Short-lived JWT (15 min) for API access, server-side refresh tokens (1 hr inactivity timeout) +- **Refresh token revocation:** Stored in PostgreSQL, can be immediately revoked for forced logout + +### Authorization (RBAC) +- **Admin:** Full access to all resources and settings +- **Operator:** Can add/remove clients, manage schedules and patches only for devices in their group memberships +- **Group scoping:** Operators can only interact with hosts in their assigned groups +- **Ungrouped hosts:** Accessible by any operator or admin + +### Agent Communication +- **mTLS:** Client certificate authentication for all agent communication +- **TLS 1.3 only:** No older TLS versions +- **Internal CA:** Patch Manager manages CA, issues and renews client certificates +- **Manual distribution:** Server administrators manually install certs on managed clients + +### Data Protection +- **Encryption at rest:** LUKS full-disk encryption (infrastructure-managed) +- **Encryption in transit:** TLS 1.3 for all connections (agent and web UI) +- **Audit log integrity:** Tamper-evident logging (hash chaining) +- **Password storage:** Argon2 with salt + +### Compliance +- **HIPAA:** Audit controls, access controls, integrity controls, transmission security, automatic logoff +- **PCI-DSS:** Vulnerability management (core function), access restrictions, user identification, audit tracking, data protection ## Deployment Architecture - +``` +┌─────────────────────────────────────────┐ +│ Patch Manager Host (Ubuntu 24.04) │ +│ │ +│ ┌─────────────────────────────────────┐ │ +│ │ systemd: patch-manager-web │ │ +│ │ (Axum web server + static files) │ │ +│ └─────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────┐ │ +│ │ systemd: patch-manager-worker │ │ +│ │ (Background polling + jobs) │ │ +│ └─────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────┐ │ +│ │ PostgreSQL │ │ +│ │ (Database) │ │ +│ └─────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────┐ │ +│ │ Internal CA │ │ +│ │ (Certificate management) │ │ +│ └─────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────┐ │ +│ │ LUKS (Full-disk encryption) │ │ +│ │ (Infrastructure-managed) │ │ +│ └─────────────────────────────────────┘ │ +└─────────────────────────────────────────┘ +``` + +- Two systemd services: `patch-manager-web` and `patch-manager-worker` +- PostgreSQL runs on the same host +- Internal CA runs on the same host +- LUKS full-disk encryption managed by infrastructure +- No Docker/LXC — bare metal/VM deployment +- Internal network only — no public internet exposure ## Scalability - +- **Single-instance design:** Supports 500 typical hosts, up to 2,500 +- **Manual horizontal scaling:** Divide clients between multiple Patch Manager hosts if needed +- **Connection pooling:** Axum handles thousands of concurrent connections with Tokio +- **Background worker:** Independent scaling of polling/jobs from web serving +- **Database:** PostgreSQL handles the workload easily on a single host +- **No automatic clustering or load balancing required** ## Integration Points - - **Upstream Dependency:** [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api) +| Integration | Protocol | Direction | Purpose | +|-------------|----------|-----------|----------| +| Agent REST API | HTTPS/mTLS (TLS 1.3) | Manager → Agent | Queries, patch operations | +| Agent WebSocket | WSS/mTLS | Agent → Manager | Real-time job status streaming | +| Azure AD | HTTPS/OAuth2 | Manager → Azure | SSO authentication (optional) | + +**API Endpoints Used:** +- `GET /api/v1/health` — Agent health checks +- `GET /api/v1/system/info` — Host system information +- `GET /api/v1/packages` — List installed packages +- `GET /api/v1/patches` — List available patches +- `POST /api/v1/patches/apply` — Apply patches +- `PUT /api/v1/packages/{name}` — Update specific package +- `DELETE /api/v1/packages/{name}` — Remove package +- `POST /api/v1/packages` — Install packages +- `GET /api/v1/jobs` — List jobs +- `GET /api/v1/jobs/{id}` — Get job status +- `POST /api/v1/jobs/{id}/rollback` — Rollback a job +- `POST /api/v1/system/reboot` — Reboot host +- `WebSocket /api/v1/ws/jobs` — Real-time job status + ## Monitoring and Observability - +- **Application logging:** Structured JSON logging (tracing crate) +- **Log levels:** Configurable at runtime (DEBUG, INFO, WARN, ERROR) +- **Health endpoint:** `GET /api/v1/health` on the Patch Manager's own API for infrastructure monitoring +- **Dashboard alerts:** Visual indicators for unhealthy/unreachable agents (red/yellow status) +- **Audit logging:** All significant events logged to PostgreSQL with tamper-evident hash chaining +- **No external monitoring integration required** (dashboard-only alerts) diff --git a/REQUIREMENTS.md b/REQUIREMENTS.md index 9b9c2f1..4607032 100644 --- a/REQUIREMENTS.md +++ b/REQUIREMENTS.md @@ -7,63 +7,146 @@ ## Functional Requirements - - ### FR-01: Host Management +- Manual host registration by FQDN or IP address (FQDN resolved to IP at add time) +- On-demand auto-discovery targeting a CIDR subnet range (scans for Linux Patch API agents on port 12443) +- Host metadata tracked: hostname, IP, OS, kernel, agent version, last seen, health status +- Static group-based organization with many-to-many relationships (hosts can belong to multiple groups) +- Ungrouped hosts can be managed by any operator or admin +- Host removal with audit logging ### FR-02: Patch Monitoring +- Scheduled background polling: 5-minute intervals for health checks, 30-minute intervals for patch data +- On-demand refresh triggered by operator/admin from the UI +- Visual dashboard alerts for unhealthy or unreachable agents (red/yellow status indicators) +- CVE severity, patch priority, and reboot requirement display per host ### FR-03: Patch Deployment +- Patches queue for the next available maintenance window by default +- Immediate-apply override option for urgent patches +- No approval gate required — operator/admin triggers deployment directly +- Auto-retry failed patch jobs once if still within the maintenance window, then surface failure prominently +- Batch operations across multiple hosts with partial failure handling (auto-retry once, then report failures) +- Rollback support via upstream Linux Patch API rollback endpoint ### FR-04: Scheduling +- Maintenance windows are per-device (not per-group) +- Recurring schedules: daily, weekly, or monthly +- One-time maintenance windows +- Patch operations execute automatically when a maintenance window opens ### FR-05: Reporting +- Compliance report: percentage of hosts fully patched, by group or fleet-wide +- Patch history: log of all patch operations per host or per group +- Vulnerability exposure: hosts with known CVEs pending patches +- Audit trail: who did what when (user actions, patch operations) +- Export formats: CSV and PDF ### FR-06: User Management +- **Admin role**: Full access to manage all aspects of Linux Patch Manager +- **Operator role**: Can add/remove clients, manage schedules and patches only for devices in their group memberships +- Operators can belong to multiple groups +- Local accounts with MFA required (TOTP or WebAuthn) +- Azure SSO integration (optional, with Azure's built-in MFA) +- Group membership management for users and hosts ## Non-Functional Requirements - - ### NFR-01: Security +- Combination authentication: local accounts + Azure SSO +- MFA required for all users (TOTP or WebAuthn; Azure MFA for SSO users) +- Session management: short-lived JWT access tokens (15 min) + server-side refresh tokens (1-hour inactivity timeout, revocable) +- mTLS for all agent communication (certificate-based, TLS 1.3 only) +- HTTPS enforced for web UI +- Internal CA managed by Patch Manager for mTLS certificate issuance and renewal +- Certificate distribution to managed clients is manual (server administrators responsible) +- RBAC with group-scoped access control ### NFR-02: Performance +- Support 500 typical managed hosts, up to 2,500 +- Dashboard load time under 5 seconds for full fleet view +- Background polling must not degrade UI responsiveness +- Concurrent batch operations (e.g., patch 500 hosts simultaneously) must not overwhelm the system ### NFR-03: Scalability +- Single-instance design on bare metal/VM (Ubuntu 24.04) +- Manual horizontal scaling by dividing clients between multiple Patch Manager hosts if needed +- No automatic clustering or load balancing required ### NFR-04: Reliability +- Agent communication failures: retry with exponential backoff (3 retries, max 30 minutes between retries) +- Patch job failures: auto-retry once within maintenance window, then surface to operators +- Batch partial failures: auto-retry once, then report remaining failures to operator +- Continue processing healthy hosts regardless of individual host failures ### NFR-05: Usability +- 11-page web UI (React + TypeScript SPA) +- Responsive design for desktop/laptop screens +- Dark mode support +- Certificate download links integrated into dashboard (root CA) and host detail (host-specific mTLS) ## Interface Requirements - - ### IR-01: Web Interface +- React + TypeScript SPA served by Axum backend +- Real-time job status via WebSocket relay (agent WebSocket → Patch Manager → browser) +- RESTful API backend for all UI operations +- Certificate download endpoints for root CA and host-specific mTLS certs ### IR-02: Linux Patch API Integration +- All managed device communication via Linux Patch API (upstream agent) +- mTLS client certificate authentication to each agent +- Base path: `/api/v1/`, Port: 12443, TLS 1.3 only +- Sync operations: GET endpoints (packages, patches, system info, health) +- Async operations: POST/PUT/DELETE endpoints (install, update, remove, patch apply, reboot) +- Job status tracking via GET `/api/v1/jobs/{id}` and WebSocket `/api/v1/ws/jobs` +- Rollback via POST `/api/v1/jobs/{id}/rollback` ## Data Requirements - +- **Database:** PostgreSQL +- **Operational data retention:** 30 days (host patch history, job history, health history) +- **Audit log retention:** 6 months +- **Data storage:** All data on Patch Manager host ## Compliance Requirements - +### HIPAA (Health Insurance Portability and Accountability Act) + +- **Audit Controls (§164.312(b)):** Comprehensive audit logging of all system activity (covered by audit logging requirements) +- **Access Controls (§164.312(a)(1)):** RBAC with group-scoped access, unique user identification, MFA enforcement +- **Integrity Controls (§164.312(c)(1)):** Audit log integrity protection (tamper-evident logging) +- **Transmission Security (§164.312(e)(1)):** mTLS for all agent communication, HTTPS for web UI, TLS 1.3 minimum +- **Encryption at Rest:** PostgreSQL data encryption (full-disk or column-level for sensitive fields) +- **Automatic Logoff (§164.312(a)(2)(iii)):** 1-hour inactivity session timeout + +### PCI-DSS (Payment Card Industry Data Security Standard) + +- **Requirement 6:** Vulnerability management — patch management is core PCI-DSS requirement; system must track and enforce timely patching +- **Requirement 7:** Restrict access to need-to-know — RBAC with group-scoped operator access +- **Requirement 8:** Identify and authenticate users — MFA required, unique IDs, session timeouts +- **Requirement 10:** Track and monitor all access — comprehensive audit logging with 6-month retention +- **Requirement 3:** Protect stored data — encryption at rest for PostgreSQL +- **Requirement 4:** Encrypt transmission — mTLS (TLS 1.3) for agent communication, HTTPS for web UI ## Constraints - +- Single bare metal/VM host running Ubuntu 24.04 +- Systemd service management +- Internal network only (no public internet exposure) +- Rust/Axum backend, React/TypeScript frontend, PostgreSQL database +- No direct permissions on managed clients +- Certificate distribution to clients is manual diff --git a/SPEC.md b/SPEC.md index d97ac34..ed16075 100644 --- a/SPEC.md +++ b/SPEC.md @@ -8,63 +8,156 @@ ## Scope - - **In Scope:** - +- Centralized dashboard for fleet-wide patch status monitoring (5 min health polling, 30 min patch polling, on-demand refresh) with visual alerts for unhealthy/unreachable agents +- Multi-distribution support (Debian/Ubuntu, RHEL/CentOS/Fedora, Alpine, Arch) +- Batch patch operations across multiple hosts +- Maintenance window scheduling (per-device, daily/weekly/monthly recurring + one-time) with immediate-apply override +- Compliance reporting and patch status dashboards (compliance, patch history, vulnerability exposure, audit trail — exportable as CSV and PDF) +- User management with RBAC +- Secure mTLS communication with Linux Patch API agents +- Real-time job status via WebSocket relay +- Host registration (manual FQDN/IP + on-demand CIDR auto-discover) +- Static group-based device organization with group-scoped operator access +- Email notifications (optional, disabled by default) **Out of Scope:** - +- Configuration management (Ansible/Puppet/Chef territory) +- OS provisioning, imaging, or bootstrapping +- Vulnerability scanning (manager consumes CVE data from agents, does not scan) +- Mobile UI / native apps +- Automated certificate distribution to agents +- Agent installation/management (separate concern) +- Webhook/Slack/other external notification integrations +- Multi-instance clustering / automatic horizontal scaling ## Objectives - - -**Primary Objective:** - +**Primary Objective:** Provide a centralized web interface to monitor and control patch operations across a fleet of Linux hosts via the Linux Patch API. **Key Goals:** - +- Fleet-wide visibility into patch status and compliance +- Zero-friction patch deployment via maintenance windows +- Secure-by-design architecture (Rust core, mTLS, MFA) +- Single-instance simplicity supporting up to 2,500 managed hosts ## Constraints - - **Deployment:** - +- Single bare metal/VM host running Ubuntu 24.04 +- Systemd service management +- Internal network access only (same network as managed agents) **Technical:** - +- Backend: Rust with Axum framework, Tokio async runtime +- Frontend: React + TypeScript SPA +- Database: PostgreSQL with SQLx for type-safe queries +- Real-time: Axum native WebSocket support for agent-to-browser relay +- Single-instance design (manual horizontal scaling by dividing clients between multiple Patch Manager hosts if needed) +- Fleet capacity: ~500 typical, up to 2,500 hosts **Security:** - +- Combination authentication: local accounts + Azure SSO +- MFA required for all users (TOTP or WebAuthn) +- Azure SSO users may use Azure's built-in MFA +- mTLS for all agent communication +- HTTPS for web UI +- Role-based access control: + - **Admin**: Full access to manage all aspects of Linux Patch Manager + - **Operator**: Can add/remove clients, manage schedules and patches only for devices in their group memberships + - Groups are static; devices and operators can belong to multiple groups + - Ungrouped devices can be managed by any operator or admin ## Architecture Overview - +Management plane web application communicating with Linux Patch API agents on each managed host. + +``` +┌─────────────────────────────┐ +│ Linux Patch Manager │ ← Web UI (this project) +│ (Management Plane) │ Rust/Axum + React/TS +│ PostgreSQL + WebSocket │ +└──────────────┬──────────────┘ + │ mTLS / REST API + ┌──────┼──────┐ + ▼ ▼ ▼ + ┌──────┐┌──────┐┌──────┐ + │ Host ││ Host ││ Host │ ← Linux Patch API agents + │ A ││ B ││ C │ (up to 2,500) + └──────┘└──────┘└──────┘ +``` ## API Integration - - **Upstream Dependency:** [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api) +- All managed device access uses the Linux Patch API +- mTLS certificate-based authentication to agents +- Hybrid sync/async operation model (sync for queries, async jobs for patch operations) +- WebSocket streaming for real-time job status from agents +- Base path: `/api/v1/`, Port: 12443, TLS 1.3 only + +## Certificate Management + +- Internal CA managed by Patch Manager, installed on the same host +- Patch Manager issues and renews client certificates for mTLS communication +- Certificate distribution to managed target clients is manual (server administrators responsible) +- Patch Manager has no direct permissions on managed clients ## User Interface - +### Pages/Views + +1. **Dashboard** — Fleet overview: patch compliance %, host health summary, pending patches, upcoming maintenance windows. Includes root CA certificate download icon. +2. **Hosts** — List of all managed hosts with filtering by group, health status, OS, patch status +3. **Host Detail** — Single host view: system info, installed packages, available patches, job history, maintenance window config. Includes host-specific mTLS certificate download icon. +4. **Patch Deployment** — Select hosts → review available patches → deploy (queue for window or apply now) +5. **Jobs** — Real-time job monitoring with WebSocket status updates +6. **Maintenance Windows** — Create/edit recurring and one-time windows per device +7. **Groups** — Manage static groups, assign hosts and operators +8. **Reports** — Generate and export compliance, patch history, vulnerability, audit reports (CSV and PDF) +9. **Users** — Manage local accounts, MFA setup, group assignments +10. **Certificates** — View/manage internal CA, issue/renew client certs +11. **Settings** — System configuration, Azure SSO setup, polling intervals ## Error Handling - +**Agent Communication Failures:** +- Mark host as unhealthy in dashboard +- Retry with exponential backoff (3 retries, max 30 minutes between retries) +- Continue processing other hosts without blocking + +**Patch Job Failures:** +- Auto-retry failed patch jobs once if still within the maintenance window +- If retry fails or window has closed, surface failure prominently to operators + +**Batch Operations with Partial Failures:** +- Auto-retry failed hosts once +- If retry fails, report which hosts failed and let operator decide next steps +- Successful hosts proceed normally regardless of failures ## Assumptions - +- Patch Manager host has network connectivity to all managed agents +- Linux Patch API agent is installed and running on each managed host +- Server administrators manually distribute mTLS and root certificates to managed clients +- PostgreSQL is available on the Patch Manager host ## Dependencies - +- Linux Patch API (upstream agent on each managed host) +- PostgreSQL +- Internal CA for mTLS certificates +- Azure AD (optional, for SSO) ## Audit Logging - +**Captured Events:** +- All user login/logout events (success and failure) +- All patch operations (who triggered, which hosts, what patches, queue vs immediate) +- All host registration/removal events +- All group membership changes (hosts and users) +- All certificate operations (issue, renew, download) +- All maintenance window changes +- All configuration changes + +**Retention:** 6 months