Private
Public Access
1
0

Complete SDD specification documents

- SPEC.md: Full project specification including scope, objectives, constraints,
  architecture overview, API integration, certificate management, UI structure,
  error handling, audit logging, and out-of-scope items

- REQUIREMENTS.md: Functional requirements (host mgmt, patch monitoring,
  deployment, scheduling, reporting, user mgmt), non-functional requirements
  (security, performance, scalability, reliability, usability), interface
  requirements, data requirements, HIPAA/PCI-DSS compliance

- ARCHITECTURE.md: Architecture decisions, system architecture diagram,
  component design (Axum web server, background worker, PostgreSQL, React SPA,
  internal CA), data flows, technology stack, security architecture,
  deployment architecture, integration points, monitoring
This commit is contained in:
2026-04-23 14:40:33 +00:00
parent 602583b624
commit f6540133c2
4 changed files with 516 additions and 42 deletions

View File

@ -7,42 +7,326 @@
## Architecture Decisions
<!-- Document key architectural decisions and rationale -->
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Backend language/framework | Rust with Axum | Security-aligned with linux_patch_api, memory-safe, high async performance |
| Frontend framework | React + TypeScript SPA | Rich ecosystem for enterprise dashboards, strong typing |
| Database | PostgreSQL with SQLx | Enterprise-grade, type-safe Rust queries, handles concurrent access |
| Async runtime | Tokio | Standard Rust async runtime, integrates with Axum |
| Deployment model | Single bare metal/VM | Simplicity, supports up to 2,500 managed hosts |
| Frontend serving | Axum serves static files | Simplest deployment, single process |
| Background processing | Separate worker process | Clean separation of concerns, communicates via PostgreSQL |
| Session management | JWT + refresh tokens | Short-lived access tokens (15 min), revocable refresh tokens (1 hr) |
| Encryption at rest | LUKS full-disk (infrastructure) | HIPAA/PCI-DSS compliant, handled at infrastructure level |
| Certificate management | Internal CA on Patch Manager host | Issues/renews mTLS certs, manual distribution to clients |
## System Architecture
<!-- High-level system architecture diagram and description -->
```
┌──────────────────────────────────────────────────────────────┐
│ Linux Patch Manager Host │
│ (Ubuntu 24.04) │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────────┐ │
│ │ Axum Web Server │ │ Background Worker │ │
│ │ │ │ │ │
│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │
│ │ │ REST API │ │ │ │ Health Poller │ │ │
│ │ │ (CRUD, auth) │ │ │ │ (5 min intervals) │ │ │
│ │ └───────────────┘ │ │ └────────────────────────┘ │ │
│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │
│ │ │ WebSocket │ │ │ │ Patch Data Poller │ │ │
│ │ │ Relay │ │ │ │ (30 min intervals) │ │ │
│ │ └───────────────┘ │ │ └────────────────────────┘ │ │
│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │
│ │ │ Static Files │ │ │ │ Job Scheduler │ │ │
│ │ │ (React SPA) │ │ │ │ (maintenance windows) │ │ │
│ │ └───────────────┘ │ │ └────────────────────────┘ │ │
│ │ ┌───────────────┐ │ │ ┌────────────────────────┐ │ │
│ │ │ mTLS Client │ │ │ │ Retry Engine │ │ │
│ │ │ (agent comm) │◄─┼────┼─►│ (exp. backoff) │ │ │
│ │ └───────────────┘ │ │ └────────────────────────┘ │ │
│ └─────────┬─────────┘ │ ┌────────────────────────┐ │ │
│ │ │ │ Email Notifier │ │ │
│ │ │ │ (optional/disabled) │ │ │
│ │ │ └────────────────────────┘ │ │
│ │ └──────────────┬───────────────┘ │
│ │ │ │
│ │ ┌───────────────────┘ │
│ │ │ │
│ ┌─────────▼─────────▼──────────────────────────────────┐ │
│ │ PostgreSQL │ │
│ │ (hosts, groups, users, jobs, schedules, audit, etc.) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Internal CA (mTLS certs) │ │
│ └───────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
mTLS / REST API (port 12443)
┌──────┼──────┐
▼ ▼ ▼
┌──────┐┌──────┐┌──────┐
│ Host ││ Host ││ Host │ ← Linux Patch API agents
│ A ││ B ││ C │ (up to 2,500)
└──────┘└──────┘└──────┘
```
## Component Design
<!-- Detailed component design and interactions -->
### 1. Axum Web Server
**Responsibility:** Handle all HTTP/HTTPS requests from browsers and serve the React SPA.
- **REST API:** CRUD operations for hosts, groups, users, schedules, certificates, reports
- **WebSocket Relay:** Proxy real-time job status from agent WebSocket streams to browser clients
- **Static File Server:** Serve compiled React SPA (HTML, JS, CSS, assets)
- **Authentication:** JWT access token validation, refresh token management, MFA enforcement
- **Authorization:** RBAC middleware enforcing admin/operator/group-scoped access
- **mTLS Client:** HTTP client with client certificates for communicating with Linux Patch API agents
**API Versioning:** URL path versioning (`/api/v1/`) to match the upstream Linux Patch API convention.
### 2. Background Worker
**Responsibility:** All scheduled and asynchronous background processing.
- **Health Poller:** Periodic health checks to all registered agents (5-minute intervals)
- **Patch Data Poller:** Periodic patch availability queries to all agents (30-minute intervals)
- **Job Scheduler:** Execute queued patch operations when maintenance windows open
- **Retry Engine:** Handle agent communication failures with exponential backoff (3 retries, max 30 min)
- **Job Executor:** Trigger patch operations on agents, track async job status
- **Email Notifier:** Optional email notifications (disabled by default)
- **Data Pruner:** Clean up operational data older than 30 days, audit logs older than 6 months
**Communication:** Worker reads job queue from PostgreSQL, updates results back to PostgreSQL. Web server reads results from PostgreSQL for API responses.
### 3. PostgreSQL Database
**Responsibility:** Persistent storage for all application data.
**Key Tables:**
- `hosts` — registered hosts, metadata, health status, last seen
- `groups` — static groups for access control
- `host_groups` — many-to-many host ↔ group membership
- `users` — local accounts with hashed passwords, MFA secrets
- `user_groups` — many-to-many user ↔ group membership
- `refresh_tokens` — server-side refresh tokens for session management
- `maintenance_windows` — per-device recurring and one-time schedules
- `patch_jobs` — queued, running, completed, failed patch operations
- `patch_job_hosts` — per-host status within a batch job
- `host_patch_data` — cached patch availability data from agents
- `host_health_data` — cached health check results
- `certificates` — issued mTLS client certificates
- `audit_log` — tamper-evident audit trail
- `azure_sso_config` — Azure AD SSO configuration
**Data Retention:**
- Operational data (health, patches, jobs): 30 days
- Audit logs: 6 months
### 4. React + TypeScript SPA
**Responsibility:** User-facing web interface.
**Pages:**
1. Dashboard — fleet overview, compliance %, health summary, upcoming windows, root CA download
2. Hosts — filterable host list by group, status, OS
3. Host Detail — system info, packages, patches, jobs, maintenance window config, host cert download
4. Patch Deployment — select hosts, review patches, deploy (queue or immediate)
5. Jobs — real-time job monitoring with WebSocket updates
6. Maintenance Windows — per-device recurring/one-time schedule management
7. Groups — manage static groups, assign hosts and operators
8. Reports — generate/export compliance, patch history, vulnerability, audit (CSV/PDF)
9. Users — local account management, MFA setup, group assignments
10. Certificates — view/manage internal CA, issue/renew client certs
11. Settings — system config, Azure SSO, polling intervals
### 5. Internal CA
**Responsibility:** mTLS certificate management for agent communication.
- Runs on the same Patch Manager host
- Issues client certificates for mTLS communication with agents
- Manages certificate renewal
- Root CA certificate downloadable from dashboard for manual distribution
- Host-specific mTLS certificates downloadable from host detail page
- No automated distribution to clients — server administrators handle this manually
## Data Flow
<!-- Data flow between components -->
### Host Registration Flow
```
1. Admin enters FQDN/IP → Axum validates & resolves FQDN
2. Axum stores host in PostgreSQL
3. Worker picks up new host → initial health check via mTLS
4. Health result stored in PostgreSQL → visible in dashboard
```
### Auto-Discovery Flow
```
1. Admin triggers CIDR scan → Axum sends request to Worker
2. Worker scans subnet for agents on port 12443
3. Discovered agents reported back → Admin selects which to register
4. Selected hosts stored in PostgreSQL
```
### Patch Deployment Flow (Queued)
```
1. Operator selects hosts + patches → chooses "Queue for next window"
2. Axum creates patch job in PostgreSQL (status: queued)
3. When maintenance window opens → Worker triggers patch operations on agents
4. Worker monitors async job status via agent API
5. Results stored in PostgreSQL → WebSocket relay pushes updates to browser
6. Failed jobs auto-retried once if still within window
```
### Patch Deployment Flow (Immediate)
```
1. Operator selects hosts + patches → chooses "Apply Now"
2. Axum creates patch job in PostgreSQL (status: pending)
3. Worker immediately triggers patch operations on agents
4. Same monitoring and retry logic as queued flow
```
### Health/Patch Polling Flow
```
1. Worker polls each agent on schedule (5 min health, 30 min patches)
2. Results cached in PostgreSQL
3. Unhealthy agents marked with visual alerts in dashboard
4. On-demand refresh: operator clicks refresh → Worker queries agent immediately
```
## Technology Stack
<!-- Technology choices and rationale -->
| Layer | Technology | Version/Notes |
|-------|-----------|---------------|
| Backend | Rust + Axum | Tokio async runtime, Tower middleware |
| Database | PostgreSQL | SQLx for type-safe queries, migrations via sqlx-cli |
| Frontend | React + TypeScript | Vite build tooling |
| UI Components | MUI (Material UI) | Enterprise dashboard components, dark mode, theming |
| WebSocket | Axum native WebSocket | Agent → Manager → Browser relay |
| Auth (Local) | Argon2 password hashing + TOTP/WebAuthn | MFA enforcement |
| Auth (SSO) | OAuth2/OIDC via Azure AD | Optional, with Azure MFA |
| Session | JWT (access) + PostgreSQL (refresh) | 15 min access, 1 hr refresh |
| mTLS Client | Rustls + client certs | TLS 1.3 only |
| Internal CA | Rustls/RCGen | Certificate issuance and renewal |
| Email | Lettre (Rust email crate) | Optional, disabled by default |
| PDF Export | Rust PDF generation crate | Compliance and audit reports |
| CSV Export | Rust CSV crate | Data export for all report types |
| Service Management | systemd | Ubuntu 24.04 |
| Static Files | Axum built-in static file serving | React SPA served directly |
## Security Architecture
<!-- Security design including authentication, authorization, encryption -->
### Authentication
- **Local accounts:** Argon2-hashed passwords + TOTP or WebAuthn for MFA
- **Azure SSO:** OAuth2/OIDC flow with Azure AD, using Azure's built-in MFA
- **Session tokens:** Short-lived JWT (15 min) for API access, server-side refresh tokens (1 hr inactivity timeout)
- **Refresh token revocation:** Stored in PostgreSQL, can be immediately revoked for forced logout
### Authorization (RBAC)
- **Admin:** Full access to all resources and settings
- **Operator:** Can add/remove clients, manage schedules and patches only for devices in their group memberships
- **Group scoping:** Operators can only interact with hosts in their assigned groups
- **Ungrouped hosts:** Accessible by any operator or admin
### Agent Communication
- **mTLS:** Client certificate authentication for all agent communication
- **TLS 1.3 only:** No older TLS versions
- **Internal CA:** Patch Manager manages CA, issues and renews client certificates
- **Manual distribution:** Server administrators manually install certs on managed clients
### Data Protection
- **Encryption at rest:** LUKS full-disk encryption (infrastructure-managed)
- **Encryption in transit:** TLS 1.3 for all connections (agent and web UI)
- **Audit log integrity:** Tamper-evident logging (hash chaining)
- **Password storage:** Argon2 with salt
### Compliance
- **HIPAA:** Audit controls, access controls, integrity controls, transmission security, automatic logoff
- **PCI-DSS:** Vulnerability management (core function), access restrictions, user identification, audit tracking, data protection
## Deployment Architecture
<!-- How the system is deployed and configured -->
```
┌─────────────────────────────────────────┐
│ Patch Manager Host (Ubuntu 24.04) │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ systemd: patch-manager-web │ │
│ │ (Axum web server + static files) │ │
│ └─────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ systemd: patch-manager-worker │ │
│ │ (Background polling + jobs) │ │
│ └─────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ PostgreSQL │ │
│ │ (Database) │ │
│ └─────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ Internal CA │ │
│ │ (Certificate management) │ │
│ └─────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ LUKS (Full-disk encryption) │ │
│ │ (Infrastructure-managed) │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
- Two systemd services: `patch-manager-web` and `patch-manager-worker`
- PostgreSQL runs on the same host
- Internal CA runs on the same host
- LUKS full-disk encryption managed by infrastructure
- No Docker/LXC — bare metal/VM deployment
- Internal network only — no public internet exposure
## Scalability
<!-- How the system scales horizontally and vertically -->
- **Single-instance design:** Supports 500 typical hosts, up to 2,500
- **Manual horizontal scaling:** Divide clients between multiple Patch Manager hosts if needed
- **Connection pooling:** Axum handles thousands of concurrent connections with Tokio
- **Background worker:** Independent scaling of polling/jobs from web serving
- **Database:** PostgreSQL handles the workload easily on a single host
- **No automatic clustering or load balancing required**
## Integration Points
<!-- External system integrations, especially Linux Patch API -->
**Upstream Dependency:** [Linux Patch API](https://gitea.moon-dragon.us/echo/linux_patch_api)
| Integration | Protocol | Direction | Purpose |
|-------------|----------|-----------|----------|
| Agent REST API | HTTPS/mTLS (TLS 1.3) | Manager → Agent | Queries, patch operations |
| Agent WebSocket | WSS/mTLS | Agent → Manager | Real-time job status streaming |
| Azure AD | HTTPS/OAuth2 | Manager → Azure | SSO authentication (optional) |
**API Endpoints Used:**
- `GET /api/v1/health` — Agent health checks
- `GET /api/v1/system/info` — Host system information
- `GET /api/v1/packages` — List installed packages
- `GET /api/v1/patches` — List available patches
- `POST /api/v1/patches/apply` — Apply patches
- `PUT /api/v1/packages/{name}` — Update specific package
- `DELETE /api/v1/packages/{name}` — Remove package
- `POST /api/v1/packages` — Install packages
- `GET /api/v1/jobs` — List jobs
- `GET /api/v1/jobs/{id}` — Get job status
- `POST /api/v1/jobs/{id}/rollback` — Rollback a job
- `POST /api/v1/system/reboot` — Reboot host
- `WebSocket /api/v1/ws/jobs` — Real-time job status
## Monitoring and Observability
<!-- Logging, metrics, tracing strategy -->
- **Application logging:** Structured JSON logging (tracing crate)
- **Log levels:** Configurable at runtime (DEBUG, INFO, WARN, ERROR)
- **Health endpoint:** `GET /api/v1/health` on the Patch Manager's own API for infrastructure monitoring
- **Dashboard alerts:** Visual indicators for unhealthy/unreachable agents (red/yellow status)
- **Audit logging:** All significant events logged to PostgreSQL with tamper-evident hash chaining
- **No external monitoring integration required** (dashboard-only alerts)