linux_patch_api/ARCHITECTURE.md

# Linux_Patch_API - Architecture Document

## System Overview

The Linux_Patch_API is a secure, single-host API service that enables remote package and patch management on Linux systems. Each instance runs as a system service on the managed host (systemd on most distributions, OpenRC on Alpine), providing a REST API over mTLS with strict IP whitelist enforcement.

**Architecture Type:** Agent Per Host (Option B)
**Deployment:** One instance per managed Linux host
**Network:** Internal network only (no internet exposure)

---

## Component Architecture

### Core Components

1. **API Layer (Actix-web/Axum)**
   - HTTP/HTTPS endpoint handling
   - mTLS termination
   - IP whitelist enforcement
   - Request routing
   - WebSocket support for real-time job status

2. **Authentication Layer**
   - Certificate validation (mTLS)
   - Client identity extraction from certificate
   - No session management (stateless, cert-based auth only)

3. **Authorization Layer**
   - IP whitelist checking (deny by default)
   - No permission validation (whitelisted IP + valid cert = full access)

4. **Job Manager**
   - Async job queue for long-running operations
   - Job status tracking with persistent storage
   - WebSocket broadcast for real-time status updates
   - 30-minute timeout enforcement
   - Job cleanup and expiration

5. **Package Manager Backend (Pluggable)**
   - apt/dpkg adapter (Debian/Ubuntu - primary)
   - dnf/yum adapter (RHEL/CentOS/Fedora)
   - apk adapter (Alpine)
   - pacman adapter (Arch)
   - Distribution detection and adapter selection

6. **Audit Logger**
   - System logging integration (primary)
     - systemd journal on systemd-based systems
     - syslog/local files on OpenRC-based systems
   - Local file fallback (`/var/log/linux_patch_api/`)
   - 30-day retention with daily rotation and gzip compression

7. **Configuration Manager**
   - YAML config file watcher (`/etc/linux_patch_api/config.yaml`)
   - Auto-reload on file change
   - Config validation before reload (prevents service downtime)
   - Runtime settings access for all components

### External Integrations

- **Package Managers:** apt, dnf, yum, apk, pacman (via system commands)
- **Init System:** Service management and logging
  - systemd (Debian, Ubuntu, RHEL, CentOS, Fedora)
  - OpenRC (Alpine Linux)
- **Internal CA:** Certificate validation against self-hosted CA

---

## Technology Stack

### Backend
- **Language:** Rust
- **Framework:** Actix-web or Axum
- **Database:** None (file-based job storage)
- **mTLS:** Rust TLS library (rustls or native-tls)

### Infrastructure
- **Service Manager:** Distribution-dependent
  - systemd (most distributions)
  - OpenRC (Alpine Linux)
- **Configuration:** YAML

### Deployment
- **Package Format:** Native Linux packages (deb, rpm, apk, pkg.tar.zst)
- **Distribution:** Via target system package manager (apt, dnf, apk, pacman)
- **Installation:** Package installs binary, init script/service, and default config structure
  - systemd unit file for systemd distributions
  - OpenRC init script for Alpine
- **Updates:** Handled through system package manager

---

## Security Architecture

### Authentication
- mTLS certificate-based authentication (required)
- Internal self-hosted CA
- Unique client certificates (1-year validity)
- Silent drop for non-mTLS connections

### Authorization
- IP whitelist enforcement (block all by default)
- No granular permissions (binary access: allowed or denied)
- Whitelisted IP + valid cert = full API access

### Process Security (Init System Hardening)
- **User:** root (required for package management)

**systemd Hardening Options:**
- NoNewPrivileges: true (prevent privilege escalation)
- ProtectSystem: strict (read-only filesystem except allowed paths)
- ProtectHome: true (no access to /home, /root, /run/user)
- PrivateTmp: true (isolated /tmp)
- SystemCallFilter: Restrict to required syscalls only (application whitelist)

**OpenRC Hardening Options:**
- Run as dedicated service user
- File permission restrictions
- chroot isolation (optional)
- Equivalent security via rc.conf and init script options
### Data Security
- All communications encrypted via TLS
- Certificates stored securely with restricted permissions
- Audit logging of all operations

### Certificate Storage (Option A: Separate Files)

```
/etc/linux_patch_api/certs/
├── ca.pem       (644) - CA certificate
├── server.pem   (644) - Server certificate
└── server.key   (600) - Server private key (restricted)
```

**Rationale:**
- Tighter permissions on private key only (600)
- Easier certificate rotation (replace cert without touching key)
- Standard practice for TLS deployments
- No extraction overhead
---

## File System Layout

```
/etc/linux_patch_api/
├── config.yaml          # Main configuration
├── whitelist.yaml       # IP whitelist
└── certs/
    ├── ca.pem          # CA certificate (or server.p12)
    ├── server.pem      # Server certificate
    └── server.key      # Server private key

/var/lib/linux_patch_api/
├── jobs/               # Job storage (cleared on restart)
└── state/              # Runtime state

/var/log/linux_patch_api/
└── audit.log           # Local audit log fallback

/usr/bin/linux-patch-api  # Binary location
Init scripts (distribution-dependent):
- /etc/systemd/system/linux-patch-api.service  # systemd
- /etc/init.d/linux-patch-api  # OpenRC (Alpine)
```
---

## Data Flow

### Synchronous Request Flow (Quick Operations):

```
Client → [mTLS Handshake] → [IP Whitelist Check] → [API Layer]
         ↓
    [Auth: Cert Valid?] → No → Silent Drop
         ↓ Yes
    [Authz: IP Allowed?] → No → Silent Drop
         ↓ Yes
    [Route to Handler] → [Execute Package Op] → [Log to Audit]
         ↓
    [Return JSON Response] ← Client
```

### Asynchronous Request Flow (Long Operations):

```
Client → [mTLS + IP Check] → [API Layer] → [Create Job] → [Return Job ID]
                                           ↓
                                    [Job Manager Queue]
                                           ↓
                                    [Package Manager Backend]
                                           ↓
                                    [Update Job Status] → [WebSocket Broadcast]
                                           ↓
                                    [Job Complete/Timeout]
                                           ↓
                                    [Log to Audit]
```

### Job Status Endpoint Flow:

```
Client → [mTLS + IP Check] → [API Layer] → [GET /jobs/{id}]
                                           ↓
                                    [Query Job Storage]
                                           ↓
                                    [Return Job Status JSON]
```

### Configuration Reload Flow:

```
[Config File Changed] → [File Watcher Detects]
         ↓
    [Validate New Config] → Invalid → [Log Error, Keep Old Config]
         ↓ Valid
    [Swap Config in Memory] → [Notify Components] → [Log Reload Event]
```

### Certificate Renewal Flow:

```
[Cert File Updated] → [File Watcher Detects]
         ↓
    [Validate Certificate Chain] → Invalid → [Log Error, Keep Old Certs]
         ↓ Valid
    [Reload TLS Context] → [New Connections Use New Certs] → [Log Reload Event]
```

### Rollback Execution Flow (Exclusive):

```
[Rollback Triggered] → [Set Exclusive Mode] → [Reject New Requests]
         ↓
    [Execute Rollback Operations] → [Log Each Step]
         ↓
    [Rollback Complete] → [Clear Exclusive Mode] → [Accept New Requests]
```

### Key Behaviors:

- Failed jobs are cleared on service restart (no persistence)
- Rollback execution is exclusive - no new requests accepted until complete
- Certificate renewal follows same validation pattern as config reload
- Status endpoint available (GET /jobs/{id}) in addition to WebSocket for job monitoring

---

## API Design Principles

- Pure REST (resources as nouns, HTTP verbs for actions)
- JSON request/response with standard envelope
- Hybrid execution model (sync for quick ops, async for long ops)
- WebSocket for real-time job status streaming
- GET /jobs/{id} endpoint for job status polling

---

## Network Configuration

- **Bind Address:** 0.0.0.0 (all interfaces)
- **Port:** 12443 (HTTPS/mTLS)
- **Protocol:** TLS 1.3 only
- **Firewall:** Host-level firewall should restrict inbound to whitelisted IPs only

---

## Health Checks

### Endpoint: GET /health

**Purpose:** General service status check with package cache status

**Response (200 OK - Healthy):**
```json
{
  "success": true,
  "request_id": "uuid",
  "timestamp": "2026-05-27T14:00:00Z",
  "data": {
    "status": "healthy",
    "uptime_seconds": 12345,
    "version": "1.1.17",
    "last_cache_update": "2026-05-27T13:30:00+00:00",
    "cache_status": "fresh"
  },
  "error": null
}
```

**Response (200 OK - Degraded):**
```json
{
  "success": true,
  "request_id": "uuid",
  "timestamp": "2026-05-27T14:00:00Z",
  "data": {
    "status": "degraded",
    "uptime_seconds": 12345,
    "version": "1.1.17",
    "last_cache_update": "2026-05-27T09:00:00+00:00",
    "cache_status": "failed"
  },
  "error": null
}
```

**Health Check Criteria:**
- Service is listening on port 12443
- mTLS is configured and valid
- Config file is loaded and valid
- Package manager backend is accessible
- Package cache is fresh (refreshed within 4 hours)

**Cache Refresh on Health Check:**
- If cache is stale (>4 hours since last update), health check triggers a cache refresh
- If refresh succeeds: status="healthy", cache_status="fresh"
- If refresh fails: status="degraded", cache_status="failed"
- If cache is fresh: status="healthy", cache_status="fresh"

**Cache Status Values:**
- `fresh` - Cache was updated within the last 4 hours
- `stale` - Cache is older than 4 hours (triggers refresh)
- `unknown` - No cache update has occurred yet
- `failed` - Last cache refresh attempt failed

**NOT Required:**
- Metrics collection
- Alerting integration
- Prometheus/Grafana endpoints

---

## Package Cache Management

### Module: `src/packages/cache.rs`

The package cache module manages the local package index state, ensuring that package metadata is current before performing operations.

**Key Components:**
- `PackageCacheState` - Thread-safe in-memory cache state with Mutex protection
- `PackageCacheStatus` - Snapshot of cache state for reporting
- `CacheStateFile` - Persistent state format for serialization
- `is_fetch_error()` - Detects 404/fetch errors for automatic retry
- `apply_with_cache_retry()` - Generic retry wrapper for cache-related failures
- `run_command_with_timeout()` - Executes cache refresh commands with timeout

**State Persistence:**
- Cache state persists to `/var/lib/linux_patch_api/state/cache.json`
- State is loaded on service startup and saved after every update
- Persists `last_cache_update` timestamp and `last_update_success` flag
- Parent directory is auto-created if missing

**Stale Detection:**
- Cache is considered stale after 4 hours (`STALE_THRESHOLD_SECS = 14400`)
- Health check automatically refreshes stale cache
- Patch apply operations always refresh cache before proceeding (mandatory)

**Refresh-Before-Apply Flow:**
1. `POST /patches/apply` creates a job and spawns background task
2. Background task refreshes package cache (mandatory, not configurable)
3. If refresh fails: job fails immediately with error message
4. If refresh succeeds: job progresses to 10%, applies patches
5. If apply fails with 404/fetch error: refresh cache and retry once
6. If retry also fails: job fails with error

**Cache Refresh Timeout:** 120 seconds (`CACHE_REFRESH_TIMEOUT_SECS`)

---

*Following kiro spec-driven development standards*