Private
Public Access
1
0
Files
linux_patch_manager/tasks/todo.md
Echo 93828e1976
Some checks failed
CI Pipeline / Rust Format Check (push) Failing after 4s
CI Pipeline / Clippy Lints (push) Successful in 46s
CI Pipeline / Rust Unit Tests (push) Successful in 1m1s
CI Pipeline / Security Audit (push) Successful in 4s
CI Pipeline / Frontend Lint & Type Check (push) Failing after 10s
CI Pipeline / Build .deb & Release (push) Has been skipped
feat: health check configuration and worker engine (Phase 3+4)
- Added health_check_poller.rs: periodic service/HTTP health checks
- Added pre-patch health gate in job_executor.rs
- Added waiting_health_check job status (migration 008)
- Added health_check_status to HostSummary and hosts API
- Added health check types and API functions to frontend
- Added health check UI section to HostDetailPage
- Added health check status indicators to HostsPage and PatchDeploymentPage
- Added serde default for health_check_poll_interval_secs
- Fixed missing AgentClient import in health_check_poller.rs
- Fixed missing ws_relay import in main.rs
- Fixed missing closing paren in retry_pending_jobs SQL
- Added ReadWritePaths for /etc/patch-manager/keys in systemd services
2026-05-05 14:10:37 +00:00

9.0 KiB

Health Check Configuration for Hosts — Feature Plan

Overview

Each host can have 1-5 health checks. During maintenance windows, all checks must be healthy before patch execution. Health checks are also continuously polled for dashboard status.

Design Decisions (confirmed with Kelly)

  • Health check config lives in manager DB, manager controls everything
  • Agent is just an action endpoint (no health check logic on agent)
  • Admins and operators can manage (operators need matching group)
  • Per-host 1:1 (not shareable templates)
  • Both continuous polling AND must be healthy before patch execution
  • Service health: query agent GET /api/v1/system/services/{name}, check healthy field
  • HTTP health: manager makes direct HTTP request, substring match in response body
  • Retry failed checks at 5-minute intervals until maintenance window closes
  • 10-second check timeout; no response = failed
  • Order doesn't matter
  • Basic auth password: encrypted in DB with per-install app key
  • Health check poll interval: 5 minutes
  • Result retention: 4 days (time-based)

linux_patch_api Agent Endpoints (Reference)

GET /api/v1/system/services/{name}

Returns service status for health check Type 1 (service). Added in commit 8b6d9ed.

Response: ApiResponse<ServiceStatusData>

{
  "success": true,
  "data": {
    "name": "nginx",
    "display_name": "A high performance web server",
    "active_state": "active",
    "sub_state": "running",
    "load_state": "loaded",
    "enabled_state": "enabled",
    "main_pid": 1234,
    "healthy": true
  }
}

Health determination: The agent healthy field is authoritative. Manager uses this boolean directly.

Error responses: 400 (invalid name), 404 (not found → unhealthy), 500 (error → unhealthy)

GET /api/v1/health

Basic agent health. Returns {"status": "healthy"}. Used for connectivity checks, not host health checks.


Phase 1: Database Schema

New table: host_health_checks

CREATE TABLE host_health_checks (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    host_id         UUID NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
    name            VARCHAR(100) NOT NULL,
    check_type      VARCHAR(20) NOT NULL CHECK (check_type IN ('service', 'http')),
    enabled         BOOLEAN NOT NULL DEFAULT true,
    -- Service check fields
    service_name    VARCHAR(200),
    -- HTTP check fields
    url             TEXT,
    expected_body   VARCHAR(500),
    ignore_cert_errors BOOLEAN DEFAULT true,
    basic_auth_user VARCHAR(100),
    basic_auth_pass_encrypted BYTEA,  -- AES-256-GCM encrypted with per-install key
    basic_auth_pass_nonce BYTEA,      -- nonce for AES-GCM
    -- Metadata
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CONSTRAINT valid_service_check CHECK (
        (check_type = 'service' AND service_name IS NOT NULL AND url IS NULL)
        OR
        (check_type = 'http' AND url IS NOT NULL AND expected_body IS NOT NULL AND service_name IS NULL)
    )
);

CREATE INDEX idx_health_checks_host ON host_health_checks (host_id);

New table: host_health_check_results

CREATE TABLE host_health_check_results (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    check_id    UUID NOT NULL REFERENCES host_health_checks(id) ON DELETE CASCADE,
    healthy     BOOLEAN NOT NULL,
    detail      TEXT,
    latency_ms  INTEGER,
    checked_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_health_results_check ON host_health_check_results (check_id, checked_at DESC);

Encryption key storage

  • Per-install app key stored at /etc/patch-manager/keys/health-check.key

  • 256-bit random key generated on first startup if not present

  • File permissions: 0600, owned by patch-manager user

  • Used for AES-256-GCM encryption of basic_auth_pass

  • Create migration 007_health_checks.sql

  • Add models to pm-core/src/models.rs

  • Add encryption utility to pm-core

  • Verify migration runs on dev LXC


Phase 2: Backend API Routes

Endpoints

  • GET /api/v1/hosts/{id}/health-checks — list health checks for host (RBAC scoped)
  • POST /api/v1/hosts/{id}/health-checks — create health check (max 5 per host)
  • PUT /api/v1/hosts/{id}/health-checks/{check_id} — update health check
  • DELETE /api/v1/hosts/{id}/health-checks/{check_id} — delete health check
  • POST /api/v1/hosts/{id}/health-checks/{check_id}/test — run check immediately, return result

Request/Response types

struct CreateHealthCheckRequest {
    name: String,
    check_type: String,  // "service" or "http"
    service_name: Option<String>,
    url: Option<String>,
    expected_body: Option<String>,
    ignore_cert_errors: Option<bool>,
    basic_auth_user: Option<String>,
    basic_auth_pass: Option<String>,  // plaintext in request, encrypted before storage
}

struct HealthCheck {
    id: Uuid,
    host_id: Uuid,
    name: String,
    check_type: String,
    enabled: bool,
    service_name: Option<String>,
    url: Option<String>,
    expected_body: Option<String>,
    ignore_cert_errors: bool,
    basic_auth_user: Option<String>,
    // basic_auth_pass NOT returned in responses
    last_result: Option<HealthCheckResult>,
    created_at: DateTime<Utc>,
    updated_at: DateTime<Utc>,
}

struct HealthCheckResult {
    healthy: bool,
    detail: Option<String>,
    latency_ms: Option<i32>,
    checked_at: DateTime<Utc>,
}
  • Add routes to pm-web/src/routes/ (new health_checks.rs)
  • Add CRUD operations
  • Add RBAC enforcement (admin all, operator matching group)
  • Add max-5-per-host validation
  • Add /test endpoint that runs check immediately
  • Add audit logging for create/update/delete
  • Add encryption/decryption for basic_auth_pass

Phase 3: Worker Health Check Engine

Continuous Polling

  • New task in pm-worker: health_check_poller
  • Polls all enabled health checks every 5 minutes
  • For service checks: call agent GET /api/v1/system/services/{name} via mTLS, check healthy field
  • For HTTP checks: make direct HTTP(S) request from manager, check status code + substring match
  • Store results in host_health_check_results
  • Prune results older than 4 days

Pre-Patch Execution Gate

  • When a patch job is about to execute on a host:
    1. Check if host has any enabled health checks
    2. If yes, verify all are currently healthy (from latest poll result)
    3. If any are unhealthy:
      • Wait and retry at 5-minute intervals
      • Continue until maintenance window closes
      • If window closes with failed checks, mark job host as failed with detail
    4. If all healthy, proceed with patch execution

HTTP Check Implementation

  • Use reqwest with:

    • 10-second timeout
    • Accept invalid certs (ignore_cert_errors)
    • Optional basic auth header (decrypt from DB)
    • Check response body contains expected_body substring
  • Return healthy=true if match, false otherwise

  • Add health_check_poller module to pm-worker

  • Implement service check via AgentClient

  • Implement HTTP check via reqwest

  • Add pre-patch execution gate to job_executor

  • Add retry loop with 5-minute intervals

  • Add maintenance window expiry check

  • Add health check config to WorkerConfig (poll interval)

  • Add result pruning (4-day retention)


Phase 4: Frontend UI

Host Detail Page

  • Add "Health Checks" section below host info
  • List current health checks with status indicators
  • Add/Edit/Delete health check dialogs
  • "Test" button to run check immediately and show result
  • Visual indicator: green check = healthy, red X = unhealthy, gray = unknown

Hosts Page

  • Add health check summary column or indicator
  • Show aggregate status: all healthy / some unhealthy / no checks configured

Deploy Page

  • Show health check status in host selection table
  • Warn if any selected hosts have unhealthy checks

Job Detail

  • Show health check gate status when job is waiting for healthy checks

  • Display which checks are passing/failing

  • Add HealthCheck types to frontend/src/types/index.ts

  • Add health check API calls to frontend/src/api/client.ts

  • Add Health Checks section to HostDetailPage.tsx

  • Add health check status to HostsPage.tsx

  • Add health check indicators to PatchDeploymentPage.tsx

  • Add health check gate status to JobsPage.tsx detail view


Phase 5: Integration & Testing

  • Build and deploy to dev LXC
  • Test service health check against dev LXC agent
  • Test HTTP health check against internal services
  • Test pre-patch gate: deploy with failing check, verify retry behavior
  • Test maintenance window expiry with failing checks
  • Test RBAC: operator can only manage checks in their group
  • Test max 5 checks per host enforcement
  • Test basic auth encryption/decryption
  • Push to Gitea

Resolved Items

  • Basic auth password storage → Encrypted in DB with per-install app key (AES-256-GCM)
  • Health check poll interval → 5 minutes
  • Result retention → 4 days (time-based)