Private
Public Access
1
0

feat: health check configuration and worker engine (Phase 3+4)
Some checks failed
CI Pipeline / Rust Format Check (push) Failing after 4s
CI Pipeline / Clippy Lints (push) Successful in 46s
CI Pipeline / Rust Unit Tests (push) Successful in 1m1s
CI Pipeline / Security Audit (push) Successful in 4s
CI Pipeline / Frontend Lint & Type Check (push) Failing after 10s
CI Pipeline / Build .deb & Release (push) Has been skipped

- Added health_check_poller.rs: periodic service/HTTP health checks
- Added pre-patch health gate in job_executor.rs
- Added waiting_health_check job status (migration 008)
- Added health_check_status to HostSummary and hosts API
- Added health check types and API functions to frontend
- Added health check UI section to HostDetailPage
- Added health check status indicators to HostsPage and PatchDeploymentPage
- Added serde default for health_check_poll_interval_secs
- Fixed missing AgentClient import in health_check_poller.rs
- Fixed missing ws_relay import in main.rs
- Fixed missing closing paren in retry_pending_jobs SQL
- Added ReadWritePaths for /etc/patch-manager/keys in systemd services
This commit is contained in:
2026-05-05 14:10:37 +00:00
parent a306806b04
commit 93828e1976
28 changed files with 2726 additions and 50 deletions

105
Cargo.lock generated
View File

@ -8,6 +8,41 @@ version = "2.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa" checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
[[package]]
name = "aead"
version = "0.5.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d122413f284cf2d62fb1b7db97e02edb8cda96d769b16e443a4f6195e35662b0"
dependencies = [
"crypto-common",
"generic-array",
]
[[package]]
name = "aes"
version = "0.8.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b169f7a6d4742236a0a00c541b845991d0ac43e546831af1249753ab4c3aa3a0"
dependencies = [
"cfg-if",
"cipher",
"cpufeatures",
]
[[package]]
name = "aes-gcm"
version = "0.10.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "831010a0f742e1209b3bcea8fab6a8e149051ba6099432c8cb2cc117dec3ead1"
dependencies = [
"aead",
"aes",
"cipher",
"ctr",
"ghash",
"subtle",
]
[[package]] [[package]]
name = "aho-corasick" name = "aho-corasick"
version = "1.1.4" version = "1.1.4"
@ -400,6 +435,16 @@ dependencies = [
"windows-link", "windows-link",
] ]
[[package]]
name = "cipher"
version = "0.4.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "773f3b9af64447d2ce9850330c473515014aa235e6a783b02db81ff39e4a3dad"
dependencies = [
"crypto-common",
"inout",
]
[[package]] [[package]]
name = "cmake" name = "cmake"
version = "0.1.58" version = "0.1.58"
@ -572,6 +617,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a" checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a"
dependencies = [ dependencies = [
"generic-array", "generic-array",
"rand_core 0.6.4",
"typenum", "typenum",
] ]
@ -596,6 +642,15 @@ dependencies = [
"memchr", "memchr",
] ]
[[package]]
name = "ctr"
version = "0.9.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0369ee1ad671834580515889b80f2ea915f23b8be8d0daa4bbaf2ac5c7590835"
dependencies = [
"cipher",
]
[[package]] [[package]]
name = "dashmap" name = "dashmap"
version = "6.1.0" version = "6.1.0"
@ -1020,6 +1075,16 @@ dependencies = [
"wasip3", "wasip3",
] ]
[[package]]
name = "ghash"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f0d8a4362ccb29cb0b265253fb0a2728f592895ee6854fd9bc13f2ffda266ff1"
dependencies = [
"opaque-debug",
"polyval",
]
[[package]] [[package]]
name = "gif" name = "gif"
version = "0.12.0" version = "0.12.0"
@ -1458,6 +1523,15 @@ dependencies = [
"serde_core", "serde_core",
] ]
[[package]]
name = "inout"
version = "0.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "879f10e63c20629ecabbb64a8010319738c66a5cd0c29b02d63d272b03751d01"
dependencies = [
"generic-array",
]
[[package]] [[package]]
name = "ipnet" name = "ipnet"
version = "2.12.0" version = "2.12.0"
@ -1878,6 +1952,12 @@ version = "1.21.4"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50" checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50"
[[package]]
name = "opaque-debug"
version = "0.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c08d65885ee38876c4f86fa503fb49d7b507c2b62552df7c70b2fce627e06381"
[[package]] [[package]]
name = "openssl" name = "openssl"
version = "0.10.78" version = "0.10.78"
@ -2195,11 +2275,13 @@ dependencies = [
name = "pm-core" name = "pm-core"
version = "0.1.0" version = "0.1.0"
dependencies = [ dependencies = [
"aes-gcm",
"anyhow", "anyhow",
"axum", "axum",
"chrono", "chrono",
"config", "config",
"hex", "hex",
"rand 0.8.6",
"serde", "serde",
"serde_json", "serde_json",
"sha2", "sha2",
@ -2280,6 +2362,7 @@ dependencies = [
"lettre", "lettre",
"pm-agent-client", "pm-agent-client",
"pm-core", "pm-core",
"reqwest",
"rustls", "rustls",
"rustls-pemfile", "rustls-pemfile",
"serde", "serde",
@ -2320,6 +2403,18 @@ dependencies = [
"miniz_oxide", "miniz_oxide",
] ]
[[package]]
name = "polyval"
version = "0.6.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9d1fe60d06143b2430aa532c94cfe9e29783047f06c0d7fd359a9a51b729fa25"
dependencies = [
"cfg-if",
"cpufeatures",
"opaque-debug",
"universal-hash",
]
[[package]] [[package]]
name = "pom" name = "pom"
version = "3.4.0" version = "3.4.0"
@ -3881,6 +3976,16 @@ version = "0.2.6"
source = "registry+https://github.com/rust-lang/crates.io-index" source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853" checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853"
[[package]]
name = "universal-hash"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fc1de2c688dc15305988b563c3854064043356019f97a4b46276fe734c4f07ea"
dependencies = [
"crypto-common",
"subtle",
]
[[package]] [[package]]
name = "untrusted" name = "untrusted"
version = "0.9.0" version = "0.9.0"

View File

@ -77,5 +77,6 @@ totp-rs = { version = "5", features = ["gen_secret", "otpauth"] }
base64 = { version = "0.22" } base64 = { version = "0.22" }
hex = { version = "0.4" } hex = { version = "0.4" }
sha2 = { version = "0.10" } sha2 = { version = "0.10" }
aes-gcm = { version = "0.10" }
ipnet = { version = "2" } ipnet = { version = "2" }
url = { version = "2" } url = { version = "2" }

View File

@ -42,6 +42,10 @@ health_poll_interval_secs = 300
# Agent patch data poll interval (seconds). Default: 1800 = 30 minutes # Agent patch data poll interval (seconds). Default: 1800 = 30 minutes
patch_poll_interval_secs = 1800 patch_poll_interval_secs = 1800
# Health check poll interval (seconds). Default: 300 = 5 minutes
# Controls how often configured service/HTTP health checks are evaluated.
health_check_poll_interval_secs = 300
# Maximum concurrent mTLS agent calls (Tokio Semaphore) # Maximum concurrent mTLS agent calls (Tokio Semaphore)
max_concurrent_agent_calls = 64 max_concurrent_agent_calls = 64

View File

@ -30,7 +30,7 @@ use crate::{
error::AgentClientError, error::AgentClientError,
types::{ types::{
AgentEnvelope, AgentJobStatus, ApplyPatchesRequest, ApplyPatchesResponse, HealthData, AgentEnvelope, AgentJobStatus, ApplyPatchesRequest, ApplyPatchesResponse, HealthData,
PackagesData, PatchesData, RollbackResponse, SystemInfoData, PackagesData, PatchesData, RollbackResponse, ServiceStatusData, SystemInfoData,
}, },
}; };
@ -221,10 +221,17 @@ impl AgentClient {
.await .await
} }
/// `GET /api/v1/system/services/{name}` — check status of a specific service on the agent.
#[instrument(skip(self), fields(base_url = %self.base_url, service_name = %service_name))]
pub async fn service_status(&self, service_name: &str) -> Result<ServiceStatusData, AgentClientError> {
self.get(&format!("system/services/{}", service_name), &[]).await
}
// -------------------------------------------------------- // --------------------------------------------------------
// Private POST helper // Private POST helper
// -------------------------------------------------------- // --------------------------------------------------------
/// Execute a POST request against `{base_url}/{path}`, serialize `body` as /// Execute a POST request against `{base_url}/{path}`, serialize `body` as
/// JSON, deserialize the [`AgentEnvelope`], and extract the `data` field — /// JSON, deserialize the [`AgentEnvelope`], and extract the `data` field —
/// or propagate an [`AgentClientError::ApiError`]. /// or propagate an [`AgentClientError::ApiError`].

View File

@ -39,5 +39,5 @@ pub use error::AgentClientError;
/// Response envelope and all data types. /// Response envelope and all data types.
pub use types::{ pub use types::{
AgentEnvelope, AgentErrorBody, HealthData, Package, PackagesData, Patch, PatchesData, AgentEnvelope, AgentErrorBody, HealthData, Package, PackagesData, Patch, PatchesData,
SystemInfoData, RollbackResponse, ServiceStatusData, SystemInfoData,
}; };

View File

@ -193,6 +193,23 @@ pub struct AgentJobStatus {
pub completed_at: Option<DateTime<Utc>>, pub completed_at: Option<DateTime<Utc>>,
} }
// ============================================================
// GET /api/v1/system/services/{name}
// ============================================================
/// Payload returned by `GET /api/v1/system/services/{name}`.
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct ServiceStatusData {
/// Service name.
pub name: String,
/// Service status string (e.g. `"running"`, `"stopped"`, `"failed"`).
pub status: String,
/// Whether the service is considered healthy.
pub healthy: bool,
/// Seconds elapsed since the service started (`null` if not running).
pub uptime_secs: Option<u64>,
}
// ============================================================ // ============================================================
// POST /api/v1/jobs/{id}/rollback // POST /api/v1/jobs/{id}/rollback
// ============================================================ // ============================================================

View File

@ -22,3 +22,5 @@ config = { workspace = true }
axum = { workspace = true } axum = { workspace = true }
sha2 = { workspace = true } sha2 = { workspace = true }
hex = { workspace = true } hex = { workspace = true }
aes-gcm = { workspace = true }
rand = { workspace = true }

View File

@ -47,6 +47,9 @@ pub enum AuditAction {
PatchJobCompleted, PatchJobCompleted,
PatchJobFailed, PatchJobFailed,
MaintenanceWindowReminder, MaintenanceWindowReminder,
HealthCheckCreated,
HealthCheckUpdated,
HealthCheckDeleted,
} }
impl AuditAction { impl AuditAction {
@ -80,6 +83,9 @@ impl AuditAction {
Self::PatchJobCompleted => "patch_job_completed", Self::PatchJobCompleted => "patch_job_completed",
Self::PatchJobFailed => "patch_job_failed", Self::PatchJobFailed => "patch_job_failed",
Self::MaintenanceWindowReminder => "maintenance_window_reminder", Self::MaintenanceWindowReminder => "maintenance_window_reminder",
Self::HealthCheckCreated => "health_check_created",
Self::HealthCheckUpdated => "health_check_updated",
Self::HealthCheckDeleted => "health_check_deleted",
} }
} }
} }

View File

@ -39,6 +39,9 @@ pub struct WorkerConfig {
pub health_poll_interval_secs: u64, pub health_poll_interval_secs: u64,
/// Patch data poll interval in seconds (default: 1800 = 30 min) /// Patch data poll interval in seconds (default: 1800 = 30 min)
pub patch_poll_interval_secs: u64, pub patch_poll_interval_secs: u64,
/// Health check poll interval in seconds (default: 300 = 5 min)
#[serde(default = "default_health_check_poll_interval")]
pub health_check_poll_interval_secs: u64,
/// Maximum concurrent agent calls /// Maximum concurrent agent calls
pub max_concurrent_agent_calls: usize, pub max_concurrent_agent_calls: usize,
/// Worker heartbeat interval in seconds /// Worker heartbeat interval in seconds
@ -98,6 +101,8 @@ impl AppConfig {
} }
} }
fn default_health_check_poll_interval() -> u64 { 300 }
impl Default for AppConfig { impl Default for AppConfig {
fn default() -> Self { fn default() -> Self {
Self { Self {
@ -115,6 +120,7 @@ impl Default for AppConfig {
worker: WorkerConfig { worker: WorkerConfig {
health_poll_interval_secs: 300, health_poll_interval_secs: 300,
patch_poll_interval_secs: 1800, patch_poll_interval_secs: 1800,
health_check_poll_interval_secs: 300,
max_concurrent_agent_calls: 64, max_concurrent_agent_calls: 64,
heartbeat_interval_secs: 30, heartbeat_interval_secs: 30,
ws_relay_poll_interval_secs: 10, ws_relay_poll_interval_secs: 10,

View File

@ -0,0 +1,80 @@
//! AES-256-GCM encryption for sensitive health check credentials.
//!
//! Uses a per-install key stored at `/etc/patch-manager/keys/health-check.key`.
use aes_gcm::{
aead::{Aead, KeyInit, OsRng},
Aes256Gcm, Nonce,
};
use rand::RngCore;
use std::fs;
use std::path::Path;
pub const KEY_PATH: &str = "/etc/patch-manager/keys/health-check.key";
/// Load or create the per-install encryption key.
/// If the key file doesn't exist, generates a new 256-bit key and saves it.
pub fn load_or_create_key(path: &Path) -> Result<[u8; 32], CryptoError> {
if path.exists() {
let key_bytes = fs::read(path).map_err(CryptoError::Io)?;
if key_bytes.len() != 32 {
return Err(CryptoError::InvalidKeyLength(key_bytes.len()));
}
let mut key = [0u8; 32];
key.copy_from_slice(&key_bytes);
Ok(key)
} else {
let mut key = [0u8; 32];
OsRng.fill_bytes(&mut key);
if let Some(parent) = path.parent() {
fs::create_dir_all(parent).map_err(CryptoError::Io)?;
}
fs::write(path, &key).map_err(CryptoError::Io)?;
// Set permissions to 0600 (owner read/write only)
#[cfg(unix)]
{
use std::os::unix::fs::PermissionsExt;
fs::set_permissions(path, fs::Permissions::from_mode(0o600))
.map_err(CryptoError::Io)?;
}
Ok(key)
}
}
/// Encrypt plaintext with AES-256-GCM. Returns (ciphertext, nonce).
pub fn encrypt(plaintext: &str, key: &[u8; 32]) -> Result<(Vec<u8>, Vec<u8>), CryptoError> {
let cipher = Aes256Gcm::new_from_slice(key).map_err(|e| CryptoError::KeyInit(e.to_string()))?;
let mut nonce_bytes = [0u8; 12];
OsRng.fill_bytes(&mut nonce_bytes);
let nonce = Nonce::from_slice(&nonce_bytes);
let ciphertext = cipher
.encrypt(nonce, plaintext.as_bytes())
.map_err(|_| CryptoError::EncryptionFailed)?;
Ok((ciphertext, nonce_bytes.to_vec()))
}
/// Decrypt AES-256-GCM ciphertext with the given nonce.
pub fn decrypt(ciphertext: &[u8], nonce: &[u8], key: &[u8; 32]) -> Result<String, CryptoError> {
let cipher = Aes256Gcm::new_from_slice(key).map_err(|e| CryptoError::KeyInit(e.to_string()))?;
let nonce = Nonce::from_slice(nonce);
let plaintext = cipher
.decrypt(nonce, ciphertext)
.map_err(|_| CryptoError::DecryptionFailed)?;
String::from_utf8(plaintext).map_err(CryptoError::Utf8)
}
#[derive(Debug, thiserror::Error)]
pub enum CryptoError {
#[error("IO error: {0}")]
Io(#[from] std::io::Error),
#[error("Invalid key length: expected 32 bytes, got {0}")]
InvalidKeyLength(usize),
#[error("Key init error: {0}")]
KeyInit(String),
#[error("Encryption failed")]
EncryptionFailed,
#[error("Decryption failed")]
DecryptionFailed,
#[error("UTF-8 error: {0}")]
Utf8(#[from] std::string::FromUtf8Error),
}

View File

@ -1,5 +1,6 @@
pub mod audit; pub mod audit;
pub mod config; pub mod config;
pub mod crypto;
pub mod db; pub mod db;
pub mod error; pub mod error;
pub mod logging; pub mod logging;
@ -8,11 +9,14 @@ pub mod request_id;
// Re-export commonly used types // Re-export commonly used types
pub use config::AppConfig; pub use config::AppConfig;
pub use crypto::{CryptoError, KEY_PATH, decrypt, encrypt, load_or_create_key};
pub use error::{AppError, ErrorResponse}; pub use error::{AppError, ErrorResponse};
pub use models::{ pub use models::{
AuthProvider, CreateGroupRequest, CreateHostRequest, CreateUserRequest, DiscoveryCidrRequest, AuthProvider, CreateGroupRequest, CreateHealthCheckRequest, CreateHostRequest,
DiscoveryResult, Group, Host, HostHealthStatus, HostSummary, RegisterDiscoveredRequest, CreateUserRequest, DiscoveryCidrRequest, DiscoveryResult, Group, HealthCheck,
UpdateGroupRequest, UpdateUserRequest, User, UserRole as DbUserRole, HealthCheckResult, HealthCheckWithResult, Host, HostHealthStatus, HostSummary,
RegisterDiscoveredRequest, UpdateGroupRequest, UpdateHealthCheckRequest, UpdateUserRequest,
User, UserRole as DbUserRole,
}; };
// Re-export audit integrity types // Re-export audit integrity types

View File

@ -113,9 +113,79 @@ pub struct HostSummary {
pub health_status: HostHealthStatus, pub health_status: HostHealthStatus,
pub agent_version: Option<String>, pub agent_version: Option<String>,
pub patches_missing: i32, pub patches_missing: i32,
pub health_check_status: Option<String>,
pub registered_at: DateTime<Utc>, pub registered_at: DateTime<Utc>,
} }
// ============================================================
// Health Checks
// ============================================================
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct HealthCheck {
pub id: Uuid,
pub host_id: Uuid,
pub name: String,
pub check_type: String, // "service" or "http"
pub enabled: bool,
// Service check fields
pub service_name: Option<String>,
// HTTP check fields
pub url: Option<String>,
pub expected_body: Option<String>,
pub ignore_cert_errors: bool,
pub basic_auth_user: Option<String>,
// basic_auth_pass_encrypted and nonce NOT exposed in API responses
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct HealthCheckWithResult {
#[serde(flatten)]
pub check: HealthCheck,
pub last_result: Option<HealthCheckResult>,
}
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
pub struct HealthCheckResult {
pub id: Uuid,
pub check_id: Uuid,
pub healthy: bool,
pub detail: Option<String>,
pub latency_ms: Option<i32>,
pub checked_at: DateTime<Utc>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CreateHealthCheckRequest {
pub name: String,
pub check_type: String, // "service" or "http"
pub service_name: Option<String>,
pub url: Option<String>,
pub expected_body: Option<String>,
#[serde(default = "default_true")]
pub ignore_cert_errors: bool,
pub basic_auth_user: Option<String>,
pub basic_auth_pass: Option<String>, // plaintext in request, encrypted before storage
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct UpdateHealthCheckRequest {
pub name: Option<String>,
pub enabled: Option<bool>,
pub service_name: Option<String>,
pub url: Option<String>,
pub expected_body: Option<String>,
pub ignore_cert_errors: Option<bool>,
pub basic_auth_user: Option<String>,
pub basic_auth_pass: Option<String>, // if provided, re-encrypt
}
fn default_true() -> bool {
true
}
// ============================================================ // ============================================================
// Group // Group
// ============================================================ // ============================================================

View File

@ -189,6 +189,7 @@ pub fn build_router(state: AppState) -> Router {
.merge(routes::ws::ticket_router()) .merge(routes::ws::ticket_router())
// Reports // Reports
.nest("/reports", routes::reports::router()) .nest("/reports", routes::reports::router())
.nest("/hosts/{host_id}/health-checks", routes::health_checks::router())
// Settings (admin-only) // Settings (admin-only)
.nest("/settings", routes::settings::router()) .nest("/settings", routes::settings::router())
// Apply auth middleware to all the above // Apply auth middleware to all the above

File diff suppressed because it is too large Load Diff

View File

@ -112,6 +112,7 @@ async fn list_hosts(
SELECT h.id, h.fqdn, host(h.ip_address)::text AS ip_address, h.display_name, SELECT h.id, h.fqdn, host(h.ip_address)::text AS ip_address, h.display_name,
h.os_family, h.os_name, h.health_status, h.agent_version, h.os_family, h.os_name, h.health_status, h.agent_version,
COALESCE(hpd.patch_count, 0) AS patches_missing, COALESCE(hpd.patch_count, 0) AS patches_missing,
" + hc_subquery + ",
h.registered_at h.registered_at
FROM hosts h FROM hosts h
LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id
@ -130,6 +131,7 @@ async fn list_hosts(
h.display_name, h.os_family, h.os_name, h.display_name, h.os_family, h.os_name,
h.health_status, h.agent_version, h.health_status, h.agent_version,
COALESCE(hpd.patch_count, 0) AS patches_missing, COALESCE(hpd.patch_count, 0) AS patches_missing,
" + hc_subquery + ",
h.registered_at h.registered_at
FROM hosts h FROM hosts h
LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id

View File

@ -11,5 +11,6 @@ pub mod settings;
pub mod status; pub mod status;
pub mod users; pub mod users;
pub mod ws; pub mod ws;
pub mod health_checks;
pub mod reports; pub mod reports;

View File

@ -28,3 +28,4 @@ tokio-rustls = { version = "0.26" }
rustls-pemfile = { version = "2" } rustls-pemfile = { version = "2" }
tokio-tungstenite = { version = "0.26", features = ["rustls-tls-webpki-roots"] } tokio-tungstenite = { version = "0.26", features = ["rustls-tls-webpki-roots"] }
lettre = { version = "0.11", default-features = false, features = ["tokio1-rustls-tls", "smtp-transport", "builder"] } lettre = { version = "0.11", default-features = false, features = ["tokio1-rustls-tls", "smtp-transport", "builder"] }
reqwest = { workspace = true }

View File

@ -0,0 +1,471 @@
//! Periodic health check poller for configured service and HTTP checks.
//!
//! Polls every `health_check_poll_interval_secs`, querying each enabled health
//! check definition and storing results in `host_health_check_results`.
//! Results older than 4 days are pruned on each cycle.
use std::path::Path;
use std::sync::Arc;
use std::time::Instant;
use pm_core::{config::AppConfig, crypto};
use sqlx::{FromRow, PgPool};
use tokio::{sync::Semaphore, time};
use uuid::Uuid;
use crate::agent_loader::load_agent_certs;
use pm_agent_client::{AgentClient, AgentClientError};
// ─────────────────────────────────────────────────────────────────────────────
// DB row types
// ─────────────────────────────────────────────────────────────────────────────
/// Row fetched for each enabled health check, joined with host connection info.
#[derive(Debug, FromRow)]
struct HealthCheckRow {
id: Uuid,
host_id: Uuid,
name: String,
check_type: String,
service_name: Option<String>,
url: Option<String>,
expected_body: Option<String>,
ignore_cert_errors: Option<bool>,
basic_auth_user: Option<String>,
basic_auth_pass_encrypted: Option<Vec<u8>>,
basic_auth_pass_nonce: Option<Vec<u8>>,
ip_address: String,
agent_port: i32,
}
// ─────────────────────────────────────────────────────────────────────────────
// Public entry point
// ─────────────────────────────────────────────────────────────────────────────
/// Run the health check poller loop indefinitely.
///
/// On each tick all enabled health checks are queried concurrently (up to
/// `max_concurrent_agent_calls` in-flight at once). Results are persisted
/// to `host_health_check_results` and stale rows are pruned.
pub async fn run_health_check_poller(pool: PgPool, config: Arc<AppConfig>) {
let interval_secs = config.worker.health_check_poll_interval_secs;
let mut ticker = time::interval(std::time::Duration::from_secs(interval_secs));
tracing::info!(interval_secs, "Health check poller started");
loop {
ticker.tick().await;
// Load certs on each cycle so cert rotation is picked up automatically.
let certs = match load_agent_certs(&config.security) {
Ok(c) => c,
Err(e) => {
tracing::error!(
error = %e,
"Health check poller: failed to load agent certs — skipping cycle"
);
continue;
},
};
let client_cert = Arc::new(certs.client_cert);
let client_key = Arc::new(certs.client_key);
let ca_cert = Arc::new(certs.ca_cert);
// Load the crypto key for decrypting HTTP check passwords.
let crypto_key = match crypto::load_or_create_key(Path::new(crypto::KEY_PATH)) {
Ok(k) => Arc::new(k),
Err(e) => {
tracing::error!(
error = %e,
"Health check poller: failed to load crypto key — skipping cycle"
);
continue;
},
};
// Fetch all enabled health checks with host connection info.
let checks: Vec<HealthCheckRow> = match sqlx::query_as(
r#"
SELECT
hc.id,
hc.host_id,
hc.name,
hc.check_type,
hc.service_name,
hc.url,
hc.expected_body,
hc.ignore_cert_errors,
hc.basic_auth_user,
hc.basic_auth_pass_encrypted,
hc.basic_auth_pass_nonce,
host(h.ip_address)::text AS ip_address,
h.agent_port
FROM host_health_checks hc
JOIN hosts h ON h.id = hc.host_id
WHERE hc.enabled = TRUE
ORDER BY hc.id
"#,
)
.fetch_all(&pool)
.await
{
Ok(rows) => rows,
Err(e) => {
tracing::error!(error = %e, "Health check poller: failed to fetch health checks");
continue;
},
};
if checks.is_empty() {
tracing::debug!("Health check poller: no enabled health checks, skipping cycle");
prune_old_results(&pool).await;
continue;
}
let total = checks.len();
let semaphore = Arc::new(Semaphore::new(config.worker.max_concurrent_agent_calls));
let mut handles = Vec::with_capacity(total);
for check in checks {
let pool = pool.clone();
let sem = semaphore.clone();
let cert = client_cert.clone();
let key = client_key.clone();
let ca = ca_cert.clone();
let ckey = crypto_key.clone();
let handle = tokio::spawn(async move {
let _permit = sem.acquire().await.expect("semaphore closed");
run_check(pool, check, &cert, &key, &ca, &ckey).await
});
handles.push(handle);
}
// Collect results and tally counts.
let mut healthy_count = 0usize;
let mut unhealthy_count = 0usize;
let mut error_count = 0usize;
for handle in handles {
match handle.await {
Ok(true) => healthy_count += 1,
Ok(false) => unhealthy_count += 1,
Err(e) => {
tracing::error!(error = %e, "Health check poller task panicked");
error_count += 1;
},
}
}
tracing::info!(
total,
healthy_count,
unhealthy_count,
error_count,
"Health check poll cycle complete"
);
// Prune results older than 4 days.
prune_old_results(&pool).await;
}
}
// ─────────────────────────────────────────────────────────────────────────────
// Check dispatch
// ─────────────────────────────────────────────────────────────────────────────
/// Run a single health check and persist the result. Returns `true` if healthy.
async fn run_check(
pool: PgPool,
check: HealthCheckRow,
client_cert: &[u8],
client_key: &[u8],
ca_cert: &[u8],
crypto_key: &[u8; 32],
) -> bool {
let start = Instant::now();
let (healthy, detail) = match check.check_type.as_str() {
"service" => run_service_check(&check, client_cert, client_key, ca_cert).await,
"http" => run_http_check(&check, crypto_key).await,
other => {
tracing::warn!(
check_id = %check.id,
check_type = other,
"Unknown health check type — treating as unhealthy"
);
(false, format!("Unknown check type: {other}"))
},
};
let latency_ms = start.elapsed().as_millis() as i32;
// Persist the result.
if let Err(e) = sqlx::query(
r#"
INSERT INTO host_health_check_results (check_id, healthy, detail, latency_ms)
VALUES ($1, $2, $3, $4)
"#,
)
.bind(check.id)
.bind(healthy)
.bind(&detail)
.bind(latency_ms)
.execute(&pool)
.await
{
tracing::error!(
check_id = %check.id,
error = %e,
"Health check poller: failed to insert result"
);
}
healthy
}
// ─────────────────────────────────────────────────────────────────────────────
// Service check (via mTLS AgentClient)
// ─────────────────────────────────────────────────────────────────────────────
/// Execute a service check by calling the agent's `/api/v1/system/services/{name}` endpoint.
async fn run_service_check(
check: &HealthCheckRow,
client_cert: &[u8],
client_key: &[u8],
ca_cert: &[u8],
) -> (bool, String) {
let service_name = match &check.service_name {
Some(name) => name.clone(),
None => {
return (false, "Service check missing service_name".to_string());
},
};
let client = match AgentClient::new(
&check.ip_address,
check.agent_port as u16,
client_cert,
client_key,
ca_cert,
) {
Ok(c) => c,
Err(e) => {
return (false, format!("Failed to build AgentClient: {e}"));
},
};
match client.service_status(&service_name).await {
Ok(data) => {
let detail = if data.healthy {
format!(
"Service '{}' is {} (uptime: {}s)",
data.name,
data.status,
data.uptime_secs.map_or("N/A".to_string(), |s| s.to_string())
)
} else {
format!(
"Service '{}' status: {} (unhealthy)",
data.name, data.status
)
};
(data.healthy, detail)
},
Err(AgentClientError::Timeout) => {
(false, format!("Agent timed out querying service '{service_name}'"))
},
Err(AgentClientError::Connect(_)) => {
(false, format!("Agent connection refused for service '{service_name}'"))
},
Err(AgentClientError::ApiError { code, message }) => {
// 404, 400, 500 etc. from the agent means the service is unhealthy.
(false, format!("Agent error [{code}]: {message}"))
},
Err(e) => {
(false, format!("Agent error querying service '{service_name}': {e}"))
},
}
}
// ─────────────────────────────────────────────────────────────────────────────
// HTTP check (via reqwest, no mTLS)
// ─────────────────────────────────────────────────────────────────────────────
/// Execute an HTTP check by making a GET request to the configured URL.
/// Supports optional basic auth (decrypted from DB) and substring body matching.
async fn run_http_check(
check: &HealthCheckRow,
crypto_key: &[u8; 32],
) -> (bool, String) {
let url = match &check.url {
Some(u) => u.clone(),
None => {
return (false, "HTTP check missing URL".to_string());
},
};
// Build a reqwest client for this check.
// Use danger_accept_invalid_certs if ignore_cert_errors is set (default true).
let ignore_cert_errors = check.ignore_cert_errors.unwrap_or(true);
let client_builder = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(10))
.redirect(reqwest::redirect::Policy::limited(5));
let client = if ignore_cert_errors {
client_builder
.danger_accept_invalid_certs(true)
.build()
.unwrap_or_else(|_| reqwest::Client::new())
} else {
client_builder.build().unwrap_or_else(|_| reqwest::Client::new())
};
// Build the request.
let mut request = client.get(&url);
// Add basic auth if configured.
if let Some(user) = &check.basic_auth_user {
// Decrypt the password if present.
let password = match (&check.basic_auth_pass_encrypted, &check.basic_auth_pass_nonce) {
(Some(enc), Some(nonce)) => {
match crypto::decrypt(enc, nonce, crypto_key) {
Ok(p) => p,
Err(e) => {
return (
false,
format!("Failed to decrypt basic auth password: {e}"),
);
},
}
},
_ => {
// No encrypted password stored — treat as missing credentials.
return (false, "HTTP check has basic_auth_user but no encrypted password".to_string());
},
};
request = request.basic_auth(user.as_str(), Some(password.as_str()));
}
// Execute the request.
let response = match request.send().await {
Ok(r) => r,
Err(e) => {
if e.is_timeout() {
return (false, format!("HTTP check timed out: {url}"));
} else if e.is_connect() {
return (false, format!("HTTP check connection failed: {url}"));
} else {
return (false, format!("HTTP check request error: {e}"));
}
},
};
let status = response.status();
// Check HTTP status code.
if !status.is_success() {
return (
false,
format!("HTTP check returned status {} for {url}", status.as_u16()),
);
}
// Read the response body for substring matching.
let body = match response.text().await {
Ok(b) => b,
Err(e) => {
return (false, format!("HTTP check failed to read response body: {e}"));
},
};
// Check expected_body substring match.
if let Some(expected) = &check.expected_body {
if !body.contains(expected) {
return (
false,
format!(
"HTTP check body mismatch for {url}: expected substring not found"
),
);
}
}
(true, format!("HTTP check OK for {url} (status {})", status.as_u16()))
}
// ─────────────────────────────────────────────────────────────────────────────
// Prune old results
// ─────────────────────────────────────────────────────────────────────────────
/// Delete health check results older than 4 days.
async fn prune_old_results(pool: &PgPool) {
match sqlx::query(
"DELETE FROM host_health_check_results WHERE checked_at < NOW() - INTERVAL '4 days'",
)
.execute(pool)
.await
{
Ok(result) => {
if result.rows_affected() > 0 {
tracing::info!(
rows_deleted = result.rows_affected(),
"Health check poller: pruned old results"
);
}
},
Err(e) => {
tracing::error!(error = %e, "Health check poller: failed to prune old results");
},
}
}
// ─────────────────────────────────────────────────────────────────────────────
// Health check gate for job executor
// ─────────────────────────────────────────────────────────────────────────────
/// Check whether all enabled health checks for a host are healthy.
///
/// Returns `Ok(true)` if all checks pass (or no checks are configured),
/// `Ok(false)` if any check is unhealthy or has no result yet.
pub async fn check_host_health_checks(pool: &PgPool, host_id: Uuid) -> anyhow::Result<bool> {
// Check if there are any enabled health checks for this host.
let check_count: (i64,) = sqlx::query_as(
"SELECT COUNT(*) FROM host_health_checks WHERE host_id = $1 AND enabled = TRUE",
)
.bind(host_id)
.fetch_one(pool)
.await?;
if check_count.0 == 0 {
// No health checks configured for this host — treat as healthy.
return Ok(true);
}
// Find any enabled check that has no healthy result or an unhealthy latest result.
let unhealthy_count: (i64,) = sqlx::query_as(
r#"
SELECT COUNT(*)
FROM host_health_checks hc
LEFT JOIN LATERAL (
SELECT healthy
FROM host_health_check_results r
WHERE r.check_id = hc.id
ORDER BY r.checked_at DESC
LIMIT 1
) latest ON true
WHERE hc.host_id = $1
AND hc.enabled = TRUE
AND (latest.healthy IS NULL OR latest.healthy = FALSE)
"#,
)
.bind(host_id)
.fetch_one(pool)
.await?;
Ok(unhealthy_count.0 == 0)
}

View File

@ -24,6 +24,7 @@ use uuid::Uuid;
use crate::agent_loader::load_agent_certs; use crate::agent_loader::load_agent_certs;
use crate::email; use crate::email;
use crate::health_check_poller::check_host_health_checks;
// ───────────────────────────────────────────────────────────────────────────── // ─────────────────────────────────────────────────────────────────────────────
// Internal DB row types // Internal DB row types
@ -78,6 +79,7 @@ struct StatusCounts {
succeeded_count: i64, succeeded_count: i64,
failed_count: i64, failed_count: i64,
cancelled_count: i64, cancelled_count: i64,
waiting_health_check_count: i64,
total_count: i64, total_count: i64,
} }
@ -369,6 +371,89 @@ async fn execute_host_job(
}, },
}; };
// ── 1b. Health check gate ──────────────────────────────────────────────
// All enabled health checks for this host must be healthy before we proceed.
match check_host_health_checks(&pool, host_id).await {
Ok(true) => {
tracing::debug!(%host_id, "execute_host_job: health checks passed");
},
Ok(false) => {
tracing::info!(%host_id, %pjh_id, "execute_host_job: health checks not passed, setting waiting_health_check");
// Check if the maintenance window is still open for this host.
let window_open: bool = sqlx::query_scalar(
r#"
SELECT EXISTS(
SELECT 1 FROM maintenance_windows mw
WHERE mw.host_id = $1
AND mw.enabled = TRUE
AND (
(mw.recurrence = 'once'
AND mw.start_at <= NOW()
AND NOW() < mw.start_at + (mw.duration_minutes * INTERVAL '1 minute'))
OR
(mw.recurrence = 'daily'
AND (NOW() AT TIME ZONE 'UTC')::time >= (mw.start_at AT TIME ZONE 'UTC')::time
AND (NOW() AT TIME ZONE 'UTC')::time < ((mw.start_at AT TIME ZONE 'UTC')::time
+ (mw.duration_minutes * INTERVAL '1 minute')))
OR
(mw.recurrence = 'weekly'
AND EXTRACT(DOW FROM NOW() AT TIME ZONE 'UTC') = mw.recurrence_day
AND (NOW() AT TIME ZONE 'UTC')::time >= (mw.start_at AT TIME ZONE 'UTC')::time
AND (NOW() AT TIME ZONE 'UTC')::time < ((mw.start_at AT TIME ZONE 'UTC')::time
+ (mw.duration_minutes * INTERVAL '1 minute')))
OR
(mw.recurrence = 'monthly'
AND EXTRACT(DAY FROM NOW() AT TIME ZONE 'UTC') = mw.recurrence_day
AND (NOW() AT TIME ZONE 'UTC')::time >= (mw.start_at AT TIME ZONE 'UTC')::time
AND (NOW() AT TIME ZONE 'UTC')::time < ((mw.start_at AT TIME ZONE 'UTC')::time
+ (mw.duration_minutes * INTERVAL '1 minute')))
)
)
"#,
)
.bind(host_id)
.fetch_optional(&pool)
.await
.unwrap_or(Some(true))
.unwrap_or(true); // Default to true if no window configured
if !window_open {
tracing::warn!(%host_id, %pjh_id, "execute_host_job: health checks not passed and maintenance window closed");
handle_host_failure(
pool,
pjh_id,
"Health checks did not pass before maintenance window closed".to_string(),
)
.await;
return;
}
// Set status to waiting_health_check and retry in 5 minutes.
let retry_at = Utc::now() + ChronoDuration::minutes(5);
if let Err(e) = sqlx::query(
r#"
UPDATE patch_job_hosts
SET status = 'waiting_health_check',
retry_next_at = $2,
last_error = 'Waiting for health checks to pass'
WHERE id = $1
"#,
)
.bind(pjh_id)
.bind(retry_at)
.execute(&pool)
.await
{
tracing::error!(%pjh_id, error = %e, "execute_host_job: failed to set waiting_health_check status");
}
return;
},
Err(e) => {
tracing::warn!(%host_id, error = %e, "execute_host_job: health check query failed, proceeding anyway");
// If we can't query health checks, proceed with the job rather than blocking.
},
}
// ── 2. Fetch the job's patch_selection ────────────────────────────────── // ── 2. Fetch the job's patch_selection ──────────────────────────────────
let patch_sel: JobPatchSelection = let patch_sel: JobPatchSelection =
match sqlx::query_as("SELECT patch_selection FROM patch_jobs WHERE id = $1") match sqlx::query_as("SELECT patch_selection FROM patch_jobs WHERE id = $1")
@ -764,6 +849,7 @@ async fn sync_job_status(pool: &PgPool, job_id: Uuid) {
COUNT(*) FILTER (WHERE status = 'succeeded') AS succeeded_count, COUNT(*) FILTER (WHERE status = 'succeeded') AS succeeded_count,
COUNT(*) FILTER (WHERE status = 'failed') AS failed_count, COUNT(*) FILTER (WHERE status = 'failed') AS failed_count,
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_count, COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_count,
COUNT(*) FILTER (WHERE status = 'waiting_health_check') AS waiting_health_check_count,
COUNT(*) AS total_count COUNT(*) AS total_count
FROM patch_job_hosts FROM patch_job_hosts
WHERE job_id = $1 WHERE job_id = $1
@ -784,7 +870,7 @@ async fn sync_job_status(pool: &PgPool, job_id: Uuid) {
let new_status: &str; let new_status: &str;
let set_completed: bool; let set_completed: bool;
if counts.running_count > 0 || counts.pending_count > 0 || counts.queued_count > 0 { if counts.running_count > 0 || counts.pending_count > 0 || counts.queued_count > 0 || counts.waiting_health_check_count > 0 {
// Still work in flight — keep parent running. // Still work in flight — keep parent running.
new_status = "running"; new_status = "running";
set_completed = false; set_completed = false;
@ -912,17 +998,18 @@ async fn sync_job_status(pool: &PgPool, job_id: Uuid) {
/// Find pending host entries whose back-off window has elapsed, reset them to /// Find pending host entries whose back-off window has elapsed, reset them to
/// `queued`, and dispatch them immediately. /// `queued`, and dispatch them immediately.
///
/// Also retries `waiting_health_check` entries whose retry window has elapsed.
pub async fn retry_pending_jobs(pool: PgPool, config: Arc<AppConfig>) { pub async fn retry_pending_jobs(pool: PgPool, config: Arc<AppConfig>) {
let rows: Vec<PatchJobHostPending> = match sqlx::query_as( let rows: Vec<PatchJobHostPending> = match sqlx::query_as(
r#" r#"
SELECT pjh.id, pjh.host_id, pjh.job_id SELECT pjh.id, pjh.host_id, pjh.job_id
FROM patch_job_hosts pjh FROM patch_job_hosts pjh
JOIN patch_jobs j ON j.id = pjh.job_id JOIN patch_jobs j ON j.id = pjh.job_id
WHERE pjh.status = 'pending' WHERE pjh.status IN ('pending', 'waiting_health_check')
AND pjh.retry_next_at <= NOW() AND pjh.retry_next_at <= NOW()
AND j.status != 'cancelled' AND j.status != 'cancelled'
"#, "#,)
)
.fetch_all(&pool) .fetch_all(&pool)
.await .await
{ {

View File

@ -6,6 +6,7 @@
mod agent_loader; mod agent_loader;
mod audit_verifier; mod audit_verifier;
mod email; mod email;
mod health_check_poller;
mod health_poller; mod health_poller;
mod job_executor; mod job_executor;
mod maintenance_scheduler; mod maintenance_scheduler;
@ -19,6 +20,7 @@ use std::{sync::Arc, time::Duration};
use tokio::time; use tokio::time;
use audit_verifier::run_audit_verifier; use audit_verifier::run_audit_verifier;
use health_check_poller::run_health_check_poller;
use health_poller::run_health_poller; use health_poller::run_health_poller;
use job_executor::run_job_executor; use job_executor::run_job_executor;
use maintenance_scheduler::run_maintenance_scheduler; use maintenance_scheduler::run_maintenance_scheduler;
@ -29,7 +31,7 @@ use ws_relay::run_ws_relay;
/// Minimum number of applied migrations the worker requires before /// Minimum number of applied migrations the worker requires before
/// accepting work. Prevents the worker from running against a schema /// accepting work. Prevents the worker from running against a schema
/// that hasn't been migrated yet. /// that hasn't been migrated yet.
const REQUIRED_MIGRATION_COUNT: i64 = 5; const REQUIRED_MIGRATION_COUNT: i64 = 8;
/// How long to wait between schema-version checks before giving up. /// How long to wait between schema-version checks before giving up.
const SCHEMA_CHECK_TIMEOUT: Duration = Duration::from_secs(120); const SCHEMA_CHECK_TIMEOUT: Duration = Duration::from_secs(120);
@ -89,6 +91,9 @@ async fn main() -> anyhow::Result<()> {
// M11: audit integrity verification (runs every 24 hours) // M11: audit integrity verification (runs every 24 hours)
let audit_verifier_handle = tokio::spawn(run_audit_verifier(pool.clone(), config.clone())); let audit_verifier_handle = tokio::spawn(run_audit_verifier(pool.clone(), config.clone()));
// Health check poller — runs configured service/HTTP health checks
let health_check_handle = tokio::spawn(run_health_check_poller(pool.clone(), config.clone()));
tracing::info!("Worker tasks started"); tracing::info!("Worker tasks started");
// Wait for all tasks (they run indefinitely) // Wait for all tasks (they run indefinitely)

View File

@ -7,6 +7,9 @@ import type {
UpdateMaintenanceWindowRequest, UpdateMaintenanceWindowRequest,
Certificate, Certificate,
IssuedCert, IssuedCert,
HealthCheckWithResult,
CreateHealthCheckRequest,
UpdateHealthCheckRequest,
} from '../types' } from '../types'
const BASE_URL = '/api/v1' const BASE_URL = '/api/v1'
@ -259,3 +262,25 @@ export const settingsApi = {
updateIpWhitelist: (entries: string[]) => apiClient.put<{ entries: string[] }>('/settings/ip-whitelist', { entries }), updateIpWhitelist: (entries: string[]) => apiClient.put<{ entries: string[] }>('/settings/ip-whitelist', { entries }),
auditIntegrity: () => apiClient.post<AuditIntegrityResult>('/settings/audit-integrity'), auditIntegrity: () => apiClient.post<AuditIntegrityResult>('/settings/audit-integrity'),
} }
// ── Health Checks API ─────────────────────────────────────────────────────────
export const healthChecksApi = {
list: (hostId: string) =>
apiClient.get<HealthCheckWithResult[]>(`/hosts/${hostId}/health-checks`),
get: (hostId: string, checkId: string) =>
apiClient.get<HealthCheckWithResult>(`/hosts/${hostId}/health-checks/${checkId}`),
create: (hostId: string, body: CreateHealthCheckRequest) =>
apiClient.post<HealthCheckWithResult>(`/hosts/${hostId}/health-checks`, body),
update: (hostId: string, checkId: string, body: UpdateHealthCheckRequest) =>
apiClient.put<HealthCheckWithResult>(`/hosts/${hostId}/health-checks/${checkId}`, body),
delete: (hostId: string, checkId: string) =>
apiClient.delete(`/hosts/${hostId}/health-checks/${checkId}`),
test: (hostId: string, checkId: string) =>
apiClient.post<HealthCheckWithResult>(`/hosts/${hostId}/health-checks/${checkId}/test`),
}

View File

@ -34,13 +34,25 @@ import {
import { import {
Add as AddIcon, Add as AddIcon,
ArrowBack, ArrowBack,
Cancel as CancelIcon,
CheckCircle as CheckCircleIcon,
Delete as DeleteIcon, Delete as DeleteIcon,
Edit as EditIcon, Edit as EditIcon,
MonitorHeart as MonitorHeartIcon,
PlayArrow as PlayArrowIcon,
Remove as RemoveIcon,
Schedule as ScheduleIcon, Schedule as ScheduleIcon,
VpnKey as VpnKeyIcon, VpnKey as VpnKeyIcon,
} from '@mui/icons-material' } from '@mui/icons-material'
import { apiClient, maintenanceWindowsApi, certsApi } from '../api/client' import { apiClient, maintenanceWindowsApi, healthChecksApi, certsApi } from '../api/client'
import type { MaintenanceWindow, WindowRecurrence } from '../types' import type {
MaintenanceWindow,
WindowRecurrence,
HealthCheckType,
HealthCheckWithResult,
CreateHealthCheckRequest,
UpdateHealthCheckRequest,
} from '../types'
// ── Helpers ─────────────────────────────────────────────────────────────────── // ── Helpers ───────────────────────────────────────────────────────────────────
@ -74,7 +86,7 @@ function scheduleDescription(w: MaintenanceWindow): string {
} }
} }
// ── Form value type ─────────────────────────────────────────────────────────── // ── Window form value type ────────────────────────────────────────────────────
interface FormValues { interface FormValues {
label: string label: string
@ -185,6 +197,114 @@ function WindowFormDialog({ open, title, initial, onClose, onSubmit }: WindowFor
) )
} }
// ── Health Check form value type ─────────────────────────────────────────────
interface HealthCheckFormValues {
name: string
check_type: HealthCheckType
service_name: string
url: string
expected_body: string
ignore_cert_errors: boolean
basic_auth_user: string
basic_auth_pass: string
enabled: boolean
}
function defaultHealthCheckForm(): HealthCheckFormValues {
return {
name: '',
check_type: 'service',
service_name: '',
url: '',
expected_body: '',
ignore_cert_errors: false,
basic_auth_user: '',
basic_auth_pass: '',
enabled: true,
}
}
// ── Health Check form dialog ──────────────────────────────────────────────────
interface HealthCheckFormDialogProps {
open: boolean
title: string
initial: HealthCheckFormValues
onClose: () => void
onSubmit: (values: HealthCheckFormValues) => Promise<void>
}
function HealthCheckFormDialog({ open, title, initial, onClose, onSubmit }: HealthCheckFormDialogProps) {
const [form, setForm] = useState<HealthCheckFormValues>(initial)
const [saving, setSaving] = useState(false)
const [err, setErr] = useState<string | null>(null)
useEffect(() => { setForm(initial); setErr(null) }, [open, initial])
const set = (field: keyof HealthCheckFormValues, value: HealthCheckFormValues[keyof HealthCheckFormValues]) =>
setForm(prev => ({ ...prev, [field]: value }))
const handleSubmit = async () => {
if (!form.name.trim()) { setErr('Name is required'); return }
if (form.check_type === 'service' && !form.service_name.trim()) { setErr('Service name is required'); return }
if (form.check_type === 'http' && !form.url.trim()) { setErr('URL is required'); return }
setSaving(true); setErr(null)
try { await onSubmit(form) }
catch (e: unknown) {
const msg = (e as { response?: { data?: { error?: { message?: string } } } })
?.response?.data?.error?.message ?? 'Failed to save'
setErr(msg)
} finally { setSaving(false) }
}
return (
<Dialog open={open} onClose={onClose} maxWidth="sm" fullWidth>
<DialogTitle>{title}</DialogTitle>
<DialogContent sx={{ display: 'flex', flexDirection: 'column', gap: 2, pt: 2 }}>
{err && <Alert severity="error">{err}</Alert>}
<TextField label="Name" value={form.name} onChange={e => set('name', e.target.value)} required fullWidth />
<FormControl fullWidth>
<InputLabel>Check Type</InputLabel>
<Select label="Check Type" value={form.check_type} onChange={e => set('check_type', e.target.value as HealthCheckType)}>
<MenuItem value="service">Service</MenuItem>
<MenuItem value="http">HTTP</MenuItem>
</Select>
</FormControl>
{form.check_type === 'service' && (
<TextField label="Service Name" value={form.service_name} onChange={e => set('service_name', e.target.value)} required fullWidth
helperText="Systemd service unit name to check" />
)}
{form.check_type === 'http' && (
<>
<TextField label="URL" value={form.url} onChange={e => set('url', e.target.value)} required fullWidth
helperText="Full URL to check (e.g. https://example.com/health)" />
<TextField label="Expected Body (optional)" value={form.expected_body} onChange={e => set('expected_body', e.target.value)} fullWidth
helperText="Substring expected in response body" />
<FormControlLabel
control={<Switch checked={form.ignore_cert_errors} onChange={e => set('ignore_cert_errors', e.target.checked)} />}
label="Ignore Certificate Errors"
/>
<TextField label="Basic Auth User (optional)" value={form.basic_auth_user} onChange={e => set('basic_auth_user', e.target.value)} fullWidth />
<TextField label="Basic Auth Password (optional)" type="password" value={form.basic_auth_pass} onChange={e => set('basic_auth_pass', e.target.value)} fullWidth
helperText="Leave blank to keep existing password" />
</>
)}
<FormControlLabel
control={<Switch checked={form.enabled} onChange={e => set('enabled', e.target.checked)} />}
label="Enabled"
/>
</DialogContent>
<DialogActions>
<Button onClick={onClose} disabled={saving}>Cancel</Button>
<Button variant="contained" onClick={handleSubmit} disabled={saving}>
{saving ? <CircularProgress size={20} /> : 'Save'}
</Button>
</DialogActions>
</Dialog>
)
}
// ── Main page ────────────────────────────────────────────────────────────────── // ── Main page ──────────────────────────────────────────────────────────────────
export default function HostDetailPage() { export default function HostDetailPage() {
@ -201,19 +321,37 @@ export default function HostDetailPage() {
open: false, message: '', severity: 'success', open: false, message: '', severity: 'success',
}) })
// Create dialog // Create window dialog
const [createOpen, setCreateOpen] = useState(false) const [createOpen, setCreateOpen] = useState(false)
const [createForm, setCreateForm] = useState<FormValues>(defaultForm()) const [createForm, setCreateForm] = useState<FormValues>(defaultForm())
// Edit dialog // Edit window dialog
const [editOpen, setEditOpen] = useState(false) const [editOpen, setEditOpen] = useState(false)
const [editWindow, setEditWindow] = useState<MaintenanceWindow | null>(null) const [editWindow, setEditWindow] = useState<MaintenanceWindow | null>(null)
const [editForm, setEditForm] = useState<FormValues>(defaultForm()) const [editForm, setEditForm] = useState<FormValues>(defaultForm())
// Delete dialog // Delete window dialog
const [deleteOpen, setDeleteOpen] = useState(false) const [deleteOpen, setDeleteOpen] = useState(false)
const [deleteTarget, setDeleteTarget] = useState<MaintenanceWindow | null>(null) const [deleteTarget, setDeleteTarget] = useState<MaintenanceWindow | null>(null)
// Health checks state
const [healthChecks, setHealthChecks] = useState<HealthCheckWithResult[]>([])
const [hcLoading, setHcLoading] = useState(false)
const [testingId, setTestingId] = useState<string | null>(null)
// Create health check dialog
const [hcCreateOpen, setHcCreateOpen] = useState(false)
const [hcCreateForm, setHcCreateForm] = useState<HealthCheckFormValues>(defaultHealthCheckForm())
// Edit health check dialog
const [hcEditOpen, setHcEditOpen] = useState(false)
const [hcEditTarget, setHcEditTarget] = useState<HealthCheckWithResult | null>(null)
const [hcEditForm, setHcEditForm] = useState<HealthCheckFormValues>(defaultHealthCheckForm())
// Delete health check dialog
const [hcDeleteOpen, setHcDeleteOpen] = useState(false)
const [hcDeleteTarget, setHcDeleteTarget] = useState<HealthCheckWithResult | null>(null)
// ── Fetch host ──────────────────────────────────────────────────────────── // ── Fetch host ────────────────────────────────────────────────────────────
useEffect(() => { useEffect(() => {
apiClient.get(`/hosts/${id}`) apiClient.get(`/hosts/${id}`)
@ -235,6 +373,19 @@ export default function HostDetailPage() {
useEffect(() => { fetchWindows() }, [fetchWindows]) useEffect(() => { fetchWindows() }, [fetchWindows])
// ── Fetch health checks ───────────────────────────────────────────────────
const fetchHealthChecks = useCallback(async () => {
if (!id) return
setHcLoading(true)
try {
const res = await healthChecksApi.list(id)
setHealthChecks(Array.isArray(res.data) ? res.data : [])
} catch { /* ignore */ }
finally { setHcLoading(false) }
}, [id])
useEffect(() => { fetchHealthChecks() }, [fetchHealthChecks])
const showSnack = (message: string, severity: 'success' | 'error') => const showSnack = (message: string, severity: 'success' | 'error') =>
setSnackbar({ open: true, message, severity }) setSnackbar({ open: true, message, severity })
@ -312,6 +463,105 @@ export default function HostDetailPage() {
} }
} }
// ── Create health check ──────────────────────────────────────────────────
const handleHcCreateSubmit = async (values: HealthCheckFormValues) => {
if (!id) return
const body: CreateHealthCheckRequest = {
name: values.name,
check_type: values.check_type,
}
if (values.check_type === 'service') {
body.service_name = values.service_name || undefined
} else {
body.url = values.url || undefined
body.expected_body = values.expected_body || undefined
body.ignore_cert_errors = values.ignore_cert_errors || undefined
body.basic_auth_user = values.basic_auth_user || undefined
body.basic_auth_pass = values.basic_auth_pass || undefined
}
await healthChecksApi.create(id, body)
setHcCreateOpen(false)
showSnack('Health check created', 'success')
await fetchHealthChecks()
}
// ── Edit health check ────────────────────────────────────────────────────
const handleHcEditClick = (check: HealthCheckWithResult) => {
setHcEditTarget(check)
setHcEditForm({
name: check.name,
check_type: check.check_type,
service_name: check.service_name ?? '',
url: check.url ?? '',
expected_body: check.expected_body ?? '',
ignore_cert_errors: check.ignore_cert_errors,
basic_auth_user: check.basic_auth_user ?? '',
basic_auth_pass: '',
enabled: check.enabled,
})
setHcEditOpen(true)
}
const handleHcEditSubmit = async (values: HealthCheckFormValues) => {
if (!id || !hcEditTarget) return
const body: UpdateHealthCheckRequest = {
name: values.name,
enabled: values.enabled,
}
if (values.check_type === 'service') {
body.service_name = values.service_name || undefined
} else {
body.url = values.url || undefined
body.expected_body = values.expected_body || undefined
body.ignore_cert_errors = values.ignore_cert_errors
body.basic_auth_user = values.basic_auth_user || undefined
body.basic_auth_pass = values.basic_auth_pass || undefined
}
await healthChecksApi.update(id, hcEditTarget.id, body)
setHcEditOpen(false)
showSnack('Health check updated', 'success')
await fetchHealthChecks()
}
// ── Delete health check ──────────────────────────────────────────────────
const handleHcDeleteConfirm = async () => {
if (!id || !hcDeleteTarget) return
try {
await healthChecksApi.delete(id, hcDeleteTarget.id)
setHcDeleteOpen(false)
showSnack('Health check deleted', 'success')
await fetchHealthChecks()
} catch {
showSnack('Failed to delete health check', 'error')
}
}
// ── Toggle health check enabled ──────────────────────────────────────────
const handleToggleEnabled = async (check: HealthCheckWithResult) => {
if (!id) return
try {
await healthChecksApi.update(id, check.id, { enabled: !check.enabled })
await fetchHealthChecks()
} catch {
showSnack('Failed to toggle health check', 'error')
}
}
// ── Test health check ────────────────────────────────────────────────────
const handleTestCheck = async (check: HealthCheckWithResult) => {
if (!id) return
setTestingId(check.id)
try {
await healthChecksApi.test(id, check.id)
await fetchHealthChecks()
showSnack('Health check test completed', 'success')
} catch {
showSnack('Health check test failed', 'error')
} finally {
setTestingId(null)
}
}
// ── Render ──────────────────────────────────────────────────────────────── // ── Render ────────────────────────────────────────────────────────────────
if (loading) return <Box display="flex" justifyContent="center" mt={8}><CircularProgress /></Box> if (loading) return <Box display="flex" justifyContent="center" mt={8}><CircularProgress /></Box>
if (error) return <Container sx={{ mt: 4 }}><Alert severity="error">{error}</Alert></Container> if (error) return <Container sx={{ mt: 4 }}><Alert severity="error">{error}</Alert></Container>
@ -350,7 +600,7 @@ export default function HostDetailPage() {
</Paper> </Paper>
{/* ── Maintenance Windows ──────────────────────────────────────────── */} {/* ── Maintenance Windows ──────────────────────────────────────────── */}
<Paper sx={{ p: 3 }}> <Paper sx={{ p: 3, mb: 3 }}>
<Box sx={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', mb: 2 }}> <Box sx={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', mb: 2 }}>
<Box sx={{ display: 'flex', alignItems: 'center', gap: 1 }}> <Box sx={{ display: 'flex', alignItems: 'center', gap: 1 }}>
<ScheduleIcon color="primary" /> <ScheduleIcon color="primary" />
@ -427,6 +677,127 @@ export default function HostDetailPage() {
)} )}
</Paper> </Paper>
{/* ── Health Checks ────────────────────────────────────────────────── */}
<Paper sx={{ p: 3, mb: 3 }}>
<Box sx={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', mb: 2 }}>
<Box sx={{ display: 'flex', alignItems: 'center', gap: 1 }}>
<MonitorHeartIcon color="primary" />
<Typography variant="h6" fontWeight={600}>Health Checks</Typography>
</Box>
<Button
startIcon={<AddIcon />}
variant="outlined"
size="small"
disabled={healthChecks.length >= 5}
onClick={() => { setHcCreateForm(defaultHealthCheckForm()); setHcCreateOpen(true) }}
>
Add Health Check
</Button>
</Box>
<Divider sx={{ mb: 2 }} />
<Typography variant="body2" color="text.secondary" sx={{ mb: 2 }}>
Monitor host health with service and HTTP checks. Maximum 5 checks per host.
</Typography>
{hcLoading ? (
<Box display="flex" justifyContent="center" py={3}><CircularProgress size={28} /></Box>
) : healthChecks.length === 0 ? (
<Alert severity="info">
No health checks configured. Add a check to monitor this host&apos;s health.
</Alert>
) : (
<Table size="small">
<TableHead>
<TableRow>
<TableCell>Name</TableCell>
<TableCell>Type</TableCell>
<TableCell>Status</TableCell>
<TableCell>Enabled</TableCell>
<TableCell>Detail</TableCell>
<TableCell>Latency</TableCell>
<TableCell>Last Checked</TableCell>
<TableCell align="right">Actions</TableCell>
</TableRow>
</TableHead>
<TableBody>
{healthChecks.map(check => (
<TableRow key={check.id} hover>
<TableCell>{check.name}</TableCell>
<TableCell>
<Chip label={check.check_type} size="small" variant="outlined" />
</TableCell>
<TableCell>
{check.last_result ? (
check.last_result.healthy ? (
<Tooltip title="Healthy">
<CheckCircleIcon color="success" fontSize="small" />
</Tooltip>
) : (
<Tooltip title="Unhealthy">
<CancelIcon color="error" fontSize="small" />
</Tooltip>
)
) : (
<Tooltip title="No result yet">
<RemoveIcon color="disabled" fontSize="small" />
</Tooltip>
)}
</TableCell>
<TableCell>
<Switch
size="small"
checked={check.enabled}
onChange={() => handleToggleEnabled(check)}
/>
</TableCell>
<TableCell>
<Typography variant="body2" sx={{ maxWidth: 200, overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>
{check.last_result?.detail ?? '—'}
</Typography>
</TableCell>
<TableCell>
{check.last_result?.latency_ms != null ? `${check.last_result.latency_ms} ms` : '—'}
</TableCell>
<TableCell>
{check.last_result?.checked_at
? new Date(check.last_result.checked_at).toLocaleString()
: '—'}
</TableCell>
<TableCell align="right">
<Tooltip title="Test now">
<IconButton
size="small"
color="primary"
disabled={testingId === check.id}
onClick={() => handleTestCheck(check)}
>
{testingId === check.id
? <CircularProgress size={16} />
: <PlayArrowIcon fontSize="small" />}
</IconButton>
</Tooltip>
<Tooltip title="Edit">
<IconButton size="small" onClick={() => handleHcEditClick(check)}>
<EditIcon fontSize="small" />
</IconButton>
</Tooltip>
<Tooltip title="Delete">
<IconButton
size="small" color="error"
onClick={() => { setHcDeleteTarget(check); setHcDeleteOpen(true) }}
>
<DeleteIcon fontSize="small" />
</IconButton>
</Tooltip>
</TableCell>
</TableRow>
))}
</TableBody>
</Table>
)}
</Paper>
{/* ── Dialogs ─────────────────────────────────────────────────────── */} {/* ── Dialogs ─────────────────────────────────────────────────────── */}
<WindowFormDialog <WindowFormDialog
open={createOpen} open={createOpen}
@ -455,6 +826,34 @@ export default function HostDetailPage() {
</DialogActions> </DialogActions>
</Dialog> </Dialog>
{/* Health Check Dialogs */}
<HealthCheckFormDialog
open={hcCreateOpen}
title="Add Health Check"
initial={hcCreateForm}
onClose={() => setHcCreateOpen(false)}
onSubmit={handleHcCreateSubmit}
/>
<HealthCheckFormDialog
open={hcEditOpen}
title="Edit Health Check"
initial={hcEditForm}
onClose={() => setHcEditOpen(false)}
onSubmit={handleHcEditSubmit}
/>
<Dialog open={hcDeleteOpen} onClose={() => setHcDeleteOpen(false)} maxWidth="xs" fullWidth>
<DialogTitle>Delete Health Check</DialogTitle>
<DialogContent>
<Typography>
Delete <strong>{hcDeleteTarget?.name}</strong>? This cannot be undone.
</Typography>
</DialogContent>
<DialogActions>
<Button onClick={() => setHcDeleteOpen(false)}>Cancel</Button>
<Button color="error" variant="contained" onClick={handleHcDeleteConfirm}>Delete</Button>
</DialogActions>
</Dialog>
{/* Snackbar */} {/* Snackbar */}
<Snackbar <Snackbar
open={snackbar.open} open={snackbar.open}

View File

@ -5,6 +5,7 @@ import {
TableRow, TextField, Toolbar, Tooltip, Typography, TableRow, TextField, Toolbar, Tooltip, Typography,
} from '@mui/material' } from '@mui/material'
import { Add as AddIcon, Refresh as RefreshIcon, Delete as DeleteIcon } from '@mui/icons-material' import { Add as AddIcon, Refresh as RefreshIcon, Delete as DeleteIcon } from '@mui/icons-material'
import { CheckCircle as CheckCircleIcon, Cancel as CancelIcon, Remove as RemoveIcon } from '@mui/icons-material'
import { useNavigate } from 'react-router-dom' import { useNavigate } from 'react-router-dom'
import { apiClient, hostsApi } from '../api/client' import { apiClient, hostsApi } from '../api/client'
import type { Host, HostHealthStatus } from '../types' import type { Host, HostHealthStatus } from '../types'
@ -67,6 +68,7 @@ export default function HostsPage() {
<TableCell>IP Address</TableCell> <TableCell>IP Address</TableCell>
<TableCell>OS</TableCell> <TableCell>OS</TableCell>
<TableCell>Health</TableCell> <TableCell>Health</TableCell>
<TableCell>Checks</TableCell>
<TableCell>Agent</TableCell> <TableCell>Agent</TableCell>
<TableCell>Actions</TableCell> <TableCell>Actions</TableCell>
</TableRow> </TableRow>
@ -82,6 +84,15 @@ export default function HostsPage() {
<TableCell> <TableCell>
<Chip size="small" label={h.health_status} color={statusColor(h.health_status)} /> <Chip size="small" label={h.health_status} color={statusColor(h.health_status)} />
</TableCell> </TableCell>
<TableCell>
{h.health_check_status === 'all_healthy' ? (
<Tooltip title="All checks healthy"><CheckCircleIcon color="success" fontSize="small" /></Tooltip>
) : h.health_check_status === 'some_unhealthy' ? (
<Tooltip title="Some checks unhealthy"><CancelIcon color="error" fontSize="small" /></Tooltip>
) : (
<Tooltip title="No checks configured"><RemoveIcon color="disabled" fontSize="small" /></Tooltip>
)}
</TableCell>
<TableCell>{h.agent_version ?? '—'}</TableCell> <TableCell>{h.agent_version ?? '—'}</TableCell>
<TableCell onClick={e => e.stopPropagation()}> <TableCell onClick={e => e.stopPropagation()}>
<Tooltip title="Request refresh"> <Tooltip title="Request refresh">

View File

@ -22,8 +22,10 @@ import {
TextField, TextField,
Toolbar, Toolbar,
Typography, Typography,
Tooltip,
} from '@mui/material' } from '@mui/material'
import { Search as SearchIcon } from '@mui/icons-material' import { Search as SearchIcon } from '@mui/icons-material'
import { CheckCircle as CheckCircleIcon, Cancel as CancelIcon, Remove as RemoveIcon } from '@mui/icons-material'
import { useNavigate } from 'react-router-dom' import { useNavigate } from 'react-router-dom'
import { hostsApi, jobsApi } from '../api/client' import { hostsApi, jobsApi } from '../api/client'
import type { Host, HostHealthStatus } from '../types' import type { Host, HostHealthStatus } from '../types'
@ -256,6 +258,7 @@ export default function PatchDeploymentPage() {
<TableCell>FQDN</TableCell> <TableCell>FQDN</TableCell>
<TableCell>IP Address</TableCell> <TableCell>IP Address</TableCell>
<TableCell>Health</TableCell> <TableCell>Health</TableCell>
<TableCell>Checks</TableCell>
<TableCell>Patches</TableCell> <TableCell>Patches</TableCell>
<TableCell>OS</TableCell> <TableCell>OS</TableCell>
</TableRow> </TableRow>
@ -263,7 +266,7 @@ export default function PatchDeploymentPage() {
<TableBody> <TableBody>
{filteredHosts.length === 0 ? ( {filteredHosts.length === 0 ? (
<TableRow> <TableRow>
<TableCell colSpan={7} align="center"> <TableCell colSpan={8} align="center">
<Typography variant="body2" color="text.secondary" py={2}> <Typography variant="body2" color="text.secondary" py={2}>
No hosts found No hosts found
</Typography> </Typography>
@ -291,6 +294,15 @@ export default function PatchDeploymentPage() {
<TableCell> <TableCell>
<HealthChip status={host.health_status} /> <HealthChip status={host.health_status} />
</TableCell> </TableCell>
<TableCell>
{host.health_check_status === 'all_healthy' ? (
<Tooltip title="All checks healthy"><CheckCircleIcon color="success" fontSize="small" /></Tooltip>
) : host.health_check_status === 'some_unhealthy' ? (
<Tooltip title="Some checks unhealthy"><CancelIcon color="error" fontSize="small" /></Tooltip>
) : (
<Tooltip title="No checks configured"><RemoveIcon color="disabled" fontSize="small" /></Tooltip>
)}
</TableCell>
<TableCell> <TableCell>
<Chip <Chip
label={host.patches_missing} label={host.patches_missing}

View File

@ -26,6 +26,7 @@ export interface Host {
agent_version?: string agent_version?: string
patches_missing: number patches_missing: number
registered_at: string registered_at: string
health_check_status?: 'all_healthy' | 'some_unhealthy' | 'none'
} }
export interface Group { export interface Group {
@ -253,3 +254,57 @@ export interface AuditIntegrityResult {
} }
export type ReportFormat = 'csv' | 'pdf' export type ReportFormat = 'csv' | 'pdf'
// ── Health Checks ────────────────────────────────────────────────────────────
export type HealthCheckType = 'service' | 'http'
export interface HealthCheck {
id: string
host_id: string
name: string
check_type: HealthCheckType
enabled: boolean
service_name?: string
url?: string
expected_body?: string
ignore_cert_errors: boolean
basic_auth_user?: string
created_at: string
updated_at: string
}
export interface HealthCheckResult {
id: string
check_id: string
healthy: boolean
detail?: string
latency_ms?: number
checked_at: string
}
export interface HealthCheckWithResult extends HealthCheck {
last_result?: HealthCheckResult
}
export interface CreateHealthCheckRequest {
name: string
check_type: HealthCheckType
service_name?: string
url?: string
expected_body?: string
ignore_cert_errors?: boolean
basic_auth_user?: string
basic_auth_pass?: string
}
export interface UpdateHealthCheckRequest {
name?: string
enabled?: boolean
service_name?: string
url?: string
expected_body?: string
ignore_cert_errors?: boolean
basic_auth_user?: string
basic_auth_pass?: string
}

View File

@ -0,0 +1,42 @@
-- Migration 007: Health check configuration and results
-- Health checks configured per host (1-5 per host)
CREATE TABLE host_health_checks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
host_id UUID NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
name VARCHAR(100) NOT NULL,
check_type VARCHAR(20) NOT NULL CHECK (check_type IN ('service', 'http')),
enabled BOOLEAN NOT NULL DEFAULT true,
-- Service check fields (Type 1)
service_name VARCHAR(200),
-- HTTP check fields (Type 2)
url TEXT,
expected_body VARCHAR(500),
ignore_cert_errors BOOLEAN DEFAULT true,
basic_auth_user VARCHAR(100),
basic_auth_pass_encrypted BYTEA, -- AES-256-GCM encrypted
basic_auth_pass_nonce BYTEA, -- nonce for AES-GCM
-- Metadata
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
-- Constraint: service checks must have service_name, http checks must have url + expected_body
CONSTRAINT valid_service_check CHECK (
(check_type = 'service' AND service_name IS NOT NULL AND url IS NULL)
OR
(check_type = 'http' AND url IS NOT NULL AND expected_body IS NOT NULL AND service_name IS NULL)
)
);
CREATE INDEX idx_health_checks_host ON host_health_checks (host_id);
-- Health check poll results (4-day retention, pruned by worker)
CREATE TABLE host_health_check_results (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
check_id UUID NOT NULL REFERENCES host_health_checks(id) ON DELETE CASCADE,
healthy BOOLEAN NOT NULL,
detail TEXT,
latency_ms INTEGER,
checked_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_health_results_check ON host_health_check_results (check_id, checked_at DESC);

View File

@ -0,0 +1,4 @@
-- Migration 008: Health check worker support
-- Adds 'waiting_health_check' to the job_status enum for pre-patch health gates.
ALTER TYPE job_status ADD VALUE IF NOT EXISTS 'waiting_health_check';

View File

@ -1,37 +1,253 @@
# WebSocket + Polling Fallback Implementation Plan # Health Check Configuration for Hosts — Feature Plan
## Problem ## Overview
The linux-patch-api agent's `/api/v1/ws/jobs` endpoint was a stub that returned HTTP 101 Each host can have 1-5 health checks. During maintenance windows, all checks must be healthy before patch execution. Health checks are also continuously polled for dashboard status.
with a JSON body but didn't compute the required `Sec-WebSocket-Accept` header. This
caused the pm-worker WS relay to fail with "Key mismatch in Sec-WebSocket-Accept header".
Additionally, the pm-worker WS relay's rustls ClientConfig didn't set ALPN to http/1.1, ## Design Decisions (confirmed with Kelly)
causing HTTP/2 negotiation which also breaks WebSocket upgrades. - Health check config lives in manager DB, manager controls everything
- Agent is just an action endpoint (no health check logic on agent)
- Admins and operators can manage (operators need matching group)
- Per-host 1:1 (not shareable templates)
- Both continuous polling AND must be healthy before patch execution
- Service health: query agent `GET /api/v1/system/services/{name}`, check `healthy` field
- HTTP health: manager makes direct HTTP request, substring match in response body
- Retry failed checks at 5-minute intervals until maintenance window closes
- 10-second check timeout; no response = failed
- Order doesn't matter
- Basic auth password: encrypted in DB with per-install app key
- Health check poll interval: 5 minutes
- Result retention: 4 days (time-based)
## Root Causes ## linux_patch_api Agent Endpoints (Reference)
1. **Agent WS handler was a stub** — didn't implement RFC 6455 WebSocket handshake
2. **WS relay missing ALPN** — rustls ClientConfig didn't set `alpn_protocols` to `http/1.1`
3. **No fallback** — WS relay had no fallback if WebSocket connection failed
## Completed ### GET /api/v1/system/services/{name}
- [x] ALPN fix in pm-worker ws_relay.rs (forces HTTP/1.1 for WebSocket) Returns service status for health check Type 1 (service). Added in commit 8b6d9ed.
- [x] Error chain logging in pm-worker ws_relay.rs (for future debugging)
- [x] Job-level WS event_type fix (frontend + backend)
- [x] Implement proper WebSocket in linux-patch-api using actix-web-actors
- [x] Add WsJobActor with broadcast channel for real-time status updates
- [x] Add HTTP polling fallback in pm-worker WS relay
- [x] Deploy both binaries to dev LXC
- [x] Push both projects to Gitea
- [x] Fix config file (ws_relay_poll_interval_secs in [worker] section)
## Deployment Notes **Response:** `ApiResponse<ServiceStatusData>`
- linux-patch-api binary deployed to /usr/bin/linux-patch-api on dev LXC (VMID 131) ```json
- pm-worker binary deployed to /usr/local/bin/pm-worker on dev LXC (VMID 131) {
- Config file: /etc/patch-manager/config.toml (added ws_relay_poll_interval_secs = 10) "success": true,
- Both services running: patch-manager-web, patch-manager-worker, linux-patch-api "data": {
"name": "nginx",
"display_name": "A high performance web server",
"active_state": "active",
"sub_state": "running",
"load_state": "loaded",
"enabled_state": "enabled",
"main_pid": 1234,
"healthy": true
}
}
```
## Verified Working **Health determination:** The agent `healthy` field is authoritative. Manager uses this boolean directly.
- WebSocket connections to linux-patch-manager-dev (agent with proper WS handler)
- HTTP polling fallback to gitea-runner-u2404 (agent with stub WS) **Error responses:** 400 (invalid name), 404 (not found → unhealthy), 500 (error → unhealthy)
- Job completion status updates via pg_notify
- Frontend real-time updates via WebSocket events ### GET /api/v1/health
Basic agent health. Returns `{"status": "healthy"}`. Used for connectivity checks, not host health checks.
---
## Phase 1: Database Schema
### New table: `host_health_checks`
```sql
CREATE TABLE host_health_checks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
host_id UUID NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
name VARCHAR(100) NOT NULL,
check_type VARCHAR(20) NOT NULL CHECK (check_type IN ('service', 'http')),
enabled BOOLEAN NOT NULL DEFAULT true,
-- Service check fields
service_name VARCHAR(200),
-- HTTP check fields
url TEXT,
expected_body VARCHAR(500),
ignore_cert_errors BOOLEAN DEFAULT true,
basic_auth_user VARCHAR(100),
basic_auth_pass_encrypted BYTEA, -- AES-256-GCM encrypted with per-install key
basic_auth_pass_nonce BYTEA, -- nonce for AES-GCM
-- Metadata
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT valid_service_check CHECK (
(check_type = 'service' AND service_name IS NOT NULL AND url IS NULL)
OR
(check_type = 'http' AND url IS NOT NULL AND expected_body IS NOT NULL AND service_name IS NULL)
)
);
CREATE INDEX idx_health_checks_host ON host_health_checks (host_id);
```
### New table: `host_health_check_results`
```sql
CREATE TABLE host_health_check_results (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
check_id UUID NOT NULL REFERENCES host_health_checks(id) ON DELETE CASCADE,
healthy BOOLEAN NOT NULL,
detail TEXT,
latency_ms INTEGER,
checked_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_health_results_check ON host_health_check_results (check_id, checked_at DESC);
```
### Encryption key storage
- Per-install app key stored at `/etc/patch-manager/keys/health-check.key`
- 256-bit random key generated on first startup if not present
- File permissions: 0600, owned by patch-manager user
- Used for AES-256-GCM encryption of basic_auth_pass
- [ ] Create migration 007_health_checks.sql
- [ ] Add models to pm-core/src/models.rs
- [ ] Add encryption utility to pm-core
- [ ] Verify migration runs on dev LXC
---
## Phase 2: Backend API Routes
### Endpoints
- `GET /api/v1/hosts/{id}/health-checks` — list health checks for host (RBAC scoped)
- `POST /api/v1/hosts/{id}/health-checks` — create health check (max 5 per host)
- `PUT /api/v1/hosts/{id}/health-checks/{check_id}` — update health check
- `DELETE /api/v1/hosts/{id}/health-checks/{check_id}` — delete health check
- `POST /api/v1/hosts/{id}/health-checks/{check_id}/test` — run check immediately, return result
### Request/Response types
```rust
struct CreateHealthCheckRequest {
name: String,
check_type: String, // "service" or "http"
service_name: Option<String>,
url: Option<String>,
expected_body: Option<String>,
ignore_cert_errors: Option<bool>,
basic_auth_user: Option<String>,
basic_auth_pass: Option<String>, // plaintext in request, encrypted before storage
}
struct HealthCheck {
id: Uuid,
host_id: Uuid,
name: String,
check_type: String,
enabled: bool,
service_name: Option<String>,
url: Option<String>,
expected_body: Option<String>,
ignore_cert_errors: bool,
basic_auth_user: Option<String>,
// basic_auth_pass NOT returned in responses
last_result: Option<HealthCheckResult>,
created_at: DateTime<Utc>,
updated_at: DateTime<Utc>,
}
struct HealthCheckResult {
healthy: bool,
detail: Option<String>,
latency_ms: Option<i32>,
checked_at: DateTime<Utc>,
}
```
- [ ] Add routes to pm-web/src/routes/ (new health_checks.rs)
- [ ] Add CRUD operations
- [ ] Add RBAC enforcement (admin all, operator matching group)
- [ ] Add max-5-per-host validation
- [ ] Add /test endpoint that runs check immediately
- [ ] Add audit logging for create/update/delete
- [ ] Add encryption/decryption for basic_auth_pass
---
## Phase 3: Worker Health Check Engine
### Continuous Polling
- New task in pm-worker: `health_check_poller`
- Polls all enabled health checks every 5 minutes
- For service checks: call agent `GET /api/v1/system/services/{name}` via mTLS, check `healthy` field
- For HTTP checks: make direct HTTP(S) request from manager, check status code + substring match
- Store results in `host_health_check_results`
- Prune results older than 4 days
### Pre-Patch Execution Gate
- When a patch job is about to execute on a host:
1. Check if host has any enabled health checks
2. If yes, verify all are currently healthy (from latest poll result)
3. If any are unhealthy:
- Wait and retry at 5-minute intervals
- Continue until maintenance window closes
- If window closes with failed checks, mark job host as failed with detail
4. If all healthy, proceed with patch execution
### HTTP Check Implementation
- Use reqwest with:
- 10-second timeout
- Accept invalid certs (ignore_cert_errors)
- Optional basic auth header (decrypt from DB)
- Check response body contains expected_body substring
- Return healthy=true if match, false otherwise
- [ ] Add health_check_poller module to pm-worker
- [ ] Implement service check via AgentClient
- [ ] Implement HTTP check via reqwest
- [ ] Add pre-patch execution gate to job_executor
- [ ] Add retry loop with 5-minute intervals
- [ ] Add maintenance window expiry check
- [ ] Add health check config to WorkerConfig (poll interval)
- [ ] Add result pruning (4-day retention)
---
## Phase 4: Frontend UI
### Host Detail Page
- Add "Health Checks" section below host info
- List current health checks with status indicators
- Add/Edit/Delete health check dialogs
- "Test" button to run check immediately and show result
- Visual indicator: green check = healthy, red X = unhealthy, gray = unknown
### Hosts Page
- Add health check summary column or indicator
- Show aggregate status: all healthy / some unhealthy / no checks configured
### Deploy Page
- Show health check status in host selection table
- Warn if any selected hosts have unhealthy checks
### Job Detail
- Show health check gate status when job is waiting for healthy checks
- Display which checks are passing/failing
- [ ] Add HealthCheck types to frontend/src/types/index.ts
- [ ] Add health check API calls to frontend/src/api/client.ts
- [ ] Add Health Checks section to HostDetailPage.tsx
- [ ] Add health check status to HostsPage.tsx
- [ ] Add health check indicators to PatchDeploymentPage.tsx
- [ ] Add health check gate status to JobsPage.tsx detail view
---
## Phase 5: Integration & Testing
- [ ] Build and deploy to dev LXC
- [ ] Test service health check against dev LXC agent
- [ ] Test HTTP health check against internal services
- [ ] Test pre-patch gate: deploy with failing check, verify retry behavior
- [ ] Test maintenance window expiry with failing checks
- [ ] Test RBAC: operator can only manage checks in their group
- [ ] Test max 5 checks per host enforcement
- [ ] Test basic auth encryption/decryption
- [ ] Push to Gitea
---
## Resolved Items
- ~~Basic auth password storage~~ → Encrypted in DB with per-install app key (AES-256-GCM)
- ~~Health check poll interval~~ → 5 minutes
- ~~Result retention~~ → 4 days (time-based)