feat: health check configuration and worker engine (Phase 3+4)
Some checks failed
CI Pipeline / Rust Format Check (push) Failing after 4s
CI Pipeline / Clippy Lints (push) Successful in 46s
CI Pipeline / Rust Unit Tests (push) Successful in 1m1s
CI Pipeline / Security Audit (push) Successful in 4s
CI Pipeline / Frontend Lint & Type Check (push) Failing after 10s
CI Pipeline / Build .deb & Release (push) Has been skipped
Some checks failed
CI Pipeline / Rust Format Check (push) Failing after 4s
CI Pipeline / Clippy Lints (push) Successful in 46s
CI Pipeline / Rust Unit Tests (push) Successful in 1m1s
CI Pipeline / Security Audit (push) Successful in 4s
CI Pipeline / Frontend Lint & Type Check (push) Failing after 10s
CI Pipeline / Build .deb & Release (push) Has been skipped
- Added health_check_poller.rs: periodic service/HTTP health checks - Added pre-patch health gate in job_executor.rs - Added waiting_health_check job status (migration 008) - Added health_check_status to HostSummary and hosts API - Added health check types and API functions to frontend - Added health check UI section to HostDetailPage - Added health check status indicators to HostsPage and PatchDeploymentPage - Added serde default for health_check_poll_interval_secs - Fixed missing AgentClient import in health_check_poller.rs - Fixed missing ws_relay import in main.rs - Fixed missing closing paren in retry_pending_jobs SQL - Added ReadWritePaths for /etc/patch-manager/keys in systemd services
This commit is contained in:
105
Cargo.lock
generated
105
Cargo.lock
generated
@ -8,6 +8,41 @@ version = "2.0.1"
|
|||||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
|
checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "aead"
|
||||||
|
version = "0.5.2"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "d122413f284cf2d62fb1b7db97e02edb8cda96d769b16e443a4f6195e35662b0"
|
||||||
|
dependencies = [
|
||||||
|
"crypto-common",
|
||||||
|
"generic-array",
|
||||||
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "aes"
|
||||||
|
version = "0.8.4"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "b169f7a6d4742236a0a00c541b845991d0ac43e546831af1249753ab4c3aa3a0"
|
||||||
|
dependencies = [
|
||||||
|
"cfg-if",
|
||||||
|
"cipher",
|
||||||
|
"cpufeatures",
|
||||||
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "aes-gcm"
|
||||||
|
version = "0.10.3"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "831010a0f742e1209b3bcea8fab6a8e149051ba6099432c8cb2cc117dec3ead1"
|
||||||
|
dependencies = [
|
||||||
|
"aead",
|
||||||
|
"aes",
|
||||||
|
"cipher",
|
||||||
|
"ctr",
|
||||||
|
"ghash",
|
||||||
|
"subtle",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "aho-corasick"
|
name = "aho-corasick"
|
||||||
version = "1.1.4"
|
version = "1.1.4"
|
||||||
@ -400,6 +435,16 @@ dependencies = [
|
|||||||
"windows-link",
|
"windows-link",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "cipher"
|
||||||
|
version = "0.4.4"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "773f3b9af64447d2ce9850330c473515014aa235e6a783b02db81ff39e4a3dad"
|
||||||
|
dependencies = [
|
||||||
|
"crypto-common",
|
||||||
|
"inout",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "cmake"
|
name = "cmake"
|
||||||
version = "0.1.58"
|
version = "0.1.58"
|
||||||
@ -572,6 +617,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
|
|||||||
checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a"
|
checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
"generic-array",
|
"generic-array",
|
||||||
|
"rand_core 0.6.4",
|
||||||
"typenum",
|
"typenum",
|
||||||
]
|
]
|
||||||
|
|
||||||
@ -596,6 +642,15 @@ dependencies = [
|
|||||||
"memchr",
|
"memchr",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "ctr"
|
||||||
|
version = "0.9.2"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "0369ee1ad671834580515889b80f2ea915f23b8be8d0daa4bbaf2ac5c7590835"
|
||||||
|
dependencies = [
|
||||||
|
"cipher",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "dashmap"
|
name = "dashmap"
|
||||||
version = "6.1.0"
|
version = "6.1.0"
|
||||||
@ -1020,6 +1075,16 @@ dependencies = [
|
|||||||
"wasip3",
|
"wasip3",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "ghash"
|
||||||
|
version = "0.5.1"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "f0d8a4362ccb29cb0b265253fb0a2728f592895ee6854fd9bc13f2ffda266ff1"
|
||||||
|
dependencies = [
|
||||||
|
"opaque-debug",
|
||||||
|
"polyval",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "gif"
|
name = "gif"
|
||||||
version = "0.12.0"
|
version = "0.12.0"
|
||||||
@ -1458,6 +1523,15 @@ dependencies = [
|
|||||||
"serde_core",
|
"serde_core",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "inout"
|
||||||
|
version = "0.1.4"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "879f10e63c20629ecabbb64a8010319738c66a5cd0c29b02d63d272b03751d01"
|
||||||
|
dependencies = [
|
||||||
|
"generic-array",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "ipnet"
|
name = "ipnet"
|
||||||
version = "2.12.0"
|
version = "2.12.0"
|
||||||
@ -1878,6 +1952,12 @@ version = "1.21.4"
|
|||||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50"
|
checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "opaque-debug"
|
||||||
|
version = "0.3.1"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "c08d65885ee38876c4f86fa503fb49d7b507c2b62552df7c70b2fce627e06381"
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "openssl"
|
name = "openssl"
|
||||||
version = "0.10.78"
|
version = "0.10.78"
|
||||||
@ -2195,11 +2275,13 @@ dependencies = [
|
|||||||
name = "pm-core"
|
name = "pm-core"
|
||||||
version = "0.1.0"
|
version = "0.1.0"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
|
"aes-gcm",
|
||||||
"anyhow",
|
"anyhow",
|
||||||
"axum",
|
"axum",
|
||||||
"chrono",
|
"chrono",
|
||||||
"config",
|
"config",
|
||||||
"hex",
|
"hex",
|
||||||
|
"rand 0.8.6",
|
||||||
"serde",
|
"serde",
|
||||||
"serde_json",
|
"serde_json",
|
||||||
"sha2",
|
"sha2",
|
||||||
@ -2280,6 +2362,7 @@ dependencies = [
|
|||||||
"lettre",
|
"lettre",
|
||||||
"pm-agent-client",
|
"pm-agent-client",
|
||||||
"pm-core",
|
"pm-core",
|
||||||
|
"reqwest",
|
||||||
"rustls",
|
"rustls",
|
||||||
"rustls-pemfile",
|
"rustls-pemfile",
|
||||||
"serde",
|
"serde",
|
||||||
@ -2320,6 +2403,18 @@ dependencies = [
|
|||||||
"miniz_oxide",
|
"miniz_oxide",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "polyval"
|
||||||
|
version = "0.6.2"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "9d1fe60d06143b2430aa532c94cfe9e29783047f06c0d7fd359a9a51b729fa25"
|
||||||
|
dependencies = [
|
||||||
|
"cfg-if",
|
||||||
|
"cpufeatures",
|
||||||
|
"opaque-debug",
|
||||||
|
"universal-hash",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "pom"
|
name = "pom"
|
||||||
version = "3.4.0"
|
version = "3.4.0"
|
||||||
@ -3881,6 +3976,16 @@ version = "0.2.6"
|
|||||||
source = "registry+https://github.com/rust-lang/crates.io-index"
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853"
|
checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853"
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "universal-hash"
|
||||||
|
version = "0.5.1"
|
||||||
|
source = "registry+https://github.com/rust-lang/crates.io-index"
|
||||||
|
checksum = "fc1de2c688dc15305988b563c3854064043356019f97a4b46276fe734c4f07ea"
|
||||||
|
dependencies = [
|
||||||
|
"crypto-common",
|
||||||
|
"subtle",
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "untrusted"
|
name = "untrusted"
|
||||||
version = "0.9.0"
|
version = "0.9.0"
|
||||||
|
|||||||
@ -77,5 +77,6 @@ totp-rs = { version = "5", features = ["gen_secret", "otpauth"] }
|
|||||||
base64 = { version = "0.22" }
|
base64 = { version = "0.22" }
|
||||||
hex = { version = "0.4" }
|
hex = { version = "0.4" }
|
||||||
sha2 = { version = "0.10" }
|
sha2 = { version = "0.10" }
|
||||||
|
aes-gcm = { version = "0.10" }
|
||||||
ipnet = { version = "2" }
|
ipnet = { version = "2" }
|
||||||
url = { version = "2" }
|
url = { version = "2" }
|
||||||
|
|||||||
@ -42,6 +42,10 @@ health_poll_interval_secs = 300
|
|||||||
# Agent patch data poll interval (seconds). Default: 1800 = 30 minutes
|
# Agent patch data poll interval (seconds). Default: 1800 = 30 minutes
|
||||||
patch_poll_interval_secs = 1800
|
patch_poll_interval_secs = 1800
|
||||||
|
|
||||||
|
# Health check poll interval (seconds). Default: 300 = 5 minutes
|
||||||
|
# Controls how often configured service/HTTP health checks are evaluated.
|
||||||
|
health_check_poll_interval_secs = 300
|
||||||
|
|
||||||
# Maximum concurrent mTLS agent calls (Tokio Semaphore)
|
# Maximum concurrent mTLS agent calls (Tokio Semaphore)
|
||||||
max_concurrent_agent_calls = 64
|
max_concurrent_agent_calls = 64
|
||||||
|
|
||||||
|
|||||||
@ -30,7 +30,7 @@ use crate::{
|
|||||||
error::AgentClientError,
|
error::AgentClientError,
|
||||||
types::{
|
types::{
|
||||||
AgentEnvelope, AgentJobStatus, ApplyPatchesRequest, ApplyPatchesResponse, HealthData,
|
AgentEnvelope, AgentJobStatus, ApplyPatchesRequest, ApplyPatchesResponse, HealthData,
|
||||||
PackagesData, PatchesData, RollbackResponse, SystemInfoData,
|
PackagesData, PatchesData, RollbackResponse, ServiceStatusData, SystemInfoData,
|
||||||
},
|
},
|
||||||
};
|
};
|
||||||
|
|
||||||
@ -221,10 +221,17 @@ impl AgentClient {
|
|||||||
.await
|
.await
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/// `GET /api/v1/system/services/{name}` — check status of a specific service on the agent.
|
||||||
|
#[instrument(skip(self), fields(base_url = %self.base_url, service_name = %service_name))]
|
||||||
|
pub async fn service_status(&self, service_name: &str) -> Result<ServiceStatusData, AgentClientError> {
|
||||||
|
self.get(&format!("system/services/{}", service_name), &[]).await
|
||||||
|
}
|
||||||
|
|
||||||
// --------------------------------------------------------
|
// --------------------------------------------------------
|
||||||
// Private POST helper
|
// Private POST helper
|
||||||
// --------------------------------------------------------
|
// --------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
/// Execute a POST request against `{base_url}/{path}`, serialize `body` as
|
/// Execute a POST request against `{base_url}/{path}`, serialize `body` as
|
||||||
/// JSON, deserialize the [`AgentEnvelope`], and extract the `data` field —
|
/// JSON, deserialize the [`AgentEnvelope`], and extract the `data` field —
|
||||||
/// or propagate an [`AgentClientError::ApiError`].
|
/// or propagate an [`AgentClientError::ApiError`].
|
||||||
|
|||||||
@ -39,5 +39,5 @@ pub use error::AgentClientError;
|
|||||||
/// Response envelope and all data types.
|
/// Response envelope and all data types.
|
||||||
pub use types::{
|
pub use types::{
|
||||||
AgentEnvelope, AgentErrorBody, HealthData, Package, PackagesData, Patch, PatchesData,
|
AgentEnvelope, AgentErrorBody, HealthData, Package, PackagesData, Patch, PatchesData,
|
||||||
SystemInfoData,
|
RollbackResponse, ServiceStatusData, SystemInfoData,
|
||||||
};
|
};
|
||||||
|
|||||||
@ -193,6 +193,23 @@ pub struct AgentJobStatus {
|
|||||||
pub completed_at: Option<DateTime<Utc>>,
|
pub completed_at: Option<DateTime<Utc>>,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ============================================================
|
||||||
|
// GET /api/v1/system/services/{name}
|
||||||
|
// ============================================================
|
||||||
|
|
||||||
|
/// Payload returned by `GET /api/v1/system/services/{name}`.
|
||||||
|
#[derive(Debug, Clone, Deserialize, Serialize)]
|
||||||
|
pub struct ServiceStatusData {
|
||||||
|
/// Service name.
|
||||||
|
pub name: String,
|
||||||
|
/// Service status string (e.g. `"running"`, `"stopped"`, `"failed"`).
|
||||||
|
pub status: String,
|
||||||
|
/// Whether the service is considered healthy.
|
||||||
|
pub healthy: bool,
|
||||||
|
/// Seconds elapsed since the service started (`null` if not running).
|
||||||
|
pub uptime_secs: Option<u64>,
|
||||||
|
}
|
||||||
|
|
||||||
// ============================================================
|
// ============================================================
|
||||||
// POST /api/v1/jobs/{id}/rollback
|
// POST /api/v1/jobs/{id}/rollback
|
||||||
// ============================================================
|
// ============================================================
|
||||||
|
|||||||
@ -22,3 +22,5 @@ config = { workspace = true }
|
|||||||
axum = { workspace = true }
|
axum = { workspace = true }
|
||||||
sha2 = { workspace = true }
|
sha2 = { workspace = true }
|
||||||
hex = { workspace = true }
|
hex = { workspace = true }
|
||||||
|
aes-gcm = { workspace = true }
|
||||||
|
rand = { workspace = true }
|
||||||
|
|||||||
@ -47,6 +47,9 @@ pub enum AuditAction {
|
|||||||
PatchJobCompleted,
|
PatchJobCompleted,
|
||||||
PatchJobFailed,
|
PatchJobFailed,
|
||||||
MaintenanceWindowReminder,
|
MaintenanceWindowReminder,
|
||||||
|
HealthCheckCreated,
|
||||||
|
HealthCheckUpdated,
|
||||||
|
HealthCheckDeleted,
|
||||||
}
|
}
|
||||||
|
|
||||||
impl AuditAction {
|
impl AuditAction {
|
||||||
@ -80,6 +83,9 @@ impl AuditAction {
|
|||||||
Self::PatchJobCompleted => "patch_job_completed",
|
Self::PatchJobCompleted => "patch_job_completed",
|
||||||
Self::PatchJobFailed => "patch_job_failed",
|
Self::PatchJobFailed => "patch_job_failed",
|
||||||
Self::MaintenanceWindowReminder => "maintenance_window_reminder",
|
Self::MaintenanceWindowReminder => "maintenance_window_reminder",
|
||||||
|
Self::HealthCheckCreated => "health_check_created",
|
||||||
|
Self::HealthCheckUpdated => "health_check_updated",
|
||||||
|
Self::HealthCheckDeleted => "health_check_deleted",
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@ -39,6 +39,9 @@ pub struct WorkerConfig {
|
|||||||
pub health_poll_interval_secs: u64,
|
pub health_poll_interval_secs: u64,
|
||||||
/// Patch data poll interval in seconds (default: 1800 = 30 min)
|
/// Patch data poll interval in seconds (default: 1800 = 30 min)
|
||||||
pub patch_poll_interval_secs: u64,
|
pub patch_poll_interval_secs: u64,
|
||||||
|
/// Health check poll interval in seconds (default: 300 = 5 min)
|
||||||
|
#[serde(default = "default_health_check_poll_interval")]
|
||||||
|
pub health_check_poll_interval_secs: u64,
|
||||||
/// Maximum concurrent agent calls
|
/// Maximum concurrent agent calls
|
||||||
pub max_concurrent_agent_calls: usize,
|
pub max_concurrent_agent_calls: usize,
|
||||||
/// Worker heartbeat interval in seconds
|
/// Worker heartbeat interval in seconds
|
||||||
@ -98,6 +101,8 @@ impl AppConfig {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
fn default_health_check_poll_interval() -> u64 { 300 }
|
||||||
|
|
||||||
impl Default for AppConfig {
|
impl Default for AppConfig {
|
||||||
fn default() -> Self {
|
fn default() -> Self {
|
||||||
Self {
|
Self {
|
||||||
@ -115,6 +120,7 @@ impl Default for AppConfig {
|
|||||||
worker: WorkerConfig {
|
worker: WorkerConfig {
|
||||||
health_poll_interval_secs: 300,
|
health_poll_interval_secs: 300,
|
||||||
patch_poll_interval_secs: 1800,
|
patch_poll_interval_secs: 1800,
|
||||||
|
health_check_poll_interval_secs: 300,
|
||||||
max_concurrent_agent_calls: 64,
|
max_concurrent_agent_calls: 64,
|
||||||
heartbeat_interval_secs: 30,
|
heartbeat_interval_secs: 30,
|
||||||
ws_relay_poll_interval_secs: 10,
|
ws_relay_poll_interval_secs: 10,
|
||||||
|
|||||||
80
crates/pm-core/src/crypto.rs
Normal file
80
crates/pm-core/src/crypto.rs
Normal file
@ -0,0 +1,80 @@
|
|||||||
|
//! AES-256-GCM encryption for sensitive health check credentials.
|
||||||
|
//!
|
||||||
|
//! Uses a per-install key stored at `/etc/patch-manager/keys/health-check.key`.
|
||||||
|
|
||||||
|
use aes_gcm::{
|
||||||
|
aead::{Aead, KeyInit, OsRng},
|
||||||
|
Aes256Gcm, Nonce,
|
||||||
|
};
|
||||||
|
use rand::RngCore;
|
||||||
|
use std::fs;
|
||||||
|
use std::path::Path;
|
||||||
|
|
||||||
|
pub const KEY_PATH: &str = "/etc/patch-manager/keys/health-check.key";
|
||||||
|
|
||||||
|
/// Load or create the per-install encryption key.
|
||||||
|
/// If the key file doesn't exist, generates a new 256-bit key and saves it.
|
||||||
|
pub fn load_or_create_key(path: &Path) -> Result<[u8; 32], CryptoError> {
|
||||||
|
if path.exists() {
|
||||||
|
let key_bytes = fs::read(path).map_err(CryptoError::Io)?;
|
||||||
|
if key_bytes.len() != 32 {
|
||||||
|
return Err(CryptoError::InvalidKeyLength(key_bytes.len()));
|
||||||
|
}
|
||||||
|
let mut key = [0u8; 32];
|
||||||
|
key.copy_from_slice(&key_bytes);
|
||||||
|
Ok(key)
|
||||||
|
} else {
|
||||||
|
let mut key = [0u8; 32];
|
||||||
|
OsRng.fill_bytes(&mut key);
|
||||||
|
if let Some(parent) = path.parent() {
|
||||||
|
fs::create_dir_all(parent).map_err(CryptoError::Io)?;
|
||||||
|
}
|
||||||
|
fs::write(path, &key).map_err(CryptoError::Io)?;
|
||||||
|
// Set permissions to 0600 (owner read/write only)
|
||||||
|
#[cfg(unix)]
|
||||||
|
{
|
||||||
|
use std::os::unix::fs::PermissionsExt;
|
||||||
|
fs::set_permissions(path, fs::Permissions::from_mode(0o600))
|
||||||
|
.map_err(CryptoError::Io)?;
|
||||||
|
}
|
||||||
|
Ok(key)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Encrypt plaintext with AES-256-GCM. Returns (ciphertext, nonce).
|
||||||
|
pub fn encrypt(plaintext: &str, key: &[u8; 32]) -> Result<(Vec<u8>, Vec<u8>), CryptoError> {
|
||||||
|
let cipher = Aes256Gcm::new_from_slice(key).map_err(|e| CryptoError::KeyInit(e.to_string()))?;
|
||||||
|
let mut nonce_bytes = [0u8; 12];
|
||||||
|
OsRng.fill_bytes(&mut nonce_bytes);
|
||||||
|
let nonce = Nonce::from_slice(&nonce_bytes);
|
||||||
|
let ciphertext = cipher
|
||||||
|
.encrypt(nonce, plaintext.as_bytes())
|
||||||
|
.map_err(|_| CryptoError::EncryptionFailed)?;
|
||||||
|
Ok((ciphertext, nonce_bytes.to_vec()))
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Decrypt AES-256-GCM ciphertext with the given nonce.
|
||||||
|
pub fn decrypt(ciphertext: &[u8], nonce: &[u8], key: &[u8; 32]) -> Result<String, CryptoError> {
|
||||||
|
let cipher = Aes256Gcm::new_from_slice(key).map_err(|e| CryptoError::KeyInit(e.to_string()))?;
|
||||||
|
let nonce = Nonce::from_slice(nonce);
|
||||||
|
let plaintext = cipher
|
||||||
|
.decrypt(nonce, ciphertext)
|
||||||
|
.map_err(|_| CryptoError::DecryptionFailed)?;
|
||||||
|
String::from_utf8(plaintext).map_err(CryptoError::Utf8)
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, thiserror::Error)]
|
||||||
|
pub enum CryptoError {
|
||||||
|
#[error("IO error: {0}")]
|
||||||
|
Io(#[from] std::io::Error),
|
||||||
|
#[error("Invalid key length: expected 32 bytes, got {0}")]
|
||||||
|
InvalidKeyLength(usize),
|
||||||
|
#[error("Key init error: {0}")]
|
||||||
|
KeyInit(String),
|
||||||
|
#[error("Encryption failed")]
|
||||||
|
EncryptionFailed,
|
||||||
|
#[error("Decryption failed")]
|
||||||
|
DecryptionFailed,
|
||||||
|
#[error("UTF-8 error: {0}")]
|
||||||
|
Utf8(#[from] std::string::FromUtf8Error),
|
||||||
|
}
|
||||||
@ -1,5 +1,6 @@
|
|||||||
pub mod audit;
|
pub mod audit;
|
||||||
pub mod config;
|
pub mod config;
|
||||||
|
pub mod crypto;
|
||||||
pub mod db;
|
pub mod db;
|
||||||
pub mod error;
|
pub mod error;
|
||||||
pub mod logging;
|
pub mod logging;
|
||||||
@ -8,11 +9,14 @@ pub mod request_id;
|
|||||||
|
|
||||||
// Re-export commonly used types
|
// Re-export commonly used types
|
||||||
pub use config::AppConfig;
|
pub use config::AppConfig;
|
||||||
|
pub use crypto::{CryptoError, KEY_PATH, decrypt, encrypt, load_or_create_key};
|
||||||
pub use error::{AppError, ErrorResponse};
|
pub use error::{AppError, ErrorResponse};
|
||||||
pub use models::{
|
pub use models::{
|
||||||
AuthProvider, CreateGroupRequest, CreateHostRequest, CreateUserRequest, DiscoveryCidrRequest,
|
AuthProvider, CreateGroupRequest, CreateHealthCheckRequest, CreateHostRequest,
|
||||||
DiscoveryResult, Group, Host, HostHealthStatus, HostSummary, RegisterDiscoveredRequest,
|
CreateUserRequest, DiscoveryCidrRequest, DiscoveryResult, Group, HealthCheck,
|
||||||
UpdateGroupRequest, UpdateUserRequest, User, UserRole as DbUserRole,
|
HealthCheckResult, HealthCheckWithResult, Host, HostHealthStatus, HostSummary,
|
||||||
|
RegisterDiscoveredRequest, UpdateGroupRequest, UpdateHealthCheckRequest, UpdateUserRequest,
|
||||||
|
User, UserRole as DbUserRole,
|
||||||
};
|
};
|
||||||
|
|
||||||
// Re-export audit integrity types
|
// Re-export audit integrity types
|
||||||
|
|||||||
@ -113,9 +113,79 @@ pub struct HostSummary {
|
|||||||
pub health_status: HostHealthStatus,
|
pub health_status: HostHealthStatus,
|
||||||
pub agent_version: Option<String>,
|
pub agent_version: Option<String>,
|
||||||
pub patches_missing: i32,
|
pub patches_missing: i32,
|
||||||
|
pub health_check_status: Option<String>,
|
||||||
pub registered_at: DateTime<Utc>,
|
pub registered_at: DateTime<Utc>,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ============================================================
|
||||||
|
// Health Checks
|
||||||
|
// ============================================================
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
|
||||||
|
pub struct HealthCheck {
|
||||||
|
pub id: Uuid,
|
||||||
|
pub host_id: Uuid,
|
||||||
|
pub name: String,
|
||||||
|
pub check_type: String, // "service" or "http"
|
||||||
|
pub enabled: bool,
|
||||||
|
// Service check fields
|
||||||
|
pub service_name: Option<String>,
|
||||||
|
// HTTP check fields
|
||||||
|
pub url: Option<String>,
|
||||||
|
pub expected_body: Option<String>,
|
||||||
|
pub ignore_cert_errors: bool,
|
||||||
|
pub basic_auth_user: Option<String>,
|
||||||
|
// basic_auth_pass_encrypted and nonce NOT exposed in API responses
|
||||||
|
pub created_at: DateTime<Utc>,
|
||||||
|
pub updated_at: DateTime<Utc>,
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
pub struct HealthCheckWithResult {
|
||||||
|
#[serde(flatten)]
|
||||||
|
pub check: HealthCheck,
|
||||||
|
pub last_result: Option<HealthCheckResult>,
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize, FromRow)]
|
||||||
|
pub struct HealthCheckResult {
|
||||||
|
pub id: Uuid,
|
||||||
|
pub check_id: Uuid,
|
||||||
|
pub healthy: bool,
|
||||||
|
pub detail: Option<String>,
|
||||||
|
pub latency_ms: Option<i32>,
|
||||||
|
pub checked_at: DateTime<Utc>,
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
pub struct CreateHealthCheckRequest {
|
||||||
|
pub name: String,
|
||||||
|
pub check_type: String, // "service" or "http"
|
||||||
|
pub service_name: Option<String>,
|
||||||
|
pub url: Option<String>,
|
||||||
|
pub expected_body: Option<String>,
|
||||||
|
#[serde(default = "default_true")]
|
||||||
|
pub ignore_cert_errors: bool,
|
||||||
|
pub basic_auth_user: Option<String>,
|
||||||
|
pub basic_auth_pass: Option<String>, // plaintext in request, encrypted before storage
|
||||||
|
}
|
||||||
|
|
||||||
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||||
|
pub struct UpdateHealthCheckRequest {
|
||||||
|
pub name: Option<String>,
|
||||||
|
pub enabled: Option<bool>,
|
||||||
|
pub service_name: Option<String>,
|
||||||
|
pub url: Option<String>,
|
||||||
|
pub expected_body: Option<String>,
|
||||||
|
pub ignore_cert_errors: Option<bool>,
|
||||||
|
pub basic_auth_user: Option<String>,
|
||||||
|
pub basic_auth_pass: Option<String>, // if provided, re-encrypt
|
||||||
|
}
|
||||||
|
|
||||||
|
fn default_true() -> bool {
|
||||||
|
true
|
||||||
|
}
|
||||||
|
|
||||||
// ============================================================
|
// ============================================================
|
||||||
// Group
|
// Group
|
||||||
// ============================================================
|
// ============================================================
|
||||||
|
|||||||
@ -189,6 +189,7 @@ pub fn build_router(state: AppState) -> Router {
|
|||||||
.merge(routes::ws::ticket_router())
|
.merge(routes::ws::ticket_router())
|
||||||
// Reports
|
// Reports
|
||||||
.nest("/reports", routes::reports::router())
|
.nest("/reports", routes::reports::router())
|
||||||
|
.nest("/hosts/{host_id}/health-checks", routes::health_checks::router())
|
||||||
// Settings (admin-only)
|
// Settings (admin-only)
|
||||||
.nest("/settings", routes::settings::router())
|
.nest("/settings", routes::settings::router())
|
||||||
// Apply auth middleware to all the above
|
// Apply auth middleware to all the above
|
||||||
|
|||||||
1042
crates/pm-web/src/routes/health_checks.rs
Normal file
1042
crates/pm-web/src/routes/health_checks.rs
Normal file
File diff suppressed because it is too large
Load Diff
@ -112,6 +112,7 @@ async fn list_hosts(
|
|||||||
SELECT h.id, h.fqdn, host(h.ip_address)::text AS ip_address, h.display_name,
|
SELECT h.id, h.fqdn, host(h.ip_address)::text AS ip_address, h.display_name,
|
||||||
h.os_family, h.os_name, h.health_status, h.agent_version,
|
h.os_family, h.os_name, h.health_status, h.agent_version,
|
||||||
COALESCE(hpd.patch_count, 0) AS patches_missing,
|
COALESCE(hpd.patch_count, 0) AS patches_missing,
|
||||||
|
" + hc_subquery + ",
|
||||||
h.registered_at
|
h.registered_at
|
||||||
FROM hosts h
|
FROM hosts h
|
||||||
LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id
|
LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id
|
||||||
@ -130,6 +131,7 @@ async fn list_hosts(
|
|||||||
h.display_name, h.os_family, h.os_name,
|
h.display_name, h.os_family, h.os_name,
|
||||||
h.health_status, h.agent_version,
|
h.health_status, h.agent_version,
|
||||||
COALESCE(hpd.patch_count, 0) AS patches_missing,
|
COALESCE(hpd.patch_count, 0) AS patches_missing,
|
||||||
|
" + hc_subquery + ",
|
||||||
h.registered_at
|
h.registered_at
|
||||||
FROM hosts h
|
FROM hosts h
|
||||||
LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id
|
LEFT JOIN host_patch_data hpd ON hpd.host_id = h.id
|
||||||
|
|||||||
@ -11,5 +11,6 @@ pub mod settings;
|
|||||||
pub mod status;
|
pub mod status;
|
||||||
pub mod users;
|
pub mod users;
|
||||||
pub mod ws;
|
pub mod ws;
|
||||||
|
pub mod health_checks;
|
||||||
|
|
||||||
pub mod reports;
|
pub mod reports;
|
||||||
|
|||||||
@ -28,3 +28,4 @@ tokio-rustls = { version = "0.26" }
|
|||||||
rustls-pemfile = { version = "2" }
|
rustls-pemfile = { version = "2" }
|
||||||
tokio-tungstenite = { version = "0.26", features = ["rustls-tls-webpki-roots"] }
|
tokio-tungstenite = { version = "0.26", features = ["rustls-tls-webpki-roots"] }
|
||||||
lettre = { version = "0.11", default-features = false, features = ["tokio1-rustls-tls", "smtp-transport", "builder"] }
|
lettre = { version = "0.11", default-features = false, features = ["tokio1-rustls-tls", "smtp-transport", "builder"] }
|
||||||
|
reqwest = { workspace = true }
|
||||||
|
|||||||
471
crates/pm-worker/src/health_check_poller.rs
Normal file
471
crates/pm-worker/src/health_check_poller.rs
Normal file
@ -0,0 +1,471 @@
|
|||||||
|
//! Periodic health check poller for configured service and HTTP checks.
|
||||||
|
//!
|
||||||
|
//! Polls every `health_check_poll_interval_secs`, querying each enabled health
|
||||||
|
//! check definition and storing results in `host_health_check_results`.
|
||||||
|
//! Results older than 4 days are pruned on each cycle.
|
||||||
|
|
||||||
|
use std::path::Path;
|
||||||
|
use std::sync::Arc;
|
||||||
|
use std::time::Instant;
|
||||||
|
|
||||||
|
use pm_core::{config::AppConfig, crypto};
|
||||||
|
use sqlx::{FromRow, PgPool};
|
||||||
|
use tokio::{sync::Semaphore, time};
|
||||||
|
use uuid::Uuid;
|
||||||
|
|
||||||
|
use crate::agent_loader::load_agent_certs;
|
||||||
|
use pm_agent_client::{AgentClient, AgentClientError};
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// DB row types
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Row fetched for each enabled health check, joined with host connection info.
|
||||||
|
#[derive(Debug, FromRow)]
|
||||||
|
struct HealthCheckRow {
|
||||||
|
id: Uuid,
|
||||||
|
host_id: Uuid,
|
||||||
|
name: String,
|
||||||
|
check_type: String,
|
||||||
|
service_name: Option<String>,
|
||||||
|
url: Option<String>,
|
||||||
|
expected_body: Option<String>,
|
||||||
|
ignore_cert_errors: Option<bool>,
|
||||||
|
basic_auth_user: Option<String>,
|
||||||
|
basic_auth_pass_encrypted: Option<Vec<u8>>,
|
||||||
|
basic_auth_pass_nonce: Option<Vec<u8>>,
|
||||||
|
ip_address: String,
|
||||||
|
agent_port: i32,
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Public entry point
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Run the health check poller loop indefinitely.
|
||||||
|
///
|
||||||
|
/// On each tick all enabled health checks are queried concurrently (up to
|
||||||
|
/// `max_concurrent_agent_calls` in-flight at once). Results are persisted
|
||||||
|
/// to `host_health_check_results` and stale rows are pruned.
|
||||||
|
pub async fn run_health_check_poller(pool: PgPool, config: Arc<AppConfig>) {
|
||||||
|
let interval_secs = config.worker.health_check_poll_interval_secs;
|
||||||
|
let mut ticker = time::interval(std::time::Duration::from_secs(interval_secs));
|
||||||
|
|
||||||
|
tracing::info!(interval_secs, "Health check poller started");
|
||||||
|
|
||||||
|
loop {
|
||||||
|
ticker.tick().await;
|
||||||
|
|
||||||
|
// Load certs on each cycle so cert rotation is picked up automatically.
|
||||||
|
let certs = match load_agent_certs(&config.security) {
|
||||||
|
Ok(c) => c,
|
||||||
|
Err(e) => {
|
||||||
|
tracing::error!(
|
||||||
|
error = %e,
|
||||||
|
"Health check poller: failed to load agent certs — skipping cycle"
|
||||||
|
);
|
||||||
|
continue;
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
let client_cert = Arc::new(certs.client_cert);
|
||||||
|
let client_key = Arc::new(certs.client_key);
|
||||||
|
let ca_cert = Arc::new(certs.ca_cert);
|
||||||
|
|
||||||
|
// Load the crypto key for decrypting HTTP check passwords.
|
||||||
|
let crypto_key = match crypto::load_or_create_key(Path::new(crypto::KEY_PATH)) {
|
||||||
|
Ok(k) => Arc::new(k),
|
||||||
|
Err(e) => {
|
||||||
|
tracing::error!(
|
||||||
|
error = %e,
|
||||||
|
"Health check poller: failed to load crypto key — skipping cycle"
|
||||||
|
);
|
||||||
|
continue;
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
// Fetch all enabled health checks with host connection info.
|
||||||
|
let checks: Vec<HealthCheckRow> = match sqlx::query_as(
|
||||||
|
r#"
|
||||||
|
SELECT
|
||||||
|
hc.id,
|
||||||
|
hc.host_id,
|
||||||
|
hc.name,
|
||||||
|
hc.check_type,
|
||||||
|
hc.service_name,
|
||||||
|
hc.url,
|
||||||
|
hc.expected_body,
|
||||||
|
hc.ignore_cert_errors,
|
||||||
|
hc.basic_auth_user,
|
||||||
|
hc.basic_auth_pass_encrypted,
|
||||||
|
hc.basic_auth_pass_nonce,
|
||||||
|
host(h.ip_address)::text AS ip_address,
|
||||||
|
h.agent_port
|
||||||
|
FROM host_health_checks hc
|
||||||
|
JOIN hosts h ON h.id = hc.host_id
|
||||||
|
WHERE hc.enabled = TRUE
|
||||||
|
ORDER BY hc.id
|
||||||
|
"#,
|
||||||
|
)
|
||||||
|
.fetch_all(&pool)
|
||||||
|
.await
|
||||||
|
{
|
||||||
|
Ok(rows) => rows,
|
||||||
|
Err(e) => {
|
||||||
|
tracing::error!(error = %e, "Health check poller: failed to fetch health checks");
|
||||||
|
continue;
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
if checks.is_empty() {
|
||||||
|
tracing::debug!("Health check poller: no enabled health checks, skipping cycle");
|
||||||
|
prune_old_results(&pool).await;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
let total = checks.len();
|
||||||
|
let semaphore = Arc::new(Semaphore::new(config.worker.max_concurrent_agent_calls));
|
||||||
|
|
||||||
|
let mut handles = Vec::with_capacity(total);
|
||||||
|
|
||||||
|
for check in checks {
|
||||||
|
let pool = pool.clone();
|
||||||
|
let sem = semaphore.clone();
|
||||||
|
let cert = client_cert.clone();
|
||||||
|
let key = client_key.clone();
|
||||||
|
let ca = ca_cert.clone();
|
||||||
|
let ckey = crypto_key.clone();
|
||||||
|
|
||||||
|
let handle = tokio::spawn(async move {
|
||||||
|
let _permit = sem.acquire().await.expect("semaphore closed");
|
||||||
|
run_check(pool, check, &cert, &key, &ca, &ckey).await
|
||||||
|
});
|
||||||
|
|
||||||
|
handles.push(handle);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Collect results and tally counts.
|
||||||
|
let mut healthy_count = 0usize;
|
||||||
|
let mut unhealthy_count = 0usize;
|
||||||
|
let mut error_count = 0usize;
|
||||||
|
|
||||||
|
for handle in handles {
|
||||||
|
match handle.await {
|
||||||
|
Ok(true) => healthy_count += 1,
|
||||||
|
Ok(false) => unhealthy_count += 1,
|
||||||
|
Err(e) => {
|
||||||
|
tracing::error!(error = %e, "Health check poller task panicked");
|
||||||
|
error_count += 1;
|
||||||
|
},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
tracing::info!(
|
||||||
|
total,
|
||||||
|
healthy_count,
|
||||||
|
unhealthy_count,
|
||||||
|
error_count,
|
||||||
|
"Health check poll cycle complete"
|
||||||
|
);
|
||||||
|
|
||||||
|
// Prune results older than 4 days.
|
||||||
|
prune_old_results(&pool).await;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Check dispatch
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Run a single health check and persist the result. Returns `true` if healthy.
|
||||||
|
async fn run_check(
|
||||||
|
pool: PgPool,
|
||||||
|
check: HealthCheckRow,
|
||||||
|
client_cert: &[u8],
|
||||||
|
client_key: &[u8],
|
||||||
|
ca_cert: &[u8],
|
||||||
|
crypto_key: &[u8; 32],
|
||||||
|
) -> bool {
|
||||||
|
let start = Instant::now();
|
||||||
|
|
||||||
|
let (healthy, detail) = match check.check_type.as_str() {
|
||||||
|
"service" => run_service_check(&check, client_cert, client_key, ca_cert).await,
|
||||||
|
"http" => run_http_check(&check, crypto_key).await,
|
||||||
|
other => {
|
||||||
|
tracing::warn!(
|
||||||
|
check_id = %check.id,
|
||||||
|
check_type = other,
|
||||||
|
"Unknown health check type — treating as unhealthy"
|
||||||
|
);
|
||||||
|
(false, format!("Unknown check type: {other}"))
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
let latency_ms = start.elapsed().as_millis() as i32;
|
||||||
|
|
||||||
|
// Persist the result.
|
||||||
|
if let Err(e) = sqlx::query(
|
||||||
|
r#"
|
||||||
|
INSERT INTO host_health_check_results (check_id, healthy, detail, latency_ms)
|
||||||
|
VALUES ($1, $2, $3, $4)
|
||||||
|
"#,
|
||||||
|
)
|
||||||
|
.bind(check.id)
|
||||||
|
.bind(healthy)
|
||||||
|
.bind(&detail)
|
||||||
|
.bind(latency_ms)
|
||||||
|
.execute(&pool)
|
||||||
|
.await
|
||||||
|
{
|
||||||
|
tracing::error!(
|
||||||
|
check_id = %check.id,
|
||||||
|
error = %e,
|
||||||
|
"Health check poller: failed to insert result"
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
healthy
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Service check (via mTLS AgentClient)
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Execute a service check by calling the agent's `/api/v1/system/services/{name}` endpoint.
|
||||||
|
async fn run_service_check(
|
||||||
|
check: &HealthCheckRow,
|
||||||
|
client_cert: &[u8],
|
||||||
|
client_key: &[u8],
|
||||||
|
ca_cert: &[u8],
|
||||||
|
) -> (bool, String) {
|
||||||
|
let service_name = match &check.service_name {
|
||||||
|
Some(name) => name.clone(),
|
||||||
|
None => {
|
||||||
|
return (false, "Service check missing service_name".to_string());
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
let client = match AgentClient::new(
|
||||||
|
&check.ip_address,
|
||||||
|
check.agent_port as u16,
|
||||||
|
client_cert,
|
||||||
|
client_key,
|
||||||
|
ca_cert,
|
||||||
|
) {
|
||||||
|
Ok(c) => c,
|
||||||
|
Err(e) => {
|
||||||
|
return (false, format!("Failed to build AgentClient: {e}"));
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
match client.service_status(&service_name).await {
|
||||||
|
Ok(data) => {
|
||||||
|
let detail = if data.healthy {
|
||||||
|
format!(
|
||||||
|
"Service '{}' is {} (uptime: {}s)",
|
||||||
|
data.name,
|
||||||
|
data.status,
|
||||||
|
data.uptime_secs.map_or("N/A".to_string(), |s| s.to_string())
|
||||||
|
)
|
||||||
|
} else {
|
||||||
|
format!(
|
||||||
|
"Service '{}' status: {} (unhealthy)",
|
||||||
|
data.name, data.status
|
||||||
|
)
|
||||||
|
};
|
||||||
|
(data.healthy, detail)
|
||||||
|
},
|
||||||
|
Err(AgentClientError::Timeout) => {
|
||||||
|
(false, format!("Agent timed out querying service '{service_name}'"))
|
||||||
|
},
|
||||||
|
Err(AgentClientError::Connect(_)) => {
|
||||||
|
(false, format!("Agent connection refused for service '{service_name}'"))
|
||||||
|
},
|
||||||
|
Err(AgentClientError::ApiError { code, message }) => {
|
||||||
|
// 404, 400, 500 etc. from the agent means the service is unhealthy.
|
||||||
|
(false, format!("Agent error [{code}]: {message}"))
|
||||||
|
},
|
||||||
|
Err(e) => {
|
||||||
|
(false, format!("Agent error querying service '{service_name}': {e}"))
|
||||||
|
},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// HTTP check (via reqwest, no mTLS)
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Execute an HTTP check by making a GET request to the configured URL.
|
||||||
|
/// Supports optional basic auth (decrypted from DB) and substring body matching.
|
||||||
|
async fn run_http_check(
|
||||||
|
check: &HealthCheckRow,
|
||||||
|
crypto_key: &[u8; 32],
|
||||||
|
) -> (bool, String) {
|
||||||
|
let url = match &check.url {
|
||||||
|
Some(u) => u.clone(),
|
||||||
|
None => {
|
||||||
|
return (false, "HTTP check missing URL".to_string());
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
// Build a reqwest client for this check.
|
||||||
|
// Use danger_accept_invalid_certs if ignore_cert_errors is set (default true).
|
||||||
|
let ignore_cert_errors = check.ignore_cert_errors.unwrap_or(true);
|
||||||
|
|
||||||
|
let client_builder = reqwest::Client::builder()
|
||||||
|
.timeout(std::time::Duration::from_secs(10))
|
||||||
|
.redirect(reqwest::redirect::Policy::limited(5));
|
||||||
|
|
||||||
|
let client = if ignore_cert_errors {
|
||||||
|
client_builder
|
||||||
|
.danger_accept_invalid_certs(true)
|
||||||
|
.build()
|
||||||
|
.unwrap_or_else(|_| reqwest::Client::new())
|
||||||
|
} else {
|
||||||
|
client_builder.build().unwrap_or_else(|_| reqwest::Client::new())
|
||||||
|
};
|
||||||
|
|
||||||
|
// Build the request.
|
||||||
|
let mut request = client.get(&url);
|
||||||
|
|
||||||
|
// Add basic auth if configured.
|
||||||
|
if let Some(user) = &check.basic_auth_user {
|
||||||
|
// Decrypt the password if present.
|
||||||
|
let password = match (&check.basic_auth_pass_encrypted, &check.basic_auth_pass_nonce) {
|
||||||
|
(Some(enc), Some(nonce)) => {
|
||||||
|
match crypto::decrypt(enc, nonce, crypto_key) {
|
||||||
|
Ok(p) => p,
|
||||||
|
Err(e) => {
|
||||||
|
return (
|
||||||
|
false,
|
||||||
|
format!("Failed to decrypt basic auth password: {e}"),
|
||||||
|
);
|
||||||
|
},
|
||||||
|
}
|
||||||
|
},
|
||||||
|
_ => {
|
||||||
|
// No encrypted password stored — treat as missing credentials.
|
||||||
|
return (false, "HTTP check has basic_auth_user but no encrypted password".to_string());
|
||||||
|
},
|
||||||
|
};
|
||||||
|
request = request.basic_auth(user.as_str(), Some(password.as_str()));
|
||||||
|
}
|
||||||
|
|
||||||
|
// Execute the request.
|
||||||
|
let response = match request.send().await {
|
||||||
|
Ok(r) => r,
|
||||||
|
Err(e) => {
|
||||||
|
if e.is_timeout() {
|
||||||
|
return (false, format!("HTTP check timed out: {url}"));
|
||||||
|
} else if e.is_connect() {
|
||||||
|
return (false, format!("HTTP check connection failed: {url}"));
|
||||||
|
} else {
|
||||||
|
return (false, format!("HTTP check request error: {e}"));
|
||||||
|
}
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
let status = response.status();
|
||||||
|
|
||||||
|
// Check HTTP status code.
|
||||||
|
if !status.is_success() {
|
||||||
|
return (
|
||||||
|
false,
|
||||||
|
format!("HTTP check returned status {} for {url}", status.as_u16()),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Read the response body for substring matching.
|
||||||
|
let body = match response.text().await {
|
||||||
|
Ok(b) => b,
|
||||||
|
Err(e) => {
|
||||||
|
return (false, format!("HTTP check failed to read response body: {e}"));
|
||||||
|
},
|
||||||
|
};
|
||||||
|
|
||||||
|
// Check expected_body substring match.
|
||||||
|
if let Some(expected) = &check.expected_body {
|
||||||
|
if !body.contains(expected) {
|
||||||
|
return (
|
||||||
|
false,
|
||||||
|
format!(
|
||||||
|
"HTTP check body mismatch for {url}: expected substring not found"
|
||||||
|
),
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
(true, format!("HTTP check OK for {url} (status {})", status.as_u16()))
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Prune old results
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Delete health check results older than 4 days.
|
||||||
|
async fn prune_old_results(pool: &PgPool) {
|
||||||
|
match sqlx::query(
|
||||||
|
"DELETE FROM host_health_check_results WHERE checked_at < NOW() - INTERVAL '4 days'",
|
||||||
|
)
|
||||||
|
.execute(pool)
|
||||||
|
.await
|
||||||
|
{
|
||||||
|
Ok(result) => {
|
||||||
|
if result.rows_affected() > 0 {
|
||||||
|
tracing::info!(
|
||||||
|
rows_deleted = result.rows_affected(),
|
||||||
|
"Health check poller: pruned old results"
|
||||||
|
);
|
||||||
|
}
|
||||||
|
},
|
||||||
|
Err(e) => {
|
||||||
|
tracing::error!(error = %e, "Health check poller: failed to prune old results");
|
||||||
|
},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
// Health check gate for job executor
|
||||||
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
/// Check whether all enabled health checks for a host are healthy.
|
||||||
|
///
|
||||||
|
/// Returns `Ok(true)` if all checks pass (or no checks are configured),
|
||||||
|
/// `Ok(false)` if any check is unhealthy or has no result yet.
|
||||||
|
pub async fn check_host_health_checks(pool: &PgPool, host_id: Uuid) -> anyhow::Result<bool> {
|
||||||
|
// Check if there are any enabled health checks for this host.
|
||||||
|
let check_count: (i64,) = sqlx::query_as(
|
||||||
|
"SELECT COUNT(*) FROM host_health_checks WHERE host_id = $1 AND enabled = TRUE",
|
||||||
|
)
|
||||||
|
.bind(host_id)
|
||||||
|
.fetch_one(pool)
|
||||||
|
.await?;
|
||||||
|
|
||||||
|
if check_count.0 == 0 {
|
||||||
|
// No health checks configured for this host — treat as healthy.
|
||||||
|
return Ok(true);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Find any enabled check that has no healthy result or an unhealthy latest result.
|
||||||
|
let unhealthy_count: (i64,) = sqlx::query_as(
|
||||||
|
r#"
|
||||||
|
SELECT COUNT(*)
|
||||||
|
FROM host_health_checks hc
|
||||||
|
LEFT JOIN LATERAL (
|
||||||
|
SELECT healthy
|
||||||
|
FROM host_health_check_results r
|
||||||
|
WHERE r.check_id = hc.id
|
||||||
|
ORDER BY r.checked_at DESC
|
||||||
|
LIMIT 1
|
||||||
|
) latest ON true
|
||||||
|
WHERE hc.host_id = $1
|
||||||
|
AND hc.enabled = TRUE
|
||||||
|
AND (latest.healthy IS NULL OR latest.healthy = FALSE)
|
||||||
|
"#,
|
||||||
|
)
|
||||||
|
.bind(host_id)
|
||||||
|
.fetch_one(pool)
|
||||||
|
.await?;
|
||||||
|
|
||||||
|
Ok(unhealthy_count.0 == 0)
|
||||||
|
}
|
||||||
@ -24,6 +24,7 @@ use uuid::Uuid;
|
|||||||
|
|
||||||
use crate::agent_loader::load_agent_certs;
|
use crate::agent_loader::load_agent_certs;
|
||||||
use crate::email;
|
use crate::email;
|
||||||
|
use crate::health_check_poller::check_host_health_checks;
|
||||||
|
|
||||||
// ─────────────────────────────────────────────────────────────────────────────
|
// ─────────────────────────────────────────────────────────────────────────────
|
||||||
// Internal DB row types
|
// Internal DB row types
|
||||||
@ -78,6 +79,7 @@ struct StatusCounts {
|
|||||||
succeeded_count: i64,
|
succeeded_count: i64,
|
||||||
failed_count: i64,
|
failed_count: i64,
|
||||||
cancelled_count: i64,
|
cancelled_count: i64,
|
||||||
|
waiting_health_check_count: i64,
|
||||||
total_count: i64,
|
total_count: i64,
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -369,6 +371,89 @@ async fn execute_host_job(
|
|||||||
},
|
},
|
||||||
};
|
};
|
||||||
|
|
||||||
|
// ── 1b. Health check gate ──────────────────────────────────────────────
|
||||||
|
// All enabled health checks for this host must be healthy before we proceed.
|
||||||
|
match check_host_health_checks(&pool, host_id).await {
|
||||||
|
Ok(true) => {
|
||||||
|
tracing::debug!(%host_id, "execute_host_job: health checks passed");
|
||||||
|
},
|
||||||
|
Ok(false) => {
|
||||||
|
tracing::info!(%host_id, %pjh_id, "execute_host_job: health checks not passed, setting waiting_health_check");
|
||||||
|
// Check if the maintenance window is still open for this host.
|
||||||
|
let window_open: bool = sqlx::query_scalar(
|
||||||
|
r#"
|
||||||
|
SELECT EXISTS(
|
||||||
|
SELECT 1 FROM maintenance_windows mw
|
||||||
|
WHERE mw.host_id = $1
|
||||||
|
AND mw.enabled = TRUE
|
||||||
|
AND (
|
||||||
|
(mw.recurrence = 'once'
|
||||||
|
AND mw.start_at <= NOW()
|
||||||
|
AND NOW() < mw.start_at + (mw.duration_minutes * INTERVAL '1 minute'))
|
||||||
|
OR
|
||||||
|
(mw.recurrence = 'daily'
|
||||||
|
AND (NOW() AT TIME ZONE 'UTC')::time >= (mw.start_at AT TIME ZONE 'UTC')::time
|
||||||
|
AND (NOW() AT TIME ZONE 'UTC')::time < ((mw.start_at AT TIME ZONE 'UTC')::time
|
||||||
|
+ (mw.duration_minutes * INTERVAL '1 minute')))
|
||||||
|
OR
|
||||||
|
(mw.recurrence = 'weekly'
|
||||||
|
AND EXTRACT(DOW FROM NOW() AT TIME ZONE 'UTC') = mw.recurrence_day
|
||||||
|
AND (NOW() AT TIME ZONE 'UTC')::time >= (mw.start_at AT TIME ZONE 'UTC')::time
|
||||||
|
AND (NOW() AT TIME ZONE 'UTC')::time < ((mw.start_at AT TIME ZONE 'UTC')::time
|
||||||
|
+ (mw.duration_minutes * INTERVAL '1 minute')))
|
||||||
|
OR
|
||||||
|
(mw.recurrence = 'monthly'
|
||||||
|
AND EXTRACT(DAY FROM NOW() AT TIME ZONE 'UTC') = mw.recurrence_day
|
||||||
|
AND (NOW() AT TIME ZONE 'UTC')::time >= (mw.start_at AT TIME ZONE 'UTC')::time
|
||||||
|
AND (NOW() AT TIME ZONE 'UTC')::time < ((mw.start_at AT TIME ZONE 'UTC')::time
|
||||||
|
+ (mw.duration_minutes * INTERVAL '1 minute')))
|
||||||
|
)
|
||||||
|
)
|
||||||
|
"#,
|
||||||
|
)
|
||||||
|
.bind(host_id)
|
||||||
|
.fetch_optional(&pool)
|
||||||
|
.await
|
||||||
|
.unwrap_or(Some(true))
|
||||||
|
.unwrap_or(true); // Default to true if no window configured
|
||||||
|
|
||||||
|
if !window_open {
|
||||||
|
tracing::warn!(%host_id, %pjh_id, "execute_host_job: health checks not passed and maintenance window closed");
|
||||||
|
handle_host_failure(
|
||||||
|
pool,
|
||||||
|
pjh_id,
|
||||||
|
"Health checks did not pass before maintenance window closed".to_string(),
|
||||||
|
)
|
||||||
|
.await;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Set status to waiting_health_check and retry in 5 minutes.
|
||||||
|
let retry_at = Utc::now() + ChronoDuration::minutes(5);
|
||||||
|
if let Err(e) = sqlx::query(
|
||||||
|
r#"
|
||||||
|
UPDATE patch_job_hosts
|
||||||
|
SET status = 'waiting_health_check',
|
||||||
|
retry_next_at = $2,
|
||||||
|
last_error = 'Waiting for health checks to pass'
|
||||||
|
WHERE id = $1
|
||||||
|
"#,
|
||||||
|
)
|
||||||
|
.bind(pjh_id)
|
||||||
|
.bind(retry_at)
|
||||||
|
.execute(&pool)
|
||||||
|
.await
|
||||||
|
{
|
||||||
|
tracing::error!(%pjh_id, error = %e, "execute_host_job: failed to set waiting_health_check status");
|
||||||
|
}
|
||||||
|
return;
|
||||||
|
},
|
||||||
|
Err(e) => {
|
||||||
|
tracing::warn!(%host_id, error = %e, "execute_host_job: health check query failed, proceeding anyway");
|
||||||
|
// If we can't query health checks, proceed with the job rather than blocking.
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
// ── 2. Fetch the job's patch_selection ──────────────────────────────────
|
// ── 2. Fetch the job's patch_selection ──────────────────────────────────
|
||||||
let patch_sel: JobPatchSelection =
|
let patch_sel: JobPatchSelection =
|
||||||
match sqlx::query_as("SELECT patch_selection FROM patch_jobs WHERE id = $1")
|
match sqlx::query_as("SELECT patch_selection FROM patch_jobs WHERE id = $1")
|
||||||
@ -764,6 +849,7 @@ async fn sync_job_status(pool: &PgPool, job_id: Uuid) {
|
|||||||
COUNT(*) FILTER (WHERE status = 'succeeded') AS succeeded_count,
|
COUNT(*) FILTER (WHERE status = 'succeeded') AS succeeded_count,
|
||||||
COUNT(*) FILTER (WHERE status = 'failed') AS failed_count,
|
COUNT(*) FILTER (WHERE status = 'failed') AS failed_count,
|
||||||
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_count,
|
COUNT(*) FILTER (WHERE status = 'cancelled') AS cancelled_count,
|
||||||
|
COUNT(*) FILTER (WHERE status = 'waiting_health_check') AS waiting_health_check_count,
|
||||||
COUNT(*) AS total_count
|
COUNT(*) AS total_count
|
||||||
FROM patch_job_hosts
|
FROM patch_job_hosts
|
||||||
WHERE job_id = $1
|
WHERE job_id = $1
|
||||||
@ -784,7 +870,7 @@ async fn sync_job_status(pool: &PgPool, job_id: Uuid) {
|
|||||||
let new_status: &str;
|
let new_status: &str;
|
||||||
let set_completed: bool;
|
let set_completed: bool;
|
||||||
|
|
||||||
if counts.running_count > 0 || counts.pending_count > 0 || counts.queued_count > 0 {
|
if counts.running_count > 0 || counts.pending_count > 0 || counts.queued_count > 0 || counts.waiting_health_check_count > 0 {
|
||||||
// Still work in flight — keep parent running.
|
// Still work in flight — keep parent running.
|
||||||
new_status = "running";
|
new_status = "running";
|
||||||
set_completed = false;
|
set_completed = false;
|
||||||
@ -912,17 +998,18 @@ async fn sync_job_status(pool: &PgPool, job_id: Uuid) {
|
|||||||
|
|
||||||
/// Find pending host entries whose back-off window has elapsed, reset them to
|
/// Find pending host entries whose back-off window has elapsed, reset them to
|
||||||
/// `queued`, and dispatch them immediately.
|
/// `queued`, and dispatch them immediately.
|
||||||
|
///
|
||||||
|
/// Also retries `waiting_health_check` entries whose retry window has elapsed.
|
||||||
pub async fn retry_pending_jobs(pool: PgPool, config: Arc<AppConfig>) {
|
pub async fn retry_pending_jobs(pool: PgPool, config: Arc<AppConfig>) {
|
||||||
let rows: Vec<PatchJobHostPending> = match sqlx::query_as(
|
let rows: Vec<PatchJobHostPending> = match sqlx::query_as(
|
||||||
r#"
|
r#"
|
||||||
SELECT pjh.id, pjh.host_id, pjh.job_id
|
SELECT pjh.id, pjh.host_id, pjh.job_id
|
||||||
FROM patch_job_hosts pjh
|
FROM patch_job_hosts pjh
|
||||||
JOIN patch_jobs j ON j.id = pjh.job_id
|
JOIN patch_jobs j ON j.id = pjh.job_id
|
||||||
WHERE pjh.status = 'pending'
|
WHERE pjh.status IN ('pending', 'waiting_health_check')
|
||||||
AND pjh.retry_next_at <= NOW()
|
AND pjh.retry_next_at <= NOW()
|
||||||
AND j.status != 'cancelled'
|
AND j.status != 'cancelled'
|
||||||
"#,
|
"#,)
|
||||||
)
|
|
||||||
.fetch_all(&pool)
|
.fetch_all(&pool)
|
||||||
.await
|
.await
|
||||||
{
|
{
|
||||||
|
|||||||
@ -6,6 +6,7 @@
|
|||||||
mod agent_loader;
|
mod agent_loader;
|
||||||
mod audit_verifier;
|
mod audit_verifier;
|
||||||
mod email;
|
mod email;
|
||||||
|
mod health_check_poller;
|
||||||
mod health_poller;
|
mod health_poller;
|
||||||
mod job_executor;
|
mod job_executor;
|
||||||
mod maintenance_scheduler;
|
mod maintenance_scheduler;
|
||||||
@ -19,6 +20,7 @@ use std::{sync::Arc, time::Duration};
|
|||||||
use tokio::time;
|
use tokio::time;
|
||||||
|
|
||||||
use audit_verifier::run_audit_verifier;
|
use audit_verifier::run_audit_verifier;
|
||||||
|
use health_check_poller::run_health_check_poller;
|
||||||
use health_poller::run_health_poller;
|
use health_poller::run_health_poller;
|
||||||
use job_executor::run_job_executor;
|
use job_executor::run_job_executor;
|
||||||
use maintenance_scheduler::run_maintenance_scheduler;
|
use maintenance_scheduler::run_maintenance_scheduler;
|
||||||
@ -29,7 +31,7 @@ use ws_relay::run_ws_relay;
|
|||||||
/// Minimum number of applied migrations the worker requires before
|
/// Minimum number of applied migrations the worker requires before
|
||||||
/// accepting work. Prevents the worker from running against a schema
|
/// accepting work. Prevents the worker from running against a schema
|
||||||
/// that hasn't been migrated yet.
|
/// that hasn't been migrated yet.
|
||||||
const REQUIRED_MIGRATION_COUNT: i64 = 5;
|
const REQUIRED_MIGRATION_COUNT: i64 = 8;
|
||||||
|
|
||||||
/// How long to wait between schema-version checks before giving up.
|
/// How long to wait between schema-version checks before giving up.
|
||||||
const SCHEMA_CHECK_TIMEOUT: Duration = Duration::from_secs(120);
|
const SCHEMA_CHECK_TIMEOUT: Duration = Duration::from_secs(120);
|
||||||
@ -89,6 +91,9 @@ async fn main() -> anyhow::Result<()> {
|
|||||||
// M11: audit integrity verification (runs every 24 hours)
|
// M11: audit integrity verification (runs every 24 hours)
|
||||||
let audit_verifier_handle = tokio::spawn(run_audit_verifier(pool.clone(), config.clone()));
|
let audit_verifier_handle = tokio::spawn(run_audit_verifier(pool.clone(), config.clone()));
|
||||||
|
|
||||||
|
// Health check poller — runs configured service/HTTP health checks
|
||||||
|
let health_check_handle = tokio::spawn(run_health_check_poller(pool.clone(), config.clone()));
|
||||||
|
|
||||||
tracing::info!("Worker tasks started");
|
tracing::info!("Worker tasks started");
|
||||||
|
|
||||||
// Wait for all tasks (they run indefinitely)
|
// Wait for all tasks (they run indefinitely)
|
||||||
|
|||||||
@ -7,6 +7,9 @@ import type {
|
|||||||
UpdateMaintenanceWindowRequest,
|
UpdateMaintenanceWindowRequest,
|
||||||
Certificate,
|
Certificate,
|
||||||
IssuedCert,
|
IssuedCert,
|
||||||
|
HealthCheckWithResult,
|
||||||
|
CreateHealthCheckRequest,
|
||||||
|
UpdateHealthCheckRequest,
|
||||||
} from '../types'
|
} from '../types'
|
||||||
|
|
||||||
const BASE_URL = '/api/v1'
|
const BASE_URL = '/api/v1'
|
||||||
@ -259,3 +262,25 @@ export const settingsApi = {
|
|||||||
updateIpWhitelist: (entries: string[]) => apiClient.put<{ entries: string[] }>('/settings/ip-whitelist', { entries }),
|
updateIpWhitelist: (entries: string[]) => apiClient.put<{ entries: string[] }>('/settings/ip-whitelist', { entries }),
|
||||||
auditIntegrity: () => apiClient.post<AuditIntegrityResult>('/settings/audit-integrity'),
|
auditIntegrity: () => apiClient.post<AuditIntegrityResult>('/settings/audit-integrity'),
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Health Checks API ─────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export const healthChecksApi = {
|
||||||
|
list: (hostId: string) =>
|
||||||
|
apiClient.get<HealthCheckWithResult[]>(`/hosts/${hostId}/health-checks`),
|
||||||
|
|
||||||
|
get: (hostId: string, checkId: string) =>
|
||||||
|
apiClient.get<HealthCheckWithResult>(`/hosts/${hostId}/health-checks/${checkId}`),
|
||||||
|
|
||||||
|
create: (hostId: string, body: CreateHealthCheckRequest) =>
|
||||||
|
apiClient.post<HealthCheckWithResult>(`/hosts/${hostId}/health-checks`, body),
|
||||||
|
|
||||||
|
update: (hostId: string, checkId: string, body: UpdateHealthCheckRequest) =>
|
||||||
|
apiClient.put<HealthCheckWithResult>(`/hosts/${hostId}/health-checks/${checkId}`, body),
|
||||||
|
|
||||||
|
delete: (hostId: string, checkId: string) =>
|
||||||
|
apiClient.delete(`/hosts/${hostId}/health-checks/${checkId}`),
|
||||||
|
|
||||||
|
test: (hostId: string, checkId: string) =>
|
||||||
|
apiClient.post<HealthCheckWithResult>(`/hosts/${hostId}/health-checks/${checkId}/test`),
|
||||||
|
}
|
||||||
|
|||||||
@ -34,13 +34,25 @@ import {
|
|||||||
import {
|
import {
|
||||||
Add as AddIcon,
|
Add as AddIcon,
|
||||||
ArrowBack,
|
ArrowBack,
|
||||||
|
Cancel as CancelIcon,
|
||||||
|
CheckCircle as CheckCircleIcon,
|
||||||
Delete as DeleteIcon,
|
Delete as DeleteIcon,
|
||||||
Edit as EditIcon,
|
Edit as EditIcon,
|
||||||
|
MonitorHeart as MonitorHeartIcon,
|
||||||
|
PlayArrow as PlayArrowIcon,
|
||||||
|
Remove as RemoveIcon,
|
||||||
Schedule as ScheduleIcon,
|
Schedule as ScheduleIcon,
|
||||||
VpnKey as VpnKeyIcon,
|
VpnKey as VpnKeyIcon,
|
||||||
} from '@mui/icons-material'
|
} from '@mui/icons-material'
|
||||||
import { apiClient, maintenanceWindowsApi, certsApi } from '../api/client'
|
import { apiClient, maintenanceWindowsApi, healthChecksApi, certsApi } from '../api/client'
|
||||||
import type { MaintenanceWindow, WindowRecurrence } from '../types'
|
import type {
|
||||||
|
MaintenanceWindow,
|
||||||
|
WindowRecurrence,
|
||||||
|
HealthCheckType,
|
||||||
|
HealthCheckWithResult,
|
||||||
|
CreateHealthCheckRequest,
|
||||||
|
UpdateHealthCheckRequest,
|
||||||
|
} from '../types'
|
||||||
|
|
||||||
// ── Helpers ───────────────────────────────────────────────────────────────────
|
// ── Helpers ───────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
@ -74,7 +86,7 @@ function scheduleDescription(w: MaintenanceWindow): string {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Form value type ───────────────────────────────────────────────────────────
|
// ── Window form value type ────────────────────────────────────────────────────
|
||||||
|
|
||||||
interface FormValues {
|
interface FormValues {
|
||||||
label: string
|
label: string
|
||||||
@ -185,6 +197,114 @@ function WindowFormDialog({ open, title, initial, onClose, onSubmit }: WindowFor
|
|||||||
)
|
)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Health Check form value type ─────────────────────────────────────────────
|
||||||
|
|
||||||
|
interface HealthCheckFormValues {
|
||||||
|
name: string
|
||||||
|
check_type: HealthCheckType
|
||||||
|
service_name: string
|
||||||
|
url: string
|
||||||
|
expected_body: string
|
||||||
|
ignore_cert_errors: boolean
|
||||||
|
basic_auth_user: string
|
||||||
|
basic_auth_pass: string
|
||||||
|
enabled: boolean
|
||||||
|
}
|
||||||
|
|
||||||
|
function defaultHealthCheckForm(): HealthCheckFormValues {
|
||||||
|
return {
|
||||||
|
name: '',
|
||||||
|
check_type: 'service',
|
||||||
|
service_name: '',
|
||||||
|
url: '',
|
||||||
|
expected_body: '',
|
||||||
|
ignore_cert_errors: false,
|
||||||
|
basic_auth_user: '',
|
||||||
|
basic_auth_pass: '',
|
||||||
|
enabled: true,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Health Check form dialog ──────────────────────────────────────────────────
|
||||||
|
|
||||||
|
interface HealthCheckFormDialogProps {
|
||||||
|
open: boolean
|
||||||
|
title: string
|
||||||
|
initial: HealthCheckFormValues
|
||||||
|
onClose: () => void
|
||||||
|
onSubmit: (values: HealthCheckFormValues) => Promise<void>
|
||||||
|
}
|
||||||
|
|
||||||
|
function HealthCheckFormDialog({ open, title, initial, onClose, onSubmit }: HealthCheckFormDialogProps) {
|
||||||
|
const [form, setForm] = useState<HealthCheckFormValues>(initial)
|
||||||
|
const [saving, setSaving] = useState(false)
|
||||||
|
const [err, setErr] = useState<string | null>(null)
|
||||||
|
|
||||||
|
useEffect(() => { setForm(initial); setErr(null) }, [open, initial])
|
||||||
|
|
||||||
|
const set = (field: keyof HealthCheckFormValues, value: HealthCheckFormValues[keyof HealthCheckFormValues]) =>
|
||||||
|
setForm(prev => ({ ...prev, [field]: value }))
|
||||||
|
|
||||||
|
const handleSubmit = async () => {
|
||||||
|
if (!form.name.trim()) { setErr('Name is required'); return }
|
||||||
|
if (form.check_type === 'service' && !form.service_name.trim()) { setErr('Service name is required'); return }
|
||||||
|
if (form.check_type === 'http' && !form.url.trim()) { setErr('URL is required'); return }
|
||||||
|
setSaving(true); setErr(null)
|
||||||
|
try { await onSubmit(form) }
|
||||||
|
catch (e: unknown) {
|
||||||
|
const msg = (e as { response?: { data?: { error?: { message?: string } } } })
|
||||||
|
?.response?.data?.error?.message ?? 'Failed to save'
|
||||||
|
setErr(msg)
|
||||||
|
} finally { setSaving(false) }
|
||||||
|
}
|
||||||
|
|
||||||
|
return (
|
||||||
|
<Dialog open={open} onClose={onClose} maxWidth="sm" fullWidth>
|
||||||
|
<DialogTitle>{title}</DialogTitle>
|
||||||
|
<DialogContent sx={{ display: 'flex', flexDirection: 'column', gap: 2, pt: 2 }}>
|
||||||
|
{err && <Alert severity="error">{err}</Alert>}
|
||||||
|
<TextField label="Name" value={form.name} onChange={e => set('name', e.target.value)} required fullWidth />
|
||||||
|
<FormControl fullWidth>
|
||||||
|
<InputLabel>Check Type</InputLabel>
|
||||||
|
<Select label="Check Type" value={form.check_type} onChange={e => set('check_type', e.target.value as HealthCheckType)}>
|
||||||
|
<MenuItem value="service">Service</MenuItem>
|
||||||
|
<MenuItem value="http">HTTP</MenuItem>
|
||||||
|
</Select>
|
||||||
|
</FormControl>
|
||||||
|
{form.check_type === 'service' && (
|
||||||
|
<TextField label="Service Name" value={form.service_name} onChange={e => set('service_name', e.target.value)} required fullWidth
|
||||||
|
helperText="Systemd service unit name to check" />
|
||||||
|
)}
|
||||||
|
{form.check_type === 'http' && (
|
||||||
|
<>
|
||||||
|
<TextField label="URL" value={form.url} onChange={e => set('url', e.target.value)} required fullWidth
|
||||||
|
helperText="Full URL to check (e.g. https://example.com/health)" />
|
||||||
|
<TextField label="Expected Body (optional)" value={form.expected_body} onChange={e => set('expected_body', e.target.value)} fullWidth
|
||||||
|
helperText="Substring expected in response body" />
|
||||||
|
<FormControlLabel
|
||||||
|
control={<Switch checked={form.ignore_cert_errors} onChange={e => set('ignore_cert_errors', e.target.checked)} />}
|
||||||
|
label="Ignore Certificate Errors"
|
||||||
|
/>
|
||||||
|
<TextField label="Basic Auth User (optional)" value={form.basic_auth_user} onChange={e => set('basic_auth_user', e.target.value)} fullWidth />
|
||||||
|
<TextField label="Basic Auth Password (optional)" type="password" value={form.basic_auth_pass} onChange={e => set('basic_auth_pass', e.target.value)} fullWidth
|
||||||
|
helperText="Leave blank to keep existing password" />
|
||||||
|
</>
|
||||||
|
)}
|
||||||
|
<FormControlLabel
|
||||||
|
control={<Switch checked={form.enabled} onChange={e => set('enabled', e.target.checked)} />}
|
||||||
|
label="Enabled"
|
||||||
|
/>
|
||||||
|
</DialogContent>
|
||||||
|
<DialogActions>
|
||||||
|
<Button onClick={onClose} disabled={saving}>Cancel</Button>
|
||||||
|
<Button variant="contained" onClick={handleSubmit} disabled={saving}>
|
||||||
|
{saving ? <CircularProgress size={20} /> : 'Save'}
|
||||||
|
</Button>
|
||||||
|
</DialogActions>
|
||||||
|
</Dialog>
|
||||||
|
)
|
||||||
|
}
|
||||||
|
|
||||||
// ── Main page ──────────────────────────────────────────────────────────────────
|
// ── Main page ──────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
export default function HostDetailPage() {
|
export default function HostDetailPage() {
|
||||||
@ -201,19 +321,37 @@ export default function HostDetailPage() {
|
|||||||
open: false, message: '', severity: 'success',
|
open: false, message: '', severity: 'success',
|
||||||
})
|
})
|
||||||
|
|
||||||
// Create dialog
|
// Create window dialog
|
||||||
const [createOpen, setCreateOpen] = useState(false)
|
const [createOpen, setCreateOpen] = useState(false)
|
||||||
const [createForm, setCreateForm] = useState<FormValues>(defaultForm())
|
const [createForm, setCreateForm] = useState<FormValues>(defaultForm())
|
||||||
|
|
||||||
// Edit dialog
|
// Edit window dialog
|
||||||
const [editOpen, setEditOpen] = useState(false)
|
const [editOpen, setEditOpen] = useState(false)
|
||||||
const [editWindow, setEditWindow] = useState<MaintenanceWindow | null>(null)
|
const [editWindow, setEditWindow] = useState<MaintenanceWindow | null>(null)
|
||||||
const [editForm, setEditForm] = useState<FormValues>(defaultForm())
|
const [editForm, setEditForm] = useState<FormValues>(defaultForm())
|
||||||
|
|
||||||
// Delete dialog
|
// Delete window dialog
|
||||||
const [deleteOpen, setDeleteOpen] = useState(false)
|
const [deleteOpen, setDeleteOpen] = useState(false)
|
||||||
const [deleteTarget, setDeleteTarget] = useState<MaintenanceWindow | null>(null)
|
const [deleteTarget, setDeleteTarget] = useState<MaintenanceWindow | null>(null)
|
||||||
|
|
||||||
|
// Health checks state
|
||||||
|
const [healthChecks, setHealthChecks] = useState<HealthCheckWithResult[]>([])
|
||||||
|
const [hcLoading, setHcLoading] = useState(false)
|
||||||
|
const [testingId, setTestingId] = useState<string | null>(null)
|
||||||
|
|
||||||
|
// Create health check dialog
|
||||||
|
const [hcCreateOpen, setHcCreateOpen] = useState(false)
|
||||||
|
const [hcCreateForm, setHcCreateForm] = useState<HealthCheckFormValues>(defaultHealthCheckForm())
|
||||||
|
|
||||||
|
// Edit health check dialog
|
||||||
|
const [hcEditOpen, setHcEditOpen] = useState(false)
|
||||||
|
const [hcEditTarget, setHcEditTarget] = useState<HealthCheckWithResult | null>(null)
|
||||||
|
const [hcEditForm, setHcEditForm] = useState<HealthCheckFormValues>(defaultHealthCheckForm())
|
||||||
|
|
||||||
|
// Delete health check dialog
|
||||||
|
const [hcDeleteOpen, setHcDeleteOpen] = useState(false)
|
||||||
|
const [hcDeleteTarget, setHcDeleteTarget] = useState<HealthCheckWithResult | null>(null)
|
||||||
|
|
||||||
// ── Fetch host ────────────────────────────────────────────────────────────
|
// ── Fetch host ────────────────────────────────────────────────────────────
|
||||||
useEffect(() => {
|
useEffect(() => {
|
||||||
apiClient.get(`/hosts/${id}`)
|
apiClient.get(`/hosts/${id}`)
|
||||||
@ -235,6 +373,19 @@ export default function HostDetailPage() {
|
|||||||
|
|
||||||
useEffect(() => { fetchWindows() }, [fetchWindows])
|
useEffect(() => { fetchWindows() }, [fetchWindows])
|
||||||
|
|
||||||
|
// ── Fetch health checks ───────────────────────────────────────────────────
|
||||||
|
const fetchHealthChecks = useCallback(async () => {
|
||||||
|
if (!id) return
|
||||||
|
setHcLoading(true)
|
||||||
|
try {
|
||||||
|
const res = await healthChecksApi.list(id)
|
||||||
|
setHealthChecks(Array.isArray(res.data) ? res.data : [])
|
||||||
|
} catch { /* ignore */ }
|
||||||
|
finally { setHcLoading(false) }
|
||||||
|
}, [id])
|
||||||
|
|
||||||
|
useEffect(() => { fetchHealthChecks() }, [fetchHealthChecks])
|
||||||
|
|
||||||
const showSnack = (message: string, severity: 'success' | 'error') =>
|
const showSnack = (message: string, severity: 'success' | 'error') =>
|
||||||
setSnackbar({ open: true, message, severity })
|
setSnackbar({ open: true, message, severity })
|
||||||
|
|
||||||
@ -312,6 +463,105 @@ export default function HostDetailPage() {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Create health check ──────────────────────────────────────────────────
|
||||||
|
const handleHcCreateSubmit = async (values: HealthCheckFormValues) => {
|
||||||
|
if (!id) return
|
||||||
|
const body: CreateHealthCheckRequest = {
|
||||||
|
name: values.name,
|
||||||
|
check_type: values.check_type,
|
||||||
|
}
|
||||||
|
if (values.check_type === 'service') {
|
||||||
|
body.service_name = values.service_name || undefined
|
||||||
|
} else {
|
||||||
|
body.url = values.url || undefined
|
||||||
|
body.expected_body = values.expected_body || undefined
|
||||||
|
body.ignore_cert_errors = values.ignore_cert_errors || undefined
|
||||||
|
body.basic_auth_user = values.basic_auth_user || undefined
|
||||||
|
body.basic_auth_pass = values.basic_auth_pass || undefined
|
||||||
|
}
|
||||||
|
await healthChecksApi.create(id, body)
|
||||||
|
setHcCreateOpen(false)
|
||||||
|
showSnack('Health check created', 'success')
|
||||||
|
await fetchHealthChecks()
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Edit health check ────────────────────────────────────────────────────
|
||||||
|
const handleHcEditClick = (check: HealthCheckWithResult) => {
|
||||||
|
setHcEditTarget(check)
|
||||||
|
setHcEditForm({
|
||||||
|
name: check.name,
|
||||||
|
check_type: check.check_type,
|
||||||
|
service_name: check.service_name ?? '',
|
||||||
|
url: check.url ?? '',
|
||||||
|
expected_body: check.expected_body ?? '',
|
||||||
|
ignore_cert_errors: check.ignore_cert_errors,
|
||||||
|
basic_auth_user: check.basic_auth_user ?? '',
|
||||||
|
basic_auth_pass: '',
|
||||||
|
enabled: check.enabled,
|
||||||
|
})
|
||||||
|
setHcEditOpen(true)
|
||||||
|
}
|
||||||
|
|
||||||
|
const handleHcEditSubmit = async (values: HealthCheckFormValues) => {
|
||||||
|
if (!id || !hcEditTarget) return
|
||||||
|
const body: UpdateHealthCheckRequest = {
|
||||||
|
name: values.name,
|
||||||
|
enabled: values.enabled,
|
||||||
|
}
|
||||||
|
if (values.check_type === 'service') {
|
||||||
|
body.service_name = values.service_name || undefined
|
||||||
|
} else {
|
||||||
|
body.url = values.url || undefined
|
||||||
|
body.expected_body = values.expected_body || undefined
|
||||||
|
body.ignore_cert_errors = values.ignore_cert_errors
|
||||||
|
body.basic_auth_user = values.basic_auth_user || undefined
|
||||||
|
body.basic_auth_pass = values.basic_auth_pass || undefined
|
||||||
|
}
|
||||||
|
await healthChecksApi.update(id, hcEditTarget.id, body)
|
||||||
|
setHcEditOpen(false)
|
||||||
|
showSnack('Health check updated', 'success')
|
||||||
|
await fetchHealthChecks()
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Delete health check ──────────────────────────────────────────────────
|
||||||
|
const handleHcDeleteConfirm = async () => {
|
||||||
|
if (!id || !hcDeleteTarget) return
|
||||||
|
try {
|
||||||
|
await healthChecksApi.delete(id, hcDeleteTarget.id)
|
||||||
|
setHcDeleteOpen(false)
|
||||||
|
showSnack('Health check deleted', 'success')
|
||||||
|
await fetchHealthChecks()
|
||||||
|
} catch {
|
||||||
|
showSnack('Failed to delete health check', 'error')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Toggle health check enabled ──────────────────────────────────────────
|
||||||
|
const handleToggleEnabled = async (check: HealthCheckWithResult) => {
|
||||||
|
if (!id) return
|
||||||
|
try {
|
||||||
|
await healthChecksApi.update(id, check.id, { enabled: !check.enabled })
|
||||||
|
await fetchHealthChecks()
|
||||||
|
} catch {
|
||||||
|
showSnack('Failed to toggle health check', 'error')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ── Test health check ────────────────────────────────────────────────────
|
||||||
|
const handleTestCheck = async (check: HealthCheckWithResult) => {
|
||||||
|
if (!id) return
|
||||||
|
setTestingId(check.id)
|
||||||
|
try {
|
||||||
|
await healthChecksApi.test(id, check.id)
|
||||||
|
await fetchHealthChecks()
|
||||||
|
showSnack('Health check test completed', 'success')
|
||||||
|
} catch {
|
||||||
|
showSnack('Health check test failed', 'error')
|
||||||
|
} finally {
|
||||||
|
setTestingId(null)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// ── Render ────────────────────────────────────────────────────────────────
|
// ── Render ────────────────────────────────────────────────────────────────
|
||||||
if (loading) return <Box display="flex" justifyContent="center" mt={8}><CircularProgress /></Box>
|
if (loading) return <Box display="flex" justifyContent="center" mt={8}><CircularProgress /></Box>
|
||||||
if (error) return <Container sx={{ mt: 4 }}><Alert severity="error">{error}</Alert></Container>
|
if (error) return <Container sx={{ mt: 4 }}><Alert severity="error">{error}</Alert></Container>
|
||||||
@ -350,7 +600,7 @@ export default function HostDetailPage() {
|
|||||||
</Paper>
|
</Paper>
|
||||||
|
|
||||||
{/* ── Maintenance Windows ──────────────────────────────────────────── */}
|
{/* ── Maintenance Windows ──────────────────────────────────────────── */}
|
||||||
<Paper sx={{ p: 3 }}>
|
<Paper sx={{ p: 3, mb: 3 }}>
|
||||||
<Box sx={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', mb: 2 }}>
|
<Box sx={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', mb: 2 }}>
|
||||||
<Box sx={{ display: 'flex', alignItems: 'center', gap: 1 }}>
|
<Box sx={{ display: 'flex', alignItems: 'center', gap: 1 }}>
|
||||||
<ScheduleIcon color="primary" />
|
<ScheduleIcon color="primary" />
|
||||||
@ -427,6 +677,127 @@ export default function HostDetailPage() {
|
|||||||
)}
|
)}
|
||||||
</Paper>
|
</Paper>
|
||||||
|
|
||||||
|
{/* ── Health Checks ────────────────────────────────────────────────── */}
|
||||||
|
<Paper sx={{ p: 3, mb: 3 }}>
|
||||||
|
<Box sx={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', mb: 2 }}>
|
||||||
|
<Box sx={{ display: 'flex', alignItems: 'center', gap: 1 }}>
|
||||||
|
<MonitorHeartIcon color="primary" />
|
||||||
|
<Typography variant="h6" fontWeight={600}>Health Checks</Typography>
|
||||||
|
</Box>
|
||||||
|
<Button
|
||||||
|
startIcon={<AddIcon />}
|
||||||
|
variant="outlined"
|
||||||
|
size="small"
|
||||||
|
disabled={healthChecks.length >= 5}
|
||||||
|
onClick={() => { setHcCreateForm(defaultHealthCheckForm()); setHcCreateOpen(true) }}
|
||||||
|
>
|
||||||
|
Add Health Check
|
||||||
|
</Button>
|
||||||
|
</Box>
|
||||||
|
<Divider sx={{ mb: 2 }} />
|
||||||
|
|
||||||
|
<Typography variant="body2" color="text.secondary" sx={{ mb: 2 }}>
|
||||||
|
Monitor host health with service and HTTP checks. Maximum 5 checks per host.
|
||||||
|
</Typography>
|
||||||
|
|
||||||
|
{hcLoading ? (
|
||||||
|
<Box display="flex" justifyContent="center" py={3}><CircularProgress size={28} /></Box>
|
||||||
|
) : healthChecks.length === 0 ? (
|
||||||
|
<Alert severity="info">
|
||||||
|
No health checks configured. Add a check to monitor this host's health.
|
||||||
|
</Alert>
|
||||||
|
) : (
|
||||||
|
<Table size="small">
|
||||||
|
<TableHead>
|
||||||
|
<TableRow>
|
||||||
|
<TableCell>Name</TableCell>
|
||||||
|
<TableCell>Type</TableCell>
|
||||||
|
<TableCell>Status</TableCell>
|
||||||
|
<TableCell>Enabled</TableCell>
|
||||||
|
<TableCell>Detail</TableCell>
|
||||||
|
<TableCell>Latency</TableCell>
|
||||||
|
<TableCell>Last Checked</TableCell>
|
||||||
|
<TableCell align="right">Actions</TableCell>
|
||||||
|
</TableRow>
|
||||||
|
</TableHead>
|
||||||
|
<TableBody>
|
||||||
|
{healthChecks.map(check => (
|
||||||
|
<TableRow key={check.id} hover>
|
||||||
|
<TableCell>{check.name}</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
<Chip label={check.check_type} size="small" variant="outlined" />
|
||||||
|
</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
{check.last_result ? (
|
||||||
|
check.last_result.healthy ? (
|
||||||
|
<Tooltip title="Healthy">
|
||||||
|
<CheckCircleIcon color="success" fontSize="small" />
|
||||||
|
</Tooltip>
|
||||||
|
) : (
|
||||||
|
<Tooltip title="Unhealthy">
|
||||||
|
<CancelIcon color="error" fontSize="small" />
|
||||||
|
</Tooltip>
|
||||||
|
)
|
||||||
|
) : (
|
||||||
|
<Tooltip title="No result yet">
|
||||||
|
<RemoveIcon color="disabled" fontSize="small" />
|
||||||
|
</Tooltip>
|
||||||
|
)}
|
||||||
|
</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
<Switch
|
||||||
|
size="small"
|
||||||
|
checked={check.enabled}
|
||||||
|
onChange={() => handleToggleEnabled(check)}
|
||||||
|
/>
|
||||||
|
</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
<Typography variant="body2" sx={{ maxWidth: 200, overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>
|
||||||
|
{check.last_result?.detail ?? '—'}
|
||||||
|
</Typography>
|
||||||
|
</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
{check.last_result?.latency_ms != null ? `${check.last_result.latency_ms} ms` : '—'}
|
||||||
|
</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
{check.last_result?.checked_at
|
||||||
|
? new Date(check.last_result.checked_at).toLocaleString()
|
||||||
|
: '—'}
|
||||||
|
</TableCell>
|
||||||
|
<TableCell align="right">
|
||||||
|
<Tooltip title="Test now">
|
||||||
|
<IconButton
|
||||||
|
size="small"
|
||||||
|
color="primary"
|
||||||
|
disabled={testingId === check.id}
|
||||||
|
onClick={() => handleTestCheck(check)}
|
||||||
|
>
|
||||||
|
{testingId === check.id
|
||||||
|
? <CircularProgress size={16} />
|
||||||
|
: <PlayArrowIcon fontSize="small" />}
|
||||||
|
</IconButton>
|
||||||
|
</Tooltip>
|
||||||
|
<Tooltip title="Edit">
|
||||||
|
<IconButton size="small" onClick={() => handleHcEditClick(check)}>
|
||||||
|
<EditIcon fontSize="small" />
|
||||||
|
</IconButton>
|
||||||
|
</Tooltip>
|
||||||
|
<Tooltip title="Delete">
|
||||||
|
<IconButton
|
||||||
|
size="small" color="error"
|
||||||
|
onClick={() => { setHcDeleteTarget(check); setHcDeleteOpen(true) }}
|
||||||
|
>
|
||||||
|
<DeleteIcon fontSize="small" />
|
||||||
|
</IconButton>
|
||||||
|
</Tooltip>
|
||||||
|
</TableCell>
|
||||||
|
</TableRow>
|
||||||
|
))}
|
||||||
|
</TableBody>
|
||||||
|
</Table>
|
||||||
|
)}
|
||||||
|
</Paper>
|
||||||
|
|
||||||
{/* ── Dialogs ─────────────────────────────────────────────────────── */}
|
{/* ── Dialogs ─────────────────────────────────────────────────────── */}
|
||||||
<WindowFormDialog
|
<WindowFormDialog
|
||||||
open={createOpen}
|
open={createOpen}
|
||||||
@ -455,6 +826,34 @@ export default function HostDetailPage() {
|
|||||||
</DialogActions>
|
</DialogActions>
|
||||||
</Dialog>
|
</Dialog>
|
||||||
|
|
||||||
|
{/* Health Check Dialogs */}
|
||||||
|
<HealthCheckFormDialog
|
||||||
|
open={hcCreateOpen}
|
||||||
|
title="Add Health Check"
|
||||||
|
initial={hcCreateForm}
|
||||||
|
onClose={() => setHcCreateOpen(false)}
|
||||||
|
onSubmit={handleHcCreateSubmit}
|
||||||
|
/>
|
||||||
|
<HealthCheckFormDialog
|
||||||
|
open={hcEditOpen}
|
||||||
|
title="Edit Health Check"
|
||||||
|
initial={hcEditForm}
|
||||||
|
onClose={() => setHcEditOpen(false)}
|
||||||
|
onSubmit={handleHcEditSubmit}
|
||||||
|
/>
|
||||||
|
<Dialog open={hcDeleteOpen} onClose={() => setHcDeleteOpen(false)} maxWidth="xs" fullWidth>
|
||||||
|
<DialogTitle>Delete Health Check</DialogTitle>
|
||||||
|
<DialogContent>
|
||||||
|
<Typography>
|
||||||
|
Delete <strong>{hcDeleteTarget?.name}</strong>? This cannot be undone.
|
||||||
|
</Typography>
|
||||||
|
</DialogContent>
|
||||||
|
<DialogActions>
|
||||||
|
<Button onClick={() => setHcDeleteOpen(false)}>Cancel</Button>
|
||||||
|
<Button color="error" variant="contained" onClick={handleHcDeleteConfirm}>Delete</Button>
|
||||||
|
</DialogActions>
|
||||||
|
</Dialog>
|
||||||
|
|
||||||
{/* Snackbar */}
|
{/* Snackbar */}
|
||||||
<Snackbar
|
<Snackbar
|
||||||
open={snackbar.open}
|
open={snackbar.open}
|
||||||
|
|||||||
@ -5,6 +5,7 @@ import {
|
|||||||
TableRow, TextField, Toolbar, Tooltip, Typography,
|
TableRow, TextField, Toolbar, Tooltip, Typography,
|
||||||
} from '@mui/material'
|
} from '@mui/material'
|
||||||
import { Add as AddIcon, Refresh as RefreshIcon, Delete as DeleteIcon } from '@mui/icons-material'
|
import { Add as AddIcon, Refresh as RefreshIcon, Delete as DeleteIcon } from '@mui/icons-material'
|
||||||
|
import { CheckCircle as CheckCircleIcon, Cancel as CancelIcon, Remove as RemoveIcon } from '@mui/icons-material'
|
||||||
import { useNavigate } from 'react-router-dom'
|
import { useNavigate } from 'react-router-dom'
|
||||||
import { apiClient, hostsApi } from '../api/client'
|
import { apiClient, hostsApi } from '../api/client'
|
||||||
import type { Host, HostHealthStatus } from '../types'
|
import type { Host, HostHealthStatus } from '../types'
|
||||||
@ -67,6 +68,7 @@ export default function HostsPage() {
|
|||||||
<TableCell>IP Address</TableCell>
|
<TableCell>IP Address</TableCell>
|
||||||
<TableCell>OS</TableCell>
|
<TableCell>OS</TableCell>
|
||||||
<TableCell>Health</TableCell>
|
<TableCell>Health</TableCell>
|
||||||
|
<TableCell>Checks</TableCell>
|
||||||
<TableCell>Agent</TableCell>
|
<TableCell>Agent</TableCell>
|
||||||
<TableCell>Actions</TableCell>
|
<TableCell>Actions</TableCell>
|
||||||
</TableRow>
|
</TableRow>
|
||||||
@ -82,6 +84,15 @@ export default function HostsPage() {
|
|||||||
<TableCell>
|
<TableCell>
|
||||||
<Chip size="small" label={h.health_status} color={statusColor(h.health_status)} />
|
<Chip size="small" label={h.health_status} color={statusColor(h.health_status)} />
|
||||||
</TableCell>
|
</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
{h.health_check_status === 'all_healthy' ? (
|
||||||
|
<Tooltip title="All checks healthy"><CheckCircleIcon color="success" fontSize="small" /></Tooltip>
|
||||||
|
) : h.health_check_status === 'some_unhealthy' ? (
|
||||||
|
<Tooltip title="Some checks unhealthy"><CancelIcon color="error" fontSize="small" /></Tooltip>
|
||||||
|
) : (
|
||||||
|
<Tooltip title="No checks configured"><RemoveIcon color="disabled" fontSize="small" /></Tooltip>
|
||||||
|
)}
|
||||||
|
</TableCell>
|
||||||
<TableCell>{h.agent_version ?? '—'}</TableCell>
|
<TableCell>{h.agent_version ?? '—'}</TableCell>
|
||||||
<TableCell onClick={e => e.stopPropagation()}>
|
<TableCell onClick={e => e.stopPropagation()}>
|
||||||
<Tooltip title="Request refresh">
|
<Tooltip title="Request refresh">
|
||||||
|
|||||||
@ -22,8 +22,10 @@ import {
|
|||||||
TextField,
|
TextField,
|
||||||
Toolbar,
|
Toolbar,
|
||||||
Typography,
|
Typography,
|
||||||
|
Tooltip,
|
||||||
} from '@mui/material'
|
} from '@mui/material'
|
||||||
import { Search as SearchIcon } from '@mui/icons-material'
|
import { Search as SearchIcon } from '@mui/icons-material'
|
||||||
|
import { CheckCircle as CheckCircleIcon, Cancel as CancelIcon, Remove as RemoveIcon } from '@mui/icons-material'
|
||||||
import { useNavigate } from 'react-router-dom'
|
import { useNavigate } from 'react-router-dom'
|
||||||
import { hostsApi, jobsApi } from '../api/client'
|
import { hostsApi, jobsApi } from '../api/client'
|
||||||
import type { Host, HostHealthStatus } from '../types'
|
import type { Host, HostHealthStatus } from '../types'
|
||||||
@ -256,6 +258,7 @@ export default function PatchDeploymentPage() {
|
|||||||
<TableCell>FQDN</TableCell>
|
<TableCell>FQDN</TableCell>
|
||||||
<TableCell>IP Address</TableCell>
|
<TableCell>IP Address</TableCell>
|
||||||
<TableCell>Health</TableCell>
|
<TableCell>Health</TableCell>
|
||||||
|
<TableCell>Checks</TableCell>
|
||||||
<TableCell>Patches</TableCell>
|
<TableCell>Patches</TableCell>
|
||||||
<TableCell>OS</TableCell>
|
<TableCell>OS</TableCell>
|
||||||
</TableRow>
|
</TableRow>
|
||||||
@ -263,7 +266,7 @@ export default function PatchDeploymentPage() {
|
|||||||
<TableBody>
|
<TableBody>
|
||||||
{filteredHosts.length === 0 ? (
|
{filteredHosts.length === 0 ? (
|
||||||
<TableRow>
|
<TableRow>
|
||||||
<TableCell colSpan={7} align="center">
|
<TableCell colSpan={8} align="center">
|
||||||
<Typography variant="body2" color="text.secondary" py={2}>
|
<Typography variant="body2" color="text.secondary" py={2}>
|
||||||
No hosts found
|
No hosts found
|
||||||
</Typography>
|
</Typography>
|
||||||
@ -291,6 +294,15 @@ export default function PatchDeploymentPage() {
|
|||||||
<TableCell>
|
<TableCell>
|
||||||
<HealthChip status={host.health_status} />
|
<HealthChip status={host.health_status} />
|
||||||
</TableCell>
|
</TableCell>
|
||||||
|
<TableCell>
|
||||||
|
{host.health_check_status === 'all_healthy' ? (
|
||||||
|
<Tooltip title="All checks healthy"><CheckCircleIcon color="success" fontSize="small" /></Tooltip>
|
||||||
|
) : host.health_check_status === 'some_unhealthy' ? (
|
||||||
|
<Tooltip title="Some checks unhealthy"><CancelIcon color="error" fontSize="small" /></Tooltip>
|
||||||
|
) : (
|
||||||
|
<Tooltip title="No checks configured"><RemoveIcon color="disabled" fontSize="small" /></Tooltip>
|
||||||
|
)}
|
||||||
|
</TableCell>
|
||||||
<TableCell>
|
<TableCell>
|
||||||
<Chip
|
<Chip
|
||||||
label={host.patches_missing}
|
label={host.patches_missing}
|
||||||
|
|||||||
@ -26,6 +26,7 @@ export interface Host {
|
|||||||
agent_version?: string
|
agent_version?: string
|
||||||
patches_missing: number
|
patches_missing: number
|
||||||
registered_at: string
|
registered_at: string
|
||||||
|
health_check_status?: 'all_healthy' | 'some_unhealthy' | 'none'
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface Group {
|
export interface Group {
|
||||||
@ -253,3 +254,57 @@ export interface AuditIntegrityResult {
|
|||||||
}
|
}
|
||||||
|
|
||||||
export type ReportFormat = 'csv' | 'pdf'
|
export type ReportFormat = 'csv' | 'pdf'
|
||||||
|
|
||||||
|
// ── Health Checks ────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
export type HealthCheckType = 'service' | 'http'
|
||||||
|
|
||||||
|
export interface HealthCheck {
|
||||||
|
id: string
|
||||||
|
host_id: string
|
||||||
|
name: string
|
||||||
|
check_type: HealthCheckType
|
||||||
|
enabled: boolean
|
||||||
|
service_name?: string
|
||||||
|
url?: string
|
||||||
|
expected_body?: string
|
||||||
|
ignore_cert_errors: boolean
|
||||||
|
basic_auth_user?: string
|
||||||
|
created_at: string
|
||||||
|
updated_at: string
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface HealthCheckResult {
|
||||||
|
id: string
|
||||||
|
check_id: string
|
||||||
|
healthy: boolean
|
||||||
|
detail?: string
|
||||||
|
latency_ms?: number
|
||||||
|
checked_at: string
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface HealthCheckWithResult extends HealthCheck {
|
||||||
|
last_result?: HealthCheckResult
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface CreateHealthCheckRequest {
|
||||||
|
name: string
|
||||||
|
check_type: HealthCheckType
|
||||||
|
service_name?: string
|
||||||
|
url?: string
|
||||||
|
expected_body?: string
|
||||||
|
ignore_cert_errors?: boolean
|
||||||
|
basic_auth_user?: string
|
||||||
|
basic_auth_pass?: string
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface UpdateHealthCheckRequest {
|
||||||
|
name?: string
|
||||||
|
enabled?: boolean
|
||||||
|
service_name?: string
|
||||||
|
url?: string
|
||||||
|
expected_body?: string
|
||||||
|
ignore_cert_errors?: boolean
|
||||||
|
basic_auth_user?: string
|
||||||
|
basic_auth_pass?: string
|
||||||
|
}
|
||||||
|
|||||||
42
migrations/007_health_checks.sql
Normal file
42
migrations/007_health_checks.sql
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
-- Migration 007: Health check configuration and results
|
||||||
|
|
||||||
|
-- Health checks configured per host (1-5 per host)
|
||||||
|
CREATE TABLE host_health_checks (
|
||||||
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||||
|
host_id UUID NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
|
||||||
|
name VARCHAR(100) NOT NULL,
|
||||||
|
check_type VARCHAR(20) NOT NULL CHECK (check_type IN ('service', 'http')),
|
||||||
|
enabled BOOLEAN NOT NULL DEFAULT true,
|
||||||
|
-- Service check fields (Type 1)
|
||||||
|
service_name VARCHAR(200),
|
||||||
|
-- HTTP check fields (Type 2)
|
||||||
|
url TEXT,
|
||||||
|
expected_body VARCHAR(500),
|
||||||
|
ignore_cert_errors BOOLEAN DEFAULT true,
|
||||||
|
basic_auth_user VARCHAR(100),
|
||||||
|
basic_auth_pass_encrypted BYTEA, -- AES-256-GCM encrypted
|
||||||
|
basic_auth_pass_nonce BYTEA, -- nonce for AES-GCM
|
||||||
|
-- Metadata
|
||||||
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
-- Constraint: service checks must have service_name, http checks must have url + expected_body
|
||||||
|
CONSTRAINT valid_service_check CHECK (
|
||||||
|
(check_type = 'service' AND service_name IS NOT NULL AND url IS NULL)
|
||||||
|
OR
|
||||||
|
(check_type = 'http' AND url IS NOT NULL AND expected_body IS NOT NULL AND service_name IS NULL)
|
||||||
|
)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_health_checks_host ON host_health_checks (host_id);
|
||||||
|
|
||||||
|
-- Health check poll results (4-day retention, pruned by worker)
|
||||||
|
CREATE TABLE host_health_check_results (
|
||||||
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||||
|
check_id UUID NOT NULL REFERENCES host_health_checks(id) ON DELETE CASCADE,
|
||||||
|
healthy BOOLEAN NOT NULL,
|
||||||
|
detail TEXT,
|
||||||
|
latency_ms INTEGER,
|
||||||
|
checked_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_health_results_check ON host_health_check_results (check_id, checked_at DESC);
|
||||||
4
migrations/008_health_check_worker.sql
Normal file
4
migrations/008_health_check_worker.sql
Normal file
@ -0,0 +1,4 @@
|
|||||||
|
-- Migration 008: Health check worker support
|
||||||
|
-- Adds 'waiting_health_check' to the job_status enum for pre-patch health gates.
|
||||||
|
|
||||||
|
ALTER TYPE job_status ADD VALUE IF NOT EXISTS 'waiting_health_check';
|
||||||
278
tasks/todo.md
278
tasks/todo.md
@ -1,37 +1,253 @@
|
|||||||
# WebSocket + Polling Fallback Implementation Plan
|
# Health Check Configuration for Hosts — Feature Plan
|
||||||
|
|
||||||
## Problem
|
## Overview
|
||||||
The linux-patch-api agent's `/api/v1/ws/jobs` endpoint was a stub that returned HTTP 101
|
Each host can have 1-5 health checks. During maintenance windows, all checks must be healthy before patch execution. Health checks are also continuously polled for dashboard status.
|
||||||
with a JSON body but didn't compute the required `Sec-WebSocket-Accept` header. This
|
|
||||||
caused the pm-worker WS relay to fail with "Key mismatch in Sec-WebSocket-Accept header".
|
|
||||||
|
|
||||||
Additionally, the pm-worker WS relay's rustls ClientConfig didn't set ALPN to http/1.1,
|
## Design Decisions (confirmed with Kelly)
|
||||||
causing HTTP/2 negotiation which also breaks WebSocket upgrades.
|
- Health check config lives in manager DB, manager controls everything
|
||||||
|
- Agent is just an action endpoint (no health check logic on agent)
|
||||||
|
- Admins and operators can manage (operators need matching group)
|
||||||
|
- Per-host 1:1 (not shareable templates)
|
||||||
|
- Both continuous polling AND must be healthy before patch execution
|
||||||
|
- Service health: query agent `GET /api/v1/system/services/{name}`, check `healthy` field
|
||||||
|
- HTTP health: manager makes direct HTTP request, substring match in response body
|
||||||
|
- Retry failed checks at 5-minute intervals until maintenance window closes
|
||||||
|
- 10-second check timeout; no response = failed
|
||||||
|
- Order doesn't matter
|
||||||
|
- Basic auth password: encrypted in DB with per-install app key
|
||||||
|
- Health check poll interval: 5 minutes
|
||||||
|
- Result retention: 4 days (time-based)
|
||||||
|
|
||||||
## Root Causes
|
## linux_patch_api Agent Endpoints (Reference)
|
||||||
1. **Agent WS handler was a stub** — didn't implement RFC 6455 WebSocket handshake
|
|
||||||
2. **WS relay missing ALPN** — rustls ClientConfig didn't set `alpn_protocols` to `http/1.1`
|
|
||||||
3. **No fallback** — WS relay had no fallback if WebSocket connection failed
|
|
||||||
|
|
||||||
## Completed
|
### GET /api/v1/system/services/{name}
|
||||||
- [x] ALPN fix in pm-worker ws_relay.rs (forces HTTP/1.1 for WebSocket)
|
Returns service status for health check Type 1 (service). Added in commit 8b6d9ed.
|
||||||
- [x] Error chain logging in pm-worker ws_relay.rs (for future debugging)
|
|
||||||
- [x] Job-level WS event_type fix (frontend + backend)
|
|
||||||
- [x] Implement proper WebSocket in linux-patch-api using actix-web-actors
|
|
||||||
- [x] Add WsJobActor with broadcast channel for real-time status updates
|
|
||||||
- [x] Add HTTP polling fallback in pm-worker WS relay
|
|
||||||
- [x] Deploy both binaries to dev LXC
|
|
||||||
- [x] Push both projects to Gitea
|
|
||||||
- [x] Fix config file (ws_relay_poll_interval_secs in [worker] section)
|
|
||||||
|
|
||||||
## Deployment Notes
|
**Response:** `ApiResponse<ServiceStatusData>`
|
||||||
- linux-patch-api binary deployed to /usr/bin/linux-patch-api on dev LXC (VMID 131)
|
```json
|
||||||
- pm-worker binary deployed to /usr/local/bin/pm-worker on dev LXC (VMID 131)
|
{
|
||||||
- Config file: /etc/patch-manager/config.toml (added ws_relay_poll_interval_secs = 10)
|
"success": true,
|
||||||
- Both services running: patch-manager-web, patch-manager-worker, linux-patch-api
|
"data": {
|
||||||
|
"name": "nginx",
|
||||||
|
"display_name": "A high performance web server",
|
||||||
|
"active_state": "active",
|
||||||
|
"sub_state": "running",
|
||||||
|
"load_state": "loaded",
|
||||||
|
"enabled_state": "enabled",
|
||||||
|
"main_pid": 1234,
|
||||||
|
"healthy": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
## Verified Working
|
**Health determination:** The agent `healthy` field is authoritative. Manager uses this boolean directly.
|
||||||
- WebSocket connections to linux-patch-manager-dev (agent with proper WS handler)
|
|
||||||
- HTTP polling fallback to gitea-runner-u2404 (agent with stub WS)
|
**Error responses:** 400 (invalid name), 404 (not found → unhealthy), 500 (error → unhealthy)
|
||||||
- Job completion status updates via pg_notify
|
|
||||||
- Frontend real-time updates via WebSocket events
|
### GET /api/v1/health
|
||||||
|
Basic agent health. Returns `{"status": "healthy"}`. Used for connectivity checks, not host health checks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1: Database Schema
|
||||||
|
|
||||||
|
### New table: `host_health_checks`
|
||||||
|
```sql
|
||||||
|
CREATE TABLE host_health_checks (
|
||||||
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||||
|
host_id UUID NOT NULL REFERENCES hosts(id) ON DELETE CASCADE,
|
||||||
|
name VARCHAR(100) NOT NULL,
|
||||||
|
check_type VARCHAR(20) NOT NULL CHECK (check_type IN ('service', 'http')),
|
||||||
|
enabled BOOLEAN NOT NULL DEFAULT true,
|
||||||
|
-- Service check fields
|
||||||
|
service_name VARCHAR(200),
|
||||||
|
-- HTTP check fields
|
||||||
|
url TEXT,
|
||||||
|
expected_body VARCHAR(500),
|
||||||
|
ignore_cert_errors BOOLEAN DEFAULT true,
|
||||||
|
basic_auth_user VARCHAR(100),
|
||||||
|
basic_auth_pass_encrypted BYTEA, -- AES-256-GCM encrypted with per-install key
|
||||||
|
basic_auth_pass_nonce BYTEA, -- nonce for AES-GCM
|
||||||
|
-- Metadata
|
||||||
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
CONSTRAINT valid_service_check CHECK (
|
||||||
|
(check_type = 'service' AND service_name IS NOT NULL AND url IS NULL)
|
||||||
|
OR
|
||||||
|
(check_type = 'http' AND url IS NOT NULL AND expected_body IS NOT NULL AND service_name IS NULL)
|
||||||
|
)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_health_checks_host ON host_health_checks (host_id);
|
||||||
|
```
|
||||||
|
|
||||||
|
### New table: `host_health_check_results`
|
||||||
|
```sql
|
||||||
|
CREATE TABLE host_health_check_results (
|
||||||
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||||
|
check_id UUID NOT NULL REFERENCES host_health_checks(id) ON DELETE CASCADE,
|
||||||
|
healthy BOOLEAN NOT NULL,
|
||||||
|
detail TEXT,
|
||||||
|
latency_ms INTEGER,
|
||||||
|
checked_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX idx_health_results_check ON host_health_check_results (check_id, checked_at DESC);
|
||||||
|
```
|
||||||
|
|
||||||
|
### Encryption key storage
|
||||||
|
- Per-install app key stored at `/etc/patch-manager/keys/health-check.key`
|
||||||
|
- 256-bit random key generated on first startup if not present
|
||||||
|
- File permissions: 0600, owned by patch-manager user
|
||||||
|
- Used for AES-256-GCM encryption of basic_auth_pass
|
||||||
|
|
||||||
|
- [ ] Create migration 007_health_checks.sql
|
||||||
|
- [ ] Add models to pm-core/src/models.rs
|
||||||
|
- [ ] Add encryption utility to pm-core
|
||||||
|
- [ ] Verify migration runs on dev LXC
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 2: Backend API Routes
|
||||||
|
|
||||||
|
### Endpoints
|
||||||
|
- `GET /api/v1/hosts/{id}/health-checks` — list health checks for host (RBAC scoped)
|
||||||
|
- `POST /api/v1/hosts/{id}/health-checks` — create health check (max 5 per host)
|
||||||
|
- `PUT /api/v1/hosts/{id}/health-checks/{check_id}` — update health check
|
||||||
|
- `DELETE /api/v1/hosts/{id}/health-checks/{check_id}` — delete health check
|
||||||
|
- `POST /api/v1/hosts/{id}/health-checks/{check_id}/test` — run check immediately, return result
|
||||||
|
|
||||||
|
### Request/Response types
|
||||||
|
```rust
|
||||||
|
struct CreateHealthCheckRequest {
|
||||||
|
name: String,
|
||||||
|
check_type: String, // "service" or "http"
|
||||||
|
service_name: Option<String>,
|
||||||
|
url: Option<String>,
|
||||||
|
expected_body: Option<String>,
|
||||||
|
ignore_cert_errors: Option<bool>,
|
||||||
|
basic_auth_user: Option<String>,
|
||||||
|
basic_auth_pass: Option<String>, // plaintext in request, encrypted before storage
|
||||||
|
}
|
||||||
|
|
||||||
|
struct HealthCheck {
|
||||||
|
id: Uuid,
|
||||||
|
host_id: Uuid,
|
||||||
|
name: String,
|
||||||
|
check_type: String,
|
||||||
|
enabled: bool,
|
||||||
|
service_name: Option<String>,
|
||||||
|
url: Option<String>,
|
||||||
|
expected_body: Option<String>,
|
||||||
|
ignore_cert_errors: bool,
|
||||||
|
basic_auth_user: Option<String>,
|
||||||
|
// basic_auth_pass NOT returned in responses
|
||||||
|
last_result: Option<HealthCheckResult>,
|
||||||
|
created_at: DateTime<Utc>,
|
||||||
|
updated_at: DateTime<Utc>,
|
||||||
|
}
|
||||||
|
|
||||||
|
struct HealthCheckResult {
|
||||||
|
healthy: bool,
|
||||||
|
detail: Option<String>,
|
||||||
|
latency_ms: Option<i32>,
|
||||||
|
checked_at: DateTime<Utc>,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- [ ] Add routes to pm-web/src/routes/ (new health_checks.rs)
|
||||||
|
- [ ] Add CRUD operations
|
||||||
|
- [ ] Add RBAC enforcement (admin all, operator matching group)
|
||||||
|
- [ ] Add max-5-per-host validation
|
||||||
|
- [ ] Add /test endpoint that runs check immediately
|
||||||
|
- [ ] Add audit logging for create/update/delete
|
||||||
|
- [ ] Add encryption/decryption for basic_auth_pass
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3: Worker Health Check Engine
|
||||||
|
|
||||||
|
### Continuous Polling
|
||||||
|
- New task in pm-worker: `health_check_poller`
|
||||||
|
- Polls all enabled health checks every 5 minutes
|
||||||
|
- For service checks: call agent `GET /api/v1/system/services/{name}` via mTLS, check `healthy` field
|
||||||
|
- For HTTP checks: make direct HTTP(S) request from manager, check status code + substring match
|
||||||
|
- Store results in `host_health_check_results`
|
||||||
|
- Prune results older than 4 days
|
||||||
|
|
||||||
|
### Pre-Patch Execution Gate
|
||||||
|
- When a patch job is about to execute on a host:
|
||||||
|
1. Check if host has any enabled health checks
|
||||||
|
2. If yes, verify all are currently healthy (from latest poll result)
|
||||||
|
3. If any are unhealthy:
|
||||||
|
- Wait and retry at 5-minute intervals
|
||||||
|
- Continue until maintenance window closes
|
||||||
|
- If window closes with failed checks, mark job host as failed with detail
|
||||||
|
4. If all healthy, proceed with patch execution
|
||||||
|
|
||||||
|
### HTTP Check Implementation
|
||||||
|
- Use reqwest with:
|
||||||
|
- 10-second timeout
|
||||||
|
- Accept invalid certs (ignore_cert_errors)
|
||||||
|
- Optional basic auth header (decrypt from DB)
|
||||||
|
- Check response body contains expected_body substring
|
||||||
|
- Return healthy=true if match, false otherwise
|
||||||
|
|
||||||
|
- [ ] Add health_check_poller module to pm-worker
|
||||||
|
- [ ] Implement service check via AgentClient
|
||||||
|
- [ ] Implement HTTP check via reqwest
|
||||||
|
- [ ] Add pre-patch execution gate to job_executor
|
||||||
|
- [ ] Add retry loop with 5-minute intervals
|
||||||
|
- [ ] Add maintenance window expiry check
|
||||||
|
- [ ] Add health check config to WorkerConfig (poll interval)
|
||||||
|
- [ ] Add result pruning (4-day retention)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 4: Frontend UI
|
||||||
|
|
||||||
|
### Host Detail Page
|
||||||
|
- Add "Health Checks" section below host info
|
||||||
|
- List current health checks with status indicators
|
||||||
|
- Add/Edit/Delete health check dialogs
|
||||||
|
- "Test" button to run check immediately and show result
|
||||||
|
- Visual indicator: green check = healthy, red X = unhealthy, gray = unknown
|
||||||
|
|
||||||
|
### Hosts Page
|
||||||
|
- Add health check summary column or indicator
|
||||||
|
- Show aggregate status: all healthy / some unhealthy / no checks configured
|
||||||
|
|
||||||
|
### Deploy Page
|
||||||
|
- Show health check status in host selection table
|
||||||
|
- Warn if any selected hosts have unhealthy checks
|
||||||
|
|
||||||
|
### Job Detail
|
||||||
|
- Show health check gate status when job is waiting for healthy checks
|
||||||
|
- Display which checks are passing/failing
|
||||||
|
|
||||||
|
- [ ] Add HealthCheck types to frontend/src/types/index.ts
|
||||||
|
- [ ] Add health check API calls to frontend/src/api/client.ts
|
||||||
|
- [ ] Add Health Checks section to HostDetailPage.tsx
|
||||||
|
- [ ] Add health check status to HostsPage.tsx
|
||||||
|
- [ ] Add health check indicators to PatchDeploymentPage.tsx
|
||||||
|
- [ ] Add health check gate status to JobsPage.tsx detail view
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 5: Integration & Testing
|
||||||
|
- [ ] Build and deploy to dev LXC
|
||||||
|
- [ ] Test service health check against dev LXC agent
|
||||||
|
- [ ] Test HTTP health check against internal services
|
||||||
|
- [ ] Test pre-patch gate: deploy with failing check, verify retry behavior
|
||||||
|
- [ ] Test maintenance window expiry with failing checks
|
||||||
|
- [ ] Test RBAC: operator can only manage checks in their group
|
||||||
|
- [ ] Test max 5 checks per host enforcement
|
||||||
|
- [ ] Test basic auth encryption/decryption
|
||||||
|
- [ ] Push to Gitea
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resolved Items
|
||||||
|
- ~~Basic auth password storage~~ → Encrypted in DB with per-install app key (AES-256-GCM)
|
||||||
|
- ~~Health check poll interval~~ → 5 minutes
|
||||||
|
- ~~Result retention~~ → 4 days (time-based)
|
||||||
|
|||||||
Reference in New Issue
Block a user