feat: Complete Azure SSO implementation (v0.1.3)
- Add SSO session cleanup task (10-min expiry, 60s purge interval) - Change callback to redirect to frontend with tokens as query params - Add sso_callback_url to SecurityConfig with serde default - Add SsoCallbackPage.tsx for handling SSO callback redirects - Add /auth/sso/callback public route to App.tsx - Add Sign in with Microsoft Azure button to LoginPage - Replace insecure decode_jwt_payload with verify_id_token - Implement JWKS caching (1-hour TTL) and RSA signature verification - Validate iss, aud, exp claims on id_token - Add jsonwebtoken dependency to pm-web crate - Update config.example.toml with sso_callback_url setting - Add sso_callback_url to settings response (read-only from TOML)
This commit is contained in:
@ -1,5 +1,27 @@
|
||||
# Linux Patch Manager — Lessons Learned
|
||||
|
||||
## 2026-05-08: Asserting Unverified Conclusions Is a Critical Failure Mode
|
||||
**Pattern:** I repeatedly asserted conclusions without verifying them first, then spun wheels on rabbit holes instead of checking the obvious source.
|
||||
**Mistakes made in this session:**
|
||||
1. Claimed vaultwarden-secrets wasn't in gitea — WRONG. It was there the whole time.
|
||||
2. Claimed Vaultwarden credentials "may be stale" — WRONG. They were correct; my implementation was wrong.
|
||||
3. Used wrong credential path (/a0/usr/credentials/gitea/ instead of /a0/usr/credentials/gitea-lxc/).
|
||||
4. Spun wheels decompiling .pyc, manual API auth, searching chat history — instead of checking the gitea repo.
|
||||
5. Didn't notice SSH key was missing from ~/.ssh/ until connection failed.
|
||||
6. Stated uncertainty as fact ("credentials may be stale") when the real issue was my own technical failure.
|
||||
**Root cause:** Violating the Verification Principle — asserting conclusions without verification.
|
||||
**Rule:** ALWAYS verify before asserting. If I haven't checked, say "I haven't verified this" — never state it as fact.
|
||||
**Rule:** When a tool/skill is broken, FIX IT FIRST before attempting manual workarounds.
|
||||
**Rule:** Check the obvious source (gitea repo, Vaultwarden store) before spinning wheels on complex alternatives.
|
||||
**Status:** Active
|
||||
|
||||
## 2026-05-08: Vaultwarden Is the Source of Truth for All Credentials
|
||||
**Pattern:** SSH keys in ~/.ssh/ are ephemeral — lost on every container recreation. Local copies are unreliable.
|
||||
**Rule:** ALWAYS pull credentials (SSH keys, API tokens, passwords) from Vaultwarden when needed. Do NOT rely on local copies in ~/.ssh/ or /a0/usr/storage/ as they may be stale or missing after container recreation.
|
||||
**Rule:** At the start of each session, verify critical credentials by pulling them from Vaultwarden using `python3 /a0/skills/vaultwarden-secrets/scripts/vw_client.py`.
|
||||
**Rule:** /a0/usr/storage/echo-ssh-setup/ is NOT the primary source — Vaultwarden is. Local copies are convenience only.
|
||||
**Status:** Active
|
||||
|
||||
## 2026-04-24: CI/CD First, Not Manual Builds
|
||||
**Pattern:** When creating release packages, set up CI/CD pipeline (Gitea Actions) FIRST before manually building.
|
||||
**Why:** Manual builds are one-off and not reproducible. CI/CD ensures every push/tag produces a fresh, consistent package built on the correct target OS (Ubuntu 24.04), with proper glibc compatibility.
|
||||
@ -95,3 +117,12 @@ The Docker container intercepted some jobs and ran them in its Alpine environmen
|
||||
**Pattern:** The debian/control file has a hardcoded `Version: 1.0.0-1` that doesn't match the Cargo.toml version.
|
||||
**Why:** When dpkg sees the same version number (1.0.0-1) for both old and new packages, it may not properly replace files. The build-package.sh script updates the version in the control file during build, but this needs to be verified.
|
||||
**Action:** Ensure build-package.sh always updates debian/control Version to match Cargo.toml version before building the .deb.
|
||||
|
||||
## 2026-05-08: CSP img-src Must Include data: for QR Codes and Dynamic Images
|
||||
**Pattern:** Content Security Policy default-src 'self' blocks data: URIs, preventing base64-encoded images (like QR codes) from displaying.
|
||||
**Mistake:** Spent extensive time investigating infrastructure (HAProxy, caching, deployment, auth tokens) when Kelly said 'it's just a display issue.' The actual cause was a missing `img-src 'self' data:;` in the CSP meta tag.
|
||||
**Root cause:** The CSP in index.html only had `default-src 'self'` which blocks `data:` image sources. The QR code library generates `data:image/png;base64,...` URIs which were silently blocked by the browser.
|
||||
**Fix:** Added `img-src 'self' data:;` to the CSP directive.
|
||||
**Rule:** When someone says 'it's just a display issue,' focus on the code (CSP, CSS, rendering) — not infrastructure (caching, proxies, deployment).
|
||||
**Rule:** For any image that uses data: URIs (QR codes, inline SVGs, base64 images), ensure CSP includes `img-src 'self' data:;` or equivalent.
|
||||
**Status:** Active
|
||||
@ -1,61 +1,45 @@
|
||||
# Target Host for Service Health Checks
|
||||
# SSO Implementation Fix Plan
|
||||
|
||||
## Overview
|
||||
Add `target_host_id` field to service health checks, allowing a check configured on Host A to query a service on Host B's agent. Useful for redundant services running on multiple machines.
|
||||
## Issues Identified
|
||||
1. **No SSO Login Button** — LoginPage.tsx missing "Sign in with Azure" button
|
||||
2. **No SSO Callback Route** — App.tsx missing frontend route to handle SSO callback
|
||||
3. **authStore No SSO Support** — authStore.ts has no method to store SSO tokens
|
||||
4. **Backend Returns JSON Not Redirect** — azure_sso.rs callback returns JSON tokens instead of redirecting to frontend
|
||||
5. **No SSO Session Cleanup** — sso_sessions DashMap has no expiry/cleanup task (memory leak)
|
||||
6. **No JWT Signature Verification** — id_token decoded without verifying Azure AD signature
|
||||
|
||||
**Design:** `target_host_id` is nullable. When NULL (default), behavior unchanged — check queries its own host's agent. When set, the service check queries the target host's agent instead. Only applies to service checks; HTTP checks already specify a full URL.
|
||||
## Phases
|
||||
|
||||
## Implementation Checklist
|
||||
### Phase 1: Backend SSO Fixes (Issues 4, 5) — COMPLETE ✅
|
||||
- [x] 1a: Add SSO session cleanup task in main.rs (purge sessions older than 10 minutes)
|
||||
- [x] 1b: Modify azure_sso.rs callback to redirect to frontend with tokens instead of returning JSON
|
||||
- [x] 1c: Add `sso_callback_url` to SecurityConfig in config.rs with serde default
|
||||
- [x] 1d: Update settings.rs to include sso_callback_url in settings response
|
||||
- [x] 1e: Verify backend compiles with `cargo check`
|
||||
|
||||
### 1. Database Migration
|
||||
- [ ] Create `migrations/011_health_check_target_host.sql`
|
||||
- [ ] Add `target_host_id UUID REFERENCES hosts(id) ON DELETE SET NULL` column
|
||||
- [ ] Add partial index on `target_host_id` where NOT NULL
|
||||
### Phase 2: Frontend SSO Integration (Issues 1, 2, 3) — COMPLETE ✅
|
||||
- [x] 2a: Add SSO callback page component (SsoCallbackPage.tsx)
|
||||
- [x] 2b: Add SSO callback route to App.tsx (public route, no auth required)
|
||||
- [x] 2c: Add "Sign in with Microsoft Azure" button to LoginPage.tsx
|
||||
- [x] 2d: Add SSO-related types and API methods to frontend
|
||||
- [x] 2e: Verify frontend builds with TypeScript compilation
|
||||
|
||||
### 2. Backend Models (`crates/pm-core/src/models.rs`)
|
||||
- [ ] Add `target_host_id: Option<Uuid>` to `HealthCheck` struct
|
||||
- [ ] Add `target_host_id: Option<Uuid>` to `CreateHealthCheckRequest`
|
||||
- [ ] Add `target_host_id: Option<Uuid>` to `UpdateHealthCheckRequest`
|
||||
- [ ] Add `target_host_id` to all HealthCheck SELECT queries
|
||||
### Phase 3: JWT Signature Verification (Issue 6) — COMPLETE ✅
|
||||
- [x] 3a: Add JWKS client dependency to pm-web/Cargo.toml
|
||||
- [x] 3b: Implement id_token signature verification in azure_sso.rs
|
||||
- [x] 3c: Verify backend compiles with `cargo check`
|
||||
|
||||
### 3. API Routes (`crates/pm-web/src/routes/health_checks.rs`)
|
||||
- [ ] Create: add `target_host_id` to INSERT, validate target host exists + is healthy
|
||||
- [ ] Update: add `target_host_id` to COALESCE UPDATE
|
||||
- [ ] List/Get: add `target_host_id` to SELECT columns
|
||||
- [ ] Test endpoint (`run_service_check`): when `target_host_id` is Some, query that host's IP/port
|
||||
- [ ] Audit log: include `target_host_id` in audit JSON
|
||||
### Phase 4: Integration Testing and Verification — COMPLETE ✅
|
||||
- [x] 4a: Backend code review — all changes verified manually
|
||||
- [x] 4b: Frontend TypeScript compilation — passes cleanly
|
||||
- [x] 4c: SSO login flow reviewed end-to-end (backend redirect → frontend callback → auth store)
|
||||
- [x] 4d: SSO session cleanup verified (10-minute expiry, 60-second purge interval)
|
||||
- [x] 4e: Settings page SSO config unchanged (sso_callback_url added as read-only)
|
||||
- [x] 4f: Lessons captured below
|
||||
|
||||
### 4. Health Check Poller (`crates/pm-worker/src/health_check_poller.rs`)
|
||||
- [ ] Add `target_host_id: Option<Uuid>` to `HealthCheckRow`
|
||||
- [ ] Modify SQL: LEFT JOIN hosts th ON th.id = hc.target_host_id, use COALESCE(th.ip_address, h.ip_address) and COALESCE(th.agent_port, h.agent_port)
|
||||
- [ ] Add `target_ip_address` and `target_agent_port` fields to HealthCheckRow
|
||||
- [ ] `run_service_check`: use target host IP/port when available
|
||||
- [ ] `check_host_health_checks`: no change needed (results count toward owning host)
|
||||
|
||||
### 5. Frontend Types (`frontend/src/types/index.ts`)
|
||||
- [ ] Add `target_host_id?: string` to `HealthCheck`
|
||||
- [ ] Add `target_host_id?: string` to `CreateHealthCheckRequest`
|
||||
- [ ] Add `target_host_id?: string` to `UpdateHealthCheckRequest`
|
||||
|
||||
### 6. Frontend Form (`frontend/src/pages/HostDetailPage.tsx`)
|
||||
- [ ] Add `target_host_id: string` to `HealthCheckFormValues`
|
||||
- [ ] Add `target_host_id: ''` to `defaultHealthCheckForm`
|
||||
- [ ] Add host selector dropdown in `HealthCheckFormDialog` (visible when check_type === 'service')
|
||||
- [ ] Fetch hosts list for dropdown (use hostsApi.list or a dedicated endpoint)
|
||||
- [ ] `handleHcCreateSubmit`: include `target_host_id: values.target_host_id || undefined`
|
||||
- [ ] `handleHcEditClick`: map `check.target_host_id ?? ''` to form
|
||||
- [ ] `handleHcEditSubmit`: include `target_host_id` in UpdateHealthCheckRequest
|
||||
- [ ] Display target host in health checks table Target column
|
||||
|
||||
### 7. Build, Test, Deploy
|
||||
- [ ] Run `cargo fmt --all` + `cargo clippy` + `cargo test`
|
||||
- [ ] Run frontend build + ESLint + tsc
|
||||
- [ ] Commit and push through CI pipeline
|
||||
- [ ] Tag release, build .deb, deploy to dev
|
||||
|
||||
## Design Decisions
|
||||
- `target_host_id` is nullable — NULL = check own host (backward compatible)
|
||||
- FK with ON DELETE SET NULL — if target host deleted, revert to default
|
||||
- Only applies to service checks (HTTP checks already have full URL)
|
||||
- Health gate: results count toward the owning host, not the target host
|
||||
- No RBAC required for target host — only requirement: target host exists in manager and is currently healthy
|
||||
## Lessons Learned
|
||||
- **SSO callback must redirect, not return JSON** — Browser OAuth2 flows require the backend to redirect to the frontend SPA, not return JSON tokens. The frontend must parse tokens from URL query parameters.
|
||||
- **URLSearchParams.get() already decodes** — Don't double-decode with decodeURIComponent() when using URLSearchParams.
|
||||
- **JWKS caching prevents rate-limiting** — Azure AD JWKS endpoint should be cached with TTL (1 hour) to avoid fetching on every SSO login.
|
||||
- **tokio::sync::Mutex over std::sync::Mutex** — Axum handlers must be Send; std::sync::MutexGuard is not Send across await points.
|
||||
- **DashMap session cleanup** — In-memory session stores (DashMap) need periodic cleanup tasks to prevent memory leaks. Pattern: tokio::spawn with interval + retain with time-based cutoff.
|
||||
|
||||
Reference in New Issue
Block a user