Private

Public Access

Files

git-echo 1322598581 feat: add auto-enrollment, cert validation, and crash loop fixes

- Auto-enrollment on startup when certs are missing/invalid and enrollment.manager_url configured
- Certificate validation (existence, parse, expiry, key match, CA trust)
- --enroll exits after completion (no port conflict with systemd service)
- --renew-certs flag for manual cert renewal
- SO_REUSEADDR on TcpListener::bind (prevents Address already in use)
- Polling token persistence for enrollment resume after restart
- Exit code strategy (0=clean, 1=error, 2=enrollment in progress)
- HTTP 409 (host already exists) handling during enrollment
- Move 'Listening on' log after actual bind
- Increase RestartSec to 10s and add StartLimitBurst=5
- Postinst checks for certs and enrollment URL, prints guidance
- EnrollmentConfig.manager_url changed to Option<String>
- cert_renewal_threshold_days and polling_token config fields
- Updated SPEC.md and DEPLOYMENT_GUIDE.md with new workflow
- RCA document for crash loop root cause analysis
- Version bumped to 1.2.0

2026-05-29 10:44:42 -05:00

10 KiB

Raw Permalink Blame History

Root Cause Analysis: linux-patch-api Crash Loop

Date: 2026-05-28 Affected Hosts: sonarr, apt-cacher-ng, radarr-lxc, lidarr-lxc, deluge-lxc (all .moon-dragon.us) Symptom: Agent crash loop with "Address already in use" on port 12443, causing flapping between healthy/unreachable on manager

Executive Summary

The crash loop has three distinct root causes, not two as initially documented:

Primary cause: Package installation enables and starts the service before certificates exist, causing immediate crash on mTLS initialization.
Secondary cause: TcpListener::bind does NOT set SO_REUSEADDR, preventing rebinding when a port is in TIME_WAIT state.
Tertiary cause (discovered during RCA): The --enroll process binds to port 12443 after enrollment completes, blocking the systemd service from starting.

The result: hosts stuck in an infinite crash loop that cannot self-recover without manual intervention.

Evidence Preserved

radarr-lxc (still crash-looping, evidence intact)

Metric	Value
NRestarts	4,762+ (since May 20)
First crash	May 20 20:23:55 — `TLS CA certificate not found: /etc/linux_patch_api/certs/ca.pem`
Current error	`Failed to bind to 0.0.0.0:12443: Address already in use (os error 98)`
PID holding port 12443	1218 (`linux-patch-api --enroll`) started at 15:59:32
PID 1218 parent	1217 (`sudo linux-patch-api --enroll`)
PID 1218 state	S (sleeping), holding socket fd=16 on 0.0.0.0:12443
Certs exist	Yes (valid May 28 2026 → May 28 2027)
systemd MainPID	0 (not tracking the enrollment process)

lidarr-lxc (still crash-looping, evidence intact)

Metric	Value
NRestarts	4,822+
PID holding port 12443	1207 (`linux-patch-api --enroll`) started at 15:42
Same pattern as radarr-lxc

deluge-lxc (still crash-looping, evidence intact)

Metric	Value
NRestarts	4,494+
PID holding port 12443	51035 (`linux-patch-api --enroll`) started at 15:11
Same pattern as radarr-lxc

sonarr (evidence destroyed by fix)

Fixed before full investigation. NRestarts was 117,647+ over 8 days. Pattern inferred from partial logs.

apt-cacher-ng (evidence destroyed by fix)

Fixed before full investigation. Pattern inferred from partial logs.

Root Cause Analysis

Cause 1: Package Postinst Starts Service Before Certs Exist

The .deb/.pkg.tar.zst package postinst script:

Installs the binary
Deploys example config files
Enables the systemd service (systemctl enable)
Does NOT start the service (comment: "admin should configure first")
Does not check if TLS certificates exist
Does not run enrollment

Note: The postinst correctly does NOT start the service. The service was started by a separate deployment step (likely during the v1.1.16→v1.1.17 upgrade or by a previous version's postinst that DID start the service).

The config file (/etc/linux_patch_api/config.yaml) references certs that don't exist yet:

tls:
  enabled: true
  ca_cert: "/etc/linux_patch_api/certs/ca.pem"
  server_cert: "/etc/linux_patch_api/certs/server.pem"
  server_key: "/etc/linux_patch_api/certs/server.key"

The agent validates cert paths at config load time and exits with error if they don't exist. Since the service is enabled and Restart=on-failure is set, systemd triggers restart immediately.

Evidence: All three preserved hosts show the same first crash on May 20 with TLS CA certificate not found.

Cause 2: Enrollment Process Port Conflict (NEW FINDING)

This is the dominant cause on the currently crash-looping hosts.

When linux-patch-api --enroll <manager_url> is run:

It registers with the manager and receives a polling token
It polls the manager for approval
After approval, it provisions certs
It then falls through to normal server startup (main.rs lines 88-100)
The enrollment process binds to port 12443 and starts serving requests

Meanwhile, the systemd service is also enabled and trying to restart:

systemd sees the service failed, waits RestartSec=5s
Tries to start a NEW linux-patch-api process
New process tries TcpListener::bind("0.0.0.0:12443") → "Address already in use"
Process exits immediately, loop repeats every 5 seconds

Key insight: systemd's MainPID=0 — it has LOST TRACK of the enrollment process because it was started outside systemd (via sudo from an SSH session). The enrollment process is an orphan holding the port.

Evidence from radarr-lxc:

PID 1218: linux-patch-api --enroll https://linux-patch-manager-dev.moon-dragon.us
  State: S (sleeping)
  FD 16: socket:[900840468] → LISTEN on 0.0.0.0:12443
  Parent: PID 1217 (sudo) → PID 1216 (bash -c from SSH session)
systemd MainPID: 0 (not tracking this process)

Source code confirmation (main.rs lines 88-100):

if let Some(ref manager_url) = args.enroll {
    info!("Enrollment mode activated - running enrollment flow before server startup");
    match enroll::run_enrollment(manager_url, &config).await {
        Ok(()) => {
            info!("Enrollment complete - proceeding to server startup");  // ← Falls through to bind!
        }
        Err(e) => {
            error!(error = %e, "Enrollment failed - shutting down");
            return Err(anyhow::anyhow!("Enrollment failed: {}", e));
        }
    }
}
// ... continues to TcpListener::bind at line 226

Cause 3: No SO_REUSEADDR on TcpListener::bind

Once the enrollment process eventually exits (or is killed), the port enters TIME_WAIT state for ~60 seconds. Without SO_REUSEADDR, the next systemd restart attempt within that window also fails with "Address already in use".

Source code (main.rs line 226):

let tcp_listener = TcpListener::bind(&bind_address)
    .map_err(|e| anyhow::anyhow!("Failed to bind to {}: {}", bind_address, e))?;

std::net::TcpListener::bind does NOT set SO_REUSEADDR. The socket2 crate is not a dependency.

Cause 4: Misleading Log Messages

The log sequence is confusing because of premature logging in main.rs:

INFO linux_patch_api: Listening on 0.0.0.0:12443     ← Line 197 (logged BEFORE actual bind)
INFO linux_patch_api: Initializing mTLS authentication  ← Line 206
INFO linux_patch_api: mTLS middleware initialized      ← Line 223
Error: Failed to bind to 0.0.0.0:12443               ← Line 227 (actual bind attempt)

The "Listening" message at line 197 is emitted before the TcpListener::bind at line 226. This makes it look like the agent successfully bound and then tried to bind again.

Recommended Fixes

Fix 1: Enrollment Should NOT Fall Through to Server Startup

Priority: CRITICAL — This is the fix that prevents the enrollment port conflict.

In src/main.rs, after enrollment completes, the process should EXIT instead of falling through to server startup:

if let Some(ref manager_url) = args.enroll {
    info!("Enrollment mode activated");
    match enroll::run_enrollment(manager_url, &config).await {
        Ok(()) => {
            info!("Enrollment complete - start the service with: systemctl start linux-patch-api");
            return Ok(());  // ← EXIT after enrollment, don't bind port
        }
        Err(e) => {
            error!(error = %e, "Enrollment failed");
            return Err(anyhow::anyhow!("Enrollment failed: {}", e));
        }
    }
}

Fix 2: Add SO_REUSEADDR to TcpListener::bind

In src/main.rs line 226, replace:

let tcp_listener = TcpListener::bind(&bind_address)
    .map_err(|e| anyhow::anyhow!("Failed to bind to {}: {}", bind_address, e))?;

With:

use socket2::{Socket, Domain, Type, Protocol};
let socket = Socket::new(Domain::IPV4, Type::STREAM, Some(Protocol::TCP))?;
socket.set_reuse_address(true)?;
socket.bind(&bind_address.parse()?)?;
socket.listen(128)?;
let tcp_listener: TcpListener = socket.into();

Add socket2 to Cargo.toml dependencies.

Fix 3: Postinst Should Check for Certs Before Enabling Service

The package postinst script should:

Check if TLS certs exist at the configured paths
If certs exist → enable and start the service
If certs don't exist → enable but DON'T start; print enrollment instructions

Fix 4: Increase RestartSec and Add StartLimitBurst

In configs/linux-patch-api.service:

Restart=on-failure
RestartSec=10s
StartLimitIntervalSec=300
StartLimitBurst=5

This prevents the crash loop from spinning at 12 attempts/minute. After 5 failures in 300s, systemd stops retrying.

Fix 5: Fix Misleading Log Message

Move the "Listening on" log (line 197) to AFTER the successful bind (after line 229):

// After TcpListener::bind succeeds:
info!("TCP listener bound to {}", bind_address);
info!("Listening on {}", bind_address);  // Move here

Immediate Actions Needed

Three hosts are still crash-looping with enrollment processes holding port 12443:

radarr-lxc — PID 1218 holding port, NRestarts=4,762
lidarr-lxc — PID 1207 holding port, NRestarts=4,822
deluge-lxc — PID 51035 holding port, NRestarts=4,494

To fix each host:

Kill the enrollment process: sudo kill <pid>
Wait for port release
Start the service: sudo systemctl start linux-patch-api

Awaiting Kelly's approval before fixing.

Lessons Learned

Do NOT destroy evidence before completing RCA. I fixed apt-cacher-ng before fully investigating the crash loop, destroying diagnostic evidence. Kelly had to point this out.
Investigate first, fix second. When doing RCA, preserve the crash-looping hosts and gather all evidence before applying fixes.
The enrollment process port conflict was the dominant cause on the currently-affected hosts, not TIME_WAIT. I initially misdiagnosed this because I destroyed the evidence too early.

Confidence

Confidence: 95% (diagnosis)

Evidence: Direct log analysis from 3 preserved hosts showing identical pattern
Evidence: Source code review of main.rs showing enrollment fall-through to server startup
Evidence: Process state showing enrollment PID holding port 12443 while systemd has MainPID=0
Evidence: ps aux and /proc data confirming enrollment process is alive and bound
Uncertainties: None significant — the evidence chain is complete and consistent
Test Status: Partially tested — apt-cacher-ng and sonarr were fixed before full investigation; radarr/lidarr/deluge still crash-looping with preserved evidence

10 KiB Raw Permalink Blame History