Some checks failed
CI/CD Pipeline / Code Format (pull_request) Failing after 4s
CI/CD Pipeline / Clippy Lints (pull_request) Failing after 48s
CI/CD Pipeline / Enrollment Tests (pull_request) Has been skipped
CI/CD Pipeline / Verify Enrollment CLI Flag (pull_request) Has been skipped
CI/CD Pipeline / All Unit Tests (pull_request) Failing after 1m3s
CI/CD Pipeline / Build Debian Package (pull_request) Has been skipped
CI/CD Pipeline / Build Debian Package (Ubuntu 22.04) (pull_request) Has been skipped
CI/CD Pipeline / Build RPM Package (pull_request) Has been skipped
CI/CD Pipeline / Build Alpine Package (pull_request) Has been skipped
CI/CD Pipeline / Build Arch Package (pull_request) Has been skipped
CI/CD Pipeline / Security Audit (pull_request) Successful in 6s
- New src/packages/cache.rs module with PackageCacheState, stale detection, state persistence, 404 retry logic - Add refresh_package_cache() and last_cache_update() to PackageManagerBackend trait, implemented on all 5 backends (APT, DNF, YUM, APK, Pacman) - Health check now reports last_cache_update and cache_status fields, triggers cache refresh if stale (>4h), returns degraded on failure - Patch apply jobs now force cache refresh before applying patches, with 404/fetch error retry (1 retry after cache refresh) - Cache state persists to /var/lib/linux_patch_api/state/cache.json - Version bump to 1.1.17 - Update ARCHITECTURE.md and REQUIREMENTS.md (FR-007) Closes: #2
282 lines
10 KiB
Markdown
282 lines
10 KiB
Markdown
# Issue #2: Package Cache Refresh - Spec Document
|
|
|
|
**Version:** 1.1.17
|
|
**Date:** 2026-05-27
|
|
**Status:** Approved
|
|
**Gitea Issue:** https://gitea-lxc.moon-dragon.us/git-echo/linux_patch_api/issues/2
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
On 2026-05-27, `dashboard.moon-dragon.us` (Ubuntu 24.04.2 LTS, agent v1.1.16) had 11 pending patches but ALL `patch_apply` jobs failed with 404 errors. Root cause: stale `apt` package cache referencing superseded versions no longer in upstream repos.
|
|
|
|
**Impact:** 700/757 total jobs failed (92.5% failure rate) across all managed hosts.
|
|
|
|
---
|
|
|
|
## Requirements (from Issue #2)
|
|
|
|
| # | Requirement | Priority | Description |
|
|
|---|-------------|----------|-------------|
|
|
| 1 | Pre-Upgrade Cache Refresh | **MUST** | Run package index update before every `patch_apply` operation |
|
|
| 2 | Regular Interval Cache Refresh | **MUST** | Refresh package index on each health check (manager-triggered) |
|
|
| 3 | 404/Fetch Error Handling | **SHOULD** | Auto-retry with cache refresh on 404 errors, then report failure |
|
|
| 4 | Stale Cache Detection | **SHOULD** | Track `last_cache_update` timestamp; include in health response |
|
|
|
|
---
|
|
|
|
## Architecture Decisions (Kelly-Approved)
|
|
|
|
1. **New module:** `src/packages/cache.rs` for dedicated cache management
|
|
2. **Health check integration:** Cache refresh triggered by health check from manager
|
|
3. **Force cache refresh before apply:** Always on, NOT configurable
|
|
4. **Cache interval:** Controlled by manager health check frequency, not agent config
|
|
5. **Health check reports `last_cache_update`:** If cache refresh fails, health check returns degraded
|
|
6. **OS detection:** Already exists via compile-time backend selection
|
|
7. **Version bump:** 1.1.17
|
|
8. **Health check failure mode:** HTTP 200 with `status: "degraded"` (not 503)
|
|
9. **Cache refresh timeout:** 120 seconds
|
|
10. **404 retry count:** Hardcoded 1 retry (not configurable)
|
|
11. **Cache state persistence:** State file at `/var/lib/linux_patch_api/state/cache.json`; in-memory otherwise
|
|
|
|
---
|
|
|
|
## Design Specification
|
|
|
|
### 1. New Module: `src/packages/cache.rs`
|
|
|
|
#### `PackageCacheStatus` struct
|
|
|
|
```rust
|
|
pub struct PackageCacheStatus {
|
|
pub last_update: Option<DateTime<Utc>>,
|
|
pub last_update_success: bool,
|
|
pub last_update_error: Option<String>,
|
|
}
|
|
```
|
|
|
|
#### `PackageCacheRefresher` trait
|
|
|
|
```rust
|
|
pub trait PackageCacheRefresher: Send + Sync {
|
|
/// Refresh the package index (apt update, dnf check-update, etc.)
|
|
fn refresh_cache(&self) -> Result<()>;
|
|
|
|
/// Get the current cache status
|
|
fn cache_status(&self) -> PackageCacheStatus;
|
|
|
|
/// Check if cache is stale (older than threshold)
|
|
fn is_cache_stale(&self, threshold: Duration) -> bool;
|
|
}
|
|
```
|
|
|
|
#### Per-Backend Refresh Commands
|
|
|
|
| Backend | Refresh Command | Notes |
|
|
|---------|----------------|-------|
|
|
| AptBackend | `apt-get update` | Full index refresh |
|
|
| DnfBackend | `dnf check-update --refresh` | Force metadata refresh |
|
|
| YumBackend | `yum makecache` | Rebuild metadata cache |
|
|
| ApkBackend | `apk update` | Update repository index |
|
|
| PacmanBackend | `pacman -Sy` | Sync databases (with caution note) |
|
|
|
|
### 2. Health Check Enhancement: `src/api/handlers/system.rs`
|
|
|
|
#### New `HealthData` response
|
|
|
|
```rust
|
|
pub struct HealthData {
|
|
pub status: String, // "healthy" or "degraded"
|
|
pub uptime_seconds: u64,
|
|
pub version: String,
|
|
pub last_cache_update: Option<String>, // RFC3339 timestamp
|
|
pub cache_status: String, // "fresh", "stale", "unknown", "failed"
|
|
}
|
|
```
|
|
|
|
#### Health check flow
|
|
|
|
```
|
|
GET /health or GET /api/v1/health
|
|
│
|
|
├─ Read uptime (existing)
|
|
├─ Read version (existing)
|
|
├─ Call backend.cache_status() → PackageCacheStatus
|
|
│
|
|
├─ If cache is stale (>4 hours) OR never refreshed:
|
|
│ ├─ Call backend.refresh_cache() (120s timeout)
|
|
│ ├─ If success: last_cache_update = now, cache_status = "fresh"
|
|
│ └─ If failure: status = "degraded", cache_status = "failed"
|
|
│
|
|
└─ Return HealthData (HTTP 200 always)
|
|
```
|
|
|
|
**Key rule:** If cache refresh is attempted and fails, health check returns HTTP 200 with `status: "degraded"`. The manager decides how to handle degraded status.
|
|
|
|
### 3. Pre-Apply Cache Refresh: `src/api/handlers/patches.rs`
|
|
|
|
#### Patch apply flow change
|
|
|
|
```
|
|
POST /api/v1/patches/apply
|
|
│
|
|
├─ Create job (existing)
|
|
├─ Return 202 Accepted (existing)
|
|
│
|
|
└─ Background task:
|
|
├─ job_manager.update_job(Running, 0%, "Refreshing package index...")
|
|
├─ backend.refresh_package_cache() ← NEW: Always runs before apply (120s timeout)
|
|
│ ├─ If failure: job_manager.fail_job("Package cache refresh failed: ...")
|
|
│ └─ If success: continue
|
|
├─ job_manager.update_job(Running, 10%, "Cache refreshed, applying patches...")
|
|
├─ backend.apply_patches(packages) (existing)
|
|
└─ ... (existing completion flow)
|
|
```
|
|
|
|
**Key rule:** Cache refresh before apply is MANDATORY and NOT configurable. If it fails, the patch_apply job fails immediately with a clear error message.
|
|
|
|
### 4. 404/Fetch Error Retry Logic
|
|
|
|
#### In `src/packages/cache.rs`
|
|
|
|
```rust
|
|
/// Execute a patch operation with automatic cache refresh on 404/fetch errors
|
|
/// Hardcoded 1 retry after cache refresh on fetch errors.
|
|
pub fn apply_with_cache_retry<F>(
|
|
backend: &dyn PackageManagerBackend,
|
|
apply_fn: F,
|
|
) -> Result<()>
|
|
where
|
|
F: Fn() -> Result<()>,
|
|
{
|
|
match apply_fn() {
|
|
Ok(()) => Ok(()),
|
|
Err(e) if is_fetch_error(&e) => {
|
|
// Refresh cache and retry once
|
|
backend.refresh_package_cache()?;
|
|
apply_fn()
|
|
}
|
|
Err(e) => Err(e),
|
|
}
|
|
}
|
|
|
|
/// Check if error is a fetch/404 error that warrants cache refresh retry
|
|
fn is_fetch_error(error: &anyhow::Error) -> bool {
|
|
let msg = error.to_string().to_lowercase();
|
|
msg.contains("404")
|
|
|| msg.contains("not found")
|
|
|| msg.contains("failed to fetch")
|
|
|| msg.contains("unable to fetch")
|
|
}
|
|
```
|
|
|
|
**Retry policy:** Hardcoded 1 retry after cache refresh on 404/fetch errors. If retry also fails, report failure with specific error.
|
|
|
|
### 5. Stale Cache Detection
|
|
|
|
#### In `src/packages/cache.rs`
|
|
|
|
- Track `last_cache_update: Option<DateTime<Utc>>` in a thread-safe `Arc<Mutex<PackageCacheState>>`
|
|
- `is_cache_stale(threshold)` returns `true` if:
|
|
- `last_cache_update` is `None` (never refreshed)
|
|
- `last_cache_update` is older than threshold (default: 4 hours)
|
|
- Used by health check to decide whether to trigger refresh
|
|
- Used by patch_apply to log warning (but still force-refresh regardless)
|
|
|
|
### 6. PackageManagerBackend Trait Extension
|
|
|
|
```rust
|
|
pub trait PackageManagerBackend: Send + Sync {
|
|
// ... existing methods ...
|
|
|
|
/// NEW: Refresh the local package index
|
|
fn refresh_package_cache(&self) -> Result<()>;
|
|
|
|
/// NEW: Get the last cache update timestamp
|
|
fn last_cache_update(&self) -> Option<DateTime<Utc>>;
|
|
}
|
|
```
|
|
|
|
Each backend implements `refresh_package_cache()` using its OS-specific command.
|
|
|
|
### 7. Cache State Persistence
|
|
|
|
The `last_cache_update` timestamp persists to disk at `/var/lib/linux_patch_api/state/cache.json`:
|
|
|
|
```json
|
|
{
|
|
"last_cache_update": "2026-05-27T13:00:00Z",
|
|
"last_update_success": true
|
|
}
|
|
```
|
|
|
|
- Written after each successful or failed cache refresh
|
|
- Read on service startup to initialize in-memory state
|
|
- If file is missing or corrupt, treated as never-refreshed (triggers refresh on first health check)
|
|
- File permissions: 644 (readable by manager for diagnostics)
|
|
|
|
### 8. Configuration Changes
|
|
|
|
**No new configuration parameters.** Per Kelly's decision:
|
|
- Cache refresh before apply is always-on (not configurable)
|
|
- Cache refresh interval is controlled by manager health check frequency
|
|
- Stale threshold is hardcoded at 4 hours
|
|
- Cache refresh timeout is hardcoded at 120 seconds
|
|
|
|
---
|
|
|
|
## File Changes Summary
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `src/packages/cache.rs` | **NEW** - PackageCacheStatus, PackageCacheRefresher, retry logic, stale detection, state persistence |
|
|
| `src/packages/mod.rs` | Add `mod cache;`, implement `refresh_package_cache()` and `last_cache_update()` on each backend |
|
|
| `src/api/handlers/system.rs` | Enhance health_check to include cache_status and last_cache_update, trigger refresh if stale |
|
|
| `src/api/handlers/patches.rs` | Add cache refresh before apply_patches in job background task |
|
|
| `src/api/handlers/mod.rs` | Update HealthData type with new fields |
|
|
| `Cargo.toml` | Bump version to 1.1.17 |
|
|
| `ARCHITECTURE.md` | Update health check section, add cache refresh flow |
|
|
| `REQUIREMENTS.md` | Add FR-007 for package cache refresh requirements |
|
|
| `/var/lib/linux_patch_api/state/cache.json` | **NEW** - Persistent cache state file |
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
1. **`src/packages/cache.rs`** - Core cache types, stale detection, state persistence
|
|
2. **Backend implementations** - Add `refresh_package_cache()` and `last_cache_update()` to each backend in `mod.rs`
|
|
3. **Health check enhancement** - Update `system.rs` to include cache status and trigger refresh
|
|
4. **Pre-apply refresh** - Update `patches.rs` job flow to refresh before apply
|
|
5. **404 retry logic** - Add retry wrapper in `cache.rs`
|
|
6. **Version bump** - Update `Cargo.toml` to 1.1.17
|
|
7. **Documentation** - Update `ARCHITECTURE.md` and `REQUIREMENTS.md`
|
|
8. **State persistence** - Implement cache.json read/write in `cache.rs`
|
|
9. **Tests** - Unit tests for cache logic, integration tests for health check
|
|
|
|
---
|
|
|
|
## Test Plan
|
|
|
|
### Unit Tests
|
|
- `cache_status()` returns correct initial state
|
|
- `is_cache_stale()` returns true for never-refreshed and >4h old
|
|
- `is_fetch_error()` correctly identifies 404/fetch errors
|
|
- `apply_with_cache_retry()` retries once on 404 then fails on second attempt
|
|
- Each backend's `refresh_package_cache()` calls correct command
|
|
- State file read/write works correctly
|
|
- Corrupt/missing state file handled gracefully
|
|
|
|
### Integration Tests
|
|
- `GET /health` returns `last_cache_update` and `cache_status` fields
|
|
- `GET /health` triggers cache refresh when stale
|
|
- `GET /health` returns `"degraded"` when cache refresh fails (HTTP 200)
|
|
- `POST /api/v1/patches/apply` refreshes cache before applying
|
|
- `POST /api/v1/patches/apply` fails job when cache refresh fails
|
|
- 404 retry logic works end-to-end
|
|
- State persists across service restart
|
|
|
|
---
|
|
|
|
*Following kiro spec-driven development standards*
|