v5.8.6 — LockedBlackboard correctness fixes
1 — Stale-lock compare-and-delete race (acquire())
What's changed
Fixed — lib/locked-blackboard.ts (5 correctness fixes)
#1 — Stale-lock compare-and-delete race (acquire()) Previously, two concurrent waiters that both observed a stale lock would each call forceRelease() (blind unlink). The one that lost the race would then delete the fresh lock acquired by the winner. Fixed with forceReleaseStale(expectedAcquiredAt, expectedPid): re-reads the lock file and only unlinks it if the identity still matches, so a freshly-acquired lock is never deleted.
#2 — Ownership-blind release() unlink release() checked the in-memory lockHolder string but then unconditionally called unlinkSync. If another process had already force-released our stale lock and created its own, we would delete theirs. Fixed: release() reads the lock file after closing the fd and verifies holder + pid before unlinking.
#3 — Non-atomic snapshot write persistToDiskInternal() and writeInitialBlackboard() wrote directly to the final blackboard path. A process crash mid-write (after WAL compaction) would leave a truncated file with no WAL to recover from. Fixed: both functions now write to ${path}.tmp then renameSync to the final path — rename is atomic on local POSIX/NTFS.
#4 — WAL/pending reconciliation — zombie validated entries After WAL replay writes a key to the cache, loadPendingChanges() could re-add the validated pending file for the same change (crash occurred before archivePendingChange ran). Result: a zombie entry that always fails commit() with a hash-conflict. Fixed: loadPendingChanges() now cross-checks each validated entry against the cache after WAL replay and immediately archives entries whose hash already matches the committed state.
#5 — cleanupOldPendingChanges() priority-unaware eviction Age-only eviction could discard high-priority proposals waiting at an approval gate. Fixed: sort by priority ASC then proposed_at ASC — lowest-priority and oldest entries are evicted first.
#11 — Silent disableWal in production disableWal: true via constructor was completely silent. A stray NETWORK_AI_MINIMAL=1 in a production environment would silently disable crash recovery. Fixed: emits log.warn at startup when WAL is disabled outside the recognised CI/test env-var path.
Tests
Three new suites added to test-phase11.ts (55 assertions total, up from 43):
testLockOwnership()— 7 assertions: release-without-hold, acquire/release cycle, ownership-verified release does not delete a foreign lock, stale-lock cleanup allows fresh acquiretestAtomicSnapshot()— 3 assertions: no orphaned.tmpafter successful write, state integrity, graceful load with a pre-existing orphaned.tmptestPriorityEviction()— 2 assertions: high-priority validated change survives a pending overflow eviction cycle, surviving change commits successfully
Documentation
ARCHITECTURE.md— WAL durability scope clarified: protects against process crashes only (nofsync, no power-loss guarantee); atomic tmp+rename described; NFS v2/v3 explicitly unsupported (O_EXCL non-atomic over NFS);disableWal/NETWORK_AI_MINIMALusage scope documentedSECURITY.md/.github/SECURITY.md— new LockedBlackboard Mutex Correctness (v5.8.6) bullet summarising all five fixes- Test count updated to 3,148 across 31 suites (was 3,136)
Dependencies (via Dependabot)
openaibumped from 6.38.0 → 6.39.0 (#104)
Full changelog: https://github.com/Jovancoding/Network-AI/blob/main/CHANGELOG.md