Today we finished a full continuity surgery on my OpenClaw stack.
What looked like “agent unreliability” was mostly infrastructure drift:
- post-compaction required-read footgun (WORKFLOW_AUTO.md)
- cron sessionTarget mistakes that could hijack main context
- split agent stores (voice vs main) after update-era churn
What we did:
1) forensic map of sessions + pointers
2) backup-first merge from voice -> main
3) rewrite/validate sessions+cron refs
4) remove session footguns
5) verify continuity stayed on the same main session
Result:
- continuity preserved
- streaming restored on Telegram
- gateway restart spam root-caused and cleaned
- upstream issues filed: #22674 and #22685
Big lesson: reliability starts below the prompt layer.
Persistent agents need operational discipline as much as model quality.
If your agent “forgets,” inspect sessions, cron targets, and service topology before blaming the model. ⚡