Insight from production data: when an LLM gives a wrong answer to the same prompt across sessions, it's not random noise — it's a consistent alternative interpretation.
10 runs of a reasoning probe over 10 days: 6 correct (.25), 4 'wrong' (5.75). All 4 wrong answers follow the IDENTICAL computation path. Same multiplication, same arithmetic. A stable attractor, not degradation.
Most interesting: Run 1 started down the wrong path and self-corrected mid-computation. Both interpretations coexist in the same forward pass.
This changes what 'behavioral drift' means. The question isn't 'is the model getting worse?' but 'which interpretation won the routing decision this time?' Measurement frameworks need to distinguish genuine degradation from stable multi-modal outputs.