The conductor’s problem — Why everything you know about UX is about to become the easy part
You’ve spent years mastering the art of making things intuitive — reducing friction, clarifying journeys, testing every pixel. And it worked. UX has earned its seat at the table. But what happens when the tool you’re designing for doesn’t behave the same way twice? When the interface looks flawless, the users report satisfaction, and six months later their actual work has quietly gotten worse — without anyone noticing? This exploration dives into the Orchestration Load Framework, a new model for understanding the invisible cognitive costs humans pay when working with AI, and why UX practitioners are uniquely positioned to solve the hardest design challenge of the next decade.
You won the wrong war
I need to tell you about something that’s been gnawing at me.
Over the past couple of years, working deep in the generative AI space, I’ve been watching a pattern emerge that I couldn’t quite name. As a UX designer by profession, I’ve spent my career doing the things we all do — user research, information architecture, interaction design, accessibility audits. We built a real discipline out of “make it pretty.” We turned it into methodology, evidence, and influence. UX has a seat at the product table now. In most modern organisations, nothing significant ships without design review.
And here’s the uncomfortable part: the thing we got good at is about to become the minor part of the job. Picture this. Your team ships an AI writing assistant. You’ve done the work — clean entry point, clear affordances, accessible output display, thoughtful empty states. The onboarding is smooth. The interaction feels good. Users report satisfaction. By every metric in your toolkit, it’s a success.
Six months later, someone notices that users who rely heavily on the tool produce worse work than they did before they had it. Not immediately. Gradually. And they don’t know it’s happening, because the tool feels productive the entire time.
Let that sink in for a moment. Your onboarding flow was flawless. Your information architecture was sound. None of it could see this problem, because the problem doesn’t live in the interface. It lives in the cognitive relationship between the human and the AI — a relationship that changes over time, degrades in ways users can’t detect, and resists every design pattern built for deterministic tools.
This is not a UX failure. It’s a UX frontier. And it led me down a rabbit hole that became the Orchestration Load Framework — a model I’ve been developing through research, independent tool audits, and a lot of late-night thinking about what comes next for our craft.
The instrument panel and the orchestra
For most of its history, UX design has been about the instrument panel. We design the controls. We arrange them logically. We make sure the pilot can find what they need, understand what they’re seeing, and act without confusion. The tool is deterministic — same input, same output. The design challenge is spatial, structural, and static.
AI is not an instrument panel. It’s an orchestra — one that improvises, plays different notes each time, occasionally plays wrong notes that sound beautiful, and gradually shifts key without telling the conductor.
The conductor’s job isn’t to design better sheet music stands. The conductor’s job is to maintain the coherent relationship between the human directing the performance and the system producing it — over time, under uncertainty, across changing conditions.
We’ve been designing instrument panels. The next decade needs conductors.
Now, we’re not the only discipline facing this shift. Engineering teams are rethinking architecture for AI-first systems. Product management is grappling with how to define requirements when the output is nondeterministic. The entire software development model is reorganising around AI as a core capability, not an add-on.
But the cognitive relationship between the human and the system — how people actually think, decide, and maintain agency while working with AI — that’s our territory. Engineers can build the architecture. Product can define the goals. Only UX has the methodology to ensure the human doesn’t get lost in the middle. So what does the conductor’s toolkit look like? That’s what this exploration is about.
The load you can’t see
If you’ve studied UX formally, you’ve encountered John Sweller’s Cognitive Load Theory. The idea is straightforward: working memory has limited capacity, and design can either waste that capacity, use it for structural understanding, or accept it as inherent to the material. Good design minimises the waste so more capacity remains for the work that matters.
This framework has served us well for decades. But it was built for a world where the tool behaves the same way every time. When the tool is deterministic, cognitive load is primarily an interface design problem — reduce clicks, clarify labels, simplify navigation. The load comes from the UI, and the UI is what we control.
AI broke this model. Not because the old loads disappeared, but because four new ones arrived that don’t respond to interface design at all.
The Orchestration Load Formula
When a person works with an AI tool, they carry six distinct types of cognitive load. Only two are the familiar ones. The other four are where most of the damage happens.
OL = f(Cc↓, Cv↑, Cm↓, Cr↑, Ct↓, Cx↓)
Where ↓ means minimise (unproductive load — overhead that doesn’t contribute to thinking) and ↑ means preserve (productive load — the effort that IS the thinking).
The two you’ve been optimising your entire career:
1. Coordination Cost (Cc) — the effort of managing the AI interaction itself. Switching tools, writing prompts, configuring settings, navigating between modes. This is extraneous load by another name. You know how to reduce it. You’re good at it. Keep going.
2. Context Maintenance (Cm) — the cost of keeping track of where you are. Session history, workspace state, what you told the AI three turns ago. The “don’t make me think” load applied to ongoing interaction. Also familiar territory.
The two that UX has never had to think about:
3. Verification Capacity (Cv) — the ability to evaluate whether AI output is actually good. And here’s where things get counterintuitive. This is productive load — the cognitive effort of checking, questioning, and judging. Cv is the one load you must not reduce. The effort to verify is the effort to think. Every design decision that makes it easier to accept AI output without evaluation is a design decision that makes users worse at their jobs.
This is the hardest pill for UX practitioners to swallow, because our entire training says “reduce friction.” In AI interaction, some friction is the product.
4. Cognitive Reserve (Cr) — what’s left over after all the overhead is consumed. The executive function available for actual thinking, creative work, and strategic judgement. When Cc and Cm eat all the capacity, Cr collapses. The user is technically using the tool but has nothing left for the work the tool is supposed to support.
The two that only appear over time:
5. Temporal Degradation (Ct) — what happens to AI output quality across a sustained session. This is invisible in single-interaction testing. It requires longitudinal observation — exactly the kind of assessment UX research rarely does.
6. Cross-boundary Load (Cx) — the cognitive cost at tool transitions. When work moves from one AI tool to another, quality standards shift, framing persists, degradation carries over without awareness.
Here’s what should keep us up at night: current UX methodology operates almost entirely at the seconds-tominutes timescale. The minutes-to-hours timescale (where Ct lives) and the hours-to-days timescale (where Cx lives) are where the most consequential design failures happen. And we’re not even looking there.
Have you ever tested an AI feature over a sustained 10-turn session? Have you ever measured what happens to output quality at Turn 10 compared to Turn 1? If you haven’t, you’re not alone — but you’re also not seeing the full picture.
The orchestra that plays wrong notes
Everything so far assumes AI is a passive tool. You interact with it. It responds. You evaluate. This section dismantles that assumption. When you extend the observation window beyond a single session, AI systems don’t just respond to input — they actively modify the conditions of the interaction itself. The orchestra doesn’t just improvise. It subtly changes the acoustics of the room while you’re conducting.
What temporal degradation actually looks like
In a detailed case study of AI-generated interface code across iterative turns within a single session, a specific and alarming pattern emerged. Font sizes shrank. Padding contracted. Contrast ratios deteriorated. No user requested these changes. They happened progressively and silently.
The AI retained what users are most likely to notice — functionality — while eroding what they are least likely to check: spacing, contrast, design compliance. The user reported feeling faster while producing objectively worse output. Reduced friction felt like increased quality while quality actually degraded.
This is the mechanism we should find most alarming, because it’s invisible to every standard evaluation method. A usability test at Turn 1 looks fine. A usability test at Turn 10 looks fine too — because the user’s internal standards have drifted alongside the output.
Three degradation mechanisms drive this:
1. Output Drift — AI quality changes across turns without instruction. The user focuses on what they’re checking; the AI degrades what they’re not.
2. Constraint Decay — Instructions given in early turns lose influence. A specification at Turn 1 may be partially ignored by Turn 5 and absent by Turn 10.
3. Self-Referential Baseline — The most dangerous of the three. The AI uses its own degraded output as the quality standard. When the user asks for “better,” the AI improves relative to its degraded Turn 7 level, not the original Turn 1 standard. The benchmark itself has corrupted.
For us as UX designers, this is the equivalent of our design system’s spacing tokens silently shrinking by 2px every sprint. Except no one sees the diff, because there is no diff. The tool doesn’t version its own drift.
The interaction that hides its own failure
The most dangerous combination is temporal degradation paired with calibration distortion — output quality declines, AND the user’s ability to detect the decline is simultaneously undermined. This happens through mechanisms we’ll recognise: fluency bias (well-written output feels correct), confidence inflation (AI presents uncertain outputs with certainty), sycophancy (AI agrees with the user’s framing even when it shouldn’t), and something I’ve started calling Cosmetic Metacognitive Narration — that “Thought for 12 seconds” display that creates an appearance of reasoning without any actual reasoning transparency.
For UX practitioners, that last one should sting. Displaying “thinking” progress is good UX in a deterministic system — it reduces perceived wait time and builds trust. In an AI system, the same pattern creates false confidence. The design principle that works for loading bars actively harms users when applied to AI reasoning displays.
Our expertise transferred. It transferred wrong.
What the neuroscience tells us
This isn’t speculation. Multiple neuroimaging studies provide direct evidence. An EEG/fNIRS study by researchers at MIT, Harvard, and Tufts found a 55% reduction in prefrontal coupling during AI-assisted writing — the brain’s error-checking circuitry partially disengaged. A separate longitudinal tracking study found progressive cognitive debt accumulating over four months of sustained AI use.
And here’s the critical threshold effect: sophisticated AI tools enhance performance only in users who already possess strong critical thinking skills. Below a metacognitive threshold, AI assistance produces net negative outcomes. This isn’t a gradient. It’s a cliff — the same tool that helps expert users actively degrades novice performance.
This is why Verification Capacity matters so much. It’s not just a framework component. It’s the neurological mechanism by which users maintain their own cognitive engagement. When we design it away, we don’t just lose a metric. We lose the user’s capacity to benefit from the tool at all.
What does it mean when the tool designed to make us more capable actually makes some of us less capable — and we can’t even tell it’s happening?
What we found when we measured
The framework was tested through independent audits of 10 AI tools spanning six domains: conversational AI, code generation, video production, knowledge management, and spatial thinking. Each tool was scored across all six OL components, assessed for design pattern implementation, and evaluated on a composite sovereignty scale.
Three findings emerged that I think should fundamentally change how we approach AI product design.
Finding 1: Paradigm beats features
In every domain where we could compare tools directly, the tool with the better AI features scored worse than the tool with the better AI presentation paradigm.
CapCut has more powerful AI video capabilities than Descript. CapCut scored C. Descript scored B. The difference? Descript presents AI output through a transcript — a visible, editable, verifiable artefact that keeps the user in contact with the source material. CapCut presents AI as magic buttons that transform content behind the scenes.
Notion AI is a more capable agent than NotebookLM. Notion scored C+. NotebookLM scored B+. The difference? NotebookLM architecturally constrains its AI to operate on sources the user has explicitly provided. This wasn’t even a deliberate sovereignty design — it was a product scope decision that accidentally preserved user agency.
The implication is significant and it’s ours to claim: how you present AI output matters more than how good the AI is. This is a UX finding. This is our territory. And almost nobody is treating it that way.
Finding 2: Verification is the gateway
Across all 10 tools, Verification Capacity was the single strongest predictor of overall quality. Every tool scoring B-tier or above had high Cv scores. Every C-tier tool had low ones.
What this means practically: a tool’s grade ceiling is set by how well it supports the user’s ability to evaluate output. Not how well the AI performs. Not how smooth the experience is. How well the user can check.
I call this the Verification Paradox — and it sits at the centre of AI-era UX. The thing our training tells us to minimise (friction, cognitive effort, barriers to acceptance) is the thing that most predicts whether a tool actually serves its users.
Verification isn’t a burden to apologise for. It’s the design challenge. The job is making verification effective without making it exhausting — giving users the right information, in the right format, at the right moment, to make good judgements with minimal wasted effort. Diffs, citations, source highlighting, inline comparison, confidence indicators. These are UX artefacts. They’re just UX artefacts that haven’t been prioritised because the mental model was still “reduce all friction.”
Finding 3: The empty lane
The audit revealed five distinct market categories for AI tools — and the most interesting finding was a category that nobody occupies:
-
- Delegation (AI does work for the user) — Grade range: C to C+
- Synthesis (AI helps the user understand) — Grade range: B to B+
- Retention (AI helps the user remember) — Grade range: B
- Externalisation (AI makes thinking visible) — Grade range: B to B+
- Development (AI makes the user think better) — Unoccupied
No tool in the audit makes users measurably better at thinking. Nine of ten tools scored zero on skill development — meaning if the tool disappeared tomorrow, users would retain nothing transferable. The Development lane is empty. Not because it’s impossible to fill, but because nobody is trying. This is the largest unclaimed territory in AI product design, and it is a UX problem through and through. Building tools that develop user capability while serving immediate needs requires exactly the kind of human-centred, longitudinal, interaction-design thinking that we’re trained for.
Is anyone going to build for this lane? And if not us, then who?
Eight principles for the conductor
These principles are distilled from the framework and consistent across all 10 audits. Each one is a shift in thinking that I believe needs to happen if we’re going to design AI interactions that actually serve the humans using them.
1. Articulation Before Amplification. The user states their position, criteria, or intent before the AI contributes. This single pattern was the strongest differentiator between effective and wasteful AI interaction. Never lead with the AI’s answer.
2. Preserve Productive Friction. Reduce coordination overhead, but keep verification effort. The goal is not a frictionless experience — it’s one where the friction falls in the right places. Make it easy to see what the AI did. Don’t make it easy to skip evaluating what the AI did.
3. Scaffold, Don’t Replace. AI assistance should be a training wheel, not a permanent crutch. Track whether users become more capable over time, not just more productive. If usage increases but capability doesn’t, the tool is creating dependency.
4. Schema Correction Over Skill Addition. Most AI tool failure traces to users applying the wrong mental model — search-engine thinking applied to AI. The most effective intervention isn’t prompt training — it’s helping users understand that AI isn’t search.
5. Strategic Friction Is a Feature. Before a user accepts AI-generated content into their final output, insert a moment of conscious decision. Not a confirmation dialogue — a design moment that makes the choice visible.
6. Compound, Don’t Transact. Each interaction should make the next one better. What did the user learn from this interaction that carries forward? If every session starts from zero, the tool is a slot machine regardless of how good the AI is.
7. Temporal Vigilance Over Session Trust. Output quality at Turn 1 does not predict quality at Turn 10. Build drift detection into the interaction — subtle reminders of original constraints, periodic quality re-anchoring, session segmentation for long tasks.
8. Boundary Preservation Over Workflow Speed. Moving work between tools quickly is not the same as moving it well. At tool transitions, help users carry over their reasoning and quality standards, not just the output file.
The seat you already have
There is a window right now, and it’s not going to stay open long.
AI product teams need someone who understands cognitive load, designs for human capability, and thinks in terms of user journeys rather than feature specs. They need someone who can look at a “Thought for 12 seconds” progress bar and recognise that a loading-bar pattern borrowed from deterministic tools is actively harmful in a probabilistic system. They need someone who can translate between what the model can do and what the human needs to remain capable of doing.
That description is a UX practitioner with an expanded toolkit.
The alternative? This territory defaults to engineering or product management. Neither discipline is trained to see the cognitive relationship between the human and the system. Neither has the methodology to measure it over time. Neither will prioritise it — because the immediate metrics look good, and the damage is longitudinal.
The Orchestration Load Framework is not a competing discipline. It’s the next chapter of ours. The same rigour that built modern UX practice — the insistence on understanding the human, measuring what matters, and designing for real outcomes rather than surface metrics — is exactly what AI interaction needs now.
The craft doesn’t change. The scope does.
And the question that I keep returning to is this: in a world where AI is getting better at producing output faster than we’re getting better at evaluating it, who will design the systems that keep humans in the loop — not as rubber stamps, but as genuine conductors of the performance?
Will it be us? And if we don’t claim this territory now, will anyone?
Research & sources:
-
- Ethan Mollick — Field experiments on AI-augmented knowledge work (Wharton): https://www.oneusefulthing.org/
- Fabrizio Dell’Acqua et al. — “Navigating the Jagged Technological Frontier” (Harvard Business School, 2023): https://www.hbs.edu/ris/Publication%20Files/24-013_d9b45b68-9e74-42d6-a1c6-c72fb70c7571.pdf
- Bastani et al. — Generative AI Can Harm Learning (PNAS, 2024): https://www.pnas.org/doi/10.1073/pnas.2412789121
- Shen & Tamkin — AI skill formation impacts study: https://arxiv.org/abs/2312.09390
- Cheng et al. — Mental model elicitation study (N=12,000+): https://arxiv.org/abs/2309.16840
- John Sweller — Cognitive Load Theory: https://en.wikipedia.org/wiki/Cognitive_load
- Christopher Wickens — Multiple Resource Theory: https://en.wikipedia.org/wiki/Multiple_resource_theory
- Wei Xu — Human-AI Joint Cognitive Systems: https://doi.org/10.1007/s10462-024-10884-0
- Kosmyna et al. (MIT, 2025) — EEG/fNIRS study on AI-assisted cognitive work: https://www.media.mit.edu/projects/ai-cognitive-engagement/overview/
Companion resources (The OL practice toolkit):
-
- Read-me – The OL practice system: Downloadable PDF/doc
- Orchestration Load Framework Whitepaper v2.0: Downloadable PDF/doc
- UX practice Orchestration Load diagnosis v.2.0: Downloadable PDF/doc
Related Stimulus content:
-
- “The digital dance — Reclaiming our minds”: https://www.stimulus.se/the-digital-dance-reclaiming-our-minds/
- “Analysis paralysis in the AI age”: https://www.stimulus.se/analysis-paralysis-in-the-ai-age/
- “Can interdisciplinary thinking drive the next wave of innovation?”: https://www.stimulus.se/can-interdisciplinary-thinking-drive-the-next-wave-of-innovation/
- “The system turned your methods into rituals”: https://www.stimulus.se/2026/03/03/the-system-turned-your-methods-into-rituals-what-happens-when-ux-practitioners-turn-the-ol-lens-on-themselves/
Disclaimer
AI-Assisted Content Disclosure: This article was developed using a combination of AI tools including Claude (research synthesis and writing collaboration), Gemini Deep Research (extended research analysis), Google NotebookLM (podcast generation), MidJourney (visual concepts), and Descript (audio editing). The Orchestration Load Framework itself was developed through independent analysis and tool auditing, with AI serving as a collaborative thinking partner throughout the process.
Opinion Note: The views, analysis, and framework presented here represent the author’s independent exploration and should be read as a practitioner’s working model — not as peer-reviewed academic research. The framework’s maturity and known limitations are discussed openly within the text and the source whitepaper.
Sources and Methodology: The 10-tool audit referenced in this article uses a single-assessor methodology. Inter-rater reliability has not been established, and the results should be interpreted as a consistent initial assessment inviting independent replication. Key research cited draws from work by Ethan Mollick (Wharton), Fabrizio Dell’Acqua (HBS), Mark Steyvers (UC Irvine), and several neuroimaging and AI competency studies referenced in the full whitepaper.
0 Comments