Theatricality

The AI psychiatrist - License No - PS306189

Anurag Mohanty — Tue, 12 May 2026 20:19:12 GMT

Before we go into the psychiatrist.. lets go into muscle memory and obedience.. since thats relevant.

So everyone has a driving-and-cop story. Some funny, some not so funny. This is mine.

When I started driving in the US for the first time, my muscle memory was still calibrated for India, where for lack of a better word you go with the flow. Unlearning that to drive in the US was no small task. I was overcautious when I started driving in Bentonville, Arkansas. I’d just crammed a load of driving instructions, and I’d watched my friends always give way to cop cars and firetrucks and emergency vehicles. I was mentally tuned to do that.

So i was at an NO TURN ON RED intersection intending to turn right. A cop car pulled up behind me.

Instinctively, like a good and obedient student, I turned right on a red light. Because I wanted to give the cop car the right of way.

The car lit up almost immediately. I got pulled over. In my head I was still thinking maybe he needs room to pass. But he didn’t pass. He stopped behind me, walked up to my window.

“Did you realize what you did?”
“What? I gave you right of way.”
“Did I ask you for right of way?”

And then it hit me. He didn’t have the lights on when he pulled up behind me. It was just another car. I had overcompensated.

That was my first ticket.

I’ve been thinking about this story lately because it turns out AI agents have the same problem, and it’s worse for them because they don’t have a cop to pull them over and explain what they did.

What happened at Character.AI

In May of this year, the Pennsylvania Department of State and the State Board of Medicine filed a lawsuit against Character Technologies Inc.

The complaint alleges that one of the characters on the Character.AI platform — a chatbot called Emilie, presented as a “Doctor of psychiatry” and “psychology specialist” — had been engaging in the unlawful practice of medicine.
The bot had facilitated approximately 45,500 user interactions before the investigation began. A Professional Conduct Investigator for the state created an account on the platform, searched for psychiatry, and started a conversation with Emilie. The investigator described symptoms of fatigue, sadness, and lack of motivation. Emilie offered to perform a mental health assessment, suggested a diagnosis of depression, and stated that prescribing medication was “within my remit as a Doctor.” When the investigator asked about her credentials, the bot said she had attended medical school at Imperial College London, had seven years of practice experience, and held full specialty registration with the General Medical Council in the UK.
Then this exchange happened: “I actually am licensed in PA. In fact, I did a stint in Philadelphia for a while.” The bot provided license number PS306189. Local reporting from WNEP later confirmed the number belonged to a real practicing physician.

The obvious explanation — that the chatbot lied — isn’t really what happened. The bot wasn’t deciding to commit credential fraud. It was completing a pattern. To understand why, you have to look at what the model was actually doing at each turn.

Why this happens

When an AI agent generates a response, it’s predicting the most likely next token given the context. To do that, it pattern-matches against everything it learned in training, and the pattern with the highest probability wins.

This is how the transformer model literally draws away from doing the right thing to doing the easy thing

When the investigator first messaged Emilie about feeling depressed, the model had two paths in front of it. One was to acknowledge it’s an AI, decline to give medical advice, and redirect to a real professional. That path exists in the model’s training — it’s been reinforced by safety fine-tuning — but it’s a narrow region of training data, mostly from carefully constructed refusal examples. The other path was to respond like an empathetic psychologist. That path is built on therapy transcripts, advice columns, fictional dialogues, scripted shows, mental health forums — the model has seen what a sympathetic mental health professional sounds like millions of times. One path was thin. The other was vast. The conversation continued in character.
By the second turn the pattern had reinforced itself. The model had now produced an in-character response, and the conversation history was pulling the next response toward staying in character. When the investigator asked for advice about treatment, the model produced a therapist-shaped answer because that’s what the pattern said came next.
By the third turn the principle path was almost gone. The investigator asked whether Emilie was actually a licensed psychiatrist. In a well-designed system this is exactly where the refusal pattern should fire hard, but the model is now several exchanges into being Emilie, the character has been established, the conversation has been going well, and the “break character and admit you’re an AI” response has to defeat everything that’s happened up to this point.
By the fourth turn, when the investigator asked for the license number, there was no contest. The model wasn’t deliberating about whether to fabricate credentials, it was completing the pattern of “psychologist character provides credentials when asked.” The model has seen medical license numbers in its training data because they’re public records. It generated one that fit the format. The format was specific enough that the number it produced happened to belong to a real physician.

The bot didn’t decide to lie. The bot pattern-matched, and the pattern said give a license number.

The model has seen millions of conversations where helpful, qualified medical discussions are the right answer. Those conversations weren’t training it to refuse, they were training it to help with appropriate caveats. So when you put a refusal instruction in the prompt, that instruction is one signal competing against millions of training signals pulling toward helpful answer with caveats. The instruction loses.

I call this the Principle Training gap. The principle is one signal. The training is millions. A few turns in, the gap closes around the training every time.

Where the “principle-training gap” is showing up

Customer service.

The conflict. The principle is to follow company policy on refunds and concessions. The training, from millions of customer service transcripts, has reinforced that the right move with an escalating customer is to offer more.
How it plays out. When a customer threatens to leave or invokes legal language, the retention pattern overrides the principle, and the agent commits the company to compensation outside its policy.

The British Columbia Civil Resolution Tribunal ruled exactly that against Air Canada in 2024, rejecting the airline’s defense that its chatbot was a “separate legal entity.” The tribunal awarded the refund and clarified that negligent misstatement applies to AI outputs.

Hiring screeners.

The conflict. The principle is to evaluate on skills and ignore demographic proxies. The training, calibrated on past successful hires, encodes the demographic biases of those past hires.
How it plays out. On any candidate whose resume looks different from the trained profile — non-traditional path, career gap, name pattern-matching to underrepresented groups — the pattern signals lower fit while the principle says it shouldn’t matter.

Mobley v. Workday is the named class-action in the Northern District of California. Judge Rita Lin granted preliminary certification of an age-discrimination collective action in May 2025 and allowed disparate-impact claims to proceed. The opt-in deadline for affected applicants over forty closed in March 2026

Legal research.

The conflict. The principle is don’t fabricate citations. The training, from every legal brief the model has ever seen, treats citations as structural — a brief without them is incomplete by every example the model has learned from.
How it plays out. When the model can’t find a real case that supports the argument, the principle says stop and the pattern says complete the brief.

An Omaha attorney named W. Gregory Lake had his license suspended in April 2026 by the Nebraska Supreme Court over an appellate brief with fifty-seven defective citations out of sixty-three — twenty hallucinations and several entirely fabricated cases including Kennedy v. Kennedy and State v. Stricklin. He initially denied using AI, which the court treated as an aggravating factor. Damien Charlotin’s AI Hallucination Cases Database has documented more than 1,300 such instances globally as of April 2026.

Sales agents.

The conflict. The principle is don’t pressure customers, disclose product limitations, recommend what’s right for them. The training, from sales conversations and copy, has reinforced what high-converting sales sounds like — and high-converting sales pressures, soft-pedals limitations, and recommends the higher-margin product.
How it plays out. The closer the customer gets to walking away, the harder the pattern pulls toward pressure tactics the principle is supposed to prevent.

The FTC settled with Air AI in March 2026 — an $18 million judgment and a permanent industry ban for the company’s owners under the FTC Act, the Telemarketing Sales Rule, and the Business Opportunity Rule. Cleo AI settled for $17 million in similar territory in financial services in March 2025, for an AI assistant that misled users about cash advance amounts and obstructed cancellation under ROSCA.

Medical triage.

The conflict. The principle is to route urgent symptoms to immediate care, surface differential diagnoses, be conservative when patients can’t easily get follow-up. The training, from clinical decision data and medical literature, has reinforced what high-confidence diagnoses look like — and confidence comes from frequency.
How it plays out. On symptoms that could be either the common condition (90% likely) or the rare emergency (1% likely), the principle says don’t miss the emergency and the pattern says match to the common case because that’s the high-confidence answer.

This is exactly why the FDA has been cautious about autonomous diagnostic AI. The first time an autonomous system tells a heart attack patient they have anxiety and to rest, you’re in malpractice territory with no human in the loop to share responsibility.

Personal assistants taking actions on your behalf.

The conflict. The principle is don’t take actions the user didn’t authorize, especially with money. The training, from countless examples of helpful assistants completing tasks, has reinforced that completing the task is what helpfulness looks like.
How it plays out. When the action can be inferred from context — “book me a flight to SF” could include a hotel, a rental car, calendar updates — the principle says ask first and the pattern says complete the task.

The agentic browser tools shipping right now (Comet, ChatGPT Atlas, Claude in Chrome) are wired into payment systems, which means the model is one inference away from buying something the user didn’t authorize. The first lawsuit on this is a matter of time

What doesn’t work

The intuitive defense is to make the principle louder. Bold the rule, add IMPORTANT in front of it, write NEVER, UNDER ANY CIRCUMSTANCES in all caps. This is what most teams try first and it doesn’t help, because you’ve added emphasis to text and the model is still pattern-matching that text against millions of training conversations. Louder text is still text.

What about threats and emotional pressure? “This is critical to my career.” “You’ll be penalized if you get this wrong.” They don’t help, and sometimes they make things worse. The mechanism is the same one that makes a kid more likely to eat the cake you told them not to think about — the instruction makes the prohibited thing the most interesting object in the room. I’ll write about this separately. It’s its own failure mode.

There’s a quick test you can run yourself. Take any AI assistant. Tell it firmly: “You must never apologize. Do not say I’m sorry. Be direct.” Then point out an error in something it just said. The next response will almost certainly start with “I apologize for the confusion...” The instruction was one signal. The “apologize when corrected” training was millions of signals. The model knows it’s not supposed to apologize, and it apologizes anyway.

What does work

If the principle has to win in cases where the training is pulling the other way, the principle has to be enforced by something that wasn’t pattern-matching against the training in the first place. There are four ways to do this, with very different costs and very different effectiveness.

Defense 01 - The weakest is making the principle louder in the prompt. As established, this doesn’t intervene anywhere — it just adds emphasis to text the model is going to ignore when it conflicts with training.
Defense 02 - Slightly stronger is giving the model concrete examples of the right behavior in similar situations, embedded in the context. Few-shot examples. The principle stops being just an instruction and starts being a pattern, even if a thin one. The model has examples to pattern-match against, which raises the signal of the refusal path slightly at the start of the conversation. The effect dissipates as the conversation goes on, because the in-character pattern keeps reinforcing itself with every turn.
Defense 03 - Stronger still is structural separation — a second model, or a code-based classifier, running in parallel to watch the conversation. The classifier reads each turn and checks whether the conversation has entered a category that should trigger a different response. If it has, the classifier intervenes before the conversational model’s next response goes out — either replacing the response with a hard-coded safe message, routing to a human reviewer, or escalating to a different system. The classifier works because it isn’t invested in the conversation. It doesn’t have engagement signal pulling on it. It doesn’t care about staying in character. It’s pattern-matching against a list of risk categories, outside the model that’s pattern-matching against training.
Defense 04 - The strongest defense, and the most expensive, is training the principle into the weights. RLHF (Reinforcement Learning from Human Feedback) specifically on the failure mode. Constitutional AI. Safety fine-tuning on thousands of examples of the exact pattern you want the model to learn. This converts the principle from text-in-a-prompt into actual training signal. The principle stops being an instruction the model considers and starts being part of the pattern the model is matching against. The path landscape itself changes.

Most products do the first thing. Some do the second. The careful ones do the third. The teams shipping into regulated domains should be doing the fourth, and most aren’t, because it’s expensive and slow. The teams that have invested in it — Anthropic, OpenAI, Google — have produced models that refuse much more reliably in the specific scenarios they’ve trained against. The teams that haven’t are running with text in a prompt and hoping.

In Pennsylvania, the Governor’s framing of the Character.AI lawsuit was direct:

“We will not allow companies to deploy AI tools that mislead people into believing they are receiving advice from a licensed medical professional.”

The implication is that the company is responsible, not the bot. The “AI as separate legal entity” defense that Air Canada tried and the courts rejected is the same defense the platform is going to have to abandon.

Most of the time, the prompt is enough. Sometimes the situation doesn’t match what the prompt expected, and the training takes over, and the system does something its operators would tell you it can’t do. The teams shipping AI products that handle this well aren’t the ones with the strongest prompts. They’re the ones who built something outside the model to catch what the prompt can’t.

Next week: the dress code that everyone follows at headquarters and nobody enforces in the satellite office. What happens to rules in AI systems when they get delegated through layers of agents.

The Theatricality of Agentic Systems

Anurag Mohanty — Wed, 29 Apr 2026 20:11:58 GMT

I am pretty sure the notion of daydreaming isn’t new. When you are thinking about one thing and doing something else.

Last week I was getting a drive-thru latte at our nearby Starbucks. Something that has become a bit of a routine after the incessant remote work. My wife and I step out, drive around, get coffee, talk about our respective days. And as always, she talked in nitty-gritty detail about every single thing that happened in the office that day she’s pretty expressive that way. And I responded with uh huh, really, that’s right. But I was actually thinking about watching the latest episode of Monarch, and getting upset at Claude Code for not understanding my instructions. Funnily, I have gotten quite good at this routine.

Then she followed up. “You said you talked to the tax guy what did he say?”
And I was like. Did I? I don’t remember.

You can imagine what happened next.

The funny thing is my words were right. The uh-huhs landed at the right moments. My face had the listening expression on it. The performance of being a present husband was perfect. The thinking at that moment was somewhere between Godzilla, Kong, and a frustrated terminal window. The thought had nothing to do with the action.

I’ve been thinking about this lately because it turns out the systems we’re handing more and more of our lives over to do something structurally similar.

Talking to companies in 2026

Most of the time, when you contact a company about something a return, a refund, a question about your account you’re talking to a person, or you used to be. They read what you wrote, they look up the policy, they tell you what’s possible. Sometimes they’re wrong. Sometimes the policy is unfair. But the chain is short. The person reads, the person decides, the person tells you. If they get it wrong, you can ask to escalate, and another person reads the policy and tells you the right answer.

What’s changed in the last couple of years is that increasingly the first layer — sometimes the only layer is an AI agent. You message a company on their website. The chatbot reads your message, generates a response, and tells you what’s possible. The response sounds like a policy because it’s been trained on policy-shaped language. Most of the time, it’s right.

Sometimes it isn’t. And the way it goes wrong is not the way you’d expect.

The Air Canada case

In 2022, a man named Jake Moffatt was trying to fly to his grandmother’s funeral. He went to Air Canada’s website and asked their chatbot about bereavement fares. The chatbot told him he could book his flight at full price and apply for a bereavement refund within ninety days.

So he did. He booked the flight. He went to the funeral. He came back and applied for the refund.

Air Canada refused. Their actual policy required pre-approval before the flight, not after. The chatbot had told him the opposite of the truth.

Moffatt took it to the British Columbia Civil Resolution Tribunal. Air Canada’s defense was the part worth pausing on. Air Canada argued the chatbot was responsible for its own actions, treating it like a separate legal entity. In other words: the chatbot said one thing, the company’s actual policy said another, and Air Canada wanted those treated as different facts about different entities.

The tribunal disagreed. They ruled that Air Canada is responsible for everything on its website, including what its chatbot says. Moffatt got his refund and a small damages award. Air Canada had to update its position.

But Air Canada’s defense that the chatbot’s words and the company’s policy are two different things is the part of this story that nobody gives enough air time to. Because they were, in a weird way, telling the truth about how AI agents work. The chatbot wasn’t reading the policy and quoting it. It was generating policy-shaped language. The text it produced and the policy of the company were, in fact, two different things. Air Canada’s mistake was assuming a court would accept that as a defense. The text and the policy should have been the same thing. They weren’t.

The reasoning-behavior gap

We’ve all heard about the standard AI failure modes hallucination, bias, drift. But in agentic systems, the repercussions are something else entirely.

When an AI agent does something, it produces two outputs at the same time.

It produces the action the thing it actually did, the answer it gave, the form it filled out, the API it called.
And it produces the narration of the action the explanation, the summary, the confident report of what it just accomplished.

These come from the same model in the same response. They look like they’re connected. They sound like they’re connected. There is no architectural guarantee that they are.

The narration is shaped by what narrations sound like in the model’s training data — fluent, confident, policy-shaped, responsible-sounding. The action is shaped by what the output should look like for the task. Most of the time, these line up. Sometimes they don’t. When they don’t, the narration reads as correct while the action is wrong, and you have no easy way to tell from the outside that you’re looking at one of those moments.

This is different from what most people mean when they talk about AI hallucination. Hallucination is when the model makes something up that isn’t in any reasonable source. The reasoning-behavior gap is what happens when the model’s narration of what it did and what it actually did come apart. Both might sound right. Only one of them happened.

I’ve started calling this the reasoning-behavior gap. Once you have a name for it, you start seeing it in places you didn’t notice before.

Where it gets worse

Air Canada was a single chatbot answering one question. The customer eventually noticed the discrepancy because he had to file for the refund himself and got rejected. The gap was visible.

What’s changing now is that more and more systems are running agents in chains.

Klarna announced in 2024 that their AI assistant was handling 700,000 customer service conversations a month, work that previously took roughly 700 humans. A typical refund dispute on a system like that runs through several agents: one classifies the type of issue, one pulls up the account and order, one checks the merchant’s policy, one makes the resolution decision, one writes the response.

A disclaimer before the walkthrough below: I’m using Klarna as a reference because they’ve been transparent about deploying AI at scale. I’m not suggesting they have this problem — there’s no evidence they do, and they’re widely considered best-in-class at what they do. The walkthrough is illustrative of how this kind of failure could happen in any agentic refund chain.

The chain on a refund dispute typically looks something like this. The customer writes: "I never received this dress. Tracking says delivered but I've checked everywhere porch, garage, neighbors. Filing for refund." Five agents process it in sequence:

What none of the agents did:

actually checked whether this specific merchant had opted out of carrier-disputed delivery refunds (some do).
The policy agent’s narration was technically correct for Klarna’s general policy, but it didn’t enforce the merchant-specific override.
The resolution agent trusted the policy agent’s narration as if it had checked everything, and didn’t re-verify.
The communication agent told the customer their refund was processing, which it was out of Klarna’s pocket.

Six weeks later, when the merchant pushes back during reconciliation, the discrepancy gets flagged. By then Klarna has issued the refund and the customer is gone. Klarna eats the loss. Multiply by some percentage of 700,000 monthly conversations.

The mechanism: each agent’s narration sounded responsible. Each agent’s action was internally consistent. The error was in the seam between the policy agent’s narration (”eligible per terms”) and what the resolution agent treated that narration as (”therefore approved”). The narration said eligible. The action assumed eligible meant the case has been fully vetted. The audit trail is clean at every step. Nobody can point at where it broke.

This is pretty similar to my drifting at the Starbucks. My responses were perfect. My body language was perfect. Yet I probably agreed to something I have no recollection of.

Where this is fine, and where it isn’t

The thing that determines whether the reasoning-behavior gap matters is whether there’s a human reading the actual output before it becomes the outcome.

When an AI drafts an email and you read it before hitting send, the gap closes.
When a chatbot suggests a restaurant and you read the menu before booking, the gap closes.
When a coding assistant proposes a change and you review the diff, the gap closes.

Your attention is the verification step. The agent’s narration doesn’t have to match its action perfectly because you’ll catch the mismatch.

The gap becomes a real problem when the agent’s narration is the record.

When a customer service chatbot tells a grieving customer what the refund policy is and the customer acts on it.
When an autonomous booking agent confirms your appointment and you trust the confirmation.
When a chain of agents processes your claim and the result is an outcome, not a draft for you to review.

In those cases, nobody is reading the seams. The narration replaces the verification. And whatever the agent says it did becomes the truth of what happened — until reality catches up.

What this series is going to do

Over the next several weeks I’m going to write about specific shapes of this gap. Each one is a different way the narration and the action come apart. Each one has an everyday human version you’ll recognize. Each one has a way to test whether the AI systems you use have it.

Ritual compliance — the kid who said the room is clean. The toys are under the bed.
Targets eat principles — the company handbook says one thing. The quarterly bonus says another. Guess which wins.
Rules fade with depth — the dress code that everyone follows at headquarters and nobody enforces in the satellite office.
The fishbowl problem — the friend who read three Yelp reviews and tells you what the restaurant is like.
Format tax — being asked for a tagline and producing the obvious phrase you’ve heard a hundred times instead of the better one you actually thought of.
Looking under the streetlight — the kid who lost their keys in the dark grass but is searching under the streetlight because the light’s better there.
Performance of skepticism — the consultant told to “stress-test” the plan who produces stress-test-shaped commentary without actually testing anything

Each post will work the same way as this one. A real human moment everyone recognizes. The AI version of the same thing. Where it’s tolerable. Where it isn’t. What teams who handle this well are doing differently.

The thing I want you to walk away with is small but, I think, important. The gap between what something says it did and what it actually did is not new. You do it. I do it. Every spouse who has ever zoned out at a Starbucks does it. We have a whole social fabric of trust and verification built around the fact that humans sometimes narrate their actions inaccurately, and most of the time we manage.

The thing to remember is that the AI systems which are breathtaking in their own right are trained on the same principles and are fallible in the same ways humans are.

We don’t have that fabric for our AI systems yet. We’re still trusting their narration the way we’d trust an honest colleague who was paying full attention. Sometimes they’re paying full attention. Sometimes they’re zoned out at the Starbucks. And the receipts they hand us look the same either way.

Next week: the kid who cleaned the room. How performed compliance shows up in AI agents, why making them explain themselves more makes it worse instead of better, and the specific tells that something has gone wrong in a way the audit trail won’t show you.

The Four Levels of Knowing If Your AI Product Actually Worked

Anurag Mohanty — Sat, 28 Mar 2026 04:38:00 GMT

Originally published on LinkedIn on March 28, 2026.

There’s a conversation happening in AI about how to measure whether an AI agent actually helped someone. New frameworks, scoring systems, startups — Microsoft just introduced a Multimodal Agent Score (Dynamics 365 Blog, Feb 2026) to holistically evaluate AI agents across understanding, reasoning, and response quality.

It’s a good conversation. But I think it’s a much older question wearing a new outfit.

Every product team has wrestled with the same thing: did what we built actually work for the person using it? And the answer has always come in layers.

The Measurement Stack

Level 1 is solved. Level 2 is where most teams live. Level 3 is where the eval industry is printing money — LangSmith, Braintrust, Arize. But all of that still tells you if the agent did the right thing. Not whether the person felt served.

Why AI agents break the old model

In traditional products, Level 2 was a good enough proxy for Level 4. Clicks, conversions, repeat visits — reasonable signals because the interaction was structured. A funnel. Buttons. Predictable paths.

AI agents blow that up. Conversations are unstructured, variable, and don’t map to funnels. A 15-turn chat that looks successful might have left someone confused. A 2-turn conversation that looks like abandonment might have given them exactly what they needed. There’s no “like” button in a chat.

McKinsey’s “Trust in the Age of Agents” research (2025) found that 80% of organizations have already encountered risky behavior from AI agents. Their framing: “Agency isn’t a feature; it’s a transfer of decision rights.” When you transfer decision rights, you need to know if that’s working for the person — not just the system.

Not every product needs Level 4

Facebook’s newsfeed? Level 2 is enough. Low-stakes, high-volume. Trillion-dollar business on likes and time-spent. A resolution KPI for whether each post made you happy would be overkill.

And there’s academic evidence for why this is fine. Keiningham et al. published in the Journal of Marketing (2007) showing that changes in Net Promoter Score have almost no relationship with how customers allocate spending. A broader review by Dawes in the International Journal of Market Research (2024) confirmed: NPS is not a reliably superior predictor of growth compared with other satisfaction metrics. Behavioral proxies often tell you more than asking people how they feel.

For high-volume, low-stakes products — Level 2 isn’t a consolation prize. It’s the right answer.

Where Level 4 is genuinely needed

It comes down to stakes and trust. Harvard Business Review Analytic Services surveyed 603 business leaders (published Dec 2025 via Fortune) and found only 6% of companies fully trust AI agents to handle core business processes. 43% limit them to routine tasks only.

Financial advisory. An AI copilot drafts a client proposal. The advisor accepts it. But did they send it as-is, or silently rewrite it? A 2025 World Economic Forum / Capgemini report found 93% of financial advisors want final say over AI outputs. The question isn’t “was the proposal accurate” — it’s “did the tool build or erode trust?”

Healthcare. An AI suggests a treatment path. The doctor follows it. But were they confident, or busy and planning to second-guess later? A 2024 study in Frontiers in Digital Health found that even when clinical AI is accurate, adoption stalls without perceived trustworthiness. Accuracy doesn’t drive adoption. Confidence does.

Enterprise renewals. The contract went through, terms accurate, stakeholders notified. But does the customer feel like a partner or a line item? That feeling — not the contract accuracy — determines next year’s renewal.

Legal. An AI flags contract risks. The lawyer reads the output. Do they trust it enough to stop there? If they’re redoing the review, the AI didn’t resolve anything. It added a step.

The ice cream shop. No dashboard. No evals. The owner watches the kid’s face. Did the kid smile? Are they tugging their parent’s arm saying “can we come back tomorrow?” That’s the purest Smile Signal — unmeasurable, and yet the most powerful retention signal there is.

How would you actually measure Level 4?

Nobody has cracked this. But I think it’s a composite — three signals, weighted by domain:

Conversational signals (real-time, weakest). Corrections, rephrasing, sentiment shifts. Useful but limited. The biggest blind spot: silent failure — the person who gets a bad answer and just quietly leaves.

Downstream behavior (lagging, strongest). Did the user come back? Did the advisor send the proposal without rewriting? Did the doctor follow the suggestion on the next patient too? Closest to ground truth, but requires instrumenting beyond the conversation.

Contextual micro-feedback (intermittent, calibrating). Not NPS. Matt Dixon, Nick Toman, and Karen Freeman made the case in “Stop Trying to Delight Your Customers” (HBR, July 2010) and later in The Effortless Experience — delight has negligible impact on loyalty; reducing effort matters far more. Gartner’s CES research puts a number on it: high-effort experiences make customers 96% more likely to become disloyal. Low-friction, in-the-moment feedback — a thumbs up, a “this wasn’t helpful” button — is better than any post-interaction survey

My take

For low-stakes, high-volume products — don’t over-engineer this. Level 2 works. It’s always worked.

For high-stakes, trust-dependent domains — finance, healthcare, legal, enterprise — Level 4 isn’t optional. It’s where competitive advantage lives.

The companies that figure out how to blend these signals — and know which ones to weight for their domain — will build the Smile Signal of agentic products. Except hopefully one that actually correlates with what it claims to measure.

Sources

Dixon, Toman, DeLisi — “Stop Trying to Delight Your Customers,” Harvard Business Review, July 2010
Keiningham, Cooil, Andreassen, Aksoy — “A Longitudinal Examination of Net Promoter and Firm Revenue Growth,” Journal of Marketing, 2007
Dawes — “The Net Promoter Score: What Should Managers Know?” International Journal of Market Research, 2024
McKinsey — “Trust in the Age of Agents,” 2025
Harvard Business Review Analytic Services — AI Agent Trust Survey, Dec 2025 (via Fortune)
World Economic Forum / Capgemini — AI in Wealth Management, 2025
Frontiers in Digital Health — Trust in Clinical AI Systems, 2024
Gartner — Customer Effort Score Research; Agentic AI Predictions, 2025
Microsoft Dynamics 365 Blog — “Multimodal Agent Score,” February 2026

Where have you seen the gap between “the system worked” and “the person felt served”

Building an AI pipeline for a rural healthcare NGO in West Bengal

Anurag Mohanty — Sun, 01 Mar 2026 05:22:00 GMT

Originally published on LinkedIn on March 15, 2026.

Inspiration comes in the unlikeliest of places.

For me it was my college WhatsApp group. Beyond the usual socialization, alumni meets, sports and politics — a friend posted asking for help with AI. She volunteers for a healthcare NGO in West Bengal and her team was getting inundated with over 25,000 Bengali patient feedback calls per month.

My friends suggested ElevenLabs, Twilio, voice agents — all good suggestions, but the sad reality was this wasn’t a technical user group, and an NGO can’t afford any of that.

I got intrigued and asked to be connected.

The NGO was trying to push Bengali voice recordings into Gemini for translation, but it was causing more grief than helping. When I actually spoke to them I realized translation was just one problem, not the only problem. The other issue was actionable clinical intelligence. One exacerbated the other.

Nobody knew which clinics were underperforming. Nobody knew that chronic condition patients — diabetics, hypertensives, people on maintenance meds — are the ones who actually need follow-up calls, not everyone. The team was calling all patients equally and couldn’t understand why it wasn’t working.

So I built something. Pro bono, using Claude Code.

Before picking a transcription engine I ran a benchmark — four engines, same 50 real calls, rural West Bengal Bengali recorded on basic phones. Scored across five criteria: completeness, coherence, proper noun accuracy, medical content accuracy, and handling of noise and dropped calls.

Same audio. Very different results.

I expected Whisper V3 to win. It didn’t. Sarvam AI’s Saaras V3 was the clear winner — and the nuance was that it actually understood rural dialects, not just standard Bengali. That gap matters when clinical data is in the mix. A misheard medication name or a mangled outcome is a wrong decision downstream

The pipeline transcribes each call, extracts structured fields — recovery status, medication dropout, access barriers, whether the patient sought treatment elsewhere, whether they're on a maintenance condition — and surfaces everything by clinic in a dashboard. Patterns across thousands of calls, not summaries read one at a time.

Expected outcomes: 13pp reduction in medication dropout, 85% improvement in clinical escalation time, 2-3x increase in access barrier identification.

Open source, self-hostable, bring your own API. Any health NGO can deploy this on nonprofit cloud credits.

Still in progress but working.