Why LLMs Are Probably Bad for Your Business

LLMs are probabilistic — they guess, they do not reason. You cannot debug them when they are wrong. They can be valuable tools, but most businesses are not choosing them — they are chasing them, driven by hype rather than any clear view of the tradeoffs.

Home
Articles
Why LLMs Are Probably Bad for Your Business

Ilja Orlovs

Founder

You are paying a rather remarkable sum of money to let a very confident guesser make your decisions. Here is why that should concern you.

The sales pitch is hard to resist: “Our AI-powered platform will transform your operations, cut costs by 40%, and unlock insights you never knew existed.” It sounds wonderful. It sounds like the future. And for most businesses, it sits somewhere between misleading and actively dangerous.

I am not here to tell you that AI is useless. Clearly it is not. But the current wave of enthusiasm around large language models has businesses making serious operational decisions without reasonable understanding of what these systems actually are.

You are building on quicksand

Large language models are probabilistic. They do not know things and they do not reason through problems — they produce statistically likely sequences of text based on patterns absorbed during training. This is not a technical footnote; this is the one and only thing that they do.

When you ask an LLM to analyse your sales data and recommend a pricing strategy, it is not performing analysis. It is producing output that pattern-matches to what “a pricing strategy recommendation” tends to look like in its training data.

One might fairly object that humans do something similar — we too are taught, and we too fall back on cultural defaults. True enough. But a human analyst has personal preferences, opinions shaped by experience, knowledge absorbed through dozens of invisible learning channels: not only documents read, but conversations had, expressions noticed while presenting last quarter’s numbers, the uneasy feeling from a client meeting that went sideways. These bits of feedback get incorporated unconsciously.

An LLM has none of that. Its output converges on the generic — the statistical average of everything it was trained on. For low-stakes tasks such as drafting an email or brainstorming, this is perfectly fine. But the moment you route business decisions through such a system, you have introduced a source of randomness that you cannot fully control or predict — and the output you receive will be, almost by definition, the most average possible version of whatever you asked for.

Yesterday’s data, tomorrow’s decisions

An LLM’s knowledge is frozen in time — its training data comes from the past, while the business decisions you are asking it to inform are about the future. In ordinary times, this is manageable. The past and the future are related, and a system that has absorbed yesterday’s patterns can say something useful about tomorrow.

The trouble is that this relationship weakens precisely when you need it most — when the business landscape becomes unpredictable.

Consider the manufacturing business whose next aluminium shipment is due in six months. For the past decade, that supply chain had been boringly stable; the factors influencing price were a more or less fixed set of variables. A model trained on that decade has absorbed those patterns thoroughly, and in calm weather its forecast would be perfectly serviceable. But the probability weight that ought to sit on “geopolitical disruption” was, until quite recently, a marginal footnote — and for many models, not a footnote at all.

This is the compounding problem of probabilistic systems trained on historical data: they are most confident precisely when conditions resemble the past, and they are least able to warn you when conditions do not. The world does not send advance notice when the relationship between training and reality is about to break — and it is precisely in those moments that the decisions you are making matter most.

The black box you cannot debug

Traditional software aims to be deterministic. When something goes wrong, you can trace the logic, find the flaw, and fix it. The problem is knowable.

LLMs do not work this way. The same prompt with the same data can give you a correct answer today and a subtly wrong one tomorrow. When it is wrong, you typically cannot explain why. The “reasoning” is distributed across billions of parameters that no human can inspect. You debug by tweaking prompts, adding examples, and hoping it behaves differently next time. “Hope” is not an engineering methodology.

For industries that care about auditability and compliance — finance, healthcare, legal, energy, government — this is not merely inconvenient. It is a liability. “The computer said so” is not an answer a regulator will accept.

Hallucinations are not a bug — they are the product

A common misconception is that hallucinations — cases where the model confidently generates fabricated information — are temporary, a flaw that the next version will resolve. They are not. Hallucination is inherent to how these models produce text. Sometimes the most probable-sounding output is simply wrong.

The consequences are already documented. In February 2024, the British Columbia Civil Resolution Tribunal ruled against Air Canada after its chatbot incorrectly told a passenger he could apply retroactively for a bereavement discount. When the passenger tried, Air Canada refused — and then argued in court that its chatbot was “a separate legal entity responsible for its own actions.” The tribunal disagreed.¹ In 2023, attorneys in Mata v. Avianca submitted briefs containing at least six entirely fictitious case citations generated by ChatGPT, complete with fabricated quotes and judicial opinions. They were sanctioned and fined $5,000.²

And here is where it becomes genuinely concerning. Because the output looks right, it tends to receive less scrutiny. A human reviewer naturally spends less effort picking apart text that reads like competent professional work. When a hallucination slips into a business decision, the damage is rarely immediately visible. A subtly wrong legal interpretation, a marginally off financial projection — these things do not announce themselves. The errors compound until somebody notices, and by then you have been operating on bad information for weeks.

Your data is not staying where you think it is

Unless you are on an enterprise-tier plan, the text you send to an LLM may be used to train future versions of the model. Every prompt, every pasted document, every customer name or contract detail is, in principle, heading into a training dataset you have no control over.

Cyberhaven’s 2026 AI Adoption & Risk Report found that 39.7% of all enterprise AI interactions involve sensitive data — including prompts, copy-paste actions, and file uploads.³ Cisco’s 2024 Data Privacy Benchmark Study found that 48% of respondents admitted entering non-public company information into GenAI tools.⁴ A GDPR fine that is a rounding error for a multinational can end a 20-person company. Even if you are careful, you are one careless paste away from a problem you cannot undo.

The talent trap

Expertise atrophies the moment you hand the work to an AI — and the consequences run deeper than your current team becoming rusty.

In August 2025, The Lancet Gastroenterology & Hepatology published a study examining 1,443 colonoscopies performed by experienced endoscopists. After routine AI assistance, adenoma detection rates in non-AI-assisted procedures dropped from 28.4% to 22.4% — a 20% relative decline.⁵ In May 2019, a cyberattack took down Wolters Kluwer’s CCH tax software for nearly a week. CNBC reported the attack “left many in the accounting world unable to work.” A sales representative emailed clients: “Many of you are awaiting guidance on what you should be doing with your staff today and unfortunately I do not have a good answer for this.”⁶

Now combine this with the opacity problem. With traditional software, the expertise needed to oversee the system is the same expertise needed to do the work manually. With LLMs, overseeing the AI requires a different kind of specialist: people who understand both your domain and the model’s behaviour. These people are rare. Lightcast’s 2025 analysis of 1.3 billion job postings found that positions requiring AI skills offer 28% higher salaries — nearly $18,000 more per year.⁷ The larger companies are already outbidding everyone else for them.

This is the trap. You deskill your existing team. You cannot hire the specialists needed to properly oversee the AI. And if you try to reverse course, rebuilding lost expertise takes years.

But we can fix it

At this point, you may reasonably push back. The pro-AI camp has real counter-arguments, and the strongest are worth acknowledging.

They will tell you that foundation models have commoditised, and the real value is what you build around them: retrieval-augmented generation, fine-tuning on proprietary data, workflow redesign that properly embeds the model in your operations. Fair point.

They will tell you that API costs per task have collapsed to pennies, and for high-volume standardised work — customer service, drafting, summarisation — the unit economics genuinely work. Also fair.

They will tell you that well-architected RAG systems can substantially reduce hallucination rates by grounding outputs in verifiable source documents. The evidence supports this: a 2025 study in JMIR Cancer found RAG reduced hallucinations from roughly 40% to near zero for cancer-related queries.⁸ Partly true — and worth being careful about what “grounded” actually means.

RAG does not fix hallucinations. It tethers the model to your existing corpus, so its output becomes statistically more likely to resemble your own documents. That is useful — but the tether is made of probability, not logic. You have narrowed the distribution of possible wrong answers; you have not eliminated wrong answers. The grounding, if anything, makes the failure mode harder to spot.

But notice what each of these counter-arguments is actually saying. None of them disputes that a naked LLM is a bad tool for most business decisions. What they are saying is this: if you build the right architecture around it, govern it properly, invest in the data layer, maintain audit trails, train your people to oversee it, and continuously measure its error rates — then it works.

Which is another way of saying that the burden of making AI safe and useful falls entirely on the organisation adopting it. The vendor sells you transformation; you are the one who has to build the thing that actually delivers it.

There is an irony here. The same vendors who market their products as “PhD-level intelligence” quietly agree that you need extensive customisation and dedicated oversight to make the thing behave. These two stories cannot both be true. If the system were genuinely operating at PhD level, it would not need your engineers to hand-feed it context, guard its outputs, and keep its knowledge base in sync with the real world.

The $600 billion question

All of which would be perfectly reasonable if decisions about AI were being made in a calm, rational environment. They are not.

In June 2024, Sequoia Capital’s David Cahn calculated that the gap between AI infrastructure investment and actual AI revenue had grown from $125 billion to $600 billion in nine months.⁹ Goldman Sachs published a report titled “Gen AI: Too Much Spend, Too Little Benefit?” noting that roughly $1 trillion in projected AI capex “has little to show for it so far.”¹⁰ MIT economist Daron Acemoglu, who won the 2024 Nobel Prize, estimates AI will produce only a modest 0.5% productivity increase and roughly 0.9% GDP growth over the next decade.¹¹

At a Yale CEO Summit in June 2025, 40% of the 150+ top executives present said AI hype had led to overinvestment and a correction was imminent.¹² Goldman Sachs CEO David Solomon expects “a lot of capital deployed that doesn’t deliver returns.” Even Sam Altman has warned that “people will overinvest and lose money.”

Business owners read breathless headlines about productivity gains and fear being left behind. Boards ask “what is our AI strategy?” in a tone that implies the only wrong answer is “we do not have one.” FOMO is a terrible basis for a technology investment, and the louder the hype becomes, the more sceptical you should be about whether the decision you are making is yours.

So what should you do?

None of the above means you should ignore AI. It means that every downside I have described might be worth accepting — but only if you know why you are doing it.

Losing some expertise in a process that does not generate meaningful ROI? Probably fine. Accepting a measured hallucination rate in a low-stakes content pipeline? Reasonable, provided you know the stakes. Using an LLM for first-pass triage on something a human reviews anyway? Sensible. The problem is not using LLMs. The problem is using them without understanding the tradeoffs — or worse, without even knowing there are tradeoffs to understand.

So start with the problem, not the technology. You might find that your needs are better addressed by an employee training programme or another spreadsheet. Not as exciting as “AI-powered transformation” — but predictable, auditable, and actually solving the problem in front of you.

If an LLM genuinely fits after honest evaluation, treat it like any other business risk. Build monitoring and human oversight into the pipeline. Know what it costs you when the model gets it wrong, and decide consciously whether that cost is acceptable.

And be deeply sceptical of anyone who tells you that AI is the solution before they have even begun to understand your problem. The most expensive technology investment a business can make is the one that solves the wrong problem with great confidence.

References

Moffatt v. Air Canada, 2024 BCCRT 149. The tribunal ruled that companies are liable for AI chatbot misrepresentations. Covered by the American Bar Association, February 2024. ↩
Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y. 2023). Judge P. Kevin Castel sanctioned attorneys Steven Schwartz and Peter LoDuca for submitting fabricated case citations generated by ChatGPT. ↩
Cyberhaven Labs, “2026 AI Adoption & Risk Report”, February 11, 2026. The study tracked real-time data lineage across millions of enterprise employees. ↩
Cisco, “2024 Data Privacy Benchmark Study”, January 25, 2024. Survey of 2,600 security professionals across 12 countries. ↩
Budzyń et al., “Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy”, The Lancet Gastroenterology & Hepatology, August 2025. Study of 1,443 colonoscopies across four Polish centres. ↩
CNBC, “Wolters Kluwer, one of the biggest accounting software companies, hit by malware attack”, May 8, 2019. CCH products serve 100% of the top 100 U.S. accounting firms. ↩
Lightcast, “Beyond the Buzz: Developing the AI Skills Employers Actually Need”, July 23, 2025. Analysis of 1.3 billion job postings from 2024. ↩
“Reducing Hallucinations and Trade-Offs in Responses in Generative AI Chatbots for Cancer Information”, JMIR Cancer, 2025. Study tested 62 cancer-related questions across six chatbot configurations. ↩
David Cahn, “AI’s $600B Question”, Sequoia Capital, June 2024. ↩
Goldman Sachs, “Gen AI: Too Much Spend, Too Little Benefit?” June 2024. ↩
Daron Acemoglu, “The Simple Macroeconomics of AI”, NBER Working Paper 32487, 2024. ↩
Yale School of Management, “This Is How the AI Bubble Bursts”, CEO Summit report, October 2025. ↩

Get in touch

Initial conversations are exploratory and obligation-free.