Skip to content
CTCO : Coder to Corner Office
Go back

The Good, the Bad, and the Frontier: A Pragmatic Read on AI Adoption

Published:  at  10:00 AM
·
19 min read
· By Joseph Tomkinson
Reality Checks
Written by Human
Two hands on a boardroom table, one holding a Copilot-style chat window, the other a coding agent terminal
Two AI conversations, one room, and usually only one of them gets heard properly.

It’s a quiet Saturday morning, dog tucked up beside me, mug of coffee, and a lazy scroll through LinkedIn. Two posts, three apart on the feed, completely different stories about AI. The first celebrates a Microsoft 365 Copilot rollout with a neat graphic promising a uniform productivity lift across the business. The second is a CTO quietly asking whether it is time to halve the engineering hiring plan because a coding agent just shipped a week’s worth of tickets overnight. Under both, the comments split cleanly into ‘this changes everything’ and ‘this changes nothing’, with almost nothing sensible in between. I closed the app and went and made another coffee.

It is the same word at the top of every one of those posts and yet they are plainly not discussing the same technology, the same user, or the same outcome.

The details I see, both in the industry reporting (I spend a lot of my week reading) and in those comment sections leaders are increasingly making decisions from, is that we are trying to form one coherent view of ‘AI’ when there are at least two conversations happening at once. And inside each of them there is a frontier-vs-standard model gap that, I personally think, decides whether the business case you signed off on survives contact with reality. Neither of those nuances survives the journey onto a LinkedIn carousel (which is a sentence I never expected to type, but here we are), and that is part of how we ended up here.

Table of contents

Open Table of contents

Two AIs in the same conversation

The first kind of AI is the generalist productivity layer. Microsoft 365 Copilot, ChatGPT for Work, Gemini in Workspace, Claude in the desktop app. The user is anyone with an email address. The expected outcome is ‘I spent less time on that task.’ Success looks like a faster first draft, a summarised thread, a tidier spreadsheet, a decent meeting recap. Failure looks like a confident-sounding hallucination in a document no one re-read, or an honest 30% time saving on a task that wasn’t on the critical path in the first place.

The second kind is the specialist or agentic layer. GitHub Copilot agents, Claude Code, Cursor, Cline, the internal agentic workflows teams are starting to wire into their delivery pipelines. The user is a trained practitioner, usually an engineer, sometimes a data or security specialist. The expected outcome is measurable throughput in a specific discipline: tickets closed, code shipped, pull requests reviewed, vulnerabilities triaged. Failure looks like code that compiles, passes the tests you happen to have, and carries a subtle regression into production, or a cost curve that only makes sense at the premium tier you didn’t budget for.

These two conversations need different sponsors, different KPIs, different risk owners, and different vendor relationships. Most of the governance frameworks I see treat them as one programme. That is the first thing that has to change.

DimensionGeneralist productivity AISpecialist / agentic AI
Primary userAny knowledge workerTrained practitioner (engineer, analyst, etc.)
Expected outcome’I spent less time on that task’Measurable throughput in a specific discipline
SponsorCOO, Head of Productivity, HR-adjacentCTO, Head of Engineering, CISO
Dominant riskQuality drift, data leakage in proseDelivery risk, regression, cost stratification
How you measure itTask-level time savings, satisfactionCycle time, defect rate, unit economics per task

A small table, because the argument is worth pinning, but the point is a prose one: the same word is doing two very different jobs, and the people held accountable for the outcome should not be the same person.

The frontier model gap nobody is pricing in

Inside each of those conversations there is a second gap that gets even less attention. The product name on the screen is not the thing doing the work. The model underneath it is.

The gap between a frontier model and a standard-tier model, today, is large enough to change the outcome of a whole rollout. A GPT-5-class or Claude Opus-class model genuinely will do a coherent piece of strategic analysis, argue with itself, refactor a non-trivial codebase with judgement, and hold a long thread without losing the plot. A cheaper, quantised, latency-optimised sibling of the same product will smile and confidently give you something that reads similar and behaves nothing like it. Both get sold as ‘Copilot’. Both get sold as ‘our AI’. The business case your CFO heard assumed the frontier one. The licence your procurement team bought assumed the standard one. Nobody wrote that down.

I covered the downstream version of this in the ACES post a couple of weeks back. When top-quartile engineers are reportedly burning five-figure sums in tokens per month to hit the productivity numbers being quoted in public, it is worth asking which tier your team is actually using, and whether the story you are telling the board was built on that tier or a cheaper one. Same question for your M365 Copilot rollout. When the vendor shows you a research demo, they are not running it on the Haiku-grade model. They are running it on the one your seat licence does not get by default.

This is not a complaint about pricing. It is the thing you have to reason about before you can say anything honest about productivity, and almost nobody is doing it in plain English.

What both camps are getting wrong

There are two loud narratives about AI right now, and both of them are half-right, which is how you can tell neither of them is useful.

The first narrative is the doom camp. It looks at a coding-agent demo, watches a well-scoped ticket get converted into a working pull request in two minutes, and extrapolates that to ‘all knowledge work collapses in eighteen months’. The labour market data so far does not back that up. Christos Makridis and Andrew Johnston’s 2026 working paper, using US administrative data from 2017 to 2024, found that industries one standard deviation higher in AI exposure saw roughly 10 percent higher productivity, 3.9 percent more jobs and 4.8 percent higher wages than comparable industries in the same state. Their separate Gallup-based work tracked the share of frequent AI users rising from 12 percent in mid-2024 to 26 percent by late 2025, with each percentage-point rise associated with measurably higher real output and employment. AI is acting, so far, more like a productivity tool that pulls labour up than a substitute that pushes labour out. The mistake the doom camp makes is treating a highly structured, well-tested, already-automated discipline like software engineering as representative of everything that knowledge workers do. Software delivery is unusually well-suited to agents: type systems, tests, linters, code review, version control, staging environments. Most other professions have none of that scaffolding. The agent is not going to find it on the way in.

The second narrative is the hype camp. It looks at a productivity deck, sees ‘30% time saved across the function’, and extrapolates that to a uniform lift across every role, every team, and every sector. The mistake is that productivity gains are task-shaped, not role-shaped. The Harvard Business School and BCG study run by Dell’Acqua and colleagues in 2023 put real numbers on this: consultants using GPT-4 performed 12 to 40 percent better on tasks that sat inside the model’s capability frontier, and meaningfully worse on tasks that sat just outside it. They called it the ‘jagged frontier’, and it has held up well across the studies that have followed. Makridis and Johnston’s data sharpens the same point from the labour market side. Where AI mainly augments people, marketing, writing, financial analysis, employment rose by about 3.6 percent per standard deviation increase in exposure. Where AI can execute tasks more autonomously, basic data processing, generating boilerplate code, standardised customer interactions, employment did not change much but wage growth slowed. Same technology. Different task shape. Different outcome for the worker. A role that is 40 percent the first kind of task and 60 percent the second does not get 30 percent faster. It gets a patchy lift on one half of the week, and a new failure mode on the other half that you did not have before.

Both camps are reading one side of the picture. Neither is reading the sectors.

Reading numbers that age in dog years

A reasonable objection at this point is that most of the studies I am quoting are already old in AI-product time. GitHub’s Copilot benchmark was 2022 and ran on a Codex-class assistant. Brynjolfsson, Li and Raymond’s customer-support trial used a GPT-3.5-era system. Dell’Acqua’s BCG work was on GPT-4. METR’s developer trial used early-2025 frontier models. The Makridis labour-market data runs to 2024. The newest of those is twelve to eighteen months behind the model class your team is actually using today, and by the time a peer-reviewed paper lands on your desk, the model class it studied is often two product generations behind. That is a real problem if you read the numbers as point estimates.

It matters considerably less if you read them as shapes. Magnitudes drift, sometimes a lot. The shapes have been remarkably consistent across every model generation we have studied. The jagged frontier of task-by-task capability is still jagged. The ‘novices benefit most’ effect from the customer-support trial keeps reproducing across new domains. The augmentation-versus-autonomous split Makridis identified in 2024 admin data is the same split practitioners are seeing in agentic workflows in 2026. If anything, more capable models have widened the jagged frontier rather than smoothed it, they are better in the middle and still poor at the edges, and the edges are where most of the unsuccessful rollouts quietly live.

The pragmatic read is that older studies tell you the direction of travel and the failure modes that survive a model upgrade. Anyone telling you the exact productivity figure their team will get from a model that did not exist three months ago is selling something.

The sector picture is more nuanced than the headlines

In knowledge work and professional services, the lift is real but task-shaped. Microsoft’s 2024 Work Trend Index reported that 75 percent of global knowledge workers were already using generative AI at work, and that 78 percent of those users were bringing their own tools in regardless of whether their employer had an official programme. That should settle any remaining argument about whether adoption is happening. Brynjolfsson, Li and Raymond’s 2023 field study of over 5,000 customer support agents found an average 14 percent productivity gain from a generative AI assistant, with novices gaining over 30 percent and the most experienced agents seeing essentially no measurable benefit. That shape, big gains concentrated where the prior skill floor was lowest, keeps showing up. The cap is not the model, it is the human review step that has to pick up the work before it leaves the organisation. Remove that step to ‘capture the productivity’ and the gains evaporate in a quarter via reputational drag.

In software engineering, the gains are the most measurable and the most fragile, and the evidence is messier than either camp usually admits. GitHub’s own 2022 study, run on a Codex-era assistant that bears very little resemblance to the agentic Copilot of today, claimed 55 percent faster task completion on a controlled exercise. METR’s 2025 randomised controlled trial of experienced open-source developers working on their own real-world issues, run on early-2025 frontier models, found something very different: participants were 19 percent slower with AI tools than without, while believing themselves to be around 20 percent faster. Both results are real, three model generations apart, and both point at the same underlying shape. Where the harness, the review culture, and the frontier model are all present at the same time, throughput genuinely moves. Remove any one of them and you get regression risk, cost blow-out, or an engineering team quietly doing the same work they did before with a more expensive keyboard. The 2024 DORA State of DevOps report was consistent with this: AI adoption correlated with higher throughput and with higher instability at the same time. Both things are true.

In regulated sectors, financial services, healthcare, legal, the bottleneck is not capability, it is assurance. The NCSC put it plainly last month in their guidance on AI-assisted delivery, which I covered in The NCSC Said the Quiet Part Out Loud. You will not stop adoption, so the work is building the guardrails, provenance, and model assurance that make adoption defensible. Governance is the rate-limiter here, not the model.

In frontline and operational sectors, logistics, manufacturing, construction, civil, current generative AI barely touches the work that actually moves the P&L. The forklift is not going to be dispatched by a chatbot. The pour schedule is not going to be optimised by a general-purpose LLM. The real gains live in back-office, procurement, safety reporting, documentation, and the edges where knowledge work leaks into the operational spine. Anyone selling a uniform uplift story into a logistics business is selling you the knowledge-work figure and hoping you do not ask which roles it applies to.

In the public sector and nonprofit space, the structural adoption gap I wrote about in the ACES piece lands hardest. Eurostat’s 2025 data had 55 percent of large EU enterprises using at least one AI technology against just 17 percent of small businesses; the ONS put overall UK business adoption at 23 percent in late 2025, and TechSoup’s 2025 State of AI in Nonprofits report found that nonprofits with revenues above one million US dollars were adopting AI at nearly twice the rate of smaller organisations. Budget decides which model tier you get, and the model tier decides which outcomes you see. Two charities running the ‘same’ tool can be running two very different products once you look under the bonnet. That is a policy question, not a procurement one, and it is not being treated as one yet.

And in creative industries, the picture is the most genuinely contested. The question there is not productivity, it is identity, authorship, and economic structure. I am deliberately not going to pretend to have the answer to that here. I will say that anyone reasoning about ‘AI adoption in creative work’ through the lens of engineering productivity metrics is going to miss the thing the sector is actually arguing about.

The common thread: impact is task-shaped and tier-shaped, and both of those cut across roles and sectors in ways the vendor decks do not.

Strategic analysis is not the same as code generation

There is a subtler version of the frontier gap that affects leaders personally, and it is worth naming.

If you are using a consumer-tier chat product to help you think through a strategy paper, a board pack, or a vendor decision, you are experiencing a different model from the one the big strategic analysis demos are run on. I do a reasonable amount of strategic analysis work with frontier models (more than I probably should admit to on a blog about my day job), scenario planning, first-pass market reads, structured critique of my own arguments, and the quality difference between a frontier model and a mid-tier one on that kind of task is not subtle. The mid-tier one gives you a confident, plausible, surface-level answer. The frontier one argues back, finds the weak link in your reasoning, and occasionally tells you the question is wrong.

If the only AI you have personally used for strategic work is the standard tier, you are almost certainly underestimating what the technology is capable of. And if the only AI you have personally used is the frontier tier, you are almost certainly overestimating what your org is actually rolling out to the average user. Leaders need both views, ideally in the same week, to make a sensible call.

Five things worth doing (in my view)

  1. Separate your generalist and specialist AI strategies. Different sponsors, different KPIs, different risk owners. If your M365 Copilot programme and your coding-agent programme sit in the same RAG report under one line item, you are already missing the detail that matters.
  2. Audit which model tier each rollout is actually using. Not which product, which tier. Compare it against the tier the business case implicitly assumed. If the gap is big, either upgrade the tier or rewrite the business case. Do not leave it quietly mismatched and discover the delta at renewal.
  3. Pick one task per function and measure honestly. Not ‘AI adoption rate’, which is a vanity metric (I know, I know, it is on every dashboard, that is rather the point). Pick a task, measure cycle time and quality before and after, and let the number be whatever the number is. Some tasks will surprise you upwards, others will not move at all. Both results are useful.
  4. Assume task-shaped and sector-shaped impact, not role-shaped headcount maths. Do not let anyone present a slide that says ‘role X is 30% more productive therefore we need 30% fewer of them’. The arithmetic is wrong at the first step.
  5. Be visibly clear and credible about your AI strategy. Makridis’s 2026 Gallup-panel work found that frequent AI use was substantially more common where workers believed their organisation had communicated a clear AI strategy and where employees said they trusted leadership, and in those environments AI use correlated with higher engagement rather than burnout. The corollary is uncomfortable: silence and ambiguity are not neutral, they actively suppress the experimentation that makes the productivity numbers real in the first place.
  6. Keep one foot in the frontier. Personally. As a leader. Pay for the top tier of something, use it on real work weekly, and form a direct view. Delegating your own understanding of the trajectory to a vendor deck is how you end up with a strategy that was already out of date at the point it was written.

Closing thoughts

The reason both narratives about AI feel so confident, and so wrong, is that each is reading one half of a two-conversation, multi-tier, sector-shaped picture and calling the whole thing after it.

Taking a pragmatic approach to the narrative is less dramatic. Generalist productivity AI is real, uneven, task-shaped, and bounded by human review. Specialist and agentic AI is real, measurable, fragile, and bounded by tier and harness. Both are worth doing. Neither is worth doing the way your loudest vendor or your loudest critic is telling you to. Lead from the detail, not the headline.

Anyway, that is my Saturday morning thought offloaded onto the page (the dog has long since lost interest). Your mileage will absolutely vary, and I would rather it did, the people closest to a problem usually see it sharpest. Thanks for reading.

References

A mix of primary studies and well-cited reports behind the numbers in this post. I have read these and recommend them if you want the full picture rather than the carousel version:


You Might Also Like



Comments