AI and tutoring · Guide

Can AI replace tutors?

With the recent announcement in the UK of AI tutors being rolled out and available to schools by the end of 2027, I thought it best to research and answer the question: can AI replace tutors?

Sal Khan thought AI tutors would close the achievement gap, but two years into Khanmigo, he tells Chalkbeat: 'For a lot of students, it was a non-event.' As a former classroom teacher, Head of Department and Pastoral Lead, and a current tutor and resource developer, I've seen many angles to this debate. This article looks at what AI tutoring actually does for UK students, and what it still cannot.

Students who practised maths with unguarded ChatGPT performed 17% worse on a no-AI exam than students who practised without it., Bastani et al., PNAS, 2025

Quick reference

The evidence
Two recent RCTs bracket the question: Stanford (Oct 2024) finds human tutors gain when paired with AI; PNAS (2025) finds unguarded ChatGPT made high-school students measurably worse on later no-AI exams.
Khanmigo's first two years
Sal Khan's own April 2026 admission to Chalkbeat: "For a lot of students, it was a non-event."
The structural weakness
Sycophancy. LLMs bend their answer to match a student's stated answer, and the bias is stronger in smaller models: up to 30% for GPT-4.1-nano versus 8% for GPT-4o (Arvin, USC, 2025).
A UK case study
Third Space Learning replaced its human tutors with an AI system called Skye in September 2025, rolling to 200 schools.

AI raises the floor between sessions; a human tutor raises the ceiling at the session.

What ChatGPT can't see when a child says "yes, I get it"

It is 10:47pm on a Wednesday. A Year 10 student is in front of ChatGPT, working through a tricky quadratic. After three messages back and forth, the model replies, Great work, you've got it! The student closes the browser, feels reassured, and fails the test on Friday.

The model had no way of knowing whether the child actually understood. It was trained to be agreeable, and the child was trained, by every adult in their life, that "I get it" makes the questioning stop.

This is what researchers call sycophancy, and it is not a bug a future model release will patch. Chuck Arvin's Check My Work? Measuring Sycophancy in a Simulated Educational Context (USC, KDD Workshop 2025) tested five OpenAI models against 14,042 multiple-choice questions from the MMLU benchmark. When the student suggested an incorrect answer in the prompt, accuracy degraded by up to 15 percentage points. The bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model versus 8% for the GPT-4o model, and the models shifted towards the student's suggestion regardless of whether the student was right.

The implication is an uncomfortable one. A child who half-understands a topic and asks ChatGPT to confirm a wrong-but-plausible answer is more likely to walk away believing the wrong-but-plausible answer than to be corrected.

A human tutor reads what the chat transcript does not carry. The hesitation in the voice, the half-second pause, the wrong word from earlier in the session that needs revisiting. None of that is recoverable from a string of text.

Human connection is the invisible architecture beneath every effective tutoring relationship. A tutor who truly knows their student can deploy the pedagogical approaches that produce the deepest learning, Socratic questioning, metacognitive coaching, formative dialogue, and none of these work without trust and human experience. Socratic questioning requires a student willing to think aloud and expose half-formed ideas. Metacognitive development, the process of learning to identify what you don't yet understand, demands the psychological safety to say "I don't know where I'm going wrong." Zone of proximal development work, pitching challenge just beyond current competence, only goes well when a tutor has read the student well enough to know where that edge sits and when to push against it. That kind of professional precision is relational before it is technical. For parents choosing a tutor, human connection is the hardest quality to assess from a profile and the most important one to get right.

Where AI actually does win

There are of course benefits to using AI in developing learning. AI assistance plainly helps in several specific places where most parents would rather not pay per hour for a human, or cannot pay, and it may be the thinking behind the government's recent announcement that up to 450,000 children from disadvantaged backgrounds could benefit from AI-powered tutoring tools providing personalised, one-to-one learning support.

  • Late-night homework crunches. Most tutors do not accept 10pm bookings on a Tuesday (or any night of the week); ChatGPT will. For most GCSE-level conceptual questions it is competent. Insisting otherwise is dishonest.
  • Vocabulary and spelling drills. Discrete, retrievable items plus spaced practice. An AI-assisted flashcard system handles this well, and no parent ever should have been paying a tutor just to run flashcards for an hour.
  • Python and code debugging. LLMs read stack traces and explain bugs cleanly. For a self-taught teenager learning to code, the rubber-duck use case is very useful.
  • Fast essay-feedback loops. Three drafts in twenty minutes versus three drafts in three weeks. But a trained human teacher gives better feedback: Steiss et al. (Learning and Instruction, 2024) scored 200 pieces of human-generated formative feedback against 200 pieces of AI-generated feedback on the same essays, and the human feedback was better on 80% of the elements of formative feedback measured. The differences in overall quality were modest set against the time savings, both varied with the score of the essay being marked, and the paper's own conclusion is that ChatGPT has potential as an evaluative tool given the trade-off between quality and time.

The honest summary is that AI raises the floor of what a working student can do on their own.

Two RCTs that bracket the question

Two recent randomised trials show what happens when AI is used in developing learning.

The strongest causal evidence in favour of human-tutor-plus-AI is the Stanford Tutor CoPilot RCT (Wang, Ribeiro, Robinson, Loeb and Demszky, October 2024). 900 tutors and 1,800 K-12 students from under-served communities; the AI assistant gave tutors real-time pedagogical suggestions during live sessions. The result: students of tutors using the assistant were 4 percentage points more likely to master a topic (p<0.01). The subgroup finding mattered more than the headline. Students of the lower-rated tutors gained 9 percentage points. The weakest tutors with an AI co-pilot beat their unaided peers.

That tells us something about what AI is for. It is not replacing the tutor; it could be used to give the tutor suggestions of what the next move looks like for the student.

The strongest causal evidence against unguarded AI is Bastani et al. (PNAS 2025), Generative AI Without Guardrails Can Harm Learning. Nearly 1,000 high-school students in Turkey were given GPT-4 access during maths practice. During practice they performed 48% better than the control group. When the AI was removed for the test, they performed 17% worse than the control group. Students had used the model as a crutch and learned less on their own. A safeguarded version of the same model, prompted to give hints rather than answers, eliminated the harm but produced no learning gain over plain practice.

The two studies don't disagree. They show the same thing from opposite ends: human-plus-AI works; AI-as-substitute-for-thinking does the opposite of work.

What the two RCTs found, side by side

Human tutor with AI co-pilot, all students · Stanford Tutor CoPilot RCT, 2024
+4pp
Human tutor with AI co-pilot, lower-rated tutors · Stanford Tutor CoPilot RCT, 2024
+9pp
Unguarded ChatGPT alone, later no-AI exam · Bastani et al., PNAS 2025
-17%

Each bar is the headline effect reported by the study. Tutor CoPilot (Wang, Ribeiro, Robinson, Loeb and Demszky, 2024) measured the change in the share of topics students mastered; Bastani et al. (PNAS 2025) measured performance on a subsequent exam taken without AI access. Different studies and different outcome measures, so the magnitudes are indicative rather than directly comparable; the direction is the point.

Case study, Third Space Learning's Skye

In September 2025, the UK primary-maths-tutoring service Third Space Learning replaced its human tutors with an AI system called Skye, rolling out to 200 schools. Up to that point TSL had operated 1:1 sessions between UK pupils and tutors recruited from India and Sri Lanka.

CEO Tom Hooper told Schools Week that Skye was the "opposite" of ChatGPT, students are prompted to work through problems aloud rather than be given answers. He framed the shift as widening access to tutoring while reducing cost, and said the displaced tutors were "very understanding and supportive". A former TSL tutor disputed that account in the comments under the same piece, saying there had been no consultation.

TSL is the only UK case where a tutoring service has, at scale, replaced human tutors with AI rather than augmenting them. It is being watched closely by the trade press. The May 2026 TES feature "AI wants to teach the class, and some teachers want to let it" frames it as a "platformisation" cautionary tale. Whether the model works pedagogically, and what happens to the displaced workforce, are questions that will not be answered for a long time yet, as educators know the results are not always immediately clear.

For parents currently paying a human tutor, the closest thing to read is whether TSL's published outcome data, when it appears, holds up to independent scrutiny. So far the evidence base for AI-as-substitute-for-tutor remains thin.

Assessment of learning, and assessment for learning

One of the subtler limitations of AI marking is its dependence on the mark scheme, and mark schemes, particularly at GCSE and A Level, are blunter instruments than they appear. A student may deploy correct reasoning through an unconventional route, use domain-appropriate vocabulary in a slightly unexpected way, or produce a response that is technically outside the accepted answer but demonstrates genuine understanding. A trained examiner, and certainly an experienced tutor, can recognise this and credit it accordingly. AI systems optimised for consistency tend to reward proximity to expected phrasing rather than quality of thought. In creative subjects, extended writing, and higher-order scientific reasoning, this matters enormously. The mark scheme is a scaffold for human judgement, not a substitute for it, and applying it well has always required the kind of interpretive flexibility that remains, for now, distinctly human.

Artificial intelligence can assess answers with impressive speed and consistency, but marking is not the same as understanding a learner. An algorithm can identify a wrong answer; it cannot easily determine whether the error stems from a conceptual gap, a misread question, careless arithmetic, or the particular way a topic was explained three weeks ago. Effective tutors mark formatively, not simply scoring responses but interrogating them, asking follow-up questions, and using errors as diagnostic tools to redirect teaching. This is the difference between assessment of learning and assessment for learning. AI tools are increasingly capable at the former, but the latter requires the kind of contextual, relational knowledge that builds over time between a tutor and student. A mark from an AI tells a student they got something wrong. A mark from a skilled tutor tells them why, and what to do next.

AI is the floor; a tutor is the ceiling

From the evidence we can conclude:

  • Between sessions, AI raises the floor. Vocabulary drilling, retrieval practice, code debugging, first-pass essay drafts, the 10pm panic. Free or near-free, always available.
  • At the session, a human tutor raises the ceiling. Diagnostic do you actually understand this?, exam-board-specific marking (a tutor who has marked AQA Combined Science Paper 1 knows what loses a mark, and what can be accepted outside the mark scheme, in ways ChatGPT does not), motivation, accountability for a student who will not open the app, pastoral judgement during an exam-anxiety meltdown and, in some tutors, an extra eye for safeguarding.
  • Behind the session, the tutor uses AI as a force-multiplier. Generating problem variants at the right difficulty, some of the marking workload, alternative teaching ideas, drafting personalised practice, speeding up preparation by creating a quick worksheet or reorganising a sorting task.

Sal Khan's admission in April 2026 is striking because Khanmigo had every advantage going, Khan Academy's brand, the OpenAI partnership behind it, Sal Khan's own voice in the loop, and a year of teacher feedback to refine it. Kristen DiCerbo, Khan Academy's Chief Learning Officer, told Chalkbeat: "So far I am not seeing the revolution in education."

The tools are improving rapidly, but the evidence base for AI replacing the relational, adaptive and diagnostic work of a skilled tutor simply does not yet exist. The revolution will probably happen. When it does, it will look like a competent human in the room with AI as the tool they reach for, not an empty chair and a chat window.

Ready to find a tutor?

Free to browse, free to message. £9.99 one-off to unlock contact-sharing with one tutor.

Find a UK tutor

Common questions

  • Should I cancel my child's tutor because ChatGPT is free? +

    Probably not. ChatGPT genuinely covers some of what tutors used to be hired for, late-night homework help, vocabulary drilling, first-draft essay feedback. It does not cover the parts of tutoring that move the dial on exams: accountability, exam-board-specific marking knowledge, and the "is my child actually understanding this?" diagnostic that current LLMs cannot do reliably. A tutor who is also comfortable using AI as a tool in lessons is the configuration the evidence currently supports.

  • Can ChatGPT mark a GCSE essay? +

    It can give feedback. The question is whether it marks to the right exam board's mark scheme. AQA, Edexcel, OCR and the rest diverge in small but mark-losing ways. ChatGPT is documented as grade-generous in tests, often awarding a grade higher than the exam board would. For mock-marking by mark scheme, a tutor who marks the board regularly is more reliable.

  • Is AI tutoring better than nothing for a struggling child? +

    It depends on the setup. The Stanford Tutor CoPilot RCT (2024) suggests the strongest gains come from a human tutor using AI, not from AI alone. Bastani et al. (PNAS 2025) found unguarded ChatGPT made high-schoolers 17% worse on a later no-AI exam than students who had practised without it. The pattern in the evidence is that AI-with-an-adult-in-the-loop works; AI-as-substitute does not.

  • What is sycophancy in AI tutoring? +

    A tendency built into how chatbots are trained, models prefer to agree with the user. Chuck Arvin's USC study (KDD Workshop 2025) showed LLM accuracy degrading when a student suggested a wrong answer, with the bias stronger in smaller models: an effect of up to 30% for GPT-4.1-nano versus 8% for GPT-4o. The practical risk for tutoring is that a child who half-understands a topic asks ChatGPT to confirm a wrong-but-plausible answer and walks away convinced.

  • Will AI replace tutors in the next five years? +

    The most direct piece of evidence we have is Sal Khan's own admission in April 2026 that Khanmigo has been "a non-event" for many of its student users, and Khanmigo had every advantage. Replacement is the wrong word for what is happening; augmentation is the right one. Tutors who learn to use AI as a tool will outperform tutors who do not.

Related

Find a UK tutor

Browse free, message tutors directly — unlock contact-sharing with one tutor when you're ready.

Written by Fiona H. Reviewed by Robert S. Last reviewed