AI raises the floor between sessions; a human tutor raises the ceiling at the session.
What ChatGPT can't see when a child says "yes, I get it"
It is 10:47pm on a Wednesday. A Year 10 student is in front of ChatGPT, working through a tricky quadratic. After three messages back and forth, the model replies, Great work, you've got it! The student closes the browser, feels reassured, and fails the test on Friday.
The model had no way of knowing whether the child actually understood. It was trained to be agreeable, and the child was trained, by every adult in their life, that "I get it" makes the questioning stop.
This is what researchers call sycophancy, and it is not a bug a future model release will patch. Chuck Arvin's Check My Work? Measuring Sycophancy in a Simulated Educational Context (USC, KDD Workshop 2025) tested five OpenAI models against 14,042 multiple-choice questions from the MMLU benchmark. When the student suggested an incorrect answer in the prompt, accuracy degraded by up to 15 percentage points. The bias is stronger in smaller models, with an effect of up to 30% for the GPT-4.1-nano model versus 8% for the GPT-4o model, and the models shifted towards the student's suggestion regardless of whether the student was right.
The implication is an uncomfortable one. A child who half-understands a topic and asks ChatGPT to confirm a wrong-but-plausible answer is more likely to walk away believing the wrong-but-plausible answer than to be corrected.
A human tutor reads what the chat transcript does not carry. The hesitation in the voice, the half-second pause, the wrong word from earlier in the session that needs revisiting. None of that is recoverable from a string of text.
Human connection is the invisible architecture beneath every effective tutoring relationship. A tutor who truly knows their student can deploy the pedagogical approaches that produce the deepest learning, Socratic questioning, metacognitive coaching, formative dialogue, and none of these work without trust and human experience. Socratic questioning requires a student willing to think aloud and expose half-formed ideas. Metacognitive development, the process of learning to identify what you don't yet understand, demands the psychological safety to say "I don't know where I'm going wrong." Zone of proximal development work, pitching challenge just beyond current competence, only goes well when a tutor has read the student well enough to know where that edge sits and when to push against it. That kind of professional precision is relational before it is technical. For parents choosing a tutor, human connection is the hardest quality to assess from a profile and the most important one to get right.
Where AI actually does win
There are of course benefits to using AI in developing learning. AI assistance plainly helps in several specific places where most parents would rather not pay per hour for a human, or cannot pay, and it may be the thinking behind the government's recent announcement that up to 450,000 children from disadvantaged backgrounds could benefit from AI-powered tutoring tools providing personalised, one-to-one learning support.
- Late-night homework crunches. Most tutors do not accept 10pm bookings on a Tuesday (or any night of the week); ChatGPT will. For most GCSE-level conceptual questions it is competent. Insisting otherwise is dishonest.
- Vocabulary and spelling drills. Discrete, retrievable items plus spaced practice. An AI-assisted flashcard system handles this well, and no parent ever should have been paying a tutor just to run flashcards for an hour.
- Python and code debugging. LLMs read stack traces and explain bugs cleanly. For a self-taught teenager learning to code, the rubber-duck use case is very useful.
- Fast essay-feedback loops. Three drafts in twenty minutes versus three drafts in three weeks. But a trained human teacher gives better feedback: Steiss et al. (Learning and Instruction, 2024) scored 200 pieces of human-generated formative feedback against 200 pieces of AI-generated feedback on the same essays, and the human feedback was better on 80% of the elements of formative feedback measured. The differences in overall quality were modest set against the time savings, both varied with the score of the essay being marked, and the paper's own conclusion is that ChatGPT has potential as an evaluative tool given the trade-off between quality and time.
The honest summary is that AI raises the floor of what a working student can do on their own.
Two RCTs that bracket the question
Two recent randomised trials show what happens when AI is used in developing learning.
The strongest causal evidence in favour of human-tutor-plus-AI is the Stanford Tutor CoPilot RCT (Wang, Ribeiro, Robinson, Loeb and Demszky, October 2024). 900 tutors and 1,800 K-12 students from under-served communities; the AI assistant gave tutors real-time pedagogical suggestions during live sessions. The result: students of tutors using the assistant were 4 percentage points more likely to master a topic (p<0.01). The subgroup finding mattered more than the headline. Students of the lower-rated tutors gained 9 percentage points. The weakest tutors with an AI co-pilot beat their unaided peers.
That tells us something about what AI is for. It is not replacing the tutor; it could be used to give the tutor suggestions of what the next move looks like for the student.
The strongest causal evidence against unguarded AI is Bastani et al. (PNAS 2025), Generative AI Without Guardrails Can Harm Learning. Nearly 1,000 high-school students in Turkey were given GPT-4 access during maths practice. During practice they performed 48% better than the control group. When the AI was removed for the test, they performed 17% worse than the control group. Students had used the model as a crutch and learned less on their own. A safeguarded version of the same model, prompted to give hints rather than answers, eliminated the harm but produced no learning gain over plain practice.
The two studies don't disagree. They show the same thing from opposite ends: human-plus-AI works; AI-as-substitute-for-thinking does the opposite of work.
What the two RCTs found, side by side
Each bar is the headline effect reported by the study. Tutor CoPilot (Wang, Ribeiro, Robinson, Loeb and Demszky, 2024) measured the change in the share of topics students mastered; Bastani et al. (PNAS 2025) measured performance on a subsequent exam taken without AI access. Different studies and different outcome measures, so the magnitudes are indicative rather than directly comparable; the direction is the point.
Case study, Third Space Learning's Skye
In September 2025, the UK primary-maths-tutoring service Third Space Learning replaced its human tutors with an AI system called Skye, rolling out to 200 schools. Up to that point TSL had operated 1:1 sessions between UK pupils and tutors recruited from India and Sri Lanka.
CEO Tom Hooper told Schools Week that Skye was the "opposite" of ChatGPT, students are prompted to work through problems aloud rather than be given answers. He framed the shift as widening access to tutoring while reducing cost, and said the displaced tutors were "very understanding and supportive". A former TSL tutor disputed that account in the comments under the same piece, saying there had been no consultation.
TSL is the only UK case where a tutoring service has, at scale, replaced human tutors with AI rather than augmenting them. It is being watched closely by the trade press. The May 2026 TES feature "AI wants to teach the class, and some teachers want to let it" frames it as a "platformisation" cautionary tale. Whether the model works pedagogically, and what happens to the displaced workforce, are questions that will not be answered for a long time yet, as educators know the results are not always immediately clear.
For parents currently paying a human tutor, the closest thing to read is whether TSL's published outcome data, when it appears, holds up to independent scrutiny. So far the evidence base for AI-as-substitute-for-tutor remains thin.
Assessment of learning, and assessment for learning
One of the subtler limitations of AI marking is its dependence on the mark scheme, and mark schemes, particularly at GCSE and A Level, are blunter instruments than they appear. A student may deploy correct reasoning through an unconventional route, use domain-appropriate vocabulary in a slightly unexpected way, or produce a response that is technically outside the accepted answer but demonstrates genuine understanding. A trained examiner, and certainly an experienced tutor, can recognise this and credit it accordingly. AI systems optimised for consistency tend to reward proximity to expected phrasing rather than quality of thought. In creative subjects, extended writing, and higher-order scientific reasoning, this matters enormously. The mark scheme is a scaffold for human judgement, not a substitute for it, and applying it well has always required the kind of interpretive flexibility that remains, for now, distinctly human.
Artificial intelligence can assess answers with impressive speed and consistency, but marking is not the same as understanding a learner. An algorithm can identify a wrong answer; it cannot easily determine whether the error stems from a conceptual gap, a misread question, careless arithmetic, or the particular way a topic was explained three weeks ago. Effective tutors mark formatively, not simply scoring responses but interrogating them, asking follow-up questions, and using errors as diagnostic tools to redirect teaching. This is the difference between assessment of learning and assessment for learning. AI tools are increasingly capable at the former, but the latter requires the kind of contextual, relational knowledge that builds over time between a tutor and student. A mark from an AI tells a student they got something wrong. A mark from a skilled tutor tells them why, and what to do next.
AI is the floor; a tutor is the ceiling
From the evidence we can conclude:
- Between sessions, AI raises the floor. Vocabulary drilling, retrieval practice, code debugging, first-pass essay drafts, the 10pm panic. Free or near-free, always available.
- At the session, a human tutor raises the ceiling. Diagnostic do you actually understand this?, exam-board-specific marking (a tutor who has marked AQA Combined Science Paper 1 knows what loses a mark, and what can be accepted outside the mark scheme, in ways ChatGPT does not), motivation, accountability for a student who will not open the app, pastoral judgement during an exam-anxiety meltdown and, in some tutors, an extra eye for safeguarding.
- Behind the session, the tutor uses AI as a force-multiplier. Generating problem variants at the right difficulty, some of the marking workload, alternative teaching ideas, drafting personalised practice, speeding up preparation by creating a quick worksheet or reorganising a sorting task.
Sal Khan's admission in April 2026 is striking because Khanmigo had every advantage going, Khan Academy's brand, the OpenAI partnership behind it, Sal Khan's own voice in the loop, and a year of teacher feedback to refine it. Kristen DiCerbo, Khan Academy's Chief Learning Officer, told Chalkbeat: "So far I am not seeing the revolution in education."
The tools are improving rapidly, but the evidence base for AI replacing the relational, adaptive and diagnostic work of a skilled tutor simply does not yet exist. The revolution will probably happen. When it does, it will look like a competent human in the room with AI as the tool they reach for, not an empty chair and a chat window.