AI Is Revolutionizing Everything. Why Do Evaluations Still Look the Same?
I have worked as an external evaluator on United Nations (UN) evaluations for several years. The recent funding cuts have brought a long-standing question into the open: are evaluations worth the effort? And in times of artificial intelligence (AI), is the traditional model, in which we collect all the data, write the report, and move on, still justifiable?
Aid budgets have been slashed, and every part of the humanitarian system is being asked to do more with less. Evaluations are no exception. Used strategically, AI offers a way for evaluations to carry their share of the burden, and to become more useful in the process.
The debate within the evaluation community on how to use AI for evaluations so far has focused, rightly, on the risks. Most external evaluators already use AI for desk reviews, qualitative coding, and drafting (sometimes for more than they say), and the early naiveté has given way to a sharper awareness of what can go wrong when these tools are used carelessly. United Nations Evaluation Group’s (UNEG) 2025 reference document is a good example: it sets out principles of transparency and accountability, fairness and inclusivity, data protection, validity and reliability, and flags risks around bias, hallucinations, and the limits of machine judgment. This work is important and necessary.
But guardrails are only half the conversation. The other half, which has received far less attention, is what AI should change about how evaluations are commissioned and run in the first place. On this question, too many aid agencies are still paying the costs of the old model while capturing few of the strategic benefits the new tools could offer. No two evaluations are exactly alike, but many terms of reference from UN agencies look broadly the same as they did before AI arrived: global evaluations include desk research by external consultants, 100+ key informant interviews, an online survey, sometimes community consultations, 100 – 120 consultant days, 100,000 – 150,000 euros per evaluation.
So, what would a more proactive approach look like? It would mean commissioning evaluations differently: redesigning terms of references (TORs) to shift budgets from data collection to follow-through, bringing national researchers in from inception, and investing in secure AI capacity that international and national evaluators can use.
This commentary looks at where the biggest changes are possible: the early phases of an evaluation, where AI can do most of the heavy lifting.
Desk Review: Automate and Bring it In-House
Evaluation offices already collect and organise much of the documentation that external teams work from. With the tools available today, they could synthesise it themselves at little extra cost. The benefit is not only financial. Offices that do this work up front enter the evaluation with a much firmer grasp of the evidence, and ask sharper questions throughout. External evaluators, in turn, start from a completed review and can play a different role: critical challengers who poke holes, develop alternative hypotheses, and identify what the in-house synthesis missed.
Data protection concerns about off-the-shelf tools to synthesise like Claude and NotebookLM are real, but only a partial constraint. Most evaluation documentation is non-sensitive, and a great deal of context can be drawn from publicly available sources. Where privacy is genuinely a barrier, or where evaluation offices are too stretched to do the desk review themselves, they can pre-clear specific AI solutions that meet their GDPR and confidentiality requirements and make these available to external teams. Where they have the capacity to go further, investing in self-hosted solutions would pay for itself many times over.
Online Surveys and KIIs: Replace Static Formats with AI-Moderated Interviews
A more radical shift is possible in data collection itself. My own turning point on this came when I was recently interviewed by an AI avatar as part of testing tools on the market. The avatar was professional, followed the guide, and probed for context and examples. When I gave deliberately wrong answers or tried to steer the conversation off topic, it stayed patient, and eventually wrapped things up politely once it became clear my responses would not be useful.
Providers of these tools claim richer data and better response rates than traditional surveys, and my own experience bears that out. Respondents seem to trust the anonymity and offer more context than they would in a standardised format. How much of this is genuine added value, and how much the novelty effect set against online survey fatigue, is still an open question. But the mechanics are already compelling: analysis is produced in minutes and comes back with coded responses, relevant quotes, and standardised data ready for synthesis. Before long, this approach will displace typical online surveys and many key informant interviews.
A fair objection is that fewer human interviews mean losing the immersion that makes qualitative research valuable. In practice, though, a handful of well-chosen conversations usually generate the bulk of analytical insight. The case for AI-moderated interviews is precisely that they free up human time for those conversations. Evaluators can run targeted, strategic interviews where judgment and rapport matter most, while the AI covers breadth: making sure no constituency is missed and providing a baseline against which emerging hypotheses can be tested.
One important exception is consultations with affected populations, especially in low-tech environments. Here, trust, consent, and power dynamics require human researchers, ideally national ones involved from the start. AI still has a role to play, but a supporting one: giving these researchers access to the full evidence base, if needed in their own language, from the outset of the evaluation.
So far so good for the early phases. But this is also where AI’s strengths run out: as I will argue in the second part of this piece, analysis, reporting, and follow-up are where evaluators add the most value, and where the commissioning model needs to change most.
This commentary was originally published by ALNAP on May 7, 2026. This is part one of a two-part series; part two is coming soon.