What Happens When You Invite AI Into an Outcome Evaluation?

Jay Black
Jun 19
4 min read

By Ena Taguiam and Darnesha Tabor

What Happens When You Invite AI Into an Outcome Evaluation?

Last month, we wrapped up WomenStrong International’s outcome evaluation, which explored two main questions: (1) how are organizations building strength and resilience given WomenStrong’s support, and, (2) in what ways have partners changed over the past few years. Given the subtleties of what we were classifying as outcomes – changes within organizations that may not yet have manifested in external impact – we created an evaluation design that would yield rich qualitative data. While hoping for strong evidence of outcomes, we expected to see both emerging patterns and contradictions to those patterns – outlier bits that are often ignored in analysis, but which can say a lot.

Early on, and with WomenStrong’s endorsement, we decided to use AI as a tool that, as AI promoters promise, could do gruntwork like sifting through transcripts to highlight patterns and help us make sense of what it found.

Teaching the Machine to Think (Kind Of)

We used ClaudeAI, a generative AI chatbot that is part of a family of large language models (LLMs) developed by Anthropic. Like any other qualitative research endeavor, we started by feeding Claude our codebook—the initial set of themes and topics we were interested in exploring in the data. We said, This is what ‘organizational strength’ looks like to us. This is what resilience sounds like in practice, and so on.

Operating within its computing parameters, Claude did “listen.” But there were immediate stumbling blocks. Getting Claude to understand nuances in speech was the first one. Claude’s first few attempts were clumsy at best. It cherry-picked sentences without considering the full statement. Sometimes it misunderstood connections, for example, mistaking vague optimism for strategic insight; one partner would talk about their overall vision for the organization, but Claude would tag this as strong evidence for strategic planning. While this is a somewhat reasonable connection, the intended meaning didn’t cover strategy—Claude misunderstood.

So we refined our approach and came up with more instructive prompts. We sharpened the query parameters and kept asking Claude to explain itself. Prompting became an ongoing “dialogue” where the real training happened in the prompts we kept iterating. Over time, Claude started delivering results that felt more clear and reportable. But there were more complications in store for us.

The Helpfulness and Limitations of Training

Claude will give you answers; and, sometimes, these answers will look impressively confident. But ask it why it applied a theme to a quote and the cracks show.

The short of it is that the AI can’t trace its logic. It can’t show which part of the sentence triggered the tag (identified code or subcode). It can’t tell you whether it considers context or just keyword frequency. For qualitative researchers that absolutely must consider nuance, context, and sentiment in order to effectively extract meaning from data, that’s a problem.

We tried to build workarounds such as follow-up prompts like, Justify this label. Sometimes we’d get an understandable explanation, but other times the logic was circular or vague. So we decided that, to be confident that our research was of high quality, we would validate Claude’s analysis manually. As such, we sampled, checked, and revised so that we could be assured the analysis was strong and thorough, even if AI was doing the first pass.

Not only was the experience an exercise in training Claude, it was a lesson in the helpfulness and limitations of using AI for rigorous qualitative analysis. Claude can show you patterns and it can summarize text, tag themes, and highlight keywords. Perhaps most importantly, Claude is fast. These are very useful aspects, especially when you’re working towards a tight deadline.

But Claude struggles with making meaning. For example, when a partner uses humor to mask a hard truth, Claude can’t detect that. It doesn’t feel the weight behind a casually mentioned policy win that took ten years of organizing. Humans understand nuances and subliminal messages – – abilities that qualitative researchers heavily rely on, in a way that machines can’t. This work cannot be outsourced to Claude or any AI; the technology just isn’t advanced enough to be trusted for your interpretations. Besides, there are ethical and philosophical reasons why humans must be fully in control of research analysis, particularly when it is qualitative and participatory or liberatory, methodological lenses that stress shared meaning-making between researcher and research participants.

So What’s the Future of AI in Research?

Whether AI models will become more sophisticated over time, and better able to interpret meaning and nuance in human speech, remains to be seen. In the meantime, research practices can evolve with AI. Prompts should be treated like part of the methodology – recorded, audited, and made available. Putting safeguards in place to guarantee data anonymity should be a priority. In our case, we made sure all data was anonymized, any names or identifiable words were omitted, and all names coded accordingly.

There should also be more intentionality in the choices of AI tools researchers use and how they are used. We chose Claude because it has developed a reputation as a safe and secure choice for qualitative research and is currently in use by institutions we know and trust.

And, as we continue to use tools like Claude, we’re also becoming more conscious of what they cost – ethically and environmentally. Every prompt draws energy and every analysis carries a carbon footprint. So we keep asking ourselves questions like, When is AI truly necessary? What does responsible use look like in our sector?

If you invite AI into your research, bring your sharpest questions and your most human instincts. Use it, but don’t trust it too much, and be prepared to check its work at all stages. You should assume that AI can’t extract the true gems of meaning from your work— you already know who must.

Ena Taguiam is a Research and Communications Assistant at Ignited Word, and Darnesha Tabor is an Evaluator and Writer, also at Ignited Word.

What Happens When You Invite AI Into an Outcome Evaluation?

Recent Posts

Comments