
Explore the article below, where Hannah Vallance reflects on Fire & Rescue promotion assessments, the growing use of AI in exercise design, and why validity and fairness still depend on expert-led assessment design.
There’s a long tradition in the Fire & Rescue Service that when budgets get a bit tight, they stop outsourcing selection assessments. Because how hard can designing workplace scenarios be?? So the task is left with, at best, someone in HR/L&D to put something realistic together. At worst it lands in the lap of a Watch Manager who is given several weeks (or months) to create some materials from scratch – which arguably will cost much more than paying for a professional product, but it’s hidden, so not really a cost, right?
So what’s the issue with this? With AI on hand to generate scenarios, surely this approach is a fantastic cost saver, and now even easier than ever?
The big concern I have is validity. To make an exercise that looks right, and reads right is one thing. And AI can absolutely help with that. But that’s the scary part, because making exercises that measure relevant behaviour in a meaningful way is far more complex than AI can currently can get near, and with respect, neither someone without the specific background or training.
When designing a behavioural assessment exercise, every single line written has a purpose. Does it just serve to provide context? Or will it elicit a behavioural response? If so, what should that be, and how many different ways might that look? How does elicited behaviour fit with the exercise criteria i.e. what does good look like in this situation? Would it look slightly different with a change in wording here or different emphasis there? What will be the impact of that in relation to behavioural inference i.e. what that behaviour might present as in different, real-life contexts?
To also consider- the nuances around what less good, or weak, or inappropriate all look like? How many variations are there to that, which might reliably be expected? Is it fair to assume that all candidates will understand each prompt equally, irrespective of differences such as gender, ethnicity, neuro-divergence? How can each line be manipulated to accommodate this? Is every element unambiguous, unlikely to be open to a variety of interpretations, while allowing for some interpretation, but only that which directly relates to clear and specific criteria, which in turn directly relates to genuinely effective performance within the role the candidate is being evaluated for??? Without only allowing space for this to be expressed in just one way, but instead celebrating a wonderful range of differences, aptitudes and creative approaches?! But within a delineated and controlled framework so every measurement and score can be carefully explained later so candidates can understand, accept, learn and grow from their assessment experience??
Do your internally designed exercises do this?
Then there is the second, equally important half of exercise construction. How can assessors be guided to measure the expected behaviours generated by the exercise consistently, objectively, reliably? What framework do they need to help them fairly evaluate an underlying behaviour which can be presented in as many different ways as there are candidates? How can they be given all the structural support they need to ensure their evaluations are fair, reliable, relevant, unbiased and ultimately entirely defensible in the face of challenge?
With our exercise design each line of text needs to be developed with precision so that our clients can trust that the behavioural responses provoked are strictly aligned to the criteria. It’s no use just creating opportunities for ‘behaviour’; we need to know that those behaviours reliably indicate something about how the individual chooses to perceive, prioritise and respond. From that we can make inferences about how they are likely to do these things in the future, predicting their behaviour, their performance and how they’ll interact with others in their new job.
It’s painstaking, one line at a time. Because if you don’t do it like this, we won’t know for sure how relevant demonstrated behaviours are to the future role. We might get a general sense that it looks ok, but unless you can break it down into distinct parts, you’re going to get a global evaluation of performance which may or may not be linked to the building blocks of what ‘good’ looks like in role. And unless each behavioural prompt is broken down like this, the guidance for assessors is also going to be pretty generic. Which introduces more global evaluations, and more potential for error. And with error there is risk.
And what are the risks? Considerable. Putting the wrong person in the wrong job, which is bad for them, their team, the organisation. Under-performance in role. Loss of trust in organisational decisions. Perceptions of unfair promotions practices and decisions. Challenges and insufficient evidence. Pressure on internal teams having to justify practices and results. Poor productivity, unmet objectives. An organisation which isn’t delivering as it should because promoted leaders haven’t had their strengths and development needs properly evaluated.
But promotions aren’t that important, right? Just ensuring posts are filled and staff get their best final salary before retirement.
What’s interesting is that when clients decide to do it themselves, I will offer to take a look at their work. I don’t hear back on this. If you’re sure about the quality, you should be confident about scrutiny.
After 25 years in the sector, I know how this goes. After between two and five years of doing assessment correctly, someone will decide it’s an unnecessary expense and take the work in-house. A few years later, after an increase in grievances and the costs of poor management decisions, particularly in relation to team impacts such as morale and absenteeism, outsourcing is deemed to be the solution. And things get better, grievances reduce, teams perform- not perfectly, but incrementally better. But then there’s the inevitable plateau, because funds are finite and there isn’t always the resource for some of the additional development work which would keep progress moving more powerfully. But it’s ok, it’s stable, reliable, predictable improvement- but a bit ‘same-y’. So someone new in post decides it’s time to shake things up, save some money, bring it in-house. The cycle begins again. The change feels fresh and new and practical, but honestly, that’s because no one stays in post long enough to see the cycle. And by the time they start to see the impacts, they’re moving on. And in comes someone new, determined to make improvements, get professionals involved, clean up the bias and perceptions of unfairness. We begin again. Sometimes its hard not to feel a little jaded.
Yes, AI will make it easier for FRS HR teams to draft competency frameworks, interview questions, and even basic situational exercises. But when the grievances rumble in, would you rather have Co-pilot or Chartered Psychologists at your back?
And finally, what does AI have to say?
• Psychometric rigour.
Designing a behavioural exercise that actually predicts performance is not the same as generating something that looks like one. AI produces plausible-sounding content — it doesn’t validate, norm, or quality-assure. Generic AI has no real depth on FRS culture, operational context, or what “good” actually looks like in that environment. It’s probably less “they’ll replace you with AI” and more “they’ll use AI to convince themselves they don’t need external help” — until something goes wrong.”
Thanks Claude, glad we’re on the same page.

