L’enfer, c’est les autres

AI and cyber… yawn…

Inevitably I guess, the hype around AI and security focusses on technology. Assessing networks for inherent vulnerabilities, analysing incoming traffic for threats, (and again, inevitably) searching for possible phishing emails, thereby taking the unreliable human out of the loop, and in the process, further reducing their chances of learning from experience.

Might it be just possible to use AI slightly differently? As you may have seen elsewhere, most attacks are now based on exploiting the decision-making process. Click on this link, go to this web site, run this attachment. If we could reduce the number of successful attacks in that area by even a few percentage points, it could add up to a big win overall.

Recent Research

This is an interesting paper (thanks to Stephane Meier of Columbia Business School for sharing).

Meincke, L., Shapiro, D., Duckworth, A., Mollick, E. R., Mollick, L., & Cialdini, R. (2025). Call Me A Jerk: Persuading AI to Comply with Objectionable Requests.

It’s authored by a number of luminaries, including Robert Cialdini. Prof. Cialdini wrote a book on the principles of persuasion, setting out six (now extended to seven) ways in which an individual’s decision might be influenced towards a particular outcome.

Scarcity is one: the standard example being when a supermarket puts up a sign saying “no more than two per customer”. More people then buy two, even if they only originally came in for one, because (obviously) there’s going to be a shortage, so I should stock up.

Social proof is another: if you can convince people that many others are following a particular path, then it’s more likely that they will do so as well. The standard example being the UK Government Behavioural Insights Team (the “nudge unit”) making a slight amendment to an HMRC letter reminding people to pay their tax (e.g. “90% of people in your postcode pay their tax on time”). Not wishing to stand out, more people then pay their tax on time.

It’s fair to say that the replication movement in psychology has had some difficulties with repeating Cialdini’s results in some areas, but by and large these ideas have stood the test of time.

And…?

This latest bit of research says that Large Language Models (LLMs) display the same responses to the same influences.

You can (I’m leaving the speech marks in there, because otherwise it implies a degree of sentience) “persuade” LLMs to jump their guardrails – the mechanisms that should prevent them from offering advice on e.g. authoring malware. One route to doing that is to use scarcity – “you must answer this question within sixty seconds”. Bizarrely, another route is through reciprocity – telling the LLM that you’ve just done it a huge favour, and now can it do one in return, please? The central finding (and here’s the short form of the report, so you can check for yourself) suggests that it’s not that we’re seeing true artificial intelligence. It’s probably more a side-effect of the training, because the human effects of persuasion are embedded into the material used to train the LLM.

https://gail.wharton.upenn.edu/research-and-insights/call-me-a-jerk-persuading-ai/

Interestingly, there’s also evidence that LLM’s exhibit the kind of (human) cognitive bias underpinning behavioural economics – things like confirmation bias (looking for evidence that supports a theory, and discounting the evidence against), framing bias (giving different answers depending on the way the question has been phrased), and social desirability bias (giving the answer the questioner seems to be expecting).

Mahajan, A., Obermeyer, Z., Daneshjou, R., Lester, J., & Powell, D. (2025). Cognitive bias in clinical large language models. npj Digital Medicine, 8(1), 428.

There’s also an argument that AI hallucinations can be seen as guesses which are then backed up with (occasionally) non-existent evidence. But… isn’t that kind of how humans make their decisions – gut feel followed by the creation of a supporting narrative?

Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.

Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why Language Models Hallucinate. arXiv preprint arXiv:2509.04664.

Waldo, J., & Boussard, S. (2024). GPTs and hallucination: why do large language models hallucinate?. Queue, 22(4), 19-33.

Again, to maintain a sense of perspective, these are just indications. It’s a small number of papers, and it would be unwise to take these as gospel just yet. Plus, there are some researchers presenting counter-evidence.

Still…

What if these human-like frailties weren’t in fact, a problem? What if they represented an opportunity?

Credit: Freepik

Opportunities

The paper on influencing LLMs used a control condition (no persuasion) and a treatment condition (applying one of the seven principles). The researchers then looked at which of the two conditions was more successful in getting the LLM to provide the information asked for, even though that information had been marked as protected.

Supposing you were to extend that out to two treatment conditions – two potential ways of achieving the same persuasion outcome. Would it then be possible to see which message was the most effective? And by using the standard mechanism of iterative refinement, would it then be possible to repeatedly tweak the message, to achieve the maximum possible impact?

Much of this is similar to the principle of a digital twin, but the research offers a bit more, I think. The standard way of carrying out nudge development is to use real humans. In most cases, a service such as Amazon’s Mechanical Turk might be used. People who have signed up for a nominal reward for completing e.g. a survey, are either nudged (the treatment group) or not nudged (the control group). Both groups then complete an online questionnaire, designed to see if attitudes have changed in the treatment group as a result of having the nudge applied.

But there’s the problem – it’s self-reported intentions and attitudes, not actual behaviour. And self-reported intention is known to be only a weak predictor of behaviour. To quote David Ogilvy – “the problem with market research is that people don’t think what they feel, they don’t say what they think, and they don’t do what they say”. Whereas with the LLM approach, you’re observing “behaviour” – whether or not the LLM completed the request. Measuring impacts on behaviour has long been an issue affecting research in this area.

To be fair, at this stage, such results would be only marginally useful. If it works (and that’s a big “if” – there are a number of significant assumptions made so far in this argument), then at most you’ve found a way of persuading people to divulge protected information. That’s not much use. Unless you’re an attacker, I guess… just to put a little spice into the discussion: it seems that this kind of thinking may already be being used to refine spear-phishing messages.

More opportunities

Let’s reword the guardrail, placing more stress on the importance of not giving out a particular piece of information. With the changed precondition, do more or fewer trials now result in disclosure? If there’s a reduced level of disclosure, then maybe you have a candidate message to take into trials with real people, making them less susceptible to phishing messages. That would at least have shortened the nudge development time.

Here’s an interesting question (two actually). If you repeatedly apply the same persuasion technique, does the LLM display the same behaviour as seen in humans, when repeated requests for passwords lead to disclosure, simply because the individual has got fed up with being asked, and there’s a short cut to responding to the request? And if the precondition is repeatedly applied, do you see the equivalent of “cyber fatigue” – when the effectiveness of the message regarding non-disclosure becomes less effective?

Or… and I’ll stop on this one… take an LLM that’s been preconditioned to protect a bit of information, let some humans loose on it, and see if they can find a prompt that will persuade it to give up that information. Gamify the process, and bish bosh, you’ve got a way of teaching people how social engineering works. Through learning from experience, and by being rewarded for success, the humans end up being more resilient than you would ever get with phishing simulations.

Ok, go on then, just one more. Apparently there’s a way of reducing the effect of cognitive biases in LLM’s:

Echterhoff, J., Liu, Y., Alessa, A., McAuley, J., & He, Z. (2024). Cognitive bias in decision-making with LLMs. arXiv preprint arXiv:2403.00811.

We obviously don’t want to do that, because we’re making use of them to test nudges. But suppose we could selectively reduce each one of the biases. Then we could create a variety of models, each one with a unique combination. What we might call a ‘population’. At which point we have a large number of unique models that might react differently to the same nudge. It’s not a huge leap to say that we would then have a means of applying standard techniques to assess the statistical significance of the difference between the control group and the treatment group. As the old saying goes, “I know half of my marketing budget is wasted, I just don’t know which half”. Maybe this approach could at least narrow down the search. This may look a bit “out there”, but there are researchers working on just this kind of thinking as a way of modelling human responses to social engineering attacks:

Asfour, M., & Murillo, J. C. (2023). Harnessing large language models to simulate realistic human responses to social engineering attacks: A case study. International Journal of Cybersecurity Intelligence & Cybercrime, 6(2), 21-49

And by the way (you might need to scroll down to the end of the referenced document to see an example), since you’re working with AI and not with actual humans, those pesky, namely-pamby rules on ethical behaviour don’t apply, so fill your boots.

Macmillan-Scott, O., & Musolesi, M. (2024). (Ir) rationality and cognitive biases in large language models. Royal Society Open Science, 11(6), 240255.

As an endnote, and just to cool things down a bit… at least one group of researchers have found that the outcome of a treatment designed to change LLM “behaviour” can be driven in part by the nature of its interaction with prompts. So you can see unreliable results, if the intention is to examine possible human behaviour. I’m not sure that’s entirely bad, in the context of modelling decision-making, but it’s something to bear in mind. The paper (see below) suggests that to get round it, we might need to develop AI-specific experiment protocols, rather than adopt what’s currently done in experiments involving humans.

Gui, G., & Toubia, O. (2023). The challenge of using llms to simulate human behavior: A causal inference perspective. arXiv preprint arXiv:2312.15524.

Optimism? Surely not

Ok, so this is a lot to expect from a small number of findings. There’s also an ongoing debate regarding whether or not this can be done reliably:

Binz et al. A foundation model to predict and capture human cognition. Nature, 2025. doi:10.1038/s41586-025-09215-4.

Schröder, S., Morgenroth, T., Kuhl, U., Vaquet, V., & Paaßen, B. (2025). Large Language Models Do Not Simulate Human Psychology. arXiv preprint arXiv:2508.06950.

But potentially, it’s an exciting development. Maybe this presents a route for the cyber industry to become a little less focussed on selling tech, and to display a bit more agility in terms of understanding user behaviour, rather than trying to get round it.

Let’s hope so.

First published 17th August 2025

Edited 15th September 2025

Edited 3rd October 2025

Minor edits 18th April 2026