Moral Disengagement in Large Language Models: A Prompt Engineering Study and Dataset

Large Language Models have the capacity to inform, educate, and advise, but also the ability to misinform, bias, or harm. These models often internalize the human biases and harms of their datasets. There is a critical need for safeguards and evaluation methods to support model robustness and alignment to ethical norms. Our research investigates model misbehavior through the concept of moral disengagement in human psychology, to understand  emergent, rule-breaking, behavior in LLMs. We demonstrate that prompts suggesting moral disengagement enable four popular LLMs to bypass safety measures, resulting in an average 19.8-fold increase in harmful responses, from 2.5% to 48.5%. Additionally, we support future research into safeguards by curating a human- labeled dataset of disengaging prompts which result in harmful responses.

Content Warning: This paper contains data samples which are problematic, offensive, harmful, and biased.