Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

被引:0
|
作者
Cantini, Riccardo [1 ]
Cosenza, Giada [1 ]
Orsino, Alessio [1 ]
Talia, Domenico [1 ]
机构
[1] Univ Calabria, Arcavacata Di Rende, CS, Italy
来源
DISCOVERY SCIENCE, DS 2024, PT I | 2025年 / 15243卷
关键词
Large Language Models; Bias; Stereotype; Jailbreak; Adversarial Robustness; Sustainable Artificial Intelligence;
D O I
10.1007/978-3-031-78977-9_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities. However, these models are inherently prone to various biases stemming from their training data. These include selection, linguistic, and confirmation biases, along with common stereotypes related to gender, ethnicity, sexual orientation, religion, socioeconomic status, disability, and age. This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability. We also investigate how known prompt engineering techniques can be exploited to effectively reveal hidden biases of LLMs, testing their adversarial robustness against jailbreak prompts specially crafted for bias elicitation. Extensive experiments are conducted using the most widespread LLMs at different scales, confirming that LLMs can still be manipulated to produce biased or inappropriate responses, despite their advanced capabilities and sophisticated alignment processes. Our findings underscore the importance of enhancing mitigation techniques to address these safety issues, toward a more sustainable and inclusive artificial intelligence.
引用
收藏
页码:52 / 68
页数:17
相关论文
共 50 条
  • [41] Using Large Language Models to Investigate and Categorize Bias in Clinical Documentation
    Apakama, D.
    Klang, E.
    Richardson, L.
    Nadkarni, G.
    ANNALS OF EMERGENCY MEDICINE, 2024, 84 (04) : S96 - S97
  • [42] Likelihood-based Mitigation of Evaluation Bias in Large Language Models
    Ohi, Masanari
    Kaneko, Masahiro
    Koike, Ryuto
    Loem, Mengsay
    Okazaki, Naoaki
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 3237 - 3245
  • [43] Understanding the Effect of Model Compression on Social Bias in Large Language Models
    Goncalves, Gustavo
    Strubell, Emma
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2663 - 2675
  • [44] Leveraging the Inductive Bias of Large Language Models for Abstract Textual Reasoning
    Rytting, Christopher Michael
    Wingate, David
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [45] Predicting startup success using two bias-free machine learning: resolving data imbalance using generative adversarial networks
    Park, Jungryeol
    Choi, Saesol
    Feng, Yituo
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [46] Human bias in AI models? Anchoring effects and mitigation strategies in large language models
    Nguyen, Jeremy K.
    JOURNAL OF BEHAVIORAL AND EXPERIMENTAL FINANCE, 2024, 43
  • [47] Pilot study on large language models for risk-of-bias assessments in systematic reviews: A(I) new type of bias?
    Barsby, Joseph
    Hume, Samuel
    Lemmey, Hamish A. L.
    Cutteridge, Joseph
    Lee, Regent
    Bera, Katarzyna D.
    BMJ EVIDENCE-BASED MEDICINE, 2024,
  • [48] A bias-free least-squares parameter estimator for continuous-time state-space models
    Garnier, H
    Sibille, P
    Bastogne, T
    PROCEEDINGS OF THE 36TH IEEE CONFERENCE ON DECISION AND CONTROL, VOLS 1-5, 1997, : 1860 - 1865
  • [49] Bias Unveiled: Enhancing Fairness in German Word Embeddings with Large Language Models
    Saeid, Yasser
    Kopinski, Thomas
    SPEECH AND COMPUTER, SPECOM 2024, PT II, 2025, 15300 : 308 - 325
  • [50] Communicating the cultural other: trust and bias in generative AI and large language models
    Jenks, Christopher J.
    APPLIED LINGUISTICS REVIEW, 2025, 16 (02) : 787 - 795