R Stuart Geiger, Flynn O'Sullivan, Elsie Wang, Jonathan Lo
{"title":"Asking an AI for salary negotiation advice is a matter of concern: Controlled experimental perturbation of ChatGPT for protected and non-protected group discrimination on a contextual task with no clear ground truth answers.","authors":"R Stuart Geiger, Flynn O'Sullivan, Elsie Wang, Jonathan Lo","doi":"10.1371/journal.pone.0318500","DOIUrl":null,"url":null,"abstract":"<p><p>We conducted controlled experimental bias audits for four versions of ChatGPT, which we asked to recommend an opening offer in salary negotiations for a new hire. We submitted 98,800 prompts to each version, systematically varying the employee's gender, university, and major, and tested prompts in voice of each side of the negotiation: the employee versus their employer. Empirically, we find many reasons why ChatGPT as a multi-model platform is not robust and consistent enough to be trusted for such a task. We observed statistically significant salary offers when varying gender for all four models, although with smaller gaps than for other attributes tested. The most substantial gaps were different model versions and between the employee- vs employer-voiced prompts. We also observed substantial gaps when varying university and major, but many of the biases were not consistent across model versions. We also tested for fictional and fraudulent universities and found wildly inconsistent results across different cases and model versions. We also make broader contributions to the AI/ML fairness and trustworthiness literature. Our salary negotiation advice scenario and our experimental design differ from mainstream AI/ML auditing efforts in key ways. Bias audits typically test discrimination for protected classes like gender, which we contrast with testing non-protected classes of university and major. Asking for negotiation advice includes how aggressive one ought to be in a negotiation relative to known empirical salary distributions and scales, which is a deeply contextual and personalized task that has no objective ground truth to validate. These results raise concerns for not only for the specific model versions we tested, but also around the consistency and robustness of the ChatGPT web platform as a multi-model platform in continuous development. Our epistemology does not permit us to definitively certify these models as either generally biased or unbiased on the attributes we test, but our study raises matters of concern for stakeholders to further investigate.</p>","PeriodicalId":20189,"journal":{"name":"PLoS ONE","volume":"20 2","pages":"e0318500"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLoS ONE","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1371/journal.pone.0318500","RegionNum":3,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
Asking an AI for salary negotiation advice is a matter of concern: Controlled experimental perturbation of ChatGPT for protected and non-protected group discrimination on a contextual task with no clear ground truth answers.
We conducted controlled experimental bias audits for four versions of ChatGPT, which we asked to recommend an opening offer in salary negotiations for a new hire. We submitted 98,800 prompts to each version, systematically varying the employee's gender, university, and major, and tested prompts in voice of each side of the negotiation: the employee versus their employer. Empirically, we find many reasons why ChatGPT as a multi-model platform is not robust and consistent enough to be trusted for such a task. We observed statistically significant salary offers when varying gender for all four models, although with smaller gaps than for other attributes tested. The most substantial gaps were different model versions and between the employee- vs employer-voiced prompts. We also observed substantial gaps when varying university and major, but many of the biases were not consistent across model versions. We also tested for fictional and fraudulent universities and found wildly inconsistent results across different cases and model versions. We also make broader contributions to the AI/ML fairness and trustworthiness literature. Our salary negotiation advice scenario and our experimental design differ from mainstream AI/ML auditing efforts in key ways. Bias audits typically test discrimination for protected classes like gender, which we contrast with testing non-protected classes of university and major. Asking for negotiation advice includes how aggressive one ought to be in a negotiation relative to known empirical salary distributions and scales, which is a deeply contextual and personalized task that has no objective ground truth to validate. These results raise concerns for not only for the specific model versions we tested, but also around the consistency and robustness of the ChatGPT web platform as a multi-model platform in continuous development. Our epistemology does not permit us to definitively certify these models as either generally biased or unbiased on the attributes we test, but our study raises matters of concern for stakeholders to further investigate.
期刊介绍:
PLOS ONE is an international, peer-reviewed, open-access, online publication. PLOS ONE welcomes reports on primary research from any scientific discipline. It provides:
* Open-access—freely accessible online, authors retain copyright
* Fast publication times
* Peer review by expert, practicing researchers
* Post-publication tools to indicate quality and impact
* Community-based dialogue on articles
* Worldwide media coverage