Paper's Additional Material Section

Project's Summary

Abstract—This study explores the application of large language (LLM) models for detecting implicit bias in job descriptions, an important concern in human resources that shapes applicant pools and influences employer perception. We compare different LLM architectures—encoder, encoder-decoder, and decoder models—focusing on seven specific bias types. The research questions address the capability of foundation LLMs to detect implicit bias and the effectiveness of domain adaptation via fine-tuning versus prompt-tuning. Results indicate that fine-tuned models are more effective in detecting biases, with Flan-T5-XL emerging as the top performer, surpassing the zero-shot prompting of GPT-4o model. A labelled dataset consisting of verified gold-standard, silver-standard, and unverified bronze-standard data was created for this purpose and open-sourced to advance the field and serve as a valuable resource for future research.

Short Introduction

In human resources, bias affects both employers and employees in explicit and implicit forms. Explicit bias is conscious and controllable, but can be illegal in employment contexts. Implicit bias is subtle, unconscious, and harder to address. Implicit bias in job descriptions is a major concern as it shapes the applicant pool and influences applicants’ decisions. Bias in the language of job descriptions can affect how attractive a role appears to different individuals and can impact employer perception. The challenge is to efficiently identify and mitigate these biases.

The application of large language models (LLMs) for detecting bias in job descriptions is promising but underexplored. This study examines the effectiveness of various LLM architectures (encoder, encoder-decoder, decoder) less than 10 billion parameters in detecting implicit bias.

We conceptualise the task of identifying implicit bias in job descriptions as a multi-label classification problem, where each job description is assigned a subset of labels from a set of eight categories—age, disability, feminine, masculine, general exclusionary, racial, sexuality, and neutral. This study investigates two primary research questions:

  1. Can foundation LLMs accurately detect implicit bias in job descriptions without specific task training? We evaluate the performance of three topical decoder-only models under four distinct prompt settings, assessing their ability to extract relevant information from job descriptions and identify implicit bias.

  2. Does domain adaptation via fine-tuning foundational LLMs outperform prompt tuning for detecting implicit bias in job descriptions? We fine-tune models with varying architectures as text-classifiers on task-specific data and compare their performance to that of prompt-tuned models.

Model Architecture Overview

The models selected for our study are given.

Additionally, OpenAI’s GPT-4 autoregressive model was used for several purposes in this study: data preprocessing, data augmentation, and as a prompting baseline

Baselines

Prompting Overview

We evaluated the instruction-tuned decoder models using four prompting approaches:

Evaluation

Dataset: Potential Job Description Bias Dataset

Potential Bias Terms (in Millions)
Potential Bias Terms found in Real Job Description Dataset (in Millions)

Dataset: Gold and Silver Samples

Annotated Samples: Bias vs Neutral
Annotated Samples: Bias vs Neutral
Annotated Samples by Bias Category
Annotated Samples by Bias Category

Results: Overall

Fine-Tuning vs Prompting Performance
Fine-Tuning vs Prompting Performance
Model Performance: Precision vs Recall
Model Performance: Precision vs Recall

Results: By Bias Category

Top Performers’ Comparison Against Baseline Models
Top Performers’ Comparison Against Baseline Models
F1 Scores Across Various Categories and Experiments; Fine-Tuning (FT) and Prompting (PT)
F1 Scores Across Various Categories and Experiments; Fine-Tuning (FT) and Prompting (PT)