Objectives: Police routinely collect unstructured narrative reports of their interactions with civilians. These accounts have the potential to reveal the extent of police engagement with vulnerable populations. We test whether large language models (LLMs) can effectively replicate human qualitative coding of these narratives-a task that would otherwise be highly resource intensive.
Methods: Using publicly available narrative reports from Boston Police Department, we compare human-generated and LLM-generated labels for four vulnerabilities: mental ill health, substance misuse, alcohol dependence, and homelessness. We assess multiple LLM sizes and prompting strategies, measure label variability through repeated prompts, and conduct counterfactual experiments to examine potential classification biases related to sex and race.
Results: LLMs demonstrate high agreement with human coders in identifying narratives without vulnerabilities, particularly when repeated classifications are unanimous or near-unanimous. Human-LLM agreement improves with larger models and tailored prompting strategies, though effectiveness varies by vulnerability type. These findings suggest a human-LLM collaborative approach, where LLMs screen the majority of cases whilst humans review ambiguous instances, would significantly reduce manual coding requirements. Counterfactual analyses indicate minimal influence of subject sex and race on LLM classifications beyond those expected by chance.
Conclusions: LLMs can substantially reduce resource requirements for analyzing large narrative datasets, whilst enhancing coding specificity and transparency, and enabling new approaches to replication and comparative analysis. These advances present promising opportunities for criminology and related fields.
扫码关注我们
求助内容:
应助结果提醒方式:
