Large language models may struggle to detect culturally embedded filicide-suicide risks.

Overview

This study examines the capacity of six large language models (LLMs)-GPT-4o, GPT-o1, DeepSeek-R1, Claude 3.5 Sonnet, Sonar Large (LLaMA-3.1), and Gemma-2-2b-to detect risks of domestic violence, suicide, and filicide-suicide in the Taiwanese flash fiction "Barbecue". The story, narrated by a six-year-old girl, depicts family tension and subtle cues of potential filicide-suicide through charcoal-burning, a culturally recognized method in Taiwan. Each model was tasked with interpreting the story's risks, with roles simulating different mental health expertise levels. Results showed that all models detected domestic violence; however, only GPT-o1, Claude 3.5 Sonnet and Sonar Large identified the risk of suicide based on cultural cues. GPT-4o, DeepSeek-R1 and Gemma-2-2b missed the suicide risk, interpreting the mother's isolation as merely a psychological response. Notably, none of the models comprehended the cultural context behind the mother sparing her daughter, reflecting a gap in LLMs' understanding of non-Western sociocultural nuances. These findings highlight the limitations of LLMs in addressing culturally embedded risks, essential for effective mental health assessments.