Large language models automate event segmentation and recall scoring with human level-accuracy
Key points: LLMs identify event boundaries more consistently than humans, while semantic embeddings enable scalable memory assessments.
Publication:
1
Background
The world around us is highly complex and ever changing. New sights, sounds, and information unfold moment-by-moment, but our brains make sense of them by breaking them into meaningful pieces. This mental process, known as event segmentation, helps us comprehend our current experiences and supports how we remember them later. In both research and clinical settings, evaluating perception and memory often relies on time-consuming, manual methods. The current research explores how artificial intelligence – particularly large language models (LLMs) – can offer an efficient, scalable way to measure how we perceive and remember everyday events.
2
The Research
We asked twenty adults to read short narratives and mark where they felt one meaningful event ended and another began. We prompted AI models, including GPT-4 and LLaMA 3.0, to perform the same task. After reading, participants recalled the stories aloud. The recalled narratives were transcribed and compared to the original narratives using text embeddings – a technique that allows AI to measure how similar two texts are in meaning. This dual approach allowed us to examine how events are segmented during comprehension and how those segmented events are later structured in memory.
3
The Findings
Both AI models, especially GPT-4, closely matched human segmentation patterns – often with more consistency than humans showed amongst each other. The automated recall scores correlated with human ratings, capturing both what was remembered and how it was structured. These findings demonstrate the LLMs can reliably replicate human-like event segmentation and provide objective measures of memory performance that align with human-based assessments. This research has broad implications across the memory sciences – for research in cognitive aging and clinical assessments. It provides a foundation for scalable, objective tolls that can measure comprehension and memory performance using materials – such as written or spoken narratives – encountered in everyday life. This enables faster more accessible assessments that support cognitive health screening in older adults or individuals with memory difficulties.

4
Next Steps
The speech that we encounter every day is often masked by varying degrees of background noise – whether on public transit, busy restaurants, classrooms, and normal shopping malls. Building on our results, we will apply these AI-based methods to study how people perceive and remember spoken information in these more naturalistic and acoustically challenging environments. Ultimately, this work can help us uncover the cognitive strategies humans use to understand and organize speech in real-world conditions, revealing how perception and memory operate when listening becomes effortful. This extends especially to older adults, who often experience age-related hearing loss and cognitive challenges that enhances listening challenges. By applying our validated approach, we hope to better understand how these environmental and age-related factors affect perception and memory during everyday communication.

Funding
