Evaluating LLM-based Personal Information Extraction and Countermeasures

Authors: 

Yupei Liu, The Pennsylvania State University; Yuqi Jia, Duke University; Jinyuan Jia, The Pennsylvania State University; Neil Zhenqiang Gong, Duke University

Abstract: 

Automatically extracting personal information—such as name, phone number, and email address—from publicly available profiles at a large scale is a stepstone to many other security attacks including spear phishing. Traditional methods—such as regular expression, keyword search, and entity detection—achieve limited success at such personal information extraction. In this work, we perform a systematic measurement study to benchmark large language model (LLM) based personal information extraction and countermeasures. Towards this goal, we present a framework for LLM-based extraction attacks; collect four datasets including a synthetic dataset generated by GPT-4 and three real-world datasets with manually labeled eight categories of personal information; introduce a novel mitigation strategy based on prompt injection; and systematically benchmark LLM-based attacks and countermeasures using ten LLMs and five datasets. Our key findings include: LLM can be misused by attackers to accurately extract various personal information from personal profiles; LLM outperforms traditional methods; and prompt injection can defend against strong LLM-based attacks, reducing the attack to less effective traditional ones.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.