AudioMarkNet: Audio Watermarking for Deepfake Speech Detection

Authors:

Wei Zong, Yang-Wai Chow, Willy Susilo, and Joonsang Baek, University of Wollongong; Seyit Camtepe, CSIRO Data61

Abstract:

Deep generative models have improved significantly in recent years to the point where generated fake images or audio are now indistinguishable from genuine media. As a result, humans are unable to differentiate between real and deepfake content. While this presents a huge benefit to the creative sector, its exploitation to fool the general public has resulted in a real-world threat to society. To prevent generative models from being exploited by adversaries, researchers have devoted much effort towards developing methods for differentiating between real and generated data. To date, most existing techniques are designed to reactively detect artifacts introduced by generative models. In this work, we propose a watermarking technique, called AudioMarkNet, to embed watermarks in original speech. The purpose is to prevent speech from being used for speaker adaptation (i.e., fine-tuning text-to-speech (TTS)), which is commonly used for generating high-fidelity fake speech. Our method is orthogonal to existing reactive detection methods. Experimental results demonstrate the success of our method in detecting fake speech generated by open-source and commercial TTS models. Moreover, our watermarking technique achieves robustness against common non-adaptive attacks. We also demonstrate the effectiveness of our method against adaptive attacks. Examples of watermarked speech using our proposed method can be found on a website. Our code and artifacts are also available online.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Zong (Prepublication) PDF