Nürnberg / 04. Dezember 2023 - 07. Dezember 2023
15th IEEE International Workshop on Information Forensics and Security
Vom 4. bis 7. Dezember 2023 findet in Nürnberg die 15th IEEE International Workshop on Information Forensics and Security (WIFS) statt. Mit Beiträgen zur synthetischen Spracherkennung und Audiophylogenie wird das Fraunhofer IDMT aktuelle Forschungsaktivitäten im Bereich der Medienforensik vorstellen.
Artem Yaroshchuk, Christoforos Papastergiopoulos, Luca Cuccovillo, Patrick Aichroth, Konstantinos Votis, Dimitrios Tzovaras
This paper introduces a multilingual, multispeaker dataset composed of synthetic and natural speech, designed to foster research and benchmarking in synthetic speech detection. The dataset encompasses 18,993 audio utterances synthesized from text, alongside with their corresponding natural equivalents, representing approximately 17 hours of synthetic audio data. The dataset features synthetic speech generated by 156 voices spanning three languages, namely, English, German, and Spanish, with a balanced gender representation. It targets state-of-the-art synthesis methods, and has been released with a license allowing seamless extension and redistribution by the research community.
The paper will be presented on December 5 at 16.00.
Milica Gerhardt, Luca Cuccovillo, Patrick Aichroth
In this study we propose a novel approach to audio phylogeny, i.e. the detection of relationships and transformations within a set of near-duplicate audio items, by leveraging a deep neural network for efficiency and extensibility. Unlike existing methods, our approach detects transformations between nodes in one step, and the transformation set can be expanded by retraining the neural network without excessive computational costs. We evaluated our method against the state of the art using a self-created and publicly released dataset, observing a superior performance in reconstructing phylogenetic trees and heightened transformation detection accuracy. Moreover, the ability to detect a wide range of transformations and to extend the transformation set make the approach suitable for various applications.
The paper will be presented on December 5 at 16.00.
Luca Cuccovillo, Milica Gerhardt, Patrick Aichroth
In this paper, we address the challenge of synthetic speech detection, which has become increasingly important due to the latest advancements in text-to-speech and voice conversion technologies. We propose a novel multi-task neural network architecture, designed to be interpretable and specifically tailored for audio signals. The architecture includes a feature bottleneck, used to autoencode the input spectrogram, predict the fundamental frequency (f0) trajectory, and classify the speech as synthetic or natural. Hence, the synthesis detection can be considered a byproduct of attending to the energy distribution among vocal formants, providing a clear understanding of which characteristics of the input signal influence the final outcome. Our evaluation on the ASVspoof 2019 LA partition indicates better performance than the current state of the art, with an AUC score of 0.900.
The paper will be presented on December 7 at 9.00.