Recently, the utilization of extensive open-sourced text data has significantly advanced the performance of text-based large language models (LLMs).
However, the use of in-the-wild large-scale speech data in the speech technology community remains constrained.
One reason for this limitation is that a considerable amount of the publicly available speech data is compromised by background noise, speech overlapping, lack of speech segmentation information, missing speaker labels, and incomplete transcriptions, which can largely hinder their usefulness. On the other hand, human annotation of speech data is both time-consuming and costly.
To address this issue, we introduce an automatic in-the-wild speech data preprocessing framework (AutoPrep) in this paper, which is designed to enhance speech quality, generate speaker labels, and produce transcriptions automatically.
The proposed AutoPrep framework comprises six components: speech enhancement, speech segmentation, speaker clustering, target speech extraction, quality filtering and automatic speech recognition.
Experiments conducted on the open-sourced WenetSpeech and our self-collected AutoPrepWild corpora demonstrate that the proposed AutoPrep framework can generate preprocessed data with similar DNSMOS and PDNSMOS scores compared to several open-sourced TTS datasets.
The corresponding TTS system can achieve up to 0.68 in-domain speaker similarity.
Fig.1: The diagram of the proposed full-band AutoPrep framework.
Dataset
AutoPrepWild: The AutoPrepWild corpus is a collection of in-the-wild speech data that we gathered from publicly available podcasts, video recordings, and audiobooks, without segmentation, speaker labels, or text transcriptions. The original dataset consists of 680 unprocessed long audio recordings, with a total duration of approximately 498 hours. Unlike the WenetSpeech dataset, the sample rate of the AutoPrepWild corpus is either 24kHz or 44.1kHz.
WenetSpeech: WenetSpeech is a widely used open-source ASR corpus, that comprises over 10,000 hours of Mandarin 16kHz speech data from diverse sources such as YouTube and Podcasts.Being derived from real-world data, WenetSpeech covers an extensive variety of acoustic conditions and includes a substantial number of speakers, making it highly suitable for the application scenarios of AutoPrep.