Within-sample variability-invariant loss for robust speaker recognition under noisy environments

Within-sample variability-invariant loss for robust speaker recognition under noisy environments
标题：样本内变异性-噪声环境下稳健说话人识别的不变损失
作者： Danwei Cai, Ming Li
备注：Accepted at ICASSP 2020
链接：https://arxiv.org/abs/2002.00924

Despite the significant improvements in speaker recognition enabled by deep neural networks, unsatisfactory performance persists under noisy environments. In this paper, we train the speaker embedding network to learn the “clean” embedding of the noisy utterance. Specifically, the network is trained with the original speaker identification loss with an auxiliary within-sample variability-invariant loss. This auxiliary variability-invariant loss is used to learn the same embedding among the clean utterance and its noisy copies and prevents the network from encoding the undesired noises or variabilities into the speaker representation. Furthermore, we investigate the data preparation strategy for generating clean and noisy utterance pairs on-the-fly. The strategy generates different noisy copies for the same clean utterance at each training step, helping the speaker embedding network generalize better under noisy environments. Experiments on VoxCeleb1 indicate that the proposed training framework improves the performance of the speaker verification system in both clean and noisy conditions.
尽管深度神经网络在说话人识别方面有了显著的改进，但在噪声环境下，说话人识别的性能仍然不尽如人意。本文通过训练说话人嵌入网络来学习噪声话语的“干净”嵌入。具体而言，网络训练与原始说话人识别损失与辅助样本内变异性不变的损失。该辅助变异不变loss用于学习干净的话语及其噪声副本之间的相同嵌入，并防止网络将不希望的噪声或变异编码成说话者表示。此外，我们还研究了在on the fly中生成干净和有噪声的话语对的数据准备策略。该策略在每个训练步,为同一个干净的语音生成不同的噪声副本，有助于说话人嵌入网络在噪声环境下更好地泛化。在VoxCeleb1上的实验表明，该训练框架提高了说话人确认系统在干净和噪声环境下的性能。

Index Terms— neural network, speaker recognition, speaker embedding, robustness, noisy conditions
索引项-神经网络，说话人识别，说话人嵌入，稳健性，噪声条件

INTRODUCTION

Automatic speaker verification (ASV) refers to automatically mak- ing the decision to accept or reject a claimed speaker by analyzing the given speech from that speaker. In the past few years, the perfor- mance of ASV systems has been improved significantly with the suc- cessful application of deep neural network (DNN) to speaker embed- ding modeling [1, 2]. However, unsatisfactory performance persists under noisy environments, which commonly noticed in smartphones or smart speakers with ASV applications. The additive noises on a clean speech contaminate the low energy regions of the spectrogram and blur the acoustic details [3]. These noises result in the loss of speech intelligibility and quality, imposing great challenges on speaker recognition systems.
自动说话人确认（ASV）是指通过分析说话人的给定语音，自动做出接受或拒绝说话人的决定。近年来，随着深度神经网络（DNN）在说话人嵌入建模中的成功应用，ASV系统的性能得到了显著提高[1，2]。然而，在嘈杂的环境下，不满意的性能仍然存在，这在智能手机或带有ASV应用程序的智能音箱中很常见。干净语音上的附加噪声，污染了spectrogram图的低能量区域，模糊了声学细节[3]。这些噪声导致语音的可懂度和质量下降，给说话人识别系统带来了巨大的挑战。
To compensate for these adverse impacts, various approaches have been proposed at different stages of the ASV systems. At the signal level, DNN based speech or feature enhancement [4, 5, 6, 7] has been investigated for ASV under complex environment. At the feature level, feature normalization techniques [8] and noise-robust features such as power-normalized cepstral coefficients (PNCC) [9] have also been applied to ASV systems. At the model level, ro- bust back-end modeling methods such as multi-condition training of probabilistic linear discriminant analysis (PLDA) models [10] and mixture of PLDA [11] were employed in the i-vector [12] frame- work. Also, score normalization [13] could be used to improve the robustness of the ASV system under noisy scenarios.
为了补偿这些不利的影响，同行在ASV系统的不同阶段提出了各种方法。在信号层次上，针对复杂环境下的ASV，研究了基于DNN的语音或特征增强[4，5，6，7]。在特征层次上，特征归一化技术[8]和噪声鲁棒性特征，如功率归一化倒谱系数（PNCC）[9]也被应用于ASV系统。在模型层次上，在i-vector[12]框架中采用了概率线性判别分析（PLDA）模型[10]的多条件训练和PLDA[11]的混合等反求后端建模方法。另外，分数标准化[13]可以用来提高噪声环境下ASV系统的鲁棒性。

More recently, researchers are working on training deep speaker networks to cope with the distortions caused by noise. Within this framework, there are two main methods. The first one regards the noisy data as a different domain from the clean data and ap- plies adversarial training to deal with domain mismatch and get a noise-invariant speaker embedding [14, 15]. The second method employs a DNN speech enhancement network for ASV tasks. Shon et al. [16] train the speech enhancement network with feedbacks from the speaker network to find the time-frequency bins that are beneficial to ASV tasks with noisy speech. Zhao et al. [17] uses the intermediate result of the speech enhancement network as an auxil- iary input for the speaker embedding network and jointly optimize these two networks.
最近，研究人员正致力于训练深度说话人网络，以应对噪声造成的失真。在这个框架中，主要有两种方法。第一种方法将噪声数据看作与干净数据不同的域，利用对抗性训练处理域失配问题，得到噪声不变的说话人嵌入算法[14，15]。第二种方法采用DNN语音增强网络进行ASV任务。肖恩等人[16] 利用说话人网络的反馈对语音增强网络进行训练，找出有利于，含噪语音的ASV任务，的时频bins 。Zhao等人〔17〕使用语音增强网络的中间结果作为说话人嵌入网络的辅助输入，并对这两个网络进行联合优化。
In this work, our network learns enhancement directly at the embedding level for speaker recognition under noisy environments. We train the deep speaker embedding network by incorporating the original speaker identification loss with an auxiliary within-sample loss. The speaker identification loss learns the speaker represen- tation using the speaker label, while the within-sample loss aims to learn the embedding of noisy utterance as similar as possible to its clean version. In this way, the deep speaker embedding net- work is trained to prevent from encoding the additive noises into the speaker representation and learn the “clean” embedding for the noisy speech utterance. The loss that helps the speaker network to learn variability-invariant embedding is called within-sample variability-invariant loss.
在这项工作中，我们的网络直接在嵌入层学习增强，以便在噪声环境下进行说话人识别。我们通过将原始说话人识别loss与样本loss中的辅助相结合来训练深度说话人嵌入网络。说话人识别loss利用说话人标签来学习说话人表示，而样本内loss，则是为了学习尽可能类似于干净文本的含噪话语的嵌入。这样，就可以训练深层的说话人嵌入网络，防止将加性噪声编码到说话人表示中，并学习噪声语音的“干净”嵌入。帮助说话人网络学习变异不变嵌入的损失称为样本内变异不变损失。（这名词造的！！！）
Furthermore, to fully explore the modeling ability of the within- sample variability-invariant loss, we dynamically generate the clean and noisy utterance pairs when preparing data for the training pro- cess. Different noisy copies for the same clean utterance are gener- ated at different training steps, helping the speaker embedding net- work generalize better under noisy environments.
此外，为了充分挖掘样本内变异不变损失的建模能力，我们在为训练过程准备数据时动态生成干净和有噪的话语对。在不同的训练步骤下，对同一个干净话语产生不同的噪声副本，有助于说话人嵌入网络在噪声环境下更好的泛化。

REVISIT:DEEP SPEAKER EMBEDDING

In this section, we describe the deep speaker embedding framework, which consists of a frame-level local pattern extractor, an utterance- level encoding layer, and several fully-connected layers for speaker embedding extraction and speaker classification.
在这一部分中，我们描述了深度说话人嵌入框架，它由一个帧级局部pattern抽取器、一个话语级编码层和几个用于说话人嵌入抽取和说话人分类的完全连接层组成。
Given a variable-length input feature sequence, the local pat- tern extractor, which is typically a convolutional neural network (CNN) [2] or a time-delayed neural network (TDNN) [1], learns the frame-level representations. An encoding layer is then applied to the top of it to get the utterance level representation. The most common encoding method is the average pooling layer, which aggregates the statistics (i.e., mean, or mean and standard deviation) [1, 2]. Self-attentive pooling layer [18], learnable dictionary encoding layer [19], and dictionary-based NetVLAD layer [20, 21] are other commonly used encoding layers. Once the utterance-level represen- tation is extracted, a fully connected layer and a speaker classifier are employed to further abstract the speaker representation and clas- sify the training speakers. After training, deep speaker embedding is extracted after the penultimate layer of the network for the given variable-length utterance.
给定一个可变长度的输入特征序列，典型的卷积神经网络（CNN）[2]或时延神经网络（TDNN）[1]使用局部pattern提取器学习帧级表示。然后在其上面再应用一个编码层，以获得话语层表示。**最常见的编码方法是平均池化层，它汇总统计数据（即平均值或平均值和标准偏差）[1，2]。自关注池层[18]、可学习字典编码层[19]和基于字典的NetVLAD层[20，21]是其他常用的编码层。**一旦提取出了话语水平的表征，就采用全连接层和说话人分类器进一步提取说话人表征并分类训练说话人。训练后，在网络倒数第二层，对给定的变长语音进行深度说话人嵌入的提取。

In this work, the local pattern extractor is a residual convolu- tional neural network (ResNet) [22], and the encoding layer is a global statistics pooling (GSP) layer. For the frame-level represen- tation F ∈ RC×H×W , the output of GSP is a utterance-level repre- sentation V = [μ1,μ2,··· ,μC,σ1,σ2,··· ,σC], where μc and σc are the mean and standard deviation of the cth feature map:
在本文中，局部模式抽取器是一个ResNet[22]，编码层是一个全局统计池（GSP）层。对于帧级表示F∈RC×H×W，GSP的输出是一个话语级表示V=[μ1，μ2，···，μC，σ1，σ2，···，σC]，其中μC和σC是cth特征图的平均值和标准差：
and C, H, W denote the number of channels, height and width of the feature map respectively.
C、H、W分别表示特征图的通道数、高度和宽度。

METHODS

In this section, we describe the proposed framework with within- sample variability-invariant loss and online noisy data generation.
在这一部分中，我们描述了一个具有样本内变异不变损失和在线噪声数据生成的框架。

3.1. Within sample variability-invariantloss

A clean speech and its noisy copies contain the same acoustic contents for recognizing speakers. Ideally, the speaker embeddings of the noisy utterance should be the same as its clean version. But in reality, the deep speaker embedding network usually encodes the noises as parts of the speaker representation for the noisy speech.
一个干净的语音和它的嘈杂的复制品，用于识别说话人的部分，应该包含相同的声学成分。理想情况下，嘈杂话语的说话人嵌入应该与干净的说话人相同。但在实际应用中，深度说话人嵌入网络，通常将噪声作为说话人表示的一部分进行编码。

The within-sample variability-invariant loss works with the original speaker identification loss together to train the speaker embedding network. The speaker identification loss is typically a cross-entropy. In our implementation, the hyper-parameters of the network are updated twice at each training step. The first update from the speaker identification loss is followed by the second update from the within-sample variability-invariant loss. Figure 1 shows the flowchart of our proposed framework.
样本内变异不变损失与原始说话人识别损失一起训练说话人嵌入网络。说话人识别损失通常是交叉熵。在我们的实现中，网络的超参数在每个训练步骤更新两次。从说话人识别损失的第一次更新之后是从样本内变异不变损失的第二次更新。图1显示了我们提议的框架的流程图。

3.2. Online data augmentation
3.2。在线数据扩充

In this work, we implement an online data augmentation strategy. Different parameters of noise types, noise clips and signal-to-noise ratio (SNR) are randomly selected to generate the clean-noisy utter- ance pair when training. Different permutations of these random pa- rameters generate different noisy segments for the same utterance at different training steps, so the network never “sees” the same noisy segment from the same clean speech.
在这项工作中，我们实现了一个在线数据扩充策略。在训练过程中，随机选取不同的噪声类型、噪声片段和信噪比参数，生成语音的干净-噪声对。这些随机参数的不同排列，在不同的训练步骤中，为同一个语音产生不同的噪声段，因此网络不会从同一个干净的语音中“看到”相同的噪声段。
During training, the SNR is a continuous random variable uni- formly distributed between 0 and 20dB, and there are four types of noise: music, ambient noise, television, and babble. The television noise is generated with one music file and one speech file. The babble noise is constructed by mixing three to six speech files into one, which results in overlapping voices simultaneously with the fore- ground speech.
在训练过程中，信噪比是一个连续的随机变量，均匀分布在0～20dB之间，有四种噪声：音乐噪声、环境噪声、电视噪声和含混不清的嘈杂的人语。电视噪声由一个音乐文件和一个语音文件产生。babble noise是通过将三到六个语音文件混合为一个而构建的，这导致了与前景语音同时重叠的声音。

EXPERIMENTS

The experiments are conducted on Voxceleb 1 dataset [23]. The training data contain 148642 utterances from 1211 speakers. In the test data, 4874 utterances from 40 speakers construct 37720 test trials. Although the Voxceleb dataset collected from online video is not strictly in clean condition, we assume the original data as a clean dataset and generate noisy data from the original data.
实验是在Voxceleb 1数据集上进行的[23]。训练数据包含1211个说话人的148642个话语。在测试数据中，来自40个说话人的4874个话语构成了37720个测试trials。虽然，从在线视频中采集的Voxceleb数据集并不严格处于干净状态，但我们假设原始数据为干净数据集，并从原始数据中产生噪声数据。

The MUSAN dataset [24] is used as the noise source. We split the MUSAN into two non-overlapping subsets for training and test-ing noisy data generation respectively.
MUSAN数据集[24]用作噪声源。我们将MUSAN分成两个不重叠的子集，分别用于训练和测试噪声数据生成。

4.2. Experimental setup

Speech signals are firstly converted to 64-dimensional log Mel- filterbank energies and then fed into the speaker embedding net- work. The detailed network architecture is in table 2. The front-end local pattern extractor is based on the well known ResNet-34 archi- tecture [22]. ReLU activation and batch normalization are applied to each convolutional layer.
首先将语音信号转换成64维对数Mel滤波器组能量，然后输入到说话人嵌入网络中。详细的网络架构见表2。前端本地模式提取器基于众所周知的ResNet-34体系结构[22]。对每个卷积层应用ReLU激活和batch normalization。

For the speaker identification loss, a standard softmax-based cross-entropy loss or angular softmax loss (A-softmax) [25] is used. When training with softmax loss, dropout is added to the penultimate fully-connected layer to prevent overfitting.
对于说话人识别损失，使用基于标准softmax的交叉熵损失或angular softmax loss (A-softmax)[25]。当使用softmax loss进行训练时，将dropout添加到倒数第二个全连接层以防止过拟合。
Three training data settings are investigated: (1) original Vox- celeb 1 dataset (clean); (2) original training dataset and offline gen- erated noisy data, i.e., the noisy data are generated in advance (of- fline AUG); (3) original training data with online data augmentation (online AUG).
我们研究了三种训练数据设置：（1）原始Vox-celeb 1数据集（clean）；（2）原始训练数据集和离线生成的噪声数据，即噪声数据提前生成（of-fline-AUG）；（3）在线数据增强的原始训练数据（online-AUG）。
At the testing stage, cosine similarity is used for scoring. We use equal error rate (EER) and detection cost function (DCF) as the per- formance metric. The reported DCF is the average of two minimum DCFs when Ptarget is 0.01 and 0.001.
在测试阶段，采用余弦相似性进行评分。我们使用等错误率（EER）和检测成本函数（DCF）作为性能度量。Ptarget为0.01和0.001，报告的DCF是两个最小DCF的平均值。

4.3. Experimental results

Eight deep speaker embedding networks are trained based on three training conditions and different loss functions. Table 1 shows the DCF and EER of three noise types (babble, ambient noise and music) at five SNR settings (0, 5, 10, 15, 20dB). Also, all of the 15 noisy testing trials are combined to form the “all noises” trial.
基于三种训练条件和不同的损失函数，对8个深度说话人嵌入网络进行训练。表1显示了在5个SNR设置（0、5、10、15、20dB）下三种噪声类型（嘈杂声、环境噪声和音乐）的DCF和EER。此外，所有15个噪声测试试验被组合起来形成“所有噪声”试验。

Several observations from the results are discussed in the fol- lowing. 1) The experimental results confirm that data augmentation strategy can greatly improve the performance of the deep speaker embedding system under noisy conditions. 2) Comparing with the offline data augmentation strategy, the performance improvement achieved by online data augmentation is more obvious in the low SNR conditions. 3) Training the deep speaker embedding system with within-sample variability-invariant loss can improve the sys- tem performance in the clean and all noisy conditions. 4) Com- paring with the network trained with offline data augmentation, the proposed framework using within-sample variability-invariant loss with online data augmentation achieves 13.0% and 6.5% reduction in terms of EER and DCF respectively. 5) When the speaker embed- ding network is trained discriminatively using the A-softmax loss with angular margin, the proposed within-class loss can still improve the system performance by setting constraints on the distance among the clean utterance and its noisy copies.
文中还讨论了几个观测结果。1）实验结果表明，在噪声环境下，数据增强策略可以显著提高深度说话人嵌入系统的性能。2）与离线数据增强策略相比，在线数据增强在低信噪比条件下的性能改善更为明显。3）训练样本内变异不变损失的深度说话人嵌入系统，可以提高系统在干净和全噪声条件下的性能。4）与离线数据增强训练的网络相比，在线数据增强的样本内变异不变损失框架在EER和DCF方面分别降低了13.0%和6.5%。5）当利用带angular margin的A-softmax损失对说话人嵌入网络进行判别训练时，提出的类内损失仍然可以通过设置干净话语与其噪声副本之间的距离约束来提高系统性能。
The detection error tradeoff (DET) curves in figure 2 provide comparisons among four selected systems, two of which are trained with our proposed framework. The DET curve uses testing trials from all the noisy conditions.
图2中的detection error tradeoff （DET）曲线提供了四个选定系统之间的比较，其中两个系统使用我们提出的框架进行了训练。DET曲线使用所有噪声条件下的试验。

We also visualized the speaker embeddings by using the t- distributed stochastic neighbor embedding (t-SNE) algorithm [26]. The two-dimensional results of the speaker embeddings are shown in figure 4. Four speakers, each with six clean utterances, are se-lected from the training dataset for visualization. Also, each clean utterance has three 5dB noisy copies of music, babble and ambient noises. Comparing with the clean training condition, data augmen- tation helps the clean and noisy embeddings from the same utterance cluster together. Further, after training the deep speaker embedding network with within-noise variability-invariant loss, the clean and noisy embeddings of the same utterance are closer to each other.
我们还使用t-分布随机邻居嵌入（t-SNE）算法来可视化说话人嵌入[26]。说话人嵌入的二维结果如图4所示。从训练数据集中选出四个说话人，每个人有六个清晰的话语，用于可视化。此外，每一个干净的声音都有三个5分贝的音乐、babble的声音和周围的噪音。与干净的训练条件相比，语音增强有助于将来自同一话语簇的干净语音和噪声语音嵌入在一起。此外，在训练具有噪声变异不变损失的深度说话人嵌入网络后，同一话语的干净和有噪声的嵌入更加接近。

The loss values of each training epoch are shown in figure 3 for the network with speaker softmax and within-sample MSE losses. The referenced MSE loss between embeddings from the clean and noisy data of the converged network trained with only softmax loss is also given. We can observe that the MSE loss is maintained at a low level during training, which helps the network to extract noisy embedding similar to its clean version.
图3所示为每个训练epoch在样本MSE损失范围内的具有说话人softmax的网络的损失值。文中还给出了仅用softmax loss训练的融合网络的干净数据和噪声数据的嵌入间的参考MSE损失。我们可以观察到，在训练过程中，MSE损失保持在一个较低的水平，这有助于网络提取与干净版本相似的噪声嵌入。

5. CONCLUSION

This paper has proposed the within-sample variability-invariant loss for deep speaker embedding networks under noisy conditions. By setting constraints on the embeddings extracted from the clean ut- terance and its noisy copies, the proposed loss works with the orig- inal speaker identification loss to learn robust embedding for noisy speeches. We also employ the data preparation strategy of generat- ing the clean and noisy utterance pairs on-the-fly to help the speaker embedding network generalize better under noisy environments. The proposed framework is flexible and can be extended to other similar applications when multiple views of the same training speech sample are available.
本文提出了噪声条件下深度说话人嵌入网络的样本内变异不变损失。通过对从干净语音及其噪声副本中提取的嵌入设置约束，提出的损失与原始说话人识别损失相结合，学习噪声语音的鲁棒嵌入。我们还采用了动态生成干净和有噪语音对的数据准备策略，以帮助说话人嵌入网络在有噪环境下更好地推广。所提出的框架是灵活的，可以扩展到其他类似的应用时，多个视图相同的训练语音样本是可用的。

ACKNOWLEDGEMENT

This research is funded in part by the National Natural Science Foun- dation of China (61773413) and Duke Kunshan University.
本研究部分由国家自然科学基金（61773413）和昆山杜克大学资助。

REFERENCES
[1] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu- danpur, “x-vectors: Robust DNN Embeddings for Speaker Recognition,” in ICASSP, 2018, pp. 5329–5333.
[2] W.Cai,J.Chen,andM.Li,“Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recogni- tion System,” in Speaker Odyssey, 2018, pp. 74–81.
[3] M. Wolfel and J. McDonough, Distant Speech Recognition, John Wiley & Sons, Incorporated, 2009.
[4] X. Zhao, Y. Wang, and D. Wang, “Robust Speaker Identifica- tion in Noisy and Reverberant Conditions,” IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 836–845, 2014.
[5] M. Kolboek, Z. Tan, and J. Jensen, “Speech Enhancement Using Long Short-Term Memory based Recurrent Neural Net- works for Noise Robust Speaker Verification,” in SLT, 2016, pp. 305–311.
[6] Z. Oo, Y. Kawakami, L. Wang, S. Nakagawa, X. Xiao, and M. Iwahashi, “DNN-Based Amplitude and Phase Feature En- hancement for Noise Robust Speaker Identification,” in Inter- speech, 2016, pp. 2204–2208.
[7] O. Plchot, L. Burget, H. Aronowitz, and P. Matejka, “Audio enhancing with DNN autoencoder for speaker recognition,” in ICASSP, 2016, pp. 5090–5094.
[8] J. Pelecanos and S. Sridharan, “Feature Warping for Robust Speaker Verication,” in Speaker Odyssey, 2001, pp. 213–218.
[9] C. Kim and R. M Stern, “Power-Normalized Cepstral Coef- cients (PNCC) for Robust Speech Recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 24, no. 7, pp. 1315–1329, 2016.
[10] D. Garcia-Romero, X. Zhou, and C. Y. Espy-Wilson, “Multi- Condition Training of Gaussian PLDA Models in i-vector Space for Noise and Reverberation Robust Speaker Recogni- tion,” in ICASSP, 2012, pp. 4257–4260.
[11] M. Mak, X. Pang, and J. Chien, “Mixture of PLDA for Noise Robust i-Vector Speaker Verification,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 24, no. 1, pp. 130–142, 2016.
[12] N.Dehak,P.J.Kenny,R.Dehak,P.Dumouchel,andP.Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.
[13] I. Peer, B. Rafaely, and Y. Zigel, “Reverberation Matching for Speaker Recognition,” in ICASSP, 2008, pp. 4829–4832.
[14] J.Zhou,T.Jiang,L.Li,Q.Hong,Z.Wang,andB.Xia,“Train- ing Multi-Task Adversarial Network for Extracting Noise- Robust Speaker Embedding,” in ICASSP, 2019, pp. 6196– 6200.
[15] Z. Meng, Y. Zhao, J. Li, and Y. Gong, “Adversarial Speaker Verification,” in ICASSP, 2019, pp. 6216–6220.
[16] S. Shon, H. Tang, and J. Glass, “VoiceID Loss: Speech En- hancement for Speaker Verification,” in Interspeech, 2019, pp. 2888–2892.
[17] F. Zhao, H. Li, and X. Zhang, “A Robust Text-independent Speaker Verification Method Based on Speech Separation and Deep Speaker,” in ICASSP, 2019, pp. 6101–6105.
[18] G. Bhattacharya, J. Alam, and P. Kenny, “Deep Speaker Em- beddings for Short-Duration Speaker Verification,” in Inter- speech, 2017, pp. 1517–1521.
[19] W. Cai, Z. Cai, X. Zhang, X. Wang, and M. Li, “A Novel Learnable Dictionary Encoding Layer for End-to-End Lan- guage Identification,” in ICASSP, 2018, pp. 5189–5193.
[20] J.Chen,W.Cai,D.Cai,Z.Cai,H.Zhong,andM.Li,“End-to- end Language Identification using NetFV and NetVLAD,” in ISCSLP, 2018.
[21] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level Aggregation For Speaker Recognition In The Wild,” in ICASSP, 2019, pp. 5791–5795.
[22] K.He,X.Zhang,S.Ren,andJ.Sun,“DeepResidualLearning for Image Recognition,” in CVPR, 2016, pp. 770–778.
[23] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A Large-Scale Speaker Identification Dataset,” in Interspeech, 2017, pp. 2616–2620.
[24] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484 [cs], 2015.
[25] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep Hypersphere Embedding for Face Recog- nition,” in CVPR, 2017, pp. 212–220.
[26] L. Maaten and G. Hinton, “Visualizing Data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 11, pp. 2579–2605, 2008.

Within-sample variability-invariant loss for robust speaker recognition under noisy environments相关推荐

人脸关键点: Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks
原文链接:https://blog.csdn.net/u011995719/article/details/80150508#commentsedit Wing Loss for Robust Fac ...
Center Invariant Loss
<Deep Face Recognition with Center Invariant Loss> 2017,Yue Wu et al. Center Invariant Loss 引言 ...
DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations – CVPR 2016
DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations – CVPR 2016 论文( ...
END-TO-END DNN BASED SPEAKER RECOGNITION INSPIRED BY I-VECTOR AND PLDA
END-TO-END DNN BASED SPEAKER RECOGNITION INSPIRED BY I-VECTOR AND PLDA Johan Rohdin, Anna Silnova, M ...
论文学习：Practical Adversarial Attacks Against Speaker Recognition Systems
文章题目:Practical Adversarial Attacks Against Speaker Recognition Systems 来源:ACM HotMobile 2020 链接:http ...
【论文学习】《Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems》
<Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems>论文学习文章目录 <Who is Real ...
【百度论文复现赛】ArcFace: Additive Angular Margin Loss for Deep Face Recognition
文章目录摘要 Introduction Proposed Approach ArcFace SphereFace与CosFace的比较与其它损失函数比较实验 Implementation Det ...
基于深度学习方法的声纹识别（Speaker Recognition）论文综述
声纹识别(Speaker Recognition),是一项提取说话人声音特征和说话内容信息,自动核验说话人身份的技术. 声纹识别通常分为两类:Speaker Verification (说话人验证)和 ...
ICASSP 2019----Analysis and Mitigation of Vocal Effort Variations in Speaker Recognition
Mahesh Kumar Nandwana1 , Mitchell McLaren1 , Luciana Ferrer2 , Diego Castan1 , Aaron Lawson1 1,Speec ...

Within-sample variability-invariant loss for robust speaker recognition under noisy environments

Within-sample variability-invariant loss for robust speaker recognition under noisy environments相关推荐

最新文章

热门文章