论文总字数:67935字
摘 要
语音信号分离是语音信号处理技术中的重要分支,其目的在于从背景噪声、混响及干扰音中分离得到较为清晰的目标语音。经过数年发展,传统的传声器阵列语音分离算法已较为成熟,其核心在于对噪声信号功率谱密度进行估计,从而设计滤波器。但目前对非平稳态噪声功率谱的估计仍存在困难。而基于统计特性、具有强大非线性拟合能力的深度学习为语音分离技术提供了新的发展方向。基于此,本文设计了一种基于深度学习的传声器阵列语音分离算法,并主要研究了其在语音增强问题及多说话人分离问题中的性能。
本文首先以语音增强问题入手,从空间角度特征、神经网络结构以及特征组合等方面对算法框架进行了设计及优化,得到了性能最佳的谱特征和空间角度特征组合的双输入单输出深度神经网络(DNN)模型。
其后,本文在语音增强问题下对模型自身性能进行了分析。本文分析了模型对信噪比、方向性噪声角度的泛化能力;对偏离定位方向目标语音的保护性;以及对非训练方向目标语音进行分离的旋转不变性。实验证明算法对-10dB至30dB信噪比的扩散噪声及方向性噪声都具有较强的降噪能力,对方向性噪声的角度也具有泛化能力。通过实验测试得到本算法对声源定位误差±30°范围内的目标语音都具有保护性,可以对目标语音进行大幅增强。在旋转性方面,通过实验发现在提取特征时简单的进行旋转操作即可令本文模型应用于分离非训练方向的目标语音。
在语音增强问题下,本文还与传统阵列信号处理算法进行了性能比较。本文将所提算法与延时求和(DAS)波束形成和最小方差无失真(MVDR)波束形成算法进行比较。实验证明本文所提算法在扩散噪声背景下分离的目标语音具有更好的语音质量及可懂度。
在本文最后,基于模型的旋转不变性,本文尝试将算法应用于多说话人分离问题。通过实验验证了算法在多说话人分离问题中也具有很好的语音分离效果。可以采用多次应用模型分离不同方向目标语音的方式得到各说话人的语音。
本文创新的使用DAS波束扫描进行空间特征的提取,并证实了基于该特征及谱特征的神经网络模型具有旋转不变性,可以应用于非训练方向的目标语音提取。这一特性大大减少了训练模型时的计算成本及时间成本,且令模型的适用范围更广,具有极大的应用价值。
关键词:语音分离,深度学习,传声器阵列,旋转不变性
ABSTRACT
Speech separation is one of the most important branches in speech signal processing. It is aimed to separate target speech from noise, reverberation and interference. After several years of development, the traditional microphone array speech separation algorithms have been well studied. In order to design the filter, the core is to estimate the power spectral density of the noise signal. However, there are still difficulties in estimating the power spectrum of non-stationary noise. The deep learning technique based on statistical characteristics with strong nonlinear fitting ability provides a new development direction for speech separation. Thus, a speech separation algorithm based on deep learning is designed in the thesis, and the performance of proposed algorithm in speech enhancement and multi-speaker separation is mainly studied.
The thesis first starts with designing and optimizing the algorithm from the aspects of spatial features, neural network structures and feature combination under the speech enhancement problem. The deep neural network (DNN) model with the best performance utilizes spectral and spatial combination features with dual input and single output structure.
Then, the performance of the model is analyzed with the speech enhancement problem. Experiments are designed to analyze the generalization of the model to the SNR and to the directional noise angle. Experiments are also designed to evaluate the performances of the algorithm to protect the target speech in the off-target direction and to separate the target speech in the non-training direction. The results show that the algorithm has effective noise reduction capability for both diffusion noise and directional noise from -10dB to 30dB SNR. It also has generalization ability for directional noise. Results show that the proposed algorithm can enhance the target speech within ±30° of the sound source localization deviation. As for separating the target speech in non-training direction, it is found through experiments that a simple rotation operation during the feature extraction is effective for speech enhancement.
As for comparing with the performance of traditional array signal processing algorithms, the proposed algorithm is compared with Delay-and-Sum (DAS) beamforming and Minimum Variance Distortionless Response (MVDR) beamforming algorithm. The experimental results show that the speech separated by the proposed algorithm has better speech quality and intelligibility in the background of diffused noise.
At the end of the paper, based on the performance observing above, the algorithm is applied to the speaker separation problem. The experiment verifies that the algorithm also improve the speech quality and intelligibility in the speaker separation problem. The speech of each speaker can be obtained by setting the target speech in different directions.
In the thesis, Delay-and-Sum beamforming is used for extracting the spatial feature innovatively. The main contribution of the study is that the model based on the proposed spatial feature and the spectral features is proved to have rotatory properties. It can be applied to separate target speech in non-training direction. The rotation of model greatly reduces the computational cost and time cost of training the model, and makes the model more applicable.
KEY WORDS: speech separation, deep learning, microphone array, rotation invariance
目 录
摘 要 I
ABSTRACT II
目 录 3
第一章 绪论 1
1.1研究背景及意义 1
1.2传声器阵列语音分离技术研究特点 2
1.2.1语音分离技术研究特点 2
1.2.2 阵列语音信号处理技术特点 3
1.3传声器阵列语音信号分离技术研究历史及现状 4
1.3.1传统传声器阵列语音分离技术 4
1.3.2基于深度学习的传声器阵列语音分离技术 5
1.4本文的研究内容及结构 7
1.4.1本文的研究内容 7
1.4.2本文的结构 8
第二章 基于深度学习的传声器阵列语音分离技术理论知识 9
2.1 语音信号基础知识 9
2.1.1 语音信号特性 9
2.1.2 噪声信号特性 11
2.1.3 声学环境 11
2.1.4 人耳感知特性 12
2.2 传统传声器阵列信号处理理论知识 12
2.2.1 信号模型及问题建模 12
2.2.2 DAS波束形成 14
2.2.3 MWF多通道维纳滤波 14
2.2.4 MVDR波束形成 15
2.3 深度学习理论知识 15
2.3.1 深度学习网络结构 16
2.3.2 激活函数 19
2.3.3 损失函数 20
2.3.4 优化算法 21
2.3.5 模型训练技巧 21
2.4 小结 23
第三章 基于深度学习的传声器阵列语音分离算法框架设计 24
3.1 系统框架 24
3.2 特征提取 25
剩余内容已隐藏,请支付后下载全文,论文总字数:67935字
该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;