基于ResUnet和TFGAN网络的激光麦克风语音增强方法

Speech enhancement method of laser microphone based on ResUnet and TFGAN network

  • 摘要: 激光麦克风是一种利用光学多普勒效应获取远场语音信息的技术,其语音质量受到探测系统自身特性、光探测路径以及目标物等多个方面的影响。为了从远距离声场下的目标物获取更高质量的语音信息,文中通过单频声激励实验获得了4种典型目标物(A4纸片、A4纸盒、瓦楞盒、塑料瓶)的声致振动频率响应,发现了其在频率上的非均匀性。在此基础上,提出了一种基于ResUnet和TFGAN网络的激光语音增强方法,其通过ResUnet网络预测去噪梅尔谱图,并利用TFGAN网络由预测的梅尔谱图恢复出激光语音的时域波形。然后,利用实验室自制的激光麦克风在4种目标物上进行了远距离语音采集实验,采用文中提出的方法对采集到的激光麦克风语音进行了处理,并与非线性函数谐波重构法、DNN+谐波重构法进行了比较。最后利用客观语音质量评估(PESQ)和时域分段信噪比(SNRseg)对处理后的激光语音进行了量化评估。实验结果表明,在4种目标物上采集到的激光语音,经过非线性函数谐波重构方法和DNN+谐波重构方法处理后,语音质量均无明显提升,其相应的PESQ和SNRseg分值无明显提高。而经过文中所提的ResUnet+TFGAN网络方法处理后,激光语音取得了更高的PESQ和SNRseg分值,语音质量明显提升。因此,文中提出的方法在激光麦克风应用中具有更好的激光语音增强效果。此外,由实验结果可知,此方法在频率响应一致性较差的目标物上,仍然可以较好地重建频谱,恢复出高质量的语音信息。

     

    Abstract:
      Objective  Laser microphone is a kind of equipment which employs optical Doppler effect to acquire acoustic vibration information (speech). Compared with conventional microphones, laser microphones have the characteristics of extended range, high precision and non-contact. It is capable of collecting distant sound field information in a directional fashion while avoiding interference from the sound field close to the equipment. However, when the laser microphone is used to collect the remote sound field speech information, the quality of the obtained speech is affected by many factors, which leads to the severe decline of the laser speech quality. At present, the research of speech enhancement algorithm for laser microphone speech is relatively preliminary. The traditional single-channel speech enhancement method requires the signal and noise to satisfy the conditions of stationarity or correlation, and its performance is significantly reduced under complex conditions such as low signal-to-noise ratio and non-stationarity noise. The method based on deep neural network can understand the complex mapping relationship between noisy speech and clear speech, and the performance is better than the traditional method. This technique, however, has poor generalizability for laser speech from complex targets in unpreset environments because different targets have different frequency response characteristics. Therefore, in order to increase the quality of far-field speech captured by laser microphones, a laser microphone speech enhancement method based on ResUnet network and TFGAN network is proposed in this paper.
      Methods  Using laboratory-made laser microphones, four different types of objects were used in this paper's remote speech acquisition tests (Fig.6). The technique described in this paper is used to process the recorded speech, and it is contrasted with methods for nonlinear function harmonic reconstruction and DNN+ harmonic reconstruction (Fig.9). Finally, objective speech quality assessment (PESQ) and time-domain segmented signal-to-noise ratio (SNRseg) were used to quantitatively evaluate the processed laser speech (Fig.11).
      Results and Discussions   Compared with the above two methods, the method proposed in this paper can better suppress the broadband noise and pulse noise and reconstruct the more accurate high-frequency information after the stepwise enhancement processing of the collected laser speech. The laser speech PESQ scores of A4 paper, A4 paper box, corrugated box and PET plastic bottle after this method are 2.126, 1.818, 1.804 and 1.951, respectively increased by 0.129, 0.113, 0.117 and 0.22. The corresponding SNRseg scores were −5.31 dB, −3.36 dB, −5.07 dB and −3.40 dB, which were increased by 1 dB, 6.25 dB, 1.41 dB and 0.17 dB, respectively. The experimental results show that the ResUnet+TFGAN network method proposed in this paper can effectively improve the laser speech quality of the above targets.
      Conclusions  In this study, a laser microphone speech enhancement method based on ResUnet and TFGAN network is proposed. Speech pieces are gathered on various targets by self-made laser microphones in the lab, and the proposed method is demonstrated through experiments. The experimental results show that this method can enhance the speech of laser microphone from a variety of objects. Compared with the nonlinear function harmonic reconstruction method and DNN+ harmonic reconstruction method, the advantages of this method are that ResUet and TFGAN networks can respectively realize the clear Mel spectrum prediction and time domain waveform recovery of laser speech, avoiding the high-frequency noise introduced by the harmonic reconstruction method in the reconstruction of speech signal, and at the same time recover the more clear high-frequency information of laser speech. PESQ and SNRseg results demonstrate that using the proposed method results in improved speech quality for the laser microphone. This method extends the application range of laser microphones to a certain extent, and we will further verify and improve this method on objects with more complex materials and shapes.

     

/

返回文章
返回