精华内容
下载资源
问答
  • 语音端点检测工具包,包括DNN,bDNN,LSTM和基于ACAM的VAD。 我们还提供我们直接记录的数据集。
  • 语音信号处理 | Python实现端点检测

    千次阅读 2020-10-28 15:45:04
    由于项目需要,我要使用Python对语音进行端点检测,在之前的博客使用短时能量和谱质心特征进行端点检测中,我使用MATLAB实现了一个语音端点检测算法,下面我将使用Python重新实现这个这个算法,并将其封装到VAD类中...

    由于项目需要,我要使用Python对语音进行端点检测,在之前的博客使用短时能量和谱质心特征进行端点检测中,我使用MATLAB实现了一个语音端点检测算法,下面我将使用Python重新实现这个这个算法,并将其封装到VAD类中,如下是运行结果:

    软件环境

    Python3.8、scipy、pyaudio、matplotlib

    程序

    matlab程序转换到python还是挺容易的,VAD.py程序如下:

    #!/usr/bin/python3
    # -*- coding: utf-8 -*-
    
    import numpy as np
    import sys
    from collections import deque
    import matplotlib.pyplot as plt
    import scipy.signal
    import pyaudio
    import struct as st
    
    def ShortTimeEnergy(signal, windowLength, step):
        """
        计算短时能量
        Parameters
        ----------
        signal : 原始信号.
        windowLength : 帧长.
        step : 帧移.
        
        Returns
        -------
        E : 每一帧的能量.
        """
        signal = signal / np.max(signal) # 归一化
        curPos = 0
        L = len(signal)
        numOfFrames  = np.asarray(np.floor((L-windowLength)/step) + 1, dtype=int)
        E = np.zeros((numOfFrames, 1))
        for i in range(numOfFrames):
            window = signal[int(curPos):int(curPos+windowLength-1)];
            E[i] = (1/(windowLength)) * np.sum(np.abs(window**2));
            curPos = curPos + step;
        return E
    
    def SpectralCentroid(signal,windowLength, step, fs):
        """
        计算谱质心
        Parameters
        ----------
        signal : 原始信号.
        windowLength : 帧长.
        step : 帧移.
        fs : 采样率.
    
        Returns
        -------
        C : 每一帧的谱质心.
        """
        signal = signal / np.max(signal) # 归一化
        curPos = 0
        L = len(signal)
        numOfFrames  = np.asarray(np.floor((L - windowLength) / step) + 1, dtype=int)
        H = np.hamming(windowLength)
        m = ((fs / (2 * windowLength)) * np.arange(1, windowLength, 1)).T
        C = np.zeros((numOfFrames, 1))
        for i in range(numOfFrames):
            window = H * (signal[int(curPos) : int(curPos + windowLength)])
            FFT = np.abs(np.fft.fft(window, 2 * int(windowLength)))
            FFT = FFT[1 : windowLength]
            FFT = FFT / np.max(FFT)
            C[i] = np.sum(m * FFT) / np.sum(FFT)
            if np.sum(window**2) < 0.010:
                C[i] = 0.0
            curPos = curPos + step;
        C = C / (fs/2)
        return C
    
    def findMaxima(f, step):
        """
        寻找局部最大值
        Parameters
        ----------
        f : 输入序列.
        step : 搜寻窗长.
    
        Returns
        -------
        Maxima : 最大值索引 最大值
        countMaxima : 最大值的数量
        """
        ## STEP 1: 寻找最大值
        countMaxima = 0
        Maxima = []
        for i in range(len(f) - step - 1): # 对于序列中的每一个元素:
            if i >= step:
                if (np.mean(f[i - step : i]) < f[i]) and (np.mean(f[i + 1 : i + step + 1]) < f[i]): 
                    # IF the current element is larger than its neighbors (2*step window)
                    # --> keep maximum:
                    countMaxima = countMaxima + 1
                    Maxima.append([i, f[i]])
            else:
                if (np.mean(f[0 : i + 1]) <= f[i]) and (np.mean(f[i + 1 : i + step + 1]) < f[i]):
                    # IF the current element is larger than its neighbors (2*step window)
                    # --> keep maximum:
                    countMaxima = countMaxima + 1
                    Maxima.append([i, f[i]])
    
        ## STEP 2: 对最大值进行进一步处理
        MaximaNew = []
        countNewMaxima = 0
        i = 0
        while i < countMaxima:
            # get current maximum:
            
            curMaxima = Maxima[i][0]
            curMavVal = Maxima[i][1]
    
            tempMax = [Maxima[i][0]]
            tempVals = [Maxima[i][1]]
            i = i + 1
    
            # search for "neighbourh maxima":
            while (i < countMaxima) and (Maxima[i][0] - tempMax[len(tempMax) - 1] < step / 2):
                
                tempMax.append(Maxima[i][0])
                tempVals.append(Maxima[i][1])
                i = i + 1
                
            MM = np.max(tempVals)
            MI = np.argmax(tempVals) 
            if MM > 0.02 * np.mean(f): # if the current maximum is "large" enough:
                # keep the maximum of all maxima in the region:
                MaximaNew.append([tempMax[MI], f[tempMax[MI]]])
                countNewMaxima = countNewMaxima + 1   # add maxima
        Maxima = MaximaNew
        countMaxima = countNewMaxima
        
        return Maxima, countMaxima
    
    def VAD(signal, fs):
        win = 0.05
        step = 0.05
        Eor = ShortTimeEnergy(signal, int(win * fs), int(step * fs));
        Cor = SpectralCentroid(signal, int(win * fs), int(step * fs), fs);
        E = scipy.signal.medfilt(Eor[:, 0], 5)
        E = scipy.signal.medfilt(E, 5)
        C = scipy.signal.medfilt(Cor[:, 0], 5)
        C = scipy.signal.medfilt(C, 5)
        
        E_mean = np.mean(E);
        Z_mean = np.mean(C);
        Weight = 100 # 阈值估计的参数
        # 寻找短时能量的阈值
        Hist = np.histogram(E, bins=10) # 计算直方图
        HistE = Hist[0]
        X_E = Hist[1]
        MaximaE, countMaximaE = findMaxima(HistE, 3) # 寻找直方图的局部最大值
        if len(MaximaE) >= 2: # 如果找到了两个以上局部最大值
            T_E = (Weight*X_E[MaximaE[0][0]] + X_E[MaximaE[1][0]]) / (Weight + 1)
        else:
            T_E = E_mean / 2
        
        # 寻找谱质心的阈值
        Hist = np.histogram(C, bins=10)
        HistC = Hist[0]
        X_C = Hist[1]
        MaximaC, countMaximaC = findMaxima(HistC, 3)
        if len(MaximaC)>=2:
            T_C = (Weight*X_C[MaximaC[0][0]]+X_C[MaximaC[1][0]]) / (Weight+1)
        else:
            T_C = Z_mean / 2
        
        # 阈值判断
        Flags1 = (E>=T_E)
        Flags2 = (C>=T_C)
        flags = np.array(Flags1 & Flags2, dtype=int)
        
        ## 提取语音片段
        count = 1
        segments = []
        while count < len(flags): # 当还有未处理的帧时
            # 初始化
            curX = []
            countTemp = 1
            while ((flags[count - 1] == 1) and (count < len(flags))):
                if countTemp == 1: # 如果是该语音段的第一帧
                    Limit1 = np.round((count-1)*step*fs)+1 # 设置该语音段的开始边界
                    if Limit1 < 1:
                        Limit1 = 1
                count = count + 1 		# 计数器加一
                countTemp = countTemp + 1	# 当前语音段的计数器加一
                
            if countTemp > 1: # 如果当前循环中有语音段
                Limit2 = np.round((count - 1) * step * fs) # 设置该语音段的结束边界
                if Limit2 > len(signal):
                    Limit2 = len(signal)
                # 将该语音段的首尾位置加入到segments的最后一行
                segments.append([int(Limit1), int(Limit2)])
            count = count + 1
            
        # 合并重叠的语音段
        for i in range(len(segments) - 1): # 对每一个语音段进行处理
            if segments[i][1] >= segments[i + 1][0]:
                segments[i][1] = segments[i + 1][1]
                segments[i + 1, :] = []
                i = 1
    
        return segments
    
    if __name__ == "__main__":
        CHUNK = 1600
        FORMAT = pyaudio.paInt16
        CHANNELS = 1 # 通道数
        RATE = 16000 # 采样率
        RECORD_SECONDS = 3 # 时长
        p = pyaudio.PyAudio()
        stream = p.open(format=FORMAT,
                        channels=CHANNELS,
                        rate=RATE,
                        input=True,
                        frames_per_buffer=CHUNK)
        frames = [] # 音频缓存
        while True:
            data = stream.read(CHUNK)
            frames.append(data)
            if(len(frames) > RECORD_SECONDS * RATE / CHUNK):
                del frames[0]
            datas = b''
            for i in range(len(frames)):
                datas = datas + frames[i]
            if len(datas) == RECORD_SECONDS * RATE * 2:
                fmt = "<" + str(RECORD_SECONDS * RATE) + "h"
                signal = np.array(st.unpack(fmt, bytes(datas))) # 字节流转换为int16数组
                segments = VAD(signal, RATE) # 端点检测
                # 可视化
                index = 0
                for seg in segments:
                    if index < seg[0]:
                        x = np.linspace(index, seg[0], seg[0] - index, endpoint=True, dtype=int)
                        y = signal[index:seg[0]]
                        plt.plot(x, y, 'g', alpha=1)
                    x = np.linspace(seg[0], seg[1], seg[1] - seg[0], endpoint=True, dtype=int)
                    y = signal[seg[0]:seg[1]]
                    plt.plot(x, y, 'r', alpha=1)
                    index = seg[1]            
                x = np.linspace(index, len(signal), len(signal) - index, endpoint=True, dtype=int)
                y = signal[index:len(signal)]
                plt.plot(x, y, 'g', alpha=1)
                plt.ylim((-32768, 32767))
                plt.show()
        
    

    运行结果

    下面是语音“语音信号处理”的端点检测结果:
    在这里插入图片描述

    展开全文
  • 主要介绍了详解python的webrtc库实现语音端点检测,小编觉得挺不错的,现在分享给大家,也给大家做个参考。一起跟随小编过来看看吧
  • Python语音基础操作--4.1语音端点检测

    千次阅读 2020-05-09 13:49:37
    所谓的端点检测其实就是将语音进行分段,分为轻音,浊音,静音等。主要依据的短时能量以及短时过零率。 短时能量表示为: En=∑m=1Nxn2(m)E_n=\sum\limits_{m=1}^Nx_n^2(m)En​=m=1∑N​xn2​(m) 短时过零率表示为:...

    《语音信号处理试验教程》(梁瑞宇等)的代码主要是Matlab实现的,现在Python比较热门,所以把这个项目大部分内容写成了Python实现,大部分是手动写的。使用CSDN博客查看帮助文件:

    Python语音基础操作–2.1语音录制,播放,读取
    Python语音基础操作–2.2语音编辑
    Python语音基础操作–2.3声强与响度
    Python语音基础操作–2.4语音信号生成
    Python语音基础操作–3.1语音分帧与加窗
    Python语音基础操作–3.2短时时域分析
    Python语音基础操作–3.3短时频域分析
    Python语音基础操作–3.4倒谱分析与MFCC系数
    Python语音基础操作–3.5线性预测分析
    Python语音基础操作–4.1语音端点检测
    Python语音基础操作–4.2基音周期检测
    Python语音基础操作–4.3共振峰估计
    Python语音基础操作–5.1自适应滤波
    Python语音基础操作–5.2谱减法
    Python语音基础操作–5.4小波分解
    Python语音基础操作–6.1PCM编码
    Python语音基础操作–6.2LPC编码
    Python语音基础操作–6.3ADPCM编码
    Python语音基础操作–7.1帧合并
    Python语音基础操作–7.2LPC的语音合成
    Python语音基础操作–10.1基于动态时间规整(DTW)的孤立字语音识别试验
    Python语音基础操作–10.2隐马尔科夫模型的孤立字识别
    Python语音基础操作–11.1矢量量化(VQ)的说话人情感识别
    Python语音基础操作–11.2基于GMM的说话人识别模型
    Python语音基础操作–12.1基于KNN的情感识别
    Python语音基础操作–12.2基于神经网络的情感识别
    Python语音基础操作–12.3基于支持向量机SVM的语音情感识别
    Python语音基础操作–12.4基于LDA,PCA的语音情感识别

    代码可在Github上下载busyyang/python_sound_open

    双门限法

    所谓的端点检测其实就是将语音进行分段,分为轻音,浊音,静音等。主要依据的短时能量以及短时过零率。
    短时能量表示为:
    E n = ∑ m = 1 N x n 2 ( m ) E_n=\sum\limits_{m=1}^Nx_n^2(m) En=m=1Nxn2(m)

    短时过零率表示为:
    Z n = 1 2 ∑ m = 1 N ∣ s g n [ x n ( m ) ] − s g n [ x n ( m − 1 ) ] ∣ Z_n=\frac{1}{2}\sum\limits_{m=1}^N|sgn[x_n(m)]-sgn[x_n(m-1)]| Zn=21m=1Nsgn[xn(m)]sgn[xn(m1)]

    在双门限法中,短时能量可以较好地区分浊音,静音,清音的能量比较小,容易误判为静音,而过零率可以区分静音和清音。双门限的方法是设定一高一低两个门限,当达到高门限要求时,并在后面的一段时间内持续超过低门限,表示语音信号的开始。
    算法过程:

    • 计算信号的短时能量和短时过零率;
    • 选择一个高门限 T 2 T_2 T2,语音信号的能量包络大部分在门限上,可以进行一次初判。语音起止点位于文献与短时能量包络交点 N 3 , N 4 N_3,N_4 N3,N4的时间间隔之外。
    • 根据背景噪声,选择一个较低的门限 T 1 T_1 T1,并从初判起点向左,终点向右搜索,分别找到能语音信号与门限值的交点 N 2 , N 5 N_2,N_5 N2,N5, N 2 N 5 N_2N_5 N2N5段就是双门限判定的语音段。
    • 以短时过零率为准,从 N 2 N_2 N2向左, N 5 N_5 N5向右搜索,找到过零率低于的阈值 T 3 T_3 T3的点 N 1 N_1 N1 N 6 N_6 N6,即为语音段的起止点。

    门限的选择是试验获得的。

    from chapter3_分析实验.C3_1_y_1 import enframe
    from chapter3_分析实验.timefeature import *
    
    
    def findSegment(express):
        """
        分割成語音段
        :param express:
        :return:
        """
        if express[0] == 0:
            voiceIndex = np.where(express)
        else:
            voiceIndex = express
        d_voice = np.where(np.diff(voiceIndex) > 1)[0]
        voiceseg = {}
        if len(d_voice) > 0:
            for i in range(len(d_voice) + 1):
                seg = {}
                if i == 0:
                    st = voiceIndex[0]
                    en = voiceIndex[d_voice[i]]
                elif i == len(d_voice):
                    st = voiceIndex[d_voice[i - 1]+1]
                    en = voiceIndex[-1]
                else:
                    st = voiceIndex[d_voice[i - 1]+1]
                    en = voiceIndex[d_voice[i]]
                seg['start'] = st
                seg['end'] = en
                seg['duration'] = en - st + 1
                voiceseg[i] = seg
        return voiceseg
    
    
    def vad_TwoThr(x, wlen, inc, NIS):
        """
        使用门限法检测语音段
        :param x: 语音信号
        :param wlen: 分帧长度
        :param inc: 帧移
        :param NIS:
        :return:
        """
        maxsilence = 15
        minlen = 5
        status = 0
        y = enframe(x, wlen, inc)
        fn = y.shape[0]
        amp = STEn(x, wlen, inc)
        zcr = STZcr(x, wlen, inc, delta=0.01)
        ampth = np.mean(amp[:NIS])
        zcrth = np.mean(zcr[:NIS])
        amp2 = 2 * ampth
        amp1 = 4 * ampth
        zcr2 = 2 * zcrth
        xn = 0
        count = np.zeros(fn)
        silence = np.zeros(fn)
        x1 = np.zeros(fn)
        x2 = np.zeros(fn)
        for n in range(fn):
            if status == 0 or status == 1:
                if amp[n] > amp1:
                    x1[xn] = max(1, n - count[xn] - 1)
                    status = 2
                    silence[xn] = 0
                    count[xn] += 1
                elif amp[n] > amp2 or zcr[n] > zcr2:
                    status = 1
                    count[xn] += 1
                else:
                    status = 0
                    count[xn] = 0
                    x1[xn] = 0
                    x2[xn] = 0
    
            elif status == 2:
                if amp[n] > amp2 and zcr[n] > zcr2:
                    count[xn] += 1
                else:
                    silence[xn] += 1
                    if silence[xn] < maxsilence:
                        count[xn] += 1
                    elif count[xn] < minlen:
                        status = 0
                        silence[xn] = 0
                        count[xn] = 0
                    else:
                        status = 3
                        x2[xn] = x1[xn] + count[xn]
            elif status == 3:
                status = 0
                xn += 1
                count[xn] = 0
                silence[xn] = 0
                x1[xn] = 0
                x2[xn] = 0
        el = len(x1[:xn])
        if x1[el - 1] == 0:
            el -= 1
        if x2[el - 1] == 0:
            print('Error: Not find endding point!\n')
            x2[el] = fn
        SF = np.zeros(fn)
        NF = np.ones(fn)
        for i in range(el):
            SF[int(x1[i]):int(x2[i])] = 1
            NF[int(x1[i]):int(x2[i])] = 0
        voiceseg = findSegment(np.where(SF == 1)[0])
        vsl = len(voiceseg.keys())
        return voiceseg, vsl, SF, NF, amp, zcr
    
    
    from chapter2_基础.soundBase import *
    from chapter4_特征提取.vad_TwoThr import *
    
    data, fs = soundBase('C4_1_y.wav').audioread()
    data /= np.max(data)
    N = len(data)
    wlen = 200
    inc = 80
    IS = 0.1
    overlap = wlen - inc
    NIS = int((IS * fs - wlen) // inc + 1)
    fn = (N - wlen) // inc + 1
    
    frameTime = FrameTimeC(fn, wlen, inc, fs)
    time = [i / fs for i in range(N)]
    
    voiceseg, vsl, SF, NF, amp, zcr = vad_TwoThr(data, wlen, inc, NIS)
    
    plt.subplot(3, 1, 1)
    plt.plot(time, data)
    
    plt.subplot(3, 1, 2)
    plt.plot(frameTime, amp)
    
    plt.subplot(3, 1, 3)
    plt.plot(frameTime, zcr)
    
    for i in range(vsl):
        plt.subplot(3, 1, 1)
        plt.plot(frameTime[voiceseg[i]['start']], 1, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 1, 'or')
    
        plt.subplot(3, 1, 2)
        plt.plot(frameTime[voiceseg[i]['start']], 1, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 1, 'or')
    
        plt.subplot(3, 1, 3)
        plt.plot(frameTime[voiceseg[i]['start']], 1, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 1, 'or')
    
    plt.savefig('images/TwoThr.png')
    plt.close()
    
    

    在这里插入图片描述

    相关法

    短时自相关:
    R n ( k ) = ∑ m = 1 N − k x n ( m ) x n ( m + k ) , 其 中 ( 0 ⩽ k ⩽ K ) R_n(k)=\sum\limits_{m=1}^{N-k}x_n(m)x_n(m+k),其中(0\leqslant k \leqslant K) Rn(k)=m=1Nkxn(m)xn(m+k)(0kK)

    K K K是最大延迟点数。对于浊音语音可以用于自相关函数求出语音波形序列的基音周期。为了避免检测过程的绝对能量带来的影响,把自相关函数进行归一化处理:
    R n ( k ) = R n ( k ) / R n ( 0 ) , ( 0 ⩽ k ⩽ K ) R_n(k)=R_n(k)/R_n(0),(0\leqslant k \leqslant K) Rn(k)=Rn(k)/Rn(0),(0kK)

    只有噪声的信号求自相关得到的序列是比较小的序列,如果是含噪语音信号,会有一个较大的值,可以通过这个获取语音起点。设置两个阈值 T 1 , T 2 T_1,T_2 T1,T2,当相关函数最大值大于 T 2 T_2 T2判定为语音,当相关函数最大值大于或小于 T 1 T_1 T1时,判断为语音信号的端点。

    from chapter3_分析实验.C3_1_y_1 import enframe
    from chapter3_分析实验.timefeature import *
    
    
    def vad_forw(dst1, T1, T2):
        fn = len(dst1)
        maxsilence = 8
        minlen = 5
        status = 0
        count = np.zeros(fn)
        silence = np.zeros(fn)
        xn = 0
        x1 = np.zeros(fn)
        x2 = np.zeros(fn)
        for n in range(1, fn):
            if status == 0 or status == 1:
                if dst1[n] > T2:
                    x1[xn] = max(1, n - count[xn] - 1)
                    status = 2
                    silence[xn] = 0
                    count[xn] += 1
                elif dst1[n] > T1:
                    status = 1
                    count[xn] += 1
                else:
                    status = 0
                    count[xn] = 0
                    x1[xn] = 0
                    x2[xn] = 0
            if status == 2:
                if dst1[n] > T1:
                    count[xn] += 1
                else:
                    silence[xn] += 1
                    if silence[xn] < maxsilence:
                        count[xn] += 1
                    elif count[xn] < minlen:
                        status = 0
                        silence[xn] = 0
                        count[xn] = 0
                    else:
                        status = 3
                        x2[xn] = x1[xn] + count[xn]
            if status == 3:
                status = 0
                xn += 1
                count[xn] = 0
                silence[xn] = 0
                x1[xn] = 0
                x2[xn] = 0
        el = len(x1[:xn])
        if x1[el - 1] == 0:
            el -= 1
        if x2[el - 1] == 0:
            print('Error: Not find endding point!\n')
            x2[el] = fn
        SF = np.zeros(fn)
        NF = np.ones(fn)
        for i in range(el):
            SF[int(x1[i]):int(x2[i])] = 1
            NF[int(x1[i]):int(x2[i])] = 0
        voiceseg = findSegment(np.where(SF == 1)[0])
        vsl = len(voiceseg.keys())
        return voiceseg, vsl, SF, NF
    
    
    def findSegment(express):
        """
        分割成語音段
        :param express:
        :return:
        """
        if express[0] == 0:
            voiceIndex = np.where(express)
        else:
            voiceIndex = express
        d_voice = np.where(np.diff(voiceIndex) > 1)[0]
        voiceseg = {}
        if len(d_voice) > 0:
            for i in range(len(d_voice) + 1):
                seg = {}
                if i == 0:
                    st = voiceIndex[0]
                    en = voiceIndex[d_voice[i]]
                elif i == len(d_voice):
                    st = voiceIndex[d_voice[i - 1] + 1]
                    en = voiceIndex[-1]
                else:
                    st = voiceIndex[d_voice[i - 1] + 1]
                    en = voiceIndex[d_voice[i]]
                seg['start'] = st
                seg['end'] = en
                seg['duration'] = en - st + 1
                voiceseg[i] = seg
        return voiceseg
    
    
    def vad_corr(y, wnd, inc, NIS, th1, th2):
        x = enframe(y, wnd, inc)
        Ru = STAc(x.T)[0]
        Rum = Ru / np.max(Ru)
        thredth = np.max(Rum[:NIS])
        T1 = th1 * thredth
        T2 = th2 * thredth
        voiceseg, vsl, SF, NF = vad_forw(Rum, T1, T2)
        return voiceseg, vsl, SF, NF, Rum
    
    
    from chapter2_基础.soundBase import *
    from chapter4_特征提取.end_detection import *
    
    data, fs = soundBase('C4_1_y.wav').audioread()
    data -= np.mean(data)
    data /= np.max(data)
    IS = 0.25
    wlen = 200
    inc = 80
    N = len(data)
    time = [i / fs for i in range(N)]
    wnd = np.hamming(wlen)
    NIS = int((IS * fs - wlen) // inc + 1)
    thr1 = 1.1
    thr2 = 1.3
    voiceseg, vsl, SF, NF, Rum = vad_corr(data, wnd, inc, NIS, thr1, thr2)
    fn = len(SF)
    frameTime = FrameTimeC(fn, wlen, inc, fs)
    
    plt.subplot(2, 1, 1)
    plt.plot(time, data)
    plt.subplot(2, 1, 2)
    plt.plot(frameTime, Rum)
    
    for i in range(vsl):
        plt.subplot(2, 1, 1)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend(['signal', 'start', 'end'])
    
        plt.subplot(2, 1, 2)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend(['xcorr', 'start', 'end'])
    
    plt.savefig('images/corr.png')
    plt.close()
    
    

    在这里插入图片描述

    谱熵法

    熵是表示信号的有序程度,语音信号的熵与噪声信号的上有较大差异。谱熵端点检测是通过检测谱的平坦程度,检测语音端点。在相同的语音信号,当信噪比降低时,谱熵值的形状大体保持不变。

    假设语音信号为 x ( i ) x(i) x(i),加窗分帧后得到第 n n n帧为 x n ( m ) x_n(m) xn(m),其FFT表示为: X n ( k ) X_n(k) Xn(k),k表示第k条谱线。语音帧在频域的短时能量为:
    E n = ∑ k = 0 N / 2 X n ( k ) X n ∗ ( k ) E_n=\sum_{k=0}^{N/2}X_n(k)X_n^*(k) En=k=0N/2Xn(k)Xn(k)

    其中N为FFT长度,只取正频率部分。第 k k k谱线的能量谱为 Y n ( k ) = X n ( k ) X n ∗ ( k ) Y_n(k)=X_n(k)X_n^*(k) Yn(k)=Xn(k)Xn(k),每个频率分量的归一化谱概率密度函数为:
    p n ( k ) = Y n ( k ) ∑ k = 0 N / 2 Y n ( l ) = Y n ( k ) E n p_n(k)=\frac{Y_n(k)}{\sum_{k=0}^{N/2}Y_n(l)}=\frac{Y_n(k)}{E_n} pn(k)=k=0N/2Yn(l)Yn(k)=EnYn(k)

    该帧短时谱熵定义为:
    H n = − ∑ l = 0 N / 2 p n ( k ) lg ⁡ p n ( k ) H_n=-\sum_{l=0}^{N/2}p_n(k)\lg p_n(k) Hn=l=0N/2pn(k)lgpn(k)

    基于谱熵的端点检测算法过程:

    • 对语音信号分帧,加窗,取FFT的点数。
    • 计算出每帧的谱的能量。
    • 计算每帧每个样本的概率密度函数。
    • 计算每帧的谱熵值。
    • 设置判决门限。
    • 更加各帧的谱熵值进行端点检测。

    计算每帧的谱熵值用:
    H ( i ) = ∑ i = 0 N / 2 − 1 P ( n , i ) ∗ lg ⁡ [ 1 / P ( n , i ) ] H(i)=\sum_{i=0}^{N/2-1}P(n,i)*\lg [1/P(n,i)] H(i)=i=0N/21P(n,i)lg[1/P(n,i)]

    H ( i ) H(i) H(i)为第i帧的谱熵, H ( i ) H(i) H(i)计算的谱的能量变化,而不是谱的能量。可以在不同噪声环境下有一定的稳健性。

    from chapter3_分析实验.C3_1_y_1 import enframe
    from chapter3_分析实验.timefeature import *
    
    
    def vad_revr(dst1, T1, T2):
        """
        端点检测反向比较函数
        :param dst1:
        :param T1:
        :param T2:
        :return:
        """
        fn = len(dst1)
        maxsilence = 8
        minlen = 5
        status = 0
        count = np.zeros(fn)
        silence = np.zeros(fn)
        xn = 0
        x1 = np.zeros(fn)
        x2 = np.zeros(fn)
        for n in range(1, fn):
            if status == 0 or status == 1:
                if dst1[n] < T2:
                    x1[xn] = max(1, n - count[xn] - 1)
                    status = 2
                    silence[xn] = 0
                    count[xn] += 1
                elif dst1[n] < T1:
                    status = 1
                    count[xn] += 1
                else:
                    status = 0
                    count[xn] = 0
                    x1[xn] = 0
                    x2[xn] = 0
            if status == 2:
                if dst1[n] < T1:
                    count[xn] += 1
                else:
                    silence[xn] += 1
                    if silence[xn] < maxsilence:
                        count[xn] += 1
                    elif count[xn] < minlen:
                        status = 0
                        silence[xn] = 0
                        count[xn] = 0
                    else:
                        status = 3
                        x2[xn] = x1[xn] + count[xn]
            if status == 3:
                status = 0
                xn += 1
                count[xn] = 0
                silence[xn] = 0
                x1[xn] = 0
                x2[xn] = 0
        el = len(x1[:xn])
        if x1[el - 1] == 0:
            el -= 1
        if x2[el - 1] == 0:
            print('Error: Not find endding point!\n')
            x2[el] = fn
        SF = np.zeros(fn)
        NF = np.ones(fn)
        for i in range(el):
            SF[int(x1[i]):int(x2[i])] = 1
            NF[int(x1[i]):int(x2[i])] = 0
        voiceseg = findSegment(np.where(SF == 1)[0])
        vsl = len(voiceseg.keys())
        return voiceseg, vsl, SF, NF
    
    
    def findSegment(express):
        """
        分割成語音段
        :param express:
        :return:
        """
        if express[0] == 0:
            voiceIndex = np.where(express)
        else:
            voiceIndex = express
        d_voice = np.where(np.diff(voiceIndex) > 1)[0]
        voiceseg = {}
        if len(d_voice) > 0:
            for i in range(len(d_voice) + 1):
                seg = {}
                if i == 0:
                    st = voiceIndex[0]
                    en = voiceIndex[d_voice[i]]
                elif i == len(d_voice):
                    st = voiceIndex[d_voice[i - 1] + 1]
                    en = voiceIndex[-1]
                else:
                    st = voiceIndex[d_voice[i - 1] + 1]
                    en = voiceIndex[d_voice[i]]
                seg['start'] = st
                seg['end'] = en
                seg['duration'] = en - st + 1
                voiceseg[i] = seg
        return voiceseg
    
    
    def vad_specEN(data, wnd, inc, NIS, thr1, thr2, fs):
        import matplotlib.pyplot as plt
        from scipy.signal import medfilt
        x = enframe(data, wnd, inc)
        X = np.abs(np.fft.fft(x, axis=1))
        if len(wnd) == 1:
            wlen = wnd
        else:
            wlen = len(wnd)
        df = fs / wlen
        fx1 = int(250 // df + 1)  # 250Hz位置
        fx2 = int(3500 // df + 1)  # 500Hz位置
        km = wlen // 8
        K = 0.5
        E = np.zeros((X.shape[0], wlen // 2))
        E[:, fx1 + 1:fx2 - 1] = X[:, fx1 + 1:fx2 - 1]
        E = np.multiply(E, E)
        Esum = np.sum(E, axis=1, keepdims=True)
        P1 = np.divide(E, Esum)
        E = np.where(P1 >= 0.9, 0, E)
        Eb0 = E[:, 0::4]
        Eb1 = E[:, 1::4]
        Eb2 = E[:, 2::4]
        Eb3 = E[:, 3::4]
        Eb = Eb0 + Eb1 + Eb2 + Eb3
        prob = np.divide(Eb + K, np.sum(Eb + K, axis=1, keepdims=True))
        Hb = -np.sum(np.multiply(prob, np.log10(prob + 1e-10)), axis=1)
        for i in range(10):
            Hb = medfilt(Hb, 5)
        Me = np.mean(Hb)
        eth = np.mean(Hb[:NIS])
        Det = eth - Me
        T1 = thr1 * Det + Me
        T2 = thr2 * Det + Me
        voiceseg, vsl, SF, NF = vad_revr(Hb, T1, T2)
        return voiceseg, vsl, SF, NF, Hb
    
    
    from chapter2_基础.soundBase import *
    from chapter4_特征提取.end_detection import *
    
    data, fs = soundBase('C4_1_y.wav').audioread()
    data -= np.mean(data)
    data /= np.max(data)
    IS = 0.25
    wlen = 200
    inc = 80
    N = len(data)
    time = [i / fs for i in range(N)]
    wnd = np.hamming(wlen)
    overlap = wlen - inc
    NIS = int((IS * fs - wlen) // inc + 1)
    thr1 = 0.99
    thr2 = 0.96
    voiceseg, vsl, SF, NF, Enm = vad_specEN(data, wnd, inc, NIS, thr1, thr2, fs)
    
    fn = len(SF)
    frameTime = FrameTimeC(fn, wlen, inc, fs)
    
    plt.subplot(2, 1, 1)
    plt.plot(time, data)
    plt.subplot(2, 1, 2)
    plt.plot(frameTime, Enm)
    
    for i in range(vsl):
        plt.subplot(2, 1, 1)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend(['signal', 'start', 'end'])
    
        plt.subplot(2, 1, 2)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend(['熵谱', 'start', 'end'])
    
    plt.savefig('images/En.png')
    plt.close()
    
    

    在这里插入图片描述

    比例法

    在噪声情况下,信号的短时能量与过零率可能发生一定变化,甚至影响端点检测。用能量值除以过零率的值,可以突出说话区间,更容易检测过语音端点。将短时能量更新为:
    L E n = lg ⁡ ( 1 + E n / a ) LE_n=\lg(1+E_n/a) LEn=lg(1+En/a)

    其中 a a a为常数,适当取值可以区分噪声和清音。过零率的定义首先要对信号进行限幅处理:
    x ^ ( m ) = { x n ( m ) , ∣ x n ( m ) > σ ∣ 0 , ∣ x n ( m ) < σ ∣ \hat x(m)=\left \{\begin{array}{ll} x_n(m)&,|x_n(m)>\sigma|\\ 0&,|x_n(m)<\sigma| \end{array}\right. x^(m)={xn(m)0,xn(m)>σ,xn(m)<σ

    能零比可以表示为:
    E Z R n = L E n / ( Z C R n + b ) EZR_n=LE_n/(ZCR_n+b) EZRn=LEn/(ZCRn+b)

    b b b是一个较小的常数,为了防止除0的错误发生。

    能熵比
    E E F n = 1 + ∣ L E n / H n ∣ EEF_n=\sqrt{1+|LE_n/H_n|} EEFn=1+LEn/Hn

    def vad_pro(data, wnd, inc, NIS, thr1, thr2, mode):
        from scipy.signal import medfilt
        x = enframe(data, wnd, inc)
        if len(wnd) == 1:
            wlen = wnd
        else:
            wlen = len(wnd)
        if mode == 1:
            a = 2
            b = 1
            LEn = np.log10(1 + np.sum(np.multiply(x, x) / a, axis=1))
            EZRn = LEn / (STZcr(data, wlen, inc) + b)
            for i in range(10):
                EZRn = medfilt(EZRn, 5)
            dth = np.mean(EZRn[:NIS])
            T1 = thr1 * dth
            T2 = thr2 * dth
            Epara = EZRn
        elif mode == 2:
            a = 2
            X = np.abs(np.fft.fft(x, axis=1))
            X = X[:, :wlen // 2]
            Esum = np.log10(1 + np.sum(np.multiply(X, X) / a, axis=1))
            prob = X / np.sum(X, axis=1, keepdims=True)
            Hn = -np.sum(np.multiply(prob, np.log10(prob + 1e-10)), axis=1)
            Ef = np.sqrt(1 + np.abs(Esum / Hn))
            for i in range(10):
                Ef = medfilt(Ef, 5)
            Me = np.max(Ef)
            eth = np.mean(Ef[NIS])
            Det = Me - eth
            T1 = thr1 * Det + eth
            T2 = thr2 * Det + eth
            Epara = Ef
        voiceseg, vsl, SF, NF = vad_forw(Epara, T1, T2)
        return voiceseg, vsl, SF, NF, Epara
    
    from chapter2_基础.soundBase import *
    from chapter4_特征提取.end_detection import *
    
    data, fs = soundBase('C4_1_y.wav').audioread()
    data -= np.mean(data)
    data /= np.max(data)
    IS = 0.25
    wlen = 200
    inc = 80
    N = len(data)
    time = [i / fs for i in range(N)]
    wnd = np.hamming(wlen)
    overlap = wlen - inc
    NIS = int((IS * fs - wlen) // inc + 1)
    
    mode = 2
    if mode == 1:
        thr1 = 3
        thr2 = 4
        tlabel = '能零比'
    elif mode == 2:
        thr1 = 0.05
        thr2 = 0.1
        tlabel = '能熵比'
    voiceseg, vsl, SF, NF, Epara = vad_pro(data, wnd, inc, NIS, thr1, thr2, mode)
    
    fn = len(SF)
    frameTime = FrameTimeC(fn, wlen, inc, fs)
    
    plt.subplot(2, 1, 1)
    plt.plot(time, data)
    plt.subplot(2, 1, 2)
    plt.plot(frameTime, Epara)
    
    for i in range(vsl):
        plt.subplot(2, 1, 1)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend(['signal', 'start', 'end'])
    
        plt.subplot(2, 1, 2)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend([tlabel, 'start', 'end'])
    
    plt.savefig('images/{}.png'.format(tlabel))
    plt.close()
    
    

    在这里插入图片描述

    在这里插入图片描述

    对数频率距离

    对于语音信号取FFT有:
    X i ( k ) = ∑ m = 0 N − 1 x i ( m ) exp ⁡ j 2 π m k N , k = 0 , 1 , . . . , N − 1 X_i(k)=\sum_{m=0}^{N-1}x_i(m)\exp{j\frac{2\pi mk}{N}},k=0,1,...,N-1 Xi(k)=m=0N1xi(m)expjN2πmk,k=0,1,...,N1

    X i ( k ) X_i(k) Xi(k)取对数有:
    X ^ i ( k ) = 20 lg ⁡ ∣ X i ( k ) ∣ \hat X_i(k)=20\lg |X_i(k)| X^i(k)=20lgXi(k)

    两个信号 x 1 ( n ) x_1(n) x1(n) x 2 ( n ) x_2(n) x2(n)的对数频谱距离定义为:
    d s p e c ( i ) = 1 N 2 ∑ k = 0 N 2 − 1 [ X ^ i 1 ( k ) − X ^ i 2 ( k ) ] 2 d_{spec}(i)=\frac{1}{N_2}\sum_{k=0}^{N_2-1}[\hat X_i^1(k)-\hat X_i^2(k)]^2 dspec(i)=N21k=0N21[X^i1(k)X^i2(k)]2

    其中 N 2 N_2 N2表示只取正频率部分,即 N 2 = N / 2 + 1 N_2=N/2+1 N2=N/2+1。可以预先计算一些噪声的频率频谱,然后用待检测信号段与平均噪声做距离运算,可以得出是不是噪声,从而进行判断。

    def vad_LogSpec(signal, noise, NoiseCounter=0, NoiseMargin=3, Hangover=8):
        """
        对数频率距离检测语音端点
        :param signal:
        :param noise:
        :param NoiseCounter:
        :param NoiseMargin:
        :param Hangover:
        :return:
        """
        SpectralDist = 20 * (np.log10(signal) - np.log10(noise))
        SpectralDist = np.where(SpectralDist < 0, 0, SpectralDist)
        Dist = np.mean(SpectralDist)
        if Dist < NoiseMargin:
            NoiseFlag = 1
            NoiseCounter += 1
        else:
            NoiseFlag = 0
            NoiseCounter = 0
        if NoiseCounter > Hangover:
            SpeechFlag = 0
        else:
            SpeechFlag = 1
        return NoiseFlag, SpeechFlag, NoiseCounter, Dist
    
    from chapter2_基础.soundBase import *
    from chapter4_特征提取.end_detection import *
    
    
    def awgn(x, snr):
        snr = 10 ** (snr / 10.0)
        xpower = np.sum(x ** 2) / len(x)
        npower = xpower / snr
        return np.random.randn(len(x)) * np.sqrt(npower) + x
    
    
    data, fs = soundBase('C4_1_y.wav').audioread()
    data -= np.mean(data)
    data /= np.max(data)
    IS = 0.25
    wlen = 200
    inc = 80
    SNR = 10
    N = len(data)
    time = [i / fs for i in range(N)]
    wnd = np.hamming(wlen)
    overlap = wlen - inc
    NIS = int((IS * fs - wlen) // inc + 1)
    signal = awgn(data, SNR)
    
    y = enframe(signal, wnd, inc)
    frameTime = FrameTimeC(y.shape[0], wlen, inc, fs)
    
    Y = np.abs(np.fft.fft(y, axis=1))
    Y = Y[:, :wlen // 2]
    N = np.mean(Y[:NIS, :], axis=0)
    NoiseCounter = 0
    SF = np.zeros(y.shape[0])
    NF = np.zeros(y.shape[0])
    D = np.zeros(y.shape[0])
    # 前导段设置NF=1,SF=0
    SF[:NIS] = 0
    NF[:NIS] = 1
    for i in range(NIS, y.shape[0]):
        NoiseFlag, SpeechFlag, NoiseCounter, Dist = vad_LogSpec(Y[i, :], N, NoiseCounter, 2.5, 8)
        SF[i] = SpeechFlag
        NF[i] = NoiseFlag
        D[i] = Dist
    sindex = np.where(SF == 1)
    voiceseg = findSegment(np.where(SF == 1)[0])
    vosl = len(voiceseg)
    
    plt.subplot(3, 1, 1)
    plt.plot(time, data)
    plt.subplot(3, 1, 2)
    plt.plot(time, signal)
    plt.subplot(3, 1, 3)
    plt.plot(frameTime, D)
    
    for i in range(vosl):
        plt.subplot(3, 1, 1)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend(['signal', 'start', 'end'])
    
        plt.subplot(3, 1, 2)
        plt.plot(frameTime[voiceseg[i]['start']], 0, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 0, 'or')
        plt.legend(['noised', 'start', 'end'])
    
        plt.subplot(3, 1, 3)
        plt.plot(frameTime[voiceseg[i]['start']], 1, '.k')
        plt.plot(frameTime[voiceseg[i]['end']], 1, 'or')
        plt.legend(['对数频率距离', 'start', 'end'])
    
    plt.savefig('images/对数频率距离.png')
    plt.close()
    
    

    在这里插入图片描述

    展开全文
  • python的webrtc库实现语音端点检测

    万次阅读 多人点赞 2017-05-26 16:54:31
    语音端点检测最早应用于电话传输和检测系统当中,用于通信信道的时间分配,提高传输线路的利用效率.端点检测属于语音处理系统的前端操作,在语音检测领域意义重大. 但是目前的语音端点检测,尤其是检测 人声 开始和结束...

    文章源码在 https://github.com/wangshub/python-vad

    引言

    语音端点检测最早应用于电话传输和检测系统当中,用于通信信道的时间分配,提高传输线路的利用效率.端点检测属于语音处理系统的前端操作,在语音检测领域意义重大.
    但是目前的语音端点检测,尤其是检测 人声 开始和结束的端点始终是属于技术难点,各家公司始终处于 能判断,但是不敢保证 判别准确性 的阶段.
    Screenshot from 2017-05-25 22-42-50.png
    现在基于云端语义库的聊天机器人层出不穷,其中最著名的当属amazon的 Alexa/Echo 智能音箱.
    timg.jpg

    国内如雨后春笋般出现了各种搭载语音聊天的智能音箱(如前几天在知乎上广告的若琪机器人)和各类智能机器人产品.国内语音服务提供商主要面对中文语音服务,由于语音不像图像有分辨率等等较为客观的指标,很多时候凭主观判断,所以较难判断各家语音识别和合成技术的好坏.但是我个人认为,国内的中文语音服务和国外的英文语音服务,在某些方面已经有超越的趋势.
    timg (1).jpg

    通常搭建机器人聊天系统主要包括以下三个方面:
    * 语音转文字(ASR/STT)
    * 语义内容(NLU/NLP)
    * 文字转语音(TTS)

    语音转文字(ASR/STT)

    在将语音传给云端API之前,是本地前端的语音采集,这部分主要包括如下几个方面:
    * 麦克风降噪
    * 声源定位
    * 回声消除
    * 唤醒词
    * 语音端点检测
    * 音频格式压缩

    python 端点检测

    由于实际应用中,单纯依靠能量检测特征检测等方法很难判断人声说话的起始点,所以市面上大多数的语音产品都是使用唤醒词判断语音起始.另外加上声音回路,还可以做语音打断.这样的交互方式可能有些傻,每次必须喊一下 唤醒词 才能继续聊天.这种方式聊多了,个人感觉会嘴巴疼:-O .现在github上有snowboy唤醒词的开源库,大家可以登录snowboy官网训练自己的唤醒词模型.
    * Kitt-AI : Snowboy
    * Sensory : Sensory

    考虑到用唤醒词嘴巴会累,所以大致调研了一下,python拥有丰富的库,直接import就能食用.这种方式容易受强噪声干扰,适合一个人在家玩玩.
    * pyaudio: pip install pyaudio 可以从设备节点读取原始音频流数据,音频编码是PCM格式;
    * webrtcvad: pip install webrtcvad 检测判断一组语音数据是否为空语音;
    当检测到持续时间长度 T1 vad检测都有语音活动,可以判定为语音起始;
    当检测到持续时间长度 T2 vad检测都没有有语音活动,可以判定为语音结束;

    完整程序代码可以从我的https://github.com/wangshub/python-vad下载
    程序很简单,相信看一会儿就明白了

    '''
    Requirements:
    + pyaudio - `pip install pyaudio`
    + py-webrtcvad - `pip install webrtcvad`
    '''
    import webrtcvad
    import collections
    import sys
    import signal
    import pyaudio
    
    from array import array
    from struct import pack
    import wave
    import time
    
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000
    CHUNK_DURATION_MS = 30       # supports 10, 20 and 30 (ms)
    PADDING_DURATION_MS = 1500   # 1 sec jugement
    CHUNK_SIZE = int(RATE * CHUNK_DURATION_MS / 1000)  # chunk to read
    CHUNK_BYTES = CHUNK_SIZE * 2  # 16bit = 2 bytes, PCM
    NUM_PADDING_CHUNKS = int(PADDING_DURATION_MS / CHUNK_DURATION_MS)
    # NUM_WINDOW_CHUNKS = int(240 / CHUNK_DURATION_MS)
    NUM_WINDOW_CHUNKS = int(400 / CHUNK_DURATION_MS)  # 400 ms/ 30ms  ge
    NUM_WINDOW_CHUNKS_END = NUM_WINDOW_CHUNKS * 2
    
    START_OFFSET = int(NUM_WINDOW_CHUNKS * CHUNK_DURATION_MS * 0.5 * RATE)
    
    vad = webrtcvad.Vad(1)
    
    pa = pyaudio.PyAudio()
    stream = pa.open(format=FORMAT,
                     channels=CHANNELS,
                     rate=RATE,
                     input=True,
                     start=False,
                     # input_device_index=2,
                     frames_per_buffer=CHUNK_SIZE)
    
    
    got_a_sentence = False
    leave = False
    
    
    def handle_int(sig, chunk):
        global leave, got_a_sentence
        leave = True
        got_a_sentence = True
    
    
    def record_to_file(path, data, sample_width):
        "Records from the microphone and outputs the resulting data to 'path'"
        # sample_width, data = record()
        data = pack('<' + ('h' * len(data)), *data)
        wf = wave.open(path, 'wb')
        wf.setnchannels(1)
        wf.setsampwidth(sample_width)
        wf.setframerate(RATE)
        wf.writeframes(data)
        wf.close()
    
    
    def normalize(snd_data):
        "Average the volume out"
        MAXIMUM = 32767  # 16384
        times = float(MAXIMUM) / max(abs(i) for i in snd_data)
        r = array('h')
        for i in snd_data:
            r.append(int(i * times))
        return r
    
    signal.signal(signal.SIGINT, handle_int)
    
    while not leave:
        ring_buffer = collections.deque(maxlen=NUM_PADDING_CHUNKS)
        triggered = False
        voiced_frames = []
        ring_buffer_flags = [0] * NUM_WINDOW_CHUNKS
        ring_buffer_index = 0
    
        ring_buffer_flags_end = [0] * NUM_WINDOW_CHUNKS_END
        ring_buffer_index_end = 0
        buffer_in = ''
        # WangS
        raw_data = array('h')
        index = 0
        start_point = 0
        StartTime = time.time()
        print("* recording: ")
        stream.start_stream()
    
        while not got_a_sentence and not leave:
            chunk = stream.read(CHUNK_SIZE)
            # add WangS
            raw_data.extend(array('h', chunk))
            index += CHUNK_SIZE
            TimeUse = time.time() - StartTime
    
            active = vad.is_speech(chunk, RATE)
    
            sys.stdout.write('1' if active else '_')
            ring_buffer_flags[ring_buffer_index] = 1 if active else 0
            ring_buffer_index += 1
            ring_buffer_index %= NUM_WINDOW_CHUNKS
    
            ring_buffer_flags_end[ring_buffer_index_end] = 1 if active else 0
            ring_buffer_index_end += 1
            ring_buffer_index_end %= NUM_WINDOW_CHUNKS_END
    
            # start point detection
            if not triggered:
                ring_buffer.append(chunk)
                num_voiced = sum(ring_buffer_flags)
                if num_voiced > 0.8 * NUM_WINDOW_CHUNKS:
                    sys.stdout.write(' Open ')
                    triggered = True
                    start_point = index - CHUNK_SIZE * 20  # start point
                    # voiced_frames.extend(ring_buffer)
                    ring_buffer.clear()
            # end point detection
            else:
                # voiced_frames.append(chunk)
                ring_buffer.append(chunk)
                num_unvoiced = NUM_WINDOW_CHUNKS_END - sum(ring_buffer_flags_end)
                if num_unvoiced > 0.90 * NUM_WINDOW_CHUNKS_END or TimeUse > 10:
                    sys.stdout.write(' Close ')
                    triggered = False
                    got_a_sentence = True
    
            sys.stdout.flush()
    
        sys.stdout.write('\n')
        # data = b''.join(voiced_frames)
    
        stream.stop_stream()
        print("* done recording")
        got_a_sentence = False
    
        # write to file
        raw_data.reverse()
        for index in range(start_point):
            raw_data.pop()
        raw_data.reverse()
        raw_data = normalize(raw_data)
        record_to_file("recording.wav", raw_data, 2)
        leave = True
    
    stream.close()

    程序运行方式sudo python vad.py

    展开全文
  • 【语音识别】语音端点检测Python实现一、语音信号的分帧处理二、端点检测方法2.1、短时能量2.2、短时过零率三、Python实现 从接收的语音信号中准确检测出人声开始和结束的端点是进行语音识别的前提。本博文介绍...


    从接收的语音信号中准确检测出人声开始和结束的端点是进行语音识别的前提。本博文介绍基于短时过零率和短时能量的基本语音端点检测方法及Python实现。如图所示为语音信号,红色方框内为人声:
    在这里插入图片描述

    一、语音信号的分帧处理

    语音信号是时序信号,其具有长时随机性和短时平稳性。长时随机性指语音信号随时间变化是一个随机过程,短时平稳性指在短时间内其特性基本不变,因为人说话是肌肉具有惯性,从一个状态到另一个状态不可能瞬时完成。语音通常在10-30ms之间相对平稳,因此语音信号处理的第一步基本都是对语音信号进行分帧处理,帧长度一般取10-30ms。
    语音信号的分帧处理通常采用滑动窗的方式,窗口可以采用直角窗、Hamming窗等。窗口长度决定每一帧信号中包含原始语音信号中信息的数量,窗口每次的滑动距离等于窗口长度时,每一帧信息没有重叠,当窗口滑动距离小于窗口长度时帧信息有重合。本博文采用直角窗进行语音信号的分帧处理:

    直角窗:
    h ( n ) = { 1 , 0 ≤ n ≤ N − 1 0 , o t h e r {\rm{h}}(n) = \left\{ {\begin{matrix} {1, 0\le n \le N - 1}\\ {0,{\rm{other}}} \end{matrix}} \right. h(n)={1,0nN10,other

    二、端点检测方法

    端点检测是指找出人声开始和结束的端点。利用人声信号短时特性与非人声信号短时特性的差异可以有效地找出人声开始和结束的端点,本博文介绍短时能量和短时过零率结合进行端点检测的方法。

    2.1、短时能量

    第n帧信号的短时平均能量定义为:
    E n = ∑ m = n − N + 1 n [ x ( m ) w ( n − m ) ] 2 {E_n} = \sum\limits_{m = n - N + 1}^n {{{\left[ {x\left( m \right)w\left( {n - m} \right)} \right]}^2}} En=m=nN+1n[x(m)w(nm)]2
    包含人声信号的帧的短时平均能量大于非人声信号的帧。

    2.2、短时过零率

    过零信号指通过零值,相邻取样值改变符号即过零,过零数是样本改变符号的数量。
    第n帧信号的平均短时过零数为:
    Z n = ∑ m = n − N + 1 n ∣ s g n [ x ( m ) ] − s g n [ x ( m − 1 ) ] ∣ w ( n − m ) {Z_n} = \sum\limits_{m = n - N + 1}^n {\left| {{\mathop{\rm sgn}} \left[ {x\left( m \right)} \right] - {\mathop{\rm sgn}} \left[ {x\left( {m - 1} \right)} \right]} \right|w\left( {n - m} \right)} Zn=m=nN+1nsgn[x(m)]sgn[x(m1)]w(nm)

    w ( n ) = { 1 / ( 2 N ) , 0 ≤ n ≤ N − 1 0 , o t h e r w\left( n \right) = \left\{ {\begin{matrix} {1/\left( {2N} \right),0 \le n \le N - 1}\\ {0,other} \end{matrix}} \right. w(n)={1/(2N),0nN10,other

    三、Python实现

    import wave
    import numpy as np
    import matplotlib.pyplot as plt
    
    def read(data_path):
        '''读取语音信号
        '''
        wavepath = data_path
        f = wave.open(wavepath,'rb')
        params = f.getparams()
        nchannels,sampwidth,framerate,nframes = params[:4] #声道数、量化位数、采样频率、采样点数
        str_data = f.readframes(nframes) #读取音频,字符串格式
        f.close()
        wavedata = np.fromstring(str_data,dtype = np.short) #将字符串转化为浮点型数据
        wavedata = wavedata * 1.0 / (max(abs(wavedata))) #wave幅值归一化
        return wavedata,nframes,framerate
    
    def plot(data,time):
        plt.plot(time,data)
        plt.grid('on')
        plt.show()
    
    def enframe(data,win,inc):
        '''对语音数据进行分帧处理
        input:data(一维array):语音信号
              wlen(int):滑动窗长
              inc(int):窗口每次移动的长度
        output:f(二维array)每次滑动窗内的数据组成的二维array
        '''
        nx = len(data) #语音信号的长度
        try:
            nwin = len(win)
        except Exception as err:
            nwin = 1	
        if nwin == 1:
            wlen = win
        else:
            wlen = nwin
        nf = int(np.fix((nx - wlen) / inc) + 1) #窗口移动的次数
        f = np.zeros((nf,wlen))  #初始化二维数组
        indf = [inc * j for j in range(nf)]
        indf = (np.mat(indf)).T
        inds = np.mat(range(wlen))
        indf_tile = np.tile(indf,wlen)
        inds_tile = np.tile(inds,(nf,1))
        mix_tile = indf_tile + inds_tile
        f = np.zeros((nf,wlen))
        for i in range(nf):
            for j in range(wlen):
                f[i,j] = data[mix_tile[i,j]]
        return f
    
    def point_check(wavedata,win,inc):
        '''语音信号端点检测
        input:wavedata(一维array):原始语音信号
        output:StartPoint(int):起始端点
               EndPoint(int):终止端点
        '''
        #1.计算短时过零率
        FrameTemp1 = enframe(wavedata[0:-1],win,inc)
        FrameTemp2 = enframe(wavedata[1:],win,inc)
        signs = np.sign(np.multiply(FrameTemp1,FrameTemp2)) # 计算每一位与其相邻的数据是否异号,异号则过零
        signs = list(map(lambda x:[[i,0] [i>0] for i in x],signs))
        signs = list(map(lambda x:[[i,1] [i<0] for i in x], signs))
        diffs = np.sign(abs(FrameTemp1 - FrameTemp2)-0.01)
        diffs = list(map(lambda x:[[i,0] [i<0] for i in x], diffs))
        zcr = list((np.multiply(signs, diffs)).sum(axis = 1))
        #2.计算短时能量
        amp = list((abs(enframe(wavedata,win,inc))).sum(axis = 1))
    #    # 设置门限
    #    print('设置门限')
        ZcrLow = max([round(np.mean(zcr)*0.1),3])#过零率低门限
        ZcrHigh = max([round(max(zcr)*0.1),5])#过零率高门限
        AmpLow = min([min(amp)*10,np.mean(amp)*0.2,max(amp)*0.1])#能量低门限
        AmpHigh = max([min(amp)*10,np.mean(amp)*0.2,max(amp)*0.1])#能量高门限
        # 端点检测
        MaxSilence = 8 #最长语音间隙时间
        MinAudio = 16 #最短语音时间
        Status = 0 #状态0:静音段,1:过渡段,2:语音段,3:结束段
        HoldTime = 0 #语音持续时间
        SilenceTime = 0 #语音间隙时间
        print('开始端点检测')
        StartPoint = 0
        for n in range(len(zcr)):
            if Status ==0 or Status == 1:
                if amp[n] > AmpHigh or zcr[n] > ZcrHigh:
                    StartPoint = n - HoldTime
                    Status = 2
                    HoldTime = HoldTime + 1
                    SilenceTime = 0
                elif amp[n] > AmpLow or zcr[n] > ZcrLow:
                    Status = 1
                    HoldTime = HoldTime + 1
                else:
                    Status = 0
                    HoldTime = 0
            elif Status == 2:
                if amp[n] > AmpLow or zcr[n] > ZcrLow:
                    HoldTime = HoldTime + 1
                else:
                    SilenceTime = SilenceTime + 1
                    if SilenceTime < MaxSilence:
                        HoldTime = HoldTime + 1
                    elif (HoldTime - SilenceTime) < MinAudio:
                        Status = 0
                        HoldTime = 0
                        SilenceTime = 0
                    else:
                        Status = 3
            elif Status == 3:
                break
            if Status == 3:
                break
        HoldTime = HoldTime - SilenceTime
        EndPoint = StartPoint + HoldTime
        return StartPoint,EndPoint,FrameTemp1
       
        
    if __name__ == '__main__':
        data_path = 'audio_data.wav'
        win = 240
        inc = 80
        wavedata,nframes,framerate = read(data_path)
        time_list = np.array(range(0,nframes)) * (1.0 / framerate)
        plot(wavedata,time_list)
        StartPoint,EndPoint,FrameTemp = point_check(wavedata,win,inc)
        checkdata,Framecheck = check_signal(StartPoint,EndPoint,FrameTemp,win,inc)
    

    端点检测结果:

    在这里插入图片描述

    展开全文
  • 语音端点检测文档包含一个python项目和17个技术文档相关资料,包括一些论文,我估计是网站里目前最全的资料
  • python实现语音端点检测(Voice Activity Detection,VAD) 1.准备环境 https://github.com/marsbroshok/VAD-python 里面的vad.py文件 2.具体代码 from vad import VoiceActivityDetector import wave if __name__ =...
  • VAD基于短时能量的端点检测函数,AudioVAD函数包括两个参数:short*的数据,long的数据长度。返回值为int类型的1或者0,表示该段数据是voice还是silence。 并附上用于测试的pcm数据。
  • matlab 代码:基于能量和过零率的双门限端点检测
  • 双门限法语音端点检测Python实现)

    万次阅读 多人点赞 2018-10-23 19:34:22
    本文介绍一下利用双门限法进行语音端点检测的方法,该方法主要利用了语音的短时能量和短时过零率。 端点检测就是在一段包含语音的信号中,准确地确定语音的起始点和终止点,将语音段和非语音段区分开。我们知道,一...
  • 1.Python 基于Keras 训练的样本模型。 2.采用 C# 使用 Python 训练的样本模型,进行语音端点检测
  • webrtcvad python——语音端点检测

    万次阅读 2017-02-07 14:16:29
    py-webrtcvad 语音端点检测算法说明webrtc的vad使用GMM(Gaussian Mixture Mode)对语音和噪音建模,通过相应的概率来判断语音和噪声,这种算法的优点是它是无监督的,不需要严格的训练。GMM的噪声和语音模型如下:p...
  • VAD(Voice Activity Detector) python 实现对读入的流式数据, 进行端点检测 依赖 pyaudio 测试平台 Distributor ID: Ubuntu Description: Ubuntu 12.04.5 LTS Release: 12.04 Codename: precise Linux ubuntu 3.13.0...
  • 自己写的关于语音分帧,加窗,去噪,端点检测方面的程序,可成功调试-Wrote it myself on the voice sub-frames, plus windows, de-noising, endpoint detection procedures can be successfully debug
  • 基于倒谱特征的语音端点检测,来对语音信号进行识别的算法
  • 基于倒谱特征的带噪语音端点检测,根据倒谱特征的优越性对语音有效部分进行提取和标注。
  • 端点检测的双门限法理解

    千次阅读 2019-12-19 16:53:20
    端点检测的双门限法理解 双门限法主要是用短时能量和短时过零率。 短时能量用于区分浊音(能量高)和清音(能量低) 短时过零率zcr用于区分清音(准确地说是清辅音)和静音,清辅音zcr高,静音的zcr低。 语音的两端...
  • 语音信号端点检测

    千次阅读 2019-05-19 15:58:00
    语音信号的端点检测方法有很多种,简单的方法可以直接通过计算出声音的音量大小,找到音量大于某个阈值的部分,认为该部分为需要的语音信号,该部分与阈值的交点即为端点,其余部分认为非语音帧。 计算音量 计算音量...
  • 基于Tensorflow的端到端在线语音关键词识别/行为检测
  • Audio Split 基于双门限法的语音端点检测及语音分割 代码在我的github上voice_activity_detection 如果您觉得有一点点用,请隔空比个心(或者,点一下 “Star” 也可以~) 根据短时能量和过零率, 基于双门限法的...
  • 端点防护 edrMore and more organizations provide access to APIs in order to enable a wider audience to use their information. This is why securing API access has become a critical concern. With the ...
  • 使用VAD技术清理wav文件中的静音片段介绍folder construction获取...在本文中我将以一段比较笨拙的代码,讲述我是如何通过Python来实现批量处理wav文件中的静音,并且生成到新的文件夹内的。 优点:可以减少多余的语
  • hand-book-of-speech-enhancement-and-recognition:https://shichaog1.gitbooks.io/hand-book-of-speech-enhancement-and-recognition/content/chapter7.html 5、WebRTC之VAD算法(python包):...
  • 针对基于短时能量和短时过零率的语音端点检测算法不能自适应环境,在低信噪比时性能较差问题,提出了一种新算法。该算法利用最小短时能量评估环境噪声,优化参数提取算法,提高了参数本身的抗噪能力和自适应能力,再...
  • C# 语音端点检测(VAD)实现过程分析

    千次阅读 热门讨论 2019-05-04 22:51:33
    早期的方法大多是基于声学特征的提取, 在时域上, 1975年, Rabiner等人提出了基于短时能量和过零率的语音端点检测方法, 这是第一个系统而完整的语音端点检测算法。该方法共有三个门限值, 前两个是通过短时能量值...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 3,643
精华内容 1,457
关键字:

python端点检测

python 订阅