2019-06-06 19:02:47 Luqiang_Shi 阅读数 1817
  • PHP微服务架构之RPC通信

    自动化部署、端点智能化和语言及数据的去中心化控制。一种将一个单一应用程序开发为一组小型服务的方法,每个服务运行在自己的进程中,服务间通信采用轻量级通信机制(通常用HTTP资源API)。可通过全自动部署机制独立部署,共用一个最小型的集中式的管理。服务可用不同的语言开发,使用不同的数据存储技术。

    321 人正在学习 去看看 李强强


从接收的语音信号中准确检测出人声开始和结束的端点是进行语音识别的前提。本博文介绍基于短时过零率和短时能量的基本语音端点检测方法及Python实现。如图所示为语音信号,红色方框内为人声:
在这里插入图片描述

一、语音信号的分帧处理

语音信号是时序信号,其具有长时随机性和短时平稳性。长时随机性指语音信号随时间变化是一个随机过程,短时平稳性指在短时间内其特性基本不变,因为人说话是肌肉具有惯性,从一个状态到另一个状态不可能瞬时完成。语音通常在10-30ms之间相对平稳,因此语音信号处理的第一步基本都是对语音信号进行分帧处理,帧长度一般取10-30ms。
语音信号的分帧处理通常采用滑动窗的方式,窗口可以采用直角窗、Hamming窗等。窗口长度决定每一帧信号中包含原始语音信号中信息的数量,窗口每次的滑动距离等于窗口长度时,每一帧信息没有重叠,当窗口滑动距离小于窗口长度时帧信息有重合。本博文采用直角窗进行语音信号的分帧处理:

直角窗:
h(n)={1,0nN10,other{\rm{h}}(n) = \left\{ {\begin{matrix} {1, 0\le n \le N - 1}\\ {0,{\rm{other}}} \end{matrix}} \right.

二、端点检测方法

端点检测是指找出人声开始和结束的端点。利用人声信号短时特性与非人声信号短时特性的差异可以有效地找出人声开始和结束的端点,本博文介绍短时能量和短时过零率结合进行端点检测的方法。

2.1、短时能量

第n帧信号的短时平均能量定义为:
En=m=nN+1n[x(m)w(nm)]2{E_n} = \sum\limits_{m = n - N + 1}^n {{{\left[ {x\left( m \right)w\left( {n - m} \right)} \right]}^2}}
包含人声信号的帧的短时平均能量大于非人声信号的帧。

2.2、短时过零率

过零信号指通过零值,相邻取样值改变符号即过零,过零数是样本改变符号的数量。
第n帧信号的平均短时过零数为:
Zn=m=nN+1nsgn[x(m)]sgn[x(m1)]w(nm){Z_n} = \sum\limits_{m = n - N + 1}^n {\left| {{\mathop{\rm sgn}} \left[ {x\left( m \right)} \right] - {\mathop{\rm sgn}} \left[ {x\left( {m - 1} \right)} \right]} \right|w\left( {n - m} \right)}

w(n)={1/(2N),0nN10,otherw\left( n \right) = \left\{ {\begin{matrix} {1/\left( {2N} \right),0 \le n \le N - 1}\\ {0,other} \end{matrix}} \right.

三、Python实现

import wave
import numpy as np
import matplotlib.pyplot as plt

def read(data_path):
    '''读取语音信号
    '''
    wavepath = data_path
    f = wave.open(wavepath,'rb')
    params = f.getparams()
    nchannels,sampwidth,framerate,nframes = params[:4] #声道数、量化位数、采样频率、采样点数
    str_data = f.readframes(nframes) #读取音频,字符串格式
    f.close()
    wavedata = np.fromstring(str_data,dtype = np.short) #将字符串转化为浮点型数据
    wavedata = wavedata * 1.0 / (max(abs(wavedata))) #wave幅值归一化
    return wavedata,nframes,framerate

def plot(data,time):
    plt.plot(time,data)
    plt.grid('on')
    plt.show()

def enframe(data,win,inc):
    '''对语音数据进行分帧处理
    input:data(一维array):语音信号
          wlen(int):滑动窗长
          inc(int):窗口每次移动的长度
    output:f(二维array)每次滑动窗内的数据组成的二维array
    '''
    nx = len(data) #语音信号的长度
    try:
        nwin = len(win)
    except Exception as err:
        nwin = 1	
    if nwin == 1:
        wlen = win
    else:
        wlen = nwin
    nf = int(np.fix((nx - wlen) / inc) + 1) #窗口移动的次数
    f = np.zeros((nf,wlen))  #初始化二维数组
    indf = [inc * j for j in range(nf)]
    indf = (np.mat(indf)).T
    inds = np.mat(range(wlen))
    indf_tile = np.tile(indf,wlen)
    inds_tile = np.tile(inds,(nf,1))
    mix_tile = indf_tile + inds_tile
    f = np.zeros((nf,wlen))
    for i in range(nf):
        for j in range(wlen):
            f[i,j] = data[mix_tile[i,j]]
    return f

def point_check(wavedata,win,inc):
    '''语音信号端点检测
    input:wavedata(一维array):原始语音信号
    output:StartPoint(int):起始端点
           EndPoint(int):终止端点
    '''
    #1.计算短时过零率
    FrameTemp1 = enframe(wavedata[0:-1],win,inc)
    FrameTemp2 = enframe(wavedata[1:],win,inc)
    signs = np.sign(np.multiply(FrameTemp1,FrameTemp2)) # 计算每一位与其相邻的数据是否异号,异号则过零
    signs = list(map(lambda x:[[i,0] [i>0] for i in x],signs))
    signs = list(map(lambda x:[[i,1] [i<0] for i in x], signs))
    diffs = np.sign(abs(FrameTemp1 - FrameTemp2)-0.01)
    diffs = list(map(lambda x:[[i,0] [i<0] for i in x], diffs))
    zcr = list((np.multiply(signs, diffs)).sum(axis = 1))
    #2.计算短时能量
    amp = list((abs(enframe(wavedata,win,inc))).sum(axis = 1))
#    # 设置门限
#    print('设置门限')
    ZcrLow = max([round(np.mean(zcr)*0.1),3])#过零率低门限
    ZcrHigh = max([round(max(zcr)*0.1),5])#过零率高门限
    AmpLow = min([min(amp)*10,np.mean(amp)*0.2,max(amp)*0.1])#能量低门限
    AmpHigh = max([min(amp)*10,np.mean(amp)*0.2,max(amp)*0.1])#能量高门限
    # 端点检测
    MaxSilence = 8 #最长语音间隙时间
    MinAudio = 16 #最短语音时间
    Status = 0 #状态0:静音段,1:过渡段,2:语音段,3:结束段
    HoldTime = 0 #语音持续时间
    SilenceTime = 0 #语音间隙时间
    print('开始端点检测')
    StartPoint = 0
    for n in range(len(zcr)):
        if Status ==0 or Status == 1:
            if amp[n] > AmpHigh or zcr[n] > ZcrHigh:
                StartPoint = n - HoldTime
                Status = 2
                HoldTime = HoldTime + 1
                SilenceTime = 0
            elif amp[n] > AmpLow or zcr[n] > ZcrLow:
                Status = 1
                HoldTime = HoldTime + 1
            else:
                Status = 0
                HoldTime = 0
        elif Status == 2:
            if amp[n] > AmpLow or zcr[n] > ZcrLow:
                HoldTime = HoldTime + 1
            else:
                SilenceTime = SilenceTime + 1
                if SilenceTime < MaxSilence:
                    HoldTime = HoldTime + 1
                elif (HoldTime - SilenceTime) < MinAudio:
                    Status = 0
                    HoldTime = 0
                    SilenceTime = 0
                else:
                    Status = 3
        elif Status == 3:
            break
        if Status == 3:
            break
    HoldTime = HoldTime - SilenceTime
    EndPoint = StartPoint + HoldTime
    return StartPoint,EndPoint,FrameTemp1
   
    
if __name__ == '__main__':
    data_path = 'audio_data.wav'
    win = 240
    inc = 80
    wavedata,nframes,framerate = read(data_path)
    time_list = np.array(range(0,nframes)) * (1.0 / framerate)
    plot(wavedata,time_list)
    StartPoint,EndPoint,FrameTemp = point_check(wavedata,win,inc)
    checkdata,Framecheck = check_signal(StartPoint,EndPoint,FrameTemp,win,inc)

端点检测结果:

在这里插入图片描述

2015-03-17 14:22:47 c602273091 阅读数 14135
  • PHP微服务架构之RPC通信

    自动化部署、端点智能化和语言及数据的去中心化控制。一种将一个单一应用程序开发为一组小型服务的方法,每个服务运行在自己的进程中,服务间通信采用轻量级通信机制(通常用HTTP资源API)。可通过全自动部署机制独立部署,共用一个最小型的集中式的管理。服务可用不同的语言开发,使用不同的数据存储技术。

    321 人正在学习 去看看 李强强

      在之前呢我们已经把portaudio平台搭好了,可以采集声音信号并播放了。那么接下来呢我们就来做一些实质性的东西——自适应端点检测。那么什么是自适应端点检测呢?也就是采集声音信号的时候,开始说话到说话结束,我们把这一段声音信号采集下来进行处理。不然那么多信号都去处理,没有声音也处理那就浪费了很多的空间以及浪费了CPU去做后续的操作。后面的功夫是省了,但是前面的工作就多了。天下可没有白费的午餐!接下来我就大概说一下我的做法吧。


1、基础


      采样频率的设置:我们人耳一般可以听到的频率最高就是16000HZ。根据采样定理,一般采样频率要是这个的两倍才不会发生混叠。所以我们在通话的时候采样频率一般是8Khz,带宽就需要16Khz。这样就基本可以使得通话的体验非常到位,还原度非常高!不是说采样频率越高声音的效果就越好,这是一个trade-off。这一次我们采样就用16Khz,这样其实已经可以把基本的声音采下来。因为人耳对于低频还是更加敏感!现在的高保真就是44.1Khz的采样率。在经过量化(均匀量化和非均匀量化)就可以进行保存。怎么把采集到的信号进行数字化变成非均匀量化比如Mu律。请参考:

http://www.speech.cs.cmu.edu/comp.speech/Section2/Q2.7.html

         声音采集时遇到的问题:在进行声音采集的时候有噪声,我们得小减小噪声的影响;以及还有回声。

        声音采集的方式:直接对已有的声音(已经录制好的)进行处理;以及现场录制。这样的工具有:Windows recorder,Adobe audition,Linux的arecord。

        声音保存的方式:如下图。一般是PCM之后才好做进一步的处理。

   

          声音采集时序考虑的参数:采样频率,量化方式,通道,存储。    

          声音采集时的两种模式:阻塞(自己设定时间,不管有没有数据都要回来)和回调(有有效的数据的时候才会调用这个函数返回数据),这两种在Portaudio里面都有对应的代码。在这里你大概也想到了我们应该使用的就是回调才能实现我们的功能。

         语言处理的模式:Push和Pull。在这里的话,这两个东西正好和阻塞和回调差不多对应。

         端点检测:实现效果如下图:一般来说人说话是突然说的,然后我们还要判断什么时候结束。

  


2、算法


具体实现的步骤如下图:


  • 判别:计算每个时刻的能量,设定一个阈值k,如果大于它,我们认为是1(1表示该点是语言),否则就是0。能量计算的公式就是:
  • 平滑:小于100ms的silien我们认为是语音的部分,大于250ms的语言我们才认为是语言。在截取的语音信号前后多截出250ms。这个的前提是比较安静,如果不安静的话那么就得另当别论,看外界影响有多大。
  • 算法一:先来一个比较简单的算法

 

  • 算法二:更复杂一些的算法

      


3、代码

捕获声音信号并转化:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>         //function log
#include <conio.h>        //kbhit()
#include "portaudio.h"
#include "readwave.h"     //WriteWave()

/* #define SAMPLE_RATE  (17932) // Test failure to open with this value. */
//SAMPLE_RATE,  FRAMES_PER_BUFFER, NUM_SECONDS, NUM_CHANNELS are modified by Yao Canwu
#define SAMPLE_RATE (16000)  
#define FRAMES_PER_BUFFER (400)
#define NUM_SECONDS     (60)
#define NUM_CHANNELS    (1)
/* #define DITHER_FLAG     (paDitherOff) */
#define DITHER_FLAG     (0) /**/
/** Set to 1 if you want to capture the recording to a file. */
#define WRITE_TO_FILE   (0)

/* Select sample format. */
#if 0
#define PA_SAMPLE_TYPE  paFloat32
typedef float SAMPLE;
#define SAMPLE_SILENCE  (0.0f)
#define PRINTF_S_FORMAT "%.8f"
#elif 1
#define PA_SAMPLE_TYPE  paInt16
typedef short SAMPLE;
#define SAMPLE_SILENCE  (0)
#define PRINTF_S_FORMAT "%d"
#elif 0
#define PA_SAMPLE_TYPE  paInt8
typedef char SAMPLE;
#define SAMPLE_SILENCE  (0)
#define PRINTF_S_FORMAT "%d"
#else
#define PA_SAMPLE_TYPE  paUInt8
typedef unsigned char SAMPLE;
#define SAMPLE_SILENCE  (128)
#define PRINTF_S_FORMAT "%d"
#endif

typedef struct
{
	int          frameIndex;  /* Index into sample array. */
	int          maxFrameIndex;
	SAMPLE      *recordedSamples;
}
paTestData;

//calculate the energy in decibe of a frame segment
//added by Yao Canwu
float energyPerSampleInDecibe(const SAMPLE *ptr)
{
	float energy = 0.0f;
	SAMPLE temp;
	for (unsigned long i = 0; i<FRAMES_PER_BUFFER; i++)
	{
		temp = *(ptr + i);
		energy += temp * temp;
	}
	energy = 10 * log(energy);
	return energy;
}
//An Adaptive Endpointing Algorithm
//added by Yao Canwu
const float forgetfactor = 1;
const float adjustment = 0.05;
//key value for classifyFrame(), need to adjust to different environment.
const float threshold = 10; //
float background = 0;
float level = 0;
int count = 0;

bool classifyFrame(const SAMPLE *ptr)
{
	float current = energyPerSampleInDecibe(ptr);
	bool isSpeech = false;
	level = ((level * forgetfactor) + current) / (forgetfactor + 1);
	if (current < background)  background = current;
	else background += (current - background) * adjustment;
	if (level < background)	level = background;
	if (level - background > threshold)	isSpeech = true;
	return isSpeech;
}
/* This routine will be called by the PortAudio engine when audio is needed.
** It may be called at interrupt level on some machines so don't do anything
** that could mess up the system like calling malloc() or free().
*/
static int recordCallback(const void *inputBuffer, void *outputBuffer,
	unsigned long framesPerBuffer,
	const PaStreamCallbackTimeInfo* timeInfo,
	PaStreamCallbackFlags statusFlags,
	void *userData)
{
	paTestData *data = (paTestData*)userData;
	const SAMPLE *rptr = (const SAMPLE*)inputBuffer;
	SAMPLE *wptr = &data->recordedSamples[data->frameIndex * NUM_CHANNELS];
	long framesToCalc;
	long i;
	int finished;
	unsigned long framesLeft = data->maxFrameIndex - data->frameIndex;

	(void)outputBuffer; /* Prevent unused variable warnings. */
	(void)timeInfo;
	(void)statusFlags;
	(void)userData;

	if (framesLeft < framesPerBuffer)
	{
		framesToCalc = framesLeft;
		finished = paComplete;
	}
	else
	{
		framesToCalc = framesPerBuffer;
		finished = paContinue;
	}

	if (inputBuffer == NULL)
	{
		for (i = 0; i<framesToCalc; i++)
		{
			*wptr++ = SAMPLE_SILENCE;  /* left */
			if (NUM_CHANNELS == 2) *wptr++ = SAMPLE_SILENCE;  /* right */
		}
	}
	else
	{
		for (i = 0; i<framesToCalc; i++)
		{
			*wptr++ = *rptr++;  /* left */
			if (NUM_CHANNELS == 2) *wptr++ = *rptr++;  /* right */
		}
	}
	data->frameIndex += framesToCalc;
	/* calculate the initial background and initial level,
	** which will be used for classify frame
	** Added by Yao Canwu
	*/
	if (data->frameIndex == 0)
	{
		level = energyPerSampleInDecibe(&data->recordedSamples[0]);
		background = 0.0f;
		SAMPLE temp;
		for (i = 0; i < 10 * framesPerBuffer; i++)
		{
			temp = data->recordedSamples[i];
			background += temp * temp;
		}
		background = log(background);
	}
	//Silence in 4 seconds means the end of audio capture
	if (classifyFrame(rptr)) count = 0;
	else count++;
	//printf("count = %d\n", count);

	if (count >= 80) data->maxFrameIndex = data->frameIndex;

	return finished;
}

/* This routine will be called by the PortAudio engine when audio is needed.
** It may be called at interrupt level on some machines so don't do anything
** that could mess up the system like calling malloc() or free().
*/
static int playCallback(const void *inputBuffer, void *outputBuffer,
	unsigned long framesPerBuffer,
	const PaStreamCallbackTimeInfo* timeInfo,
	PaStreamCallbackFlags statusFlags,
	void *userData)
{
	paTestData *data = (paTestData*)userData;
	SAMPLE *rptr = &data->recordedSamples[data->frameIndex * NUM_CHANNELS];
	SAMPLE *wptr = (SAMPLE*)outputBuffer;
	unsigned int i;
	int finished;
	unsigned int framesLeft = data->maxFrameIndex - data->frameIndex;

	(void)inputBuffer; /* Prevent unused variable warnings. */
	(void)timeInfo;
	(void)statusFlags;
	(void)userData;

	if (framesLeft < framesPerBuffer)
	{
		/* final buffer... */
		for (i = 0; i<framesLeft; i++)
		{
			*wptr++ = *rptr++;  /* left */
			if (NUM_CHANNELS == 2) *wptr++ = *rptr++;  /* right */
		}
		for (; i<framesPerBuffer; i++)
		{
			*wptr++ = 0;  /* left */
			if (NUM_CHANNELS == 2) *wptr++ = 0;  /* right */
		}
		data->frameIndex += framesLeft;
		finished = paComplete;
	}
	else
	{
		for (i = 0; i<framesPerBuffer; i++)
		{
			*wptr++ = *rptr++;  /* left */
			if (NUM_CHANNELS == 2) *wptr++ = *rptr++;  /* right */
		}
		data->frameIndex += framesPerBuffer;
		finished = paContinue;
	}
	return finished;
}

/*******************************************************************/
int main(void)
{
	PaStreamParameters  inputParameters,
		outputParameters;
	PaStream*           stream;
	PaError             err = paNoError;
	paTestData          data;
	int                 i;
	int                 totalFrames;
	int                 numSamples;
	int                 numBytes;
	SAMPLE              max, val;
	double              average;

	printf("patest_record.c\n"); fflush(stdout);

	data.maxFrameIndex = totalFrames = NUM_SECONDS * SAMPLE_RATE; /* Record for a few seconds. */
	data.frameIndex = 0;
	numSamples = totalFrames * NUM_CHANNELS;
	numBytes = numSamples * sizeof(SAMPLE);
	data.recordedSamples = (SAMPLE *)malloc(numBytes); /* From now on, recordedSamples is initialised. */
	if (data.recordedSamples == NULL)
	{
		printf("Could not allocate record array.\n");
		goto done;
	}
	for (i = 0; i<numSamples; i++) data.recordedSamples[i] = 0;

	err = Pa_Initialize();
	if (err != paNoError) goto done;

	inputParameters.device = Pa_GetDefaultInputDevice(); /* default input device */
	if (inputParameters.device == paNoDevice) {
		fprintf(stderr, "Error: No default input device.\n");
		goto done;
	}
	inputParameters.channelCount = 1;                    /* stereo input */
	inputParameters.sampleFormat = PA_SAMPLE_TYPE;
	inputParameters.suggestedLatency = Pa_GetDeviceInfo(inputParameters.device)->defaultLowInputLatency;
	inputParameters.hostApiSpecificStreamInfo = NULL;

	//set a keyboard hit to start recording. Added by Yao Canwu
	printf("Press any key to start recording\n");
	while (!kbhit()){}

	/* Record some audio. -------------------------------------------- */
	err = Pa_OpenStream(
		&stream,
		&inputParameters,
		NULL,                  /* &outputParameters, */
		SAMPLE_RATE,
		FRAMES_PER_BUFFER,
		paClipOff,      /* we won't output out of range samples so don't bother clipping them */
		recordCallback,
		&data);
	if (err != paNoError) goto done;

	err = Pa_StartStream(stream);
	if (err != paNoError) goto done;
	printf("\n=== Now start recording!!\n"); fflush(stdout);
	/* Pa_IsStreamActive: Determine whether the stream is active. A stream
	is active after a successful call to Pa_StartStream(), until it becomes
	inactive either as a result of a call to Pa_StopStream() or Pa_AbortStream(),
	or as a result of a return value other than paContinue from the stream callback.
	In the latter case, the stream is considered inactive after the last buffer has finished playing. */
	while ((err = Pa_IsStreamActive(stream)) == 1)
	{
		Pa_Sleep(1000);
		printf("index = %d\n", data.frameIndex); fflush(stdout);
	}
	if (err < 0) goto done;

	err = Pa_CloseStream(stream);
	if (err != paNoError) goto done;

	//Write wave to file in wav formate. Added by Yao Canwu
	printf("Waiting to save into file...\n");
	char *path = "audio.wav";
	WriteWave(path, data.recordedSamples, data.maxFrameIndex, SAMPLE_RATE);
	printf("Save successfully!\n");

	/* Write recorded data to a file. */
#if WRITE_TO_FILE
	{
		FILE  *fid;
		fid = fopen("recorded.raw", "wb");
		if (fid == NULL)
		{
			printf("Could not open file.");
		}
		else
		{
			fwrite(data.recordedSamples, NUM_CHANNELS * sizeof(SAMPLE), totalFrames, fid);
			fclose(fid);
			printf("Wrote data to 'recorded.raw'\n");
		}
	}
#endif
	/* Playback recorded data.  -------------------------------------------- */
	data.frameIndex = 0;

	outputParameters.device = Pa_GetDefaultOutputDevice(); /* default output device */
	if (outputParameters.device == paNoDevice) {
		fprintf(stderr, "Error: No default output device.\n");
		goto done;
	}
	outputParameters.channelCount = 1;                     /* stereo output */
	outputParameters.sampleFormat = PA_SAMPLE_TYPE;
	outputParameters.suggestedLatency = Pa_GetDeviceInfo(outputParameters.device)->defaultLowOutputLatency;
	outputParameters.hostApiSpecificStreamInfo = NULL;

	printf("\n=== Now playing back. ===\n"); fflush(stdout);
	err = Pa_OpenStream(
		&stream,
		NULL, /* no input */
		&outputParameters,
		SAMPLE_RATE,
		FRAMES_PER_BUFFER,
		paClipOff,      /* we won't output out of range samples so don't bother clipping them */
		playCallback,
		&data);
	if (err != paNoError) goto done;

	if (stream)
	{
		err = Pa_StartStream(stream);
		if (err != paNoError) goto done;

		printf("Waiting for playback to finish.\n"); fflush(stdout);

		while ((err = Pa_IsStreamActive(stream)) == 1) Pa_Sleep(100);
		if (err < 0) goto done;

		err = Pa_CloseStream(stream);
		if (err != paNoError) goto done;

		printf("Done.\n"); fflush(stdout);
	}
done:
	Pa_Terminate();
	if (data.recordedSamples)       /* Sure it is NULL or valid. */
		free(data.recordedSamples);
	if (err != paNoError)
	{
		fprintf(stderr, "An error occured while using the portaudio stream\n");
		fprintf(stderr, "Error number: %d\n", err);
		fprintf(stderr, "Error message: %s\n", Pa_GetErrorText(err));
		err = 1;          /* Always return 0 or 1, but no other return codes. */
	}
	system("pause");
	return err;
}

readwav:

#include <stdlib.h>
#include <math.h>
#include <memory.h>
#include <assert.h>
#include <string.h>
#include "readwave.h"


bool WaveRewind(FILE *wav_file, WavFileHead *wavFileHead)
{
	char riff[8],wavefmt[8];
	short i;
	rewind(wav_file);
	fread(wavFileHead,sizeof(struct WavFileHead),1,wav_file);

	for ( i=0;i<8;i++ )
	{
		riff[i]=wavFileHead->RIFF[i];
		wavefmt[i]=wavFileHead->WAVEfmt_[i];
	}
	riff[4]='\0';
	wavefmt[7]='\0';
	if ( strcmp(riff,"RIFF")==0 && strcmp(wavefmt,"WAVEfmt")==0 )
		return	true;  // It is WAV file.
	else
	{
		rewind(wav_file);
		return(false);
	}
}


short *ReadWave(const char *wavFile, int *numSamples, int *sampleRate ) 
{                                                               
	FILE	*wavFp;
	WavFileHead		wavHead;
	short	*waveData;
	long	numRead;

	wavFp = fopen(wavFile, "rb");
	if (!wavFp)	
	{
		printf("\nERROR:can't open %s!\n", wavFile);
		exit(0);
	}

	if (WaveRewind(wavFp, &wavHead) == false)
	{
		printf("\nERROR:%s is not a Windows wave file!\n", wavFile);
		exit(0);
	}

	waveData = new short [wavHead.RawDataFileLength/sizeof(short)];
	numRead = fread(waveData, sizeof(short), wavHead.RawDataFileLength / 2, wavFp);
	assert(numRead * sizeof(short) == (unsigned long)wavHead.RawDataFileLength);
	fclose(wavFp);

	*numSamples = wavHead.RawDataFileLength/sizeof(short);
	*sampleRate = wavHead.SampleRate;
	return	waveData;
}

void FillWaveHeader(void *buffer, int raw_wave_len, int sampleRate)
{
	WavFileHead  wavHead;

	strcpy(wavHead.RIFF, "RIFF");
	strcpy(wavHead.WAVEfmt_, "WAVEfmt ");
	wavHead.FileLength = raw_wave_len + 36;
	wavHead.noUse = 16;
	wavHead.FormatCategory = 1;
	wavHead.NChannels = 1;
	wavHead.SampleRate = sampleRate;
	wavHead.SampleBytes = sampleRate*2;
	wavHead.BytesPerSample = 2;
	wavHead.NBitsPersample = 16;
	strcpy(wavHead.data, "data");
	wavHead.RawDataFileLength = raw_wave_len;

	memcpy(buffer, &wavHead, sizeof(WavFileHead));
}

void WriteWave(const char *wavFile, short *waveData, int numSamples, int sampleRate)
{
	FILE	*wavFp;
	WavFileHead		wavHead;
	long	numWrite;

	wavFp = fopen(wavFile, "wb");
	if (!wavFp)	
	{
		printf("\nERROR:can't open %s!\n", wavFile);
		exit(0);
	}

	FillWaveHeader(&wavHead, numSamples*sizeof(short), sampleRate);
	fwrite(&wavHead, sizeof(WavFileHead), 1, wavFp);
	numWrite = fwrite(waveData, sizeof(short), numSamples, wavFp);
	assert(numWrite == numSamples);
	fclose(wavFp);
}

void GetWavHeader(const char *wavFile, short *Bits, int *Rate,
				  short *Format, int *Length, short *Channels) 
{                                                               
	FILE	*wavFp;
	WavFileHead		wavHead;
	char    *waveData;
	long	numRead,File_length;

	wavFp = fopen(wavFile, "rb");
	if (!wavFp)	
	{
		printf("\nERROR:can't open %s!\n", wavFile);
		exit(0);
	}
    fseek(wavFp,0,SEEK_END);
	File_length=ftell(wavFp);

	if (WaveRewind(wavFp, &wavHead) == false)
	{
		printf("\nERROR:%s is not a Windows wave file!\n", wavFile);
		exit(0);
	}

	waveData = new char[(File_length-sizeof(struct WavFileHead))/sizeof(char)];
	numRead = fread(waveData, sizeof(char), File_length-sizeof(struct WavFileHead), wavFp);
	fclose(wavFp);

	*Bits = wavHead.NBitsPersample;
	*Format = wavHead.FormatCategory;
	*Rate = wavHead.SampleRate;
	*Length = (int)numRead;
	*Channels = wavHead.NChannels;

	delete []	waveData;
}


short *ReadWavFile(const char *wavFile, int *numSamples, int *sampleRate )
{                                                               
	FILE	*wavFp;
	WavFileHead		wavHead;
	short	*waveData;
	long	numRead,File_length;

	wavFp = fopen(wavFile, "rb");
	if (!wavFp)	
	{
		printf("\nERROR:can't open %s!\n", wavFile);
		exit(0);
	}
    fseek(wavFp,0,SEEK_END);
	File_length=ftell(wavFp);


	if (WaveRewind(wavFp, &wavHead) == false)
	{
		printf("\nERROR:%s is not a Windows wave file!\n", wavFile);
		exit(0);
	}

	waveData = new short [(File_length-sizeof(struct WavFileHead))/sizeof(short)];
	numRead = fread(waveData, sizeof(short), (File_length-sizeof(struct WavFileHead))/sizeof(short), wavFp);
	fclose(wavFp);

	*numSamples = (int)numRead;
	*sampleRate = wavHead.SampleRate;
	return	waveData;
}

void ReadWav(const char *wavFile, short *waveData, int *numSamples, int *sampleRate)
{                                                               
	FILE	*wavFp;
	WavFileHead		wavHead;
	long	numRead;

	wavFp = fopen(wavFile, "rb");
	if (!wavFp)	
	{
		printf("\nERROR:can't open %s!\n", wavFile);
		exit(0);
	}

	if (WaveRewind(wavFp, &wavHead) == false)
	{
		printf("\nERROR:%s is not a Windows PCM file!\n", wavFile);
		exit(0);
	}

	numRead = fread(waveData, sizeof(short), wavHead.RawDataFileLength/2, wavFp);
	assert(numRead*sizeof(short) == (unsigned long)wavHead.RawDataFileLength);
	fclose(wavFp);

	*numSamples = wavHead.RawDataFileLength/sizeof(short);
	*sampleRate = wavHead.SampleRate;
}



特别说明:以上截图来自于CMU的李明老师的上课PPT。



2019-07-27 15:38:50 alice_tl 阅读数 1541
  • PHP微服务架构之RPC通信

    自动化部署、端点智能化和语言及数据的去中心化控制。一种将一个单一应用程序开发为一组小型服务的方法,每个服务运行在自己的进程中,服务间通信采用轻量级通信机制(通常用HTTP资源API)。可通过全自动部署机制独立部署,共用一个最小型的集中式的管理。服务可用不同的语言开发,使用不同的数据存储技术。

    321 人正在学习 去看看 李强强

端点检测的概念

端点检测,也叫语音活动检测,Voice Activity Detection,VAD,它的目的是对语音和非语音的区域进行区分。通俗来理解,端点检测就是为了从带有噪声的语音中准确的定位出语音的开始点,和结束点,去掉静音的部分,去掉噪声的部分,找到一段语音真正有效的内容。

在噪声环境下使用语音识别系统,或者讲话人产生情绪或心里上的变化,导致发音失真、发音速度和音调改变,都会产生Lombard/Loud效应。研究表明,即使在安静的环境下,语音识别系统一半以上的识别错误来自端点检测器。

 

端点检测的分类

VAD 算法可以粗略的分为三类:基于阈值的 VAD、作为分类器的 VAD、模型 VAD。

基于阈值的 VAD:通过提取时域(短时能量、短期过零率等)或频域(MFCC、谱熵等)特征,通过合理的设置门限,达到区分语音和非语音的目的。这是传统的 VAD 方法。

作为分类器的 VAD:可以将语音检测视作语音/非语音的两分类问题,进而用机器学习的方法训练分类器,达到检测语音的目的。

模型 VAD:可以利用一个完整的声学模型(建模单元的粒度可以很粗),在解码的基础,通过全局信息,判别语音段和非语音段。

VAD 作为整个流程的最前端,需要在本地实时的完成。由于计算资源非常有限,因此,VAD 一般会采用阈值法中某种算法;经过工程优化的分类法也可能被利用;而模型 VAD 目前难以在本地部署应用。

 

端点检测处理的好,不仅将处理的时间序列变小,还能消除无声段道噪声。

 

端点检测的原理

为了能更清楚说明端点检测的原理,录制了一段音频,并且将语音信号截取了几部分。

开始,有片刻的准备工作,并未发出声音

 

第一次讲”你好”

 

第二次讲”你好”

 

第三次伪装了声音讲”你好”

 

可以看到如下特点:

  1. 首尾的静音部分声波的振幅很小,而有效语音”你好”部分的振幅比较大。
  2. 一个信号的振幅表示了信号能量的大小,从直观上明显看出静音的部分能量值较小,有效语音部分的能量值较大。
  3. 首尾没有讲话,缺依然有能量值,并且能量值有变化。
  4. 在没有特别的伪装和干扰的情况下,两次讲你好的振幅,即信号是一样的。
  5. 第三次由于伪装了声音,所以导致振幅同上面两次不一样,并且由于刻意的伪装,导致第三次的波长度和前两次明显不一样。

 

由此可以了解到端点检测中涉及到的一些概念:

噪声:背景音称之为噪声。有外界环境的噪声,也有设备本身的噪声。在实际使用中,如果出现长时间的静默,会使用户感到很不自然。因此接收端常常会在静音期间发送一些分组,从而生成使用户感觉舒服一些的背景噪声,即所谓的舒适噪声。

静音:连续若干帧能量值持续维持在低水平。理想情况下静音能量值为0,但实际无法做到,因为一般有背景音,而背景音有基础能量值。

端点:静音和有效语音信号变化临界点。

在实际应用中,比如说电话通话时,用户没有讲话时,就没有语音分组的发送,从而可以进一步降低语音比特率。当用户的语音信号能量低于一定门限值时就认为是静默状态,也不发送语音分组。当检测到突发的活动声音时才生成语音信号,并加以传输。运用这种技术能够获得大于50%的带宽。

同理,在实际测试过程中我们也需要考虑非连续性说话,比如口吃、犹豫、吞吞吐吐时,语言的识别准确性,避免断点检测环节处理出现异常或者不合理的情况。

2019-05-21 09:13:59 u012361418 阅读数 1260
  • PHP微服务架构之RPC通信

    自动化部署、端点智能化和语言及数据的去中心化控制。一种将一个单一应用程序开发为一组小型服务的方法,每个服务运行在自己的进程中,服务间通信采用轻量级通信机制(通常用HTTP资源API)。可通过全自动部署机制独立部署,共用一个最小型的集中式的管理。服务可用不同的语言开发,使用不同的数据存储技术。

    321 人正在学习 去看看 李强强

语音识别系列7-语音活动端点检测(VAD)

一、介绍

语音活动端点检测(VAD)已经是一个古老的话题,用于分离信号中语音信号和非语音信号,首先我们讲述VAD的三种做法:1,通过分帧,判断一帧的能量,过零率等简单的方法来判断是否是语音段;2,通过检测一帧是否有基音周期来判断是否是语音段;3,通过DNN的方法训练模型来分类是否是语音帧。相对来说,通过DNN的方法来做VAD准确率会更好一些,本节我们讲述通过DNN来做语音分类,进而分离语音段和非语音段。

 

二、方法及原理

我们采用的训练数据为:语音识别做音素对齐后的数据,也就是说每一帧都有对应的音素,我们这里把对应为SIL的帧作为非语音帧,把对应为非SIL的帧作为语音帧。这样我们就有了训练DNN模型的数据和标签。接着我们使用KALDI中NNET3来训练DNN网络,为了减小网络规模,我们使用TDNN,每层结点数逐层递减,代码如下:

 

  echo "$0: creating neural net configs";
  num_targets=`feat-to-dim scp:$targets_scp - 2>/dev/null` || exit 1
  feat_dim=`feat-to-dim scp:$data_dir/feats.scp - 2>/dev/null` || exit 1

  mkdir -p $dir/configs
  cat <<EOF > $dir/configs/network.xconfig
  input dim=$feat_dim name=input

  relu-renorm-layer name=tdnn1 dim=200 input=Append(-2,-1,0,1,2)
  relu-renorm-layer name=tdnn2 dim=100 input=Append(-2,0,2)
  relu-renorm-layer name=tdnn3 dim=50 input=Append(-3,0,3)
  relu-renorm-layer name=tdnn4 dim=25 input=Append(-3,0,3)
  output-layer name=output dim=$num_targets max-change=1.5
EOF
  steps/nnet3/xconfig_to_configs.py --xconfig-file $dir/configs/network.xconfig --config-dir $dir/configs/

  steps/nnet3/train_raw_dnn.py --stage=$train_stage \
    --cmd="$decode_cmd" \
    --feat.cmvn-opts "--norm-means=false --norm-vars=false" \
    --trainer.num-epochs 2 \ 
    --trainer.optimization.num-jobs-initial 2 \ 
    --trainer.optimization.num-jobs-final 4 \ 
    --trainer.optimization.initial-effective-lrate 0.001 \
    --trainer.optimization.final-effective-lrate 0.0001 \
    --trainer.optimization.minibatch-size 512 \
    --egs.dir "$common_egs_dir" --egs.opts "$egs_opts" \
    --cleanup.remove-egs $remove_egs \
    --cleanup.preserve-model-interval 10 \
    --nj 20 \
    --use-dense-targets=true \
    --feat-dir=${data_dir} \
    --targets-scp=$targets_scp \
    --dir=$dir || exit 1;

训练完成后,我们就得到了一个二分类的DNN模型,接下来就是一些策略的问题了,在这里,我们把状态分为8个状态:1,未知状态;2,语音开始状态;3,语音状态;4,语音结束状态;5,非语音开始状态;6,非语音状态;7,非语音结束状态;8,结束状态。

策略是:语音开始时是未知状态,如果窗口中非语音帧的个数大于语音帧的个数,我们是非语音开始状态,如果连续10帧都判断为语音帧,那么返回非语音结束状态和语音开始状态,反之亦然。

但是这样一个结果就是语音过渡时间过短,我们会在语音开始状态之前和语音结束状态之后补一段非语音帧,这样可以防止误判。

 

三、结论

以上就是我们使用DNN模型来做VAD,从实验结果来看,效果还不错。

 

 

 

DTW算法(语音识别)

阅读数 38418

没有更多推荐了,返回首页