2019-04-29 20:59:58 king_audio_video 阅读数 1967

                             PYTORCH-KALDI语音识别工具包

                                                         Mirco Ravanelli1,Titouan Parcollet2,Yoshua Bengio1 *

                                                                Mila, Universit´e de Montr´eal , CIFAR Fellow

                                                                                   LIA, Universit´e d’Avignon

目录

                             PYTORCH-KALDI语音识别工具包

摘要

1.引言

2. PYTORCH-KALDI项目

2.1 配置文件

2.2特征

2.3 标签

2.4 块和小批量组合

2.5 DNN声学建模

2.6 解码和评分

3.实验设置

3.1语料库和任务

3.2 DNN设置

4.基线

5结论

6.致谢


本内容部分原创,因作者才疏学浅,偶有纰漏,望不吝指出。本内容由灵声讯音频-语音算法实验室整理创作,转载和使用请与“灵声讯”联系,联系方式:音频/识别/合成算法QQ群(696554058)


 

摘要

    开源软件的实用性在主流语音识别和深度学习的普及中扮演着重要的角色。例如,Kaldi是目前用于开发最先进的语音识别器的既定框架。 PyTorch被用来采用Python语言构建神经网络,并且由于其简单性和灵活性,最近在机器学习社区中产生了巨大的兴趣。

    PyTorch-Kaldi项目旨在弥合这些流行工具包之间的差距,且试图继承Kaldi的效率和PyTorch的灵活性。 在这些软件之间它不仅接口简单,而且还嵌入了一些用于开发现代语音识别器的有用功能。例如,该代码专门设计用于自然插入用户定义的声学模型。作为替代方案,用户可以利用多个预先实现的神经网络,这些神经网络可以使用直观的配置文件进行定制。 PyTorch-Kaldi支持多种特征和标签流以及神经网络的组合,可以使用复杂的神经架构。该工具包与丰富的文档一起公开发布,旨在在本地或HPC群集上正常工作。

    在多个数据集和任务上进行的实验表明,PyTorch-Kaldi可以有效地用于开发现代最先进的语音识别器。

关键词:语音识别,深度学习,Kaldi,PyTorch。


1.引言

    在过去几年中,我们目睹了自动语音识别(ASR)技术的逐步改进和成熟[1,2],这些技术已经达到了前所未有的性能水平,现在已被全球数百万用户使用。

    深度学习正在发挥这一技术突破的关键作用[3],这有助于战胜先前基于高斯混合模型(GMM)的语音识别器。除了深度学习外,其他因素在该领域的发展中也发挥了作用。许多与语音相关的项目,如AMI [4](数据集名称)和DIRHA [5]以及CHiME [6],Babel和Aspire等语音识别挑战,都极大地促进了ASR的发展。 Librispeech [7]等大型数据集的公开发布也为建立通用的评估框架和任务发挥了重要作用。

    在其他因素中,开源软件的开发,如HTK [8],Julius [9],CMU-Sphinx,RWTH-ASR [10],LIA-ASR [11]以及最近的Kaldi工具包[12]进一步帮助推广ASR,使得新型ASR应用研究和开发变得更加容易。

     Kaldi目前是最受欢迎的ASR工具包。它依赖于有限状态转换器(FST)[13],并为有效地实施最先进的语音识别系统提供了一组C ++库。此外,该工具包包括一全套使用方法,涵盖了所有最流行的语音语料库。在开发ASR这种特定软件的同时,一些通用的深度学习框架,如Theano [14],TensorFlow [15]和CNTK [16],已经在机器学习社区中得到普及。这些工具包在神经网络设计中提供了极大的灵活性,可用于各种深度学习应用。

    PyTorch [17]是一个新型的python包,它通过适当的自动梯度计算程序实现了基于GPU的张量计算,并促进了神经架构的设计。 它的一个有趣特性在于其现代和灵活的设计,即自然就支持动态神经网络。实际上,计算图是在运行时动态构建的,而不是静态编译的。

    PyTorch-Kaldi项目的重点是在弥合Kaldi和PyTorch1之间的间隙。我们的工具包在PyTorch中实现声学模型,同时使用Kaldi执行特征提取,标签/对齐计算和解码,使其适合开发最先进的DNN-HMM语音识别器。 PyTorch-Kaldi本身支持多种DNN,CNN和RNN模型。还支持深度学习模型,声学特征和标签之间的组合,从而能够使用复杂的神经架构。例如,用户可以在CNN,LSTM和DNN之间使用级联,或者并行运行共享一些隐藏层的多个模型。用户还可以探索不同的声学特征,上下文持续时间,神经元激励(例如,ReLU,leaky ReLU),标准化(例如,batch [18]和layer normalization[19]),成本函数,正则化策略(例如,L2,丢失[20]通过简单的配置文件编辑,优化算法(例如,Adam [21],RMSPROP)和ASR系统的许多其他参数。

    该工具包旨在使用户定义的声学模型的集成尽可能简单。在实践中,用户即使不完全熟悉复杂的语音识别流程,也可以在PyTorch-Kaldi中嵌入他们的深度学习模型并进行ASR实验。该工具包可以在本地计算机或者HPC集群上执行计算,并支持多GPU训练,支持恢复策略和自动数据分块。

在几个数据集和任务上进行的实验表明,PyTorch-Kaldi可以轻松开发出具有竞争力的最先进的语音识别系统。

2. PYTORCH-KALDI项目

最近使用python语言开发了一些其他语音识别工具包。例如,PyKaldi [22]是一个易于使用的Python包,它封装了c++写的Kaldi和OpenFst库。然而,与我们的工具包不同,PyKaldi的当前版本并没有提供几个以前实现的和已经使用的神经网络模型。 另一个python项目是ESPnet [23]。 ESPnet是一个端到端的语音处理工具包,主要侧重于端到端语音识别和端到端的文本到语音转换。 与我们项目的主要区别在于当前版本的PyTorch-Kaldi实现了混合DNN-HMM语音识别器。(www.github.com/mravanelli/pytorch-kaldi/

                                                                  

                                                                                         图1:PyTorch-Kaldi架构框图

      PyTorch-Kaldi采用的架构框图如图1所示。主要脚本run_exp.py是用python编写的,用于管理ASR系统中涉及的所有阶段,包括特征和标签提取,训练,验证,解码和评分。该工具包将在以下小节中详细介绍。

2.1 配置文件

主脚本将INI格式中的配置文件作为输入,它由几部分组成。 其中,[Exp]部分指定了一些高级信息,例如用于实验的文件夹,训练时期的数量,随机种子。它还允许用户指定是否必须在CPU,GPU或多个GPU上进行实验。配置文件继续用[dataset*]部分表示,它用于指定有关要素和标签的信息,包括存储它们的路径,上下文窗口的特征[24]以及必须分割语音数据集的块数。神经网络模型在[architechure*]部分,而[model]部分定义了这些神经网络的组合方式。后一部分开发了一种由run_exp.py脚本自动解释的简单元语言。 最后,在[decoding]部分中定义了解码参数的配置文件

2.2特征

     使用Kaldi原生提供的C++库(例如,compute-mfcc-feats,compute-fbank-feats,compute-plp-feats)执行特征提取,有效地提取最流行的语音识别特征。计算出的系数存储在二进制存档(扩展名为.ark)中,然后使用从kaldi-io-for-python项目继承的kaldi-io实用程序导入到python环境中。然后,通过函数load-chunk处理这些特征,该函数执行上下文窗口组合,移动以及均值和方差归一化。如前所述,PyTorch-Kaldi可以管理多个特征流。例如,用户可以定义利用MFCC,FBANK,PLP和fMLLR [25]系数组合的模型。

2.3 标签

    用于训练声学模型的主要标签源自语音特征与音素状态序列之间的强制对齐过程,而这种序列是由Kaldi用语音决策树计算的且依赖于上下文。为了实现多任务学习,PyTorch-Kaldi支持多个标签。例如,可以联合加载上下文相关和上下文无关的任务,并使用后者执行单音素正则化[26,27]。也可以使用基于执行不同任务的神经网络生态系统的模型,如在语音增强和语音识别之间的联合训练的背景下[28,29]或在最近提出的深度神经合作网络的背景下进行的模型。

2.4 块和小批量组合

     PyTorch-Kaldi自动将整个数据集拆分为多个块,这些块由从完整语料库中随机抽样的标签和特征组成。然后将每个块存储到GPU或CPU存储器中并由神经训练算法run_nn.py运行处理。该工具包在每个时期动态地组成不同的块。然后从它们中获得一组小批量的数据集。小批量集合由少数用于梯度计算和参数优化的训练样例组成。

     小批量集的聚集方式很大程度上取决于神经网络的类型。对于前馈模型,小批量集由随机移动特征和从块中采样的标签组成。对于周期性网络,小批量集合必须由完整的句子组成。然而,不同的句子可能具有不同的持续时间,使得制作相同大小的小批量集合需要零扩充。 PyTorch-Kaldi根据它们的长度按升序对语音序列进行排序(即,首先处理短句)。这种方法最大限度地减少了零扩充的需要,并证明有助于避免批量标准化统计数据的可能偏差。此外,已经证明有效果的是能略微提高性能并改善梯度的数值稳定性。

2.5 DNN声学建模

    每个小批量集由PyTorch实现的神经网络来处理,该神经网络将特征作为输入,并将依赖于上下文的音素状态上的一组后验概率作为输出。该代码旨在轻松插入自定义模型。正如图2中公开的伪代码所写的那样,可以通过在neural_nets.py中添加新类来简单地定义新模型。该类必须由通过初始化指定参数的初始化方法,以及定义要执行的计算的正向方法组成。

 (3 www.github.com/vesis84/kaldi-io-for-python )

                                                     

    作为替代方案,在工具包中原生地实现了许多预先定义的最先进的神经模型。当前版本支持标准MLP,CNN,RNN,LSTM和GRU模型。此外,它支持一些先进的循环架构,例如最近提出的Light GRU [31]和双正则化RNN [32]。 SincNet模型[33,34]也用于直接从语音波形实现语音识别。可以用实现随机搜索算法的程序来调整模型的超参数(例如学习速率,神经元数量,层数,丢失因子等)[35]。

2.6 解码和评分

    在基于反馈HMM的Kaldi解码器之前,由神经网络产生的声学后验概率被它们的先验概率归一化了。解码器将声学分数与由n-gram语言模型导出的语言概率合并,并尝试使用波束搜索算法检索在语音信号中发出的单词序列。使用NIST SCTK评分工具包计算最终的字错误率(WER)分数。

3.实验设置

    在以下小节中,描述了实验采用的语料库和DNN模型设置。

3.1语料库和任务

    第一组实验是使用TIMIT语料库进行的,考虑到标准音素识别任务(与Kaldi s5匹配[12]一致)。

    为了在更具挑战性的情景中验证我们的模型,还在DIRHA-English数据集[36,37]的远场谈话条件下进行了实验。训练基于最初的WSJ-5k语料库(由83名说话者的7,138个句子组成),这些语句受到在家庭环境中测量的一组冲激响应的污染[37]。测试阶段使用了数据集的实际部分进行的,其中包括由上述六位美国本土人士在上述环境中发出的409个WSJ句子。

    使用CHiME 4数据集[6]进行了额外的实验,该数据集基于在四个嘈杂环境(公共汽车,咖啡馆,步行区和街道交叉点)中记录的语音数据。训练集由43个690个嘈杂的WSJ句子组成,这些句子由五个麦克风(安排在平板电脑上)录制,并由总共87个扬声器发出。在这项工作中考虑的测试集ET-real基于由四个发言者发出的1,320个真实句子,而子集DT-real已用于参数优化。 CHiME实验是基于单通道的设置[6]。

     最后,使用LibriSpeech [7]数据集进行实验。我们使用由100小时组成的训练子集和用于参数搜索的dev-clean集。使用继承自Kaldi s5方法的fglarge解码图在test-clean部分上报告测试结果。

                                               

3.2 DNN设置

    实验考虑了不同的声学特征,即39个MFCC(13个静态+Δ+ΔΔ),40个对数滤波器组特征(FBANKS),以及40个fMLLR特征[25](根据Kaldi中s5所述方法进行提取),使用25 ms的窗口(帧长)计算,重叠(帧移)为10 ms。根据Glorot的方案[38]初始化前馈模型,而使用正交矩阵初始化周期权重[39]。周期性丢失被用作正则化技术[40]。正如[41]中提出的那样,仅对前馈连接采用批量归一化。使用运行24个时期的RMSprop算法完成优化。在每个时期之后监测开发组的表现,并且当相对性能改善低于0.1%时,学习率减半。在开发数据集上调整模型的主要参数(即,学习速率,隐藏层数,每层隐藏神经元,丢失因子以及双正则化项λ)。

4.基线

    在本节中,我们将讨论使用TIMIT,DIRHA,CHiME和LibriSpeech数据集获得的基线。作为对PyTorch-Kaldi工具包主要功能的展示,我们首先报告了对TIMIT进行的实验验证。

表1显示了使用不同特征的几个前馈和重复模型获得的性能。为了确保架构之间更准确的比较,针对每个模型和特征进行了不同的初始化种子的五个实验。该表因此报告平均音素错误率(PER)5。结果表明,正如预期的那样,由于扬声器适应过程,fMLLR功能优于MFCC和FBANKs系数。循环模型显着优于标准MLP模型,特别是在使用LSTM,GRU和Li-GRU架构时,通过乘法门(multiplicative gates)有效地解决梯度消失问题。使用Li-GRU模型[31]获得最佳结果(PER = 14.2%),该模型基于单个门,因此比标准GRU节省了33%的计算。

 

                                           

    表2详细介绍了PyTorch-Kaldi中用于改善ASR性能的一些常用技术的影响。第一行(基线)报告使用基本的循环模型实现的性能,其中不采用诸如压差和批量归一化的强大技术。第二行突出显示在训练期间逐渐增加序列长度时实现的性能增益。在这种情况下,我们通过在100步(即,大约1秒的语音)截断语音句子开始训练,并且逐渐地在每个时期加倍最大序列持续时间。这种简单的策略通常可以提高系统性能,因为它鼓励模型首先关注短期依赖关系,并且只在稍后阶段学习长期关联。第三行显示了添加重复丢失时所实现的改进。与[40,41]类似,我们对所有时间步骤应用相同的压差掩模以避免梯度消失问题。相反,第四行显示了从批量标准化得到的好处[18]。最后,最后一行显示了在应用单音正则化时所达到的性能[27]。在这种情况下,我们通过两个softmax分类器采用多任务学习策略:第一个估计依赖于上下文的状态,而第二个预测单音目标。正如[27]中所观察到的,我们的结果证实该技术可以成功地用作有效的正则化器。

    到目前为止讨论的实验是基于单个神经元的模型。在表3中,我们将最好的Li-GRU系统与基于前馈和循环模型组合的更复杂的架构进行比较,这些模型由一系列特征串联。据我们所知,后一系统实现的PER = 13.8%会在TIMIT测试集上产生最佳公开的性能。

    以前的成就是基于Kaldi计算的特征。但是,在PyTorch-Kaldi中,用户可以使用自己的功能。表4显示了通过标准FBANKs系数或直接原始波形馈送的卷积模型获得的结果。基于原始样本的标准CNN与由FBANK特征馈送的CNN类似地执行。 SincNet [33]观察到性能提升,其语音识别的有效性首次突出显示。

                                                  

     我们现在将实验验证扩展到其他数据集。 在这方面,表5显示了在DIRHA,CHiME和Librispeech(100h)数据集上实现的性能。 该表始终显示了Li-GRU模型的更好性能,证实了我们之前在TIMIT上取得的成就。 DIRHA和CHiME的结果显示了所提出的工具包在噪声条件下的有效性。 为了进行比较,在egs / chime4 / s5中提出了最佳Kaldi基线1ch的WER(%)= 18.1%。使用ESPnet训练的端到端系统达到WER(%)= 44.99%,确认端到端语音识别对声学条件的挑战是多么重要。 DIRHA代表了另一项非常具有挑战性的任务,其特点是存在相当大的噪音和混响。在此数据集上获得的WER = 23.9%表示迄今为止在单麦克风任务上发布的最佳性能。最后,使用Librispeech获得的性能优于所考虑的100小时子集的相应p-norm的Kaldi基线(​​W ER = 6.5%)。

5结论

    本文描述了PyTorch-Kaldi项目,这是一项旨在弥合Kaldi和PyTorch之间间隙的新举措。该工具包核心是使ASR系统的开发更简单,更灵活,允许用户轻松插入其定制的声学模型。 PyTorch-Kaldi还支持神经网络架构,功能和标签的组合,允许用户使用复杂的ASR流水线。实验证实,PyTorch-Kaldi可以在一些流行的语音识别任务和数据集中实现最先进的结果。

目前版本的PyTorch-Kaldi已经公开发布,并附有详细的文档。该项目仍处于初始阶段,我们邀请所有潜在的贡献者参与其中。我们希望建立一个足够大的开发人员社区,以逐步维护,改进和扩展我们当前工具包的功能。未来,我们计划增加预先实施的模型数量,支持神经语言模型训练/重新训练,序列判别训练,在线语音识别以及端到端训练。

6.致谢

我们要感谢Maurizio Omologo,Enzo Telk和Antonio Mazzaldi的有益评论。这项研究部分得到了Calcul Qu'ebec和Compute Canada的支持。


本内容部分原创,因作者才疏学浅,偶有纰漏,望不吝指出。本内容由灵声讯音频-语音算法实验室整理创作,转载和使用请与“灵声讯”联系,联系方式:音频/识别/合成算法QQ群(696554058)


2019-09-15 22:47:26 u012798683 阅读数 7694

Kaldi 底层是使用C++ 编写的语音识别工具,旨在供语音识别研究员使用。

也是语音识别领域最常用的一个工具。

它自带了很多特征提取模块、语音模型代码,可直接使用或重新训练GMM-HMM 等模型。

还支持GPU进行训练,功能非常强大。很多新手在使用Kaldi时候,都遇到很多问题

网上资料一大堆,有的比较老,很现在的安装编译方法不一样,会各种报错。

所以把自己安装编译kaldi 过程中,遇到的问题以及安装方法分享给大家。

在安装过程中,请尽量使用物理机Ubuntu 来进行安装。虚拟机Ubuntu 会出现不能安装的问题。

官方文档:http://kaldi-asr.org/

如何安装:我们直接切入正题:

1、首先,通过我的另外一篇博客,将Ubuntu 的源换成国内的阿里源。

地址:https://blog.csdn.net/u012798683/article/details/100765882

2、按照步骤更换完源后,安装git

sudo apt-get install git

3、从GitHub上下载kaldi 的源码

git clone https://github.com/kaldi-asr/kaldi.git

4、安装kaldi 依赖工具以及所使用的第三方工具库

sudo apt-get install git
sudo apt-get install bc
sudo apt-get install g++
sudo apt-get install zlib1g-dev make automake autoconf bzip2 libtool subversion
sudo apt-get install libatlas3-base

5、按照上面的安装完kaldi 的依赖包之后,我们解压kaldi,运行自带的脚本文件,来检测是否安装完成所需要的依赖。

cd kaldi-master
cd tools 

运行依赖检测脚本:

./extras/check_dependencies.sh

会提示缺失MKL依赖包,也会提示你,到tools目录下,运行install_mkl.sh脚本文件进行MKL安装。

运行安装脚本:

./extras/chech_dependencies.sh

安装完成以后,再次运行检查脚本:

./extras/check_dependencies.sh

会提示缺少另外一个依赖包,sox,也同样会告诉你安装方式,运行安装命令即可。

安装完成后,再次检测,运行脚本文件。值到没有提示错误,且返回下图所示内容,依赖既安装完成。

在tools目录下面输入命令:

make -j 4 (意思是多线程加快进度)

或者直接输入make 也可。然后耐心等待

tools目录下make 完成后,说明我们的外部依赖和第三方库已经全部安装完成。

下面进入到src目录下,进行编译安装。

cd ..
cd src

进入src 按照指令进行安装:

./configure --shared
make depend
make

执行完上述命令,接下来就是耐心等待make的完成

这里进行make 的时候花的时间比较久,耐心等待即可,

make 完成后,会提示如下图

提示echo Done

Done

即表示make完成,

下面我们可以跑一个简单的例子,来验证,kaldi是否安装成功。

我们进入到路径,kaldi-master/egs/yesno/s5,目录下

运行下面的命令:

./run.sh

运行完成后,如果没有报错,那说明你已经安装成功。

运行完 yesno 例子以后,显示如下,说明已经安装成功。

kaldi 就算安装完成。

2014-04-01 20:47:24 u013538664 阅读数 7442

最近看了有关KALDI的论文,在这里介绍一下。

Abstract:

We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

注:

1.Kaldi是免费开源的用于语音识别研究的工具包

2.finite-state transducers(FST) 是有两个tape的有限状态自动机

3.OpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs)

4.means and mixture weights vary in a subspace of the total parameter space. We call this a Subspace Gaussian Mixture Model (SGMM). 

I INTRODUCTION

Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.
The goal of Kaldi is to have modern and flexible code that is easy to understand, modify and extend. Kaldi is available on SourceForge (see http://kaldi.sf.net/). The tools compile on the commonly used Unix-like systems and on Microsoft Windows.

注:

1. Kaldi is available on SourceForge (see http://kaldi.sf.net/)

2.The tools compile on  Unix-like systems and on Microsoft Windows.

Researchers on automatic speech recognition (ASR) have several potential choices of open-source toolkits for building a recognition system. Notable among these are: HTK [1], Julius[2] (both written in C), Sphinx-4[3] (written in Java), and the RWTH ASR toolkit [4] (written in C++). Yet, our specific requirements—a finite-state transducer (FST) based frame-work, extensive linear algebra support, and a non-restrictive license—led to the development of Kaldi. Important features of Kaldi include:

Integration with Finite State Transducers: We compile against the OpenFst toolkit [5] (using it as a library).

Extensive linear algebra support: We include a matrix library that wraps standard BLAS and LAPACK routines.

Extensible design:We attempt to provide our algorithms in the most generic form possible. For instance, our decoders work with an interface that provides a score for a particular frame and FST input symbol. Thus the decoder could work from any suitable source of scores.

Open license:The code is licensed under Apache v2.0, which is one of the least restrictive licenses available.

Complete recipes:We make available complete recipes for building speech recognition systems, that work from
widely available databases such as those provided by the Linguistic Data Consortium (LDC).

Thorough testing: The goal is for all or nearly all the code to have corresponding test routines.

The main intended use for Kaldi is acoustic modeling research; thus, we view the closest competitors as being HTK
and the RWTH ASR toolkit (RASR). The chief advantage versus HTK is modern, flexible, cleanly structured code and better WFST and math support; also, our license terms are more open than either HTK or RASR.

注:

1.Kaldi's main intend is acoustic modeling research

2.Advatages: modern, flexible, cleanly structured code and better WFST and math support license terms are more 

   open

The paper is organized as follows: we start by describing the structure of the code and design choices (section II). This is followed by describing the individual components of a speech recognition system that the toolkit supports: feature extraction (section III), acoustic modeling (section IV), phonetic decision trees (section V), language modeling (section VI), and de-coders (section VIII). Finally, we provide some benchmarking results in section IX.

II OVERVIEW OF THE TOOLKIT

We give a schematic overview of the Kaldi toolkit in figure 1. The toolkit depends on two external libraries that are
also freely available: one is OpenFst [5] for the finite-state framework, and the other is numerical algebra libraries. We use the standard “Basic Linear Algebra Subroutines” (BLAS)and “Linear Algebra PACKage” (LAPACK) routines for the latter.

注:

1.external libraries:OpenFst、numerical algebra libraries


The library modules can be grouped into two distinct halves, each depending on only one of the external libraries
(c.f. Figure 1). A single module, the DecodableInterface (section VIII), bridges these two halves.

注:

1.DecodableInterface bridges these two halves

Access to the library functionalities is provided through command-line tools written in C++, which are then called
from a scripting language for building and running a speech recognizer. Each tool has very specific functionality with a small set of command line arguments: for example, there are separate executables for accumulating statistics, summing accumulators, and updating a GMM-based acoustic model using maximum likelihood estimation. Moreover, all the tools can read from and write to pipes which makes it easy to chain together different tools.

To avoid “code rot”, We have tried to structure the toolkit in such a way that implementing a new feature will generally involve adding new code and command-line tools rather than modifying existing ones

注:

1.command-line tools written in C++ to access the library functionalities

III FEATURE EXTRACTION

Our feature extraction and waveform-reading code aims to create standard MFCC and PLP features, setting reasonable defaults but leaving available the options that people are most likely to want to tweak (for example, the number of mel bins, minimum and maximum frequency cutoffs, etc.). We support most commonly used feature extraction approaches: e.g. VTLN, cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, and so on.

注:

1.features: MFCC and PLP

2.feature extraction approaches: VTLN, cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, and so 

   on

IV ACOUSTIC MODELING

Our aim is for Kaldi to support conventional models (i.e.diagonal GMMs) and Subspace Gaussian Mixture Models
(SGMMs), but to also be easily extensible to new kinds of model.

注:

1.DIAGONAL GMMs

2.Subspace Gaussian Mixture Models(SGMMs)

A.Gaussian mixture models

We support GMMs with diagonal and full covariance structures. Rather than representing individual Gaussian densities separately, we directly implement a GMM class that is parametrized by the natural parameters, i.e. means times inverse covariances and inverse covariances. The GMM classes also store the constant term in likelihood computation, which consist of all the terms that do not depend on the data vector. Such an implementation is suitable for efficient log-likelihood computation with simple dot-products.

B.GMM-based acoustic model

The “acoustic model” class AmDiagGmm represents a collection of DiagGmm objects, indexed by “pdf-ids” that correspond to context-dependent HMM states. This class does not represent any HMM structure, but just a collection of densities (i.e.GMMs). There are separate classes that represent the HMM structure, principally the topology and transition-modeling code and the code responsible for compiling decoding graphs, which provide a mapping between the HMM states and the pdf index of the acoustic model class. Speaker adaptation and other linear transforms like maximum likelihood linear transform (MLLT) [6] or semi-tied covariance (STC) [7] are implemented by separate classes.

C.HMM Topology

It is possible in Kaldi to separately specify the HMM topology for each context-independent phone. The topology
format allows nonemitting states, and allows the user to pre-specify tying of the p.d.f.’s in different HMM states.

D.Speaker adaptation

We support both model-space adaptation using maximum likelihood linear regression (MLLR) [8] and feature-space
adaptation using feature-space MLLR (fMLLR), also known as constrained MLLR [9]. For both MLLR and fMLLR,
multiple transforms can be estimated using a regression tree [10]. When a single fMLLR transform is needed, it can be used as an additional processing step in the feature pipeline. The toolkit also supports speaker normalization using a linear approximation to VTLN, similar to [11], or conventional feature-level VTLN, or a more generic approach for gender normalization which we call the “exponential transform” [12]. Both fMLLR and VTLN can be used for speaker adaptive training (SAT) of the acoustic models.

注:

1.maximum likelihood linear regression (MLLR) & feature-space adaptation using feature-space MLLR (fMLLR)

E. Subspace Gaussian Mixture Models

For subspace Gaussian mixture models (SGMMs), the toolkit provides an implementation of the approach described
in [13]. There is a single class AmSgmm that represents a whole collection of pdf’s; unlike the GMM case there is no class that represents a single pdf of the SGMM. Similar to the GMM case, however, separate classes handle model estimation and speaker adaptation using fMLLR.












2014-04-02 13:19:50 u013538664 阅读数 6516

V. PHONETIC DECISION TREES

Our goals in building the phonetic decision tree code were to make it efficient for arbitrary context sizes (i.e. we avoided enumerating contexts), and also to make it general enough to support a wide range of approaches. The conventional approach is, in each HMM-state of each monophone, to have a decision tree that asks questions about, say, the left and right phones. In our framework, the decision-tree roots can be shared among the phones and among the states of the phones, and questions can be asked about any phone in the context window, and about the HMM state. Phonetic questions can be supplied based on linguistic knowledge, but in our recipes the questions are generated automatically based on a tree-clustering of the phones. Questions about things like phonetic stress (if marked in the dictionary) and word start/end information are supported via an extended phone set; in this case we share the decision-tree roots among the different versions of the same phone.

注:

1.建立phonetic decision tree code的目的是有效的应对任意的上下文以及对广泛的方法有通用的支持

2.conventional approach: 每个单音素的HMM模型都有一个决策树

3.In paper approach:每个决策树的根节点被众多的音素或者是音素的状态共享 

VI. LANGUAGE MODELING

Since Kaldi uses an FST-based framework, it is possible, in principle, to use any language model that can be represented as an FST. We provide tools for converting LMs in the standard ARPA format to FSTs. In our recipes, we have used the IRSTLM toolkit for purposes like LM pruning. For building LMs from raw text, users may use the IRSTLM toolkit, for which we provide installation help, or a more fully-featured toolkit such as SRILM.

注:

1.Kaldi: an FST-based framework

2.IRSTLM toolkit: LM pruning

VII. CREATING DECODING GRAPHS

All our training and decoding algorithms use Weighted Finite State Transducers (WFSTs). In the conventional recipe [14], the input symbols on the decoding graph cor-respond to context-dependent states (in our toolkit, these symbols are numeric and we call them pdf-ids). However, because we allow different phones to share the same pdf-ids, we would have a number of problems with this approach, including not being able to determinize the FSTs, and not having sufficient information from the Viterbi path through an FST to work out the phone sequence or to train the transition probabilities. In order to fix these problems, we put on the input of the FSTs a slightly more fine-grained integer identifier that we call a “transition-id”, that encodes the pdf-id, the phone it is a member of, and the arc (transition) within the topology specification for that phone. There is a one-to-one mapping between the “transition-ids” and the transition-probability pa-rameters in the model: we decided make transitions as fine-grained as we could without increasing the size of the decoding graph.

注:

1.Problems: determinize the FSTs & not having sufficient information from the Viterbi path through an FST to work 

   out the phone sequence or to train the transition probabilities

2.Fix: we put on the input of the FSTs a slightly more fine-grained integer identifier that we call a “transition-id”

Our decoding-graph construction process is based on the recipe described in [14]; however, there are a number of
differences. One important one relates to the way we handle “weight-pushing”, which is the operation that is supposed to ensure that the FST is stochastic. “Stochastic” means that the weights in the FST sum to one in the appropriate sense, for each state (like a properly normalized HMM). Weight pushing may fail or may lead to bad pruning behavior if the FST representing the grammar or language model (G) is not stochastic, e.g. for backoff language models. Our approach is to avoid weight-pushing altogether, but to ensure that each stage of graph creation “preserves stochasticity” in an appropriate sense. Informally, what this means is that the “non-sum-to-one-ness” (the failure to sum to one) will never get worse than what was originally present in G.

VIII. DECODERS

We have several decoders, from simple to highly optimized; more will be added to handle things like on-the-fly language model rescoring and lattice generation. By “decoder” we mean a C++ class that implements the core decoding algorithm. The decoders do not require a particular type of acoustic model: they need an object satisfying a very simple interface with a function that provides some kind of acoustic model score for a particular (input-symbol and frame).

<span style="font-size:18px;">class DecodableInterface {
public:
virtual float LogLikelihood(int frame, int index) = 0;
virtual bool IsLastFrame(int frame) = 0;
virtual int NumIndices() = 0;
virtual ˜DecodableInterface() {}
};</span>
Command-line decoding programs are all quite simple, do just one pass of decoding, and are all specialized for one
decoder and one acoustic-model type. Multi-pass decoding is implemented at the script level.

IX. EXPERIMENTS
We report experimental results on the Resource Manage-ment (RM) corpus and on Wall Street Journal. The results re-ported here correspond to version 1.0 of Kaldi; the scripts that correspond to these experiments may be found inegs/rm/s1 and egs/wsj/s1.

A. Comparison with previously published results

Table I shows the results of a context-dependent triphone system with mixture-of-Gaussian densities; the HTK baseline numbers are taken from [15] and the systems use essentially the same algorithms. The features are MFCCs with per-speaker cepstral mean subtraction. The language model is the word-pair bigram language model supplied with the RM corpus. The WERs are essentially the same. Decoding time was about 0.13×RT, measured on an Intel Xeon CPU at 2.27GHz. The system identifier for the Kaldi results is tri3c.

Table II shows similar results for the Wall Street Journal system, this time without cepstral mean subtraction. The WSJ corpus comes with bigram and trigram language models. and we compare with published numbers using the bigram lan-guage model. The baseline results are reported in [16], which we refer to as “Bell Labs” (for the authors’ affiliation), and a HTK system described in [17]. The HTK system was gender-dependent (a gender-independent baseline was not reported), so the HTK results are slightly better. Our decoding time was about 0.5×RT.
B. Other experiments
Here we report some more results on both the WSJ test sets (Nov’92 and Nov’93) using systems trained on just the SI-84 part of the training data, that demonstrate different features that are supported by Kaldi. We also report results on the RM task, averaged over 6 test sets: the 4 mentioned in table I together with Mar’87 and Oct’87. The best result for a conventional GMM system is achieved by a SAT system that splices 9 frames (4 on each side of the current frame) and uses LDA to project down to 40 dimensions, together with MLLT. We achieve better performance on average, with an SGMM system trained on the same features, with speaker vectors and fMLLR adaptation. The last line, with the best results, includes the “exponential transform” [12] in the features.


X. CONCLUSIONS
We described the design of Kaldi, a free and open-source speech recognition toolkit. The toolkit currently supports mod-eling of context-dependent phones of arbitrary context lengths, and all commonly used techniques that can be estimated using maximum likelihood. It also supports the recently proposed SGMMs. Development of Kaldi is continuing and we are working on using large language models in the FST frame-work, lattice generation and discriminative training.

REFERENCES

[1] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for version 3.4). Cambridge University Engineering Department, 2009.

[2] A. Lee, T. Kawahara, and K. Shikano, “Julius – an open source real-time large vocabulary recognition engine,” in EUROSPEECH, 2001, pp.1691–1694.

[3] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf,and J. Woelfel, “Sphinx-4: A flexible open source framework for speech recognition,” Sun Microsystems Inc., Technical Report SML1 TR2004-0811, 2004.

[4] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. L¨o¨ of, R. Schl¨ uter,and H. Ney, “The RWTH Aachen University Open Source Speech Recognition System,” inINTERSPEECH, 2009, pp. 2111–2114.

[5] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: a general and efficient weighted finite-state transducer library,” inProc. CIAA, 2007.

[6] R. Gopinath, “Maximum likelihood modeling with Gaussian distribu-tions for classification,” in Proc. IEEE ICASSP, vol. 2, 1998, pp. 661–664.

[7] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,”IEEE Trans. Speech and Audio Proc., vol. 7, no. 3, pp. 272–281, May 1999.

[8] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,”Computer Speech and Language, vol. 9, no. 2, pp. 171–185,
1995.

[9] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, no. 2, pp. 75–98, April 1998.
[10] ——, “The generation and use of regression class trees for MLLR adaptation,” Cambridge University Engineering Department, Technical Report CUED/F-INFENG/TR.263, August 1996.

[11] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. C. Woodland, “Using VTLN for broadcast news transcription,” inProc. ICSLP, 2004, pp. 1953–1956.

[12] D. Povey, G. Zweig, and A. Acero, “The Exponential Transform as a generic substitute for VTLN,” inIEEE ASRU, 2011.

[13] D. Povey, L. Burgetet al., “The subspace Gaussian mixture model— A structured model for speech recognition,”Computer Speech & Lan-guage, vol. 25, no. 2, pp. 404–439, April 2011.

[14] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech and Language, vol. 20, no. 1, pp. 69–88, 2002.

[15] D. Povey and P. C. Woodland, “Frame discrimination training for HMMs for large vocabulary speech recognition,” inProc. IEEE ICASSP, vol. 1,1999, pp. 333–336.

[16] W. Reichl and W. Chou, “Robust decision tree state tying for continuous speech recognition,” IEEE Transactions on Speech and Audio Process-ing, vol. 8, no. 5, pp. 555–566, September 2000.

[17] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young, “Large vocabulary continuous speech recognition using HTK,” inProc. IEEE ICASSP, vol. 2, 1994, pp. II/125–II/128.


















2014-01-26 04:26:52 u013538664 阅读数 5109

1.介绍

Kaldi语音识别工具将HTK比较零碎的各种各样的指令和功能进行整理集合,使用perl脚本调用。同时也加入了深度神经网络的分类器(DNN),本身由原来做HTK开发的人员制作而成,可以说是HTK的升级加强版。 

kaldi官方网站请见:http://kaldi.sourceforge.net/index.html 


2.安装和编译

第一步:下载kaldi工具包 
kaldi 有两个版本,kaldi-1和kaldi—trunk,前者是稳定版,后者是新版。我安装的是新版。 
下面开始安装: 

sudo apt-get install subversion
svn update
svn co https://kaldi.svn.sourceforge.net/svnroot/kaldi/trunk kaldi-trunk
cd kaldi-trunk
cd tools
cat INSTALL
make  -j 4

注:

1.当电脑有不止一个cpu时,假设有四个,可以输入:make -j 4 以节省时间。

2.make指令是为了安装8个软件,其中,(sph2pipe, openfst, ATLAS)这3个是必须的。

第二步:配置

cd ../src
./configure

注:通常这个时候会报错,那是因为没有安装openfst或者ATLAS

安装openfst:

1.安装g++

sudo apt-get install g++

2.解压

tar -xovzf openfst-1.3.2.tar.gz
for dir in openfst-1.3.2/{src/,}include/fst; do
    ( [ -d $dir ] && cd $dir && patch -p0 -N <../../../../openfst.patch ) 
done 
rm openfst 2>/dev/null # Remove any existing link
ln -s openfst-1.3.2 openfst
cd openfst-1.3.2

以下选择正确的配置指令:

若是linuxdarwin

./configure --prefix=`pwd` --enable-static --disable-shared

若是64位系统,
./configure --host=x86_64-linux --prefix=`pwd` --enable-static --disable-shared
若是虚拟机,
./configure --prefix=`pwd` CXX=g++-4.exe CC=gcc-4.exe --enable-static --disable-shared
3.安装
sudo make install
安装ATLAS:

注:安装ATLAS前,要保证关掉cpu throttling。绝大部分操作系统默认开启power management中的cpu throttling以保护cpu。绝大部分电脑可以在BIOS中关掉cpu throttling(通常在power management中或cpu frequency 选项中。)绝大部分操作系统也能关掉cpu throttling,在fedora中,输入/usr/bin/cpufreq-selector -g performance可以关掉cpu throttling。本机器是ubuntu 12.04 cpu frequency scaling governor的路径是(/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor)。此处建议使用:https://wiki.archlinux.org/index.php/CPU_Frequency_Scaling_(简体中文)中的修改软件,修改后是临时的,重启之后恢复默认设置,这样不会对基础设置造成影响。

具体操作:
sudo apt-get install cpufrequtils 
sudo cpufreq-set -c 1 -g performance
sudo cpufreq-set -c 2 -g performance
sudo cpufreq-set -c 3 -g performance
sudo cpufreq-set -c 4 -g performance
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

查看里面的选项ondemand是否均改为performance

除了更改cpu throttling,还要安装gfortran,否则会出错:

sudo apt-get install gfortran

最后,在tools目录下输入:

./install_atlas.sh


完成ATLAS的安装。

第三步:配置安装:

../src
./configure
make depend
make -j 4


在一段时间之后就会有提示出现,显示安装成功。




没有更多推荐了,返回首页