2019-04-19 20:51:44 weixin_44461682 阅读数 502

接触语音识别以来,从看文献开始了解语音识别是怎么一回事,它的基本原理、背景、识别流程等等…
现在要用Kaldi进行语音识别真的可以称上小白了,关于文档解读,仅供大家参考。

【以下为Kaldi官方文档目录及内容】

3 kaldi 的使用
3.1 总述
在跑 kaldi 里的样例时,你需要注意三个脚本:cmd.sh path.sh run.sh。下
面分别来说,

  1. Cmd.sh 脚本为:


queue.pl” uses qsub. The options to it are
options to qsub. If you have GridEngine installed, # change this to a queue you have access to. # Otherwise, use “run.pl”, which will run jobs locally
(make sure your --num-jobs options are no more than
#the number of cpus on your machine.

  • #a)

JHU cluster options
#export train_cmd=“queue.pl -l arch=*64” #export decode_cmd=“queue.pl -l arch=*64,mem_free=2G,ram_free=2G” #export mkgraph_cmd=“queue.pl -l arch=*64,ram_free=4G,mem_free=4G” #export cuda_cmd=run.pl

  • #b)

BUT cluster options
#export train_cmd=“queue.pl -q all.q@@blade -l
ram_free=1200M,mem_free=1200M” #export decode_cmd=“queue.pl -q all.q@@blade -l
ram_free=1700M,mem_free=1700M” #export decodebig_cmd=“queue.pl -q all.q@@blade -l
ram_free=4G,mem_free=4G” #export cuda_cmd=“queue.pl -q long.q@@pco203 -l gpu=1” #export cuda_cmd=“queue.pl -q long.q@pcspeech-gpu” #export mkgraph_cmd=“queue.pl -q all.q@@servers -l
ram_free=4G,mem_free=4G”

  • #c)

run it locally… export train_cmd=run.pl
export decode_cmd=run.pl
export cuda_cmd=run.pl
export mkgraph_cmd=run.pl

大家可以很清楚的看到有 3 个分类分别对应 a,b,c。a 和 b 都是集群上去运
行这个样子,c 就是我们需要的。我们在虚拟机上运行的。你需要修改这个脚本。

  1. Path.sh 的内容:

export KALDI_ROOT=pwd/…/…/… export
PATH=PWD/utils/:PWD/utils/:KALDI_ROOT/src/bin:KALDIROOT/tools/openfst/bin:KALDI_ROOT/tools/openfst/bin:K
ALDI_ROOT/tools/irstlm/bin/:KALDIROOT/src/fstbin/:KALDI_ROOT/src/fstbin/:KALDI_ROOT/src/gmm
bin/:KALDIROOT/src/featbin/:KALDI_ROOT/src/featbin/:KALDI_ROOT/src/lm/:KALDIROOT/src/sgmmbin/:KALDI_ROOT/src/sgm mbin/:KALDI_ROOT/src/sgmm2bin/:KALDIROOT/src/fgmmbin/:KALDI_ROOT/src/fgmmbin/:KALDI_RO
OT/src/latbin/:KALDIROOT/src/nnetbin:KALDI_ROOT/src/nnetbin:KALDI_ROOT/src/nnet-cpubin/:KALDIROOT/src/kwsbin:KAL DI_ROOT/src/kwsbin:PWD:PATHexportLCALL=CexportIRSTLM=PATH export LC_ALL=C export IRSTLM=KALDI_ROOT/tools/irstlm
在这里一般只要修改 export KALDI_ROOT=pwd/…/…/…改为你安装 kaldi 的目
录,有时候不修改也可以,大家根据实际情况。

  1. Run.sh

里大家需要指定你的数据在什么路径下,你只需要修改:
如:
#timit=/export/corpora5/LDC/LDC93S1/timit/TIMIT # @JHU
timit=/mnt/matylda2/data/TIMIT/timit # @BUT
修改为你的 timit 所在的路径。
其他的数据库都一样。
此外,voxforge 或者 vystadial_cz 或者 vystadial_en 这些数据库都提供下载,
没有数据库的可以利用这些来做实验

这里说一下读后感吧:

跑kaldi样例之前

需要注意cmd.shpath.shrun.sh 这三个脚本。

  1. cmd.sh(cmd = command)
    这里主要修改queue.pl为run.pl
    #我们需要修改cmd.sh. 如下:
    export train_cmd=run.pl   #将原来的queue.pl改为run.pl
    export decode_cmd="run.pl"   #将原来的queue.pl改为run.pl    这里的--mem 4G 还是去掉吧  因为我机器装的虚拟机内存不是很大
    export mkgraph_cmd="run.pl" #将原来的queue.pl改为run.pl  这里的--mem 8G 还是去掉吧  因为我机器装的虚拟机内存不是很大
    export cuda_cmd="run.pl" #将原来的queue.pl改为run.pl 这里去掉原来的--gpu 1  因为我们不打算用GPU来参与
  1. path.sh
    我在这里的时候咨询了一下师兄,师兄说这里可不改,我就没改。。。
  2. run.sh
    这里我是想先跑一下清华thchs30的脚本
    (已经安装好了语料库,后面讲如何安装),
    这里主要是改nj = "8"或者“4”,thchs=…(放清华数据库的路径)
    所以先进入run.sh
[czy@localhost ~]$ cd
[czy@localhost ~]$ cd kaldi/egs/thchs30/s5
[czy@localhost s5]$ ls
cmd.sh conf local path.sh RESULT run.sh steps thchs30-openslr utils
[czy@localhost s5]$ vim run.sh

我们接下来看看run.sh,前面几行

#!/bin/bash
. ./cmd.sh ## You'll want to change cmd.sh to something that will work on your system.
                 ## This relates to the queue.

. ./path.sh

这里我们看到,其实执行run.sh的时候,它也是先要执行cmd.sh和path.sh,其中 cmd.sh就是刚刚我们改的,path.sh一会我们再说。

H=`pwd`  #exp home
n=4      #parallel jobs #我们把n=8改为:n=4

这里我们看到H='pwd’完全是为了后面引用这个路径用的,先不用管它.我们把n=8改为:n=4,是因为我们并发的时候为四核心。

#corpus and trans directory
thchs=/home/czy/kaldi/egs/thchs30/s5/thchs30-openslr    #我们把原来的/nfs/public/materials/data/thchs30-openslr改为/home/czy/kaldi/egs/thchs30/s5/thchs30-openslr 

这里的意思是说,要训练的thchs30数据的目录,
我这里的目录是/home/czy/kaldi/egs/thchs30/s5/thchs30-openslr

[czy@localhost thchs30-openslr]$ pwd
/home/czy/kaldi/egs/thchs30/s5/thchs30-openslr
[czy@localhost thchs30-openslr]$ ls
data_thchs30 data_thchs30.tgz resource resource.tgz
[czy@localhost thchs30-openslr]$

(这里我只下载了两个压缩包)
thchs30的中文语音数据库,网址是http://www.openslr.org/18/
在这里插入图片描述进去以后我们看到,有data_thchs30.tgz resource.tgz test-noise.tgz 这三个语音文件压缩包链接地址,
在服务器里面

[czy@localhost ~]$ cd
[czy@localhost ~]$ cd kaldi/egs/thchs30/s5
[czy@localhost s5]$ mkdir thchs30-openslr
[czy@localhost s5]$ cd thchs30-openslr
[czy@localhost thchs30-openslr]$ 

可以使用
wget http://www.openslr.org/resources/18/data_thchs30.tgz
wget http://www.openslr.org/resources/18/test-noise.tgz
wget http://www.openslr.org/resources/18/resource.tgz
分别下载。

[czy@localhost thchs30-openslr]$ wget http://www.openslr.org/resources/18/data_thchs30.tgz 
[czy@localhost thchs30-openslr]$ wget http://www.openslr.org/resources/18/test-noise.tgz
[czy@localhost thchs30-openslr]$ wget http://www.openslr.org/resources/18/resource.tgz

(当然也可下载到电脑上用x ftp6上传到服务器,(ps: 用rz 命令可将本地文件上传到服务器,但是文件大小不超过4GB))

下载结束之后,使用解压命令解压
解压到当前文件夹: tar zxvf 文件名.tgz -C./
解压到指定文件夹:tar zxvf 文件名.tgz -C/制定文件夹路径

2019-09-15 22:47:26 u012798683 阅读数 7810

Kaldi 底层是使用C++ 编写的语音识别工具,旨在供语音识别研究员使用。

也是语音识别领域最常用的一个工具。

它自带了很多特征提取模块、语音模型代码,可直接使用或重新训练GMM-HMM 等模型。

还支持GPU进行训练,功能非常强大。很多新手在使用Kaldi时候,都遇到很多问题

网上资料一大堆,有的比较老,很现在的安装编译方法不一样,会各种报错。

所以把自己安装编译kaldi 过程中,遇到的问题以及安装方法分享给大家。

在安装过程中,请尽量使用物理机Ubuntu 来进行安装。虚拟机Ubuntu 会出现不能安装的问题。

官方文档:http://kaldi-asr.org/

如何安装:我们直接切入正题:

1、首先,通过我的另外一篇博客,将Ubuntu 的源换成国内的阿里源。

地址:https://blog.csdn.net/u012798683/article/details/100765882

2、按照步骤更换完源后,安装git

sudo apt-get install git

3、从GitHub上下载kaldi 的源码

git clone https://github.com/kaldi-asr/kaldi.git

4、安装kaldi 依赖工具以及所使用的第三方工具库

sudo apt-get install git
sudo apt-get install bc
sudo apt-get install g++
sudo apt-get install zlib1g-dev make automake autoconf bzip2 libtool subversion
sudo apt-get install libatlas3-base

5、按照上面的安装完kaldi 的依赖包之后,我们解压kaldi,运行自带的脚本文件,来检测是否安装完成所需要的依赖。

cd kaldi-master
cd tools 

运行依赖检测脚本:

./extras/check_dependencies.sh

会提示缺失MKL依赖包,也会提示你,到tools目录下,运行install_mkl.sh脚本文件进行MKL安装。

运行安装脚本:

./extras/chech_dependencies.sh

安装完成以后,再次运行检查脚本:

./extras/check_dependencies.sh

会提示缺少另外一个依赖包,sox,也同样会告诉你安装方式,运行安装命令即可。

安装完成后,再次检测,运行脚本文件。值到没有提示错误,且返回下图所示内容,依赖既安装完成。

在tools目录下面输入命令:

make -j 4 (意思是多线程加快进度)

或者直接输入make 也可。然后耐心等待

tools目录下make 完成后,说明我们的外部依赖和第三方库已经全部安装完成。

下面进入到src目录下,进行编译安装。

cd ..
cd src

进入src 按照指令进行安装:

./configure --shared
make depend
make

执行完上述命令,接下来就是耐心等待make的完成

这里进行make 的时候花的时间比较久,耐心等待即可,

make 完成后,会提示如下图

提示echo Done

Done

即表示make完成,

下面我们可以跑一个简单的例子,来验证,kaldi是否安装成功。

我们进入到路径,kaldi-master/egs/yesno/s5,目录下

运行下面的命令:

./run.sh

运行完成后,如果没有报错,那说明你已经安装成功。

运行完 yesno 例子以后,显示如下,说明已经安装成功。

kaldi 就算安装完成。

2014-04-01 20:47:24 u013538664 阅读数 7442

最近看了有关KALDI的论文,在这里介绍一下。

Abstract:

We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.

注:

1.Kaldi是免费开源的用于语音识别研究的工具包

2.finite-state transducers(FST) 是有两个tape的有限状态自动机

3.OpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs)

4.means and mixture weights vary in a subspace of the total parameter space. We call this a Subspace Gaussian Mixture Model (SGMM). 

I INTRODUCTION

Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.
The goal of Kaldi is to have modern and flexible code that is easy to understand, modify and extend. Kaldi is available on SourceForge (see http://kaldi.sf.net/). The tools compile on the commonly used Unix-like systems and on Microsoft Windows.

注:

1. Kaldi is available on SourceForge (see http://kaldi.sf.net/)

2.The tools compile on  Unix-like systems and on Microsoft Windows.

Researchers on automatic speech recognition (ASR) have several potential choices of open-source toolkits for building a recognition system. Notable among these are: HTK [1], Julius[2] (both written in C), Sphinx-4[3] (written in Java), and the RWTH ASR toolkit [4] (written in C++). Yet, our specific requirements—a finite-state transducer (FST) based frame-work, extensive linear algebra support, and a non-restrictive license—led to the development of Kaldi. Important features of Kaldi include:

Integration with Finite State Transducers: We compile against the OpenFst toolkit [5] (using it as a library).

Extensive linear algebra support: We include a matrix library that wraps standard BLAS and LAPACK routines.

Extensible design:We attempt to provide our algorithms in the most generic form possible. For instance, our decoders work with an interface that provides a score for a particular frame and FST input symbol. Thus the decoder could work from any suitable source of scores.

Open license:The code is licensed under Apache v2.0, which is one of the least restrictive licenses available.

Complete recipes:We make available complete recipes for building speech recognition systems, that work from
widely available databases such as those provided by the Linguistic Data Consortium (LDC).

Thorough testing: The goal is for all or nearly all the code to have corresponding test routines.

The main intended use for Kaldi is acoustic modeling research; thus, we view the closest competitors as being HTK
and the RWTH ASR toolkit (RASR). The chief advantage versus HTK is modern, flexible, cleanly structured code and better WFST and math support; also, our license terms are more open than either HTK or RASR.

注:

1.Kaldi's main intend is acoustic modeling research

2.Advatages: modern, flexible, cleanly structured code and better WFST and math support license terms are more 

   open

The paper is organized as follows: we start by describing the structure of the code and design choices (section II). This is followed by describing the individual components of a speech recognition system that the toolkit supports: feature extraction (section III), acoustic modeling (section IV), phonetic decision trees (section V), language modeling (section VI), and de-coders (section VIII). Finally, we provide some benchmarking results in section IX.

II OVERVIEW OF THE TOOLKIT

We give a schematic overview of the Kaldi toolkit in figure 1. The toolkit depends on two external libraries that are
also freely available: one is OpenFst [5] for the finite-state framework, and the other is numerical algebra libraries. We use the standard “Basic Linear Algebra Subroutines” (BLAS)and “Linear Algebra PACKage” (LAPACK) routines for the latter.

注:

1.external libraries:OpenFst、numerical algebra libraries


The library modules can be grouped into two distinct halves, each depending on only one of the external libraries
(c.f. Figure 1). A single module, the DecodableInterface (section VIII), bridges these two halves.

注:

1.DecodableInterface bridges these two halves

Access to the library functionalities is provided through command-line tools written in C++, which are then called
from a scripting language for building and running a speech recognizer. Each tool has very specific functionality with a small set of command line arguments: for example, there are separate executables for accumulating statistics, summing accumulators, and updating a GMM-based acoustic model using maximum likelihood estimation. Moreover, all the tools can read from and write to pipes which makes it easy to chain together different tools.

To avoid “code rot”, We have tried to structure the toolkit in such a way that implementing a new feature will generally involve adding new code and command-line tools rather than modifying existing ones

注:

1.command-line tools written in C++ to access the library functionalities

III FEATURE EXTRACTION

Our feature extraction and waveform-reading code aims to create standard MFCC and PLP features, setting reasonable defaults but leaving available the options that people are most likely to want to tweak (for example, the number of mel bins, minimum and maximum frequency cutoffs, etc.). We support most commonly used feature extraction approaches: e.g. VTLN, cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, and so on.

注:

1.features: MFCC and PLP

2.feature extraction approaches: VTLN, cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, and so 

   on

IV ACOUSTIC MODELING

Our aim is for Kaldi to support conventional models (i.e.diagonal GMMs) and Subspace Gaussian Mixture Models
(SGMMs), but to also be easily extensible to new kinds of model.

注:

1.DIAGONAL GMMs

2.Subspace Gaussian Mixture Models(SGMMs)

A.Gaussian mixture models

We support GMMs with diagonal and full covariance structures. Rather than representing individual Gaussian densities separately, we directly implement a GMM class that is parametrized by the natural parameters, i.e. means times inverse covariances and inverse covariances. The GMM classes also store the constant term in likelihood computation, which consist of all the terms that do not depend on the data vector. Such an implementation is suitable for efficient log-likelihood computation with simple dot-products.

B.GMM-based acoustic model

The “acoustic model” class AmDiagGmm represents a collection of DiagGmm objects, indexed by “pdf-ids” that correspond to context-dependent HMM states. This class does not represent any HMM structure, but just a collection of densities (i.e.GMMs). There are separate classes that represent the HMM structure, principally the topology and transition-modeling code and the code responsible for compiling decoding graphs, which provide a mapping between the HMM states and the pdf index of the acoustic model class. Speaker adaptation and other linear transforms like maximum likelihood linear transform (MLLT) [6] or semi-tied covariance (STC) [7] are implemented by separate classes.

C.HMM Topology

It is possible in Kaldi to separately specify the HMM topology for each context-independent phone. The topology
format allows nonemitting states, and allows the user to pre-specify tying of the p.d.f.’s in different HMM states.

D.Speaker adaptation

We support both model-space adaptation using maximum likelihood linear regression (MLLR) [8] and feature-space
adaptation using feature-space MLLR (fMLLR), also known as constrained MLLR [9]. For both MLLR and fMLLR,
multiple transforms can be estimated using a regression tree [10]. When a single fMLLR transform is needed, it can be used as an additional processing step in the feature pipeline. The toolkit also supports speaker normalization using a linear approximation to VTLN, similar to [11], or conventional feature-level VTLN, or a more generic approach for gender normalization which we call the “exponential transform” [12]. Both fMLLR and VTLN can be used for speaker adaptive training (SAT) of the acoustic models.

注:

1.maximum likelihood linear regression (MLLR) & feature-space adaptation using feature-space MLLR (fMLLR)

E. Subspace Gaussian Mixture Models

For subspace Gaussian mixture models (SGMMs), the toolkit provides an implementation of the approach described
in [13]. There is a single class AmSgmm that represents a whole collection of pdf’s; unlike the GMM case there is no class that represents a single pdf of the SGMM. Similar to the GMM case, however, separate classes handle model estimation and speaker adaptation using fMLLR.












2014-04-02 13:19:50 u013538664 阅读数 6517

V. PHONETIC DECISION TREES

Our goals in building the phonetic decision tree code were to make it efficient for arbitrary context sizes (i.e. we avoided enumerating contexts), and also to make it general enough to support a wide range of approaches. The conventional approach is, in each HMM-state of each monophone, to have a decision tree that asks questions about, say, the left and right phones. In our framework, the decision-tree roots can be shared among the phones and among the states of the phones, and questions can be asked about any phone in the context window, and about the HMM state. Phonetic questions can be supplied based on linguistic knowledge, but in our recipes the questions are generated automatically based on a tree-clustering of the phones. Questions about things like phonetic stress (if marked in the dictionary) and word start/end information are supported via an extended phone set; in this case we share the decision-tree roots among the different versions of the same phone.

注:

1.建立phonetic decision tree code的目的是有效的应对任意的上下文以及对广泛的方法有通用的支持

2.conventional approach: 每个单音素的HMM模型都有一个决策树

3.In paper approach:每个决策树的根节点被众多的音素或者是音素的状态共享 

VI. LANGUAGE MODELING

Since Kaldi uses an FST-based framework, it is possible, in principle, to use any language model that can be represented as an FST. We provide tools for converting LMs in the standard ARPA format to FSTs. In our recipes, we have used the IRSTLM toolkit for purposes like LM pruning. For building LMs from raw text, users may use the IRSTLM toolkit, for which we provide installation help, or a more fully-featured toolkit such as SRILM.

注:

1.Kaldi: an FST-based framework

2.IRSTLM toolkit: LM pruning

VII. CREATING DECODING GRAPHS

All our training and decoding algorithms use Weighted Finite State Transducers (WFSTs). In the conventional recipe [14], the input symbols on the decoding graph cor-respond to context-dependent states (in our toolkit, these symbols are numeric and we call them pdf-ids). However, because we allow different phones to share the same pdf-ids, we would have a number of problems with this approach, including not being able to determinize the FSTs, and not having sufficient information from the Viterbi path through an FST to work out the phone sequence or to train the transition probabilities. In order to fix these problems, we put on the input of the FSTs a slightly more fine-grained integer identifier that we call a “transition-id”, that encodes the pdf-id, the phone it is a member of, and the arc (transition) within the topology specification for that phone. There is a one-to-one mapping between the “transition-ids” and the transition-probability pa-rameters in the model: we decided make transitions as fine-grained as we could without increasing the size of the decoding graph.

注:

1.Problems: determinize the FSTs & not having sufficient information from the Viterbi path through an FST to work 

   out the phone sequence or to train the transition probabilities

2.Fix: we put on the input of the FSTs a slightly more fine-grained integer identifier that we call a “transition-id”

Our decoding-graph construction process is based on the recipe described in [14]; however, there are a number of
differences. One important one relates to the way we handle “weight-pushing”, which is the operation that is supposed to ensure that the FST is stochastic. “Stochastic” means that the weights in the FST sum to one in the appropriate sense, for each state (like a properly normalized HMM). Weight pushing may fail or may lead to bad pruning behavior if the FST representing the grammar or language model (G) is not stochastic, e.g. for backoff language models. Our approach is to avoid weight-pushing altogether, but to ensure that each stage of graph creation “preserves stochasticity” in an appropriate sense. Informally, what this means is that the “non-sum-to-one-ness” (the failure to sum to one) will never get worse than what was originally present in G.

VIII. DECODERS

We have several decoders, from simple to highly optimized; more will be added to handle things like on-the-fly language model rescoring and lattice generation. By “decoder” we mean a C++ class that implements the core decoding algorithm. The decoders do not require a particular type of acoustic model: they need an object satisfying a very simple interface with a function that provides some kind of acoustic model score for a particular (input-symbol and frame).

<span style="font-size:18px;">class DecodableInterface {
public:
virtual float LogLikelihood(int frame, int index) = 0;
virtual bool IsLastFrame(int frame) = 0;
virtual int NumIndices() = 0;
virtual ˜DecodableInterface() {}
};</span>
Command-line decoding programs are all quite simple, do just one pass of decoding, and are all specialized for one
decoder and one acoustic-model type. Multi-pass decoding is implemented at the script level.

IX. EXPERIMENTS
We report experimental results on the Resource Manage-ment (RM) corpus and on Wall Street Journal. The results re-ported here correspond to version 1.0 of Kaldi; the scripts that correspond to these experiments may be found inegs/rm/s1 and egs/wsj/s1.

A. Comparison with previously published results

Table I shows the results of a context-dependent triphone system with mixture-of-Gaussian densities; the HTK baseline numbers are taken from [15] and the systems use essentially the same algorithms. The features are MFCCs with per-speaker cepstral mean subtraction. The language model is the word-pair bigram language model supplied with the RM corpus. The WERs are essentially the same. Decoding time was about 0.13×RT, measured on an Intel Xeon CPU at 2.27GHz. The system identifier for the Kaldi results is tri3c.

Table II shows similar results for the Wall Street Journal system, this time without cepstral mean subtraction. The WSJ corpus comes with bigram and trigram language models. and we compare with published numbers using the bigram lan-guage model. The baseline results are reported in [16], which we refer to as “Bell Labs” (for the authors’ affiliation), and a HTK system described in [17]. The HTK system was gender-dependent (a gender-independent baseline was not reported), so the HTK results are slightly better. Our decoding time was about 0.5×RT.
B. Other experiments
Here we report some more results on both the WSJ test sets (Nov’92 and Nov’93) using systems trained on just the SI-84 part of the training data, that demonstrate different features that are supported by Kaldi. We also report results on the RM task, averaged over 6 test sets: the 4 mentioned in table I together with Mar’87 and Oct’87. The best result for a conventional GMM system is achieved by a SAT system that splices 9 frames (4 on each side of the current frame) and uses LDA to project down to 40 dimensions, together with MLLT. We achieve better performance on average, with an SGMM system trained on the same features, with speaker vectors and fMLLR adaptation. The last line, with the best results, includes the “exponential transform” [12] in the features.


X. CONCLUSIONS
We described the design of Kaldi, a free and open-source speech recognition toolkit. The toolkit currently supports mod-eling of context-dependent phones of arbitrary context lengths, and all commonly used techniques that can be estimated using maximum likelihood. It also supports the recently proposed SGMMs. Development of Kaldi is continuing and we are working on using large language models in the FST frame-work, lattice generation and discriminative training.

REFERENCES

[1] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for version 3.4). Cambridge University Engineering Department, 2009.

[2] A. Lee, T. Kawahara, and K. Shikano, “Julius – an open source real-time large vocabulary recognition engine,” in EUROSPEECH, 2001, pp.1691–1694.

[3] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf,and J. Woelfel, “Sphinx-4: A flexible open source framework for speech recognition,” Sun Microsystems Inc., Technical Report SML1 TR2004-0811, 2004.

[4] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. L¨o¨ of, R. Schl¨ uter,and H. Ney, “The RWTH Aachen University Open Source Speech Recognition System,” inINTERSPEECH, 2009, pp. 2111–2114.

[5] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: a general and efficient weighted finite-state transducer library,” inProc. CIAA, 2007.

[6] R. Gopinath, “Maximum likelihood modeling with Gaussian distribu-tions for classification,” in Proc. IEEE ICASSP, vol. 2, 1998, pp. 661–664.

[7] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,”IEEE Trans. Speech and Audio Proc., vol. 7, no. 3, pp. 272–281, May 1999.

[8] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,”Computer Speech and Language, vol. 9, no. 2, pp. 171–185,
1995.

[9] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, no. 2, pp. 75–98, April 1998.
[10] ——, “The generation and use of regression class trees for MLLR adaptation,” Cambridge University Engineering Department, Technical Report CUED/F-INFENG/TR.263, August 1996.

[11] D. Y. Kim, S. Umesh, M. J. F. Gales, T. Hain, and P. C. Woodland, “Using VTLN for broadcast news transcription,” inProc. ICSLP, 2004, pp. 1953–1956.

[12] D. Povey, G. Zweig, and A. Acero, “The Exponential Transform as a generic substitute for VTLN,” inIEEE ASRU, 2011.

[13] D. Povey, L. Burgetet al., “The subspace Gaussian mixture model— A structured model for speech recognition,”Computer Speech & Lan-guage, vol. 25, no. 2, pp. 404–439, April 2011.

[14] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech and Language, vol. 20, no. 1, pp. 69–88, 2002.

[15] D. Povey and P. C. Woodland, “Frame discrimination training for HMMs for large vocabulary speech recognition,” inProc. IEEE ICASSP, vol. 1,1999, pp. 333–336.

[16] W. Reichl and W. Chou, “Robust decision tree state tying for continuous speech recognition,” IEEE Transactions on Speech and Audio Process-ing, vol. 8, no. 5, pp. 555–566, September 2000.

[17] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young, “Large vocabulary continuous speech recognition using HTK,” inProc. IEEE ICASSP, vol. 2, 1994, pp. II/125–II/128.


















2019-04-19 19:51:26 weixin_44461682 阅读数 393

写在前面:
本人目前读研中,小白一枚,主要研究方向:语音识别及语音歌曲合成,在这里记录一下自己学习软件的过程,还望大神们口下留情。

kaldi是什么

kaldi是一个用C++写的语音识别工具包。kaldi旨在供语音识别研究员使用。当然,kaldi也可以用作声纹识别。关于他的详细介绍可以访问kaldi的官方文档。

Kaldi是一个非常强大的语音识别工具库,主要由Daniel Povey开发和维护。目前支持GMM-HMM、SGMM-HMM、DNN-HMM等多种语音识别的模型的训练和预测。其中DNN-HMM中的神经网络还可以由配置文件自定义,DNN、CNN、TDNN、LSTM以及Bidirectional-LSTM等神经网络结构均可支持。

kaldi与中文语音识别

清华大学开源的thchs30数据集(疯狂为CSLT打电话~~~)
CVTE公司开源的CVTE Mandarin Model模型
Beijing Shell Shell Technology公司开源的aishell数据集

编译与安装kaldi

注意:为了提高训练的速度,kaldi最好安装在GPU云服务器下。如果没有服务器话,使用虚拟机应该也是可以的,但一定要分配足够的内存空间和存储空间。下面我就以我使用的centos服务器为例,介绍kaldi的编译与安装。

编译与安装大概分为3步

安装git、下载kaldi的源码
安装编译所需依赖包
配置、编译kaldi

  • 下载以及安装

与其他开源软件一样,首先Clone它在Github上的代码
目前在Github上这个项目依旧非常活跃,可以在 https://github.com/kaldi-asr/kaldi 下载代码,以及在http://kaldi-asr.org/ 查看它的文档。

$ git clone https://github.com/kaldi-asr/kaldi

Clone下来之后按照INSTALL文件的指示,需要先完成tools文件夹下的编译安装,然后再去编译src下的内容。因此,先去tools文件夹:

$ cd kaldi/tools

在tools文件夹下依旧有一个INSTALL,我们根据它的指示,一步一步完成安装。
首先,需要运行extras/check_dependencies.sh这个脚本来检查一些依赖的环境是否存在并且正确配置。

$  extras/check_dependencies.sh
extras/check_dependencies.sh: python3 is not installed.
extras/check_dependencies.sh: we recommend that you run (our best guess):
  sudo yum install python3

这里输出的结果会有不同。。。。
这个shell脚本会提示系统需要安装的包,按照提示安装即可。(可以使用pip install python3从自己的服务器上面下载,,,,方式挺多的,,,,自己查查嘿嘿嘿)
在这里插入图片描述
这里我是安装了anaconda2,然后用source activate kaldi激活,再次运行extras/dependencies.sh来检验
安装完成后再次运行这个脚本,如果提示OK就可以开始进行kaldi的编译。

  • 编译

首先编译tools,在在kaldi/tools目录下输入

make

输入make -j 8可以使用8个核心一起编译,加快编译速度。

之后转到kaldi/src目录下,输入如下命令

cd ../src

在src目录下首先要运行configure进行配置,由于每个人的cuda版本、安装路径不一定相同,所以这里需要根据自己的服务器情况进行修改

./configure 

这里配置的时候,如果像我一样出现cuda版本不支持问题
在这里插入图片描述
可以进入src目录下的configure,手动改写cuda版本


cd
cd kaldi/src
vim configure

a(插入)然后改写,esc退出插入
英文键盘状态下输入:wq就结束了cuda版本修改
在这里插入图片描述
然后再次运行./configure
配置完成后,就可以进行src的编译了

  • 编译
    make depend -j 8
    make -j 8

同样,如果是多核CPU的话,你可以使用make depend -j 8和make -j 8加快编译速度

完成后会有提示成功和失败,结束make后就算完成了kaldi的编译与安装了~

  • 检验安装成功与否

这里可以运行yesno例子检验一下,自己是否安装成功

cd
cd kaldi/egs/yesno/s5
./run.sh

在这里插入图片描述
在这里插入图片描述
ok,安装成功,撒花庆祝!!!!

没有更多推荐了,返回首页