精华内容
下载资源
问答
  • deep speech2

    2018-07-16 09:38:32
    百度二代语音识别,采用paddlepaddle平台,继续使用ctc(端到端)的语音识别
  • deepspeech2

    千次阅读 2019-08-24 15:49:46
    版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明...代码地址https://github.com/SeanNaren/deepspeech.pytorch 中文语音数据库采用thchs30 (1)首先提取data文件下的tr...

    版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
    本文链接:https://blog.csdn.net/hw200855/article/details/89639304

    代码地址https://github.com/SeanNaren/deepspeech.pytorch

    中文语音数据库采用thchs30

    (1)首先提取data文件下的trn翻译文本,生成包含空格在内的生字表并保存为json格式lexicon.json,是汉字字典,不是拼音,我在这一步卡了很久,后来发现data_loader只能读取单个字符,所以中文识别的词汇表是翻译文本的汉字生字表

    (2)生成train.csv,dev.csv,test.csv路径文件,包含wav位置和对应的trn翻译文本位置

    (3)修改train.py中的这三个参数,分别是训练集,验证集和生字表

    ‘–train-manifest’

    ‘–val-manifest’

    ‘–labels-path’

    (4)data_loader.py读取翻译到的翻译文本是以空格对词进行区别,在实际训练中效果很差,loss值一直降不下来。参考deepspeech v1将翻译文本改为以字加空格的格式

    在165行读取翻译文本的时加入两行代码,得到单字+空格+单字…格式翻译文本

    transcript=transcript.replace(' ','')
    transcript=''.join([f + ' ' for f in transcript])
    

    (5)进行训练,在30轮迭代后,验证集的wer降至5%左右,cer降至2.5%,在测试集的wer为50%,cer为25%

    对thchs30数据集进行分析,发现翻译文本只有1000句,其中训练集包含750句,测试集包含250句,验证集使用的句子与训练集重合,这也解释了为什么在验证集识别结果极好,在测试机集效果极差的原因。数据集样本不够多,训练时出现过拟合,这也是测试集结果不佳的原因。

    下面将改用aishell数据集对deepspeech v2进行进一步性能测试。

    thchs30生字表和路径生成文件代码

    链接:https://pan.baidu.com/s/1GUnsLbVweDrnZnmYdssMYg
    提取码:y38d
    ————————————————
    版权声明:本文为CSDN博主「hw200855」的原创文章,遵循CC 4.0 by-sa版权协议,转载请附上原文出处链接及本声明。
    原文链接:https://blog.csdn.net/hw200855/article/details/89639304

    展开全文
  • deepspeech v2

    2020-04-25 11:14:13
    https://blog.csdn.net/qq_27842551/article/details/100054007
    展开全文
  • 使用实现DeepSpeech2用于PyTorch。 该支持使用模型进行训练/测试和推断。 可选地,可以在推理时使用语言模型。 安装 需要安装几个库才能进行工作培训。 我将假定一切都已在Ubuntu的Anaconda安装中安装,并安装了...
  • DeepSpeech2 详解

    千次阅读 2019-07-19 12:13:04
    论文题目: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin 论文地址: ...tensorflow版本: https://github.com/mozilla/DeepSpeech pytorch版本: http://www.github....

    论文题目: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
    论文地址: https://arxiv.org/pdf/1512.02595.pdf
    tensorflow版本: https://github.com/mozilla/DeepSpeech
    pytorch版本: http://www.github.com/SeanNaren/deepspeech.pytorch

    一. 论文解读

    引用:https://blog.csdn.net/Code_Mart/article/details/87291644

    1. 摘要
    2. 模型结构
    3. 文章亮点

    1. 摘要

    这篇论文是在2015年由Baidu AI Lab所发布的,依旧延续了上一篇论文的路线:抛弃复杂的传统框架,拥抱基于神经网络的端到端模型。这篇论文有三个亮点:1. 作者这次所训练的模型既识别英文语音,也识别中文(普通话)语音。 2. 作者利用 HPC 技术(High-performance Computing,即高性能计算),使得整个系统性能有了大幅的提高(得利于此,模型训练速度大幅提升,也引出了第三点)。3. 作者在 Deep Speech 的基础上做了大量修改与尝试:加深了网络深度,尝试了 (Bi-directional) Vanilla RNN 和 GRU,引进了1D/2D invariant convolution,引入 Batch Nomalization。

    2. 模型结构

    model input : spectrogram of power normalized audio clips as the features ,功率归一化音频剪辑的频谱图
    model output : 1. English: English character or blank symbol 2.Mandarin: simplified Chinese characters
    loss function : CTC loss,
    Author: 'We report Word Error Rate (WER) for the English system and Character Error Rate (CER) for the Mandarin system’

    模型由三部分组成 Conv layer, Recurrent layer, FC layer
    首先是 Conv layer:

    其次是 Recurrent layer:
    在这里插入图片描述

    最后是 FC layer:
    在这里插入图片描述
    此外,作者还在模型中加入了 Batch Normalization
    在这里插入图片描述

    3. 文章亮点

    1. 数据并行化
    2. CTC loss function
    3. HPC

    高性能计算(High performance computing, 缩写HPC) 指通常使用很多处理器(作为单个机器的一部分)或者某一集群中组织的几台计算机(作为单个计 算资源操作)的计算系统和环境

    1. Cold Fusion(deep speech v3)

    百度开发了Cold Fusion,它可以在训练Seq2Seq模型的时候使用一个预训练的语言模型。百度在论文中表明,带有Cold Fusion的Seq2Seq模型可以更好地运用语言信息,带来了更好的泛化效果和更快的收敛,同时只需用不到10%的标注训练数据就可以完全迁移到一个新领域。Cold Fusion还可以在测试过程中切换不同的语言模型以便为任何内容优化。Cold Fusion能够用在Seq2Seq模型上的同时,它在RNN变换器上应当也能发挥出同样好的效果。
    相关论文:
    1.Exploring Neural Transducers for End-to-End Speech Recognition https://arxiv.org/abs/1707.07413
    2.Cold Fusion: Training Seq2Seq Models Together with Language Model https://arxiv.org/abs/1708.06426

    二. 环境准备

    deepspeech.pytorch是由SeanNaren所写的 pytorch版本,他本人同时贡献了Warp-CTC是CTCLoss的pytorch版本是目前使用较为广泛的CTCLoss(直接将log_softmax写入了函数)
    github地址: https://github.com/SeanNaren/deepspeech.pytorch

    1.安装Warp-CTC

    git clone https://github.com/SeanNaren/warp-ctc.git
    cd warp-ctc; mkdir build; cd build; cmake ..; make
    cd ../pytorch_binding && python setup.py install
    

    注意事项:

    1. 需要安装cmake
    2. 安装过后如果不能直接import warp_pytorch需要关掉终端重新开启即可

    2. 安装pytorch audio

    sudo apt-get install sox libsox-dev libsox-fmt-all
    git clone https://github.com/pytorch/audio.git
    cd audio && python setup.py install
    

    注意事项

    1. 因为这只是一个读取音频的库 有很多类似的例如librosa scipy都可以实现相似的功能 并且我在安装时报错 所以没有安装
    2. 如果选择不安装的话 只需要修改 data/data_loader.py即可

    3. 安装 NVIDIA apex

    git clone --recursive https://github.com/NVIDIA/apex.git
    cd apex && pip install .
    

    注意事项:

    1. 安装过程中可能出现各种问题 去apex中查看具体安装方式
    2. 此项用于FP16 半精度计算 如果不需要可以不安 (pytorch专用 速度更快 显存消耗减半)

    4. 支持beam search

    git clone --recursive https://github.com/parlance/ctcdecode.git
    cd ctcdecode && pip install .
    

    注意事项:

    1. decoder方法有两种行为 greedy search和beam search
      greedy search可以看做是beam search的一种特殊情况
    2. 这两种都是获得局部最优解的算法 而全局最优解通常使用维特比算法 而维特比算法再此处不适用 beam search可以看做是Viterbi算法的一种简化阉割版 在此不过多叙述

    三. 案例代码

    1. 数据准备

    目前支持 AN4(100M) TEDLIUM(30G) Voxforge(15G) LibriSpeech(100G) 四个数据集 并且写好了预处理文本

    1.1 AN4

    cd data
    python3 an4.py 
    

    自动下载数据集并创建 an4_train_manifest.csv文件 (wav和txt对应文件)

    1.2 Voxforge

    cd data
    python3 voxforge.py --target-dir DIR
    

    因为文件较大 DIR 选择自己存储数据集路径 (一开始没选默认是当前目录 给我的盘撑爆了差点)

    1.3 TEDLIUM

    cd data
    pthon3 ted.py
    

    1.4 LibriSpeech

    cd data/
    mkdir LibriSpeech/ # This can be anything as long as you specify the directory path as --target-dir when running the librispeech.py script
    mkdir LibriSpeech/val/
    mkdir LibriSpeech/test/
    mkdir LibriSpeech/train/
    cd data/
    python librispeech.py --files-to-use "train-clean-100.tar.gz, train-clean-360.tar.gz,train-other-500.tar.gz, dev-clean.tar.gz,dev-other.tar.gz, test-clean.tar.gz,test-other.tar.gz"
    

    1.5 自定义数据集

    /path/to/audio.wav,/path/to/text.txt
    /path/to/audio2.wav,/path/to/text2.txt
    

    创建csv文件 包含wav和txt对应的文件

    2. 训练模型

    python3 train.py
    

    – train-manifest #required=True 训练csv文件路径
    – val-manifest #required=True 测试csv文件路径
    – tensorboard --logdir # 可视化及l ogdir路径
    – cuda # 使用GPU训练
    – mixed-precision # 使用FP16半精度计算 需要apex
    – visdom # 另一种可视化
    – checkpoint # 是否在训练过程中设置保存点
    – checkpoint-per-batch # 设置保存点间隔 默认为0 表示不记录
    – epochs #默认70 迭代次数
    – batch-size # 设置batch-size 默认20
    – save-folder #设置保存每个epochs的模型路径
    – model-path #设置保存最好model的路径
    还有若干参数具体看 train.py 中的 parser 部分

    3. 训练结果

    1.1 an4

    在这里插入图片描述

    1.2 Voxforge

    跑了两个epochs的结果 简单谈一谈batch-size的影响
    在这里插入图片描述

    1. batch-size大小会影响训练速度 batch=128 跑一个epoch 1560s batch=32 跑一个epoch 1736s
    2. batch-size = 32 WER CER loss 同时正常降低
    3. batch-szie = 128 loss CER正常降低 但WER降低缓慢 在80左右不在降低

    4. TODO

    1.剩下两个数据集还没下完 回头跑完测试
    2.下一章进行各个模块代码部分详解

    展开全文
  • 百度第二代语音识别deep speech2相关论文,稍后上传相关代码
  • from deepspeech.models.deepspeech2 import DeepSpeech2Model if __name__ == '__main__': batch_size = 2 feat_dim = 161 max_len = 100 audio = np.random.randn(batch_size, feat_dim, max_len) audio_

    net_test

    import torch
    import numpy as np
    
    from deepspeech.models.deepspeech2 import DeepSpeech2Model
    
    if __name__ == '__main__':
    
    
        batch_size = 2
        feat_dim = 161
        max_len = 100
        audio = np.random.randn(batch_size, feat_dim, max_len)
        audio_len = np.random.randint(100, size=batch_size, dtype='int32')
        audio_len[-1] = 100
        text = np.array([[1, 2], [1, 2]], dtype='int32')
        text_len = np.array([2] * batch_size, dtype='int32')
    
    
        audio = torch.Tensor(
            audio)
        audio_len = torch.Tensor(
            audio_len)
        text = torch.Tensor(
            text)
        text_len = torch.Tensor(
            text_len)
    
        print(audio.shape)
        print(audio_len.shape)
        print(text.shape)
        print(text_len.shape)
        print("-----------------")
    
        model = DeepSpeech2Model(
            feat_size=feat_dim,
            dict_size=10,
            num_conv_layers=2,
            num_rnn_layers=3,
            rnn_size=1024,
            use_gru=False,
            share_rnn_weights=False, )
        logits, probs, logits_len = model(audio, text, audio_len, text_len)
        print('probs.shape', probs.shape)
        print("-----------------")
    
        model2 = DeepSpeech2Model(
            feat_size=feat_dim,
            dict_size=10,
            num_conv_layers=2,
            num_rnn_layers=3,
            rnn_size=1024,
            use_gru=True,
            share_rnn_weights=False, )
        logits, probs, logits_len = model2(audio, text, audio_len, text_len)
        print('probs.shape', probs.shape)
        print("-----------------")
    
        model3 = DeepSpeech2Model(
            feat_size=feat_dim,
            dict_size=10,
            num_conv_layers=2,
            num_rnn_layers=3,
            rnn_size=1024,
            use_gru=False,
            share_rnn_weights=True, )
        logits, probs, logits_len = model3(audio, text, audio_len, text_len)
        print('probs.shape', probs.shape)
        print("-----------------")
    
        model4 = DeepSpeech2Model(
            feat_size=feat_dim,
            dict_size=10,
            num_conv_layers=2,
            num_rnn_layers=3,
            rnn_size=1024,
            use_gru=True,
            share_rnn_weights=True, )
        logits, probs, logits_len = model4(audio, text, audio_len, text_len)
        print('probs.shape', probs.shape)
        print("-----------------")
    
        model5 = DeepSpeech2Model(
            feat_size=feat_dim,
            dict_size=10,
            num_conv_layers=2,
            num_rnn_layers=3,
            rnn_size=1024,
            use_gru=False,
            share_rnn_weights=False, )
        logits, probs, logits_len = model5(audio, text, audio_len, text_len)
        print('probs.shape', probs.shape)
        print("-----------------")
    
    

    mask

    
    # Copyright (c) 2021 torchtorch Authors. All Rights Reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    import logging
    
    import torch
    
    
    logger = logging.getLogger(__name__)
    
    __all__ = ['sequence_mask']
    
    
    def sequence_mask(x_len, max_len=None):
        max_len = max_len or x_len.max()
        x_len = torch.unsqueeze(x_len, -1)
        row_vector = torch.arange(max_len)
        # TODO(Hui Zhang): fix this bug
        #mask = row_vector < x_len
        mask = row_vector <x_len  # a bug, broadcast 的时候出错了
        return torch.broadcast_tensors(mask)
    
    

    rnn

    # Copyright (c) 2021 torchtorch Authors. All Rights Reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    
    import logging
    
    import torch
    from torch import nn
    
    from deepspeech.modules.mask import sequence_mask
    from deepspeech.modules.activation import brelu
    
    logger = logging.getLogger(__name__)
    
    __all__ = ['RNNStack']
    
    
    
    class BiRNNWithBN(nn.Module):
        """Bidirectonal simple rnn layer with sequence-wise batch normalization.
        The batch normalization is only performed on input-state weights.
    
        :param name: Name of the layer parameters.
        :type name: string
        :param size: Dimension of RNN cells.
        :type size: int
        :param share_weights: Whether to share input-hidden weights between
                              forward and backward directional RNNs.
        :type share_weights: bool
        :return: Bidirectional simple rnn layer.
        :rtype: Variable
        """
    
        def __init__(self, i_size, h_size, share_weights):
            super().__init__()
            self.share_weights = share_weights
            if self.share_weights:
                #input-hidden weights shared between bi-directional rnn.
                self.fw_fc = nn.Linear(i_size, h_size)
                # batch norm is only performed on input-state projection
                self.fw_bn = nn.BatchNorm1d(
                    h_size,)
                self.bw_fc = self.fw_fc
                self.bw_bn = self.fw_bn
            else:
                self.fw_fc = nn.Linear(i_size, h_size)
    
                self.bw_fc = nn.Linear(i_size, h_size)
                self.fw_bn = self.fw_bn = nn.BatchNorm1d(
                    h_size)
                self.bw_bn = self.fw_bn = nn.BatchNorm1d(
                    h_size,)
    
    
            self.fw_rnn = nn.RNN(h_size,h_size,nonlinearity="tanh")
            self.bw_rnn= nn.RNN(h_size,h_size,nonlinearity="tanh")
    
    
        def forward(self, x, x_len):
            # x, shape [B, T, D]
            x1=self.fw_fc(x)
            fw_x =x1.view(x1.shape[0],-1,x1.shape[1])
            x1 = self.bw_fc(x)
            bw_x = x1.view(x1.shape[0],-1,x1.shape[1])
            fw_x, _ = self.fw_rnn(fw_x.view(fw_x.shape[0],-1,fw_x.shape[1]))
            bw_x, _ = self.bw_rnn(bw_x.view(bw_x.shape[0],-1,bw_x.shape[1]))
            x = torch.cat([fw_x, bw_x], dim=-1)
            return x, x_len
    
    
    class BiGRUWithBN(nn.Module):
        """Bidirectonal gru layer with sequence-wise batch normalization.
        The batch normalization is only performed on input-state weights.
    
        :param name: Name of the layer.
        :type name: string
        :param input: Input layer.
        :type input: Variable
        :param size: Dimension of GRU cells.
        :type size: int
        :param act: Activation type.
        :type act: string
        :return: Bidirectional GRU layer.
        :rtype: Variable
        """
    
        def __init__(self, i_size, h_size, act):
            super().__init__()
            hidden_size = h_size * 3
    
            self.fw_fc = nn.Linear(i_size, hidden_size,bias=True)
            self.fw_bn = nn.BatchNorm1d(
                hidden_size)
            self.bw_fc = nn.Linear(i_size, hidden_size,bias=True)
            self.bw_bn = nn.BatchNorm1d(
                hidden_size)
    
            self.fw_rnn = nn.GRU(input_size=hidden_size, hidden_size=h_size)
            self.bw_rnn = nn.GRU(input_size=hidden_size, hidden_size=h_size)
    
    
        def forward(self, x, x_len):
            # x, shape [B, T, D]
            x1 = self.fw_fc(x)
            fw_x = x1.view(x1.shape[0], -1, x1.shape[1])
            x1 = self.bw_fc(x)
            bw_x = x1.view(x1.shape[0], -1, x1.shape[1])
            fw_x, _ = self.fw_rnn(fw_x.view(fw_x.shape[0], -1, fw_x.shape[1]))
            bw_x, _ = self.bw_rnn(bw_x.view(bw_x.shape[0], -1, bw_x.shape[1]))
            x = torch.cat([fw_x, bw_x], dim=-1)
    
            return x, x_len
    
    
    class RNNStack(nn.Module):
        """RNN group with stacked bidirectional simple RNN or GRU layers.
    
        :param input: Input layer.
        :type input: Variable
        :param size: Dimension of RNN cells in each layer.
        :type size: int
        :param num_stacks: Number of stacked rnn layers.
        :type num_stacks: int
        :param use_gru: Use gru if set True. Use simple rnn if set False.
        :type use_gru: bool
        :param share_rnn_weights: Whether to share input-hidden weights between
                                  forward and backward directional RNNs.
                                  It is only available when use_gru=False.
        :type share_weights: bool
        :return: Output layer of the RNN group.
        :rtype: Variable
        """
    
        def __init__(self, i_size, h_size, num_stacks, use_gru, share_rnn_weights):
            super().__init__()
            self.rnn_stacks = nn.ModuleList()
            for i in range(num_stacks):
                if use_gru:
                    #default:GRU using tanh
                    self.rnn_stacks.append(
                        BiGRUWithBN(i_size=i_size, h_size=h_size, act="relu"))
                else:
                    self.rnn_stacks.append(
                        BiRNNWithBN(
                            i_size=i_size,
                            h_size=h_size,
                            share_weights=share_rnn_weights))
                i_size = h_size * 2
    
        def forward(self, x, x_len):
            """
            x: shape [B, T, D]
            x_len: shpae [B]
            """
            for i, rnn in enumerate(self.rnn_stacks):
                x, x_len = rnn(x, x_len)
                masks = sequence_mask(x_len)  #[B, T]
                masks = masks[0].unsqueeze(-1)  # [B, T, 1]
                x = x.multiply(masks.cuda())
                x=x
            return x, x_len
    
    

    conv

    # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    import logging
    
    
    from torch import nn
    from torch.nn import functional as F
    
    import torch
    from deepspeech.modules.mask import sequence_mask
    from deepspeech.modules.activation import brelu
    
    logger = logging.getLogger(__name__)
    
    __all__ = ['ConvStack']
    
    
    class ConvBn(nn.Module):
        """Convolution layer with batch normalization.
    
        :param kernel_size: The x dimension of a filter kernel. Or input a tuple for
                            two image dimension.
        :type kernel_size: int|tuple|list
        :param num_channels_in: Number of input channels.
        :type num_channels_in: int
        :param num_channels_out: Number of output channels.
        :type num_channels_out: int
        :param stride: The x dimension of the stride. Or input a tuple for two 
                    image dimension. 
        :type stride: int|tuple|list
        :param padding: The x dimension of the padding. Or input a tuple for two
                        image dimension.
        :type padding: int|tuple|list
        :param act: Activation type, relu|brelu
        :type act: string
        :return: Batch norm layer after convolution layer.
        :rtype: Variable
    
        """
    
        def __init__(self, num_channels_in, num_channels_out, kernel_size, stride,
                     padding, act):
    
            super().__init__()
            assert len(kernel_size) == 2
            assert len(stride) == 2
            assert len(padding) == 2
            self.kernel_size = kernel_size
            self.stride = stride
            self.padding = padding
    
            self.conv = nn.Conv2d(
                num_channels_in,
                num_channels_out,
                kernel_size=kernel_size,
                stride=stride,
                padding=padding)
    
            self.bn = nn.BatchNorm2d(
                num_channels_out)
    
            if act == 'relu':
                self.act = F.relu
            else:
                self.act= brelu
    
        def forward(self, x, x_len):
            """
            x(Tensor): audio, shape [B, C, D, T]
            """
            x = self.conv(x)
            x = self.bn(x)
            x = self.act(x)
    
            x_len = (x_len - self.kernel_size[1] + 2 * self.padding[1]
                     ) // self.stride[1] + 1
    
            # reset padding part to 0
            masks = sequence_mask(x_len)  #[B, T]
            masks = masks[0].unsqueeze(1).unsqueeze(1)  # [B, 1, 1, T]
            x = x.multiply(masks)
    
            return x, x_len
    
    
    class ConvStack(nn.Module):
        """Convolution group with stacked convolution layers.
    
        :param feat_size: audio feature dim.
        :type feat_size: int
        :param num_stacks: Number of stacked convolution layers.
        :type num_stacks: int
        """
    
        def __init__(self, feat_size, num_stacks):
            super().__init__()
            self.feat_size = feat_size  # D
            self.num_stacks = num_stacks
    
            self.conv_in = ConvBn(
                num_channels_in=1,
                num_channels_out=32,
                kernel_size=(41, 11),  #[D, T]
                stride=(2, 3),
                padding=(20, 5),
                act='brelu')
    
            out_channel = 32
            self.conv_stack = nn.ModuleList([
                ConvBn(
                    num_channels_in=32,
                    num_channels_out=out_channel,
                    kernel_size=(21, 11),
                    stride=(2, 1),
                    padding=(10, 5),
                    act='brelu') for i in range(num_stacks - 1)
            ])
    
            # conv output feat_dim
            output_height = (feat_size - 1) // 2 + 1
            for i in range(self.num_stacks - 1):
                output_height = (output_height - 1) // 2 + 1
            self.output_height = out_channel * output_height
    
        def forward(self, x, x_len):
            """
            x: shape [B, C, D, T]
            x_len : shape [B]
            """
            x, x_len = self.conv_in(x, x_len)
            for i, conv in enumerate(self.conv_stack):
                x, x_len = conv(x, x_len)
            return x, x_len
    
    

    act

    # Copyright (c) 2021 torchtorch Authors. All Rights Reserved.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    import logging
    
    import math
    
    import torch
    from torch import nn
    from torch.nn import functional as F
    
    
    logger = logging.getLogger(__name__)
    
    __all__ = ['brelu', "softplus", "gelu_accurate", "gelu", 'Swish']
    
    
    def brelu(x, t_min=0.0, t_max=24.0, name=None):
        # torch.to_tensor is dygraph_only can not work under JIT
        t_min = torch.full([1], t_min)
        t_max = torch.full([1], t_max)
        return x.maximum(t_min).minimum(t_max)
    
    
    def softplus(x):
        """Softplus function."""
        if hasattr(torch.nn.functional, 'softplus'):
            #return torch.nn.functional.softplus(x.float()).type_as(x)
            return torch.nn.functional.softplus(x)
        else:
            raise NotImplementedError
    
    
    def gelu_accurate(x):
        """Gaussian Error Linear Units (GELU) activation."""
        # [reference] https://github.com/pytorch/fairseq/blob/e75cff5f2c1d62f12dc911e0bf420025eb1a4e33/fairseq/modules/gelu.py
        if not hasattr(gelu_accurate, "_a"):
            gelu_accurate._a = math.sqrt(2 / math.pi)
        return 0.5 * x * (1 + torch.tanh(gelu_accurate._a *
                                          (x + 0.044715 * torch.pow(x, 3))))
    
    
    def gelu(x):
        """Gaussian Error Linear Units (GELU) activation."""
        if hasattr(torch.nn.functional, 'gelu'):
            #return torch.nn.functional.gelu(x.float()).type_as(x)
            return torch.nn.functional.gelu(x)
        else:
            return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
    
    
    class Swish(nn.Module):
        """Construct an Swish object."""
    
        def forward(self, x: torch.Tensor) -> torch.Tensor:
            """Return Swish activation function."""
            return x * F.sigmoid(x)
    
    

    deepspeechv2

    
    
    import torch
    from torch import nn
    from deepspeech.modules.conv import ConvStack
    from deepspeech.modules.rnn import RNNStack
    from deepspeech.modules.ctc import CTCDecoder
    from deepspeech.utils import checkpoint
    from deepspeech.utils import layer_tools
    
    __all__ = ['DeepSpeech2Model']
    
    
    class CRNNEncoder(nn.Module):
        def __init__(self,
                     feat_size,
                     dict_size,
                     num_conv_layers=2,
                     num_rnn_layers=3,
                     rnn_size=1024,
                     use_gru=False,
                     share_rnn_weights=True):
            super().__init__()
            self.rnn_size = rnn_size
            self.feat_size = feat_size  # 161 for linear
            self.dict_size = dict_size
    
            self.conv = ConvStack(feat_size, num_conv_layers)
    
            i_size = self.conv.output_height  # H after conv stack
            self.rnn = RNNStack(
                i_size=i_size,
                h_size=rnn_size,
                num_stacks=num_rnn_layers,
                use_gru=use_gru,
                share_rnn_weights=share_rnn_weights)
            self.rnn_bn=nn.BatchNorm1d(rnn_size*2)
    
        @property
        def output_size(self):
            return self.rnn_size * 2
    
        def forward(self, audio, audio_len):
            """
            audio: shape [B, D, T]
            text: shape [B, T]
            audio_len: shape [B]
            text_len: shape [B]
            """
            """Compute Encoder outputs
    
            Args:
                audio (Tensor): [B, D, T]
                text (Tensor): [B, T]
                audio_len (Tensor): [B]
                text_len (Tensor): [B]
            Returns:
                x (Tensor): encoder outputs, [B, T, D]
                x_lens (Tensor): encoder length, [B]
            """
            # [B, D, T] -> [B, C=1, D, T]
            x = audio.unsqueeze(1)
            x_lens = audio_len
    
            # convolution group
            x, x_lens = self.conv(x, x_lens)
    
            # convert data from convolution feature map to sequence of vectors
            #B, C, D, T = torch.shape(x)  # not work under jit
            # x = x.transpose([0, 3, 1, 2])  #[B, T, C, D]
            x = x.transpose(1,3)#[B, T, d, c]
            x = x.reshape([x.shape[0], x.shape[1], -1])
            #x = x.reshape([B, T, C * D])  #[B, T, C*D]  # not work under jit
            # x = x.reshape([0, 0, -1])  #[B, T, C*D]
    
            # remove padding part
            x, x_lens = self.rnn(x, x_lens)  #[B, T, D]
            x=self.rnn_bn(x.view(x.shape[0],x.shape[-1],-1)).view(x.shape[0],-1,x.shape[-1])
    
            return x, x_lens
    
    
    class DeepSpeech2Model(nn.Module):
        """The DeepSpeech2 network structure.
    
        :param audio_data: Audio spectrogram data layer.
        :type audio_data: Variable
        :param text_data: Transcription text data layer.
        :type text_data: Variable
        :param audio_len: Valid sequence length data layer.
        :type audio_len: Variable
        :param masks: Masks data layer to reset padding.
        :type masks: Variable
        :param dict_size: Dictionary size for tokenized transcription.
        :type dict_size: int
        :param num_conv_layers: Number of stacking convolution layers.
        :type num_conv_layers: int
        :param num_rnn_layers: Number of stacking RNN layers.
        :type num_rnn_layers: int
        :param rnn_size: RNN layer size (dimension of RNN cells).
        :type rnn_size: int
        :param use_gru: Use gru if set True. Use simple rnn if set False.
        :type use_gru: bool
        :param share_rnn_weights: Whether to share input-hidden weights between
                                  forward and backward direction RNNs.
                                  It is only available when use_gru=False.
        :type share_weights: bool
        :return: A tuple of an output unnormalized log probability layer (
                 before softmax) and a ctc cost layer.
        :rtype: tuple of LayerOutput    
        """
    
        # @classmethod
        # def params(cls, config: Optional[CfgNode]=None) -> CfgNode:
        #     default = CfgNode(
        #         dict(
        #             num_conv_layers=2,  #Number of stacking convolution layers.
        #             num_rnn_layers=3,  #Number of stacking RNN layers.
        #             rnn_layer_size=1024,  #RNN layer size (number of RNN cells).
        #             use_gru=True,  #Use gru if set True. Use simple rnn if set False.
        #             share_rnn_weights=True  #Whether to share input-hidden weights between forward and backward directional RNNs.Notice that for GRU, weight sharing is not supported.
        #         ))
        #     if config is not None:
        #         config.merge_from_other_cfg(default)
        #     return default
    
        def __init__(self,
                     feat_size,
                     dict_size,
                     num_conv_layers=2,
                     num_rnn_layers=3,
                     rnn_size=1024,
                     use_gru=False,
                     share_rnn_weights=True):
            super().__init__()
            self.encoder = CRNNEncoder(
                feat_size=feat_size,
                dict_size=dict_size,
                num_conv_layers=num_conv_layers,
                num_rnn_layers=num_rnn_layers,
                rnn_size=rnn_size,
                use_gru=use_gru,
                share_rnn_weights=share_rnn_weights)
            assert (self.encoder.output_size == rnn_size * 2)
    
            self.decoder = CTCDecoder(
                enc_n_units=self.encoder.output_size,
                odim=dict_size + 1,  # <blank> is append after vocab
                blank_id=dict_size,  # last token is <blank>
                dropout_rate=0.5,
                reduction=True,  # sum
                batch_average=True)  # sum / batch_size
    
        def forward(self, audio, text, audio_len, text_len):
            """Compute Model loss
    
            Args:
                audio (Tenosr): [B, D, T]
                text (Tensor): [B, T]
                audio_len (Tensor): [B]
                text_len (Tensor): [B]
    
            Returns:
                loss (Tenosr): [1]
            """
    
            eouts, eouts_len = self.encoder(audio, audio_len)
            # loss = torch.nn.CTCLoss()
            loss,pro = self.decoder(eouts, eouts_len, text, text_len)
    
    
            return loss,pro,eouts_len
    
        @torch.no_grad()
        def decode(self, audio, audio_len, vocab_list, decoding_method,
                   lang_model_path, beam_alpha, beam_beta, beam_size, cutoff_prob,
                   cutoff_top_n, num_processes):
            # init once
            # decoders only accept string encoded in utf-8
            self.decoder.init_decode(
                beam_alpha=beam_alpha,
                beam_beta=beam_beta,
                lang_model_path=lang_model_path,
                vocab_list=vocab_list,
                decoding_method=decoding_method)
    
            eouts, eouts_len = self.encoder(audio, audio_len)
            probs = self.decoder.probs(eouts)
            return self.decoder.decode_probs(
                probs.numpy(), eouts_len, vocab_list, decoding_method,
                lang_model_path, beam_alpha, beam_beta, beam_size, cutoff_prob,
                cutoff_top_n, num_processes)
    
        @classmethod
        def from_pretrained(cls, dataset, config, checkpoint_path):
            """Build a DeepSpeech2Model model from a pretrained model.
            Parameters
            ----------
            dataset: torch.io.Dataset
    
            config: yacs.config.CfgNode
                model configs
            
            checkpoint_path: Path or str
                the path of pretrained model checkpoint, without extension name
            
            Returns
            -------
            DeepSpeech2Model
                The model built from pretrained result.
            """
            model = cls(feat_size=dataset.feature_size,
                        dict_size=dataset.vocab_size,
                        num_conv_layers=config.model.num_conv_layers,
                        num_rnn_layers=config.model.num_rnn_layers,
                        rnn_size=config.model.rnn_layer_size,
                        use_gru=config.model.use_gru,
                        share_rnn_weights=config.model.share_rnn_weights)
            checkpoint.load_parameters(model, checkpoint_path=checkpoint_path)
            layer_tools.summary(model)
            return model
    
    
    class DeepSpeech2InferModel(DeepSpeech2Model):
        def __init__(self,
                     feat_size,
                     dict_size,
                     num_conv_layers=2,
                     num_rnn_layers=3,
                     rnn_size=1024,
                     use_gru=False,
                     share_rnn_weights=True):
            super().__init__(
                feat_size=feat_size,
                dict_size=dict_size,
                num_conv_layers=num_conv_layers,
                num_rnn_layers=num_rnn_layers,
                rnn_size=rnn_size,
                use_gru=use_gru,
                share_rnn_weights=share_rnn_weights)
    
        def forward(self, audio, audio_len):
            """export model function
    
            Args:
                audio (Tensor): [B, D, T]
                audio_len (Tensor): [B]
    
            Returns:
                probs: probs after softmax
            """
            eouts, eouts_len = self.encoder(audio, audio_len)
            probs = self.decoder.probs(eouts)
            return probs
    
    
    
    
    
    
    from torch import nn
    from torch.nn import functional as F
    
    from torch.nn import CTCLoss
    
    
    
    __all__ = ['CTCDecoder']
    
    class CTCDecoder(nn.Module):
        def __init__(self,
                     enc_n_units,
                     odim,
                     blank_id=0,
                     dropout_rate: float=0.0,
                     reduction: bool=True,
                     batch_average: bool=False):
            """CTC decoder
    
            Args:
                enc_n_units ([int]): encoder output dimention
                vocab_size ([int]): text vocabulary size
                dropout_rate (float): dropout rate (0.0 ~ 1.0)
                reduction (bool): reduce the CTC loss into a scalar, True for 'sum' or 'none'
                batch_average (bool): do batch dim wise average.
            """
    
            super().__init__()
    
            self.blank_id = blank_id
            self.odim = odim
            self.dropout_rate = dropout_rate
            self.ctc_lo = nn.Linear(enc_n_units, self.odim)
            reduction_type = "sum" if reduction else "none"
            self.criterion = CTCLoss(
                blank=self.blank_id,
                reduction=reduction_type,)
    
            self._ext_scorer = None
    
        def forward(self, hs_pad, hlens, ys_pad, ys_lens):
            """Calculate CTC loss.
    
            Args:
                hs_pad (Tensor): batch of padded hidden state sequences (B, Tmax, D)
                hlens (Tensor): batch of lengths of hidden state sequences (B)
                ys_pad (Tenosr): batch of padded character id sequence tensor (B, Lmax)
                ys_lens (Tensor): batch of lengths of character sequence (B)
            Returns:
                loss (Tenosr): scalar.
            """
            logits = self.ctc_lo(F.dropout(hs_pad, p=self.dropout_rate))
            loss = self.criterion(logits.view(logits.shape[1],logits.shape[0],-1).log_softmax(dim=-1), ys_pad.long(), hlens.long(), ys_lens.long())
            return loss
    
    
    
    展开全文
  • DeepSpeech2_Mandarin_PyTorch 在此项目中,我们基于“深度语音2”体系结构为普通话语音拼写构建ASR模型。 我们的代码部分来自
  • 基于deepspeech2的语音识别模型

    千次阅读 2019-10-22 21:40:25
    deepspeech2的GitHub 以及 中文Readme 论文地址 运行deepspeech2没有使用docker而是直接依赖环境安装的: 运行tiny的demo时遇到的问题: Q1:paddlepaddle对应的cuda和cudnn版本不对应 paddlepaddle的版本 参考链接1...
  • deepspeech2 百度研究公司的Deep Speech 2模型于2015年发布,可将语音从文本的端到端从标准化的声谱图转换为字符序列。 它由在时间和频率上的几个卷积层组成,然后是门控循环单元(GRU)层(通过附加的批归一化进行...
  • 代码地址https://github.com/SeanNaren/deepspeech.pytorch 中文语音数据库采用thchs30 (1)首先提取data文件下的trn翻译文本,生成包含空格在内的生字表并保存为json格式lexicon.json,是汉字字典,不是拼音,我...
  • deepspeech 2 (百度 2016 论文解读 )

    千次阅读 2019-10-20 22:11:25
    题目:Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin 摘要 我们表明,可以使用端到端的深度学习方法来识别英语或普通话(两种截然不同的语言)。 由于它用神经网络代替了人工工程组件的...
  • 基于PaddlePaddle实现的DeepSpeech2端到端中文语音识模型(1300小时数据集) 源码地址:https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech
  • Paddlpaddle+DeepSpeech2自动语音识别部署 背景 ​ 语音识别 环境 DeepSpeech2 Paddlpaddle1.8.5 Python 2.7 Nvidia-docker ubuntu1~18.04 安装与配置 可以不使用nvidia-docker,直接跳到第五步 1.首先安装nvidia-...
  • 本文用于记录我在阅读 Deep Speech 2: End-to-End Speech Recognition in English and Mandarin 时所做的阅读笔记。
  • deepspeech v2.rar

    2021-04-19 23:54:14
    添加了label 去除非中文的判断使用permute 和contiguous
  • 本项目是基于PaddlePaddle的DeepSpeech项目修改的,方便训练中文自定义数据集。 本项目使用的环境: Python 2.7 PaddlePaddle 1.8.0 本项目源码地址:https://github.com/yeyupiaoling/DeepSpeech 环境搭建 请...
  • deepspeech2 代码之模型构建

    千次阅读 2019-07-29 12:19:56
    模型构建 模型整体框架如下图所示 可以看到模型主要由以下几个部分构成: DeepSpeech model MaskConv BatchRNN fc model = DeepSpeech(rnn_hidden_size=args.hidden_size, nb_layers=args.hidden_layers, ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 8,886
精华内容 3,554
关键字:

deepspeech2