精华内容
下载资源
问答
  • React语音 API的React组件。 Web Speech API旨在使Web开发人员能够在Web浏览器中提供speech-input和text-to-speech输出。 Web语音API分为两个部分, 和。 此react组件支持speech...Speech text="Welcome to react sp
  • 传送门...automatic speech recognition/speech synthesis paper roadmap, including HMM, DNN, RNN, CNN, Seq2Seq, Attention Introd...

    传送门https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers

    automatic speech recognition/speech synthesis paper roadmap, including HMM, DNN, RNN, CNN, Seq2Seq, Attention

    Introduction

    Automatic Speech Recognition has been investigated for several decades, and speech recognition models are from HMM-GMM to deep neural networks today. It's very necessary to see the history of speech recognition by this awesome paper roadmap. I will cover papers from traditional models to nowadays popular models, not only acoustic models or ASR systems, but also many interesting language models.

    Paper List

    Automatic Speech Recognition

    • An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition(1982), S. E. LEVINSON et al. [pdf]

    • A Maximum Likelihood Approach to Continuous Speech Recognition(1983), LALIT R. BAHL et al. [pdf]

    • Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition(1986), Andrew K. Halberstadt. [pdf]

    • Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition(1986), Lalit R. Bahi et al. [pdf]

    • A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition(1989), Lawrence R Rabiner. [pdf]

    • Phoneme recognition using time-delay neural networks(1989), Alexander H. Waibel et al. [pdf]

    • Speaker-independent phone recognition using hidden Markov models(1989), Kai-Fu Lee et al. [pdf]

    • Hidden Markov Models for Speech Recognition(1991), B. H. Juang et al. [pdf]

    • Connectionist Speech Recognition: A Hybrid Approach(1994), Herve Bourlard et al. [pdf]

    • A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)(1997), J.G. Fiscus. [pdf]

    • Speech recognition with weighted finite-state transducers(2001), M Mohri et al. [pdf]

    • Review of Tdnn (time Delay Neural Network) Architectures for Speech Recognition(2014), Masahide Sugiyamat et al. [pdf]

    • Framewise phoneme classification with bidirectional LSTM and other neural network architectures(2005), Alex Graves et al. [pdf]

    • Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks(2006), Alex Graves et al. [pdf]

    • The kaldi speech recognition toolkit(2011), Daniel Povey et al. [pdf]

    • Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition(2012), Ossama Abdel-Hamid et al. [pdf]

    • Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition(2012), George E. Dahl et al. [pdf]

    • Deep Neural Networks for Acoustic Modeling in Speech Recognition(2012), Geoffrey Hinton et al. [pdf]

    • Sequence Transduction with Recurrent Neural Networks(2012), Alex Graves et al. [pdf]

    • Deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]

    • Improving deep neural networks for LVCSR using rectified linear units and dropout(2013), George E. Dahl et al. [pdf]

    • Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training(2013), Yajie Miao et al. [pdf]

    • Improvements to deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]

    • Machine Learning Paradigms for Speech Recognition: An Overview(2013), Li Deng et al. [pdf]

    • Recent advances in deep learning for speech research at Microsoft(2013), Li Deng et al. [pdf]

    • Speech recognition with deep recurrent neural networks(2013), Alex Graves et al. [pdf]

    • Convolutional deep maxout networks for phone recognition(2014), László Tóth et al. [pdf]

    • Convolutional Neural Networks for Speech Recognition(2014), Ossama Abdel-Hamid et al. [pdf]

    • Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition(2014), László Tóth. [pdf]

    • Deep Speech: Scaling up end-to-end speech recognition(2014), Awni Y. Hannun et al. [pdf]

    • End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results(2014), Jan Chorowski et al. [pdf]

    • First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs(2014), Andrew L. Maas et al. [pdf]

    • Long short-term memory recurrent neural network architectures for large scale acoustic modeling(2014), Hasim Sak et al. [pdf]

    • Robust CNN-based speech recognition with Gabor filter kernels(2014), Shuo-Yiin Chang et al. [pdf]

    • Stochastic pooling maxout networks for low-resource speech recognition(2014), Meng Cai et al. [pdf]

    • Towards End-to-End Speech Recognition with Recurrent Neural Networks(2014), Alex Graves et al. [pdf]

    • A neural transducer(2015), N Jaitly et al. [pdf]

    • Attention-Based Models for Speech Recognition(2015), Jan Chorowski et al. [pdf]

    • Analysis of CNN-based speech recognition system using raw speech as input(2015), Dimitri Palaz et al. [pdf]

    • Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks(2015), Tara N. Sainath et al. [pdf]

    • Deep convolutional neural networks for acoustic modeling in low resource languages(2015), William Chan et al. [pdf]

    • Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition(2015), Chao Weng et al. [pdf]

    • EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding(2015), Y Miao et al. [pdf]

    • Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition(2015), Hasim Sak et al. [pdf]

    • Lexicon-Free Conversational Speech Recognition with Neural Networks(2015), Andrew L. Maas et al. [pdf]

    • Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification(2015), Kyuyeon Hwang et al. [pdf]

    • Advances in All-Neural Speech Recognition(2016), Geoffrey Zweig et al. [pdf]

    • Advances in Very Deep Convolutional Neural Networks for LVCSR(2016), Tom Sercu et al. [pdf]

    • End-to-end attention-based large vocabulary speech recognition(2016), Dzmitry Bahdanau et al. [pdf]

    • Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention(2016), Dong Yu et al. [pdf]

    • Deep Speech 2: End-to-End Speech Recognition in English and Mandarin(2016), Dario Amodei et al. [pdf]

    • End-to-end attention-based distant speech recognition with Highway LSTM(2016), Hassan Taherian. [pdf]

    • Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning(2016), Suyoun Kim et al. [pdf]

    • Listen, attend and spell: A neural network for large vocabulary conversational speech recognition(2016), William Chan et al. [pdf]

    • Latent Sequence Decompositions(2016), William Chan et al. [pdf]

    • Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks(2016), Tara N. Sainath et al. [pdf]

    • Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2016), Suyoun Kim et al. [pdf]

    • Segmental Recurrent Neural Networks for End-to-End Speech Recognition(2016), Liang Lu et al. [pdf]

    • Towards better decoding and language model integration in sequence to sequence models(2016), Jan Chorowski et al. [pdf]

    • Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition(2016), Yanmin Qian et al. [pdf]

    • Very Deep Convolutional Networks for End-to-End Speech Recognition(2016), Yu Zhang et al. [pdf]

    • Very deep multilingual convolutional neural networks for LVCSR(2016), Tom Sercu et al. [pdf]

    • Wav2Letter: an End-to-End ConvNet-based Speech Recognition System(2016), Ronan Collobert et al. [pdf]

    • WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]

    • Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech(2017), Michael Neumann et al. [pdf]

    • An enhanced automatic speech recognition system for Arabic(2017), Mohamed Amine Menacer et al. [pdf]

    • Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM(2017), Takaaki Hori et al. [pdf]

    • A network of deep neural networks for distant speech recognition(2017), Mirco Ravanelli et al. [pdf]

    • An online sequence-to-sequence model for noisy speech recognition(2017), Chung-Cheng Chiu et al. [pdf]

    • An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems(2017), Hany Ahmed et al. [pdf]

    • Attention-Based End-to-End Speech Recognition in Mandarin(2017), C Shan et al. [pdf]

    • Building DNN acoustic models for large vocabulary speech recognition(2017), Andrew L. Maas et al. [pdf]

    • Direct Acoustics-to-Word Models for English Conversational Speech Recognition(2017), Kartik Audhkhasi et al. [pdf]

    • Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments(2017), Zixing Zhang et al. [pdf]

    • English Conversational Telephone Speech Recognition by Humans and Machines(2017), George Saon et al. [pdf]

    • ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA(2017), Song Han et al. [pdf]

    • Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition(2017), Chris Donahue et al. [pdf]

    • Deep LSTM for Large Vocabulary Continuous Speech Recognition(2017), Xu Tian et al. [pdf]

    • Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition(2017), Taesup Kim et al. [pdf]

    • Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling(2017), Hairong Liu et al. [pdf]

    • Improving the Performance of Online Neural Transducer Models(2017), Tara N. Sainath et al. [pdf]

    • Learning Filterbanks from Raw Speech for Phone Recognition(2017), Neil Zeghidour et al. [pdf]

    • Multichannel End-to-end Speech Recognition(2017), Tsubasa Ochiai et al. [pdf]

    • Multi-task Learning with CTC and Segmental CRF for Speech Recognition(2017), Liang Lu et al. [pdf]

    • Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition(2017), Tara N. Sainath et al. [pdf]

    • Multilingual Speech Recognition With A Single End-To-End Model(2017), Shubham Toshniwal et al. [pdf]

    • Optimizing expected word error rate via sampling for speech recognition(2017), Matt Shannon. [pdf]

    • Residual Convolutional CTC Networks for Automatic Speech Recognition(2017), Yisen Wang et al. [pdf]

    • Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition(2017), Jaeyoung Kim et al. [pdf]

    • Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2017), Suyoun Kim et al. [pdf]

    • Reducing Bias in Production Speech Models(2017), Eric Battenberg et al. [pdf]

    • Robust Speech Recognition Using Generative Adversarial Networks(2017), Anuroop Sriram et al. [pdf]

    • State-of-the-art Speech Recognition With Sequence-to-Sequence Models(2017), Chung-Cheng Chiu et al. [pdf]

    • Towards Language-Universal End-to-End Speech Recognition(2017), Suyoun Kim et al. [pdf]

    • Accelerating recurrent neural network language model based online speech recognition system(2018), K Lee et al. [pdf]

    Speaker Verification

    • Speaker Verification Using Adapted Gaussian Mixture Models(2000), Douglas A.Reynolds et al. [pdf]

    • A tutorial on text-independent speaker verification(2004), Frédéric Bimbot et al. [pdf]

    • Deep neural networks for small footprint text-dependent speaker verification(2014), E Variani et al. [pdf]

    • Deep Speaker Vectors for Semi Text-independent Speaker Verification(2015), Lantian Li et al. [pdf]

    • Deep Speaker: an End-to-End Neural Speaker Embedding System(2017), Chao Li et al. [pdf]

    • Deep Speaker Feature Learning for Text-independent Speaker Verification(2017), Lantian Li et al. [pdf]

    • Deep Speaker Verification: Do We Need End to End?(2017), Dong Wang et al. [pdf]

    • Speaker Diarization with LSTM(2017), Quan Wang et al. [pdf]

    • Text-Independent Speaker Verification Using 3D Convolutional Neural Networks(2017), Amirsina Torfi et al. [pdf]

    Speech Synthesis

    • Signal estimation from modified short-time Fourier transform(1993), Daniel W. Griffin et al. [pdf]

    • Text-to-speech synthesis(2009), Paul Taylor et al. [pdf]

    • A fast Griffin-Lim algorithm(2013), Nathanael Perraudin et al. [pdf]

    • First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention(2016), Wenfu Wang et al. [pdf]

    • Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer(2016), Xavi Gonzalvo et al. [pdf]

    • SampleRNN: An Unconditional End-to-End Neural Audio Generation Model(2016), Soroush Mehri et al. [pdf]

    • WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]

    • Char2Wav: End-to-end speech synthesis(2017), J Sotelo et al. [pdf]

    • Deep Voice: Real-time Neural Text-to-Speech(2017), Sercan O. Arik et al. [pdf]

    • Deep Voice 2: Multi-Speaker Neural Text-to-Speech(2017), Sercan Arik et al. [pdf]

    • Deep Voice 3: 2000-Speaker Neural Text-to-speech(2017), Wei Ping et al. [pdf]

    • Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions(2017), Jonathan Shen et al. [pdf]

    • Parallel WaveNet: Fast High-Fidelity Speech Synthesis(2017), Aaron van den Oord et al. [pdf]

    • Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework(2017), S Yang et al. [pdf]

    • Tacotron: Towards End-to-End Speech Synthesis(2017), Yuxuan Wang et al. [pdf]

    • Uncovering Latent Style Factors for Expressive Speech Synthesis(2017), Yuxuan Wang et al. [pdf]

    • VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop(2017), Yaniv Taigman et al. [pdf]

    • Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions(2017), Jonathan Shen et al. [pdf]

    • Neural Voice Cloning with a Few Samples(2018), Sercan O. Arık , Jitong Chen , 1 Kainan Peng , Wei Ping * et al. [pdf]

    Language Modelling

    • Class-Based n-gram Models of Natural Language(1992), Peter F. Brown et al. [pdf]

    • An empirical study of smoothing techniques for language modeling(1996), Stanley F. Chen et al. [pdf]

    • A Neural Probabilistic Language Model(2000), Yoshua Bengio et al. [pdf]

    • A new statistical approach to Chinese Pinyin input(2000), Zheng Chen et al. [pdf]

    • Discriminative n-gram language modeling(2007), Brian Roark et al. [pdf]

    • Neural Network Language Model for Chinese Pinyin Input Method Engine(2015), S Chen et al. [pdf]

    • Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition(2016), Xie Chen et al. [pdf]

    • Exploring the limits of language modeling(2016), R Jozefowicz et al. [pdf]

    • On the State of the Art of Evaluation in Neural Language Models(2016), G Melis et al. [pdf]

    Contact Me

    For any questions, welcome to send email to :zzw922cn@gmail.com. Thanks!

    展开全文
  • Welcome-源码

    2021-04-01 14:04:58
    欢迎 你好 :waving_hand: ,我是Koushik ... :speech_balloon: 向我询问 :thumbs_up: 不和谐之王 :closed_mailbox_with_raised_flag: 如何联系我 :high_voltage: 有趣的事实每次忙 与我联系: 语言和工具:
  • OpenEars 包括离线语音处理等等 http://www.politepix.com/openears/ ...Welcome ... to OpenEars: free speech recognition and speech synthesis for the iPhone Introduction Installation Basic conce

    OpenEars   包括离线语音处理等等

    http://www.politepix.com/openears/


    If you aren't quite ready to read the documentation,  visit the quickstart tool so you can get started with OpenEars in just a few minutes! You can come back and read the  docs or the FAQ once you have specific questions.

    Introduction

    OpenEars is an shared-source iOS framework for iPhone voice recognition and speech synthesis (TTS). It lets you easily implement round-trip English language speech recognition and text-to-speech on the iPhone and iPad and uses the open source CMU Pocketsphinx, CMU Flite, and CMUCLMTK libraries, and it is free to use in an iPhone or iPad app. It is the most popular offline framework for speech recognition and speech synthesis on iOS and has been featured in development books such as O'Reilly's Basic Sensors in iOS by Alasdair Allan and Cocos2d for iPhone 1 Game Development Cookbook by Nathan Burba.


    Highly-accurate large-vocabulary recognition (that is, trying to recognize any word the user speaks out of many thousands of known words) is not yet a reality for local in-app processing on the iPhone given the hardware limitations of the platform; even Siri does its large-vocabulary recognition on the server side. However, Pocketsphinx (the open source voice recognition engine that OpenEars uses) is capable of local recognition on the iPhone of vocabularies with hundreds of words depending on the environment and other factors, and performs very well with command-and-control language models. The best part is that it uses no network connectivity because all processing occurs locally on the device.

    The current version of OpenEars is 1.2.4. Download OpenEars 1.2.4or read its changelog.

    Features of OpenEars

    OpenEars can:

    • Listen continuously for speech on a background thread, while suspending or resuming speech processing on demand, all while using less than 4% CPU on average on an iPhone 4(decoding speech, text-to-speech, updating the UI and other intermittent functions use more CPU),
    • Use any of 9 voices for speech, including male and female voices with a range of speed/quality level, and switch between them on the fly,
    • Change the pitch, speed and variance of any text-to-speech voice,
    • Know whether headphones are plugged in and continue voice recognition during text-to-speech only when they are plugged in,
    • Support bluetooth audio devices (experimental),
    • Dispatch information to any part of your app about the results of speech recognition and speech, or changes in the state of the audio session (such as an incoming phone call or headphones being plugged in),
    • Deliver level metering for both speech input and speech output so you can design visual feedback for both states.
    • Support JSGF grammars,
    • Dynamically generate new ARPA language models in-app based on input from an NSArray of NSStrings,
    • Switch between ARPA language models or JSGF grammars on the fly,
    • Get n-best lists with scoring,
    • Test existing recordings,
    • Be easily interacted with via standard and simple Objective-C methods,
    • Control all audio functions with text-to-speech and speech recognition in memory instead of writing audio files to disk and then reading them,
    • Drive speech recognition with a low-latency Audio Unit driver for highest responsiveness,
    • Be installed in a Cocoa-standard fashion using an easy-peasy already-compiled framework.
    • In addition to its various new features and faster recognition/text-to-speech responsiveness, OpenEars now has improved recognition accuracy.
      • OpenEars is free to use in an iPhone or iPad app.
    Warning
    Before using OpenEars, please note it has to use a different audio driver on the Simulator that is less accurate, so it is always necessary to evaluate accuracy on a real device. Please don't submit support requests for accuracy issues with the Simulator.


    Warning
    Because Apple has removed armv6 architecture compiling in Xcode 4.5, and it is only possible to support upcoming devices using the armv7s architecture available in Xcode 4.5, there was no other option than to end support for armv6 devices after OpenEars 1.2. That means that current version of OpenEars only supports armv7 and armv7s devices (iPhone 3GS and later). If your app supports older devices like the first generation iPhone or the iPhone 3G, you can continue to download the legacy edition of OpenEars 1.2  here, but that edition will not update further – all updated versions of OpenEars starting with 1.2.1 will not support armv6 devices, just armv7 and armv7s. If you have previously been supporting older devices and you want to submit an app update removing that support, you must set your minimum deployment target to iOS 4.3 or later, or your app will be rejected by Apple. The framework is 100% compatible with LLVM-using versions of Xcode which precede version 4.5, but your app must be set to not compile the armv6 architecture in order to use it.

    Installation

    To use OpenEars:

    • Create your own app, and add the iOS frameworks AudioToolbox and AVFoundation to it.
    • Inside your downloaded distribution there is a folder called "Frameworks". Drag the "Frameworks" folder into your app project in Xcode.

    OK, now that you've finished laying the groundwork, you have to...wait, that's everything. You're ready to start using OpenEars. Give the sample app a spin to try out the features (the sample app uses ARC so you'll need a recent Xcode version) and then visit the Politepix interactive tutorial generator for a customized tutorial showing you exactly what code to add to your app for all of the different functionality of OpenEars.

    If the steps on this page didn't work for you, you can get free support at the forums, read the FAQ, brush up on the documentation, or open aprivate email support incident at the Politepix shop. If you'd like to read the documentation, simply read onward.

    Basic concepts

    There are a few basic concepts to understand about voice recognition and OpenEars that will make it easiest to create an app.

    • Local or offline speech recognition versus server-based or online speech recognition: most speech recognition on the iPhone is done by streaming the speech audio to servers. OpenEars works by doing the recognition inside the iPhone without using the network. This saves bandwidth and results in faster response, but since a server is much more powerful than a phone it means that we have to work with much smaller vocabularies to get accurate recognition.
    • Language Models. The language model is the vocabulary that you want OpenEars to understand, in a format that its speech recognition engine can understand. The smaller and better-adapted to your users' real usage cases the language model is, the better the accuracy. An ideal language model for PocketsphinxController has fewer than 200 words.
    • The parts of OpenEars. OpenEars has a simple, flexible and very powerful architecture. PocketsphinxController recognizes speech using a language model that was dynamically created byLanguageModelGeneratorFliteController creates synthesized speech (TTS). And OpenEarsEventsObserver dispatches messages about every feature of OpenEars (what speech was understood by the engine, whether synthesized speech is in progress, if there was an audio interruption) to any part of your app.
    BACK TO TOP

    FliteController Class Reference

    Detailed Description

    The class that controls speech synthesis (TTS) in OpenEars.

    Usage examples

    Preparing to use the class:

    To use FliteController, you need to have at least one Flite voice added to your project. When you added the "framework" folder of OpenEars to your app, you already imported a voice called Slt, so these instructions will use the Slt voice. You can get eight more free voices in OpenEarsExtras, available at https://bitbucket.org/Politepix/openearsextras

    What to add to your header:

    Add the following lines to your header (the .h file). Under the imports at the very top:
    #import <Slt/Slt.h>
    #import <OpenEars/FliteController.h>
    
    In the middle part where instance variables go:
    FliteController *fliteController;
    Slt *slt;
    
    In the bottom part where class properties go:
    @property (strong, nonatomic) FliteController *fliteController;
    @property (strong, nonatomic) Slt *slt;
    

    What to add to your implementation:

    Add the following to your implementation (the .m file):Under the @implementation keyword at the top:
    @synthesize fliteController;
    @synthesize slt;
    
    Among the other methods of the class, add these lazy accessor methods for confident memory management of the object:
    - (FliteController *)fliteController {
    	if (fliteController == nil) {
    		fliteController = [[FliteController alloc] init];
    	}
    	return fliteController;
    }
    
    - (Slt *)slt {
    	if (slt == nil) {
    		slt = [[Slt alloc] init];
    	}
    	return slt;
    }
    

    How to use the class methods:

    In the method where you want to call speech (to test this out, add it to your viewDidLoad method), add the following method call:
    [self.fliteController say:@"A short statement" withVoice:self.slt];
    
    Warning
    There can only be one  FliteController instance in your app at any given moment.

    Method Documentation

    - (void) say:   (NSString *)  statement
    withVoice:   (FliteVoice *)  voiceToUse 
           

    This takes an NSString which is the word or phrase you want to say, and the FliteVoice to use to say the phrase. Usage Example:

    [ self.fliteController say: @"Say it, don't spray it." withVoice: self.slt];

    There are a total of nine FliteVoices available for use with OpenEars. The Slt voice is the most popular one and it ships with OpenEars. The other eight voices can be downloaded as part of the OpenEarsExtras package available at the URL http://bitbucket.org/Politepix/openearsextras. To use them, just drag the desired downloaded voice's framework into your app, import its header at the top of your calling class (e.g. import <Slt/Slt.h> or import <Rms/Rms.h>) and instantiate it as you would any other object, then passing the instantiated voice to this method.

    - (Float32) fliteOutputLevel      

    A read-only attribute that tells you the volume level of synthesized speech in progress. This is a UI hook. You can't read it on the main thread.

    Property Documentation

    - (float) duration_stretch

    duration_stretch changes the speed of the voice. It is on a scale of 0.0-2.0 where 1.0 is the default.

    - (float) target_mean

    target_mean changes the pitch of the voice. It is on a scale of 0.0-2.0 where 1.0 is the default.

    - (float) target_stddev

    target_stddev changes convolution of the voice. It is on a scale of 0.0-2.0 where 1.0 is the default.

    - (BOOL) userCanInterruptSpeech

    Set userCanInterruptSpeech to TRUE in order to let new incoming human speech cut off synthesized speech in progress.

    BACK TO TOP

    LanguageModelGenerator Class Reference

    Detailed Description

    The class that generates the vocabulary the PocketsphinxController is able to understand.

    Usage examples

    What to add to your implementation:

    Add the following to your implementation (the .m file):Under the @implementation keyword at the top:
    #import <OpenEars/LanguageModelGenerator.h>
    
    Wherever you need to instantiate the language model generator, do it as follows:
    LanguageModelGenerator *lmGenerator = [[LanguageModelGenerator alloc] init];
    

    How to use the class methods:

    In the method where you want to create your language model (for instance your viewDidLoad method), add the following method call (replacing the placeholders like "WORD" and "A PHRASE" with actual words and phrases you want to be able to recognize):
    NSArray *words = [NSArray arrayWithObjects:@"WORD", @"STATEMENT", @"OTHER WORD", @"A PHRASE", nil];
    NSString *name = @"NameIWantForMyLanguageModelFiles";
    NSError *err = [lmGenerator generateLanguageModelFromArray:words withFilesNamed:name];
    
    
    NSDictionary *languageGeneratorResults = nil;
    
    NSString *lmPath = nil;
    NSString *dicPath = nil;
    	
    if([err code] == noErr) {
    	
    	languageGeneratorResults = [err userInfo];
    		
    	lmPath = [languageGeneratorResults objectForKey:@"LMPath"];
    	dicPath = [languageGeneratorResults objectForKey:@"DictionaryPath"];
    		
    } else {
    	NSLog(@"Error: %@",[err localizedDescription]);
    }
    
    If you are using the default English-language model generation, it is a requirement to enter your words and phrases in all capital letters, since the model is generated against a dictionary in which the entries are capitalized (meaning that if the words in the array aren't capitalized, they will not match the dictionary and you will not have the widest variety of pronunciations understood for the word you are using).If you need to create a fixed language model ahead of time instead of creating it dynamically in your app, just use this method (or generateLanguageModelFromTextFile:withFilesNamed:) to submit your full language model using the Simulator and then use the  Simulator documents folder script to get the language model and dictionary file out of the documents folder and add it to your app bundle, referencing it from there.

    Method Documentation

    - (NSError *) generateLanguageModelFromArray:   (NSArray *)  languageModelArray
    withFilesNamed:   (NSString *)  fileName 
           

    Generate a language model from an array of NSStrings which are the words and phrases you want PocketsphinxController or PocketsphinxController+RapidEars to understand. Putting a phrase in as a string makes it somewhat more probable that the phrase will be recognized as a phrase when spoken. fileName is the way you want the output files to be named, for instance if you enter "MyDynamicLanguageModel" you will receive files output to your Documents directory titled MyDynamicLanguageModel.dic, MyDynamicLanguageModel.arpa, and MyDynamicLanguageModel.DMP. The error that this method returns contains the paths to the files that were created in a successful generation effort in its userInfo when NSError == noErr. The words and phrases in languageModelArray must be written with capital letters exclusively, for instance "word" must appear in the array as "WORD".

    - (NSError *) generateLanguageModelFromTextFile:   (NSString *)  pathToTextFile
    withFilesNamed:   (NSString *)  fileName 
           

    Generate a language model from a text file containing words and phrases you want PocketsphinxController to understand. The file should be formatted with every word or contiguous phrase on its own line with a line break afterwards. Putting a phrase in on its own line makes it somewhat more probable that the phrase will be recognized as a phrase when spoken. Give the correct full path to the text file as a string. fileName is the way you want the output files to be named, for instance if you enter "MyDynamicLanguageModel" you will receive files output to your Documents directory titled MyDynamicLanguageModel.dic, MyDynamicLanguageModel.arpa, and MyDynamicLanguageModel.DMP. The error that this method returns contains the paths to the files that were created in a successful generation effort in its userInfo when NSError == noErr. The words and phrases in languageModelArray must be written with capital letters exclusively, for instance "word" must appear in the array as "WORD".

    Property Documentation

    - (BOOL) verboseLanguageModelGenerator

    Set this to TRUE to get verbose output

    - (BOOL) useFallbackMethod

    Advanced: turn this off if the words in your input array or text file aren't in English and you are using a custom dictionary file

    - (NSString *) dictionaryPathAsString

    Advanced: if you have your own pronunciation dictionary you want to use instead of CMU07a.dic you can assign its full path to this property before running the language model generation.

    BACK TO TOP

    OpenEarsEventsObserver Class Reference

    Detailed Description

    OpenEarsEventsObserver provides a large set of delegate methods that allow you to receive information about the events in OpenEars from anywhere in your app. You can create as many OpenEarsEventsObservers as you need and receive information using them simultaneously. All of the documentation for the use ofOpenEarsEventsObserver is found in the sectionOpenEarsEventsObserverDelegate.

    Property Documentation

    - (id< OpenEarsEventsObserverDelegate >) delegate

    To use the OpenEarsEventsObserverDelegate methods, assign this delegate to the class hosting OpenEarsEventsObserver and then use the delegate methods documented under OpenEarsEventsObserverDelegate. There is a complete example of how to do this explained under theOpenEarsEventsObserverDelegate documentation.

    BACK TO TOP

    OpenEarsLogging Class Reference

    Detailed Description

    A singleton which turns logging on or off for the entire framework. The type of logging is related to overall framework functionality such as the audio session and timing operations. Please turn OpenEarsLogging on for any issue you encounter. It will probably show the problem, but if not you can show the log on the forum and get help.

    Warning
    The individual classes such as  PocketsphinxController and LanguageModelGenerator have their own verbose flags which are separate from  OpenEarsLogging.

    Method Documentation

    + (id) startOpenEarsLogging      

    This just turns on logging. If you don't want logging in your session, don't send the startOpenEarsLogging message.

    Example Usage:

    Before implementation:

    #import <OpenEars/OpenEarsLogging.h>;

    In implementation:

    BACK TO TOP

    PocketsphinxController Class Reference

    Detailed Description

    The class that controls local speech recognition in OpenEars.

    Usage examples

    Preparing to use the class:

    To use PocketsphinxController, you need a language model and a phonetic dictionary for it. These files define which words PocketsphinxController is capable of recognizing. They are created above by using LanguageModelGenerator.

    What to add to your header:

    Add the following lines to your header (the .h file). Under the imports at the very top:
    #import <OpenEars/PocketsphinxController.h>
    
    In the middle part where instance variables go:
    PocketsphinxController *pocketsphinxController;
    
    In the bottom part where class properties go:
    @property (strong, nonatomic) PocketsphinxController *pocketsphinxController;
    

    What to add to your implementation:

    Add the following to your implementation (the .m file):Under the @implementation keyword at the top:
    @synthesize pocketsphinxController;
    
    Among the other methods of the class, add this lazy accessor method for confident memory management of the object:
    - (PocketsphinxController *)pocketsphinxController {
    	if (pocketsphinxController == nil) {
    		pocketsphinxController = [[PocketsphinxController alloc] init];
    	}
    	return pocketsphinxController;
    }
    

    How to use the class methods:

    In the method where you want to recognize speech (to test this out, add it to your viewDidLoad method), add the following method call:
    [self.pocketsphinxController startListeningWithLanguageModelAtPath:lmPath dictionaryAtPath:dicPath languageModelIsJSGF:NO];
    
    Warning
    There can only be one  PocketsphinxController instance in your app.

    Method Documentation

    - (void) startListeningWithLanguageModelAtPath:   (NSString *)  languageModelPath
    dictionaryAtPath:   (NSString *)  dictionaryPath
    languageModelIsJSGF:   (BOOL)  languageModelIsJSGF 
           

    Start the speech recognition engine up. You provide the full paths to a language model and a dictionary file which are created usingLanguageModelGenerator.

    - (void) stopListening      

    Shut down the engine. You must do this before releasing a parent view controller that contains PocketsphinxController.

    - (void) suspendRecognition      

    Keep the engine going but stop listening to speech until resumeRecognition is called. Takes effect instantly.

    - (void) resumeRecognition      

    Resume listening for speech after suspendRecognition has been called.

    - (void) changeLanguageModelToFile:   (NSString *)  languageModelPathAsString
    withDictionary:   (NSString *)  dictionaryPathAsString 
           

    Change from one language model to another. This lets you change which words you are listening for depending on the context in your app.

    - (Float32) pocketsphinxInputLevel      

    Gives the volume of the incoming speech. This is a UI hook. You can't read it on the main thread or it will block.

    - (void) runRecognitionOnWavFileAtPath:   (NSString *)  wavPath
    usingLanguageModelAtPath:   (NSString *)  languageModelPath
    dictionaryAtPath:   (NSString *)  dictionaryPath
    languageModelIsJSGF:   (BOOL)  languageModelIsJSGF 
           

    You can use this to run recognition on an already-recorded WAV file for testing. The WAV file has to be 16-bit and 16000 samples per second.

    Property Documentation

    - (float) secondsOfSilenceToDetect

    This is how long PocketsphinxController should wait after speech ends to attempt to recognize speech. This defaults to .7 seconds.

    - (BOOL) returnNbest

    Advanced: set this to TRUE to receive n-best results.

    - (int) nBestNumber

    Advanced: the number of n-best results to return. This is a maximum number to return – if there are null hypotheses fewer than this number will be returned.

    - (int) calibrationTime

    How long to calibrate for. This can only be one of the values '1', '2', or '3'. Defaults to 1.

    - (BOOL) verbosePocketSphinx

    Turn on verbose output. Do this any time you encounter an issue and any time you need to report an issue on the forums.

    - (BOOL) returnNullHypotheses

    By default, PocketsphinxController won't return a hypothesis if for some reason the hypothesis is null (this can happen if the perceived sound was just noise). If you need even empty hypotheses to be returned, you can set this to TRUE before starting PocketsphinxController.

    BACK TO TOP

    <OpenEarsEventsObserverDelegate> Protocol Reference

    Detailed Description

    OpenEarsEventsObserver provides a large set of delegate methods that allow you to receive information about the events in OpenEars from anywhere in your app. You can create as many OpenEarsEventsObservers as you need and receive information using them simultaneously.

    Usage examples

    What to add to your header:

    Add the following lines to your header (the .h file). Under the imports at the very top:
    #import <OpenEars/OpenEarsEventsObserver.h>
    
    at the @interface declaration, add the OpenEarsEventsObserverDelegate inheritance.An example of this for a view controller called ViewController would look like this:
    @interface ViewController : UIViewController <OpenEarsEventsObserverDelegate> {
    
    In the middle part where instance variables go:
    OpenEarsEventsObserver *openEarsEventsObserver;
    
    In the bottom part where class properties go:
    @property (strong, nonatomic) OpenEarsEventsObserver *openEarsEventsObserver;
    

    What to add to your implementation:

    Add the following to your implementation (the .m file):Under the @implementation keyword at the top:
    @synthesize openEarsEventsObserver;
    
    Among the other methods of the class, add this lazy accessor method for confident memory management of the object:
    - (OpenEarsEventsObserver *)openEarsEventsObserver {
    	if (openEarsEventsObserver == nil) {
    		openEarsEventsObserver = [[OpenEarsEventsObserver alloc] init];
    	}
    	return openEarsEventsObserver;
    }
    
    and then right before you start your first OpenEars functionality (for instance, right before your first self.fliteController say:withVoice: message or right before your first self.pocketsphinxController startListeningWithLanguageModelAtPath:dictionaryAtPath:languageModelIsJSGF: message) send this message:
    [self.openEarsEventsObserver setDelegate:self];
    

    How to use the class methods:

    Add these delegate methods of OpenEarsEventsObserver to your class:
    - (void) pocketsphinxDidReceiveHypothesis:(NSString *)hypothesis recognitionScore:(NSString *)recognitionScore utteranceID:(NSString *)utteranceID {
    	NSLog(@"The received hypothesis is %@ with a score of %@ and an ID of %@", hypothesis, recognitionScore, utteranceID);
    }
    
    - (void) pocketsphinxDidStartCalibration {
    	NSLog(@"Pocketsphinx calibration has started.");
    }
    
    - (void) pocketsphinxDidCompleteCalibration {
    	NSLog(@"Pocketsphinx calibration is complete.");
    }
    
    - (void) pocketsphinxDidStartListening {
    	NSLog(@"Pocketsphinx is now listening.");
    }
    
    - (void) pocketsphinxDidDetectSpeech {
    	NSLog(@"Pocketsphinx has detected speech.");
    }
    
    - (void) pocketsphinxDidDetectFinishedSpeech {
    	NSLog(@"Pocketsphinx has detected a period of silence, concluding an utterance.");
    }
    
    - (void) pocketsphinxDidStopListening {
    	NSLog(@"Pocketsphinx has stopped listening.");
    }
    
    - (void) pocketsphinxDidSuspendRecognition {
    	NSLog(@"Pocketsphinx has suspended recognition.");
    }
    
    - (void) pocketsphinxDidResumeRecognition {
    	NSLog(@"Pocketsphinx has resumed recognition."); 
    }
    
    - (void) pocketsphinxDidChangeLanguageModelToFile:(NSString *)newLanguageModelPathAsString andDictionary:(NSString *)newDictionaryPathAsString {
    	NSLog(@"Pocketsphinx is now using the following language model: \n%@ and the following dictionary: %@",newLanguageModelPathAsString,newDictionaryPathAsString);
    }
    
    - (void) pocketSphinxContinuousSetupDidFail { // This can let you know that something went wrong with the recognition loop startup. Turn on OPENEARSLOGGING to learn why.
    	NSLog(@"Setting up the continuous recognition loop has failed for some reason, please turn on OpenEarsLogging to learn more.");
    }
    

    Method Documentation

    - (void) audioSessionInterruptionDidBegin      

    There was an interruption.

    - (void) audioSessionInterruptionDidEnd      

    The interruption ended.

    - (void) audioInputDidBecomeUnavailable      

    The input became unavailable.

    - (void) audioInputDidBecomeAvailable      

    The input became available again.

    - (void) audioRouteDidChangeToRoute:   (NSString *)  newRoute  

    The audio route changed.

    - (void) pocketsphinxDidStartCalibration      

    Pocketsphinx isn't listening yet but it started calibration.

    - (void) pocketsphinxDidCompleteCalibration      

    Pocketsphinx isn't listening yet but calibration completed.

    - (void) pocketsphinxRecognitionLoopDidStart      

    Pocketsphinx isn't listening yet but it has entered the main recognition loop.

    - (void) pocketsphinxDidStartListening      

    Pocketsphinx is now listening.

    - (void) pocketsphinxDidDetectSpeech      

    Pocketsphinx heard speech and is about to process it.

    - (void) pocketsphinxDidDetectFinishedSpeech      

    Pocketsphinx detected a second of silence indicating the end of an utterance

    - (void) pocketsphinxDidReceiveHypothesis:   (NSString *)  hypothesis
    recognitionScore:   (NSString *)  recognitionScore
    utteranceID:   (NSString *)  utteranceID 
           

    Pocketsphinx has a hypothesis.

    - (void) pocketsphinxDidReceiveNBestHypothesisArray:   (NSArray *)  hypothesisArray  

    Pocketsphinx has an n-best hypothesis dictionary.

    - (void) pocketsphinxDidStopListening      

    Pocketsphinx has exited the continuous listening loop.

    - (void) pocketsphinxDidSuspendRecognition      

    Pocketsphinx has not exited the continuous listening loop but it will not attempt recognition.

    - (void) pocketsphinxDidResumeRecognition      

    Pocketsphinx has not existed the continuous listening loop and it will now start attempting recognition again.

    - (void) pocketsphinxDidChangeLanguageModelToFile:   (NSString *)  newLanguageModelPathAsString
    andDictionary:   (NSString *)  newDictionaryPathAsString 
           

    Pocketsphinx switched language models inline.

    - (void) pocketSphinxContinuousSetupDidFail      

    Some aspect of setting up the continuous loop failed, turn onOpenEarsLogging for more info.

    - (void) fliteDidStartSpeaking      

    Flite started speaking. You probably don't have to do anything about this.

    - (void) fliteDidFinishSpeaking      

    Flite finished speaking. You probably don't have to do anything about this.


    展开全文
  • https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers ...awesome-speech-recognition-speech-synthesis-papers automatic speech recognition/speech synthesis paper roadmap,

    https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers

    awesome-speech-recognition-speech-synthesis-papers

    automatic speech recognition/speech synthesis paper roadmap, including HMM, DNN, RNN, CNN, Seq2Seq, Attention

    Introduction

    Automatic Speech Recognition has been investigated for several decades, and speech recognition models are from HMM-GMM to deep neural networks today. It's very necessary to see the history of speech recognition by this awesome paper roadmap. I will cover papers from traditional models to nowadays popular models, not only acoustic models or ASR systems, but also many interesting language models.

    Paper List

    Automatic Speech Recognition

    • An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition(1982), S. E. LEVINSON et al. [pdf]

    • A Maximum Likelihood Approach to Continuous Speech Recognition(1983), LALIT R. BAHL et al. [pdf]

    • Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition(1986), Andrew K. Halberstadt. [pdf]

    • Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition(1986), Lalit R. Bahi et al. [pdf]

    • A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition(1989), Lawrence R Rabiner. [pdf]

    • Phoneme recognition using time-delay neural networks(1989), Alexander H. Waibel et al. [pdf]

    • Speaker-independent phone recognition using hidden Markov models(1989), Kai-Fu Lee et al. [pdf]

    • Hidden Markov Models for Speech Recognition(1991), B. H. Juang et al. [pdf]

    • Connectionist Speech Recognition: A Hybrid Approach(1994), Herve Bourlard et al. [pdf]

    • A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)(1997), J.G. Fiscus. [pdf]

    • Review of Tdnn (time Delay Neural Network) Architectures for Speech Recognition(2014), Masahide Sugiyamat et al. [pdf]

    • Framewise phoneme classification with bidirectional LSTM and other neural network architectures(2005), Alex Graves et al. [pdf]

    • Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks(2006), Alex Graves et al. [pdf]

    • The kaldi speech recognition toolkit(2011), Daniel Povey et al. [pdf]

    • Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition(2012), Ossama Abdel-Hamid et al. [pdf]

    • Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition(2012), George E. Dahl et al. [pdf]

    • Deep Neural Networks for Acoustic Modeling in Speech Recognition(2012), Geoffrey Hinton et al. [pdf]

    • Sequence Transduction with Recurrent Neural Networks(2012), Alex Graves et al. [pdf]

    • Deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]

    • Improving deep neural networks for LVCSR using rectified linear units and dropout(2013), George E. Dahl et al.[pdf]

    • Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training(2013), Yajie Miao et al. [pdf]

    • Improvements to deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]

    • Machine Learning Paradigms for Speech Recognition: An Overview(2013), Li Deng et al. [pdf]

    • Recent advances in deep learning for speech research at Microsoft(2013), Li Deng et al. [pdf]

    • Speech recognition with deep recurrent neural networks(2013), Alex Graves et al. [pdf]

    • Convolutional deep maxout networks for phone recognition(2014), László Tóth et al. [pdf]

    • Convolutional Neural Networks for Speech Recognition(2014), Ossama Abdel-Hamid et al. [pdf]

    • Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition(2014), László Tóth. [pdf]

    • Deep Speech: Scaling up end-to-end speech recognition(2014), Awni Y. Hannun et al. [pdf]

    • End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results(2014), Jan Chorowski et al. [pdf]

    • First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs(2014), Andrew L. Maas et al. [pdf]

    • Long short-term memory recurrent neural network architectures for large scale acoustic modeling(2014), Hasim Sak et al. [pdf]

    • Robust CNN-based speech recognition with Gabor filter kernels(2014), Shuo-Yiin Chang et al. [pdf]

    • Stochastic pooling maxout networks for low-resource speech recognition(2014), Meng Cai et al. [pdf]

    • Towards End-to-End Speech Recognition with Recurrent Neural Networks(2014), Alex Graves et al. [pdf]

    • Attention-Based Models for Speech Recognition(2015), Jan Chorowski et al. [pdf]

    • Analysis of CNN-based speech recognition system using raw speech as input(2015), Dimitri Palaz et al. [pdf]

    • Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks(2015), Tara N. Sainath et al. [pdf]

    • Deep convolutional neural networks for acoustic modeling in low resource languages(2015), William Chan et al.[pdf]

    • Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition(2015), Chao Weng et al. [pdf]

    • Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition(2015), Hasim Sak et al.[pdf]

    • Lexicon-Free Conversational Speech Recognition with Neural Networks(2015), Andrew L. Maas et al. [pdf]

    • Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification(2015), Kyuyeon Hwang et al. [pdf]

    • Advances in All-Neural Speech Recognition(2016), Geoffrey Zweig et al. [pdf]

    • Advances in Very Deep Convolutional Neural Networks for LVCSR(2016), Tom Sercu et al. [pdf]

    • End-to-end attention-based large vocabulary speech recognition(2016), Dzmitry Bahdanau et al. [pdf]

    • Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention(2016), Dong Yu et al. [pdf]

    • Deep Speech 2: End-to-End Speech Recognition in English and Mandarin(2016), Dario Amodei et al. [pdf]

    • End-to-end attention-based distant speech recognition with Highway LSTM(2016), Hassan Taherian. [pdf]

    • Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning(2016), Suyoun Kim et al.[pdf]

    • Listen, attend and spell: A neural network for large vocabulary conversational speech recognition(2016), William Chan et al. [pdf]

    • Latent Sequence Decompositions(2016), William Chan et al. [pdf]

    • Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks(2016), Tara N. Sainath et al. [pdf]

    • Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2016), Suyoun Kim et al. [pdf]

    • Segmental Recurrent Neural Networks for End-to-End Speech Recognition(2016), Liang Lu et al. [pdf]

    • Towards better decoding and language model integration in sequence to sequence models(2016), Jan Chorowski et al. [pdf]

    • Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition(2016), Yanmin Qian et al. [pdf]

    • Very Deep Convolutional Networks for End-to-End Speech Recognition(2016), Yu Zhang et al. [pdf]

    • Very deep multilingual convolutional neural networks for LVCSR(2016), Tom Sercu et al. [pdf]

    • Wav2Letter: an End-to-End ConvNet-based Speech Recognition System(2016), Ronan Collobert et al. [pdf]

    • WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]

    • Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech(2017), Michael Neumann et al. [pdf]

    • An enhanced automatic speech recognition system for Arabic(2017), Mohamed Amine Menacer et al. [pdf]

    • Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM(2017), Takaaki Hori et al. [pdf]

    • A network of deep neural networks for distant speech recognition(2017), Mirco Ravanelli et al. [pdf]

    • An online sequence-to-sequence model for noisy speech recognition(2017), Chung-Cheng Chiu et al. [pdf]

    • An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems(2017), Hany Ahmed et al. [pdf]

    • Building DNN acoustic models for large vocabulary speech recognition(2017), Andrew L. Maas et al. [pdf]

    • Direct Acoustics-to-Word Models for English Conversational Speech Recognition(2017), Kartik Audhkhasi et al.[pdf]

    • Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments(2017), Zixing Zhang et al. [pdf]

    • English Conversational Telephone Speech Recognition by Humans and Machines(2017), George Saon et al. [pdf]

    • ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA(2017), Song Han et al. [pdf]

    • Deep LSTM for Large Vocabulary Continuous Speech Recognition(2017), Xu Tian et al. [pdf]

    • Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling(2017), Hairong Liu et al.[pdf]

    • Multichannel End-to-end Speech Recognition(2017), Tsubasa Ochiai et al. [pdf]

    • Multi-task Learning with CTC and Segmental CRF for Speech Recognition(2017), Liang Lu et al. [pdf]

    • Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition(2017), Tara N. Sainath et al. [pdf]

    • Optimizing expected word error rate via sampling for speech recognition(2017), Matt Shannon. [pdf]

    • Residual Convolutional CTC Networks for Automatic Speech Recognition(2017), Yisen Wang et al. [pdf]

    • Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition(2017), Jaeyoung Kim et al. [pdf]

    • Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2017), Suyoun Kim et al. [pdf]

    • Reducing Bias in Production Speech Models(2017), Eric Battenberg et al. [pdf]

    Speech Synthesis

    • Signal estimation from modified short-time Fourier transform(1993), Daniel W. Griffin et al. [pdf]

    • A fast Griffin-Lim algorithm(2013), Nathanael Perraudin et al. [pdf]

    • First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention(2016), Wenfu Wang et al. [pdf]

    • Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer(2016), Xavi Gonzalvo et al. [pdf]

    • SampleRNN: An Unconditional End-to-End Neural Audio Generation Model(2016), Soroush Mehri et al. [pdf]

    • WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]

    • Deep Voice: Real-time Neural Text-to-Speech(2017), Sercan O. Arik et al. [pdf]

    • Deep Voice 2: Multi-Speaker Neural Text-to-Speech(2017), Sercan Arik et al. [pdf]

    • Tacotron: Towards End-to-End Speech Synthesis(2017), Yuxuan Wang et al. [pdf]

    Language Modelling

    • Class-Based n-gram Models of Natural Language(1992), Peter F. Brown et al. [pdf]

    • A Neural Probabilistic Language Model(2000), Yoshua Bengio et al. [pdf]

    • Discriminative n-gram language modeling(2007), Brian Roark et al. [pdf]

    • Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition(2016), Xie Chen et al. [pdf]

    Contact Me

    For any questions, welcome to send email to :zzw922cn@gmail.com. Thanks!


    展开全文
  • Automatic_Speech_Recognition

    千次阅读 2017-06-18 19:29:50
    https://github.com/zzw922cn/Automatic_Speech_Recognition ...Automatic-Speech-Recognition End-to-end automatic speech recognition system implemented in TensorFlow. Recent Updates  Suppor

    https://github.com/zzw922cn/Automatic_Speech_Recognition

    Automatic-Speech-Recognition

    End-to-end automatic speech recognition system implemented in TensorFlow.

    Recent Updates

    •  Support TensorFlow r1.0 (2017-02-24)
    •  Support dropout for dynamic rnn (2017-03-11)
    •  Support running in shell file (2017-03-11)
    •  Support evaluation every several training epoches automatically (2017-03-11)
    •  Fix bugs for character-level automatic speech recognition (2017-03-14)
    •  Improve some function apis for reusable (2017-03-14)
    •  Add scaling for data preprocessing (2017-03-15)
    •  Add reusable support for LibriSpeech training (2017-03-15)
    •  Add simple n-gram model for random generation or statistical use (2017-03-23)
    •  Improve some code for pre-processing and training (2017-03-23)
    •  Replace TABs with blanks and add nist2wav converter script (2017-04-20)
    •  Add some data preparation code (2017-05-01)
    •  Add WSJ corpus standard preprocessing by s5 recipe (2017-05-05)
    •  Restructuring of the project. Updated train.py for usage convinience (2017-05-06)
    •  Finish feature module for timit, libri, wsj, support training for LibriSpeech (2017-05-14)

    Recommendation

    If you want to replace feed dict operation with Tensorflow multi-thread and fifoqueue input pipeline, you can refer to my repo TensorFlow-Input-Pipeline for more example codes. My own practices prove that fifoqueue input pipeline would improve the training speed in some time.

    If you want to look the history of speech recognition, I have collected the significant papers since 1981 in the ASR field. You can read awesome paper list in my repo awesome-speech-recognition-papers, all download links of papers are provided. I will update it every week to add new papers, including speech recognition, speech synthesis and language modelling. I hope that we won't miss any important papers in speech domain.

    All my public repos will be updated in future, thanks for your stars!

    Install and Usage

    Currently only python 2.7 is supported.

    This project depends on scikit.audiolab, for which you need to have libsndfile installed in your system. Clone the repository to your preferred directory and install the dependencies using:

    pip install -r requirements.txt
    

    To use, simply run the following command:

    python main/timit_train.py [-h] [--mode MODE] [--keep [KEEP]] [--nokeep]
                          [--level LEVEL] [--model MODEL] [--rnncell RNNCELL]
                          [--num_layer NUM_LAYER] [--activation ACTIVATION]
                          [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]
                          [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]
                          [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]
                          [--lr LR] [--dropout_prob DROPOUT_PROB]
                          [--grad_clip GRAD_CLIP] [--datadir DATADIR]
                          [--logdir LOGDIR]
    
    optional arguments:
      -h, --help            show this help message and exit
      --mode MODE           set whether to train or test
      --keep [KEEP]         set whether to restore a model, when test mode, keep
                            should be set to True
      --nokeep
      --level LEVEL         set the task level, phn, cha, or seq2seq, seq2seq will
                            be supported soon
      --model MODEL         set the model to use, DBiRNN, BiRNN, ResNet..
      --rnncell RNNCELL     set the rnncell to use, rnn, gru, lstm...
      --num_layer NUM_LAYER
                            set the layers for rnn
      --activation ACTIVATION
                            set the activation to use, sigmoid, tanh, relu, elu...
      --optimizer OPTIMIZER
                            set the optimizer to use, sgd, adam...
      --batch_size BATCH_SIZE
                            set the batch size
      --num_hidden NUM_HIDDEN
                            set the hidden size of rnn cell
      --num_feature NUM_FEATURE
                            set the size of input feature
      --num_classes NUM_CLASSES
                            set the number of output classes
      --num_epochs NUM_EPOCHS
                            set the number of epochs
      --lr LR               set the learning rate
      --dropout_prob DROPOUT_PROB
                            set probability of dropout
      --grad_clip GRAD_CLIP
                            set the threshold of gradient clipping
      --datadir DATADIR     set the data root directory
      --logdir LOGDIR       set the log directory
    
    

    Instead of configuration in command line, you can also set the arguments above in train.py in practice.

    Besides, you can also run main/run.sh for both training and testing simultaneously! See run.sh for details.

    Performance

    PER based dynamic BLSTM on TIMIT database, with casual tuning because time it limited

    image

    LibriSpeech recognition result without LM

    Label:

    it was about noon when captain waverley entered the straggling village or rather hamlet of tully veolan close to which was situated the mansion of the proprietor

    Prediction:

    it was about noon when captain wavraly entered the stragling bilagor of rather hamlent of tulevallon close to which wi situated the mantion of the propriater

    Label:

    the english it is evident had they not been previously assured of receiving the king would never have parted with so considerable a sum and while they weakened themselves by the same measure have strengthened a people with whom they must afterwards have so material an interest to discuss

    Prediction:

    the onglish it is evident had they not being previously showed of receiving the king would never have parted with so considerable a some an quile they weakene themselves by the same measure haf streigth and de people with whom they must afterwards have so material and interest to discuss

    Label:

    one who writes of such an era labours under a troublesome disadvantage

    Prediction:

    one how rights of such an er a labours onder a troubles hom disadvantage

    Label:

    then they started on again and two hours later came in sight of the house of doctor pipt

    Prediction:

    then they started on again and two hours laytor came in sight of the house of doctor pipd

    Label:

    what does he want

    Prediction:

    whit daes he want

    Label:

    there just in front

    Prediction:

    there just infront

    Label:

    under ordinary circumstances the abalone is tough and unpalatable but after the deft manipulation of herbert they are tender and make a fine dish either fried as chowder or a la newberg

    Prediction:

    under ordinary circumstancesi the abl ony is tufgh and unpelitable but after the deftominiculation of hurbourt and they are tender and make a fine dish either fride as choder or alanuburg

    Label:

    by degrees all his happiness all his brilliancy subsided into regret and uneasiness so that his limbs lost their power his arms hung heavily by his sides and his head drooped as though he was stupefied

    Prediction:

    by degrees all his happiness ill his brilliancy subsited inter regret and aneasiness so that his limbs lost their power his arms hung heavily by his sides and his head druped as though he was stupified

    Label:

    i am the one to go after walt if anyone has to i'll go down mister thomas

    Prediction:

    i have the one to go after walt if ety wod hastu i'll go down mister thommas

    Label:

    i had to read it over carefully as the text must be absolutely correct

    Prediction:

    i had to readit over carefully as the tex must be absolutely correct

    Label:

    with a shout the boys dashed pell mell to meet the pack train and falling in behind the slow moving burros urged them on with derisive shouts and sundry resounding slaps on the animals flanks

    Prediction:

    with a shok the boy stash pale mele to meek the pecktrait ane falling in behind the slow lelicg burs ersh tlan with deressive shouts and sudery resounding sleps on the animal slankes

    Label:

    i suppose though it's too early for them then came the explosion

    Prediction:

    i suppouse gho waths two early for them then came the explosion

    Content

    This is a powerful library for automatic speech recognition, it is implemented in TensorFlow and support training with CPU/GPU. This library contains followings models you can choose to train your own model:

    • Data Pre-processing
    • Acoustic Modeling
      • RNN
      • BRNN
      • LSTM
      • BLSTM
      • GRU
      • BGRU
      • Dynamic RNN
      • Deep Residual Network
      • Seq2Seq with attention decoder
      • etc.
    • CTC Decoding
    • Evaluation(Mapping some similar phonemes)
    • Saving or Restoring Model
    • Mini-batch Training
    • Training with GPU or CPU with TensorFlow
    • Keeping logging of epoch time and error rate in disk

    Implementation Details

    Data preprocessing

    TIMIT corpus

    The original TIMIT database contains 6300 utterances, but we find the 'SA' audio files occurs many times, it will lead bad bias for our speech recognition system. Therefore, we removed the all 'SA' files from the original dataset and attain the new TIMIT dataset, which contains only 5040 utterances including 3696 standard training set and 1344 test set.

    Automatic Speech Recognition transcribes a raw audio file into character sequences; the preprocessing stage converts a raw audio file into feature vectors of several frames. We first split each audio file into 20ms Hamming windows with an overlap of 10ms, and then calculate the 12 mel frequency ceptral coefficients, appending an energy variable to each frame. This results in a vector of length 13. We then calculate the delta coefficients and delta-delta coefficients, attaining a total of 39 coefficients for each frame. In other words, each audio file is split into frames using the Hamming windows function, and each frame is extracted to a feature vector of length 39 (to attain a feature vector of different length, modify the settings in the file timit_preprocess.py.

    In folder data/mfcc, each file is a feature matrix with size timeLength*39 of one audio file; in folder data/label, each file is a label vector according to the mfcc file.

    If you want to set your own data preprocessing, you can edit calcmfcc.py or timit_preprocess.py.

    The original TIMIT dataset contains 61 phonemes, we use 61 phonemes for training and evaluation, but when scoring, we mappd the 61 phonemes into 39 phonemes for better performance. We do this mapping according to the paper Speaker-independent phone recognition using hidden Markov models. The mapping details are as follows:

    Original Phoneme(s)Mapped Phoneme
    iyiy
    ix, ihix
    eheh
    aeae
    ax, ah, ax-hax
    uw, uxuw
    uhuh
    ao, aaao
    eyey
    ayay
    oyoy
    awaw
    owow
    er, axrer
    l, ell
    rr
    ww
    yy
    m, emm
    n, en, nxn
    ng, engng
    vv
    ff
    dhdh
    thth
    zz
    ss
    zh, shzh
    jhjh
    chch
    bb
    pp
    dd
    dxdx
    tt
    gg
    kk
    hh, hvhh
    bcl, pcl, dcl, tcl, gcl, kcl, q, epi, pau, h#h#

    LibriSpeech corpus

    LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech. It can be downloaded from here

    In order to preprocess LibriSpeech data, download the dataset from the above mentioned link, extract it and run the following:

    cd feature/libri
    python libri_preprocess.py -h 
    usage: libri_preprocess [-h]
                            [-n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}]
                            [-m {mfcc,fbank}] [--featlen FEATLEN] [-s]
                            [-wl WINLEN] [-ws WINSTEP]
                            path save
    
    Script to preprocess libri data
    
    positional arguments:
      path                  Directory of LibriSpeech dataset
      save                  Directory where preprocessed arrays are to be saved
    
    optional arguments:
      -h, --help            show this help message and exit
      -n {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}, --name {dev-clean,dev-other,test-clean,test-other,train-clean-100,train-clean-360,train-other-500}
                            Name of the dataset
      -m {mfcc,fbank}, --mode {mfcc,fbank}
                            Mode
      --featlen FEATLEN     Features length
      -s, --seq2seq         set this flag to use seq2seq
      -wl WINLEN, --winlen WINLEN
                            specify the window length of feature
      -ws WINSTEP, --winstep WINSTEP
                            specify the window step length of feature
    

    The processed data will be saved in the "save" path.

    To train the model, run the following:

    python main/libri_train.py -h 
    usage: libri_train.py [-h] [--task TASK] [--train_dataset TRAIN_DATASET]
                          [--dev_dataset DEV_DATASET]
                          [--test_dataset TEST_DATASET] [--mode MODE]
                          [--keep [KEEP]] [--nokeep] [--level LEVEL]
                          [--model MODEL] [--rnncell RNNCELL]
                          [--num_layer NUM_LAYER] [--activation ACTIVATION]
                          [--optimizer OPTIMIZER] [--batch_size BATCH_SIZE]
                          [--num_hidden NUM_HIDDEN] [--num_feature NUM_FEATURE]
                          [--num_classes NUM_CLASSES] [--num_epochs NUM_EPOCHS]
                          [--lr LR] [--dropout_prob DROPOUT_PROB]
                          [--grad_clip GRAD_CLIP] [--datadir DATADIR]
                          [--logdir LOGDIR]
    
    optional arguments:
      -h, --help            show this help message and exit
      --task TASK           set task name of this program
      --train_dataset TRAIN_DATASET
                            set the training dataset
      --dev_dataset DEV_DATASET
                            set the development dataset
      --test_dataset TEST_DATASET
                            set the test dataset
      --mode MODE           set whether to train, dev or test
      --keep [KEEP]         set whether to restore a model, when test mode, keep
                            should be set to True
      --nokeep
      --level LEVEL         set the task level, phn, cha, or seq2seq, seq2seq will
                            be supported soon
      --model MODEL         set the model to use, DBiRNN, BiRNN, ResNet..
      --rnncell RNNCELL     set the rnncell to use, rnn, gru, lstm...
      --num_layer NUM_LAYER
                            set the layers for rnn
      --activation ACTIVATION
                            set the activation to use, sigmoid, tanh, relu, elu...
      --optimizer OPTIMIZER
                            set the optimizer to use, sgd, adam...
      --batch_size BATCH_SIZE
                            set the batch size
      --num_hidden NUM_HIDDEN
                            set the hidden size of rnn cell
      --num_feature NUM_FEATURE
                            set the size of input feature
      --num_classes NUM_CLASSES
                            set the number of output classes
      --num_epochs NUM_EPOCHS
                            set the number of epochs
      --lr LR               set the learning rate
      --dropout_prob DROPOUT_PROB
                            set probability of dropout
      --grad_clip GRAD_CLIP
                            set the threshold of gradient clipping, -1 denotes no
                            clipping
      --datadir DATADIR     set the data root directory
      --logdir LOGDIR       set the log directory
    

    where the "datadir" is the "save" path used in preprocess stage.

    Wall Street Journal corpus

    TODO

    Core Features

    • dynamic RNN(GRU, LSTM)
    • Residual Network(Deep CNN)
    • CTC Decoding
    • TIMIT Phoneme Edit Distance(PER)

    Future Work

    •  Add Attention Mechanism
    •  Add more efficient dynamic computation graph without padding
    •  List experimental results
    •  Implement more ASR models following newest investigations
    •  Provide fast TensorFlow Input Pipeline

    License

    MIT

    Contact Us

    If this program is helpful to you, please give us a star or fork to encourage us to keep updating. Thank you! Besides, any issues or pulls are appreciated.

    For any questions, welcome to send email to :zzw922cn@gmail.com.

    Collaborators:

    zzw922cn

    hiteshpaul

    xxxxyzt


    展开全文
  • Web API 变得越来越丰富,其中一个值得注意的是Web Speech API。传统的网站只能“说”,这个API的出现,让网站能“倾听”用户。这个功能已经开放了一系列的用法,非常棒。在这篇文章中,我们将看一下这项技术和建议...
  • You can integrate the Speech service with the Language Understanding service to create applications that can intelligently determine user int
  • Make your audience feel welcome. Make frequent eye contact . Remember that your audience wants your conclusions . Many, many speakers spend too much time on background, which ...
  • alexa skill supported ssml tags: amazon:effect attribute : name : whispered: Applies ... a whispering effect to the speech. I want to tell you a secret. name="whispered">I have
  • $ rosrun sound_play say.py "Welcome to the future" voice_don_diphone There aren't a huge number of voices to choose from, but a few additional voices can be installed as described here and ...
  • 170712 python_speech_features

    千次阅读 2017-07-12 12:08:00
    Welcome to python_speech_features’s documentation! Audio tools for Linux commandline geeks Code:from python_speech_features import mfcc from python_speech_features import logfbank import scipy.io....
  • welcome 2017

    2016-12-16 11:38:53
    1.William - piano,roller skating,chess class , speech and drama, drawing,Swimming --Piano lesson was disrupted due to NZ moving, hope to continue next year --Finally started a chess club and ...
  • SpeechSynthesizer synthesizer = new SpeechSynthesizer(); synthesizer.SelectVoiceByHints(VoiceGender.Male, ... // here args = pran 只需要引用系统只带的System.Speech.dll 即可,无需联网,可识别中英文。
  • session:execute("play_and_detect_speech",welcome .. "detect:unimrcp {start-input-timers=false,no-input-timeout=" .. no_input_timeout .. ",recognition-timeout=" .. recognition_timeout .. "}" .. ...
  • Out: Mar 30 2019Due: Apr 13 2019EE 519: Speech Recognition andProcessing for MultimediaSpring 2019Homework 5There are 2 problems in this homework, with several questions. Please make sure to show the ...
  • Welcome to Apache OpenNLP

    2019-10-06 01:11:34
    Welcome to Apache OpenNLP http://maxent.sourceforge.net/Welcome to Apache OpenNLPThe Apache OpenNLP library is a machine learning bas...
  • European Commission President of the European Commission Speech by President Barroso at the European Parliamentary Week on the European Semester for Economic Policy Coordination European ...
  • 这里主要梳理一下作业的主要内容和思路,完整作业文件可参考: ... 作业完整截图,参考本文结尾:作业完整截图。 Trigger Word Detection(唤醒...Welcome to the final programming assignment of this specialization!
  • 《学术会议发言稿 英文(精选多篇)》由...1、学术会议发言稿 英文(精选多篇)第一篇:英文学术会议主持人发言稿good morning,ladies and gentleman.welcome to harbin,a beautiful northland ice city of china.im a...
  • You may explore the characteristics and trends of wording in a novel text, or study secrets in presidents�� speech texts, or classify or cluster numerous network comments. Small data, big data, ...
  • 也可以最后安排拍照 Let’s welcome the speech by Thank Professor 剩下的自己写,一组几个人安排下去就可以。 结束语可以这样: I’d like to thank all the representatives for their excellent remarks. Also I...
  • welcome to redistribute it under certain conditions. Type `flac' for details. 1.wav: wrote 23004 bytes, ratio=0.621 新浪网 新浪吗 新浪 如果正常的话,你就会在浏览器上看到已经打开...
  • Yu-Ying Hu Taurus speech

    2011-03-21 16:32:15
    Yu-Ying Hu Taurus speech, "Dream Warring States" joy with you all the way Happy New year! Chinese New Year has pas...
  • ))) Lets welcome our president Jill to awardthe winners. Who wined the third place? Who wined the second place? Who wined the first place? thanks Thank all the contestants, judges, functionroles. ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 2,061
精华内容 824
关键字:

speechwelcome