精华内容
下载资源
问答
  • 自然语言处理综述

    2019-08-30 16:56:56
  • 中亚语言自然语言处理综述
  • 自然语言处理综述Aren't we all initially got surprised when smart devices understood what we were telling them? And in fact, it answered in the most friendly manner too, isn't it? Like Apple’s Siri ...

    自然语言处理综述

    Aren't we all initially got surprised when smart devices understood what we were telling them? And in fact, it answered in the most friendly manner too, isn't it? Like Apple’s Siri and Amazon’s Alexa comprehend when we ask the weather, for directions, or to play a certain genre of music. Ever since then I was wondering how do these computers get our language. This long due curiosity rekindled me and I thought to write a blog as a newbie on this.

    当智能设备理解了我们告诉他们的内容后,我们所有人最初并不感到惊讶吗? 实际上,它也以最友好的方式回答,不是吗? 就像苹果公司的Siri和亚马逊公司的Alexa一样,当我们询问天气,方向或播放某种音乐时,他们就会明白。 从那时起,我一直在想这些计算机如何获得我们的语言。 这种长期的好奇心使我重新燃起了生命,我想以此为博客写一个新手。

    In this article, I will be using a popular NLP library called NLTK. Natural Language Toolkit or NLTK is one of the most powerful and probably the most popular natural language processing libraries. Not only does it have the most comprehensive library for python-based programming, but it also supports the most number of different human languages.

    在本文中,我将使用一个流行的名为NLTK的NLP库 。 自然语言工具包或NLTK是功能最强大且可能是最受欢迎的自然语言处理库之一。 它不仅具有用于基于python的编程的最全面的库,而且还支持大多数不同的人类语言。

    What is Natural Language Processing?

    什么是自然语言处理?

    Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to train computers to process and analyze large amounts of natural language data.

    自然语言处理(NLP)是语言学,计算机科学,信息工程和人工智能的一个子领域,与计算机和人类语言之间的相互作用有关,尤其是如何训练计算机以处理和分析大量自然语言数据。

    Why sorting of Unstructured Datatype is so important?

    为什么对非结构化数据类型进行排序如此重要?

    For every tick of the clock, the world generates the overwhelming amount of data!!, yeah, this is mind-boggling!! and the majority of the data falls under unstructured datatype. The data formats such as text, audio, video, image are classic examples of unstructured data. The Unstructured Datatype will not be having fixed dimensions and structures like traditional row and column structure of relational databases. Therefore it’s more difficult to analyze and not easily searchable. Having said that, it is also important for business organizations to find ways of addressing challenges and embracing opportunities to derive insights and prosper in highly competitive environments to be successful. However, with the help of natural language processing and machine learning, this is changing fast.

    每一刻时钟,世界都会产生大量数据!是的,这真是令人难以置信! 并且大多数数据属于非结构化数据类型。 文本,音频,视频,图像等数据格式是非结构化数据的经典示例。 非结构化数据类型将没有固定的维度和结构,如关系数据库的传统行和列结构。 因此,它更难以分析且不易搜索。 话虽如此,对于企业组织来说,找到应对挑战和把握机遇的方法也很重要,以便在高竞争环境中获得见识并取得成功。 但是,借助自然语言处理和机器学习,这种情况正在Swift改变。

    Are Computers confused with our Natural Language?

    计算机与我们的自然语言混淆了吗?

    Human language is one of the powerful tools of communication. The words, the tone, the sentences, the gestures which we use draw information. There are countless different ways of assembling words in a phrase. Words can also have many shades of meaning and, to comprehend human language with the intended meaning is a challenge. A linguistic paradox is a phrase or sentence that contradicts itself, for example, “oh, this is my open secret”, “can you please act naturally”, though it sounds pointedly foolish, we humans can understand and use in everyday speech but for machines, natural language’s ambiguity and inaccurate characteristics are the hurdles to sail-off.

    语言是交流的强大工具之一。 我们使用的单词,语气,句子,手势会吸引信息。 在短语中组合单词的方式有无数种。 单词也可以具有多种含义,要使人类语言具有预期的含义是一个挑战。 语言悖论是与自己矛盾的短语或句子,例如,“哦,这是我的公开秘密”,“您能自然地行动吗”,虽然听起来很愚蠢,但我们人类可以在日常语音中理解和使用,但对于机器,自然语言的歧义和不正确的特征是航行的障碍。

    Image for post

    Most used NLP Libraries

    最常用的NLP库

    In the past, only pioneers could be part of NLP projects those who would have superior knowledge in mathematics, computer learning, and linguistics in natural language processing. Now developers can use ready-made libraries to simplify pre-processing of texts so that they can concentrate on creating machine learning models. These libraries have enabled text comprehension, interpretation, sentiment analysis through only a few lines of code. Most popular NLP libraries are:

    过去,只有先驱者才能成为NLP项目的一部分,他们将对数学,计算机学习和自然语言处理方面的语言有丰富的知识。 现在,开发人员可以使用现成的库来简化文本的预处理,以便他们可以专注于创建机器学习模型。 这些库仅通过几行代码就可以进行文本理解,解释和情感分析。 最受欢迎的NLP库是:

    Spark NLP, NLTK, PyTorch-Transformers, TextBlob, Spacy, Stanford CoreNLP, Apache OpenNLP, Allen NLP, GenSim, NLP Architecture, sci-kit learn.

    Spark NLP,NLTK,PyTorch-Transformers,TextBlob,Spacy,Stanford CoreNLP,Apache OpenNLP,Allen NLP,GenSim,NLP Architecture,Sci-kit学习。

    The question is from where should we start and how?

    问题是我们应该从哪里开始,如何开始?

    Have you ever observed how kids start to understand and learn a language? yeah, by picking each word and then sentence formations, right! Making computers understand our language is more or less similar to it.

    您是否曾经观察过孩子如何开始理解和学习语言? 是的,先选择每个单词,然后再选择句子形式,对! 使计算机理解我们的语言或多或少类似于它。

    Pre-processing Steps :

    预处理步骤:

    1. Sentence Tokenization

      句子标记化
    2. Word Tokenization

      词标记化
    3. Text Lemmatization and Stemming

      文本缩编和词干
    4. Stop Words

      停用词
    5. POS Tagging

      POS标签
    6. Chunking

      块状
    7. Wordnet

      词网
    8. Bag-of-Words

      言语袋
    9. TF-IDF

      特遣部队
    1. Sentence Tokenization(Sentence Segmentation)To make computers understand the natural language, the first step is to break the paragraphs into the sentences. Punctuation marks are such an easy way out for splitting the sentences apart.

      句子标记化(Sentence Segmentation)为了使计算机理解自然语言,第一步是将段落分成句子。 标点符号是将句子分开的简便方法。

    import nltk
    nltk.download('punkt')text = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
    print("The number of sentences in the paragrah:",len(sentences))for sentence in sentences:
    print(sentence)OUTPUT:The number of sentences in the paragraph: 3 Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area.

    2. Word Tokenization(Word Segmentation)By now we have separated sentences with us and the next step is to break the sentences into words which are often called Tokens.

    2.单词标记化(单词分割)到目前为止,我们已经将句子分隔开了,下一步是将这些句子分解为通常称为标记的单词。

    The way of creating a space in one’s own life helps for good, similarly, Space between the words helps in breaking the words apart in a phrase. We can consider punctuation marks as separate tokens as well, as punctuation has a purpose too.

    在自己的生活中创造空间的方式有益于美好,同样,单词之间的空间有助于将单词在短语中分开。 我们也可以将标点符号也视为单独的标记,因为标点符号也是有目的的。

    for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print("The number of words in a sentence:", len(words))
    print(words)OUTPUT:The number of words in a sentence: 32
    ['Home', 'Farm', 'is', 'one', 'of', 'the', 'biggest', 'junior', 'football', 'clubs', 'in', 'Ireland', 'and', 'their', 'senior', 'team', ',', 'from', '1970', 'up', 'to', 'the', 'late', '1990s', ',', 'played', 'in', 'the', 'League', 'of', 'Ireland', '.'] The number of words in a sentence: 18
    ['However', ',', 'the', 'link', 'between', 'Home', 'Farm', 'and', 'the', 'senior', 'team', 'was', 'severed', 'in', 'the', 'late', '1990s', '.'] The number of words in a sentence: 22
    ['The', 'senior', 'side', 'was', 'briefly', 'known', 'as', 'Home', 'Farm', 'Fingal', 'in', 'an', 'effort', 'to', 'identify', 'it', 'with', 'the', 'north', 'Dublin', 'area', '.']

    The prerequisite to use word_tokenize() or sent_tokenize() functions in the program, we should have punkt package downloaded.

    在程序中使用word_tokenize()sent_tokenize()函数的前提条件是,我们应该已下载punkt软件包。

    3. Stemming and Text Lemmatization

    3.词干和词法化

    In every text document, we usually come across different forms of words like write, writes, writing with an alike meaning, and the same base word. But how to make a computer to analyze such words?That’s when Text Lemmatization and Stemming comes in the picture.

    在每个文本文档中,我们通常会遇到不同形式的单词,例如写,写,具有相似含义的写词和相同的基本单词。 但是如何使计算机分析此类单词呢? 那就是图片的词法化和词法提取的时候。

    Stemming and Text Lemmatization are the normalization techniques that offer the same idea of chopping the ends of a word to the core word. While both of them want to solve the same problem, but they are going about it in entirely different ways. Stemming is often a crude heuristic process whereas Lemmatization is a vocabulary -based morphological base word. Let’s just take a closer look!

    词干和文本词法归类化是归一化技术,它们提供将单词的结尾切成核心单词的相同思想。 虽然他们两个都想解决相同的问题,但是他们以完全不同的方式来解决这个问题。 词干提取通常是一个粗略的启发式过程,而词干提取则是基于词汇的词法基础词。 让我们仔细看看!

    Stemming- Words are reduced to their stem word. A word stem need not be the same root as a dictionary-based morphological(smallest unit) root, it just is an equal to or smaller form of the word.

    词干 -单词被简化为词干。 词干不必与基于字典的词法(最小单位)词根相同,而可以等于或小于该词的形式。

    from nltk.stem import PorterStemmer#create an object of class PorterStemmer
    porter = PorterStemmer()#A list of words to be stemmed
    word_list = ['running', ',', 'driving', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Porter Stemmer"))for word in word_list:
    print("{0:20}{1:20}".format(word,porter.stem(word)))OUTPUT:
    Word Porter Stemmer
    running run
    , ,
    driving drive
    sung sung
    between between
    lasted last
    was wa
    paticipated paticip
    before befor
    severed sever
    1990s 1990
    . .

    Stemming is not as easy as it looks :(we might get into two issues such as under-stemming and over-stemming of a word.

    词干看起来并不容易:(我们可能会遇到两个问题,例如单词的词干 不足和词干 过度

    Lemmatization-When we think that stemming is the best estimate method to snip a word based on how it appears and meanwhile, on the other hand, lemmatization is a method that seems to be even more planned way of pruning the word. Their dictionary process includes resolving words. Indeed a word’s lemma is its dictionary or canonical form.

    词法化 -当我们认为词干是根据单词出现的方式来截断单词的最佳估计方法时,另一方面,词法化似乎是一种修剪单词的更有计划的方法。 他们的词典处理过程包括解析单词。 确实,单词的引理是其字典或规范形式。

    nltk.download('wordnet')
    from nltk.stem import WordNetLemmatizer
    wordnet_lemmatizer = WordNetLemmatizer()#A list of words to lemmatizeword_list = ['running', ',', 'drives', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Lemma"))for word in word_list:
    print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))OUTPUT:Word Lemma
    running running
    , ,
    drives drive
    sung sung
    between between
    lasted lasted
    was wa
    paticipated paticipated
    before before
    severed severed
    1990s 1990s
    . .

    If speed is needed, then resorting to stemming is better. But it’s better to use lemmatization when accuracy is needed.

    如果需要速度,则最好采用阻止。 但是,当需要准确性时,最好使用定理。

    4. Stop Words‘in’, ‘at’, ‘on’, ‘so’.. etc are considered as stop words. Stop words don't play an important role in NLP, but the removal of stop words necessarily plays an important role during sentiment analysis.

    4.“在”,“在”,“在”,“如此”等上的停用词被视为停用词。 停用词在NLP中并不重要,但是在情感分析过程中停用词的去除必定起着重要作用。

    NLTK comes with the stopwords for 16 different languages that contain stop word lists.

    NLTK随附了包含停用词列表的16种不同语言的停用词。

    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    stop_words = set(stopwords.words('english'))print("The stop words in NLTK lib are:", stop_words)para="""Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."""tokenized_para=word_tokenize(para)
    modified_token_list=[word for word in tokenized_para if not word in stop_words]
    print("After removing the stop words in the sentence:")
    print(modified_token_list)OUTPUT:The stop words in NLTK lib are: {'about', 'ma', "shouldn't", 's', 'does', 't', 'our', 'mightn', 'doing', 'while', 'ourselves', 'themselves', 'will', 'some', 'you', "aren't", 'by', "needn't", 'in', 'can', 'he', 'into', 'as', 'being', 'between', 'very', 'after', 'couldn', 'himself', 'herself', 'had', 'its', 've', 'him', 'll', "isn't", 'through', 'should', 'was', 'now', 'them', "you'll", 'again', 'who', 'don', 'been', 'they', 'weren', "you're", 'both', 'd', 'me', 'didn', "won't", "you'd", 'only', 'itself', 'hadn', "should've", 'than', 'how', 'few', 're', 'down', 'these', 'y', "haven't", "mightn't", 'won', "hadn't", 'other', 'above', 'all', "doesn't", 'isn', "that'll", 'not', 'yourselves', 'at', 'mustn', "it's", 'on', 'the', 'for', "didn't", 'what', "mustn't", 'his', 'haven', 'doesn', "you've", 'are', 'out', 'hers', 'with', 'has', 'she', 'most', 'ain', 'those', 'when', 'myself', 'before', 'their', 'during', 'there', 'or', 'until', 'that', 'more', "hasn't", 'o', 'we', 'and', "shan't", 'which', 'because', "don't", 'why', 'shan', 'an', 'my', 'if', 'did', 'having', "couldn't", 'your', 'theirs', 'aren', 'just', 'further', 'here', 'of', "wouldn't", 'be', 'too', 'her', 'no', 'same', 'it', 'is', 'were', 'yourself', 'have', 'off', 'this', 'needn', 'once', "wasn't", 'against', 'wouldn', 'up', 'a', 'i', 'below', "weren't", 'over', 'own', 'then', 'so', 'do', 'from', 'shouldn', 'am', 'under', 'any', 'yours', 'ours', 'hasn', 'such', 'nor', 'wasn', 'to', 'where', 'm', "she's", 'each', 'whom', 'but'} After removing the stopwords in the sentence:
    ['Home', 'Farm', 'one', 'biggest', 'junior', 'football', 'clubs', 'Ireland', 'senior', 'team', ',', '1970', 'late', '1990s', ',', 'played', 'League', 'Ireland', '.', 'However', ',', 'link', 'Home', 'Farm', 'senior', 'team', 'severed', 'late', '1990s', '.', 'The', 'senior', 'side', 'briefly', 'known', 'Home', 'Farm', 'Fingal', 'effort', 'identify', 'north', 'Dublin', 'area', '.']

    5. POS TaggingDown the memories lane of our early English grammar classes, can we all remember how our teachers used to give relevant instructions around basic parts of speech to have effective communication? Yeah, good old days!!Let's teach parts of speech to our computers too. :)

    5. POS标记我们早期英语语法课的记忆里,我们都还记得我们的老师曾经如何围绕基本的言语给予相关指导以进行有效的交流吗? 是的,过去美好!让我们也将词性教学到我们的计算机上。 :)

    The eight parts of speech are nouns, verbs, pronouns, adjectives, adverbs, prepositions, conjunctions, and interjections.

    语音的八个部分是名词,动词,代词,形容词,副词,介词,连词感叹词。

    POS Tagging is an ability to identify and assign parts of speech to the words in a sentence. There are different methods to tag, but we will be using the universal style of tagging.

    POS标记是一种识别语音部分并将其分配给句子中单词的功能。 标记的方法不同,但是我们将使用通用的标记样式。

    nltk.download('averaged_perceptron_tagger')
    nltk.download('universal_tagset')
    pos_tag= [nltk.pos_tag(i,tagset="universal") for i in words]
    print(pos_tag)[[('Home', 'NOUN'), ('Farm', 'NOUN'), ('is', 'VERB'), ('one', 'NUM'), ('of', 'ADP'), ('the', 'DET'), ('biggest', 'ADJ'), ('junior', 'NOUN'), ('football', 'NOUN'), ('clubs', 'NOUN'), ('in', 'ADP'), ('Ireland', 'NOUN'), ('and', 'CONJ'), ('their', 'PRON'), ('senior', 'ADJ'), ('team', 'NOUN'), (',', '.'), ('from', 'ADP'), ('1970', 'NUM'), ('up', 'ADP'), ('to', 'PRT'), ('the', 'DET'), ('late', 'ADJ'), ('1990s', 'NUM'), (',', '.'), ('played', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('League', 'NOUN'), ('of', 'ADP'), ('Ireland', 'NOUN'), ('.', '.')]

    One of the applications of POS tagging to analyze the qualities of a product in feedback, by sorting the adjectives in the customers’ review we can evaluate the sentiment of the feedback. Say example, how was your shopping with us?

    POS标记用于分析产品在反馈中的质量的一种应用,通过对客户评论中的形容词进行分类,我们可以评估反馈的情绪。 举例来说, 如何与我们一起购物?

    6. ChunkingChunking is used to add more structure to the sentence by tagging the following parts of speech (POS). Also named as shallow parsing. The resulting word group is named “chunks.” There are no such predefined rules to perform chunking.

    6.分块用于通过标记以下词性(POS)为句子添加更多结构。 也称为浅层解析。 所得的单词组称为“块”。 没有此类预定义规则可以执行分块。

    Phrase structure conventions:

    短语结构约定:

    • S(Sentence) → NP VP.

      S(句子)→NP VP。
    • NP → {Determiner, Noun, Pronoun, Proper name}.

      NP→{确定词,名词,代词,专有名称}。
    • VP → V (NP)(PP)(Adverb).

      VP→V(NP)(PP)(副词)。
    • PP → Pronoun (NP).

      PP→代词(NP)。
    • AP → Adjective (PP).

      AP→形容词(PP)。

    I never had a good time with complex regular expressions, I used to remain as far as I could but off late realized, how important it is to have a grip on regular expressions in data science. Let’s start by understanding the simple instance.

    我从来没有过过使用复杂的正则表达式的美好时光,我曾经尽我所能,但后来才意识到,掌握数据科学中的正则表达式是多么重要。 让我们从了解简单实例开始。

    If we need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below

    如果我们需要标记句子中的名词,动词(过去式),形容词和协调连接。 您可以使用以下规则

    chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}

    块:{<NN。?> * <VBD。?> * <JJ。?> * <CC>?}

    import nltk
    from nltk.tokenize import word_tokenizecontent = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."tokenized_text = nltk.word_tokenize(content)
    print("After Split:",tokenized_text)
    tokens_tag = pos_tag(tokenized_text)
    print("After Token:",tokens_tag)patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""chunker = RegexpParser(patterns)
    print("After Regex:",chunker)
    output = chunker.parse(tokens_tag)
    print("After Chunking",output)OUTPUT:After Regex: chunk.RegexpParser with 1 stages: RegexpChunkParser with 1 rules: <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'> After Chunking
    (S (mychunk Home/NN Farm/NN) is/VBZ one/CD of/IN the/DT
    (mychunk biggest/JJS)
    (mychunk junior/NN football/NN clubs/NNS) in/IN
    (mychunk Ireland/NNP and/CC) their/PRP$
    (mychunk senior/JJ)
    (mychunk team/NN) ,/, from/IN 1970/CD up/IN to/TO the/DT (mychunk late/JJ) 1990s/CD ,/, played/VBN in/IN the/DT (mychunk League/NNP) of/IN (mychunk Ireland/NNP) ./.)

    7. Wordnet

    7.词网

    Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to generate a synonym or antonym.

    Wordnet是NLTK语料库阅读器,英语的词汇数据库。 它可用于生成同义词或反义词。

    from nltk.corpus import wordnetsynonyms = []
    antonyms = []for syn in wordnet.synsets("active"):
    for lemmas in syn.lemmas():
    synonyms.append(lemmas.name())for syn in wordnet.synsets("active"):
    for lemmas in syn.lemmas():
    if lemmas.antonyms():
    antonyms.append(lemmas.antonyms()[0].name())print("Synonyms are:",synonyms)
    print("Antonyms are:",antonyms)OUTPUT:Synonyms are: ['active_agent', 'active', 'active_voice', 'active', 'active', 'active', 'active', 'combat-ready', 'fighting', 'active', 'active', 'participating', 'active', 'active', 'active', 'active', 'alive', 'active', 'active', 'active', 'dynamic', 'active', 'active', 'active'] Antonyms are: ['passive_voice', 'inactive', 'passive', 'inactive', 'inactive', 'inactive', 'quiet', 'passive', 'stative', 'extinct', 'dormant', 'inactive']

    8. Bag of WordsA bag of words model turns the raw text into words, and the frequency for the words in the text is also counted.

    8.单词单词袋模型将原始文本转换为单词,并计算单词在单词中的出现频率。

    import nltk
    import re # to match regular expressions
    import numpy as nptext="Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
    for i in range(len(sentences)):
    sentences[i] = sentences[i].lower()
    sentences[i] = re.sub(r'\W', ' ', sentences[i])
    sentences[i] = re.sub(r'\s+', ' ', sentences[i])bag_of_words = {}
    for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    for word in words:
    if word not in bag_of_words.keys():
    bag_of_words[word] = 1
    else:
    bag_of_words[word] += 1
    print(bag_of_words)OUTPUT:{'home': 3, 'farm': 3, 'is': 1, 'one': 1, 'of': 2, 'the': 8, 'biggest': 1, 'junior': 1, 'football': 1, 'clubs': 1, 'in': 4, 'ireland': 2, 'and': 2, 'their': 1, 'senior': 3, 'team': 2, 'from': 1, '1970': 1, 'up': 1, 'to': 2, 'late': 2, '1990s': 2, 'played': 1, 'league': 1, 'however': 1, 'link': 1, 'between': 1, 'was': 2, 'severed': 1, 'side': 1, 'briefly': 1, 'known': 1, 'as': 1, 'fingal': 1, 'an': 1, 'effort': 1, 'identify': 1, 'it': 1, 'with': 1, 'north': 1, 'dublin': 1, 'area': 1}

    9. TF-IDF

    9.特遣部队

    TF-IDF stands for Term Frequency — Inverse document frequency.

    TF-IDF代表术语频率-反向文档频率

    Text data needs to be converted to the numerical format where each word is represented in the matrix form. The encoding of a given word is the vector in which the corresponding element is set to one, and all other elements are zero. Thus TF-IDF technique is also referred to as Word Embedding.

    文本数据需要转换为数字格式,其中每个单词都以矩阵形式表示。 给定单词的编码是将相应元素设置为1并将所有其他元素设置为零的向量。 因此,TF-IDF技术也称为词嵌入

    TF-IDF works on two concepts:

    TF-IDF处理两个概念:

    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

    TF(t)=(术语t在文档中出现的次数)/(文档中术语的总数)

    IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

    IDF(t)= log_e(文件总数/其中带有术语t的文件数)

    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import CountVectorizer
    import pandas as pddocs=["Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland",
    "However, the link between Home Farm and the senior team was severed in the late 1990s",
    " The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area"]#instantiate CountVectorizer()
    cv=CountVectorizer()# this steps generates word counts for the words in your docs
    word_count_vector=cv.fit_transform(docs)tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
    tfidf_transformer.fit(word_count_vector)# print idf values
    df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])# sort ascending
    df_idf.sort_values(by=['idf_weights'])# count matrix
    count_vector=cv.transform(docs)# tf-idf scores
    tf_idf_vector=tfidf_transformer.transform(count_vector)feature_names = cv.get_feature_names()#get tfidf vector for the document
    first_document_vector=tf_idf_vector[0]#print the scores
    df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
    df.sort_values(by=["tfidf"],ascending=False)tfidf
    of 0.374810
    ireland 0.374810
    the 0.332054
    in 0.221369
    1970 0.187405
    football 0.187405
    up 0.187405
    as 0.000000
    an 0.000000and so on..

    What are these scores telling us? The more common the word across documents, the lower its score, and the more unique a word the higher the score will be.

    这些分数告诉我们什么? 文档中的单词越常见,其得分就越低,单词越独特,得分就会越高。

    So far, we learned the steps of cleaning and preprocessing the text. What can we do with the sorted data after all this? We could use this data for sentiment analysis, chatbot, market intelligence. Maybe build a recommender system based on user purchases or item reviews or customer segmentation with clustering.

    到目前为止,我们学习了清理和预处理文本的步骤。 所有这些之后,我们该如何处理排序后的数据? 我们可以使用这些数据进行情感分析,聊天机器人,市场情报。 也许可以基于用户购买或商品评论或具有集群的客户细分来构建推荐系统。

    Computers are still not accurate with human language as much as they are with numbers. With the massive proportion of text data generated every day, NLP is indeed becoming ever more significant to make sense of the data and is being used in many other applications. Hence there are endless ways to explore NLP.

    计算机对人类语言的准确性仍然不如数字。 随着每天生成大量文本数据,NLP确实变得越来越重要以理清数据,并在许多其他应用程序中得到使用。 因此,有无数种探索NLP的方法。

    翻译自: https://medium.com/analytics-vidhya/natural-language-processing-bedb2e1c8ceb

    自然语言处理综述

    展开全文
  • 【JMBook】Daniel Jurafsky and James H. Martin,2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech ...自然语言处理综述的中文版。
  • 自然语言处理综述 中文版 HttpClient 4.0.3的关键修复 HttpClient的紧急版本现已发布。 HttpClient 4.0.3修复了SSL连接管理代码中的关键回归问题,并引入了对4.0.2版中对多宿主主机的改进支持。 鼓励上游项目...

    自然语言处理综述 中文版

    HttpClient 4.0.3的关键修复

    HttpClient的紧急版本现已发布。

    HttpClient 4.0.3修复了SSL连接管理代码中的关键回归问题,并引入了对4.0.2版中对多宿主主机的改进支持。 鼓励上游项目进行升级。

    HttpClient尝试提供一个软件包,以实现最新的超文本传输​​协议(HTTP)标准和建议的客户端。 它还提供对基本HTTP协议的支持。 请注意,HttpClient 4.0当前仅提供对NTLM身份验证的有限支持。

    请参阅发行说明以获取完整信息。

    适用于CMIS的Apache Chemistry OpenCMIS 0.1.0

    现在可以使用Apache Chemistry OpenCMIS的0.1.0版本。

    Apache Chemistry OpenCMIS旨在通过提供API,Java库,框架,SPI和测试工具来简化Java客户端和服务器开发人员的内容管理互操作性服务(CMIS),使开发人员可以专注于ECM域模型。

    请参阅发行说明以获取更多信息。

    Apache Vysper的第二版

    Apache MINA项目宣布了Apache Vysper的第二个版本: 0.6版

    Vysper是用Java编写的模块化XMPP服务器。 此版本引入了BOSH的完整实现,并改进了多用户聊天实现。 在Maven构建中添加了一个pubsub演示,并且nbxml-sax分支已合并到主干中。


    翻译自: https://jaxenter.com/daily-roundup-apache-edition-102254.html

    自然语言处理综述 中文版

    展开全文
  • 自然语言处理领域经典综述教材《Speech and Language Processing 》,中文名《自然语言处理综述》第三版发布。该书由NLP领域的大牛,斯坦福大学 Daniel Jurafsky教授和科罗拉多大学的 James H. Martin 教授等人共同...

        自然语言处理领域经典综述教材《Speech and Language Processing 》,中文名《自然语言处理综述》第三版发布。该书由NLP领域的大牛,斯坦福大学 Daniel Jurafsky教授和科罗拉多大学的 James H. Martin 教授等人共同编写。Daniel Jurafsky是斯坦福大学计算机科学教授,主要研究方向是计算语言学和自然语言处理。James H. Martin 是科罗拉多大学博尔德分校计算机科学系一名教授,两位教授都是NLP领域知名学者。

     

       最新版pdf: https://mp.weixin.qq.com/s?__biz=MzIxNDgzNDg3NQ==&mid=2247488070&idx=1&sn=f56cf2433b69040eb3e8d28a4aecc74d&chksm=97a0d992a0d750842ade299f263b23c30b8228825b9a5f7d63c9fe449967462868629131e4c1&token=974303003&lang=zh_CN#rd

     

        《Speech and Language Processing 》最新版,与之前版本相比,内容有较大的变动:

        1. 更新内容包括新的第10、22、23、27章。

        2. 大范围修改了第9、19和26章,以及所有其他各章的最新内容。

        3. 针对读者的建议对许多错字和建议进行了修复。

        4. 更新了部分章节的幻灯片。

     

        整本书什么时候完成?大概是在2020年夏天之后的某个时候,因为大多在夏天和秋天写东西。

     

    本书目录

    章节内容节选

     

     

     

     

    本书pdf下载:https://mp.weixin.qq.com/s?__biz=MzIxNDgzNDg3NQ==&mid=2247488070&idx=1&sn=f56cf2433b69040eb3e8d28a4aecc74d&chksm=97a0d992a0d750842ade299f263b23c30b8228825b9a5f7d63c9fe449967462868629131e4c1&token=974303003&lang=zh_CN#rd

     

    往期精品内容推荐

    DeepMind20年DL课程(带字幕)-CNN与图像识别

    文本分类智能标注与海量复杂文本分类——EasyDL产业应用系列·信息智能处理NLP公开课实录

    2020年至今-NN SLAM各领域必读的最新研究论文整理分享

    2020斯坦福新课-《新冠战疫中的数据科学与AI》视频及ppt分享

    ACL2020-最新录用论文列表分享

    342个中、英文等NLP开源数据集分享

    陈蕴侬-《应用深度学习2020》中文视频课程及ppt分享

    20年算法校招编程-剑指offer、Leetcode常考题目及解法分享

    CMU新课-《深度学习技术入门 2020春》视频及ppt分享

    百度NLP短文本匹配、序列标注算法原理及实战课程分享

    DeepMind 2020年新课-《强化学习进阶课程》视频分享

    最新免费书推荐-《因果推理算法概述》pdf免费下载

    机器学习核心知识手册-《机器学习综述》书籍分享

    展开全文
  • 者篇论文相当详细的描绘了自然语言处理在深度学习的基础上的研究情况,是很好的综述性质文章,可以借鉴借鉴
  • 1. 自然语言处理的基本内容 语言是思维的载体,是人类交流思想、表达情感最自然、最直接、最方便的工具。人类历史上以语言文字形式记载和流传的知识占知识总量的80%以上,中国互联网上有87.8%的网页内容是文本表示的...
  • Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Third Edition draft
  • 有时候觉得很好笑,每天说自己做自然语言处理,可真正,自然语言处理到底是做什么的,我也没有搞明白,不透彻,整个背景还是空缺的,现在对这部分的知识点进行弥补。 看的是宗成庆老师的这本书的讲义。 我们说的...
  • 这是一系列自然语言处理的介绍,本文不会涉及公式推导,主要是一些算法思想的随笔记录。 适用人群:自然语言处理初学者,转AI的开发人员。 编程语言:Python 参考书籍:《数学之美》 《统计自然语言处理》 --宗成庆 ...
  • 近年来,在自然语言处理(NLP)的背景下,各种模型设计和方法得到了蓬勃发展。在这篇综述中,我们回顾了许多NLP任务中所使用的重要的深度学习相关模型和方法。在综述中,我们提到了序列生成方法,神经机器翻译,对话...
  • 之前红色石头整理过一篇文章,谈一谈机器学习如何入门的路线图:【干货】我的机器学习入门路线图那么对于深度学习的自然语言处理(NLP)方向有没有比较好的学习资源呢?我们熟知的...
  • 文本规范化(Text Normalization):任何一种NLP模型,都需要先进行文本规范化 文本规范化至少包含如下三个部分: (1)分词(Segmenting/tokenizing words from running text) (2)单词规范化(Normalizing word ...
  • 目录9.0 前言9.1 简单的循环神经网络 9.0 前言 语言是一种时间现象。当我们理解和产出口语时,我们是在处理一个不定长度的连续输入流。...这些时间性质,也体现在语言处理所用的算法当中。当应用于词类...
  • 神经网络时语言处理中核心的计算工具,并且很早就出现了。神经这个名字最早来源于McCulloch-Pitts neuron(1943),是一个人类神经元的简化模型,可以理解为命题逻辑中的计算单元(?)。不过现在在语言处理中,不再...
  • 这里我们介绍另外一个形式主义语法,叫做依存语法,在当代语言处理中十分重要。在这类形式主义中,短语成分以及短语结构规则起不到直接的作用。取而代之的是,一个句子的句法结构,通过句子中的词语(或者词根)以及...
  • 我们如何构建一个计算模型,可以有效处理词义的各个方面,比如上面提到的词语相似性、词语相关性、语义场、语义框架、隐含语义等? 目前最好的模型是 向量语义 ,它是来自与50年代语言学与哲学的灵感。维特根斯坦...
  • 逻辑回归用于发现特征与输出结果之间的联系,在社科与自然研究方面都是最重要的分析工具之一,用于 有监督的 机器学习算法中的分类问题,并与神经网络有密切的关系,可以将神经网络看作是多个逻辑回归分类器的堆叠。...
  • 目录10.0前言10.1 再论神经语言模型和生成 10.0前言 It is all well and good to copy what one sees, but it is much better to draw only what remains in one’s memory. This is a transformation in which ...
  • 自然语言处理,是指用计算机对自然语言的形、音、义等信息进行处理,即对字、词、句、篇章的输入、输出、识别、分析、理解、生成等的操作和加工。实现人机间的信息交流,是人工智能界、计算机科学和语言学界所共同...
  • 哈工大与科大讯飞撰写,自然语言处理国际前沿动态综述,提供最新的自然语言处理研究动向和学术成果,可以窥见新的行业变化
  • NLP (Natural Language Processing) 自然语言处理,是计算机科学、人工智能和语言学的交叉学科,目的是让计算机处理或“理解”自然语言。自然语言通常是指一种自然地随文化演化的语言,如汉语、英语、日语。 NLP ...
  • 自然语言处理之语言模型综述

    千次阅读 2016-04-15 10:35:13
    文法型语言模型是人工编制的语言学文法,文法规则来源于语言学家掌握的语言学知识和领域知识,但这种语言模型不能处理大规模真实文本。 1 统计语言模型 1). 无历史,一元模型 2). 最近一个历史,二元模型(Bigram)...
  • NLP入门-综述阅读-【基于深度学习的自然语言处理研究综述】基于深度学习的自然语言处理研究综述摘要0 引言1 深度学习概述卷积神经网络递归神经网络2 NLP应用研究进展3 预训练语言模型BERTXLNetERNIE4 结束语个人总结...
  • 深度学习在自然语言处理中的应用综述 知乎-自然语言处理怎么最快入门? 从语言学到深度学习NLP,一文概述自然语言处理 http://www.360doc.com/content/17/1114/13/5315_703729214.shtml  ...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 474
精华内容 189
关键字:

自然语言处理综述