精华内容
下载资源
问答
  • 主要针对英文文本做出词频计算,因为英文是用空格作为词语分割的。中文需要用到分词的库。下面就用奥巴马的一片演讲做词频计算1,分析的文本speech_etxt = '''My fellow citizens: I stand here today humbled by ...

    主要针对英文文本做出词频计算,因为英文是用空格作为词语分割的。中文需要用到分词的库。

    下面就用奥巴马的一片演讲做词频计算

    1,分析的文本

    speech_etxt = '''

    My fellow citizens: I stand here today humbled by the task before us, grateful for the trust you've bestowed, mindful of the sacrifices borne by our ancestors.

    I thank President Bush for his service to our nation -- (applause) -- as well as the generosity and cooperation he has shown throughout this transition.

    Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often, the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because we, the people, have remained faithful to the ideals of our forebears and true to our founding documents.

    So it has been; so it must be with this generation of Americans.

    That we are in the midst of crisis is now well+++ understood. Our nation is at war against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost, jobs shed, businesses shuttered. Our health care is too costly, our schools fail too many -- and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.

    These are the indicators of crisis, subject to data and statistics. Less measurable, but no less profound, is a sapping of confidence across our land; a nagging fear that America's decline is inevitable, that the next generation must lower its sights.

    Today I say to you that the challenges we face are real. They are serious and they are many. They will not be met easily or in a short span of time. But know this America: They will be met. (Applause.)

    On this day, we gather because we have chosen hope over fear, unity of purpose over conflict and discord. On this day, we come to proclaim an end to the petty grievances and false promises, the recriminations and worn-out dogmas that for far too long have strangled our politics. We remain a young nation. But in the words of Scripture, the time has come to set aside childish things. The time has come to reaffirm our enduring spirit; to choose our better history; to carry forward that precious gift, that noble idea passed on from generation to generation: the God-given promise that all are equal, all are free, and all deserve a chance to pursue their full measure of happiness. (Applause.)

    In reaffirming the greatness of our nation we understand that greatness is never a given. It must be earned. Our journey has never been one of short-cuts or settling for less. It has not been the path for the faint-hearted, for those that prefer leisure over work, or seek only the pleasures of riches and fame. Rather, it has been the risk-takers, the doers, the makers of things -- some celebrated, but more often men and women obscure in their labor -- who have carried us up the long rugged path towards prosperity and freedom.

    For us, they packed up their few worldly possessions and traveled across oceans in search of a new life. For us, they toiled in sweatshops, and settled the West, endured the lash of the whip, and plowed the hard earth. For us, they fought and died in places like Concord and Gettysburg, Normandy and Khe Sahn.

    Time and again these men and women struggled and sacrificed and worked till their hands were raw so that we might live a better life. They saw America as bigger than the sum of our individual ambitions, greater than all the differences of birth or wealth or faction.

    This is the journey we continue today. We remain the most prosperous, powerful nation on Earth. Our workers are no less productive than when this crisis began. Our minds are no less inventive, our goods and services no less needed than they were last week, or last month, or last year. Our capacity remains undiminished. But our time of standing pat, of protecting narrow interests and putting off unpleasant decisions -- that time has surely passed. Starting today, we must pick ourselves up, dust ourselves off, and begin again the work of remaking America. (Applause.)

    For everywhere we look, there is work to be done. The state of our economy calls for action, bold and swift. And we will act, not only to create new jobs, but to lay a new foundation for growth. We will build the roads and bridges, the electric grids and digital lines that feed our commerce and bind us together. We'll restore science to its rightful place, and wield technology's wonders to raise health care's quality and lower its cost. We will harness the sun and the winds and the soil to fuel our cars and run our factories. And we will transform our schools and colleges and universities to meet the demands of a new age. All this we can do. All this we will do.

    Now, there are some who question the scale of our ambitions, who suggest that our system cannot tolerate too many big plans. Their memories are short, for they have forgotten what this country has already done, what free men and women can achieve when imagination is joined to common purpose, and necessity to courage. What the cynics fail to understand is that the ground has shifted beneath them, that the stale political arguments that have consumed us for so long no longer apply.

    The question we ask today is not whether our government is too big or too small, but whether it works -- whether it helps families find jobs at a decent wage, care they can afford, a retirement that is dignified. Where the answer is yes, we intend to move forward. Where the answer is no, programs will end. And those of us who manage the public's dollars will be held to account, to spend wisely, reform bad habits, and do our business in the light of day, because only then can we restore the vital trust between a people and their government.

    Nor is the question before us whether the market is a force for good or ill. Its power to generate wealth and expand freedom is unmatched. But this crisis has reminded us that without a watchful eye, the market can spin out of control. The nation cannot prosper long when it favors only the prosperous. The success of our economy has always depended not just on the size of our gross domestic product, but on the reach of our prosperity, on the ability to extend opportunity to every willing heart -- not out of charity, but because it is the surest route to our common good. (Applause.)

    As for our common defense, we reject as false the choice between our safety and our ideals. Our Founding Fathers -- (applause) -- our Founding Fathers, faced with perils that we can scarcely imagine, drafted a charter to assure the rule of law and the rights of man -- a charter expanded by the blood of generations. Those ideals still light the world, and we will not give them up for expedience sake. (Applause.)

    And so, to all the other peoples and governments who are watching today, from the grandest capitals to the small village where my father was born, know that America is a friend of each nation, and every man, woman and child who seeks a future of peace and dignity. And we are ready to lead once more. (Applause.)

    Recall that earlier generations faced down fascism and communism not just with missiles and tanks, but with the sturdy alliances and enduring convictions. They understood that our power alone cannot protect us, nor does it entitle us to do as we please. Instead they knew that our power grows through its prudent use; our security emanates from the justness of our cause, the force of our example, the tempering qualities of humility and restraint.

    We are the keepers of this legacy. Guided by these principles once more we can meet those new threats that demand even greater effort, even greater cooperation and understanding between nations. We will begin to responsibly leave Iraq to its people and forge a hard-earned peace in Afghanistan. With old friends and former foes, we'll work tirelessly to lessen the nuclear threat, and roll back the specter of a warming planet.

    We will not apologize for our way of life, nor will we waver in its defense. And for those who seek to advance their aims by inducing terror and slaughtering innocents, we say to you now that our spirit is stronger and cannot be broken -- you cannot outlast us, and we will defeat you. (Applause.)

    For we know that our patchwork heritage is a strength, not a weakness. We are a nation of Christians and Muslims, Jews and Hindus, and non-believers. We are shaped by every language and culture, drawn from every end of this Earth; and because we have tasted the bitter swill of civil war and segregation, and emerged from that dark chapter stronger and more united, we cannot help but believe that the old hatreds shall someday pass; that the lines of tribe shall soon dissolve; that as the world grows smaller, our common humanity shall reveal itself; and that America must play its role in ushering in a new era of peace.

    To the Muslim world, we seek a new way forward, based on mutual interest and mutual respect. To those leaders around the globe who seek to sow conflict, or blame their society's ills on the West, know that your people will judge you on what you can build, not what you destroy. (Applause.)

    To those who cling to power through corruption and deceit and the silencing of dissent, know that you are on the wrong side of history, but that we will extend a hand if you are willing to unclench your fist. (Applause.)

    To the people of poor nations, we pledge to work alongside you to make your farms flourish and let clean waters flow; to nourish starved bodies and feed hungry minds. And to those nations like ours that enjoy relative plenty, we say we can no longer afford indifference to the suffering outside our borders, nor can we consume the world's resources without regard to effect. For the world has changed, and we must change with it.

    As we consider the role that unfolds before us, we remember with humble gratitude those brave Americans who at this very hour patrol far-off deserts and distant mountains. They have something to tell us, just as the fallen heroes who lie in Arlington whisper through the ages.

    We honor them not only because they are the guardians of our liberty, but because they embody the spirit of service -- a willingness to find meaning in something greater than themselves.

    And yet at this moment, a moment that will define a generation, it is precisely this spirit that must inhabit us all. For as much as government can do, and must do, it is ultimately the faith and determination of the American people upon which this nation relies. It is the kindness to take in a stranger when the levees break, the selflessness of workers who would rather cut their hours than see a friend lose their job which sees us through our darkest hours. It is the firefighter's courage to storm a stairway filled with smoke, but also a parent's willingness to nurture a child that finally decides our fate.

    Our challenges may be new. The instruments with which we meet them may be new. But those values upon which our success depends -- honesty and hard work, courage and fair play, tolerance and curiosity, loyalty and patriotism -- these things are old. These things are true. They have been the quiet force of progress throughout our history.

    What is demanded, then, is a return to these truths. What is required of us now is a new era of responsibility -- a recognition on the part of every American that we have duties to ourselves, our nation and the world; duties that we do not grudgingly accept, but rather seize gladly, firm in the knowledge that there is nothing so satisfying to the spirit, so defining of our character than giving our all to a difficult task.

    This is the price and the promise of citizenship. This is the source of our confidence -- the knowledge that God calls on us to shape an uncertain destiny. This is the meaning of our liberty and our creed, why men and women and children of every race and every faith can join in celebration across this magnificent mall; and why a man whose father less than 60 years ago might not have been served in a local restaurant can now stand before you to take a most sacred oath. (Applause.)

    So let us mark this day with remembrance of who we are and how far we have traveled. In the year of America's birth, in the coldest of months, a small band of patriots huddled by dying campfires on the shores of an icy river. The capital was abandoned. The enemy was advancing. The snow was stained with blood. At the moment when the outcome of our revolution was most in doubt, the father of our nation ordered these words to be read to the people:

    "Let it be told to the future world...that in the depth of winter, when nothing but hope and virtue could survive... that the city and the country, alarmed at one common danger, came forth to meet [it]."

    America: In the face of our common dangers, in this winter of our hardship, let us remember these timeless words. With hope and virtue, let us brave once more the icy currents, and endure what storms may come. Let it be said by our children's children that when we were tested we refused to let this journey end, that we did not turn back nor did we falter; and with eyes fixed on the horizon and God's grace upon us, we carried forth that great gift of freedom and delivered it safely to future generations.

    Thank you. God bless you. And God bless the United States of America. (Applause.)

    '''

    2,将文本利用空格分割

    # 因为文本是英文的,大小写会分成两种词语

    # 先转换为小写再执行

    speech = speech_etxt.lower().split()

    print(speech)

    运行结果

    3,利用字典的方式,计算词频

    # 利用字典进行处理

    dic = {}

    for word in speech:

    if word not in dic:

    dic[word] = 1

    else:

    dic[word] = dic[word] + 1

    4,再利用operator对词频进行排序

    import operator

    # items里面是一个列表,列表里面是由多个的元祖组成,元祖的构成第一个位置是字典的Key,第二个元素是,Value

    # 按照value来排序

    # 如果不加key参数,会默认按照字典的key来排序

    # reverse表示是正序还是倒序

    swd = sorted(dic.items(),key=operator.itemgetter(1),reverse=True)

    print(swd)

    运行结果

    5,发现排在前面的词语都是一些停用词,对于我们的分析没有多大意义,所以,我们应该去掉这些停用词再做处理。

    # 在结果中,会有很多的停用词对我们的影响很大

    # 引用自然语言处理中的套件,一个集合包含了基本的停用词

    from nltk.corpus import stopwords

    stop_words = stopwords.words('English')

    for k,v in swd:

    if k not in stop_words:

    print (k,v)

    运行结果

    以上,是利用python中自身的数据结构做的处理,下面利用python库做处理。

    使用counter计算词频

    1,导入相关的库,同样是需要去掉停用词的,并且去除前10的词语及对应的词频

    from collections import Counter

    wd = Counter(speech)

    # wd.most_common(10)

    # 去除停用词

    for sw in stop_words:

    del wd[sw]

    wd.most_common(10)

    运行结果

    展开全文
  • 所以我们经常会遇到利用Python从一篇文档中,统计文本词频的问题。以《三国演义》这部名著为例,文中哪些人物的出场次数最多呢?让我们用Python来解决看看吧!解决方案在实际计算中,我们常常遇到需要同时处理多个...

    欢迎点击「算法与编程之美」↑关注我们!

    本文首发于微信公众号:"算法与编程之美",欢迎关注,及时了解更多此系列文章。

    问题描述

    Python在自然语言处理这个方面,有其天然的优势:简单,快捷。所以我们经常会遇到利用Python从一篇文档中,统计文本词频的问题。以《三国演义》这部名著为例,文中哪些人物的出场次数最多呢?让我们用Python来解决看看吧!

    解决方案

    在实际计算中,我们常常遇到需要同时处理多个数据的情况,所以我们引入了“组合数据类型”的概念。而我们今天主要用到的就是组合数据类型中映射类型“字典”的知识。字典具有处理任意长度和混合类型键值对的能力。

    简单介绍了相关知识后,我们再来审审题。文本词频统计其实就是计算同一个词语出现的次数,通过对文本信息的自动检索,进行累加的简单计算就可以解决问题。

    下面是此问题的IPO描述:

    Input:读取《三国演义》内容

    Process:利用“字典”统计词语的出现次数

    Output:打印出《三国演义》中出场次数最多的人名和具体次数

    我们一直在强调的是文本词频的统计,那么“词语”就成为了关键,如何从一段话中提取准确的词语呢?我们就要引入Python第三方库“jieba”的知识啦。Jieba是python中一个重要的第三方中文分词函数库,他将待分词的内容与分词词库依次进行比对,通过图结构和动态规划的方法找到最大概率的词组。

    所以在第一行我们要引入jieba库

    在第二行执行input,利用open函数,打开《三国演义》的文档。文档名为:sanguoyanyi.txt

    接下来在第三行开始利用jieba 库的分词函数jieba.lcut()进行分词,这个分词函数是在精确模式下,可以形成列表类型的。我们把他命名为words。

    接着,构造一个字典counts,对比词库,对文件进行列表分词。并通过counts.get的方法,对相关词语进行计数。‘

    从第十行开始,正式进入累加的环节,将counts转换为列表格式。

    第十一行是进一步对列表进行排序。然后用for i in range的方式,打印出排名前20的人物名称。

    结果如下:

    但是我们在打印出来的结果中发现,像“却说”,“二人”,“不可”这样的词是不属于人名的,而且像“孔明”和“孔明曰”这样词语又是一个意思,所以这个统计结果,并不是很准确。所以我们还要对刚刚的代码进行调整,比如加入一段排除词汇的代码。

    为了解决问题一,排除不是人名的词汇,我们构造一个excludes的集合,将累计次数较高的词汇输入进去。这样的词汇怎么获得呢,我们只有一个笨办法,那就是“试”!不断的运行程序,让更多的词汇浮出水面,再把他们加到excludes的集合中,直到排名前20的词语都是人名为止。

    为了解决问题二,我们要对词语进行整合,这就是第9到18行代码的作用了。

    完整代码如下:

    因为排除词汇的工作量太大了,所以我仅将出场排名前8名的人物打印出来了:

    结果如下:

    结语

    这个实例的作用主要是让我们运用组合数据类型对文本进行词频统计。这里涉及到了列表和字典的知识,需要我们温故而知新。对于列表,我们可以用列表管理采集到的信息,用来构建数据结构;对于字典,我们可以用字典来处理复杂的数据信息。

    总之,能完成一个实例,能写能理解,才代表你真正学会了这部分知识。也要记住,好的代码是靠着每一点点的修改,才能成为完美!!!

    更多精彩文章:

    where2go 团队

    微信号:算法与编程之美

    长按识别二维码关注我们!

    温馨提示:点击页面右下角“写留言”发表评论,期待您的参与!期待您的转发!

    展开全文
  • 所以我们经常会遇到利用Python从一篇文档中,统计文本词频的问题。以《三国演义》这部名著为例,文中哪些人物的出场次数最多呢?让我们用Python来解决看看吧!解决方案在实际计算中,我们常常遇到需要同时处理多个...

    问题描述

    Python在自然语言处理这个方面,有其天然的优势:简单,快捷。所以我们经常会遇到利用Python从一篇文档中,统计文本词频的问题。以《三国演义》这部名著为例,文中哪些人物的出场次数最多呢?让我们用Python来解决看看吧!解决方案

    在实际计算中,我们常常遇到需要同时处理多个数据的情况,所以我们引入了“组合数据类型”的概念。而我们今天主要用到的就是组合数据类型中映射类型“字典”的知识。字典具有处理任意长度和混合类型键值对的能力。

    简单介绍了相关知识后,我们再来审审题。文本词频统计其实就是计算同一个词语出现的次数,通过对文本信息的自动检索,进行累加的简单计算就可以解决问题。

    下面是此问题的IPO描述:

    Input:读取《三国演义》内容

    Process:利用“字典”统计词语的出现次数

    Output:打印出《三国演义》中出场次数最多的人名和具体次数

    我们一直在强调的是文本词频的统计,那么“词语”就成为了关键,如何从一段话中提取准确的词语呢?我们就要引入Python第三方库“jieba”的知识啦。Jieba是python中一个重要的第三方中文分词函数库,他将待分词的内容与分词词库依次进行比对,通过图结构和动态规划的方法找到最大概率的词组。

    所以在第一行我们要引入jieba库

    在第二行执行input,利用open函数,打开《三国演义》的文档。文档名为:sanguoyanyi.txt

    接下来在第三行开始利用jieba 库的分词函数jieba.lcut()进行分词,这个分词函数是在精确模式下,可以形成列表类型的。我们把他命名为words。

    接着,构造一个字典counts,对比词库,对文件进行列表分词。并通过counts.get的方法,对相关词语进行计数。‘

    从第十行开始,正式进入累加的环节,将counts转换为列表格式。

    第十一行是进一步对列表进行排序。然后用for i in range的方式,打印出排名前20的人物名称。

    结果如下:

    但是我们在打印出来的结果中发现,像“却说”,“二人”,“不可”这样的词是不属于人名的,而且像“孔明”和“孔明曰”这样词语又是一个意思,所以这个统计结果,并不是很准确。所以我们还要对刚刚的代码进行调整,比如加入一段排除词汇的代码。

    为了解决问题一,排除不是人名的词汇,我们构造一个excludes的集合,将累计次数较高的词汇输入进去。这样的词汇怎么获得呢,我们只有一个笨办法,那就是“试”!不断的运行程序,让更多的词汇浮出水面,再把他们加到excludes的集合中,直到排名前20的词语都是人名为止。

    为了解决问题二,我们要对词语进行整合,这就是第9到18行代码的作用了。

    完整代码如下:

    因为排除词汇的工作量太大了,所以我仅将出场排名前8名的人物打印出来了:

    结果如下:

    结语

    这个实例的作用主要是让我们运用组合数据类型对文本进行词频统计。这里涉及到了列表和字典的知识,需要我们温故而知新。对于列表,我们可以用列表管理采集到的信息,用来构建数据结构;对于字典,我们可以用字典来处理复杂的数据信息。

    总之,能完成一个实例,能写能理解,才代表你真正学会了这部分知识。也要记住,好的代码是靠着每一点点的修改,才能成为完美!!!

    展开全文
  • 一、利用Python进行词频统计 (一)计算机等级考试中常用的方法 (二)升级方法 利用Python进行词频统计的核心语法 利用Python进行词频统计的三种方法示例 二、Mapreduce的方法进行词频统计 面对大型的文件的统计...

    一、利用Python进行词频统计

    (一)计算机等级考试中常用的方法
    首先是一个比较标准的考试中使用的方法,针对英文文本:

    def getText():
        txt = open("E:\hamlet.txt", "r").read()   #读取Hamlet文本文件,并返回给txt
        txt = txt.lower()          #将文件中的单词全部变为小写
        for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~': 
            txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
        return txt
     
    hamletTxt = getText()
    words  = hamletTxt.split() #按照空格,将文本分割
    counts = {}
    for word in words:  #统计单词出现的次数,并存储到counts字典中         
        counts[word] = counts.get(word,0) + 1  #先给字典赋值,如果字典中没有word这个键,则返回0
    items = list(counts.items())   #将字典转换为列表,以便操作
    items.sort(key=lambda x:x[1], reverse=True)  # 见下面函数讲解
    for i in range(10):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))
    

    针对中文文本则一般使用jieba库,下面是一个示例(但不算很常考):

    #使用Jieba库进行词频统计
    import jieba
    txt = open("Jieba词频统计素材.txt", "r", encoding='utf-8').read()#防止出现编码问题而使用encoding
    words  = jieba.lcut(txt)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue#不希望统计到单个词,比如说“的”,“好”等
       counts[word] = counts.get(word,0) + 1
       #将分词放入字典中
    #如果有不希望统计到的词,那就在开始时创建一个包含所有你不想统计到的词语列表,例如
    #exclude_words=["统计","排除"]
    #for word in exclude_words:
    #    del counts[word]
    #这样就可以避免统计到不希望出现的词了
    #以下开始对字典中词语进行统计
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True) 
    for i in range(10):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))
    

    (二)升级方法

    1. 利用Python进行词频统计的核心语法
      要掌握好利用python词频统计(特指上述的最简单的方法),我认为有以下几个重要的点需要熟悉
      (1)将词放入字典,并同时统计频数的过程
    words  = txt_file.split() #以" "为分隔符分隔文件
    words2 = txt_file.lcut()#或者将中文文件用jieba库分词
    for word in words:
    	counts[word]=counts.get(word,0)+1#dict.get(寻找值,找不到则返回的值);这一行代码同时实现计数
    

    (2)将字典的键值对以列表形式输出,中途进行排序的过程

    items = list(counts.items())#items方法返回键值对
    items.sort(key=lambda x:x[1], reverse=True) 
    

    先简单讲lambda函数,lambda x:y,输入x返回y,可以理解成sort函数的key参数的值等于lambda函数的返回值;lambda函数输入值x相当于items列表,输出的是列表的第二列也就是itmes[1],即返回的是词的频数。
    也就是说,按照频数对items排序。
    3. 利用Python进行词频统计的三种方法示例

    import pandas as pd
    from collections import Counter
    words_list = ["Monday","Tuesday","Thursday","Zeus","Venus","Monday","Monday","Zeus","Venus","Venus"]
    dict = {} 
    for word in words_list:         
        dict[word] = dict.get(word, 0) + 1 
    print ("Result1:\n",dict) 
    result2 =Counter(words_list)
    print("Result2:\n",result2)
    result3 =pd.value_counts(words_list)
    print("Result3:\n",result3)
    Result1:
     {'Monday': 3, 'Tuesday': 1, 'Thursday': 1, 'Zeus': 2, 'Venus': 3}
    Result2:
     Counter({'Monday': 3, 'Venus': 3, 'Zeus': 2, 'Tuesday': 1, 'Thursday': 1})
    Result3:
     Monday      3
    Venus       3
    Zeus        2
    Thursday    1
    Tuesday     1
    dtype: int64
    

    二、Mapreduce的方法进行词频统计

    面对大型的文件的统计需求,需要使用到集群来进行词频统计。我们打算在Hadoop平台上运行Python程序,分布计算从而提高我们词频统计的效率。因此使用了写MapReduce的方法。

    (一)代码示例以及解释
    Map:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import sys
    from operator import itemgetter
    from itertools import groupby
    
    def main():
        # input comes from STDIN (standard input)
        for line in sys.stdin:
            # remove leading and trailing whitespace
            line = line.strip()
            # split the line into words
            words = line.split()
            # increase counters
            for word in words:
                # write the results to STDOUT (standard output);
                # what we output here will be the input for the
                # Reduce step, i.e. the input for reducer.py
                # tab-delimited; the trivial word count is 1
                print('%s\t%s' % (word, 1))
    
    if (__name__ == "__main__" ):
        main()
    

    Reduce:

    #!/usr/bin/env python
     
    from operator import itemgetter
    import sys
    
    current_word = None
    current_count = 0
    word = None
    
    # input comes from STDIN
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        # parse the input we got from mapper.py
        word, count = line.split('\t', 1)
        # convert count (currently a string) to int
        try:
            count = int(count)
        except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
            continue
    
        
        # this IF-switch only works because Hadoop sorts map output
        # by key (here: word) before it is passed to the reducer
        if current_word == word:
            current_count += count
        else:
            if current_word:
              # write result to STDOUT
                print('%s\t%s' % (current_word, current_count))
            current_count = count
            current_word = word
    
     # do not forget to output the last word if needed!
    if current_word == word:
        print('%s,%s' % (current_word, current_count))
    

    (二)核心语法的学习探究

    展开全文
  • 如何利用Python进行文本词频统计

    千次阅读 2019-03-02 00:00:00
    欢迎点击「算法与编程之美」↑关注我们!本文首发于微信公众号:"算法与编程之美",欢迎关注,及时了解更多此系列文章。问题描述Python在自然语言处理这个方面,有其天然的优...
  • 在本文中利用Python对Hamlet英文词频进行统计,我们解决该问题的基本流程应该如下:1、读取文件2、将所有英文字母变成小写3、根据标点符号,对!'#$%&()*+,-./:;?@[\\]^_‘{|}~等对单词进行分割,形成列表4、对每个...
  • 在本文中利用Python对Hamlet英文词频进行统计,我们解决该问题的基本流程应该如下:1、读取文件2、将所有英文字母变成小写3、根据标点符号,对!'#$%&()*+,-./:;<=>?@[]^_‘{|}~等对单词进行分割,形成列表...
  • 今天,我们利用python编写一个MapReduce程序,程序的目的还是百年不变的计算单词个数,也就是WordCunt。所谓mapreduce其实就是先分散计算后综合处理计算结果。首先我们来看一下map部分的代码。#!/usr/bin/env python...
  • 分享给大家供大家参考,具体如下:应用介绍:统计英文文章词频是很常见的需求,本文利用python实现。思路分析:1、把英文文章的每个单词放到列表里,并统计列表长度;2、遍历列表,对每个单词出现的次数进行统计,并...
  • 分享给大家供大家参考,具体如下:应用介绍:统计英文文章词频是很常见的需求,本文利用python实现。思路分析:1、把英文文章的每个单词放到列表里,并统计列表长度;2、遍历列表,对每个单词出现的次数进行统计,并...
  • 利用Python对瓦尔登湖进行词频统计

    千次阅读 2018-04-20 17:10:44
    深入理解列表的使用,利用python对瓦尔登湖文本(英文)进行词频统计。 二、必要知识 1.python数据结构 2.数据结构的推导式(List Comprehension) 如我们需要将5个元素装进列表中,写法: b = [i for i in...
  • 分享给大家供大家参考,具体如下:应用介绍:统计英文文章词频是很常见的需求,本文利用python实现。思路分析:1、把英文文章的每个单词放到列表里,并统计列表长度;2、遍历列表,对每个单词出现的次数进行统计,并...
  • 分享给大家供大家参考,具体如下:应用介绍:统计英文文章词频是很常见的需求,本文利用python实现。思路分析:1、把英文文章的每个单词放到列表里,并统计列表长度;2、遍历列表,对每个单词出现的次数进行统计,并...
  • 1、利用jieba分词,排除停用词stopword之后,对文章中的词进行词频统计,并用matplotlib进行直方图展示 # coding: utf-8 import codecs import matplotlib.pyplot as plt import jieba # import sys # ...
  • Python 词频统计

    2018-06-16 08:38:00
    利用Python做一个词频统计 GitHub地址:FightingBob【Give me a star , thanks.】 词频统计  对纯英语的文本文件【Eg: 瓦尔登湖(英文版).txt】的英文单词出现的次数进行统计,并记录起来 代码实现 ...
  • 利用jieba,我们可以对字段、文章或是文档文件进行词频分析工作 废话不说,直接上代码: #!/usr/bin/env Python # coding=utf-8 # @author:神乐坂几禾 # @time:2020-6-8 20:00 # @discribe:counts o
  • Python使用Hadoop进行词频统计

    千次阅读 2016-12-25 20:05:06
    今天,我们利用python编写一个MapReduce程序,程序的目的还是百年不变的计算单词个数,也就是WordCunt。 所谓mapreduce其实就是先分散计算后综合处理计算结果。 首先我们来看一下map部分的代码。 #!/usr/bin/env ...
  • 摘要:对常见的文本存储格式,如txt、doc、docx,利用Python第三方库jieba进行分词,并进行词频统计。环境:win10+pycharm2018.1+Python3.6第三方库:jieba、docx、win32com准备文件:stopwords1893停用词表,可从...
  • 在中文的文本挖掘中,对海量文本进行准确分词是其中至关重要一步。当前的Python语言下存在多种开源文本分析包,其中jieba这个包能够提供相对高效的分词方案。结合jieba代码和一些相关资料,基本得知jieba是基于Trie...
  • Python3 利用openpyxl 以及jieba 对帖子进行关键词抽取 ——对抽取的关键词进行词频统计20180413学习笔记一、工作前天在对帖子的关键词抽取存储后,发现一个问题。我似乎将每个关键词都存到分离的cell中,这样在最后...
  • 分词的应用场景还是蛮多的,比如电商里面...整体思路就是利用python中的jieba库,对每一行文字进行分词处理,处理之后的结果放在list(列表)中。遍历列表,重复的值就计数,留下唯一的值作为key。引用库import jieba ...
  • Python读取文件进行中文词频统计

    千次阅读 2019-04-10 15:11:15
    利用Counter函数进行词频统计,比较简洁,代码如下: 数据:movie_comments.csv文件为23万的影评数据 # -*- coding:utf-8 -*- import jieba import re import pandas as pd from collections import Counter ...
  • 利用python简单实现对一句话内的词频统计,首先我们要把这句话进行分割,这就要用到split()函数, import string lyric ="The night begin to shine, the night begin to shine" words = [word.strip...

空空如也

空空如也

1 2 3 4 5 6
收藏数 112
精华内容 44
关键字:

利用python进行词频统计

python 订阅