精华内容
下载资源
问答
  • 字频计数器 快速词频分析
  • C - Word frequency counter

    2021-03-31 17:12:03
    分享一个大牛的人工智能教程。零基础!通俗易懂!风趣幽默!希望你也加入到人工智能的队伍中来!... * decreasing order of frequency of occurrence. Precede each word by its count. * * WordFrequencyCo

    分享一个大牛的人工智能教程。零基础!通俗易懂!风趣幽默!希望你也加入到人工智能的队伍中来!请点击http://www.captainbed.net

    /*
     * Write a program that prints the distinct words in its input sorted into 
     * decreasing order of frequency of occurrence. Precede each word by its count.
     *
     * WordFrequencyCounter.c - by FreeMan
     */
    
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #include <assert.h>
    
    typedef struct WORD
    {
        char *Word;
        size_t Count;
        struct WORD *Left;
        struct WORD *Right;
    } WORD;
    
    /*
      Assumptions: input is on stdin, output to stdout.
    
      Plan: read the words into a tree, keeping a count of how many we have,
            allocate an array big enough to hold Treecount (WORD *)'s
            walk the tree to populate the array.
            qsort the array, based on size.
            printf the array
            free the array
            free the tree
            free tibet (optional)
            free international shipping!
    */
    
    #define SUCCESS                      0
    #define CANNOT_MALLOC_WORDARRAY      1
    #define NO_WORDS_ON_INPUT            2
    #define NO_MEMORY_FOR_WORDNODE       3
    #define NO_MEMORY_FOR_WORD           4
    #define NONALPHA "1234567890 \v\f\n\t\r+=-*/\\,.;:'#~?<>|{}[]`!\"�$%^&()"
    
    int ReadInputToTree(WORD **DestTree, size_t *Treecount, FILE *Input);
    int AddToTree(WORD **DestTree, size_t *Treecount, char *Word);
    int WalkTree(WORD **DestArray, WORD *Word);
    int CompareCounts(const void *vWord1, const void *vWord2);
    int OutputWords(FILE *Dest, size_t Count, WORD **WordArray);
    void FreeTree(WORD *W);
    char *dupstr(char *s);
    
    int main(void)
    {
        int Status = SUCCESS;
        WORD *Words = NULL;
        size_t Treecount = 0;
        WORD **WordArray = NULL;
    
        /* Read the words on stdin into a tree */
        if (SUCCESS == Status)
        {
            Status = ReadInputToTree(&Words, &Treecount, stdin);
        }
    
        /* Sanity check for no sensible input */
        if (SUCCESS == Status)
        {
            if (0 == Treecount)
            {
                Status = NO_WORDS_ON_INPUT;
            }
        }
    
        /* Allocate a sufficiently large array */
        if (SUCCESS == Status)
        {
            WordArray = malloc(Treecount * sizeof * WordArray);
            if (NULL == WordArray)
            {
                Status = CANNOT_MALLOC_WORDARRAY;
            }
        }
    
        /* Walk the tree into the array */
        if (SUCCESS == Status)
        {
            Status = WalkTree(WordArray, Words);
        }
    
        /* Quick sort the array */
        if (SUCCESS == Status)
        {
            qsort(WordArray, Treecount, sizeof * WordArray, CompareCounts);
        }
    
        /* Walk down the WordArray outputting the values */
        if (SUCCESS == Status)
        {
            Status = OutputWords(stdout, Treecount, WordArray);
        }
    
        /* Free the word array */
        if (NULL != WordArray)
        {
            free(WordArray);
            WordArray = NULL;
        }
    
        /* Free the tree memory */
        if (NULL != Words)
        {
            FreeTree(Words);
            Words = NULL;
        }
    
        /* Error report and we are finshed */
        if (SUCCESS != Status)
        {
            fprintf(stderr, "Program failed with code %d\n", Status);
        }
        return (SUCCESS == Status ? EXIT_SUCCESS : EXIT_FAILURE);
    }
    
    void FreeTree(WORD *W)
    {
        if (NULL != W)
        {
            if (NULL != W->Word)
            {
                free(W->Word);
                W->Word = NULL;
            }
            if (NULL != W->Left)
            {
                FreeTree(W->Left);
                W->Left = NULL;
            }
            if (NULL != W->Right)
            {
                FreeTree(W->Right);
                W->Right = NULL;
            }
        }
    }
    
    int AddToTree(WORD **DestTree, size_t *Treecount, char *Word)
    {
        int Status = SUCCESS;
        int CompResult = 0;
    
        /* Safety check */
        assert(NULL != DestTree);
        assert(NULL != Treecount);
        assert(NULL != Word);
    
        /* Ok, either *DestTree is NULL or it isn't (deep huh?) */
        if (NULL == *DestTree)  /* This is the place to add it then */
        {
            *DestTree = malloc(sizeof **DestTree);
            if (NULL == *DestTree)
            {
                /* Horrible - we're out of memory */
                Status = NO_MEMORY_FOR_WORDNODE;
            }
            else
            {
                (*DestTree)->Left = NULL;
                (*DestTree)->Right = NULL;
                (*DestTree)->Count = 1;
                (*DestTree)->Word = dupstr(Word);
                if (NULL == (*DestTree)->Word)
                {
                    /* Even more horrible - we've run out of memory in the middle */
                    Status = NO_MEMORY_FOR_WORD;
                    free(*DestTree);
                    *DestTree = NULL;
                }
                else
                {
                    /* Everything was successful, add one to the tree nodes count */
                    ++ *Treecount;
                }
            }
        }
        else /* We need to make a decision */
        {
            CompResult = strcmp(Word, (*DestTree)->Word);
            if (0 < CompResult)
            {
                Status = AddToTree(&(*DestTree)->Left, Treecount, Word);
            }
            else if (0 > CompResult)
            {
                Status = AddToTree(&(*DestTree)->Left, Treecount, Word);
            }
            else
            {
                /* Add one to the count - this is the same node */
                ++(*DestTree)->Count;
            }
        }
    
        return Status;
    }
    
    int ReadInputToTree(WORD **DestTree, size_t *Treecount, FILE *Input)
    {
        int Status = SUCCESS;
        char Buf[8192] = { 0 };
        char *Word = NULL;
    
        /* Safety check */
        assert(NULL != DestTree);
        assert(NULL != Treecount);
        assert(NULL != Input);
    
        /* For every line */
        while (NULL != fgets(Buf, sizeof Buf, Input))
        {
            /* Strtok the input to get only alpha character words */
            Word = strtok(Buf, NONALPHA);
            while (SUCCESS == Status && NULL != Word)
            {
                /* Deal with this word by adding it to the tree */
                Status = AddToTree(DestTree, Treecount, Word);
    
                /* Next word */
                if (SUCCESS == Status)
                {
                    Word = strtok(NULL, NONALPHA);
                }
            }
        }
    
        return Status;
    }
    
    int WalkTree(WORD **DestArray, WORD *Word)
    {
        int Status = SUCCESS;
        static WORD **Write = NULL;
    
        /* Safety check */
        assert(NULL != Word);
    
        /* Store the starting point if this is the first call */
        if (NULL != DestArray)
        {
            Write = DestArray;
        }
    
        /* Now add this node and it's kids */
        if (NULL != Word)
        {
            *Write = Word;
            ++Write;
            if (NULL != Word->Left)
            {
                Status = WalkTree(NULL, Word->Left);
            }
            if (NULL != Word->Right)
            {
                Status = WalkTree(NULL, Word->Right);
            }
        }
    
        return Status;
    }
    
    /*
     * CompareCounts is called by qsort. This means that it gets pointers to the
     * data items being compared. In this case the data items are pointers too.
     */
    int CompareCounts(const void *vWord1, const void *vWord2)
    {
        int Result = 0;
        WORD *const *Word1 = vWord1;
        WORD *const *Word2 = vWord2;
    
        assert(NULL != vWord1);
        assert(NULL != vWord2);
    
        /* Ensure the result is either 1, 0 or -1 */
        if ((*Word1)->Count < (*Word2)->Count)
        {
            Result = 1;
        }
        else if ((*Word1)->Count > (*Word2)->Count)
        {
            Result = -1;
        }
        else
        {
            Result = 0;
        }
    
        return Result;
    }
    
    int OutputWords(FILE *Dest, size_t Count, WORD **WordArray)
    {
        int Status = SUCCESS;
        size_t Pos = 0;
    
        /* Safety check */
        assert(NULL != Dest);
        assert(NULL != WordArray);
    
        /* Print a header */
        fprintf(Dest, "Total Words : %lu\n", (unsigned long)Count);
    
        /* Print the words in descending order */
        while (SUCCESS == Status && Pos < Count)
        {
            fprintf(Dest, "%10lu %s\n", (unsigned long)WordArray[Pos]->Count, WordArray[Pos]->Word);
            ++Pos;
        }
    
        return Status;
    }
    
    
    /*
     * dupstr: Duplicate a string
     */
    char *dupstr(char *s)
    {
        char *Result = NULL;
        size_t slen = 0;
    
        /* Sanity check */
        assert(NULL != s);
    
        /* Get string length */
        slen = strlen(s);
    
        /* Allocate enough storage */
        Result = malloc(slen + 1);
    
        /* Populate string */
        if (NULL != Result)
        {
            memcpy(Result, s, slen);
            *(Result + slen) = '\0';
        }
    
        return Result;
    }
    
    // Output:
    /*
    These are short, famous texts in English from classic sources like the Bible or Shakespeare. Some texts have word definitions and explanations to help you. Some
     of these texts are written in an old style of English. Try to understand them, because the English that we speak today is based on what our great, great, great
    , great grandparents spoke before! Of course, not all these texts were originally written in English. The Bible, for example, is a translation. But they are all
     well known in English today, and many of them express beautiful thoughts.
    ^Z
    Total Words : 66
             5 English
             4 great
             4 texts
             4 in
             3 are
             3 of
             2 is
             2 today
             2 them
             2 written
             2 all
             2 the
             2 Bible
             2 these
             2 to
             2 Some
             2 and
             1 word
             1 definitions
             1 have
             1 explanations
             1 Shakespeare
             1 help
             1 you
             1 or
             1 like
             1 sources
             1 an
             1 old
             1 style
             1 Try
             1 understand
             1 classic
             1 because
             1 that
             1 we
             1 speak
             1 from
             1 famous
             1 based
             1 on
             1 what
             1 our
             1 short
             1 grandparents
             1 spoke
             1 before
             1 Of
             1 course
             1 not
             1 These
             1 were
             1 originally
             1 The
             1 for
             1 example
             1 a
             1 translation
             1 But
             1 they
             1 well
             1 known
             1 many
             1 express
             1 beautiful
             1 thoughts
    
    */

     

    展开全文
  • 字频计数器
  • #Step 3: build a dictionary with words and frequency info and get it sorted def create_word_dictionary (clean_word_list) : word_count = {} for word in clean_word_list: if word in word_...

    跟着Bucky Roberts 的tutorial写了一个简单的网页词汇频率代码块

    目的:根据所给网页,抓取上面的词汇(这里是英语词汇),并按照词汇出现的频率排序

    步骤:
    1. 创建一个list,将页面上的所有strings放进去
    2. 整理list,去除掉特殊符号
    3. 创建dictionary,将list内容放进去按照词汇出现的频率排序

    需要的模块:requests, BeautifulSoup, operator

    代码块及注释如下:

    import requests
    from bs4 import BeautifulSoup
    import operator
    
    
    url = 'https://www.python.org/events/'
    
    #Step 1: create a list with every word in
    def start(url):
    
        #set up a blank list to store words
        word_list = []
        #get source code from url, pick the content word by word and put in the list
        #internet request, connect the url
        source_code = requests.get(url).text
        #turn into soup object to work with
        soup = BeautifulSoup(source_code,"html.parser")
    
        #inspect the unique element for the content you need
        for post_text in soup.findAll('span',{'class':'event-location'}):
            #lower the content and split the sentences
            content = post_text.string
            words = content.lower().split()
            for each_word in words:
                word_list.append(each_word)
        clean_up_list(word_list)
    
    #Step 2: clean up the list, take out things which are not words
    def clean_up_list(word_list):
        clean_word_list = []
        for word in word_list:
            symbols = '`~!@#$%^&*()-=_+[]\|;/\':""?/,.<>{}'
            for i in range(0,len(symbols)):
                word = word.replace(symbols[i],'')
            if len(word) > 0:
                clean_word_list.append(word)
        create_word_dictionary(clean_word_list)
    
    
    #Step 3: build a dictionary with words and frequency info and get it sorted
    def create_word_dictionary(clean_word_list):
        word_count = {}
        for word in clean_word_list:
            if word in word_count:
                word_count[word] += 1
            else:
                word_count[word] = 1
        for key,value in sorted(word_count.items(), key = operator.itemgetter(1)):
            print (key, value)
    
    start(url)

    这里以python的活动页面为例[其实计算这个页面的词汇频率并没多大价值,只做功能实现用]

    运行结果如下 [还是没法贴图片,只能又通过引用来显示结果了]:

    basel 1
    centre 2
    and 2
    city 2
    hotel 2
    campus 2
    republic 2
    switzerland 2
    new 2
    ossa 2
    convention 2
    ohio 2
    1 3
    singapore 3
    south 3
    usa 3
    computing 3
    germany 5
    the 5
    university 6
    of 9
    Process finished with exit code 0

    结果是按照词汇出现的频率升序排序的。
    key = operator.itemgetter(1) 是按照频率排序,如果code为’0’,则是按照词汇的首字母排序

    for key,value in sorted(word_count.items(), key = operator.itemgetter(1)):

    以上就是简单的代码实现啦,如有疑问还请指出,欢迎讨论。

    参考教程:
    http://www.bilibili.com/video/av2847788/index_35.html
    http://www.bilibili.com/video/av2847788/index_36.html
    http://www.bilibili.com/video/av2847788/index_37.html
    引用页面:
    https://www.python.org/events/

    展开全文
  • <div><p>Python solution for <a href="https://github.com/tmbdev/hocr-tools/wiki/Calculate-word-frequency">calculating word frequency</a></p>该提问来源于开源项目:ocropus/hocr-tools</p></div>
  • Word_Frequency_counter 实现了该函数,使其返回按词频排序的字符串列表,最常出现的词在前,如下所示: 一个接受两个参数的函数: 表示文本文档的字符串 一个整数,提供要返回的项目数 运行时间不超过 O(n)
  • 字频计数器 几个用于处理文本文档的 C 程序。 这些程序将计算文本文档中每个不同单词的频率,并为每个提供的文本文档打印出最常用的前 3 个单词。 用户可以一次输入多个文件进行处理。 rvw.c在主进程中按顺序进行...
  • $ git clone https://github.com/DeepGunner/Word-Frequency-Counter $ npm install $ npm start 使用的资源: 代码组成部分: wordArray = text.toLowerCase().split(/\W+/) wordArray.forEach((key) => {...
  • Frequency 频率统计

    千次阅读 2017-11-27 16:46:37
     res[word[0]] = counter  counter += 1  return res # 把标准的单词位置记录下来 standard_position_dict = position_lookup(standard_freq_vector) print(standard_position_dict) # 得到⼀一个...
    # -*- coding: utf-8 -*-
    
    """
    Created on Fri Oct 20 19:16:41 2017

    @author: ESRI
    """
    import nltk
    from nltk import FreqDist
    # 做个词库先
    corpus = 'this is my sentence ' \
    'this is my life ' \
    'this is the day'
    # 随便便tokenize⼀一下
    # 显然, 正如上⽂文提到,
    # 这⾥里里可以根据需要做任何的preprocessing:
    # stopwords, lemma, stemming, etc.
    tokens = nltk.word_tokenize(corpus)
    print(tokens)
    # 得到token好的word list
    # ['this', 'is', 'my', 'sentence',
    # 'this', 'is', 'my', 'life', 'this',
    # 'is', 'the', 'day']

    # 借用NLTK的FreqDist统计一下文字出现的频率
    fdist = FreqDist(tokens)
    print(fdist)
    # 它就类似于一个Dict
    # 带上某个单词, 可以看到它在整个文章中出现的次数
    print(fdist['is'])
    # 3
    # 好, 此刻, 我们可以把最常⽤用的50个单词拿出来
    standard_freq_vector = fdist.most_common(50)
    size = len(standard_freq_vector)
    print(standard_freq_vector)
    # [('is', 3), ('this', 3), ('my', 2),
    # ('the', 1), ('day', 1), ('sentence', 1),
    # ('life', 1)
    # Func: 按照出现频率⼤大⼩小, 记录下每⼀一个单词的位置
    def position_lookup(v):
        res = {}
        counter = 0
        for word in v:
            #print(word[0])
            res[word[0]] = counter
            counter += 1
        return res

    # 把标准的单词位置记录下来
    standard_position_dict = position_lookup(standard_freq_vector)
    print(standard_position_dict)
    # 得到⼀一个位置对照表
    # {'is': 0, 'the': 3, 'day': 4, 'this': 1,
    # 'sentence': 5, 'my': 2, 'life': 6}

    # 这时, 如果我们有个新句句⼦子:
    sentence = 'this is cool'
    # 先新建⼀一个跟我们的标准vector同样⼤大⼩小的向量量
    freq_vector = [0] * size
    # 简单的Preprocessing
    tokens = nltk.word_tokenize(sentence)
    # 对于这个新句句⼦子⾥里里的每⼀一个单词
    for word in tokens:
        try:
    # 如果在我们的词库⾥里里出现过
    # 那么就在"标准位置"上+1
            freq_vector[standard_position_dict[word]] += 1
        except KeyError:
    # 如果是个新词
    # 就pass掉
            continue
    print(freq_vector)
    # [1, 1, 0, 0, 0, 0, 0]
    # 第⼀一个位置代表 is, 出现了了⼀一次
    # 第⼆二个位置代表 this, 出现了了⼀一次
    # 后⾯面都⽊木有
    展开全文
  • Hbase Map Reduce Example - Frequency Counter

    千次阅读 2011-01-19 12:50:00
    文章来源:http://sujee.net/tech/articles/hbase-map-reduce-freq-counter/通过这篇入门级的例子,可以学习使用HBase API写MapReduce程序实现自己的测试用例!This is a tutorial on how to run a map reduce job ...

    文章来源:http://sujee.net/tech/articles/hbase-map-reduce-freq-counter/

     

    通过这篇入门级的例子,可以学习使用HBase API写MapReduce程序实现自己的测试用例!

     

    This is a tutorial on how to run a map reduce job on Hbase.  This covers version 0.20 and later.

    Recommended Readings:
    - Hbase home
    - Hbase map reduce Wiki
    - Hbase Map Reduce Package
    - Great intro to Hbase map reduce by George Lars

    Version Difference

    Hadoop map reduce API changed around v0.20.  So did Hbase map reduce package.
    - org.apache.hadoop.hbase.mapred   : older API, pre v0.20
    - org.apache.hadoop.hbase.mapreduce : newer API,  post v0.20

    We will be using the newer API.

    Frequency Counter

    For this tutorial lets say our Hbase has records of web_access_logs.  We record each web page access by a user.  To keep things simple, we are only logging the user_id and the page they visit.  You can imagine all sorts of stats can be gathered, such as ip_address, referer_paget ..etc

    The schema looks like this:


    To make row-key unique, we have in a timestamp at the end making up a composite key.

    So a sample setup data might looke like this:

    row details:page
    user1_t1a.html
    user2_t2
    b.html
    user3_t4
    a.html
    user1_t5
    c.html
    user1_t6
    b.html
    user2_t7
    c.html
    user4_t8
    a.html



    we want to count how many times we have seen each user.  The result we want is:

    user count (frequency)
    user1
    3
    user2
    2
    user3
    1
    user4
    1



    So we will write a map reduce program.  Similar to the popular example word-count - couple of differences.  Our Input-Source is a Hbase table.  Also output is sent to an Hbase table.

    First, code access & Hbase setup


    The code is in GIT repository at GitHub : http://github.com/sujee/hbase-mapreduce
    You can get it by

    git clone git://github.com/sujee/hbase-mapreduce.git

    This is an Eclipse project. To compile it, define HBASE_HOME to point Hbase install directory.

    Lets also setup our Hbase tables:
    0) For map reduce to run Hadoop needs to know about Hbase classes. edit 'hadoop/conf/hadoop-env.sh':

    # Extra Java CLASSPATH elements.  add hbae jars
    export HADOOP_CLASSPATH=/hadoop/hbase/hbase-0.20.3.jar:/hadoop/hbase/hbase-0.20.3-test.jar:/hadoop/hbase/conf:/hadoop/hbase/lib/zookeeper-3.2.2.jar

    Change this to reflect your Hbase installation. instructions are here : (http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html ) to modify Hbase configuration
    1) restart Hadoop in pseodo-distributed (single server) mode
    2) restart Hbase in psuedo-distributed (single server) mode.
    3)

    hbase shell
        create 'access_logs', 'details'
        create 'summary_user', {NAME=>'details', VERSIONS=>1}


    'access_logs' is the table that has 'raw' logs and will serve as our Input Source for mapreduce.  'summary_user' table is where we will write out the final results.

    Some Test Data ...

    So lets get some sample data into our tables.  The 'Importer1' class will fill 'access_logs' with some sample data.

    package hbase_mapred1;

    import java.util.Random;

    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.client.HTable;
    import org.apache.hadoop.hbase.client.Put;
    import org.apache.hadoop.hbase.util.Bytes;

    /**
    * writes random access logs into hbase table
    *
    * userID_count => {
    * details => {
    * page
    * }
    * }
    *
    * @author sujee ==at== sujee.net
    *
    */
    public class Importer1 {

    public static void main(String[] args) throws Exception {

    String [] pages = {"/", "/a.html", "/b.html", "/c.html"};

    HBaseConfiguration hbaseConfig = new HBaseConfiguration();
    HTable htable = new HTable(hbaseConfig, "access_logs");
    htable.setAutoFlush(false);
    htable.setWriteBufferSize(1024 * 1024 * 12);

    int totalRecords = 100000;
    int maxID = totalRecords / 1000;
    Random rand = new Random();
    System.out.println("importing " + totalRecords + " records ....");
    for (int i=0; i < totalRecords; i++)
    {
    int userID = rand.nextInt(maxID) + 1;
    byte [] rowkey = Bytes.add(Bytes.toBytes(userID), Bytes.toBytes(i));
    String randomPage = pages[rand.nextInt(pages.length)];
    Put put = new Put(rowkey);
    put.add(Bytes.toBytes("details"), Bytes.toBytes("page"), Bytes.toBytes(randomPage));
    htable.put(put);
    }
    htable.flushCommits();
    htable.close();
    System.out.println("done");
    }
    }


    Go ahead and run 'Importer1' in Eclipse.

    In hbase shell lets see how our data looks:

    hbase(main):004:0> scan 'access_logs', {LIMIT => 5}
    ROW                          COLUMN+CELL                                                                     
     /x00/x00/x00/x01/x00/x00/x00r column=details:page, timestamp=1269330405067, value=/                           
                                                              
     /x00/x00/x00/x01/x00/x00/x00/xE7 column=details:page, timestamp=1269330405068, value=/a.html                     
                                                                                                           
     /x00/x00/x00/x01/x00/x00/x00/xFC column=details:page, timestamp=1269330405068, value=/a.html                     
                                                                                                           
     /x00/x00/x00/x01/x00/x00/x01a column=details:page, timestamp=1269330405068, value=/b.html                     
                                                                                                              
     /x00/x00/x00/x01/x00/x00/x02/xC6 column=details:page, timestamp=1269330405068, value=/a.html                     
                                                                                                           
    5 row(s) in 0.0470 seconds

     

    About Hbase Mapreduce

    Lets take a minute and examine the Hbase map reduce classes.

    Hadoop mapper  can take in ( KEY1, VALUE1)  and output (KEY2,  VALUE2).  The Reducer can take (KEY2, VALUE2) and output (KEY3, VALUE3).
    mapreduce.png
    (image credit : http://www.larsgeorge.com/2009/05/hbase-mapreduce-101-part-i.html)


    Hbase provides convenient Mapper & Reduce classes -  org.apache.hadoop.hbase.mapreduce.TableMapper   and org.apache.hadoop.hbase.mapreduce.TableReduce . These classes extend Mapper and Reducer interfaces. They make it easier to read & write from/to Hbase tables

    tablemapper.png

    tablereduce.png

    TableMapper:

    Hbase TableMapper is an abstract class extending Hadoop Mapper.
    The source can be found at :  HBASE_HOME/src/java/org/apache/hadoop/hbase/mapreduce/TableMapper.java

    
     

    Notice how TableMapper parameterizes Mapper class.

    Param
    class
    comment
    KEYIN (k1)
    ImmutableBytesWritablefixed.  This is the row_key of the current row being processed
    VALUEIN (v1)
    Result
    fixed.  This is the value (result) of the row
    KEYOUT (k2)
    user specified
    customizable
    VALUEOUT (v2)
    user specified
    customizable




    The input key/value for TableMapper is fixed.  We are free to customize output key/value classes.  This is a noticeable difference compared to writing a straight hadoop mapper.

    TableReducer

    src  : HBASE_HOME/src/java/org/apache/hadoop/hbase/mapreduce/TableReducer.java

     


    Lets look at the parameters:

    Param
    Class
    Comment
    KEYIN (k2 - same as mapper keyout)
    user-specified (same class as K2 ouput from mapper)

    VALUEIN(v2 - same as mapper valueout)
    user-specified (same class as V2 ouput from mapper)

    KEYIN (k3)
    user-specified

    VALUEOUT (k4)
    must be Writable

    TableReducer can take any KEY2 / VALUE2 class and emit any KEY3 class, and a Writable VALUE4 class.


    Back to Frequency Counting

    We will extend TableMapper and TableReducer with our custom classes.

    Mapper

    Input
    Output
    ImmutableBytesWritable
    (RowKey = userID + timestamp)
    ImmutableBytesWritable
    (userID)
    Result
    (Row Result)
    IntWritable
    (always ONE)



    Reducer

    Input
    Output
    ImmutableBytesWritable
    (uesrID)
    (from output K2 from mapper)
    ImmutableBytesWritable
    (userID : same as input)
    (this will be the KEYOUT k3. And it will serve as the 'rowkey' for output Hbase table)
    Iterable<IntWriable>
    (all ONEs combined for this key)
    (from output V2 from mapper, all combined into a 'list' for this key)
    IntWritable
    (total of all ONEs for this key)
    (this will be the VALUEOUT v3. And it will be PUT value for Hbase table)



    In mapper we extract the USERID from the composite rowkey (userID + timestamp).  Then we just emit the userID and ONE - as in number ONE.

    Visualizing Mapper output

       (user1, 1)
    (user2, 1)
    (user1, 1)
    (user3, 1)


    The map-reduce framework, collects similar output keys together and send them to reducer.  This is why we see a 'list' or 'iterable' for each userID key at reducer.   In Reducer, we simply add all the values and emit   <UserID , total Count>. 

    Visualizing Input to Reducer:

       (user1, [1, 1])
    (user2, [1])
    (user3, [1])


    And the output of reducer:

       (user1, 2)
    (user2, 1)
    (user3, 1)

    Ok, now onto the code.

    Frequency Counter Map Reduce Code

    Code Walk-through

     

    • Since our mapper/reducer code is pretty compact, we have it all in one file

    • At line 26 :
          static class Mapper1 extends TableMapper<ImmutableBytesWritable, IntWritable> {
      we configure class type Emitted from mapper. Remember, map inputs are already defined for us by TableMapper (as ImmutableBytesWritable and Result)

    • At line 34:
      ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
      we are extracting userID from the composite key (userID + timestamp = INT + INT). This will be the key that we will emit.

    • at line 36:
              context.write(userKey, one);
      Here is where we EMIT our output. Notice we always output ONE (which is IntWritable(1)).

    • At line 46, we configure our reducer to accept the values emitted from the mapper (ImmutableBytessWriteable, IntWritable)

    • line 52:
                  for (IntWritable val : values) {
      sum += val.get();
      we simply aggregate the count. Since each count is ONE, the sum is total is number values.

    • At line 56:
                  Put put = new Put(key.get());
      put.add(Bytes.toBytes("details"), Bytes.toBytes("total"), Bytes.toBytes(sum));
      context.write(key, put);
      Here we see the familiar Hbase PUT being created. The key being used is USERID (passed on from mapper, and used unmodified here). The value is SUM. This PUT will be saved into our target Hbase Table ('summary_user').
      Notice how ever, we don't write directly to output table. This is done by super class 'TableReducer'.

    • Finally, lets look at the job setup.
      
       
      We setup Hbase configuration, Job and Scanner. Optionally, we are also configuring the scanner on which columns to read. And using the 'TableMapReduceUtil' to setup mapper class.
               TableMapReduceUtil.initTableMapperJob(
      "access_logs", // table to read data from
      scan, // scanner
      Mapper1.class, // map class
      ImmutableBytesWritable.class, // mapper output KEY class
      IntWritable.class, // mapper output VALUE class
      job // job
      );

      Similarly we setup Reducer
            TableMapReduceUtil.initTableReducerJob(
      "summary_user", // table to write to
      Reducer1.class, // reducer class
      job); // job

     

    Running the Job

     

    Single Server mode

    We can just run the code from Eclipse. Run 'FreqCounter1' from Eclipse. (You may need to up the memory for JVM using -Xmx300m in launch configurations). Output looks like this:

    ...
    10/04/09 15:08:32 INFO mapred.JobClient: map 0% reduce 0%
    10/04/09 15:08:37 INFO mapred.LocalJobRunner: mapper processed 10000 records so far
    10/04/09 15:08:40 INFO mapred.LocalJobRunner: mapper processed 30000 records so far
    ...
    10/04/09 15:08:55 INFO mapred.JobClient: map 100% reduce 0%
    ...
    stats : key : 1, count : 999
    stats : key : 2, count : 1040
    stats : key : 3, count : 986
    stats : key : 4, count : 983
    stats : key : 5, count : 967
    ...
    10/04/09 15:08:56 INFO mapred.JobClient: map 100% reduce 100%

    Alright... we see mapper progressing and then we see 'frequency output' from our Reducer! Neat !!

    Running this on a Hbase cluster (multi machines)

    For this we need to make a JAR file of our classes.
    Open a terminal and navigate to the directory of the project.

    jar cf freqCounter.jar -C classes .

    This will create a jar file 'freqCounter.jar'. Use this jar file with 'hadoop jar' command to launch the MR job

    hadoop jar freqCounter.jar hbase_mapred1.FreqCounter1


    You can track the progress of the job at task tracker : http://localhost:50030
    Plus you can monitor the program output on the task-tracker website as well.

    Checking The Result


    Lets do a scan of results table

    hbase(main):002:0> scan 'summary_user', {LIMIT => 5}
    ROW                          COLUMN+CELL                                                                     
     /x00/x00/x00/x00            column=details:total, timestamp=1269330349590, value=/x00/x00/x04/x0A           
     /x00/x00/x00/x01            column=details:total, timestamp=1270856929004, value=/x00/x00/x03/xE7           
     /x00/x00/x00/x02            column=details:total, timestamp=1270856929004, value=/x00/x00/x04/x10           
     /x00/x00/x00/x03            column=details:total, timestamp=1270856929004, value=/x00/x00/x03/xDA           
     /x00/x00/x00/x04            column=details:total, timestamp=1270856929005, value=/x00/x00/x03/xD7           
    5 row(s) in 0.0750 seconds


    ok, looks like we have our frequency count.  But they are in all byte-display.  Lets write a quick scanner to print out a more user friendly display

    
     


    Running this will print out output like ...

    key: 0,  count: 1034
    key: 1, count: 999
    key: 2, count: 1040
    key: 3, count: 986
    key: 4, count: 983
    key: 5, count: 967
    key: 6, count: 987
    ...
    ...

    That's it

    thanks!

    展开全文
  • https://leetcode-cn.com/problems/sort-characters-by-frequency/ 示例 1: 输入: “tree” 输出: “eert” 解释: 'e’出现两次,'r’和’t’都只出现一次。 因此’e’必须出现在’r’和’t’之前。此外...
  • Counter统计每个字符出现的次数

    千次阅读 2016-06-19 20:02:16
    在python的API中,提到了Counter,它具有统计的功能 下面是我做的demo: 1.统计自定义字符串中每个字符出现的次数 2.读取一个文件,把文件中的内容转化为字符串,统计该字符串中每个字符串出现的次数 ...
  • PyTorch实现Word2Vec

    千次阅读 2020-04-22 17:10:49
    本文主要是使用PyTorch复现word2vec论文 PyTorch中的nn.Embedding 实现关键是nn.Embedding()这个API,首先看一下它的参数说明 [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uWyBPAc4-...
  • # 注意Counter() 输入的是字符串 返回的是用于计算字符串中字符出现的接口 word_counts.most_common() 输入整数时C 返回的是排名前C个的数据 不输入是按照出现次数对所有数据排序 word_counts.most_common() ...
  • counter vector matrix 。每个文档用词向量的组合来表示,每个词的权重用其出现的次数来表示。 当然,如果语料库十分庞大,那么 dictionary 的规模亦会十分庞大,因此上述矩阵必然是稀疏的,会给后续运算带来很大的...
  • BUAA Advanced Software ...Project: Individual Project - Word frequency program Ryan Mao (毛宇)-1106116_11061171 Implement a console application to tally the frequency of words under a direc...
  • 来拔flag。这个是详细NLP技术系列的第一篇。...Represent the Meaning of a Word WordNet WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of co
  • Pytorch实现word2vec训练

    千次阅读 2020-02-06 22:13:54
    Pytorch实现word2vec 主要内容 Word2Vec的原理网上有很多很多资料,这里就不再复述了。本人使用pytorch来尽可能复现Distributed Representations of Words and Phrases and their Compositionality 论文中训练词向量...
  • Word Embedding 知识总结

    万次阅读 多人点赞 2019-04-15 12:44:26
    一. Word Embedding的基本概念 1.1 什么是Word Embedding?...如果将word看作文本的最小单元,可以将Word Embedding理解为一种映射,其过程是:将文本空间中的某个word,通过一定的方法,映射或者说嵌入(embe...
  • tensorflow word2vec demo详解

    千次阅读 2018-07-28 19:04:01
    word2vec有CBOW与Skip-Gram模型 CBOW是根据上下文预测中间值,Skip-Gram则恰恰相反 本文首先介绍Skip-Gram模型,是基于tensorflow官方提供的一个demo,第二大部分是经过简单修改的CBOW模型,主要参考: ...
  • 文章目录1 WordNet使用1.1 简介1.2 使用2 词向量前期发展2.1 one-hot2.2 SVD Based Methods2.2.1 Word-Document Matrix2.2.2 Window based Co-occurrence Matrix共现矩阵3 Word2vec3.1 n-gram3.2 skip-gram3.3 CBOW...
  • Word Embedding

    2020-07-21 21:42:02
    Word Embedding 1. 基本概念 1.1 什么是Word Embedding 现有的机器学习方法往往无法直接处理文本数据,因此需要找到合适的方法,将文本数据转换为数值型数据,由此引出了Word Embedding的概念。如果将word看作文本的...
  • hadoop编程小技巧(2)---计数器Counter

    千次阅读 2014-07-16 17:30:50
    在Hadoop编程的时候,有时我们在进行我们算法逻辑的时候想附带了解下数据的一些特性,比如全部数据的记录数有多少,map的输出有多少等等信息(这些是在算法运行完毕后,直接有的),就可以使用计数器Counter。...
  • user-images/image-20220117092707395.png)] (3)频率统计Frequency FreqDist统计器:在token中做统计 第一步:word_tokenize或者jieba进行分词,分成一个word list 第二步:用FreqDist统计,直接将分完词的tokens...
  • downsampling = 1e-2 # threshold for configuring which higher-frequency words are randomly downsampled # Initialize and train the model model = word2vec.Word2Vec(all_receipes_ingredients, workers=num_...
  • 简易版本的word2vec实现 skip-gram原理简述 skip-gram是word2vec的一种训练方法,是核心思想是用中心词预测周围词,相比起用周围词预测中心词的CBOW训练方法,skip-gram训练的“难度更大”,因此训练出来的词向量...
  • word2vec

    2020-10-31 19:13:08
    word2vec定义表示方式语言模型 定义 什么是word2vec,词向量,顾名思义,在NLP中,最细粒度的是词语,解决NLP问题,就要时刻关注最基本的词语。 而NLP中的词语,把它们转换为数值形式,或者嵌入到一个数学空间中,...
  • word2vec是google 2013年提出的,从大规模语料中训练词向量的模型,在许多场景中都有应用,信息提取相似度计算等等。也是从word2vec开始,embedding在各个领域的应用开始流行,所以拿word2vec来作为开篇再合适不过了...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 2,548
精华内容 1,019
关键字:

counterfrequencyword