精华内容
下载资源
问答
  • Given a paragraphand a list of banned words, return the most frequent word that is not in the list of banned words. It is guaranteed there is at least one word that isn't banned, and that the...

    [抄题]:

    Given a paragraph and a list of banned words, return the most frequent word that is not in the list of banned words.  It is guaranteed there is at least one word that isn't banned, and that the answer is unique.

    Words in the list of banned words are given in lowercase, and free of punctuation.  Words in the paragraph are not case sensitive.  The answer is in lowercase.

    Example:
    Input: 
    paragraph = "Bob hit a ball, the hit BALL flew far after it was hit."
    banned = ["hit"]
    Output: "ball"
    Explanation: 
    "hit" occurs 3 times, but it is a banned word.
    "ball" occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph. 
    Note that words in the paragraph are not case sensitive,
    that punctuation is ignored (even if adjacent to words, such as "ball,"), 
    and that "hit" isn't the answer even though it occurs more because it is banned.

     [暴力解法]:

    时间分析:

    空间分析:

     [优化后]:

    时间分析:

    空间分析:

    [奇葩输出条件]:

    [奇葩corner case]:

    [思维问题]:

    [一句话思路]:

    [输入量]:空: 正常情况:特大:特小:程序里处理到的特殊情况:异常情况(不合法不合理的输入):

    [画图]:

    [一刷]:

    1. 斜杠不是正斜杠

    [二刷]:

    [三刷]:

    [四刷]:

    [五刷]:

      [五分钟肉眼debug的结果]:

    [总结]:

    去标点、去空格都用正则表达式

    [复杂度]:Time complexity: O(n) Space complexity: O(n)

    [英文数据结构或算法,为什么不用别的数据结构或算法]:

    1. Arrays.asList()返回的是List,而且是一个定长的List。把string数组转成普通数组,才能存到hashset中。
    public static void main(String[] args){
      2         int[] a1 = new int[]{1,2,3};
      3         String[] a2  = new String[]{"a","b","c"};
      4
      5         System.out.println(Arrays.asList(a1));
      6         System.out.println(Arrays.asList(a2));
      7     }
      打印结果如下:
      [[I@dc8569]
      [a, b, c]

     删除标点、划分空格:斜杠

    String[] words = p.replaceAll("\\pP" , "").toLowerCase().split("\\s+");

    [关键模板化代码]:

    [其他解法]:

    [Follow Up]:

    [LC给出的题目变变变]:

     [代码风格] :

    class Solution {
        public String mostCommonWord(String paragraph, String[] banned) {
            //ini
            Set<String> set = new HashSet<>(Arrays.asList(banned));
            Map<String, Integer> map = new HashMap<>();
            String res = "";
            int max = Integer.MIN_VALUE;
            
            //store in HM
            String[] words = paragraph.replaceAll("\\pP", "").toLowerCase().split("\\s+");
            for (String w : words) {
                if (!set.contains(w)) {
                    map.put(w, map.getOrDefault(w, 0) + 1);
                    if (map.get(w) > max) {
                        res = w;
                        max = map.get(w);
                    }
                } 
            }
            
            //return
            return res;
        }
    }
    View Code

     

    转载于:https://www.cnblogs.com/immiao0319/p/8974022.html

    展开全文
  • 统计高频词

    千次阅读 2018-06-27 20:37:41
    #返回高频词 def word(file): with open(file) as file: data = file.read() #一次性读取全部的内容 contents = re.findall('\w+',data) #匹配所有的字符 count = {} for content in contents: ...
    #返回高频词
    def word(file):
        with open(file) as file:
            data = file.read()  #一次性读取全部的内容
            contents = re.findall('\w+',data) #匹配所有的字符
            count = {}
            for content in contents:
                count[content] = count.get(content,0) + 1  #如果 count里面没有content则会设置成0 然后加1
            print(count)
            sortedcount = sorted(count.items(),key = operator.itemgetter(1)) #排序 1 表示按照 values排
            print(sortedcount[-1][0])

    展开全文
  • Python高频词统计

    千次阅读 2019-06-26 14:08:33
    # -*- coding: utf-8 -*- import jieba txt = open('gaopin.txt','r').read() words = jieba.lcut(txt) print(words) counts = {} for word in words: if len(word) == 1: continue else: ...
    # -*- coding: utf-8 -*-
    import jieba
    
    txt = open('gaopin.txt','r').read()
    words = jieba.lcut(txt)
    print(words)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue
        else:
            counts[word] = counts.get(word,0) + 1
    print(counts)
    jieguo=[]
    jieguo=sorted(counts.items(),key=lambda x:x[1],reverse=True)
    print(jieguo)
    

     

    展开全文
  • python之jieba模块高频词统计

    千次阅读 2018-10-30 22:37:31
    import jieba txt = open('gaopin.txt','r').read() words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word,0) + 1...
    import jieba
    
    txt = open('gaopin.txt','r').read()
    words = jieba.lcut(txt)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue
        else:
            counts[word] = counts.get(word,0) + 1
            
    items = list(counts.items())
    
    items.sort(key=lambda x : x[1],reverse=True)
    #items.sort(reverse = True)
    for i in range(20):
        word ,count =items[i]
        print(word,count)
    #    print('{0:<10}{1:>5}'.format(word,count))
    
    展开全文
  • import jieba ls="中国是一个伟大的国家,是一个好的...counts={} # 定义统计字典 words=jieba.lcut(ls) print('分好的词组为:',words) for word in words: counts[word]=counts.get(word,0)+1 print...
  • NLP剔除高频停用

    千次阅读 2019-05-06 10:05:29
    剔除高频停用减少模型噪音,并加速训练 def remove_fre_stop_word(words): t = 1e-5 # t 值 threshold = 0.8 # 剔除概率阈值 # 统计单词频率 int_word_counts = collections.Counter(words) total_count = ...
  • 承上篇,统计好词的频率后,就需要进行词频分析了。 由于最近工作中一直在使用Spotfire,感觉相当高大上,咱就把这个任务交给...排名前一百高频词是: No. word count 1 ACCOUNT 1938 2 RIDER
  • 不过,本文不纠结于现象本身,利用python词云工具,简单的对“打工人”语录的词频进行分析,找到其中的高频词。在处理文本类内容的过程中,大家很容易找不到重点。而词云图则能够通过简单的频次统计工作,将出现频次...
  • 词向量计算方法2.1 回顾word2vec计算2.2 word2vec中计算方法详解2.3 高频词(the)引起的问题三、优化基础3.1 梯度下降3.2 随机(stochastic)梯度下降(SGD)四、word vector优化过程4.1 SGD引起的稀疏数据4.2 两种词...
  • Word Vectors and Word Senses一、词向量计算方法1 回顾word2vec的计算2 word2vec中计算方法详解3 高频词(the)引起的问题二、优化基础1 梯度下降2 随机(stochastic)梯度下降(SGD)三、word vector优化过程1 SGD引起...
  • (中文)stop word

    2009-09-16 15:08:19
    许多文本处理系统都有过滤停用词(stop word)这道工序,把对文本信息内容不起作用的高频词过滤。停用词策略能节省存储,提高分类和统计准确度,减少运算量。
  • 由于是脱敏数据,所以作NLP之前需要先作word2vec,这时就需要将标点符号和无意义的词(比如“的”)去掉,我们采用的方法就是去掉高频词,所以首先要找到高频词。如何找到高频词呢, 这里当然可以使用dict了,但是有...
  • 词频统计

    2019-05-06 15:08:58
    利用 Python 从文本文件中提取出现频次前十的单词,完成函数: (1)词频提取函数: 函数原型: def word_freq(path) 参数 path:字符串,需要提取的文本文件路径。...统计时去除高频词(见 sight word.txt)。 可逐...
  • 嵌入(subword embedding):FastText 以固定大小的 n-gram 形式将单词更细致地表示为了子的集合,而 BPE (byte pair encoding) 算法则能根据语料库的统计信息,自动且动态地生成高频的集合; GloVe 全局...
  • 统计词频并可视化

    千次阅读 2018-07-21 22:49:45
    由于是脱敏数据,所以作NLP之前需要先作word2vec,这时就需要将标点符号和无意义的词(比如“的”)去掉,我们采用的方法就是去掉高频词,所以首先要找到高频词。 如何找到高频词呢, 这里当然可以使用dict了,但是...
  • python统计千万级以上文本数据字频的一种思路有个研究项目需要统计jieba分词结果的高频词统计字频,之前一直利用python的字典格式特性及 **not in** 方式进行统计,代码片段如下: #创建空字典 word_dict = {} #...
  • 自然语言处理 Notes

    2019-05-28 00:25:31
    二次采样试图尽可能减轻高频词对训练词嵌入模型的影响。 10.4. 子词嵌入(fastText) 10.5. 全局向量的词嵌入(GloVe) 在有些情况下,交叉熵损失函数有劣势。GloVe模型采用了平方损失,并通过词向量拟合...
  • python 实用功能

    2019-12-27 09:10:19
    # 词频统计 word_counts = collections.Counter...word_counts_top = word_counts.most_common() # 可添加参数,获取前n最高频 print(word_counts_top) # 输出检查 word_list, count_list = list(), list() ...
  • 爱丽丝梦游仙境---python云图

    万次阅读 2018-11-09 18:04:22
    目录 WordCloud功能 文章和底片来源 ...(3) 将高频词以图片形式进行彩色渲染 文章和底片来源 https://github.com/amueller/word_cloud/tree/master/examples 无底片云图 from os import path...
  • 今天分享一个很有趣的数据...最上方是5个高频词统计,不同的颜色代表不同的专辑。 Today I’d like to share an interesting data visualization project that shows how often American singer Taylor Swift uses
  • 对N-gram同现频率、主题词覆盖率和高频词覆盖率3种不同参数,分别在Coverage Baseline、Centroid-Based Summary和Word Mining based Summary(WMS)3种不同文摘系统下所产生的文摘质量,进行了对比实验,结果表明WMS...
  • 词云也叫文字云,是一种可视化的结果呈现,常用在爬虫数据分析中,原理就是统计文本中高频出现的,过滤掉某些干扰,将结果生成一张图片,直观的获取数据的重点信息。今天,我们就来学习一下Python生成词云的常用...

空空如也

空空如也

1 2
收藏数 24
精华内容 9
关键字:

word统计高频词