精华内容
下载资源
问答
  • 适用于分词、nlp等过程的英文停用词
  • 本资源收集了史上最全的停用词表 中文,英文全都包含其中 另附pycharm停用词表的构建 停用词过滤,是文本分析中一个预处理方法。它的功能是过滤分词结果中的噪声(例如:的、是、啊等)
  • 常用英文停用词(NLP处理英文必备),常见基础语气词、代词、疑问词等等,在做文本相关比赛或者学习自然语言处理知识时必备
  • 在开发分词系统的时候常用的中英文停用词词表,可以用来去掉分词结果中的停用词,常见的的分词系统可以使用结巴分词或者中科院的NLPIR。
  • 搜索下载了各种中英文停用词(哈工大、百度、四川人工智能实验室等等),最终整理优化了一个合集,供项目使用
  • 英文停用词

    2018-10-25 11:34:13
    整合了哈工大等多家机构的停用词,以及自己整合的,总量为3900多个
  • 例如:㉡㉢㉣㉤㉥㉦㉧㉨㉩㉪㉫㉬㉭㉮㉯㉰㉱㉲㉳㉴㉵㉶㉷㉸㉹㉺㉻㈀㈁㈂㈃㈄㈅㈆㈇㈈㈉㈊㈋㈌㈍㈎㈏㈐㈑㈒㈓㈔㈕㈖㈗㈘㈙㈚㈛А...中英文停用词,在分词时必不可少的一环,包含所有字符,支持自定义修改编辑,个人整理!
  • ik-analyzer需要的中英文停用词,chinese_stopword.txt,english_stopword.txt
  • 中文英文停用词

    2019-05-08 11:36:01
    包含中文和英文的常用停用词,例如中文“的”、“如果”,英文的“if”、“but”等
  • 【2020年5月整理优化,亲测,可用,好用】常用中英文停用词合集,内含哈工大、四川人智能实验室、百度停用词等,放心下载
  • 英文停用词词典(进行文本分词时使用),亲测可用,较全
  • 英文停用词列表

    千次阅读 2018-10-15 15:22:17
    为了便于处理英文文档时,进行去停用词,现贴出常用的英文停用词: stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours',   'couldn', 'because', 'is',...

    为了便于处理英文文档时,进行去停用词,现贴出常用的英文停用词:

    stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours', 
                'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their', 
                'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once', 
                'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you', 
                'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will', 
                'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be', 
                'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself', 
                'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both', 
                'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn', 
                'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about', 
                'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn', 
                'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which']
    

    上面的列表,即为英文停用词,可复制于代码中,进行使用。

    展开全文
  • 如何去英文停用词

    千次阅读 2018-08-07 09:23:44
    在进行LDA模型的运行时,需要先将文章进行去停用词的操作,在python中有一个模块为nltk,该模块中包含去停用词一部分: 安装nltk模块 如果已经安装了anconda,则nltk模块本身携带,但是stopwords不是本身...

    在进行LDA模型的运行时,需要先将文章进行去停用词的操作,在python中有一个模块为nltk,该模块中包含去停用词一部分:
    ##安装nltk模块
    如果已经安装了anconda,则nltk模块本身携带,但是stopwords不是本身具有的,需要自行安装(反正我的没有):
    pip install nltk
    然后进入python
    ###>>>import nltk
    ###>>>nltk.download(‘stopwords’)
    参考连接:https://blog.csdn.net/qq_27717921/article/details/60975835 该连接中有更加详细的对nltk的介绍

    停用词具体代码

    from nltk.corpus import stopwords as pw
    import sys 
    import re
    cacheStopWords=pw.words("english")
    trnfile_all=['D:\\合并程序代码\\NHL\\2016\\result_NHL2016.txt',
                'D:\\合并程序代码\\NHL\\2017\\result_NHL2017.txt',
                'D:\\合并程序代码\\pancreatic_cancer\\2015\\result_pancreatic_cancer2015.txt',
                'D:\\合并程序代码\\pancreatic_cancer\\2016\\result_pancreatic_cancer2016.txt',
                'D:\\合并程序代码\\pancreatic_cancer\\2017\\result_pancreatic_cancer2017.txt',
               ]
    stop_word_all=['D:\\合并程序代码\\breast_cancer\\2015\\pw_breast_cancer2015.txt',
                'D:\\合并程序代码\\breast_cancer\\2016\\pw_breast_cancer2016.txt',
                'D:\\合并程序代码\\breast_cancer\\2017\\pw_breast_cancer2017.txt',
                'D:\\合并程序代码\\colon_cancer\\2015\\pw_colon_cancer2015.txt',
                'D:\\合并程序代码\\colon_cancer\\2016\\pw_colon_cancer2016.txt',
                ]
    def stopWords():
    	for i in range(len(trnfile_all)):
    		filePath=trnfile_all[i]
    		with open(filePath) as text:
    			text=text.readlines()
    		for j in range(len(text)):
    			text1=text[j]
    			#print(type(text1))
    			text1=''.join([word+" " for word in text1.split() if word not in cacheStopWords])
    			#print(text)
    			with open(stop_word_all[i],'r+',encoding='utf-8') as f:
    				f.read()
    				f.write('\n'+text1)
    
    if __name__=="__main__":
    	stopWords()
    
    

    第一个数组是存储为删除停用词的文档,第二个存储删除停用词以后的文档

    展开全文
  • NLP英文停用词大全

    千次阅读 2020-07-06 14:56:01
    'd 'll 'm 're 's 't 've ZT ZZ a a's able about above abst accordance according accordingly across act actually added adj adopted affected affecting affects after afterwards again ...a.
    'd
    'll
    'm
    're
    's
    't
    've
    ZT
    ZZ
    a
    a's
    able
    about
    above
    abst
    accordance
    according
    accordingly
    across
    act
    actually
    added
    adj
    adopted
    affected
    affecting
    affects
    after
    afterwards
    again
    against
    ah
    ain't
    all
    allow
    allows
    almost
    alone
    along
    already
    also
    although
    always
    am
    among
    amongst
    an
    and
    announce
    another
    any
    anybody
    anyhow
    anymore
    anyone
    anything
    anyway
    anyways
    anywhere
    apart
    apparently
    appear
    appreciate
    appropriate
    approximately
    are
    area
    areas
    aren
    aren't
    arent
    arise
    around
    as
    aside
    ask
    asked
    asking
    asks
    associated
    at
    auth
    available
    away
    awfully
    b
    back
    backed
    backing
    backs
    be
    became
    because
    become
    becomes
    becoming
    been
    before
    beforehand
    began
    begin
    beginning
    beginnings
    begins
    behind
    being
    beings
    believe
    below
    beside
    besides
    best
    better
    between
    beyond
    big
    biol
    both
    brief
    briefly
    but
    by
    c
    c'mon
    c's
    ca
    came
    can
    can't
    cannot
    cant
    case
    cases
    cause
    causes
    certain
    certainly
    changes
    clear
    clearly
    co
    com
    come
    comes
    concerning
    consequently
    consider
    considering
    contain
    containing
    contains
    corresponding
    could
    couldn't
    couldnt
    course
    currently
    d
    date
    definitely
    describe
    described
    despite
    did
    didn't
    differ
    different
    differently
    discuss
    do
    does
    doesn't
    doing
    don't
    done
    down
    downed
    downing
    downs
    downwards
    due
    during
    e
    each
    early
    ed
    edu
    effect
    eg
    eight
    eighty
    either
    else
    elsewhere
    end
    ended
    ending
    ends
    enough
    entirely
    especially
    et
    et-al
    etc
    even
    evenly
    ever
    every
    everybody
    everyone
    everything
    everywhere
    ex
    exactly
    example
    except
    f
    face
    faces
    fact
    facts
    far
    felt
    few
    ff
    fifth
    find
    finds
    first
    five
    fix
    followed
    following
    follows
    for
    former
    formerly
    forth
    found
    four
    from
    full
    fully
    further
    furthered
    furthering
    furthermore
    furthers
    g
    gave
    general
    generally
    get
    gets
    getting
    give
    given
    gives
    giving
    go
    goes
    going
    gone
    good
    goods
    got
    gotten
    great
    greater
    greatest
    greetings
    group
    grouped
    grouping
    groups
    h
    had
    hadn't
    happens
    hardly
    has
    hasn't
    have
    haven't
    having
    he
    he's
    hed
    hello
    help
    hence
    her
    here
    here's
    hereafter
    hereby
    herein
    heres
    hereupon
    hers
    herself
    hes
    hi
    hid
    high
    higher
    highest
    him
    himself
    his
    hither
    home
    hopefully
    how
    howbeit
    however
    hundred
    i
    i'd
    i'll
    i'm
    i've
    id
    ie
    if
    ignored
    im
    immediate
    immediately
    importance
    important
    in
    inasmuch
    inc
    include
    indeed
    index
    indicate
    indicated
    indicates
    information
    inner
    insofar
    instead
    interest
    interested
    interesting
    interests
    into
    invention
    inward
    is
    isn't
    it
    it'd
    it'll
    it's
    itd
    its
    itself
    j
    just
    k
    keep
    keeps
    kept
    keys
    kg
    kind
    km
    knew
    know
    known
    knows
    l
    large
    largely
    last
    lately
    later
    latest
    latter
    latterly
    least
    less
    lest
    let
    let's
    lets
    like
    liked
    likely
    line
    little
    long
    longer
    longest
    look
    looking
    looks
    ltd
    m
    made
    mainly
    make
    makes
    making
    man
    many
    may
    maybe
    me
    mean
    means
    meantime
    meanwhile
    member
    members
    men
    merely
    mg
    might
    million
    miss
    ml
    more
    moreover
    most
    mostly
    mr
    mrs
    much
    mug
    must
    my
    myself
    n
    n't
    na
    name
    namely
    nay
    nd
    near
    nearly
    necessarily
    necessary
    need
    needed
    needing
    needs
    neither
    never
    nevertheless
    new
    newer
    newest
    next
    nine
    ninety
    no
    nobody
    non
    none
    nonetheless
    noone
    nor
    normally
    nos
    not
    noted
    nothing
    novel
    now
    nowhere
    number
    numbers
    o
    obtain
    obtained
    obviously
    of
    off
    often
    oh
    ok
    okay
    old
    older
    oldest
    omitted
    on
    once
    one
    ones
    only
    onto
    open
    opened
    opening
    opens
    or
    ord
    order
    ordered
    ordering
    orders
    other
    others
    otherwise
    ought
    our
    ours
    ourselves
    out
    outside
    over
    overall
    owing
    own
    p
    page
    pages
    part
    parted
    particular
    particularly
    parting
    parts
    past
    per
    perhaps
    place
    placed
    places
    please
    plus
    point
    pointed
    pointing
    points
    poorly
    possible
    possibly
    potentially
    pp
    predominantly
    present
    presented
    presenting
    presents
    presumably
    previously
    primarily
    probably
    problem
    problems
    promptly
    proud
    provides
    put
    puts
    q
    que
    quickly
    quite
    qv
    r
    ran
    rather
    rd
    re
    readily
    really
    reasonably
    recent
    recently
    ref
    refs
    regarding
    regardless
    regards
    related
    relatively
    research
    respectively
    resulted
    resulting
    results
    right
    room
    rooms
    run
    s
    said
    same
    saw
    say
    saying
    says
    sec
    second
    secondly
    seconds
    section
    see
    seeing
    seem
    seemed
    seeming
    seems
    seen
    sees
    self
    selves
    sensible
    sent
    serious
    seriously
    seven
    several
    shall
    she
    she'll
    shed
    shes
    should
    shouldn't
    show
    showed
    showing
    shown
    showns
    shows
    side
    sides
    significant
    significantly
    similar
    similarly
    since
    six
    slightly
    small
    smaller
    smallest
    so
    some
    somebody
    somehow
    someone
    somethan
    something
    sometime
    sometimes
    somewhat
    somewhere
    soon
    sorry
    specifically
    specified
    specify
    specifying
    state
    states
    still
    stop
    strongly
    sub
    substantially
    successfully
    such
    sufficiently
    suggest
    sup
    sure
    t
    t's
    take
    taken
    taking
    tell
    tends
    th
    than
    thank
    thanks
    thanx
    that
    that'll
    that's
    that've
    thats
    the
    their
    theirs
    them
    themselves
    then
    thence
    there
    there'll
    there's
    there've
    thereafter
    thereby
    thered
    therefore
    therein
    thereof
    therere
    theres
    thereto
    thereupon
    these
    they
    they'd
    they'll
    they're
    they've
    theyd
    theyre
    thing
    things
    think
    thinks
    third
    this
    thorough
    thoroughly
    those
    thou
    though
    thoughh
    thought
    thoughts
    thousand
    three
    throug
    through
    throughout
    thru
    thus
    til
    tip
    to
    today
    together
    too
    took
    toward
    towards
    tried
    tries
    truly
    try
    trying
    ts
    turn
    turned
    turning
    turns
    twice
    two
    u
    un
    under
    unfortunately
    unless
    unlike
    unlikely
    until
    unto
    up
    upon
    ups
    us
    use
    used
    useful
    usefully
    usefulness
    uses
    using
    usually
    uucp
    v
    value
    various
    very
    via
    viz
    vol
    vols
    vs
    w
    want
    wanted
    wanting
    wants
    was
    wasn't
    way
    ways
    we
    we'd
    we'll
    we're
    we've
    wed
    welcome
    well
    wells
    went
    were
    weren't
    what
    what'll
    what's
    whatever
    whats
    when
    whence
    whenever
    where
    where's
    whereafter
    whereas
    whereby
    wherein
    wheres
    whereupon
    wherever
    whether
    which
    while
    whim
    whither
    who
    who'll
    who's
    whod
    whoever
    whole
    whom
    whomever
    whos
    whose
    why
    widely
    will
    willing
    wish
    with
    within
    without
    won't
    wonder
    words
    work
    worked
    working
    works
    world
    would
    wouldn't
    www
    x
    y
    year
    years
    yes
    yet
    you
    you'd
    you'll
    you're
    you've
    youd
    young
    younger
    youngest
    your
    youre
    yours
    yourself
    yourselves
    z
    zero
    zt
    zz

     

    展开全文
  • 英文停用词,类似于’a’,‘can’等对我们进行文本分析是无助的,所以要预处理掉。以下是使用Java删除停用词。同时,可将该程序改写成取高频词和低频词的程序。 Java去除英文停用词 package clouddataprocess;...

    本文作者:合肥工业大学 管理学院 钱洋 email:1563178220@qq.com 内容可能有不到之处,欢迎交流。

    英文语料预处理

    针对英语语料预处理时,我们经常要将其进行词干转化,然后去除停用词等操作。英文停用词,类似于’a’,‘can’等对我们进行文本分析是无助的,所以要预处理掉。以下是使用Java删除停用词。同时,可将该程序改写成取高频词和低频词的程序。

    Java去除英文停用词

    package clouddataprocess;
    
    
    import java.io.BufferedReader;
    import java.io.BufferedWriter;
    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.FileReader;
    import java.io.FileWriter;
    import java.io.UnsupportedEncodingException;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.Date;
    import java.util.Enumeration;
    import java.util.HashSet;
    import java.util.Iterator;
    import java.util.Vector;
    public class Stopwords {
    
      /** The hash set containing the list of stopwords */
      protected HashSet m_Words = null;
    
      /** The default stopwords object (stoplist based on Rainbow) */
      protected static Stopwords m_Stopwords;
    
      //下面这一小部分是静态代码块,他是自动执行的。而静态方法则是被动执行,需要使用类名来调用。其次静态代码块先于构造函数执行
      //具体可参见网址http://www.cnblogs.com/panjun-Donet/archive/2010/08/10/1796209.html
      static {
        if (m_Stopwords == null) {
          m_Stopwords = new Stopwords();
        }
      }
    
      /**
       * initializes the stopwords (based on <a href="http://www.cs.cmu.edu/~mccallum/bow/rainbow/" target="_blank">Rainbow</a>).
       */
      //这个是构造函数
      public Stopwords() {
        m_Words = new HashSet();
    
        //Stopwords list from Rainbow
        add("a");
        add("able");
        add("about");
        add("above");
        add("according");
        add("accordingly");
        add("across");
        add("actually");
        add("after");
        add("afterwards");
        add("again");
    //    add("against");
        add("all");
    //    add("allow");
    //    add("allows");
    //    add("almost");
        add("alone");
        add("along");
        add("already");
        add("also");
    //    add("although");
        add("always");
        add("am");
        add("among");
        add("amongst");
        add("an");
        add("and");
        add("another");
        add("any");
        add("anybody");
        add("anyhow");
        add("anyone");
        add("anything");
        add("anyway");
        add("anyways");
        add("anywhere");
        add("apart");
        add("appear");
    //    add("appreciate");
        add("appropriate");
        add("are");
        add("around");
        add("as");
        add("aside");
        add("ask");
        add("asking");
        add("associated");
        add("at");
        add("available");
        add("away");
    //    add("awfully");
        add("b");
        add("be");
        add("became");
        add("because");
        add("become");
        add("becomes");
        add("becoming");
        add("been");
        add("before");
        add("beforehand");
        add("behind");
        add("being");
        add("believe");
        add("below");
        add("beside");
        add("besides");
    //    add("best");
    //    add("better");
        add("between");
        add("beyond");
        add("both");
        add("but");    
        add("brief");
        add("by");
        add("c");
        add("came");
        add("can");
        add("certain");
        add("certainly");
        add("clearly");
        add("co");
        add("com");
        add("come");
        add("comes");
        add("contain");
        add("containing");
        add("contains");
        add("corresponding");
        add("could");
        add("course");
        add("currently");
        add("d");
        add("definitely");
        add("described");
        add("despite");
        add("did");
        add("different");
        add("do");
        add("does");
        add("doing");
        add("done");
        add("down");
        add("downwards");
        add("during");
        add("e");
        add("each");
        add("edu");
        add("eg");
        add("eight");
        add("either");
        add("else");
        add("elsewhere");
        add("enough");
        add("entirely");
        add("especially");
        add("et");
        add("etc");
        add("even");
        add("ever");
        add("every");
        add("everybody");
        add("everyone");
        add("everything");
        add("everywhere");
        add("ex");
        add("exactly");
        add("example");
        add("except");
        add("f");
        add("far");
        add("few");
        add("fifth");
        add("first");
        add("five");
        add("followed");
        add("following");
        add("follows");
        add("for");
        add("former");
        add("formerly");
        add("forth");
        add("four");
        add("from");
        add("further");
        add("furthermore");
        add("g");
        add("get");
        add("gets");
        add("getting");
        add("given");
        add("gives");
        add("go");
        add("goes");
        add("going");
        add("gone");
        add("got");
        add("gotten");
    //    add("greetings");
        add("h");
        add("had");
        add("happens");
    //    add("hardly");
        add("has");
        add("have");
        add("having");
        add("he");
        add("hello");
        add("help");
        add("hence");
        add("her");
        add("here");
        add("hereafter");
        add("hereby");
        add("herein");
        add("hereupon");
        add("hers");
        add("herself");
        add("hi");
        add("him");
        add("himself");
        add("his");
        add("hither");
    //    add("hopefully");
        add("how");
        add("howbeit");
        add("however");
        add("i");
        add("ie");
        add("if");
    //    add("ignored");
        add("immediate");
        add("in");
        add("inasmuch");
        add("inc");
        add("indeed");
        add("indicate");
        add("indicated");
        add("indicates");
        add("inner");
        add("insofar");
        add("instead");
        add("into");
        add("inward");
        add("is");
        add("it");
        add("its");
        add("itself");
        add("j");
        add("just");
        add("k");
        add("keep");
        add("keeps");
        add("kept");
        add("know");
        add("knows");
        add("known");
        add("l");
        add("last");
        add("lately");
        add("later");
        add("latter");
        add("latterly");
        add("least");
        add("less");
        add("lest");
        add("let");
        add("like");
        add("liked");
        add("likely");
        add("little");
        add("ll"); //added to avoid words like you'll,I'll etc.
        add("look");
        add("looking");
        add("looks");
        add("ltd");
        add("m");
        add("mainly");
        add("many");
        add("may");
        add("maybe");
        add("me");
    //    add("mean");
        add("meanwhile");
    //    add("merely");
        add("might");
        add("more");
        add("moreover");
        add("most");
        add("mostly");
        add("much");
        add("must");
        add("my");
        add("myself");
        add("n");
        add("name");
        add("namely");
        add("nd");
        add("near");
        add("nearly");
        add("necessary");
        add("need");
        add("needs");
    //    add("neither");
    //    add("never");
    //    add("nevertheless");
        add("new");
        add("next");
        add("nine");
        add("normally");
    //    add("novel");
        add("no");
        add("nobody");
        add("non");
        add("none");
        add("noone");
        add("nor");
        add("normally");
        add("not");
        add("n't");
        add("nothing");
        add("novel");
        add("now");
        add("nowhere");
        add("now");
        add("nowhere");
        add("o");
        add("obviously");
        add("of");
        add("off");
        add("often");
        add("oh");
        add("ok");
        add("okay");
    //    add("old");
        add("on");
        add("once");
        add("one");
        add("ones");
        add("only");
        add("onto");
        add("or");
        add("other");
        add("others");
        add("otherwise");
        add("ought");
        add("our");
        add("ours");
        add("ourselves");
        add("out");
        add("outside");
        add("over");
        add("overall");
        add("own");
        add("p");
        add("particular");
        add("particularly");
        add("per");
        add("perhaps");
        add("placed");
        add("please");
        add("plus");
        add("possible");
        add("presumably");
        add("probably");
        add("provides");
        add("q");
        add("que");
        add("quite");
        add("qv");
        add("r");
        add("rather");
        add("rd");
        add("re");
        add("really");
        add("reasonably");
        add("regarding");
        add("regardless");
        add("regards");
        add("relatively");
        add("respectively");
        add("right");
        add("s");
        add("said");
        add("same");
        add("saw");
        add("say");
        add("saying");
        add("says");
        add("second");
        add("secondly");
        add("see");
        add("seeing");
    //    add("seem");
    //    add("seemed");
    //    add("seeming");
    //    add("seems");
        add("seen");
        add("self");
        add("selves");
        add("sensible");
        add("sent");
       // add("serious");
       // add("seriously");
        add("seven");
        add("several");
        add("shall");
        add("she");
        add("should");
        add("since");
        add("six");
        add("so");
        add("some");
        add("somebody");
        add("somehow");
        add("someone");
        add("something");
        add("sometime");
        add("sometimes");
        add("somewhat");
        add("somewhere");
        add("soon");
        add("sorry");
        add("specified");
        add("specify");
        add("specifying");
        add("still");
        add("sub");
        add("such");
        add("sup");
        add("sure");
        add("t");
        add("take");
        add("taken");
        add("tell");
        add("tends");
        add("th");
        add("than");
        add("thank");
        add("thanks");
        add("thanx");
        add("that");
        add("thats");
        add("the");
        add("their");
        add("theirs");
        add("them");
        add("themselves");
        add("then");
        add("thence");
        add("there");
        add("thereafter");
        add("thereby");
        add("therefore");
        add("therein");
        add("theres");
        add("thereupon");
        add("these");
        add("they");
        add("think");
        add("third");
        add("this");
        add("thorough");
        add("thoroughly");
        add("those");
        add("though");
        add("three");
        add("through");
        add("throughout");
        add("thru");
        add("thus");
        add("to");
        add("together");
        add("too");
        add("took");
        add("toward");
        add("towards");
        add("tried");
        add("tries");
        add("truly");
        add("try");
        add("trying");
        add("twice");
        add("two");
        add("u");
        add("un");
        add("under");
    //    add("unfortunately");
    //    add("unless");
    //    add("unlikely");
        add("until");
        add("unto");
        add("up");
        add("upon");
        add("us");
        add("use");
        add("used");
    //    add("useful");
        add("uses");
        add("using");
        add("usually");
        add("uucp");
        add("v");
        add("value");
        add("various");
        add("ve"); //added to avoid words like I've,you've etc.
        add("very");
        add("via");
        add("viz");
        add("vs");
        add("w");
        add("want");
        add("wants");
        add("was");
    //    add("way");
        add("we");
        add("welcome");
    //    add("well");
        add("went");
        add("were");
        add("what");
    //    add("whatever");
        add("when");
        add("whence");
        add("whenever");
        add("where");
        add("whereafter");
        add("whereas");
        add("whereby");
        add("wherein");
        add("whereupon");
        add("wherever");
        add("whether");
        add("which");
        add("while");
        add("whither");
        add("who");
        add("whoever");
        add("whole");
        add("whom");
        add("whose");
        add("why");
        add("will");
        add("willing");
        add("wish");
        add("with");
        add("within");
        add("without");
        add("wonder");
        add("would");
        add("would");
        add("x");
        add("y");
    //    add("yes");
        add("yet");
        add("you");
        add("your");
        add("yours");
        add("yourself");
        add("yourselves");
        add("z");
        add("zero");
        // add new
        add("i'm");
        add("he's");
        add("she's");
        add("you're");
        add("i'll");
        add("you'll");
        add("she'll");
        add("he'll");
        add("it's");
        add("don't");
        add("can't");
        add("didn't");
        add("i've");
        add("that's");
        add("there's");
        add("isn't");
        add("what's");
        add("rt");
        add("doesn't");
        add("w/");
        add("w/o");
      }
    
      /**
       * removes all stopwords
       */
      public void clear() {
        m_Words.clear();
      }
    
      /**
       * adds the given word to the stopword list (is automatically converted to
       * lower case and trimmed) trim的意思是将空格去掉
       *
       * @param word: the word to add
       */
     //将给定的词添加进停用词列表
      public void add(String word) {
        if (word.trim().length() > 0)
          m_Words.add(word.trim().toLowerCase());
      }
    
      /**
       * removes the word from the stopword list
       *
       * @param word: the word to remove
       * @return true if the word was found in the list and then removed
       */
      public boolean remove(String word) {
        return m_Words.remove(word);
      }
    
      /**
       * Returns a sorted enumeration over all stored stopwords
       *
       * @return the enumeration over all stopwords
       */
      public Enumeration elements() {
        Iterator    iter;
        Vector      list;
    
        iter = m_Words.iterator();
        list = new Vector();
    
        while (iter.hasNext())
          list.add(iter.next());
    
        // sort list
        Collections.sort(list);
    
        return list.elements();
      }
    
      /**
       * Generates a new Stopwords object from the given file
       *
       * @param filename: the file to read the stopwords from
       * @throws Exception if reading fails
       */
      public void read(String filename) throws Exception {
        read(new File(filename));
      }
    
      /**
       * Generates a new Stopwords object from the given file
       *
       * @param file: the file to read the stopwords from
       * @throws Exception if reading fails
       */
      public void read(File file) throws Exception {
        read(new BufferedReader(new FileReader(file)));
      }
    
      /**
       * Generates a new Stopwords object from the reader. The reader is
       * closed automatically.
       *
       * @param reader: the reader to get the stopwords from
       * @throws Exception if reading fails
       */
      public void read(BufferedReader reader) throws Exception {
        String   line;
    
        clear();
    
        while ((line = reader.readLine()) != null) {
          line = line.trim();
          // comment?
          if (line.startsWith("#"))
            continue;
          add(line);
        }
    
        reader.close();
      }
    
      /**
       * Writes the current stopwords to the given file
       *
       * @param filename the file to write the stopwords to
       * @throws Exception if writing fails
       */
      public void write(String filename) throws Exception {
        write(new File(filename));
      }
    
      /**
       * Writes the current stopwords to the given file
       *
       * @param file the file to write the stopwords to
       * @throws Exception if writing fails
       */
      public void write(File file) throws Exception {
        write(new BufferedWriter(new FileWriter(file)));
      }
    
      /**
       * Writes the current stopwords to the given writer. The writer is closed
       * automatically.
       *
       * @param writer the writer to get the stopwords from
       * @throws Exception if writing fails
       */
      public void write(BufferedWriter writer) throws Exception {
        Enumeration   enm;
    
        // header
        writer.write("# generated " + new Date());
        writer.newLine();
    
        enm = elements();
    
        while (enm.hasMoreElements()) {
          writer.write(enm.nextElement().toString());
          writer.newLine();
        }
    
        writer.flush();
        writer.close();
      }
    
      /**
       * returns the current stopwords in a string
       *
       * @return the current stopwords
       */
      public String toString() {
        Enumeration   enm;
        StringBuffer  result;
    
        result = new StringBuffer();
        enm    = elements();
        while (enm.hasMoreElements()) {
          result.append(enm.nextElement().toString());
          if (enm.hasMoreElements())
            result.append(",");
        }
    
        return result.toString();
      }
    
      /** 
       * Returns true if the given string is a stop word.
       * 
       * @param word the word to test
       * @return true if the word is a stopword
       */
      public boolean is(String word) {
        return m_Words.contains(word.toLowerCase());
      }
    
    /** 
       * Returns true if the given string is a stop word.
       * 
       * @param str the word to test
       * @return true if the word is a stopword
       */
    //m_Stopwords是stopwords类的实例化对象,因此可以条用里面的所有对象
      public static boolean isStopword(String str) {
        return m_Stopwords.is(str.toLowerCase());
      }
      //测试程序
      public static void main(String[] args) throws UnsupportedEncodingException, FileNotFoundException {
    
            String text = "Gridspot can link up idle computers instances across the world to provide large scale efforts with the computing power they require at affordable prices 0103 centsCPU hour These Linux instances run Ubuntu inside a virtual machine You are able to bid on access to these instances and specify the requirements of your tasks or jobs When your bid is fulfilled you can start running the instances using SSH anywhere youd like There are grant options available to defray costs for researchers and nonprofits The Gridspot API allows you to manage instances and identify new ones You can list available instances access them and stop the instances if you so choose Each API call requires an API key that can be generated from your account page";
            String[] wordarr = text.split("\\s+");
            ArrayList<String> words = new ArrayList<String>();
            for (int i = 0; i < wordarr.length; i++) {
                words.add(wordarr[i]);
            }
            //移除停用词
            for(int i = 0; i < words.size(); i++){
                if(Stopwords.isStopword(words.get(i))){
                    words.remove(i);
                    i--;
                }
            }
            String textremoveStopword = "";
            for (int i = 0; i < words.size(); i++) {
                textremoveStopword += words.get(i)+" ";
            }
            System.out.println(textremoveStopword);
      }
    
    }
    

    这里给了一个案例,程序的运行结果如下:

    Gridspot link idle computers instances world provide large scale efforts computing power require affordable prices 0103 centsCPU hour Linux instances run Ubuntu inside virtual machine bid access instances requirements tasks jobs bid fulfilled start running instances SSH youd grant options defray costs researchers nonprofits Gridspot API allows manage instances identify list instances access stop instances choose API call requires API key generated account page

    可以看否,’can’,‘the’等一系列停用词被删除了。

    展开全文
  • 应该算是比较齐全的了,中文英文标点符号特殊符号基本上的都有包括
  • 使用nltk删除英文停用词

    千次阅读 2019-12-02 15:39:00
    当然在很多任务(比如对话任务中)中,停用词还包括下面这些符合和后缀: [ '!' , ',' , '.' , '?' , '-s' , '-ly' , '</s> ' , 's' ] 使用下面代码,将他们加上去 for w in [ '!' , ',' ...
  • 本资源包含各个版本的中英文停用词。停用词是指在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言数据(或文本)之前或之后会自动过滤掉某些字或词,这些字或词即被称为Stop Words(停用词)。这些停用词...
  • 机器学习 数据分析 中文 英文 停用词汇总2
  • 昨天做项目找了好久都没有找到免费的停用词表,气死了 于是乎,自己照着豆丁上的pdf把停用词敲进了txt文件(还好QQ有扫一扫识别图片文字的功能!) 这是中英文停用词表百度网盘链接,按需自取,提取密码:9g92 ...
  • 英文停用词(stop word)列表

    千次阅读 2018-11-19 17:32:04
    停用词即我们在处理文本时出现频率比较高,但是没有统计意义的词。一般在处理统计性文本信息...搜集了一下网上大家列的中英文停用词以备之后使用。 英文停用词 able about above according accordingly across a...
  • 英文 停用词 词典

    2016-11-09 21:55:00
    英文停用词词典: 'd 'll 'm 're 's 't 've ZT ZZ a a's able about above abst accordance according accordingly across act actually added adj adopted affected affecting ...
  • 文章简介如果你只想获取中文停用词此表,请直接到文章结尾下载项目文件,本博文及链接会定期更新:最近更新2017/07/04第二次更新
  • 停用词:R中的多语言停用词列表
  • 900英文停用词.txt

    2020-04-05 11:01:46
    里面有900+个英文停用词,可用于词云图的制作,去除无意义的干扰词汇。 版权声明:资源下载只能自己学习使用,切勿用于商业用途,违者必究。
  • 常见停用词词典

    2018-09-28 20:41:07
    包含常见的停用词(包括英文通用词)。在做NLP时,停用词过滤可以用。
  • 个人收集的常见的中文英文停用词表大全,希望对你有用。
  • 英文常见停用词

    2019-05-15 16:59:46
    一共1200个左右字符,完整文档可以下载:https://download.csdn.net/download/qq_32251537/11166157 更多关注微信公众号:
  • 英文停用词(转载)

    2017-05-18 17:29:01
    本文转载自 http://blog.csdn.net/shijiebei2009/article/details/39696523 全英文停用词整理 'd 'll 'm 're 's 't 've ZT ZZ a a's able about above abst accordance according ac

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 5,968
精华内容 2,387
关键字:

英文停用词