精华内容
下载资源
问答
  • Python 英文分词

    万次阅读 2015-12-03 20:19:14
    Python 英文分词,词倒排索引,一般多次查询 ''' Created on 2015-11-18 ''' #encoding=utf-8 # List Of English Stop Words # http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/ _WORD_MIN_...

    Python 英文分词,词倒排索引

    【一.一般多次查询】

    '''
    Created on 2015-11-18
    '''
    #encoding=utf-8
    
    # List Of English Stop Words
    # http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
    _WORD_MIN_LENGTH = 3
    _STOP_WORDS = frozenset([
    'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again', 
    'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although',
    'always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another',
    'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',
    'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been', 
    'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 
    'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can', 
    'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 
    'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 
    'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 
    'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 
    'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 
    'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
    'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 
    'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 
    'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc', 
    'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 
    'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 
    'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 
    'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 
    'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 
    'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only',
    'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out',
    'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same',
    'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 
    'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 
    'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 
    'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 
    'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 
    'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third',
    'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 
    'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 
    'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 
    'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter',
    'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 
    'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 
    'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself',
    'yourselves', 'the'])
    
    def word_split_out(text):
        word_list = []
        wcurrent = []
    
        for i, c in enumerate(text):
            if c.isalnum():
                wcurrent.append(c)
            elif wcurrent:
                word = u''.join(wcurrent)
                word_list.append(word)
                wcurrent = []
    
        if wcurrent:
            word = u''.join(wcurrent)
            word_list.append(word)
    
        return word_list
    
    def word_split(text):
        """
        Split a text in words. Returns a list of tuple that contains
        (word, location) location is the starting byte position of the word.
        """
        word_list = []
        wcurrent = []
        windex = 0
    
        for i, c in enumerate(text):
            if c.isalnum():
                wcurrent.append(c)
            elif wcurrent:
                word = u''.join(wcurrent)
                word_list.append((windex, word))
                windex += 1
                wcurrent = []
    
        if wcurrent:
            word = u''.join(wcurrent)
            word_list.append((windex, word))
            windex += 1
    
        return word_list
    
    def words_cleanup(words):
        """
        Remove words with length less then a minimum and stopwords.
        """
        cleaned_words = []
        for index, word in words:
            if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:
                continue
            cleaned_words.append((index, word))
        return cleaned_words
    
    def words_normalize(words):
        """
        Do a normalization precess on words. In this case is just a tolower(),
        but you can add accents stripping, convert to singular and so on...
        """
        normalized_words = []
        for index, word in words:
            wnormalized = word.lower()
            normalized_words.append((index, wnormalized))
        return normalized_words
    
    def word_index(text):
        """
        Just a helper method to process a text.
        It calls word split, normalize and cleanup.
        """
        words = word_split(text)
        words = words_normalize(words)
        words = words_cleanup(words)
        return words
    
    def inverted_index(text):
        """
        Create an Inverted-Index of the specified text document.
            {word:[locations]}
        """
        inverted = {}
    
        for index, word in word_index(text):
            locations = inverted.setdefault(word, [])
            locations.append(index)
    
        return inverted
    
    def inverted_index_add(inverted, doc_id, doc_index):
        """
        Add Invertd-Index doc_index of the document doc_id to the 
        Multi-Document Inverted-Index (inverted), 
        using doc_id as document identifier.
            {word:{doc_id:[locations]}}
        """
        for word, locations in doc_index.iteritems():
            indices = inverted.setdefault(word, {})
            indices[doc_id] = locations
        return inverted
    
    def search(inverted, query):
        """
        Returns a set of documents id that contains all the words in your query.
        """
        words = [word for _, word in word_index(query) if word in inverted]
        results = [set(inverted[word].keys()) for word in words]
        return reduce(lambda x, y: x & y, results) if results else []
    
    if __name__ == '__main__':
        doc1 = """
    Niners head coach Mike Singletary will let Alex Smith remain his starting 
    quarterback, but his vote of confidence is anything but a long-term mandate.
    Smith now will work on a week-to-week basis, because Singletary has voided 
    his year-long lease on the job.
    "I think from this point on, you have to do what's best for the football team,"
    Singletary said Monday, one day after threatening to bench Smith during a 
    27-24 loss to the visiting Eagles.
    """
    
        doc2 = """
    The fifth edition of West Coast Green, a conference focusing on "green" home 
    innovations and products, rolled into San Francisco's Fort Mason last week 
    intent, per usual, on making our living spaces more environmentally friendly 
    - one used-tire house at a time.
    To that end, there were presentations on topics such as water efficiency and 
    the burgeoning future of Net Zero-rated buildings that consume no energy and 
    produce no carbon emissions.
    """
    
        # Build Inverted-Index for documents
        inverted = {}
        documents = {'doc1':doc1, 'doc2':doc2}
        for doc_id, text in documents.iteritems():
            doc_index = inverted_index(text)
            inverted_index_add(inverted, doc_id, doc_index)
    
        # Print Inverted-Index
        for word, doc_locations in inverted.iteritems():
            print word, doc_locations
    
        # Search something and print results
        queries = ['Week', 'Niners week', 'West-coast Week']
        for query in queries:
            result_docs = search(inverted, query)
            print "Search for '%s': %r" % (query, result_docs)
            for _, word in word_index(query):
                def extract_text(doc, index): 
                    word_list = word_split_out(documents[doc])
                    word_string = ""
                    for i in range(index, index +4):
                        word_string += word_list[i] + " "
                    word_string = word_string.replace("\n", "")
                    return word_string
    
                for doc in result_docs:
                    for index in inverted[word][doc]:
                        print '   - %s...' % extract_text(doc, index)
            print

    【二. 短语查询】

    '''
    Created on 2015-11-18
    '''
    #encoding=utf-8
    
    # List Of English Stop Words
    # http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
    _WORD_MIN_LENGTH = 3
    _STOP_WORDS = frozenset([
    'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again', 
    'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although',
    'always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another',
    'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',
    'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been', 
    'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 
    'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can', 
    'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 
    'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 
    'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 
    'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 
    'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 
    'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
    'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 
    'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 
    'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc', 
    'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 
    'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 
    'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 
    'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 
    'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 
    'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only',
    'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out',
    'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same',
    'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 
    'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 
    'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 
    'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 
    'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 
    'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third',
    'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 
    'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 
    'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 
    'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter',
    'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 
    'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 
    'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself',
    'yourselves', 'the'])
    
    def word_split_out(text):
        word_list = []
        wcurrent = []
    
        for i, c in enumerate(text):
            if c.isalnum():
                wcurrent.append(c)
            elif wcurrent:
                word = u''.join(wcurrent)
                word_list.append(word)
                wcurrent = []
    
        if wcurrent:
            word = u''.join(wcurrent)
            word_list.append(word)
    
        return word_list
    
    def word_split(text):
        """
        Split a text in words. Returns a list of tuple that contains
        (word, location) location is the starting byte position of the word.
        """
        word_list = []
        wcurrent = []
        windex = 0
    
        for i, c in enumerate(text):
            if c.isalnum():
                wcurrent.append(c)
            elif wcurrent:
                word = u''.join(wcurrent)
                word_list.append((windex, word))
                windex += 1
                wcurrent = []
    
        if wcurrent:
            word = u''.join(wcurrent)
            word_list.append((windex, word))
            windex += 1
    
        return word_list
    
    def words_cleanup(words):
        """
        Remove words with length less then a minimum and stopwords.
        """
        cleaned_words = []
        for index, word in words:
            if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:
                continue
            cleaned_words.append((index, word))
        return cleaned_words
    
    def words_normalize(words):
        """
        Do a normalization precess on words. In this case is just a tolower(),
        but you can add accents stripping, convert to singular and so on...
        """
        normalized_words = []
        for index, word in words:
            wnormalized = word.lower()
            normalized_words.append((index, wnormalized))
        return normalized_words
    
    def word_index(text):
        """
        Just a helper method to process a text.
        It calls word split, normalize and cleanup.
        """
        words = word_split(text)
        words = words_normalize(words)
        words = words_cleanup(words)
        return words
    
    def inverted_index(text):
        """
        Create an Inverted-Index of the specified text document.
            {word:[locations]}
        """
        inverted = {}
    
        for index, word in word_index(text):
            locations = inverted.setdefault(word, [])
            locations.append(index)
    
        return inverted
    
    def inverted_index_add(inverted, doc_id, doc_index):
        """
        Add Invertd-Index doc_index of the document doc_id to the 
        Multi-Document Inverted-Index (inverted), 
        using doc_id as document identifier.
            {word:{doc_id:[locations]}}
        """
        for word, locations in doc_index.iteritems():
            indices = inverted.setdefault(word, {})
            indices[doc_id] = locations
        return inverted
    
    def search(inverted, query):
        """
        Returns a set of documents id that contains all the words in your query.
        """
        words = [word for _, word in word_index(query) if word in inverted]
        results = [set(inverted[word].keys()) for word in words]
        return reduce(lambda x, y: x & y, results) if results else []
    
    def distance_between_word(word_index_1, word_index_2, distance):
        """
        To judge whether the distance between the two words is equal distance
        """ 
        distance_list = []
        for index_1 in word_index_1:
            for index_2 in word_index_2:
                if (index_1 < index_2):
                    if(index_2 - index_1 == distance):
                        distance_list.append(index_1)
                else:
                    continue        
        return distance_list
    
    def extract_text(doc, index): 
        """
        Output search results
        """
        word_list = word_split_out(documents[doc])
        word_string = ""
        for i in range(index, index +4):
            word_string += word_list[i] + " "
        word_string = word_string.replace("\n", "")
        return word_string
    
    if __name__ == '__main__':
        doc1 = """
    Niners head coach Mike Singletary will let Alex Smith remain his starting 
    quarterback, but his vote of confidence is anything but a long-term mandate.
    Smith now will work on a week-to-week basis, because Singletary has voided 
    his year-long lease on the job.
    "I think from this point on, you have to do what's best for the football team,"
    Singletary said Monday, one day after threatening to bench Smith during a 
    27-24 loss to the visiting Eagles.
    """
    
        doc2 = """
    The fifth edition of West Coast Green, a conference focusing on "green" home 
    innovations and products, rolled into San Francisco's Fort Mason last week 
    intent, per usual, on making our living spaces more environmentally friendly 
    - one used-tire house at a time.
    To that end, there were presentations on topics such as water efficiency and 
    the burgeoning future of Net Zero-rated buildings that consume no energy and 
    produce no carbon emissions.
    """
    
        # Build Inverted-Index for documents
        inverted = {}
        documents = {'doc1':doc1, 'doc2':doc2}
        for doc_id, text in documents.iteritems():
            doc_index = inverted_index(text)
            inverted_index_add(inverted, doc_id, doc_index)
    
        # Print Inverted-Index
        for word, doc_locations in inverted.iteritems():
            print word, doc_locations
    
        # Search something and print results
        queries = ['Week', 'water efficiency', 'Singletary said Monday']
        for query in queries:
            result_docs = search(inverted, query)
            print "Search for '%s': %r" % (query, result_docs)
            query_word_list = word_index(query)
            for doc in result_docs:
                index_first = []
                distance = 1
                for _, word in query_word_list:
                    index_second = inverted[word][doc]
                    index_new = []
                    if(index_first != []):
                        index_first = distance_between_word(index_first, index_second, distance)
                        distance += 1
                    else:
                        index_first = index_second
                for index in index_first:
                    print '   - %s...' % extract_text(doc, index)
                
            print
    【三. 临近词查询】

    '''
    Created on 2015-11-18
    '''
    #encoding=utf-8
    
    # List Of English Stop Words
    # http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/
    _WORD_MIN_LENGTH = 3
    _STOP_WORDS = frozenset([
    'a', 'about', 'above', 'above', 'across', 'after', 'afterwards', 'again', 
    'against', 'all', 'almost', 'alone', 'along', 'already', 'also','although',
    'always','am','among', 'amongst', 'amoungst', 'amount',  'an', 'and', 'another',
    'any','anyhow','anyone','anything','anyway', 'anywhere', 'are', 'around', 'as',
    'at', 'back','be','became', 'because','become','becomes', 'becoming', 'been', 
    'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 
    'between', 'beyond', 'bill', 'both', 'bottom','but', 'by', 'call', 'can', 
    'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 
    'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 
    'either', 'eleven','else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 
    'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 
    'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 
    'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get',
    'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 
    'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 
    'himself', 'his', 'how', 'however', 'hundred', 'ie', 'if', 'in', 'inc', 
    'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 
    'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 
    'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 
    'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 
    'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 
    'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only',
    'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out',
    'over', 'own','part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same',
    'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 
    'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 
    'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 
    'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 
    'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 
    'therefore', 'therein', 'thereupon', 'these', 'they', 'thickv', 'thin', 'third',
    'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 
    'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 
    'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 
    'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter',
    'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 
    'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 
    'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself',
    'yourselves', 'the'])
    
    def word_split_out(text):
        word_list = []
        wcurrent = []
    
        for i, c in enumerate(text):
            if c.isalnum():
                wcurrent.append(c)
            elif wcurrent:
                word = u''.join(wcurrent)
                word_list.append(word)
                wcurrent = []
    
        if wcurrent:
            word = u''.join(wcurrent)
            word_list.append(word)
    
        return word_list
    
    def word_split(text):
        """
        Split a text in words. Returns a list of tuple that contains
        (word, location) location is the starting byte position of the word.
        """
        word_list = []
        wcurrent = []
        windex = 0
    
        for i, c in enumerate(text):
            if c.isalnum():
                wcurrent.append(c)
            elif wcurrent:
                word = u''.join(wcurrent)
                word_list.append((windex, word))
                windex += 1
                wcurrent = []
    
        if wcurrent:
            word = u''.join(wcurrent)
            word_list.append((windex, word))
            windex += 1
    
        return word_list
    
    def words_cleanup(words):
        """
        Remove words with length less then a minimum and stopwords.
        """
        cleaned_words = []
        for index, word in words:
            if len(word) < _WORD_MIN_LENGTH or word in _STOP_WORDS:
                continue
            cleaned_words.append((index, word))
        return cleaned_words
    
    def words_normalize(words):
        """
        Do a normalization precess on words. In this case is just a tolower(),
        but you can add accents stripping, convert to singular and so on...
        """
        normalized_words = []
        for index, word in words:
            wnormalized = word.lower()
            normalized_words.append((index, wnormalized))
        return normalized_words
    
    def word_index(text):
        """
        Just a helper method to process a text.
        It calls word split, normalize and cleanup.
        """
        words = word_split(text)
        words = words_normalize(words)
        words = words_cleanup(words)
        return words
    
    def inverted_index(text):
        """
        Create an Inverted-Index of the specified text document.
            {word:[locations]}
        """
        inverted = {}
    
        for index, word in word_index(text):
            locations = inverted.setdefault(word, [])
            locations.append(index)
    
        return inverted
    
    def inverted_index_add(inverted, doc_id, doc_index):
        """
        Add Invertd-Index doc_index of the document doc_id to the 
        Multi-Document Inverted-Index (inverted), 
        using doc_id as document identifier.
            {word:{doc_id:[locations]}}
        """
        for word, locations in doc_index.iteritems():
            indices = inverted.setdefault(word, {})
            indices[doc_id] = locations
        return inverted
    
    def search(inverted, query):
        """
        Returns a set of documents id that contains all the words in your query.
        """
        words = [word for _, word in word_index(query) if word in inverted]
        results = [set(inverted[word].keys()) for word in words]
        return reduce(lambda x, y: x & y, results) if results else []
    
    def distance_between_word(word_index_1, word_index_2, distance):
        """
        To judge whether the distance between the two words is smaller than distance
        """ 
        distance_list = []
        for index_1 in word_index_1:
            for index_2 in word_index_2:
                if (index_1 < index_2):
                    if(index_2 - index_1 <= distance):
                        distance_list.append(index_1)
                else:
                    continue       
        return distance_list
    
    def extract_text(doc, index): 
        """
        Output search results
        """
        word_list = word_split_out(documents[doc])
        word_string = ""
        for i in range(index, index + 7):
            word_string += word_list[i] + " "
        word_string = word_string.replace("\n", "")
        return word_string
    
    if __name__ == '__main__':
        doc1 = """
    Niners head coach Mike Singletary will let Alex Smith remain his starting 
    quarterback, but his vote of confidence is anything but a long-term mandate.
    Smith now will work on a week-to-week basis, because Singletary has voided 
    his year-long lease on the job.
    "I think from this point on, you have to do what's best for the football team,"
    Singletary said Monday, one day after threatening to bench Smith during a 
    27-24 loss to the visiting Eagles.
    """
    
        doc2 = """
    The fifth edition of West Coast Green, a conference focusing on "green" home 
    innovations and products, rolled into San Francisco's Fort Mason last week 
    intent, per usual, on making our living spaces more environmentally friendly 
    - one used-tire house at a time.
    To that end, there were presentations on topics such as water efficiency and 
    the burgeoning future of Net Zero-rated buildings that consume no energy and 
    produce no carbon emissions.
    """
    
        # Build Inverted-Index for documents
        inverted = {}
        documents = {'doc1':doc1, 'doc2':doc2}
        for doc_id, text in documents.iteritems():
            doc_index = inverted_index(text)
            inverted_index_add(inverted, doc_id, doc_index)
    
        # Print Inverted-Index
        for word, doc_locations in inverted.iteritems():
            print word, doc_locations
    
        # Search something and print results
        queries = ['Week', 'buildings consume', 'Alex remain quarterback']
        for query in queries:
            result_docs = search(inverted, query)
            print "Search for '%s': %r" % (query, result_docs)
            query_word_list = word_index(query)
            for doc in result_docs:
                index_first = []
                step = 3
                distance = 3
                for _, word in query_word_list:
                    index_second = inverted[word][doc]
                    index_new = []
                    if(index_first != []):
                        index_first = distance_between_word(index_first, index_second, distance)
                        distance += step 
                    else:
                        index_first = index_second
                for index in index_first:
                    print '   - %s...' % extract_text(doc, index)
                
            print



    展开全文
  • 今天的话题是分词Python扩展库jieba和snownlp很好地支持了中文分词,可以使用pip命令进行安装。在自然语言处理领域经常需要对文字进行分词分词的准确度直接影响了后续文本处理和挖掘算法的...

    首先给出昨天文章里最后的小思考题的答案,原文链接为:

    既然选择的是不重复的元素,那么试图在[1,100]这样的区间里选择500个元素,当然是不可能的,但是机器不知道这事,就一直尝试,没有精力做别的事了。

    今天的话题是分词:Python扩展库jieba和snownlp很好地支持了中文分词,可以使用pip命令进行安装。在自然语言处理领域经常需要对文字进行分词,分词的准确度直接影响了后续文本处理和挖掘算法的最终效果。

    >>> import jieba #导入jieba模块

    >>> x = '分词的准确度直接影响了后续文本处理和挖掘算法的最终效果。'

    >>> jieba.cut(x) #使用默认词库进行分词

    >>> list(_)

    ['分词', '的', '准确度', '直接', '影响', '了', '后续', '文本处理', '和', '挖掘', '算法', '的', '最终', '效果', '。']

    >>> list(jieba.cut('纸杯'))

    ['纸杯']

    >>> list(jieba.cut('花纸杯'))

    ['花', '纸杯']

    >>> jieba.add_word('花纸杯') #增加新词条

    >>> list(jieba.cut('花纸杯')) #使用新题库进行分词

    ['花纸杯']

    >>> import snownlp 导入snownlp模块

    >>> snownlp.SnowNLP('学而时习之,不亦说乎').words

    ['学而', '时习', '之', ',', '不亦', '说乎']

    >>> snownlp.SnowNLP(x).words

    ['分词', '的', '准确度', '直接', '影响', '了', '后续', '文本', '处理', '和', '挖掘', '算法', '的', '最终', '效果', '。']

    如果有一本Python书,像下面图中所展示的写作风格,大家会不会很喜欢呢,至少我是会的。

    0

    0

    0

    0

    0

    0

    0

    0

    0

    展开全文
  • 那你得先学会如何做中文文本分词。 跟着我们的教程,一步步用python来动手实践吧。? (由于微信公众号外部链接的限制,文中的部分链接可能无法正确打开。 如有需要,请点击文末的“阅读原文”按钮,访问可以正常...

    o55g08d9dv.jpg广告关闭

    腾讯云11.11云上盛惠 ,精选热门产品助力上云,云服务器首年88元起,买的越多返的越多,最高返5000元!

    cz1n2lxbr3.jpeg

    打算绘制中文词云图? 那你得先学会如何做中文文本分词。 跟着我们的教程,一步步用python来动手实践吧。? (由于微信公众号外部链接的限制,文中的部分链接可能无法正确打开。 如有需要,请点击文末的“阅读原文”按钮,访问可以正常显示外链的版本。 需求在《如何用python做词云》一文中,我们介绍了英文文本的词云...

    cp4yu3npra.png

    lsi通过奇异值分解的方法计算出文本中各个主题的概率分布,严格的数学证明需要看相关论文。 假设有5个主题,那么通过lsi模型,文本向量就可以降到5维,每个分量表示对应主题的权重。 python实现分词上使用了结巴分词,词袋模型、tf-idf模型、lsi模型的实现使用了gensim库。 import jieba.posseg as psegimport codec...

    lfi9b34u19.jpeg

    lsi通过奇异值分解的方法计算出文本中各个主题的概率分布,严格的数学证明需要看相关论文。 假设有5个主题,那么通过lsi模型,文本向量就可以降到5维,每个分量表示对应主题的权重。 python实现 分词上使用了结巴分词https:github.comfxsjyjieba,词袋模型、tf-idf模型、lsi模型的实现使用了gensim库 https...

    org9kfwn4i.png

    lsi通过奇异值分解的方法计算出文本中各个主题的概率分布,严格的数学证明需要看相关论文。 假设有5个主题,那么通过lsi模型,文本向量就可以降到5维,每个分量表示对应主题的权重。 python实现 分词上使用了结巴分词https:github.comfxsjyjieba,词袋模型、tf-idf模型、lsi模型的实现使用了gensim库 https...

    m4pm0vh2tl.png

    关键字全网搜索最新排名【机器学习算法】:排名第一【机器学习】:排名第二【python】:排名第三【算法】:排名第四前言在做文本挖掘的时候,首先要做的预处理就是分词。 英文单词天然有空格隔开容易按照空格分词,但是也有时候需要把多个单词做为一个分词,比如一些名词如“new york”,需要做为一个词看待...

    d5nsnmzuok.jpeg

    此处我们采用“结巴分词”工具。 这一工具的具体介绍和其他用途请参见《如何用python做中文分词?》一文。 文章链接:http:www.jianshu.comp721190534061我们首先调用jieba分词包。 ? 我们此次需要处理的,不是单一文本数据,而是1000多条文本数据,因此我们需要把这项工作并行化。 这就需要首先编写一个函数,处理...

    nlpir是一整套对原始文本集进行处理和加工的软件,提供了中间件处理效果的可视化展示,也可以作为小规模数据的处理加工工具。 主要功能包括:中文分词,词性标注,命名实体识别,用户词典、新词发现与关键词提取等功能。 另外对于分词功能,它有 python 实现的版本,github 链接:https:github.comtsrotenpynlpir...

    war9ml0xla.jpeg

    此处我们采用“结巴分词”工具。 这一工具的具体介绍和其他用途请参见《如何用python做中文分词?》一文。 我们首先调用jieba分词包。 import jieba我们此次需要处理的,不是单一文本数据,而是1000多条文本数据,因此我们需要把这项工作并行化。 这就需要首先编写一个函数,处理单一文本的分词。 def chinese_word...

    3tn78qau1g.png

    本文将使用sklearn实现朴素贝叶斯模型(原理在后文中讲解)。 slearn小抄先送上(下文有高清下载地址)。 大概流程为:导入数据切分数据数据预处理训练模型测试模型? jieba分词首先,我们对评论数据分词。 为什么要分词了? 中文和英文不一样,例如:i love python,就是通过空格来分词的; 我们中文不一样,例如:我...

    xugbu3oiu8.png

    文本分类是一个监督学习的过程,常见的应用就是新闻分类,情感分析等等。 其中涉及到机器学习,数据挖掘等领域的许多关键技术:分词,特征抽取,特征选择,降维,交叉验证,模型调参,模型评价等等,掌握了这个有助于加深对机器学习的的理解。 这次我们用python的scikit-learn模块实现文本分类。 文本分类的过程首先...

    vr4fkskggy.png

    lsi通过奇异值分解的方法计算出文本中各个主题的概率分布,严格的数学证明需要看相关论文。 假设有5个主题,那么通过lsi模型,文本向量就可以降到5维,每个分量表示对应主题的权重。 python实现分词上使用了结巴分词,词袋模型、tf-idf模型、lsi模型的实现使用了gensim库。 import jieba.posseg as psegimport codec...

    1) jieba: https:github.comfxsjyjieba“结巴”中文分词:做最好的 python 中文分词组件jieba (chinese for to stutter) chinese textsegmentation:built to be the best python chinese word segmentationmodule.特点支持三种分词模式:精确模式,试图将句子最精确地切开,适合文本分析; 全模式,把句子中所有的...

    zd9ic9mhlm.jpeg

    云计算应用工程技术研究中心))、thulac(清华大学自然语言处理与社会人文计算实验室) 四款都有分词功能,本博客只介绍作者比较感兴趣、每个模块的内容。 jieba在这不做介绍,可见博客:python+gensim︱jieba分词、词袋doc2bow、tfidf文本挖掘 ? . 一、snownlp 只处理的unicode编码,所以使用时请自行decode成...

    x45zpn1zc4.gif

    前述本文需要的两个python类库 jieba:中文分词分词工具 wordcloud:python下的词云生成工具写作本篇文章用时一个小时半,阅读需要十分钟,读完该文章后你将学会如何将任意中文文本生成词云。 代码组成 代码部分来源于其他人的博客,但是因为bug或者运行效率的原因,我对代码进行了较大的改变代码第一部分,设置代码运行需要...

    版权声明:本文为博主原创文章,遵循 cc 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。 本文链接:https:blog.csdn.netxiaosongshinearticledetails101439157介绍一个好用多功能的python中文分词工具snownlp,全称simplified chinese text processing。 在实现分词的同时,提供转换成拼音(trie树实现的最大...

    本文将以文本分析中最基本的分词操作为入口,介绍人工智能处理自然语言的基本工具和方法,为读者打开语言分析和认知的大门。 作者:朱晨光来源:大数据dt...我 来到 北京 清华大学英文分词功能可以通过spacy软件包完成:# 安装spacy# pip install spacy# python -m spacy download en_core_web_smimport...

    实体链接api输入中文文本,输出分词后的文本,以及识别的实体,json格式。 返回字段 cuts: 文本分词的结果,格式为字符串的列表 entities:从文本中识别...下面就来讲讲其实现的基本原理,这里使用的是python3.6,与python2的区别在于这里使用urllib.request而不是urllib2。 # coding=utf-8import urllib...

    p1v5t7cp0n.jpeg

    之前我写过《 如何用python从海量文本抽取主题? 》一文,其中有这么一段:为了演示的流畅,我们这里忽略了许多细节。 很多内容使用的是预置默认参数,而且完全忽略了中文停用词设置环节,因此“这个”、“如果”、“可能”、“就是”这样的停用词才会大摇大摆地出现在结果中。 不过没有关系,完成比完美重要得多...

    在中文文本挖掘预处理流程总结中,我们总结了中文文本挖掘的预处理流程,这里我们再对英文文本挖掘的预处理流程做一个总结。 1. 英文文本挖掘预处理特点 英文文本的预处理方法和中文的有部分区别。 首先,英文文本挖掘预处理一般可以不做分词(特殊需求除外),而中文预处理分词是必不可少的一步。 第二点,大部分...

    kssl0rxowh.jpeg

    github主页:https:github.comsaffsdlangid.py7.jieba:结巴中文分词“结巴”中文分词:做最好的python中文分词组件 “jieba” (chinese for “to stutter”) chinese textsegmentation:built to be the best python chinese word segmentation module.好了,终于可以说一个国内的python文本处理工具包了:结巴分词...

    展开全文
  • Python英文文本分词(无空格)模块wordninja的使用实例发布时间:2020-08-31 23:40:00来源:脚本之家阅读:77在NLP中,数据清洗与分词往往是很多工作开始的第一步,大多数工作中只有中文语料数据需要进行分词,现有的...

    Python英文文本分词(无空格)模块wordninja的使用实例

    发布时间:2020-08-31 23:40:00

    来源:脚本之家

    阅读:77

    在NLP中,数据清洗与分词往往是很多工作开始的第一步,大多数工作中只有中文语料数据需要进行分词,现有的分词工具也已经有了很多了,这里就不再多介绍了。英文语料由于其本身存在空格符所以无需跟中文语料同样处理,如果英文数据中没有了空格,那么应该怎么处理呢?

    今天介绍一个工具就是专门针对上述这种情况进行处理的,这个工具叫做:wordninja,地址在这里。

    下面简单以实例看一下它的功能:

    def wordinjaFunc():

    '''

    https://github.com/yishuihanhan/wordninja

    '''

    import wordninja

    print wordninja.split('derekanderson')

    print wordninja.split('imateapot')

    print wordninja.split('wethepeopleoftheunitedstatesinordertoformamoreperfectunionestablishjusticeinsuredomestictranquilityprovideforthecommondefencepromotethegeneralwelfareandsecuretheblessingsoflibertytoourselvesandourposteritydoordainandestablishthisconstitutionfortheunitedstatesofamerica')

    print wordninja.split('littlelittlestar')

    结果如下:

    ['derek', 'anderson']

    ['im', 'a', 'teapot']

    ['we', 'the', 'people', 'of', 'the', 'united', 'states', 'in', 'order', 'to', 'form', 'a', 'more', 'perfect', 'union', 'establish', 'justice', 'in', 'sure', 'domestic', 'tranquility', 'provide', 'for', 'the', 'common', 'defence', 'promote', 'the', 'general', 'welfare', 'and', 'secure', 'the', 'blessings', 'of', 'liberty', 'to', 'ourselves', 'and', 'our', 'posterity', 'do', 'ordain', 'and', 'establish', 'this', 'constitution', 'for', 'the', 'united', 'states', 'of', 'america']

    ['little', 'little', 'star']

    从简单的结果上来看,效果还是不错的,之后在实际的使用中会继续评估。

    总结

    以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作具有一定的参考学习价值,谢谢大家对亿速云的支持。如果你想了解更多相关内容请查看下面相关链接

    展开全文
  • 说到分词大家肯定一般认为是很高深的技术,但是今天作者用短短几十行代码就搞定了,感叹python很强大啊!作者也很强大。不过这个只是正向最大匹配,没有机器学习能力注意:使用前先要下载搜狗词库# -*- coding:utf-8...
  • 需求在《如何用Python做词云》一文中,我们介绍了英文文本的词云制作方法。大家玩儿得可还高兴?文中提过,选择英文文本作为示例,是因为处理起来最简单。但是很快就有读者尝试用中文文本做词云了。按照前文的方法,...
  • 一、NLTK 的安装如果是python 2.x 的环境,安装命令如下:sudo pip install nltk如果是python 3.x 的环境,安装命令如下:sudo pip3 install nltk成功地执行了上述命令后,NLTK 的安装还没有彻底地完成,还需要在 ...
  • spaCy 是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in Python”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的...
  • The Porter Stemming Algorithm 已有多种语言实现porterstemmer 示例from porterstemmer import Stemmerstemmer = Stemmer()print stemmer("foo")print stemmer(u"foo")print stemmer("...
  • 文本预处理是要文本处理成计算机能识别的格式,是文本分类、文本可视化、文本分析等研究的重要步骤。具体流程包括文本分词、去除停用词、...文本分词文本分词即将文本拆解成词语单元,英文文本以英文单词空格连接成...
  • python英文分词及字典排序

    千次阅读 2018-05-28 14:40:31
    speak='''Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans and people of the world, thank you.We, the citizens of America, are now joined in...
  • Python中文分词 jieba 十五分钟入门与进阶

    万次阅读 多人点赞 2017-05-27 16:21:04
    整体介绍jieba 基于Python的中文分词工具,安装使用非常方便,直接pip即可,2/3都可以,功能强悍,博主十分推荐 github:https://github.com/fxsjy/jieba 开源中国地址:...
  • 前言大家都知道,英文分词由于单词间是以空格进行分隔的,所以分词要相对的容易些,而中文就不同了,中文中一个句子的分隔就是以字为单位的了,而所谓的正向最大匹配和逆向最大匹配便是一种分词匹配的方法,这里以...
  • 中文分词原理及常用Python中文分词库介绍 转自 进击的Coder 公众号 原理 中文分词,即 Chinese Word Segmentation,即将一个汉字序列进行切分,得到一个个单独的词。表面上看,分词其实就是那么回事,但分词...
  • python 中文分词

    2015-08-05 20:18:35
    英文单词之间是以空格作为自然分界符的,而汉语是以字为基本的书写单位,词语之间没有明显的区分标记,因此,中文词语分析是中文信息处理的基础与关键。分词算法可分为三大类:基于字典、词库匹配的分词方法;基于词...
  • 大家都知道,英文分词由于单词间是以空格进行分隔的,所以分词要相对的容易些,而中文就不同了,中文中一个句子的分隔就是以字为单位的了,而所谓的正向最大匹配和逆向最大匹配便是一种分词匹配的方法,这里以词典...
  • 斯坦福python中文分词stanza

    千次阅读 2019-12-12 15:04:49
    斯坦福python中文分词stanza 1 下载Stanford CoreNLP相关文件 下载完整的组件https://stanfordnlp.github.io/CoreNLP/index.html 下载中文模型文件 解压stanford-corenlp-full-2018-02-...
  • python中文分词

    2012-05-19 08:48:00
    python中文分词 2012 年 03 月 17 日isnowfyalgorithm,programGo to comment 相对于英文而言,中文在计算机处理方面有个必须要面对的问题就是中文分词英文的单词都是空格间隔的,而中文的词语则不同,所以用...
  • Python 分词

    2016-01-07 09:45:14
    利用Python进行中英文分词,另外还支持中英文索引。
  • “结巴”中文分词:做最好的 Python 中文分词组件 “Jieba” (Chinese for “to stutter”) Chinese text segmentation: built to be the best Python Chinese word segmentation module. 完整文档见README.md ...
  • $好玩的分词——python jieba分词模块的基本用法 jieba(结巴)是一个强大的分词库,完美支持中文分词,本文对其基本用法做一个简要总结。 安装jieba pip install jieba 简单用法 结巴分词分为...
  • 中文分词是中文文本处理的一个基础性工作,结巴分词利用进行中文分词。其基本实现原理有三点: 基于Trie树结构实现高效的词图扫描,生成句子中汉字所有可能成词情况所构成的有向无环图(DAG) 采用了动态规划查找...
  • 跟着博主的脚步,每天进步一点点这篇文章事实上整合了之前文章的相关介绍,同时添加一些其他的Python中文分词相关资源,甚至非Python的中文分词工具,仅供参考。首先介绍...
  • 即:tag只能是小写英文,词频只能是数字 3)更改分词的主词典:jieba.set_dictionary(dict_path) 说明:词库更新操作均会改变jieba词库,词库重载后失效。jieba分词原则:采用词频统计计算得分最高者,为最终...
  • 本篇文章主要介绍了python jieba分词并统计词频后输出结果到Excel和txt文档方法,具有一定的参考价值,感兴趣的小伙伴们可以参考一下
  • 2.分词 jieba.cut 方法接受三个输入参数: 需要分词的字符串;cut_all 参数用来控制是否采用全模式;HMM 参数用来控制是否使用 HMM 模型jieba.cut_for_search 方法接受两个参数:需要分词的字符串;是否使用 HMM...
  • jieba分词 jieba分词支持三种分词模式: 精确模式, 试图将句子最精确地切开,适合文本分析 全模式,把句子中所有的可以成词的词语都扫描出来,速度非常快,但是不能解决歧义 ...在python2.x和python3.x均兼容,...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 9,717
精华内容 3,886
关键字:

python英文分词

python 订阅