精华内容
下载资源
问答
  • 今天小编就为大家分享一篇python 实现倒排索引的方法,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
  • # -*- coding: utf-8 -*- Str="文档" wendang=[["你好",0,"搜索引擎",2],["搜索引擎",0,"技术",4,"窥视",6],["你好",0,"搜索",2,"技术",4]] res={} for s in wendang: for j in range(0,len(s),2): ...
    # -*- coding: utf-8 -*-
    Str="文档"
    wendang=[["你好",0,"搜索引擎",2],["搜索引擎",0,"技术",4,"窥视",6],["你好",0,"搜索",2,"技术",4]]
    res={}
    for s in wendang:
        for j in range(0,len(s),2):
            if s[j] not in res:
                value=[]
                res[s[j]]=value    
            value_1=res[s[j]]
            temp=[]
            temp.append(Str+"_"+str(wendang.index(s)+1))
            temp.append(s[j+1])
            value_1.append(temp)
            res[s[j]]=value_1
    print(res)
    

    在这里插入图片描述

    展开全文
  • 该方法以函数为单位,基于simhash与倒排索引技术,能在海量代码中快速溯源相似函数。首先基于simhash利用海量样本构建具有三级倒排索引结构的代码库。对于待溯源函数,依据函数中代码块的simhash值快速发现相似代码...
  • python实现倒排索引

    千次阅读 2020-04-02 15:58:47
    python实现倒排索引 倒排索引的过程简述: 题目形式如下: 前面的数字是文档号,每两行是一篇文档。 题目要求如下: 请编写程序(任意开发语言,推荐python3)为本目录下的1.txt文件构建倒排索引,保存在2...

    python实现倒排索引

    • 倒排索引的过程简述:
      在这里插入图片描述

    • 题目形式如下:在这里插入图片描述

      • 前面的数字是文档号,每两行是一篇文档。
    • 题目要求如下:

      • 请编写程序(任意开发语言,推荐python3)为本目录下的1.txt文件构建倒排索引,保存在2_generated.txt(每行格式:词条\tDocFreq\tdocID docID,\t指换行符,docID间使用空格)
      • 文本处理要求:不要求做词条变化如friends -> friend等;直接用空格作为分割符;都转成小写A->a;符号(例如,)和符号混合字母(例如68-years-old),空字符串(因split函数产生)等非标准单词均视为单词参与统计,不做特殊处理(即空格分割得到的单个字符串不做进一步处理);把出现次数排名Top 100的字符串去掉
    • 实现代码

    filename = './1.txt'
    
    '''
        Func: read the file,and return the word_list
    '''
    def read_file(filename):
        result = []
        count = 0                           #put the same dot together
        with open(filename, 'r') as file_to_read:
            while True:
                lines = file_to_read.readline()  # read the line
                if(len(lines) == 0):
                    break
                lsplit = lines.split("\t")
                count = count + 1
                if(count % 2 == 0):
                    for word in lsplit[1].split(" "):
                        tmp.append(word)
                    result.append(tmp)
                else:
                    tmp = []
                    for word in lsplit[1].split(" "):
                        tmp.append(word)
                if not lines:
                    break
                    pass
        file_to_read.close()
        return result
    
    
    '''
        Step 1:
        Create 2 dictionary,
        dict1 for record the word&dotID,
        dict2 for record the word&wordcount in order to delete the top100
    '''
    
    dict1 = {}
    dict2 = {}
    sentences = read_file(filename)
    # print(sentences)
    '''
        get the 2 dicts and delete the ''
    '''
    for i in range(100):
        sentence = sentences[i]
        for word in sentence:
            if(word == ''): continue
            if word.lower() not in dict1:
                dict1[word.lower()] = set()         #new word
                dict2[word.lower()] = 1
            else:
                dict2[word.lower()] += 1
            dict1[word.lower()].add(i+1)            #update for dictionary
    
    # print(dict1)
    # print(dict2)
    
    '''
        Step 2:
        According to the wordcount to sort the dict2 and get a answer_list,
        Get rid of the top100,
        According to the ascll to sort the list
    '''
    answer_list = sorted(dict2.items(),key=lambda d:d[1], reverse=True) #Sort by wordcount of dictionary.
    answer_list_delete_top100 = answer_list[100:]
    answer_sort_ascll = sorted(answer_list_delete_top100, key=lambda x:x[0])
    
    
    '''
        Step 3:
        Write into the 2_generated.txt in a certain rule
    '''
    
    with open('./2_generated.txt', 'w') as f:
        for word in answer_sort_ascll:
            f.write("%s\t%d\t" % (word[0],word[1]))
            sort_dotid = sorted(dict1[word[0]])
            for i in range(len(sort_dotid)):
                f.write("%d" % (sort_dotid[i]))
                if(i != len(sort_dotid)-1):
                    f.write(" ")
            f.write('\n')
        f.close()
    
    
    
    展开全文
  • Python倒排索引

    2020-12-15 13:36:20
    具体实现 Index.txt 记录所出现的文件 这里将建立倒排索引分为三步 thefile.txt 所有出现过的词(词频由高到低) stop_word.txt 停词 data.pkl 所创建的索引 1 count.py 确定停词 2 index.py 建立倒排索引 3 query....

    代码链接

    预处理

    word stemming

    一个单词可能不同的形式,在英语中比如动词的主被动、单复数等。比如live\lives\lived.

    虽然英文的处理看起来已经很复杂啦但实际在中文里的处理要更加复杂的多。

    stop words

    比如a、the这种词在处理的时候没有实际意义。在这里处理的时候先对词频进行统计,人为界定停词,简单的全部替换为空格。但是这种方式并不适用于所有的情况,对于比如,To be or not to be,这种就很难处理。

    具体实现

    Index.txt 记录所出现的文件

    这里将建立倒排索引分为三步

    thefile.txt 所有出现过的词(词频由高到低)

    stop_word.txt 停词

    data.pkl 所创建的索引

    1 count.py 确定停词

    2 index.py 建立倒排索引

    3 query.py 用于查询

    这里在建立倒排索引的时候只记录了出现的文件名,并没有记录在文件中出现的位置。

    图为count.py生成的词频统计

    count.py

    #-*- coding:utf-8 -*-

    '''

    @author birdy qian

    '''

    import sys

    from nltk import * #import natural-language-toolkit

    from operator import itemgetter #for sort

    def output_count(fdist): #output the relative information

    #vocabulary =fdist.items()

    vocabulary =fdist.items() #get all the vocabulary

    vocabulary=sorted(vocabulary, key=itemgetter(1),reverse=True) #sort the vocabulary in decreasing order

    print vocabulary[:250] #print top 250 vocabulary and its count on the screen

    print 'drawing plot.....' #show process

    fdist.plot(120 , cumulative=False) #print the plot

    #output in file

    file_object = open('thefile.txt', 'w') #prepare the file for writing

    for j in vocabulary:

    file_object.write( j[0] + ' ') #put put all the vocabulary in decreasing order

    file_object.close( ) #close the file

    def pre_file(filename):

    print("read file %s.txt....."%filename) #show process

    content = open( str(filename) + '.txt', "r").read()

    content = content.lower()

    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~' : #cancel the punction

    content = content.replace(ch, " ")

    plurals = content.split() #split the file at '\n' or ' '

    stemmer = PorterStemmer() #prepare for stemming

    singles = [stemmer.stem(plural) for plural in plurals] #handling stemming

    return singles

    #main function

    def main():

    print "read index....." #show process

    input = open('index.txt', 'r') #titles that need to be handled

    all_the_file =input.read( )

    file=all_the_file.split()

    input.close() #close the file

    fdist1=FreqDist() #create a new dist

    for x in range( 0, len(file) ):

    #print file[x]

    txt = pre_file( file[x] ) #pre handing the txt

    for words in txt :

    words =words.decode('utf-8').encode(sys.getfilesystemencoding()) #change string typt from utf-8 to gbk

    fdist1[words] +=1 #add it to the dist

    output_count(fdist1)

    #runfile

    if __name__ == '__main__':

    main()

    index.py

    #-*- coding:utf-8 -*-

    '''

    @author birdy qian

    '''

    import sys

    import pickle

    from nltk import * #import natural-language-toolkit

    from operator import itemgetter #for sort

    STOPWORDS = [] #grobal variable

    def output_index(result):

    #print result

    output = open('data.pkl', 'wb')

    pickle.dump(result, output) # Pickle dictionary using protocol 0

    output.close()

    def pre_file(filename):

    global STOPWORDS

    print("read file %s.txt....."%filename) #show process

    content = open( str(filename) + '.txt', "r").read()

    content = content.lower()

    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_��{|}~' : #cancel the punction

    content = content.replace(ch, " ")

    for ch in STOPWORDS: #cancel the stopwords

    content = content.replace(ch, " ")

    plurals = content.split() #split the file at '\n' or ' '

    stemmer = PorterStemmer() #prepare for stemming

    singles = [stemmer.stem(plural) for plural in plurals] #handling stemming

    return singles

    def readfile(filename):

    input = open(filename, 'r') #titles that need to be handled

    all_the_file =input.read( )

    words = all_the_file.split() #split the file at '\n' or ' '

    input.close()

    return words

    #main function

    def main():

    global STOPWORDS

    print "read index....." #show process

    file=readfile('index.txt')

    print "read stopwords....."

    STOPWORDS = readfile('stop_word.txt')

    print "create word list....."

    word = list(readfile('thefile.txt')) #the file with all the words in all the books

    result = {} #memorize the result

    for x in range( 0, len(file) ):

    #print file[x]

    txt = pre_file( file[x] ) # file[x] is the title

    txt = {}.fromkeys(txt).keys() #cancel the repeat word

    #can also use text.set()

    for words in txt :

    words =words.decode('utf-8').encode(sys.getfilesystemencoding()) #change string typt from utf-8 to gbk

    if result.get(words) == None : #if the word is not in the dictionary

    result[words]=[file[x]]

    else: #if the word is in the dictionary

    t=result.get(words)

    t.append(file[x])

    result[words]=t

    output_index(result)

    #runfile

    if __name__ == '__main__':

    main()

    query.py

    #-*- coding:utf-8 -*-

    '''

    @author birdy qian

    '''

    import os

    import sys

    import pprint, pickle

    from nltk import PorterStemmer

    def readfile(filename):

    input = open(filename, 'r') #titles that need to be handled

    all_the_file =input.read( )

    words = all_the_file.split() #split the file at '\n' or ' '

    input.close() #close the data

    return words

    def getdata():

    pkl_file = open('data.pkl', 'rb') #index is saved in the file 'data.pkl'

    data1 = pickle.load(pkl_file) #change the type

    #pprint.pprint(data1)

    pkl_file.close() #close the file

    return data1 #close the data

    def output( result ):

    #print result

    if result == None: #if the words is not in the index (one word return None)

    print None

    return

    if len(result) == 0 : #if the words is not in the index (more than one words return [] )

    print None

    return

    if len(result) < 10 : #if the records is less than 10

    print result

    else: #if the records is more than 10

    print 'get '+ str(len(result)) + ' records' #the record number

    for i in range( 0 , len(result) / 10 +1):

    print '10 records start from ' +str(i*10+1)

    if 10 * i + 9 < len(result) : #print from 10 * i to 10 * i + 10

    print result[ 10 * i : 10 * i + 10 ]

    else: #print from 10 * i to end

    print result[ 10 * i : len(result) ]

    break

    getstr = raw_input("Enter 'N' for next ten records & other input to quit : ")

    if getstr != 'N':

    break

    #main function

    def main():

    data_list = getdata() #read data

    STOPWORDS = readfile('stop_word.txt')

    stemmer = PorterStemmer() #prepare for stemming

    while True:

    get_str = raw_input("Enter your query('\\'to quit): ")

    if get_str == '\\' : #leave the loop

    break

    get_str = get_str.lower()

    for ch in STOPWORDS: #cancel the stopwords

    get_str = get_str.replace(ch, " ")

    query_list=get_str.split() #split the file at '\n' or ' '

    query_list = [stemmer.stem(plural) for plural in query_list] #handling stemming

    while True:

    if query_list != [] :

    break

    get_str = raw_input("Please enter more information: ")

    get_str = get_str.lower()

    for ch in STOPWORDS: #cancel the stopwords

    get_str = get_str.replace(ch, " ")

    query_list=get_str.split()

    query_list = [stemmer.stem(plural) for plural in query_list] #handling stemming

    result=[]

    for k in range( 0 , len(query_list) ):

    if k==0: #if the list has not been built

    result = data_list.get( query_list[0] )

    else: #if the list has been built

    result = list( set(result).intersection(data_list.get( query_list[k] ) ) )

    output( result )

    #runfile

    if __name__ == '__main__':

    main()

    展开全文
  • 倒排索引是一种检索方式,比如存入数据库的数据是存一篇文章进去,然而检索时我们经常需要通过关键词检索,所以提前做好倒排索引即可方便检索,而省略掉全表扫描的问题了,这是一种用空间换时间的方法。 使用字典...

    倒排索引是一种检索方式,比如存入数据库的数据是存一篇文章进去,然而检索时我们经常需要通过关键词检索,所以提前做好倒排索引即可方便检索,而省略掉全表扫描的问题了,这是一种用空间换时间的方法。

    使用字典构造倒排索引

    sentences = ['This is the first word',
                 'This is the second text Hello How are you',
                 'This is the third, this is it now']
    sentence_dict = {}
    index_dict = {}
    for index, line in enumerate(sentences):
        line = line.lower()  # 小写
        sentence_dict[index] = line
        for char in line.split(' '):
            if not char.strip():
                continue
            if char in index_dict:
                index_dict[char].add(index)
            else:
                index_dict[char] = {index}
    

    检索

    检索时依次遍历每一个待检索的单词对应的文章列表,然后取它们的交集即可,交集意味着这个文章包含待检索的全部单词。

    def list_intersection(list1: list, list2: list) -> list:  # 取交集
        return list(set(list1).intersection(set(list2)))
    
    
    index_value = list(sentence_dict.keys())
    search_key = ['this', 'first']
    for key in search_key:
        index_value = list_intersection(index_dict[key], index_value)
    print(index_value)
    

    完整代码

    sentences = ['This is the first word',
                 'This is the second text Hello How are you',
                 'This is the third, this is it now']
    sentence_dict = {}
    index_dict = {}
    for index, line in enumerate(sentences):
        line = line.lower()  # 小写
        sentence_dict[index] = line
        for char in line.split(' '):
            if not char.strip():
                continue
            if char in index_dict:
                index_dict[char].add(index)
            else:
                index_dict[char] = {index}
    
    
    # 检索时
    def list_intersection(list1: list, list2: list) -> list:  # 取交集
        return list(set(list1).intersection(set(list2)))
    
    
    index_value = list(sentence_dict.keys())
    search_key = ['this', 'first']
    for key in search_key:
        index_value = list_intersection(index_dict[key], index_value)
    print(index_value)
    
    展开全文
  • 倒排索引与布尔查询

    2019-10-07 22:21:07
    对所给的Tweets数据集建立倒排索引; 实现Boolean Retrieval Model,使用TREC 2014 test topics进行测试; Boolean Retrieval Model中支持and, or ,not,查询优化可选做;
  • python 实现倒排索引

    千次阅读 2018-05-08 17:29:40
    建立正向索引: “文档1”的ID &gt; 单词1:出现位置列表;单词2:出现位置列表;………… “文档2”的ID &gt; 此文档出现的关键词列表。 ''' forward_index = {} for line in fin: line = line.strip()....
  • 什么是倒排索引倒排索引(英语:Inverted index),也常被称为反向索引、置入档案或反向档案,是一种索引方法,被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最...
  • 本篇文章主要介绍ES中的索引——倒排索引分词在创建索引之前,会对文档中的字符串进行分词。ES中字符串有两种类型,keyword和text。keyword类型的字符串不会被分词,搜索时全匹配查询text类型的字符串会被分词,搜索...
  • python 实现倒排索引,建立简单的搜索引擎

    万次阅读 多人点赞 2019-05-05 13:51:02
    本文将用python实现倒排索引 如下,一个数据表docu_set中有三篇文章的,d1,d2,d3,如下 docu_set={'d1':'i love shanghai', 'd2':'i am from shanghai now i study in tongji university', 'd3':'i am from ...
  • 关于倒排索引搜索引擎通常检索的场景是:给定几个关键词,找出包含关键词的文档。怎么快速找到包含某个关键词的文档就成为搜索的关键。这里我们借助单词——文档矩阵模型,通过这个模型我们可以很方便知道某篇文档...
  • 问题描述我想python实现一个简单的搜索引擎,需要建立倒排索引。但我的方法太耗时了(如图),这个算法应该怎么改比较好?我的思路是,爬虫获取的网页数据保存在MySQL中,然后从数据库获取到数据后进行分词,暂存在...
  • 先简单说一下过程,再说原理一、ES写入数据过程1、客户端随机选择一个 node 发送请求,这个 node 就是 coordinating node(协调节点)2、coordinating node 对 document 进行路由,将请求转发到该索引的 primary shard...
  • 一个倒排索引(inverted index)的python实现 使用spider.py抓取了10篇中英双语安徒生童话并存在”documents_cn”目录下 使用inverted_index_cn.py对”documents_cn”目录下文档建立倒排索引 查询 “第三根火柴”, ...
  • 按照不同划分标准,索引有多种分类方式,仅常用类型也不止4种之多,而其中最为关键的则是“倒排索引”技术。本文就是一篇,介绍“倒排索引创建方法”的文章。一、相关概念及术语单词—文档矩阵表达两者包含关系的...
  • 这是山东大学大数据实验二,用Hadoop实现文档的倒排索引
  • Python建立倒排索引

    2020-12-10 17:40:03
    cdays-3-test.txt 内容:1 key12 key23 key17 key38 key210 key114 key219 key420 key130 key3读取某一简单索引文件cdays-3-test.txt,其每行格式为文档序号 关键词,现需根据这些信息转化为倒排索引,即统计关键词在...
  • Hadoop倒排索引(附带完整代码)

    千次阅读 2020-06-10 13:14:44
    倒排索引”是文档检索系统中最常用的数据结构,被广泛地应用于全文搜索引擎。它主要是用来存储某个单词(或词组)在一个文档或一组文档中的存储位置的映射,即提供了一种根据内容来查找文档的方式。由于不是根据...
  • 利用倒排索引和向量空间模型实现的信息检索系统。 完成工作: 带位置信息的倒排索引 转化空间模型 TOP K查询 BOOL查询 初步查询 拼写矫正 名词查询 拼写矫正(以下) 运行 环境要求:python3 在初次运行程序前请下载...
  • http://bitjoy.net/2016/01/04/introduction-to-building-a-search-engine-1/

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 6,759
精华内容 2,703
关键字:

倒排索引python

python 订阅