精华内容
下载资源
问答
  • 倒排索引c++
    千次阅读
    2016-12-17 11:41:13

    1.介绍

    倒排索引是现代搜索引擎的核心技术之一,其核心目的是将从大量文档中查找包含某些词的文档集合这一任务用O(1)或O(logn)的时间复杂度完成,其中n为索引中的文档数目。也就是说,利用倒排索引技术,可以实现与文档集大小基本无关的检索复杂度,这一点对于海量内容的检索来说至关重要。

    2.示例

    假设我们有如下几篇文档:

    D1 = “谷歌地图之父跳槽Facebook”
      D2 = “谷歌地图之父加盟Facebook”
      D3 = “谷歌地图创始人拉斯离开谷歌加盟Facebook”
      D4 = “谷歌地图创始人跳槽Facebook与Wave项目取消有关”
      D5 = “谷歌地图创始人拉斯加盟社交网站Facebook”

    对每篇文档都进行分词以后,可以这些文档中包含的关键词有:{谷歌,地图,之父,跳槽,Facebook,加盟,创始人,拉斯,离开,与,Wave,项目,取消,有关,社交,网站}

    首先,去掉“与”这样的没有实际表意作用的停用词。
      然后,对每一个词建立一个链表,表中的每个元素都是包含该词的某篇文档的标识。
      于是,得到如下的倒排链集合。

    谷歌—{D1,D2,D3,D4,D5},地图—{D1,D2,D3,D4,D5},之父—{D1,D2},跳槽—{D1,D2},Facebook—{D1,D2,D3,D4,D5},创始人—{D3,D4,D5},加盟—{D2,D3,D5},拉斯—{D3,D5},离开—{D3},Wave—{D4},取消—{D4},项目—{D4},有关—{D4},社交—{D5},网站—{D5}
      
    3.实现细节

    我们使用一个类结构来描述一个倒排索引,这个类结构派生于hash map,其中的键为关键词。典型情况下,该键是string类型,但是也有可能发生变化,为了逻辑统一,引入了模板参数来泛化此处的数据类型。

    倒排索引的基本操作有两项:一是想索引中加入一个新文档,而是给定一个由多个关键词组成的查询,返回对应的文档集合。

    在倒排索引中,由于文档ID是在加入倒排索引是被在线分配的,因此每个倒排链都可以确保是有序的

    4.实现代码

    main函数中包含示例部分的测试

    #include <iostream>
    #include <map>
    #include <list>
    #include <vector>
    #include <string>
    #include <set>
    
    using namespace std;
    
    template <class TKey>
    class InvIndex : public map<TKey, list<int>>
    {
    public:
    	vector<vector<TKey>> docs; //文档正排表
    
    public:
    	//向索引中加入一个文档
    	void add(vector<TKey>& doc)
    	{
    		//在正排表里记录该文档
    		docs.push_back(doc);
    		int curDocID = docs.size();  //现在代码:使得文档编号从1开始 原始代码:int curDocID = docs.size()-1;
    
    		//遍历doc里所有的term
    		for (int w = 0; w < doc.size(); w++)
    		{
    			map<TKey,list<int>>::iterator it;
    			it = this->find(doc[w]);
    
    			//如果该term的倒排链不存在,新建倒排链
    			if (it == this->end())
    			{
    				list<int> newList;
    				(*this)[doc[w]] = newList;
    				it = this->find(doc[w]);
    			}
    
    			//在倒排链末尾插入新的文档
    			it->second.push_back(curDocID);
    		}
    	}
    
    	//在索引中进行一次查询
    	void retrieve(vector<TKey>& query, set<int>& docIDs)
    	{
    		int termNum = query.size();
    
    		//合并所有term的倒排链
    		docIDs.clear();
    		for (int t = 0; t < termNum; t++)
    		{
    			map<TKey,list<int>>::iterator it;
    			//该term倒排链不存在则跳过
    			if ((it = this->find(query[t])) != this->end())
    				docIDs.insert(it->second.begin(),it->second.end());
    		}
    
    	}
    };
    
    
    int main()
    {
    
    	string D1_tmp[] = {"谷歌","地图","之父","跳槽","Facebook"};
    	int D1_tmp_size = sizeof(D1_tmp)/sizeof(string);
    	vector<string> D1(D1_tmp,D1_tmp+D1_tmp_size);
    
    	string D2_tmp[] = {"谷歌","地图","之父","加盟","Facebook"};
    	int D2_tmp_size = sizeof(D2_tmp)/sizeof(string);
    	vector<string> D2(D2_tmp,D2_tmp+D2_tmp_size);
    
    	string D3_tmp[] = {"谷歌","地图","创始人","拉斯","离开","谷歌","加盟","Facebook"};
    	int D3_tmp_size = sizeof(D3_tmp)/sizeof(string);
    	vector<string> D3(D3_tmp,D3_tmp+D3_tmp_size);
    
    	string D4_tmp[] = {"谷歌","地图","创始人","跳槽","Facebook","与","Wave","项目","取消","有关"};
    	int D4_tmp_size = sizeof(D4_tmp)/sizeof(string);
    	vector<string> D4(D4_tmp,D4_tmp+D4_tmp_size);
    
    	string D5_tmp[] = {"谷歌","地图","创始人","拉斯","加盟","社交","网站","Facebook"};
    	int D5_tmp_size = sizeof(D5_tmp)/sizeof(string);
    	vector<string> D5(D5_tmp,D5_tmp+D5_tmp_size);
    
    	InvIndex<string>* inverted_index = new InvIndex<string>;
    	inverted_index->add(D1);
    	inverted_index->add(D2);
    	inverted_index->add(D3);
    	inverted_index->add(D4);
    	inverted_index->add(D5);
    
    	
    	string str_query[] = {"谷歌","地图","之父","跳槽","Facebook","创始人","加盟","拉斯","离开","与","Wave","项目","取消","有关","社交","网站"};
    
    	
    
    	for(int i = 0; i < sizeof(str_query)/sizeof(string); i++)
    	{
    		vector<string> query;
    		query.push_back(str_query[i]);
    
    		cout << str_query[i] << " ";
    
    		set<int> docSet;
    
    		inverted_index->retrieve(query,docSet);
    
    		set<int>::iterator it;
    		for (it = docSet.begin(); it != docSet.end(); it++)
    		{
    			cout << "D" << *it << " ";
    		}
    		cout << endl;
    
    	}
    
    	return 0;
    }
    

    输出:

    ![](https://imgconvert.csdnimg.cn/aHR0cDovL2ltZy5ibG9nLmNzZG4ubmV0LzIwMTYxMjE3MTEzOTQ3MTQ0?x-oss-process=image/format,png)

    参考文献
    (1)《计算广告》

    更多相关内容
  • 倒排索引C++实现

    千次阅读 2020-06-11 20:52:25
    什么是倒排索引(反向索引) 以字或者词为关键字进行索引 正排索引是从文档到关键字的映射,已知文档求关键字。倒排索引是从关键字到文档的映射,已知关键字求文档。 百度搜索为什么这么快? 使用了倒排,当然具体的...

    什么是倒排索引(反向索引)

    以字或者词为关键字进行索引

    正排索引是从文档到关键字的映射,已知文档求关键字。倒排索引是从关键字到文档的映射,已知关键字求文档。

    百度搜索为什么这么快?

    使用了倒排,当然具体的实现会更加复杂,这里只是简单实现倒排索引(inverted index)。

    具体主要做了,把多个文档中的单词切出来(为了简单起见直接空格分开单词),然后建立如上图所示的数据结构,每个单词后面挂上一串文档。

    这样用户输入一个单词,我们就能找到这个有哪些文档中出现过这个单词(为了简单起见,没有存单词在文档中具体的位置)

    如何查找用户输入的单词,即单词库用什么存储,这里用了字典树。

    实现

    #include <algorithm>
    #include <fstream>
    #include <iostream>
    #include <vector>
    #include <string>
    
    const std::string _CHARS = "abcdefghijklmnopqrstuvwxyz0123456789.:-_/";
    const size_t MAX_NODES = 41;
    
    class node
    {
    public:
        node() { clear(); }
        node(char z) { clear(); }
        ~node()
        {
            for (int x = 0; x < MAX_NODES; x++)
            {
                if (next[x])
                    delete next[x];
            }
        }
        void clear()
        {
            for (int x = 0; x < MAX_NODES; x++)
            {
                next[x] = 0;
                isWord = false;
            }
        }
        bool isWord;                    // 是否是一个单词
        std::vector<std::string> files; // 文件列表
        node *next[MAX_NODES];          // 字典树
    };
    
    class Index
    {
    public:
        void add(std::string s, std::string fileName)
        {
            std::transform(s.begin(), s.end(), s.begin(), tolower);
            std::string h;
            for (std::string::iterator i = s.begin(); i != s.end(); i++)
            {
                if (*i == 32)
                { // 空格的ascii
                    pushFileName(addWord(h), fileName);
                    h.clear();
                }
                h.append(1, *i);
            }
            if (h.length())
            {
                pushFileName(addWord(h), fileName);
            }
        }
        void findWord(std::string s)
        {
            std::vector<std::string> v = find(s);
            if (!v.size())
            {
                std::cout << s + " was not found!\n";
                return;
            }
            std::cout << s << " found in:\n";
            for (std::vector<std::string>::iterator i = v.begin(); i != v.end(); i++)
            {
                std::cout << *i << "\n";
            }
            std::cout << "\n";
        }
    
    private:
        node root;
        /*
            对输入的字符串s,遍历root下的字典树,并将该单词最后一个字符的位置设置isWord。
            返回最后一个字符的指针。
        */
        node *addWord(std::string s)
        {
            size_t idx;
            node *rt = &root, *n;
            for (std::string::iterator i = s.begin(); i != s.end(); i++)
            {
                idx = _CHARS.find(*i);
                if (idx < MAX_NODES)
                {
                    n = rt->next[idx];
                    if (n)
                    {
                        rt = n;
                        continue;
                    }
                    n = new node(*i);
                    rt->next[idx] = n;
                    rt = n;
                }
            }
            rt->isWord = true;
            return rt;
        }
        /*
            在单词的字典树末尾节点下插入文档的名字。
        */
        void pushFileName(node *n, std::string fn)
        {
            std::vector<std::string>::iterator i = std::find(n->files.begin(), n->files.end(), fn);
            if (i == n->files.end())
                n->files.push_back(fn);
        }
        const std::vector<std::string> &find(std::string s)
        {
            size_t idx;
            std::transform(s.begin(), s.end(), s.begin(), tolower);
            node *rt = &root;
            for (std::string::iterator i = s.begin(); i != s.end(); i++)
            {
                idx = _CHARS.find(*i);
                if (idx < MAX_NODES)
                {
                    if (!rt->next[idx])
                        return std::vector<std::string>();
                    rt = rt->next[idx];
                }
            }
            if (rt->isWord)
                return rt->files;
            return std::vector<std::string>();
        }
    };
    
    int main(int argc, char *argv[])
    {
        Index t;
        std::string s;
        std::string files[] = {"file1.txt", "f_text.txt", "text_1b.txt"};
    
        for (int x = 0; x < 3; x++)
        {
            std::ifstream f;
            f.open(files[x].c_str(), std::ios::in);
            if (f.good())
            {
                while (!f.eof())
                {
                    f >> s;
                    t.add(s, files[x]);
                    s.clear();
                }
                f.close();
            }
        }
    
        while (true)
        {
            std::cout << "Enter one word to search for ,return to exit; ";
            std::getline(std::cin, s);
            if (!s.length())
                break;
            t.findWord(s);
        }
    
        return 0;
    }

    参考链接

    本文由博客群发一文多发等运营工具平台 OpenWrite 发布

    展开全文
  • C++倒排索引

    2020-02-28 04:30:17
    读入文本集,建立倒排索引,内含有的TXT文本可以替换,源代码可以直接运行
  • C++ 构建简单倒排索引

    千次阅读 多人点赞 2021-04-21 17:45:18
    问题思路一、构建文档二、构建倒排索引三.查询main函数总结 智能信息检索这门课的第一个上机实验: 问题表述如下: 1.对硬盘目录中的10个文本文件(doc01.txt~doc10.txt),在内存中建立倒排索引 2.构建索引系统,...


    智能信息检索这门课的第一个上机实验:
    问题表述如下:
    1.对硬盘目录中的10个文本文件(doc01.txt~doc10.txt),在内存中建立倒排索引
    2.构建索引系统,输入查询词,输出包含查询词的文档ID


    思路

    要构建倒排索引,首先我们对每个文档编号(设置ID),这里我们可以设置一个名为Doc的结构体,用于绑定文档和他的ID,然后将每个文档放入数组中,生成我们所需要构建倒排索引的文档集。(这里我们使用STL中的vector),之后进行最重要的构建倒排索引的操作(详细步骤见该模块说明),最后我们简单的写一个查询函数就完成我们的任务啦。


    一、构建文档

    1.给出我们要构建倒排的文档名

    static vector<string> files={ 
    "doc1.txt","doc2.txt", "doc3.txt",
    "doc4.txt","doc5.txt", "doc6.txt", 
    "doc7.txt","doc8.txt","doc9.txt","doc10.txt" };
    

    2.建立文档结构体,保存其文档名和文档ID

    struct Doc {
    	string docName;//文档名
    	int docID;//文档ID
    	Doc() { docName = "", docID = -1; }
    };
    

    3.构建文档集

    vector<Doc> Docs;//文档集
    void makeDocs(vector<Doc> &Docs){//生成文档集
    	Doc *doc = new Doc;
    	//将文档依次编号,生成文档集
    	for (int i = 0; i < files.size(); ++i) {
    	doc->docName = files[i];
    	doc->docID = 1+i;
    	Docs.push_back(*doc);
    	}
    	delete doc;
    }
    

    3.显示我们构建的文档集(这是一个小小的附加功能)

    void showDocs() {//显示文档集
    	for (int i = 0; i <Docs.size(); ++i) {
    		cout << "DocName: "<<Docs[i].docName << "\tdocID: " << Docs[i].docID<<endl;
    	}
    }
    

    二、构建倒排索引

    详细步骤见注释啊,笔者懒了:

    map<string,set<int>> indexList;//倒排记录表
    void index(vector<Doc> &Docs) {
    	for (int i = 0; i < Docs.size(); ++i) {
    		ifstream in(Docs[i].docName);//依次打开文档
    		int ch;//用于扫描的字母
    		string s;//获取到的单词
    		map<string, set<int>>::iterator it;//用于判定词项是否在倒排记录表中
    		if (in.is_open()) {
    			while (!in.eof()) {//一次循环获取一个单词
    				//找到第一个字母
    				do {
    					ch = in.get();//获取一个字符
    					if (in.eof())break;//遇到文件尾则结束
    				} while (!isalpha(ch));
    				if (in.eof())break;//防止后面的空行而不能结束
    				while (isalpha(ch)) {
    					ch=tolower(char(ch));//将获取到的字母变为小写
    					s += ch;//合成单词
    					ch = in.get();
    				}
    				it = indexList.find(s);//判断s是否已经在倒排记录表中
    				if (it != indexList.end()) {//如果s已经在全体倒排记录表中
    					it->second.insert(i+1);//只将词项对应的文档编号加入倒排记录表
    				}
    				else {//s不在全体倒排记录表中
    					set<int> pset{i+1};//s词项对应的文档编号放入set
    					pair<string, set<int>> newTrem(s, pset);//新词项及其倒排记录表
    					indexList.insert(newTrem);//将新词项及其倒排记录表加入全体倒排记录表
    				}
    				s.clear();//清空s,进行下一个单词的操作
    			}
    
    		}
    		in.close();//关闭文件,以便进行下一个文档的到操作
    	}
    	cout << "倒排索引构建成功" << endl;
    }
    

    一个附加小功能(显示倒排记录表)

    void showIndexList(const map<string, set<int>> indexList){
    	for (auto c : indexList) {//显示倒排记录表
    		cout << c.first<<"\t" ;
    		for (auto c1 : c.second) {
    			cout << c1 << " ";
    		}
    		cout << endl;
    	}
    }
    

    三.查询

    void query(string s) {//查询
    	map<string, set<int>>::iterator it=indexList.find(s);//判断s是否在全体倒排记录表
    	if (it!= indexList.end()) {//Yes
    		cout << s << "\t"<<"出现的文档:";
    		int length = it->second.size();
    		for (auto c : it->second) {//输出该词项的倒排记录表
    			cout << c << " ";
    		}
    		cout << endl;
    	}
    	else {//N0
    		cout << s << "没有在文档中出现" << endl;
    	}
    }
    

    main函数

    int main() {
    	makeDocs(Docs);
    	showDocs();
    	index(Docs);
    	showIndexList(indexList);
    	while (true) {
    		cout << "请输入查询词:";
    		string s;
    		cin >> s;
    		query(s);
    		cout << endl;
    	}
    }
    

    总结

    笔者写该程序用到了,vector,set,map,pair等等STL的功能,set和map的底层都是红黑树,所以每次进行的查询,插入等操作时间复杂度都是灰常的低的。把代码从上到下copy一下就是源代码了,笔者很菜也,肝了很久,给个赞呗。

    展开全文
  • c++实现倒排索引算法

    2020-12-03 17:17:12
    c++倒排索引算法
  • 读取 10 个 .txt 文本构建序列表,排序并输出排序列表。 输入两个词,空格隔开,搜索,输出两个词的公有文本。
  • 采用MFC可视化,通过建立倒排索引表,简单实现了搜索功能
  • 绝对是最简单的,仅供参考,希望大家不要吐槽,不足之处希望大家指出=。=
  • hello c++ hello c++ 需要输出如下格式: c++ c.txt-->2 hello a.txt-->4 b.txt-->4 c.txt-->4 jack b.txt-->1 java c.txt-->1 jerry b.txt-->1 c.txt-->1 jim a.txt-->1 b.txt-->1
  • INVERTED_INDEX 倒排索引结构的实现,用 C++ 编写 版权所有 (C) 2013 Nick Georgiadis
  • 本系统源码是个人原创文章系列,程序员编程艺术第二十六章:基于给定的文档生成倒排索引的编码与实践的整个工程源码 look:http://blog.csdn.net/v_july_v/article/details/7109500 windows下VS2010,linux环境下皆...
  • spimi算法实现的倒排索引的构建,并且对倒排索引进行了Gamma编码压缩,对词典进行了单一字符串压缩,分别写入了二进制的倒排索引文件和词典文件
  • 基于倒排索引的小型文档搜索引擎,用C/C++实现
  • 信息检索 倒排索引

    2014-04-02 22:04:51
    编写程序实现为给定目录下txt文件建立倒排索引文件il.txt 运行后会自动生成 1.txt,2.txt,4.txt,其中 1.txt,2.txt需要你自己输入需要排序的文档(如莎士比亚的文集),排序结果输出在il.txt中
  • 【信息检索】布尔检索和倒排索引

    千次阅读 2022-03-15 05:13:31
    布尔检索和倒排索引的建立。信息检索理论的基础知识

    实验目的

    掌握倒排索引(inverted index)的建立过程;掌握倒排记录表(postings lists)的合并算法

    实验过程

    1. 倒排索引

    根据教材《Introduction to Information Retrieval》第8页Figure 1.4中所描述的倒排索引(reverted index)建立的详细过程,使用附件“HW1.txt”(文末)中的60个文档(每行表示一个document),用Java语言或其他常用语言实现倒排索引建立的详细过程。

    使用语言:Python3

    1. 首先将文档读取到一个一维数据之中
      代码:
    # 打开文件
    f = open('HW1.txt')
    # 读取文章,并删除每行结尾的换行符
    doc = pd.Series(f.read().splitlines())
    

    结果:打印数组进行查看

    0 Adaptive Pairwise Preference Learning for Coll…
    1 HGMF: Hierarchical Group Matrix Factorization …
    2 Adaptive Bayesian Personalized Ranking for Het…
    3 Compressed Knowledge Transfer via Factorizatio…
    4 A Survey of Transfer Learning for Collaborativ…
    5 Mixed Factorization for Collaborative Recommen…
    6 Transfer Learning for Heterogeneous One-Class …
    7 Group Bayesian Personalized Ranking with Rich …
    8 Transfer Learning for Semi-Supervised Collabor…
    9 Transfer Learning for Behavior Prediction
    10 TOCCF: Time-Aware One-Class Collaborative Filt…

    1. 将每篇文档转换成一个个token的列表
      步骤:
      a. 将文本全部转换成小写
      b. 根据“非字符”对文本使用正则表达式进行切割(注:当出现两个连续非字符,会切割出现空串,需要手工删除)
      代码:
    # 转换为小写,并使用正则表达式进行切割
    doc = doc.apply(lambda x: re.split('[^a-zA-Z]', x.lower()))
    # 删除空串
    for list in doc:
        while '' in list:
            list.remove('')
    

    结果:打印token列表进行查看

    Output exceeds the size limit. Open the full output data in a text editor
    0 [adaptive, pairwise, preference, learning, for…
    1 [hgmf, hierarchical, group, matrix, factorizat…
    2 [adaptive, bayesian, personalized, ranking, fo…
    3 [compressed, knowledge, transfer, via, factori…
    4 [a, survey, of, transfer, learning, for, colla…
    5 [mixed, factorization, for, collaborative, rec…
    6 [transfer, learning, for, heterogeneous, one, …
    7 [group, bayesian, personalized, ranking, with,…
    8 [transfer, learning, for, semi, supervised, co…
    9 [transfer, learning, for, behavior, prediction]
    10 [toccf, time, aware, one, class, collaborative…

    1. 构建倒排索引
      步骤:
      a. 建立如下数据结构:
      建立一个哈希表,key值为字符串,value值为列表。
      其中key值中存储所有单词,并作为哈希表的索引;value值中第1位记录倒排索引长度,第2位开始记录每个单词出现文章的序号。
      在这里插入图片描述
      b. 遍历token列表:
      1. 如果单词出现过,就将文章序号添加到列表尾部,并且长度加一。
      2. 单词第一次出现时,将单词加入哈希表。

    代码:

    hashtable = {}
    for index, list in enumerate(doc):
        for word in list:
            if word in hashtable:
                hashtable[word].append(index+1)
                hashtable[word][0] += 1
            else:
                hashtable[word] = [1, index+1]
    hashtable = dict(
        sorted(hashtable.items(), key=lambda kv: (kv[1][0], kv[0]), reverse=True))
    
    

    结果:

    inverted_index = pd.DataFrame(columns=['term', 'doc.freq', 'postings list'])
    for term in hashtable:
        inverted_index = inverted_index.append(
            {'term': term, 'doc.freq': hashtable[term][0], 'postings list': hashtable[term][1:]}, ignore_index=True)
    display(inverted_index[:20])
    
    

    在这里插入图片描述

    2. 布尔检索

    根据教材《Introduction to Information Retrieval》第11页Figure 1.6中所描述的倒排记录表(postings lists)的合并算法,使用第(1)题中的倒排索引,用Java语言或其他常用语言实现以下布尔检索:
    a. transfer AND learning
    b. transfer AND learning AND filtering
    c. recommendation AND filtering
    d. recommendation OR filtering
    e. transfer AND NOT (recommendation OR filtering)

    在完成题目之前,首先完成AND、OR、AND NOT三个函数的编写

    1. AND
      思路:
      参数为两个单词对应的倒排索引列表,返回值为完成AND操作后的结果列表。
      需要完成的操作是将同时出现在list1,list2的index筛选出来。因为原先两个列表都是从小到大排序,因此,只需要不断地将指向较小数的指针不断向后移,遇到相同的index时,将index加入结果列表,直到一个指针走到底。
    def And(list1, list2):
        i, j = 0, 0
        res = []
        while i < len(list1) and j < len(list2):
            # 同时出现,加入结果列表
            if list1[i] == list2[j]:
                res.append(list1[i])
                i += 1
                j += 1
            # 指向较小数的指针后移
            elif list1[i] < list2[j]:
                i += 1
            else:
                j += 1
        return res
    
    
    1. OR
      思路:
      参数为两个单词对应的倒排索引列表,返回值为完成OR操作后的结果列表。
      需要完成的操作是将在list1,list2中出现的所有index合并筛选出来。思路与AND的解法大致类似,原先两个列表都是从小到大排序,因此,同样只需要不断地将指向较小数的指针不断向后移,区别是在index大小不相同时仍然需要将index加入结果列表,直到一个指针走到底。
      因为OR操作是将两个列表合并,还需要将两个列表中剩余未遍历到的index加入结果列表之中。
    def Or(list1, list2):
        i, j = 0, 0
        res = []
        while i < len(list1) and j < len(list2):
            # 同时出现,只需要加入一次
            if list1[i] == list2[j]:
                res.append(list1[i])
                i += 1
                j += 1
            # 指向较小数的指针后移,并加入列表
            elif list1[i] < list2[j]:
                res.append(list1[i])
                i += 1
            else:
                res.append(list2[j])
                j += 1
        # 加入未遍历到的index
        res.extend(list1[i:]) if j == len(list2) else res.extend(list2[j:])
        return res
    
    
    1. AND NOT
      思路:
      参数为两个单词对应的倒排索引列表,返回值为完成AND NOT操作后的结果列表。
      需要完成的操作是将出现在list1,但是未出现在list2的index筛选出来。原先两个列表都是从小到大排序,因此,同样需要不断地将指向较小数的指针不断向后移,并且当指向list1的index较小时,将index加入结果列表,直到一个指针走到底。
      假设list1未遍历完,list2已经结束,那么list1剩余的index一定不会出现在list2中,所以还需要将剩余未遍历到的index加入结果列表之中。
    def AndNot(list1, list2):
        i, j = 0, 0
        res = []
        while i < len(list1) and j < len(list2):
            # index相等时,同时后移
            if list1[i] == list2[j]:
                i += 1
                j += 1
            # 指向list1的index较小时,加入结果列表
            elif list1[i] < list2[j]:
                res.append(list1[i])
                i += 1
            else:
                j += 1
        # list1 未遍历完,加入剩余index
        if i != len(list1):
            res.extend(list1[i:])
        return res
    
    
    1. 辅助函数:从哈希表中获取倒排索引列表,并删除第一个元素(用于记录元素个数)
    def getList(word):
        return hashtable[word][1:]
    

    a) transfer AND learning

    结果:transfer AND learning: [5, 7, 9, 10, 16, 17, 25, 32, 33, 49, 55, 56]

    print('transfer AND learning:', And(getList('transfer'), getList('learning')),'\n')
    
    print('transfer:', getList('transfer'))
    print('learning:',getList('learning'))
    
    

    transfer AND learning: [5, 7, 9, 10, 16, 17, 25, 32, 33, 49, 55, 56]

    transfer: [4, 5, 7, 9, 10, 16, 17, 24, 25, 29, 32, 33, 43, 49, 55, 56]
    learning: [1, 5, 7, 9, 10, 14, 16, 17, 19, 20, 25, 30, 32, 33, 36, 41, 47, 49, 54, 55, 56, 58]

    结果正确

    b) transfer AND learning AND filtering

    先计算transfer AND learning,在计算AND filtering
    结果:transfer AND learning AND filtering: [7, 25, 33, 55, 56]

    print('transfer AND learning AND filtering:', And(And(getList('transfer'), getList('learning')), getList('filtering')), '\n')
    
    print('transfer:', getList('transfer'))
    print('learning:', getList('learning'))
    print('filtering:', getList('filtering'))
    print('transfer AND learning:', And(getList('transfer'), getList('learning')))
    

    transfer AND learning AND filtering: [7, 25, 33, 55, 56]

    transfer: [4, 5, 7, 9, 10, 16, 17, 24, 25, 29, 32, 33, 43, 49, 55, 56]
    learning: [1, 5, 7, 9, 10, 14, 16, 17, 19, 20, 25, 30, 32, 33, 36, 41, 47, 49, 54, 55, 56, 58]
    filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]
    transfer AND learning: [5, 7, 9, 10, 16, 17, 25, 32, 33, 49, 55, 56]

    结果正确

    c) recommendation AND filtering

    结果:recommendation AND filtering: [13, 26, 38]

    print('recommendation AND filtering:', And(getList('recommendation'), getList('filtering')), '\n')
    
    print('recommendation:', getList('recommendation'))
    print('filtering:', getList('filtering'))
    

    recommendation AND filtering: [13, 26, 38]

    recommendation: [1, 2, 4, 5, 6, 9, 13, 14, 15, 17, 20, 21, 26, 29, 31, 32, 34, 35, 38, 39, 43, 44, 45, 47, 48, 49, 50, 51, 53, 59, 60]
    filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]

    结果正确

    d) recommendation OR filtering

    结果:recommendation OR filtering: [1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]

    print('recommendation OR filtering:', Or(getList('recommendation'), getList('filtering')), '\n')
    
    print('recommendation:', getList('recommendation'))
    print('filtering:', getList('filtering'))
    

    recommendation OR filtering: [1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]

    recommendation: [1, 2, 4, 5, 6, 9, 13, 14, 15, 17, 20, 21, 26, 29, 31, 32, 34, 35, 38, 39, 43, 44, 45, 47, 48, 49, 50, 51, 53, 59, 60]
    filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]

    结果正确

    e) transfer AND NOT (recommendation OR filtering)

    先计算recommendation OR filtering,在计算AND NOT
    结果:transfer AND NOT (recommendation OR filtering) [10, 16]

    print('transfer AND NOT (recommendation OR filtering)', \
        AndNot(getList('transfer'), Or(getList('recommendation'), getList('filtering'))), '\n')
    
    print('transfer:', getList('transfer'))
    print('recommendation:', getList('recommendation'))
    print('filtering:', getList('filtering'))
    print('recommendation OR filtering:', Or(getList('recommendation'), getList('filtering')))
    

    transfer AND NOT (recommendation OR filtering) [10, 16]

    transfer: [4, 5, 7, 9, 10, 16, 17, 24, 25, 29, 32, 33, 43, 49, 55, 56]
    recommendation: [1, 2, 4, 5, 6, 9, 13, 14, 15, 17, 20, 21, 26, 29, 31, 32, 34, 35, 38, 39, 43, 44, 45, 47, 48, 49, 50, 51, 53, 59, 60]
    filtering: [7, 8, 11, 12, 13, 18, 19, 22, 24, 25, 26, 27, 30, 33, 36, 37, 38, 41, 42, 46, 52, 54, 55, 56, 57, 58]
    recommendation OR filtering: [1, 2, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60]

    附录:HW1.txt

    Adaptive Pairwise Preference Learning for Collaborative Recommendation with Implicit Feedbacks
    HGMF: Hierarchical Group Matrix Factorization for Collaborative Recommendation
    Adaptive Bayesian Personalized Ranking for Heterogeneous Implicit Feedbacks
    Compressed Knowledge Transfer via Factorization Machine for Heterogeneous Collaborative Recommendation
    A Survey of Transfer Learning for Collaborative Recommendation with Auxiliary Data
    Mixed Factorization for Collaborative Recommendation with Heterogeneous Explicit Feedbacks
    Transfer Learning for Heterogeneous One-Class Collaborative Filtering
    Group Bayesian Personalized Ranking with Rich Interactions for One-Class Collaborative Filtering
    Transfer Learning for Semi-Supervised Collaborative Recommendation
    Transfer Learning for Behavior Prediction
    TOCCF: Time-Aware One-Class Collaborative Filtering
    RBPR: Role-based Bayesian Personalized Ranking for Heterogeneous One-Class Collaborative Filtering
    Hybrid One-Class Collaborative Filtering for Job Recommendation
    Mixed Similarity Learning for Recommendation with Implicit Feedback
    Collaborative Recommendation with Multiclass Preference Context
    Transfer Learning for Behavior Ranking
    Transfer Learning from APP Domain to News Domain for Dual Cold-Start Recommendation
    k-CoFi: Modeling k-Granularity Preference Context in Collaborative Filtering
    CoFi-points: Collaborative Filtering via Pointwise Preference Learning over Item-Sets
    Personalized Recommendation with Implicit Feedback via Learning Pairwise Preferences over Item-sets
    BIS: Bidirectional Item Similarity for Next-Item Recommendation
    RLT: Residual-Loop Training in Collaborative Filtering for Combining Factorization and Global-Local Neighborhood
    MF-DMPC: Matrix Factorization with Dual Multiclass Preference Context for Rating Prediction
    Transfer to Rank for Heterogeneous One-Class Collaborative Filtering
    Neighborhood-Enhanced Transfer Learning for One-Class Collaborative Filtering
    Next-Item Recommendation via Collaborative Filtering with Bidirectional Item Similarity
    Asymmetric Bayesian Personalized Ranking for One-Class Collaborative Filtering
    Context-aware Collaborative Ranking
    Transfer to Rank for Top-N Recommendation
    Dual Similarity Learning for Heterogeneous One-Class Collaborative Filtering
    Sequence-Aware Factored Mixed Similarity Model for Next-Item Recommendation
    PAT: Preference-Aware Transfer Learning for Recommendation with Heterogeneous Feedback
    Adaptive Transfer Learning for Heterogeneous One-Class Collaborative Filtering
    A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data
    FISSA: Fusing Item Similarity Models with Self-Attention Networks for Sequential Recommendation
    Asymmetric Pairwise Preference Learning for Heterogeneous One-Class Collaborative Filtering
    k-Reciprocal Nearest Neighbors Algorithm for One-Class Collaborative Filtering
    CoFiGAN: Collaborative Filtering by Generative and Discriminative Training for One-Class Recommendation
    Conditional Restricted Boltzmann Machine for Item Recommendation
    Matrix Factorization with Heterogeneous Multiclass Preference Context
    CoFi-points: Collaborative Filtering via Pointwise Preference Learning on User/Item-Set
    A Survey on Heterogeneous One-Class Collaborative Filtering
    Holistic Transfer to Rank for Top-N Recommendation
    FedRec: Federated Recommendation with Explicit Feedback
    Factored Heterogeneous Similarity Model for Recommendation with Implicit Feedback
    FCMF: Federated Collective Matrix Factorization for Heterogeneous Collaborative Filtering
    Sequence-Aware Similarity Learning for Next-Item Recommendation
    FR-FMSS: Federated Recommendation via Fake Marks and Secret Sharing
    Transfer Learning in Collaborative Recommendation for Bias Reduction
    Mitigating Confounding Bias in Recommendation via Information Bottleneck
    FedRec++: Lossless Federated Recommendation with Explicit Feedback
    VAE++: Variational AutoEncoder for Heterogeneous One-Class Collaborative Filtering
    TransRec++: Translation-based Sequential Recommendation with Heterogeneous Feedback
    Collaborative filtering with implicit feedback via learning pairwise preferences over user-groups and item-sets
    Interaction-Rich Transfer Learning for Collaborative Filtering with Heterogeneous User Feedback
    Transfer Learning in Heterogeneous Collaborative Filtering Domains
    GBPR: Group Preference based Bayesian Personalized Ranking for One-Class Collaborative Filtering
    CoFiSet: Collaborative Filtering via Learning Pairwise Preferences over Item-sets
    Modeling Item Category for Effective Recommendation
    Position-Aware Context Attention for Session-Based Recommendation

    展开全文
  • 从搜索引擎谈起 、倒排索引的基本概念 、倒排索引举例 、浅谈正排索引 、倒排索引的实现
  • 倒排索引

    2016-02-09 01:07:59
    倒排索引的实现。 一个文件含有几个文件的名字,打开这个文件之后读其他文件的内容,将内容出现的文件号输出。
  • c++实现的简易倒排索引

    千次阅读 多人点赞 2019-04-04 22:05:08
    智能信息检索这门课程有个上机作业,题目是“实现倒排索引”。 用到了以前没有学的STL中的vector。 个人博客本文传送门 勿抄袭代码,代码仅供参考。转载注明出处 倒排索引简介 为了从文档集(collection)中...
  • 推荐系统中内容如何召回?倒排索引技术如何在召回中得到应用
  • 课堂学习搜索引擎,初步用简单的C语言实现了构建倒排索引和中文少字数搜索,代码可以帮助初学者了解搜索引擎的基础结构,可直接运行,内含word文档具体解释
  • 我用python的字典实现了一个倒排索引。 #hash_list是一个simhash对象列表,列表中每个元素H,的simhash数值为H.hash hash_dict = {} for i,h in enumerate(hash_list): #将哈希数值转化成01序列,并填充到64维 hash...
  • C风格文件、C++风格读写操作 、英文文章单词的正确分割 、基于Trie树实现文件词频统计 、基于Trie树实现带倒排索引的文件词频统计
  • Cpp:倒排索引和Wordnet

    2021-02-16 11:49:22
    Cpp:倒排索引和Wordnet
  • C++实现倒排索引

    千次阅读 2018-06-12 23:17:24
    * C++实现倒排索引 * Author:Steven' * E-mail : steven_zhaosh@163.com feel free to contract to me! * 文档按照文件读取的方式写入,我自己暂时没办法独立写一个中文分词出来,目前只支持英文文档。...
  • 稀疏相似度-倒排索引

    2020-11-09 13:56:56
    double 转 string 时如何控制位数并正确地四舍五入,以下代码...倒排索引 $1 题目 题目链接 面试题 17.26. 稀疏相似度 题目描述 两个(具有不同单词的)文档的交集(intersection)中元素的个数除以并集(union)中元素的.
  • 聚类的Elias-Fano索引 这是Giulio Ermanno Pibiri和Rossano Venturini撰写的论文的实验代码,该论文发表在ACM TOIS 2017 [1]中。 本指南旨在提供该库的简要概述,并通过一些示例来说明其功能。 目录 构建代码 该...
  • C++ 倒排索引的实现

    千次阅读 2015-04-05 10:31:21
    1.1基本介绍 ...倒排索引是搜索引擎中一个很基本的概念,几乎所有的搜索引擎都会使用到倒排索引。 1.2 准备工作 ² 5个源文件 Test0.txt, Test1.txt,Test2.txt, Test3.txt, Test4.txt 里面包含

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 5,485
精华内容 2,194
关键字:

倒排索引c++