精华内容
下载资源
问答
  • GraphSage: Representation Learning on Large Graphs Authors: William L. Hamilton (wleif@stanford.edu), Rex Ying (rexying@stanford.edu) Project Website Alternative reference PyTorch implementation ...
  • GraphSAGE代码详解-pytorch版本1. GraphSAGE导入2. 代码解析2.1 加载数据2.2 Unsupervised Loss2.3 Models2.4 评估与模型使用2.5 Main参考资料 1. GraphSAGE导入 论文标题:Inductive Representation Learning on ...

    1. GraphSAGE导入

    • 论文标题:Inductive Representation Learning on Large Graphs

    • 作者:William L. Hamilton, Rex Ying and Jure Leskovec

      在GraphSAGE之前提出的图神经网络方法,如DeepWalk、GCN等都属于transductive模型,此类模型在当网络的结构稍微出现一点改变,就需要重新训练,无法满足实时快速产生网络节点嵌入的需求。为了解决这一问题,本文的作者们提出了一个infuctive模型,即GraphSAGE方法,该方法的目标是训练一个aggregator以聚合目标节点邻居的信息,从而可以快速生成未知节点的低维向量表示。

    • GraphSAGE的基本流程见下图:
      在这里插入图片描述
      1)首先通过随机游走获得固定大小的邻域网络 2)然后通过aggregator把有限阶邻居节点的特征聚合给目标节点,伪代码如下
      在这里插入图片描述
      由上面的伪代码可见,GraphSAGE的输入为:目标网络 G G G、节点的特征向量 x v x_v xv、权重矩阵 W k W^k Wk、非线性激活函数 σ \sigma σ、aggregator函数以及邻居函数 N N N. 1)首先 h 0 h_0 h0为节点的特征向量,循环 K K K步 2)遍历每个节点,对于每个节点首先聚合邻居节点 k − 1 k-1 k1时刻的特征,然后将聚合的结果与当前节点 k − 1 k-1 k1时刻的特征进行concat并经过一个激活函数 3)当循环完K步之后进行一个令 h k h_k hk除以 ∣ ∣ h k ∣ ∣ 2 ||h_k||_2 hk2得到节点的低维表示。

    • Aggregators: 从上面的流程图和伪代码可见,GraphSAGE需要用到aggregators,那这个aggregator是什么呢?起到什么作用呢?其实aggregator的作用就是将目标节点的邻居信息进行一个聚合,作者在文中给出了3种不同的aggregators分别是:
      1)Mean aggregator: 该策略是将邻居节点与目标节点特征向量的值取平均
      在这里插入图片描述
      2)LSTM aggregator: 利用LSTM来聚合邻居节点的信息。
      3)Pooling aggregator: 在使用pooling聚合器的时候,每个邻居节点的特征逐一的经过一个全连接层,从而进行池化操作
      在这里插入图片描述
      这里的max是一个element-wise的max.

    2. 代码解析

    import numpy as np
    import pandas as pd
    import os,sys
    import argparse
    import torch 
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    import random
    import math
    from sklearn.utils import shuffle
    from sklearn.metrics import f1_score
    from collections import defaultdict
    

    2.1 加载数据

    • 本代码使用的数据cora共包含两个文件,分别是cora_content和cora_cite. 数据详细描述可参考链接:Cora数据集描述
    class DataCenter(object):
        """加载数据集
    	Parameter:
    		file_paths:{数据文件存放地址1,数据文件存放地址2}
    	"""
        def __init__(self,file_paths):
            """file_paths:{name:root,...,}"""
            super(DataCenter,self).__init__()
            self.file_paths = file_paths
        
        def load_Dataset(self,dataset='cora'):
            """读取存放在指定路径的数据集"""
            feat_list = [] # 用于存放每个节点特征向量的列表
            label_list = [] # 用于存放每个节点对应类别的列表
            node_map = {} # 将节点进行重新编码
            label_map = {} # 将label映射为数字
            
            if dataset == 'cora':
                content = file_paths['cora_content'] # 获取cora_content的地址
                cite = file_paths['cora_cite'] # 获取cora_cite的地址
                with open(content) as f1:
                    for i,each_sample in enumerate(f1.readlines()): # 遍历每个样本的特征
                        sample_clean = each_sample.strip().split()
                        # 提取每个样本的特征,其中第一个元素和最后一个元素是样本名称和对应的标签
                        feat_list.append(sample_clean[1:-1])
                        # 把节点名称映射为节点编号 
                        node_map[sample_clean[0]]=i
                        label = sample_clean[-1]
                        if label not in label_map.keys():
                        	# 把label转化为数字
                            label_map[label] = len(label_map)
                        label_list.append(label_map[label])
                    feat_list = np.asarray(feat_list,dtype=np.float64)
                    label_list = np.asarray(label_list,dtype=np.int64)
                
                # 获得每个节点的邻居{v0:[v0的邻居集合],v1:[v1的邻居集合]}
                adj_lists = defaultdict(set)
                with open(cite) as f2:
                    for j,each_pair in enumerate(f2.readlines()):
                        pair = each_pair.strip().split()
                        assert len(pair) == 2
                        adj_lists[node_map[pair[0]]].add(node_map[pair[1]])
                        adj_lists[node_map[pair[1]]].add(node_map[pair[0]])
                
                assert len(feat_list) == len(label_list) == len(adj_lists)
                train_index,test_index,val_index = self._split_data(feat_list.shape[0])
                # 使用getattr()可以获得数据
                setattr(self,dataset+'_test',test_index)
                setattr(self,dataset+'_val',val_index)
                setattr(self,dataset+'_train',train_index)
                setattr(self,dataset+'_feats',feat_list)
                setattr(self,dataset+'_labels',label_list)
                setattr(self,dataset+'_adj_lists',adj_lists)
            
        def _split_data(self,number_of_nodes,test_split=3,val_split=6):
        	"""获得训练集、验证集和测试集"""
        	# 打乱顺序
            rand_indices = np.random.permutation(number_of_nodes)
            test_size = number_of_nodes // test_split
            val_size = number_of_nodes // val_split
            test_index = rand_indices[:test_size]
            val_index = rand_indices[test_size:test_size+val_size]
            train_index = rand_indices[test_size+val_size:]
            return train_index,test_index,val_index
    

    2.2 Unsupervised Loss

    GraphSAGE定义的Loss函数如下所示, J G ( z u ) = − l o g ( σ ( z u T z v ) ) − Q E v n   P n ( v ) l o g ( σ ( − z u T z v n ) ) J_G(z_u)=-log(\sigma(z_u^Tz_v))-QE_{v_n ~ P_n(v)}log(\sigma(-z_u^Tz_{v_n})) JG(zu)=log(σ(zuTzv))QEvn Pn(v)log(σ(zuTzvn))
    其中, Q Q Q为负样本数量,前一项是根据正样本计算的Loss,后一项是根据负样本计算的Loss.

    class UnsupervisedLoss(object):
    	"""docstring for UnsupervisedLoss"""
    	def __init__(self, adj_lists, train_nodes, device):
            """初始化参数"""
    		super(UnsupervisedLoss, self).__init__()
    		self.Q = 10 # 负样本的数量
    		self.N_WALKS = 6 # 每个节点随机游走的次数
    		self.WALK_LEN = 1 # 每次随机游走的步长
    		self.N_WALK_LEN = 5 # 每次负样本随机游走几个节点
    		self.MARGIN = 3 
    		self.adj_lists = adj_lists #{v0:[v0的邻居集合],v1:[v1的邻居集合],...,vn:[vn的邻居集合]}
    		self.train_nodes = train_nodes # 训练节点
    		self.device = device # cpu or gpu
    
    		self.target_nodes = None
    		self.positive_pairs = [] # 存放正例样本 [(v0,v0邻居中采样到的正例节点),....,]
    		self.negtive_pairs = [] # 存放负例样本 [(v0,v0邻居中采样到的负例节点),....,]
    		self.node_positive_pairs = {} # {v0:[(v0,从v0开始随机游走采样到的正例节点),(v0,从v0开始随机游走采样到的正例节点)],...,vn:[(vn,从vn开始随机游走采样到的正例节点)]}
    		self.node_negtive_pairs = {} # {v0:[(v0,从v0开始随机游走采样到的负例节点),(v0,从v0开始随机游走采样到的负例节点)],...,vn:[(vn,从vn开始随机游走采样到的负例节点)]}
    		self.unique_nodes_batch = [] # 一个batch所有会用到的节点及其邻居节点
    
    	def get_loss_sage(self, embeddings, nodes):
            """根据论文里的公式计算损失函数"""
    		assert len(embeddings) == len(self.unique_nodes_batch) #判断是不是每个节点都有了embeddings
    		assert False not in [nodes[i]==self.unique_nodes_batch[i] for i in range(len(nodes))] # 判断目标节点集和unique集里的节点是否1一一对应
    		node2index = {n:i for i,n in enumerate(self.unique_nodes_batch)} # 把节点重新编码
    
    		nodes_score = []
    		assert len(self.node_positive_pairs) == len(self.node_negtive_pairs) # 确定正例节点对和负例节点对的数量是否相同
    		for node in self.node_positive_pairs: # 遍历所有节点
    			pps = self.node_positive_pairs[node] # 获得对应的正例 [(v0,v0正例样本1),(v0,v0正例样本2),...,(v0,v0正例样本n)]
    			nps = self.node_negtive_pairs[node] # 获得每个节点对应的负例 [(v0,v0负例样本1),(v0,v0负例样本2),...,(v0,v0负例样本n)]
    			if len(pps) == 0 or len(nps) == 0: # 判断是否都有正例和负例
    				continue
    
    			# Q * Exception(negative score)计算负例样本的Loss,即Loss函数的后一项
    			indexs = [list(x) for x in zip(*nps)] # [[源节点,...,源节点],[采样得到的负节点1,...,采样得到的负节点n]]
    			node_indexs = [node2index[x] for x in indexs[0]] # 获得源节点的编号
    			neighb_indexs = [node2index[x] for x in indexs[1]] # 负样本节点的编号
    			neg_score = F.cosine_similarity(embeddings[node_indexs], embeddings[neighb_indexs]) # 计算余弦相似性
    			neg_score = self.Q*torch.mean(torch.log(torch.sigmoid(-neg_score)), 0) # 计算损失的后一项
    			#print(neg_score)
    
    			# multiple positive score 计算正列样本的Loss,即Loss函数的前一项
    			indexs = [list(x) for x in zip(*pps)]
    			node_indexs = [node2index[x] for x in indexs[0]]
    			neighb_indexs = [node2index[x] for x in indexs[1]]
    			pos_score = F.cosine_similarity(embeddings[node_indexs], embeddings[neighb_indexs])
    			pos_score = torch.log(torch.sigmoid(pos_score)) # 计算损失的前一项
    			#print(pos_score)
    
    			nodes_score.append(torch.mean(- pos_score - neg_score).view(1,-1)) # 把每个节点的损失加入到列表中
    				
    		loss = torch.mean(torch.cat(nodes_score, 0)) # 求平均
    		
    		return loss
    
    	def get_loss_margin(self, embeddings, nodes):
    		assert len(embeddings) == len(self.unique_nodes_batch)
    		assert False not in [nodes[i]==self.unique_nodes_batch[i] for i in range(len(nodes))]
    		node2index = {n:i for i,n in enumerate(self.unique_nodes_batch)}
    
    		nodes_score = []
    		assert len(self.node_positive_pairs) == len(self.node_negtive_pairs)
    		for node in self.node_positive_pairs:
    			pps = self.node_positive_pairs[node]
    			nps = self.node_negtive_pairs[node]
    			if len(pps) == 0 or len(nps) == 0:
    				continue
    
    			indexs = [list(x) for x in zip(*pps)]
    			node_indexs = [node2index[x] for x in indexs[0]]
    			neighb_indexs = [node2index[x] for x in indexs[1]]
    			pos_score = F.cosine_similarity(embeddings[node_indexs], embeddings[neighb_indexs])
    			pos_score, _ = torch.min(torch.log(torch.sigmoid(pos_score)), 0)
    
    			indexs = [list(x) for x in zip(*nps)]
    			node_indexs = [node2index[x] for x in indexs[0]]
    			neighb_indexs = [node2index[x] for x in indexs[1]]
    			neg_score = F.cosine_similarity(embeddings[node_indexs], embeddings[neighb_indexs])
    			neg_score, _ = torch.max(torch.log(torch.sigmoid(neg_score)), 0)
    
    			nodes_score.append(torch.max(torch.tensor(0.0).to(self.device), neg_score-pos_score+self.MARGIN).view(1,-1))
    			# nodes_score.append((-pos_score - neg_score).view(1,-1))
    
    		loss = torch.mean(torch.cat(nodes_score, 0),0)
    
    		# loss = -torch.log(torch.sigmoid(pos_score))-4*torch.log(torch.sigmoid(-neg_score))
    		
    		return loss
    
    
    	def extend_nodes(self, nodes, num_neg=6):
    		"""获得目标节点集的正样本和负样本,输出这些节点的集合"""
    		self.positive_pairs = []
    		self.node_positive_pairs = {}
    		self.negtive_pairs = []
    		self.node_negtive_pairs = {}
    
    		self.target_nodes = nodes
    		self.get_positive_nodes(nodes)
    		# print(self.positive_pairs)
    		self.get_negtive_nodes(nodes, num_neg) 
    		# print(self.negtive_pairs)
    		self.unique_nodes_batch = list(set([i for x in self.positive_pairs for i in x]) | set([i for x in self.negtive_pairs for i in x]))
    		assert set(self.target_nodes) < set(self.unique_nodes_batch)
    		return self.unique_nodes_batch
    
    	def get_positive_nodes(self, nodes):
    		return self._run_random_walks(nodes) # 通过随机游走获得正列样本
    
    	def get_negtive_nodes(self, nodes, num_neg):
            """
            生成负样本,也就是让目标节点与目标节点相隔很远的节点组成一个负例
            """
    		for node in nodes: # 遍历每个节点
    			neighbors = set([node])
    			frontier = set([node])
    			for i in range(self.N_WALK_LEN):
    				current = set() 
    				for outer in frontier:
    					current |= self.adj_lists[int(outer)] #获取frontier中所有的邻居节点
    				frontier = current - neighbors #去除源节点
    				neighbors |= current # 源节点+邻居节点
    			far_nodes = set(self.train_nodes) - neighbors # 减去train_nodes里源节点及其一阶邻居
    			neg_samples = random.sample(far_nodes, num_neg) if num_neg < len(far_nodes) else far_nodes # 从二阶邻居开始采样
    			self.negtive_pairs.extend([(node, neg_node) for neg_node in neg_samples])
    			self.node_negtive_pairs[node] = [(node, neg_node) for neg_node in neg_samples]
    		return self.negtive_pairs
    
    	def _run_random_walks(self, nodes):
    		for node in nodes: # 遍历每个节点
    			if len(self.adj_lists[int(node)]) == 0: # 若该节点没有邻居节点则跳过
    				continue
    			cur_pairs = [] # 创建一个
    			for i in range(self.N_WALKS): # 每个节点会有N_WALKS次的随机游走
    				curr_node = node # 
    				for j in range(self.WALK_LEN): # 每次随机游走走WALK_LEN的长度
    					neighs = self.adj_lists[int(curr_node)]
    					next_node = random.choice(list(neighs))
    					# self co-occurrences are useless
    					if next_node != node and next_node in self.train_nodes:
    						self.positive_pairs.append((node,next_node))
    						cur_pairs.append((node,next_node))
    					curr_node = next_node
    
    			self.node_positive_pairs[node] = cur_pairs
    		return self.positive_pairs
    

    2.3 Models

    • Classification model
    class Classification(nn.Module):
        """一个最简单的一层分类模型
        Parameters:
            input_size:输入维度
            num_classes:类别数量
        return:
            logists:最大概率对应的标签
        """
        def __init__(self,input_size,num_classes):
            super(Classification,self).__init__()
            self.fc1 = nn.Linear(input_size,num_classes) # 定义一个input_size*num_classes的线性层
            self.init_params() # 初始化权重参数
            
        def init_params(self):
            for param in self.parameters():
                if len(param.size()) == 2: # 如果参数是矩阵的话就重新初始化
                    nn.init.xavier_uniform_(param) 
        
        def forward(self,x):
            logists = torch.log_softmax(self.fc1(x),1) # 利用log_softmax来获得最终输出的类别
            return logists
    
    • GraphSAGE
    class SageLayer(nn.Module):
    	"""
    	一层SageLayer
    	"""
    	def __init__(self, input_size, out_size, gcn=False): 
    		super(SageLayer, self).__init__()
    		self.input_size = input_size
    		self.out_size = out_size
    		self.gcn = gcn
    		self.weight = nn.Parameter(torch.FloatTensor(out_size, self.input_size if self.gcn else 2 * self.input_size)) #初始化权重参数w*input.T
    		self.init_params() # 调整权重参数分布
    
    	def init_params(self):
    		for param in self.parameters():
    			nn.init.xavier_uniform_(param)
    
    	def forward(self, self_feats, aggregate_feats, neighs=None):
    		"""
    		Parameters:
    			self_feats:源节点的特征向量
    			aggregate_feats:聚合后的邻居节点特征
    		"""
    		if not self.gcn: # 如果不是gcn的话就要进行concatenate
    			combined = torch.cat([self_feats, aggregate_feats], dim=1)
    		else:
    			combined = aggregate_feats
    		combined = F.relu(self.weight.mm(combined.t())).t()
    		return combined
    
    class GraphSage(nn.Module):
    	"""定义一个GraphSage模型"""
    	def __init__(self, num_layers, input_size, out_size, raw_features, adj_lists, device, gcn=False, agg_func='MEAN'):
    		super(GraphSage, self).__init__()
    		self.input_size = input_size 
    		self.out_size = out_size
    		self.num_layers = num_layers # Graphsage的层数
    		self.gcn = gcn
    		self.device = device
    		self.agg_func = agg_func
    		self.raw_features = raw_features
    		self.adj_lists = adj_lists
    		# 定义每一层的输入和输出
    		for index in range(1, num_layers+1):
    			layer_size = out_size if index != 1 else input_size
    			setattr(self, 'sage_layer'+str(index), SageLayer(layer_size, out_size, gcn=self.gcn))#除了第1层的输入为input_size,其余层的输入和输出均为outsize
    
    	def forward(self, nodes_batch):
    		"""
    		为一批节点生成嵌入表示
    		Parameters:
    			nodes_batch:目标批次的节点
    		"""
    		lower_layer_nodes = list(nodes_batch) # 初始化第一层节点
    		nodes_batch_layers = [(lower_layer_nodes,)] # 存放每一层的节点信息
    		for i in range(self.num_layers):
    			lower_samp_neighs, lower_layer_nodes_dict, lower_layer_nodes= self._get_unique_neighs_list(lower_layer_nodes) # 根据当前层节点获得下一层节点
    			nodes_batch_layers.insert(0, (lower_layer_nodes, lower_samp_neighs, lower_layer_nodes_dict))
    
    		assert len(nodes_batch_layers) == self.num_layers + 1
    
    		pre_hidden_embs = self.raw_features # 初始化h0
    		for index in range(1, self.num_layers+1):
    			nb = nodes_batch_layers[index][0]  #所有邻居节点
    			pre_neighs = nodes_batch_layers[index-1] # 上一层的邻居节点
    			aggregate_feats = self.aggregate(nb, pre_hidden_embs, pre_neighs)
    			sage_layer = getattr(self, 'sage_layer'+str(index))
    			if index > 1:
    				nb = self._nodes_map(nb, pre_hidden_embs, pre_neighs)
    			# self.dc.logger.info('sage_layer.')
    			cur_hidden_embs = sage_layer(self_feats=pre_hidden_embs[nb],
    										aggregate_feats=aggregate_feats)
    			pre_hidden_embs = cur_hidden_embs
    
    		return pre_hidden_embs
    
    	def _nodes_map(self, nodes, hidden_embs, neighs):
    		layer_nodes, samp_neighs, layer_nodes_dict = neighs
    		assert len(samp_neighs) == len(nodes)
    		index = [layer_nodes_dict[x] for x in nodes]
    		return index
    
    	def _get_unique_neighs_list(self, nodes, num_sample=10):
    		_set = set 
    		to_neighs = [self.adj_lists[int(node)] for node in nodes] # 获取目标节点集的所有邻居节点[[v0的邻居],[v1的邻居],[v2的邻居]]
    		if not num_sample is None: # 如果num_sample为实数的话
    			_sample = random.sample  
    			samp_neighs = [_set(_sample(to_neigh, num_sample)) if len(to_neigh) >= num_sample else to_neigh for to_neigh in to_neighs] # [set(随机采样的邻居集合),set(),set()]
                # 遍历所有邻居集合如果邻居节点数>=num_sample,就从邻居节点集中随机采样num_sample个邻居节点,否则直接把邻居节点集放进去
    		else:
    			samp_neighs = to_neighs 
    		samp_neighs = [samp_neigh | set([nodes[i]]) for i, samp_neigh in enumerate(samp_neighs)] # 把源节点也放进去
    		_unique_nodes_list = list(set.union(*samp_neighs)) #展平
    		i = list(range(len(_unique_nodes_list))) # 重新编号
    		unique_nodes = dict(list(zip(_unique_nodes_list, i)))
    		return samp_neighs, unique_nodes, _unique_nodes_list
    
    	def aggregate(self, nodes, pre_hidden_embs, pre_neighs, num_sample=10):
            """聚合邻居节点信息
            Parameters:
                nodes:从最外层开始的节点集合
                pre_hidden_embs:上一层的节点嵌入
                pre_neighs:上一层的节点
            """
    		unique_nodes_list, samp_neighs, unique_nodes = pre_neighs # 上一层的源节点,...,....,
    		assert len(nodes) == len(samp_neighs) 
    		indicator = [(nodes[i] in samp_neighs[i]) for i in range(len(samp_neighs))] # 判断每个节点是否出现在邻居节点中
    		assert (False not in indicator)
    		if not self.gcn:
    			# 如果不适用gcn就要把源节点去除
    			samp_neighs = [(samp_neighs[i]-set([nodes[i]])) for i in range(len(samp_neighs))]
    		if len(pre_hidden_embs) == len(unique_nodes):
    			embed_matrix = pre_hidden_embs
    		else:
    			embed_matrix = pre_hidden_embs[torch.LongTensor(unique_nodes_list)]
    		# self.dc.logger.info('3')
    		mask = torch.zeros(len(samp_neighs), len(unique_nodes))
    		column_indices = [unique_nodes[n] for samp_neigh in samp_neighs for n in samp_neigh]
    		row_indices = [i for i in range(len(samp_neighs)) for j in range(len(samp_neighs[i]))]
    		mask[row_indices, column_indices] = 1 # 每个源节点为一行,一行元素中1对应的就是邻居节点的位置
    
    		if self.agg_func == 'MEAN':
    			num_neigh = mask.sum(1, keepdim=True) # 计算每个源节点有多少个邻居节点
    			mask = mask.div(num_neigh).to(embed_matrix.device) # 
    			aggregate_feats = mask.mm(embed_matrix)
    
    		elif self.agg_func == 'MAX':
    			# print(mask)
    			indexs = [x.nonzero() for x in mask==1]
    			aggregate_feats = []
    			for feat in [embed_matrix[x.squeeze()] for x in indexs]:
    				if len(feat.size()) == 1:
    					aggregate_feats.append(feat.view(1, -1))
    				else:
    					aggregate_feats.append(torch.max(feat,0)[0].view(1, -1))
    			aggregate_feats = torch.cat(aggregate_feats, 0)
    		return aggregate_feats
    

    2.4 评估与模型使用

    def evaluate(dataCenter, ds, graphSage, classification, device, max_vali_f1, name, cur_epoch):
        """
        测试模型的性能
        Parameters:
        	datacenter:创建好的datacenter对像
        	ds:数据集的名称
        	graphSage:训练好的graphSage对像
        	classification:训练好的classificator
        	
        """
    	test_nodes = getattr(dataCenter, ds+'_test') # 获得测试集
    	val_nodes = getattr(dataCenter, ds+'_val') # 获得验证集
    	labels = getattr(dataCenter, ds+'_labels') # 获得标签
    
    	models = [graphSage, classification] 
    
    	params = [] # 将两个模型的参数存入一个列表中
    	for model in models:
    		for param in model.parameters():
    			if param.requires_grad:
    				param.requires_grad = False
    				params.append(param)
    
    	embs = graphSage(val_nodes)
    	logists = classification(embs)
    	_, predicts = torch.max(logists, 1)
    	labels_val = labels[val_nodes]
    	assert len(labels_val) == len(predicts)
    	comps = zip(labels_val, predicts.data)
    
    	vali_f1 = f1_score(labels_val, predicts.cpu().data, average="micro")
    	print("Validation F1:", vali_f1)
    
    	if vali_f1 > max_vali_f1:
    		max_vali_f1 = vali_f1
    		embs = graphSage(test_nodes)
    		logists = classification(embs)
    		_, predicts = torch.max(logists, 1)
    		labels_test = labels[test_nodes]
    		assert len(labels_test) == len(predicts)
    		comps = zip(labels_test, predicts.data)
    
    		test_f1 = f1_score(labels_test, predicts.cpu().data, average="micro")
    		print("Test F1:", test_f1)
    
    		for param in params:
    			param.requires_grad = True
    
    		torch.save(models, './model_best_{}_ep{}_{:.4f}.torch'.format(name, cur_epoch, test_f1))
    
    	for param in params:
    		param.requires_grad = True
    
    	return max_vali_f1
    
    def get_gnn_embeddings(gnn_model, dataCenter, ds):
    	"""使用GraphSage获得节点的嵌入表示"""
        print('Loading embeddings from trained GraphSAGE model.')
        features = np.zeros((len(getattr(dataCenter, ds+'_labels')), gnn_model.out_size))
        nodes = np.arange(len(getattr(dataCenter, ds+'_labels'))).tolist()
        b_sz = 500
        batches = math.ceil(len(nodes) / b_sz)
        embs = []
        for index in range(batches):
            nodes_batch = nodes[index*b_sz:(index+1)*b_sz]
            embs_batch = gnn_model(nodes_batch)
            assert len(embs_batch) == len(nodes_batch)
            embs.append(embs_batch)
            # if ((index+1)*b_sz) % 10000 == 0:
            #     print(f'Dealed Nodes [{(index+1)*b_sz}/{len(nodes)}]')
    
        assert len(embs) == batches
        embs = torch.cat(embs, 0)
        assert len(embs) == len(nodes)
        print('Embeddings loaded.')
        return embs.detach()
    
    def train_classification(dataCenter, graphSage, classification, ds, device, max_vali_f1, name, epochs=800):
    	"""训练分类器"""
    	print('Training Classification ...')
    	c_optimizer = torch.optim.SGD(classification.parameters(), lr=0.5)
    	# train classification, detached from the current graph
    	#classification.init_params()
    	b_sz = 50
    	train_nodes = getattr(dataCenter, ds+'_train')
    	labels = getattr(dataCenter, ds+'_labels')
    	features = get_gnn_embeddings(graphSage, dataCenter, ds)
    	for epoch in range(epochs):
    		train_nodes = shuffle(train_nodes)
    		batches = math.ceil(len(train_nodes) / b_sz)
    		visited_nodes = set()
    		for index in range(batches):
    			nodes_batch = train_nodes[index*b_sz:(index+1)*b_sz]
    			visited_nodes |= set(nodes_batch)
    			labels_batch = labels[nodes_batch]
    			embs_batch = features[nodes_batch]
    
    			logists = classification(embs_batch)
    			loss = -torch.sum(logists[range(logists.size(0)), labels_batch], 0)
    			loss /= len(nodes_batch)
    			# print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Dealed Nodes [{}/{}] '.format(epoch+1, epochs, index, batches, loss.item(), len(visited_nodes), len(train_nodes)))
    
    			loss.backward()
    			
    			nn.utils.clip_grad_norm_(classification.parameters(), 5)
    			c_optimizer.step()
    			c_optimizer.zero_grad()
    
    		max_vali_f1 = evaluate(dataCenter, ds, graphSage, classification, device, max_vali_f1, name, epoch)
    	return classification, max_vali_f1
    
    def apply_model(dataCenter, ds, graphSage, classification, unsupervised_loss, b_sz, unsup_loss, device, learn_method):
    	test_nodes = getattr(dataCenter, ds+'_test')
    	val_nodes = getattr(dataCenter, ds+'_val')
    	train_nodes = getattr(dataCenter, ds+'_train')
    	labels = getattr(dataCenter, ds+'_labels')
    
    	if unsup_loss == 'margin':
    		num_neg = 6
    	elif unsup_loss == 'normal':
    		num_neg = 100
    	else:
    		print("unsup_loss can be only 'margin' or 'normal'.")
    		sys.exit(1)
    
    	train_nodes = shuffle(train_nodes)
    
    	models = [graphSage, classification]
    	params = []
    	for model in models:
    		for param in model.parameters():
    			if param.requires_grad:
    				params.append(param)
    
    	optimizer = torch.optim.SGD(params, lr=0.7)
    	optimizer.zero_grad()
    	for model in models:
    		model.zero_grad()
    
    	batches = math.ceil(len(train_nodes) / b_sz)
    
    	visited_nodes = set()
    	for index in range(batches):
    		nodes_batch = train_nodes[index*b_sz:(index+1)*b_sz]
    
    		# extend nodes batch for unspervised learning
    		# no conflicts with supervised learning
    		nodes_batch = np.asarray(list(unsupervised_loss.extend_nodes(nodes_batch, num_neg=num_neg)))
    		visited_nodes |= set(nodes_batch)
    
    		# get ground-truth for the nodes batch
    		labels_batch = labels[nodes_batch]
    
    		# feed nodes batch to the graphSAGE
    		# returning the nodes embeddings
    		embs_batch = graphSage(nodes_batch)
    
    		if learn_method == 'sup':
    			# superivsed learning
    			logists = classification(embs_batch)
    			loss_sup = -torch.sum(logists[range(logists.size(0)), labels_batch], 0)
    			loss_sup /= len(nodes_batch)
    			loss = loss_sup
    		elif learn_method == 'plus_unsup':
    			# superivsed learning
    			logists = classification(embs_batch)
    			loss_sup = -torch.sum(logists[range(logists.size(0)), labels_batch], 0)
    			loss_sup /= len(nodes_batch)
    			# unsuperivsed learning
    			if unsup_loss == 'margin':
    				loss_net = unsupervised_loss.get_loss_margin(embs_batch, nodes_batch)
    			elif unsup_loss == 'normal':
    				loss_net = unsupervised_loss.get_loss_sage(embs_batch, nodes_batch)
    			loss = loss_sup + loss_net
    		else:
    			if unsup_loss == 'margin':
    				loss_net = unsupervised_loss.get_loss_margin(embs_batch, nodes_batch)
    			elif unsup_loss == 'normal':
    				loss_net = unsupervised_loss.get_loss_sage(embs_batch, nodes_batch)
    			loss = loss_net
    
    		print('Step [{}/{}], Loss: {:.4f}, Dealed Nodes [{}/{}] '.format(index+1, batches, loss.item(), len(visited_nodes), len(train_nodes)))
    		loss.backward()
    		for model in models:
    			nn.utils.clip_grad_norm_(model.parameters(), 5)
    		optimizer.step()
    
    		optimizer.zero_grad()
    		for model in models:
    			model.zero_grad()
    
    	return graphSage, classification
    

    2.5 Main

    file_paths = {'cora_content':'./cora.content','cora_cite':'./cora.cites'}
    datacenter  = DataCenter(file_paths)
    datacenter.load_Dataset()
    feature_data = torch.FloatTensor(getattr(datacenter, 'cora'+'_feats'))
    label_data = torch.from_numpy(getattr(datacenter,'cora'+'_labels')).long()
    adj_lists = getattr(datacenter,'cora'+'_adj_lists')
    random.seed(824)
    np.random.seed(824)
    torch.manual_seed(824)
    torch.cuda.manual_seed_all(824)
    learn_method = 'sup'
    ds = 'cora'
    epochs = 50
    max_vali_f1=0
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    graphSage = GraphSage(2, feature_data.size(1), 128, feature_data, getattr(datacenter, ds+'_adj_lists'), device, gcn='store_true', agg_func='MEAN')
    num_labels = len(set(getattr(datacenter, ds+'_labels')))
    classification = Classification(128, num_labels)
    unsupervised_loss = UnsupervisedLoss(getattr(datacenter, ds+'_adj_lists'), getattr(datacenter, ds+'_train'), device)
    if learn_method == 'sup':
        print('GraphSage with Supervised Learning')
    elif learn_method == 'plus_unsup':
        print('GraphSage with Supervised Learning plus Net Unsupervised Learning')
    else:
        print('GraphSage with Net Unsupervised Learning')
    
    for epoch in range(epochs):
        print('----------------------EPOCH %d-----------------------' % epoch)
        graphSage, classification = apply_model(datacenter, ds, graphSage, classification, unsupervised_loss, 20, 'normal', device, learn_method)
    if (epoch+1) % 2 == 0 and learn_method == 'unsup':
        classification, max_vali_f1 = train_classification(datacenter, graphSage, classification, ds, device,max_vali_f1, 'debug')
    if learn_method != 'unsup':
    		max_vali_f1 = evaluate(datacenter, ds, graphSage, classification, device, max_vali_f1 , 'debug', epoch)
    
    • 输出结果如下:
      在这里插入图片描述

    参考资料

    [1] Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs[J]. arXiv preprint arXiv:1706.02216, 2017.
    [2] https://github.com/twjiang/graphSAGE-pytorch

    展开全文
  • 图神经网络GraphSAGE代码详解1. 前言2. 代码下载3. 数据集分析4. 代码分析4. 1 model.py4. 2 aggregators.py4. 3 encoders.py5 总结 1. 前言 最近在学习图神经网络相关知识,对于直推式的图神经网络,训练代价昂贵,...

    1. 前言

    最近在学习图神经网络相关知识,对于直推式的图神经网络,训练代价昂贵,这篇文章主要是介绍一个基于归纳学习的框架GraphSAGE的代码,旨在训练一个聚合函数,为看不见的节点(新的节点)生成嵌入。因为自己也是小白,写这篇文章的目的也是为了增强自己对该算法的理解和记忆,由于下载下来的代码没有注释,我会尽可能的加上足够清晰的注释,方便大家阅读,如有错误,望神仙网友给予批评指正!!!

    2. 代码下载

    该代码是从github上下载而来,使用pytorch框架的一个简易版的GraphSAGE算法,适合小白入手学习。
    代码下载链接:https://pan.baidu.com/s/1WW0mkHXupl6kkyyzOG9pBA
    提取码:v06v

    3. 数据集分析

    代码中提供了两种数据集,cora数据集和pubmed数据集,主要针对cora数据集进行分析。
    Cora数据集中样本是机器学习论文,论文被分为7类:

    1. Case_Based
    2. Genetic_Algorithms
    3. Neural_Networks
    4. Probabilistic_Methods
    5. Reinforcement_Learning
    6. Rule_Learning
    7. Theory

    数据集共有2708篇论文,分为两个文件:

    1. cora.cites
    2. cora.content

    第一个文件cora.cites文件格式:

    <paper_id> <word_attributes>+ <class_label>
    <paper_id> :论文的ID(或者说图中节点的ID编号)
    <word_attributes>:节点的特征向量(0-1编码)
    <class_label>:节点类别
    

    第一个文件cora.cites文件格式:

    <ID of cited paper> <ID of citing paper>
    <ID of cited paper>:被引用的论文ID
    <ID of citing paper>:引用论文的ID
    我们可以把它看作是图中两个节点ID的边
    

    4. 代码分析

    主要有三个代码文件:aggregators.py、encoders.py、model.py
    aggregators.py:用于聚合邻居节点的特征,返回的就是聚合后的邻居节点的特征。
    encoders.py:根据aggregators得到的邻居节点特征执行卷积操作
    model.py:代码的主文件,加载数据集以及训练代码等操作

    4. 1 model.py

    首先是一个用于创建GraphSage的SupervisedGraphSage类。

    class SupervisedGraphSage(nn.Module):
    
        def __init__(self, num_classes, enc):
            super(SupervisedGraphSage, self).__init__()
            self.enc = enc
            self.xent = nn.CrossEntropyLoss()
    		# num_classes:节点类别数量
    		# enc:执行卷积操作的encoder类,embed_dim是输出维度的大小
            self.weight = nn.Parameter(torch.FloatTensor(num_classes, enc.embed_dim))
            init.xavier_uniform(self.weight)
    
        def forward(self, nodes):
            embeds = self.enc(nodes)
            scores = self.weight.mm(embeds)
            return scores.t()
    
        def loss(self, nodes, labels):
            scores = self.forward(nodes)
            return self.xent(scores, labels.squeeze())
    

    上面的代码比较简单,下面看一下加载数据集的代码模块。

    def load_cora():
        num_nodes = 2708   #节点数量
        num_feats = 1433   #节点特征维度
        #创建一个特征矩阵,维度大小[2708,1433]
        feat_data = np.zeros((num_nodes, num_feats))
        #创建一个类别矩阵,相当于y_train的意思,维度大小[2708,1]
        labels = np.empty((num_nodes,1), dtype=np.int64)
        #节点的map映射 节点ID -> 索引
        node_map = {}
        #节点的列别标签映射 节点类别 -> 节点的类别对应的index
        label_map = {}
        with open("D:\workspace\pythonSpace\pythonProject\\nlp\graphNetwork\graphsage-simple-master\cora\cora.content") as fp:
            for i,line in enumerate(fp):
            	'''把文件中的每一行读出来,info含有三部分:
            	info[0]:节点的编号ID
            	info[1:-1]:节点对应的特征 -> 1433维
            	info[-1]:节点对应的类别
            	'''
                info = line.strip().split()
                #把加载进来的节点特征维度赋给特征矩阵feat_data
                feat_data[i,:] = list(map(float, info[1:-1]))
                #构造节点编号映射,{节点编号:索引号}
                node_map[info[0]] = i
                if not info[-1] in label_map:
                #构造节点类别映射,len(label_map)的值域[0,6],对应七种论文的类别。
                    label_map[info[-1]] = len(label_map)
                #把该论文类别对应的标签[0,6]存储在labels列表中
                labels[i] = label_map[info[-1]]
    	#创建一个空的邻接矩阵集合
        adj_lists = defaultdict(set)
        with open("D:\workspace\pythonSpace\pythonProject\\nlp\graphNetwork\graphsage-simple-master\cora\cora.cites") as fp:
            for i,line in enumerate(fp):
            	'''
            	info有两部分组成
            	info[0]:被引用论文的ID
            	info[1]:引用论文的ID        	'''
                info = line.strip().split()
                #拿到这两个节点对应的索引
                paper1 = node_map[info[0]]
                paper2 = node_map[info[1]]
                #在邻接矩阵中互相添加相邻节点ID
                adj_lists[paper1].add(paper2)
                adj_lists[paper2].add(paper1)
        return feat_data, labels, adj_lists
    

    下面看一下训练函数

    def run_cora():
    	#设置随机种子
        np.random.seed(1)
        random.seed(1)
        num_nodes = 2708
        #加载数据集
        feat_data, labels, adj_lists = load_cora()
        #随机生成[2708,1433]维度的特征向量
        features = nn.Embedding(2708, 1433)
        #使用特征维度[2708,1433]的feat_data数据替换features中的特征向量值
        features.weight = nn.Parameter(torch.FloatTensor(feat_data), requires_grad=False)
       # features.cuda()
    	#创建一个两层的graphsage网络,MeanAggregator和Encoder下面会单独解释
        agg1 = MeanAggregator(features, cuda=True)
        enc1 = Encoder(features, 1433, 128, adj_lists, agg1, gcn=True, cuda=False)
        agg2 = MeanAggregator(lambda nodes : enc1(nodes).t(), cuda=False)
        enc2 = Encoder(lambda nodes : enc1(nodes).t(), enc1.embed_dim, 128, adj_lists, agg2,
                base_model=enc1, gcn=True, cuda=False)
        #设置两层网络各自的邻居节点采样数
        enc1.num_samples = 5
        enc2.num_samples = 5
    	
        graphsage = SupervisedGraphSage(7, enc2)
    #    graphsage.cuda()
    	#打乱节点的顺序,生成一个乱序的1-2708的列表
        rand_indices = np.random.permutation(num_nodes)
        #划分测试集和训练集
        test = rand_indices[:1000]
        val = rand_indices[1000:1500]
        train = list(rand_indices[1500:])
    	#创建优化器对象
        optimizer = torch.optim.SGD(filter(lambda p : p.requires_grad, graphsage.parameters()), lr=0.7)
        times = []
        #迭代循环训练
        for batch in range(100):
        	#挑选256个节点为一个批次进行训练
            batch_nodes = train[:256]
            #打乱训练集节点的顺序,方便下一次训练选取节点
            random.shuffle(train)
            start_time = time.time()
            #梯度清零
            optimizer.zero_grad()
            #计算损失
            loss = graphsage.loss(batch_nodes, 
                    Variable(torch.LongTensor(labels[np.array(batch_nodes)])))
            #执行反向传播
            loss.backward()
            #更新参数
            optimizer.step()
            end_time = time.time()
            times.append(end_time-start_time)
            #打印每一轮的损失值
            print( batch, loss.data)
    	#查看验证集
        val_output = graphsage.forward(val) 
        print( "Validation F1:", f1_score(labels[val], val_output.data.numpy().argmax(axis=1), average="micro"))
        print ("Average batch time:", np.mean(times))
    

    4. 2 aggregators.py

    下面我们看一下GraphSAGE如何获得邻居节点的特征。

    class MeanAggregator(nn.Module):
        """
        Aggregates a node's embeddings using mean of neighbors' embeddings
        """
        def __init__(self, features, cuda=False, gcn=False): 
            """
            Initializes the aggregator for a specific graph.
    
            features -- function mapping LongTensor of node ids to FloatTensor of feature values.
            cuda -- whether to use GPU
            gcn --- whether to perform concatenation GraphSAGE-style, or add self-loops GCN-style
            """
    
            super(MeanAggregator, self).__init__()
    
            self.features = features
            self.cuda = cuda
            #是否使用GCN模式的均值聚合方式,详细请看GraphSAGE的聚合方式。
            self.gcn = gcn
            
        def forward(self, nodes, to_neighs, num_sample=10):
            """
            nodes --- 一个批次的节点编号
            to_neighs --- 每个节点对应的邻居节点编号集合
            num_sample --- 每个节点对邻居的采样数量
            """
            # Local pointers to functions (speed hack)
            _set = set
            if not num_sample is None:
                _sample = random.sample
                #如果邻居节点数目大于num_sample,就随机选取num_sample个节点,否则选取仅有的邻居节点编号即可。
                samp_neighs = [_set(_sample(to_neigh, 
                                num_sample,
                                )) if len(to_neigh) >= num_sample else to_neigh for to_neigh in to_neighs]
            else:
                samp_neighs = to_neighs
    
            if self.gcn:
            	#聚合邻居节点信息时加上自己本身节点的信息
                samp_neighs = [samp_neigh + set([nodes[i]]) for i, samp_neigh in enumerate(samp_neighs)]
            #把一个批次内的所有节点的邻居节点编号聚集在一块并去重
            unique_nodes_list = list(set.union(*samp_neighs))
            #为所有的邻居节点建立一个索引映射
            unique_nodes = {n:i for i,n in enumerate(unique_nodes_list)}
            #创建mask的目的,其实就是为了创建一个邻接矩阵
            #该临界矩阵的维度为[一个批次的节点数,一个批次内所有邻居节点的总数目]
            mask = Variable(torch.zeros(len(samp_neighs), len(unique_nodes)))
            #所有邻居节点的列索引
            column_indices = [unique_nodes[n] for samp_neigh in samp_neighs for n in samp_neigh]   
            #所有邻居节点的行索引
            row_indices = [i for i in range(len(samp_neighs)) for j in range(len(samp_neighs[i]))]
            #将对应行列索引的节点值赋为1,就构成了邻接矩阵
            mask[row_indices, column_indices] = 1
            if self.cuda:
                mask = mask.cuda()
            #每个节点对应的邻居节点的数据
            num_neigh = mask.sum(1, keepdim=True)
            #除以对应的邻居节点个数,求均值
            mask = mask.div(num_neigh)
            # if self.cuda:
            #     embed_matrix = self.features(torch.LongTensor(unique_nodes_list).cuda())
            # else:
            #得到unique_nodes_list列表中各个邻居节点的特征
            #embed_matrix 的维度[一个批次内所有邻居节点的总数目,1433]
            embed_matrix = self.features(torch.LongTensor(unique_nodes_list))
            #用邻接矩阵乘上所有邻居节点的特征矩阵,就得到了聚合邻居节点后的各个节点的特征矩阵
            to_feats = mask.mm(embed_matrix)
            return to_feats
    
    

    4. 3 encoders.py

    得到聚合了邻居节点的特征向量之后,执行卷积的操作如下:

    class Encoder(nn.Module):
        """
        Encodes a node's using 'convolutional' GraphSage approach
        """
        def __init__(self, features, feature_dim, 
                embed_dim, adj_lists, aggregator,
                num_sample=10,
                base_model=None, gcn=False, cuda=False, 
                feature_transform=False): 
            super(Encoder, self).__init__()
    		#特征矩阵信息
            self.features = features
            #特征矩阵的向量维度 
            self.feat_dim = feature_dim
            #每个节点对应的邻居节点的编码集合,例如[1:{2,3,4},2:{1,5,6}]
            self.adj_lists = adj_lists
            self.aggregator = aggregator
            self.num_sample = num_sample
            if base_model != None:
                self.base_model = base_model
    
            self.gcn = gcn
            #输出向量的维度
            self.embed_dim = embed_dim
            self.cuda = cuda
            self.aggregator.cuda = cuda
            #执行卷积的参数矩阵,如果不使用GCN模式,需要执行一个concat拼接操作,所以向量维度为2倍的feat_dim
            self.weight = nn.Parameter(
                    torch.FloatTensor(embed_dim, self.feat_dim if self.gcn else 2 * self.feat_dim))
            init.xavier_uniform(self.weight)
    
        def forward(self, nodes):
            """
            Generates embeddings for a batch of nodes.
    
            nodes     -- list of nodes
            """
            #获得聚合了邻居节点后的节点特征信息
            neigh_feats = self.aggregator.forward(nodes, [self.adj_lists[int(node)] for node in nodes], 
                    self.num_sample)
            if not self.gcn:
                if self.cuda:
                    self_feats = self.features(torch.LongTensor(nodes).cuda())
                else:
                	#获得这一个批次的节点本身的特征信息
                    self_feats = self.features(torch.LongTensor(nodes))
                #将节点本身的特征信息和邻居节点的特征信息拼接一起
                combined = torch.cat([self_feats, neigh_feats], dim=1)
            else:
            	#使用GCN的均值聚合方式,直接使用聚合了本身信息的邻居节点信息即可
                combined = neigh_feats
            #线性转换后再经过一个relu激活函数,得到最终的聚合结果
            combined = F.relu(self.weight.mm(combined.t()))
            return combined
    

    5 总结

    以上就是实现了均值MeanAggregator的GraphSAGE的算法,我尽可能多的为每一行代码加上了注释,如有错误,望批评指正。
    除了上面的均值聚合方式,还有LSTM、池化聚合方式,还有无监督的GraphSAGE训练方式,如果有机会,争取在后面学习之后再写一篇博文分享出来。

    展开全文
  • GraphSAGE 代码解析(一) - unsupervised_train.py GraphSAGE 代码解析(三) - aggregators.py GraphSAGE 代码解析(四) - models.py 1 # global unique layer ID dictionary for layer nam...

     

     1 # global unique layer ID dictionary for layer name assignment
     2 _LAYER_UIDS = {}
     3 
     4 def get_layer_uid(layer_name=''):
     5     """Helper function, assigns unique layer IDs."""
     6     if layer_name not in _LAYER_UIDS:
     7         _LAYER_UIDS[layer_name] = 1
     8         return 1
     9     else:
    10         _LAYER_UIDS[layer_name] += 1
    11         return _LAYER_UIDS[layer_name]

     

     

    这里_LAYER_UIDS = {} 是记录layer及其出现次数的字典。

    在 get_layer_uid()函数中,若layer_name从未出现过,如今出现了,则将_LAYER_UIDS[layer_name]设为1;否则累加。

    作用: 在class Layer中,当未赋variable scope的name时,通过实例化Layer的次数来标定不同的layer_id.

    例子:简化一下class Layer可以看出:

     1 class Layer():
     2     def __init__(self):
     3         layer = self.__class__.__name__
     4         name = layer + '_' + str(get_layer_uid(layer))
     5         print(name) 
     6 
     7 layer1 = Layer()
     8 layer2 = Layer()
     9 
    10 # Output:
    11 # Layer_1
    12 # Layer_2
    View Code

     2. class Layer

    class Layer主要定义基本的层的API。

     1 class Layer(object):
     2     """Base layer class. Defines basic API for all layer objects.
     3     Implementation inspired by keras (http://keras.io).
     4     # Properties
     5         name: String, defines the variable scope of the layer.
     6         logging: Boolean, switches Tensorflow histogram logging on/off
     7 
     8     # Methods
     9         _call(inputs): Defines computation graph of layer
    10             (i.e. takes input, returns output)
    11         __call__(inputs): Wrapper for _call()
    12         _log_vars(): Log all variables
    13     """
    14 
    15     def __init__(self, **kwargs):
    16         allowed_kwargs = {'name', 'logging', 'model_size'}
    17         for kwarg in kwargs.keys():
    18             assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
    19         name = kwargs.get('name')
    20         if not name:
    21             layer = self.__class__.__name__.lower() # "layer"
    22             name = layer + '_' + str(get_layer_uid(layer))
    23         self.name = name
    24         self.vars = {}
    25         logging = kwargs.get('logging', False)
    26         self.logging = logging
    27         self.sparse_inputs = False
    28 
    29     def _call(self, inputs):
    30         return inputs
    31 
    32     def __call__(self, inputs):
    33         with tf.name_scope(self.name):
    34             if self.logging and not self.sparse_inputs:
    35                 tf.summary.histogram(self.name + '/inputs', inputs)
    36             outputs = self._call(inputs)
    37             if self.logging:
    38                 tf.summary.histogram(self.name + '/outputs', outputs)
    39             return outputs
    40 
    41     def _log_vars(self):
    42         for var in self.vars:
    43             tf.summary.histogram(self.name + '/vars/' + var, self.vars[var])
    View Code

    方法:

    __init__(): 获取传入的name, logging, model_size参数。初始化实例变量name, vars{}, logging, sparse_inputs

    _call(inputs): 定义层的计算图:获取input, 返回output.

    __call__(inputs): 相当于_call()的装饰器,在实现列_call()基本功能后,丰富了其功能,这里主要通过tf.summary.histogram() 可以查看inputs与outputs分布情况的直方图。

    _log_vars(): 记录所有变量。实现时主要将vars中的各个变量以直方图形式显示。

    3. class Dense

    Dense layer主要用于实现全连接层的基本功能。即为了最终得到 Relu(Wx + b)。

    __init__(): 用于获取初始化成员变量。其中num_features_nonzero和featureless的作用目前还不清楚。

    _call(): 用于实现并且返回Relu(Wx + b)

     1 class Dense(Layer):
     2     """Dense layer."""
     3 
     4     def __init__(self, input_dim, output_dim, dropout=0.,
     5                  act=tf.nn.relu, placeholders=None, bias=True, featureless=False,
     6                  sparse_inputs=False, **kwargs):
     7         super(Dense, self).__init__(**kwargs)
     8 
     9         self.dropout = dropout
    10 
    11         self.act = act
    12         self.featureless = featureless
    13         self.bias = bias
    14         self.input_dim = input_dim
    15         self.output_dim = output_dim
    16 
    17         # helper variable for sparse dropout
    18         self.sparse_inputs = sparse_inputs
    19         if sparse_inputs:
    20             self.num_features_nonzero = placeholders['num_features_nonzero']
    21 
    22         with tf.variable_scope(self.name + '_vars'):
    23             self.vars['weights'] = tf.get_variable('weights', shape=(input_dim, output_dim),
    24         dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer(),                                             
    25         regularizer=tf.contrib.layers.l2_regularizer(FLAGS.weight_decay))
    26             if self.bias:
    27                 self.vars['bias'] = zeros([output_dim], name='bias')
    28 
    29         if self.logging:
    30             self._log_vars()
    31 
    32     def _call(self, inputs):
    33         x = inputs
    34         x = tf.nn.dropout(x, 1 - self.dropout)
    35 
    36         # transform
    37         output = tf.matmul(x, self.vars['weights'])
    38 
    39         # bias
    40         if self.bias:
    41             output += self.vars['bias']
    42 
    43         return self.act(output)
    View Code

     

     

    转载于:https://www.cnblogs.com/shiyublog/p/9894617.html

    展开全文
  • 原创文章~转载请注明出处...GraphSAGE 代码解析(一) - unsupervised_train.py GraphSAGE 代码解析(二) - layers.py GraphSAGE 代码解析(三) - aggregators.py 1. 类及其继承关系 Model / \ / \ MLP G...

    原创文章~转载请注明出处哦。其他部分内容参见以下链接~

    GraphSAGE 代码解析(一) - unsupervised_train.py

    GraphSAGE 代码解析(二) - layers.py

    GraphSAGE 代码解析(三) - aggregators.py

    1. 类及其继承关系

         Model 
         /   \
        /     \
      MLP   GeneralizedModel
              /  \
             /    \
    Node2VecModel  SampleAndAggregate

    首先看Model, GeneralizedModel, SampleAndAggregate这三个类的联系。

    其中Model与 GeneralizedModel的区别在于,Model的build()函数中搭建了序列层模型,而在GeneralizedModel中被删去。self.ouput必须在GeneralizedModel的子类build()中被赋值。

    class Model(object) 中的build()函数如下:

     1 def build(self):
     2     """ Wrapper for _build() """
     3     with tf.variable_scope(self.name):
     4         self._build()
     5 
     6     # Build sequential layer model
     7     self.activations.append(self.inputs)
     8     for layer in self.layers:
     9         hidden = layer(self.activations[-1])
    10         self.activations.append(hidden)
    11     self.outputs = self.activations[-1]
    12     # 这部分sequential layer model模型在GeneralizedModel的build()中被删去
    13 
    14     # Store model variables for easy access
    15     variables = tf.get_collection(
    16         tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
    17     self.vars = {var.name: var for var in variables}
    18 
    19     # Build metrics
    20     self._loss()
    21     self._accuracy()
    22 
    23     self.opt_op = self.optimizer.minimize(self.loss)
    View Code

    序列层实现的功能是,给输入,通过layer()返回输出,又将这个输出再次作为输入到下一个layer()中,最终,取最后一层layer的结果作为output.

    2. class SampleAndAggregate(GeneralizedModel)

    1. __init__():

    (1) self.features的由来:

    para: features    tf.get_variable()-> identity features
         |                   |
    self.features     self.embeds   --> At least one is not None
          \                 /       --> Concat if both are not None 
           \               /
            \             /
             self.features

    (2) self.dims:

    self.dims是一个list, 每一位记录各个神经网络层的维数。

    self.dims[0]的值相当于self.features的列数 (0 if features is None else features.shape[1]) + identity_dim),(注意:括号里features为传入的参数,而非self.features)

    之后各位为各层output_dim,也就是hidden units的个数。

    (3) __init()__函数代码

     1 def __init__(self, placeholders, features, adj, degrees,
     2          layer_infos, concat=True, aggregator_type="mean",
     3          model_size="small", identity_dim=0,
     4          **kwargs):
     5 '''
     6 Args:
     7     - placeholders: Stanford TensorFlow placeholder object.
     8     - features: Numpy array with node features. 
     9                 NOTE: Pass a None object to train in featureless mode (identity features for nodes)!
    10     - adj: Numpy array with adjacency lists (padded with random re-samples)
    11     - degrees: Numpy array with node degrees. 
    12     - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all 
    13            the recursive layers. See SAGEInfo definition above.
    14     - concat: whether to concatenate during recursive iterations
    15     - aggregator_type: how to aggregate neighbor information
    16     - model_size: one of "small" and "big"
    17     - identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy)
    18 '''
    19 super(SampleAndAggregate, self).__init__(**kwargs)
    20 if aggregator_type == "mean":
    21     self.aggregator_cls = MeanAggregator
    22 elif aggregator_type == "seq":
    23     self.aggregator_cls = SeqAggregator
    24 elif aggregator_type == "maxpool":
    25     self.aggregator_cls = MaxPoolingAggregator
    26 elif aggregator_type == "meanpool":
    27     self.aggregator_cls = MeanPoolingAggregator
    28 elif aggregator_type == "gcn":
    29     self.aggregator_cls = GCNAggregator
    30 else:
    31     raise Exception("Unknown aggregator: ", self.aggregator_cls)
    32 
    33 # get info from placeholders...
    34 self.inputs1 = placeholders["batch1"]
    35 self.inputs2 = placeholders["batch2"]
    36 self.model_size = model_size
    37 self.adj_info = adj
    38 if identity_dim > 0:
    39     self.embeds = tf.get_variable(
    40         "node_embeddings", [adj.get_shape().as_list()[0], identity_dim])
    41     # self.embeds: record the neigh features embeddings
    42     # number of features = identity_dim
    43     # number of neighbors = adj.get_shape().as_list()[0]
    44 else:
    45     self.embeds = None
    46 if features is None:
    47     if identity_dim == 0:
    48         raise Exception(
    49             "Must have a positive value for identity feature dimension if no input features given.")
    50     self.features = self.embeds
    51 else:
    52     self.features = tf.Variable(tf.constant(
    53         features, dtype=tf.float32), trainable=False)
    54     if not self.embeds is None:
    55         self.features = tf.concat([self.embeds, self.features], axis=1)
    56 self.degrees = degrees
    57 self.concat = concat
    58 
    59 self.dims = [
    60     (0 if features is None else features.shape[1]) + identity_dim]
    61 self.dims.extend(
    62     [layer_infos[i].output_dim for i in range(len(layer_infos))])
    63 self.batch_size = placeholders["batch_size"]
    64 self.placeholders = placeholders
    65 self.layer_infos = layer_infos
    66 
    67 self.optimizer = tf.train.AdamOptimizer(
    68     learning_rate=FLAGS.learning_rate)
    69 
    70 self.build()
    View Code

    (2) sample(inputs, layer_infos, batch_size=None)

    对于sample的算法描述,详见论文Appendix A, algorithm 2.

    代码:

     1 def sample(self, inputs, layer_infos, batch_size=None):
     2     """ Sample neighbors to be the supportive fields for multi-layer convolutions.
     3 
     4     Args:
     5         inputs: batch inputs
     6         batch_size: the number of inputs (different for batch inputs and negative samples).
     7     """
     8 
     9     if batch_size is None:
    10         batch_size = self.batch_size
    11     samples = [inputs]
    12     # size of convolution support at each layer per node
    13     support_size = 1
    14     support_sizes = [support_size]
    15 
    16     for k in range(len(layer_infos)):
    17         t = len(layer_infos) - k - 1
    18         support_size *= layer_infos[t].num_samples
    19         sampler = layer_infos[t].neigh_sampler
    20         
    21         node = sampler((samples[k], layer_infos[t].num_samples))
    22         samples.append(tf.reshape(node, [support_size * batch_size, ]))
    23         support_sizes.append(support_size)
    24 
    25     return samples, support_sizes
    View Code

    sampler = layer_infos[t].neigh_sampler

    当函数被调用时,layer_infos会被赋值,在unsupervised_train.py中,其中neigh_sampler被赋为UniformNeighborSampler,其在neigh_samplers.py中定义:class UniformNeighborSampler(Layer)。

    目的是对于输入的samples[k] (即为上一步sample得到的节点,如上图依次得到黄色区域表示的samples[0],橙色区域表示的samples[1], 粉色区域表示的samples[2]。其中samples[k]是有由对samples[k - 1]中各节点的邻居采样而得),选取num_samples个数的邻居节点的序号(对应上图N(u))。(返回值是adj_lists, 即为被截断为num_samples列数的邻接矩阵。)

    这里注意区别support_size与num_samples:

    num_sample为当前深度每个节点u所选取的邻居节点的个数为num_samples;

    support_size表示当前节点u的embedding受多少节点信息的影响。其既受当前层num_samples个直接邻居的影响,其邻居也受更先前深度num_samples个邻居的影响,以此类推。故support_size是到目前深度为止的各深度下num_samples的连乘积。则对于batch_size个输入节点,总的support个数为: support_size * batch_size。

    最后将support_size存进support_sizes的数组中。

    sample() 函数最终返回包含各深度下采样点的samples数组与各深度下各点受支持节点数目的support_sizes数组。

    2. def _build(self):

    1 self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
    2     true_classes=labels,
    3     num_true=1,
    4     num_sampled=FLAGS.neg_sample_size,
    5     unique=False,
    6     range_max=len(self.degrees),
    7     distortion=0.75,
    8     unigrams=self.degrees.tolist()))

    (1) tf.nn.fixed_unigram_candidate_sampler:

    按照用户提供的概率分布进行采样。
    如果类别服从均匀分布,我们就用uniform_candidate_sampler;
    如果词作类别,我们知道词服从 Zipfian, 我们就用 log_uniform_candidate_sampler;
    如果能够通过统计或者其他渠道知道类别满足某些分布,用 nn.fixed_unigram_candidate_sampler;
    如果实在不知道类别分布,我们还可以用 tf.nn.learned_unigram_candidate_sampler。
    
    (2) Paras:
    a. num_sampled:
    
    sampling_candidates的元素是在没有替换(如果unique = True)的情况下绘制的,
    或者是从基本分布中替换(如果unique = False)。
    
    unique = True 可以看作无放回抽样;unique = False 可以看作有放回抽样。
    
    b. distortion:
    
    distortion used the word2vec freq energy table formulation
    f^(3/4) / total(f^(3/4))
    in word2vec energy counted by freq;
    in graphsage energy counted by degrees
    so in unigrams = [] each ID recored each node's degree
    
    c. unigrams: 
    
    各个节点的度。
    
    (3) Returns:
    a. sampled_candidates: A tensor of type int64 and shape [num_sampled]. The sampled classes.
    b. true_expected_count: A tensor of type float. Same shape as true_classes. The expected counts under the sampling distribution of each of true_classes.
    c. sampled_expected_count: A tensor of type float. Same shape as sampled_candidates. The expected counts under the sampling distribution of each of sampled_candidates.
    View Code

     

    -------addtional---------------

    1. self.__class__.__name__.lower()

    1 if not name:
    2             name = self.__class__.__name__.lower()

    self.__class__.__name__.lower(): https://stackoverflow.com/questions/36367736/use-name-as-attribute

    1 class MyClass:
    2     def __str__(self):
    3         return str(self.__class__)
    >>> instance = MyClass()
    >>> print(instance)
    __main__.MyClass

    That is because the string version of the class includes the module that it is defined in. In this case, it is defined in the module that is currently being executed, the shell, so it shows up as __main__.MyClass. If we use self.__class__.__name__, however:

    1 class MyClass:
    2     def __str__(self):
    3         return self.__class__.__name__
    4 
    5 instance = MyClass()
    6 print(instance)

    it outputs:

    MyClass

    The __name__ attribute of the class does not include the module.

    Note: The __name__ attribute gives the name originally given to the class. Any copies will keep the name. For example:

    1 class MyClass:
    2     def __str__(self):
    3         return self.__class__.__name__
    4 
    5 SecondClass = MyClass
    6 
    7 instance = SecondClass()
    8 print(instance)

    output:

    MyClass

    That is because the __name__ attribute is defined as part of the class definition. Using SecondClass = MyClass is just assigning another name to the class. It does not modify the class or its name in any way.

    2. allowed_kwargs = {'name', 'logging', 'model_size'}

    其中name,logging,model_size指什么?

    name: String, defines the variable scope of the layer.
    logging: Boolean, switches Tensorflow histogram logging on/off
    model_size: small / big 见aggregates.py: small: hidden_dim =512; big: hidden_dim = 1024

    3. python 中参数*args, **kwargs 

    https://blog.csdn.net/anhuidelinger/article/details/10011013

     1 def foo(*args, **kwargs):
     2     print 'args = ', args
     3     print 'kwargs = ', kwargs
     4     print '---------------------------------------'
     5 
     6 if __name__ == '__main__':
     7     foo(1,2,3,4)
     8     foo(a=1,b=2,c=3) 9 foo(1,2,3,4, a=1,b=2,c=3) 10 foo('a', 1, None, a=1, b='2', c=3) 11 12 # Output: 13 # args = (1, 2, 3, 4) 14 # kwargs = {} 15 16 # args = () 17 # kwargs = {'a': 1, 'c': 3, 'b': 2} 18 19 # args = (1, 2, 3, 4) 20 # kwargs = {'a': 1, 'c': 3, 'b': 2} 21 22 # args = ('a', 1, None) 23 # kwargs = {'a': 1, 'c': 3, 'b': '2'}

    1. 可以看到,这两个是python中的可变参数。

    *args表示任何多个无名参数,它是一个tuple;

    **kwargs表示关键字参数,它是一个 dict。

    并且同时使用*args和**kwargs时,必须*args参数列要在**kwargs前.

    像foo(a=1, b='2', c=3, a', 1, None, )这样调用的话,会提示语法错误“SyntaxError: non-keyword arg after keyword arg”。

    2. 何时使用**kwargs:

    Using **kwargs and default values is easy. Sometimes, however, you shouldn't be using **kwargs in the first place.

    In this case, we're not really making best use of **kwargs.

    1 class ExampleClass( object ):
    2     def __init__(self, **kwargs):
    3         self.val = kwargs.get('val',"default1")
    4         self.val2 = kwargs.get('val2',"default2")

    The above is a "why bother?" declaration. It is the same as

    1 class ExampleClass( object ):
    2     def __init__(self, val="default1", val2="default2"):
    3         self.val = val
    4         self.val2 = val2

    When you're using **kwargs, you mean that a keyword is not just optional, but conditional. There are more complex rules than simple default values.

    When you're using **kwargs, you usually mean something more like the following, where simple defaults don't apply.

     1 class ExampleClass( object ):
     2     def __init__(self, **kwargs):
     3         self.val = "default1"
     4         self.val2 = "default2"
     5         if "val" in kwargs:
     6             self.val = kwargs["val"]
     7             self.val2 = 2*self.val
     8         elif "val2" in kwargs: 9 self.val2 = kwargs["val2"] 10 self.val = self.val2 / 2 11 else: 12 raise TypeError( "must provide val= or val2= parameter values" )

    3. logging = kwargs.get('logging', False) : default value: false

    https://stackoverflow.com/questions/1098549/proper-way-to-use-kwargs-in-python

    You can pass a default value to get() for keys that are not in the dictionary:

    1 self.val2 = kwargs.get('val2',"default value")

    However, if you plan on using a particular argument with a particular default value, why not use named arguments in the first place?

    1 def __init__(self, val2="default value", **kwargs):

     4. tf.variable_scope() 

    https://blog.csdn.net/IB_H20/article/details/72936574

    5. masked_softmax_cross_entropy ?  见metrics.py

    1 # Cross entropy error
    2         if self.categorical:
    3             self.loss += metrics.masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
    4                     self.placeholders['labels_mask'])
    1 def masked_logit_cross_entropy(preds, labels, mask):
    2     """Logit cross-entropy loss with masking."""
    3     loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels)
    4     loss = tf.reduce_sum(loss, axis=1)
    5     mask = tf.cast(mask, dtype=tf.float32)
    6     mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.]))
    7     loss *= mask 8 return tf.reduce_mean(loss)

    =======================================

         感谢您的支持!             感谢您的支持!

    感谢您的打赏!

    (梦想还是要有的,万一您喜欢我的文章呢)

    转载于:https://www.cnblogs.com/shiyublog/p/9879875.html

    展开全文
  • 原创文章~转载请注明出处哦。其他部分内容参见以下链接~ GraphSAGE 代码解析(二) - layers.py GraphSAGE 代码解析(三) - ...GraphSAGE代码详解 example_data: 1. toy-ppi-G.json 图的信息 { directed:...
  • GraphSAGE 代码解析(一) - unsupervised_train.py GraphSAGE 代码解析(二) - layers.py GraphSAGE 代码解析(四) - models.py 1. class MeanAggregator(Layer): 该类主要用于实现 1. __init__() __ini...
  • GraphSage代码阅读笔记(TensorFlow版)

    千次阅读 2020-04-26 16:31:20
    graphsage代码链接 二、文件结构 1.文件目录 ├──eval_scripts //验证集 ├──example_data //ppi数据集 └──graphsage//模型结构定义、GCN层定义、…… 2.eval_scripts //验证集目录内容 ├──ci...
  • 一:为什么要图采样?二Graphsage 采样代码实践GraphSage的PGL完整代码实现位于https://github.com/PaddlePaddle/PGL/tree/main...
  • GraphSage 代码阅读笔记

    2019-11-19 12:03:40
    relation也就是边 没有...更多理解https://discuss.dgl.ai/t/graphsage-question-the-train-data-and-valid-data-have-no-intersection-then-how-does-the-valid-data-get-the-embedding-for-downstream-model/539/3
  • dgl框架实现graphsage代码流程梳理

    千次阅读 2020-04-14 23:33:11
    代码git地址:...dgl在最近的4月份更新的0.4.3版本中增加了dgl.sampling.sample_neighbors的模块,用来采样邻居节点,之前的版本里面实现的graphsage没有做采样,这里记录一下实现的代码过程,对一...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 710
精华内容 284
关键字:

graphsage代码