精华内容
下载资源
问答
  • 文章目录解决了什么问题主要贡献和创新点基本框架提出的方法01 variance confidence方差置信度02 Variance Subsampling Algorithm 方差二次采样算法03 Variance decay strategy 方差衰减策略实验01 性能02 Ablation ...

    导言
    针对现有工作中存在的错误伪标签问题,文章通过优化样本间的相似性度量和伪标签置信度评估策略来改善这个问题,从而提供模型性能。具体地,文章提出了方差置信度的概念,并设计了方差二次采样算法将方差置信度和距离置信度结合起来作为采样准则,同时还提出了方差衰减策略来更好了优化选择出来的伪标签样本。最终,该方法将MARS数据集上的mAP和Rank1分别提高了 3.94%和4.55%。

    引用

    @article{DBLP:journals/jvca/ZhaoYYHYZ20,
    author = {Jing Zhao and
    Wenjing Yang and
    Mingliang Yang and
    Wanrong Huang and
    Qiong Yang and
    Hongguang Zhang},
    title = {One-shot video-based person re-identification with variance subsampling
    algorithm},
    journal = {Comput. Animat. Virtual Worlds},
    volume = {31},
    number = {4-5},
    year = {2020}
    }

    相关链接

    原文链接:https://onlinelibrary.wiley.com/doi/10.1002/cav.1964
    或者下方↓公众号后台回复“VSA”,即可获得论文电子资源。
    在这里插入图片描述

    解决了什么问题

    Previous works propose the distance-based sampling for unlabeled datapoints to address the few-shot person re-identification task, however, many selected samples may be assigned with wrong labels due to poor feature quality in these works, which negatively affects the learning procedure.

    主要贡献和创新点

    在这里插入图片描述

    • We propose the variance confidence to measure the credibility of pseudo-labels, which can be widely used as a general similarity measurement.
    • We propose a novel VSA(variance subsampling algorithm) to improve the accuracy of pseudo-labels for selected samples. It combines distance confidence and variance confidence as the sampling criterion, and adopt a variance decay strategy as the sampling strategy.

    创新点主要有三个:

    • 一是提出了方差置信度(variance confidence)的概念
    • 二是提出了VSA(方差二次采样算法)
    • 三是提出了方差衰减策略(variance decay strategy)。

    基本框架

    在这里插入图片描述
    The dataset extension process. Both labeled and unlabeled samples are extracted into the feature space through the backbone network in step 1. As shown in feature space (a), the gray points indicate unlabeled samples, and the colored hollow points indicate labeled samples. Different colors indicate different person identity. Then label estimation is performed according to the criterion that the unlabeled sample has the same label as the nearest labeled sample in step 2. We call the sample after label estimation a pseudo-label sample, which is the colored solid point in the feature space (b). Finally, the pseudo-label samples with higher confidence are preferred, which are closer to the labeled samples in feature space ©

    整体框架采用监督训练和数据扩展交叉迭代进行的模式。数据扩展的过程如上图所示,具体包括了特征提取、标签估计和伪标签样本采样三个环节。

    提出的方法

    01 variance confidence方差置信度

    在这里插入图片描述
    A distribution situation in the feature space. U1 and U2 represent two unlabeled samples, and L1 and L2 are the two labeled samples with the closest distance to both U1 and U2 in the feature space. di,i∈[1,4], represent the Euclidean distance between samples and satisfy Equation (7). The solid line represents the distance between the unlabeled sample and its nearest labeled sample. While U1 is similar to L1, it is also similar to L2 with the same extent. On the contrary, although U2 is slightly similar to L1, it is very different from L2. Therefore, it is more believable that U2 is more likely than U1 to fall into the same category as L1

    作者举例了特征空间中的一种分布情况。 U1和U2是无标注样本,L1和L2是距离U1和U2最近的带标注样本。di表示样本之间的欧几里得距离,且满足 d 1 < d 2 < d 3 < d 4 d_1<d_2<d_3<d_4 d1<d2<d3<d4。 如果仅根据距离来度量样本标签的可靠性的话,那U1优于U2(因为d1<d3)。 但作者认为,当一个样本(U1)同时和两个不同的样本(L1和L2)相似的时候(d1和d2相差很小),那这个样本就谁都不像了。

    作者用无标签样本与其距离最近的两个带标注样本的距离方差来表示方差置信度,且方差越大,置信度越高。

    02 Variance Subsampling Algorithm 方差二次采样算法

    在这里插入图片描述

    The sampling criterion of variance subsampling algorithm. Hollow points in the feature space represent labeled samples, and solid points represent pseudo-label samples. The first sampling is based on the distance confidence. The number of sampling is extended to e, corresponding to the range of the red squares and circles in the figure. The second sampling is based on the variance confidence, and the number of samples is restored to ns, which corresponds to the range of the yellow box and the circle in the figure.

    作者通过二次采样的形式,将距离置信度和方差置信度结合了起来,作为采样准则。

    03 Variance decay strategy 方差衰减策略

    在这里插入图片描述
    The partial distribution of the real feature space. Colors is used to distinguish different people identity, and shapes is used to distinguish the camera. Black dots in the center of the sample indicate that this is the original labeled sample. The distribution in the first iteration is relatively uniform, while the distribution after the seventh iteration has shown a clustering distribution

    作者在实验过程中可视化了特征空间的真实分布情况。 发现模型训练到中后期时,提取出的特征空间已经呈现出了聚类分布。

    Obviously, in the case of the feature distribution of model 7 in Figure 5, the situation described in Figure 3 will hardly occur. This shows that the situation described in Figure 3 is gradually reduced during the iteration process. Therefore, a variance decay strategy is proposed as the sampling strategy. A stopping factor 𝜂 is taken to control the number of steps in which the variance confidence is disabled. In addition, 𝜖 is also set to be variable and calculated by Equation (13).
    在这里插入图片描述

    在聚类分布形成的情况下, 方差置信度将不再适用,因此作者提出了方差衰减策略

    整体的算法:
    在这里插入图片描述

    实验

    01 性能

    在这里插入图片描述

    Our method is evaluated on the MARS and DukeMTMC-VideoReID dataset, and compared with recent related works including the baseline work EUG.10 Table 1 reports the final results of different methods. One-shot refers to the experimental results obtained by supervised learning on the labeled dataset L only. DGM9 and SMP8 do not train their model in crossiterative manner. The results demonstrate that our method performs better than EUG on both the two datasets. Specifically, our mAP, Rank-1, Rank5, and Rank20 on the MARS dataset are 38.62%,62.17%,74.34%, and 83.43%, which surpasses the baseline10 by 3.94%, 4.55%, 4.70%, and 5.35%, respectively. Though we outperform10 on the four types of accuracy, the benefits of our method on DukeMTMC-VideoReID dataset is lower than the improvements on MARS dataset as the number of unlabeled samples in the DukeMTMC-VideoReID dataset is merely 1/5 of that in MARS dataset, which reduce the impacts of using sampling strategies.

    02 Ablation - sampling criterions

    在这里插入图片描述
    We compared the number of error labels contained in the selected pseudo-label samples under different sampling strategies. Specifically, there are 1,494 unlabeled samples in the DukeMTMC-VideoReID dataset, of which 663 samples have correct labels after label estimation. As shown in Figure 7, the colored bar is where the wrong label is located. Among the 300 samples, 29 samples have error labels according to distance confidence only and 19 samples have error labels according to the variance confidence only, respectively. When we combine distance confidence and variance confidence as a two-round sampling criterion, the number of erroneous labels drops to 16, which means that the accuracy of selected sample labels is improved to 95% (90% with the distance confidence only). The result effectively illustrates that the VSA does effectively reduce the number of wrong labels in the selected samples.

    将距离置信度和方差置信度结合起来,有效地提高了采样出来的为标签样本的标签准确率

    展开全文
  • global white_noise_variance;global rmodel cs_threshold;Pr = 0;I = find(node(: 3)>0);for i=1:length(I) tx1 = I(i); if tx1 == rv continue; end Pr = Pr + recv_power(tx1 rv rmodel);end% N0 = abs
  • Variance analysis that explains some of the DQN problems, and how the proposed extension addresses them. • Experiments with several ALE games demonstrating the favorable effect of the proposed ...

    Averaged-DQN:深度强化学习的方差减少和稳定性

    Abstract

    Instability and variability of Deep Reinforcement Learning (DRL) algorithms tend to adversely affect their performance. Averaged-DQN is a sim-ple extension to the DQN algorithm, based on averaging previously learned Q-values estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. To understand the effect of the algorithm, we examine the source of value function estimation errors and provide an analytical comparison within a simplified model. We further present experiments on the Arcade Learning Environment benchmark that demonstrate significantly improved stability and performance due to the proposed extension.
    深度强化学习(DRL)算法的不稳定性和可变性往往会对其性能产生不利影响。 Averaged-DQN是对DQN算法的简单扩展,基于对先前学习的Q值估计进行平均,这通过减少目标值中的近似误差方差而导致更稳定的训练过程和改进的性能。 为了理解算法的效果,我们检查了价值函数估计误差的来源,并在简化模型中提供了分析比较。 我们进一步展示了Arcade学习环境基准测试的实验,该基准测试表明由于建议的扩展而显着提高了稳定性和性能

    Introduction

    In Reinforcement Learning (RL) an agent seeks an opti-mal policy for a sequential decision making problem (Sut-ton & Barto, 1998). It does so by learning which action is optimal for each environment state. Over the course of time, many algorithms have been introduced for solv-ing RL problems including Q-learning (Watkins & Dayan, 1992), SARSA (Rummery & Niranjan, 1994; Sutton & Barto, 1998), and policy gradient methods (Sutton et al., 1999). These methods are often analyzed in the setup of linear function approximation, where convergence is guar-anteed under mild assumptions (Tsitsiklis, 1994; Jaakkola et al., 1994; Tsitsiklis & Van Roy, 1997; Even-Dar & Man-sour, 2003). In practice, real-world problems usually in-volve high-dimensional inputs forcing linear function ap-proximation methods to rely upon hand engineered features for problem-specific state representation. These problem-specific features diminish the agent flexibility, and so the need of an expressive and flexible non-linear function ap-proximation emerges. Except for few successful attempts (e.g., TD-gammon, Tesauro (1995)), the combination of non-linear function approximation and RL was considered unstable and was shown to diverge even in simple domains (Boyan & Moore, 1995).

    The recent Deep Q-Network (DQN) algorithm (Mnih et al., 2013), was the first to successfully combine a power-ful non-linear function approximation technique known as Deep Neural Network (DNN) (LeCun et al., 1998; Krizhevsky et al., 2012) together with the Q-learning al-gorithm. DQN presented a remarkably flexible and stable algorithm, showing success in the majority of games within the Arcade Learning Environment (ALE) (Bellemare et al., 2013). DQN increased the training stability by breaking the RL problem into sequential supervised learning tasks. To do so, DQN introduces the concept of a target network and uses an Experience Replay buffer (ER) (Lin, 1993).

    Following the DQN work, additional modifications and ex-tensions to the basic algorithm further increased training stability. Schaul et al. (2015) suggested sophisticated ER sampling strategy. Several works extended standard RL exploration techniques to deal with high-dimensional input (Bellemare et al., 2016; Tang et al., 2016; Osband et al., 2016). Mnih et al. (2016) showed that sampling from ER could be replaced with asynchronous updates from parallel environments (which enables the use of on-policy meth-ods). Wang et al. (2015) suggested a network architecture base on the advantage function decomposition (Baird III, 1993).
    在强化学习(RL)中,代理人为顺序决策问题寻求最优政策(Sut-ton&Barto,1998)。它是通过了解哪种行为对每种环境状态最佳来实现的。随着时间的推移,已经引入了许多算法来解决RL问题,包括Q-learning(Watkins&Dayan,1992),SARSA(Rummery&Niranjan,1994; Sutton&Barto,1998),以及政策梯度方法(Sutton) et al。,1999)。这些方法通常在线性函数逼近的设置中进行分析,其中收敛在温和的假设下得到保证(Tsitsiklis,1994; Jaakkola等,1994; Tsitsiklis和Van Roy,1997; Even-Dar&Man-sour,2003; )。在实践中,现实世界的问题通常包括高维输入,迫使线性函数近似方法依赖于手工设计的特征来解决特定问题的状态。这些特定于问题的特征降低了代理的灵活性,因此出现了表达和灵活的非线性函数近似的需要。除了少数成功的尝试(例如,TD-gammon,Tesauro(1995)),非线性函数近似和RL的组合被认为是不稳定的,并且即使在简单的域中也显示出分歧(Boyan&Moore,1995)。

    最近的深度Q网络(DQN)算法(Mnih等,2013),是第一个成功结合强大的非线性函数逼近技术,称为深度神经网络(DNN)(LeCun等,1998) ; Krizhevsky等,2012)与Q学习算法一起。 DQN提出了一种非常灵活和稳定的算法,在Arcade学习环境(ALE)中的大多数游戏中都取得了成功(Bellemare等,2013)。 DQN通过将RL问题分解为顺序监督学习任务来提高训练稳定性。为此,DQN引入了目标网络的概念并使用了体验重放缓冲区(ER)(Lin,1993)。

    在DQN工作之后,对基本算法的额外修改和扩展进一步提高了训练稳定性。 Schaul等人。 (2015)建议复杂的ER采样策略。一些工作扩展了标准RL探测技术以处理高维输入(Bellemare等,2016; Tang等,2016; Osband等,2016)。 Mnih等人。 (2016)表明,ER的抽样可以替换为来自并行环境的异步更新(这使得能够使用政策上的方法)。王等人。 (2015)提出了基于优势函数分解的网络架构(Baird III,1993)。
    In this work we address issues that arise from the combi-nation of Q-learning and function approximation. Thrun & Schwartz (1993) were first to investigate one of these issues which they have termed as the overestimation phenomena. The max operator in Q-learning can lead to overestimation of state-action values in the presence of noise. Van Hasselt et al. (2015) suggest the Double-DQN that uses the Double Q-learning estimator (Van Hasselt, 2010) method as a solu-tion to the problem. Additionally, Van Hasselt et al. (2015) showed that Q-learning overestimation do occur in practice.
    (at least in the ALE).

    This work suggests a different solution to the overestima-tion phenomena, named Averaged-DQN (Section 3), based on averaging previously learned Q-values estimates. The averaging reduces the target approximation error variance (Sections 4 and 5) which leads to stability and improved results. Additionally, we provide experimental results on selected games of the Arcade Learning Environment.

    • We summarize the main contributions of this paper as fol-lows:

    • A novel extension to the DQN algorithm which stabi-lizes training, and improves the attained performance, by averaging over previously learned Q-values.

    • Variance analysis that explains some of the DQN problems, and how the proposed extension addresses them.

    • Experiments with several ALE games demonstrating the favorable effect of the proposed scheme.
    在这项工作中,我们解决了Q学习和函数逼近的组合所产生的问题。 Thrun&Schwartz(1993)首先研究了其中一个被称为过高估计现象的问题。 Q学习中的最大算子可能导致在存在噪声的情况下高估状态 - 动作值。范哈塞特等人。 (2015)建议使用Double Q-learning估计器(Van Hasselt,2010)方法的Double-DQN作为问题的解决方案。此外,Van Hasselt等人。 (2015)表明Q-learning高估确实在实践中发生。
    (至少在ALE中)。

    这项工作提出了一种不同的高估现象解决方案,名为Averaged-DQN(第3节),基于对先前学习的Q值估计进行平均。平均值降低了目标近似误差方差(第4节和第5节),从而提高了稳定性并改善了结果。此外,我们还提供了Arcade学习环境选定游戏的实验结果。

    • 我们总结了本文的主要贡献如下:

    •DQN算法的新扩展,通过对先前学习的Q值进行平均,稳定训练并提高获得的性能。

    •方差分析,解释了一些DQN问题,以及建议的扩展如何解决这些问题。

    •几个ALE游戏的实验证明了该方案的有利影响。

    2.Background

    In this section we elaborate on relevant RL background, and specifically on the Q-learning algorithm.
    在本节中,我们将详细介绍相关的RL背景,特别是Q学习算法。

    2.1. Reinforcement Learning

    在这里插入图片描述

    Value-based methods for solving RL problems encode poli-cies through the use of value functions, which denote the expected discounted cumulative reward from a given state s, following a policy π. Specifically we are interested in state-action value functions:
    解决RL问题的基于价值的方法通过使用价值函数来编码政策,价值函数表示遵循政策π的给定州s的预期折现累积奖励。 具体来说,我们对状态 - 动作值函数感兴趣:

    在这里插入图片描述

    2.2. Q-learning

    One of the most popular RL algorithms is the Q-learning algorithm (Watkins & Dayan, 1992). This algorithm is based on a simple value iteration update (Bellman, 1957), directly estimating the optimal value function Q . Tabular Q-learning assumes a table that contains old action-value function estimates and preform updates using the follow-ing update rule:

    最流行的RL算法之一是Q学习算法(Watkins&Dayan,1992)。 该算法基于简单的值迭代更新(Bellman,1957),直接估计最优值函数Q. 表格式Q-learning假设一个表包含旧的操作 - 值函数估计和使用以下更新规则的预成型更新:
    在这里插入图片描述

    2.3. Deep Q Networks (DQN)

    We present in Algorithm 1 a slightly different formulation of the DQN algorithm (Mnih et al., 2013). In iteration i the DQN algorithm solves a supervised learning problem to approximate the action-value function Q(s, a; θ) (line 6). This is an extension of implementing (1) in its function ap-proximation form (Riedmiller, 2005).
    我们在算法1中提出了一种略微不同的DQN算法公式(Mnih等,2013)。 在迭代i中,DQN算法解决了监督学习问题以近似动作值函数Q(s,a;θ)(第6行)。 这是在函数ap-proximation形式(Riedmiller,2005)中实现(1)的扩展。
    在这里插入图片描述

    在这里插入图片描述

    在这里插入图片描述

    Note that in the original implementation(Mnihetal.,2013; 2015), transitions are added to the ER buffer simultaneously with the minimization of the DQN loss (line 6). Using the hyperparameters employed by Mnih et al. (2013; 2015)(detailed for completeness in AppendixE),1%of the experience transitions in ER buffer are replaced between target network parameter updates, and 8% are sampled for minimization.
    请注意,在最初的实现中(Mnihetal。,2013; 2015),转换被同时添加到ER缓冲区,同时最小化DQN损失(第6行)。 使用Mnih等人使用的超参数。 (2013; 2015)(详见附录E中的完整性),ER缓冲区中1%的经验转换在目标网络参数更新之间被替换,8%被采样用于最小化。

    3.AveragedDQN

    The Averaged-DQN algorithm (Algorithm 2) is an extension of the DQN algorithm. Averaged-DQN uses the K previously learned Q-values estimates to produce the current action-value estimate (line 5). The Averaged-DQN algorithm stabilizes the training process (see Figure 1), by reducing the variance of target approximation error as we elaborate in Section 5. The computational effort compared to DQN is, K-fold more forward passes through a Q-network while minimizing the DQN loss (line 7). The number of back-propagation updates remains the same as in DQN. Computational cost experiments are provided in AppedixD. The output of the algorithm is the average over the last K previously learned Q-networks.
    Averaged-DQN算法(算法2)是DQN算法的扩展。 Averaged-DQN使用K先前学习的Q值估计来产生当前动作值估计(第5行)。 Averaged-DQN算法通过减少目标近似误差的方差来稳定训练过程(见图1),正如我们在第5节中详细说明的那样。与DQN相比,计算工作量是向前通过Q网络的K倍。 最小化DQN损失(第7行)。 反向传播更新的数量与DQN中的相同。 AppedixD中提供了计算成本实验。 算法的输出是最近K个先前学习的Q网络的平均值。

    在这里插入图片描述

    Figure 1. DQN and Averaged-DQN performance in the Atari game of BREAKOUT. The bold lines are averages over seven independent learning trials. Every 1M frames, a performance test using �-greedy policy with � = 0.05 for 500000 frames was conducted. The shaded area presents one standard deviation. For both DQN and Averaged-DQN the hyperparameters used were taken from Mnih et al. (2015).
    图1. BREAKOUT的Atari游戏中的DQN和Averaged-DQN性能。 大胆的线条是七次独立学习试验的平均值。 每1M帧,使用�贪婪策略进行性能测试,对于500000帧,�= 0.05。 阴影区域呈现一个标准偏差。 对于DQN和Averaged-DQN,使用的超参数取自Mnih等人。(2015年)。
    在这里插入图片描述
    In Figures1 and 2 we can see the performance of Averaged-DQN compared to DQN(and Double-DQN),further experimental results are given in Section 6. We note that recently-learned state-action value estimates are likely to be better than older ones, therefore we have also considered a recency-weighted average. In practice, a weighted average scheme did not improve performance and therefore is not presented here.
    在图1和图2中,我们可以看到Averaged-DQN与DQN(和Double-DQN)相比的性能,进一步的实验结果在第6节中给出。我们注意到最近学到的状态 - 动作值估计可能比旧的更好。 因此,我们也考虑了新近加权平均值。 在实践中,加权平均方案没有改善性能,因此这里没有介绍。

    4. Overestimation and Approximation Errors

    Next, we discuss the various types of errors that arise due to the combination of Q-learning and function approximation in the DQN algorithm, and their effect on training stability. We refer to DQN’s performance in the BREAKOUT game in Figure 1. The source of the learning curve variance in DQN’s performance is an occasional sudden drop in the average score that is usually recovered in the next evalua-tion phase (for another illustration of the variance source see Appendix A). Another phenomenon can be observed in Figure 2, where DQN initially reaches a steady state (after 20 million frames), followed by a gradual deterioration in performance.

    For the rest of this section, we list the above mentioned er-rors, and discuss our hypothesis as to the relations between each error and the instability phenomena depicted in Fig-ures 1 and 2.
    接下来,我们讨论由于DQN算法中Q学习和函数逼近的组合而产生的各种类型的错误,以及它们对训练稳定性的影响。 我们在图1中的BREAKOUT游戏中参考DQN的表现.DQN表现中学习曲线方差的来源是平均得分的偶然突然下降,通常在下一个评估阶段恢复(另一个方差说明) 来源见附录A)。 在图2中可以观察到另一种现象,其中DQN最初达到稳定状态(在2000万帧之后),随后性能逐渐恶化。

    在本节的其余部分,我们列出了上面提到的错误,并讨论了我们关于每个错误与图1和图2中描述的不稳定现象之间关系的假设。
    在这里插入图片描述

    在这里插入图片描述
    Figure 2. DQN, Double-DQN, and Averaged-DQN performance (left), and average value estimates (right) in the Atari game of ASTERIX. The bold lines are averages over seven independent learning trials. The shaded area presents one standard deviation. Every 2M frames, a performance test using -greedy policy with = 0.05 for 500000 frames was conducted. The hyperparameters used were taken from Mnih et al. (2015).
    图2. ASTERIX的Atari游戏中的DQN,Double-DQN和Averaged-DQN性能(左)以及平均值估计(右)。 大胆的线条是七次独立学习试验的平均值。 阴影区域呈现一个标准偏差。 每2M帧,使用-greedy策略进行性能测试,对于500000帧使用= 0.05。 使用的超参数取自Mnih等人。(2015年)。

    在这里插入图片描述

    The optimality difference can be seen as the error of a standard tabular Q-learning, here we address the other errors. We next discuss each error in turn.
    最优性差异可以看作是标准表格Q学习的错误,这里我们解决其他错误。 我们接下来依次讨论每个错误。

    4.1. Target Approximation Error (TAE)

    在这里插入图片描述

    We hypothesize that the variability in DQN’s performance in Figure 1, that was discussed at the start of this section, is related to deviating from a steady-state policy induced by the TAE.
    我们假设在本节开始时讨论的图1中DQN性能的变化与偏离TAE引起的稳态策略有关。

    4.2. Overestimation Error

    在这里插入图片描述

    The overestimation error is different in its nature from the TAE since it presents a positive bias that can cause asymp-totically sub-optimal policies, as was shown by Thrun & Schwartz (1993), and later by Van Hasselt et al. (2015) in the ALE environment. Note that a uniform bias in the action-value function will not cause a change in the induced policy. Unfortunately, the overestimation bias is uneven and is bigger in states where the Q-values are similar for the different actions, or in states which are the start of a long trajectory (as we discuss in Section 5 on accumulation of TAE variance).

    Following from the above mentioned overestimation upper bound, the magnitude of the bias is controlled by the variance of the TAE.

    The Double Q-learning and its DQN implementation (Double-DQN) (Van Hasselt et al., 2015; Van Hasselt, 2010) is one possible approach to tackle the overestimation problem, which replaces the positive bias with a negative one. Another possible remedy to the adverse effects of this error is to directly reduce the variance of the TAE, as in our proposed scheme (Section 5).

    In Figure 2 we repeated the experiment presented in Van Hasselt et al. (2015) (along with the application of Averaged-DQN). This experiment is discussed in Van Has-selt et al. (2015) as an example of overestimation that leads to asymptotically sub-optimal policies. Since Averaged-DQN reduces the TAE variance, this experiment supports an hypothesis that the main cause for overestimation in DQN is the TAE variance.

    高估误差在性质上与TAE不同,因为它提出了一个正偏差,可以导致全面的次优政策,如Thrun&Schwartz(1993)和后来的Van Hasselt等人所示。 (2015年)在ALE环境中。请注意,动作值函数中的统一偏差不会导致诱导策略的变化。不幸的是,高估偏差是不均匀的,并且在不同行为的Q值相似的状态下,或在作为长轨迹开始的状态中更高(正如我们在第5节中讨论的TAE方差累积)。

    根据上述高估估计的上限,偏差的大小由TAE的方差控制。

    双Q学习及其DQN实施(Double-DQN)(Van Hasselt等,2015; Van Hasselt,2010)是解决高估问题的一种可能方法,用负面替代正偏差。对于这种误差的不利影响,另一种可能的补救措施是直接减少TAE的方差,如我们提出的方案(第5节)。

    在图2中,我们重复了Van Hasselt等人的实验。 (2015)(以及Averaged-DQN的应用)。 Van Has-selt等人讨论了该实验。 (2015)作为过高估计的一个例子,导致渐近次优政策。由于Averaged-DQN降低了TAE方差,因此该实验支持一个假设,即DQN中高估的主要原因是TAE方差。

    5. TAE Variance Reduction

    在这里插入图片描述

    在这里插入图片描述

    5.1. DQN Variance

    We assume the statistical model mentioned at the start of this section. Consider a unidirectional Markov Decision Process (MDP) as in Figure 3, where the agent starts at state s0, state sM −1 is a terminal state, and the reward in any state is equal to zero.

    Employing DQN on this MDP model, we get that for i > M:

    我们假设本节开头提到的统计模型。 考虑如图3所示的单向马尔可夫决策过程(MDP),其中代理在状态s0开始,状态sM -1是终端状态,并且任何状态的奖励等于零。

    在这个MDP模型上使用DQN,我们得到i> M:
    在这里插入图片描述
    The above example gives intuition about the behavior of the TAE variance in DQN. The TAE is accumulated over the past DQN iterations on the updates trajectory. Accu-mulation of TAE errors results in bigger variance with its associated adverse effect, as was discussed in Section 4.

    上面的例子给出了关于DQN中TAE方差行为的直觉。 TAE在更新轨迹上的过去DQN迭代上累积。 如第4节所述,TAE误差的准确性导致更大的方差及其相关的不利影响。

    在这里插入图片描述

    5.2. Ensemble DQN Variance

    We consider two approaches for TAE variance reduction. The first one is the Averaged-DQN and the second we term Ensemble-DQN. We start with Ensemble-DQN which is a straightforward way to obtain a 1/K variance reduction, with a computational effort of K-fold learning problems, compared to DQN. Ensemble-DQN (Algorithm 3) solves K DQN losses in parallel, then averages over the resulted Q-values estimates.
    For Ensemble-DQN on the unidirectional MDP in Figure 3, we get for i > M:

    我们考虑两种减少TAE方差的方法。 第一个是Averaged-DQN,第二个是Ensemble-DQN。 我们从Ensemble-DQN开始,这是一种直接的方法来获得1 / K方差减少,与DQN相比,计算费用为K倍学习问题。 Ensemble-DQN(算法3)并行地求解K DQN损失,然后对得到的Q值估计求平均值。
    对于图3中单向MDP上的Ensemble-DQN,我们得到i> M:

    在这里插入图片描述

    5.3. Averaged DQN Variance

    We continue with Averaged-DQN, and calculate the vari-ance in state s0 for the unidirectional MDP in Figure 3. We get that for i > KM:

    我们继续使用Averaged-DQN,并计算图3中单向MDP的状态s0的变化。我们得到i> KM:
    在这里插入图片描述

    meaning that Averaged-DQN is theoretically more efficient in TAE variance reduction than Ensemble-DQN, and at least K times better than DQN. The intuition here is that Averaged-DQN averages over TAEs averages, which are the value estimates of the next states.
    这意味着,平均DQN理论上在TAE方差减少方面比Ensemble-DQN更有效,并且至少比DQN好K倍。 这里的直觉是,平均DQN平均超过TAE平均值,这是下一个州的估值。

    6. Experiments

    The experiments were designed to address the following questions:

    • How does the number K of averaged target networks affect the error in value estimates, and in particular the overestimation error.
    • How does the averaging affect the learned polices quality.

    To that end, we ran Averaged-DQN and DQN on the ALE benchmark. Additionally, we ran Averaged-DQN, Ensemble-DQN, and DQN on a Gridworld toy problem where the optimal value function can be computed exactly.

    这些实验旨在解决以下问题:

    • 平均目标网络的数量K如何影响值估计中的误差,特别是高估误差。
    • 平均值如何影响学习策略的质量。

    为此,我们在ALE基准测试中运行了Averaged-DQN和DQN。 此外,我们在Gridworld玩具问题上运行了Averaged-DQN,Ensemble-DQN和DQN,其中可以精确计算最佳值函数。

    6.1. Arcade Learning Environment (ALE)

    To evaluate Averaged-DQN, we adopt the typical RL methodology where agent performance is measured at the end of training. We refer the reader to Liang et al. (2016) for further discussion about DQN evaluation methods on the ALE benchmark. The hyperparameters used were taken from Mnih et al. (2015), and are presented for complete-ness in Appendix E. DQN code was taken from McGill University RLLAB, and is available online1 (together with Averaged-DQN implementation).

    We have evaluated the Averaged-DQN algorithm on three Atari games from the Arcade Learning Environment (Bellemare et al., 2013). The game of BREAKOUT was selected due to its popularity and the relative ease of the DQN to reach a steady state policy. In contrast, the game of SEAQUEST was selected due to its relative complexity, and the significant improvement in performance obtained by other DQN variants (e.g., Schaul et al. (2015); Wang et al. (2015)). Finally, the game of ASTERIX was presented in Van Hasselt et al. (2015) as an example to overestimation in DQN that leads to divergence.

    As can be seen in Figure 4 and in Table 1 for all three games, increasing the number of averaged networks in Averaged-DQN results in lower average values estimates, better-preforming policies, and less variability between the runs of independent learning trials. For the game of AS-TERIX, we see similarly to Van Hasselt et al. (2015) that the divergence of DQN can be prevented by averaging.

    Overall, the results suggest that in practice Averaged-DQN reduces the TAE variance, which leads to smaller overes-timation, stabilized learning curves and significantly im-proved performance.

    为了评估Averaged-DQN,我们采用典型的RL方法,其中在训练结束时测量代理性能。我们将读者推荐给Liang等人。 (2016)关于ALE基准的DQN评估方法的进一步讨论。使用的超参数取自Mnih等人。 (2015年),并在附录E中提供完整性.DQN代码取自麦吉尔大学RLLAB,可在线获得1(与Averaged-DQN实施一起)。

    我们已经从Arcade学习环境评估了三个Atari游戏的Averaged-DQN算法(Bellemare等,2013)。选择BREAKOUT的比赛是因为它的受欢迎程度以及DQN达到稳定政策的相对容易程度。相比之下,选择SEAQUEST游戏是因为其相对复杂性,以及其他DQN变体(例如,Schaul等人(2015); Wang等人(2015))获得的性能的显着改善。最后,AS Hasix的游戏在Van Hasselt等人的作品中展示。 (2015)作为DQN过高估计导致分歧的一个例子。

    从图4和表1中可以看出,对于所有三个游戏,增加Averaged-DQN中的平均网络数量会导致较低的平均值估计值,更好的预先形成的策略以及独立学习试验运行之间的可变性更小。对于AS-TERIX的游戏,我们看到类似于Van Hasselt等人。 (2015)可以通过平均来防止DQN的分歧。

    总体而言,结果表明,在实践中,Averaged-DQN降低了TAE方差,从而导致更小的过热度,稳定的学习曲线和显着改善的性能。

    6.2. Gridworld

    The Gridworld problem (Figure 5) is a common RL bench-mark (e.g., Boyan & Moore (1995)). As opposed to the ALE, Gridworld has a smaller state space that allows the ER buffer to contain all possible state-action pairs. Addi-tionally, it allows the optimal value function Q to be accurately computed.

    For the experiments, we have used Averaged-DQN, and Ensemble-DQN with ER buffer containing all possible state-action pairs. The network architecture that was used composed of a small fully connected neural network with one hidden layer of 80 neurons. For minimization of the DQN loss, the ADAM optimizer (Kingma & Ba, 2014) was used on 100 mini-batches of 32 samples per target network parameters update in the first experiment, and 300 mini-batches in the second.

    Gridworld问题(图5)是一个常见的RL基准(例如,Boyan&Moore(1995))。 与ALE相反,Gridworld具有较小的状态空间,允许ER缓冲区包含所有可能的状态 - 动作对。 另外,它允许精确计算最佳值函数Q.

    对于实验,我们使用了Averaged-DQN和Ensemble-DQN以及包含所有可能的状态 - 动作对的ER缓冲区。 使用的网络架构由一个小的完全连接的神经网络和一个隐藏的80个神经元层组成。 为了最大限度地减少DQN损失,ADAM优化器(Kingma&Ba,2014)用于第一个实验中每个目标网络参数更新的100个小批量32个样本,第二个实验中有300个小批量。

    在这里插入图片描述
    Figure 4. The top row shows Averaged-DQN performance for the different number K of averaged networks on three Atari games. For K= 1 Averaged-DQN is reduced to DQN. The bold lines are averaged over seven independent learning trials. Every 2M frames, a performance test using -greedy policy with = 0.05 for 500000 frames was conducted. The shaded area presents one standard deviation. The bottom row shows the average value estimates for the three games. It can be seen that as the number of averaged networks is increased, overestimation of the values is reduced, performance improves, and less variability is observed. The hyperparameters used were taken from Mnih et al. (2015).

    图4.顶行显示三个Atari游戏中不同数量K的平均网络的Averaged-DQN性能。 对于K = 1,平均DQN减少到DQN。 七个独立的学习试验平均粗线。 每2M帧,使用-greedy策略进行性能测试,对于500000帧使用= 0.05。 阴影区域呈现一个标准偏差。 底行显示了三场比赛的平均值估计值。 可以看出,随着平均网络的数量增加,对值的过高估计减少,性能提高,并且观察到较小的可变性。 使用的超参数取自Mnih等人。(2015年)。

    6.ENVIRONMENT SETUP

    在这里插入图片描述

    在这里插入图片描述

    Figure 5. Gridworld problem. The agent starts at the left-bottom of the grid. In the upper-right corner, a reward of +1 is obtained.

    图5. Gridworld问题。 代理程序从网格的左下角开始。 在右上角,获得+1的奖励。

    6.2.2. OVERESTIMATION

    In Figure 6 it can be seen that increasing the number K of averaged target networks leads to reduced overestimation eventually. Also, more averaged target networks seem to reduces the overshoot of the values, and leads to smoother and less inconsistent convergence.

    在图6中可以看出,增加平均目标网络的数量K最终导致过高估计。 此外,更平均的目标网络似乎减少了值的过冲,并且导致更平滑和更不一致的收敛。

    6.2.3. AVERAGED VERSUS ENSEMBLE DQN

    In Figure 7, it can be seen that as was predicted by the analysis in Section 5, Ensemble-DQN is also inferior to Averaged-DQN regarding variance reduction, and as a consequence far more overestimates the values. We note that Ensemble-DQN was not implemented for the ALE exper-iments due to its demanding computational effort, and the empirical evidence that was already obtained in this simple Gridworld domain.
    在图7中,可以看出,正如第5节中的分析预测的那样,Ensemble-DQN在方差减少方面也不如Averaged-DQN,因此更加高估了这些值。 我们注意到Ensemble-DQN没有为ALE实验实现,因为它需要大量的计算工作,以及在这个简单的Gridworld领域已经获得的经验证据。
    在这里插入图片描述

    在这里插入图片描述

    Figure 6. Averaged-DQN average predicted value in Gridworld. Increasing the number K of averaged target networks leads to a faster convergence with less overestimation (positive-bias). The bold lines are averages over 40 independent learning trials, and the shaded area presents one standard deviation. In the figure, A,B,C,D present DQN, and Averaged-DQN for K=5,10,20 aver-age overestimation.

    Figure 7. Averaged-DQN and Ensemble-DQN predicted value in Gridworld. Averaging of past learned value is more beneficial than learning in parallel. The bold lines are averages over 20 independent learning trials, where the shaded area presents one standard deviation.
    图6. Gridworld中的Averaged-DQN平均预测值。 增加平均目标网络的数量K会导致更快的收敛,而不会过高估计(正偏差)。 粗线是40多个独立学习试验的平均值,阴影区域呈现一个标准差。 在图中,A,B,C,D表示DQN,而平均DQN表示K = 5,10,20平均高估。

    图7. Gridworld中的Averaged-DQN和Ensemble-DQN预测值。 平均过去的学习价值比并行学习更有利。 粗线是20多个独立学习试验的平均值,其中阴影区域呈现一个标准偏差。

    7. Discussion and Future Directions

    In this work, we have presented the Averaged-DQN algo-rithm, an extension to DQN that stabilizes training, and im-proves performance by efficient TAE variance reduction. We have shown both in theory and in practice that the pro-posed scheme is superior in TAE variance reduction, com-pared to a straightforward but computationally demanding approach such as Ensemble-DQN (Algorithm 3). We have demonstrated in several games of Atari that increasing the number K of averaged target networks leads to better poli-

    cies while reducing overestimation. Averaged-DQN is a simple extension that can be easily integrated with other DQN variants such as Schaul et al. (2015); Van Hasselt et al. (2015); Wang et al. (2015); Bellemare et al. (2016); He et al. (2016). Indeed, it would be of interest to study the added value of averaging when combined with these variants. Also, since Averaged-DQN has variance reduc-tion effect on the learning curve, a more systematic com-parison between the different variants can be facilitated as discussed in (Liang et al., 2016).

    In future work, we may dynamically learn when and how many networks to average for best results. One simple sug-gestion may be to correlate the number of networks with the state TD-error, similarly to Schaul et al. (2015). Finally, incorporating averaging techniques similar to Averaged-DQN within on-policy methods such as SARSA and Actor-Critic methods (Mnih et al., 2016) can further stabilize these algorithms.

    在这项工作中,我们提出了Averaged-DQN算法,这是DQN的扩展,可以稳定训练,并通过有效的TAE方差减少来提高性能。我们已经在理论和实践中表明,与TAse方差减少相比,提出的方案更优越,与Ensemble-DQN(算法3)等简单但计算要求更高的方法相比。我们已经在几个Atari游戏中证明,增加平均目标网络的数量K会导致更好的策略

    在减少高估的同时减少。 Averaged-DQN是一个简单的扩展,可以很容易地与其他DQN变体集成,如Schaul等。 (2015);范哈塞特等人。 (2015);王等人。 (2015); Bellemare等。 (2016);他等人。 (2016)。实际上,当与这些变体结合时,研究平均值的附加值将是有意义的。此外,由于Averaged-DQN对学习曲线具有方差减少效应,因此可以促进不同变体之间更系统的比较,如(Liang et al。,2016)中所讨论的。

    在未来的工作中,我们可以动态地了解何时以及为了获得最佳结果而平均的网络数量。一个简单的建议可能是将网络数量与状态TD误差相关联,类似于Schaul等人。 (2015年)。最后,在诸如SARSA和Actor-Critic方法(Mnih等,2016)等政策方法中引入类似于Averaged-DQN的平均技术可以进一步稳定这些算法。

    References

    Arthur E Bryson, Yu Chi Ho. Applied Optimal Control: Optimization Estimation and Control. Hemisphere Pub-lishing, 1975.

    Baird III, Leemon C. Advantage updating. Technical re-port, DTIC Document, 1993.

    Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

    Bellemare, Marc G, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Uni-fying count-based exploration and intrinsic motivation. arXiv preprint arXiv:1606.01868, 2016.

    Bellman, Richard. A Markovian decision process. Indiana Univ. Math. J., 6:679–684, 1957.

    Boyan, Justin and Moore, Andrew W. Generalization in reinforcement learning: Safely approximating the value function. Advances in neural information processing systems, pp. 369–376, 1995.

    Even-Dar, Eyal and Mansour, Yishay. Learning rates for q-learning. Journal of Machine Learning Research, 5 (Dec):1–25, 2003.

    He, Frank S., Yang Liu, Alexander G. Schwing, and Peng, Jian. Learning to play in a day: Faster deep reinforce-ment learning by optimality tightening. arXiv preprint arXiv:1611.01606, 2016.

    Jaakkola, Tommi, Jordan, Michael I, and Singh, Satinder P. On the convergence of stochastic iterative dynamic pro-gramming algorithms. Neural Computation, 6(6):1185– 1201, 1994.

    Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980, 2014.

    Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in NIPS, pp. 1097–1105, 2012.

    LeCun, Yann, Bottou, Leon,´ Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998.

    Liang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowl-ing, Michael. State of the art control of Atari games using shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 485–493, 2016.

    Lin, Long-Ji. Reinforcement learning for robots using neu-ral networks. Technical report, DTIC Document, 1993.

    Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing Atari with deep reinforce-ment learning. arXiv preprint arXiv:1312.5602, 2013.

    Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529– 533, 2015.

    Mnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.

    Osband, Ian, Blundell, Charles, Pritzel, Alexander, and Van Roy, Benjamin. Deep exploration via bootstrapped DQN. arXiv preprint arXiv:1602.04621, 2016.

    Riedmiller, Martin. Neural fitted Q iteration–first experi-ences with a data efficient neural reinforcement learning method. In European Conference on Machine Learning, pp. 317–328. Springer, 2005.

    Rummery, Gavin A and Niranjan, Mahesan. On-line Q-learning using connectionist systems. University of Cambridge, Department of Engineering, 1994.

    Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Sil-ver, David. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.

    Sutton, Richard S and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press Cambridge, 1998.

    Sutton, Richard S, McAllester, David A, Singh, Satinder P, and Mansour, Yishay. Policy gradient methods for re-inforcement learning with function approximation. In NIPS, volume 99, pp. 1057–1063, 1999.

    Tang, Haoran, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, and Filip De Turck, Pieter Abbeel. #exploration: A study of count-based ex-ploration for deep reinforcement learning. arXiv preprint arXiv:1611.04717, 2016.

    Tesauro, Gerald. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.

    Thrun, Sebastian and Schwartz, Anton. Issues in using function approximation for reinforcement learning. In

    Proceedings of the 1993 Connectionist Models Summer

    School Hillsdale, NJ. Lawrence Erlbaum, 1993.

    Tsitsiklis, John N. Asynchronous stochastic approxima-

    tion and q-learning. Machine Learning, 16(3):185–202,

    Tsitsiklis, John N and Van Roy, Benjamin. An analysis

    of temporal-difference learning with function approxi-

    mation. IEEE transactions on automatic control, 42(5):

    674–690, 1997.

    Van Hasselt, Hado. Double Q-learning. In Lafferty, J. D.,

    Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and

    Culotta, A. (eds.), Advances in Neural Information Pro-

    cessing Systems 23, pp. 2613–2621. 2010.

    Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep

    reinforcement learning with double Q-learning. arXiv

    preprint arXiv: 1509.06461, 2015.

    Wang, Ziyu, de Freitas, Nando, and Lanctot, Marc. Dueling

    network architectures for deep reinforcement learning.

    arXiv preprint arXiv: 1511.06581, 2015.

    Watkins, Christopher JCH and Dayan, Peter. Q-learning.

    Machine Learning, 8(3-4):279–292, 1992.

    展开全文
  • <div><p>I reimplemented the variance kernel using Welford's online algorithm to fix catastrophic cancellation: <h2>precision test with random noise: <pre><code> mrconvert image.mif -coord 3 0 - | ...
  • Layers Variance Shadow Map

    千次阅读 2008-11-28 09:53:00
    阴影图(shadow map)是目前在实时场景中普遍运用的生成阴影技术,相对于体积阴影(volume shadow)而言提供了许多优势。最引人注目的是:shadow map能够在任意的位置得到,并且能和任何几何体一起被珊格化,包括那些...

    我尝试翻译此篇文章,此文章是Andrew Laurizen和Michael McCool的大作。事先声明,本人的翻译水平有限,此翻译为参考。

    首先对题目翻译一下:多层方差阴影图 。

    ABSTRACT
    Shadow maps are commonly used in real-time rendering, but they
    cannot be filtered linearly like standard color, resulting in severe
    aliasing. Variance shadow maps resolve this problem by representing
    the depth distribution using moments, which can be linearly filtered.
    However, variance shadow maps suffer from “light bleeding”
    artifacts and require high-precision texture filtering hardware.
    We introduce layered variance shadow maps, which provide simultaneous
    solutions to both of these limitations. By partitioning
    the shadow map depth range into multiple layers, we eliminate
    all light bleeding between different layers. Using more layers increases
    the quality of the shadows at the expense of additional storage.
    Because each of these layers covers a reduced depth range,
    they can be stored in lower precision than would be required with
    typical variance shadow maps, enabling their use on a much wider
    range of graphics hardware. We also describe an iterative optimization
    algorithm to automatically position layers so as to maximize
    the utility of each.
    Our algorithm is easy to implement on current graphics hardware
    and provides an efficient, scalable solution to the problem of
    shadow map filtering.
    Index Terms: I.3.7 [Computer Graphics]: Three-Dimensional
    Graphics and Realism—Color, shading, shadowing, and texture

     

    摘要:阴影图普遍用于实时渲染中,但此图不能标准纹理那样进行线形过滤,从而产生锯齿现象(之所以不能用线形过滤,是因为阴影图存储的是深度信息,而阴影的产生是深度比较的结果)。方差阴影图解决了这个问题,通过运用深度的分布情况(深度和深度平方的期望和方差),而这样深度信息就可以进行线形过滤(之所以能用线形过滤,是因为最后阴影的产生是通过深度信息求出来的,通过切尔雪夫不等式)。但是,方差阴影图算法会引起"light bleeding"假象(之所以会出现light bleeding,是因为深度信息不够充足),需要更高分辨率的渲染纹理(render target),(比如两个R32F的渲染纹理),但是对于此高分辨率的渲染纹理,硬件不支持此类纹理的过滤。所以作者引入了多层方差阴影图算法,从而解决此类问题。此算法是将shadow map(阴影图)的深度信息(深度范围)分多个层,从而在不同层间消除了"light bleeding" ,但在提高了阴影质量的同时,也增加了存储空间。每个层表示的深度范围缩小了,这样表示的精度也就缩小了,而这些缩小后的精度,绝大多数阴影都支持。同时作者通过递归的方法求出最佳的分层(通过K聚类的方法),从而让每层得到最佳的深度范围。

    作者的算法在目前的图形显卡上能够很容易的实现,并且提供了一个针对阴影图过滤有效的,伸缩性好的方法。

    索引词语:三维图形和现实---颜色,着色,阴影和纹理。

     

    1 INTRODUCTION
    Shadow maps [12] are the most commonly used shadowing technique
    in real-time applications, providing several advantages over
    shadow volumes [3]. Most notably, shadow maps can be queried at
    arbitrary locations, they work with anything that can be rasterized
    including alpha tested geometry, and they are much less sensitive
    to geometric complexity than shadow volumes. Like most textures,
    however, they suffer from aliasing if not properly filtered. Unfortunately,
    common linear filtering algorithms such as mipmapping
    and anisotropic filtering are inapplicable to ordinary shadow maps,
    which require a non-linear depth comparison per sample.

    1.引言

    阴影图(shadow map)是目前在实时场景中普遍运用的生成阴影技术,相对于体积阴影(volume shadow)而言提供了许多优势。最引人注目的是:shadow map能够在任意的位置得到,并且能和任何几何体一起被珊格化,包括那些需要alpha test的几何体,并且shadow map对几何体的复杂度要求不高(相对于shadow volumes)。但是,像大多数纹理一样,如果过滤不恰当,shadow map也会出现锯齿(注意这里好像矛盾,其实不矛盾,此话是针对本文的算法)。比较糟糕的是:普通的线形过滤,如mipmapping和各向异性过滤对普通的shadow map不起作用,shadow map需要对每一个采样进行非线性的深度比较。

     

    Variance shadow maps (VSMs) [4, 7] provide a solution to this
    problem by representing a probability distribution of depths at each
    shadow map texel. To achieve constant space usage, the distributions
    are approximated using their first two moments. The visibility
    function over an arbitrary filter region can be estimated using
    Chebyshev’s Inequality, which yields an upper bound on the
    amount of light reaching the fragment being shaded. In the case of
    a single planar occluder and a single planar receiver the reconstruction
    is exact [4], motivating its direct application to shading.

    方差阴影图算法针对此问题提供了一个方法:通过在shadow map上的每一个纹素上作一个可能的深度分布表示。为了达到恒定距离的运用(此处翻译不好,本人还得斟酌),分布的估计运用到了两个时刻(这里moments的翻译也觉得不好):期望和方差。在一个任意的过滤区域的可见性函数可以通过切尔雪夫不等式来估计,此不等式产生一个所有光照达到此着色像素(fragment)总和的最高限。在一个单独的遮挡平面和一个单独的吸收光照平面的情况下进行阴影重建,得到的结果非常准确,证明此方法可行,从而激励此方法运用到了实时阴影生成的程序中。

     

    However, because only two moments are stored, complex visibility
    functions cannot be reconstructed perfectly. This leads to
    regions that should be in shadow being lit, an artifact known as
    “light bleeding”. Storing more moments could eliminate the problem,
    but estimating the visibility function becomes very computationally
    intensive as more moments are used. Furthermore higherorder
    moments are numerically unstable, making them difficult to
    use in practice. Even the second moment requires fairly high precision
    storage and filtering to avoid numeric problems with large light
    ranges. Graphics hardware that supports 32-bit floating point filtering is sufficent, but hardware with less precision (16-bit for example)
    often is not. This limits the applicability of VSMs to high-end
    hardware.

    但是,因为仅仅两个时刻被存储,复杂的可见性函数不能被很好的重建。这样导致了"light bleeding"假象:那些需要在阴影区的地方被照亮。存储更多的时刻(E(x^3),E(x^4),E(x^5)等等)也许会解决此问题,但是通过它们产生的可见性函数计算量非常大,特别是更多的时刻被运用。再者,高次的时刻(前面提到的)是数值不稳定的,将它们运用到实践比较困难。即使是二次方差需要很高的精度存储纹理和运用大范围的灯光范围过滤去避免数值不稳定问题。显卡支持32位浮点纹理的是可以的,但是不支持的这样精度(16位)的显卡就不能用了。这样就限制了VSM算法的适用。

     

    Rather than using more moments to improve the quality of
    VSMs, we can instead alter the depth metric. The key observation
    is that the visibility test is a function of the depth ordering in the
    scene, not the specific depth values. We can thus choose any monotonic
    warp of the depth metric and still obtain an upper bound on
    the visibility using Chebyshev’s Inequality as with standard VSMs.

    我们可以替换深度矩阵(存储深度的纹理),而不用更高次的时刻(方差之类)去提高VSM算法生成阴影的质量。问题的关键转化到了,可见测试是一个场景中深度的次序的函数,而不是具体的深度值。我们可以选择一个单调的深度扭曲矩阵,但还是能够得到运用切尔雪夫不等式得到可见度的最高限。

     

    Moreover we can use any number of uniquely chosen warps.
    Each warp will produce a different upper bound on the visibility
    function, some tighter than others. We use this fact to partition the
    light’s depth range into different layers, each of which has a unique
    warping function. We choose these warps so that for a given depth
    d, we know in advance which layer will provide the best approximation.
    With this approach, only a single texture sample from one
    layer is required for each fragment while shading, allowing us to
    scale to a large number of layers.

    再者,我们能够运用任何唯一选择的扭曲数值。每一个扭曲将产生一个通过可见性函数产生的不同的最高限,有些和其他的比较接近。我们运用这些扭曲将光源的深度范围划分为多层,每层有一个唯一的扭曲函数。我们选择了这些扭曲函数后,对一个给定的深度,我们就事先知道那一层将提供最好的估计。运用这种办法,当对一层中的每个象素进行着色处理时,仅仅对一个单独的纹理采样。这种方法允许我们对层数进行扩展。

     

    A remaining problem is where to place the layer boundaries to
    eliminate the most light bleeding. While uniformly or manually
    placed layer boundaries may be sufficient for many applications, it
    is desirable to have an automated method available. We describe an
    iterative image-space optimization algorithm based on Lloyd relaxation
    [5] in Section 4.

    那么就要对如何分层进行探讨了,当平均分层或者手动分层也许对许多应用程序已经足够。但是运用自动分层效果更好。我们描述了一个基于Lloyd relaxation的在图像空间递归优化算法。

     

    In summary, layered variance shadow maps (LVSMs) generalize
    variance shadow maps and provide a scalable way to increase
    shadow quality beyond what is possible with standard VSMs. Additionally,
    LVSMs can be used with a wider range of graphics hardware
    than VSMs since they do not require high-precision texture
    filtering.

    总之,多层方差阴影图算法继承了方差阴影图算法,并提供一个可扩展的方法去提高阴影的质量,而不是运用提高精度的标准的VSM算法。再者,由于LVSM不需要高精度的纹理过滤,所以可运行的显卡范围广泛。

     

    2 RELATED WORK
    Shadow maps [12] are an efficient method of computing shadows
    in general scenes. Unfortunately, they suffer from several forms of
    aliasing. There are two orthogonal bodies of research that address
    these aliasing problems: projection optimization and shadow map
    filtering. Lloyd [8] gives an excellent survey and analysis of various
    warping an partitioning algorithms that aim to improve the shadow
    projection. We address the filtering problem, but projection optimization
    algorithms can be used in conjunction with our approach
    to obtain high quality shadows.

    2 相关工作

    阴影图算法在一些普通场景中是一个有效计算阴影的算法。糟糕的是:它们会产生锯齿现象。在解决锯齿问题方面有两个研究流派:投影优化和阴影图过滤。Lloyd给出了许多很好的关于扭曲分层的提高阴影投影算法的介绍(哦?我得好好研究一下)。我们现在运用过滤的方法,不过也结合了投影优化的方法而得到高质量的阴影。

     

    Percentage-closer filtering (PCF) [10] provides a way to filter
    shadow maps by observing that the results of many depth comparisons
    over a filter region should be averaged, rather than the depth
    values themselves. This method does not support pre-filtering,
    however, and so for large filter sizes it is very expensive. Additionally,
    PCF does not consider the receiver geometry within the
    filter region, which causes significant biasing problems.

    PCF算法提供了一种方法:通过在过滤区域对深度比较结果进行平均,来过滤阴影图(不是过滤shadow map,是过滤深度比较结果)。这种方法不提供预过滤,但是,大的过滤区域将会计算时间比较长。再者,PCF不考虑接受几何体的过滤区域,这些造成了严重的偏移问题(哦。。。)。

     

    Deep shadow maps [9] store a distribution of depths for each
    shadow map texel and allow pre-filtering. Unfortunately general
    deep shadow maps require a variable amount of storage per pixel,
    making them unsuitable for implementation on current graphics
    hardware. Furthermore averaging two distributions is non-trivial,
    which complicates filtering.

    Deep shadow maps算法在每一个Shadow map的纹素中存储了深度的分布并且可以预过滤。糟糕的是普通的deep shadow maps需要在每个象素有大容量存储(本人没有这方面经验,还需研究),这样对当前普通显卡来说不适用。再者,对两个分布平均操作也不是很简单,这样会过滤复杂。

     

    Opacity shadow maps [6] and convolution shadow maps [1] represent
    the visibility function with respect to a basis to allow linear
    filtering. These algorithms are limited in their ability to represent
    discontinuities in the visibility function, and thus suffer from
    light bleeding near all occluders. Properly rendering any reasonably
    sized scene requires a huge number of layers (in the case of opacity
    shadow maps) or coefficients (in the case of convolution shadow
    maps), which makes these solutions impractical for real-time rendering
    of complex scenes. Furthermore, convolution shadow maps
    need to sample all of the basis coefficients for a given shadow map
    texel, which scales poorly as the number of coefficients increases.In contrast, layered variance shadow maps require only a single texture
    sample per shaded pixel, just like variance shadow maps.

    不透明阴影图算法和卷积阴影图算法展示了运用基函数重构阴影图(不准确)来进行线形过滤的可见度函数。这些算法受限于可见度函数的边界处理,并且在所有遮挡物的附近都有light bleeding假象(我得再把以前看的论文看看)。恰当的渲染任何合理大小的场景需要一大堆数目的层(在不透明阴影图算法情况下)或者系数(在卷积阴影图情况下),而这两者让这些方法在实时渲染复杂场景中不可行。再者,卷积阴影图算法需要采样给定阴影图纹素的所有基系数,当系数的数目增多时,伸展性很差。相比较而言,多层方差阴影图对于每个着色像素只需要单个纹理采样,就像方差阴影图算法。

     

    Our algorithm also uses a piecewise representation of the visibility
    function, but depth tests within each piece are resolved in
    the same way as with standard variance shadow maps (VSMs)
    [4]. Thus like VSMs our algorithm has no problem reconstructing
    the visibility function in the case of a single planar occluder
    and receiver. Unlike VSMs, however, we avoid multi-occluder
    light bleeding artifacts by partitioning the depth range into multiple
    pieces.

    我们的算法也运用了一个分段表示的可见度函数,但是每段的深度比较和标准的方差阴影图算法一样进行的。那么像VSM算法那样,我们的算法在可以在只有一个遮挡平面和一个吸收平面情况下重构可见度函数。但是并不像VSM算法那样,我们通过划分深度范围,将其分段,避免了多遮挡物“light bleeding”假象。

     

    3 ALGORITHM OVERVIEW
    The algorithm is built on top of variance shadow maps [4, 7], which
    we will review briefly here.

    3 算法综述

    此算法是建立在方差阴影图算法基础上,我们将简单回顾一下:

     

    We begin by rendering both depth and squared depth into a twocomponent
    variance shadow map. The VSM is sampled while shading
    a fragment to produce the moments M1 and M2 of the depth
    distribution F(x) over the texture filter region, defined as:
    我们将深度和深度的平方渲染到一个两通道的方差阴影图上。当着色像素通过纹理过滤区域来产生深度分布F(x)时刻:M1和M2时将采样方差阴影图,定义如下:

    (怎么回事,公式搞不上去)

     

    From these we compute the mean and variance of the distribution:

    从这些我们可以计算出的期望和方差分布:

    (公式还是搞不上去)

     

    We then apply the one-tailed version of Chebyshev’s Inequality to
    estimate the percentage of light reaching a surface at depth t. In
    particular for the some random variable x drawn from a depth distribution
    recovered from the filtered variance shadow map, we have

    这样我们应用一个修改版的切尔雪夫不等式去估计光源到达深度t平面的百分比。尤其对于一些从过滤后的方差阴影图得到的随机的变量x ,我们得到:

     

    Finally, we use p(t) to attenuate the amount of light reaching the
    fragment due to shadowing. Usually it is desirable to clamp the
    variance to some minimum value s2
    min to avoid any numeric problems
    when t 约等于m

    最后,我们用p(t)来衰减到达像素的光照总数(因为要算阴影)。常常将方差限制在一个最小值而避免数值不稳定是可取的。

     

    3.1 Light Bleeding
    Variance shadow maps are simple and efficient, but they suffer from
    light bleeding. While Chebyshev’s Inequality gives an upper bound
    on P(x t), there is no guarantee that the upper bound is a good
    approximation. As an example, consider the situation shown in
    Figure 2.

    3.1光照溢出

    方差阴影图简单有效,但是有“光照溢出”的缺陷。因为切尔雪夫不等式给出的是每个象素灰度的上限,而又没法保证其上限是一个很好的估计。比如图2所示的情形。

     

    Let objects A, B and C be at depths a, b and c respectively. Note
    that only objects A and B will be represented in the filter region
    since object C is not visible from the light. Thus if we are shading
    a fragment at the center of the outlined region, we will recover (1)
    the moments
    M1 =a+b/2
    M2 =a2+b2/2
    Then from (2), we compute:
    m =b+a/2
    s2 =(b

    展开全文
  • Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance 2018-12-1913:02:45 This blog is copied from:https://machinelearningmastery.com/ensemble-method...

    Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

    2018-12-19 13:02:45

     

    This blog is copied from: https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/ 

     

    Deep learning neural networks are nonlinear methods.

    They offer increased flexibility and can scale in proportion to the amount of training data available. A downside of this flexibility is that they learn via a stochastic training algorithm which means that they are sensitive to the specifics of the training data and may find a different set of weights each time they are trained, which in turn produce different predictions.

    Generally, this is referred to as neural networks having a high variance and it can be frustrating when trying to develop a final model to use for making predictions.

    A successful approach to reducing the variance of neural network models is to train multiple models instead of a single model and to combine the predictions from these models. This is called ensemble learning and not only reduces the variance of predictions but also can result in predictions that are better than any single model.

    In this post, you will discover methods for deep learning neural networks to reduce variance and improve prediction performance.

    After reading this post, you will know:

    • Neural network models are nonlinear and have a high variance, which can be frustrating when preparing a final model for making predictions.
    • Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
    • Techniques for ensemble learning can be grouped by the element that is varied, such as training data, the model, and how predictions are combined.

    Let’s get started.

    Ensemble Methods to Reduce Variance and Improve Performance of Deep Learning Neural Networks

    Ensemble Methods to Reduce Variance and Improve Performance of Deep Learning Neural Networks
    Photo by University of San Francisco’s Performing Arts, some rights reserved.

    Overview

    This tutorial is divided into four parts; they are:

    1. High Variance of Neural Network Models
    2. Reduce Variance Using an Ensemble of Models
    3. How to Ensemble Neural Network Models
    4. Summary of Ensemble Techniques

    High Variance of Neural Network Models

    Training deep neural networks can be very computationally expensive.

    Very deep networks trained on millions of examples may take days, weeks, and sometimes months to train.

    Google’s baseline model […] was a deep convolutional neural network […] that had been trained for about six months using asynchronous stochastic gradient descent on a large number of cores.

    — Distilling the Knowledge in a Neural Network, 2015.

    After the investment of so much time and resources, there is no guarantee that the final model will have low generalization error, performing well on examples not seen during training.

    … train many different candidate networks and then to select the best, […] and to discard the rest. There are two disadvantages with such an approach. First, all of the effort involved in training the remaining networks is wasted. Second, […] the network which had best performance on the validation set might not be the one with the best performance on new test data.

    — Pages 364-365, Neural Networks for Pattern Recognition, 1995.

    Neural network models are a nonlinear method. This means that they can learn complex nonlinear relationships in the data. A downside of this flexibility is that they are sensitive to initial conditions, both in terms of the initial random weights and in terms of the statistical noise in the training dataset.

    This stochastic nature of the learning algorithm means that each time a neural network model is trained, it may learn a slightly (or dramatically) different version of the mapping function from inputs to outputs, that in turn will have different performance on the training and holdout datasets.

    As such, we can think of a neural network as a method that has a low bias and high variance. Even when trained on large datasets to satisfy the high variance, having any variance in a final model that is intended to be used to make predictions can be frustrating.

     

    Want Better Results with Deep Learning?

    Take my free 7-day email crash course now (with sample code).

    Click to sign-up and also get a free PDF Ebook version of the course.

    Download Your FREE Mini-Course

     

    Reduce Variance Using an Ensemble of Models

    A solution to the high variance of neural networks is to train multiple models and combine their predictions.

    The idea is to combine the predictions from multiple good but different models.

    A good model has skill, meaning that its predictions are better than random chance. Importantly, the models must be good in different ways; they must make different prediction errors.

    The reason that model averaging works is that different models will usually not make all the same errors on the test set.

    — Page 256, Deep Learning, 2016.

    Combining the predictions from multiple neural networks adds a bias that in turn counters the variance of a single trained neural network model. The results are predictions that are less sensitive to the specifics of the training data, choice of training scheme, and the serendipity of a single training run.

    In addition to reducing the variance in the prediction, the ensemble can also result in better predictions than any single best model.

    … the performance of a committee can be better than the performance of the best single network used in isolation.

    — Page 365, Neural Networks for Pattern Recognition, 1995.

    This approach belongs to a general class of methods called “ensemble learning” that describes methods that attempt to make the best use of the predictions from multiple models prepared for the same problem.

    Generally, ensemble learning involves training more than one network on the same dataset, then using each of the trained models to make a prediction before combining the predictions in some way to make a final outcome or prediction.

    In fact, ensembling of models is a standard approach in applied machine learning to ensure that the most stable and best possible prediction is made.

    For example, Alex Krizhevsky, et al. in their famous 2012 paper titled “Imagenet classification with deep convolutional neural networks” that introduced very deep convolutional neural networks for photo classification (i.e. AlexNet) used model averaging across multiple well-performing CNN models to achieve state-of-the-art results at the time. Performance of one model was compared to ensemble predictions averaged over two, five, and seven different models.

    Averaging the predictions of five similar CNNs gives an error rate of 16.4%. […] Averaging the predictions of two CNNs that were pre-trained […] with the aforementioned five CNNs gives an error rate of 15.3%.

    Ensembling is also the approach used by winners in machine learning competitions.

    Another powerful technique for obtaining the best possible results on a task is model ensembling. […] If you look at machine-learning competitions, in particular on Kaggle, you’ll see that the winners use very large ensembles of models that inevitably beat any single model, no matter how good.

    — Page 264, Deep Learning With Python, 2017.

    How to Ensemble Neural Network Models

    Perhaps the oldest and still most commonly used ensembling approach for neural networks is called a “committee of networks.”

    A collection of networks with the same configuration and different initial random weights is trained on the same dataset. Each model is then used to make a prediction and the actual prediction is calculated as the average of the predictions.

    The number of models in the ensemble is often kept small both because of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members. Ensembles may be as small as three, five, or 10 trained models.

    The field of ensemble learning is well studied and there are many variations on this simple theme.

    It can be helpful to think of varying each of the three major elements of the ensemble method; for example:

    • Training Data: Vary the choice of data used to train each model in the ensemble.
    • Ensemble Models: Vary the choice of the models used in the ensemble.
    • Combinations: Vary the choice of the way that outcomes from ensemble members are combined.

    Let’s take a closer look at each element in turn.

    Varying Training Data

    The data used to train each member of the ensemble can be varied.

    The simplest approach would be to use k-fold cross-validation to estimate the generalization error of the chosen model configuration. In this procedure, k different models are trained on k different subsets of the training data. These k models can then be saved and used as members of an ensemble.

    Another popular approach involves resampling the training dataset with replacement, then training a network using the resampled dataset. The resampling procedure means that the composition of each training dataset is different with the possibility of duplicated examples allowing the model trained on the dataset to have a slightly different expectation of the density of the samples, and in turn different generalization error.

    This approach is called bootstrap aggregation, or bagging for short, and was designed for use with unpruned decision trees that have high variance and low bias. Typically a large number of decision trees are used, such as hundreds or thousands, given that they are fast to prepare.

    … a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. […] Of course, this is not practical because we generally do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set.

    — Pages 216-317, An Introduction to Statistical Learning with Applications in R, 2013.

    An equivalent approach might be to use a smaller subset of the training dataset without regularization to allow faster training and some overfitting.

    The desire for slightly under-optimized models applies to the selection of ensemble members more generally.

    … the members of the committee should not individually be chosen to have optimal trade-off between bias and variance, but should have relatively smaller bias, since the extra variance can be removed by averaging.

    — Page 366, Neural Networks for Pattern Recognition, 1995.

    Other approaches may involve selecting a random subspace of the input space to allocate to each model, such as a subset of the hyper-volume in the input space or a subset of input features.

    Varying Models

    Training the same under-constrained model on the same data with different initial conditions will result in different models given the difficulty of the problem, and the stochastic nature of the learning algorithm.

    This is because the optimization problem that the network is trying to solve is so challenging that there are many “good” and “different” solutions to map inputs to outputs.

    Most neural network algorithms achieve sub-optimal performance specifically due to the existence of an overwhelming number of sub-optimal local minima. If we take a set of neural networks which have converged to local minima and apply averaging we can construct an improved estimate. One way to understand this fact is to consider that, in general, networks which have fallen into different local minima will perform poorly in different regions of feature space and thus their error terms will not be strongly correlated.

    — When networks disagree: Ensemble methods for hybrid neural networks, 1995.

    This may result in a reduced variance, but may not dramatically improve generalization error. The errors made by the models may still be too highly correlated because the models all have learned similar mapping functions.

    An alternative approach might be to vary the configuration of each ensemble model, such as using networks with different capacity (e.g. number of layers or nodes) or models trained under different conditions (e.g. learning rate or regularization).

    The result may be an ensemble of models that have learned a more heterogeneous collection of mapping functions and in turn have a lower correlation in their predictions and prediction errors.

    Differences in random initialization, random selection of minibatches, differences in hyperparameters, or different outcomes of non-deterministic implementations of neural networks are often enough to cause different members of the ensemble to make partially independent errors.

    — Pages 257-258, Deep Learning, 2016.

    Such an ensemble of differently configured models can be achieved through the normal process of developing the network and tuning its hyperparameters. Each model could be saved during this process and a subset of better models chosen to comprise the ensemble.

    Slightly inferiorly trained networks are a free by-product of most tuning algorithms; it is desirable to use such extra copies even when their performance is significantly worse than the best performance found. Better performance yet can be achieved through careful planning for an ensemble classification by using the best available parameters and training different copies on different subsets of the available database.

    — Neural Network Ensembles, 1990.

    In cases where a single model may take weeks or months to train, another alternative may be to periodically save the best model during the training process, called snapshot or checkpoint models, then select ensemble members among the saved models. This provides the benefits of having multiple models trained on the same data, although collected during a single training run.

    Snapshot Ensembling produces an ensemble of accurate and diverse models from a single training process. At the heart of Snapshot Ensembling is an optimization process which visits several local minima before converging to a final solution. We take model snapshots at these various minima, and average their predictions at test time.

    — Snapshot Ensembles: Train 1, get M for free, 2017.

    A variation on the Snapshot ensemble is to save models from a range of epochs, perhaps identified by reviewing learning curves of model performance on the train and validation datasets during training. Ensembles from such contiguous sequences of models are referred to as horizontal ensembles.

    First, networks trained for a relatively stable range of epoch are selected. The predictions of the probability of each label are produced by standard classifiers [over] the selected epoch[s], and then averaged.

    — Horizontal and vertical ensemble with deep representation for classification, 2013.

    A further enhancement of the snapshot ensemble is to systematically vary the optimization procedure during training to force different solutions (i.e. sets of weights), the best of which can be saved to checkpoints. This might involve injecting an oscillating amount of noise over training epochs or oscillating the learning rate during training epochs. A variation of this approach called Stochastic Gradient Descent with Warm Restarts (SGDR) demonstrated faster learning and state-of-the-art results for standard photo classification tasks.

    Our SGDR simulates warm restarts by scheduling the learning rate to achieve competitive results […] roughly two to four times faster. We also achieved new state-of-the-art results with SGDR, mainly by using even wider [models] and ensembles of snapshots from SGDR’s trajectory.

    — SGDR: Stochastic Gradient Descent with Warm Restarts, 2016.

    A benefit of very deep neural networks is that the intermediate hidden layers provide a learned representation of the low-resolution input data. The hidden layers can output their internal representations directly, and the output from one or more hidden layers from one very deep network can be used as input to a new classification model. This is perhaps most effective when the deep model is trained using an autoencoder model. This type of ensemble is referred to as a vertical ensemble.

    This method ensembles a series of classifiers whose inputs are the representation of intermediate layers. A lower error rate is expected because these features seem diverse.

    — Horizontal and vertical ensemble with deep representation for classification, 2013.

    Varying Combinations

    The simplest way to combine the predictions is to calculate the average of the predictions from the ensemble members.

    This can be improved slightly by weighting the predictions from each model, where the weights are optimized using a hold-out validation dataset. This provides a weighted average ensemble that is sometimes called model blending.

    … we might expect that some members of the committee will typically make better predictions than other members. We would therefore expect to be able to reduce the error still further if we give greater weight to some committee members than to others. Thus, we consider a generalized committee prediction given by a weighted combination of the predictions of the members …

    — Page 367, Neural Networks for Pattern Recognition, 1995.

    One further step in complexity involves using a new model to learn how to best combine the predictions from each ensemble member.

    The model could be a simple linear model (e.g. much like the weighted average), but could be a sophisticated nonlinear method that also considers the specific input sample in addition to the predictions provided by each member. This general approach of learning a new model is called model stacking, or stacked generalization.

    Stacked generalization works by deducing the biases of the generalizer(s) with respect to a provided learning set. This deduction proceeds by generalizing in a second space whose inputs are (for example) the guesses of the original generalizers when taught with part of the learning set and trying to guess the rest of it, and whose output is (for example) the correct guess. […] When used with a single generalizer, stacked generalization is a scheme for estimating (and then correcting for) the error of a generalizer which has been trained on a particular learning set and then asked a particular question.

    — Stacked generalization, 1992.

    There are more sophisticated methods for stacking models, such as boosting where ensemble members are added one at a time in order to correct the mistakes of prior models. The added complexity means this approach is less often used with large neural network models.

    Another combination that is a little bit different is to combine the weights of multiple neural networks with the same structure. The weights of multiple networks can be averaged, to hopefully result in a new single model that has better overall performance than any original model. This approach is called model weight averaging.

    … suggests it is promising to average these points in weight space, and use a network with these averaged weights, instead of forming an ensemble by averaging the outputs of networks in model space

    — Averaging Weights Leads to Wider Optima and Better Generalization, 2018.

    Summary of Ensemble Techniques

    In summary, we can list some of the more common and interesting ensemble methods for neural networks organized by each element of the method that can be varied, as follows:

    • Varying Training Data
      • k-fold Cross-Validation Ensemble
      • Bootstrap Aggregation (bagging) Ensemble
      • Random Training Subset Ensemble
    • Varying Models
      • Multiple Training Run Ensemble
      • Hyperparameter Tuning Ensemble
      • Snapshot Ensemble
      • Horizontal Epochs Ensemble
      • Vertical Representational Ensemble
    • Varying Combinations
      • Model Averaging Ensemble
      • Weighted Average Ensemble
      • Stacked Generalization (stacking) Ensemble
      • Boosting Ensemble
      • Model Weight Averaging Ensemble

    There is no single best ensemble method; perhaps experiment with a few approaches or let the constraints of your project guide you.

    Further Reading

    This section provides more resources on the topic if you are looking to go deeper.

    Books

    Papers

    Articles

    Summary

    In this post, you discovered ensemble methods for deep learning neural networks to reduce variance and improve prediction performance.

    Specifically, you learned:

    • Neural network models are nonlinear and have a high variance, which can be frustrating when preparing a final model for making predictions.
    • Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
    • Techniques for ensemble learning can be grouped by the element that is varied, such as training data, the model, and how predictions are combined.

    Do you have any questions?
    Ask your questions in the comments below and I will do my best to answer. 

     

    转载于:https://www.cnblogs.com/wangxiaocvpr/p/10142598.html

    展开全文
  • Layered-->Variance-->Shadow Map

    千次阅读 2011-06-20 11:10:00
     基于硬件的动态阴影一般有两种技术,Shadow Volume和Shadow Map。 不过SV已经处于被淘汰的境地,加上算法复杂,和occluder复杂程度有直接关系,因此已不是主流;SM(呃,这名字)算法简单,和场景复杂度无关,唯一的...
  • Layered-->Variance-->Shadow Map

    千次阅读 2010-11-24 12:23:00
     基于硬件的动态阴影一般有两种技术,Shadow Volume和Shadow Map。 不过SV已经处于被淘汰的境地,加上算法复杂,和occluder复杂程度有直接关系,因此已不是主流;SM(呃,这名字)算法简单,和场景复杂度无关,唯一的...
  • The scenario of STD costing and variance analysis.[@more@]http://www.accountingformanagement.com/standard_costing_...
  • 1 Variance calculation in Cost Object Accounting is used for monitoring the financial aspects of day to day ...
  • 23, TITLE: Cross-modal Attention for MRI and Ultrasound Volume Registration AUTHORS: XINRUI SONG et. al. CATEGORY: cs.CV [cs.CV] HIGHLIGHT: This paper aims to develop a self-attention mechanism ...
  • 【影像组学pyradiomics教程】(七)影像组学特征

    万次阅读 多人点赞 2018-05-07 15:42:13
    特征类别 特征名 形状特征 (14个) Mesh Volume(网格体积) Voxel Volume(体素体积) Surface Area(表面积) Surface Area to Volume ratio(表面积体积比) Sphericity(球度) Maximum 3D diameter(最大3D...
  • 基于硬件的动态阴影一般有两种技术,Shadow Volume和Shadow Map。 不过SV已经处于被淘汰的境地,加上算法复杂,和occluder复杂程度有直接关系,因此已不是主流;SM(呃,这名字)算法简单,和场景复杂度无关,唯一的...
  • In this study, optimised combinations of several variance reduction techniques (VRTs) have been implemented in order to achieve a high precision in Monte Carlo (MC) radiation transport simulations ...
  • 对应分析与多维尺度分析

    千次阅读 2019-07-10 10:44:28
    Package,' Journal of Statistical Software, May 2007, Volume 20, Issue 3. """ def __init__ ( self , cross_table ) : N = np . matrix ( cross_table , dtype = float ) # correspondence ...
  • Paper:《A Few Useful Things to Know About Machine Learning》翻译与解读 ...《A Few Useful Things to Know About Machine Learning》翻译与解读了解机器学习的一些有用的东西 ...Learning = Representation ...
  • iVPF: Numerical Invertible Volume Preserving Flow for Efficient Lossless Compression Shifeng Zhang, Chen Zhang, Ning Kang, Zhenguo Li [pdf] [supp] [arXiv] [bibtex] Pose Recognition With Cascade ...
  • Intel MicroCode 两则

    2019-12-08 17:03:46
    The header is documented by Intel in Volume 3 of the Developer's Manual. It contains three pieces of information required for validation: the microcode revision, processor signature, and ...
  • PMP模拟题(二)

    2020-09-09 14:58:21
    共 200 题,时间240分钟,英文题目
  • 《Gpu Gems》《Gpu Pro》《Gpu Zen》系列读书笔记

    万次阅读 多人点赞 2019-02-13 13:37:50
    比如阴影效果,在《Gpu Gems》的年代,还有使用Shadow Volume的方式进行阴影的渲染,而在《Gpu Pro》系列开始,Shadow Map成为了绝对的主流,阴影技术的研究方向也转向了如何优化Shadow Map,如PCF,CSM,进而如何...
  • acceleration structure管理 在nvidia目前的turing架构下,是要建立bvh((bounding volume hierarchy))作为acceleration structure。 compute shader有的特性,在bvh构建的时候都有,比如async compute。 这里有...
  • Volume Ex-Dividend \ Date 2004 - 08 - 19 100.00 104.06 95.96 100.34 44659000 0 2004 - 08 - 20 101.01 109.08 100.50 108.31 22834300 0 2004 - 08 - 23 110.75 113.48...
  • 第8章 区域求和的差值阴影贴图(Summed-Area Variance Shadow Maps)   3.2 第二部分 光照和阴影(Light and Shadows) 第9章 使用全局照明实现互动的电影级重光照(Interactive Cinematic Relighting ...
  • Parallel Split Shadow Map (PSSM) [15][16] / Cascaded Shadow Map(CSM)[17]:虽然Shadow Volume很吸引,但它需要大量的内存频宽,而且通常不能实现软阴影。后来大部分游戏改为使用Shadow Map(阴影贴图),这更...
  • 推荐系统:协同过滤collaborative filtering

    万次阅读 多人点赞 2016-07-03 18:17:48
    Volume 20, Number 4, pp. 539-552.] ItemCF算法 提前计算Pre-compute所有物品对的相似度 查找与用户likes Or has purchased Or has in their basket的物品相似的items For each item to score  Find the similar ...
  • 实现中还有trick,那就是设置variance超参数来调整检测值,通过bool参数variance_encoded_in_target来控制两种模式,当其为True时,表示variance被包含在预测值中,就是上面那种情况。但是如果是False(大部分采用...
  • : density estimation using real-valued non-volume preserving (real NVP) transformations. rebar : low-variance, unbiased gradient estimates for discrete latent variable models. resnet : deep and ...
  • Batch Normalization论文翻译——中英文对照

    千次阅读 多人点赞 2017-09-28 15:59:10
    文章作者:Tyan 博客:noahsnail.com &nbsp;|&nbsp; CSDN &nbsp;|&nbsp; 简书 ...声明:作者翻译论文仅为学习,如有侵权请联系作者删除博文,谢谢!...Batch Normalization: Acceleratin...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 1,327
精华内容 530
关键字:

variancevolume