A ride sharing company collects a dataset of pricing and discount decisions with corresponding changes in customer and driver behavior, in order to optimize a dynamic pricing strategy. An online vendor records orders and inventory levels to generate an inventory level management policy. An autonomous car company records driving data from human drivers to train an improved end-to-end vision-based driving controller.
一个拼车公司收集的定价和折扣的决定与相应的客户和驾驶员行为的改变，以优化动态定价策略的数据集。 在线供应商记录订单和库存水平以生成库存水平管理策略。 一家自动驾驶汽车公司记录人类驾驶员的驾驶数据，以训练一种改进的基于视觉的端到端驾驶控制器。
All of these applications have two things in common: we would consider each of them to be classic examples of how machine learning can enable smarter decisions with data, and each of them is not actually feasible with the kind of machine learning technologies that are widely in use today without manual design of decision rules and state machines. I will discuss how recent advances in the field of offline reinforcement learning can change that in the next few years. I believe that we stand on the cusp of a revolution in how data can inform automated decision-making. Outside of a few applications in advertising and recommender systems, ML-enabled decision making systems have generally relied on supervised learning methods for prediction, followed by manually designed decision rules that utilize these predictions to choose how to act. While reinforcement learning algorithms have made tremendous headway in providing a toolkit for automated end-to-end decision making in research, this toolkit has proven difficult to apply in reality because, in its most common incarnation, it is simply too hard to reconcile with the data-driven machine learning pipelines in use today. This may change once reinforcement learning algorithms can effectively use data, and I will discuss how that might happen.
所有这些应用程序都有两个共同点：我们将它们视为机器学习如何通过数据实现更明智的决策的经典示例，而对于使用广泛的机器学习技术来说，它们实际上都不可行。无需手动设计决策规则和状态机即可使用。 我将讨论离线强化学习领域的最新进展如何在未来几年内改变这一现状。 我相信，在数据如何为自动化决策提供信息方面，我们正处于革命的风口浪尖。 除了在广告和推荐系统中的一些应用之外，支持ML的决策系统通常依靠监督学习的方法进行预测，然后是人工设计的决策规则，这些规则利用这些预测来选择如何采取行动。 尽管强化学习算法在提供用于研究中的自动端对端决策的工具包方面取得了长足的进步，但事实证明，该工具包在现实中难以应用，因为在最常见的化身中，它太难与当今使用的数据驱动的机器学习管道。 一旦强化学习算法可以有效地使用数据，这种情况可能会改变，我将讨论如何发生这种情况。
机器学习系统如何制定决策 (How machine learning systems make decisions)
First, we must draw a distinction between prediction and decision making. Supervised learning systems make predictions. These predictions can then be used to make decisions, but how a prediction turns into a decision is up to the user. If a model forecasts that customer orders will increase 200% in October, a reasonable decision is to increase inventory levels accordingly. However, the model is not telling us that increasing inventory levels will lead to larger profits. Not only does it not account for the distribution shift induced by acting on the model’s own predictions, it also fails to account for the entirety of the decision making process. Real-world decision making systems face a sequential and iterated problem, where each decision influences future events, which in turn influence future decisions. Here are some of the differences between the assumptions made by supervised predictive modeling systems and the properties of real-world sequential decision making problems:
首先，我们必须区分预测和决策。 监督学习系统做出预测。 这些预测然后可以用来做出决策，但是预测如何变成决策取决于用户。 如果模型预测10月份客户订单将增加200％，则一个合理的决定是相应地增加库存水平。 但是，该模型并没有告诉我们，库存水平的提高将带来更大的利润。 它不仅不能解决因模型本身的预测而导致的分布变化，而且还不能解决整个决策过程。 现实世界中的决策系统面临一个顺序且重复的问题，其中每个决策都会影响未来的事件，进而影响未来的决策。 在监督式预测建模系统所做的假设与现实世界中顺序决策问题的属性之间存在一些差异：
- Predicts manually selected quantities (e.g., number of customer orders)预测手动选择的数量(例如，客户订单数量)
- Decisions must be made based on predictions manually, using human intuition and hand-crafted rules 必须基于人类的直觉和手工制定的规则，基于预测手动做出决策
- Assumes i.i.d. (independent and identically distributed) data假设iid(独立且分布均匀)数据
- Ignores feedback, which changes how inputs map to outputs when the learning system itself interacts with the world (e.g., customers may not react the same way to auto-generated recommendations as they did during data collection)忽略反馈，当学习系统本身与世界互动时，反馈会改变输入映射到输出的方式(例如，客户对自动生成的建议的React可能与数据收集过程中的React不同)
Sequential decision making
- Only the objective is manually specified (e.g., maximize profits)仅手动指定目标(例如，最大化利润)
- Requires outputting near-optimal actions that will lead to desired outcomes (e.g., how to alter inventory levels to maximize profits) 需要输出将导致预期结果的近乎最佳的操作(例如，如何更改库存水平以实现最大利润)
- Each observation is part of a sequential process, each action influences future observations (not i.i.d.) 每个观察都是顺序过程的一部分，每个动作都会影响将来的观察(不是iid)
- Feedback is critical, and may be utilized to achieve desired goals through long-term interaction 反馈至关重要，可以通过长期互动来实现预期目标
Reinforcement learning (RL) is concerned most directly with the decision making problem. RL has attained good results on tasks ranging from playing games to enabling robots to grasp objects. RL algorithms directly aim to optimize long-term performance in the face of a dynamic and changing environment that reacts to each decision. However, most reinforcement learning methods are studied in an active learning setting, where an agent directly interacts with its environment, observes the outcomes of its actions, and uses these attempts to learn through trial and error, as shown below.
Instantiating this framework with real-world data collection is difficult, because partially trained agents interacting with real physical systems require careful oversight and supervision (would you want a partially trained RL policy to make real-world inventory purchasing decisions?). For this reason, most of the work that utilizes reinforcement learning relies either on meticulously hand-designed simulators, which preclude handling complex real-world situations, especially ones with unpredictable human participants, or requires carefully designed real-world learning setups, as in the case of real-world robotic learning. More fundamentally, this precludes combining RL algorithms with the most successful formula in ML. From computer vision to NLP to speech recognition, time and time again we’ve seen that large datasets, combined with large models, can enable effective generalization in complex real-world settings. However, with active online RL algorithms that must recollect their dataset each time a new model is trained, such a formula becomes impractical. Here are some of the differences between the active RL setup and data-driven machine learning:
用现实世界的数据收集实例化此框架非常困难，因为受过部分训练的代理商与真实的物理系统进行交互需要仔细的监督和监督(您是否需要受过部分训练的RL策略来制定实际的库存购买决策？)。 因此，利用强化学习的大多数工作都依赖于精心设计的手工模拟器，该模拟器无法处理复杂的现实情况，尤其是那些无法预测的人类参与者，或者需要精心设计的现实学习设置，例如现实世界中机器人学习的案例。 从根本上讲，这排除了将RL算法与ML中最成功的公式结合使用的可能性。 从计算机视觉到NLP到语音识别，我们一次又一次地看到，大型数据集与大型模型相结合，可以在复杂的现实环境中进行有效的概括。 但是，由于主动的在线RL算法每次训练新模型时都必须重新收集其数据集，所以这种公式变得不切实际。 以下是主动式RL设置和数据驱动的机器学习之间的一些区别：
Active (online) reinforcement learning
- Agent collects data each time it is trained每次训练时，代理都会收集数据
- Agent must collect data using its own (partially trained) policy代理商必须使用自己的(部分受过培训的)政策来收集数据
- Either uses narrow datasets (e.g., collected in one environment), or manually designed simulators要么使用狭窄的数据集(例如，在一个环境中收集的数据)，要么使用手动设计的模拟器
- Generalization can be poor due to small, narrow datasets, or simulators that differ from reality由于数据集狭小或与实际情况不同，模拟器的通用性可能很差
Data-driven machine learning
- Data may be collected once and reused for all models数据可能只收集一次，然后可用于所有模型
- Data can be collected with any strategy, including a hand-engineered system, humans, or just randomly可以采用任何策略来收集数据，包括手工设计的系统，人员或随机收集
- Large and diverse datasets can be collected from all available sources可以从所有可用来源中收集大量的数据集
- Generalization is quite good, due to large and diverse datasets由于数据集庞大且种类繁多，因此泛化效果很好
离线强化学习(Offline reinforcement learning)
To perform effective end-to-end decision making in the real world, we must combine the formalism of reinforcement learning, which handles feedback and sequential decision making, with data-driven machine learning, which learns from large and diverse datasets, and therefore enables generalization. This necessitates removing the requirement for active data collection and devising RL algorithms that can learn from prior data. Such methods are referred to as batch reinforcement learning algorithms, or offline reinforcement learning (I will use the term “offline reinforcement learning,” since it is more self-explanatory, though the term “batch” is more common in the foundational literature). The diagram below illustrates the differences between classic online reinforcement learning, off-policy reinforcement learning, and offline reinforcement learning:
为了在现实世界中执行有效的端到端决策，我们必须将处理反馈和顺序决策的强化学习的形式主义与从大型多样的数据集中学习的数据驱动的机器学习相结合，从而实现概括。 这就需要消除主动数据收集的需求，并设计出可以从先前数据中学习的RL算法。 这种方法称为批量强化学习算法或离线强化学习(我将使用术语“离线强化学习”，因为它更容易解释，尽管术语“批量”在基础文献中更常见)。 下图说明了经典的在线强化学习，非政策强化学习和离线强化学习之间的区别：
In online RL, data is collected each time the policy is modified. In off-policy RL, old data is retained, and new data is still collected periodically as the policy changes. In offline RL, the data is collected once, in advance, much like in the supervised learning setting, and is then used to train optimal policies without any additional online data collection. Of course, in practical use, offline RL methods can be combined with modest amounts of online finetuning, where after an initial offline phase, the policy is deployed to collect additional data to improve online.
在在线RL中，每次修改策略时都会收集数据。 在脱离策略的RL中，保留了旧数据，并且随着策略的更改仍会定期收集新数据。 在离线RL中，类似于在有监督的学习设置中一样，提前收集一次数据，然后将其用于训练最佳策略，而无需进行任何其他在线数据收集。 当然，在实际使用中，离线RL方法可以与少量的在线微调结合使用，在初始的离线阶段之后，将部署该策略以收集其他数据以改进在线。
Crucially, when the need to collect additional data with the latest policy is removed completely, reinforcement learning does not require any capability to interact with the world during training. This removes a wide range of cost, practicality, and safety issues: we no longer need to deploy partially trained and potentially unsafe policies, we no longer need to figure out how to conduct multiple trials in the real world, and we no longer need to build complex simulators. The offline data for this learning process could be collected from a baseline manually designed controller, or even by humans demonstrating a range of behaviors. These behaviors do not need to all be good either, in contrast to imitation learning methods. This approach removes one of the most complex and challenging parts of a real-world reinforcement learning system.
至关重要的是，当完全消除了使用最新策略收集其他数据的需求时，强化学习不需要在培训过程中与世界互动的任何能力。 这消除了广泛的成本，实用性和安全性问题：我们不再需要部署经过部分培训且可能不安全的策略，我们不再需要弄清楚如何在现实世界中进行多次试验，并且我们不再需要构建复杂的模拟器。 此学习过程的脱机数据可以从手动设计的基准控制器中收集，甚至可以由演示一系列行为的人收集。 与模仿学习方法相比，这些行为也不一定都很好。 这种方法消除了现实世界强化学习系统中最复杂和最具挑战性的部分之一。
However, the full benefit of offline reinforcement learning goes even further. By making it possible to utilize previously collected datasets, offline RL can utilize large and diverse datasets that are only practical to collect once — datasets on the scale of ImageNet or MS-COCO, which capture a wide, longitudinal slice of real-world situations. For example, an autonomous vehicle could be trained on millions of videos depicting real-world driving. An HVAC controller could be trained using logged data from every single building in which that HVAC system was ever deployed. An algorithm that controls traffic lights to optimize city traffic could utilize data from many different intersections in many different cities. And crucially, all of this could be done end-to-end, training models that directly map rich observations or features directly to decisions that optimize user-specified objective functions.
但是，离线强化学习的全部好处甚至更进一步。 通过使利用以前收集的数据集成为可能，离线RL可以利用大型且多样化的数据集，这些数据集仅能收集一次，即ImageNet或MS-COCO规模的数据集，可捕获现实情况的广阔而纵向的片段。 例如，可以在数百万个描述现实驾驶的视频上训练自动驾驶汽车。 可以使用曾经部署过HVAC系统的每个建筑物的记录数据来训练HVAC控制器。 控制交通信号灯以优化城市交通的算法可以利用来自许多不同城市中许多不同路口的数据。 至关重要的是，所有这些都可以端到端地完成，训练模型可以直接将丰富的观察结果或特征直接映射到优化用户指定目标功能的决策。
离线强化学习算法如何工作？ (How do offline reinforcement learning algorithms work?)
The fundamental challenge in offline reinforcement learning is distributional shift. The offline training data comes from a fixed distribution (sometimes referred to as the behavior policy). The new policy that we learn from this data induces a different distribution. Every offline RL algorithm must contend with the resulting distributional shift problem. One widely studied approach in the literature is to employ importance sampling, where distributional shift can lead to high variance in the importance weights. Algorithms based on value functions (e.g., deep Q-learning and actor-critic methods) must contend with distributional shift in the inputs to the Q-function: the Q-function is trained under the state-action distribution induced by the behavior policy, but evaluated, for the purpose of policy improvement, under the distribution induced by the latest policy. Using the Q-function to evaluate or improve a learned policy can result in out-of-distribution actions being passed into the Q-function, leading to unpredictable and likely incorrect predictions. When the policy is optimized so as to maximize its predicted Q-values, this leads to a kind of “adversarial example” problem, where the policy learns to produce actions that “fool” the learned Q-function into thinking they are good.
离线强化学习的基本挑战是分布转移。 离线培训数据来自固定分布(有时称为行为策略)。 我们从这些数据中学到的新政策引起了不同的分布。 每个离线RL算法都必须应对由此产生的分布偏移问题。 文献中一种广泛研究的方法是采用重要性抽样，其中分布偏移会导致重要性权重的高方差。 基于价值函数的算法(例如，深入的Q学习和参与者批评方法)必须应对Q函数输入中的分布漂移：Q函数是在行为政策引起的状态行为分布下进行训练的，但出于改进政策的目的，对最新政策引起的分配进行了评估。 使用Q函数评估或改进学习到的策略可能会导致分配失调的动作传递到Q函数中，从而导致不可预测的和可能不正确的预测。 当对策略进行优化以最大化其预测Q值时，这会导致一种“对抗性示例”问题，在该问题中，策略将学会产生使“学习的Q函数“愚弄”到认为自己是好的行为。
Most successful offline RL methods address this problem with some type of constrained or conservative update, which either avoids excessive distribution shift by limiting how much the learned policy can deviate from the behavior policy, or explicitly regularizes learned value functions or Q-functions so that the Q-values for unlikely actions are kept low, which in turn also limits the distribution shift by dis-incentivizing the policy from taking these unlikely, out-of-distribution actions. The intuition is that we should only allow the policy to take those actions for which the data supports viable predictions.
Of course, at this point, we might ask — why should we expect offline RL to actually improve over the behavior policy at all? The key to this is the sequential nature of the decision making problem. While at any one time step, the actions of the learned policy should remain close to the distribution of behaviors we’ve seen before, across time steps, we can combine bits and pieces of different behaviors we’ve seen in the data. Imagine learning to play a new card game. Even if you play your cards at random, on some trials some of your actions will — perhaps by accident — lead to favorable outcomes. By looking back on all of your experiences and combining the best moves into a single policy, you can arrive at a policy that is substantially better than any of your previous plays, despite being composed entirely of actions that you’ve made before.
当然，在这一点上，我们可能会问-为什么我们应该期望脱机RL完全改善行为策略？ 关键在于决策问题的顺序性。 在任何一个时间步上，学习到的策略的行为都应保持与我们之前所看到的行为分布接近，在各个时间步上，我们可以将数据中所见到的不同行为的点点滴滴结合起来。 想象一下学习玩新的纸牌游戏。 即使您随意打牌，但在某些试验中，您的某些动作(可能是偶然的情况)仍会带来令人满意的结果。 通过回顾您的所有经验并将最佳举措组合到一个策略中，尽管完全由您之前做出的动作组成，但您可以制定出比以前任何游戏都更好的策略。
Building on these ideas, recent advances in offline reinforcement learning have led to substantial improvements in the capabilities of offline RL algorithms. A complete technical discussion of these methods is outside the scope of this article, and I would refer the reader to our recent tutorial paper for more details. However, I will briefly summarize several recent advances that I think are particularly exciting:
基于这些思想，离线强化学习的最新进展已导致离线RL算法的功能得到了实质性的改进。 这些方法的完整技术讨论不在本文讨论范围之内，有关更多详细信息，请读者参考我们最近的指南文章。 但是，我将简要总结一些我认为特别令人兴奋的最新进展：
Policy constraints: A simple approach to control distributional shift is to limit how much the learned policy can deviate from the behavior policy. This is especially natural for actor-critic algorithms, where policy constraints can be formalized as using the following type of policy update:
The constraint, expressed in terms of some divergence (“D”), limits how far the learned policy deviates from the behavior policy. Examples include KL-divergence constraints and support constraints. This class of methods is summarized in detail in our tutorial. Note that such methods require estimating the behavior policy by using another neural network, which can be a substantial source of error.
Implicit constraints: The AWR and AWAC algorithms instead perform offline RL by using an implicit constraint. Instead of explicitly learning the behavior policy, these methods solve for the optimal policy via a weighted maximum likelihood update of the following form:
Here, A(s,a) is an estimate of the advantage, which is computed in different ways for different algorithms (AWR uses Monte Carlo estimates, while AWAC uses an off-policy Q-function). Using this type of update to enforce constraints has been explored in a number of prior works (see, e.g., REPS), but has only recently been applied to offline RL. Computing the expectation under the behavior policy only requires samples from the behavior policy, which we can obtain directly from the dataset, without actually needing to estimate what the behavior policy is. This makes AWR and AWAC substantially simpler, and enables good performance in practice.
此处，A(s，a)是优势的估算，可以针对不同的算法以不同的方式进行计算(AWR使用蒙特卡洛估算，而AWAC使用非政策Q函数)。 在许多先前的工作中已经探索了使用这种类型的更新来实施约束(例如，参见REPS )，但是直到最近才应用于离线RL。 在行为策略下计算期望值仅需要行为策略中的样本，我们可以直接从数据集中获取样本，而无需实际估算行为策略是什么。 这使AWR和AWAC大大简化，并在实践中实现了良好的性能。
Conservative Q-functions: A very different approach to offline RL, which we explore in our recent conservative Q-learning (CQL) paper, is to not constrain the policy at all, but instead regularize the Q-function to assign lower values to out-of-distribution actions. This prevents the policy from taking these actions, and results in a much simpler algorithm that in practice attains state-of-the-art performance across a wide range of offline RL benchmark problems. This approach also leads to appealing theoretical guarantees, allowing us to show that conservative Q-functions are guaranteed to lower bound the true Q-function with an appropriate choice of regularizer, providing a degree of confidence in the output of the method.
保守的Q函数：我们在最近的保守Q学习(CQL)论文中探讨了一种非常不同的离线RL方法，即完全不限制策略，而是对Q函数进行正则化以将较低的值分配给外部分配动作。 这阻止了策略采取这些措施，并导致算法简单得多，在实践中，该算法在各种离线RL基准测试问题中均达到了最先进的性能。 这种方法还提供了吸引人的理论保证，使我们能够证明，通过选择适当的正则化函数，可以保证保守Q函数能够下限真实Q函数的下限，从而为该方法的输出提供了一定程度的可信度。
Despite these advances, I firmly believe that the most effective and elegant offline RL algorithms have yet to be invented, which is why I consider this research area to be so promising both for its practical applications today and for its potential as a topic of research in the future.
人工智能呢？ (What about artificial intelligence?)
Aside from its practical value, much of the appeal of reinforcement learning also stems from the widely held belief that reinforcement learning algorithms hold at least part of the answer to the development of intelligent machines — AI systems that emulate or reproduce some or all of the capabilities of the human mind. While a complete solution to this puzzle may be far in the future, I would like to briefly address the relevance of offline RL to this (perhaps distant) vision.
In its classical definition, the active learning framework of RL reflects a very reasonable model of an adaptive natural learning system: an animal observes a stimulus, adjusts its model, and improves its response to that stimulus to attain larger rewards in the future. Indeed, reinforcement learning originated in the study of natural intelligence, and only made its way into artificial intelligence later. It may therefore seem like a step in the wrong direction to remove the “active” part of this learning framework from consideration.
在其经典定义中，RL的主动学习框架反映了一种自适应自然学习系统的非常合理的模型：动物观察刺激，调整其模型，并改善其对刺激的React，以在将来获得更大的回报。 确实，强化学习起源于自然智能的研究，后来才进入人工智能。 因此，将学习框架的“活跃”部分从考虑中删除似乎是朝错误方向迈出的一步。
However, I would put forward an alternative argument: in the first few years of your life, your brain processed a broad array of sights, sounds, smells, and motor commands that rival the size and diversity of the largest datasets used in machine learning. While learning online from streaming data is definitely one facet of the AI problem, processing large and diverse experiences seems to be an equally critical facet. Current supervised learning methods operate far more effectively in “batch” mode, making multiple passes over a large dataset, than they do in “online” mode with streaming data. Cracking the puzzle of online continual learning may one day change that, but until then, we can make a great deal of progress with such batch-mode methods. It then stands to reason that a similar logic should be applied to RL: while understanding continual online learning is important, equally important is understanding large-scale learning and generalization, and these facets of the problem will likely be far more practical to tackle in the offline setting, and then extended into the online and continual setting once our understanding of online and continual algorithms catches up to our understanding of large-scale learning and generalization. Utilizing large amounts of data for decision making effectively will need to be a part of any generalizable AI solution, and right now, offline RL offers us the most direct path to study how to do that.
但是，我会提出另一种论点：在您生命的头几年中，您的大脑处理了各种各样的视觉，声音，气味和运动命令，这些命令可以与机器学习中使用的最大数据集的大小和多样性相媲美。 虽然从流数据在线学习绝对是AI问题的一个方面，但处理大量多样的经验似乎也同样重要。 当前的有监督学习方法在“批处理”模式下运行要有效得多，与在流数据的“在线”模式下进行遍历相比，对大型数据集进行多次遍历。 破解在线持续学习难题的一天可能会改变这一点，但是在那之前，我们可以使用这种批处理模式方法取得很大的进步。 因此，有理由将类似的逻辑应用于RL：虽然理解持续在线学习很重要，但理解大规模学习和泛化同样重要，而在教育和社区管理中，解决问题的这些方面将更加实用。离线设置，然后一旦我们对在线和连续算法的理解赶上了我们对大规模学习和泛化的理解，便扩展到在线和连续设置。 有效地利用大量数据进行决策将成为任何通用AI解决方案的一部分，而现在，离线RL为我们提供了研究方法的最直接途径。
结束语 (Concluding remarks)
Offline reinforcement learning algorithms hold the promise of turning data into powerful decision-making strategies, enabling end-to-end learning of policies directly from large and diverse datasets and bringing large datasets and large models to bear on real-world decision-making and control problems. However, the full promise of offline RL has not yet been realized, and major technical hurdles remain. Fundamentally, offline RL algorithms must be able to reason about counterfactuals: what will happen if we take a different action? Will the outcome be better, or worse? Such questions are known to be exceptionally difficult for statistical machine learning systems, and while recent innovations in offline RL based around distributional constraints and conservative targets can provide a partial solution, at its core this problem touches on deep questions in the study of causality, distributional robustness, and invariance, and connects at a fundamental level with some of the most challenging problems facing modern machine learning. While this will present major challenges, it also makes this topic particularly exciting to study.
离线强化学习算法有望将数据转化为强大的决策策略，从而可以直接从大型多样的数据集中进行策略的端到端学习，并将大型数据集和大型模型用于现实世界的决策和控制问题。 但是，离线RL的全部前景尚未实现，并且仍然存在主要的技术障碍。 从根本上讲，离线RL算法必须能够推理出反事实：如果我们采取其他行动，将会发生什么？ 结果会好还是坏？ 众所周知，这样的问题对于统计机器学习系统来说异常困难，尽管基于分布约束和保守目标的离线RL的最新创新可以提供部分解决方案，但该问题的核心涉及因果关系，分布健壮性和不变性，并在根本上与现代机器学习面临的一些最具挑战性的问题联系在一起。 尽管这将带来重大挑战，但也使该主题的学习特别令人兴奋。
For readers interested in learning more about this topic, I would recommend a tutorial article that I’ve co-authored with colleagues on this subject, as well as the “Datasets for Data-Driven Deep Reinforcement Learning” benchmark suite, which includes code and implementations for many of the latest algorithms. Aviral Kumar and I will also be giving a tutorial on offline reinforcement learning at NeurIPS 2020. Hope to see you there!
I want to acknowledge helpful feedback from Chelsea Finn and Aviral Kumar on an earlier draft of this article.
我想感谢切尔西·芬恩(Chelsea Finn)和阿维尔·库马尔(Aviral Kumar)对本文较早的草案提供的有益反馈。