• eg.It upsets profit-making journal publishers.它让获利的出版   4、【considerably】美[kənˈsɪdərəblɪ] adv.相当，非常，颇; 【benefit】美[ˈbɛnəfɪt] n.利益，好处; 救济金，津贴; 义演，...

1、【critic】美[ˈkrɪtɪk]
n.批评家; 批评者; 评论员; 挑剔的人;
【criticize】美[ˈkrɪtɪˌsaɪz]
vt.& vi.批评; 挑剔; 分析，评估;
eg.It criticizes government-funded research.它对政府资助的研究项目做了批评。

2、【effect】美[ɪˈfɛkt]
n.效果; 影响; 印象; 所有物;
【effective】美[ɪˈfɛktɪv]
eg.It introduces an effective means of publication.它提出了一种有效的出版方法。

3、【upset】美[ʌpˈsɛt]
n.心烦意乱; 翻倒，颠覆; 混乱;
【profit-making】美[ˈprɑfɪt ˈmekɪŋ]
营利，赚钱。
eg.It upsets profit-making journal publishers.它让获利的出版

4、【considerably】美[kənˈsɪdərəblɪ]
【benefit】美[ˈbɛnəfɪt]
n.利益，好处; 救济金，津贴; 义演，义卖; 恩惠，恩泽;
eg.It benefit scientific research considerably.它对科学研究大有裨益。

5、【significant】美[sɪɡˈnɪfɪkənt]
eg.According to the text,online publication is significant in that...根据文章，网络出版很重要是因为。

6、【emphasize】美[ˈɛmfəˌsaɪz]
vt.强调，着重; 加强语气; 使突出;
【crucial role】
关键作用
eg.it emphasizes the crucial role of scientific knowledge.它强调了科学知识的重要作用。

7、【facilitate】美[fəˈsɪlɪˌtet]
vt.帮助; 促进，助长; 使容易;
eg.it facilitates public investment in scientific research.它使得科研的公共投资更为便利。

8、【summary】美[ˈsʌməri]
n.摘要，概要; 总结，一览;
eg.Which of the following best summarizes the main idea of the passage?下面哪一项最好的概况本文大的主旨观点。

9、【render】美[ˈrɛndɚ]
v.给予; 使成为; 递交; 表达;
eg.Publication is rendered easier by online service.网络服务使得出版变得容易。


展开全文
• 数字集成电路面试常见问题 对集成电路辐射硬度的常见误解 (Common misconceptions on the radiation hardness of integrated circuits) Space exploration was always fascinating, and recent developments have ...

数字集成电路面试常见问题
对集成电路辐射硬度的常见误解 (Common misconceptions on the radiation hardness of integrated circuits)

Space exploration was always fascinating, and recent developments have reignited the interest to the heights never seen since the last man stood on the Moon. People argue about Mars exploration and features of spaceships as their grandparents would’ve done if the internet existed fifty years ago. I’m an electronics engineer working in the aerospace industry, so I know a thing or two about the technical background of this stuff — and I see that these things aren’t common knowledge, and people often have significantly skewed ideas about the reasons behind some devices and decisions. Namely, I’d love to speak about some misconceptions related to radiation hardened integrated circuits and the means of their protection against radiation-induced damage. But, I warn you, this text will be relatively long.
太空探索总是令人着迷，最近的发展使人们对自最后一个人站在月球上以来从未见过的高度重新产生了兴趣。 人们争论着火星的探索和宇宙飞船的特征，就像互联网在五十年前存在时他们的祖父母所做的那样。 我是航空航天业的电子工程师，所以我对这些技术的背景知识了解一两件事，而且我发现这些知识不是常识，人们通常对背后的原因有明显的偏见一些设备和决定。 即，我很想谈谈与辐射硬化集成电路有关的一些误解，以及防止辐射引起的损害的保护手段。 但是，我警告您，这段文字会比较长。
我为什么要写这个？ (Why do I write this?)
The most popular theses about radiation hardness of ICs are the following:
关于IC辐射硬度的最受欢迎的论文如下：
Radiation hardened chips are not needed at all. CubeSats are just fine with chips from the nearest store, very ordinary Lenovo laptops work on the ISS without any problems, and even NASA-commissioned Orion onboard computer is based on a commercial microprocessor! 根本不需要经过辐射硬化的芯片。 CubeSats可以与附近商店的芯片配合使用，非常普通的Lenovo笔记本电脑可以在ISS上正常工作，甚至NASA委托的Orion车载计算机都基于商用微处理器！ Satellites don’t need computational power, but they need these magical radiation hardened chips, so most of them use very old but extremely robust designs from the eighties, like TTL quad NAND gates. 卫星不需要计算能力，但是它们需要这些神奇的辐射硬化芯片，因此大多数卫星都使用80年代的非常古老但极其坚固的设计，例如TTL四与非门。 A thesis that complements the previous one: it is impossible to achieve radiation hardness on modern process nodes. Ionizing particles just tear small transistors apart. So, the use of these TTL NAND gates is not just justified, it’s the only way to go. 这是对前一个论文的补充：在现代Craft.io节点上不可能达到辐射硬度。 电离粒子只会将小的晶体管撕开。 因此，使用这些TTL NAND门并不合理，这是唯一的方法。 It’s necessary and sufficient to use silicon on insulator (SOI) or silicon on sapphire (SOS) technology to achieve radiation hardness. 必须使用绝缘体上硅(SOI)或蓝宝石上硅(SOS)技术来达到辐射硬度。 All military-grade chips are radiation hardened and all radiation hardened chips are military-grade. If you have a military-grade IC, you can safely launch it into outer space. 所有军用级芯片都经过辐射硬化，所有防辐射芯片都经过了军工级认证。 如果您拥有军用级IC，则可以安全地将其发射到太空中。
As one can see, these theses directly contradict each other — which makes arguing on the internet even funnier, especially if you take into account that not a single one of them is true.
可以看到，这些论点直接相互矛盾，这使在互联网上争论变得更加有趣，尤其是如果考虑到其中没有一个是真的。
Let’s start with an important disclaimer: radiation hardness is not the Holy Grail of integrated design for space and other similar environments. It’s just a bunch of checkboxes in the long requirements list, which typically includes reliability, longevity, wide temperature range, tolerance to electrostatic discharge, vibrations and many more. Everything that can compromise reliable functioning through the entire lifetime is important, and most applications requiring radiation tolerance also assume the impossibility of repair or replacement. On the other hand, if something is wrong with one of the parameters, system-level designers of the final can often find a workaround — tighten temperature requirements, use cold spares or additional protection circuitry — whatever is suitable. The same approach can be fine when dealing with radiation effects: majority voting, supply current control and reset are very common means that are often effective. But it's also often when a brand new radiation hardened IC is the only good way to meet mission requirements.
让我们从一个重要的免责声明开始：辐射硬度不是用于太空和其他类似环境的集成设计的圣杯。 在很长的要求列表中，它只是一堆复选框，通常包括可靠性，寿命，宽温度范围，对静电放电的耐受性，振动等等。 在整个生命周期内都会影响可靠功能的所有因素都很重要，并且大多数需要防辐射的应用也都认为不可能进行修理或更换。 另一方面，如果其中一个参数出了点问题，则最终的系统级设计人员通常可以找到解决方法-严格控制温度，使用冷备件或附加保护电路-适用。 处理辐射效应时，可以使用相同的方法：多数表决，电源电流控制和重置是非常有效的非常普遍的方法。 但是，通常只有全新的辐射硬化IC是满足任务要求的唯一好方法。
It is also useful to remember that the developers of special-purpose systems are the same people as any other developers. Just like anyone else, they normally write code filled with crutches to be ready for yesterday's deadline and want more powerful hardware to mask their sloppy job; some would’ve used Arduino if it was properly certified. And it’s also obvious that people who create requirements are rarely really concerned with any limitations and want to have the same as in commercial systems, but more reliable and radhard. Therefore, modern processes are more than welcome in radhard electronics — system designers would love to have large amounts of DRAM, multi-core processors, and the most advanced FPGAs. I have already mentioned that there could be workarounds for mediocre radiation tolerance, so the use of commercial chips is mostly limited by the lack of data on what problems are than by the problems themselves or by the commercial status of the chips.
记住，专用系统的开发人员与任何其他开发人员都是同一个人，这也很有用。 就像其他任何人一样，他们通常会写满拐杖的代码，以准备在昨天的截止日期前完成工作，并希望有更强大的硬件来掩盖他们的草率工作。 如果经过适当的认证，有些人会使用Arduino。 同样显而易见的是，创建需求的人很少真正关心任何限制，并希望与商业系统中的局限性相同，但更可靠，更可靠。 因此，现代Craft.io在radhard电子产品中非常受欢迎-系统设计人员希望拥有大量DRAM，多核处理器和最先进的FPGA。 我已经提到过，对于中等的辐射耐受性可能存在变通办法，因此商用芯片的使用主要受到缺乏关于什么问题的数据的限制，而不是受到问题本身或芯片商业地位的限制。
The very concepts of "radiation hardness" and "radiation hardened IC" are enormous simplifications. There are many different sources of ionizing and non-ionizing radiation, and they affect the functioning of microelectronic devices in multiple ways. The tolerance to different sets of conditions and varying levels of exposure for different applications is not the same, so a “radiation hardened” circuit designed for low earth orbit is absolutely not obliged to work in a robot parsing debris in Chernobyl or Fukushima.
“辐射硬度”和“辐射硬化IC”这两个概念非常简化。 电离辐射和非电离辐射有许多不同的来源，它们以多种方式影响微电子设备的功能。 对于不同的条件，对于不同的条件集和不同的暴露水平，其容忍度是不一样的，因此，绝对不需要在切尔诺贝利或福岛的机器人中使用为低地球轨道设计的“辐射硬化”电路来解析碎片。
Ionizing radiation is called so because the deceleration of an incoming particle in a substance releases the energy and ionizes the substance. Each material has its own energy required for ionization and the creation of an electron-hole pair. For silicon it is 3.6 eV, for its oxide — 17 eV, for gallium arsenide — 4.8 eV. The energy release can also “shift” an atom out of the correct place in the crystal lattice (21 eV must be transferred to shift a silicon atom). Electron-hole pairs created in a substance can produce different effects in an integrated circuit. Therefore, radiation effects can be divided into the four large groups: the effects of total ionizing dose (TID), the dose rate effects, single event effects (SEE), and the non-ionizing effects called the displacement damage. This separation is somewhat arbitrary: for example, irradiation with a stream of heavy ions causes both single event effects and accumulation of a total ionizing dose.
之所以称为电离辐射，是因为物质中传入粒子的减速会释放能量并使物质离子化。 每种材料都有其自身的能量，这些能量用于电离和创建电子-空穴对。 硅为3.6 eV，氧化物为17 eV，砷化镓为4.8 eV。 能量释放还可以使原子“移出”晶格中的正确位置(必须转移21 eV才能移出硅原子)。 在物质中产生的电子-空穴对可以在集成电路中产生不同的效果。 因此，辐射效应可分为四大类：总电离剂量(TID)效应，剂量率效应，单事件效应(SEE)和称为位移损伤的非电离效应。 这种分离在某种程度上是任意的：例如，重离子流的照射会导致单事件效应和总电离剂量的累积。
总电离剂量 (Total ionizing dose)
The TID effects are mostly associated with the accumulation of positive charge in dielectrics. They manifest themselves in CMOS circuits in several main ways:
TID效应主要与电介质中正电荷的积累有关。 它们以几种主要方式出现在CMOS电路中：
Threshold voltage shift. For n-channel transistors, the threshold is usually reduced (but the dependence may be non-monotonic, especially at high doses), while for p-channel transistors it increases. The shift magnitude correlates to gate oxide thickness and decreases with process node. In older technologies, n-MOSFET threshold shift can cause functional failure when n-channel transistors stop closing and p-channel ones stop opening. This effect is less important in submicron technologies, but it can still give a lot of headaches to analogue designers. 阈值电压漂移。 对于n沟道晶体管，阈值通常会降低(但相关性可能是非单调的，尤其是在高剂量时)，而对于p沟道晶体管，阈值会增加。 偏移量与栅极氧化物的厚度相关，并且随Craft.io节点而减小。 在较旧的技术中，当n沟道晶体管停止关闭而p沟道晶体管停止打开时，n-MOSFET阈值偏移会导致功能故障。 在亚微米技术中，这种影响并不那么重要，但是它仍然会使模拟设计人员感到头疼。 Leakage currents flow through parasitic channels opened by an excessive charge in isolating oxides, either from source to drain of the same device, or from one transistor to another. In the first case, a parasitic transistor controlled by the total dose is formed in parallel to the main one. The severity of this effect is highly technology-dependent as the exact shape of isolated oxide matters. Therefore, there is no direct correlation to process nodes, and there is no good way to guess which commercial device will have better or worse TID hardness. 漏电流流过由隔离氧化物中的过量电荷打开的寄生通道，这些寄生通道从同一器件的源极到漏极，或者从一个晶体管流向另一个晶体管。 在第一种情况下，由总剂量控制的寄生晶体管与主晶体管并联形成。 这种影响的严重程度在很大程度上取决于技术，因为隔离氧化物的确切形状至关重要。 因此，与Craft.io节点没有直接关系，也没有很好的方法来猜测哪个商业设备的TID硬度会更好或更差。 Charge carrier mobility decreases due to scattering on accumulated defects. The influence of this factor on submicron digital circuits on silicon is small, but it is a way more important for power transistors (including GaN HEMT). 电荷载流子迁移率由于累积缺陷上的散射而降低。 这个因素对硅上的亚微米数字电路的影响很小，但是对于功率晶体管(包括GaN HEMT)而言，这是一种更为重要的方法。 1/f noise increase caused by parasitic edge transistors. It is important for analogue and radio frequency circuits and becomes more important at lower process nodes when the influence of other TID effects gradually decreases. 由寄生边缘晶体管引起的1 / f噪声增加。 这对于模拟和射频电路很重要，并且在其他TID效应的影响逐渐减小时，在较低的过程节点处变得尤为重要。
A quick word on bipolars: the main TID effect there is gain decrease due to leakage-related base current increase. Another bipolar-specific effect is their (non-mandatory) rough reaction to the dose collection at low speed, so-called ELDRS (Enhanced Low Dose Rate Sensitivity). This effect complicates the testing and makes it more expensive. And the worst part is that many CMOS circuits contain a few bipolars (namely in voltage reference circuits) — and therefore can also be susceptible.
一个关于双极的简短说明：由于泄漏相关的基极电流增加，主要的TID效应会导致增益降低。 另一个双极特异性效应是它们在低速下对剂量收集的(非强制性)粗略React，即所谓的ELDRS(增强的低剂量率敏感性)。 这种效果使测试复杂化并使其更加昂贵。 最糟糕的是，许多CMOS电路都包含一些双极性(即在参考电压电路中)，因此也容易受到影响。
剂量率效应 (Dose rate effects)
Another effect related to dose rate is when dose accumulation is so fast that such a large number of electron-hole pairs is generated that a huge excessive electric charge is overflowing every node in the chip and is causing a temporary loss of functionality and sometimes a latchup of parasitic thyristor between supply and ground. The non-functioning time is the usual measure of sensitivity to this kind of effect and it's normally seen in military standards like Mil-Std-883.
与剂量率有关的另一个影响是，剂量累积如此之快，以至于产生了如此大量的电子-空穴对，以至于芯片中的每个节点都溢出了巨大的过量电荷，并导致功能暂时性丧失，有时甚至闭锁电源和地之间的寄生晶闸管的数量。 非工作时间是对这种影响的敏感度的通常度量，通常在军事标准(如Mil-Std-883)中看到。
Total dose rate effects are the reason for “silicon on sapphire” (SOS) and “silicon on insulator” (SOI) technology creation and adoption: the best way to reduce the amount of charge inserted into active devices by the flow of ionizing particles is to cut their electrical connection to the enormously big substrate (and to each other). Why are these effects important? An extremely high dose rate for a short time is a typical consequence of a nuclear explosion, and military guys all around the world deeply care about this matter. Luckily for us, SOI proved to be advantageous in many other applications and therefore became widespread in normal life.
总剂量率效应是创建和采用“蓝宝石上的硅”(SOS)和“绝缘体上的硅”(SOI)技术的原因：减少通过电离粒子流插入有源器件中的电荷量的最佳方法是切断它们与非常大的基板(以及彼此之间)的电连接。 为什么这些影响很重要？ 短时间内的极高剂量率是核爆炸的典型结果，世界各地的军方对此事都深表关切。 对我们来说幸运的是，SOI在许多其他应用中被证明是有优势的，因此在正常生活中变得越来越普遍。
单事件效果 (Single event effects)
Single event effects (SEE) are associated with a measurable effect from the strike of a single ionizing particle. They can be divided into two large groups:
单事件效应(SEE)与单个电离粒子撞击的可测量效应相关。 它们可以分为两大类：
Non-destructive events include bit flips or upsets (SEU) in a variety of storage elements (cache memory cells, register files, FPGA configuration memory, etc.) and transient voltage spikes (SET) in combinational logic and in analogue circuits. The main feature of these effects is that they do not lead to the physical destruction of the chip and can be corrected by software or hardware. Moreover, single event transients are self-corrected after some arbitrarily short time. Memory upsets are the most known of these effects as they constitute a lion's share of failures due to the enormous amount of memory in modern digital ICs. 无损事件包括各种存储元件(高速缓存存储单元，寄存器文件，FPGA配置存储器等)中的位翻转或翻转(SEU)，以及组合逻辑和模拟电路中的瞬态电压尖峰(SET)。 这些影响的主要特征是它们不会导致芯片的物理损坏，可以通过软件或硬件进行纠正。 此外，单事件瞬变会在任意短时间后自动校正。 存储器故障是这些影响中最著名的，因为现代数字IC中存在大量的存储器，因此它们构成了故障的绝大部分。 Destructive events are Single-Event latchup (SEL) effect and a variety of fortunately rarer catastrophic failures like transistor burnout of gate rupture. Their distinctive feature is that they are, well, destructive and irreversibly damage the chip if occurred. The specific case of the latchup is distinctive as the very fast power off can often (but not always!) save the chip. Circuits for supply current monitoring and cycling are fairly popular as a latchup protection measure. Other destructive effects uncommon on CMOS circuitry, but are a serious threat for some types of flash memory and for high voltage devices, including power switches. 破坏性事件包括单事件闩锁(SEL)效应以及各种幸运的罕见灾难性故障，例如晶体管对栅极破裂的烧坏。 它们的独特之处在于，它们一旦发生就会具有破坏性且不可逆转地损坏芯片。 闩锁的特殊情况很独特，因为非常快速的断电通常可以(但不总是如此)节省芯片。 作为闭锁保护措施，用于电源电流监视和循环的电路相当流行。 其他破坏性影响在CMOS电路上并不常见，但对某些类型的闪存和包括电源开关在内的高压设备则构成严重威胁。

Figure 1. Experimental data on single event effects rate. Taken from J. Barth et al., "Single event effects on commercial SRAMs and power MOSFETs: final results of the CRUX flight experiment on APEX", NSREC Radiation Effects Data Workshop, 1998
图1.单事件影响率的实验数据。 摘自J. Barth等人，“对商用SRAM和功率MOSFET的单事件效应：在APEX上进行CRUX飞行实验的最终结果”，NSREC辐射效应数据研讨会，1998年
Looking at figure 1, one can see that the worst case is about one upset per two hundred days... per bit. Yes, every bit in memory is about to be affected twice per years. But when we have Megabits or Gigabytes of memory, it's always compromised, right? Yes, that's a problem, and there are techniques to address this problem, but more on that a bit later.
从图1可以看出，最坏的情况是每200天……每位发生一次故障。 是的，内存中的每一位每年将受到两次影响。 但是，当我们拥有兆位或千兆字节的内存时，它总是会受到损害，对吧？ 是的，这是一个问题，并且有一些技术可以解决此问题，但稍后会对此进行更多介绍。
The specific energy yield of an ionizing particle strike is called “linear energy transfer” (LET) and is measured (MeV * cm^2)/mg. LET non-linearly and non-monotonously depends on particle energy and is also related to the path length, which can vary from hundreds of nanometers to hundreds of millimeters for relevant particles and materials. Basically, most ionizing particles just punch through an IC and fly back to outer space. Low energy particles are much more common in a real space environment (see Figure 2). Important LET values are 30 (corresponding to ions of iron) and 60/80 (which are normally considered the highest LET values to be taken into account). Another important figure is 15 MeV * cm ^ 2/(mg) — the maximum LET of products of the nuclear reaction between a silicon atom and a proton or a neutron. Protons are important as they make up a significant part of solar radiation. Whilst they have very low LET on their own, the probability of the above mentioned nuclear reaction is high enough to create a lot of events, especially in van Allen belts or during solar bursts. Protons can also interact with nuclei of heavier elements, like tungsten (used in contacts) or tantalum (popular anti-TID shielding material). Such secondary effects are the second most important reason not to pack your space-bound chips into led covers in an attempt to increase their radiation hardness. The first one is, by the way, the launch price per kilo.
电离粒子撞击的比能量产率称为“线性能量转移”(LET)，并以(MeV * cm ^ 2)/ mg进行测量。 LET非线性且非单调取决于粒子能量，并且还与路径长度有关，对于相关的粒子和材料，路径长度可能从数百纳米到数百毫米不等。 基本上，大多数电离粒子只是穿过IC，然后飞回太空。 在现实空间环境中，低能粒子更为常见(请参见图2)。 重要的LET值是30(对应于铁离子)和60/80(通常被认为是要考虑的最高LET值)。 另一个重要的数字是15 MeV * cm ^ 2 /(mg)-硅原子与质子或中子之间的核React产物的最大LET。 质子很重要，因为它们构成了太阳辐射的重要组成部分。 尽管它们自己的LET很小，但上述核React的可能性很高，足以引发很多事件，尤其是在范艾伦带或太阳爆发时。 质子还可以与重元素的原子核相互作用，例如钨(用于接触)或钽(常用的抗TID屏蔽材料)。 这种次要效果是第二个最重要的原因，不要将空间受限的芯片包装到LED盖中，以提高其辐射硬度。 顺便说一下，第一个是每公斤的价格。
It’s also worth noting helium nuclei (alpha particles) as a source of single event effects — not because there are some in solar radiation, but because plenty of alpha sources can be found in ordinary life, like led solder and some IC packaging materials. If you have heard about low-alpha bumps and underfills — it’s about single event mitigation in “mundane” applications not related to aerospace.
还值得一提的是，氦原子核(α粒子)是单事件效应的来源-并不是因为太阳辐射中有一些，而是因为在日常生活中会发现大量的α辐射源，例如led焊料和某些IC封装材料。 如果您听说过低阿尔法的颠簸和底部填充，那就是与航空航天无关的“平凡”应用中的单事件缓解。

Figure 2. A number of different particles spotted during the two-year mission in space. Quoted from: Xapsos et al., "Model for Cumulative Solar Heavy Ion Energy and Linear Energy Transfer Spectra", IEEE Transactions on Nuclear Science, Vol. 5, No. 6., 2007
图2.在为期两年的太空任务中发现的许多不同粒子。 引自：Xapsos等人，“累积太阳重离子能量和线性能量转移谱的模型”，IEEE Transactions on Nuclear Science，第1卷。 5，第6号，2007年
1, 30 or 60 MeV * cm ^ 2/(mg) — how much is it? The upset threshold of a standard SRAM memory cell in the 7 nm technology is much lower than one, while for 180 nm it can vary from one to ten. The use of a special schematic allows to raise the threshold up to a hundred, but it is usually wiser to achieve 15 or 30 and to the rest via error-correcting codes. 60 MeV * cm ^ 2/(mg) can most often be found in requirements for destructive events — to ensure that the chip will highly likely survive its full intended lifespan.
1、30或60 MeV * cm ^ 2 /(mg)-多少？ 7纳米技术中的标准SRAM存储单元的翻转阈值远低于1，而180纳米的阈值则可以从1改变为10。 使用特殊的原理图可以将阈值提高到一百，但是通常更明智的做法是通过纠错码将阈值提高到15或30，然后达到其余值。 60 MeV * cm ^ 2 /(mg)最经常出现在破坏性事件的要求中-以确保芯片极有可能在其预期的整个使用寿命中幸存下来。
位移损坏 (Displacement damage)
The displacement effects are local destruction of the crystal lattice due to an atom being "knocked out" of its intended place. The energy required for this is usually quite high, so most irradiating particles do not cause this effect. However, secondary irradiation can, and there are plenty of protons in space. These local lattice defects decrease charge carriers’ mobility, increase noise and do some other damage. Due to their very local nature, they normally do not significantly affect conventional CMOS chips — but they dominate in solar cells, photodetectors, power transistors and other devices based on compound semiconductors, such as gallium arsenide and gallium nitride. Transistors in compound semiconductors are usually not MOS, but JFET or HEMT, so they lack gate oxide. This explains their high total dose tolerance — they simply do not suffer from the effects causing the rapid degradation of CMOS chips. However, displacement effects are much more significant in these new materials, so they should be considered and weighted appropriately.
位移效应是由于原子“敲除”其预期位置而导致的晶格局部破坏。 为此所需的能量通常很高，因此大多数辐射粒子不会引起这种效果。 但是，可以进行二次辐照，并且空间中有很多质子。 这些局部晶格缺陷会降低电荷载流子的迁移率，增加噪声并造成其他一些损害。 由于它们的局部特性，它们通常不会显着影响传统的CMOS芯片-但它们在太阳能电池，光电探测器，功率晶体管和其他基于化合物半导体的器件(例如砷化镓和氮化镓)中占主导地位。 复合半导体中的晶体管通常不是MOS，而是JFET或HEMT，因此它们缺少栅极氧化物。 这解释了它们的高总剂量耐受性-它们根本不会遭受导致CMOS芯片快速退化的影响。 但是，位移效应在这些新材料中更为重要，因此应考虑并适当权衡它们。
As we’re finished with the description of effects, let’s look at where and how they threaten integrated circuits.
完成对效果的描述后，让我们看一下它们在哪里以及如何威胁集成电路。

Figure 3. Total ionizing dose calculations, for ten years of satellite lifetime, under the shielding of 1 g/cm^2. Adapted from N. Kuznetsov, "Radiation danger on space orbits and interplanetary trajectories of satellites" (in Russian).
图3.在1 g / cm ^ 2的屏蔽下，十年卫星寿命的总电离剂量计算。 改编自库兹涅佐夫(N. Kuznetsov)，“卫星的空间轨道和行星际轨道上的辐射危险”( 俄语 )。
Figure 3 shows an example of total ionizing dose calculation for different orbits. There are multiple assumptions there — including solar activity, shape, material and thickness of protection, but you can get the idea: the dose rate can vary in five orders of magnitude at different orbits. At low orbits under the first Van Allen belt, the dose is absorbed so slowly that many out-of-the-shelf commercial chips can withstand several years in these conditions, like laptops at ISS do. Even much more fragile people can fly there for years without dramatic health consequences. Low orbits are extremely important as they encompass the entire manned astronautic, the Earth remote sensing, many present communication satellites and future internet-from-above constellations. Last but not least, almost all CubeSats are launched into low orbits.
图3显示了不同轨道总电离剂量计算的示例。 那里有多种假设-包括太阳活动，形状，材料和保护层的厚度，但是您可以理解：剂量率在不同的轨道上可以变化五个数量级。 在第一个Van Allen带下的低轨道上，剂量吸收得如此缓慢，以至于许多现成的商用芯片在这种情况下都可以承受数年，就像ISS的笔记本电脑一样。 甚至更脆弱的人也可以在那飞行很多年而不会造成严重的健康后果。 低轨道非常重要，因为它们涵盖了整个载人航天，地球遥感，许多现有的通信卫星以及未来的互联网之上的星座。 最后但并非最不重要的一点是，几乎所有的CubeSat卫星都被送入低轨道。
低轨道 (Low orbit)
Actually, the importance of low orbits is the root of multiple speculations that expensive radiation hardened chips are not needed at all, and COTS can do everything if not rejected by the overly conservative industry. Yes, COTS can do some decent job, but there are some pitfalls, even at low orbits.
实际上，低轨道的重要性是多种猜测的根源，即根本不需要昂贵的辐射硬化芯片，而且如果过度保守的行业不拒绝，COTS可以做所有事情。 是的，COTS可以做一些体面的工作，但是即使在低轨道上也有一些陷阱。
The van Allen belts protect the Earth only from light particles, mainly solar electrons and protons. Heavier particles, even though they are much rarer, quietly reach even our last shield — the atmosphere — and cause single effects, including the catastrophic latchup capable of irreversibly destroying any chip at any moment. Therefore, commercial chips can be used only if they are somehow protected from the latchup, or the entire spacecraft can be lost.
范艾伦带仅保护地球免受轻粒子(主要是太阳电子和质子)的侵害。 较重的粒子，即使它们稀少得多，也可以悄悄地到达我们的最后一道盾牌-大气-并产生单一的影响，包括灾难性的闩锁，该闩锁能够在任何时候不可逆转地破坏任何芯片。 因此，只有在某种程度上保护了商业芯片免受闩锁，否则整个航天器可能会丢失，才可以使用它们。
Another problem is that the chips used in space are not just processors and memory, but also many other types, including power and analogue ones. Radiation tolerance of non-logic circuits is much more complex, less investigated and less predictable. Moreover, modern SoCs contain a lot of non-digital blocks like PLL, ADC, I/O circuits. For example, the most common reason for flash memory total dose failure is the high-voltage generator used for memory writing. Analogue circuits suffer from offset increase, small leakages can significantly affect the functioning of low-power analogue, power transistors are experiencing breakdown voltage degradation, and so on and so on.
另一个问题是，空间中使用的芯片不仅是处理器和内存，而且还有许多其他类型，包括电源和模拟类型。 非逻辑电路的辐射容忍度要复杂得多，研究较少且难以预测。 此外，现代SoC包含许多非数字模块，例如PLL，ADC，I / O电路。 例如，闪存总剂量失败的最常见原因是用于存储器写入的高压发生器。 模拟电路的失调增加，小泄漏会严重影响低功率模拟电路的功能，功率晶体管正遭受击穿电压降级等。
It’s also important to remember that radiation sensitivity is, well, sensitive to process variations, sometimes even small ones. So, if the fab changes the temperature of some oxide growing, you can throw your radiation testing results into a trash can. Commercial vendors never guarantee that the different batches of the same product will have the same crystal and that the manufacturing process will be stable for some long time. The processor from iPhone 6, Apple A9, was produced on both 16 nm TSMC and 14 nm Samsung fabs, and the user is ineligible to know which version is inside the specific cell phone. Such an approach is unfortunately impossible for high-reliability circuits, and that’s why radhard chips are often manufactured on some kinds of Trusted Foundries or at least on automotive-intended processes, as the car industry also cares about reliability and needs stable technology.
同样重要的是要记住，辐射敏感性对Craft.io变化敏感，有时甚至很小。 因此，如果制造厂改变了某些氧化物生长的温度，则可以将辐射测试结果扔到垃圾桶中。 商业供应商永远不能保证同一产品的不同批次具有相同的晶体，并且制造过程将长期稳定。 iPhone 6，Apple A9的处理器是在16 nm TSMC和14 nm Samsung fab上生产的，用户无资格知道特定手机内的哪个版本。 不幸的是，这种方法对于高可靠性电路是不可能的，这就是为什么radhard芯片通常在某些类型的Trusted Foundry或至少在汽车预期的Craft.io上制造的原因，因为汽车行业还关心可靠性并需要稳定的技术。
其他轨道 (Other orbits)
However, satellites don’t fly just on low orbits. I will take a “Molniya” orbit as an example of very different requirements. This orbit is named after a Soviet satellite which was there first. “There” is a polar orbit with minimal altitude around 500 km and maximal around 40 000. The orbital period is twelve or twenty-four hour, and the satellite spends most of the time near apogee, acting as a pseudo-static object and providing communications for polar regions where geostationary satellites can’t be seen.
但是，卫星不仅会在低轨道上飞行。 我将以“莫利尼亚”轨道为例说明完全不同的要求。 该轨道以最早出现的苏联卫星命名。 “那里”有一条极地轨道，其最小高度在500公里左右，最大高度在4万左右。轨道周期为十二或二十四小时，卫星大部分时间都在近地点附近，充当伪静态物体并提供无法看到对地静止卫星的极地通信。

Figure 4. Molniya orbit with hours marked. Taken from Wikipedia.
图4.标记了小时数的摩利尼亚轨道。 取自维基百科。
The lifespan of the very first Molniya satellites was very short — just a handful of months. Primarily due to the degradation of solar panels powering radio transmitters. Why was the degradation so high? Perigee 500 km and apogee 40 000 km means that the satellite crosses van Allen belts twice each period — or four times per day. Van Allen belts gather and concentrate solar electrons and protons, so the environment there is among the worst one can have.
最早的Molniya卫星的寿命很短-仅几个月。 主要是由于为无线电发射器供电的太阳能电池板的性能下降。 为什么降解如此之高？ 近地点500公里和远地点40 000公里，意味着卫星每个周期两次穿过范艾伦带-每天四次。 范艾伦带聚集并聚集太阳能电子和质子，因此那里的环境是最恶劣的环境之一。
Figure 1 promises the dose rate of some tens of kilorads per ten-year lifespan on high orbits and some hundreds of thousand if the satellite is in contact with van Allen belts. That’s higher than many commercial chips can achieve, so one will need significantly thicker, heavier and more expensive shielding to use them. It may still be cheaper than buying radiation hardened ICs, but here we descend into the world of satellite creation, which is out of this article’s scope. Let’s just say that shielding is heavy and therefore expensive to launch while it doesn’t solve all problems and can even make some of them worse.
图1保证了在高轨道上每十年寿命约数十万拉德的剂量率，如果卫星与范艾伦带接触，则约数十万的剂量率。 这比许多商用芯片所能达到的要高，因此使用它们将需要明显更厚，更重且更昂贵的屏蔽层。 它可能仍然比购买防辐射的IC便宜，但在这里我们进入了人造卫星的世界，这不在本文的讨论范围之内。 我们只能说屏蔽很重，因此发射起来很昂贵，但不能解决所有问题，甚至会使其中一些问题变得更糟。
The ultimate answer to a question if COTS chips can be used in space is “Yes, but”. There are many opportunities, but also many constraints. Also, if you want to use a COTS chip in your space-related project and invest in its radiation testing, stockpile your ten-year need. By the way, it’s a credible business-model: well-known and very respected company 3DPlus tests a lot of COTS chips chooses ones that are accidentally better than others and then packs them into their own hybrid modules found everywhere in space, including Curiosity Rover on Mars.
对于是否可以在太空中使用COTS芯片的问题，最终的答案是“是，但是”。 机会很多，但约束也很多。 另外，如果您想在与太空有关的项目中使用COTS芯片并投资其辐射测试，则可以存储10年的需求。 顺便说一句，这是一种可靠的商业模式：知名且备受推崇的公司3DPlus对许多COTS芯片进行了测试，选择了偶然比其他芯片更好的芯片，然后将它们包装到太空中随处可见的自己的混合模块中，包括好奇号流浪者在火星上。
It’s impossible to avoid the topic of “military-grade” chips while dealing with preconceptions about radiation hardness. They are believed to be radiation hardened, but the real situation is a bit more complicated. Not all military-grade chips are radhard and not all radhard chips are military-grade. If we look into the US military standard Mil-Std-883, we will find there a lot of different environmental tests — for thermal cycling, humidity, air with sea salt, etc. etc.
下列段落讨论了辐射：
1017.2 Neutron irradiation 1017.2中子辐照 1019.8 Ionizing radiation (total dose) test procedure 1019.8电离辐射(总剂量)测试程序 1020.1 Dose rate induced latchup test procedure 1020.1剂量率诱导的闩锁测试程序 1021.3 Dose rate upset testing of microcircuits 1021.3微电路剂量率不合格测试 1023.3 Dose rate response of linear microcircuits 1023.3线性微电路的剂量率响应
Total dose? Check! Total dose rate? Check. Single events? Sorry, nothing to find here. Many specifications for military-grade radhard chips include the requirements for single event effects, but they are not part of the military standard. So, “military-grade” status does not guarantee that the chip will be capable to work in space, or at Large Hadron Collider. The best-known example of this misconception in action was the infamous Russian satellite called “Phobos-Grunt”. It was sent to Mars in 2011, but never left Earth’s orbit. The official investigation concluded that the fatal failure occurred in American military-grade SRAM chip which some poor engineer found to be suitable for space travel while it wasn’t in fact protected from single event latchup.
总剂量？ 检查！ 总剂量率？ 检查一下 单项活动？ 抱歉，在这里找不到任何内容。 军用级radhard芯片的许多规范都包括对单事件效果的要求，但它们不是军用标准的一部分。 因此，“军用级”状态不能保证芯片能够在太空或大型强子对撞机上工作。 这种误解在行动中最著名的例子是臭名昭著的俄罗斯卫星“ Phobos-Grunt”。 它于2011年被送往火星，但从未离开地球轨道。 官方调查得出的结论是，致命故障发生在美国军用级SRAM芯片上，一些可怜的工程师发现该芯片适合于太空旅行，而实际上却没有受到单事件闩锁的保护。

Recent SEE testing of 1M and 4M monolithic SRAMs at Brookhaven National Laboratories has shown an extreme sensitivity to single-event latchup (SEL). We have observed SEL at the minimum heavy-ion LET available at Brookhaven, 0.375 MeV-cm2/mg
Brookhaven国家实验室最近对1M和4M单片SRAM的SEE测试表明，它对单事件闩锁(SEL)具有极高的敏感性。 我们已经在布鲁克海文获得的最小重离子LET下观察到SEL，0.375 MeV-cm2 / mg

says the report on that very chip. The report was published in 2005, but wasn’t taken into account by “Phobos-Grunt” designers, who just supposed that “military-grade” is enough to fly to the Red planet.
那个芯片上的报告说。 该报告于2005年发布，但“ Phobos-Grunt”设计人员并未考虑到该报告，他们只是认为“军事级”足以飞往红色星球。
平凡的应用 (Mundane applications)
The importance of radiation hardness is not limited to space and military applications. The atmosphere works as the final shield between the space radiation and the life on Earth, but also creates secondary particles, which are aplenty at the airliner heights (a typical transatlantic flight can see a dozen of single event upsets in the onboard computer). Some secondary particles even reach the ground and are seen in the devices with the highest memory sizes — like supercomputers. X-ray radiation is routinely used in medicine, and radiotherapy is an important way to combat malignant tumours. More and more electronics are needed in medical devices, and these areas aren't an exception.
辐射硬度的重要性不仅限于太空和军事应用。 大气层是空间辐射与地球生命之间的最终屏障，但还会产生次级粒子，这些粒子在客机高度上非常丰富(典型的跨大西洋飞行可以在机载计算机上看到许多单事件扰动)。 一些次级粒子甚至到达地面，并在具有最大内存大小的设备(例如超级计算机)中看到。 X射线通常在医学中使用，放射疗法是对抗恶性肿瘤的重要方法。 医疗设备中需要越来越多的电子设备，这些领域也不例外。
And, of course, we should not forget that all the fuss with lead-free solder was partially caused by the fact that lead and some other materials used in IC fabrication contain impurities of heavy elements like uranium. The use of these materials cause the generation of a small, but still well-measurable flux of alpha particles — right around vulnerable silicon. In the case of BGA packages or 3D assemblies — over the entire surface of vulnerable silicon.
而且，当然，我们不应该忘记，无铅焊料的所有问题都部分是由于铅和IC制造中使用的某些其他材料包含重元素(如铀)的杂质造成的。 这些材料的使用会导致在易损硅周围产生少量但仍可测量的α粒子通量。 对于BGA封装或3D组件-在易损硅的整个表面上。
Luckily, alpha particles have a rather short ionization track (just a few microns, depending on energy), and multi-layer metallization helps to reduce their influence. The bad news is that at low process nodes the required energy is so small that all alpha particles, which are able to reach the surface, cause upsets. For example, TSMC published an article at the 2018 IEEE International Reliability Physics Symposium, measuring the number of alpha-related upsets in 7 nm SRAM. So, the problem still exists in a largely lead-free world.
幸运的是，α粒子具有相当短的电离轨迹(取决于能量，仅几微米)，多层金属化有助于减少其影响。 坏消息是，在低Craft.io节点处，所需的能量是如此之小，以至于所有能够到达表面的α粒子都会引起不安。 例如，台积电(TSMC)在2018年IEEE国际可靠性物理研讨会上发表了一篇文章，测量了7 nm SRAM中与alpha相关的翻转次数。 因此，在很大程度上无铅的世界中仍然存在该问题。

Figure 5. Solder bumps as the alpha radiation source. Image courtesy of Mitsubishi Materials.
图5.焊料凸点作为α辐射源。 图片由三菱综合材料提供 。
I also want to say a few words on yet another application of radiation hardened chips: high energy physics and nuclear industry. Hadron colliders and nuclear power plants require extremely robust electronics capable of working in contaminated active zones for many years. The same would be the case for robots designed to deal with nuclear-related catastrophes like Chernobyl or Fukushima. TID requirements for these circumstances could be dozens or even hundreds of Megarads (Si), which is three orders of magnitude more than in conventional space applications. The problem is further complicated by the fact that such durability is required not just from digital ICs, but also from power management and analog chips, which could be found in multichannel telemetry systems and servo motor drives. These chips could be much more vulnerable than digital ICs in terms of their reaction to transistor degradation. TID behaviour and hardening of digital circuits is well-investigated and well understood, but for analog circuits, it’s much more interesting as every case and every circuit may require an individual approach rather than a semi-automated application of known methods. The electric circuit is often guarded know-how in analog design, and it’s more true for radhard analog.
我也想谈谈辐射硬化芯片的另一种应用：高能物理和核工业。 强子对撞机和核电站需要极其坚固的电子设备，这些电子设备必须能够在受污染的活动区域中工作多年。 设计用于应对切尔诺贝利核电站或福岛核电站核灾难的机器人也是如此。 在这种情况下，TID要求可能为数十甚至数百兆拉(Si)，这比常规太空应用要大三个数量级。 由于不仅需要数字IC还需要这种耐用性，而且还需要电源管理和模拟芯片来提供这种耐用性，这一问题变得更加复杂，在多通道遥测系统和伺服电机驱动器中都可以找到这种耐用性。 就其对晶体管退化的React而言，这些芯片可能比数字IC更加脆弱。 TID行为和数字电路的强化已经得到了充分的调查和很好的理解，但是对于模拟电路，它更有趣，因为每种情况和每种电路都可能需要单独的方法，而不是已知方法的半自动化应用。 电路在模拟设计中通常是受保护的专有技术，对于radhard模拟而言更是如此。

Figure 6. Normal and radiation hardened bandgap voltage reference. Taken from Y. Cao et al., «A 4.5 MGy TID-Tolerant CMOS Bandgap Reference Circuit Using a Dynamic Base Leakage Compensation Technique», IEEE Transactions on Nuclear Science, Vol.60, No.4, 2013
图6.正常和辐射硬化的带隙电压基准。 摘自Y. Cao等人的《使用动态基极泄漏补偿技术的A 4.5 MGy耐TID CMOS带隙基准电路》，IEEE核科学学报，第60卷，第4期，2013年
Let's look at a good (and rare) example of such a task. Bandgap reference voltage source is a simple and well-known circuit that could be found in any analog IC. This circuit normally contains a pair of bipolar transistors controlled by an operational amplifier. These bipolars show significant leakage under irradiation, and this leakage leads to significant output voltage changes, sometimes 10-20% under high doses, which corresponds to the effective ADC resolution of two to three bits. The circuit at the right shows reference voltage variation within 1% (which gives us more than 7 bits) under the total dose of 4.5 MGy. As you may see, it wasn’t easy to achieve this outstanding result: local feedbacks are scattered everywhere, subtracting base current from the equation and therefore getting rid of leakage current too. This radhard version contains four times more transistors and has two times more power consumption than its conventional analog. The worst is that, as I’ve said, every circuit normally requires an individual approach, making analog radhard designer's work very challenging. And there’s also a single event effects problem, solution for which is as well badly formalized and very circuit-dependent.
辐射硬度和Craft.io节点 (Radiation hardness and process node)
The website of one established microelectronics fab with old links to the aerospace industry for a long time contained a statement that radiation tolerance could not be achieved at process nodes below 600 nm, as “charged particles pierce silicon and destroy transistors”. Surprisingly, but likely unrelated, the minimal available process node for that fab, whose high-ranking official said in the interview that it’s “technologically impossible” to create radiation hardened ICs at nodes lower than 90 nm. You may guess what was the minimal node at that fab. I was quite surprised to read that interview as I was working on a radhard 65 nm chip at that moment. I can understand some marketing nonsense, but such words are dangerous in the long term, especially when said to the wide audience or to the audience of decision-making persons.
一家建立了很久的微电子制造厂的网站已经很长时间了，它与航空航天业有着很长的联系，该网站的声明中说，在“低于600 nm的Craft.io节点”上无法达到辐射耐受性，因为“带电粒子会刺穿硅并破坏晶体管”。 令人惊讶的是，但可能与之无关的是，该晶圆厂的最小可用Craft.io节点，其高级官员在接受采访时表示，“在技术上不可能”在低于90 nm的节点上制造辐射硬化IC。 您可能会猜到那个晶圆厂的最小节点是什么。 当我当时正在研究radhard 65 nm芯片时，我很惊讶地读到这篇采访。 我可以理解一些营销上的胡说八道，但是从长远来看，这样的用语很危险，尤其是对广大受众或决策者的受众而言。
I also regularly see the reasoning that ICs built on coarse process nodes are SEL-immune due to very high energy required to influence transistors, so the long-time use of proven technology is not just justified, but simply necessary. Or vice versa, sub-something process nodes work with very low supply voltages — too low to exhibit SEL as parasitic thyristor simply can’t open. Or there are opinions that the problem is not in process nodes, it’s CMOS technology that is fundamentally weak (as evidenced by some tests done by the applicant in early seventies), while in good old time radhard ICs were bipolar/SOI/GaAs. So, since CMOS technology is fundamentally flawed, there is no other way than to continue using ancient tech for spaceships. Preferably, electronic lamps.
我还经常看到这样的理由，由于影响晶体管的能量非常高，因此在粗糙的Craft.io节点上构建的IC不受SEL的影响，因此长期使用成熟的技术不是合理的，而是很必要的。 反之亦然，子级Craft.io节点可在非常低的电源电压下工作-太低而无法展示SEL，因为寄生晶闸管根本无法打开。 或者有观点认为问题不在于Craft.io节点，而在于CMOS技术从根本上是薄弱的(如申请人在七十年代初所做的一些测试所证明的)，而在较早的时期，radhard IC是双极性/ SOI / GaAs。 因此，由于CMOS技术从根本上来说是有缺陷的，因此除了继续将古老的技术用于宇宙飞船之外，别无他法。 优选地，电子灯。
For the sake of justice, some old ICs built on multi-micron process nodes are really insensitive to single events. But “some” doesn’t mean “all”, and all kinds of problems were documented through the history of space exploration. Large process nodes understandably require a lot of energy from ionizing particle fo flip a bit — but they also require the same amount of energy at each switching during normal operation, so I wish a lot of luck to anyone willing to build an Intel Core processor equivalent out of 74-series logic and I would love to see a rocket that would be able to lift such a monster into the air.
为了公正起见，一些建立在微米Craft.io节点上的旧IC确实对单个事件不敏感。 但是“一些”并不意味着“全部”，并且通过太空探索的历史记录了各种问题。 可以理解，大型过程节点需要一点点电离粒子的翻转才能消耗大量能量，但是在正常运行期间，每次切换它们也需要相同量的能量，因此，我希望对愿意构建与之等效的Intel Core处理器的所有人感到好运从74系列逻辑中脱颖而出，我很想看到一枚​​能够将这种怪兽升空的火箭。
On the other hand, microelectronics is not limited to microprocessors and memory. There is a huge variety of tasks where latest process nodes are not necessary, unprofitable or simply unsuitable. The global market for IC built on 200 mm wafers (process node 90 nm and above) has been growing for several years, up to a periodic shortage of production equipment. “Outdated” fabs produce both old and new designs, and many manufacturing companies are commercially successful despite not being on par with TSMC and Samsung. So, take all process node fuss with a grain of salt when data processing isn’t the topic.
另一方面，微电子不限于微处理器和存储器。 有许多任务不需要最新的过程节点，无利可图或根本不合适。 基于200毫米晶圆(Craft.io节点90纳米及以上)构建的IC的全球市场已经增长了几年，直至生产设备的周期性短缺。 “过时”的晶圆厂生产新旧设计，尽管与台积电和三星不相上下，但许多制造公司在商业上都取得了成功。 因此，当数据处理不是主题时，请对所有过程节点大惊小怪。
Other factors inciting the use of older process nodes in aerospace are a longer life cycle of such products, expensive certification and small production quantities. The design of a simple 180 nm IC could cost a few million Euro, and when these millions plus few more millions required for certification and testing are divided by a thousand ICs, each of these ICs becomes very expensive. And what if we need to recoup a few hundred millions for a 7-5 nm design? These troubles lead to two things. First, the design of most radiation hardened ICs in the world is government-subsidized. Second, successful designs are manufactured as long as it’s possible, and IPs from them are reused and reused and reused to lower costs, forcing the manufacturer to stay at the proved process node. These factors combined could create an illusion that most radhard ICs are outdated. The clients also support proven projects, or, to be more precise, flight-proven projects. If a chip has a heritage in space, it’s a colossal competitive advantage, and you may be sure that this advantage is exploited as long as possible, even when the design itself becomes outdated.
促使在航空航天中使用较旧的过程节点的其他因素是此类产品的使用寿命更长，认证费用昂贵且生产量少。 一个简单的180 nm IC的设计可能要花费几百万欧元，而当数以百万计的认证和测试所需的数百万美元除以一千个IC时，这些IC中的每一个都会变得非常昂贵。 如果我们需要为7-5 nm设计赔偿几亿美元呢？ 这些麻烦导致两件事。 首先，世界上大多数辐射硬化IC的设计都是由政府补贴的。 第二，成功的设计要尽可能地进行制造，而来自它们的IP可以重复使用，再利用和再利用以降低成本，从而迫使制造商留在经过验证的过程节点上。 这些因素加在一起可能会造成大多数radhard IC已过时的幻觉。 客户还支持经过验证的项目，或更准确地说，是经过飞行验证的项目。 如果芯片在太空中具有传统，那将是一项巨大的竞争优势，即使设计本身已经过时，您也可以确保尽可能多地利用这一优势。
The public image of radhard ICs is then further diminished by the fact that the most famous of them are used in long-term scientific missions. In 2015, I’ve seen a lot of news like “the New Horizons has the same CPU as 20-year-old original Sony Playstation”. Well said, well said. The New Horizons was launched in 2006, its development had begun in 2000 — it was the year of the first flight of the processor they used. Mongoose-V processor shares the MIPS ISA with PlayStation's MIPS R3000, but it’s entirely different chip released in 1998, some eight years before the launch of the New Horizons and seventeen years before it was featured in the news. Here is another example: Power750 processors came out for commercial applications in 1997, particularly for iMac computers. Their radhard counterpart, RAD750, was released in 2001 and flown into space in 2005, four years later. It was the highest computational power available for the Curiosity Mars Rover, so there was a lot of news about an ancient processor on Mars later in 2012. And, to make it even funnier, almost the entire Curiosity design was reused for the Perseverance, which is due to produce more stupid processor-related news headlines next year.
Despite all of the above, the newest radhard ICs of today are designed at the nodes between 45 and 20 nm, like fresh radhard Xilinx Kintex FPGAs. American RAD5500 series is manufactured at 45 nm, European DAHLIA, which is due in 2021, uses 28 nm, and so on. GlobalFoundries already offers a 12 nm process for aerospace applications, so the modern radhard ICs are definitely modern.
There are many topics to be researched and there is no shortage of scientific articles on the topic of radiation hardness of modern technologies as new challenges tend to emerge with each new generation. Process node shrinking definitely affects radiation hardness, but this effect is complex and not necessarily negative. The general trend is that TID influence decreases while the role of single events becomes more important. Thinner gate oxides lead to smaller threshold voltage shifts, but then these gate oxides are not silicon oxide anymore, and their interface with silicon is different, and so on and so on.
有很多课题需要研究，关于现代技术的辐射硬度这一主题的科学文章也不乏，因为每一代人都面临着新的挑战。 Craft.io节点的收缩肯定会影响辐射硬度，但是这种影响是复杂的，不一定是负面的。 总体趋势是，TID影响减小，而单个事件的作用变得更重要。 较薄的栅极氧化物会导致较小的阈值电压漂移，但是这些栅极氧化物不再是氧化硅，它们与硅的界面也不同，依此类推。

Figure 7. Two versions of radiation hardened inverter. Taken from Vaz et al., "Design Flow Methodology for Radiation Hardened by Design CMOS Enclosed-Layout-Transistor-Based Standard-Cell Library", Journal of Electronic Testing, volume 34, 2018
图7.两种版本的防辐射逆变器。 取自Vaz等人，“通过基于设计CMOS封闭布局晶体管的标准单元库对辐射进行硬化的设计流方法论”，电子测试杂志，第34卷，2018年
Figure 7 shows two implementations of an inverter. On the right, we see a complete stuffing — enclosed layout transistors to combat total dose and individual guard rings against Inter-transistor leakages and SEL. On the right — a simpler design for lower total dose requirements: transistors are linear. It’s worth noting that total dose tolerance of 50-100 krad(Si) is quite sufficient for many space applications, and normal linear transistors do an excellent job there while saving area, not suffering from aspect ratio limitation and having better matching than ELTs. Also note that only nMOSFETs suffer from source-drain leakage and only they have to have enclosed gates if high total dose tolerance is needed, but pMOSFETs are often drawn as ELTs too for easier size balancing between nMOS and pMOS.
图7显示了逆变器的两种实现。 在右侧，我们看到了一个完整的填充物-封闭的布局晶体管以抵抗总剂量，并设有单独的保护环以防止晶体管间泄漏和SEL。 右侧-一种用于降低总剂量的更简单设计：晶体管为线性。 值得注意的是，50-100 krad(Si)的总剂量容限对于许多空间应用来说已经足够了，普通的线性晶体管在节省空间的同时也做得很好，而不会受到长宽比的限制，并且比ELT具有更好的匹配性。 还要注意的是，只有nMOSFET受源漏泄漏的困扰，如果需要高的总剂量容限，则只有封闭的栅极，但为了使nMOS和pMOS之间的尺寸平衡更加容易，pMOSFET也经常被用作ELT。
Single events’ relationship with process nodes is more interesting. Approximate diameter of the charge collection area of an ionizing particle hit is around one micron — which is much bigger than the size of memory cells in deep submicron process nodes. And indeed, experiments show multiple bit upsets from a single ion strike.
单个事件与流程节点的关系更加有趣。 电离粒子撞击的电荷收集区域的近似直径约为1微米-远大于深亚微米Craft.io节点中存储单元的大小。 实际上，实验表明，一次离子撞击会造成多个位不安。
Figure 8. Multiple bit upsets in two different 6T SRAM cell arrays. Taken from M. Gorbunov et al., "Design of 65 nm CMOS SRAM for Space Applications: A Comparative Study", IEEE Transactions on Nuclear Science, Vol.61, No.4, 2014
图8.两个不同的6T SRAM单元阵列中的多个位翻转。 取自M. Gorbunov等人，“用于太空应用的65 nm CMOS SRAM设计：比较研究”，IEEE Transactions on Nuclear Science，第61卷，第4期，2014年
Figure 8 shows the experimental data on single-event upsets in 65 nm bulk technology. On the left — normal commercial 6T-SRAM design. Ten upsets from the single hit! Hamming code won’t protect you from such disaster. So, when we’re talking about commercial ICs, coarse process nodes are somewhat better than smaller ones, as they will mostly experience easier to correct single-bit upsets. But when we’re designing a radhard chip form the scratch, there are a plethora of architectural, schematic and layout solutions capable to produce both high single event tolerance and high performance. The right side of figure 8 also shows the results from 6T-SRAM, from the same die, with the same schematic, but with a different layout. The price of getting rid of most multiple bit upsets, latchup and for increasing total dose hardness is very simple: four times area increase. Doesn’t sound nice, but no one said it would be easy. However, if you’re ready for compromises, Radiation Hardening by Design allows achieving any predetermined level of radiation hardness at any bulk technology.
图8显示了65 nm批量技术中单事件翻转的实验数据。 左侧-常规商用6T-SRAM设计。 单打十下！ 汉明代码无法保护您免受此类灾难的影响。 因此，当我们谈论商用IC时，粗制程节点要比较小的制程节点好一些，因为它们通常会更容易纠正单位位失调。 但是，当我们从头开始设计radhard芯片时，有大量的体系结构，原理图和布局解决方案能够产生高单事件容忍度和高性能。 图8的右侧还显示了来自6T-SRAM的结果，该结果来自相同的管芯，相同的原理图，但布局不同。 摆脱大多数多位打乱，闩锁和增加总剂量硬度的价格非常简单：面积增加四倍。 听起来不太好，但是没有人说这很容易。 但是，如果您准备妥协，则可以通过任意设计的辐射硬化设计，以任何预定的技术达到任何预定水平的辐射硬度。
Why predetermined? Because different requirements could be satisfied with different means. But why not apply all of them at once and be fine for every possible application? Most radiation hardening methods normally come at the cost of compromising functional parameters to some extent (supply current, area, speed, etc.). Therefore overengineering will lead to non-competitive products. Sure, such low-volume ICs are rarely made for just one application and should be flexible, but detailed and reasonable radiation requirements are absolutely vital for the successful design.
为什么要预定？ 因为用不同的方法可以满足不同的要求。 但是，为什么不一次全部应用它们，并适合每种可能的应用呢？ 通常，大多数辐射硬化方法都会以某种程度损害功能参数(电源电流，面积，速度等)为代价。 因此，过度设计将导致产品缺乏竞争力。 当然，这样的小批量集成电路很少会只为一种应用而制造，而是应该具有灵活性，但是详细而合理的辐射要求对于成功的设计绝对至关重要。
绝缘体上硅(SOI) (Silicon on Insulator (SOI))
The eye of the attentive reader could’ve caught the word “bulk” in the phrase “predetermined level of radiation hardness at any bulk technology”. Isn’t it superfluous there? Isn’t it even wrong? It’s widely supposed that all the best radhard ICs are fabricated using “silicon on insulator” or “silicon on sapphire” technology. Right?
细心的读者可以在“任何批量技术中的预定辐射强度水平”一词中发现“散装”一词。 那里不是多余的吗？ 甚至不对吗？ 人们普遍认为，所有最好的radhard IC都是使用“绝缘体上的硅”或“蓝宝石上的硅”技术制造的。 对？
The “silicon on insulator” technology has long been firmly entrenched with “inherently radiation-hard” fame. The roots of this popular fallacy go back into antiquity, when its predecessor, SOS (silicon on sapphire) was actively used for military designs. Why? Transistors in SOS/SOI are electrically separated from each other and from the substrate. This means much lower radiation-induced charge collection volume, which is quite handy for dealing with high dose rate events as it significantly reduces the chip shutdown time right after the nearby nuclear explosion — indeed an important trait for a product designed during the Cold War.
长期以来，“绝缘体上的硅”技术已经以“固有的抗辐射性”声名远扬。 这种流行谬论的根源可以追溯到上古时代，当时其前身SOS(蓝宝石上的硅)被积极地用于军事设计。 为什么？ SOS / SOI中的晶体管相互之间以及与基板之间均电隔离。 这意味着辐射引起的电荷收集量要低得多，这对于处理高剂量率事件非常方便，因为它可以显着减少附近核爆炸后的芯片关闭时间，这对于冷战期间设计的产品而言确实是一个重要特征。
Another part of the “SOI = Radhard” myth is insensitivity to latchup, including dose rate latchup. Latchup (also known as “thyristor effect”) is one of the main headaches for spaceborne systems’ designers as it’s unpredictable and catastrophic. So the technology allowing to deal with it for free could be naturally considered a heavens’ gift. But the whole picture is a little bit more complicated.
“ SOI = Radhard”神话的另一部分是对闩锁不敏感，包括剂量率闩锁。 闩锁(也称为“晶闸管效应”)是星载系统设计人员的主要难题之一，因为它不可预测且具有灾难性。 因此，允许免费处理的技术自然可以视为天赐之礼。 但是整个情况要复杂一些。

Figure 9. CMOS technology cross-section with parts of parasitic thyristor causing the latchup.
图9. CMOS技术截面图，其中部分寄生晶闸管引起闩锁。
The cause of the latchup effect is the parasitic thyristor structure present in bulk CMOS technology. If the resistances Rs and Rw are large enough, a hit of an ionizing particle can deposit enough charge to open parasitic thyristor and create a short between supply and ground. How big are these resistances in real chips? The answer is quite simple: contact to substrate or well means an extra area, so their number is usually minimized to make chips cheaper. This means that a random commercial IC is more likely to be vulnerable to latchup than not. Latchup, however, can occur not just after an ion strike, but also due to ESD, high temperature, excessive current density or a door being shut in the nearby room, so automotive and industrial IC designers are familiar with the topic and more likely to take measures against it.
闩锁效应的原因是体CMOS技术中存在的寄生晶闸管结构。 如果电阻Rs和Rw足够大，则电离粒子的撞击会沉积足够的电荷来打开寄生晶闸管，并在电源和地之间造成短路。 这些电阻在实际芯片中有多大？ 答案很简单：与基板或Kong的接触意味着额外的面积，因此通常将其数量最小化以使芯片更便宜。 这意味着，随机商用IC更有可能遭受闩锁攻击。 但是，闩锁不仅可能发生在离子撞击之后，还可能由于ESD，高温，过大的电流密度或附近房间的门被关闭而引起，因此汽车和工业IC设计人员很熟悉该主题，并且更有可能这样做。采取措施。
A chip can be driven off of the latchup condition by supply reboot, and such a reboot is quite acceptable in many space applications, so many commercial products can still be used in space — even if with some caution. So-called latchup current limiters are very popular in radhard systems, especially in ones requiring high computational performance impossible with up-to-date radhard processors. But such a solution has many limitations. Power reset is not always possible as there is no shortage in real-time calculations. The reboot during an important manoeuvre can put an end to a long mission. The current consumption of a modern IC may vary in a few orders of magnitude according to its working mode, so current consumption in the “nothing happens and there is a latch” state may be less than in high-performance normal condition. Where to set the current limit for such a chip? The required system reaction time also depends on a protected chip as some of them are very vulnerable and others can sustain thousands of latchups if they are being reset sufficiently fast.
If a chip is fabricated on SOI technology, all these problems are not a concern anymore. And no protection circuitry is necessary — completely nothing. That’s why commercial SOI chips are so attractive for space applications. For example, the new American spacecraft Orion is controlled by a commercial SOI-based microprocessor PowerPC 750 rather than its radhard version RAD 750.

Figure 10. Leakage paths in bulk CMOS technology. Taken from J. Schwank et al., «Radiation effects in MOS oxides», IEEE Transactions on Nuclear Science, Vol. 55, No. 4, 2008
图10.批量CMOS技术中的泄漏路径。 摘自J. Schwank等人的《 MOS氧化物中的辐射效应》，IEEE Transactions on Nuclear Science，第1卷。 55，第4号，2008年
Then what’s the problem? There is not just a latchup, but also other radiation effects, and SOI is not inherently better than bulk technology in terms of both TID and SEE hardness. Figure 10 shows two leakage paths in bulk CMOS technology. Both of these paths are easily closed with proper layout design — one using ring n-channel transistors, the second — with the help of guard rings. These solutions have drawbacks from the point of view of the functioning of the circuit (restrictions on the minimum size of the ring transistor, area loss when using guard rings), but from the point of view of ensuring radiation resistance, they are very effective.
那是什么问题 不仅存在闩锁问题，还存在其他辐射影响，并且就TID和SEE硬度而言，SOI并非天生比本体技术更好。 图10显示了批量CMOS技术中的两条泄漏路径。 这两种路径都可以通过适当的布局设计轻松闭合-一种是使用环形n沟道晶体管，另一种是在保护环的帮助下。 从电路的功能的观点来看，这些解决方案具有缺点(限制环形晶体管的最小尺寸，使用保护环时的面积损失)，但是从确保抗辐射性的观点来看，它们是非常有效的。

Figure 11. SOI buried oxide leakage path. Taken from J. Schwank et al., "Radiation effects in MOS oxides", IEEE Transactions on Nuclear Science, Vol. 55, N.4, 2008
图11. SOI掩埋氧化物泄漏路径。 取自J. Schwank等人，“ MOS氧化物中的辐射效应”，IEEE Transactions on Nuclear Science，第1卷。 55，N.4，2008
In SOI technology, there is another leakage path from the source to the drain along the boundary of silicon and latent oxide. Hidden oxide is much thicker than the gate, which means that it can accumulate a lot of positive charge. If we consider the "lower" transistor (the right part of Figure 11), for which the hidden oxide is a gate, we will see that in a normal situation, the source-gate voltage of this transistor is zero and its threshold voltage is several tens of volts, i.e. the current through this transistor does not flow. When irradiated, a positive charge is accumulated in the hidden oxide (this process is influenced by the geometry of the main transistor, in particular, by the thickness of the silicon instrument layer), and the threshold voltage of the “lower” n-channel transistor drops. As soon as it falls below zero, the current begins to flow freely through the transistor along the uncontrolled bottom channel. Thus, from the point of view of the total absorbed dose, the SOI technology is fundamentally strictly worse than the volumetric technology. But maybe there is a way to fix the situation somehow?
在SOI技术中，沿着硅和潜在氧化物的边界从源极到漏极还有另一条泄漏路径。 隐藏的氧化物比栅极厚得多，这意味着它可以积累很多正电荷。 如果我们考虑“下层”晶体管(图11的右半部分)，其中隐藏的氧化物是栅极，我们将看到，在正常情况下，该晶体管的源极-栅极电压为零，而其阈值电压为几十伏，即不流过该晶体管的电流。 受到辐射时，正电荷会在隐藏的氧化物中积累(此过程受主晶体管的几何形状，特别是硅仪器层的厚度的影响)和“下部” n沟道的阈值电压晶体管滴。 一旦降到零以下，电流就开始沿着不受控制的底部通道自由流过晶体管。 因此，从总吸收剂量的角度来看，SOI技术从根本上说比体积技术要差。 但是也许有某种方法可以解决这种情况？
The substrate is usually grounded (in fact, connected to the lowest available potential), but in SOI nothing prevents us from setting negative voltage there and closing that parasitic back gate. This idea is, in fact, actively used — and in FDSOI technologies active back gate control is even used in their normal operation to minimize leakages in low-power modes and maximize speed when necessary. However, there is a catch: when we apply a high electric field to the buried oxide, we don’t just close the back transistor, but also accelerate the accumulation of positive charge. As a result, depending on technology specifics and the magnitude of the voltage applied, it’s possible that the total dose hardness will become even worse! There are other details, but in general, it’s possible to achieve almost any TID hardness level using standard CMOS technology, but there are some fundamental limitations for SOI. These limitations are normally negligible for low-orbit space applications, but if we’re speaking about multi-Megarad levels that could be present in the nuclear industry, commercially unfeasible technology changes are necessary for SOI.
基板通常接地(实际上，已连接到最低可用电位)，但是在SOI中，没有什么可以阻止我们在此处设置负电压并关闭该寄生背栅。 实际上，这个想法已经得到了积极的运用，并且在FDSOI技术中，主动背栅控制甚至在其正常操作中也被使用，以最小化低功耗模式下的泄漏并在必要时最大化速度。 但是，有一个陷阱：当我们向埋入的氧化物施加高电场时，我们不仅会关闭后晶体管，还会加速正电荷的积累。 结果，取决于技术细节和所施加电压的大小，总剂量硬度可能会变得更糟！ 还有其他细节，但总的来说，使用标准CMOS技术可以达到几乎任何TID硬度水平，但是SOI有一些基本限制。 这些限制对于低轨道空间应用通常可以忽略不计，但是如果我们谈论的是核工业中可能存在的多梅加拉德水平，那么SOI必须进行商业上不可行的技术变革。
Single event upsets in SOI are no less interesting. On the one hand, the charge collection volume in SOI is much smaller (although there is a long-lasting argument about the exact shape of this volume and its possible connection to the bulk). This means that we get less excessive charge and can dissipate it through supply lines faster, increasing chances of logic masking in cases on non-memory cells being hit.
SOI中的单事件失败也同样有趣。 一方面，SOI中的电荷收集体积要小得多(尽管对此体积的确切形状及其与主体的可能连接存在长期争论)。 这意味着我们得到的过量电荷更少，并且可以更快地通过电源线进行耗散，从而在非内存单元被击中的情况下增加了逻辑屏蔽的机会。
On the other hand, this small area has small capacitance, so even a small deposited charge can raise the voltage and open a parasitic bipolar transistor consisting of source, body and drain. If it happens, the deposited charge is multiplied by the gain of this parasitic transistor. In practice, this means threshold LET drop to levels below 1 MeV * cm ^ 2/(mg), and then effectively any incoming particle will cause a bit upset.
另一方面，这个小区域的电容很小，因此即使很小的沉积电荷也可以提高电压并打开由源极，基极和漏极组成的寄生双极晶体管。 如果发生这种情况，则所沉积的电荷将乘以该寄生晶体管的增益。 实际上，这意味着阈值LET降至低于1 MeV * cm ^ 2 /(mg)的水平，然后有效地，任何进入的粒子都会引起一点不安。
This negative effect, of course, could be mitigated by the careful low-ohmic connection between the transistor body and a respective power bus (or, in some cases, transistor source). But no one does this in commercial chips as these connections take a lot of areas and do nothing in exchange. Even in a radhard chip losing some area in each transistor can be a significant downside compared to bulk alternatives where one contact per 4-8 memory cells is often sufficient to prevent both latchup and parasitic bipolar multiplication. Even some guard rings can be set up with smaller area loss.
SOI gets another important advantage at small process nodes where dielectric isolation helps prevent multiple bit upsets from a single particle, but modern cells are so small that a single ion track can directly affect two of them. However, it’s still much better than 10-bit upsets seen in experiments with bulk technology.
SOI在小型Craft.io节点上获得了另一个重要的优势，在这些Craft.io中，电介质隔离有助于防止单个粒子发生多次位翻转，但是现代单元是如此之小，以至于单个离子轨迹可以直接影响其中两个。 但是，它仍然比批量技术实验中看到的10位翻转要好得多。
Summing things up, SOI is not “inherently radiation hardened”, but it has some significant advantages and disadvantages compared to traditional bulk technology. The advantages could be exploited for a great effect, while disadvantages should be mitigated with a proper design. But the same is also true for bulk technology, so the proper process choice is not as trivial as it may seem and should be taken seriously in every single project. One should deeply understand the application to achieve desired levels of radiation hardness without making the chip unnecessarily complicated and too expensive.
总结一下，SOI并不是“固有地辐射硬化”的，但是与传统的批量技术相比，它具有一些明显的优缺点。 可以充分利用这些优点，而通过适当的设计来减轻这些缺点。 但是对于批量技术来说也是如此，因此正确的过程选择并不像看起来那样琐碎，应该在每个项目中认真对待。 人们应该深刻理解该应用，以达到所需的辐射硬度水平，而不会使芯片变得不必要的复杂和昂贵。
就是这样，伙计们！ (That's all, folks!)
Many engineers across the globe are working on the topic of radiation hardness, and it’s completely impossible to cover everything in one article, especially if it’s dedicated to a wider audience. So, my colleagues will probably find enough oversimplifications or even mistakes, which I will be happy to discuss and correct. While not trying to be exhaustive, I hope that I gave my readers a brief understanding of what radiation hardening of electronic circuits is and that I was able to dispel some related misconceptions. Microelectronics in general and its special applications are one of the fastest evolving fields of applied science, so common knowledge becomes outdated very fast, while simple recipes are not used just because they don’t exist anymore.
全球许多工程师都在研究辐射硬度这一主题，并且完全不可能在一篇文章中涵盖所有内容，尤其是当它面向更广泛的受众时。 因此，我的同事们可能会发现足够的过度简化甚至错误，我将很乐于讨论和纠正。 尽管我不想穷尽所有，但我希望我能使读者对电子电路的辐射硬化有一个简要的了解，并希望我能够消除一些相关的误解。 总体而言，微电子学及其特殊应用是应用科学领域发展最快的领域之一，因此，常识变得非常过时，而不再使用简单的食谱，因为它们不再存在。

翻译自: https://habr.com/en/post/518366/

数字集成电路面试常见问题

展开全文
• 人工智能时代的竞争 封人疯语：闭上眼睛，想想明天的世界吧，执汽车行业牛耳者是百度、谷歌还是丰田、沃尔沃？数据和算法已经成为整个世界的底层，基于物质世界资源稀缺、非此即彼和人类大脑有限理性的传统逻辑...
人工智能时代的竞争

封人疯语：闭上眼睛，想想明天的世界吧，执汽车行业牛耳者是百度、谷歌还是丰田、沃尔沃？数据和算法已经成为整个世界的底层，基于物质世界资源稀缺、非此即彼和人类大脑有限理性的传统逻辑似乎正在被彻底颠覆，数据越多、算法越强、强者恒强，智者通吃。这是一幅非常可怕的图景，也是一幅令人激动向往的图景。斯密用分工描述世界发展，马克思用阶级分析人类未来，在这个崭新时代到来之际，我们需要新的思维逻辑，数据和算法是我们理解明天的关键。

/正文/

In 2019, just five years after the Ant Financial Services Group was launched, the number of consumers using its services passed the one billion mark. Spun out of Alibaba, Ant Financial uses artificial intelligence and data from Alipay—its core mobile-payments platform—to run an extraordinary variety of businesses, including consumer lending, money market funds, wealth management, health insurance, credit-rating services, and even an online game that encourages people to reduce their carbon footprint. The company serves more than 10 times as many customers as the largest U.S. banks—with less than one-tenth the number of employees. At its last round of funding, in 2018, it had a valuation of \$150 billion—almost half that of JPMorgan Chase, the world’s most valuable financial-services company.
2019年，蚂蚁金服成立才5年，客户数突破10亿大关。脱胎于阿里巴巴，蚂蚁金服利用人工智能和支付宝的数据（阿里巴巴的核心移动支付平台）来运营各种不同的业务，包括消费贷款、货币市场基金、财富管理、医疗保险、信用评级服务，甚至还有一款鼓励人们减少碳排放的在线游戏。蚂蚁金服的客户数是美国最大银行的10倍多，而员工却不到十分之一。在2018年它的最近一轮融资中，估值达到了1500亿美元——差不多是世界上最有价值的金融服务公司摩根大通的一半。

Unlike traditional banks, investment institutions, and insurance companies, Ant Financial is built on a digital core. There are no workers in its “critical path” of operating activities. AI runs the show. There is no manager approving loans, no employee providing financial advice, no representative authorizing consumer medical expenses. And without the operating constraints that limit traditional firms, Ant Financial can compete in unprecedented ways and achieve unbridled growth and impact across a variety of industries.
与传统银行、投资机构和保险公司不同，蚂蚁金服建立在数字核心之上。在其经营活动的“关键路径”上没有工人，AI主宰了一切。没有经理批准贷款，没有员工提供财务建议，没有代表审批消费者的医疗费用。没有了限制传统企业的运营约束，蚂蚁金服能够以前所未有的方式展开竞争，实现无约束的增长，并跨越多个行业产生影响。

The age of AI is being ushered in by the emergence of this new kind of firm. Ant Financial’s cohort includes giants like Google, Facebook, Alibaba, and Tencent, and many smaller, rapidly growing firms, from Zebra Medical Vision and Wayfair to Indigo Ag and Ocado. Every time we use a service from one of those companies, the same remarkable thing happens: Rather than relying on traditional business processes operated by workers, managers, process engineers, supervisors, or customer service representatives, the value we get is served up by algorithms. Microsoft’s CEO, Satya Nadella, refers to AI as the new “runtime” of the firm. True, managers and engineers design the AI and the software that makes the algorithms work, but after that, the system delivers value on its own, through digital automation or by leveraging an ecosystem of providers outside the firm. AI sets the prices on Amazon, recommends songs on Spotify, matches buyers and sellers on Indigo’s marketplace, and qualifies borrowers for an Ant Financial loan.

The elimination of traditional constraints transforms the rules of competition. As digital networks and algorithms are woven into the fabric of firms, industries begin to function differently and the lines between them blur. The changes extend well beyond born-digital firms, as more-traditional organizations, confronted by new rivals, move toward AI-based models too. Walmart, Fidelity, Honeywell, and Comcast are now tapping extensively into data, algorithms, and digital networks to compete convincingly in this new era. Whether you’re leading a digital start-up or working to revamp a traditional enterprise, it’s essential to understand the revolutionary impact AI has on operations, strategy, and competition.
消除传统约束无疑改变了竞争规则。随着数字网络和算法被导入企业的体系结构之中，行业开始以不同的方式运作，行业之间的界限开始变得模糊。这些变化不只是由这些新型的数字公司带来的，面对新的竞争对手，传统组织也开始转向基于人工智能的运营模式。沃尔玛、富达(Fidelity)、霍尼韦尔(Honeywell)和康卡斯特(Comcast)正在广泛利用数据、算法和数字网络，以赢得新时代的竞争。显然，无论你是领导一家数字型初创企业，还是致力于改造一家传统企业，理解人工智能对企业运营、战略和竞争的革命性影响都是至关重要的。

The AI Factory
人工智能工厂

At the core of the new firm is a decision factory—what we call the “AI factory.” Its software runs the millions of daily ad auctions at Google and Baidu. Its algorithms decide which cars offer rides on Didi, Grab, Lyft, and Uber. It sets the prices of headphones and polo shirts on Amazon and runs the robots that clean floors in some Walmart locations. It enables customer service bots at Fidelity and interprets X-rays at Zebra Medical. In each case the AI factory treats decision-making as a science. Analytics systematically convert internal and external data into predictions, insights, and choices, which in turn guide and automate operational workflows.
蚂蚁金服这样的新型公司的核心是一个决策工厂—— 我们称之为“人工智能工厂”。在谷歌和百度上，软件每天运营着数百万次广告拍卖。在滴滴、Grab、Lyft和Uber上，算法决定着哪些车可以提供服务。在亚马逊上，智能算法为耳机和polo衫定价。在沃尔玛的一些门店里，机器人在清洁地板。富达使用机器人提供客户服务，斑马医疗利用机器人解释x射线的图像。在每一个案例中，人工智能工厂都将决策视为一门科学，数据分析软件系统地将内外部数据转化为预测、洞察和选择，从而指导和自动化工作流程。

Oddly enough, the AI that can drive the explosive growth of a digital firm often isn’t even all that sophisticated. To bring about dramatic change, AI doesn’t need to be the stuff of science fiction—indistinguishable from human behavior or simulating human reasoning, a capability sometimes referred to as “strong AI.” You need only a computer system to be able to perform tasks traditionally handled by people—what is often referred to as “weak AI.”
奇怪的是，推动数字公司爆炸式增长的人工智能往往并不复杂。尽管带来了戏剧性的变化，但需要的人工智能并不是科幻小说里的那些东西——与人类行为或模拟人类推理没有什么区别的能力，这种能力有时被称为“强人工智能”。事实上，你只需要一个计算机系统就能完成传统上由人来完成的任务——这通常被称为“弱人工智能”。

With weak AI, the AI factory can already take on a range of critical decisions. In some cases it might manage information businesses (such as Google and Facebook). In other cases it will guide how the company builds, delivers, or operates actual physical products (like Amazon’s warehouse robots or Waymo, Google’s self-driving car service). But in all cases digital decision factories handle some of the most critical processes and operating decisions. Software makes up the core of the firm, while humans are moved to the edge.

Four components are essential to every factory. The first is the data pipeline, the semiautomated process that gathers, cleans, integrates, and safeguards data in a systematic, sustainable, and scalable way. The second is algorithms, which generate predictions about future states or actions of the business. The third is an experimentation platform, on which hypotheses regarding new algorithms are tested to ensure that their suggestions are having the intended effect. The fourth is infrastructure, the systems that embed this process in software and connect it to internal and external users.
对于人工智能工厂，有四个要素必不可少。一是数据管道，这是一个半自动化的过程，用一种系统的、可持续和可扩展的方式收集、清理、集成和保护数据。二是算法，生成关于业务未来状态或行动的预测值。三是实验平台，在这个平台上测试新算法的假设，确保具有预期的效果。四是基础设施，将人工智能嵌入软件平台，并将其连接到内外部用户的系统上。

The AI that drives explosive growth often isn't even all that sophisticated
推动爆炸性增长的人工智能通常不是很复杂

Take a search engine like Google or Bing. As soon as someone starts to type a few letters into the search box, algorithms dynamically predict the full search term on the basis of terms that many users have typed in before and this particular user’s past actions. These predictions are captured in a drop-down menu (the “autosuggest box”) that helps the user zero in quickly on a relevant search. Every keystroke and every click are captured as data points, and every data point improves the predictions for future searches. AI also generates the organic search results, which are drawn from a previously assembled index of the web and optimized according to the clicks generated on the results of previous searches. The entry of the term also sets off an automated auction for the ads most relevant to the user’s search, the results of which are shaped by additional experimentation and learning loops. Any click on or away from the search query or search results page provides useful data. The more searches, the better the predictions, and the better the predictions, the more the search engine is used.
以谷歌或Bing这样的搜索引擎为例。一旦有人开始在搜索框中输入几个字母，算法就会根据许多用户之前输入的词汇和该用户过去的行为动态预测整个搜索词。这些预测值会在下拉菜单（“自动建议框”）中显示出来，帮助用户快速锁定相关搜索。每一个击键和每一次点击都被捕获为数据点，每一个数据点都改进了对未来搜索的预测。人工智能还能生成有机的搜索结果，这些搜索结果来自于以前收集的web索引，并根据以前搜索结果产生的点击进行优化。这个词的加入也引发了与用户搜索最相关的广告的自动拍卖，这个结果是由其它的实验和学习循环形成的。任何点击或离开搜索查询或搜索结果页面都会提供有用的数据。搜索越多，预测效果越好，预测效果越好，搜索引擎的使用率就越高。

Removing Limits to Scale, Scope, and Learning
消除规模、范围和学习等因素对企业增长影响的限制

The concept of scale has been central in business since at least the Industrial Revolution. The great Alfred Chandler described how modern industrial firms could reach unprecedented levels of production at much lower unit cost, giving large firms an important edge over smaller rivals. He also highlighted the benefits companies could reap from the ability to achieve greater production scope, or variety. The push for improvement and innovation added a third requirement for firms: learning. Scale, scope, and learning have come to be considered the essential drivers of a firm’s operating performance. And for a long time they’ve been enabled by carefully defined business processes that rely on labor and management to deliver products and services to customers—and that are reinforced by traditional IT systems.
工业革命以来，规模概念一直是商业的核心。伟大的阿尔弗雷德•钱德勒曾经描述过，现代工业企业是怎样以低得多的单位成本达到前所未有的生产水平，从而使大型企业相对于规模较小的竞争对手拥有重要优势。他还强调了企业能够从扩大生产范围或增加品种中获得的好处。随着创新重要性的与日俱增，对企业又增加了学习能力的要求。规模、范围和学习能力被认为是一个公司经营业绩的主要驱动力。很长一段时间以来，它们都是通过精心定义的业务流程来实现的，这些业务流程依赖于劳动力和管理人员向客户交付产品和服务，并由传统的IT系统加以强化。

After hundreds of years of incremental improvements to the industrial model, the digital firm is now radically changing the scale, scope, and learning paradigm. AI-driven processes can be scaled up much more rapidly than traditional processes can, allow for much greater scope because they can easily be connected with other digitized businesses, and create incredibly powerful opportunities for learning and improvement—like the ability to produce ever more accurate and sophisticated customer-behavior models and then tailor services accordingly.
虽然经历了数百年，企业的竞争模式只是在缓慢改变。现在数字公司彻底改变了规模、范围和学习的竞争范式。AI驱动的业务流程相比传统业务流程，以快得多的速度扩大服务能力，拓展服务范围，他们可以很容易的与其他数字化业务实现连接，创造令人难以置信的强大的学习和改进机会，产生更精确和复杂的客户行为模型，定制相应的服务。

In traditional operating models, scale inevitably reaches a point at which it delivers diminishing returns. But we don’t necessarily see this with AI-driven models, in which the return on scale can continue to climb to previously unheard-of levels. Now imagine what happens when an AI-driven firm competes with a traditional firm by serving the same customers with a similar (or better) value proposition and a much more scalable operating model.
在传统运营模式中，规模会达到一个均衡点，之后，回报开始递减。但在人工智能驱动的运营模式下，这种情况可能不会出现，规模回报可能会持续攀升至前所未有的水平。现在，想象一下，当一个人工智能驱动的公司与一个传统公司竞争，人工智能驱动的公司用极具可扩展性的运营模式为相同的客户提供类似（或更好）的价值服务，结果会怎样呢？

How AI-Driven Companies Can Outstrip Traditional Firms
人工智能驱动的公司如何超越传统公司

The value that scale delivers eventually tapers off in traditional operating models, but in digital operating models, it can climb much higher.
在传统的运营模式中，这种规模增长带来的价值最终会逐渐减少，但在数字运营模式中，它可以爬升得更高。

We call this kind of confrontation a “collision.” As both learning and network effects amplify volume’s impact on value creation, firms built on a digital core can overwhelm traditional organizations. Consider the outcome when Amazon collides with traditional retailers, Ant Financial with traditional banks, and Didi and Uber with traditional taxi services. As Clayton Christensen, Michael Raynor, and Rory McDonald argued in “What Is Disruptive Innovation?” (HBR, December 2015), such competitive upsets don’t fit the disruption model. Collisions are not caused by a particular innovation in a technology or a business model. They’re the result of the emergence of a completely different kind of firm. And they can fundamentally alter industries and reshape the nature of competitive advantage.
我们称人工智能驱动的公司与传统公司之间的对抗为“冲突”。由于学习和网络效应放大了数量对价值创造的影响，建立在数字核心之上的公司可以超越传统组织。考虑一下亚马逊与传统零售商、蚂蚁金服与传统银行、滴滴和优步与传统出租车服务发生冲突的后果。正如克莱顿·克里斯坦森、迈克尔·雷诺和罗里·麦克唐纳在《什么是颠覆性创新》（哈佛商业评论，2015年12月）中指出的，这样的竞争性颠覆不符合颠覆创新模式。冲突不是由技术或商业模式中的特定创新引起的。它们是一种完全不同的公司出现的结果。它们可以从根本上改变行业，重塑竞争优势的本质。

Note that it can take quite a while for AI-driven operating models to generate economic value anywhere near the value that traditional operating models generate at scale. Network effects produce little value before they reach critical mass, and most newly applied algorithms suffer from a “cold start” before acquiring adequate data. Ant Financial grew rapidly, but its core payment service, Alipay, which had been launched in 2004 by Alibaba, took years to reach its current volume. This explains why executives ensconced in the traditional model have a difficult time at first believing that the digital model will ever catch up. But once the digital operating model really gets going, it can deliver far superior value and quickly overtake traditional firms.
请注意（如上图），人工智能驱动的运营模式产生的经济价值，可能需要相当长的一段时间才能接近传统运营模式在规模上产生的价值。网络效应在达到临界规模之前产生的价值很小，而大多数新应用的算法在获得足够的数据之前都遭遇了“冷启动”。蚂蚁金服发展迅速，但其核心支付服务——阿里巴巴于2004年推出的支付宝——花了多年时间才达到目前的规模。这就解释了为什么那些安坐在传统模式下的高管们一开始很难相信数字模式会迎头赶上。但一旦数字运营模式真正开始运作，它就能带来远超传统企业的价值，并迅速超越传统企业。

Collisions between AI-driven and traditional firms are happening across industries: software, financial services, retail, telecommunications, media, health care, automobiles, and even agribusiness. It’s hard to think of a business that isn’t facing the pressing need to digitize its operating model and respond to the new threats.
在软件、金融服务、零售、电信、媒体、医疗、汽车甚至农业综合企业等行业，人工智能驱动的企业与传统企业之间的冲突正在发生。很难想象一个企业不面临着将其运营模式数字化和应对新威胁的迫切需要。

重建传统企业

For leaders of traditional firms, competing with digital rivals involves more than deploying enterprise software or even building data pipelines, understanding algorithms, and experimenting. It requires rearchitecting the firm’s organization and operating model. For a very, very long time, companies have optimized their scale, scope, and learning through greater focus and specialization, which led to the siloed structures that the vast majority of enterprises today have. Generations of information technology didn’t change this pattern. For decades, IT was used to enhance the performance of specific functions and organizational units. Traditional enterprise systems often even reinforced silos and the divisions across functions and products.
对于传统企业的领导者来说，同数字企业的竞争不只是部署企业软件，或者是建立数据管道、理解算法和进行实验。它需要重新架构公司的组织和运营模式。很长一段时间以来，公司通过归核化和专业化在不断优化它们的规模、范围和学习模式，形成了今天绝大多数企业所拥有的烟囱结构。虽然信息技术经历了几代的发展，但并没有改变这种模式。几十年来，信息技术只是被用来提高某些特定功能和组织单元的绩效。这反而强化了传统企业的烟囱结构，促进了企业功能和产品的分散化。

Silos, however, are the enemy of AI-powered growth. Indeed, businesses like Google Ads and Ant Financial’s MyBank deliberately forgo them and are designed to leverage an integrated core of data and a unified, consistent code base. When each silo in a firm has its own data and code, internal development is fragmented, and it’s nearly impossible to build connections across the silos or with external business networks or ecosystems. It’s also nearly impossible to develop a 360-degree understanding of the customer that both serves and draws from every department and function. So when firms set up a new digital core, they should avoid creating deep organizational divisions within it.

虽然向人工智能驱动模式转变充满挑战，但许多传统公司——其中一些与我们有过合作——已经开始做出转变。事实上，在最近的一项研究中，我们研究了350多家服务和制造行业的传统企业，发现大多数企业都开始更加注重数据和分析。包括诺德斯特龙、沃达丰、康卡斯特和visa在内的许多公司已经取得了重要进展，他们将运营模式的关键组件进行了数字化和重新设计，并开发了复杂的数据平台和人工智能。你不必成为一个软件初创公司来数字化你的关键业务元素，但你必须面对烟囱式的、分散的传统信息系统，给它赋能，并重构公司文化。

Fidelity Investments is using AI to enable processes in important areas, including customer service, customer insights, and investment recommendations. Its AI initiatives build on a multiyear effort to integrate data assets into one digital core and redesign the organization around it. The work is by no means finished, but the impact of AI is already evident in many high-value use cases across the company. To take on Amazon, Walmart is rebuilding its operating model around AI and replacing traditional siloed enterprise software systems with an integrated, cloud-based architecture. That will allow Walmart to use its unique data assets in a variety of powerful new applications and automate or enhance a growing number of operating tasks with AI and analytics. At Microsoft, Nadella is betting the company’s future on a wholesale transformation of its operating model.
富达投资正在利用人工智能为重要领域的业务流程赋能，包括客户服务、客户洞察和投资建议。它的人工智能计划建立在多年的努力之上，将数据资产整合到一个数字核中，并围绕它重新设计组织。虽然这项工作并没有结束，但是人工智能的影响已经在公司的许多高价值应用案例中得到了明显的体现。为了与亚马逊竞争，沃尔玛正围绕人工智能重建其运营模式，以集成的、基于云的架构取代传统的、烟囱式的企业软件系统。这将使沃尔玛能够在各种强大的新应用程序中使用其独特的数据资产，通过人工智能和数据分析让越来越多的任务自动化、并提升效率。在微软，纳德拉正将公司的未来押注于运营模式的整体转型。

Rethinking Strategy and Capabilities
重新思考战略和能力

As AI-powered firms collide with traditional businesses, competitive advantage is increasingly defined by the ability to shape and control digital networks. (See “Why Some Platforms Thrive and Others Don’t,” HBR, January–February 2019.) Organizations that excel at connecting businesses, aggregating the data that flows among them, and extracting its value through analytics and AI will have the upper hand. Traditional network effects and AI-driven learning curves will reinforce each other, multiplying each other’s impact. You can see this dynamic in companies such as Google, Facebook, Tencent, and Alibaba, which have become powerful “hub” firms by accumulating data through their many network connections and building the algorithms necessary to heighten competitive advantages across disparate industries.

Meanwhile, conventional approaches to strategy that focus on traditional industry analysis are becoming increasingly ineffective. Take automotive companies. They’re facing a variety of new digital threats, from Uber to Waymo, each coming from outside traditional industry boundaries. But if auto executives think of cars beyond their traditional industry context, as a highly connected, AI-enabled service, they can not only defend themselves but also unleash new value—through local commerce opportunities, ads, news and entertainment feeds, location-based services, and so on.
与此同时，聚焦传统行业分析的传统战略分析方法正变得越来越无效。以汽车企业为例，他们正面临着各种新的数字威胁，从优步到Waymo，每一种威胁都来自传统行业的边界之外。但是，如果汽车行业的高管们能超越传统思维，把汽车看作是高度互联的、由人工智能驱动的服务，那么他们不仅可以保护好自己，还可以通过车内的商业机会、广告、新闻和娱乐信息、基于位置的服务等来释放新的价值。

The advice to executives was once to stick with businesses they knew, in industries they understood. But synergies in algorithms and data flows do not respect industry boundaries. And organizations that can’t leverage customers and data across those boundaries are likely to be at a big disadvantage. Instead of focusing on industry analysis and on the management of companies’ internal resources, strategy needs to focus on the connections firms create across industries and the flow of data through the networks the firms use.
曾经给高管们的建议是，在熟悉的行业里，坚持做自己熟悉的生意。但算法和数据流的协同效应并不尊重行业边界。而那些不能跨越这些边界利用客户和数据的组织可能会处于很大的劣势。战略需要聚焦的不是行业分析和公司内部资源的管理，而是公司跨行业建立的联系和公司正在使用的网络中的数据流。

All this has major implications for organizations and their employees. Machine learning will transform the nature of almost every job, regardless of occupation, income level, or specialization. Undoubtedly, AI-based operating models can exact a real human toll. Several studies suggest that perhaps half of current work activities may be replaced by AI-enabled systems. We shouldn’t be too surprised by that. After all, operating models have long been designed to make many tasks predictable and repeatable. Processes for scanning products at checkout, making lattes, and removing hernias, for instance, benefit from standardization and don’t require too much human creativity. While AI improvements will enrich many jobs and generate a variety of interesting opportunities, it seems inevitable that they will also cause widespread dislocation in many occupations.
所有这些变化对组织及其雇员都有重大影响。机器学习将改变几乎所有工作的性质，无论职业、收入水平或专业领域。毫无疑问，基于人工智能的运营模式将会对就业造成实实在在的影响。几项研究表明，目前的工作可能有一半将被人工智能系统取代。对此我们不应该感到太惊讶。毕竟，长期以来，运营模式已经被设计成让许多工作任务是可预测和可重复的。例如，检查时扫描产品、制作拿铁和去除疝气的流程都可以标准化，不需要太多的人类创造力。虽然人工智能将使很多工作变得更加丰富，并产生各种有趣的机会，但似乎不可避免的是，它们也将在许多职业中造成广泛的混乱与调整。

The dislocations will include not only job replacement but also the erosion of traditional capabilities. In almost every setting, AI-powered firms are taking on highly specialized organizations. In an AI-driven world, the requirements for competition have less to do with specialization and more to do with a universal set of capabilities in data sourcing, processing, analytics, and algorithm development. These new universal capabilities are reshaping strategy, business design, and even leadership. Strategies in very diverse digital and networked businesses now look similar, as do the drivers of operating performance. Industry expertise has become less critical. When Uber looked for a new CEO, the board hired someone who had previously run a digital firm—Expedia—not a limousine services company.
这种混乱与调整不仅包括工作的替代，还包括传统能力的削弱。在几乎每一种情况下，人工智能公司都在挑战高度专业化的组织。在人工智能驱动的世界中，竞争能力与专门化关系不大，而更多地与数据来源、处理、分析和算法开发方面的通用功能有关。这些新的通用能力正在重塑战略、业务设计，甚至领导力。如今，在非常多样化的数字和网络化公司中，战略看起来都很相似，经营业绩的驱动因素也是如此。行业专长变得不那么重要了。当优步寻找新的首席执行官时，董事会聘请的是一位曾运营过数字公司的人，运营的是艾派迪公司，而不是一家豪华轿车服务公司。

We’re moving from an era of core competencies that differ from industry to industry to an age shaped by data and analytics and powered by algorithms—all hosted in the cloud for anyone to use. This is why Alibaba and Amazon are able to compete in industries as disparate as retail and financial services, and health care and credit scoring. These sectors now have many similar technological foundations and employ common methods and tools. Strategies are shifting away from traditional differentiation based on cost, quality, and brand equity and specialized, vertical expertise and toward advantages like business network position, the accumulation of unique data, and the deployment of sophisticated analytics.
我们正在从一个不同行业拥有不同核心竞争力的时代，进入一个由数据和分析塑造、由算法驱动的核心竞争力时代——所有这些都托管在云端，任何人都可以使用。这就是为什么阿里巴巴和亚马逊能够在零售和金融服务、医疗保健和信用评分等完全不同的行业展开竞争。这些部门现在有许多类似的技术基础，并使用共同的方法和工具。战略正从传统的构建基于成本、质量、品牌价值、专门化和垂直专长等方面的差异，转向打造基于商业网络位置、独特数据积累和复杂分析部署等方面的优势。

对领导力挑战

Though it can unleash enormous growth, the removal of operating constraints isn’t always a good thing. Frictionless systems are prone to instability and hard to stop once they’re in motion. Think of a car without brakes or a skier who can’t slow down. A digital signal—a viral meme, for instance—can spread rapidly through networks and can be just about impossible to halt, even for the organization that launched it in the first place or an entity that controls the key hubs in a network. Without friction, a video inciting violence or a phony or manipulative headline can quickly spread to billions of people on a variety of networks, even morphing to optimize click-throughs and downloads. If you have a message to send, AI offers a fantastic way to reach vast numbers of people and personalize that message for them. But the marketer’s paradise can be a citizen’s nightmare.
尽管它可以释放出巨大的增长，但消除运营约束并不总是一件好事。无摩擦系统容易不稳定，一旦运行就很难停止。想想一辆没有刹车的汽车或者一个不能减速的滑雪者。数字信号——例如，病毒式的模因（meme）—可以通过网络迅速传播，而且几乎不可能被阻止，即使是最初发布它的组织或控制网络关键枢纽的实体也不例外。在没有摩擦的情况下，一个煽动暴力的视频，或者一个虚假或被操纵的标题，都可以通过各种各样的网络迅速传播到数十亿人的手中，甚至可以通过变形来优化点击率和下载。如果你有信息要发送，人工智能提供了一种奇妙的方式来接触大量的人，并为他们个性化信息。但市场营销者的天堂可能是公民的噩梦。

Digital operating models can aggregate harm along with value. Even when the intent is positive, the potential downside can be significant. A mistake can expose a large digital network to a destructive cyberattack. Algorithms, if left unchecked, can exacerbate bias and misinformation on a massive scale. Risks can be greatly magnified. Consider the way that digital banks are aggregating consumer savings in an unprecedented fashion. Ant Financial, which now operates one of the largest money market funds in the world, is entrusted with the savings of hundreds of millions of Chinese consumers. The risks that presents are significant, especially for a relatively unproven institution.
数字运营模式在创造价值的同时也可能聚集与放大伤害。即使意图是积极的，潜在的负面影响也是巨大的。一个错误就能使一个庞大的数字网络遭受毁灭性的网络攻击。如果不加以检查，算法可能会在大规模范围内加剧偏见和错误信息。风险可能被大大放大。想想数字银行正以一种前所未有的方式聚合消费者储蓄。蚂蚁金服目前管理着全球最大的货币市场基金之一，它受托管理数亿中国消费者的储蓄。由此带来的风险是巨大的，尤其是对于一个相对未经验证的机构而言。

Digital scale, scope, and learning create a slew of new challenges—not just privacy and cybersecurity problems, but social turbulence resulting from market concentration, dislocations, and increased inequality. The institutions designed to keep an eye on business—regulatory bodies, for example—are struggling to keep up with all the rapid change.
数字的规模、范围和学习创造了一系列新的挑战——不仅仅是隐私和网络安全问题，还有由市场集中、就业调整和不平等加剧造成的社会动荡。例如，那些监督企业的机构，也就是监管机构，正在努力跟上所有这些快速的改变。

In an AI-driven world, once an offering’s fit with a market is ensured, user numbers, engagement, and revenues can skyrocket. Yet it’s increasingly obvious that unconstrained growth is dangerous. The potential for businesses that embrace digital operating models is huge, but the capacity to inflict widespread harm needs to be explicitly considered. Navigating these opportunities and threats will be a real test of leadership for both businesses and public institutions.
在人工智能驱动的世界里，一旦产品与市场相匹配，用户数、参与度和收入就会飙升。然而，越来越明显的是，无约束的增长是危险的。拥抱数字运营模式的企业潜力巨大，对它们造成广泛伤害的能力也需要认真对待。平衡好这些机遇和威胁将是对企业和公共机构领导力的真正考验。
​

作者介绍

Marco Iansiti is the David Sarnoff Professor of Business Administration at Harvard Business School, where he heads the Technology and Operations Management Unit and the Digital Initiative. He has advised many companies in the technology sector, including Microsoft, Facebook, and Amazon. He is a coauthor (with Karim Lakhani) of the book Competing in the Age of AI (Harvard Business Review Press, 2020).
Karim R. Lakhani is the Charles Edward Wilson Professor of Business Administration and the Dorothy and Michael Hintze Fellow at Harvard Business School and the founder and codirector of the Laboratory for Innovation Science at Harvard. He is a coauthor (with Marco Iansiti) of the book Competing in the Age of AI (Harvard Business Review Press, 2020).
卡里姆·r·拉克哈尼(Karim R. Lakhani) 哈佛商学院工商管理教授，哈佛大学创新科学实验室的创始人和联合主任。他是《人工智能时代的竞争》一书的合著者之一。


展开全文
• 0.0.2 - 修复的RR /瑞士括号 -upsets是斜体的 v。0.0.3 - 具有描述的校准锦标赛 v。0.0.4 - 不要在瑞士/ rr括号上覆盖底表 v。0.0.5 - 为名称中具有奇怪字符的播放器（如＆）的播放器 v。0.0.6 - 添加导出结果线程...
• 原文地址：﻿﻿http://www.linuxvirtualserver.org/zh/lvs4.html本文主要讲述了LVS集群的IP负载均衡软件IPVS在内核中实现的各种连接调度算法。针对请求的服务时间变化很大，给出一个动态反馈负载均衡算法，它结合...
原文地址：

﻿﻿

http://www.linuxvirtualserver.org/zh/lvs4.html

本文主要讲述了LVS集群的IP负载均衡软件IPVS在内核中实现的各种连接调度算法。针对请求的服务时间变化很大，给出一个动态反馈负载均衡算法，它结合内核中的加权连接调度算法，根据动态反馈回来的负载信息来调整服务器的权值，来进一步避免服务器间的负载不平衡。

1. 前言 在上一篇文章中，我们主要讲述了LVS集群中实现的三种IP负载均衡技术，它们主要解决系统的可伸缩性和透明性问题，如何通过负载调度器将请求高 效地分发到不同的服务器执行，使得由多台独立计算机组成的集群系统成为一台虚拟服务器；客户端应用程序与集群系统交互时，就像与一台高性能的服务器交互一 样。
在下面描述中，我们称客户的socket和服务器的socket之间的数据通讯为连接，无论它们是使用TCP还是UDP协议。对于UDP数据报文的 调度，IPVS调度器也会为之建立调度记录并设置超时值（如5分钟）；在设定的时间内，来自同一地址（IP地址和端口）的UDP数据包会被调度到同一台服 务器。
2. 内核中的连接调度算法
IPVS在内核中的负载均衡调度是以连接为粒度的。在HTTP协议（非持久）中，每个对象从WEB服务器上获取都需要建立一个TCP连接，同一用户 的不同请求会被调度到不同的服务器上，所以这种细粒度的调度在一定程度上可以避免单个用户访问的突发性引起服务器间的负载不平衡。
在内核中的连接调度算法上，IPVS已实现了以下八种调度算法：
轮叫调度（Round-Robin Scheduling）加权轮叫调度（Weighted Round-Robin Scheduling）最小连接调度（Least-Connection Scheduling）加权最小连接调度（Weighted Least-Connection Scheduling）基于局部性的最少链接（Locality-Based Least Connections Scheduling）带复制的基于局部性最少链接（Locality-Based Least Connections with Replication Scheduling）目标地址散列调度（Destination Hashing Scheduling）源地址散列调度（Source Hashing Scheduling）

下面，我们先介绍这八种连接调度算法的工作原理和算法流程，会在以后的文章中描述怎么用它们。
2.1. 轮叫调度
轮叫调度（Round Robin Scheduling）算法就是以轮叫的方式依次将请求调度不同的服务器，即每次调度执行i = (i + 1) mod n，并选出第i台服务器。算法的优点是其简洁性，它无需记录当前所有连接的状态，所以它是一种无状态调度。
在系统实现时，我们引入了一个额外条件，当服务器的权值为零时，表示该服务器不可用而不被调度。这样做的目的是将服务器切出服务（如屏蔽服务器故障和系统维护），同时与其他加权算法保持一致。所以，算法要作相应的改动，它的算法流程如下：

轮叫调度算法流程

假设有一组服务器S = {S0, S1, …, Sn-1}，一个指示变量i表示上一次选择的
服务器，W(Si)表示服务器Si的权值。变量i被初始化为n-1，其中n > 0。

j = i;
do {
j = (j + 1) mod n;
if (W(Sj) > 0) {
i = j;
return Si;
}
} while (j != i);
return NULL;


轮叫调度算法假设所有服务器处理性能均相同，不管服务器的当前连接数和响应速度。该算法相对简单，不适用于服务器组中处理性能不一的情况，而且当请求服务时间变化比较大时，轮叫调度算法容易导致服务器间的负载不平衡。
虽然Round-Robin DNS方法也是以轮叫调度的方式将一个域名解析到多个IP地址，但轮叫DNS方法的调度粒度是基于每个域名服务器的，域名服务器对域名解析的缓存会妨碍轮 叫解析域名生效，这会导致服务器间负载的严重不平衡。这里，IPVS轮叫调度算法的粒度是基于每个连接的，同一用户的不同连接都会被调度到不同的服务器 上，所以这种细粒度的轮叫调度要比DNS的轮叫调度优越很多。
2.2. 加权轮叫调度
加权轮叫调度（Weighted Round-Robin Scheduling）算法可以解决服务器间性能不一的情况，它用相应的权值表示服务器的处理性能，服务器的缺省权值为1。假设服务器A的权值为1，B的 权值为2，则表示服务器B的处理性能是A的两倍。加权轮叫调度算法是按权值的高低和轮叫方式分配请求到各服务器。权值高的服务器先收到的连接，权值高的服 务器比权值低的服务器处理更多的连接，相同权值的服务器处理相同数目的连接数。加权轮叫调度算法流程如下：

加权轮叫调度算法流程

假设有一组服务器S = {S0, S1, …, Sn-1}，W(Si)表示服务器Si的权值，一个
指示变量i表示上一次选择的服务器，指示变量cw表示当前调度的权值，max(S)
表示集合S中所有服务器的最大权值，gcd(S)表示集合S中所有服务器权值的最大
公约数。变量i初始化为-1，cw初始化为零。

while (true) {
i = (i + 1) mod n;
if (i == 0) {
cw = cw - gcd(S);
if (cw <= 0) {
cw = max(S);
if (cw == 0)
return NULL;
}
}
if (W(Si) >= cw)
return Si;
}



例如，有三个服务器A、B和C分别有权值4、3和2，则在一个调度周期内(mod sum(W(Si)))调度序列为AABABCABC。加权轮叫调度算法还是比较简单和高效。当请求的服务时间变化很大，单独的加权轮叫调度算法依然会导致服务器间的负载不平衡。
从上面的算法流程中，我们可以看出当服务器的权值为零时，该服务器不被被调度；当所有服务器的权值为零，即对于任意i有W(Si)=0，则没有任何 服务器可用，算法返回NULL，所有的新连接都会被丢掉。加权轮叫调度也无需记录当前所有连接的状态，所以它也是一种无状态调度。
2.3. 最小连接调度
最小连接调度（Least-Connection Scheduling）算法是把新的连接请求分配到当前连接数最小的服务器。最小连接调度是一种动态调度算法，它通过服务器当前所活跃的连接数来估计服务 器的负载情况。调度器需要记录各个服务器已建立连接的数目，当一个请求被调度到某台服务器，其连接数加1；当连接中止或超时，其连接数减一。
在系统实现时，我们也引入当服务器的权值为零时，表示该服务器不可用而不被调度，它的算法流程如下：

最小连接调度算法流程

假设有一组服务器S = {S0, S1, ..., Sn-1}，W(Si)表示服务器Si的权值，
C(Si)表示服务器Si的当前连接数。

for (m = 0; m < n; m++) {
if (W(Sm) > 0) {
for (i = m+1; i < n; i++) {
if (W(Si) <= 0)
continue;
if (C(Si) < C(Sm))
m = i;
}
return Sm;
}
}
return NULL;


当各个服务器有相同的处理性能时，最小连接调度算法能把负载变化大的请求分布平滑到各个服务器上，所有处理时间比较长的请求不可能被发送到同一台服 务器上。但是，当各个服务器的处理能力不同时，该算法并不理想，因为TCP连接处理请求后会进入TIME_WAIT状态，TCP的TIME_WAIT一般 为2分钟，此时连接还占用服务器的资源，所以会出现这样情形，性能高的服务器已处理所收到的连接，连接处于TIME_WAIT状态，而性能低的服务器已经 忙于处理所收到的连接，还不断地收到新的连接请求。
2.4. 加权最小连接调度
加权最小连接调度（Weighted Least-Connection Scheduling）算法是最小连接调度的超集，各个服务器用相应的权值表示其处理性能。服务器的缺省权值为1，系统管理员可以动态地设置服务器的权 值。加权最小连接调度在调度新连接时尽可能使服务器的已建立连接数和其权值成比例。加权最小连接调度的算法流程如下：

加权最小连接调度的算法流程


假设有一组服务器S = {S0, S1, ..., Sn-1}，W(Si)表示服务器Si的权值，
C(Si)表示服务器Si的当前连接数。所有服务器当前连接数的总和为
CSUM = ΣC(Si)  (i=0, 1, .. , n-1)。当前的新连接请求会被发送服务器Sm，
当且仅当服务器Sm满足以下条件
(C(Sm) / CSUM)/ W(Sm) = min { (C(Si) / CSUM) / W(Si)}  (i=0, 1, . , n-1)
其中W(Si)不为零
因为CSUM在这一轮查找中是个常数，所以判断条件可以简化为
C(Sm) / W(Sm) = min { C(Si) / W(Si)}  (i=0, 1, . , n-1)
其中W(Si)不为零

因为除法所需的CPU周期比乘法多，且在Linux内核中不允许浮点除法，服务器的
权值都大于零，所以判断条件C(Sm) / W(Sm) > C(Si) / W(Si) 可以进一步优化
为C(Sm)*W(Si) > C(Si)* W(Sm)。同时保证服务器的权值为零时，服务器不被调
度。所以，算法只要执行以下流程。

for (m = 0; m < n; m++) {
if (W(Sm) > 0) {
for (i = m+1; i < n; i++) {
if (C(Sm)*W(Si) > C(Si)*W(Sm))
m = i;
}
return Sm;
}
}
return NULL;


2.5. 基于局部性的最少链接调度
基于局部性的最少链接调度（Locality-Based Least Connections Scheduling，以下简称为LBLC）算法是针对请求报文的目标IP地址的负载均衡调度，目前主要用于Cache集群系统，因为在Cache集群中 客户请求报文的目标IP地址是变化的。这里假设任何后端服务器都可以处理任一请求，算法的设计目标是在服务器的负载基本平衡情况下，将相同目标IP地址的 请求调度到同一台服务器，来提高各台服务器的访问局部性和主存Cache命中率，从而整个集群系统的处理能力。
LBLC调度算法先根据请求的目标IP地址找出该目标IP地址最近使用的服务器，若该服务器是可用的且没有超载，将请求发送到该服务器；若服务器不 存在，或者该服务器超载且有服务器处于其一半的工作负载，则用“最少链接”的原则选出一个可用的服务器，将请求发送到该服务器。该算法的详细流程如下：

LBLC调度算法流程

假设有一组服务器S = {S0, S1, ..., Sn-1}，W(Si)表示服务器Si的权值，
C(Si)表示服务器Si的当前连接数。ServerNode[dest_ip]是一个关联变量，表示
目标IP地址所对应的服务器结点，一般来说它是通过Hash表实现的。WLC(S)表示
在集合S中的加权最小连接服务器，即前面的加权最小连接调度。Now为当前系统
时间。

if (ServerNode[dest_ip] is NULL) then {
n = WLC(S);
if (n is NULL) then return NULL;
ServerNode[dest_ip].server = n;
} else {
n = ServerNode[dest_ip].server;
(C(n) > W(n) AND
there is a node m with C(m) < W(m)/2))) then {
n = WLC(S);
if (n is NULL) then return NULL;
ServerNode[dest_ip].server = n;
}
}
ServerNode[dest_ip].lastuse = Now;
return n;


此外，对关联变量ServerNode[dest_ip]要进行周期性的垃圾回收（Garbage Collection），将过期的目标IP地址到服务器关联项进行回收。过期的关联项是指哪些当前时间（实现时采用系统时钟节拍数jiffies）减去最 近使用时间超过设定过期时间的关联项，系统缺省的设定过期时间为24小时。
2.6. 带复制的基于局部性最少链接调度
带复制的基于局部性最少链接调度（Locality-Based Least Connections with Replication Scheduling，以下简称为LBLCR）算法也是针对目标IP地址的负载均衡，目前主要用于Cache集群系统。它与LBLC算法的不同之处是它要 维护从一个目标IP地址到一组服务器的映射，而LBLC算法维护从一个目标IP地址到一台服务器的映射。对于一个“热门”站点的服务请求，一台Cache 服务器可能会忙不过来处理这些请求。这时，LBLC调度算法会从所有的Cache服务器中按“最小连接”原则选出一台Cache服务器，映射该“热门”站 点到这台Cache服务器，很快这台Cache服务器也会超载，就会重复上述过程选出新的Cache服务器。这样，可能会导致该“热门”站点的映像会出现 在所有的Cache服务器上，降低了Cache服务器的使用效率。LBLCR调度算法将“热门”站点映射到一组Cache服务器（服务器集合），当该“热 门”站点的请求负载增加时，会增加集合里的Cache服务器，来处理不断增长的负载；当该“热门”站点的请求负载降低时，会减少集合里的Cache服务器 数目。这样，该“热门”站点的映像不太可能出现在所有的Cache服务器上，从而提供Cache集群系统的使用效率。
LBLCR算法先根据请求的目标IP地址找出该目标IP地址对应的服务器组；按“最小连接”原则从该服务器组中选出一台服务器，若服务器没有超载， 将请求发送到该服务器；若服务器超载；则按“最小连接”原则从整个集群中选出一台服务器，将该服务器加入到服务器组中，将请求发送到该服务器。同时，当该 服务器组有一段时间没有被修改，将最忙的服务器从服务器组中删除，以降低复制的程度。LBLCR调度算法的流程如下：

LBLCR调度算法流程

假设有一组服务器S = {S0, S1, ..., Sn-1}，W(Si)表示服务器Si的权值，
C(Si)表示服务器Si的当前连接数。ServerSet[dest_ip]是一个关联变量，表示
目标IP地址所对应的服务器集合，一般来说它是通过Hash表实现的。WLC(S)表示
在集合S中的加权最小连接服务器，即前面的加权最小连接调度；WGC(S)表示在
集合S中的加权最大连接服务器。Now为当前系统时间，lastmod表示集合的最近
修改时间，T为对集合进行调整的设定时间。

if (ServerSet[dest_ip] is NULL) then {
n = WLC(S);
if (n is NULL) then return NULL;
} else {
n = WLC(ServerSet[dest_ip]);
if ((n is NULL) OR
(C(n) > W(n) AND
there is a node m with C(m) < W(m)/2))) then {
n = WLC(S);
if (n is NULL) then return NULL;
} else
if (|ServerSet[dest_ip]| > 1 AND
Now - ServerSet[dest_ip].lastmod > T) then {
m = WGC(ServerSet[dest_ip]);
remove m from ServerSet[dest_ip];
}
}
ServerSet[dest_ip].lastuse = Now;
if (ServerSet[dest_ip] changed) then
ServerSet[dest_ip].lastmod = Now;
return n;


此外，对关联变量ServerSet[dest_ip]也要进行周期性的垃圾回收（Garbage Collection），将过期的目标IP地址到服务器关联项进行回收。过期的关联项是指哪些当前时间（实现时采用系统时钟节拍数jiffies）减去最 近使用时间（lastuse）超过设定过期时间的关联项，系统缺省的设定过期时间为24小时。
2.7. 目标地址散列调度
目标地址散列调度（Destination Hashing Scheduling）算法也是针对目标IP地址的负载均衡，但它是一种静态映射算法，通过一个散列（Hash）函数将一个目标IP地址映射到一台服务器。
目标地址散列调度算法先根据请求的目标IP地址，作为散列键（Hash Key）从静态分配的散列表找出对应的服务器，若该服务器是可用的且未超载，将请求发送到该服务器，否则返回空。该算法的流程如下：

目标地址散列调度算法流程

假设有一组服务器S = {S0, S1, ..., Sn-1}，W(Si)表示服务器Si的权值，
C(Si)表示服务器Si的当前连接数。ServerNode[]是一个有256个桶（Bucket）的
Hash表，一般来说服务器的数目会运小于256，当然表的大小也是可以调整的。
算法的初始化是将所有服务器顺序、循环地放置到ServerNode表中。若服务器的
连接数目大于2倍的权值，则表示服务器已超载。

n = ServerNode[hashkey(dest_ip)];
(W(n) == 0) OR
(C(n) > 2*W(n))) then
return NULL;
return n;


在实现时，我们采用素数乘法Hash函数，通过乘以素数使得散列键值尽可能地达到较均匀的分布。所采用的素数乘法Hash函数如下：

素数乘法Hash函数

static inline unsigned hashkey(unsigned int dest_ip)
{
}
其中，2654435761UL是2到2^32 (4294967296)间接近于黄金分割的素数，
(sqrt(5) - 1) / 2 =  0.618033989
2654435761 / 4294967296 = 0.618033987


2.8. 源地址散列调度
源地址散列调度（Source Hashing Scheduling）算法正好与目标地址散列调度算法相反，它根据请求的源IP地址，作为散列键（Hash Key）从静态分配的散列表找出对应的服务器，若该服务器是可用的且未超载，将请求发送到该服务器，否则返回空。它采用的散列函数与目标地址散列调度算法 的相同。它的算法流程与目标地址散列调度算法的基本相似，除了将请求的目标IP地址换成请求的源IP地址，所以这里不一一叙述。
在实际应用中，源地址散列调度和目标地址散列调度可以结合使用在防火墙集群中，它们可以保证整个系统的唯一出入口。
3. 动态反馈负载均衡算法
动态反馈负载均衡算法考虑服务器的实时负载和响应情况，不断调整服务器间处理请求的比例，来避免有些服务器超载时依然收到大量请求，从而提 高整个系统的吞吐率。图1显示了该算法的工作环境，在负载调度器上运行Monitor Daemon进程，Monitor Daemon来监视和收集各个服务器的负载信息。Monitor Daemon可根据多个负载信息算出一个综合负载值。Monitor Daemon将各个服务器的综合负载值和当前权值算出一组新的权值，若新权值和当前权值的差值大于设定的阀值，Monitor Daemon将该服务器的权值设置到内核中的IPVS调度中，而在内核中连接调度一般采用加权轮叫调度算法或者加权最小连接调度算法。

图1：动态反馈负载均衡算法的工作环境

3.1. 连接调度
当客户通过TCP连接访问网络访问时，服务所需的时间和所要消耗的计算资源是千差万别的，它依赖于很多因素。例如，它依赖于请求的服务类型、当前网 络带宽的情况、以及当前服务器资源利用的情况。一些负载比较重的请求需要进行计算密集的查询、数据库访问、很长响应数据流；而负载比较轻的请求往往只需要 读一个HTML页面或者进行很简单的计算。
请求处理时间的千差万别可能会导致服务器利用的倾斜（Skew），即服务器间的负载不平衡。例如，有一个WEB页面有A、B、C和D文件，其中D是 大图像文件，浏览器需要建立四个连接来取这些文件。当多个用户通过浏览器同时访问该页面时，最极端的情况是所有D文件的请求被发到同一台服务器。所以说， 有可能存在这样情况，有些服务器已经超负荷运行，而其他服务器基本是闲置着。同时，有些服务器已经忙不过来，有很长的请求队列，还不断地收到新的请求。反 过来说，这会导致客户长时间的等待，觉得系统的服务质量差。
3.1.1. 简单连接调度
简单连接调度可能会使得服务器倾斜的发生。在上面的例子中，若采用轮叫调度算法，且集群中正好有四台服务器，必有一台服务器总是收到D文件的请求。这种调度策略会导致整个系统资源的低利用率，因为有些资源被用尽导致客户的长时间等待，而其他资源空闲着。
3.1.2. 实际TCP/IP流量的特征
文献[1]说明网络流量是呈波浪型发生的，在一段较长时间的小流量后，会有一段大流量的访问，然后是小流量，这样跟波浪一样周期性地发生。文献 [2,3,4,5]揭示在WAN和LAN上网络流量存在自相似的特征，在WEB访问流也存在自相似性。这就需要一个动态反馈机制，利用服务器组的状态来应 对访问流的自相似性。
3.2. 动态反馈负载均衡机制
TCP/IP流量的特征通俗地说是有许多短事务和一些长事务组成，而长事务的工作量在整个工作量占有较高的比例。所以，我们要设计一种负载均衡算法，来避免长事务的请求总被分配到一些机器上，而是尽可能将带有毛刺（Burst）的分布分割成相对较均匀的分布。
我们提出基于动态反馈负载均衡机制，来控制新连接的分配，从而控制各个服务器的负载。例如，在IPVS调度器的内核中使用加权轮叫调度 （Weighted Round-Robin Scheduling）算法来调度新的请求连接；在负载调度器的用户空间中运行Monitor Daemon。Monitor Daemon定时地监视和收集各个服务器的负载信息，根据多个负载信息算出一个综合负载值。Monitor Daemon将各个服务器的综合负载值和当前权值算出一组新的权值。当综合负载值表示服务器比较忙时，新算出的权值会比其当前权值要小，这样新分配到该服 务器的请求数就会少一些。当综合负载值表示服务器处于低利用率时，新算出的权值会比其当前权值要大，来增加新分配到该服务器的请求数。若新权值和当前权值 的差值大于设定的阀值，Monitor Daemon将该服务器的权值设置到内核中的IPVS调度中。过了一定的时间间隔（如2秒钟），Monitor Daemon再查询各个服务器的情况，并相应调整服务器的权值；这样周期性地进行。可以说，这是一个负反馈机制，使得服务器保持较好的利用率。
在加权轮叫调度算法中，当服务器的权值为零，已建立的连接会继续得到该服务器的服务，而新的连接不会分配到该服务器。系统管理员可以将一台服务器的 权值设置为零，使得该服务器安静下来，当已有的连接都结束后，他可以将该服务器切出，对其进行维护。维护工作对系统都是不可少的，比如硬件升级和软件更新 等，零权值使得服务器安静的功能很主要。所以，在动态反馈负载均衡机制中我们要保证该功能，当服务器的权值为零时，我们不对服务器的权值进行调整。
3.3. 综合负载
在计算综合负载时，我们主要使用两大类负载信息：输入指标和服务器指标。输入指标是在调度器上收集到的，而服务器指标是在服务器上的各种负载信息。 我们用综合负载来反映服务器当前的比较确切负载情况，对于不同的应用，会有不同的负载情况，这里我们引入各个负载信息的系数，来表示各个负载信息在综合负 载中轻重。系统管理员根据不同应用的需求，调整各个负载信息的系数。另外，系统管理员设置收集负载信息的时间间隔。
输入指标主要是在单位时间内服务器收到新连接数与平均连接数的比例，它是在调度器上收集到的，所以这个指标是对服务器负载情况的一个估计值。在调度 器上有各个服务器收到连接数的计数器，对于服务器Si，可以得到分别在时间T1和T2时的计数器值Ci1和Ci2，计算出在时间间隔T2-T1内服务器 Si收到新连接数Ni = Ci2 - Ci1。这样，得到一组服务器在时间间隔T2-T1内服务器Si收到新连接数{Ni}，服务器Si的输入指标INPUTi为其新连接数与n台服务器收到平 均连接数的比值，其公式为

服务器指标主要记录服务器各种负载信息，如服务器当前CPU负载LOADi、服务器当前磁盘使用情况Di、当前内存利用情况Mi和当前进程数目 Pi。有两种方法可以获得这些信息；一是在所有的服务器上运行着SNMP（Simple Network Management Protocol）服务进程，而在调度器上的Monitor Daemon通过SNMP向各个服务器查询获得这些信息；二是在服务器上实现和运行收集信息的Agent，由Agent定时地向Monitor Daemon报告负载信息。若服务器在设定的时间间隔内没有响应，Monitor Daemon认为服务器是不可达的，将服务器在调度器中的权值设置为零，不会有新的连接再被分配到该服务器；若在下一次服务器有响应，再对服务器的权值进 行调整。再对这些数据进行处理，使其落在[0, ∞)的区间内，1表示负载正好，大于1表示服务器超载，小于1表示服务器处于低负载状态。获得调整后的数据有DISKi、MEMORYi和 PROCESSi。
另一个重要的服务器指标是服务器所提供服务的响应时间，它能比较好地反映服务器上请求等待队列的长度和请求的处理时间。调度器上的Monitor Daemon作为客户访问服务器所提供的服务，测得其响应时间。例如，测试从WEB服务器取一个HTML页面的响应延时，Monitor Daemon只要发送一个“GET /”请求到每个服务器，然后记录响应时间。若服务器在设定的时间间隔内没有响应，Monitor Daemon认为服务器是不可达的，将服务器在调度器中的权值设置为零。同样，我们对响应时间进行如上调整，得到RESPONSEi。
这里，我们引入一组可以动态调整的系数Ri来表示各个负载参数的重要程度，其中ΣRi = 1。综合负载可以通过以下公式计算出：

例如，在WEB服务器集群中，我们采用以下系数{0.1, 0.3, 0.1, 0.1, 0.1, 0.3}，认为服务器的CPU负载和请求响应时间较其他参数重要一些。若当前的系数Ri不能很好地反映应用的负载，系统管理员可以对系数不断地修正，直到 找到贴近当前应用的一组系数。
3.4. 权值计算
当服务器投入集群系统中使用时，系统管理员对服务器都设定一个初始权值DEFAULT_WEIGHTi，在内核的IPVS调度中也先使用这个权值。 然后，随着服务器负载的变化，对权值进行调整。为了避免权值变成一个很大的值，我们对权值的范围作一个限制[DEFAULT_WEIGHTi, SCALE*DEFAULT_WEIGHTi]，SCALE是可以调整的，它的缺省值为10。

在公式中，0.95是我们想要达到的系统利用率，A是一个可调整的系数（缺省值为5）。当综合负载值为0.95时，服务器权值不变；当综合负载值大 于0.95时，权值变小；当综合负载值小于0.95时，权值变大。若新权值大于SCALE*DEFAULT_WEIGHTi，我们将新权值设为 SCALE*DEFAULT_WEIGHTi。若新权值与当前权值的差异超过设定的阀值，则将新权值设置到内核中的IPVS调度参数中，否则避免打断 IPVS调度的开销。我们可以看出这是一个负反馈公式，会使得权值调整到一个稳定点，如系统达到理想利用率时，权值是不变的。
在实际使用中，若发现所有服务器的权值都小于他们的DEFAULT_WEIGHT，则说明整个服务器集群处于超载状态，这时需要加入新的服务器结点 到集群中来处理部分负载；反之，若所有服务器的权值都接近于SCALE*DEFAULT_WEIGHT，则说明当前系统的负载都比较轻。
3.5. 一个实现例子

服务器权值调整区间为[DEFAULT_WEIGHTi, 10*DEFAULT_WEIGHTi]，A为DEFAULT_WEIGHTi /2，而权值调整的阀值为DEFAULT_WEIGHTi /4。1是所想要达到的系统利用率。Piranha每隔20秒查询各台服务器的CPU负载，进行权值计算和调整。
4. 小结
本文主要讲述了IP虚拟服务器在内核中实现的八种连接调度算法：
轮叫调度（Round-Robin Scheduling）加权轮叫调度（Weighted Round-Robin Scheduling）最小连接调度（Least-Connection Scheduling）加权最小连接调度（Weighted Least-Connection Scheduling）基于局部性的最少链接（Locality-Based Least Connections Scheduling）带复制的基于局部性最少链接（Locality-Based Least Connections with Replication Scheduling）目标地址散列调度（Destination Hashing Scheduling）源地址散列调度（Source Hashing Scheduling）

因为请求的服务时间差异较大，内核中的连接调度算法容易使得服务器运行出现倾斜。为此，给出了一个动态反馈负载均衡算法，结合内核中的加权连接调度 算法，根据动态反馈回来的负载信息来调整服务器的权值，来调整服务器间处理请求数的比例，从而避免服务器间的负载不平衡。动态反馈负载算法可以较好地避免 服务器的倾斜，提高系统的资源使用效率，从而提高系统的吞吐率。

参考文献
William Stalling, Viewpoint: Self-similarity upsets data traffic assumptions, IEEE Spectrum, January 1997.Kihong Park, Gitae Kim, Mark Crovella, "On the Effect of Traffic Self-similarity on Network Performance", In Proceedings of the 1997 SPIE International Conference on Performance and Control of Network Systems, 1997.Nicolas D. Georganas, Self-Similar ("Fractal") Traffic in ATM Networks, In Proceedings of the 2nd International Workshop on Advanced Teleservices and High-Speed Communications Architectures (IWACA'94), pages 1-7, Heidelberg, Germany, September 1994.Mark Crovella and Azer Besavros, Explaining World Wide Web Traffic Self-Similarity. Technical report, Boston University, October 1995, TR-95-015. Bruce A. Mah. An Empirical Model of HTTP Network Traffic. In Proceedings of INFOCOM 97, Kobe, Japan, April 1997.Red Hat High Availability Server Project, http://ha.redhat.com/The Linux Virtual Server Project, http://www.LinuxVirtualServer.org/
展开全文
• Are there changes we can make to our code, or compile-time improvements that can be made to identify/correct soft errors and memory-corruption caused by single event upsets ? 我们可以对代码进行更改吗...
• 文章目录A B S T R A C T1. Introduction2. Big Data analytics2.1. Big Data definition2.2 Big Data methodology2.2.1. Apache Hadoop2.2.2. MangoDB2.2.3. Cassandra2.3. Big Data processing2.3.1....
• use a sha1 to generate an etag for caching, and upsets HTTP cache servers in general.</p> <p>All 'fixing' this would require is changing 'keys' to 'sort keys' in _encode_...
• to accept the invitation upsets me. Tom 拒绝接受邀请，这让我很恼⽕火。   2. 做宾语——所有格、普通格 Do you mind my/me making a suggestion? 我提个建议，你介意吗?   三、动...
• [] values = [] for k, v in updates.items(): s = '%s=%%s' % k upsets.append(s) values.append(v) upsets = ','.join(upsets) sql = 'UPDATE %s SET %s WHERE %s="%s"' % ( table_name, upsets, field_where, ...
• upsets = ','.join(upsets) sql = 'UPDATE %s SET %s WHERE %s="%s"' % ( table_name, upsets, field_where, value_where, ) self.execute(sql, *(values)) 3.使用方法 这个实现是对pymysql的简单封装...
• upsets = [] values = [] for k, v in updates.items(): s = '%s=%%s' % k upsets.append(s) values.append(v) upsets = ','.join(upsets) sql = 'UPDATE %s SET %s WHERE %s="%s"' % ( table_name, upsets, field_...
• ：https://www.yuanrenxue.com/crawler/news-crawler-urlpool.html url_pool.py # -*- coding: utf-8 -*- # @Author : 佛祖保佑, 永无 bug # @Date : # @File : url_pool.py ...import pickle...
•  // correct return but upsets fdisk */ } scsi_debug_queuecommand 此函数是LLDD中最重要的工作函数，当工作完成时执行传入的scsi_cmd中done回调函数。函数返回0表示本请求可以被接受处理，如果不能处理则...
• SEU in GEO upsets with recommended redundancy implementation Certified to MIL-PRF-38535 (Qualified Manufacturer Listing) Guaranteed over the full military temperature range (–5°C to +125°C) 0.15...
• [C] It upsets profit-making journal publishers. [D] It benefits scientific research considerably. According to the text, online publication is significant in that [A] it provides an easier access to ...
• their waste is flushed into the sewer system, altered in composition but not in amount at a treatment plant, and the residue is dumped into rivers or the ocean as waste, which upsets the natural ...
• A.upsets B.disagrees C.annoys D.sickens4. Williams______ his position as Mayor to give jobs to his friends. A.abused B.used C.applied D. exploited5. Manufacturers are making safety a design______.A....
• the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight ...
• Read the following text. Choose the best word(s) for each numbered blank and mark A,B,C or D on ANSWER SHEET 1.(10 points) The idea that some group people may be more intelligent than others is one ...
• e.g.We feel that we have had our fill of disappointments and emotional upsets. 我们觉得已经受够了失望和情绪困扰。 15. to fill the bill -> see bill 16. 填，填写(表格等) If you fill in a form or other ...
• 原题: 问题：He seems not to have grasped what I meant, ______ greatly upsets me. that which - - - - - - - - - - - - - - - - - - - - - - - - - - - - 答案: which 解释: 将整个主句作为先行词，对其进行...
• California has asked the justices to refrain from a sweeping ruling, particularly one that upsets the old assumption that authorities may search through the possessions of suspects at the time of ...
• LVS集群的负载调度 章文嵩(wensong@linux-vs.org) 2002 年 5 月 本文主要讲述了LVS集群的IP负载均衡软件IPVS在内核中实现的各种连接调度算法。针对请求的服务时间变化很大，给出一个动态反馈负载均衡算法，它结合...
• 42 如果代码正确，则禁止Single-event upsets，硬件故障等， no assert will ever fail . 这就是为什么程序对最终用户的行为不得受到影响的原因 . 特别是，即使在 exceptional programmatic conditions 下，断言也...
• DAY1 It is difficult to the point of(达到…的程度) impossiblity for the average reader under the age of forty to imagine a time when high-quality arts criticism could be found in most big-city ...
• Ross: All right look, I-I realize it upsets you. Paul: Yes it does. Ross: But, Elizabeth and I are-are both adults and so I don't think there's really anything you can do about it. Paul: I'll call ...

...