By David Weinberger
AI Outside In is a column by PAIR’s writer-in-residence, David Weinberger, who offers his outsider perspective on key ideas in machine learning. His opinions are his own and do not necessarily reflect those of Google.
AI Outside In 是PAIR的常驻作者David Weinberger的专栏文章，他提供了有关机器学习关键思想的局外人观点。 他的观点是他自己的，不一定反映Google的观点。
机器学习的超能力 (Machine learning’s superpower)
When we humans argue over what’s fair, sometimes it’s about principles, sometimes about consequences, and sometimes about trade-offs. But machine learning systems can bring us to think about fairness — and many other things — in terms of three interrelated factors: two ways the machine learning (ML) can go wrong, and the most basic way of adjusting the balance between these potential errors. The types of error you’ll prefer to live with depends entirely on the sort of fairness — defined mathematically — you’re aiming your ML system at. But one way or another, you have to decide.
当我们人类争论什么是公平的时候，有时是关于原则，有时是后果，有时是权衡。 但是，机器学习系统可以使我们从三个相互关联的因素来考虑公平性以及许多其他方面：机器学习(ML)出错的两种方式，以及调节这些潜在错误之间的平衡的最基本的方式。 您更愿意忍受的错误类型完全取决于以ML系统为目标的公平性(以数学方式定义)。 但是，您必须决定一种方式。
At their heart, many ML systems are classifiers. They ask: Should this photo go into the bucket of beach photos or not? Should this dark spot on a medical scan be classified as a fibrous growth or something else? Should this book go on the “Recommended for You” or “You’re Gonna Hate It” list? ML’s superpower is that it lets computers make these sorts of “decisions” based on what they’ve inferred from looking at thousands or even millions of examples that have already been reliably classified. From these examples they notice patterns that indicate which categories new inputs should be put into.
本质上，许多机器学习系统都是分类器。 他们问：这张照片是否应该放在沙滩照片的桶中？ 是否应该将医学扫描上的黑点归类为纤维状生长或其他？ 这本书应该放在“推荐给您”还是“您讨厌它”清单上？ ML的超强能力是，它使计算机可以根据从数千个甚至数百万个已经可靠分类的示例中得出的结论来做出这些“决定”。 从这些示例中，他们注意到指示新输入应放入哪些类别的模式。
While this works better than almost anyone would expect — and a tremendous amount of research is devoted to fundamental improvements in classification algorithms — virtually every ML system that classifies inputs mis-classifies some of them. An image classifier might think that the photo of a desert is a photo of a beach. The cellphone you’re dictating into might insist that you said “Wreck a nice beach” instead of “Recognize speech.”
尽管这比几乎任何人都预期的要好，并且大量研究致力于分类算法的根本改进，但实际上，对输入进行分类的每个ML系统都会对其中一些进行错误分类。 图像分类器可能认为沙漠的照片就是海滩的照片。 您要输入的手机可能会坚持要求您说“ 破坏美丽的海滩 ”，而不是“识别语音”。
So, researchers and developers typically test and tune their ML systems by having them classify data that’s already been reliably tagged — the same sort of data these systems were trained on. In fact, it’s typical to hold back some of the inputs the system is being trained on so that it can test itself on data it hasn’t yet seen. Since the right classifications are known for the test inputs, the developers can quickly see how well the system has done.
因此，研究人员和开发人员通常通过让他们对已经可靠标记的数据进行分类来测试和优化ML系统，这些数据是对这些系统进行训练的相同类型的数据。 实际上，通常会保留一些正在接受系统训练的输入，以便可以对尚未看到的数据进行自我测试。 由于测试输入已知正确的分类，因此开发人员可以快速查看系统的性能。
In this sort of basic testing, there are two ways the system can go wrong. A image classifier designed simply to identify photos of beaches might, say, put an image of the Sahara into the “Beach” bucket, or it might put an image of a beach into the “Not a Beach” bucket.
For this post’s purposes, let’s call the first “False alarms”: the ML thinks the photo of the Sahara depicts a beach.
The second “Missed targets”: the ML failed to recognize an actual beach photo.
ML practitioners use other terms for these errors. False alarms are false positives. Missed targets are false negatives. But just about everyone finds these confusing names, even many professionals. Non-medical folk understandably can assume that positive test results are always good news. In the ML world, it’s easy to confuse the positivity of the classification with the positivity of the trait being classified. For example, ML might be used to looking at lots of metrics to assess whether a car is likely to need service soon. If a healthy car is put into the “Needs Service” bucket, it would count as a false positive even though we might think of needing service as a negative. And logically, shouldn’t a false negative be a positive? The concepts are crucial, but the terms are not not unintuitive.
ML练习者使用其他术语来表示这些错误。 错误警报是误报 。 错过的目标是假阴性 。 但是几乎每个人都发现了这些令人困惑的名字，甚至很多专业人员。 可以理解的是，非医学人士可以假定阳性测试结果始终是个好消息。 在ML世界中，很容易将分类的积极性与要分类的特征的积极性混淆。 例如，ML可能用于查看大量指标以评估汽车是否可能很快需要维修。 如果将健康的汽车放入“需要服务”类别，即使我们可能认为需要服务是负面的，也将被视为误报。 从逻辑上讲，假否定不应该是肯定的吗？ 概念很关键，但术语并非并非直觉。
So, let’s go with false alarms and missed targets as we talk about errors.
深刻的后果 (Deep-reaching consequences)
Take an example that doesn’t involve machine learning, at least not yet. Let’s say you’re adjusting a body scanner at an airport security checkpoint. Those who fly often (back in the day) can attest to the fact that most of the people for whom the scanner buzzes are in fact not security threats. They get manually screened by an agent — often a pat-down — and are sent on their way. That’s not an accident or a misadjustment. The scanners are set to generate false alarms rather frequently: if there’s any doubt, the machine beeps a human over to double check.
举一个不涉及机器学习的例子，至少现在还不涉及。 假设您要在机场安全检查站调整人体扫描仪。 那些经常飞行的人(白天回来)可以证明，扫描仪嗡嗡作响的大多数人实际上并不是安全威胁。 他们由代理人手动筛选(通常是轻拍)，并按自己的方式发送。 这不是意外或错误调整。 扫描仪被设置为相当频繁地产生误报：如果有任何疑问，机器会发出哔哔声，以进行仔细检查。
That’s a bit of a bother for the mis-classified passengers, but if the machine were set to create fewer false alarms, it potentially would miss genuine threats. So it errs on the side of false alarms, rather than missed targets.
There are two things to note here. First, reducing the false alarms can increase the number of missed targets, and vice versa. Second, which is the better thing to do depends on the goal of the machine learning system. And that always depends on the context.
这里有两件事要注意。 首先，减少错误警报可以增加错过目标的数量，反之亦然。 其次，哪个更好，取决于机器学习系统的目标。 这始终取决于上下文。
For example, false alarms are not too much of a bother when the result is that more passengers get delayed for a few seconds. But if the ML is being used to recommend preventive surgery, false alarms could potentially lead people to put themselves at unnecessary risk. Having a kidney removed for no good reason is far worse than getting an unnecessary pat down. (This is obviously why a human doctor will be involved in your decision.)
例如，当更多的乘客延迟几秒钟时，错误警报就不会太麻烦。 但是，如果使用ML来推荐预防性手术，则错误警报可能会导致人们将自己置于不必要的风险中。 无缘无故拔除肾脏远比不必要的轻拍要差得多。 (这显然就是为什么人类医生会参与您的决定。)
The consequences can reach deep. If your ML system is predicting which areas of town ought to be patrolled most closely by the police, then tolerating a high rate of false alarms may mean that local people will feel targeted for stop-and-frisk operations, potentially alienating them from the police force, which can have its own harmful consequences on a community…as well as other highly consequential outcomes.
False alarms are possible in every system designed by humans. They can be very expensive, in whatever dimensions you’re calculating costs.
It gets no less complex when considering how many missed targets you’re going to design your ML system to accept. If you tune your airport scanner so that it generates fewer false alarms, some people who are genuine threats may be waved on through, endangering an entire airplane. On the other hand, if your ML is deciding who is worthy of being granted a loan, a false alarm — someone who is granted a loan and then defaults on it — may be more costly to the lender than the missed opportunity of turning down someone who would have repaid the loan.
考虑要设计ML系统接受多少个错过的目标时，它的复杂度也不会降低。 如果您对机场扫描仪进行调整，使其产生更少的错误警报，则可能会冒出一些真正的威胁，危及整架飞机。 另一方面，如果您的ML决定谁值得获得贷款，那么错误的警报(某人获得贷款然后拖欠贷款)对放贷方而言可能比错过了拒绝某人的机会更为昂贵。谁会偿还贷款。
Now, to not miss an opportunity to be confusing when talking about ML, consider an online book store that presents each user with suggestions for the next book to buy. What should the ML be told to prefer: Adding false alarms to the list, or avoiding missed opportunities? False alarms in this case are books the ML thinks the reader will be interested in, but the reader in fact doesn’t care about. Missed opportunities are the books the readers might actually buy but the ML thinks the reader wouldn’t care about. From the store’s point of view, what’s the best adjustment of those two sliders?
现在，为避免错过谈论ML的机会，请考虑一家在线书店，该书店向每个用户提供有关购买下一本书的建议。 应该告诉ML更喜欢什么：将错误警报添加到列表中，或避免错过机会？ 在这种情况下，虚假警报是ML认为读者会感兴趣的书，但实际上读者并不在意。 错失的机会是读者可能实际购买的书，但ML认为读者不会在意。 从商店的角度来看，这两个滑块的最佳调整是什么？
That question isn’t easy, and not just because the terms are non-intuitive for most of us. For one thing, should the buckets for books be “User Will Buy It” or, perhaps, “User Will Enjoy It”? Or maybe, “User Will Be Stretched By It”?
这个问题并不容易，不仅仅是因为这些术语对我们大多数人而言都不直观。 一方面，书桶应该是“用户愿意购买”还是“用户喜欢”？ 或者，“用户会被它吸引”？
Then, for reasons external to ML, not all missed opportunities and false alarms are equal. For example, maybe your loan application ML is doing fine sorting applications into “Approve” and “Disapprove” buckets in terms of the missed opportunities and false alarms your company can tolerate. But suppose many more applications that become missed opportunities are coming from women or racial minorities. The system is performing up to specification, but that specification turns out to have unfair and unacceptable results.
然后，由于ML之外的原因，并非所有错过的机会和错误警报都是相等的。 例如，就您的公司可以容忍的错失机会和虚假警报而言，也许您的贷款申请ML正在将申请分类为“批准”和“拒绝”两个类别。 但是，假设更多的成为错失良机的应用来自女性或少数民族。 该系统正在执行符合规范的要求，但事实证明该规范具有不公平和不可接受的结果。
努力思考并大声说出来 (Think hard and out loud)
Adjusting the mix of false alarms and missed opportunities brings us to the third point of the Triangle of Error: the ML confidence level.
One of the easiest ways to adjust the percentage of false alarms and missed targets is to change the threshold of confidence required to make it into the bin. (Others way including training the system on better data or adjusting its classification algorithms.) For example, suppose you’ve trained an ML system on hundreds of thousands of images that have been manually labeled as “Smiling” or “Not Smiling”. From this training, the ML has learned that a broad expanse of light patches towards the bottom of the image is highly correlated with smiles, but then there are the Clint Eastwoods whose smiles are much subtler. When the ML comes across a photo like that, it may classify it as smiling, but not as confidently as the image of the person with the broad, toothy grin.
调整错误警报和错过目标的百分比的最简单方法之一是更改将其放入垃圾箱所需的置信度阈值 。 (其他方法包括在更好的数据上训练系统或调整其分类算法。)例如，假设您已经在成千上万个手动标记为“微笑”或“不微笑”的图像上训练了机器学习系统。 从这次培训中，机器学习人员得知，朝向图像底部的广阔色块与微笑高度相关，但随后还有克林特·伊斯特伍德(Clint Eastwoods)的微笑更加微妙。 当ML遇到这样的照片时，它可能将其分类为微笑，但不如带有露齿露齿笑容的人的形象那样自信。
If you want to lower the percentage of false alarms, you can raise the confidence level required to be put into the “Smiling” bin. Let’s say that on a scale of 0 to 10, the ML gives a particular toothy grin a 9, while Clint gets a 5. If you stipulate that it takes at least a 6 to make it into the “Smile” bin, Clint won’t make the grade; he’ll become a missed target. Your “Smile” bucket will become more accurate, but your “Not Smile” bucket will have at least one more missed opportunity.
如果要降低错误警报的百分比，则可以提高放入“微笑”容器中所需的置信度。 假设从0到10的比例，ML给特定的露齿笑容9，而Clint则得到5。如果您规定至少需要6才能使它进入“微笑”容器，Clint不会t成绩； 他会成为错过的目标。 您的“微笑”存储桶将变得更加准确，但是您的“不微笑”存储桶将至少有一个错过的机会。
Was that the right choice? That’s not something the machine can answer. It takes humans — design teams, communities, the full range of people affected by the machine learning — to decide what they want from the system, and what the trade-offs should be to best achieve that result.
Deciding on the trade-offs occasions difficult conversations. But perhaps one of the most useful consequences of machine learning at the social level is not only that it requires us humans to think hard and out loud about these issues, but the requisite conversations implicitly acknowledge that we can never entirely escape error. At best we can decide how to err in ways that meet our goals and that treat all as fairly as possible.
在权衡取舍时，很难进行对话。 但是，在社会层面上机器学习最有用的后果之一不仅是它要求我们人类对这些问题进行认真思考和大声思考，而且必要的对话含蓄地承认我们永远无法完全避免错误。 充其量，我们可以决定如何以符合我们目标的方式来犯错误，并尽可能公平地对待所有人。