If you’re currently in the job market or looking to switch careers, you’ve probably noticed an increase in popularity of Data Science jobs. In 2019, LinkedIn ranked “data scientist” the №1 most promising job in the U.S. based on job openings, salary, and career advancement opportunities and reported a 56% rise in job openings for data scientists over the previous year. Despite its popularity, however, data science can me a difficult field to enter, let alone to learn. I know from my personal experience, the amount of statistics involved made it very challenging. Probability, in particular, can be quite complicated but is fundamental to many machine learning models such as decision tree learning. So the purpose of this article is to provide a rudimentary undertanding of conditional probability.
如果您目前正处于就业市场或正在寻求转行，您可能已经注意到Data Science职位的受欢迎程度有所提高。 根据职位空缺，薪水和职业晋升机会，LinkedIn在2019年将“数据科学家”排在美国最有前途的工作之一，并报告说数据科学家的职位空缺比上一年增长了56％。 尽管它非常流行，但是数据科学还是一个很难进入的领域，更不用说学习了。 从我的亲身经历，我知道所涉及的统计数据非常具有挑战性。 概率尤其可能非常复杂，但是对于许多机器学习模型(例如决策树学习)而言，这是基础。 因此，本文的目的是提供对条件概率的基本理解。
How To Calculate Probability
Simply put, the probability of an event happening is equal to the number of times an event could happen divided by the total number of outcomes. For example, imagine you have a deck of cards and you want to calculate the probability that you’ll randomly pull a king from the deck. How would you calculate that? Well, since there are 4 kings in a deck of cards, there are 4 possible ways you can draw a king from the deck; and since there are 52 cards in the deck, there’s 52 possible outcomes. So 4 divided by 52 is .076 or 7.6% chance your card will be a king. Now say you want to figure out the probability of drawing another king — the answer will depend on how you handle replacement. Sampling with replacement means that you place the first card back into the deck making the two events independant (the probability of drawing each king doesn’t change). Sampling without replacement means you’re not placing the first card back, which affects the probability of drawing the second king (total number of outcomes is now 51). If event A is drawing the first king card and event B os drawing the second king card, then we’d say the probability of B given A is equal to the probability of event A multiplied by the probability of event B given that A occurs.
简而言之，事件发生的概率等于事件可能发生的次数除以结果总数。 例如，假设您有一副扑克牌，并且想要计算随机从该副牌中拉出国王的概率。 您将如何计算？ 好吧，由于在一副纸牌中有4个国王，因此有四种方法可以从纸牌中抽出一张国王； 而且由于套牌中有52张牌，因此有52种可能的结果。 因此，将4除以52得出的结果是.076，即7.6％的机会是您的卡成为王牌。 现在，您要确定吸引另一位国王的可能性-答案将取决于您如何进行替换 。 进行替换采样意味着您将第一张卡放回卡组中，从而使两个事件无关(抽出每位国王的概率不变)。 无需更换就可以进行采样，这意味着您不会放回第一张纸牌，这会影响抽出第二张王牌的可能性(现在总结果为51)。 如果事件A吸引第一张王牌而事件B os吸引第二张王牌，那么我们说给定A的B概率等于事件A的概率乘以给定A发生的事件B的概率。
P(A and B) = P(A) x P(B|A) = 4/52 x 3/51 = .45%
Mathematics isn’t intuitive to everyone; it certainly wasn’t for me as I was just starting out in this field. Visualizations, however, can be a great tool when it comes to reenforcing complex topics. A tree diagram is one example that can help you break down a general problem into smaller components — perfect for probability problems that involves multiple events that lead to a variety of outcomes. For example, take a look at the diagram I’ve created that helps answer the following question: If you have a bag of 23 marbles (5 green, 8 blue, and 10 red), what’s the probability that you’ll randomly pull out a blue marble and a green marble? Let’s break it down.
数学不是每个人都直观的。 因为我刚开始涉足这一领域，所以对我当然不是。 但是，在强化复杂主题时，可视化可能是一个很好的工具。 树形图是一个示例，可以帮助您将一般问题分解为较小的部分-非常适合涉及多个事件并导致各种结果的概率问题。 例如，看一下我创建的有助于回答以下问题的图表：如果您有一袋23颗大理石(5颗绿色，8颗蓝色和10颗红色)，那么您随机抽出的概率是多少？蓝色大理石和绿色大理石？ 让我们分解一下。
- The probability of grabbing a blue marble is 35%, because there are 8 way you can get a blue marble and 23 total potential outcomes. 抓住蓝色大理石的可能性为35％，因为有8种方法可以获取蓝色大理石，并且有23种潜在结果。
Now given that you pulled out a blue marble, the probability of grabbing a green marble from the bag is 23% — 5 green marbles divided by 22 potential outcomes (notice how the total number of outcomes changes the second time, hence the change in probability).
Finally, calculating the probability of both these events happening involves multiplying the probability of both events (.35 x .23 = 8%).
最后，计算这两个事件发生的概率涉及将两个事件的概率相乘(.35 x .23 = 8％)。
Hopefully this demsonstration has given you a clearer mental picture of statistical probability. Even though conditional probability may seem elementary compared to the more advanced concepts in machine learning, having a solid understanding of the foundation of which data science is built on is extremely important. So whenever you begin to learn something new, remember that no topic is too small and relearning is reenforcement.
希望这种演示能使您对统计概率有更清晰的认识。 尽管与机器学习中更高级的概念相比，条件概率似乎是基本的，但对数据科学所基于的基础有扎实的了解仍然非常重要。 因此，每当您开始学习新知识时，请记住，没有一个主题太小，重新学习就是强化。