重点 (Top highlight)
What is data science? Is a simple question, but the answers are often confusing. I regularly hear folks say that data science is nothing more than statistics dressed up in fancy clothes. Data science has jokingly been called statistics on a Mac. And a data scientist has been called a data analyst who lives in California. 😂
While these statements are humorous, it’s not at all obvious what data science encompasses. There have been many data science Venn diagrams and many definitions over the years. However, in my research, the ones I found were either convoluted or missing one of the three core data science functions.
这些陈述虽然很幽默，但数据科学包含的内容一点也不明显。 多年来，已经有许多数据科学的维恩图和许多定义。 但是，在我的研究中，发现的结果要么令人费解，要么就缺少了三个核心数据科学功能之一。
In this article, you’ll learn about the three primary parts of data science. You’ll also learn about an emerging type of data science project. Finally, you’ll see two other areas that are important to data science, but not quite part of the core.
在本文中，您将学习数据科学的三个主要部分。 您还将了解新兴的数据科学项目类型。 最后，您将看到另外两个对数据科学重要的领域，但并不是核心领域的一部分。
A data science project has one of three goals — either to provide insight, establish causality, or make predictions. These three goals are associated with the domains of data analysis, statistics, and machine learning.
数据科学项目具有三个目标之一- 提供见解 ， 建立因果关系或做出预测 。 这三个目标与数据分析 ， 统计和机器学习的领域有关 。
Analysis is used to extract and convey insights from existing data.
Statistics is used to establish causality.
Machine learning has prediction as its goal.
数据科学的定义 (A Definition of Data Science)
Here’s my definition that encompasses all three domains:
Data science is using data to make better decisions with analysis for insight, statistics for causality, and machine learning for prediction.
It’s possible to be solely a data analyst, statistician, or machine learning engineer. However, a data scientist is a person who can do all three. 👍
可能完全可以是数据分析师，统计学家或机器学习工程师。 但是， 数据科学家是可以做到这三者的人。 👍
After drafting this article, I saw that the insightful and eloquent Cassie Kozyrkov had come to a similar conclusion— she defines data science as “the discipline of making data useful” and then breaks it into statistics, machine learning, and data-mining/analytics. Her similar breakdown convinced me that it’s valuable to try to help folks understand what data science is by focussing on these three core aspects. 🎉
Also, I was humbly reminded that most any half-way decent thought I’ve had, Cassie probably had years ago. 😀
数据科学背景 (Data Science Context)
Here’s my depiction of how data flows through an organization. The core data science roles are highlighted in peach.
Let’s look at data analysis first.
数据分析📊 (Data Analysis 📊)
The goal of data analysis is to find insights in existing data.
Example analysis question: What caused the sales drop last month?
Alternatively, the question asker might want you to answer their question and craft a narrative. You’ll then use the data as evidence for your argument as to why sales fell.
A data analyst’s job requires exploring questions honestly. You need working hypotheses, but you must follow where the evidence leads and generate new hypotheses based on what you find.
Data analysis requires critical thinking and intellectual honesty.
Data analysis used poorly is obfuscation. It’s How to Lie with Statistics. It’s claiming the USA is doing well in terms of COVID-19 and supporting that claim with a chart showing the number of deaths as a proportion of cases. 🤦♂ ️A much more relevant metric is the number of deaths as a proportion of the population. You can see that chart for large developed nations below below.
In data analysis you might be dipping your toes into the territory of causality, but you should be careful about making causal claims. You aren’t create experiments and using inferential statistics.
Stock market talking heads do analysis all day, and often use the language of causality. Their explanations are plausible, but they generally can’t be tested.
Why did the stock market go up today? Traders liked the jobs numbers.
Why did it go down? Traders brushed off the bad jobs report.
You should be skeptical. These causal statements are not disconfirmable.
When doing data analysis reports and presentations are often one-off affairs. If you find yourself presenting the same metrics and charts repeatedly. It might be a good idea to set up a dashboard that will automatically grab the data and show those metrics and visualizations to stakeholders.
进行数据分析时，报告和演示文稿通常是一次性的事务。 如果您发现自己重复显示相同的指标和图表。 最好设置一个仪表板，该仪表板将自动获取数据并将这些指标和可视化内容显示给利益相关者。
The most common technology tools for data analysis are SQL, Excel, Tableau in that order. Programming with Python and R are next most common. See my analysis of the technologies for data analyst positions here.
依次进行数据分析的最常用技术工具是SQL，Excel和Tableau。 其次是使用Python和R进行编程。 见我的技术来进行数据分析职位分析在这里 。
Becoming competent with data analysis generally requires the least training time of the three areas. You need to understand how to query data, use descriptive statistics, and display data. You don’t necessarily need programming skills or advanced stats. However, programming skills are often useful to automate analyses.😀
要胜任数据分析，通常需要在这三个领域中花费最少的培训时间。 您需要了解如何查询数据，使用描述性统计信息以及显示数据。 您不一定需要编程技能或高级统计信息。 但是，编程技巧通常对于自动化分析很有用。
Now let’s look at the second core domain of data science: statistics.
统计🔢 (Statistics 🔢)
Statistics is a key part of the scientific method. It’s how we take the data we have collected and apply probability and mathematical rigor to make causal claims.
Example statistics question: Which version of our website results in more sales?
Let’s say you want to improve your website’s conversion rate. Of course you’ve used best practices in user experience and website design. 😀 Now you have two different website designs you want to test.
假设您要提高网站的转化率。 当然，您已经在用户体验和网站设计中使用了最佳实践。 😀现在，您有两个要测试的网站设计。
You want to to extrapolate to a population based on a randomized control experiment with a sample from that population. If you want to do this with an A/B test from frequentest statistics, you need to determine up front how much data to collect and what your cutoff will be to determine significance. No cheating! Statistics is all about rigor. ☝️
您想根据随机对照实验使用该人群的样本推断该人群。 如果要通过最频繁的统计信息进行A / B测试，则需要预先确定要收集多少数据以及确定的重要性。 别作弊！ 统计数据是关于严谨性的。 ☝️
Alternatively, to answer this question you could use the multi-armed bandit test from Bayesian statistics. Both methods seek to draw inferences from a random sample of the population. See this blog post for more on the topic.
When doing statistics you care about effect sizes, p-values, and confidence intervals or credible intervals. You are down with the central limit theorem. You know statistical distributions and common statistical tests such as Chi-squared tests, ANOVA, linear regression, and logistic regression. You might do time series with ARIMA or Holt-Winters exponential smoothing.
在进行统计时，您会关注效果大小，p值以及置信区间或可信区间。 您不满意中心极限定理。 您知道统计分布和常见的统计检验，例如卡方检验，ANOVA，线性回归和逻辑回归。 您可以使用ARIMA或Holt-Winters指数平滑进行时间序列。
You might use R, Python (likely with pandas and statsmodels), or SAS as your tools.
Stats is often concerned with matching data to the underlying theoretical distribution of some phenomena. Here’s a great article on the history of data science vs. statistics.
Depending on your research problem you may need deep domain expertise or someone who has it on your team. 👍
Now let’s check out machine learning, the cool kid who made data science popular.
机器学习🖥 (Machine Learning 🖥)
The key feature of machine learning is that prediction dominates.
Example machine learning question: What will customer churn be next month?
机器学习问题示例：什么 将 客户下个月会流失吗？
When doing machine learning, you don’t care that much about the assumptions of statistical distributions, you care about what works. You might not care which variables led to the results.
Did the homoscedacity of variance assumption of linear regression get violated? If you only care about how well the model predicts the outcome variable, then “Who cares, my model predicts well”, is a reasonable response. Machine learning folks are pragmatists.
是否违反了线性回归的方差同质性假设？ 如果您仅关心模型对结果变量的预测程度，那么“谁在乎，我的模型预测的结果就很好”，这是一个合理的回答。 机器学习专家是实用主义者。
However, many machine learning problems share a fair amount of overlap with statistics.
- Sometimes explainability is key — for example with a bank loan decision. When you need to be able to say why someone was denied a loan the statistical requirements for interpretability matter. In that case you care about predictive validity and interpretability. 👍 有时，可解释性是关键-例如，通过银行贷款决定。 当您需要说出某人为什么被拒绝贷款时，有关可解释性的统计要求就很重要。 在那种情况下，您需要关注预测的有效性和可解释性。 👍
- Machine learning shares some algorithms with statistics, such as linear and logistic regression. 机器学习与统计共享一些算法，例如线性和逻辑回归。
- Like statisticians, machine learning researchers want the simplest model that works well. That saves time and money for training and inference. 像统计学家一样，机器学习研究人员也希望使用最简单的模型。 这样可以节省时间和金钱进行训练和推理。
There are also some fundamental differences with statistics, in addition to machine learning being more concerned with prediction.
- Folks doing machine learning are expected to be more adept with programming than folks who strictly do statistics. 与严格进行统计的人们相比，从事机器学习的人们被认为更擅长编程。
- In machine learning you train your models and evaluate your model type and hyperparameters based on data that they haven’t seen before. This hold-out test set isn’t generally found in traditional statistics. 在机器学习中，您可以训练模型，并根据之前从未见过的数据评估模型类型和超参数。 传统统计中通常找不到这种保留测试集。
- Machine learning is designed to take advantage of bigger data and more processing power. Its sophisticated algorithms became useful when processing power became cheap and data became plentiful. 机器学习旨在利用更大的数据和更多的处理能力。 当处理能力变得廉价且数据变得丰富时，其复杂的算法变得很有用。
Let’s now drill down into deep learning — a family of machine learning algorithms where amazing breakthroughs are occurring. ☀️
深度学习🧠 (Deep learning 🧠)
Deep learning, also known as artificial neural networks, requires many layers of hidden nodes. The nodes’ weights are iteratively updated as the model is trained with the goal of optimizing a loss function.
深度学习 (也称为人工神经网络)需要多层隐藏节点。 为了优化损失函数，在训练模型时迭代地更新节点的权重。
Deep learning is eating more and more of the kinds of problems that used to be tackled with more traditional machine learning algorithms.
If the relationship between the predictors variables and outcome variable is complex, deep learning is often the tool of choice. For example, image classification is one area where deep learning excels. Natural language processing (NLP) is another area where deep learning is being used for many types of artificial intelligence breakthroughs.
Although insights, causality, and prediction are the types of tasks data scientists usually tackle, they have begun using deep learning for a fourth task: creation.
Creative breakthroughs are happening by using generative adversarial networks (GANs). The importance of this domain for the future of humanity is large. Check out the latest GPT-3 NLP model to see what’s possible. 😲
Most data scientists aren’t using deep learning for creation, although its prevalence is increasing. In the future, creation might become a core component of the data scientist’s role.
You’ve seen how a data scientist uses data analysis, statistics, and machine learning. Now let’s briefly look at two areas that are important to data science, but not quite part of the core.
数据工程 (Data engineering)
Data collection, storage, and preparation are vitally important for data science. You might have heard the statistic that 80% of a data scientist’s job is data cleaning. However, this number appears to be the result of something akin to a game of telephone. 📞 See great research digging into the sources here.
数据收集，存储和准备对于数据科学至关重要。 您可能已经听说过统计数据科学家的工作中有80％是数据清理。 但是，该数字似乎是类似于电话游戏的结果。 📞在这里可以看到大量的研究资料。
Below is a chart from from the Kaggle 2018 Developer Survey showing how respondents report spending their time. The survey had nearly 24,000 responses from around the world. Note that it includes professions beyond data scientists and a good number of students.
以下是来自Kaggle 2018开发者调查的图表，该图表显示了受访者如何报告自己的时间花费。 这项调查收到了来自世界各地的近24,000份回复。 请注意，它包括数据科学家以外的专业和大量的学生。
For what it’s worth, gathering and cleaning data took more than 25% of the respondents’ time, on average. That’s not insignificant, but it’s also not 80%. 😉
就其价值而言，收集和清理数据平均花费了受访者25％以上的时间。 这并不重要，但也不是80％。 😉
The last several years have seen the rise of the data engineer as a separate job title. Many organizations now have dedicated data engineers to build pipelines to injest, transform, and store data for use by other folks. However, in a small organization the data engineering and machine learning might be done by the same person.
在过去的几年中，数据工程师的崛起是一个单独的职位。 现在，许多组织都有专门的数据工程师来构建管道，以注入，转换和存储数据以供其他人使用。 但是，在小型组织中，数据工程和机器学习可能是由同一个人完成的。
All this is to say that many data scientists do some data gathering and cleaning, too — it just isn’t their core function.
Next, let’s shift our focus to what happens to machine learning models after they leave the the data scientist. 🚀
机器学习工程 (Machine learning engineering)
Machine learning engineers productionize a model. Machine learning engineers have to ensure models will be continually updated and that they will reliably make fast, high quality predictions at scale. See my article on in-demand tech skills for machine learning engineers here.
机器学习工程师生产模型。 机器学习工程师必须确保模型会不断更新，并能够可靠地大规模进行快速，高质量的预测。 查看有需求的技术技能机器学习的工程师我的文章在这里 。
Data engineering and machine learning engineering are important to data science, but not in its primary domain.
The major goals of a data science project are either insight, causality, or prediction.
Doing data science well requires a foundation in coding, ethics, communication, math, and critical thinking. It also requires an understanding of the problem context. Although this skill set is large, data scientists are not unicorns. 🦄
做好数据科学需要具备编码，道德，沟通，数学和批判性思维的基础。 它还需要对问题上下文的理解。 尽管这种技能非常丰富，但数据科学家并不是独角兽。 🦄
Data scientists need an understanding of the analysis, stats, and machine learning processes. Understanding data engineering and machine learning engineering is valuable, too. However, you don’t need to be a pro at everything. That’s what teammates are for. 😀
数据科学家需要了解分析，统计数据和机器学习过程。 了解数据工程和机器学习工程也很有价值。 但是，您不需要在所有方面都成为专家。 那就是队友的目的。 😀
Above all, data scientists are continually learning. I suggest you focus on your own learning journey and embrace the humility that comes with feeling like there is always more to learn. It’s better than feeling overwhelmed. 😂
最重要的是，数据科学家正在不断学习。 我建议您专注于自己的学习历程，并怀着谦卑的感觉，好像总有更多的东西要学习。 总比不知所措好。 😂
I hope you’ve enjoyed this article and found it helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀
希望您喜欢这篇文章并发现对您有所帮助。 如果您这样做了，请在您喜欢的社交媒体上分享它，以便其他人也可以找到它。 😀
I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍
Happy data-sciencing! 😀