精华内容
下载资源
问答
  • 描述性统计

    2019-10-09 02:46:27
    描述性统计是以揭示数据分布特性的方式汇总并表达定量数据的方法。主要包括数据的频数分析、数据的集中趋势分析、数据离散程度分析、数据的分布、以及一些基本的统计图形。特征括并表示定量数据,揭示数据分布的特征...

    描述性统计是以揭示数据分布特性的方式汇总并表达定量数据的方法。主要包括数据的频数分析、数据的集中趋势分析、数据离散程度分析、数据的分布、以及一些基本的统计图形。特征括并表示定量数据,揭示数据分布的特征。描述性统计是一类统计方法的汇总,作用是提供了一种概括和表征数据的有效且相对简便的方法。通常用图示法来表述,易于看懂,能发现质量特性值(总体)的分布状况、趋势走向的一些规律,便于采取措施。用于汇总和表征数据,通常是对数据进一步定量分析的基础,或是对推断性统计方法的有效补充。常见的描述性统计方法可分为三类:用数据的统计量来描述,如:均值、标准差等;用图示技术来描述,如:直方图、散布图、趋势图、排列图、条形图和饼分图等;用文字语言分析和描述,如:统计分析表、分层、因果图、亲和图和流程图等。

    转载于:https://www.cnblogs.com/guo-xiang/p/5761170.html

    展开全文
  • 主要介绍MATLAB统计分析中描述性统计部分的内容,包括集中趋势、离中趋势的描述、频数分析和相关统计图形绘制等。
  • 统计分类分为描述性统计 数据分析 (DATA ANALYSIS) 目录 (Table of contents) Introduction 介绍 Data types 资料类型 Analyzing Quantitative DataI. Measures of CenterII. Measures of SpreadIII. Shape of ...

    统计分类分为描述性统计

    数据分析 (DATA ANALYSIS)

    目录 (Table of contents)

    1. Introduction

      介绍

    2. Data types

      资料类型

    3. Analyzing Quantitative DataI. Measures of CenterII. Measures of SpreadIII. Shape of distribution

      分析定量数据 CenterII的措施 价差措施III。 分布形状

      Analyzing Quantitative DataI. Measures of CenterII. Measures of SpreadIII. Shape of distributionIV. Outliers

      分析定量数据 CenterII的措施 价差措施III。 分布形状 IV。 离群值

    4. Descriptive vs. Inferential Statistics

      描述性统计与推论统计

    5. Looking Ahead

      展望未来

    6. Summary

      摘要

    介绍 (Introduction)

    The word “data” is defined as distinct pieces of information. You may think of data as simply numbers on a spreadsheet, but it can come in many forms from text to videos to spreadsheets and databases to images to audio … Utilizing data is the new way of the world. Data is used to understand and improve nearly every facet of our lives, from early disease detection to social networks that allow us to connect and communicate with people around the world. No matter what field you’re in, from insurance and banking to medicine, to education, to agriculture … You can utilize data to make better decisions and accomplish your goals. Running descriptive statistics on your datasets is absolutely crucial before you begin the process of inferential statistics. Many people do not take the time, especially novice folks, at the research they carefully run descriptive statistics and clean data and make sure that the data meet the assumptions that are required of the more robust statistical test but it’s absolutely imperative that this process is done correctly.

    “数据”一词被定义为不同的信息。 您可能会认为数据只是电子表格上的数字,但是它可以有多种形式,从文本到视频再到电子表格,再到数据库再到图像再到音频…… 利用数据是世界的新方式 。 数据用于了解和改善我们生活的方方面面,从早期疾病检测到社交网络,使我们能够与世界各地的人们联系和沟通。 无论您身处哪个领域,从保险和银行业到医药,教育,农业…… 您都可以利用数据做出更好的决策并实现目标 。 在开始推理统计过程之前,对数据集运行描述性统计绝对至关重要。 许多人没有花时间,尤其是新手,他们会仔细地进行描述性统计并清理数据,并确保数据符合更强大的统计测试所需的假设,但是绝对必须这样做正确地。

    资料类型 (Data Types)

    • Quantitative data takes on numeric values that allow us to perform mathematical operations.

      定量 数据采用允许我们执行数学运算的数值。

      Quantitative data takes on numeric values that allow us to perform mathematical operations.- Continuous data can be split into smaller and smaller units, and still a smaller unit exists. (for example, we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with it).

      定量 数据采用允许我们执行数学运算的数值。 - 连续 数据可以分为越来越小的单位,并且仍然存在较小的单位。 (例如,我们可以以年,月,日,小时,秒为单位来衡量年龄单位,但仍有可能与之关联的单位更小)。

      Quantitative data takes on numeric values that allow us to perform mathematical operations.- Continuous data can be split into smaller and smaller units, and still a smaller unit exists. (for example, we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with it).- Discrete data only takes on countable values.

      定量 数据采用允许我们执行数学运算的数值。 - 连续 数据可以分为越来越小的单位,并且仍然存在较小的单位。 (例如,我们可以以年,月,日,小时,秒为单位来衡量年龄单位,但仍有可能与之关联的单位更小)。 - 离散数据仅具有可计数的值。

    • Categorical data are used to label a group or set of items.

      分类数据用于标记一组或一组项目。

      Categorical data are used to label a group or set of items.- Categorical Ordinal: data take on a ranked ordering (for example a ranked interaction on a scale from Very bad to Very Good).- Categorical Nominal: data that do not have an order or ranking.

      分类数据用于标记一组或一组项目。 - 分类序数:数据按排名排序(例如,按从Very badVery Good比例对交互进行排序)。 -分类标称:没有顺序或排名的数据。

    For more information about data types, check out this story:

    有关数据类型的更多信息,请查看以下故事:

    分析定量数据 (Analyzing Quantitative Data)

    中心的措施 (Measure of center)

    1. The MeanThe mean is often called the average or the expected value in mathematics. We calculate the mean by adding all of our values together and dividing by the number of values in our dataset.

      平均值 平均值通常被称为数学中的平均值期望值 。 我们通过将所有值加在一起并除以数据集中的值数来计算平均值。

    2. The MedianThe median splits our data so that 50% of our values are lower and 50% are higher.

      位数中位数会分割我们的数据,以使我们的值降低50%,而值高50%。

      The MedianThe median splits our data so that 50% of our values are lower and 50% are higher.- Median for Odd Values: If we have an odd number of observations, the median is simply the number in the direct middle.- If we have an even number of observations, the median is the average of the two values in the middle.

      位数中位数会分割我们的数据,以使我们的值降低50%,而值高50%。 - 数值的中位数:如果观察数的数量为奇数 ,则中位数就是直接中间数的位数 。-如果观察数为偶数 ,则中位数中间的两个值平均值

      The MedianThe median splits our data so that 50% of our values are lower and 50% are higher.- Median for Odd Values: If we have an odd number of observations, the median is simply the number in the direct middle.- If we have an even number of observations, the median is the average of the two values in the middle.Note: In order to compute the median we MUST sort our values first.

      位数中位数会分割我们的数据,以使我们的值降低50%,而值高50%。 - 数值的中位数:如果观察数的数量为奇数 ,则中位数就是直接中间数的位数 。-如果观察数为偶数 ,则中位数中间的两个值平均值注意 :为了计算中位数,我们必须首先对值进行排序

    3. The ModeThe mode is the most frequently observed value in our dataset.

      模式 模式是我们数据集中最常观察到的值。

      The ModeThe mode is the most frequently observed value in our dataset.Note 1: There might be multiple modes for a particular dataset or no mode at all.

      模式 模式是我们数据集中最常观察到的值。 注意1 :特定数据集可能有多种模式 ,或者根本没有模式

      The ModeThe mode is the most frequently observed value in our dataset.Note 1: There might be multiple modes for a particular dataset or no mode at all.Note 2: The mode of a distribution is essentially the tallest bar in a histogram. There may be multiple modes depending on the number of peaks in our histogram.

      模式 模式是我们数据集中最常观察到的值。 注意1 :特定数据集可能有多种模式 ,或者根本没有模式注2 :分布的模式实质上是直方图中的最高条形。 取决于我们的直方图中的峰数,可能有多种模式。

    价差措施 (Measures of Spread)

    One of the most common ways to measure the spread of our data is by looking at the Five Number Summary. It consists of five values:

    衡量数据分布的最常见方法之一是查看“ 五个数字摘要” 。 它包含五个值:

    1. The minimum: The smallest number in the dataset.

      最小值:数据集中的最小值。

    2. The first quartile Q1: The value such that 25% of the data falls below.

      第一个四分位数Q1:该值使得25%的数据低于此值。

    3. The second quartile Q2(median): The value such that 50% of the data falls below.

      第二个四分位数Q2(中位数):使得50%的数据低于此值的值。

    4. The third quartile Q3: The value such that 75% of the data falls below.

      第三四分位数Q3:该值使得75%的数据低于此值。

    5. Maximum: The largest value in the dataset.

      最大值:数据集中的最大值

    We represent the five-number summary with a boxplot as shown below

    我们用箱线图表示五位数摘要,如下所示

    Image for post
    Author作者

    Measures of Spread are used to provide us an idea of how spread out our data are from one another. Common measures of spread include:

    价差措施 用于向我们提供一个关于我们的数据如何分散的想法。 常见的传播措施包括:

    1. Range:The range is the difference between the maximum and the minimum.

      范围: 范围最大值最小值之间的差。

    2. Interquartile Range (IQR):The interquartile range is calculated as the difference between Q3​ and Q1​.

      四分位间距 (IQR): 四分位间距计算为Q3Q1之差。

    3. Variance:

      方差

      The variance is used to compare the spread of two different groups. A set of data with higher variance is more spread out than a dataset with a lower variance. Be careful though, there might just be an

      方差用于比较两个不同组的传播。 具有较高方差的一组数据比具有较低方差的数据集更分散。 但是要小心,可能只是

      outlier (or outliers) that is increasing the variance when most of the data are actually very close. The variance is the average squared difference of each observation from the mean.

      当大多数数据实际上非常接近时, 异常值 (或多个异常值)会增加方差。 的 方差是每个观察值平均值的平方差。

    Image for post
    Source by 作者 Author来源

    4. Standard DeviationThe standard deviation is one of the most common measures for talking about the spread of data. It is defined as the square root of the variance.The standard deviation is used more often in practice than the variance because it shares the units of the original dataset.

    4. 标准偏差 标准偏差是谈论数据传播的最常用措施之一。 它定义为方差的平方根。 在实践中,使用标准偏差比使用方差更多,因为它共享原始数据集的单位

    Image for post
    Source by 来源Author作者

    Note: If you’re interested in mathematical writing with LaTeX check out this article

    注意 :如果您对使用LaTeX进行数学编写感兴趣,请查看本文

    分布的形状 (The shape of the Distribution)

    From a histogram, we can quickly identify the shape of our data, which can actually tell us a lot about the measures of center and spread. The distribution of data is frequently associated with one of the three shapes:

    通过直方图,我们可以快速识别数据的形状,这实际上可以告诉我们很多有关中心和扩散的度量。 数据的分布通常与以下三种形状之一相关:

    1. Right-skewed

      右偏

      A histogram that has

      直方图具有

      shorter bins on the right and taller bins on the left is considered a right-skewed shape. In this distribution, the mean is greater than the median.

      右侧的 较短垃圾箱和左侧的 较高垃圾箱被认为是右偏斜的形状。 在此分布中,平均值大于中位数。

      shorter bins on the right and taller bins on the left is considered a right-skewed shape. In this distribution, the mean is greater than the median.Real-world examples: The amount of drug left in your bloodstream over time, human athletic abilities …

      右侧的 较短垃圾箱和左侧的 较高垃圾箱被认为是右偏斜的形状。 在此分布中,平均值大于中位数。 实际示例:随着时间的流逝,血液中残留的药物量,人类运动能力……

    2. Left-skewedA histogram that has shorter bins on the left and taller bins on the right is considered a right-skewed shape. In this distribution, the mean is less than the median.

      左偏斜直方图在左侧具有较短的条带 在右侧具有较高的条带 被认为是右侧弯曲的形状。 在此分布中,平均值小于中位数。

      Left-skewedA histogram that has shorter bins on the left and taller bins on the right is considered a right-skewed shape. In this distribution, the mean is less than the median.Real-world examples: The age of death, asset price changes …

      左偏斜直方图在左侧具有较短的条带 在右侧具有较高的条带 被认为是右侧弯曲的形状。 在此分布中,平均值小于中位数。 实际示例:死亡年龄,资产价格变化……

    3. Symmetric

      对称的

      Any distribution where you can draw a line down the middle and the right side mirrors the left side is considered symmetric. One of the most common symmetric distributions is known as

      可以在中间沿中间画一条线,而右侧向左镜像的任何分布都被认为是对称的。 最常见的对称分布之一是

      normal distribution and it’s also called ‘Bell Curve’.

      正态分布 ,也称为“ 钟形曲线 ”。

      Symmetric distributions have a mean that’s equal to the median, which also equals the mode, alternatively it has also a symmetric box spot.

      对称分布的平均值等于中位数,也等于众数,或者也有对称箱形斑点。

      Real-world examples: Heights, weights, precipitation amount …

      实际示例:高度,重量,降水量……

    Note 1: Data in the real world can be messy and it might not follow any of these distributions.Note 2: In a skewed distribution, the mean is pulled by the tail of the distribution while the median stays closer to the mode.

    注意1 :现实世界中的数据可能比较混乱,并且可能不遵循任何这些分布。 注2 :在偏态分布中,均值被分布的尾部拉动,而中位数保持更接近于众数。

    Image for post
    Author作者

    离群值 (Outliers)

    Outliers are data points that fall very far from the rest of the values in a data set. In order to determine what is very far, there are a number of different methods. The method I usually use for detecting outliers isn’t very scientific, I just plot the data and see if there is a point really far from any of the other data points.You can check here the methods and techniques for identifying outliers.

    离群值是与数据集中的其余值相差很远的数据点。 为了确定距离很远,有很多不同的方法。 我通常用于检测异常值的方法不是很科学,我只是绘制数据并查看是否有一个点与其他任何数据点确实相距甚远。 你可以 在这里查看 识别异常值的方法和技术。

    A quick plot of your data can often help you understand a lot in a short amount of time.

    快速绘制数据图通常可以帮助您在短时间内获得很多了解。

    Image for post
    Author作者

    In order to illustrate the impact that outliers can have on the way we report summary statistics, let’s consider the income of startups/companies. Imagine I select ten startup earnings and I pull these nine values here as earnings in thousands of dollars and the tenth is Facebook or Tesla. The measure of mean, variance, Standard deviation are incredibly misleading, none of the ten salaries can be even close to the mean calculated. A better measure of center would certainly be the median

    为了说明异常值可能对报告汇总统计信息的方式产生的影响,让我们考虑初创公司/公司的收入。 想象一下,我选择了十个初创公司的收入,然后将这九个值拉到这里,作为千美元的收入,第十个是Facebook或Tesla。 均值,方差,标准差的度量具有令人难以置信的误导性,十个薪金中的任何一个都无法接近所计算的均值。 更好地衡量中心肯定是中位数

    处理异常值 (Working with outliers)

    If you’re the one doing the reporting, here are some of my personal guidelines when analyzing data:

    如果您是进行报告的人,那么以下是我在分析数据时的一些个人准则:

    1. Plot your data

      绘制数据
    2. If there are outliers, determine how you should handle them. This might require a domain expert of the field. Should you remove them? should you fix them? should you keep them?

      如果存在异常值,请确定如何处理它们。 这可能需要该领域的领域专家。 您应该删除它们吗? 你应该修复它们吗? 你应该保留它们吗?
    3. If you’re working with data that are normally distributed, the bell shape that we saw before, you can find out every little detail about the data with only the mean and the standard deviation. This may seem surprising but it’s true. However, if you’re working with skewed data, the five-number summary provides much more information for these datasets than the mean and the standard deviation can provide.

      如果您使用的是正态分布的数据(即我们之前看到的钟形形状),则可以仅使用均值和标准差来查找有关数据的每个小细节。 这似乎令人惊讶,但这是事实。 但是,如果您使用的是偏斜数据,则五位数汇总可以为这些数据集提供比平均值和标准差所能提供的更多信息。

      If you’re working with data that are normally distributed, the bell shape that we saw before, you can find out every little detail about the data with only the mean and the standard deviation. This may seem surprising but it’s true. However, if you’re working with skewed data, the five-number summary provides much more information for these datasets than the mean and the standard deviation can provide.Note: If you aren’t sure if your data are normally distributed there are statistical methods like the Kolmogorov-Smirnov test that are aimed to help you understand whether or not your data are normally distributed.

      如果您使用的是正态分布的数据(即我们之前看到的钟形形状),则可以仅使用均值和标准差来查找有关数据的每个小细节。 这似乎令人惊讶,但这是事实。 但是,如果您使用的是偏斜数据,则五位数汇总可以为这些数据集提供比平均值和标准差所能提供的更多信息。 注意:如果不确定数据是否呈正态分布,则可以使用统计方法(例如Kolmogorov-Smirnov检验)来帮助您了解数据是否呈正态分布。

    How should we work with these outliers in practice?

    在实践中,我们应如何处理这些异常值?

    At the very least, we should note that they exist. We need to realize the impact they have on our summary statistics. If the outliers are typos or data entry errors, this is a reason to remove these points, or if we know what they should be, we can update them with the correct values. In cases like the example above (Startups/Facebook), we might try to understand what was so different about the outlier when compared to the other startups. How did this startup/company become so successful? And why the earnings so large in comparison? There is an entire field aimed at this idea called “the anomaly detection”.

    至少,我们应该注意它们的存在。 我们需要意识到它们对摘要统计的影响。 如果异常值是错别字或数据输入错误,这是删除这些点的原因,或者如果我们知道它们应该是什么,则可以使用正确的值更新它们。 在类似上述示例(初创公司/ Facebook)的情况下,我们可能会试图了解与其他初创公司相比,离群值有何不同。 这家创业公司/公司如何如此成功? 以及为什么收益如此之大? 有一个针对这一想法的整个领域称为“ 异常检测 ”。

    A single number can be very misleading about what is actually happening in our data. Some statistics are more misleading than others. If you are the consumer of information based on data, which we all are, it’s important to know how to ask the right questions regarding the statistics around you.

    单个数字可能会误导我们数据中实际发生的事情 。 一些统计数据比其他统计数据更具误导性。 如果您是基于数据的信息使用者, 而我们都是基于数据的信息,那么了解如何就您周围的统计信息提出正确的问题非常重要。

    描述性统计与推论统计 (Descriptive vs. Inferential Statistics)

    The topics covered this far have all been aimed at descriptive statistics. That is, describing the data we’ve collected. There’s an entire other field of statistics known as inferential statistics that’s aimed at drawing conclusions about population of individuals based only on a sample of individuals from that population.

    到目前为止,所涉及的主题都针对描述性统计。 也就是说,描述我们收集的数据。 还有其他整个统计领域,称为推论统计,旨在仅基于来自该人群的个体样本得出有关该人群的结论。

    The vocabulary you need to know:

    您需要了解的词汇表

    1. Population: A collection of all the measurements you are analyzing.

      人口 :所有您所分析的测量结果的集合。

    2. Sample: Subset of the population.

      样本 :总体子集。

    3. Statistic: Any numeric summary calculated from the sample.

      统计 :从样本计算得出的任何数字摘要。

    4. Parameter: Numeric summary about a population (the result of inferential statistics: we don’t know this number, as it’s a number that requires information from all the population).

      参数 :关于总体的数字摘要(推论统计的结果:我们不知道该数字,因为它是一个需要所有总体信息的数字)。

    Drawing conclusions regarding a parameter based on our statistics is known as inference.

    根据我们的统计数据得出的有关参数的结论称为推理

    展望未来 (Looking Ahead)

    Through this article, we’ll not be diving deep into inferential statistics, you’re now aware of the difference between these two branches of statistics. The way we perform inferential statistics is changing as technology evolves. Many career paths involving Machine Learning and Artificial Intelligence are aimed at using collected data to draw conclusions about entire populations at an individual level.

    通过本文,我们不会深入研究推论统计,您现在已经知道这两个统计分支之间的区别。 随着技术的发展,我们执行推理统计的方式正在发生变化。 涉及机器学习人工智能的许多职业道路都旨在使用收集到的数据得出有关个人整体水平的结论。

    摘要 (Summary)

    We started with identifying data types as either categorical or quantitative. Then we learned that we could identify quantitative data as either continuous or discrete, and categorical data as either ordinal or nominal.

    我们从识别数据类型为分类或定量数据开始。 然后我们了解到,我们可以将定量数据识别为连续数据或离散数据,将分类数据识别为序数或名义数据。

    When analyzing categorical variables, we commonly just look at the count or percent of a group that falls into each level of a category. When analyzing quantitative data there are four main aspects:

    在分析类别变量时,我们通常只查看属于类别每个级别的组的数量或百分比。 分析定量数据时,有四个主要方面

    1. Measures of CenterI. Means

      CenterI的措施 手段

      Measures of CenterI. MeansII. Medians

      CenterI的措施 手段二。 中位数

      Measures of CenterI. MeansII. Medians III. Modes

      CenterI的措施 手段二。 中位数III。 模式

    2. Measures of SpreadI. RangeII. Interquartile Range (IQR)III. VarianceIV. Standard Deviation

      传播措施I. 范围II。 四分位间距(IQR) III。 方差IV。 标准偏差

    3. Shape of distribution

      分布形状

      Shape of distributionI. Right-skewed

      分布形状 I.右偏

      Shape of distributionI. Right-skewedII. Left-Skewed

      分布形状 I.右偏II。 左歪

      Shape of distributionI. Right-skewedII. Left-SkewedIII. Symmetric

      分布形状 I.右偏II。 左偏III。 对称的

    4. Outliers

      离群值

    There are two types of statistics:1. Descriptive statistics: Present, organize, summarize, and describe the collected data using the measures discussed throughout measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.

    统计有两种类型:1。 描述性统计数据 :使用贯穿中心度量,传播度量,分布形状和异常值的讨论方法来呈现,组织,总结和描述所收集的数据。 我们还可以使用数据图获得更好的理解。

    2. Inferential statistics: This is where you run different tests and draw conclusions about your sample that we can impute to a larger population. Performing inferential statistics well requires that we take a sample that accurately represents our population of interest.

    2. 推论统计 :您可以在这里进行不同的测试,并得出关于样本的结论,我们可以将其推算给更大的人群。 良好地执行推论统计需要我们抽取一个能够准确代表我们感兴趣的人群的样本。

    Thanks For Reading! 😄

    谢谢阅读! 😄

    Image for post
    MediumMedium上关注我

    Khelifi Ahmed Aziz

    凯利菲·艾哈迈德·阿齐兹

    翻译自: https://medium.com/dataseries/understand-descriptive-statistics-c29282b7a62e

    统计分类分为描述性统计

    展开全文
  • python 描述性统计 描述性统计 (Descriptive Statistics) After data collection, most Psychology researchers use different ways to summarise the data. In this tutorial we will learn how to do descriptive ...

    python 描述性统计

    描述性统计 (Descriptive Statistics)

    After data collection, most Psychology researchers use different ways to summarise the data. In this tutorial we will learn how to do descriptive statistics in Python. Python, being a programming language, enables us  many ways to carry out descriptive statistics. Pandas makes data manipulation and summary statistics quite similar to how you would do it in R. I believe that the dataframe in R is very intuitive to use and pandas offers a DataFrame method similar to Rs. Also, many Psychology researchers may have experience of R.

    收集数据后,大多数心理学研究人员使用不同的方式来汇总数据。 在本教程中,我们将学习如何在Python中进行描述性统计 Python是一种编程语言,它使我们可以采用多种方式来进行描述性统计。 Pandas使数据操作和汇总统计信息与R中的操作非常相似。我相信R中的数据框的使用非常直观,Pandas提供了类似于Rs的DataFrame方法。 同样,许多心理学研究人员可能有R的经验。

    Thus, in this tutorial you will learn how to do descriptive statistics using  Pandas, but also using NumPy, and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy so we use SciPy. Towards the end we learn how get some measures of variability (e.g., variance using pandas).

    因此,在本教程中,您将学习如何使用Pandas以及NumPy和SciPy进行描述性统计。 我们首先使用熊猫获取摘要统计信息和一些方差度量。 之后,我们继续使用Pandas和NumPy进行中央租赁措施(例如,均值和中位数)。 谐波,几何和修剪均值无法使用Pandas或NumPy计算,因此我们使用SciPy。 最后,我们学习如何获得一些可变性的度量(例如,使用熊猫的变异)。

    import numpy as np
    from pandas import DataFrame as df
    from scipy.stats import trim_mean, kurtosis
    from scipy.stats.mstats import mode, gmean, hmean
    import numpy as np
    from pandas import DataFrame as df
    from scipy.stats import trim_mean, kurtosis
    from scipy.stats.mstats import mode, gmean, hmean
     

    模拟响应时间数据 (Simulate response time data)

    Many times in experimental psychology response time is the dependent variable. I to simulate an experiment in which the dependent variable is response time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method describe.

    实验心理学中,响应时间很多时候都是因变量。 我模拟一个实验,其中因变量是对某些任意目标的响应时间。 此外,模拟数据将具有两个自变量(IV,“ iv1”具有2个级别,“ iv2”具有3个级别)。 在创建数据框的同时对数据进行仿真,并使用描述的方法获得第一个描述性统计信息。

    使用熊猫进行描述性统计 (Descriptive statistics using Pandas)

    data.describe()
    data.describe()
     

    Pandas will output summary statistics by using this method. Output is a table, as you can see below.

    熊猫将使用此方法输出摘要统计信息。 输出是一个表,如下所示。

    Output table from Pandas DataFrame describe - descriptive statistics
    Output table of data.describe()
    data.describe()的输出表

    Typically, a researcher is interested in the descriptive statistics of the IVs. Therefore, I group the data by these. Using describe on the grouped date aggregated data for each level in each IV.  As can be seen from the output it is somewhat hard to read. Note, the method unstack is used to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.

    通常,研究人员会对IV的描述性统计感兴趣。 因此,我将这些数据分组。 使用分组日期上的describe描述每个IV中每个级别的汇总数据。 从输出中可以看出,它有点难以阅读。 请注意,unstack方法用于获取均值,标准差(std)等作为列,并且变得更易于阅读。

    Output from describe on the grouped data
    Output from describe on the grouped data
    来自分组数据描述的输出

    中央倾向 (Central tendancy)

    Often we want to know something about the “average” or “middle” of our data. Using Pandas and NumPy the two most commonly used measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean  can also be obtained using Pandas but I will use methods from  SciPy.

    通常,我们想了解一些有关数据“平均”或“中间”的信息。 使用Pandas和NumPy,可以获得两种最常用的中央租房措施。 均值和中位数。 模式和修剪后的均值也可以使用Pandas获得,但我将使用SciPy的方法。

    意思 (Mean)

    There are at least two ways of doing this using our grouped data. First, Pandas have the method mean;

    使用我们的分组数据至少有两种方法可以做到这一点。 首先,熊猫具有方法的含义;

    grouped_data['rt'].mean().reset_index()
    grouped_data['rt'].mean().reset_index()
     

    But the method aggregate in combination with NumPys mean can also be used;

    但是也可以使用与NumPys平均值结合的方法。

    Both methods will give the same output but the aggregate method have some advantages that I will explain later.

    两种方法将提供相同的输出,但是聚合方法具有一些优点,我将在后面解释。

    Output of aggregate using Numpy mean method
    Output of mean and aggregate using NumPy – Mean
    使用NumPy输出均值和合计–均值

     

    几何与谐波均值 (Geometric & Harmonic mean)

    Sometimes the geometric or harmonic mean  can be of interested. These two descriptives can be obtained using the method apply with the methods gmean and hmean (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.

    有时,几何或调和均值可能令人感兴趣。 可以使用gmean和hmean(来自SciPy)方法作为参数的方法获得这两个描述。 也就是说,Pandas或NumPy中没有任何方法可以使我们计算几何和调和平均值。

    几何 (Geometric)
    grouped_data['rt'].apply(gmean, axis=None).reset_index()
    grouped_data['rt'].apply(gmean, axis=None).reset_index()
     
    谐波 (Harmonic)

    均值修整 (Trimmed mean)

    Trimmed means are, at times, used. Pandas or NumPy seems not to have methods for obtaining the trimmed mean. However, we can use the method trim_mean from SciPy . By using apply to our grouped data we can use the function (‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.

    有时会使用修饰后的方法。 Pandas或NumPy似乎没有获得修整平均值的方法。 但是,我们可以使用SciPy中的trim_mean方法。 通过应用应用于分组数据,我们可以将函数('trim_mean')与参数一起使用,该参数将使10%av成为要删除的最大值和最小值。

    trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
    trimmed_mean.reset_index()
    trimmed_mean = grouped_data['rt'].apply(trim_mean, .1)
    trimmed_mean.reset_index()
     

    Output from the mean values above (trimmed, harmonic, and geometric means):

    从上述平均值(修整,谐波和几何均值)输出:

    Trimmed Mean

    使用SciPy整理了熊猫的平均产量
    均值

    Harmonic Mean

    使用Pandas DataFrame SciPy的谐波均值
    谐波均值

    Geometric Mean

    描述性-几何均值
    几何平均数

    中位数 (Median)

    As with the mean there are also at least two ways of obtaining the median;

    与平均值一样,至少还有两种获取中位数的方法;

    grouped_data['rt'].aggregate(np.median).reset_index()
    grouped_data['rt'].aggregate(np.median).reset_index()
     
    使用Numpy-中位数的合计输出。
    Output of aggregate using Numpy – Median.
    使用Numpy –中位数的合计输出。

    模式 (Mode)

    There is a method (i.e., pandas.DataFrame.mode()) for getting the mode for a DataFrame object. However, it cannot be used on the grouped data so I will use mode from SciPy:

    有一种方法(即pandas.DataFrame.mode() )用于获取DataFrame对象的模式。 但是,它不能用于分组数据,因此我将使用SciPy的模式:

    Most of the time I probably would want to see all measures of central tendency at the same time. Luckily, aggregate enables us to use many NumPy and SciPy methods. In the example below the standard deviation (std), mean, harmonic mean,  geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.

    大多数时候,我可能希望同时查看所有集中趋势指标。 幸运的是,聚合使我们能够使用许多NumPy和SciPy方法。 在下面的示例中,标准偏差(std),均值,谐波均值,几何均值和微调均值都在同一输出中。 请注意,我们将必须在之后添加调整后的均值。

    descr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()
    
    descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
    descrdescr = grouped_data['rt'].aggregate([np.median, np.std, np.mean]).reset_index()
    
    descr['trimmed_mean'] = pd.Series(trimmed_mean.values, index=descr.index)
    descr 
    使用Pandas,NumPy和SciPy进行描述性统计。
    Output of aggregate using some of the methods.
    使用某些方法输出合计。

    变异性度量 (Measures of variability)

    Central tendency (e.g., the mean & median) is not the only type of summary statistic that we want to calculate. Doing data analysis we also want a measure of the variability of the data.

    集中趋势(例如,均值和中位数)不是我们要计算的唯一统计摘要类型。 在进行数据分析时,我们还希望度量数据的可变性。

    标准偏差 (Standard deviation)

    四分位间距 (Inter quartile range)

    Note that here the use unstack()  also get the quantiles as columns and the output is easier to read.

    请注意,这里使用unstack()还将分位数作为列,并且输出更易于阅读。

    grouped_data['rt'].quantile([.25, .5, .75]).unstack()
    grouped_data['rt'].quantile([.25, .5, .75]).unstack()
     
    使用熊猫分位数的四分位间距(IQR)
    IQR
    IQR

    方差 (Variance)

    Variance using pandas var method
    Variance
    方差

    That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes these calculation almost as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.

    就这些。 现在,您知道如何使用Python获得一些最常见的描述性统计信息。 Pandas,NumPy和SciPy实际上使这些计算几乎与在诸如SPSS之类的图形统计软件中进行计算一样容易。 应用和聚合方法的一大优势是我们可以输入其他方法或函数来获取其他类型的描述。

    翻译自: https://www.pybloggers.com/2016/02/descriptive-statistics-using-python/

    python 描述性统计

    展开全文
  • 描述性统计分析

    千次阅读 2018-11-28 13:34:44
    描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述...

    描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析 描述性统计分析

    展开全文
  • pandas 入门(pandas 描述性统计的概述与计算—描述性统计和汇总统计) 1、描述性统计和汇总统计 ...
  • 描述性统计-源码

    2021-02-20 10:02:39
    描述性统计 概述 此gem向Enumerable模块添加了一些方法,从而可以轻松计算包含Enumerable的集合(例如Array,Hash,Set和Range)中数字样本数据的基本描述统计量。 可以计算的统计信息是: 数字 和 意思是 中位数 ...
  • python 描述性统计The field of statistics is often misunderstood, but it plays an essential role in our everyday lives. Statistics, done correctly, allows us to extract knowledge from the vague, ...
  • 有很多方法用来集体计算DataFrame的描述性统计信息和其他相关操作。 其中大多数是sum(),mean()等聚合函数,但其中一些,如sumsum(),产生一个相同大小的对象。 一般来说,这些方法采用轴参数,就像ndarray.{sum,...
  • 描述性统计分析.ipynb

    2019-11-22 17:58:15
    描述性统计分析-day1使用 
  • pandas描述性统计.pdf

    2019-01-30 17:53:11
    pandas描述性统计.pdf
  • 描述性统计分析的应用—基于描述性统计分析识别优质股票内容导入:大家好,这里是每天分析一点点。上期给大家介绍离散趋势,本期介绍描述性统计分析的基本原理与应用,包括集中趋势、离散趋势、偏度与峰度的概念,再...
  • 统计学习-描述性统计(理论部分)主要包含的内容有: 集中趋势各测度值的计算方法 2. 集中趋势各测度值的特点 3. 离散程度各测度值的计算方法 4. 离散程度各测度值的特点 5. 偏态与峰态的测度方法
  • 参考资料1 描述性统计概念描述性统计主要是对数据集中的数据进行分析,借助图表或总结性的数值得出反映客观现象和总体情况的各种描述性特征,包括数据的集中趋势、离散程度、频数分布等。利用Python中的NumPy和...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 6,181
精华内容 2,472
关键字:

描述性统计