统计分类分为描述性统计 数据分析 (DATA ANALYSIS) 目录 (Table of contents) Introduction 介绍 Data types 资料类型 Analyzing Quantitative DataI. Measures of CenterII. Measures of SpreadIII. Shape of ...
数据分析 (DATA ANALYSIS)
目录 (Table of contents)
Analyzing Quantitative DataI. Measures of CenterII. Measures of SpreadIII.Shape of distribution
Analyzing Quantitative DataI. Measures of CenterII. Measures of SpreadIII.Shape of distributionIV.Outliers
Descriptive vs. Inferential Statistics
The word “data” is defined as distinct pieces of information. You may think of data as simply numbers on a spreadsheet, but it can come in many forms from text to videos to spreadsheets and databases to images to audio … Utilizing data is the new way of the world. Data is used to understand and improve nearly every facet of our lives, from early disease detection to social networks that allow us to connect and communicate with people around the world. No matter what field you’re in, from insurance and banking to medicine, to education, to agriculture … You can utilize data to make better decisions and accomplish your goals. Running descriptive statistics on your datasets is absolutely crucial before you begin the process of inferential statistics. Many people do not take the time, especially novice folks, at the research they carefully run descriptive statistics and clean data and make sure that the data meet the assumptions that are required of the more robust statistical test but it’s absolutely imperative that this process is done correctly.
Quantitativedata takes on numeric values that allow us to perform mathematical operations.
Quantitativedata takes on numeric values that allow us to perform mathematical operations.-Continuousdata can be split into smaller and smaller units, and still a smaller unit exists. (for example, we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with it).
Quantitativedata takes on numeric values that allow us to perform mathematical operations.-Continuousdata can be split into smaller and smaller units, and still a smaller unit exists. (for example, we can measure the units of the age in years, months, days, hours, seconds, but there are still smaller units that could be associated with it).-Discrete data only takes on countable values.
Categorical data are used to label a group or set of items.
Categorical data are used to label a group or set of items.-Categorical Ordinal: data take on a ranked ordering (for example a ranked interaction on a scale from Very bad to Very Good).- Categorical Nominal: data that do not have an order or ranking.
The MedianThe median splits our data so that 50% of our values are lower and 50% are higher.
The MedianThe median splits our data so that 50% of our values are lower and 50% are higher.- Median for Odd Values: If we have an odd number of observations, the median is simply the number in the direct middle.- If we have an even number of observations, the median is the average of the two values in the middle.
The MedianThe median splits our data so that 50% of our values are lower and 50% are higher.- Median for Odd Values: If we have an odd number of observations, the median is simply the number in the direct middle.- If we have an even number of observations, the median is the average of the two values in the middle.Note: In order to compute the median we MUST sort our values first.
The ModeThe mode is the most frequently observed value in our dataset.
The ModeThe mode is the most frequently observed value in our dataset.Note 1: There might be multiple modes for a particular dataset or no mode at all.
模式模式是我们数据集中最常观察到的值。 注意1 ：特定数据集可能有多种模式 ，或者根本没有模式 。
The ModeThe mode is the most frequently observed value in our dataset.Note 1: There might be multiple modes for a particular dataset or no mode at all.Note 2: The mode of a distribution is essentially the tallest bar in a histogram. There may be multiple modes depending on the number of peaks in our histogram.
One of the most common ways to measure the spread of our data is by looking at the Five Number Summary. It consists of five values:
衡量数据分布的最常见方法之一是查看“ 五个数字摘要” 。 它包含五个值：
The minimum: The smallest number in the dataset.
The first quartile Q1: The value such that 25% of the data falls below.
The second quartile Q2(≈ median): The value such that 50% of the data falls below.
第二个四分位数Q2( ≈ 中位数)：使得50％的数据低于此值的值。
The third quartile Q3: The value such that 75% of the data falls below.
Maximum: The largest value in the dataset.
We represent the five-number summary with a boxplot as shown below
Measures of Spreadare used to provide us an idea of how spread out our data are from one another. Common measures of spread include:
价差措施 用于向我们提供一个关于我们的数据如何分散的想法。 常见的传播措施包括：
Range:The range is the difference between the maximum and the minimum.
Interquartile Range (IQR):The interquartile range is calculated as the difference between Q3 and Q1.
The variance is used to compare the spread of two different groups. A set of data with higher variance is more spread out than a dataset with a lower variance. Be careful though, there might just be an
4. Standard DeviationThe standard deviation is one of the most common measures for talking about the spread of data. It is defined as the square root of the variance.The standard deviation is used more often in practice than the variance because it shares the units of the original dataset.
Note: If you’re interested in mathematical writing with LaTeX check out this article
分布的形状 (The shape of the Distribution)
From a histogram, we can quickly identify the shape of our data, which can actually tell us a lot about the measures of center and spread. The distribution of data is frequently associated with one of the three shapes:
shorter bins on the right and taller bins on the left is considered a right-skewed shape. In this distribution, the mean is greater than the median.
shorter bins on the right and taller bins on the left is considered a right-skewed shape. In this distribution, the mean is greater than the median.Real-world examples: The amount of drug left in your bloodstream over time, human athletic abilities …
Left-skewedA histogram that has shorter bins on the left and taller bins on the right is considered a right-skewed shape. In this distribution, the mean is less than the median.Real-world examples: The age of death, asset price changes …
Any distribution where you can draw a line down the middle and the right side mirrors the left side is considered symmetric. One of the most common symmetric distributions is known as
normal distribution and it’s also called ‘Bell Curve’.
正态分布 ，也称为“ 钟形曲线 ”。
Symmetric distributions have a mean that’s equal to the median, which also equals the mode, alternatively it has also a symmetric box spot.
Real-world examples: Heights, weights, precipitation amount …
Note 1: Data in the real world can be messy and it might not follow any of these distributions.Note 2: In a skewed distribution, the mean is pulled by the tail of the distribution while the median stays closer to the mode.
Outliers are data points that fall very far from the rest of the values in a data set. In order to determine what is very far, there are a number of different methods. The method I usually use for detecting outliers isn’t very scientific, I just plot the data and see if there is a point really far from any of the other data points.You can check herethe methods and techniques for identifying outliers.
A quick plot of your data can often help you understand a lot in a short amount of time.
In order to illustrate the impact that outliers can have on the way we report summary statistics, let’s consider the income of startups/companies. Imagine I select ten startup earnings and I pull these nine values here as earnings in thousands of dollars and the tenth is Facebook or Tesla. The measure of mean, variance, Standard deviation are incredibly misleading, none of the ten salaries can be even close to the mean calculated. A better measure of center would certainly be the median
If you’re the one doing the reporting, here are some of my personal guidelines when analyzing data:
Plot your data 绘制数据
If there are outliers, determine how you should handle them. This might require a domain expert of the field. Should you remove them? should you fix them? should you keep them? 如果存在异常值，请确定如何处理它们。 这可能需要该领域的领域专家。 您应该删除它们吗？ 你应该修复它们吗？ 你应该保留它们吗？
If you’re working with data that are normally distributed, the bell shape that we saw before, you can find out every little detail about the data with only the mean and the standard deviation. This may seem surprising but it’s true. However, if you’re working with skewed data, the five-number summary provides much more information for these datasets than the mean and the standard deviation can provide.
If you’re working with data that are normally distributed, the bell shape that we saw before, you can find out every little detail about the data with only the mean and the standard deviation. This may seem surprising but it’s true. However, if you’re working with skewed data, the five-number summary provides much more information for these datasets than the mean and the standard deviation can provide.Note: If you aren’t sure if your data are normally distributed there are statistical methods like the Kolmogorov-Smirnov test that are aimed to help you understand whether or not your data are normally distributed.
How should we work with these outliers in practice?
At the very least, we should note that they exist. We need to realize the impact they have on our summary statistics. If the outliers are typos or data entry errors, this is a reason to remove these points, or if we know what they should be, we can update them with the correct values. In cases like the example above (Startups/Facebook), we might try to understand what was so different about the outlier when compared to the other startups. How did this startup/company become so successful? And why the earnings so large in comparison? There is an entire field aimed at this idea called “the anomaly detection”.
A single number can be very misleading about what is actually happening in our data. Some statistics are more misleading than others. If you are the consumer of information based on data, which we all are, it’s important to know how to ask the right questions regarding the statistics around you.
描述性统计与推论统计 (Descriptive vs. Inferential Statistics)
The topics covered this far have all been aimed at descriptive statistics. That is, describing the data we’ve collected. There’s an entire other field of statistics known as inferential statistics that’s aimed at drawing conclusions about population of individuals based only on a sample of individuals from that population.
Drawing conclusions regarding a parameter based on our statistics is known as inference.
展望未来 (Looking Ahead)
Through this article, we’ll not be diving deep into inferential statistics, you’re now aware of the difference between these two branches of statistics. The way we perform inferential statistics is changing as technology evolves. Many career paths involving Machine Learning and Artificial Intelligence are aimed at using collected data to draw conclusions about entire populations at an individual level.
We started with identifying data types as either categorical or quantitative. Then we learned that we could identify quantitative data as either continuous or discrete, and categorical data as either ordinal or nominal.
Measures of SpreadI. RangeII. Interquartile Range (IQR)III. VarianceIV. Standard Deviation
传播措施I. 范围II。 四分位间距(IQR) III。 方差IV。 标准偏差
Shape of distribution
Shape of distributionI. Right-skewed
Shape of distributionI. Right-skewedII. Left-Skewed
Shape of distributionI. Right-skewedII. Left-SkewedIII. Symmetric
分布形状I.右偏II。 左偏III。 对称的
There are two types of statistics:1. Descriptive statistics: Present, organize, summarize, and describe the collected data using the measures discussed throughout measures of center, measures of spread, shape of our distribution, and outliers. We can also use plots of our data to gain a better understanding.
2. Inferential statistics: This is where you run different tests and draw conclusions about your sample that we can impute to a larger population. Performing inferential statistics well requires that we take a sample that accurately represents our population of interest.
python 描述性统计描述性统计 (Descriptive Statistics) After data collection, most Psychology researchers use different ways to summarise the data. In this tutorial we will learn how to do descriptive ...
After data collection, most Psychology researchers use different ways to summarise the data. In this tutorial we will learn how to do descriptive statistics in Python. Python, being a programming language, enables us many ways to carry out descriptive statistics.
Pandas makes data manipulation and summary statistics quite similar to how you would do it in R. I believe that the dataframe in R is very intuitive to use and pandas offers a DataFrame method similar to Rs. Also, many Psychology researchers may have experience of R.
Thus, in this tutorial you will learn how to do descriptive statistics using Pandas, but also using NumPy, and SciPy. We start with using Pandas for obtaining summary statistics and some variance measures. After that we continue with the central tenancy measures (e.g., mean and median) using Pandas and NumPy. The harmonic, geometric, and trimmed mean cannot be calculated using Pandas or NumPy so we use SciPy. Towards the end we learn how get some measures of variability (e.g., variance using pandas).
import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
import numpy as np
from pandas import DataFrame as df
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
模拟响应时间数据(Simulate response time data)
Many times in experimental psychology response time is the dependent variable. I to simulate an experiment in which the dependent variable is response time to some arbitrary targets. The simulated data will, further, have two independent variables (IV, “iv1” have 2 levels and “iv2” have 3 levels). The data are simulated as the same time as a dataframe is created and the first descriptive statistics is obtained using the method describe.
Pandas will output summary statistics by using this method. Output is a table, as you can see below.
Typically, a researcher is interested in the descriptive statistics of the IVs. Therefore, I group the data by these. Using describe on the grouped date aggregated data for each level in each IV. As can be seen from the output it is somewhat hard to read. Note, the method unstack is used to get the mean, standard deviation (std), etc as columns and it becomes somewhat easier to read.
Often we want to know something about the “average” or “middle” of our data. Using Pandas and NumPy the two most commonly used measures of central tenancy can be obtained; the mean and the median. The mode and trimmed mean can also be obtained using Pandas but I will use methods from SciPy.
But the method aggregate in combination with NumPys mean can also be used;
Both methods will give the same output but the aggregate method have some advantages that I will explain later.
几何与谐波均值(Geometric & Harmonic mean)
Sometimes the geometric or harmonic mean can be of interested. These two descriptives can be obtained using the method apply with the methods gmean and hmean (from SciPy) as arguments. That is, there is no method in Pandas or NumPy that enables us to calculate geometric and harmonic means.
Trimmed means are, at times, used. Pandas or NumPy seems not to have methods for obtaining the trimmed mean. However, we can use the method trim_mean from SciPy . By using apply to our grouped data we can use the function (‘trim_mean’) with an argument that will make 10 % av the largest and smallest values to be removed.
Most of the time I probably would want to see all measures of central tendency at the same time. Luckily, aggregate enables us to use many NumPy and SciPy methods. In the example below the standard deviation (std), mean, harmonic mean, geometric mean, and trimmed mean are all in the same output. Note that we will have to add the trimmed means afterwards.
That is all. Now you know how to obtain some of the most common descriptive statistics using Python. Pandas, NumPy, and SciPy really makes these calculation almost as easy as doing it in graphical statistical software such as SPSS. One great advantage of the methods apply and aggregate is that we can input other methods or functions to obtain other types of descriptives.