样本均值的抽样分布One of the most important concepts discussed in the context of inferential data analysis is the idea of sampling distributions. Understanding sampling distributions helps us better ...
One of the most important concepts discussed in the context of inferential data analysis is the idea of sampling distributions. Understanding sampling distributions helps us better comprehend and interpret results from our descriptive as well as predictive data analysis investigations. Sampling distributions are also frequently used in decision making under uncertainty and hypothesis testing.
You may already be familiar with the idea of probability distributions. A probability distribution gives us an understanding of the probability and likelihood associated with values (or range of values) that a random variable may assume. A random variable is a quantity whose value (outcome) is determined randomly. Some examples of a random variable include, the monthly revenue of a retail store, the number of customers arriving at a car wash location on any given day, the number of accidents on a certain highway on any given day, weekly sales volume at a retail store, etc. Although the outcome of a random variable is random, the probability distribution allows us to gain and understanding about the likelihood and probabilities of different values occurring in the outcome. Sampling distributions are probability distributions that we attach to sample statistics of a sample.
A sample statistic (also known simply as a statistic) is a value learned from a sample. Here is an example, suppose you collect the results of a survey filled out by 250 randomly selected individuals who live in a certain neighborhood. Based on the survey results you realize that the average annual income of the individuals in this sample is $82,512. This is a sample statistic and is denoted by x̅ = $82,512. The sample mean is also a random variable (denoted by X̅) with a probability distribution. The probability distribution for X̅ is called the sampling distribution for the sample mean. Sampling distribution could be defined for other types of sample statistics including sample proportion, sample regression coefficients, sample correlation coefficient, etc.
You might be wondering why X̅ is a random variable while the sample mean is just a single number! The key to understanding this lies in the idea of sample to sample variability. This idea refers to the fact that samples drawn from the same population are not identical. Here’s an example, suppose in the example above, instead of conducting only one survey of 250 individuals living in a particular neighborhood, we conducted 35 samples of the same size in that neighborhood. If we calculated the sample mean x̅ for each of the 35 samples, you would be getting 35 different values. Now suppose, hypothetically, we conducted many many surveys of the same size in that neighborhood. We would be getting many many (different) values for sample means. The distribution resulting from those sample means is what we call the sampling distribution for sample mean. Thinking about the sample mean from this perspective, we can imagine how X̅ (note the big letter) is the random variable representing sample means and x̅ (note the small letter)is just one realization of that random variable.
样本均值的抽样分布 (Sampling distribution of the sample mean)
Assuming that X represents the data (population), if X has a distribution with average μ and standard deviation σ, and if X is approximately normally distributed or if the sample size n is large,
The above distribution is only valid if,
X is approximately normal or sample size n is large, and,
the data (population) standard deviation σ is known. 数据(种群)标准偏差σ是已知的。
If X is normal, then X̅ is also normally distributed regardless of the sample size n. Central Limit Theorem tells us that even if X is not normal, if the sample size is large enough (usually greater than 30), then X̅’s distribution is approximately normal (Sharpe, De Veaux, Velleman and Wright, 2020, pp. 318–320). If X̅ is normal, we can easily standardize and convert it to the standard normal distribution Z.
If the population standard deviation σ is not known, we cannot assume that the sample mean X̅ is normally distributed. If certain conditions are satisfied (explained below), then we can transform X̅ to another random variable t such that,
The random variable t is said to follow the t-distribution with n-1 degrees of freedom, where n is the sample size. The t-distribution is bell-shaped and symmetric (just like the normal distribution) but has fatter tails compared to the normal distribution. This means values further away from the mean have a higher likelihood of occurring compared to that in the normal distribution.
The conditions to use the t-distribution for the random variable t are as follows (Sharpe et al., 2020, pp. 415–420):
If X is normally distributed, even for small sample sizes (n<15), the t-distribution can be used.
如果X是正态分布的，即使对于小样本量( n < 15)，也可以使用t分布。
If the sample size is between 15 and 40, the t-distribution can be used as long as X is unimodal and reasonably symmetric. 如果样本大小在15到40之间，则只要X是单峰且合理对称，就可以使用t分布。
For sample sizes greater than 40, the t-distribution can be used unless X’s distribution is heavily skewed. 对于大于40的样本，除非X的分布严重偏斜，否则可以使用t分布。
用Python模拟 (Simulation with Python)
Let’s draw a sample of size n=250 from the normal distribution. Here we are assuming that our data is normally distributed and has parameters μ = 20 and σ = 3. Collecting one sample from this population
As you can see, the distribution is approximately symmetric and bell-shaped (just like the normal distribution) with an average of approximately 20 and a standard error that is approximately equal to 3/sqrt(250) = 0.19.
Sampling from the same population with different sample sizes will result in different measures of spread in the outcome distribution. As we expect, increasing the sample size will reduce the standard error and therefore, the distribution will be narrower around its average. Note that the distribution of X̅ is normal even for extremely small sample sizes. This is because X is normally distributed.
如果总体(数据)不正常怎么办？ (What if the population (data) is not normal?)
No worries! Even if your data is not normally distributed, if the sample size is large enough, the distribution of X̅ can still be approximated using the normal distribution (according to Central Limit Theorem). The following figure shows the distribution of X̅ when X is heavily skewed to the left. As you can see, X̅’s distribution tends to mimic the distribution of X for small sample sizes. However, as sample size grows the distribution of X̅ becomes more symmetric and bell-shaped. As mentioned above, if sample size is large (usually larger than 30), X̅’s distribution is approximately normal regardless of what the distribution of X is.
Knowing the distribution of X̅ can help us solve problems, where we need to use inferential data analysis to make decisions under uncertainty. Many business problems require decision making tools that are able to address the stochastic and probabilistic nature of random event. Hypothesis testing is one of those tools frequently used in many different business domains including retail operations, marketing, quality assurance, etc.
For example, suppose a retail store has run a major marketing campaign and is interested to investigate the effects of the campaign on average sales of the store. Suppose that the management would like to investigate if average daily sales is now greater than $8,000. The following hypotheses demonstrate this research question:
Note that we are conducting a test on the population average sales, hence the μ. To address the test, suppose we record sales volumes over 40 days (sample with n=40) and calculate the required statistics. Suppose the average and standard deviation of daily sales volumes are calculated as x̅=$8,100 and s=$580, respectively. Since the value of σ is not known, and given that the above hypothesis test is being addressed, we can convert X̅ to the random variable t with n-1=39 degrees of freedom where,
To address the test, we need to find the p-value associated with the test. This property is calculated as,
The probability density function for the random variable t along with the p-value of the test are depicted below.
The following will find the p-value for the test.
The calculations give a p-value equal to approximately 0.14. By most standards (significance levels), this is a large p-value indicating that we fail to reject the null hypothesis. In other words, based on the distribution of X̅ and the sample collected, we cannot conclude that the average daily sales volume at the retail store, μ, is greater than $8000. This calculation was possible only because we knew what the distribution of X̅ was.
Sampling distributions could be defined for other sample statistics (e.g., sample proportions, regression predictor coefficients, etc.) and are also used in other contexts like confidence and prediction intervals or inferential analysis on regression results.
"""利用list 和均值计算方差""" var1=0 for i in list: var1+=float((i-avg)**2*1.0) var2=(math.sqrt(var1/(len(lst)-1)*1.0)) return var2 print("sum= %f"%sum1(lst)) print(...
lst1 = str.split(" ")
i = 0
while i <= t+1:
i += 1
s = 0
for x in list:
s += x
avg = 0
avg = sum1(lst)/float(len(lst)*1.0) #调用sum函数求和
for i in list: