精华内容
下载资源
问答
  • Python解释数学系列——分位数Quantile
    2020-12-09 19:34:07

    1. 分位数计算案例与Python代码

    案例1

    Ex1: Given a data = [6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36],求Q1, Q2, Q3, IQR

    Solving:

    步骤:

    1. 排序,从小到大排列data,data = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]

    2. 计算分位数的位置

    3. 给出分位数

    分位数计算法一

    pos = (n+1)*p,n为数据的总个数,p为0-1之间的值

    Q1的pos = (11 + 1)*0.25 = 3 (p=0.25) Q1=15

    Q2的pos = (11 + 1)*0.5 = 6 (p=0.5) Q2=40

    Q3的pos = (11 + 1)*0.75 = 9 (p=0.75) Q3=43

    IQR = Q3 - Q1 = 28

    import math

    def quantile_p(data, p):

    pos = (len(data) + 1)*p

    #pos = 1 + (len(data)-1)*p

    pos_integer = int(math.modf(pos)[1])

    pos_decimal = pos - pos_integer

    Q = data[pos_integer - 1] + (data[pos_integer] - data[pos_integer - 1])*pos_decimal

    return Q

    data = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]

    Q1 = quantile_p(data, 0.25)

    print("Q1:", Q1)

    Q2 = quantile_p(data, 0.5)

    print("Q2:", Q2)

    Q3 = quantile_p(data, 0.75)

    print("Q3:", Q3)

    分位数计算法二

    pos = 1+ (n-1)\*p,n为数据的总个数,p为0-1之间的值

    Q1的pos = 1 + (11 - 1)\*0.25 = 3.5 (p=0.25) Q1=25.5

    Q2的pos = 1 + (11 - 1)\*0.5 = 6 (p=0.5) Q2=40

    Q3的pos = 1 + (11 - 1)\*0.75 = 8.5 (p=0.75) Q3=42.5

    ```

    import math

    def quantile_p(data, p):

    pos = 1 + (len(data)-1)*p

    pos_integer = int(math.modf(pos)[1])

    pos_decimal = pos - pos_integer

    Q = data[pos_integer - 1] + (data[pos_integer] - data[pos_integer - 1])*pos_decimal

    return Q

    data = [6, 7, 15, 36, 39, 40, 41, 42, 43, 47, 49]

    Q1 = quantile_p(data, 0.25)

    print("Q1:", Q1)

    Q2 = quantile_p(data, 0.5)

    print("Q2:", Q2)

    Q3 = quantile_p(data, 0.75)

    print("Q3:", Q3)

    ```

    ## 案例2

    给定数据集 data = [7, 15, 36, 39, 40, 41],求Q1,Q2,Q3

    分位数计算法一

    import math

    def quantile_p(data, p):

    data.sort()

    pos = (len(data) + 1)*p

    pos_integer = int(math.modf(pos)[1])

    pos_decimal = pos - pos_integer

    Q = data[pos_integer - 1] + (data[pos_integer] - data[pos_integer - 1])*pos_decimal

    return Q

    data = [7, 15, 36, 39, 40, 41]

    Q1 = quantile_p(data, 0.25)

    print("Q1:", Q1)

    Q2 = quantile_p(data, 0.5)

    print("Q2:", Q2)

    Q3 = quantile_p(data, 0.75)

    print("Q3:", Q3)

    计算结果:

    Q1 = 7 +(15-7)×(1.75 - 1)= 13

    Q2 = 36 +(39-36)×(3.5 - 3)= 37.5

    Q3 = 40 +(41-40)×(5.25 - 5)= 40.25

    分位数计算法二

    结果:

    Q1: 20.25

    Q2: 37.5

    Q3: 39.75

    2. 分位数解释

    **四分位数**

    **概念**:把给定的乱序数值由小到大排列并分成四等份,处于三个分割点位置的数值就是四分位数。

    **第1四分位数 (Q1)**,又称“较小四分位数”,等于该样本中所有数值由小到大排列后第25%的数字。

    **第2四分位数 (Q2)**,又称“中位数”,等于该样本中所有数值由小到大排列后第50%的数字。

    **第3四分位数 (Q3)**,又称“较大四分位数”,等于该样本中所有数值由小到大排列后第75%的数字。

    **四分位距**(InterQuartile Range, IQR)= 第3四分位数与第1四分位数的差距

    确定p分位数位置的两种方法

    position = (n+1)*p

    position = 1 + (n-1)*p

    3. 分位数在pandas中的解释

    在python中计算分位数位置的方案采用position=1+(n-1)*p

    案例1

    import pandas as pd

    import numpy as np

    df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), columns=['a', 'b'])

    print("数据原始格式:")

    print(df)

    print("计算p=0.1时,a列和b列的分位数")

    print(df.quantile(.1))

    程序计算结果:

    序号

    a

    b

    0

    1

    1

    1

    2

    10

    2

    3

    100

    3

    4

    100

    计算p=0.1时,a列和b列的分位数

    a 1.3

    b 3.7

    Name: 0.1, dtype: float64

    手算计算结果:

    计算a列

    pos = 1 + (4 - 1)*0.1 = 1.3

    fraction = 0.3

    ret = 1 + (2 - 1) * 0.3 = 1.3

    计算b列

    pos = 1.3

    ret = 1 + (10 - 1)* 0.3 = 3.7

    案例二

    利用pandas库计算data = [6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36]的分位数。

    import pandas as pd

    import numpy as np

    dt = pd.Series(np.array([6, 47, 49, 15, 42, 41, 7, 39, 43, 40, 36])

    print("数据格式:")

    print(dt)

    print('Q1:', df.quantile(.25))

    print('Q2:', df.quantile(.5))

    print('Q3:', df.quantile(.75))

    计算结果

    Q1: 25.5

    Q2: 40.0

    Q3: 42.5

    4. 概括总结

    自定义分位数python代码程序

    import math

    def quantile_p(data, p, method=1):

    data.sort()

    if method == 2:

    pos = 1 + (len(data)-1)*p

    else:

    pos = (len(data) + 1)*p

    pos_integer = int(math.modf(pos)[1])

    pos_decimal = pos - pos_integer

    Q = data[pos_integer - 1] + (data[pos_integer] - data[pos_integer - 1])*pos_decimal

    Q1 = quantile_p(data, 0.25)

    Q2 = quantile_p(data, 0.5)

    Q3 = quantile_p(data, 0.75)

    IQR = Q3 - Q1

    return Q1, Q2, Q3, IQR

    pandas中的分位数程序

    直接调用.quantile(p)方法,就可以计算出分位数,采用method=2方法。

    参考文献:

    更多相关内容
  • ------------------------- qq_plot(y) 显示 y 的样本分位数与正态分布的理论分位数分位数-分位数图。 如果 y 的分布是正态分布,则该将接近线性。 qq_plot(x,y) 显示两个样本的分位数-分位数图。 如果样本来自...
  • 使用默认设置,BINNED_PLOT(X,Y)... 选项包括例如:绘制不同的分位数; 均值/方差而不是分位数; 改变垃圾箱的数量; 和密度相关的着色。 例子: x=0:0.1:20; y = [sin(x); cos(x)] +randn(2,201); binned_plot(x,y)
  • matlab开发-用高亮度分位数绘制数据的正常分布。绘制数据的直方,使其与突出显示分位数的正态分布相匹配
  • 【数据挖掘】 分位数-分位数图

    千次阅读 2019-04-17 18:58:56
    【数据挖掘】 分位数-分位数图
                   

    最简单的说法是用一张图对应了两个数据,还是一样的画,但是X轴变成了另一个数据,这种图的作用是写出来两种数据的不同的地方,观测是否发生了漂移

    2.2.3 数据的基本统计描述的图形显示(1)

    本节我们研究基本统计描述的图形显示,包括分位数图、分位数-分位数图、直方图和散点图。这些图形有助于可视化地审视数据,对于数据预处理是有用的。前三种图显示一元分布(即,一个属性的数据),而散点图显示二元分布(即,涉及两个属性)。

    1.分位数图

    这里和以下几小节我们介绍常用的数据分布的图形显示。分位数图(quantile plot)是一种观察单变量数据分布的简单有效方法。首先,51它显示给定属性的所有数据(允许用户评估总的情况和不寻常的出现)。其次,它绘出了分位数信息(见2.2.2节)。对于某序数或数值属性X,设xi(i=1,…,N)是按递增序排序的数据,使得x1是最小的观测值,而xN是最大的。每个观测值xi与一个百分数fi配对,指出大约fi×100%的数据小于值xi。我们说“大约”,因为可能没有一个精确的小数值fi,使得数据的fi×100%小于值xi。注意,百分比0.25对应于四分位数Q1,百分比0.50对应于中位数,而百分比0.75对应于Q3。

     

    这些数从12N(稍大于0)到1-12N(稍小于1),以相同的步长1/N递增。在分位数图中,xi对应fi画出。这使得我们可以基于分位数比较不同的分布。例如,给定两个不同时间段的销售数据的分位数图,我们一眼就可以比较它们的Q1、中位数、Q3以及其他fi值。

    例2.13 分位数图。图2.4显示了表2.1的单价数据的分位数图。

    表2.1 AllElectronics的一个部门销售的

     
     
    图2.4 表2.1的单价数据的分位数图

    2.分位数-分位数图

    分位数-分位数图(quantile-quantile plot)或q-q图对着另一个对应的分位数,绘制一个单变量分布的分位数。它是一种强有力的可视化工具,使得用户可以观察从一个分布到另一个分布是否有漂移。

    假定对于属性或变量unit price(单价),我们有两个观测集,取自两个不同的部门。设x1,…,xN是取自第一个部门的数据,y1,…,yM是取自第二个部门的数据,其中每组数据都已按递增序排序。如果M=N(即每个集合中的点数相等),则我们简单地对着xi画yi,其中yi和xi都是它们的对应数据集的第(i-0.5)/N个分位数。如果M<N(即第二个部门的观测值比第一个少),则可能只有M个点在q-q图中。这里,yi是y数据的第(i-0.5)/M个分位数,52对着x数据的第(i-0.5)/M个分位数画。在典型情况下,该计算涉及插值。

    例2.14 分位数-分位数图。图2.5显示在给定的时间段AllElectronics的两个不同部门销售的商品的单价数据的分位数-分位数图。每个点对应于每个数据集的相同的分位数,并对该分位数显示部门1与部门2的销售商品单价。(为帮助比较,我们也画了一条直线,它代表对于给定的分位数,两个部门的单价相同的情况。此外,加黑的点分别对应于Q1、中位数和Q3。)

     
    图2.4 表2.1的单价数据的分位数图

               
    展开全文
  • Quantile-Quantile (q-q) PlotsAuthor(s)DavidScottPrerequisitesHistograms,Distributions,Percentiles,DescribingBivariate Data,NormalDistributionsIntroductionThe quantile-quantile or q-q plot is an explor...

    Quantile-Quantile (q-q) Plots

    Author(s)

    David

    Scott

    Prerequisites

    Histograms,Distributions,Percentiles,Describing

    Bivariate Data,Normal

    Distributions

    Introduction

    The quantile-quantile or q-q plot is an exploratory graphical

    device used to check the validity of a distributional assumption

    for a data set. In general, the basic idea is to compute the

    theoretically expected value for each data point based on the

    distribution in question. If the data indeed follow the assumed

    distribution, then the points on the q-q plot will fall

    approximately on a straight line.

    Before delving into the details of q-q plots, we first describe two

    related graphical methods for assessing distributional assumptions:

    the histogram and the cumulative distribution function

    (CDF). As will be seen, q-q plots are more general than

    these alternatives.

    Assessing Distributional Assumptions

    As an example, consider data measured from a physical device such

    as the spinner depicted in Figure 1. The red arrow is spun around

    the center, and when the arrow stops spinning, the number between 0

    and 1 is recorded. Can we determine if the spinner is fair?

    a4c26d1e5885305701be709a3d33442f.png

    Figure 1. A physical device that gives samples from a uniform

    distribution.

    If the spinner is fair, then these numbers should follow a uniform

    distribution. To investigate whether the spinner is fair, spin the

    arrow n times, and record the measurements by {μ1,

    μ2, ..., μn}. In this example, we collect n =

    100 samples. The histogram provides a useful visualization of these

    data. In Figure 2, we display three different histograms on a

    probability scale. The histogram should be flat for a uniform

    sample, but the visual perception varies depending on whether the

    histogram has 10, 5, or 3 bins. The last histogram looks flat,

    but the other two histograms are not obviously flat. It is not

    clear which histogram we should base our conclusion on.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 2. Three histograms of a sample of 100 uniform points.

    Alternatively, we might use the cumulative distribution function

    (CDF), which is denoted by F(μ). The CDF gives the probability that

    the spinner gives a value less than or equal to μ, that is, the

    probability that the red arrow lands in the interval [0, μ]. By

    simple arithmetic, F(μ) = μ, which is the diagonal straight line y

    = x. The CDF based upon the sample data is called the empirical

    CDF (ECDF), is denoted by a4c26d1e5885305701be709a3d33442f.png, and is

    defined to be the fraction of the data less than or equal to μ;

    that is, a4c26d1e5885305701be709a3d33442f.png

    In general, the ECDF takes on a ragged staircase

    appearance. For the spinner sample analyzed in Figure 2, we computed the ECDF

    and CDF, which are displayed in Figure 3. In the left frame, the

    ECDF appears close to the line y = x, shown in the middle frame. In

    the right frame, we overlay these two curves and verify that they

    are indeed quite close to each other. Observe that we do not need

    to specify the number of bins as with the histogram.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 3. The empirical and theoretical cumulative distribution

    functions of a sample of 100 uniform points.

    q-q plot for uniform data

    The q-q plot for uniform data is very similar to the empirical CDF

    graphic, except with the axes reversed. The q-q plot provides a visual comparison of the sample

    quantiles to the corresponding theoretical quantiles. In

    general, if the points in a q-q plot depart from a straight line,

    then the assumed distribution is called into question.

    Here we define the qth quantile of a batch of n numbers as a number

    ξqsuch that a fraction q x n of the sample is less than

    ξq, while a fraction (1 - q) x n of the sample is

    greater than ξq. The best known quantile is the median,

    ξ0.5, which is located in the middle of the sample.

    Consider a small sample of 5 numbers from the

    spinner: μ1 =

    0.41, μ2 =0.24,

    μ3 =0.59,

    μ4 =0.03,and

    μ5 =0.67.

    Based upon our description of the spinner, we expect a uniform

    distribution to model these data. If the sample data were

    “perfect,” then on average there would be an observation in the

    middle of each of the 5 intervals: 0 to .2, .2 to .4, .4 to .6, and

    so on. Table 1 shows the 5 data points (sorted in ascending order)

    and the theoretically expected value of each based on the

    assumption that the distribution is uniform (the middle of the

    interval).

    Table 1. Computing the Expected Quantile Values.

    Data (μ)

    Rank (i)

    Middle of the ith Interval

    .03

    .24

    .41

    .59

    .67

    1

    2

    3

    4

    5

    .1

    .3

    .5

    .7

    .9

    The theoretical and empirical CDFs are shown in Figure 4 and the

    q-q plot is shown in the left frame of Figure

    5.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 4. The theoretical and empirical CDFs of a small sample of 5

    uniform points, together with the expected values of the 5 points

    (red dots in the right frame).

    In general, we consider the full set of sample quantiles to be the

    sorted data values

    μ(1) <

    μ(2) <

    μ(3) < ··· <

    μ(n-1) <

    μ(n) ,

    where the parentheses in the subscript indicate the data have been

    ordered. Roughly speaking, we expect the first ordered value to be

    in the middle of the interval (0, 1/n), the second to be in the

    middle of the interval (1/n, 2/n), and the last to be in the middle

    of the interval ((n - 1)/n, 1). Thus, we take as the theoretical

    quantile the value

    a4c26d1e5885305701be709a3d33442f.png

    where q corresponds to the ith ordered sample value. We

    subtract the quantity 0.5 so that we are exactly in the middle of

    the interval ((i - 1)/n, i/n). These ideas are depicted

    in the right frame of Figure 4 for our small sample of size n =

    5.

    We are now prepared to define the q-q plot precisely. First, we

    compute the n expected values of the data, which we pair with the n

    data points sorted in ascending order. For the uniform density,

    the q-q plot is composed of the n ordered pairs

    a4c26d1e5885305701be709a3d33442f.png

    This definition is slightly different from the ECDF, which includes

    the points (u(i), i/n). In the left frame of Figure 5,

    we display the q-q plot of the 5 points in Table 1. In the right

    two frames of Figure 5, we display the q-q plot of the same batch

    of numbers used in Figure 2. In the final frame, we add the

    diagonal line y = x as a point of reference.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 5. (Left) q-q plot of the 5 uniform points. (Right) q-q plot

    of a sample of 100 uniform points.

    The sample size should be taken into account when judging how close

    the q-q plot is to the straight line. We show two other uniform

    samples of size n = 10 and n = 1000 in Figure 6. Observe that the

    q-q plot when n = 1000 is almost identical to the line y = x, while

    such is not the case when the sample size is only n = 10.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 6. q-q plots of a sample of 10 and 1000 uniform points.

    In Figure 7, we show the q-q plots of two random samples that are

    not uniform. In both examples, the sample quantiles match the

    theoretical quantiles only at the median and at the extremes.

    Both samples seem to be symmetric around

    the median. But the data in the left frame are closer to the median

    than would be expected if the data were uniform. The data in the

    right frame are further from the median than would be expected if

    the data were uniform.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 7. q-q plots of two samples of size 1000 that are not

    uniform.

    In fact, the data were generated in the R language from beta

    distributions with parameters a = b = 3 on the left and a = b =0.4

    on the right. In Figure 8 we display histograms of these two data

    sets, which serve to clarify the true shapes of the densities.

    These are clearly non-uniform.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 8. Histograms of the two non-uniform data sets.

    q-q

    plot for normal data

    The definition of the q-q plot may be extended to any continuous

    density. The q-q plot will be close to a straight line if the

    assumed density is correct. Because the cumulative distribution

    function of the uniform density was a straight line, the q-q plot

    was very easy to construct. For data that is not uniform, the

    theoretical quantiles must be computed in a different manner.

    Let {z1, z2, ..., zn} denote a

    random sample from a normal distribution with mean μ = 0 and standard deviation σ = 1. Let the ordered

    values be denoted by

    z{1) <

    z(2) <

    z(3) < ... <

    z(n-1)(n).

    These n ordered values will play the role of the sample

    quantiles.

    Let us consider a sample of 5 values from a distribution to see how

    they compare with what would be expected for a normal distribution.

    The 5 values in ascending order are shown in the first column of

    Table 2.

    Table 2. Computing the expected quantile values for normal

    data.

    Data (z)

    Rank (i)

    Middle of theith Interval

    Normal(z)

    -1.96

    -.78

    .31

    1.15

    1.62

    1

    2

    3

    4

    5

    .1

    .3

    .5

    .7

    .9

    -1.28

    -0.52

    0.00

    0.52

    1.28

    Just as in the case of the uniform distribution, we have 5

    intervals. However, with a normal distribution the theoretical

    quantile is not the middle of the interval but rather the inverse

    of the normal distribution for the middle of the interval. Taking

    the first interval as an example, we want to know the z value such

    that 0.1 of the area in the normal distribution is below z. This

    can be computed using the Inverse Normal Calculator as shown in

    Figure 9. Simply set the “Shaded Area” field to the middle of the

    interval (0.1) and click on the “Below” button. The result is

    -1.28. Therefore, 10% of the distribution is below a z value of

    -1.28.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 9. Example of the Inverse Normal Calculator for finding a

    value of the expected quantile from a normal distribution.

    The q-q plot for the data in Table 2 is shown in the left frame of

    Figure 11.

    In general, what should we take as the corresponding theoretical

    quantiles? Let the cumulative distribution function of the normal

    density be denoted by Φ(z). In the previous example, Φ(-1.28) =

    0.10 and Φ(0.00) = 0.50. Using the quantile notation, if

    ξq is the qth quantile of a normal

    distribution, then

    Φ(ξq)= q.

    That is, the probability a normal sample is less

    than ξq is

    in fact just q.

    Consider the first ordered value, z(1). What might we

    expect the value of Φ(z(1)) to be? Intuitively, we

    expect this probability to take on a value in the interval (0,

    1/n). Likewise, we expect Φ(z(2)) to take on a value in

    the interval (1/n, 2/n). Continuing, we expect Φ(z(n))

    to fall in the interval ((n - 1)/n, 1). Thus, the theoretical

    quantile we desire is defined by the inverse (not reciprocal) of

    the normal CDF. In particular, the theoretical quantile

    corresponding to the empirical quantile

    z(i) should be

    a4c26d1e5885305701be709a3d33442f.png for i = 1, 2, ..., n.

    The empirical CDF and theoretical quantile construction for the

    small sample given in Table 2 are displayed in Figure 10. For the

    larger sample of size 100, the first few expected quantiles are

    -2.576, -2.170, and -1.960.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 10. The empirical CDF of a small sample of 5 normal points,

    together with the expected values of the 5 points (red dots in the

    right frame).

    In the left frame of Figure 11, we display the q-q plot of the

    small normal sample given in Table 2. The remaining frames in

    Figure 11 display the q-q plots of normal random samples of size n

    = 100 and n = 1000. As the sample size increases, the points in the

    q-q plots lie closer to the line y = x.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 11. q-q plots of normal data.

    As before, a normal q-q plot can indicate departures from

    normality. The two most common examples are skewed data and data

    with heavy tails (large kurtosis). In Figure 12, we show normal q-q

    plots for a chi-squared (skewed) data set and a Student’s-t

    (kurtotic) data set, both of size n = 1000. The data were first

    standardized. The red line is again y = x. Notice, in particular,

    that the data from the t distribution follow the normal curve

    fairly closely until the last dozen or so points on each

    extreme.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 12. q-q plots for standardized non-normal data (n =

    1000).

    q-q plots for normal data with general mean and scale

    Our previous discussion of q-q plots for normal data all assumed

    that our data were standardized. One approach to constructing q-q

    plots is to first standardize the data and then proceed as

    described previously. An alternative is to construct the plot

    directly from raw data.

    In this section, we present a general approach for data that are

    not standardized. Why did we standardize the data in Figure 12? The

    q-q plot is comprised of the n points

    a4c26d1e5885305701be709a3d33442f.png

    If the original data {zi} are normal, but have an

    arbitrary mean μ and standard deviation σ, then the line y = x will

    not match the expected theoretical quantiles. Clearly, the linear

    transformation

    μ + σ ξq

    would provide the qth theoretical quantile on the transformed

    scale. In practice, with a new data set

    {x1,x2,...,xn} ,

    the normal q-q plot would consist of the n points

    a4c26d1e5885305701be709a3d33442f.png

    Instead of plotting the line y = x as a reference line, the

    line

    y = M + s · x

    should be composed, where M and s are the sample moments (mean and

    standard deviation) corresponding to the theoretical moments μ and

    σ. Alternatively, if the data are standardized, then the line y = x

    would be appropriate, since now the sample mean would be 0 and the

    sample standard deviation would be 1.

    Example: SAT Case Study

    The SAT case study followed the academic achievements of 105

    college students majoring in computer science. The first variable

    is their verbal SAT score and the second is their grade point

    average (GPA) at the university level. Before we compute

    inferential statistics using these variables, we should check if

    their distributions are normal. In Figure 13, we display the q-q

    plots of the verbal SAT and university GPA variables.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 13. q-q plots for the student data (n = 105).

    The verbal SAT seems to follow a normal distribution reasonably

    well, except in the extreme tails. However, the university GPA

    variable is highly non-normal. Compare the GPA q-q plot to the

    simulation in the right frame of Figure 7. These figures are very

    similar, except for the region where x ≈ -1. To follow these ideas,

    we computed histograms of the variables and their scatter diagram

    in Figure 14. These figures tell quite a different story. The

    university GPA is bimodal, with about 20% of the students falling

    into a separate cluster with a grade of C. The scatter diagram is

    quite unusual. While the students in this cluster all have below

    average verbal SAT scores, there are as many students with low SAT

    scores whose GPAs were quite respectable. We might speculate as to

    the cause(s): different distractions, different study habits, but

    it would only be speculation. But observe that the raw correlation

    between verbal SAT and GPA is a rather high 0.65, but when we

    exclude the cluster, the correlation for the remaining 86 students

    falls a little to 0.59.

    a4c26d1e5885305701be709a3d33442f.png

    Figure 14. Histograms and scatter diagram of the verbal SAT and GPA

    variables for the 105 students.

    Discussion

    Parametric modeling usually involves making assumptions about the

    shape of data, or the shape of residuals from a regression fit.

    Verifying such assumptions can take many forms, but an exploration

    of the shape using histograms and q-q plots is very effective. The

    q-q plot does not have any design parameters such as the number of

    bins for a histogram.

    In an advanced treatment, the q-q plot can be used to formally test

    the null hypothesis that the data are normal. This is done by

    computing the correlation coefficient of the n points in the q-q

    plot. Depending upon n, the null hypothesis is rejected if the

    correlation coefficient is less than a threshold. The threshold is

    already quite close to 0.95 for modest sample sizes.

    We have seen that the q-q plot for uniform data is very closely

    related to the empirical cumulative distribution function. For

    general density functions, the so-called probability integral

    transform takes a random variable X and maps it to the interval (0,

    1) through the CDF of X itself, that is,

    Y = FX(X)

    which has been shown to be a uniform density. This explains why the

    q-q plot on standardized data is always close to the line y = x

    when the model is correct. Finally, scientists have used special graph paper for years to make

    relationships linear (straight lines). The most common example used

    to be semi-log paper, on which points following the formula y =

    aebx appear linear. This

    follows of course since log(y) = log(a) + bx, which is the equation

    for a straight line. The q-q plots may be thought of as being

    “probability graph paper” that makes a plot of the ordered data

    values into a straight line. Every density has its own special

    probability graph paper.

    展开全文
  • 作图方法5——分位数分位数图

    万次阅读 2018-05-24 11:05:32
    ##通过分位数-分位数图来验证收益率是否符合正态分布 import statsmodels.api as sm fig, axes = plt.subplots(3,2,figsize=(10,12)) for i in range(0,3): for j in range(0,2): sm.qqplot(log_returns.iloc[:,2*...
    ##通过分位数-分位数图来验证收益率是否符合正态分布
    import statsmodels.api as sm
    fig, axes = plt.subplots(3,2,figsize=(10,12))
    for i in range(0,3):
        for j in range(0,2):
            sm.qqplot(log_returns.iloc[:,2*i+j].dropna(),line='s',ax=axes[i,j])
            axes[i,j].set_title(log_returns.columns[2*i+j])
            axes[i,j].set_xlabel('理论分位数')
            axes[i,j].set_ylabel('样本分位数')
    plt.subplots_adjust(wspace=0.3,hspace=0.4)# 调整小分图之间的间距

    展开全文
  • 分位数与QQ

    千次阅读 2019-12-04 17:54:35
    Sample Quantiles 样本分位数 quantile(x, ...) 给定一个系列xxx,可以求出给定累积概率ppp对应的分位数。 计算分位数有9种方法1^11: 假设方法iii(1≤i≤91 \le i \le 91≤i≤9),对应概率p的计算公式是: Q(p)=(1...
  • 【数据挖掘】:分位数-分位数图

    万次阅读 2016-06-30 15:27:41
    本节我们研究基本统计描述的图形显示,包括分位数图分位数-分位数图、直方和散点。这些图形有助于可视化地审视数据,对于数据预处理是有用的。前三种显示一元分布(即,一个属性的数据),而散点显示二元...
  • (Q-Q分位数图详解

    千次阅读 2019-10-12 21:17:54
    一 定义: ...分位数Qi=xi−mean(x)δQ_i = \frac{x_i - mean(x)}{\delta}Qi​=δxi​−mean(x)​,其本质是某个值偏离均值的单位。 二 做法: 三 解 如果是在同一条线上,则样本分布和理论...
  • item (1, item, 0.5) x (rnorm(inds, 1, 1), rnorm(item - inds, 8, 1)) data ("value" = x, "class" = rep(1, length(x))) 绘制密度函数并添加分位数线 # 绘图 p1 (data, aes(x = value, y = class, fill = ...
  • 简述 这里只有在读取xlsx上才需要库,其他都不需要。 读取数据 library(xlsx) # Hydrocarbon mydata = read.xlsx('D:/Code/R/Data in Excel/Chapter 10/beeswax.xls',1) ...直方 hist(mydata[, 2]...
  • 分位数回归(Quantile Regression)

    万次阅读 多人点赞 2019-07-09 22:25:16
    在介绍分位数回归之前,先重新说一下回归分析,我们之前介绍了线性回归、多项式回归、核回归等等,基本上,都是假定一个函数,然后让函数尽可能拟合训练数据,确定函数的未知参数。尽可能拟合训练数据,一般是通过...
  • 分位数回归及其Python源码

    千次阅读 2020-12-06 14:51:03
    分位数回归及其Python源码天朗气清,惠风和畅。赋闲在家,正宜读书。前人文章,不得其解。代码开源,无人注释。你们不来,我行我上。废话少说,直入主题。o( ̄︶ ̄)o我们要探测自变量 与因变量 的关系,最简单的方法...
  • 多元线性模型的分位数回归

    千次阅读 2021-04-11 10:26:42
    分位数回归学习笔记一、为什么要使用分位数回归?二、分位数回归基本模型三、分位数回归估计--单纯形法1.损失函数2.目标函数3.算法推导4.实际案例分析与python代码 一、为什么要使用分位数回归?    ...
  • 中虚线是分位数回归线,红线是线性最小二乘(OLS)的回归线。通过观察,我们可以发现3个现象:随着收入提高,食品消费也在提高。 随着收入提高,家庭间食品消费的差别拉大。穷人别无选择,富人能选择生活方式,...
  • 分位数回归--基于R

    千次阅读 2019-10-16 17:21:39
    分位数回归 分位数回归是估计一组回归变量X与被解释变量Y的分位数之间线性关系的建模方法。以往的回归模型实际上是研究被解释变量的条件期望。而人们也关心解释变量与被解释变量分布的中位数、分位数呈何种关系。它...
  • 分位数回归

    千次阅读 2020-03-10 00:24:43
    分位数(Quantile),亦称分位点,是指将一个随机变量的概率分布范围分为几个等份的数值点,常用的有中位数(即二分位数)、四分位数、百分位数等。
  • 分位数 算法

    2022-03-08 16:28:33
    1 p分位数的原理及计算_juliarjuliar的博客-CSDN博客_分位数 2 性能指标里的80分位是什么? - 掘金 本以为很简单,没想到那么复杂.....
  • 分位数回归(Stata)

    万次阅读 多人点赞 2020-01-08 20:53:30
    基于分位数回归的成都空气质量指数的数据分析 空气质量指数计算公式为: (1)线性回归模型得到的是一种条件均值,并未考虑到因变量总体上的分布特征,在需要了解因变量位置(分位数)上的信息时,线性回归就...
  • 分位数与箱线图的详细理解

    千次阅读 2020-11-18 20:32:13
    最近在读论文时,碰到了箱线图这个东西,之前没见过,所以查了一下资料,发现它跟分位数联系紧密,于是又接着学习了一下分位数,并将相关内容整理如下: 分位数 首先说一下分位数(Quantile)的概念 百度给出的...
  • C语言实现——MATLAB分位数

    千次阅读 2019-05-20 15:02:32
    matlab线性插值求分位数算法如下: C代码实现如下: //d:输入数据,len_d:输入数据长度,rate:分位点 float quantile(float* d, int len_d, float rate) { float dfg2 = 0; float x[M];//数组长度M for (int i = 1...
  • 使用Qt出直方分位数图

    千次阅读 2017-05-15 13:36:37
    painter.drawText(rect(), Qt::AlignHCenter, "A班和B班的直方分位数图"); } mainwindow.ui <class>MainWindow <x>0 <y>0 <width>1000 <height>700 <width>1000 <height>...
  • 基于R语言的分位数回归(quantile regression)

    万次阅读 多人点赞 2017-12-18 17:45:21
    分位数回归(quantile regression) 这一讲,我们谈谈分位数回归的知识,我想大家传统回归都经常见到。分位数回归可能大家见的少一些,其实这个方法也很早了,大概78年代就有了,但是那个时候这个理论还不完善。到...
  • 目录 1. 简介 2.... 4.1 使用 `qreg` 的简单条件分位数回归 4.2 使用 `bsqreg` 估计 4.3 使用 `qrprocess` 估计 4.4 使用 `qreg2` 估计 4.5 使用 `ivqreg2` 估计 4.6 使用 `xtqreq`
  • 分位数、上侧分位数及python实现

    万次阅读 2018-10-03 19:00:26
    分位数 定义:对随机变量X和给定的α\alphaα ,(0&amp;amp;lt;α\alphaα&amp;amp;lt;1),若存在
  • 那么在一亿美元金额下投资组合的损失为 将损失从大排到小,数第5个(1%),或从小到大排取第496个 也可以出简单历史模拟法的损失分布,视觉上更为直观。 改良历史模拟法: 1.时间加权 在上述历史模拟法中,每笔...
  • R语言t分布正态分布分位数图

    万次阅读 2016-01-02 10:46:27
    #分位数图t分布密度带p值 x =se q(-6,6,length=1000) ; y =dt( x , 19 ) r1=- 6 ; r2=- 2.89 ; x2=c(r1,r1, x [ x x >r1],r2,r2) y2=c( 0 ,dt(c(r1, x [ x x >r1],r2), 19 ), 0 ) plot( x , y ,type= "l" ,...
  • 分位数回归模型学习笔记

    万次阅读 多人点赞 2018-03-09 11:10:54
    我读硕士老师给我的第一篇论文就是一个分位数回归的文章,当时觉得这个模型很简单,我很快就用R的示例文件写了一个例子,但是,在后面的研究中,我越来越觉得,这个模型没有我想的那么简单,而且有着非常丰富的内涵...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 25,044
精华内容 10,017
关键字:

分位数图怎么画