• As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to ...

python信用评分卡（附代码，博主录制） https://etav.github.io/python/vif_factor_python.html

Colinearity is the state where two variables are highly correlated and contain similiar information about the variance within a given dataset. To detect colinearity among variables, simply create a correlation matrix and find variables with large absolute values. In R use the corr function and in python this can by accomplished by using numpy's corrcoeffunction.

Multicolinearity on the other hand is more troublesome to detect because it emerges when three or more variables, which are highly correlated, are included within a model. To make matters worst multicolinearity can emerge even when isolated pairs of variables are not colinear.

A common R function used for testing regression assumptions and specifically multicolinearity is "VIF()" and unlike many statistical concepts, its formula is straightforward:

$$V.I.F. = 1 / (1 - R^2).$$

The Variance Inflation Factor (VIF) is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the the ratio of the variance of all a given model's betas divide by the variane of a single beta if it were fit alone.

Steps for Implementing VIF

1. Run a multiple regression.
2. Calculate the VIF factors.
3. Inspect the factors for each predictor variable, if the VIF is between 5-10, multicolinearity is likely present and you should consider dropping the variable.
#Imports
import pandas as pd
import numpy as np from patsy import dmatrices import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor df = pd.read_csv('loan.csv') df.dropna() df = df._get_numeric_data() #drop non-numeric cols df.head()
idmember_idloan_amntfunded_amntfunded_amnt_invint_rateinstallmentannual_incdtidelinq_2yrs...total_bal_ilil_utilopen_rv_12mopen_rv_24mmax_bal_bcall_utiltotal_rev_hi_liminq_fitotal_cu_tlinq_last_12m
0107750112965995000.05000.04975.010.65162.8724000.027.650.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1107743013141672500.02500.02500.015.2759.8330000.01.000.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2107717513135242400.02400.02400.015.9684.3312252.08.720.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
31076863127717810000.010000.010000.013.49339.3149200.020.000.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4107535813117483000.03000.03000.012.6967.7980000.017.940.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 51 columns

df = df[['annual_inc','loan_amnt', 'funded_amnt','annual_inc','dti']].dropna() #subset the dataframe

Step 1: Run a multiple regression

%%capture
#gather features
features = "+".join(df.columns - ["annual_inc"]) # get y and X dataframes based on this regression: y, X = dmatrices('annual_inc ~' + features, df, return_type='dataframe')

Step 2: Calculate VIF Factors

# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame() vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape)] vif["features"] = X.columns

Step 3: Inspect VIF Factors

vif.round(1)
VIF Factorfeatures
05.1Intercept
11.0dti
2678.4funded_amnt
3678.4loan_amnt

As expected, the total funded amount for the loan and the amount of the loan have a high variance inflation factor because they "explain" the same variance within this dataset. We would need to discard one of these variables before moving on to model building or risk building a model with high multicolinearity.

https://study.163.com/course/courseMain.htm?courseId=1005988013&share=2&shareId=400000000398149

varVarianceSyntaxy = var(X)y = var(X,1)y = var(X,W)y = var(X,W,DIM)ArgumentsXFinancial time series object.WWeight vector used in calculating variance.DIMDimension of X used in calculatingvariance.Desc...

var

Variance

Syntax

y = var(X)

y = var(X,1)

y = var(X,W)

y = var(X,W,DIM)

Arguments

XFinancial time series object.

WWeight vector used in calculating variance.

DIMDimension of X used in calculating

variance.

Description

var supports financial time series objects based on the MATLAB®

var function. See var.

y = var(X), if X is a financial time series

object and returns the variance of each series.

var normalizes y by N –

1 if N > 1, where

N is the sample size. This is an unbiased estimator of the

variance of the population from which X is drawn, as long as

X consists of independent, identically distributed samples. For

N = 1, y is normalized by

N.

y = var(X,1) normalizes by N and produces the

second moment of the sample about its mean. var(X, 0) is the same as

var(X).

y = var(X,W) computes the variance using the weight vector

W. The length of W must equal the length of

the dimension over which var operates, and its elements must be

nonnegative. var normalizes W to sum to

1. Use a value of 0 for W

to use the default normalization by N – 1, or use

a value of 1 to use N.

y = var(X,W,DIM) takes the variance along the dimension

DIM of X.

Examples

The variance is the square of the standard deviation. Consider if

f = fints((today:today+1)', [4 -2 1; 9 5 7])

Warning: FINTS will be removed in a future release. Use TIMETABLE instead.

> In fints (line 165)

Warning: FINTS will be removed in a future release. Use TIMETABLE instead.

> In fints/display (line 66)

f =

desc: (none)

freq: Unknown (0)

'dates: (2)' 'series1: (2)' 'series2: (2)' 'series3: (2)'

'02-Oct-2017' [ 4] [ -2] [ 1]

'03-Oct-2017' [ 9] [ 5] [ 7]

then

var(f, 0, 1)

is

Warning: FINTS will be removed in a future release. Use TIMETABLE instead.

> In fints/var (line 49)

[12.5 24.5 18.0]

and

var(f, 0, 2)

is

Warning: FINTS will be removed in a future release. Use TIMETABLE instead.

> In fints/var (line 49)

[9.0; 4.0]

Bias（偏差）描述的是预期值偏离真实值的大小，所以high bias代表Underfitting（欠拟合）。...下面介绍Bias和Variance计算。 Bias 估计量的bias定义为： 如果，则说估计量是无偏差的。 Bernou...

Bias（偏差）描述的是预期值偏离真实值的大小，所以high bias代表Underfitting（欠拟合）。
Variance（方差）描述的是任何特殊采样数据可能造成的与预期值的偏离，所以high variance 代表Overfitting（过拟合）。
下面介绍Bias和Variance的计算。

Bias

估计量的bias定义为： 如果 ，则说估计量是无偏差的。

Bernoulli分布的bias计算：
假设分布期望值是 ，则对于每一个样本 ，分布函数为： 计算方法如下图所示： 由上图可得结果是0，所以估计量 是unbiased。

Gaussian分布 假设样本服从高斯分布 数学期望估计量的偏差计算方法下图所示： 所以高斯分布的数学期望估计量是无偏差的。 方差的估计量计算方法如下图所示： 所以按这种方法啊高斯分布的方差估计量是有偏差的，可以通过设置： 来使高斯分布的方差估计量是无偏差。 Variance

numpy 中计算的方差就是样本方差本身，公式为： σ2=∑i=1N(xi−x⎯⎯⎯)Nσ2=∑i=1N(xi−x¯)N \sigma^2 = \frac{ \sum\limits_{i=1}^{N}(x_i - \overline x) } { N } pandas 中计算的方差为无偏样本方差，公式为
Variance -方差 方差就是一组数据中平均值与任意点之间的距离。 The Variance is the distance between the mean of a set of data to any point in the data. Variation -差异 正常预期结果与观测结果之间
这里提一点：pca的方法explained_variance_ratio_计算了每个特征方差贡献率，所有总和为1，explained_variance_为方差值，通过合理使用这两个参数可以画出方差贡献率图或者方差值图，便于观察PCA降维最佳值。
NFL(No Free Lunch Theorem)告诉我们选择算法应当与具体问题相匹配,通常我们看一个算法的好坏就是看其泛化性能,但是对于一个算法为什么好为什么坏,我们缺乏一下认识,"Bias-Variance-Decomposition"就是从偏差
