机器学习构建因子模型
With the surge in e-commerce and digital transactions, identity fraud is has also risen to affect millions of people every year. In 2019, fraud losses in the US alone were estimated to be at around US$16.9 billion, a substantial portion of which includes losses from credit card fraud¹.
随着电子商务和数字交易的激增,身份欺诈每年也增加到影响数以百万计的人。 2019年,仅在美国,欺诈损失估计约为169亿美元,其中很大一部分包括信用卡欺诈造成的损失¹。
In addition to strengthening cybersecurity measures, financial institutions are increasingly turning to machine learning to identify and reject fraudulent transactions when they happen, so as to limit losses.
除了加强网络安全措施外,金融机构越来越多地转向机器学习来识别和拒绝欺诈交易,以限制损失。
I came across a credit card fraud dataset on Kaggle and built a classification model to predict fraudulent transactions. In this article, I will walk through the 5 steps to building a supervised machine learning model. Below is a an outline of the five steps:
我在Kaggle上遇到了信用卡欺诈数据集,并建立了一个分类模型来预测欺诈交易。 在本文中,我将逐步完成构建监督型机器学习模型的5个步骤。 以下是五个步骤的概述:
- Exploratory Data Analysis 探索性数据分析
- Train-test split火车测试拆分
- Modeling造型
- Hyperparameter Tuning超参数调整
- Evaluating Final Model Performance评估最终模型性能
I.探索性数据分析(EDA)(I. Exploratory Data Analysis (EDA))
When starting a new modeling project, it is important to start with EDA in order to understand the dataset. In this case, the credit card fraud dataset from Kaggle contains 284,807 rows with 31 columns. This particular dataset contains no nulls, but note that this may not be the case when dealing with datasets in reality.
在开始新的建模项目时,从EDA开始以理解数据集很重要。 在这种情况下,来自Kaggle的信用卡欺诈数据集包含284,807行和31列。 这个特定的数据集不包含任何空值,但请注意,实际上处理数据集时可能不是这种情况。
Our target variable is named class
, and it is a binary output of 0’s and 1’s, with 1’s representing fraudulent transactions and 0’s as non-fraudulent ones. The remaining 30 columns are features that we will use to train our model, the vast majority of which have been transformed using PCA and thus anonymized, while only two (time
and amount
) are labeled.
我们的目标变量名为class
,它是0和1的二进制输出,其中1代表欺诈性交易,0代表非欺诈性交易。 剩下的30列是我们将用来训练模型的功能,其中大部分已使用PCA进行了转换并因此被匿名化,而只有两个( time
和amount
)被标记。
Ia目标变量 (I.a. Target Variable)
Our dataset is highly imbalanced, as the majority of rows (99.8%) in the dataset are non-fraudulent transactions and have a class = 0
. Fraudulent transactions only represent ~0.2% of the dataset.
我们的数据集高度不平衡,因为数据集中的大多数行(99.8%)是非欺诈性事务,并且class = 0
。 欺诈性交易仅占数据集的〜0.2%。
This class imbalance problem is common with fraud detection, as fraud (hopefully) is a rare event. Because of this class imbalance issue, our model may not have enough fraudulent examples to learn from and we will mitigate this by experimenting with sampling methods in the modeling stage.
这种类别的不平衡问题在欺诈检测中很常见,因为欺诈(希望如此)是罕见的事件。 由于类不平衡问题,我们的模型可能没有足够的欺诈性示例可供学习,我们将通过在建模阶段尝试使用采样方法来减轻这种情况。
IB功能 (I.b. Features)
To get a preliminary look at our features, I find seaborn’s pairplot function to be very useful, especially because we can plot out the distributions by the target variable if we introduce thehue='class'
argument. Below is a plot showing the first 10 features in our dataset by label, with orange representing 0 or non-fraudulent transactions and blue representing 1 or fraudulent transactions.
为了初步了解我们的功能,我发现seaborn的pairplot函数非常有用,特别是因为如果引入了hue='class'
参数,我们可以通过目标变量来绘制分布。 下面的图表按标签显示了数据集中的前10个特征,橙色表示0或非欺诈性交易,蓝色表示1或欺诈性交易。

As you can see from the pairplot, the distributions of some features differ by label, giving an indication that these features may be useful for the model.
从对图中可以看出,某些特征的分布因标签而异,这表明这些特征可能对模型有用。
二。 火车测试拆分 (II. Train-Test Split)
Since the dataset has already been cleaned, we can move on to split our dataset into the train and test sets. This is an important step as you cannot effectively evaluate the performance of your model on data that it has trained on!
由于数据集已经清理完毕,我们可以继续将数据集拆分为训练集和测试集。 这是重要的一步,因为您无法有效地根据模型训练过的数据评估模型的性能!
I used scikit-learn’s train_test_split
function to split 75% of our dataset as the train set and the remaining 25% as the test set. It is important to note that I set the stratify
argument to be equal to the label or y
in the train_test_split
function to make sure there are proportional examples of our label in both the train and test sets. Otherwise, if there were no examples where the label is 1 in our train set, the model would not learn what fraudulent transactions are like. Likewise, if there were no examples where the label is 1 in our test set, we would not know how well the model would perform when it encounters fraud.
我使用scikit-learn的train_test_split
函数将数据集的75%拆分为训练集,其余25%作为测试集。 需要注意的是我设置的是很重要的stratify
参数等于标签或y
在train_test_split
功能,以确保有在列车组和测试组我们两个标签的比例例子。 否则,如果在我们的训练集中没有标签为1的示例,则该模型将不会获悉欺诈交易是什么样的。 同样,如果在我们的测试集中没有标签为1的示例,我们将不知道该模型在遇到欺诈时的表现如何。
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, stratify=y)
三, 造型 (III. Modeling)
Since our dataset is anonymized, there is no feature engineering to be done, so the next step is modeling.
由于我们的数据集是匿名的,因此无需进行要素工程,因此下一步就是建模。
III.a. 选择ML模型 (III.a. Choosing an ML Model)
There are different classification models to choose from, and I experimented with building simple models to pick the best one that we will later tune the hyperparameters of to optimize. In this case, I trained a logistic regression model, random forest model and XGBoost model to compare their performances.
有不同的分类模型可供选择,我尝试构建简单的模型以选择最佳模型,然后我们将对其进行优化以优化它们。 在这种情况下,我训练了逻辑回归模型,随机森林模型和XGBoost模型来比较它们的性能。
Due to class imbalance, accuracy is not a meaningful metric in this case. Instead I used AUC as the evaluation metric, which takes on values between 0 and 1. The AUC measures the probability that the model will rank a random positive example (class = 1
) higher than a random negative example.
由于类别不平衡,在这种情况下,准确性不是有意义的指标。 取而代之的是,我使用AUC作为评估指标,该指标采用介于0和1之间的值。AUC衡量模型将随机正例( class = 1
)排列为高于随机负例的概率。
To evaluate model performances, I used stratified K-Fold Cross Validation to stratify sampling by class label, since our dataset is highly imbalanced. Using the model AUC scores, I made a boxplot to compare the ranges of AUC scores by model.
为了评估模型的性能,我使用分层的K折交叉验证来按类标签对抽样进行分层,因为我们的数据集高度不平衡。 使用模型AUC分数,我制作了一个箱形图以比较模型对AUC分数的范围。

Not surprisingly, XGBoost appears to be the best model of our three choices. The mean AUC score of the XGBoost model is 0.970, compared to 0.944 for that of the random forest model and 0.911 for that of the logistic regression model. So, I selected XGboost as my model of choice going forward.
毫不奇怪,XGBoost似乎是我们三个选择中最好的模型。 XGBoost模型的平均AUC得分为0.970,而随机森林模型的平均AUC得分为0.944,逻辑回归模型的平均AUC得分为0.911。 因此,我选择了XGboost作为今后的选择模型。
III.b. 比较采样方法 (III.b. Compare Sampling Methods)
As mentioned previously, I also experimented with different sampling techniques to deal with the class imbalance issue. I tried outimblearn
's random oversampling, random undersampling and SMOTE functions:
如前所述,我还尝试了不同的采样技术来处理类不平衡问题。 我尝试了imblearn
的随机过采样,随机欠采样和SMOTE函数:
Random oversampling samples the minority class with replacement until a defined threshold, which I left at the default of 0.5, so our new dataset has a 50/50 split between labels of 0’s and 1’s.
随机过采样通过替换对少数类进行采样,直到定义的阈值为止(我将其保留为默认值0.5),因此我们的新数据集在0和1的标签之间划分为50/50。
Random undersampling samples the majority class, without replacement by default but you can set it to sample with replacement, until our dataset has a 50/50 split between labels of 0’s and 1’s.
随机欠采样对大多数类别进行采样,默认情况下不进行替换,但您可以将其设置为替换采样,直到我们的数据集在0和1的标签之间划分为50/50。
SMOTE (Synthetic Minority Oversampling Technique) is a data augmentation method that randomly selects an example from the minority class, finds k of its nearest neighbours (usually k=5), chooses a random neighbour and creates a synthetic new example in the feature space between this random neighbour and the original example.
SMOTE(合成少数类过采样技术)是一种数据增强方法,其随机地选择从少数类的例子,发现其最近邻的K(通常K = 5),随机选择一个邻居和在之间的特征的空间中形成一个合成的新的例子这个随机的邻居和原始的例子。
I used the Pipeline function fromimblearn.pipeline
to avoid leakage, then used stratified K-Fold Cross Validation to compare performances of XGBoost models with the three different sampling techniques listed above.
我使用了来自imblearn.pipeline
的Pipeline函数来避免泄漏,然后使用分层的K折交叉验证来比较XGBoost模型与上面列出的三种不同采样技术的性能。

The mean AUC scores of the three sampling methods are quite close at between 0.974 to 0.976. In the end, I chose SMOTE because of the smaller range in AUC scores.
三种采样方法的平均AUC得分非常接近,介于0.974至0.976之间。 最后,由于AUC得分范围较小,我选择了SMOTE。
IV。 超参数调整 (IV. Hyperparameter Tuning)
I chose to use Bayesian hyperparameter tuning with a package called hyperopt
, because it is faster and more informed than other methods such as grid search or randomized search. The hyperparameters that I wanted to tune for my XGBoost model were:
我选择将贝叶斯超参数调整与称为hyperopt
的软件包hyperopt
,因为它比诸如网格搜索或随机搜索之类的其他方法更快,更明智。 我想为XGBoost模型调整的超参数是:
max_depth
: maximum depth of a tree; values between 4 to 10.max_depth
:一棵树的最大深度; 值介于4到10之间。min_child_weight
: minimum sum of weights of samples to form a leaf node or the end of a branch; values between 1 to 20.min_child_weight
:形成叶子节点或分支末端的样本的最小权重总和; 值介于1到20之间。subsample
: random sample of observations for each tree; values between 0.5 to 0.9.subsample
:每棵树的观测值的随机样本; 值介于0.5到0.9之间。colsample_bytree
: random sample of columns or features for each tree; values between 0.5 to 0.9.colsample_bytree
:每棵树的列或特征的随机样本; 值介于0.5到0.9之间。gamma
: minimum loss reduction needed to split a node and used to prevent overfitting; values between 0 and 5.gamma
:分割节点所需的最小损耗减少,用于防止过度拟合; 值介于0到5之间。eta
: learning_rate; values between 0.01 and 0.3.eta
:learning_rate; 值介于0.01和0.3之间。
To use hyperopt, I first set up my search space with hyperparameters and their respective bounds to search through:
要使用hyperopt,我首先使用超参数及其各自的边界设置搜索空间以进行搜索:
space = {
'max_depth': hp.quniform('max_depth', 4, 10, 2),
'min_child_weight': hp.quniform('min_child_weight', 5, 30, 2),
'gamma': hp.quniform('gamma', 0, 10, 2),
'subsample': hp.uniform('subsample', 0.5, 0.9),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 0.9),
'eta': hp.uniform('eta', 0.01, 0.3),
'objective': 'binary:logistic',
'eval_metric': 'auc'
}
Next, I defined an objective function to minimize that will receive values from the previously defined search space:
接下来,我定义了一个目标函数以最小化将从先前定义的搜索空间接收值的函数:
def objective(params):
params = {'max_depth': int(params['max_depth']),
'min_child_weight': int(params['min_child_weight']),
'gamma': params['gamma'],
'subsample': params['subsample'],
'colsample_bytree': params['colsample_bytree'],
'eta': params['eta'],
'objective': params['objective'],
'eval_metric': params['eval_metric']}
xgb_clf = XGBClassifier(num_boost_rounds=num_boost_rounds, early_stopping_rounds=early_stopping_rounds, **params)
best_score = cross_val_score(xgb_clf, X_train, y_train, scoring='roc_auc', cv=5, n_jobs=3).mean()
loss = 1 - best_score
return loss
The best hyperparameters returned are listed below and we will use this to train our final model!
下面列出了返回的最佳超参数,我们将用它来训练我们的最终模型!
best_params = {'colsample_bytree': 0.7,
'eta': 0.2,
'gamma': 1.5,
'max_depth': 10,
'min_child_weight': 6,
'subsample': 0.9}
V.最终模型性能评估 (V. Evaluation of Final Model Performance)
To train the final model, I used imblearn
's pipeline to avoid leakage. In the pipeline, I first used SMOTE
to augment the dataset and include more positive classes for the model to learn from, then trained a XGBoost model with the best hyperparameters found in step IV.
为了训练最终模型,我使用了imblearn
的管道来避免泄漏。 在管道中,我首先使用SMOTE
扩充了数据集,并为模型学习了更多肯定的类,然后使用步骤IV中发现的最佳超参数训练了XGBoost模型。
final_model = imblearn.pipeline.Pipeline([
('smote',SMOTE(random_state=1)),
('xgb', XGBClassifier(num_boost_rounds=1000,
early_stopping_rounds=10,
**best_params))])
Va指标 (V.a. Metrics)
Below are some metrics to evaluate the performance of the final model:
以下是一些评估最终模型性能的指标:
AUC
AUC
The AUC score of the final model is 0.991! This indicates that our final model is able to rank order fraud risk quite well.
最终模型的AUC分数为0.991! 这表明我们的最终模型能够很好地对订单欺诈风险进行排名。
Classification Report
分类报告

Precision
精确
True Positives/(True Positives + False Positives)
True Positives/(True Positives + False Positives)
Precision for class 0 is 1, indicating that all items labeled as belonging to class 0 are indeed non-fraudulent transactions. Precision for class 1 is 0.86, meaning that 86% of items labeled as class 1 are indeed fraudulent transactions. In other words, the final model correctly predicted 100% of non-fraudulent transactions and 86% of fraudulent transactions.
0类的精度为1,表示所有标记为属于0类的项目的确是非欺诈性交易。 1类的精度为0.86,这意味着标记为1类的项目中确实有86%是欺诈交易。 换句话说,最终模型正确地预测了100%的非欺诈性交易和86%的欺诈性交易。
Recall
召回
True Positives/(True Positives + False Negatives)
True Positives/(True Positives + False Negatives)
Recall for class 0 is 1, meaning that all non-fraudulent transactions were labeled as such, i.e. belonging to class 0. Recall for class 1 is 0.9, so 90% of fraudulent transactions were labeled as belonging to class 1 by our final model. This means that the final model is able to catch 90% of all fraudulent transactions.
召回类别0为1,这意味着所有非欺诈性交易都被标记为此类,即属于类别0。召回类别1为0.9,因此最终模型将90%的欺诈性交易标记为属于类别1。 这意味着最终模型能够捕获所有欺诈性交易的90%。
F1 score
F1分数
2 * (Recall * Precision)/(Recall + Precision)
2 * (Recall * Precision)/(Recall + Precision)
The F1 score is a weighted harmonic mean of precision and recall. The F1 score of the final model predictions on the test set for class 0 is 1, while that for class 1 is 0.88.
F1分数是精度和召回率的加权谐波平均值。 0级测试集上最终模型预测的F1得分为1,而1级测试集的F1得分为0.88。
Vb功能的重要性 (V.b. Feature Importances)
To understand the model, it is useful to look at the Shap summary and feature importances plots. Unfortunately, most features have been anonymized in this dataset, but the plots show that v14, v4 and v12 are the top 3 most important features in the final model.
要了解该模型,查看Shap摘要和要素重要性图很有用。 不幸的是,大多数特征在该数据集中都是匿名的,但是这些图显示v14,v4和v12是最终模型中最重要的3个特征。


最后的想法(Final Thoughts)
In merely five steps, we built an XGBoost model capable of predicting whether a transaction is fraudulent or not based on the 30 features provided in this dataset.
仅用五步,我们就建立了一个XGBoost模型,该模型能够基于此数据集中提供的30个功能来预测交易是否为欺诈行为。
Our final model has an AUC score of 0.991, which is incredibly high! However, it is worth noting that this was done with a pre-cleaned (and manipulated) dataset. In reality, feature engineering is a vital step in modeling but we did not have the chance to do so here due to limits of working with an anonymized dataset.
我们的最终模型的AUC得分为0.991,这是令人难以置信的高! 但是,值得注意的是,这是使用预先清理(和处理)的数据集完成的。 实际上,要素工程是建模的关键步骤,但是由于使用匿名数据集的限制,我们在这里没有机会这样做。
I hope that this hands-on modeling exercise using a real dataset helped you better understand the mechanics behind creating a machine learning model to predict fraud. I am very curious to know what the anonymized features were, especially the most predictive ones. If you have any ideas on what they could be, please comment below!
我希望这个使用真实数据集的动手建模练习可以帮助您更好地了解创建机器学习模型以预测欺诈的背后机制。 我很想知道匿名功能是什么,尤其是最具预测性的功能。 如果您有任何想法,请在下面评论!
To see the code, please check out my jupyter notebook file on Github. Thank you!
要查看代码,请在Github上查看我的jupyter笔记本文件。 谢谢!
脚注 (Footnotes)
[1]: Javelin Strategy. 2020 Identity Fraud Study: Genesis of the Identity Fraud Crisis. https://www.javelinstrategy.com/coverage-area/2020-identity-fraud-study-genesis-identity-fraud-crisis
[1]:标枪策略。 2020年身份欺诈研究:身份欺诈危机的成因。 https://www.javelinstrategy.com/coverage-area/2020-identity-fraud-study-genesis-identity-fraud-crisis
机器学习构建因子模型