• 分类评估指标_评估指标分类

    千次阅读 2020-09-20 03:09:26
    分类评估指标If you are familiar with machine learning competitions and you take the time to read through the competition guidelines, you will come across the term Evaluation Metric. The evaluation ...


    If you are familiar with machine learning competitions and you take the time to read through the competition guidelines, you will come across the term Evaluation Metric. The evaluation metric is the basis by which the model performance is determined and winning models placed on the leaderboard. Understanding evaluation metrics will help you build better models and give you an edge over your peers in the event of a competition. We will discuss the common model evaluation metrics paying attention to when they are used, range of values for each metric and most importantly, the values we want to see.

    如果您熟悉机器学习竞赛,并且花时间阅读竞赛指南,则会遇到术语“ 评估指标” 。 评估指标是确定模型性能并将获胜模型放置在排行榜上的基础。 了解评估指标将帮助您建立更好的模型,并在竞争中胜过同行。 我们将讨论通用模型评估指标,其中将关注使用它们时,每个指标的值范围以及最重要的是我们希望看到的值。

    A prediction model is trained with historical data. To ensure that the model makes accurate predictions — is able to generalize learned rules on new data — the model should be tested using data that it was not trained with.

    使用历史数据训练预测模型。 为了确保模型做出准确的预测-能够将学习到的规则推广到新数据上-应该使用未经训练的数据对模型进行测试

    This can be done by separating the dataset into training data and testing data. The model is trained using the training data and the model’s predictive accuracy is evaluated using the test set. The dataset can be split using sklearn.model_selection.train_test_split from the scikit-learn library.

    这可以通过将数据集分为训练数据和测试数据来完成。 使用训练数据对模型进行训练,并使用测试集评估模型的预测准确性。 可以使用sklearn.model_selection拆分数据集 scikit-learn库中的train_test_split

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

    X is the input feature(s) or independent feature(s)


    y is the target column or dependent feature


    test_size denotes the ratio of total dataset that will be used to test the model.


    The names X_train, X_test, y_train, y_test are conventional names and can be changed to suit your taste.


    The model prediction accuracy is then determined using Evaluation Metrics. We will discuss popular evaluation metrics for classification models.

    然后,使用评估指标确定模型预测的准确性 。 我们将讨论分类模型的流行评估指标。

    混淆矩阵 (Confusion Matrix)

    The confusion matrix is a common method that is used to determine and visualize the performance of classification models. The confusion matrix is a NxN matrix where N is the number of target classes or values.

    混淆矩阵是用于确定和可视化分类模型性能的常用方法。 混淆矩阵是一个NxN矩阵,其中N是目标类别或值的数量。

    The rows in a confusion matrix represent actual values while the columns represent predicted values.


    混淆矩阵中要注意的术语 (Terms to note in Confusion Matrix)

    • True positives: True positives occur when the model has correctly predicted a True instance.

      真实肯定 :当模型正确预测真实实例时,就会出现真实肯定。

    • True Negatives: True negative are cases when the model accurately predicts False instances.

      真否定 :真否定是模型准确预测False实例的情况。

    • False positives: False positives are a situation where the model has predicted a True value when the actual value is False.

      误报 :误报是当实际值为False时模型已预测为True的情况。

    • False negatives: False negative is a situation where the model predicts a False value when the actual value is True.

      假阴性 :假阴性是当实际值为True时模型预测False值的情况。

    To build a better image I will use the Titanic dataset as an example. The Titanic dataset is a popular machine learning dataset and is common amongst beginners. It is a binary classification problem and the goal is to accurately predict which passengers survived the Titanic shipwreck. The passengers who survived are denoted by 1 while the survivors who did not survive are denoted by 0 in the target column SURVIVED.

    为了构建更好的图像,我将以Titanic数据集为例。 泰坦尼克号数据集是一种流行的机器学习数据集,在初学者中很常见。 这是一个二元分类问题,目的是准确预测哪些乘客在泰坦尼克号沉船中幸存。 在目标列SURVIVED中,幸存的乘客用1表示,而未幸存的乘客用0表示

    Now if our model classifies a passenger as having survived (1) and the passenger actually survived (according to our dataset) then that classification is a True Positive — the model accurately predicted a 1.

    现在,如果我们的模型将一名乘客分类为已幸存(1),而该乘客实际上已幸存(根据我们的数据集),则该分类为“ 真正” —该模型准确地预测为1。

    If the model predicts that a passenger did not survive and the passenger did not survive, that is a True Negative classification.


    But if the model predicts that a passenger survived when the passenger in fact did not survive, that is a case of False Positive.

    但是,如果模型预测乘客实际上没有幸存而幸存下来,那就是False Positive的情况。

    As you may have guessed, when the model predicts a passenger died when the passenger actually survived, then it is a False Negative.

    您可能已经猜到了,当模型预测乘客实际幸存下来而导致乘客死亡时,它就是False Negative

    Hope this illustration puts things in perspective for you.


    An ideal confusion matrix will have higher non-zero values along it’s major diagonal, from left to right.


    The Titanic dataset confusion matrix of a Binary Classification problem with only two possible outputs — survived or did not survive — is shown below. We have only two target values, therefore, we have a 2x2 matrix. With the actual values on the vertical axis and the predicted values on the horizontal axis.

    下面显示了只有两个可能的输出(幸存或未幸存)的二元分类问题的泰坦尼克数据集混淆矩阵。 我们只有两个目标值,因此,我们有一个2x2矩阵。 纵轴为实际值,横轴为预测值。

    Image for post
    Titanic 2x2 Confusion Metric

    A confusion matrix can also visualize multi-class classification problems. This is not different from the binary classification with the exception of an increase in the dimension of the matrix. A three-class classification problem will have a 3x3 confusion matrix and so on.

    混淆矩阵还可以可视化多类分类问题。 这与二进制分类没有什么不同,除了矩阵的维数增加。 三级分类问题将具有3x3混淆矩阵,依此类推。

    Image for post
    NxN Confusion Matrix

    Specific Metrics that can be gotten from the Confusion Matrix include:


    准确性 (Accuracy)

    The accuracy metric measures the number of classes that are correctly predicted by the model — the true positives and true negatives.


    Image for post
    Model Accuracy
    #calculating accuracy mathematically
    Accuracy = sum(clf.predict(X_test)== y_test)/(1.0*len(y_test))#Calculating accuracy using sklearn
    from sklearn.metrics import accuracy_score print(accuracy_score(clf.predict(X_test),y_test))

    The values for accuracy range from 0 to 1. If a model accuracy is 0, the model is not a good predictor. The accuracy metric is well suited for classification problems where the two classes are balanced or almost balanced.

    精度值的范围是0到1 。 如果模型精度为0,则该模型不是良好的预测指标。 精度度量非常适用于两类平衡或几乎平衡的分类问题。

    Remember, it is advisable to evaluate the model on new data. If the model is evaluated using the same data the model was trained on, high accuracy value is not surprising as the model remembers the actual values and returns them as predictions. This is called overfitting. Overfitting is a situation whereby the model fits the training data but is not able to accurately predict target values when introduced to new data. To ensure your model is accurate , make sure to evaluate with a new set of data.

    请记住,建议根据新数据评估模型。 如果使用与训练模型相同的数据对模型进行评估,则高精度值就不足为奇了,因为模型会记住实际值并将其返回为预测值。 这称为过拟合。 过度拟合是指模型拟合训练数据但当引入新数据时无法准确预测目标值的情况。 为确保模型准确,请确保使用一组新数据进行评估。

    精确度和召回率 (Precision and Recall)

    The precision and recall metrices work hand-in-hand. While the Precision measures the number of positive values predicted by the model that are actually positive, Recall determines the proportion of the positives values that were accurately predicted.

    精度和召回率指标是紧密结合的。 虽然Precision度量模型预测的实际为正的正值的数量,但Recall会确定精确预测的正值的比例。

    Image for post
    Precision and Recall

    A model that produces a low number of false positives has high precision. While a model with low false negatives, has high recall value.

    产生少量误报的模型具有较高的精度。 假阴性率低的模型具有较高的召回价值。

    Image for post
    Confusion Matrix

    We will again use the Titanic dataset as before. If the confusion matrix for our model is as above, using our equations above, we get precision and recall values (in 2 dp) as follows:

    我们将像以前一样再次使用Titanic数据集。 如果模型的混淆矩阵如上所述,则使用上面的方程式,我们可以得出精度和召回值(以2 dp为单位),如下所示:

    Precision = 0.77

    精度= 0.77

    Recall = 0.86

    召回率= 0.86

    If the number of False negatives are decreased, the Recall value increases. Likewise, if the number of False positives is reduced, the precision value increases. From the confusion matrix above, the model is able to predict those who died in the shipwreck more accurately than those who survived.

    如果假阴性的数量减少,则调用值会增加。 同样,如果误报的数量减少,则精度值也会增加。 从上面的混淆矩阵中,该模型能够比那些幸存者更准确地预测那些在沉船中丧生的人。

    Another way to understand the relationship between precision and recall is with thresholds. The values above the threshold are assigned positive (survived) and values below the threshold are negative (did not survive).

    了解精度和召回率之间关系的另一种方法是使用阈值。 高于阈值的值被指定为正(存活),低于阈值的值被分配为负(未存活)。

    If the threshold is 0.5, passengers whose target value fall above 0.5 survived the Titanic. If the threshold is higher, such as 0.8, the accurate number of survivors who survived but are classified as dead will increase (the False Negatives) and the recall value will decrease. At the same time, the number of False positives will reduce as the the model will now classify fewer models as positive due to its high threshold. The Precision value will increase.

    如果阈值为0.5,则目标值低于0.5的乘客可以幸免于泰坦尼克号。 如果阈值较高,例如0.8,则存活但被归类为死亡的幸存者的准确数量会增加(假阴性),召回值会降低。 同时,由于模型的阈值较高,现在该模型将把较少的模型归为阳性,因此误报的数量将减少。 精度值将增加。

    In this way, precision and recall can be seen to have a see-saw relationship. If the threshold is lower, the number of positive values increase, the False positive value is higher and the False Negative increases. In his book Hands On Machine Learning with Scikit-Learn, Keras and Tensorflow, Aurélien-Géron describes this Tradeoff splendidly.

    这样,可以看到精度和查全率之间存在跷跷板关系。 如果阈值较低,则正值的数量会增加,假正值会更高,而假负数会增加。 Aurélien- Géron在他的《 使用Scikit-Learn,Keras和Tensorflow进行机器学习动手》中出色地描述了这种权衡。

    We may choose to focus on precision or recall for a particular problem type. For example, if our model is to classify between malignant cancer and benign cancer, we would want to minimize the chances of a malignant cancer being classified as a benign cancer — we want to minimize the False Negatives — and therefore we can focus on increasing our Recall value rather than Precision. In this situation, we are keen on correctly diagnosing as many malignant cancer as possible, even if some happen to be benign rather than miss a malignant cancer.

    对于特定的问题类型,我们可以选择关注精度或召回率。 例如,如果我们的模型是将恶性癌症与良性癌症进行分类,则我们希望将恶性癌症归为良性癌症的机会降到最低–我们希望将假阴性率降到最低,因此我们可以集中精力增加回忆价值而不是精确度。 在这种情况下,即使某些恶性肿瘤是良性的,而不是遗漏了恶性肿瘤,我们仍希望正确诊断出尽可能多的恶性肿瘤。

    Precision and Recall values range from 0 to 1 and in both cases, the closer the metric value is to 1, the higher the precision or recall. They are also good metrics to use when the classes are imbalanced and can be averaged for multiclass/multilabel classifications.

    精度和查全率值的范围是0到1 ,在两种情况下,度量值越接近1,精度或查全率就越高。 当类不平衡时,它们也是很好的度量标准,可以对多类/多标签分类取平均值。

    You can read more on precision and recall in the scikit-learn documentation.


    from sklearn.metrics import recall_score
    from sklearn.metrics import precision_scorey_test = [0, 1, 1, 0, 1, 0]
    y_pred = [0, 0, 1, 0, 0, 1]
    recall_score(y_test, y_pred)
    precision_score(y_test, y_pred)

    A metric that takes these both precision and recall into consideration is the F1 score.


    F1分数 (F1 score)

    The F1 score takes into account the precision and recall of the model. The F1 score computes the harmonic mean of precision and recall, giving a higher weight to the low values. Therefore, if either of precision or recall has low values, the F1 score will also have a value closer to the lesser metric. This gives a better model evaluation than an arithmetic mean of precision and recall. A model with high recall and precision values will also have a high F1 score. The F1 score ranges from 0 to 1. The closer the F1 score is to 1, the better the model is.

    F1分数考虑了模型的精度和召回率。 F1分数计算精度和查全率的谐波平均值,对较低的值赋予更高的权重。 因此,如果精度或召回率中的任何一个值都较低,则F1得分的值也将更接近较小指标。 与精度和召回率的算术平均值相比,这提供了更好的模型评估。 具有较高召回率和精度值的模型也将具有较高的F1分数。 F1分数的范围是0到1 。 F1分数越接近1,则模型越好。

    F1 score also works well with imbalanced classes and for multiclass/multilabel classification targets.


    from sklearn.metrics import f1_score
    f1_score(y_test, y_pred)
    Image for post
    F1 score

    To display a summary of the results of the confusion matrix:


    from sklearn.metrics import classification_report
    y_true = [0, 1, 1, 1, 0]
    y_pred = [0, 0, 1, 1, 1]
    target_names = ['Zeros', 'Ones']
    print(classification_report(y_true, y_pred, target_names=target_names)) precision recall f1-score support Zeros 0.50 0.50 0.50 2
    Ones 0.67 0.67 0.67 3 accuracy 0.60 5
    macro avg 0.58 0.58 0.58 5
    weighted avg 0.60 0.60 0.60 5

    日志损失 (Log Loss)

    The Logarithmic Loss metric or Log loss as it’s commonly known, measures how far the predicted values are from the actual values. The Log loss works by penalizing wrong predictions.

    对数损耗度量或对数损耗,通常用于衡量预测值与实际值之间的距离。 对数丢失通过惩罚错误的预测而起作用。

    Log loss does not have negative values. Its output is a float data type with values ranging from 0 to infinity. A model that accurately predicts the target class has a log loss value close to 0. This indicates that the model has made minimal error in prediction. Log loss is used when the output of the classification model is a probability such as in logistic regression or some neural networks.

    对数丢失没有负值。 其输出是float数据类型,其值的范围从0到infinity 。 准确预测目标类别的模型的对数损失值接近0。这表明该模型在预测中的误差很小。 当分类模型的输出是概率时(例如在逻辑回归或某些神经网络中),将使用对数损失

    from sklearn.metrics import log_loss
    log_loss(y_test, y_pred)

    In this post, we have discussed some of the most popular evaluation metrics for a classification model such as the confusion matrix, accuracy, precision, recall, F1 score and log loss. We have seen instances where these metrices are useful and their possible values. We have also outlined the metric scores for each evaluation metric that indicate our model is doing a great job at predicting the actual values. Next time we will look at the charts and curves that can also be used to evaluate the performance of a classification model.

    在这篇文章中,我们讨论了分类模型的一些最受欢迎的评估指标,例如混淆矩阵,准确性,精度,召回率,F1得分和对数损失。 我们已经看到了这些指标有用的实例及其可能的值。 我们还概述了每个评估指标的指标得分,这些指标表明我们的模型在预测实际值方面做得很好。 下次,我们将查看可用于评估分类模型性能的图表和曲线。

    翻译自: https://medium.com/swlh/evaluation-metrics-i-classification-a26476dd0146


  • 多标签分类 评价指标Metrics play quite an important role in the field of Machine Learning or Deep Learning. We start the problems with metric selection as to know the baseline score of a particular ...

    多标签分类 评价指标

    Metrics play quite an important role in the field of Machine Learning or Deep Learning. We start the problems with metric selection as to know the baseline score of a particular model. In this blog, we look into the best and most common metrics for Multi-Label Classification and how they differ from the usual metrics.

    指标在机器学习或深度学习领域中扮演着非常重要的角色。 我们从度量选择开始着手,以了解特定模型的基线得分。 在此博客中,我们研究了“多标签分类”的最佳和最常用指标,以及它们与通常的指标有何不同。

    Let me get into what is Multi-Label Classification just in case you need it. If we have data about the features of a dog and we had to predict which breed and pet category it belonged to.

    让我进入什么是多标签分类,以防万一您需要它。 如果我们有关于狗的特征的数据,并且我们必须预测它属于哪个品种和宠物。

    In the case of Object Detection, Multi-Label Classification gives us the list of all the objects in the image as follows. We can see that the classifier detects 3 objects in the image. It can be made into a list as follows [1 0 1 1] if the total number of trained objects are 4 ie. [dog, human, bicycle, truck].

    在对象检测的情况下,多标签分类为我们提供了图像中所有对象的列表,如下所示。 我们可以看到分类器检测到图像中的3个对象。 如果训练对象的总数为4,即[1 0 1 1],则可以将其列为以下列表。 [狗,人,自行车,卡车]。

    Object Detection (Multi-Label Classification)
    Object Detection output

    This kind of classification is known as Multi-Label Classification.


    The most common metrics that are used for Multi-Label Classification are as follows:


    1. Precision at k

    2. Avg precision at k

    3. Mean avg precision at k

    4. Sampled F1 Score


    Let’s get into the details of these metrics.


    k精度(P @ k): (Precision at k (P@k):)

    Given a list of actual classes and predicted classes, precision at k would be defined as the number of correct predictions considering only the top k elements of each class divided by k. The values range between 0 and 1.

    给定实际类别和预测类别的列表,将在k处的精度定义为仅考虑每个类别的前k个元素除以k得出的正确预测数。 取值范围是0到1。

    Here is an example as explaining the same in code:


    def patk(actual, pred, k):
    	#we return 0 if k is 0 because 
    	#   we can't divide the no of common values by 0 
    	if k == 0:
    		return 0
    	#taking only the top k predictions in a class 
    	k_pred = pred[:k]
    	#taking the set of the actual values 
    	actual_set = set(actual)
    	#taking the set of the predicted values 
    	pred_set = set(k_pred)
    	#taking the intersection of the actual set and the pred set
    		# to find the common values
    	common_values = actual_set.intersection(pred_set)
    	return len(common_values)/len(pred[:k])
    #defining the values of the actual and the predicted class
    y_true = [1 ,2, 0]
    y_pred = [1, 1, 0]
    if __name__ == "__main__":
        print(patk(y_true, y_pred,3))

    Running the following code, we get the following result.



    In this case, we got the value of 2 as 1, thus resulting in the score going down.


    K处的平均精度(AP @ k): (Average Precision at K (AP@k):)

    It is defined as the average of all the precision at k for k =1 to k. To make it more clear let’s look at some code. The values range between 0 and 1.

    它定义为k = 1至k时k处所有精度的平均值。 为了更加清楚,让我们看一些代码。 取值范围是0到1。

    import numpy as np
    import pk
    def apatk(acutal, pred, k):
    	#creating a list for storing the values of precision for each k 
    	precision_ = []
    	for i in range(1, k+1):
    		#calculating the precision at different values of k 
    		#      and appending them to the list 
    		precision_.append(pk.patk(acutal, pred, i))
    	#return 0 if there are no values in the list
    	if len(precision_) == 0:
    		return 0 
    	#returning the average of all the precision values
    	return np.mean(precision_)
    #defining the values of the actual and the predicted class
    y_true = [[1,2,0,1], [0,4], [3], [1,2]]
    y_pred = [[1,1,0,1], [1,4], [2], [1,3]]
    if __name__ == "__main__":
    	for i in range(len(y_true)):
    		for j in range(1, 4):
    				y_true = {y_true[i]}
    				y_pred = {y_pred[i]}
    				AP@{j} = {apatk(y_true[i], y_pred[i], k=j)}

    Here we check for the AP@k from 1 to 4. We get the following output.

    在这里,我们检查从1到4的AP @ k。我们得到以下输出。

    y_true = [1, 2, 0, 1]
    				y_pred = [1, 1, 0, 1]
    				AP@1 = 1.0
    				y_true = [1, 2, 0, 1]
    				y_pred = [1, 1, 0, 1]
    				AP@2 = 0.75
    				y_true = [1, 2, 0, 1]
    				y_pred = [1, 1, 0, 1]
    				AP@3 = 0.7222222222222222
    				y_true = [0, 4]
    				y_pred = [1, 4]
    				AP@1 = 0.0
    				y_true = [0, 4]
    				y_pred = [1, 4]
    				AP@2 = 0.25
    				y_true = [0, 4]
    				y_pred = [1, 4]
    				AP@3 = 0.3333333333333333
    				y_true = [3]
    				y_pred = [2]
    				AP@1 = 0.0
    				y_true = [3]
    				y_pred = [2]
    				AP@2 = 0.0
    				y_true = [3]
    				y_pred = [2]
    				AP@3 = 0.0
    				y_true = [1, 2]
    				y_pred = [1, 3]
    				AP@1 = 1.0
    				y_true = [1, 2]
    				y_pred = [1, 3]
    				AP@2 = 0.75
    				y_true = [1, 2]
    				y_pred = [1, 3]
    				AP@3 = 0.6666666666666666

    This gives us a clear understanding of how the code works.


    K处的平均平均精度(MAP @ k): (Mean Average Precision at K (MAP@k):)

    The average of all the values of AP@k over the whole training data is known as MAP@k. This helps us give an accurate representation of the accuracy of whole prediction data. Here is some code for the same.

    整个训练数据中AP @ k所有值的平均值称为MAP @ k。 这有助于我们准确表示整个预测数据的准确性。 这是一些相同的代码。

    The values range between 0 and 1.


    import numpy as np
    import apk
    def mapk(acutal, pred, k):
    	#creating a list for storing the Average Precision Values
    	average_precision = []
    	#interating through the whole data and calculating the apk for each 
    	for i in range(len(acutal)):
    		average_precision.append(apk.apatk(acutal[i], pred[i], k))
    	#returning the mean of all the data
    	return np.mean(average_precision)
    #defining the values of the actual and the predicted class
    y_true = [[1,2,0,1], [0,4], [3], [1,2]]
    y_pred = [[1,1,0,1], [1,4], [2], [1,3]]
    if __name__ == "__main__":
        print(mapk(y_true, y_pred,3))

    Running the above code, we get the output as follows.



    Here, the score is bad as the prediction set has many errors.


    F1-样本: (F1 — Samples:)

    This metric calculates the F1 score for each instance in the data and then calculates the average of the F1 scores. We will be using sklearn’s implementation of the same in the code.

    此度量标准计算数据中每个实例的F1分数,然后计算F1分数的平均值。 我们将在代码中使用sklearn的相同实现。

    Here is the documentation of F1 Scores. The values range between 0 and 1.

    是F1分数的文档。 取值范围是0到1。

    We first convert the data into binary format and then perform f1 on the same. This gives us the required values.

    我们首先将数据转换为二进制格式,然后对它执行f1。 这为我们提供了所需的值。

    from sklearn.metrics import f1_score
    from sklearn.preprocessing import MultiLabelBinarizer
    def f1_sampled(actual, pred):
        #converting the multi-label classification to a binary output
        mlb = MultiLabelBinarizer()
        actual = mlb.fit_transform(actual)
        pred = mlb.fit_transform(pred)
        #fitting the data for calculating the f1 score 
        f1 = f1_score(actual, pred, average = "samples")
        return f1
    #defining the values of the actual and the predicted class
    y_true = [[1,2,0,1], [0,4], [3], [1,2]]
    y_pred = [[1,1,0,1], [1,4], [2], [1,3]]
    if __name__ == "__main__":
        print(f1_sampled(y_true, y_pred))

    The output of the following code will be the following:



    We know that the F1 score lies between 0 and 1 and here we got a score of 0.45. This is because the prediction set is bad. If we had a better prediction set, the value would be closer to 1.

    我们知道F1分数介于0和1之间,在这里我们得到0.45的分数。 这是因为预测集不好。 如果我们有更好的预测集,则该值将接近1。

    Hence based on the problem, we usually use Mean Average Precision at K or F1 Sample or Log Loss. Thus setting up the metrics for your problem.

    因此,基于该问题,我们通常使用K或F1样本或对数损失的平均平均精度。 从而为您的问题设置指标。

    I would like to thank Abhishek for his book Approaching (Any) Machine Learning Problem without which this blog wouldn’t have been possible.


    翻译自: https://medium.com/analytics-vidhya/metrics-for-multi-label-classification-49cc5aeba1c3

    多标签分类 评价指标

  • 在介绍各种指标前,先介绍混淆矩阵,基本所有的评价指标都是基于混淆矩阵计算得来的。 混淆矩阵每一行代表数据的真实类别,每一列代表预测类别。 以下是一个三分类问题的混淆矩阵: 二分类和多分类都有混淆矩阵,...


    混淆矩阵(confusion matrix)


    • TP:True Positive,真阳性, 正样本分类为正样本
    • FP:False Positive,假阳性,负样本分类为正样本
    • TN:True Negative,真阴性, 负样本分类为负样本
    • FN:False Negative,假阴性,正样本分类为负样本


    1. 准确率(Accuracy)
      Accuracy =(TP+TN)/(TP+FP+TN+FN)
    2. 精确率/查准率(Precision)/阳性预测值(positive predictive value,PPV)
      Precision =TP/(TP+FP)
    3. 召回率/查全率(Recall)/真阳性率(true positive rate,TPR)/敏感度(Sensitivity)
      Recall = TP/(TP+FN)
    4. F1-score
      F1 = 2Precision*Recall/(Precision+Recall)


    1. 特异度(Specificity)/真阴性率(true negative rate,TNR)
      TNR = TN/(TN+FP)
    2. 误报率(False discovery rate, FDR)
      FDR = FP/(FP+TP) = 1- Precision
    3. 阴性预测值(Negative Predictive Value,NPV )
      NPV = TN/(TN+FN)
    4. kappa系数
      kappa系数的取值为-1到1之间,通常大于0。可分为五组来表示不同级别的一致性:0.0~0.20极低的一致性(slight)、0.21~0.40一般的一致性(fair)、0.41~0.60 中等的一致性(moderate)、0.61~0.80 高度的一致性(substantial)和0.81~1几乎完全一致(almost perfect)。 在这里插入图片描述


    1. ROC曲线/AUC(Area Under the Curve,曲线下面积)
      ROC曲线(receiver operating characteristic curve), 是反映敏感性和特异性连续变量的综合指标,曲线下面积越大,诊断准确性越高。
    • 横坐标:1-Specificity,伪正类率(False positive rate,
    • 纵坐标:Sensitivity,真正类率(True positive rate,





    1. Macro-average方法
    2. Weighted-average方法
    3. Micro-average方法
      把每个类别的TP, FP, FN先相加之后,在根据二分类的公式进行计算。

    多分类模型Accuracy, Precision, Recall和F1-score的超级无敌深入探讨
    机器学习之分类性能度量指标 : ROC曲线、AUC值、正确率、召回率

  • SKlearn二分类评价指标

    千次阅读 2020-07-02 16:41:27
    SKlearn的Metrics模块下有有许多二分类算法的评价指标,这里我们主要讨论最常用的几种。 1.准确度(Accuracy) from sklearn.metrics import accuracy_score(y_true, y_pred, normalize=True, sample_weight=None) ...

    ML evaluation.png


    from sklearn.metrics import accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

    a c c u r a c y ( y , y ^ ) = 1 n s a m p l e s ∑ i = 0 n s a m p l e s − 1 ( y i = y ^ i ) accuracy(y,\hat y) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples} - 1}(y_i = \hat y_i) accuracy(y,y^)=nsamples1i=0nsamples1(yi=y^i)

    1.3 案列演示

    import numpy as np
    import pandas as pd
    from sklearn.metrics import accuracy_score
    y_true = [1,1,0,1,0]
    y_pred = [1,1,1,0,0]
    score = accuracy_score(y_true,y_pred)
    score1 = accuracy_score(y_true,y_pred,normalize = True)

    2.混淆矩阵(Confusion Matrix)

    from sklearn.metrics import confusion_matrix(y_true,y_pred,labels=None,sample_weight = None)


    import numpy as np
    from sklearn.metrics import confusion_matrix
    y_true = np.array([0,1,1,1,0,0,1,2,0,1])
    y_pred = np.array([0,0,1,1,1,0,1,2,0,1])
    confusion_matrix = confusion_matrix(y_true,y_pred)
    array([[3, 1, 0],
           [1, 4, 0],
           [0, 0, 1]], dtype=int64)


    import itertools
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import svm, datasets
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix
    def plot_confusion_matrix(cm, classes,
                              title='Confusion matrix',
        This function prints and plots the confusion matrix.
        Normalization can be applied by setting `normalize=True`.
        if normalize:
            cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
            print("Normalized confusion matrix")
            print('Confusion matrix, without normalization')
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=45)
        plt.yticks(tick_marks, classes)
        fmt = '.2f' if normalize else 'd'
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, format(cm[i, j], fmt),
                     color="white" if cm[i, j] > thresh else "black")
        plt.ylabel('True label')
        plt.xlabel('Predicted label')
    plot_confusion_matrix(confusion_matrix,classes = [0,1,2])

    confusion matrix.png
    TP(True positive):预测对了,预测结果为正例(把女生预测为女生),TP = 20。
    FP(False positive):预测错了,预测结果为正例(把男生预测为女生),FP = 30。
    TN(True negative):预测对了,预测结果为负例(把男生预测为男生),TN = 50。
    FN(False negative):预测错了,预测结果为负例(把女生预测为男生),FN = 0。
    Recall = TP/(TP + FN),Recall的意思是在正确的结果中,有多少个预测结果是正确的。上述例子的Recall值就是TP/(TP+FN) = 20/(20+0) = 100%。
    假设我们现在有100个患者样本,其中有5个患者患有癌症,我们用1表示,其余95名正常患者我们用0表示。假设我们现在用一种算法做预测,所有的结果都预测为0,95个肯定是预测对了,算法准确率为95%,看着挺高的。但是这个算法对于癌症患者的预测准确率是0,所以这个算法是没有任何意义的。这时候我们的recall值的价值就体现出来了,recall值是在5个癌症患者中找能预测出来的,如果预测3个对了,recall = 60%。

    from sklearn.metrics import classification_report(y_true, y_pred, labels=None, 
    target_names=None, sample_weight=None, digits=2, output_dict=False)


    import numpy as np
    from sklearn.metrics import classification_report
    y_true = np.array([0,1,1,0,1,2,1])
    y_pred = np.array([0,1,0,0,1,2,1])
    target_names = ['class0','class1','class2']
    print(classification_report(y_true,y_pred,target_names = target_names))
      precision    recall  f1-score   support
          class0       0.67      1.00      0.80         2
          class1       1.00      0.75      0.86         4
          class2       1.00      1.00      1.00         1
       micro avg       0.86      0.86      0.86         7
       macro avg       0.89      0.92      0.89         7
    weighted avg       0.90      0.86      0.86         7

    补充:f1-score是recall与precision的综合结果,其表达式为:f1-score = 2 * (precision * recall)/(precision + recall)

    from sklearn.metrics import roc_auc_score(y_true, y_score, average=’macro’, 
    sample_weight=None, max_fpr=None)

    max_fpr 取值范围[0,1),如果不是None,则会标准化,使得最大值=max_fpr
    4.2 Roc曲线说明
    如下图所示,横轴表示false positive rate,纵轴表示true positive rate,我们希望false positive rate的值越小越好,true positive rate的值越大越好,希望曲线往左上角偏。那么如何衡量Roc曲线,我们用Roc的曲线面积来衡量即Auc(Area under curve)值,取值范围:[0,1]。
    Roc curve.png

    import numpy as np
    from sklearn.metrics import roc_auc_score
    y_true = np.array([0,1,1,0])
    y_score = np.array([0.85,0.78,0.69,0.54])


  • 分类评价指标

    千次阅读 多人点赞 2019-04-30 16:28:30
    AUC   为了计算 ROC 曲线上的点,我们可以使用不同的分类阈值多次评估逻辑回归模型,但这样做效率非常低。幸运的是,有一种基于排序的高效算法可以为我们提供此类信息,这种算法称为曲线下面积(Area Under Curve...
  • 分类评价指标python代码

    千次阅读 2019-10-06 21:17:51
    计算每个实例的度量标准,并找到它们的平均值(对于不同于此的多标签分类仅有意义  accuracy_score )。 sklearn.metrics模块实现了一些loss, score以及一些工具函数来计算分类性能。一些metrics可能需要正例、...
  • 主要介绍了使用sklearn对多分类的每个类别进行指标评价操作,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
  • 本文详细解释了精确率,召回率以及其他的机器学习常用的平均指标
  • 总体分类精度OA(Overall Accuracy)是指正确分类的像素点数与总的像素点数的比值,是一种常用的衡量变化检测结果的指标, Kappa系数是一种能更加精确衡量分类准确度的参数,能较好的反映出两者的一致性,
  • 在上一篇《(一)常见的回归评价指标及代码应用》中我们介绍了评价回归性能的指标。...本篇博客我们只介绍常见的分类评价指标,以及它们的代码实现与应用。 文章目录一、二分类指标(Binary Clas...
  • matlab开发-分类性能指标。此函数用于评估分类模型的常见性能度量。
  • 1. 二分类评价指标 2. 多分类评价指标 3. 总结 1. 二分类评价指标 常用的二分类评价指标包括准确率、精确率、召回率、F1-score、AUC、ROC、P-R曲线、MCC等 混淆矩阵 2分类问题的混淆矩阵是2*2的,通常以关注的...
  • 点击查看
  • 分类评估指标

    千次阅读 2019-04-26 20:16:44
    转 多分类评估指标 2018年12月13日 15:58:17 时空守卫 阅读数:349 ...
  • 针对不同的模型由主要可以分回归评价指标分类评价指标,本文主要是想梳理一下各种不同类型的分类评价指标原理。 目录 分类评价指标 分类评价指标 在了解二分类之前需要先了解下面四个概念 TP(正确预测正分类...
  • 分类和多分类的性能评价指标及python计算

    万次阅读 多人点赞 2019-08-15 16:05:29
    一、二分类 real\predict Positive Negative True TP FN False FP TN TP、TN、FP、FN 中的第二个字母(列标首字母)是机器学习算法或模型预测的结果(正例:P、反例:N) TP、TN、FP、FN 中的第一个...
  • 其实多分类评价指标的计算方式与二分类完全一样,只不过我们计算的是针对于每一类来说的召回率、精确度、准确率和 F1分数。 1、混淆矩阵(Confuse Matrix) (1)若一个实例是正类,并且被预测为正类,即为真正类...
  • 0 分类问题评价指标 混淆矩阵 混淆矩阵:Actual :实际的正类或父类; Predicted : 预测的;1 :正类 0:父类 predicted predicted 1 0 Actual 1 TP FN ...
  • 分类和多分类问题的评价指标总结

    万次阅读 多人点赞 2019-07-09 18:51:55
    分类评价指标 准确率(Accuracy) 评价分类问题的性能指标一般是分类准确率,即对于给定的数据,分类正确的样本数占总样本数的比例。 注意:准确率这一指标在Unbalanced数据集上的表现很差,因为如果我们的正负...
  • 目录   二分类: 多分类: 一、什么是多类分类?...评估指标:混淆矩阵,accuracy,precision,f1-score,AUC,ROC,P-R(不能用) 1.混淆矩阵: 2. accuracy,precision,reacall,f1-score: 3...
  • 不平衡数据的分类评价指标总结

    万次阅读 2017-10-30 19:39:19
    Matrix)用于评价算法好坏的指标。下图是一个二分类问题的混淆矩阵: TP:正确肯定——实际是正例,识别为正例 FN:错误否定(漏报)——实际是正例,却识别成了负例 FP:错误肯定(误报)——实际是负例...
  • 目录 目录 1、概念 1.1、前提 ...2、评价指标(性能度量) 2.1、分类评价指标 2.1.1 值指标-Accuracy、Precision、Recall、F值 2.1.2 相关曲线-P-R曲线、ROC曲线及AUC值 ...1)分类评价指标(classification...
  • Macro F1,这个指标计算每一类的F1 score然后求算术平均,如果模型在小样本上表现不好,小样本的F1会极大程度上拉低Macro F1。除了F1之外还有Macro recall,Macro precision,计算原理是一样的。 另,Micro F1在多...
  • 分类一般分为三种情况:二分类、多分类和多标签分类。多标签分类比较直观的理解是,一个样本可以同时拥有几个类别标签,比如一首歌的标签可以是流行、轻快,一部电影的标签可以是动作、喜剧、搞笑,一本书的标签可以...
  • 多标签分类评价指标

    万次阅读 2018-01-24 19:03:22
    所以其评价指标与多分类的也有差异,本文将介绍几种评价指标。  1.Hamming loss(汉明损失),表示所有label中错误样本的比例,所以该值越小则网络的分类能力越强。计算公式如下。 其中:|D|表示样本总数,...
  • SKLearn分类评估指标(一)

    千次阅读 2020-05-24 16:26:02
    一、使用sklearn.metric...1、只限于二元单标签分类问题的评估指标 (1)matthews_corrcoef(y_true,y_pred[,…]计算二元分类中的Matthews相关系数(MCC) (2)precision_recall_curve(y_true,probas_pred)在不同的概率
  • 高职教师分类评价指标研究.doc



1 2 3 4 5 ... 20
收藏数 76,340
精华内容 30,536