精华内容
下载资源
问答
  • 分类评估指标_评估指标分类

    千次阅读 2020-09-20 03:09:26
    分类评估指标If you are familiar with machine learning competitions and you take the time to read through the competition guidelines, you will come across the term Evaluation Metric. The evaluation ...

    多分类评估指标

    If you are familiar with machine learning competitions and you take the time to read through the competition guidelines, you will come across the term Evaluation Metric. The evaluation metric is the basis by which the model performance is determined and winning models placed on the leaderboard. Understanding evaluation metrics will help you build better models and give you an edge over your peers in the event of a competition. We will discuss the common model evaluation metrics paying attention to when they are used, range of values for each metric and most importantly, the values we want to see.

    如果您熟悉机器学习竞赛,并且花时间阅读竞赛指南,则会遇到术语“ 评估指标” 。 评估指标是确定模型性能并将获胜模型放置在排行榜上的基础。 了解评估指标将帮助您建立更好的模型,并在竞争中胜过同行。 我们将讨论通用模型评估指标,其中将关注使用它们时,每个指标的值范围以及最重要的是我们希望看到的值。

    A prediction model is trained with historical data. To ensure that the model makes accurate predictions — is able to generalize learned rules on new data — the model should be tested using data that it was not trained with.

    使用历史数据训练预测模型。 为了确保模型做出准确的预测-能够将学习到的规则推广到新数据上-应该使用未经训练的数据对模型进行测试

    This can be done by separating the dataset into training data and testing data. The model is trained using the training data and the model’s predictive accuracy is evaluated using the test set. The dataset can be split using sklearn.model_selection.train_test_split from the scikit-learn library.

    这可以通过将数据集分为训练数据和测试数据来完成。 使用训练数据对模型进行训练,并使用测试集评估模型的预测准确性。 可以使用sklearn.model_selection拆分数据集 scikit-learn库中的train_test_split

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

    X is the input feature(s) or independent feature(s)

    X是输入要素或独立要素

    y is the target column or dependent feature

    y是目标列或从属特征

    test_size denotes the ratio of total dataset that will be used to test the model.

    test_size表示将用于测试模型的总数据集的比率。

    The names X_train, X_test, y_train, y_test are conventional names and can be changed to suit your taste.

    名称X_train,X_test,y_train,y_test是常规名称,可以更改以适合您的口味。

    The model prediction accuracy is then determined using Evaluation Metrics. We will discuss popular evaluation metrics for classification models.

    然后,使用评估指标确定模型预测的准确性 。 我们将讨论分类模型的流行评估指标。

    混淆矩阵 (Confusion Matrix)

    The confusion matrix is a common method that is used to determine and visualize the performance of classification models. The confusion matrix is a NxN matrix where N is the number of target classes or values.

    混淆矩阵是用于确定和可视化分类模型性能的常用方法。 混淆矩阵是一个NxN矩阵,其中N是目标类别或值的数量。

    The rows in a confusion matrix represent actual values while the columns represent predicted values.

    混淆矩阵中的行代表实际值,而列则代表预测值。

    混淆矩阵中要注意的术语 (Terms to note in Confusion Matrix)

    • True positives: True positives occur when the model has correctly predicted a True instance.

      真实肯定 :当模型正确预测真实实例时,就会出现真实肯定。

    • True Negatives: True negative are cases when the model accurately predicts False instances.

      真否定 :真否定是模型准确预测False实例的情况。

    • False positives: False positives are a situation where the model has predicted a True value when the actual value is False.

      误报 :误报是当实际值为False时模型已预测为True的情况。

    • False negatives: False negative is a situation where the model predicts a False value when the actual value is True.

      假阴性 :假阴性是当实际值为True时模型预测False值的情况。

    To build a better image I will use the Titanic dataset as an example. The Titanic dataset is a popular machine learning dataset and is common amongst beginners. It is a binary classification problem and the goal is to accurately predict which passengers survived the Titanic shipwreck. The passengers who survived are denoted by 1 while the survivors who did not survive are denoted by 0 in the target column SURVIVED.

    为了构建更好的图像,我将以Titanic数据集为例。 泰坦尼克号数据集是一种流行的机器学习数据集,在初学者中很常见。 这是一个二元分类问题,目的是准确预测哪些乘客在泰坦尼克号沉船中幸存。 在目标列SURVIVED中,幸存的乘客用1表示,而未幸存的乘客用0表示

    Now if our model classifies a passenger as having survived (1) and the passenger actually survived (according to our dataset) then that classification is a True Positive — the model accurately predicted a 1.

    现在,如果我们的模型将一名乘客分类为已幸存(1),而该乘客实际上已幸存(根据我们的数据集),则该分类为“ 真正” —该模型准确地预测为1。

    If the model predicts that a passenger did not survive and the passenger did not survive, that is a True Negative classification.

    如果模型预测乘客没有幸存且乘客没有幸存,则为真负分类。

    But if the model predicts that a passenger survived when the passenger in fact did not survive, that is a case of False Positive.

    但是,如果模型预测乘客实际上没有幸存而幸存下来,那就是False Positive的情况。

    As you may have guessed, when the model predicts a passenger died when the passenger actually survived, then it is a False Negative.

    您可能已经猜到了,当模型预测乘客实际幸存下来而导致乘客死亡时,它就是False Negative

    Hope this illustration puts things in perspective for you.

    希望此插图能为您提供帮助。

    An ideal confusion matrix will have higher non-zero values along it’s major diagonal, from left to right.

    理想的混淆矩阵沿着其主要对角线(从左到右)将具有更高的非零值。

    The Titanic dataset confusion matrix of a Binary Classification problem with only two possible outputs — survived or did not survive — is shown below. We have only two target values, therefore, we have a 2x2 matrix. With the actual values on the vertical axis and the predicted values on the horizontal axis.

    下面显示了只有两个可能的输出(幸存或未幸存)的二元分类问题的泰坦尼克数据集混淆矩阵。 我们只有两个目标值,因此,我们有一个2x2矩阵。 纵轴为实际值,横轴为预测值。

    Image for post
    Titanic 2x2 Confusion Metric
    泰坦尼克号2x2混乱指标

    A confusion matrix can also visualize multi-class classification problems. This is not different from the binary classification with the exception of an increase in the dimension of the matrix. A three-class classification problem will have a 3x3 confusion matrix and so on.

    混淆矩阵还可以可视化多类分类问题。 这与二进制分类没有什么不同,除了矩阵的维数增加。 三级分类问题将具有3x3混淆矩阵,依此类推。

    Image for post
    NxN Confusion Matrix
    NxN混淆矩阵

    Specific Metrics that can be gotten from the Confusion Matrix include:

    可以从混淆矩阵中获得的特定指标包括:

    准确性 (Accuracy)

    The accuracy metric measures the number of classes that are correctly predicted by the model — the true positives and true negatives.

    准确性度量标准衡量模型正确预测的类别的数量-真实肯定和真实否定。

    Image for post
    Model Accuracy
    模型精度
    #calculating accuracy mathematically
    Accuracy = sum(clf.predict(X_test)== y_test)/(1.0*len(y_test))#Calculating accuracy using sklearn
    from sklearn.metrics import accuracy_score print(accuracy_score(clf.predict(X_test),y_test))

    The values for accuracy range from 0 to 1. If a model accuracy is 0, the model is not a good predictor. The accuracy metric is well suited for classification problems where the two classes are balanced or almost balanced.

    精度值的范围是0到1 。 如果模型精度为0,则该模型不是良好的预测指标。 精度度量非常适用于两类平衡或几乎平衡的分类问题。

    Remember, it is advisable to evaluate the model on new data. If the model is evaluated using the same data the model was trained on, high accuracy value is not surprising as the model remembers the actual values and returns them as predictions. This is called overfitting. Overfitting is a situation whereby the model fits the training data but is not able to accurately predict target values when introduced to new data. To ensure your model is accurate , make sure to evaluate with a new set of data.

    请记住,建议根据新数据评估模型。 如果使用与训练模型相同的数据对模型进行评估,则高精度值就不足为奇了,因为模型会记住实际值并将其返回为预测值。 这称为过拟合。 过度拟合是指模型拟合训练数据但当引入新数据时无法准确预测目标值的情况。 为确保模型准确,请确保使用一组新数据进行评估。

    精确度和召回率 (Precision and Recall)

    The precision and recall metrices work hand-in-hand. While the Precision measures the number of positive values predicted by the model that are actually positive, Recall determines the proportion of the positives values that were accurately predicted.

    精度和召回率指标是紧密结合的。 虽然Precision度量模型预测的实际为正的正值的数量,但Recall会确定精确预测的正值的比例。

    Image for post
    Precision and Recall
    精确度和召回率

    A model that produces a low number of false positives has high precision. While a model with low false negatives, has high recall value.

    产生少量误报的模型具有较高的精度。 假阴性率低的模型具有较高的召回价值。

    Image for post
    Confusion Matrix
    混淆矩阵

    We will again use the Titanic dataset as before. If the confusion matrix for our model is as above, using our equations above, we get precision and recall values (in 2 dp) as follows:

    我们将像以前一样再次使用Titanic数据集。 如果模型的混淆矩阵如上所述,则使用上面的方程式,我们可以得出精度和召回值(以2 dp为单位),如下所示:

    Precision = 0.77

    精度= 0.77

    Recall = 0.86

    召回率= 0.86

    If the number of False negatives are decreased, the Recall value increases. Likewise, if the number of False positives is reduced, the precision value increases. From the confusion matrix above, the model is able to predict those who died in the shipwreck more accurately than those who survived.

    如果假阴性的数量减少,则调用值会增加。 同样,如果误报的数量减少,则精度值也会增加。 从上面的混淆矩阵中,该模型能够比那些幸存者更准确地预测那些在沉船中丧生的人。

    Another way to understand the relationship between precision and recall is with thresholds. The values above the threshold are assigned positive (survived) and values below the threshold are negative (did not survive).

    了解精度和召回率之间关系的另一种方法是使用阈值。 高于阈值的值被指定为正(存活),低于阈值的值被分配为负(未存活)。

    If the threshold is 0.5, passengers whose target value fall above 0.5 survived the Titanic. If the threshold is higher, such as 0.8, the accurate number of survivors who survived but are classified as dead will increase (the False Negatives) and the recall value will decrease. At the same time, the number of False positives will reduce as the the model will now classify fewer models as positive due to its high threshold. The Precision value will increase.

    如果阈值为0.5,则目标值低于0.5的乘客可以幸免于泰坦尼克号。 如果阈值较高,例如0.8,则存活但被归类为死亡的幸存者的准确数量会增加(假阴性),召回值会降低。 同时,由于模型的阈值较高,现在该模型将把较少的模型归为阳性,因此误报的数量将减少。 精度值将增加。

    In this way, precision and recall can be seen to have a see-saw relationship. If the threshold is lower, the number of positive values increase, the False positive value is higher and the False Negative increases. In his book Hands On Machine Learning with Scikit-Learn, Keras and Tensorflow, Aurélien-Géron describes this Tradeoff splendidly.

    这样,可以看到精度和查全率之间存在跷跷板关系。 如果阈值较低,则正值的数量会增加,假正值会更高,而假负数会增加。 Aurélien- Géron在他的《 使用Scikit-Learn,Keras和Tensorflow进行机器学习动手》中出色地描述了这种权衡。

    We may choose to focus on precision or recall for a particular problem type. For example, if our model is to classify between malignant cancer and benign cancer, we would want to minimize the chances of a malignant cancer being classified as a benign cancer — we want to minimize the False Negatives — and therefore we can focus on increasing our Recall value rather than Precision. In this situation, we are keen on correctly diagnosing as many malignant cancer as possible, even if some happen to be benign rather than miss a malignant cancer.

    对于特定的问题类型,我们可以选择关注精度或召回率。 例如,如果我们的模型是将恶性癌症与良性癌症进行分类,则我们希望将恶性癌症归为良性癌症的机会降到最低–我们希望将假阴性率降到最低,因此我们可以集中精力增加回忆价值而不是精确度。 在这种情况下,即使某些恶性肿瘤是良性的,而不是遗漏了恶性肿瘤,我们仍希望正确诊断出尽可能多的恶性肿瘤。

    Precision and Recall values range from 0 to 1 and in both cases, the closer the metric value is to 1, the higher the precision or recall. They are also good metrics to use when the classes are imbalanced and can be averaged for multiclass/multilabel classifications.

    精度和查全率值的范围是0到1 ,在两种情况下,度量值越接近1,精度或查全率就越高。 当类不平衡时,它们也是很好的度量标准,可以对多类/多标签分类取平均值。

    You can read more on precision and recall in the scikit-learn documentation.

    您可以在scikit-learn文档中阅读有关精度召回的更多信息。

    from sklearn.metrics import recall_score
    from sklearn.metrics import precision_scorey_test = [0, 1, 1, 0, 1, 0]
    y_pred = [0, 0, 1, 0, 0, 1]
    recall_score(y_test, y_pred)
    precision_score(y_test, y_pred)

    A metric that takes these both precision and recall into consideration is the F1 score.

    F1分数是同时考虑到这些准确性和召回率的度量标准。

    F1分数 (F1 score)

    The F1 score takes into account the precision and recall of the model. The F1 score computes the harmonic mean of precision and recall, giving a higher weight to the low values. Therefore, if either of precision or recall has low values, the F1 score will also have a value closer to the lesser metric. This gives a better model evaluation than an arithmetic mean of precision and recall. A model with high recall and precision values will also have a high F1 score. The F1 score ranges from 0 to 1. The closer the F1 score is to 1, the better the model is.

    F1分数考虑了模型的精度和召回率。 F1分数计算精度和查全率的谐波平均值,对较低的值赋予更高的权重。 因此,如果精度或召回率中的任何一个值都较低,则F1得分的值也将更接近较小指标。 与精度和召回率的算术平均值相比,这提供了更好的模型评估。 具有较高召回率和精度值的模型也将具有较高的F1分数。 F1分数的范围是0到1 。 F1分数越接近1,则模型越好。

    F1 score also works well with imbalanced classes and for multiclass/multilabel classification targets.

    F1分数也适用于不平衡类别以及多类别/多标签分类目标。

    from sklearn.metrics import f1_score
    f1_score(y_test, y_pred)
    Image for post
    F1 score
    F1分数

    To display a summary of the results of the confusion matrix:

    要显示混淆矩阵结果的摘要:

    from sklearn.metrics import classification_report
    y_true = [0, 1, 1, 1, 0]
    y_pred = [0, 0, 1, 1, 1]
    target_names = ['Zeros', 'Ones']
    print(classification_report(y_true, y_pred, target_names=target_names)) precision recall f1-score support Zeros 0.50 0.50 0.50 2
    Ones 0.67 0.67 0.67 3 accuracy 0.60 5
    macro avg 0.58 0.58 0.58 5
    weighted avg 0.60 0.60 0.60 5

    日志损失 (Log Loss)

    The Logarithmic Loss metric or Log loss as it’s commonly known, measures how far the predicted values are from the actual values. The Log loss works by penalizing wrong predictions.

    对数损耗度量或对数损耗,通常用于衡量预测值与实际值之间的距离。 对数丢失通过惩罚错误的预测而起作用。

    Log loss does not have negative values. Its output is a float data type with values ranging from 0 to infinity. A model that accurately predicts the target class has a log loss value close to 0. This indicates that the model has made minimal error in prediction. Log loss is used when the output of the classification model is a probability such as in logistic regression or some neural networks.

    对数丢失没有负值。 其输出是float数据类型,其值的范围从0到infinity 。 准确预测目标类别的模型的对数损失值接近0。这表明该模型在预测中的误差很小。 当分类模型的输出是概率时(例如在逻辑回归或某些神经网络中),将使用对数损失

    from sklearn.metrics import log_loss
    log_loss(y_test, y_pred)

    In this post, we have discussed some of the most popular evaluation metrics for a classification model such as the confusion matrix, accuracy, precision, recall, F1 score and log loss. We have seen instances where these metrices are useful and their possible values. We have also outlined the metric scores for each evaluation metric that indicate our model is doing a great job at predicting the actual values. Next time we will look at the charts and curves that can also be used to evaluate the performance of a classification model.

    在这篇文章中,我们讨论了分类模型的一些最受欢迎的评估指标,例如混淆矩阵,准确性,精度,召回率,F1得分和对数损失。 我们已经看到了这些指标有用的实例及其可能的值。 我们还概述了每个评估指标的指标得分,这些指标表明我们的模型在预测实际值方面做得很好。 下次,我们将查看可用于评估分类模型性能的图表和曲线。

    翻译自: https://medium.com/swlh/evaluation-metrics-i-classification-a26476dd0146

    多分类评估指标

    展开全文
  • 多标签分类 评价指标Metrics play quite an important role in the field of Machine Learning or Deep Learning. We start the problems with metric selection as to know the baseline score of a particular ...

    多标签分类 评价指标

    Metrics play quite an important role in the field of Machine Learning or Deep Learning. We start the problems with metric selection as to know the baseline score of a particular model. In this blog, we look into the best and most common metrics for Multi-Label Classification and how they differ from the usual metrics.

    指标在机器学习或深度学习领域中扮演着非常重要的角色。 我们从度量选择开始着手,以了解特定模型的基线得分。 在此博客中,我们研究了“多标签分类”的最佳和最常用指标,以及它们与通常的指标有何不同。

    Let me get into what is Multi-Label Classification just in case you need it. If we have data about the features of a dog and we had to predict which breed and pet category it belonged to.

    让我进入什么是多标签分类,以防万一您需要它。 如果我们有关于狗的特征的数据,并且我们必须预测它属于哪个品种和宠物。

    In the case of Object Detection, Multi-Label Classification gives us the list of all the objects in the image as follows. We can see that the classifier detects 3 objects in the image. It can be made into a list as follows [1 0 1 1] if the total number of trained objects are 4 ie. [dog, human, bicycle, truck].

    在对象检测的情况下,多标签分类为我们提供了图像中所有对象的列表,如下所示。 我们可以看到分类器检测到图像中的3个对象。 如果训练对象的总数为4,即[1 0 1 1],则可以将其列为以下列表。 [狗,人,自行车,卡车]。

    Object Detection (Multi-Label Classification)
    Object Detection output
    目标检测输出

    This kind of classification is known as Multi-Label Classification.

    这种分类称为多标签分类。

    The most common metrics that are used for Multi-Label Classification are as follows:

    用于多标签分类的最常见指标如下:

    1. Precision at k

      k精度
    2. Avg precision at k

      平均精度(k)
    3. Mean avg precision at k

      k的平均平均精度
    4. Sampled F1 Score

      采样的F1分数

    Let’s get into the details of these metrics.

    让我们详细了解这些指标。

    k精度(P @ k): (Precision at k (P@k):)

    Given a list of actual classes and predicted classes, precision at k would be defined as the number of correct predictions considering only the top k elements of each class divided by k. The values range between 0 and 1.

    给定实际类别和预测类别的列表,将在k处的精度定义为仅考虑每个类别的前k个元素除以k得出的正确预测数。 取值范围是0到1。

    Here is an example as explaining the same in code:

    这是一个解释相同代码的示例:

    def patk(actual, pred, k):
    	#we return 0 if k is 0 because 
    	#   we can't divide the no of common values by 0 
    	if k == 0:
    		return 0
    
    
    	#taking only the top k predictions in a class 
    	k_pred = pred[:k]
    
    
    	#taking the set of the actual values 
    	actual_set = set(actual)
    
    
    	#taking the set of the predicted values 
    	pred_set = set(k_pred)
    
    
    	#taking the intersection of the actual set and the pred set
    		# to find the common values
    	common_values = actual_set.intersection(pred_set)
    
    
    	return len(common_values)/len(pred[:k])
    
    
    #defining the values of the actual and the predicted class
    y_true = [1 ,2, 0]
    y_pred = [1, 1, 0]
    
    
    if __name__ == "__main__":
        print(patk(y_true, y_pred,3))

    Running the following code, we get the following result.

    运行以下代码,我们得到以下结果。

    0.6666666666666666

    In this case, we got the value of 2 as 1, thus resulting in the score going down.

    在这种情况下,我们将2的值设为1,从而导致得分下降。

    K处的平均精度(AP @ k): (Average Precision at K (AP@k):)

    It is defined as the average of all the precision at k for k =1 to k. To make it more clear let’s look at some code. The values range between 0 and 1.

    它定义为k = 1至k时k处所有精度的平均值。 为了更加清楚,让我们看一些代码。 取值范围是0到1。

    import numpy as np
    import pk
    
    
    def apatk(acutal, pred, k):
    	#creating a list for storing the values of precision for each k 
    	precision_ = []
    	for i in range(1, k+1):
    		#calculating the precision at different values of k 
    		#      and appending them to the list 
    		precision_.append(pk.patk(acutal, pred, i))
    
    
    	#return 0 if there are no values in the list
    	if len(precision_) == 0:
    		return 0 
    
    
    	#returning the average of all the precision values
    	return np.mean(precision_)
    
    
    #defining the values of the actual and the predicted class
    y_true = [[1,2,0,1], [0,4], [3], [1,2]]
    y_pred = [[1,1,0,1], [1,4], [2], [1,3]]
    
    
    if __name__ == "__main__":
    	for i in range(len(y_true)):
    		for j in range(1, 4):
    			print(
    				f"""
    				y_true = {y_true[i]}
    				y_pred = {y_pred[i]}
    				AP@{j} = {apatk(y_true[i], y_pred[i], k=j)}
    				"""
    			)

    Here we check for the AP@k from 1 to 4. We get the following output.

    在这里,我们检查从1到4的AP @ k。我们得到以下输出。

    y_true = [1, 2, 0, 1]
    				y_pred = [1, 1, 0, 1]
    				AP@1 = 1.0
    				
    
    
    				y_true = [1, 2, 0, 1]
    				y_pred = [1, 1, 0, 1]
    				AP@2 = 0.75
    				
    
    
    				y_true = [1, 2, 0, 1]
    				y_pred = [1, 1, 0, 1]
    				AP@3 = 0.7222222222222222
    				
    
    
    				y_true = [0, 4]
    				y_pred = [1, 4]
    				AP@1 = 0.0
    				
    
    
    				y_true = [0, 4]
    				y_pred = [1, 4]
    				AP@2 = 0.25
    				
    
    
    				y_true = [0, 4]
    				y_pred = [1, 4]
    				AP@3 = 0.3333333333333333
    				
    
    
    				y_true = [3]
    				y_pred = [2]
    				AP@1 = 0.0
    				
    
    
    				y_true = [3]
    				y_pred = [2]
    				AP@2 = 0.0
    				
    
    
    				y_true = [3]
    				y_pred = [2]
    				AP@3 = 0.0
    				
    
    
    				y_true = [1, 2]
    				y_pred = [1, 3]
    				AP@1 = 1.0
    				
    
    
    				y_true = [1, 2]
    				y_pred = [1, 3]
    				AP@2 = 0.75
    				
    
    
    				y_true = [1, 2]
    				y_pred = [1, 3]
    				AP@3 = 0.6666666666666666

    This gives us a clear understanding of how the code works.

    这使我们对代码的工作方式有了清晰的了解。

    K处的平均平均精度(MAP @ k): (Mean Average Precision at K (MAP@k):)

    The average of all the values of AP@k over the whole training data is known as MAP@k. This helps us give an accurate representation of the accuracy of whole prediction data. Here is some code for the same.

    整个训练数据中AP @ k所有值的平均值称为MAP @ k。 这有助于我们准确表示整个预测数据的准确性。 这是一些相同的代码。

    The values range between 0 and 1.

    取值范围是0到1。

    import numpy as np
    import apk
    
    
    def mapk(acutal, pred, k):
    
    
    	#creating a list for storing the Average Precision Values
    	average_precision = []
    	#interating through the whole data and calculating the apk for each 
    	for i in range(len(acutal)):
    		average_precision.append(apk.apatk(acutal[i], pred[i], k))
    
    
    	#returning the mean of all the data
    	return np.mean(average_precision)
    
    
    #defining the values of the actual and the predicted class
    y_true = [[1,2,0,1], [0,4], [3], [1,2]]
    y_pred = [[1,1,0,1], [1,4], [2], [1,3]]
    
    
    if __name__ == "__main__":
        print(mapk(y_true, y_pred,3))

    Running the above code, we get the output as follows.

    运行上面的代码,我们得到的输出如下。

    0.4305555555555556

    Here, the score is bad as the prediction set has many errors.

    在此,由于预测集存在许多错误,因此评分很差。

    F1-样本: (F1 — Samples:)

    This metric calculates the F1 score for each instance in the data and then calculates the average of the F1 scores. We will be using sklearn’s implementation of the same in the code.

    此度量标准计算数据中每个实例的F1分数,然后计算F1分数的平均值。 我们将在代码中使用sklearn的相同实现。

    Here is the documentation of F1 Scores. The values range between 0 and 1.

    是F1分数的文档。 取值范围是0到1。

    We first convert the data into binary format and then perform f1 on the same. This gives us the required values.

    我们首先将数据转换为二进制格式,然后对它执行f1。 这为我们提供了所需的值。

    from sklearn.metrics import f1_score
    from sklearn.preprocessing import MultiLabelBinarizer
    
    
    def f1_sampled(actual, pred):
        #converting the multi-label classification to a binary output
        mlb = MultiLabelBinarizer()
        actual = mlb.fit_transform(actual)
        pred = mlb.fit_transform(pred)
    
    
        #fitting the data for calculating the f1 score 
        f1 = f1_score(actual, pred, average = "samples")
        return f1
    
    
    #defining the values of the actual and the predicted class
    y_true = [[1,2,0,1], [0,4], [3], [1,2]]
    y_pred = [[1,1,0,1], [1,4], [2], [1,3]]
    
    
    if __name__ == "__main__":
        print(f1_sampled(y_true, y_pred))

    The output of the following code will be the following:

    以下代码的输出如下:

    0.45

    We know that the F1 score lies between 0 and 1 and here we got a score of 0.45. This is because the prediction set is bad. If we had a better prediction set, the value would be closer to 1.

    我们知道F1分数介于0和1之间,在这里我们得到0.45的分数。 这是因为预测集不好。 如果我们有更好的预测集,则该值将接近1。

    Hence based on the problem, we usually use Mean Average Precision at K or F1 Sample or Log Loss. Thus setting up the metrics for your problem.

    因此,基于该问题,我们通常使用K或F1样本或对数损失的平均平均精度。 从而为您的问题设置指标。

    I would like to thank Abhishek for his book Approaching (Any) Machine Learning Problem without which this blog wouldn’t have been possible.

    我要感谢Abhishek的书《接近(任何)机器学习问题》,否则就没有这个博客。

    翻译自: https://medium.com/analytics-vidhya/metrics-for-multi-label-classification-49cc5aeba1c3

    多标签分类 评价指标

    展开全文
  • 在介绍各种指标前,先介绍混淆矩阵,基本所有的评价指标都是基于混淆矩阵计算得来的。 混淆矩阵每一行代表数据的真实类别,每一列代表预测类别。 以下是一个三分类问题的混淆矩阵: 二分类和多分类都有混淆矩阵,...

    一直想写篇度量指标的文章,梳理一下这块的知识点,可能想了太多次,总以为自己已经写过了,今天看文章列表,竟然没有相关的内容,刚好最近在做多分类模型,借此机会整理一下。

    混淆矩阵(confusion matrix)

    在介绍各种指标前,先介绍混淆矩阵,基本所有的评价指标都是基于混淆矩阵计算得来的。
    混淆矩阵每一行代表数据的真实类别,每一列代表预测类别。
    以下是一个三分类问题的混淆矩阵:
    在这里插入图片描述
    二分类和多分类都有混淆矩阵,为了后面介绍指标时参数含义容易理解,我们以二分类的混淆矩阵为例。
    在这里插入图片描述

    • TP:True Positive,真阳性, 正样本分类为正样本
    • FP:False Positive,假阳性,负样本分类为正样本
    • TN:True Negative,真阴性, 负样本分类为负样本
    • FN:False Negative,假阴性,正样本分类为负样本

    二分类常用指标

    1. 准确率(Accuracy)
      分类正确的样本占总样本的比例
      Accuracy =(TP+TN)/(TP+FP+TN+FN)
      但是,准确率在不均衡的样本集上度量效果很差。比如,一个二分类的样本中,有90个🐶类别,10个🐱类别,模型把所有样本都分类为🐶,此时,模型的准确率为90%,但显然模型的实际分类性能很差。
      因此,引出以下指标。
    2. 精确率/查准率(Precision)/阳性预测值(positive predictive value,PPV)
      预测为正且实际为正的样本占预测为正的样本的比例
      Precision =TP/(TP+FP)
    3. 召回率/查全率(Recall)/真阳性率(true positive rate,TPR)/敏感度(Sensitivity)
      预测为正且实际为正的样本占实际为正的样本的比例;换句话说,它是正样本中被预测为正的比例。
      Recall = TP/(TP+FN)
    4. F1-score
      召回率和精确率之间往往存在此消彼长的关系,当模型能找出更多的正样本时,往往也会导致将更多的负样本分类为正样本,即recall高时,precision往往较低,而precision高时,recall往往较低。为了在这两个指标之间取得平衡,发明了F1指标,它是上述两者的调和平均数。
      F1 = 2Precision*Recall/(Precision+Recall)

    除了这四个最常用的指标,还有几个值得了解。

    1. 特异度(Specificity)/真阴性率(true negative rate,TNR)
      和recall类似,它是负样本中被预测为负样本的比例。
      TNR = TN/(TN+FP)
    2. 误报率(False discovery rate, FDR)
      预测为正的样本中,实际为负的样本所占比例。
      FDR = FP/(FP+TP) = 1- Precision
    3. 阴性预测值(Negative Predictive Value,NPV )
      预测为负的样本中负样本的比例。
      NPV = TN/(TN+FN)
    4. kappa系数
      Kappa系数是一个用于一致性检验的指标,也可以用于衡量分类的效果。对于分类问题,所谓一致性就是模型预测结果和实际分类结果是否一致。
      kappa系数的提出也是因为准确率指标存在的问题,因此它能够惩罚模型的“偏向性”,根据kappa的计算公式(下图),越不平衡的混淆矩阵,Pe越高,kappa值就越低,正好能够给“偏向性”强的模型打低分。
      kappa系数的取值为-1到1之间,通常大于0。可分为五组来表示不同级别的一致性:0.0~0.20极低的一致性(slight)、0.21~0.40一般的一致性(fair)、0.41~0.60 中等的一致性(moderate)、0.61~0.80 高度的一致性(substantial)和0.81~1几乎完全一致(almost perfect)。 在这里插入图片描述

    考虑到一个数据集中正负样本的比例可能随着时间(/阈值)的改变而发生变化,且实际数据集常存在样本分布不均衡的情况,因此又有一个新的指标被发明出来。

    1. ROC曲线/AUC(Area Under the Curve,曲线下面积)
      ROC曲线(receiver operating characteristic curve), 是反映敏感性和特异性连续变量的综合指标,曲线下面积越大,诊断准确性越高。
    • 横坐标:1-Specificity,伪正类率(False positive rate,
      FPR),预测为正但实际为负的样本占所有负例样本的比例;
    • 纵坐标:Sensitivity,真正类率(True positive rate,
      TPR),预测为正且实际为正的样本占所有正例样本的比例。
      这条线到底是如何画出来的呢?

      假如有上面这样一组数据,score是将该样本分为正样本的阈值,我们取不同的阈值就会得到不同的分类结果,对应多组FPR和TPR,即ROC曲线上的一点,这样就可以画出曲线了。
      由于不受阈值影响,roc曲线可以用来评价一个分类器。而Precision-Recall曲线则会随着正负样本数量的变化而变化较大。

    多分类常用指标

    以上的计算公式一般只适用于二分类模型,对于多分类模型的评价方法,通常是先将其转换为多个二分类模型,分别对其进行指标计算,然后使用一些规则来把这些指标汇总起来。
    比如一个多分类模型的样本标签有A、B、C三类,则先把它看作三个二分类器,分类器1的标签为A,非A;分类器2的标签为B,非B;分类器3的标签为C,非C。对每个二分类器的评估我们已经知道了,但要评估分类器的总体功能,就需要考虑三个类别的综合预测性能。

    下面有三种常用的汇总准则:

    1. Macro-average方法
      对各个二分类器的评估指标求平均。该方法受样本量小的类别影响大。
    2. Weighted-average方法
      对各个二分类器的评估指标求加权平均,权重为该类别在总样本中的占比。该方法受样本量大的类别影响大。
    3. Micro-average方法
      把每个类别的TP, FP, FN先相加之后,在根据二分类的公式进行计算。

    参考链接:
    二分类和多分类问题的评价指标总结
    多分类模型Accuracy, Precision, Recall和F1-score的超级无敌深入探讨
    详解多分类模型的Macro-F1/Precision/Recall计算过程
    kappa系数简介
    机器学习之分类性能度量指标 : ROC曲线、AUC值、正确率、召回率

    展开全文
  • SKlearn二分类评价指标

    千次阅读 2020-07-02 16:41:27
    SKlearn的Metrics模块下有有许多二分类算法的评价指标,这里我们主要讨论最常用的几种。 1.准确度(Accuracy) from sklearn.metrics import accuracy_score(y_true, y_pred, normalize=True, sample_weight=None) ...

    SKlearn的Metrics模块下有有许多二分类算法的评价指标,这里我们主要讨论最常用的几种。
    ML evaluation.png

    1.准确度(Accuracy)

    from sklearn.metrics import accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)
    

    1.1参数说明
    y_true:数据的真实label值
    y_pred:数据的预测标签值
    normalize:默认为True,返回正确预测的个数,若是为False,返回正确预测的比例
    sample_weight:样本权重
    返回结果:score为正确预测的个数或者比例,由normalize确定
    1.2数学表达
    a c c u r a c y ( y , y ^ ) = 1 n s a m p l e s ∑ i = 0 n s a m p l e s − 1 ( y i = y ^ i ) accuracy(y,\hat y) = \frac{1}{n_{samples}}\sum_{i=0}^{n_{samples} - 1}(y_i = \hat y_i) accuracy(y,y^)=nsamples1i=0nsamples1(yi=y^i)

    1.3 案列演示

    import numpy as np
    import pandas as pd
    from sklearn.metrics import accuracy_score
    
    y_true = [1,1,0,1,0]
    y_pred = [1,1,1,0,0]
    
    score = accuracy_score(y_true,y_pred)
    print(score)
    0.6#正确预测的比例
    
    score1 = accuracy_score(y_true,y_pred,normalize = True)
    print(score1)
    3#正确预测的个数
    

    2.混淆矩阵(Confusion Matrix)

    from sklearn.metrics import confusion_matrix(y_true,y_pred,labels=None,sample_weight = None)
    

    2.1参数说明
    y_true:真实的label,一维数组,列名
    y_pred:预测值的label,一维数组,行名
    labels:默认不指定,此时y_true,y_pred去并集,做升序,做label
    sample_weight:样本权重
    返回结果:返回混淆矩阵
    2.2案例演示

    import numpy as np
    from sklearn.metrics import confusion_matrix
    
    y_true = np.array([0,1,1,1,0,0,1,2,0,1])
    y_pred = np.array([0,0,1,1,1,0,1,2,0,1])
    
    confusion_matrix = confusion_matrix(y_true,y_pred)
    confusion_matrix
    array([[3, 1, 0],
           [1, 4, 0],
           [0, 0, 1]], dtype=int64)
    

    confusion_matrix的官方文档给了我们一个可将混淆矩阵可视化的模板,我们可以将模板复制过来。文档连接:https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

    import itertools
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import svm, datasets
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix
    def plot_confusion_matrix(cm, classes,
                              normalize=False,
                              title='Confusion matrix',
                              cmap=plt.cm.Blues):
        """
        This function prints and plots the confusion matrix.
        Normalization can be applied by setting `normalize=True`.
        """
        if normalize:
            cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
            print("Normalized confusion matrix")
        else:
            print('Confusion matrix, without normalization')
    
        print(cm)
    
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.title(title)
        plt.colorbar()
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=45)
        plt.yticks(tick_marks, classes)
    
        fmt = '.2f' if normalize else 'd'
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, format(cm[i, j], fmt),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
    
        plt.ylabel('True label')
        plt.xlabel('Predicted label')
        plt.tight_layout()
    
    #可视化上述混淆矩阵
    plot_confusion_matrix(confusion_matrix,classes = [0,1,2])
    

    如图所示,这样看上去就很直观了,中间的格子4表示1预测为1的个数有4个。
    confusion matrix.png
    3.Precision&Recall
    Precision:精确度。
    Recall:召回率。
    我们用一个例子来说明下Recall召回率的意思。
    3.1班有100位同学,其中80位男生,20位是女生,现欲选出所有的女生,从100人中选了50人,其中20位是女生,30位是男生。
    Recall中的四个指标:
    TP(True positive):预测对了,预测结果为正例(把女生预测为女生),TP = 20。
    FP(False positive):预测错了,预测结果为正例(把男生预测为女生),FP = 30。
    TN(True negative):预测对了,预测结果为负例(把男生预测为男生),TN = 50。
    FN(False negative):预测错了,预测结果为负例(把女生预测为男生),FN = 0。
    Recall = TP/(TP + FN),Recall的意思是在正确的结果中,有多少个预测结果是正确的。上述例子的Recall值就是TP/(TP+FN) = 20/(20+0) = 100%。
    为什么要引入Recall这个指标呢?
    假设我们现在有100个患者样本,其中有5个患者患有癌症,我们用1表示,其余95名正常患者我们用0表示。假设我们现在用一种算法做预测,所有的结果都预测为0,95个肯定是预测对了,算法准确率为95%,看着挺高的。但是这个算法对于癌症患者的预测准确率是0,所以这个算法是没有任何意义的。这时候我们的recall值的价值就体现出来了,recall值是在5个癌症患者中找能预测出来的,如果预测3个对了,recall = 60%。

    from sklearn.metrics import classification_report(y_true, y_pred, labels=None, 
    target_names=None, sample_weight=None, digits=2, output_dict=False)
    

    3.1参数说明
    y_true:真实的label,一维数组,列名
    y_pred:预测值的label,一维数组,行名
    labels:默认不指定,此时y_true,y_pred去并集,做升序,做label
    sample_weight:样本权重
    target_names:行标签,顺序和label的要一致
    digits,整型,小数的位数
    out_dict:输出格式,默认False,如果为True,输出字典。
    3.2样例演示

    import numpy as np
    from sklearn.metrics import classification_report
    
    y_true = np.array([0,1,1,0,1,2,1])
    y_pred = np.array([0,1,0,0,1,2,1])
    target_names = ['class0','class1','class2']
    print(classification_report(y_true,y_pred,target_names = target_names))
    #结果如下
      precision    recall  f1-score   support
    
          class0       0.67      1.00      0.80         2
          class1       1.00      0.75      0.86         4
          class2       1.00      1.00      1.00         1
    
       micro avg       0.86      0.86      0.86         7
       macro avg       0.89      0.92      0.89         7
    weighted avg       0.90      0.86      0.86         7
    
    

    补充:f1-score是recall与precision的综合结果,其表达式为:f1-score = 2 * (precision * recall)/(precision + recall)
    4.ROC_AUC

    from sklearn.metrics import roc_auc_score(y_true, y_score, average=’macro’, 
    sample_weight=None, max_fpr=None)
    

    4.1参数说明
    y_true:真实的label,一维数组
    y_score:模型预测的正例的概率值
    average:有多个参数可选,一般默认即可
    sample_weight:样本权重
    max_fpr 取值范围[0,1),如果不是None,则会标准化,使得最大值=max_fpr
    4.2 Roc曲线说明
    如下图所示,横轴表示false positive rate,纵轴表示true positive rate,我们希望false positive rate的值越小越好,true positive rate的值越大越好,希望曲线往左上角偏。那么如何衡量Roc曲线,我们用Roc的曲线面积来衡量即Auc(Area under curve)值,取值范围:[0,1]。
    Roc curve.png
    4.3样例演示

    import numpy as np
    from sklearn.metrics import roc_auc_score
    
    y_true = np.array([0,1,1,0])
    y_score = np.array([0.85,0.78,0.69,0.54])
    print(roc_auc_score(y_true,y_score))
    0.5
    

    5.小结
    评估机器学习算法是我们在解决实际问题中非常重要的一步,只有通过准确的评估,我们才能对算法进行后期的优化处理。这一节所说的Precision,Recall,Confusion_matrix,Roc_Auc是我们最常用的二分类算法的评估方法。以后在解决实际问题时,单一的评估指标往往不能反映出真正的算法效果,所以我们需综合多种评估指标来评价某一种算法。

    展开全文
  • 分类评价指标

    千次阅读 多人点赞 2019-04-30 16:28:30
    AUC   为了计算 ROC 曲线上的点,我们可以使用不同的分类阈值多次评估逻辑回归模型,但这样做效率非常低。幸运的是,有一种基于排序的高效算法可以为我们提供此类信息,这种算法称为曲线下面积(Area Under Curve...
  • 分类评价指标python代码

    千次阅读 2019-10-06 21:17:51
    计算每个实例的度量标准,并找到它们的平均值(对于不同于此的多标签分类仅有意义  accuracy_score )。 sklearn.metrics模块实现了一些loss, score以及一些工具函数来计算分类性能。一些metrics可能需要正例、...
  • 主要介绍了使用sklearn对多分类的每个类别进行指标评价操作,具有很好的参考价值,希望对大家有所帮助。一起跟随小编过来看看吧
  • 本文详细解释了精确率,召回率以及其他的机器学习常用的平均指标
  • 总体分类精度OA(Overall Accuracy)是指正确分类的像素点数与总的像素点数的比值,是一种常用的衡量变化检测结果的指标, Kappa系数是一种能更加精确衡量分类准确度的参数,能较好的反映出两者的一致性,
  • 在上一篇《(一)常见的回归评价指标及代码应用》中我们介绍了评价回归性能的指标。...本篇博客我们只介绍常见的分类评价指标,以及它们的代码实现与应用。 文章目录一、二分类指标(Binary Clas...
  • matlab开发-分类性能指标。此函数用于评估分类模型的常见性能度量。
  • 1. 二分类评价指标 2. 多分类评价指标 3. 总结 1. 二分类评价指标 常用的二分类评价指标包括准确率、精确率、召回率、F1-score、AUC、ROC、P-R曲线、MCC等 混淆矩阵 2分类问题的混淆矩阵是2*2的,通常以关注的...
  • 点击查看
  • 分类评估指标

    千次阅读 2019-04-26 20:16:44
    转 多分类评估指标 2018年12月13日 15:58:17 时空守卫 阅读数:349 ...
  • 针对不同的模型由主要可以分回归评价指标分类评价指标,本文主要是想梳理一下各种不同类型的分类评价指标原理。 目录 分类评价指标 分类评价指标 在了解二分类之前需要先了解下面四个概念 TP(正确预测正分类...
  • 分类和多分类的性能评价指标及python计算

    万次阅读 多人点赞 2019-08-15 16:05:29
    一、二分类 real\predict Positive Negative True TP FN False FP TN TP、TN、FP、FN 中的第二个字母(列标首字母)是机器学习算法或模型预测的结果(正例:P、反例:N) TP、TN、FP、FN 中的第一个...
  • 其实多分类评价指标的计算方式与二分类完全一样,只不过我们计算的是针对于每一类来说的召回率、精确度、准确率和 F1分数。 1、混淆矩阵(Confuse Matrix) (1)若一个实例是正类,并且被预测为正类,即为真正类...
  • 0 分类问题评价指标 混淆矩阵 混淆矩阵:Actual :实际的正类或父类; Predicted : 预测的;1 :正类 0:父类 predicted predicted 1 0 Actual 1 TP FN ...
  • 分类和多分类问题的评价指标总结

    万次阅读 多人点赞 2019-07-09 18:51:55
    分类评价指标 准确率(Accuracy) 评价分类问题的性能指标一般是分类准确率,即对于给定的数据,分类正确的样本数占总样本数的比例。 注意:准确率这一指标在Unbalanced数据集上的表现很差,因为如果我们的正负...
  • 目录   二分类: 多分类: 一、什么是多类分类?...评估指标:混淆矩阵,accuracy,precision,f1-score,AUC,ROC,P-R(不能用) 1.混淆矩阵: 2. accuracy,precision,reacall,f1-score: 3...
  • 不平衡数据的分类评价指标总结

    万次阅读 2017-10-30 19:39:19
    Matrix)用于评价算法好坏的指标。下图是一个二分类问题的混淆矩阵: TP:正确肯定——实际是正例,识别为正例 FN:错误否定(漏报)——实际是正例,却识别成了负例 FP:错误肯定(误报)——实际是负例...
  • 目录 目录 1、概念 1.1、前提 ...2、评价指标(性能度量) 2.1、分类评价指标 2.1.1 值指标-Accuracy、Precision、Recall、F值 2.1.2 相关曲线-P-R曲线、ROC曲线及AUC值 ...1)分类评价指标(classification...
  • Macro F1,这个指标计算每一类的F1 score然后求算术平均,如果模型在小样本上表现不好,小样本的F1会极大程度上拉低Macro F1。除了F1之外还有Macro recall,Macro precision,计算原理是一样的。 另,Micro F1在多...
  • 分类一般分为三种情况:二分类、多分类和多标签分类。多标签分类比较直观的理解是,一个样本可以同时拥有几个类别标签,比如一首歌的标签可以是流行、轻快,一部电影的标签可以是动作、喜剧、搞笑,一本书的标签可以...
  • 多标签分类评价指标

    万次阅读 2018-01-24 19:03:22
    所以其评价指标与多分类的也有差异,本文将介绍几种评价指标。  1.Hamming loss(汉明损失),表示所有label中错误样本的比例,所以该值越小则网络的分类能力越强。计算公式如下。 其中:|D|表示样本总数,...
  • SKLearn分类评估指标(一)

    千次阅读 2020-05-24 16:26:02
    一、使用sklearn.metric...1、只限于二元单标签分类问题的评估指标 (1)matthews_corrcoef(y_true,y_pred[,…]计算二元分类中的Matthews相关系数(MCC) (2)precision_recall_curve(y_true,probas_pred)在不同的概率
  • 高职教师分类评价指标研究.doc

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 76,340
精华内容 30,536
关键字:

分类评估指标