精华内容
下载资源
问答
  • 数据安全之数据分类分级系统建设

    千次阅读 2020-05-01 18:57:15
    数据分类分级数据安全治理过程中至关重要,数据的分级是数据重要性的直观化展示,是组织内部管理体系编写的基础、是技术支撑体系落地实施的基础、是运维过程中合理分配精力及力度的基础(80%精力关注重要数据,20%...

        一、数据分类分级的意义   

    数据分类分级在数据安全治理过程中至关重要,数据的分级是数据重要性的直观化展示,是组织内部管理体系编写的基础、是技术支撑体系落地实施的基础、是运维过程中合理分配精力及力度的基础80%精力关注重要数据,20%精力关注普通数据)。

     数据分类分级起到成承上(管理)启下(技术)的作用。承上:从运维制度、保障措施、岗位职责等多个方面的管理体系都需依托数据分类分级进行针对性编制(管理体系与分类分级的结合,可强化体系落地执行性)。启下:根据不同数据级别,实现不同安全防护,如高级数据需要实现细粒度规则管控和数据加密,低级别数据实现单向审计即可。

    总而言之数据分类分级是管理体系合理规划、数据安全合理管控、人员精力及力度合理利用的基础,是迈向数据安全年精细化管理的重要一步。

     二、数据分类分级系统架构

    目前业界数据分类分级多数属于数据资产管理系统的一个重要模块,大体实现思路是自动发现敏感数据,再结合人工方式进行分级(因为数据分类分级主观占比较重)操作,虽可帮助相关人员快速发现敏感数据,但针对主观数据还是力不从心,分级方式不灵活,不能适应各种组织的数据安全分级需要。整体而言,业界系统其实并不能满足所倡导的数据分类分级要求(主要是因为业界数据分类分级没有标准),多数解决方式是利用具有行业、业务、安全多方面经验的人员进行梳理,特点是准确性高、效果好但效率低、周期长、无规范依据。

    为尽可能解决两者不匹配的问题,更好支撑组织对数据安全分类分级需要,在结合自身数据安全经验及对数据分类分级了解的基础上,初步形成了数据分类分级系统应具有的特性和功能架构,以期助力数据安全治理工作的发展。

    数据分类分级系统应具的特性,如下图:

    2.1主观判定与客观判断的支持

    主观判定与客观判断主要是针对数据的敏感性(机密性)。组织内部数据分级判断常分为客观数据及主观数据,客观数据可直接辨别敏感性(如电话、身份证等),而有些数据则需要进行主观判定。

    2.2具有敏感数据发现能力

    敏感数据发现是数据分类分级的基础,也是客观判断的前期条件,如对电话、身份证号码、社保卡号、银行账号等多种数据进行判断,及时发现组织内部敏感数据。

    2.3现有安全环境的系统映射能力

    数据分类分级要考虑数据多种特性,其中包括数据安全可控性的问题,如果组织内部具有高强度的安全可控环境,那数据分级价值则会有限,如果环境中安全防护能力有限,则需考虑如何利用现有设备(或部分新购设备)有针对性的加深数据防护粒度,从而减轻资金、人员、运维精力等综合投入成本,在此环境下数据分类分级则显得尤为重要。

    2.4动态扩展能力

    动态扩展能力是适应不同场景的需要,是系统能否适用组织内部不同数据形态、不同分类分级需求的基础(如果不具备动态扩展的能力且但满足需求,则可初步认定该系统是项目级,非产品级)。动态扩展能力包括敏感数据发现规则的动态拓展、元数据管理的动态扩展、标准自定义的动态扩展等。

    2.5上下游系统结合能力

    数据分类分级的意义不在于对数据进行分类分级,而是在于对分类分级后的数据如何进行精细化安全管控,所以数据分类分级应具有上下游系统结合的能力(即需要丰富的接口)。可提供上游态势可视化展示(数据分布可视化、数据流程可视化等)、资产应用等,下游的数据安全管控(审计、防火墙、脱敏、加密、数据防泄漏)等。

    2.6系统架构设计

    依托数据分类分级系统应有的几个特性,数据分类分级系统功能应包括但不限于:规则管理、元数据管理、安全映射管理、指标管理、数据分类分级管理、接口管理、血缘分析等,简易架构图如下:

    应用层:应用层是数据分类分级价值输出层,包括资产管理、态势感知、安全管理(审计、防火墙等)。该层业务系统是利用不同数的分类分级进行细粒度操作,如态势感知系统进行高级别数据请求、使用、分布的态势展现,安全管控系统形成定向防护策略等。

    应用支撑层:该层是数据分类分级应具有的功能。包括规则管理、元数据管理、指标管理、安全映射管理、数据分类分级管理、接口管理、血缘分析等。

    规则管理:通过建立的规则引擎,实现敏感数据发现(客观数据),方案(标准)的组合执行规则、指标判定规则等。

    元数据管理:是系统的基础支撑功能,如满足指标管理中各种指标的动态管理。

    指标管理:是数据分类分级的判定指标,是方案管理中基础元素。

    安全映射管理:是现有安全环境的映射,利用如SNMP协议自动爬取网络环境,通过规则形成安全可控情况。

    数据分类管理:客观数据利用规则引擎进行分类。结合机器学习方式进行主观类别分类,形成初步的分类方案,最终需要人员介入。

    数据分级管理:客观数据利用规则引擎进行分级。结合机器学习方式进行主观数据分级,形成初步的分级方案,最终需要人员介入。

    接口管理:打通上下游应用的唯一途径,包括获取数据分类、分级信息,判断数据的分类分级结果等。

    数据层:数据分类分级的基础数据内容。

    以上便是数据分类分级系统建设大致思路,由于业界没有相关标准且行业最佳实践屈指可数,难免会出现仁者见仁智者见智的情况,如您有更好建设思路,望互相交流。

    展开全文
  • 数据安全分类分级实施指南 重点 (Top highlight)Balance within the imbalance to balance what’s imbalanced — Amadou Jarou Bah 在不平衡中保持平衡以平衡不平衡— Amadou Jarou Bah Disclaimer: This is a ...

    数据安全分类分级实施指南

    重点 (Top highlight)

    Balance within the imbalance to balance what’s imbalanced — Amadou Jarou Bah

    在不平衡中保持平衡以平衡不平衡— Amadou Jarou Bah

    Disclaimer: This is a comprehensive tutorial on handling imbalanced datasets. Whilst these approaches remain valid for multiclass classification, the main focus of this article will be on binary classification for simplicity.

    免责声明:这是有关处理不平衡数据集的综合教程。 尽管这些方法对于多类分类仍然有效,但为简单起见,本文的主要重点将放在二进制分类上。

    介绍 (Introduction)

    As any seasoned data scientist or statistician will be aware of, datasets are rarely distributed evenly across attributes of interest. Let’s imagine we are tasked with discovering fraudulent credit card transactions — naturally, the vast majority of these transactions will be legitimate, and only a very small proportion will be fraudulent. Similarly, if we are testing individuals for cancer, or for the presence of a virus (COVID-19 included), the positive rate will (hopefully) be only a small fraction of those tested. More examples include:

    正如任何经验丰富的数据科学家或统计学家都会意识到的那样,数据集很少会在感兴趣的属性之间均匀分布。 想象一下,我们负有发现欺诈性信用卡交易的任务-自然,这些交易中的绝大多数都是合法的,只有很小一部分是欺诈性的。 同样,如果我们正在测试个人是否患有癌症或是否存在病毒(包括COVID-19),那么(希望)阳性率仅是所测试者的一小部分。 更多示例包括:

    • An e-commerce company predicting which users will buy items on their platform

      一家电子商务公司预测哪些用户将在其平台上购买商品
    • A manufacturing company analyzing produced materials for defects

      一家制造公司分析所生产材料的缺陷
    • Spam email filtering trying to differentiation ‘ham’ from ‘spam’

      垃圾邮件过滤试图区分“火腿”和“垃圾邮件”
    • Intrusion detection systems examining network traffic for malware signatures or atypical port activity

      入侵检测系统检查网络流量中是否存在恶意软件签名或非典型端口活动
    • Companies predicting churn rates amongst their customers

      预测客户流失率的公司
    • Number of clients who closed a specific account in a bank or financial organization

      在银行或金融组织中关闭特定帐户的客户数量
    • Prediction of telecommunications equipment failures

      预测电信设备故障
    • Detection of oil spills from satellite images

      从卫星图像检测漏油
    • Insurance risk modeling

      保险风险建模
    • Hardware fault detection

      硬件故障检测

    One has usually much fewer datapoints from the adverse class. This is unfortunate as we care a lot about avoiding misclassifying elements of this class.

    通常,来自不利类的数据点少得多。 这很不幸,因为我们非常在意避免对此类元素进行错误分类。

    In actual fact, it is pretty rare to have perfectly balanced data in classification tasks. Oftentimes the items we are interested in analyzing are inherently ‘rare’ events for the very reason that they are rare and hence difficult to predict. This presents a curious problem for aspiring data scientists since many data science programs do not properly address how to handle imbalanced datasets given their prevalence in industry.

    实际上,在分类任务中拥有完全平衡的数据非常罕见。 通常,我们感兴趣的项目本质上是“稀有”事件,原因是它们很少见,因此难以预测。 对于有抱负的数据科学家而言,这是一个令人好奇的问题,因为鉴于其在行业中的普遍性,许多数据科学程序无法正确解决如何处理不平衡的数据集。

    数据集什么时候变得“不平衡”? (When does a dataset become ‘imbalanced’?)

    The notion of an imbalanced dataset is a somewhat vague one. Generally, a dataset for binary classification with a 49–51 split between the two variables would not be considered imbalanced. However, if we have a dataset with a 90–10 split, it seems obvious to us that this is an imbalanced dataset. Clearly, the boundary for imbalanced data lies somewhere between these two extremes.

    不平衡数据集的概念有些模糊。 通常,在两个变量之间划分为49-51的二进制分类数据集不会被认为是不平衡的。 但是,如果我们有一个90-10分割的数据集,对我们来说显然这是一个不平衡的数据集。 显然,不平衡数据的边界介于这两个极端之间。

    In some sense, the term ‘imbalanced’ is a subjective one and it is left to the discretion of the data scientist. In general, a dataset is considered to be imbalanced when standard classification algorithms — which are inherently biased to the majority class (further details in a previous article) — return suboptimal solutions due to a bias in the majority class. A data scientist may look at a 45–55 split dataset and judge that this is close enough that measures do not need to be taken to correct for the imbalance. However, the more imbalanced the dataset becomes, the greater the need is to correct for this imbalance.

    从某种意义上说,“不平衡”一词是主观的,由数据科学家自行决定。 通常,当标准分类算法(固有地偏向多数类(在上一篇文章中有更多详细信息))由于多数类的偏向而返回次优解时,则认为数据集不平衡。 数据科学家可以查看45–55的分割数据集,并判断该数据集足够接近,因此无需采取措施来纠正不平衡。 但是,数据集变得越不平衡,就越需要纠正这种不平衡。

    In a concept-learning problem, the data set is said to present a class imbalance if it contains many more examples of one class than the other.

    在概念学习问题中,如果数据集包含一个类别的实例多于另一个类别的实例,则称该数据集存在类别不平衡。

    As a result, these classifiers tend to ignore small classes while concentrating on classifying the large ones accurately.

    结果,这些分类器倾向于忽略小类别,而专注于准确地对大类别进行分类。

    Imagine you are working for Netflix and are tasked with determining which customer churn rates (a customer ‘churning’ means they will stop using your services or using your products).

    想象您正在为Netflix工作,并负责确定哪些客户流失率(客户“流失”意味着他们将停止使用您的服务或产品)。

    In an ideal world (at least for the data scientist), our training and testing datasets would be close to fully balanced, having around 50% of the dataset containing individuals that will churn and 50% who will not. In this case, a 90% accuracy will more or less indicate a 90% accuracy on both the positively and negatively classed groups. Our errors will be evenly split across both groups. In addition, we have roughly the same number of points in both classes, which from the law of large numbers tells us reduces the overall variance in the class. This is great for us, accuracy is an informative metric in this situation and we can continue with our analysis unimpeded.

    在理想的世界中(至少对于数据科学家而言),我们的训练和测试数据集将接近完全平衡,大约50%的数据集包含会搅动的人和50%不会搅动的人。 在这种情况下,90%的准确度将或多或少地表明在正面和负面分类组中都达到90%的准确度。 我们的错误将平均分配给两个组。 此外,两个类中的点数大致相同,这从大数定律可以看出,这减少了类中的总体方差。 这对我们来说非常好,在这种情况下,准确性是一个有用的指标,我们可以继续进行不受阻碍的分析。

    Image for post
    A dataset with an even 50–50 split across the binary response variable. There is no majority class in this example.
    二进制响应变量之间均分50–50的数据集。 此示例中没有多数类。

    As you may have suspected, most people that already pay for Netflix don't have a 50% chance of stopping their subscription every month. In fact, the percentage of people that will churn is rather small, closer to a 90–10 split. How does the presence of this dataset imbalance complicate matters?

    您可能会怀疑,大多数已经为Netflix付款的人没有50%的机会每月停止订阅。 实际上,会流失的人数比例很小,接近90-10。 这个数据集的不平衡如何使问题复杂化?

    Assuming a 90–10 split, we now have a very different data story to tell. Giving this data to an algorithm without any further consideration will likely result in an accuracy close to 90%. This seems pretty good, right? It’s about the same as what we got previously. If you try putting this model into production your boss will probably not be so happy.

    假设拆分为90-10,我们现在要讲一个非常不同的数据故事。 将此数据提供给算法而无需进一步考虑,可能会导致接近90%的精度。 这看起来还不错吧? 它与我们之前获得的内容大致相同。 如果您尝试将这种模型投入生产,您的老板可能不会很高兴。

    Image for post
    An imbalanced dataset with a 90–10 split. False positives will be much larger than false negatives. Variance in the minority set will be larger due to fewer data points. The majority class will dominate algorithmic predictions without any correction for imbalance.
    分割为90-10的不平衡数据集。 假阳性比假阴性要大得多。 由于较少的数据点,少数派集中的方差会更大。 多数类将主导算法预测,而无需对不平衡进行任何校正。

    Given the prevalence of the majority class (the 90% class), our algorithm will likely regress to a prediction of the majority class. The algorithm can pretty closely maximize its accuracy (our scoring metric of choice) by arbitrarily predicting that the majority class occurs every time. This is a trivial result and provides close to zero predictive power.

    给定多数类别(90%类别)的患病率,我们的算法可能会回归到多数类别的预测。 通过任意预测每次都会出现多数类,该算法可以非常精确地最大程度地提高其准确性(我们的选择评分标准)。 这是微不足道的结果,并提供接近零的预测能力。

    Image for post
    (Left) A balanced dataset with the same number of items in the positive and negative class; the number of false positives and false negatives in this scenario are roughly equivalent and result in little classification bias. (Right) An imbalanced dataset with around 5% of samples being in the negative class and 95% of samples being in the positive class (this could be the number of people that pay for Netflix that decide to quit during the next payment cycle).
    (左)一个平衡的数据集,其中正数和负数类的项目数相同; 在这种情况下,假阳性和假阴性的数量大致相等,并且几乎没有分类偏差。 (右)一个不平衡的数据集,其中约5%的样本属于负面类别,而95%的样本属于正面类别(这可能是为Netflix付款并决定在下一个付款周期退出的人数)。

    Predictive accuracy, a popular choice for evaluating the performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly.

    当数据不平衡和/或不同错误的成本明显不同时,预测准确性是评估分类器性能的一种普遍选择,可能不合适。

    Visually, this dataset might look something like this:

    从视觉上看,该数据集可能看起来像这样:

    Machine learning algorithms by default assume that data is balanced. In classification, this corresponds to a comparative number of instances of each class. Classifiers learn better from a balanced distribution. It is up to the data scientist to correct for imbalances, which can be done in multiple ways.

    默认情况下,机器学习算法假定数据是平衡的。 在分类中,这对应于每个类的比较实例数。 分类器从均衡的分布中学习得更好。 数据科学家可以纠正不平衡,这可以通过多种方式来完成。

    不同类型的失衡 (Different Types of Imbalance)

    We have clearly shown that imbalanced datasets have some additional challenges to standard datasets. To further complicate matters, there are different types of imbalance that can occur in a dataset.

    我们已经清楚地表明,不平衡的数据集对标准数据集还有一些其他挑战。 更复杂的是,数据集中可能会出现不同类型的失衡。

    (1) Between-Class

    (1)课间

    A between-class imbalance occurs when there is an imbalance in the number of data points contained within each class. An example of this is shown below:

    当每个类中包含的数据点数量不平衡时,将发生类间不平衡。 下面是一个示例:

    Image for post
    An illustration of between-class imbalance. We have a large number of data points for the red class but relatively few for the white class.
    类间失衡的例证。 红色类别的数据点很多,而白色类别的数据点相对较少。

    An example of this would be a mammography dataset, which uses images known as mammograms to predict breast cancer. Consider the number of mammograms related to positive and negative cancer diagnoses:

    这样的一个例子是乳腺X射线摄影数据集,它使用称为乳腺X线照片的图像来预测乳腺癌。 考虑与阳性和阴性癌症诊断相关的乳房X线照片数量:

    Image for post
    The vast majority of samples (>90%) are negative, whilst relatively few (<10%) are positive.
    绝大多数样本(> 90%)为阴性,而相对少数(<10%)为阳性。

    Note that given enough data samples in both classes the accuracy will improve as the sampling distribution is more representative of the data distribution, but by virtue of the law of large numbers, the majority class will have inherently better representation than the minority class.

    请注意,如果两个类别中都有足够的数据样本,则精度会随着采样分布更能代表数据分布而提高,但是由于数量规律,多数类别在本质上要比少数类别更好。

    (2) Within-Class

    (2)班内

    A within-class imbalance occurs when the dataset has balanced between-class data but one of the classes is not representative in some regions. An example of this is shown below:

    当数据集具有平衡的类间数据,但其中一个类在某些区域中不具有代表性时,会发生类内不平衡。 下面是一个示例:

    Image for post
    An illustration of within-class imbalance. We have a large number of data points for both classes but the number of data points in the white class in the top left corner is very sparse, which can result in similar complications as between-class imbalance for predictions in those regions.
    类内失衡的例证。 这两个类别都有大量数据点,但是左上角的白色类别中的数据点数量非常稀疏,这可能导致与这些区域中的类间不平衡预测相似的复杂情况。

    (3) Intrinsic and Extrinsic

    (3)内部和外部

    An intrinsic imbalance is due to the nature of the dataset, while extrinsic imbalance is related to time, storage, and other factors that limit the dataset or the data analysis. Intrinsic characteristics are relatively simple and are what we commonly see, but extrinsic imbalance can exist separately and can also work to increase the imbalance of a dataset.

    内在的不平衡归因于数据集的性质, 而外在的不平衡则与时间,存储以及其他限制数据集或数据分析的因素有关。 内部特征相对简单,这是我们通常看到的特征,但是外部不平衡可以单独存在,也可以用来增加数据集的不平衡。

    For example, companies often use intrusion detection systems that analyze packets of data sent in and out of networks in order to detect malware of malicious activity. Depending on whether you analyze all data or just data sent through specific ports or specific devices, this will significantly influence the imbalance of the dataset (most network traffic is likely legitimate). Similarly, if log files or data packets related to suspected malicious behavior are commonly stored but normal log are not (or only a select few types are stored), then this can also influence the imbalance of the dataset. Similarly, if logs were only stored during a normal working day (say, 9–5 PM) instead of 24 hours, this will also affect the imbalance.

    例如,公司经常使用入侵检测系统来分析进出网络的数据包,以检测恶意活动的恶意软件。 根据您是分析所有数据还是仅分析通过特定端口或特定设备发送的数据,这将严重影响数据集的不平衡(大多数网络流量可能是合法的)。 同样,如果通常存储与可疑恶意行为有关的日志文件或数据包,但不存储常规日志(或仅存储少数几种类型的日志),则这也可能会影响数据集的不平衡。 同样,如果日志仅在正常工作日(例如9-5 PM)而非24小时内存储,这也会影响不平衡。

    不平衡的进一步复杂化 (Further Complication of Imbalance)

    There are a couple more difficulties increased by imbalanced datasets. Firstly, we have class overlapping. This is not always a problem, but can often arise in imbalanced learning problems and cause headaches. Class overlapping is illustrated in the below dataset.

    不平衡的数据集会增加更多的困难。 首先,我们有班级重叠 。 这并不总是一个问题,但是经常会在学习不平衡的问题中出现并引起头痛。 下面的数据集说明了类重叠。

    Image for post
    Example of class overlapping. Some of the positive data points (stars) are intermixed with the negative data points (circles), which would lead an algorithm to construct an imperfect decision boundary.
    类重叠的示例。 一些正数据点(星号)与负数据点(圆)混合在一起,这将导致算法构造不完善的决策边界。

    Class overlapping occurs in normal classification problems, so what is the additional issue here? Well, the class more represented in overlap regions tends to be better classified by methods based on global learning (on the full dataset). This is because the algorithm is able to get a more informed picture of the data distribution of the majority class.

    在正常的分类问题中会发生类重叠,那么这里还有什么其他问题? 好吧,在重叠区域中表示更多的类倾向于通过基于全局学习的方法(在完整数据集上)更好地分类。 这是因为该算法能够获得多数类数据分布的更多信息。

    In contrast, the class less represented in such regions tends to be better classified by local methods. If we take k-NN as an example, as the value of k increases, it becomes increasingly global and increasingly local. It can be shown that performance for low values of k has better performance on the minority dataset, and lower performance at high values of k. This shift in accuracy is not exhibited for the majority class because it is well-represented at all points.

    相反,在此类区域中较少代表的类别倾向于通过本地方法更好地分类。 如果以k-NN为例,随着k值的增加,它变得越来越全球化,也越来越局部化。 可以证明,k值较低时的性能在少数数据集上具有较好的性能,而k值较高时的性能较低。 准确性的这种变化在大多数类别中都没有表现出来,因为它在所有方面都得到了很好的体现。

    This suggests that local methods may be better suited for studying the minority class. One method to correct for this is the CBO Method. The CBO Method uses cluster-based resampling to identify ‘rare’ cases and resample them individually, so as to avoid the creation of small disjuncts in the learned hypothesis. This is a method of oversampling — a topic that we will discuss in detail in the following section.

    这表明本地方法可能更适合于研究少数群体。 一种纠正此问题的方法CBO方法 。 CBO方法使用基于聚类的重采样来识别“稀有”案例并分别对其进行重采样,以避免在学习的假设中产生小的歧义。 这是一种过采样的方法-我们将在下一节中详细讨论这个主题。

    Image for post
    CBO Method. Once the training examples of each class have been clustered, oversampling starts. In the majority class, all the clusters, except for the largest one, are randomly oversampled so as to get the same number of training examples as the largest cluster.
    CBO方法。 一旦将每个班级的训练示例进行了聚类,就会开始进行过度采样。 在多数类中,除最大的聚类外,所有聚类均被随机过采样,以便获得与最大聚类相同数量的训练样例。

    纠正数据集不平衡 (Correcting Dataset Imbalance)

    There are several techniques to control for dataset imbalance. There are two main types of techniques to handle imbalanced datasets: sampling methods, and cost-sensitive methods.

    有几种控制数据集不平衡的技术。 处理不平衡数据集的技术主要有两种: 抽样方法成本敏感方法

    Image for post

    The simplest and most commonly used of these are sampling methods called oversampling and undersampling, which we will go into more detail on.

    其中最简单,最常用的是称为过采样和欠采样的采样方法,我们将对其进行详细介绍。

    Oversampling/Undersampling

    过采样/欠采样

    Simply stated, oversampling involves generating new data points for the minority class, and undersampling involves removing data points from the majority class. This acts to somewhat reduce the extent of the imbalance in the dataset.

    简而言之,过采样涉及为少数类生成新的数据点,而欠采样涉及从多数类中删除数据点。 这在某种程度上减少了数据集中的不平衡程度。

    What does undersampling look like? We continually remove like-samples in close proximity until both classes have the same number of data points.

    欠采样是什么样的? 我们会不断删除附近的相似样本,直到两个类具有相同数量的数据点。

    Image for post
    Undersampling. Imagine you are analysing a dataset for fraudulent transactions. Most of the transactions are not fraudulent, creating a fundamentally imbalanced dataset. In the scenario of undersampling, we will take fewer samples from the majority class to help reduce the extent of this imbalance.欠采样。 假设您正在分析数据集中的欺诈性交易。 大多数交易不是欺诈性的,从而造成了根本上不平衡的数据集。 在抽样不足的情况下,我们将从多数类别中抽取较少的样本,以帮助减少这种不平衡的程度。

    Is undersampling a good idea? Undersampling is recommended by many statistical researchers but is only good if enough data points are available on the undersampled class. Also, since the majority class will end up with the same number of points as the minority class, the statistical properties of the distributions will become ‘looser’ in a sense. However, we have not artificially distorted the data distribution with this method by adding in artificial data points.

    采样不足是个好主意吗? 许多统计研究人员建议进行欠采样,但是只有在欠采样类别上有足够的数据点可用时,采样才是好的。 同样,由于多数类最终将获得与少数类相同的分数,因此从某种意义上说,分布的统计属性将变为“较弱”。 但是,我们没有通过添加人工数据点来使用这种方法人为地扭曲数据分布。

    Image for post
    Illustration of undersampling. Like-samples in close proximity are removed in an attempt to increase the sparsity of the data distribution.
    欠采样的插图。 为了提高数据分布的稀疏性,删除了附近的相似样本。

    What does oversampling look like? In shot, the opposite of undersampling. We are artificially adding data points to our dataset to make the number of instances in each class balanced.

    过采样看起来像什么? 在拍摄中,欠采样的情况与之相反。 我们正在人为地向数据集中添加数据点,以使每个类中的实例数量保持平衡。

    Image for post
    Oversampling. In the scenario of oversampling, we will oversample from the minority class to help reduce the extent of this imbalance.过采样。 在过度采样的情况下,我们将对少数群体进行过度采样,以帮助减少这种不平衡的程度。

    How do we generate these samples? The most common way is to generate points that are close in dataspace proximity to existing samples or are ‘between’ two samples, as illustrated below.

    我们如何生成这些样本? 最常见的方法是生成在数据空间中与现有样本接近或在两个样本“之间”的点,如下所示。

    Image for post
    Illustration of oversampling.
    过度采样的插图。

    As you may have suspected, there are some downsides to adding false data points. Firstly, you risk overfitting, especially if one does this for points that are noise — you end up exacerbating this noise by adding reinforced measurements. In addition, adding these values randomly can also contribute additional noise to our model.

    您可能已经怀疑过,添加错误的数据点有一些缺点。 首先,您可能会面临过度拟合的风险,特别是如果对噪声点进行过度拟合时,最终会通过添加增强的测量来加剧这种噪声。 此外,随机添加这些值也会给我们的模型带来额外的噪声。

    SMOTE (Synthetic minority oversampling technique)

    SMOTE(合成少数群体过采样技术)

    Luckily for us, we don’t have to write an algorithm for randomly generating data points for the purpose of oversampling. Instead, we can use the SMOTE algorithm.

    对我们来说幸运的是,我们不必编写用于过采样的随机生成数据点的算法。 相反,我们可以使用SMOTE算法。

    How does SMOTE work? SMOTE generates new samples in between existing data points based on their local density and their borders with the other class. Not only does it perform oversampling, but can subsequently use cleaning techniques (undersampling, more on this shortly) to remove redundancy in the end. Below is an illustration for how SMOTE works when studying class data.

    SMOTE如何工作? SMOTE根据现有数据点的局部密度及其与其他类别的边界在新数据点之间生成新样本。 它不仅执行过采样,而且可以随后使用清除技术(欠采样,稍后对此进行更多介绍)最终消除冗余。 下面是学习班级数据时SMOTE如何工作的图示。

    Image for post
    An illustration of how SMOTE functions. The instance on the left is isolated and is thus considered noise by the algorithm. No additional data points are generated in its proximity, or, if they are, they will be in very close proximity to the singular point. The two clusters in the center and right have several data points, indicating that it is less likely that these points correspond to random noise. Thus, a larger cluster (empirical data distribution) can be drawn by the algorithm from which additional samples can be generated.
    SMOTE的功能说明。 左侧的实例被隔离,因此被算法视为噪声。 不会在其附近生成任何其他数据点,或者如果它们是,它们将非常靠近奇异点。 中央和右侧的两个群集具有几个数据点,表明这些点对应于随机噪声的可能性较小。 因此,可以通过该算法得出更大的聚类(经验数据分布),从中可以生成其他样本。

    The algorithm for SMOTE is as follows. For each minority sample:

    SMOTE的算法如下。 对于每个少数族裔样本:

    – Find its k-nearest minority neighbours

    –寻找其k最近的少数族裔邻居

    – Randomly select j of these neighbours

    –随机选择这些邻居中的j个

    – Randomly generate synthetic samples along the lines joining the minority sample and its j selected neighbours (j depends on the amount of oversampling desired)

    –沿连接少数样本及其j个选定邻居的直线随机生成合成样本(j取决于所需的过采样量)

    Informed vs. Random Oversampling

    知情vs.随机过采样

    Using random oversampling (with replacement) of the minority class has the effect of making the decision region for the minority class very specific. In a decision tree, it would cause a new split and often lead to overfitting. SMOTE’s informed oversampling generalizes the decision region for the minority class. As a result, larger and less specific regions are learned, thus, paying attention to minority class samples without causing overfitting.

    使用少数类的随机过采样 (替换)具有使少数类的决策区域非常具体的效果。 在决策树中,这将导致新的分裂并经常导致过度拟合。 SMOTE的明智超采样概括了少数群体的决策区域。 结果,学习了更大和更少的特定区域,因此,在不引起过度拟合的情况下注意少数类样本。

    Drawbacks of SMOTE

    SMOTE的缺点

    Overgeneralization. SMOTE’s procedure can be dangerous since it blindly generalizes the minority area without regard to the majority class. This strategy is particularly problematic in the case of highly skewed class distributions since, in such cases, the minority class is very sparse with respect to the majority class, thus resulting in a greater chance of class mixture.

    过度概括。 SMOTE的程序可能很危险,因为它盲目地将少数民族地区泛化而无视多数阶级。 这种策略在阶级分布高度偏斜的情况下尤其成问题,因为在这种情况下,少数阶级相对于多数阶级而言非常稀疏,因此导致阶级混合的机会更大。

    Inflexibility. The number of synthetic samples generated by SMOTE is fixed in advance, thus not allowing for any flexibility in the re-balancing rate.

    僵硬。 SMOTE生成的合成样本的数量是预先固定的,因此再平衡速率不具有任何灵活性。

    Another potential issue is that SMOTE might introduce the artificial minority class examples too deeply in the majority class space. This drawback can be resolved by hybridization: combining SMOTE with undersampling algorithms. One of the most famous of these is Tomek Links. Tomek Links are pairs of instances of opposite classes who are their own nearest neighbors. In other words, they are pairs of opposing instances that are very close together.

    另一个潜在的问题是,SMOTE可能会在多数阶层的空间中过于深入地介绍人工少数群体的例子。 这个缺点可以通过杂交解决:将SMOTE与欠采样算法结合在一起。 其中最著名的就是Tomek Links 。 Tomek链接是一对相反类别的实例,它们是自己最近的邻居。 换句话说,它们是一对非常靠近的相对实例。

    Tomek’s algorithm looks for such pairs and removes the majority instance of the pair. The idea is to clarify the border between the minority and majority classes, making the minority region(s) more distinct. Scikit-learn has no built-in modules for doing this, though there are some independent packages (e.g., TomekLink, imbalanced-learn).

    Tomek的算法会查找此类对,并删除该对的多数实例。 这样做的目的是弄清少数民族和多数阶级之间的界限,使少数民族地区更加鲜明。 尽管有一些独立的软件包(例如TomekLinkimbalanced -learn ),但Scikit-learn没有内置模块可以执行此操作。

    Thus, Tomek’s algorithm is an undersampling technique that acts as a data cleaning method for SMOTE to regulate against redundancy. As you may have suspected, there are many additional undersampling techniques that can be combined with SMOTE to perform the same function. A comprehensive list of these functions can be found in the functions section of the imbalanced-learn documentation.

    因此,Tomek的算法是一种欠采样技术,可作为SMOTE调节冗余的数据清洗方法。 您可能已经怀疑,还有许多其他的欠采样技术可以与SMOTE结合使用以执行相同的功能。 这些功能的全面列表可在不平衡学习文档的功能部分中找到。

    An additional example is Edited Nearest Neighbors (ENN). ENN removes any example whose class label differs from the class of at least two of their neighbor. ENN removes more examples than the Tomek links does and also can remove examples from both classes.

    另一个示例是“最近的邻居”(ENN)。 ENN删除任何其类别标签不同于其至少两个邻居的类别的示例。 与Tomek链接相比,ENN删除的示例更多,并且还可以从两个类中删除示例。

    Other more nuanced versions of SMOTE include Borderline SMOTE, SVMSMOTE, and KMeansSMOTE, and more nuanced versions of the undersampling techniques applied in concert with SMOTE are Condensed Nearest Neighbor (CNN), Repeated Edited Nearest Neighbor, and Instance Hardness Threshold.

    SMOTE的其他细微差别版本包括Borderline SMOTE,SVMSMOTE和KMeansSMOTE,与SMOTE结合使用的欠采样技术的细微差别版本是压缩最近邻(CNN),重复编辑最近邻和实例硬度阈值。

    成本敏感型学习 (Cost-Sensitive Learning)

    We have discussed sampling techniques and are now ready to discuss cost-sensitive learning. In many ways, the two approaches are analogous — the main difference being that in cost-sensitive learning we perform under- and over-sampling by altering the relative weighting of individual samples.

    我们已经讨论了采样技术,现在准备讨论对成本敏感的学习。 在许多方面,这两种方法是相似的-主要区别在于在成本敏感型学习中,我们通过更改单个样本的相对权重来进行欠采样和过采样。

    Upweighting. Upweighting is analogous to over-sampling and works by increasing the weight of one of the classes keeping the weight of the other class at one.

    增重。 上权类似于过采样,其工作方式是增加一个类别的权重,将另一类别的权重保持为一个。

    Down-weighting. Down-weighting is analogous to under-sampling and works by decreasing the weight of one of the classes keeping the weight of the other class at one.

    减重。 减权类似于欠采样,它通过减小一个类别的权重而将另一类别的权重保持为一个来工作。

    An example of how this can be performed using sklearn is via the sklearn.utils.class_weight function and applied to any sklearn classifier (and within keras).

    如何使用sklearn执行此操作的示例是通过sklearn.utils.class_weight函数并将其应用于任何sklearn分类器(以及在keras中)。

    from sklearn.utils import class_weight
    class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
    model.fit(X_train, y_train, class_weight=class_weights)

    In this case, we have set the instances to be ‘balanced’, meaning that we will treat these instances to have balanced weighting based on their relative number of points — this is what I would recommend unless you have a good reason for setting the values yourself. If you have three classes and wanted to weight one of them 10x larger and another 20x larger (because there are 10x and 20x fewer of these points in the dataset than the majority class), then we can rewrite this as:

    在这种情况下,我们将实例设置为“平衡”,这意味着我们将根据它们的相对点数将这些实例视为具有均衡的权重-这是我的建议,除非您有充分的理由来设置值你自己 如果您有三个类别,并且想要将其中一个类别的权重放大10倍,将另一个类别的权重增大20倍(因为数据集中这些点的数量比多数类别少10倍和20倍),则可以将其重写为:

    class_weight = {0: 0.1,
    1: 1.,
    2: 2.}

    Some authors claim that cost-sensitive learning is slightly more effective than random or directed over- or under-sampling, although all approaches are helpful, and directed oversampling, is close to cost-sensitive learning in efficacy. Personally, when I am working on a machine learning problem I will use cost-sensitive learning because it is much simpler to implement and communicate to individuals. However, there may be additional aspects of using sampling techniques that provide superior results of which I am not aware.

    一些作者声称,成本敏感型学习比随机或有针对性的过度采样或欠采样略有效果,尽管所有方法都是有帮助的,有针对性的过度采样在效果上接近于成本敏感型学习。 就个人而言,当我处理机器学习问题时,我将使用成本敏感型学习,因为它易于实现并与个人进行交流 。 但是,使用采样技术可能存在其他方面,这些方面提供了我所不知道的优异结果。

    评估指标 (Assessment Metrics)

    In this section, I outline several metrics that can be used to analyze the performance of a classifier trained to solve a binary classification problem. These include (1) the confusion matrix, (2) binary classification metrics, (3) the receiver operating characteristic curve, and (4) the precision-recall curve.

    在本节中,我概述了几个可用于分析经过训练以解决二进制分类问题的分类器的性能的指标。 其中包括(1)混淆矩阵,(2)二进制分类指标,(3)接收器工作特性曲线和(4)精确调用曲线。

    混淆矩阵 (Confusion Matrix)

    Despite what you may have garnered from its name, a confusion matrix is decidedly confusing. A confusion matrix is the most basic form of assessment of a binary classifier. Given the prediction outputs of our classifier and the true response variable, a confusion matrix tells us how many of our predictions are correct for each class, and how many are incorrect. The confusion matrix provides a simple visualization of the performance of a classifier based on these factors.

    尽管您可能从它的名字中学到了什么,但是混乱矩阵显然令人困惑。 混淆矩阵是二进制分类器评估的最基本形式。 给定分类器的预测输出和真实的响应变量,混淆矩阵会告诉我们每个类别正确的预测有多少,不正确的预测有多少。 混淆矩阵基于这些因素提供了分类器性能的简单可视化。

    Here is an example of a confusion matrix:

    这是一个混淆矩阵的示例:

    Image for post

    Hopefully what this is showing is relatively clear. The TN cell tells us the number of true positives: the number of positive samples that I predicted were positive.

    希望这显示的是相对清楚的。 TN细胞告诉我们真正的阳性数量:我预测的阳性样品数量为阳性。

    The TP cell tells us the number of true negatives: the number of negative samples that I predicted were negative.

    TP单元告诉我们真实阴性的数量:我预测的阴性样品的数量为阴性。

    The FP cell tells us the number of false positives: the number of negative samples that I predicted were positive.

    FP细胞告诉我们假阳性的数量:我预测的阴性样品的数量是阳性的。

    The FN cell tells us the number of false negatives: the number of positive samples that I predicted were positive.

    FN细胞告诉我们假阴性的数量:我预测的阳性样品的数量为阳性。

    These numbers are very important as they form the basis of the binary classification metrics discussed next.

    这些数字非常重要,因为它们构成了下面讨论的二进制分类指标的基础。

    二进制分类指标 (Binary Classification Metrics)

    There are a plethora of single-value metrics for binary classification. As such, only a few of the most commonly used ones and their different formulations are presented here, more details can be found on scoring metrics in the sklearn documentation and on their relation to confusion matrices and ROC curves (discussed in the next section) here.

    二进制分类有很多单值指标。 因此,此处仅介绍一些最常用的方法及其不同的公式,有关更多详细信息,请参见sklearn文档中的评分指标以及它们与混淆矩阵和ROC曲线的关系(在下一节中讨论)

    Arguably the most important five metrics for binary classification are: (1) precision, (2) recall, (3) F1 score, (4) accuracy, and (5) specificity.

    可以说,二元分类最重要的五个指标是:(1)精度,(2)回忆,(3)F1得分,(4)准确性和(5)特异性。

    Image for post

    Precision. Precision provides us with the answer to the question “Of all my positive predictions, what proportion of them are correct?”. If you have an algorithm that predicts all of the positive class correctly but also has a large portion of false positives, the precision will be small. It makes sense why this is called precision since it is a measure of how ‘precise’ our predictions are.

    精确。 Precision为我们提供了以下问题的答案: “在我所有的积极预测中,有多少是正确的?” 。 如果您有一种算法可以正确预测所有肯定分类,但也有很大一部分误报,则精度会很小。 之所以将其称为“精度”是有道理的,因为它可以衡量我们的预测有多“精确”。

    Recall. Recall provides us with the answer to a different question “Of all of the positive samples, what proportion did I predict correctly?”. Instead of false positives, we are now interested in false negatives. These are items that our algorithm missed, and are often the most egregious errors (e.g. failing to diagnose something with cancer that actually has cancer, failing to discover malware when it is present, or failing to spot a defective item). The name ‘recall’ also makes sense for this circumstance as we are seeing how many of the samples the algorithm was able to pick up on.

    召回。 Recall为我们提供了一个不同问题的答案: “在所有阳性样本中,我正确预测的比例是多少?” 。 现在,我们对假阴性感兴趣了,而不是假阳性。 这些是我们的算法遗漏的项目,并且通常是最严重的错误(例如,未能诊断出确实患有癌症的癌症,无法发现恶意软件或存在缺陷的项目)。 在这种情况下,“召回”这个名称也很有意义,因为我们看到了该算法能够提取多少个样本。

    It should be clear that these questions, whilst related, are substantially different to each other. It is possible to have a very high precision and simultaneously have a low recall, and vice versa. For example, if you predicted the majority class every time, you would have 100% recall on the majority class, but you would then get a lot of false positives from the minority class.

    应当明确的是,这些问题虽然相关,但彼此之间却有很大不同。 可能有很高的精度,同时召回率也很低,反之亦然。 例如,如果您每次都预测多数派,则多数派将有100%的回忆率,但随后您将从少数派中得到很多误报。

    One other important point to make is that precision and recall can be determined for each individual class. That is, we can talk about the precision of class A, or the precision of class B, and they will have different values — when doing this, we assume that the class we are interested in is the positive class, regardless of its numeric value.

    另一个重要的观点是, 可以为每个单独的类确定精度和召回率 。 也就是说,我们可以谈论类A的精度或类B的精度,并且它们将具有不同的值-这样做时,我们假设我们感兴趣的类是正类,而不管其数值如何。

    Image for post
    .

    F1 Score. The F1 score is a single-value metric that combines precision and recall by using the harmonic mean (a fancy type of averaging). The β parameter is a strictly positive value that is used to describe the relative importance of recall to precision. A larger β value puts a higher emphasis on recall than precision, whilst a smaller value puts less emphasis. If the value is 1, precision and recall are treated with equal weighting.

    F1分数。 F1分数是一个单值指标,通过使用谐波均值(一种奇特的平均值)将精度和召回率结合在一起。 β参数是一个严格的正值,用于描述召回对精度的相对重要性。 β值较大时,对查全率的重视程度要高于精度,而β值较小时,对查全率的重视程度较低。 如果该值为1,则精度和召回率将以相等的权重处理。

    What does a high F1 score mean? It suggests that both the precision and recall have high values — this is good and is what you would hope to see upon generating a well-functioning classification model on an imbalanced dataset. A low value indicates that either precision or recall is low, and maybe a call for concern. Good F1 scores are generally lower than good accuracies (in many situations, an F1 score of 0.5 would be considered pretty good, such as predicting breast cancer from mammograms).

    F1高分意味着什么? 它表明精度和查全率都具有很高的值-这很好,这是在不平衡数据集上生成功能良好的分类模型时希望看到的。 较低的值表示准确性或召回率较低,可能表示需要关注。 良好的F1分数通常低于良好的准确性(在许多情况下,F1分数0.5被认为是相当不错的,例如根据乳房X线照片预测乳腺癌)。

    Specificity. Simply stated, specificity is the recall of negative values. It answers the question “Of all of my negative predictions, what proportion of them are correct?”. This may be important in situations where examining the relative proportion of false positives is necessary.

    特异性。 简而言之,特异性就是召回负值。 它回答了一个问题: “在我所有的负面预测中,有多少比例是正确的?” 。 这在需要检查假阳性的相对比例的情况下可能很重要。

    Macro, Micro, and Weighted Scores

    宏观,微观和加权分数

    This is where things get a little complicated. Anyone who has delved into these metrics on sklearn may have noticed that we can refer to the recall-macro or f1-weighted score.

    这会使事情变得有些复杂。 认真研究了sklearn的这些指标的任何人都可能已经注意到,我们可以参考召回宏或f1加权得分。

    A macro-F1 score is the average of F1 scores across each class.

    宏观F1分数是每个课程中F1分数的平均值。

    Image for post

    This is most useful if we have many classes and we are interested in the average F1 score for each class. If you only care about the F1 score for one class, you probably won’t need a macro-F1 score.

    如果我们有很多班,并且我们对每个班的平均F1成绩感兴趣,这将是最有用的。 如果您只关心一个班级的F1分数,则可能不需要宏F1分数。

    A micro-F1 score takes all of the true positives, false positives, and false negatives from all the classes and calculates the F1 score.

    微型F1分数采用所有类别中的所有真实肯定,错误肯定和错误否定,并计算F1得分。

    Image for post

    The micro-F1 score is pretty similar in utility to the macro-F1 score as it gives an aggregate performance of a classifier over multiple classes. That being said, they will give different results and understand the underlying difference in that result may be informative for a given application.

    微型F1得分的效用与宏观F1得分非常相似,因为它提供了多个类别的分类器的综合性能。 话虽如此,他们将给出不同的结果,并了解该结果的根本差异可能对给定的应用程序有帮助。

    A weighted-F1 score is the same as the macro-F1 score, but each of the class-specific F1 scores is scaled by the relative number of samples from that class.

    加权F1分数与宏F1分数相同,但是每个类别特定的F1分数均根据该类别的样本的相对数量进行缩放。

    Image for post

    In this case, N refers to the proportion of samples in the dataset belonging to a single class. For class A, where class A is the majority class, this might be equal to 0.8 (80%). The values for B and C might be 0.15 and 0.05, respectively.

    在这种情况下, N是指数据集中属于单个类别的样本所占的比例。 对于A类,其中A类为多数类,这可能等于0.8(80%)。 B和C的值分别为0.15和0.05。

    For a highly imbalanced dataset, a large weighted-F1 score might be somewhat misleading because it is overly influenced by the majority class.

    对于高度不平衡的数据集,较大的F1加权分数可能会引起误导,因为它受到多数类别的过度影响。

    Other Metrics

    其他指标

    Some other metrics that you may see around that can be informative for binary classification (and multiclass classification to some extent) are:

    您可能会发现的一些其他指标可对二进制分类(在某种程度上,以及多类分类)有所帮助:

    Accuracy. If you are reading this, I would imagine you are already familiar with accuracy, but perhaps not so familiar with the others. Cast in the light of a metric for a confusion matrix, the accuracy can be described as the ratio of true predictions (positive and negative) to the sum of the total number of positive and negative samples.

    准确性。 如果您正在阅读本文,我想您已经对准确性很熟悉,但对其他准确性可能不太了解。 根据混淆矩阵的度量标准,可以将准确度描述为真实预测(阳性和阴性)与阳性和阴性样本总数之和的比率。

    Image for post

    G-Mean. A less common metric that is somewhat analogous to the F1 score is the G-Mean. This is often cast in two different formulations, the first being the precision-recall g-mean, and the second being the sensitivity-specificity g-mean. They can be used in a similar manner to the F1 score in terms of analyzing algorithmic performance. The precision-recall g-mean can also be referred to as the Fowlkes-Mallows Index.

    G均值。 G均值是一种不太常见的指标,与F1分数有些相似。 通常用两种不同的公式表示,第一种是精确调用g均值,第二种是敏感性特异性g均值。 就分析算法性能而言,它们可以与F1分数类似的方式使用。 精确调用g均值也可以称为Fowlkes-Mallows索引

    Image for post
    Image for post

    There are many other metrics that can be used, but most have specialized use cases and offer little additional utility over the metrics described here. Other metrics the reader may be interested in viewing are balanced accuracy, Matthews correlation coefficient, markedness, and informedness.

    可以使用许多其他指标,但是大多数指标都有专门的用例,并且与此处描述的指标相比,几乎没有其他用途。 读者可能感兴趣的其他指标是平衡的准确性马修斯相关系数标记性信息灵通性

    Receiver Operating Characteristic (ROC) Curve

    接收器工作特性(ROC)曲线

    An ROC curve is a two-dimensional graph to depicts trade-offs between benefits (true positives) and costs (false positives). It displays a relation between sensitivity and specificity for a given classifier (binary problems, parameterized classifier or a score classification).

    ROC曲线是一个二维图形,用于描述收益(真实肯定)和成本(错误真实)之间的权衡。 它显示了给定分类器(二进制问题,参数化分类器或分数分类)的敏感性特异性之间的关系。

    Here is an example of an ROC curve.

    这是ROC曲线的示例。

    Image for post

    There is a lot to unpack here. Firstly, the dotted line through the center corresponds to a classifier that acts as a ‘coin flip’. That is, it is correct roughly 50% of the time and is the worst possible classifier (we are just guessing). This acts as our baseline, against which we can compare all other classifiers — these classifiers should be closer to the top left corner of the plot since we want high true positive rates in all cases.

    这里有很多要解压的东西。 首先,通过中心的虚线对应于充当“硬币翻转”的分类器。 也就是说,大约50%的时间是正确的,并且是最糟糕的分类器(我们只是在猜测)。 这是我们的基准,可以与所有其他分类器进行比较-这些分类器应更靠近图的左上角,因为在所有情况下我们都希望有较高的真实阳性率。

    It should be noted that an ROC curve does not assess a group of classifiers. Rather, it examines a single classifier over a set of classification thresholds.

    应该注意的是,ROC曲线不评估一组分类器。 而是,它在一组分类阈值上检查单个分类

    What does this mean? It means that for one point, I take my classifier and set the threshold to be 0.3 (30% propensity) and then assess the true positive and false positive rates.

    这是什么意思? 这意味着,我将分类器的阈值设置为0.3(倾向性为30%),然后评估真实的阳性和假阳性率。

    True Positive Rate: Percentage of true positives (to the sum of true positives and false negatives) generated by the combination of a specific classifier and classification threshold.

    真实肯定率: 特定分类器和分类阈值的组合所生成的 真实肯定率 (相对于真实肯定率和错误否定率)。

    False Positive Rate: Percentage of false positives (to the sum of false positives and true negatives) generated by the combination of a specific classifier and classification threshold.

    误报率: 特定分类器和分类阈值的组合所产生的误报率(占误报率和真实否定值的总和)。

    This gives me two numbers, which I can then plot on the curve. I then take another threshold, say 0.4, and repeat this process. After doing this for every threshold of interest (perhaps in 0.1, 0.01, or 0.001 increments), we have constructed an ROC curve for this classifier.

    这给了我两个数字,然后可以在曲线上绘制它们。 然后,我将另一个阈值设为0.4,然后重复此过程。 在对每个感兴趣的阈值执行此操作后(可能以0.1、0.01或0.001为增量),我们为此分类器构建了ROC曲线。

    Image for post
    Image for post
    An example ROC curve showing how an individual point is plotted. A classifier is selected along with a classification threshold. Following this, the true positive rate and false positive rate for this combination of classification and threshold are calculated and subsequently plotted.
    示例ROC曲线显示了如何绘制单个点。 选择分类器以及分类阈值。 此后,针对分类和阈值的这种组合,计算出真阳性率和假阳性率,并随后进行绘图。

    What is the point of doing this? Depending on your application, you may be very averse to false positives as they may be very costly (e.g. launches of nuclear missiles) and thus would like a classifier that has a very low false-positive rate. Conversely, you may not care so much about having a highfalse positive rate as long as you get a high true positive rate (stopping most events of fraud may be worth it even if you have to check many more occurrences that are flagged by the algorithm as flawed). For the optimal balance between these two ratios (where false positives and false negatives are equally costly), we would take the classification threshold which results in the minimum diagonal distance from the top left corner.

    这样做有什么意义? 根据您的应用,您可能会反对误报,因为误报的代价可能很高(例如,发射核导弹),因此希望分类器的误报率非常低。 相反,只要您获得很高的真实阳性率,您可能就不会太在意高假阳性率(即使必须检查该算法标记为的更多事件,停止大多数欺诈事件也是值得的)有缺陷的)。 为了在这两个比率之间实现最佳平衡(假阳性和假阴性的代价均相同),我们将采用分类阈值,以使距左上角的对角线距离最小。

    Why does the top left corner correspond to the ideal classifier? The ideal point on the ROC curve would be (0,100), that is, all positive examples are classified correctly and no negative examples are misclassified as positive. In a perfect classifier, there would be no misclassification!

    为什么左上角对应于理想分类器? ROC曲线上的理想点是(0,100) 也就是说,所有正样本都正确分类,没有负样本被误分类为正样本。 在一个完美的分类器中,不会出现分类错误!

    Whilst a graph may not seem pretty useful in itself, it is helpful in comparing classifiers. One particular metric, the Area Under Curve (AUC) score, allows us to compare classifiers by comparing the total area underneath the line produced on the ROC curve. For an ideal classifier, the AUC equals 1, since we are multiplying 100% (1.0) true positive rate by 100% (1.0) false-positive rate. If a particular classifier has an ROC of 0.6 and another has an ROC of 0.8, the latter is clearly a better classifier. The AUC has the benefit that it is independent of the decision criteria — the classification threshold — and thus makes it easier to compare these classifiers.

    虽然图本身似乎不太有用,但它有助于比较分类器。 一种特殊的度量标准,即曲线下面积(AUC)得分,使我们可以通过比较ROC曲线上生成的线下的总面积来比较分类器。 For an ideal classifier, the AUC equals 1, since we are multiplying 100% (1.0) true positive rate by 100% (1.0) false-positive rate. If a particular classifier has an ROC of 0.6 and another has an ROC of 0.8, the latter is clearly a better classifier. The AUC has the benefit that it is independent of the decision criteria — the classification threshold — and thus makes it easier to compare these classifiers.

    A question may have come to mind now — what if some classifiers are better at lower thresholds and some are better at higher thresholds? This is where the ROC convex hull comes in. The convex hull provides us with a method of identifying potentially optimal classifiers — even though we may not have directly observed them, we can infer their existence. Consider the following diagram:

    A question may have come to mind now — what if some classifiers are better at lower thresholds and some are better at higher thresholds? This is where the ROC convex hull comes in. The convex hull provides us with a method of identifying potentially optimal classifiers — even though we may not have directly observed them, we can infer their existence. Consider the following diagram:

    Image for post
    Source: Source: QuoraQuora

    Given a family of ROC curves, the ROC convex hull can include points that are more towards the top left corner (perfect classifier) of the ROC space. If a line passes through a point on the convex hull, then there is no other line with the same slope passing through another point with a larger true positive intercept. Thus, the classifier at that point is optimal under any distribution assumptions in tandem with that slope. This is perhaps easier to understand after examining the image.

    Given a family of ROC curves, the ROC convex hull can include points that are more towards the top left corner (perfect classifier) of the ROC space. If a line passes through a point on the convex hull, then there is no other line with the same slope passing through another point with a larger true positive intercept. Thus, the classifier at that point is optimal under any distribution assumptions in tandem with that slope. This is perhaps easier to understand after examining the image.

    How does undersampling/oversampling influence the ROC curve? A famous paper on SMOTE (discussed previously) titled “SMOTE: Synthetic Minority Over-sampling Technique” outlines that by undersampling the majority class, we force the ROC curve to move up and to the right, and thus has the potential to increase the AUC of a given classifier (this is essentially just validation that SMOTE functions correctly, as expected). Similarly, oversampling the minority class has a similar impact.

    How does undersampling/oversampling influence the ROC curve? A famous paper on SMOTE (discussed previously) titled “ SMOTE: Synthetic Minority Over-sampling Technique ” outlines that by undersampling the majority class, we force the ROC curve to move up and to the right, and thus has the potential to increase the AUC of a given classifier (this is essentially just validation that SMOTE functions correctly, as expected). Similarly, oversampling the minority class has a similar impact.

    Image for post
    Source: Source : ResearchgateResearchgate

    Precision-Recall (PR) Curves

    Precision-Recall (PR) Curves

    An analogous diagram to an ROC curve can be recast from ROC space and reformulated into PR space. These diagrams are in many ways analogous to the ROC curve, but instead of plotting recall against fallout (true positive rate vs. false positive rate), we are instead plotting precision against recall. This produces a somewhat mirror-image (the curve itself will look somewhat different) of the ROC curve in the sense that the top right corner of a PR curve designates the ideal classifier. This can often be more understandable than an ROC curve but provides very similar information. The area under a PR curve is often called mAP and is analogous to the AUC in ROC space.

    An analogous diagram to an ROC curve can be recast from ROC space and reformulated into PR space. These diagrams are in many ways analogous to the ROC curve, but instead of plotting recall against fallout (true positive rate vs. false positive rate), we are instead plotting precision against recall. This produces a somewhat mirror-image (the curve itself will look somewhat different) of the ROC curve in the sense that the top right corner of a PR curve designates the ideal classifier. This can often be more understandable than an ROC curve but provides very similar information. The area under a PR curve is often called mAP and is analogous to the AUC in ROC space.

    Image for post
    Source: Source: Researchgate — Ten quick tips for machine learning in computational biologyResearchgate — Ten quick tips for machine learning in computational biology

    Final Comments (Final Comments)

    Imbalanced datasets are underrepresented (no pun intended) in many data science programs contrary to their prevalence and importance in many industrial machine learning applications. It is the job of the data scientist to be able to recognize when a dataset is imbalanced and follow procedures and utilize metrics that allow this imbalance to be sufficiently understood and controlled.

    Imbalanced datasets are underrepresented (no pun intended) in many data science programs contrary to their prevalence and importance in many industrial machine learning applications. It is the job of the data scientist to be able to recognize when a dataset is imbalanced and follow procedures and utilize metrics that allow this imbalance to be sufficiently understood and controlled.

    I hope that in the course of reading this article you have learned something about dealing with imbalanced datasets and are in the future will be comfortable in the face of such imbalanced problems. If you are a serious data scientist, it is only a matter of time before one of these applications will pop up!

    I hope that in the course of reading this article you have learned something about dealing with imbalanced datasets and are in the future will be comfortable in the face of such imbalanced problems. If you are a serious data scientist, it is only a matter of time before one of these applications will pop up!

    Newsletter (Newsletter)

    For updates on new blog posts and extra content, sign up for my newsletter.

    For updates on new blog posts and extra content, sign up for my newsletter.

    翻译自: https://towardsdatascience.com/guide-to-classification-on-imbalanced-datasets-d6653aa5fa23

    数据安全分类分级实施指南

    展开全文
  • 金融行业最新发布的数据安全分类标准,通过此标准可以了解金融行业是如何划分数据等级的,具有很强的参考价值
  • 2019年8月30日,《信息安全技术数据安全能力成熟度模型》(GB/T 37988-2019)简称DSMM(Data Security Maturity Model)正式成为国标对外发布,并已于2020年3月起正式实施。DSMM将数据按照其生命周期分阶段采用不同...

    2019年8月30日,《信息安全技术数据安全能力成熟度模型》(GB/T 37988-2019)简称DSMM(Data Security Maturity Model)正式成为国标对外发布,并已于2020年3月起正式实施。

    DSMM将数据按照其生命周期分阶段采用不同的能力评估等级,分为数据采集安全、数据传输安全、数据存储安全、数据处理安全、数据交换安全、数据销毁安全六个阶段。DSMM从组织建设、制度流程、技术工具、人员能力四个安全能力维度的建设进行综合考量。DSMM将数据安全成熟度划分成了1-5个等级,依次为非正式执行级、计划跟踪级、充分定义级、量化控制级、持续优化级,形成一个三维立体模型,全方面对数据安全进行能力建设。

    数据分级分类三维立体模型

     

    在此基础上,DSMM将上述6个生命周期进一步细分,划分出30个过程域。这30个过程域分别分布在数据生命周期的6个阶段,部分过程域贯穿于整个数据生命周期。

    数据生命周期安全过程域

     

    随着《中华人民共和国数据安全法(草案)》的公布,后续DSMM很可能会成为该法案的具体落地标准和衡量指标,对于中国企业而言,以DSMM为数据安全治理思路方案选型,可以更好的实现数据安全治理的制度合规。

    本系列文将以DSMM数据安全治理思路为依托,针对上述各过程域,基于充分定义级视角(3级),提供数据安全建设实践建议,本文作为开篇,将介绍数据采集安全阶段的数据分类分级过程域(PA01)。

     

    01定义

     

    DSMM标准在充分定义级对数据分类分级要求如下:

    组织建设

    组织应设立负责数据安全分类分级工作的管理岗位利人员,主要负责定义组织整体的数据分类分级的安全原则( BP.01.04)。

    制度流程

    1)应明确数据分类分级原则、方法和操作指南(BP.01.05);

    2)应对组织的数据进行分类分级标识和管理(BP.01.06);

    3)应对不同类别利级别的数据建立相应的访问控制、数据加解密、数据脱敏等安全管理和控制措施(BP.01.07);

    4)应明确数据分类分级变更审批流程和机制,通过该流程保证对数据分类分级的变更操作及其结果符合组织的要求(BP.01.08)。

    技术工具

    应建立数据分类分级打标或数据资产管理工具,实现对数据的分类分级自动标识、标识结果发布、审核等功能(BP.01.09)。

    人员能力

    负责该项工作的人员应了解数据分类分级的合规要求,能够识别哪些数据属于敏感数据(BP.01.10)。

     

    02实践指南

     

    组织建设

    组织机构在条件允许的情况下应该设立一个数据分类分级部门并招募相关人员,负责公司整体的数据分类分级工作,包括负责定义组织机构整体的数据分类分级安全原则和操作指南、推动相关指南的落地情况、建立数据分类分级审批机制、对组织机构中的进行完数据分类分级的数据进行标识和管理、对识别到的敏感数据进行脱敏处理、对数据分类分级中的重要操作进行审计和记录等。

    人员能力

    针对数据分类分级岗位的相关人员,需要具备良好的数据安全风险意识,熟悉国家网络安全法律法规以及组织机构所属行业的政策和监管要求,在采集数据的过程中严格按照《网络安全法》、《个人信息安全规范》等相关国家法律法规和行业规范执行,除此之外,还需要相关人员具备良好的数据分类分级基础,了解公司内部的数据资产范围、组织架构,能够准确识别出哪些数据属于敏感数据等,同时还需要相关人员熟悉数据分类分级的合规要求,熟练掌握数据安全措施,拥有制定标准化流程或制度的经验,能够根据公司的具体情况制定出符合公司真实环境的数据分类分级原则、数据分类分级操作指南、数据分类分级管理制度、数据分类分级清单等,并推动相关要求与制度的真实落地。

    落地执行性确认

    针对组织建设和对应人员能力的实际落地执行性确认,可通过内部审计、外部审计等形式以调研访谈、问卷调查、流程观察、文件调阅、技术检测等多种方式实现。

    制度流程

    1)数据分级分类原则

    数据分级分类应结合实际情况,明确需求,以数据的属性为基础,遵循科学性、稳定性、实用性和扩展性原则。

    ❖ 科学性——按照数据的多维特征以及相互间客观存在的逻辑关联进行科学和系统化的分级分类;

    ❖ 稳定性——根据实际情况,以数据最稳定的特征和属性为依据指定分级分类方案;

    ❖ 实用性——数据的分级分类要确保每个类目下要有数据,不设没有意义的类目;

    ❖ 扩展性——数据分级分类方案在总体上应具有概括性和包容性,能够实现各种类型数据的分类,以及满足将来可能出现的数据类型。

    2)分级分类方法及细则

    ❖ 数据分类常用方法:按关系分类,基于业务(来源)、基于内容、基于监管等。

    ❖ 数据分级常用方法:按特性分级,基于价值(公开、内部、重要核心等)、基于敏感程度(公开、秘密、机密、绝密等)、基于司法影响范围(大陆境内、跨区、跨境等)。

    ❖ 常见公用数据分类方法:重要数据、个人及企业信息、业务数据。(重要数据指泄露可导致危害国家安全/公共利益生命财产安全/危害国家关键基础设施/扰乱市场秩序/可推论出国家秘密等的数据。)

    ❖ 个人及企业信息包含直接个人信息:以电子或者其他方式记录的能够单独或者与其他信息结合识别自然人个人身份或企业的各种信息。

    ❖ 业务数据包含:企业或公共组织从事经营活动或例行社会管理功能、事务处理等一系列活动产生的可存储的数据。

    根据上述公共分类,其对应分级分别如下:

    重要数据分级

     

    个人及企业信息分级

     

    业务数据分级

     

    企业可基于上述公共分类、分级策略,结合自身业务、合规需求实际,规划出自己的数据分类分级方法,建立组织/公司自己的的数据分类分级原则和方法,将数据按照重要程度进行分类,然后在数据分类的基础上根据数据安全在受到破坏后,对组织造成的影响和损失进行分级。

    企业自主分类分级示例

     

    在进行数据分类分级后需要有针对性地制定数据防护要求,设置不同的访问权限、对重要数据进行加密存储和传输、敏感数据进行脱敏处理、重要操作进行审计记录和分析等。

    3)变更审核

    在进行分类分级工作中要明确相关内容和操作流程的审核和审批机制,保证数据分类分级工作符合组织的分类分级原则和制度要求。原则上已被明确分类分级的数据,其分级只可升级不可降级(防止泄密),审批需多人控制,涉及数据所有者、数据分类分级管理者,行者管理者等。

    4)技术工具简述

    数据分类分级技术工具实现落地的前提是确定了组织内部的数据分类分级方法和策略,也就是分类和分级的规则。技术层面看,数据分类分级首先涉及到最初的数据发现,目前数据类型可以分为两种,一种是结构化的数据,如业务数据、数据库等;另外一种则是非结构化的数据,如商业文件、财务报表、合同等,依据标签库、关键词、正则表达式、自然语言处理、数据挖掘、机器学习等内容识别技术,进行数据分类,根据数据分类的结果,依据标签进行敏感数据的划分,最终实现数据分级的效果。

    按元数据类型分类技术:

    内容感知分类技术,对非结构化数据内容的自动分析来确定分类,涉及正则表达式、完全匹配、部分或完整指纹识别、机器学习等。

    情境感知分类技术,基于数据特定属性类型,利用广泛上下文属性,适用于静态数据(如基于存储路径或其他文件元数据)、使用中的数据(如由CAD应用程序创建的数据)和传输中的数据(基于IP)。

    按实际应用场景分类技术:

    根据分类分级规则,建立标签库,利用机器学习算法经过训练形成分类器,利用分类器将生成的分类器应用在有待分类的文档集合中,获取文档的分类结果,并可进行自动化打标。

    受限于篇幅,此处技术工具不进行进一步展开,下图为数据分级分类的技术工具进行分类分级作业的基本流程图。

    分类分级作业的基本流程图

     

    本文转自杭州美创科技有限公司公众号,载如需二次转载,请咨询美创服务热线4008113777。

    展开全文
  • 去年10月、11月的时候参加了DataFountain的面向数据安全治理的数据内容智能发现与分级分类比赛https://www.datafountain.cn/competitions/471,最终获得了A榜第7、B榜第10的成绩。着这里记录一下此次比赛历程。 ...

    0.前言

    去年10月、11月的时候参加了DataFountain的面向数据安全治理的数据内容智能发现与分级分类比赛,最终获得了A榜第7、B榜第10的成绩。着这里记录一下此次比赛历程。

    github,欢迎star~

    1.赛题背景

    随着企业信息化水平的不断提高,数据共享与开放对企业发展的作用日益凸显,数据已成为重要生产要素之一,企业在产业与服务、营销支持、业务运营、风险管控、信息披露和分析决策等经营管理活动中涉及到大量的业务数据,其中可能会包含企业的商业秘密、工作秘密,以及员工的隐私信息等,若因为使用不当,造成数据泄露,则有可能造成巨大的经济损失,或在社会、法律、信用、品牌上对企业造成严重的不良影响。同时,在合规要求层面,围绕数据安全,国家近年密集颁布《网络安全法》、《民法典》、《数据安全法》(征求意见稿)、《个人信息保护法》(征求意见稿)等,从国家法律层面强调对关键基础设施、各类APP应用中的敏感数据保护要求。而为了有效、规范保护企业敏感数据,其首要问题是对数据进行分级分类,以识别敏感数据,从而进一步围绕保护对象的全生命周期进行开放、动态的数据安全治理,解决数据开放共享与数据隐私保护的矛盾与统一。
    现有的敏感数据识别与分级分类已广泛采用基于自然语言处理的语义识别技术,但会存在以下问题:
    1.需要有大批量、高质量的标注数据,花费大量的人力和时间,建设成本高。
    2.泛化能力不足,对新业务数据的适应能力弱,敏感数据的误报率和漏报率高。
    3.不能进行自我优化、自我学习,需要业务和技术领域专家共同进行人工干预,建设难度大。

    2.赛题任务

    识别样本中的敏感数据,构建基于敏感数据本体的分级分类模型,判断数据所属的类别以及级别。
    1.利用远程监督技术,基于小样本构建文档分类分级样本库。
    2.结合当下先进的深度学习和机器学习技术,利用已构建的样本库,提取文本语义特征,构建泛化能力强且能自我学习的文档分类分级模型。

    3.数据简介与说明

    (1)已标注数据:共7000篇文档,类别包含7类,分别为:财经、房产、家居、教育、科技、时尚、时政,每一类包含1000篇文档。
    (2)未标注数据:共33000篇文档。
    (3)分类分级测试数据:共20000篇文档,包含10个类别:财经、房产、家居、教育、科技、时尚、时政、游戏、娱乐、体育。

    本次大赛提供两份数据,已标注数据labeled_data.csv,未标注数据 unlabeled_data.csv,分类分级测试数据test_data.csv。
    (1)已标注数据 labeled_data.csv

    字段信息 类型 描述
    id String 数据ID
    class_label String 文本所属类别
    content String 文本内容

    (2)未标注数据unlabeled_data.csv

    字段信息 类型 描述
    id String 数据ID
    content String 文本内容

    (3)分类分级测试数据 test_data.csv

    字段信息 类型 描述
    id String 数据ID
    content String 文本内容

    (4)分级信息
    假设文档类别与文档级别有如下对应关系:

    文档类别 文档级别
    财经、时政 高风险
    房产、科技 中风险
    教育、时尚、游戏 低风险
    家居、体育、娱乐 可公开

    本次比赛不允许使用外部数据集~

    可以简单的理解为,此次比赛是一个文本分类问题;但是给的训练集有7类数据,测试集有10类数据,需要事先在未标注数据(共33000篇文档)中采用无监督的方法选取标签为另外3类的部分数据,加入到训练集中训练。

    4.无监督文本分类

    由于本次提供的训练集共7000篇文档,类别包含7类;首先需要在unlabled数据集中采用无监督的方法将另外3类(游戏、娱乐、体育)分出来;

    本文采用了LDA文本聚类算法,首先通过该算法将未标注33000篇文档聚为10类,然后从类别为游戏、娱乐、体育的数据中各随机出1000条,与原始数据组成10000条的训练集。

    5.方法介绍

    1. 句子长度选取

      由于本次数据文本为新闻内容,长度较长,而bert对于文本长度有所限制,考虑到新闻的关键信息常常集中在文首或文尾,所以采取截断的方式选取超长文本。

      if len(text) > 512:
          text = text[:256] + text[-256:]
      
    2. 模型尝试

      chinese_roberta_wwm_large_ext_pytorch模型跑分最高(模型较大,文本较长,吃GPU)。

    3. drouout选取

      在训练过程中,由于Bert后接了dropout层。为了加快模型的训练,我们使用multi-sample-dropout技术。通过对Bert后的dropout层进行多次sample,并对其多次输出的loss进行平均,增加了dropout层的稳定性,同时使得Bert后面的全连接层相较于前面的Bert部分能够得到更多的训练。

      dropout设置为0.9时效果最佳。

    4. 文本对抗(https://zhuanlan.zhihu.com/p/91269728)

      尝试了PGD和FGM,PGD最佳。

    5. loss权重

      通过比赛时发现,测试集中的类别基本平衡,所以根据分类结果,来调试loss的权重,使得分类结果保持基本平衡,这是本次比赛提分的一个关键。

      loss_fn = nn.CrossEntropyLoss(weight=torch.from_numpy(np.array(args.weight_list)).float())
      
    6. 模型融合

      通过对bert模型最后几层输出采用不同的方式组合构建多种模型,进行简单的投票融合。

      例如BertLastFourCls、BertLastFourEmbedding、BertRCNN、BertLastFourClsPooler、BertLastTwoEmbeddingsPooler等等,具体可参考代码。

    最终A榜成绩0.91007184 ,B榜0.91033015。

    6.使用的其他trick

    1. EDA文本增强技术(https://github.com/jasonwei20/eda_nlp)(无提升)。

    2. 采用半监督的思想,对测试集预测,并将预测结果与原有数据一起训练(无提升)。

    3. F1值优化(效果不稳定)

    4. 使用THUCnews进行模型预训练(后因主办方不允许使用外部数据放弃),单纯使用部分THUCnews做为训练集训练,对测试集进行预测得分只有0.5+。

      预训练参考(https://github.com/zhusleep/pytorch_chinese_lm_pretrain)

    7.前排大佬解决方案

    代更新~

    展开全文
  • 信息安全事件分类分级解读

    万次阅读 2018-03-08 13:39:52
    信息安全事件分类分级解读 信息安全事件是指由于人为原因、软硬件缺陷或故障、自然灾害等情况对网络和信息系统或者其中的数据造成危害,对社会造成负面影响的网络安全事件。 1、信息安全事件分类 依据《中华人民...
  • 作者:管晓宏 (中科院院士、西安交通大学电子与信息学部主任)工业数据规范管理和共享、保证数据安全性,对工业经济的健康发展至关重要。在信息全球化和网络经济时代,工业数据的安全有序使用和流动...
  • 本节书摘来自华章...1.3 数据中心分类分级 不同数据中心应用的系统、面向的服务对象、关联的级别各不相同,小型数据中心未必会配置双路市电输入和大型的柴油发电机;而大型的数据中心由于面临各种复杂环境的...
  • 浅谈数据中心分级

    2017-11-15 19:47:00
    摘要:介绍国家标准、行业标准以及... 关键词:数据中心 数据大集中建设 美国通信工业协会 (TIA) TIA-942 《数据中心的通信基础设施标准》 可用性 安全性 稳定性 四级分类。  随着国内金融业全面对外开放和国内大...
  • 与此同时,数据过度采集滥用、非法交易及用户数据泄露等数据安全问题日益凸显,做好电信和互联网行业(以下简称行业)网络数据安全管理尤为迫切。为积极应对新形势新情况新问题,制定本标准草案。
  • 数据安全基本概念讲解,数据安全的实际痛点,数据安全治理概念,以风险为基础的双轴驱动数据安全治理理念,数据安全治理方案,数据资产分级分类治理框架,分类分级:构建量化的定级指标体系,数据资产分级分类实施...
  • 《大数据产业发展规划(2016- 2020 年)》有关要求,更好推动《数据管理能力成熟度评估模型》( GB/T36073- 2018)贯标和《工业控制系统信息安全防护指南》落实,指导企业提升工业数据管理能力, 促进工业数据的使用...
  • 因而对确保数据中心安全高效运营的管理系统的必需性和选择性造成误解。本文将数据中心管理工具分成四个子分类 并加以阐述。通过使用这个分级系统,数据中心专业人员就能够量体裁衣地决定各自具体的数据中心需要哪些...
  • 数据安全大背景介绍,数据安全咨询服务介绍,数据安全治理体系建立解决方案,数据分类分级实践,数据安全治理成功案例等
  • 数据治理实践之数据安全管理

    千次阅读 2020-08-18 10:36:00
    1. 数据安全治理介绍 ... 重视数据分级分类、数据角色授权、数据安全过程场景化管理。 为了有效实践数据安全目标和理念,机构必须打造数据安全闭环管理体系,推动数据安全治理体系持续改善。 1.2 数据安全...
  • 数据安全治理方法导论

    千次阅读 2020-11-25 22:30:38
    数据安全-The eye of storm】 目录 第一章 数据安全治理是一套系统工程 ...1.1 数据安全成为安全的核心问题 ...1.3 数据安全按相关法律和标准大爆发 ...1.4 数据安全建设需要有系统化思维和...4.2 数据分类分级 ..
  • 解构大数据安全的相关法律法规,解构大数据典型的业务场景,大数据业务发展下的安全框架,以数据为中心的大数据安全架构设计方法,数据分类分级与生命周期相结合的能力框架,基于业务场景的数据安全逻辑框架,数据...
  • 新时代下“大数据安全”的重新解读,解构大数据业务发展下的安全框架,以数据为中心的大数据安全架构设计方法,数据分类分级与生命周期结合的能力框架,基于业务场景的数据安全逻辑框架,内生安全指导大数据安全落地...
  • 新数据的新挑战,数据安全整体解决方案架构、数据分类分级数据安全建设,内控安全等
  • 独特性如数据资产梳理(如数据流向、数据分类分级等)、数据安全持续性评估、数据安全 数据安全运营强调的是针对性 数据安全运营其针对性 工单的流转、 数据安全运营的介绍(价值、作用、目的) 如果数据...
  • 实施分级分类管理 加强重要数据和敏感字段保护 强化数据安全审批管理 落实数据安全权限 推动数据安全共享和使用 建立健全数据安全风险评估机制 互联网+医疗的发展和医疗信息化建设不
  • 数据分级通常是按照数据的价值、敏感程度、泄露之后的影响等因素进行分级;通常分级一旦设定,基本不再变化。 数据分类通常是按照数据的用途、内容、业务领域等因素进行分类;... 表16-3 数据分级分类参考
  • 数字化转型加速期,全球数据安全面临新挑战; 数据安全发展演进与新趋势:以数据为中心; 数据安全治理更加强调多维化、体系化、实战化; 数据安全治理体系建设框架:五大保障、五大实践; 数据安全治理实践总结; 某...
  • 2021年3月1日至4月1日期间,中国信息通信研究院安全所开展数据安全类产品能力验证计划第一期,涉及的数据安全产品功能类型包括数据分类分级管理、数据异常行为监测、API安全管控、数据脱敏、数据库审计,共5种。...
  • 近年来,数据安全形势越发严峻,各种数据安全事件层出不穷。...例如数据安全生命周期提出,首先要对数据进行分类分级,然后才是保护。但互联网公司基本上都是野蛮生长,发展壮大以后才发现数据安全的问题。但
  • 详解数据安全治理与传统理念的关系和...场景维度建设(数据分类分级、开发测试场景、数据运维场景、数据共享场景、数据分析场景、应用访问场景、特权访问场景),纠正和优化数据安全治理体系(纠正措施、持续优化)等
  • 基础共性标准包括术语定义、数据安全框架、数据分类分级,相关标准为各类标准提供基础性支撑。关键技术标准从数据采集、传输、存储、处理、交换、销毁等数据全生命周期维度对数据安全关键技术进行规范。安全管理标准...

空空如也

空空如也

1 2 3 4 5 ... 8
收藏数 159
精华内容 63
关键字:

数据安全分级分类