Decision Trees are one of the highly interpretable models and can perform both classification and regression tasks. As the name suggests Decision Trees are tree-like structure model which resembles an upside-down tree. At this point, you might be having a question like we already have classical machine learning models like linear regression and logistic regression to perform regression and classification tasks in such case what is the necessity of having another model like Decision Tree. The answer to this question is to perform classical linear models we need to make sure that the data which is used for training the model is free from all irregularities like missing values, outliers needed to be handled, multicollinearity needs to be addressed. A whole lot of data preprocessing needs to be done before. Whereas in Decision Trees we no need to perform any sort of data preprocessing beforehand. Decision Trees are robust enough to handle all such kinds of problems to reach a decision. Also, Decision Trees are capable of handling nonlinear data that classical linear models fail to handle. Hence Decision Trees are diverse enough to perform both regression and classification tasks. A whole set of advantages and disadvantages associated with Decision Trees can be discussed in detail in the latter part of this article. Before that let’s start understanding Decision Trees.
决策树是高度可解释的模型之一，可以执行分类和回归任务。 顾名思义，决策树是类似于倒置树的树状结构模型。 此时，您可能会遇到一个问题，例如我们已经拥有经典的机器学习模型(例如线性回归和逻辑回归)来执行回归和分类任务，在这种情况下，是否有必要使用其他模型(例如决策树)。 这个问题的答案是执行经典的线性模型，我们需要确保用于训练模型的数据没有所有不规则性，例如缺失值，需要处理的异常值，多重共线性。 之前需要完成大量数据预处理。 而在决策树中，我们无需事先执行任何类型的数据预处理。 决策树足够强大，可以处理所有此类问题以做出决策。 而且，决策树能够处理经典线性模型无法处理的非线性数据。 因此，决策树的多样性足以执行回归和分类任务。 与决策树相关的整套优缺点可以在本文的后半部分中详细讨论。 在此之前，让我们开始了解决策树。
Decision Trees build the tree by asking a series of questions to the data to reach a decision. Hence it is said that Decision Trees mimic the human decision process. During the tree-building process, it divides the entire data into subsets of data until it reaches a decision. Let’s understand a few terminologies associated with Decision trees to better understand Decision Trees.
决策树通过向数据提出一系列问题以做出决策来构建树。 因此，可以说决策树模仿了人类的决策过程。 在树构建过程中，它将整个数据划分为数据子集，直到达成决策为止。 让我们了解一些与决策树相关的术语，以更好地理解决策树。
决策树中的少数术语： (Few Terminologies in Decision Trees:)
Root Node: The topmost node of the tree corresponds to the Root Node. All the data will be present at this Root Node. The arrows in the decision tree are generally pointed away from this Root Node.
根节点：树的最高节点与根节点相对应。 所有数据都将出现在此根节点上。 决策树中的箭头通常指向远离此根节点的方向。
Leaf Node or Terminal Node: Also called as Terminal Node. If a particular node cannot be split further that it is considered as Leaf Node. The Decisions or the Predictions are held by this Leaf Node. The arrows in the decision tree are generally pointed to this Leaf Node.
叶节点或终端节点：也称为终端节点。 如果无法进一步拆分特定节点，则将其视为叶节点。 决策或预测由此叶节点保留。 决策树中的箭头通常指向该叶子节点。
Internal Node or Decision Node: The nodes between the root node and the leaf node are said to be internal nodes. These nodes can be split further into sub-nodes.
Please refer to the below image for a better understanding of the above-mentioned terminology.
Decision Trees are said to be highly interpretable because of its tree-like structure. In order to interpret the Decision Tree, we transverse down the tree satisfying the conditions associated with the nodes to reach a decision. The term Decision refers to a prediction it can be either a class label if performing a classification task or a value if performing a regression task. Upon interpreting the Decision Tree, we will get to know about the associated features which lead to a particular decision. Though we will not be able to interpret the linear relation between the feature variable and the target variable and its directional effect. If interpreting the model is of major concern, then Decision Tree would be on the top. On a whole, we can think of interpreting a Decision Tree using some logical IF then ELSE statements. Then using AND logical operator we can connect the condition associated with a particular node with the previous node condition.
由于决策树具有树状结构，因此可以高度解释。 为了解释决策树，我们将满足与与节点相关联的条件的树横向向下以做出决策。 术语“决策”是指预测，如果执行分类任务，则可以是类别标签，如果执行回归任务，则可以是值。 解释了决策树后，我们将了解导致特定决策的相关功能。 尽管我们将无法解释特征变量与目标变量之间的线性关系及其方向效应。 如果解释模型是主要问题，则决策树将排在最前面。 总体而言，我们可以考虑使用逻辑IF然后使用ELSE语句来解释决策树。 然后，使用AND逻辑运算符可以将与特定节点关联的条件与先前的节点条件连接起来。
Having understood how to interpret Decision Tree we next understand the high-level tree building process.
树构建过程涉及的步骤如下： (The steps involved in the tree building process is as follows:)
1. Recursive partition of the data into multiple subsets.2. At each node identifying the variable and the rule associated with the variable for the best split.3. Applying the split at that node using the best variable using the rule defined for the variable.4. Repeating steps 2 and 3 on the sub-nodes.5. Repeating this process until we reach a stopping condition.6. Assigning the decisions at the leaf nodes based on the majority class label present at that node if performing a classification task or considering the average of the target variable values present at that leaf node if performing a regression task.
1.将数据递归划分为多个子集。 在每个节点上标识变量以及与该变量关联的规则以实现最佳分割3。 使用为变量定义的规则，使用最佳变量在该节点上应用拆分.4。 在子节点上重复步骤2和3.5。 重复此过程，直到达到停止状态为止。6。 如果执行分类任务，则基于存在于该节点的多数类标签在叶节点处分配决策；如果执行回归任务，则考虑存在于该叶节点的目标变量值的平均值。
There exists different tree-building algorithms like CART, CHAID, ID3, C4.5, C5.0, etc. In each of the building algorithm, the criteria considered in selecting the best feature which provides the best split might be different like CART algorithm uses Gini Index impurity measure to determine the best feature which provides the best split. Similarly, ID3 uses Information gain, C4.5 uses Gain Ratio likewise for other algorithms as well. But the overall tree-building algorithm remains the same as mentioned above.
存在不同的树构建算法，例如CART，CHAID，ID3，C4.5，C5.0等。在每种构建算法中，选择提供最佳分割的最佳特征所考虑的标准可能与CART算法不同使用基尼系数杂质测度确定可提供最佳分离效果的最佳功能。 同样，ID3使用信息增益，C4.5同样使用增益比用于其他算法。 但是总体的树构建算法与上面提到的相同。
At this point, you might have questions like how to select the features which provide the best split. How to define rule associated with the feature for providing the best split and finally what is the stopping condition. These questions will be answered in the latter part of this article.
此时，您可能会遇到诸如如何选择可提供最佳分割效果的功能之类的问题。 如何定义与功能相关的规则以提供最佳分割，最后是什么停止条件。 这些问题将在本文的后半部分得到解答。
Few things to be noted about Decision Tree building, these Decision Trees follow a top-down approach in building the tree and also said to have a Greedy approach. Greedy approach because at every split of the node these Decision Trees are concerned about the immediate result after the split. They do not take into consideration of the effect of split after two or three nodes. Hence these Decision Trees are said to have a Greedy approach. One important implication of the Greedy approach is that it makes Decision Trees high variance model meaning a small change in input data will result in a complete change in a tree structure and the final decisions.
关于决策树的构建，没有什么要注意的事情，这些决策树在构建树时遵循自顶向下的方法，并且据说具有贪婪的方法。 贪婪方法，因为在节点的每个拆分中，这些决策树都关注拆分之后的即时结果。 他们没有考虑两个或三个节点后拆分的影响。 因此，这些决策树被称为具有贪婪方法。 贪婪方法的一个重要含义是，它使决策树成为高方差模型，这意味着输入数据的微小变化将导致树结构和最终决策的完整变化。
Having some high-level understanding of Decision Trees and their model building process. Let’s address all of our questions one by one which we have come across during the model building process.
如何选择在节点上为我们提供最佳分割的功能？ (How to select features which provide us with the best split at a node?)
Before addressing this question we need to have some understanding related to the homogeneity associated with a node in case of classification setting. The same notion can be extrapolated for a regression setting. As the name suggests homogeneity refers to something of the same kind. The same definition can be extended in the case of Decision Trees as well. A particular node is said to be homogenous if the class labels associated at the node belong to a single class if performing a classification activity. If performing a regression activity then we speak in terms of variance associated with a node.
在解决该问题之前，我们需要对分类设置中与节点关联的同质性有一些了解。 可以为回归设置外推相同的概念。 顾名思义，同质性指的是同类事物。 在决策树的情况下，也可以扩展相同的定义。 如果执行分类活动，则在该节点关联的类别标签属于单个类别，则认为该特定节点是同质的。 如果执行回归活动，那么我们说的是与节点相关的方差。
In case of classification, the term best split refers to obtaining as homogenous as possible sub-nodes or child nodes upon splitting a parent node. The class labels of the target variable associated with the data points present at those sub nodes or child nodes should be belonging to one of the class labels then the sub-nodes obtained are said to be as homogenous as possible. In the case of regression, the term best split refers to obtaining low variance nodes upon splitting the parent node. Upon computing, Mean Square Error we get to know about the variance associated with the data points present at a node. Let’s now focus on how to identify the feature associated with some rule to split a node which results in the best split.
在分类的情况下，术语最佳分割是指在分割父节点时获得尽可能均匀的子节点或子节点。 与存在于那些子节点或子节点上的数据点相关联的目标变量的类标签应属于这些类标签之一，然后将获得的子节点视为尽可能同质。 在回归的情况下，术语最佳分割是指在分割父节点时获得低方差节点。 通过计算均方误差，我们可以了解与节点上存在的数据点相关的方差。 现在，让我们集中讨论如何识别与某些规则关联的功能来分割节点，从而获得最佳分割效果。
In the case of classification activity, a particular feature is selected which results in best split based on the difference in impurity created or purity gain obtained upon splitting or how well the feature is able to classify the class labels associated with the target variable or results in as homogenous as possible sub-nodes. At this point, we might get a question like how do we quantify the homogeneity of a node. Using impurity measures we can quantify the homogeneity of a node, some of the popular metrics are Classification Error, Gini Index, and Entropy. Since these metrics account for the impurity of a node, lower the metric value higher the homogeneity of a node. Let’s look at the impurity measures in detail.
在分类活动的情况下，将根据分裂时产生的杂质差异或获得的纯度增益，或该特征能够对与目标变量相关联的类别标签进行分类的程度，来选择能够最佳分裂的特定特征。尽可能同质的子节点。 在这一点上，我们可能会遇到一个问题，例如如何量化节点的同质性。 使用杂质度量，我们可以量化节点的同质性，一些流行的度量标准是分类误差，基尼系数和熵。 由于这些指标解决了节点的杂质，因此指标值越低，节点的同质性越高。 让我们详细了解一下杂质措施。
用于测量节点同质性的杂质度量： (Impurity measures to measure Homogeneity at a node:)
Classification Error is the error made in assigning the class labels to the data points based on the majority class label. Upon computing the probabilities associated with each of the class labels we will get to know about the majority class label then assigning the majority class label to all the data points. Doing so misclassification is made by assigning the low probability class label data points with the majority class label.
分类错误是基于多数类标签将类标签分配给数据点时发生的错误。 在计算与每个类别标签相关的概率后，我们将了解多数类别标签，然后将多数类别标签分配给所有数据点。 通过为低概率类别标签数据点分配多数类别标签来进行错误分类。
Gini Index accounts for any random data point being misclassified. The value varies between 0 and 0.5. Lower the Gini index the lesser chances of any random data point will be misclassified which will help in assigning decisions or outcomes to leaf nodes without any ambiguity. If there exist all the data points belong to one single class label in such Gini Index value will be 0 as the data points are completely homogenous with a single class label associated with them. Similarly, if there exists an equal class distribution of data points in such case the Gini Index will be maximum that is 0.5 as there exists complete ambiguity in class labels and the data points are said to be highly non-homogenous.
基尼系数(Gini Index)解释了任何随机分类的数据点。 该值在0到0.5之间变化。 基尼系数越低，随机数据点被错误分类的机会就越少，这将有助于将决策或结果分配给叶节点而没有任何歧义。 如果存在所有数据点，则此类基尼索引值将属于一个单一类别标签，因为这些数据点与与其关联的单一类别标签完全同质。 类似地，如果在这种情况下数据点的类分布相等，则由于基类标签中存在完全的歧义，并且数据点被认为是高度不均匀的，因此，基尼系数最大为0.5。
On the other hand, Entropy which is derived from thermodynamics and Information Theory accounts for the degree of disorder present in data points at a node. Lower the entropy value lower the disorder present at a node in the class labels of the target variable. The value ranges between 0 and 1. If all the data points belong to one single class label in such case the Entropy value will be minimum that is 0 as there exists least disorder in the class labels of the target variable. Similarly, if there exists an equal distribution of class labels in data points in such case the Entropy value will be maximum that is 1 as there exists complete disorder in the class labels of the target variable and the data is to be completely non-homogenous.
另一方面，从热力学和信息论导出的熵说明了节点数据点中存在的无序程度。 降低熵值，降低目标变量的类标签中某个节点处出现的混乱。 该值的范围是0到1。如果在这种情况下所有数据点都属于一个单一的类别标签，则熵值将是最小值，即0，因为目标变量的类别标签中存在最少的混乱。 类似地，如果在这种情况下数据标签中的类标签分布相等，则熵值将为1，因为目标变量的类标签中存在完全无序且数据将完全不均匀。
The formulas to compute the Classification Error, Gini Index, and Entropy as follows:
Where pi corresponds to the probability of the data point belonging to ith class label and k accounts for different class labels.
Gini Index and Entropy are widely used in computing the homogeneity of a node when compared to Classification Error as Classification Error is not as sensitive as other metrics in identifying the homogeneity of a node.Numerically Gini Index and Entropy are similar to each other. When computed Scaled Entropy which is Entropy/2, the trajectories of Scaled Entropy and Gini Index will be almost touching with each other. For better understanding please refer to the below image:
与分类误差相比，基尼系数和熵被广泛用于计算节点的同质性，因为分类误差在识别节点的同质性方面不如其他度量标准敏感。数值上，基尼系数和熵彼此相似。 当计算的比例熵为熵/ 2时，比例熵的轨迹和基尼指数几乎彼此接触。 为了更好的理解，请参考下图：
Hence in order to select the feature which provides the best split, it should result in sub-nodes that have a low value of any one of the impurity measures or creates a maximum difference in impurity measure when calculated the impurity measures before and after the split meaning creating maximum purity gain. To compute the difference in impurity measures we first compute the impurity measure of the node before splitting and then compute the weighted average of impurity measure of the sub-nodes which are obtained upon splitting the node with some feature which is assigned with some rule to split. Finally computing the difference between impurity measures computed before and after the split will result in purity gain. We aim to have higher purity gain as it results in more homogenous sub-nodes. We select the feature among all other features to split a node that creates the highest purity gain.
因此，为了选择提供最佳分割效果的特征，当计算分割前后的杂质测度时，应该导致子节点的任何一种杂质测度值较低或在杂质测度上产生最大差异意味着最大程度地提高纯度。 为了计算杂质测度的差异，我们首先计算分裂前节点的杂质测度，然后计算子节点的杂质测度的加权平均值，该子节点的杂质测度是通过将某节点划分为具有某些特征的特征而获得的，该特征分配有一些规则进行分裂。 最后，计算拆分前后所计算出的杂质测度之间的差异，将获得纯度提高。 我们的目标是获得更高的纯度，因为它会导致更均匀的子节点。 我们从所有其他特征中选择一个特征，以拆分创建最高纯度增益的节点。
The same thought process can be extended to the regression setting in which we select the feature to split the node which results in low variance sub-nodes.
Let’s move forward to answer our next question which is How to identify rule associated with a feature to split a node.
如何识别与要素关联的规则以分割节点？ (How to identify rule associated with a feature to split a node?)
To answer this question we will restrict our discussion to the CART tree building algorithm and explore different methods in defining the rule to a feature in order to split a node.
When building the tree using a CART algorithm, every node is split into two parts meaning it performs binary split at every node. The other specification of the CART algorithm is for every node split there exists only univariate condition associated with a feature in order to split the node meaning at every node using only one rule associated with the feature we proceed to split the node. No multiple rules associated with different features are used to split a node. In order to perform a multi-way split, there are other popular algorithms like ID3, C4.5, C5.0, CHAID.
使用CART算法构建树时，每个节点都分为两部分，这意味着它将在每个节点上执行二进制拆分。 CART算法的另一种规范是，对于每个分割的节点，仅存在与特征关联的单变量条件，以便仅使用与该特征关联的一个规则来分割节点，从而在每个节点处分割节点的含义。 没有使用与不同功能关联的多个规则来拆分节点。 为了执行多路拆分，还有其他流行的算法，例如ID3，C4.5，C5.0，CHAID。
Let’s come back to our question which is how to define a rule associated with a feature that provides the best split. If the predictor variable is a nominal categorical variable having some k classes in it, the possible number of splits using each of the class are (2^(k-1)-1). Among all possible splits the split which results in homogeneous sub-nodes is taken into consideration. If the predictor variable is an ordinal categorical variable having n classes in it, the possible number of splits are (n-1). Among all possible splits the split which results in homogenous sub-nodes is taken into consideration. If the predictor variable is continuous or numerical there are some discretization techniques available to select the rule from the numerical variable. One of the discretization technique is to arrange the variable in ascending order array and check on each of the numerical value by performing split using each numerical value. Finally, the one which provides the best split is taken into consideration. By following other discretization methods like considering mean, percentiles we can define rule associated with the numerical feature to split the node.
让我们回到我们的问题，即如何定义与提供最佳分割的功能关联的规则。 如果预测变量是其中包含k个类别的名义分类变量，则使用每个类别的拆分的可能数量为(2 ^(k-1)-1)。 在所有可能的分割中，考虑导致均匀子节点的分割。 如果预测变量是其中具有n个类别的有序分类变量，则拆分的可能数量为(n-1)。 在所有可能的分割中，考虑导致均匀子节点的分割。 如果预测变量是连续变量或数字变量，则可以使用一些离散技术从数字变量中选择规则。 离散化技术之一是将变量按升序排列，并通过使用每个数值执行除法来检查每个数值。 最后，考虑提供最佳拆分的方法。 通过遵循其他离散化方法(例如考虑均值，百分位数)，我们可以定义与数字特征关联的规则以拆分节点。
Now among all the features associated with their rules which result in best split only one feature is selected finally to split the node which generates greater purity gain among all other features associated with their best rules.
After having an understanding of the tree-building process and some concepts surrounding it. Let’s answer our last question which is What is the stopping condition.
停止条件是什么？ (What is the stopping condition?)
In general, the tree-building process will continue until all the features have been exhausted to split the nodes or all the leaf nodes have been formed such that each of the leaf nodes corresponds to minimal data points of the training data set.
Upon allowing the tree to grow to its complete logical end the tree becomes a high variance model meaning it will overfit on training data by memorizing every data point present in the training data. Once it becomes a high variance model in such case a small change in training data will alter complete tree structure as a result all the decisions associated with leaf nodes might change. Hence such trees are not trustworthy to make any decisions.
在允许树增长到其完整的逻辑末端时，树将成为高方差模型，这意味着它将通过记忆训练数据中存在的每个数据点来过度拟合训练数据。 在这种情况下，一旦成为高方差模型，训练数据的微小变化将改变完整的树结构，结果与叶节点相关的所有决策都可能发生变化。 因此，此类树不值得做出任何决定。
Stopping conditions can also be defined by assigning some values to the hyperparameters by hyperparameter tuning. Before understanding more let’s get an understanding of what are hyperparameters and what is hyperparameter tuning. Hyperparameters are something that is defined by modeler during the process of model building. The learning algorithm takes into consideration of these hyperparameters which are passed by the modeler before producing a final model. The model is not capable of identifying these hyperparameters implicitly during the model building. To find out the optimum hyperparameters we perform hyperparameter tuning. Some of the hyperparameters which control the tree growth are max_depth, max_features, min_samples_leaf, min_samples_split, criterion, etc. Upon defining hyperparameter we can control tree from overfitting. Let’s look at the methods to control overfitting of trees.
还可以通过通过超参数调整为超参数分配一些值来定义停止条件。 在进一步了解之前，我们先了解一下什么是超参数以及什么是超参数调整。 超参数是建模人员在模型构建过程中定义的内容。 学习算法考虑了建模人员在生成最终模型之前传递的这些超参数。 模型无法在模型构建过程中隐式识别这些超参数。 为了找出最佳的超参数，我们执行超参数调整。 控制树生长的一些超参数是max_depth，max_features，min_samples_leaf，min_samples_split，criteria等。通过定义超参数，我们可以控制树的过度拟合。 让我们看一下控制树过度拟合的方法。
控制树木过度拟合的方法： (Methods to control Overfitting of trees:)
Decision Trees have high chances of overfitting on the training data as a result they become high variance models. To avoid overfitting issues in trees we follow some of the methods:- Tree Truncation or Pre Pruning Strategies- Post Pruning Strategies
Tree Truncation: Also, called as Pre Pruning Strategies as we control the tree from overfitting during the model building itself. One of the naive tree truncation strategies is to define a threshold homogeneity value in case of a classification activity. Which will be used to compare homogeneity at every node before splitting. If the homogeneity of the node is less than the threshold value then we split the node further. Similarly, if the homogeneity of the node is greater than the threshold value then we convert the node as a leaf node. Other tree truncation strategies are by defining hyperparameters during the model building we can control the tree from overfitting.
树截断：也称为“预修剪策略”，因为我们在模型构建过程中控制树的过度拟合。 天真的树截断策略之一是在分类活动的情况下定义阈值同质性值。 在拆分之前，将使用它比较每个节点的同质性。 如果节点的同质性小于阈值，则我们将节点进一步拆分。 同样，如果节点的同质性大于阈值，则将其转换为叶节点。 其他树截断策略是通过在模型构建期间定义超参数来控制树的过度拟合。
Post Pruning: In Post Pruning, we allow the tree to its complete logical end and then perform pruning from the bottom of the tree. Some of the popular post pruning methods are Reduced Error Pruning, Cost Complexity Pruning. We perform pruning of nodes until there exists no reduction in purity gain.
修剪后：在修剪后，我们允许树到其完整的逻辑末端，然后从树的底部执行修剪。 一些流行的后期修剪方法是减少错误修剪，成本复杂性修剪 。 我们执行节点的修剪，直到纯度增益没有降低为止。
Generally, Tree Truncation strategies are preferred over Post Pruning strategies as Post Pruning methods are overkill process.
We have answered all our questions that were lingering in our minds during the tree building process. Let’s finish off this article by discussing the advantages and disadvantages of Decision Trees.
决策树的优势： (Advantages of Decision Trees:)
Versatile: Decision Trees can be used for building Classification and Regression model building.
Fast: Upon defining the hyperparameters by hyperparameter tuning the Decision Tree building process is significantly fast.
Minimal Data Preprocessing: No much data preprocessing is needed like scaling, outlier treatment, etc.
Easy Interpretable: Decision Tree is like a flow chart that can be easily interpretable without any mathematical understanding.
Able to handle non-linear relationships: If there exists any non-linear relationship between the predictor variable and target variable using Decision Trees the non-linear relationship can be captured upon segmenting the data into smaller subsets and then assigning a single decision to the entire subset using a leaf node.
Handles Multicollinearity: Decision Trees can handle multicollinearity by considering only those features among multicorrelated features for node splitting as there is no meaning in considering the other multicorrelated feature as well.
Non-parametric Model: Decision trees doesn’t take into consideration of any assumption and distributions related to the predictor variables or the errors associates with the predictions. Hence it is said to be a Non-parametric Model.
Feature Importance: Feature importance can be obtained upon building a Decision Tree with which we can know which are the significant features that made a significant contribution in marking prediction about the target variable. By knowing these feature importance we can perform model-based dimensionality reduction by considering only the significant features.
决策树的缺点： (Disadvantages of Decision Trees:)
Loss of Inference: Using Decision trees we can get to know about the decision associated with a data point and also we can know about the factors leading to a particular decision holding by a leaf node. But we will not be knowing about the linear relationship between the predictor variable and the target variable as a result we cannot make any inferences about the population.
Loss of the numerical nature of the variable: If there exists any numerical variable upon tree building the entire subset of the numerical variable is being assigned with a single prediction value as a result the information present in the numerical variable is taken into consideration.
Overfitting: If the tree is allowed to grow to its complete logical end then the tree overfits on the training data. Though overfitting issues can be controlled by performing the above-mentioned methods but it is an inherited problem with Decision Trees if not controlled.