If we think about what a Machine Learning model does, we can see how its main job is that of finding those rules governing the relationship between input and output. Once found those rules, the idea is that of applying them to new data and make predictions about their related output.
Henceforth, being predictions the final goal of an ML algorithm, it is pivotal for it to be properly generalized and not too adapted on data it trained on.
In this article, we are going to examine different options you have whenever training an ML model.
在整个数据集上训练和评估模型 (Train and evaluate the model on the whole dataset)
Needless to say, this first approach will lead to a biased result. If we evaluate the model in the very same dataset it trained on, we will probably face the curse of overfitting, which happens whenever the model is too adapted to training data. Regarding the evaluation phase, we will probably get a very high score, yet it is not the score we are looking at: it probably derived from the fact that the algorithm learnt the patterns of that specific dataset and their associated output. Hence, while returning the correct output, it is not exploiting a general rule, it is just reproducing the observed patterns.
不用说，这第一种方法将导致有偏见的结果。 如果我们在与训练数据集相同的数据集中评估模型，那么我们可能会面临过度拟合的诅咒，每当模型过于适应训练数据时就会发生这种情况。 关于评估阶段，我们可能会获得很高的分数，但这并不是我们正在关注的分数：它可能源自以下事实：该算法学习了该特定数据集的模式及其关联的输出。 因此，在返回正确的输出时，它并没有利用一般规则，而只是再现观察到的模式。
How it is supposed to work on new, never-seen-before data? Of course, it will not be reliable. We need to improve our training and evaluation phases.
它应该如何处理从未见过的新数据？ 当然，这将是不可靠的。 我们需要改善培训和评估阶段。
将数据分为训练和测试集 (Splitting data into training and test set)
With this approach, we are keeping apart one portion of the dataset and training the model on the remaining portion. By doing so, we are left with a small set of data, called test set, the model has never seen before, hence it is a more reliable benchmark for evaluation purposes. Indeed, if we evaluate the model on the test set and obtain a great score, we are more confident to say that this model is well generalized.
通过这种方法，我们将数据集的一部分分开，并在其余部分上训练模型。 这样一来，我们剩下的一小部分数据称为测试集，该模型从未见过，因此它是用于评估目的的更可靠基准。 的确，如果我们在测试集上评估该模型并获得高分，则我们更有信心地说该模型已被很好地推广。
However, there is still one caveat in this method. There is an infinite number of possible combinations of train-test sets, however, we only experimented one. How do we know that the very splitting we obtained is the most representative one? Maybe, if we had a different composition of train and test set, we would have a very different result.
但是，此方法仍然有一个警告 。 火车测试集的可能组合有无数种，但是，我们仅进行了试验。 我们怎么知道我们获得的最分裂是最有代表性的分裂？ 也许，如果我们在训练和测试集的构成上有所不同，那么结果将大不相同。
To bypass this problem, we can introduce the concept of cross-validation.
The idea of cross-validation arises because of the caveat explained above. It basically wants to guarantee that the score of our model does not depend on the way we picked the train and test set.
It works as follows. It splits our dataset into K-folds, then the model is trained on K-1 folds and tested on the remaining one, for K iterations. So each time, because of the K rotations of the test set, the model is trained and tested on a new composition of data.
它的工作原理如下。 它将我们的数据集分成K折，然后在K-1折上训练模型，并在其余的K折上进行测试。 因此，每次由于测试集的K旋转，都会在新的数据组合上对模型进行训练和测试。
Are we now confident that our model will perform well on new data? Is it enough generalized? Well, not really.
我们现在是否有信心我们的模型将在新数据上表现良好？ 是否足够概括？ 好吧，不是真的。
Even though we are getting closer to the “optimal” solution, there is still a drawback to be addressed.
培训，验证和测试集 (Train, validation and test set)
The caveat of cross-validation as explained above is that we are evaluating the model on a test set which is not completely extraneous from the model itself. Imagine we start a cross-validation procedure: at the first iteration, the model will be evaluated on new data, since this is the very first split of the dataset; at iteration 2, however, the test set will include also some data point that, in the previous iteration, where part of the training set, hence the model has actually seen it before! Basically, we are falling again into the first scenario of this article.
如上所述，交叉验证的警告是，我们正在测试集上评估模型，该测试集与模型本身并不完全无关。 想象一下，我们开始一个交叉验证过程：在第一次迭代中，将对新数据评估模型，因为这是数据集的第一个拆分； 但是，在迭代2时，测试集还将包含一些数据点，该数据点在上一次迭代中是训练集的一部分，因此模型实际上已经在之前看到过！ 基本上，我们将再次陷入本文的第一种情况。
Luckily, there is an easy way to fix it, and it consists of introducing a third set, the validation set.
Basically, we first split data into train and test set. Then, we keep apart the test set and further split the train set into train and validation sets. By doing so, when applying cross-validation, we first evaluate the model over the K possible combinations. Then, once obtained a validation score, we are ready to try the model on the test set. Now, we are guaranteed that a high score in the test set is an index of a well-generalized model.
基本上，我们首先将数据分为训练和测试集。 然后，我们将测试集分开，然后将训练集进一步分为训练集和验证集。 这样，在应用交叉验证时，我们首先在K个可能的组合上评估模型。 然后，一旦获得验证分数，我们准备在测试集上尝试模型。 现在，我们可以保证测试集中的高分是一个通用模型的指标。
Choosing the proper training and validation approach is crucial whenever you build an ML model. However, it is even more important the awareness that every approach has its drawbacks, and to keep them into considerations while drawing conclusions.
In this article, the final approach is probably the most accurate one, yet it leads to further problems, the first one being the reduction of training data, which is another cause of overfitting.
So it is pivotal to choose a technique depending on the specific task you are going to solve, keeping in mind all the pros and cons of each method.