fold_folder - CSDN
  • K-Fold Cross Validation

    2019-06-13 17:16:11
    2019独角兽企业重金招聘Python工程师标准>>> ...

    1: K Fold Cross Validation

    In the previous mission, we learned about cross validation, a technique for testing a machine learning model's accuracy on new data that the model wasn't trained on. Specifically, we focused on the holdout validation technique, which involved:

    • splitting the full dataset into 2 partitions:
      • a training set and
      • a test set
    • training the model on the training set,
    • using the trained model to predict labels on the test set,
    • computing an error metric (e.g. simple accuracy) to understand the model's accuracy.

    Holdout validation is actually a specific example of a larger class of validation techniques called k-fold cross-validation. K-fold cross-validation works by:

    • splitting the full dataset into k equal length partitions,
      • selecting k-1 partitions as the training set and
      • selecting the remaining partition as the test set
    • training the model on the training set,
    • using the trained model to predict labels on the test set,
    • computing an error metric (e.g. simple accuracy) and setting aside the value for later,
    • repeating all of the above steps k-1 times, until each partition has been used as the test set for an iteration,
    • calculating the mean of the k error values.

    Using 5 or 10 folds is common for k-fold cross-validation. Here's a diagram describing each iteration of 5-fold cross validation:

    Imgur

    Since you're training k models, the more number of folds you use the longer it takes. When working with large datasets, often only a few number of folds are used because of the time and cost it takes, with the tradeoff that having more training examples helps improve the accuracy even with less folds.

    2: Partititioning The Data

    To explore k-fold cross-validation, we'll continue to work with the dataset on graduate admissions. Recall that this dataset contains data on 644 applications with the following columns:

    • gre - applicant's store on the Graduate Record Exam, a generalized test for prospective graduate students.
      • Score ranges from 200 to 800.
    • gpa - college grade point average.
      • Continuous between 0.0 and 4.0.
    • admit - binary value
      • Binary value, 0 or 1, where 1 means the applicant was admitted to the program and 0 means the applicant was rejected.

    To save you time, we've already imported the Pandas library, read in admissions.csv into a Dataframe, renamed the admit column toactual_label, and randomized the ordering of the rows.

    Now, partition the dataset into 5 folds.

    Instructions

    Partition the dataset into 5 folds and store each row's fold in a new integer column named fold:

    • Fold 1 : rows from index 0 to128, including both of those rows.
    • Fold 2 : rows from index 129 to257, including both of those rows.
    • Fold 3 : rows from index 258 to386, including both of those rows.
    • Fold 4 : rows from index 387 to514, including both of those rows.
    • Fold 5 : rows from index 515 to644, including both of those rows.

    Display the first 5 rows and the last 5 rows of the Dataframe to confirm.

    import pandas as pd

    admissions = pd.read_csv("admissions.csv")
    admissions["actual_label"] = admissions["admit"]
    admissions = admissions.drop("admit", axis=1)

    shuffled_index = np.random.permutation(admissions.index)
    shuffled_admissions = admissions.loc[shuffled_index]
    admissions = shuffled_admissions.reset_index()
    admissions.ix[0:128,"fold"]=1
    admissions.ix[129:257,"fold"]=2
    admissions.ix[258:386,"fold"]=3
    admissions.ix[387:514,"fold"]=4
    admissions.ix[515:644,"fold"]=5
    admissions["fold"]=admissions["fold"].astype("int")

    print(admissions.head())
    print(admissions.tail())

    3: First Iteration

    In the first iteration, let's assign fold 1 as the test set and folds 2 to 5 as the training set. Then, train the model and use it to predict labels for the test set.

    Instructions

    • Train a logistic regression model using the gpa column as the sole feature from folds 2 to 5 as the training set.c
    • Use the model to make predictions on the test set and assign the predicted labels tolabels.
    • Calculate the accuracy by comparing the predicted labels with the actual labels from theactual_label column on the test set.
    • Assign the accuracy value toiteration_one_accuracy.

    from sklearn.linear_model import LogisticRegression
    model=LogisticRegression()
    train_iteration_one=admissions[admissions["fold"]!=1]
    test_iteration_one=admissions[admissions["fold"]==1]
    model.fit(train_iteration_one[["gpa"]],train_iteration_one[["actual_label"]])
    labels=model.predict(test_iteration_one[["gpa"]])
    test_iteration_one["prediction_label"]=labels
    matches=test_iteration_one["prediction_label"]==test_iteration_one["actual_label"]
    correct_predictions=test_iteration_one[matches]
    iteration_one_accuracy=len(correct_predictions)/len(test_iteration_one)
    print(iteration_one_accuracy)

     

    4: Function For Training Models

    From the first iteration, we achieved an accuracy score of 60.5% accuracy. Let's now run through the rest of the iterations to see how the accuracy changes after each iteration and to compute the mean accuracy.

    To make the iteration process easier, wrap the code you in the previous screen in a function.

    Instructions

    • Write a function namedtrain_and_test that takes in a Dataframe and a list of fold id values (1 to 5 in our case) and returns a list of accuracy values, e.g.:
    [0.5, 0.5, 0.5, 0.5, 0.5]
    
    • Use the train_and_testfunction to return the list of accuracy values for theadmissions Dataframe and assign to accuracies. e.g.:
    accuracies = train_and_test(admissions, [1,2,3,4,5])
    
    • Compute the average accuracy and assign toaverage_accuracy.
    • average_accuracy should be a float value while accuraciesshould be a list of float values (one float value per iteration).
    • Use the variable inspector or theprint function to display the values for accuracies andaverage_accuracy.

    # Use np.mean to calculate the mean.
    import numpy as np
    fold_ids = [1,2,3,4,5]
    def train_and_test(df, folds):
        fold_accuracies = []
        for fold in folds:
            model = LogisticRegression()
            train = admissions[admissions["fold"] != fold]
            test = admissions[admissions["fold"] == fold]
            model.fit(train[["gpa"]], train["actual_label"])
            model.fit(train[["gpa"] = model.predict(test[["gpa"]])
            
            matches = test["predicted_label"] == test["actual_label"]
            correct_predictions = test[matches]
            fold_accuracies.append(len(correct_predictions) / len(test))
        return(fold_accuracies)

    accuracies = train_and_test(admissions, fold_ids)
    print(accuracies)
    average_accuracy = np.mean(accuracies)
    print(average_accuracy)

    5: Sklearn

    The average accuracy value was 64.8%, compared to an accuracy value of 63.6% using holdout validation from the last mission. In many cases, the resulting accuracy values don't differ much between a simpler, less time-intensive method like holdout validation and a more robust but more time-intensive method like k-fold cross-validation. As you use these and other cross validation techniques more often, you should get a better sense of these tradeoffs and when to use which validation technique.

    In addition, the computed accuracy values for each fold stayed within 61% and 63%, which is a healthy sign. Wild variations in the accuracy values between folds is usually indicative of using too many folds (k value). By implementing your own k-fold cross-validation function, you hopefully acquired a good understanding of the inner workings of the technique.

    When working in a production environment however, you should use scikit-learn. Scikit-learn has a few different tools that make performing cross validation easy. Similar to having to instantiate a LinearRegression or LogisticRegression object before you can train one of those models, you need to instantiate a KFold class before you can perform k-fold cross-validation:

    kf = KFold(n, n_folds, shuffle=False, random_state=None)

    where:

    • n is the number of observations in the dataset,
    • n_folds is the number of folds you want to use,
    • shuffle is used to toggle shuffling(切换洗牌) of the ordering of the observations in the dataset,
    • random_state is used to specify a seed value(种子值) if shuffle is set to True.

    You'll notice here that only the first parameter depends on the dataset at all. This is because the KFold class returns an iterator object but won't actually handle the training and testing of models. If we're primarily only interested in accuracy and error metrics for each fold, we can use the KFold class in conjunction with the cross_val_score function, which will handle training and testing of the models in each fold.

    Here are the relevant parameters for the cross_val_score function:

     

     
    cross_val_score(estimator, X, Y, scoring=None, cv=None)

    where:

    • estimator is a sklearn model that implements the fit method (e.g. instance of LinearRegression or LogisticRegression),
    • X is the list or 2D array containing the features you want to train on,
    • y is a list containing the values you want to predict (target column),
    • scoring is a string describing the scoring criteria (list of accepted values here).
    • cv describes the number of folds. Here are some examples of accepted values:
      • an instance of the KFold class,
      • an integer representing the number of folds.

    Depending on the scoring criteria you specify, either a single value is returned (e.g. average_precision) or an array of values (e.g.accuracy), one value for each fold.

    Here's the general workflow for performing k-fold cross-validation using the classes we just described:

    • instantiate the model class you want to fit (e.g. LogisticRegression),
    • instantiate the KFold class and using the parameters to specify the k-fold cross-validation attributes you want,
    • use the cross_val_score function to return the scoring metric you're interested in.

    Instructions

    • Create a new instance of theKFold class with the following properties:

      • n set to length ofadmissions,
      • 5 folds,
      • shuffle set to True,
      • random seed set to 8 (so we can answer check using the same seed),
      • assigned to the variable kf.
    • Create a new instance of theLogisticRegression class and assign to lr.

    • Use the cross_val_scorefunction to perform k-fold cross-validation:

      • using the LogisticRegression instancelr,
      • using the gpa column for training,
      • using the actual_labelcolumn as the target column,
      • returning an array of accuracy values (one value for each fold).
    • Assign the resulting array of accuracy values to accuracies, compute the average accuracy, and assign the average toaverage_accuracy.

    • Use the variable inspector or theprint function to display the values for accuracies andaverage_accuracy.

    from sklearn.cross_validation import KFold
    from sklearn.cross_validation import cross_val_score

    admissions = pd.read_csv("admissions.csv")
    admissions["actual_label"] = admissions["admit"]
    admissions = admissions.drop("admit", axis=1)
    kf = KFold(len(admissions),5,shuffle=True, random_state=8)
    lr=LogisticRegression()

    accuracies=cross_val_score(lr, admissions[["gpa"]], admissions["actual_label"], scoring="accuracy", cv=kf)
    average_accuracy=sum(accuracies)/len(accuracies)
    print(accuracies)
    print(average_accuracy)

    6: Interpretation

    Using 5-fold cross-validation, we achieved an average accuracy score of 64.4%, which closely matches the 63.6% accuracy score we achieved using holdout validation. When working with simple univariate models, often holdout validation is more than enough and the similar accuracy scores confirm this. When you're using multiple features to train a model (multivariate models), performing k-fold cross-validation can give you a better sense of the accuracy you should expect when you use the model on data it wasn't trained on.

    7: Next Steps

    In this mission, we explored a more robust cross validation technique called k-fold cross-validation. Cross-validation helps us understand a model's generalizability and reduce overfitting. In the next mission, we'll explore some more specific overfitting techniques.

     

    转载于:https://my.oschina.net/Bettyty/blog/751627

    展开全文
  • fold 命令入门学习

    2018-08-04 22:05:31
    你有没有发现自己在某种情况下想要... fold 命令在这里就能派的上用场了! fold 命令会以适合指定的宽度调整输入文件中的每一行,并将其打印到标准输出。 在这个简短的教程中,我们将看到 fold 命令的用法,带有实例...

    你有没有发现自己在某种情况下想要折叠或中断命令的输出,以适应特定的宽度?在运行虚拟机的时候,我遇到了几次这种的情况,特别是没有 GUI 的服务器。 以防万一,如果你想限制一个命令的输出为一个特定的宽度,现在看看这里! fold 命令在这里就能派的上用场了! fold 命令会以适合指定的宽度调整输入文件中的每一行,并将其打印到标准输出。
    在这个简短的教程中,我们将看到 fold 命令的用法,带有实例。

    fold 命令示例教程

    fold 命令是 GNU coreutils 包的一部分,所以我们不用为安装的事情烦恼。
    fold 命令的典型语法:

    fold [OPTION]... [FILE]...

    请允许我向您展示一些示例,以便您更好地了解 fold 命令。 我有一个名为 linux.txt文件,内容是随机的。
    要将上述文件中的每一行换行为默认宽度,请运行:

    fold linux.txt

    每行 80 列是默认的宽度。 这里是上述命令的输出:
    正如你在上面的输出中看到的,fold 命令已经将输出限制为 80 个字符的宽度。
    当然,我们可以指定您的首选宽度,例如 50,如下所示:

    fold -w50 linux.txt

    示例输出:
    我们也可以将输出写入一个新的文件,如下所示:

    fold -w50 linux.txt > linux1.txt

    以上命令将把 linux.txt 的行宽度改为 50 个字符,并将输出写入到名为 linux1.txt的新文件中。
    让我们检查一下新文件的内容:

    cat linux1.txt

    你有没有注意到前面的命令的输出? 有些词在行之间被中断。 为了解决这个问题,我们可以使用 -s 标志来在空格处换行。
    以下命令将给定文件中的每行调整为宽度 50,并在空格处换到新行:

    fold -w50 -s linux.txt

    看清楚了吗? 现在,输出很清楚。 换到新行中的单词都是用空格隔开的,所在行单词的长度大于 50 的时候就会被调整到下一行。
    在所有上面的例子中,我们用列来限制输出宽度。 但是,我们可以使用 -b 选项将输出的宽度强制为指定的字节数。 以下命令以 20 个字节中断输出。

    fold -b20 linux.txt
    

    本文地址:  https://www.linuxprobe.com/fold-linux.html

    展开全文
  • K-Fold 交叉验证 (Cross-Validation)的理解与应用 个人主页-->http://www.yansongsong.cn/ 1.K-Fold 交叉验证概念 在机器学习建模过程中,通行的做法通常是将数据分为训练集和测试集。测试集是与训练独立的...

    K-Fold 交叉验证 (Cross-Validation)的理解与应用

    个人主页-->http://www.yansongsong.cn/

    1.K-Fold 交叉验证概念

    在机器学习建模过程中,通行的做法通常是将数据分为训练集和测试集。测试集是与训练独立的数据,完全不参与训练,用于最终模型的评估。在训练过程中,经常会出现过拟合的问题,就是模型可以很好的匹配训练数据,却不能很好在预测训练集外的数据。如果此时就使用测试数据来调整模型参数,就相当于在训练时已知部分测试数据的信息,会影响最终评估结果的准确性。通常的做法是在训练数据再中分出一部分做为验证(Validation)数据,用来评估模型的训练效果。

    验证数据取自训练数据,但不参与训练,这样可以相对客观的评估模型对于训练集之外数据的匹配程度。模型在验证数据中的评估常用的是交叉验证,又称循环验证。它将原始数据分成K组(K-Fold),将每个子集数据分别做一次验证集,其余的K-1组子集数据作为训练集,这样会得到K个模型。这K个模型分别在验证集中评估结果,最后的误差MSE(Mean Squared Error)加和平均就得到交叉验证误差。交叉验证有效利用了有限的数据,并且评估结果能够尽可能接近模型在测试集上的表现,可以做为模型优化的指标使用。

    2.举例说明

    下面举一个具体的例子来说明K-Fold的过程,比如如下的数据

    [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
    

    分为K=3组后

    Fold1: [0.5, 0.2]
    Fold2: [0.1, 0.3]
    Fold3: [0.4, 0.6]
    

    交叉验证的时会使用如下三个模型,分别进行训练和测试,每个测试集误差MSE加和平均就得到了交叉验证的总评分

    Model1: Trained on Fold1 + Fold2, Tested on Fold3
    Model2: Trained on Fold2 + Fold3, Tested on Fold1
    Model3: Trained on Fold1 + Fold3, Tested on Fold2

    3.应用讲解

     

    1、 将全部训练集S分成k个不相交的子集,假设S中的训练样例个数为m,那么每一个子集有m/k个训练样例,相应的子集称作{clip_image024}。

    2、 每次从模型集合M中拿出来一个clip_image010[3],然后在训练子集中选择出k-1个

    {clip_image026}(也就是每次只留下一个clip_image028),使用这k-1个子集训练clip_image010[4]后,得到假设函数clip_image030。最后使用剩下的一份clip_image028[1]作测试,得到经验错误clip_image032

    3、 由于我们每次留下一个clip_image028[2](j从1到k),因此会得到k个经验错误,那么对于一个clip_image010[5],它的经验错误是这k个经验错误的平均。

    4、 选出平均经验错误率最小的clip_image010[6],然后使用全部的S再做一次训练,得到最后的clip_image012[4]

     

    核心内容:

    通过上述1,2,3步进行模型性能的测试,取平均值作为某个模型的性能指标

    方法一,将所有训练的KFold进行融合

    方法二,根据性能指标来挑选出最优模型,再进行上述第4步重新进行训练,获得最终模型

    疑问解答:

    1.为什么不直接拆分训练集与数据集,来验证模型性能,反而采用多次划分的形式,岂不是太麻烦了?

    我们为了防止在训练过程中,出现过拟合的问题,通行的做法通常是将数据分为训练集和测试集。测试集是与训练独立的数据,完全不参与训练,用于最终模型的评估。这样的直接划分会导致一个问题就是测试集不会参与训练,这样在小的数据集上会浪费掉这部分数据,无法使模型达到最优(数据决定了程性能上限,模型与算法会逼近这个上限)。但是我们又不能划分测试集,因为需要验证网络泛化性能。采用K-Fold 多次划分的形式就可以利用全部数据集。最后采用平均的方法合理表示模型性能。

    2.为什么还要进行所有数据集重新训练,是否太浪费时间?

    我们通过K-Fold 多次划分的形式进行训练是为了获取某个模型的性能指标,单一K-Fold训练的模型无法表示总体性能,但是我们可以通过K-Fold训练的训练记录下来较为优异的超参数,然后再以最优模型最优参数进行重新训练,将会取得更优结果。

    也可以采取方法一的方式不再进行训练使用模型融合的方式。

    3.何时使用K-Fold

    我的看法,数据总量较小时,其他方法无法继续提升性能,可以尝试K-Fold。其他情况就不太建议了,例如数据量很大,就没必要更多训练数据,同时训练成本也要扩大K倍(主要指的训练时间)。

     

    4.举例说明

     

    在上面的5种组合上做了5次训练,测试的时候我们就有了5个模型,每个模型预测一遍测试集就得到了5个概率矩阵,每个概率矩阵的形状都是(测试集样本数 x 17)。我们可以将5个概率矩阵直接求平均后做二分类预测,也可以分别做完二分类预测,再做投票,来获得最终的多类预测结果。这个结果实际上用到了所有5个折的训练数据,会更加准确,也更加稳定。

    当然如果只是想用上所有数据的话,更简单的办法就是直接把整个训练集用这个模型跑一遍,再把训练好的模型模型对测试集作预测。不过我们没有采用这第二种方式,一来,所有训练样本都被这模型“看光了”,没有额外的验证集,难以评估其泛化性能;二来,我们认为第一种方法中,5个模型的预测结果做了个简单的Ensemble,会更稳定一点。

    5.参考

    1.K-Fold 交叉验证 (Cross-Validation)

    2.规则化和模型选择(Regularization and model selection)

    3.Kaggle求生:亚马逊热带雨林篇

    展开全文
  • 关于fold

    2017-10-11 15:08:07
    1.在scala中,对集合的fold操作: val l=List(1,2,3) val res=l.fold(10)(_+_) 结果res为16 这里的10为整个集合的初始值,只使用了一次 2.spark中rdd val a=sc.makeRDD(List(1,2,3)) val res=a.fold(10)(_+_) ...

    1.在scala中,对集合的fold操作:

    val l=List(1,2,3)
    val res=l.fold(10)(_+_)

    结果res为16  这里的10为整个集合的初始值,只使用了一次

    2.spark中rdd

    val a=sc.makeRDD(List(1,2,3))
    val res=a.fold(10)(_+_)

    结果res为46 这里的10为集合中每个元素的初始值,每个元素在调用函数_+_之前都被加了10 

    注意区别

    foldByKey同fold,但是前者反悔的是一个rdd,属于转换操作,后者返回值,属于行动操作

    val b=sc.parallelize(List((1,2),(3,4),(3,6)))
    val res=b.foldByKey(10)(_+_)
    res foreach println

    结果为(1,12) (3,30)

    展开全文
  • 最近看到一篇决策树的论文,其中说到了5-fold cross-validation和10-fold cross-validation,所以查找了一些资料了解一下他们是什么。 原理: 它将原始数据分成K组(K-Fold),将每个子集数据分别做一次验证集,其余...
  • 函数式编程里面的fold

    2014-03-23 13:35:32
    fold在函数式编程里面的基本含义是遍历数据结构,最后产生一个聚合值。最简单的例子是sum list = foldl (+) 0 list fold抽象了两个动作,一个是遍历数据结构本身,一个是累积值和元素之间的关系。用了fold后就只需要...
  • 当样本数据量比较小时,K-fold交叉验证是训练、评价模型时的常用方法,本文介绍Scikit-learn的可用于K-fold交叉验证的集合划分类ShuffleSplit、GroupShuffleSplit的用法。
  • DIY教学网站iFixit拆解了全球首款折迭屏幕手机Galaxy Fold,并于4月24日公布分析结果,不过,隔天iFixit就在三星的要求下,被迫撤下该文章。根据iFixit的说法,这支Galaxy Fold并非直接来自三星,而是由一个可靠的...
  • 1. fold介绍从本质上说,fold函数将一种格式的输入数据转化成另外一种格式返回。fold, foldLeft和foldRight这三个函数除了有一点点不同外,做的事情差不多。我将在下文解释它们的共同点并解释它们的不同点。 我将从...
  • k-折交叉验证(k-fold crossValidation): 在机器学习中,将数据集A分为训练集(training set)B和测试集(test set)C,在样本量不充足的情况下,为了充分利用数据集对算法效果进行测试,将数据集A随机分为k个包,...
  • fold是组内的每个元素与累加器(一开始是初始值initialValue)合并再返回累加器,累加器的类型可以与组内的元素类型不一致; 2 reduce可以用于DataStream或DataSet,但是fold只能用于DataStream。 Flink reduce ...
  • 1. rdd.fold(value)(func) 说到fold()函数,就不得不提一下reduce()函数,他俩的区别就在于一个初始值。reduce()函数是这样写的: rdd.reduce(func) 参数是一个函数,这个函数的对rdd中的所有数据进行某种操作...
  • fold change的意思是样本质检表达量的差异倍数,log2 fold change的意思是取log2,这样可以可以让差异特别大的和差异比较小的数值缩小之间的差距。 Q-value,是P-value校正值,P值是统计差异的显著性的。Q值比P值...
  • C++17 fold expression

    2020-08-04 21:29:40
    1.简介 C++11增加了一个新特性变参模板(variadic...C++17解决了这个问题,通过fold expression(折叠表达式)简化对参数包的展开。 2.语法形式 折叠表达式共有四种语法形式,分别为一元的左折叠和右折...
  • Scala学习笔记之reduce、fold、scan 文章目录`Scala`学习笔记之`reduce`、`fold`、`scan``1. reduce化简`概念与区别:示例:`1.` 计算给定集合的元素和:`2.` 计算`n!`:`2. fold折叠`概念与区别:示例:`1.` 计算`n...
  • 在卷积网络中,经常会...但是,大部分深度学习框架也是提供了显式地进行滑动窗口操作的API的,在pytorch中就是unfold和fold。接下来我们来探讨下这两个函数的使用。 在pytorch中,和unfold有关的有:torch.nn.Unfol...
  • 从本质上说,fold函数将一种格式的输入数据转化成另外一种格式返回。fold, foldLeft和foldRight这三个函数除了有一点点不同外,做的事情差不多。我将在下文解释它们的共同点并解释它们的不同点。  我将从一个简单...
  • 常用的规约方法有reduce和fold,两个方法唯一的差别是, reduce是从容器的两个元素开始规约,而fold则是从提供的初始值开始规约。 同样地,对于无序容器而言, fold方法不保证规约时的遍历顺序,如要保证顺序,请...
  • fold change(ratio)

    2019-10-02 13:20:12
    fold change 英文简称 : FC 中文全称 : 倍性变化 所属分类 : 生物科学 词条简介 : 一种用于描述两个用于相比的对象数量差异的方法。例如,第一个样本和第二个样本的量是50/10,那么FC(Ratio)就是5,反之...
  • 在函数式编程中,Map和Fold是两个非常有用的操作,它们存在于每一个函数式编程语言中。既然Map和Fold操作如此强大和重要,但是Java语言缺乏Map和Fold机制,那么该如何解释我们使用Java完成日常编码工作呢?实际上你...
1 2 3 4 5 ... 20
收藏数 32,514
精华内容 13,005
关键字:

fold