精华内容
下载资源
问答
  • readr 0.1.0

    2015-04-10 22:13:53
    (This article was first published on RStudio Blog, and kindly contributed to ...I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types
    (This article was first published on  RStudio Blog, and kindly contributed to
    R-bloggers)     
    

    I’m pleased to announced that readr is now available on CRAN. Readr makes it easy to read many types of tabular data:

    • Delimited files withread_delim(), read_csv(), read_tsv(), and read_csv2().
    • Fixed width files with read_fwf(), and read_table().
    • Web log files with read_log().

    You can install it by running:

    install.packages("readr")

    Compared to the equivalent base functions, readr functions are around 10x faster. They’re also easier to use because they’re more consistent, they produce data frames that are easier to use (no more stringsAsFactors = FALSE!), they have a more flexible column specification, and any parsing problems are recorded in a data frame. Each of these features is described in more detail below.

    Input

    All readr functions work the same way. There are four important arguments:

    • file gives the file to read; a url or local path. A local path can point to a a zipped, bzipped, xzipped, or gzipped file – it’ll be automatically uncompressed in memory before reading. You can also pass in a connection or a raw vector.

      For small examples, you can also supply literal data: if file contains a new line, then the data will be read directly from the string. Thanks to data.table for this great idea!

      library(readr)
      read_csv("x,yn1,2n3,4")
      #>   x y
      #> 1 1 2
      #> 2 3 4
    • col_names: describes the column names (equivalent to header in base R). It has three possible values:
      • TRUE will use the the first row of data as column names.
      • FALSE will number the columns sequentially.
      • A character vector to use as column names.
    • col_types: overrides the default column types (equivalent to colClasses in base R). More on that below.
    • progress: By default, readr will display a progress bar if the estimated loading time is greater than 5 seconds. Use progress = FALSE to suppress the progress indicator.

    Output

    The output has been designed to make your life easier:

    • Characters are never automatically converted to factors (i.e. no more stringsAsFactors = FALSE!).
    • Column names are left as is, not munged into valid R identifiers (i.e. there is no check.names = TRUE). Use backticks to refer to variables with unusual names, e.g. df$`Income ($000)`.
    • The output has class c("tbl_df", "tbl", "data.frame") so if you also use dplyr you’ll get an enhanced print method (i.e. you’ll see just the first ten rows, not the first 10,000!).
    • Row names are never set.

    Column types

    Readr heuristically inspects the first 100 rows to guess the type of each columns. This is not perfect, but it’s fast and it’s a reasonable start. Readr can automatically detect these column types:

    • col_logical() [l], contains only T, F, TRUE or FALSE.
    • col_integer() [i], integers.
    • col_double() [d], doubles.
    • col_euro_double() [e], “Euro” doubles that use , as the decimal separator.
    • col_date() [D]: Y-m-d dates.
    • col_datetime() [T]: ISO8601 date times
    • col_character() [c][/c], everything else.

    You can manually specify other column types:

    • col_skip() [_], don’t import this column.
    • col_date(format) and col_datetime(format, tz), dates or date times parsed with given format string. Dates and times are rather complex, so they’re described in more detail in the next section.
    • col_numeric() [n], a sloppy numeric parser that ignores everything apart from 0-9, - and . (this is useful for parsing currency data).
    • col_factor(levels, ordered), parse a fixed set of known values into a (optionally ordered) factor.

    There are two ways to override the default choices with the col_types argument:

    • Use a compact string: "dc__d". Each letter corresponds to a column so this specification means: read first column as double, second as character, skip the next two and read the last column as a double. (There’s no way to use this form with column types that need parameters.)
    • With a (named) list of col objects:
      read_csv("iris.csv", col_types = list(
        Sepal.Length = col_double(),
        Sepal.Width = col_double(),
        Petal.Length = col_double(),
        Petal.Width = col_double(),
        Species = col_factor(c("setosa", "versicolor", "virginica"))
      ))

      Any omitted columns will be parsed automatically, so the previous call is equivalent to:

      read_csv("iris.csv", col_types = list(
        Species = col_factor(c("setosa", "versicolor", "virginica"))
      )

    Dates and times

    One of the most helpful features of readr is its ability to import dates and date times. It can automatically recognise the following formats:

    • Dates in year-month-day form: 2001-10-20 or 2010/15/10 (or any non-numeric separator). It can’t automatically recongise dates in m/d/y or d/m/y format because they’re ambiguous: is 02/01/2015 the 2nd of January or the 1st of February?
    • Date times as ISO8601 form: e.g. 2001-02-03 04:05:06.07 -0800, 20010203 040506, 20010203 etc. I don’t support every possible variant yet, so please let me know if it doesn’t work for your data (more details in ?parse_datetime).

    If your dates are in another format, don’t despair. You can use col_date() and col_datetime() to explicit specify a format string. Readr implements it’s own strptime() equivalent which supports the following format strings:

    • Year: %Y (4 digits). %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
    • Month: %m (2 digits), %b (abbreviated name in current locale), %B (full name in current locale).
    • Day: %d (2 digits), %e (optional leading space)
    • Hour: %H
    • Minutes: %M
    • Seconds: %S (integer seconds), %OS (partial seconds)
    • Time zone: %Z (as name, e.g. America/Chicago), %z (as offset from UTC, e.g. +0800)
    • Non-digits: %. skips one non-digit charcater, %* skips any number of non-digit characters.
    • Shortcuts: %D = %m/%d/%y, %F = %Y-%m-%d, %R = %H:%M, %T = %H:%M:%S, %x = %y/%m/%d.

    To practice parsing date times with out having to load the file each time, you can use parse_datetime() and parse_date():

    parse_date("2015-10-10")
    #> [1] "2015-10-10"
    parse_datetime("2015-10-10 15:14")
    #> [1] "2015-10-10 15:14:00 UTC"
    
    parse_date("02/01/2015", "%m/%d/%Y")
    #> [1] "2015-02-01"
    parse_date("02/01/2015", "%d/%m/%Y")
    #> [1] "2015-01-02"

    Problems

    If there are any problems parsing the file, the read_ function will throw a warning telling you how many problems there are. You can then use the problems() function to access a data frame that gives information about each problem:

    csv <- "x,y
    1,a
    b,2
    "
    
    df <- read_csv(csv, col_types = "ii")
    #> Warning: 2 problems parsing literal data. See problems(...) for more
    #> details.
    problems(df)
    #>   row col   expected actual
    #> 1   1   2 an integer      a
    #> 2   2   1 an integer      b
    df
    #>    x  y
    #> 1  1 NA
    #> 2 NA  2

    Helper functions

    Readr also provides a handful of other useful functions:

    • read_lines() works the same way as readLines(), but is a lot faster.
    • read_file() reads a complete file into a string.
    • type_convert() attempts to coerce all character columns to their appropriate type. This is useful if you need to do some manual munging (e.g. with regular expressions) to turn strings into numbers. It uses the same rules as the read_* functions.
    • write_csv() writes a data frame out to a csv file. It’s quite a bit faster than write.csv() and it never writes row.names. It also escapes " embedded in strings in a way that read_csv() can read.

    Development

    Readr is still under very active development. If you have problems loading a dataset, please try the development version, and if that doesn’t work, file an issue.

    展开全文
  • readr数据的导入

    2020-02-16 22:09:00
    第八章-使用readr进行数据的导入 readr读取数据 parse_*函数簇——解析数据(单个向量) ——解析文件 提要: 1.readr函数中掌握read.csv逗号分隔,read.delim任意分隔 2.readr函数中一些参数 skip = n 跳过 ...

    第八章-使用readr进行数据的导入

    readr读取数据
    parse_*函数簇——解析数据(单个向量)
    ——解析文件

    提要:

    1.readr函数中掌握read.csv逗号分隔,read.delim任意分隔


    2.readr函数中一些参数

    • skip = n 跳过

    • comment = “#” 忽略#开头

    • col_names = FALSE 第一行不是列名

    • col_names=c(“x”,“y”,“z”) 重新设置列名

    • na="." 点最为na读取

    • readr example()函数的用法,它可以找到包含在R包中的文件的路径。


    3.读取数据之前,要对数据解析,一种数据有多中表达形式

    (1) 数值 parse_number忽略字符串中任何其他非数字符号

    decimal_mark = ","
    grouping_mark="."忽略分组符号
    

    (2)字符串


    1.准备工作
    library(tidyverse)
    使用readr包将平面文件加载到R中

    函数 描述
    read_csv() 读取逗号分隔文件
    read_csv2() 读取分号分隔文件
    read_tsv() 读取制表符分隔文件
    read_delim() 可以读取使用任意分隔符的文件
    read_fuf() 读取固定宽度的文件。read_table()是他的变体。
    read_log() 读取Apache风格的日志文件。
    read_lines() 按行读入字符向量
    read_file() 读入一个长度为1的字符向量

    注意:

    1. 使用readr中函数来读取数据,比R基础包中的读取函数效率更高
      例如readr中函数比R基础模块函数read.csv()速度约快10倍
      data.table::fread()读取比readr更快

    2. readr中函数可以生成tibble,并且不会将字符向量转换为因子,不使用行名称,也不会随意改动列名称。

    3. readr中函数易于重复使用

    2.read_csv()的参数

    1. 可以使用skip = n 来跳过前n 行;
    2. 用comment = “#” 来丢弃所有以 # 开头的行
    3. 默认第一行为列名,取消默认用col_names = FALSE,定义列名向col_names传递一个字符向量,用做列名称
    4. 将某些值读取为缺失值,na="."
    5. "\n"是非常便捷的快捷方式,用于添加新行

    例子

    read_csv("The first line of metadata
    The second line of metadata
    x,y,z
    1,2,3", skip = 2)
    
    read_csv("a,b,c
    1,2,3
    4,5,6")
    # A tibble: 2 x 3
    #      a     b     c
    #    <dbl> <dbl> <dbl>
    # 1     1     2     3
    # 2     4     5     6
    read_csv("a,b,c\n1,2,3\n4,5,6")
    
    read_csv("1,2,3\n4,5,6",col_names=c("x","y","z"))
    

    练习

    (1) 如果一个文件中的域是由“|”分隔的,那么应该使用哪个函数来读取这个文件?

    read_delim(file, delim = "|")
    

    (2) 除了file、skip 和comment,还有哪些参数是read_csv() 和read_tsv() 这两个函数共有的?

    union(names(formals(read_csv)), names(formals(read_tsv)))
    

    (3) read_fwf() 函数中最重要的参数是什么?

    read_fwf()读取“固定宽度格式” 的最重要的参数是col_positions告诉函数数据列开始和结束的位置。

    (4) 有时CSV 文件中的字符串会包含逗号。为了防止引发问题,需要用引号(如" 或’)将逗号围起来。按照惯例,read_csv() 默认引号为",如果想要改变默认值,就要转而使用read_delim() 函数。要想将以下文本读入一个数据框,需要设定哪些参数?

    "x,y\n1,'a,b'"
    x <- "x,y\n1,'a,b'"
    read_csv(x, quote = "'")
    

    3.解析向量

    parse_*() 函数接受一个字符型向量,并返回一个特定向量,如逻辑、整数或日期,改变了数据类型。

    第一个参数是需要解析的字符向量,na参数设定了哪些字符串应该当作缺失值来处理:

    parse_integer(c("1","231",".","456"),na=".")
    #>[1]1231NA456
    
    parse_logical()#解析逻辑值
    parse_integer()#解析整数
    parse_double()#严格的数值型
    parse_number()#灵活的数值型
    parse_character()
    parse_factor()
    parse_datetime()#日期型和SQL一样
    parse_date()
    parse_time()
    

    输入字符串,返回定义的类型

    parse_logical(c("TRUE","FALSE","NA"))
    [1]  TRUE FALSE    NA
    str(parse_logical(c("TRUE","FALSE","NA")))
    >logi [1:3] TRUE FALSE NA 
    str(parse_integer(c("1","2","3")))
    #>int [1:37123
    str(parse_date(c("2010-01-01","1979-10-14")))
    #>Date[1:2],format:"2010-01-01""1979-10-14"
    

    解析失败可用problems(x)获取失败信息集合,可用dplyr处理缺失值。

    3.1.数值

    生活中同一个数值有多个表达形式

    readr使用“地区”这个概念,可以按照不同地区设置解析选项的一个对象

    默认小数点是"."有的国家用逗号作为小数点,需要用参数
    locale = locale(decimal_mark = “,”)设置。

    parse_double("1.23")
    #> [1] 1.23
    parse_double("1,23", locale = locale(decimal_mark = ","))
    #> [1] 1.23
    

    数字前后有货币、百分比、无关文本等等,可用parse_number提取,也可以提取嵌在文本中的数值

    parse_number()解决了第二个问题:它可以忽略数值前后的非数值型字符。

    parse_number("$100")
    #>[1]100
    parse_number("20%")
    #>[1]20
    parse_number("It cost $123.45")
    #>[17123
    

    组合使用parse_number()和地区设置可以解决最后一个问题,
    因为parse_number()可以忽略“分组符号”:

    适用于美国
    parse_number("$123,456,789")#默认
    #>[1]1.23e+08
    
    适用于多数欧洲国家
    parse_number(
    "123.456.789",locale=locale(grouping_mark="."))#忽略“分组符号”
    #>[1]1.23e+08
    
    适用于瑞士
    parse_number(
    "123'456'789",locale=locale(grouping_mark="'"))#忽略“分组符号”
    #>[1]1.23e+08
    

    3.2字符串

    生活中同一个字符串有多个表达形式
    parse_character()
    用charToRaw()获得一个字符串的底层编码。

    charToRaw("Hadley")
    #>[1]48 61 64 6c 65 7
    

    ASCII可以非常好地表示英文字符9

    readr全面支持UTF-8:当读取数据时,它假设数据是UTF-8编码的,并总是使用UTF-8编码写入数据
    处理数据文本乱码,可以用

    locale = locale(encoding = "")
    猜测编码方式:guess_encoding(charToRaw())
    
    x1="\xceҰ�\xc4\xe3"
    parse_character(x1,locale=locale(encoding="chr"))
    guess_encoding(charToRaw(x1))
    

    3.3因子

    R使用因子表示取值范围是已知集合的分类变量。
    如果parse_factor函数的levels参数被赋予一个已知向量,
    那么只要存在向量中没有的值,就会生成一条警告:
    fruit <- c(“apple”, “banana”,“bananana”)
    parse_factor(c(“apple”, “banana”, “bananana”), levels = fruit)
    #[1] apple banana bananana
    #Levels: apple banana bananana

    3.4日期

    注意:通常是这样

    parse_date("01/02/15","%m/%d/%y")
    #>[1]"2015-01-02"
    parse_date("01/02/15","%d/%m/%y")
    #>[1]"2015-02-01"
    parse_date("01/02/15","%y/%m/%d")
    #>[1]"2001-02-15"
    

    1.日期时间型
    parse_datetime期待的是符合ISO8601标准的日期时间。
    ISO8601是一种国际标准,其中日期的各个部分按从大到小的顺序排列,即年、月、日、小时、分钟、秒:

    parse_datetime("2010-10-01T2010")
    #>[1]"2010-10-0120:10:00 UTC"
    如果时间被省略了,那么它就会被设置为午夜
    parse_datetime("20101010")
    #>[1]"2010-10-10 UTC"
    

    2.日期型
    parse_date期待的是四位数的年份、一个-或/、月、一个-或/,然后是日:

    parse_date("2010-10-01")
    #>[1]"2010-10-01"
    

    3.时间型
    parse_time期待的是小时、:、分钟、可选的:和秒,以及一个可选的a.m./p.m.标识符:

    library(hms)
    parse_time("01:10 am")
    #>01:10:00
    parse_time("20:10:01")
    #>20:10:01
    

    4.解析文件

    先是用guess_parser() 猜测数据类型
    parse_guess() 解析列,列是向量。

    guess_parser("2010-10-01")
    #>[1]"date"
    guess_parser("15:01")
    #>[1]"time"
    guess_parser(c("TRUE","FALSE"))
    #>[1]"logical"
    guess_parser(c("1","5","9"))
    #>[1]"integer"
    guess_parser(c("12,352,561"))
    #>[1]"number"
    

    readr首先读取1000行,然后开始启发式猜测,猜测文件中的数据进行自动解析
    可以猜测:逻辑值、整数、双精度浮点数、数值、时间、日期、日期时间等
    如果都不匹配,就默认为这一列为字符串

    但是按照上面这样,会出现一些问题

    readr中包含了一份非常有挑战性的CSV文件,该文件可以说明以上两个问题。
    readr example()函数的用法,它可以找到包含在R包中的文件的路径。

    challenge <- read_csv(readr_example(“challenge.csv”))

    自动根据前1000行,猜测x列为整数,y列为数值。
    在1000列之后都是failure了,因为有拖尾字符。
    实际上1000行后x列为双精度浮点数,y列为日期
    。因此我们在读取命令中加入col_types,指定x列为col_double(),
    y列为col_date()

    可以更改列的类型

    challenge<-read_csv(
      readr_example("challenge. csv"), 
      coL_types=cols(
        x=coL_integer(), 
        y=col_character()
    ))
    接着修改x列的类型:
    challenge<-read_csv(
      readr_example("challenge. csv"), 
        coL_types=cols(
          x=coL_double(),
          y=col_character()
    ))
    接着修改y列的类型:                                                                                                                    
    challenge<-read_csv(
      readr_example("challenge. csv"), 
      coL_types=cols(x=col_double(), y=coL_date()))
    
    还可以更改一次读取的最大值
    challenge2<-read_csv(
      readr_example("challenge. csv"), guess_max=1001)
    
    
    
    另外将所有列作为字符向量读入:
    challenge2 <- read_csv(readr_example("challenge.csv"),
                           col_types = cols(.default = col_character())
    )
    

    使用type_convert()转换数据类型:

    df <- tribble(
      ~x, ~y,
      "1", "1.21",
      "2", "2.32",
      "3", "4.56"
    )
    df
    #> # A tibble: 3 × 2
    #> x y
    #> <chr> <chr>
    #> 1 1 1.21
    #> 2 2 2.32
    #> 3 3 4.56
    # 注意列类型
    type_convert(df)
    #> Parsed with column specification:
    #> cols(
    #> x = col_integer(),
    #> y = col_double()
    #> )
    #> # A tibble: 3 × 2
    #> x y
    #> <int> <dbl>
    #> 1 1 1.21
    #> 2 2 2.32
    #> 3 3 4.56
    

    5.保存、导出

    readr还提供了两个非常有用的函数,用于将数据写回到磁盘:

    write_csv()
    write_tsv()
    

    如果想要将CSV文件导为Excel文件,可以使用write_excel_csv()函数

    write_csv导出后,列的数据类型会丢失。
    替代方式:
    (1)write_rds(),read_rds-R自定义的二进制格式RDS格式

    write_rds(challenge,"challenge.rds")
    read_rds("challenge.rds")
    

    (2)feather包中的write_feather和read_feather-在多种编程语言间通用。

    library(feather)
    write_feather(challenge,"challenge.feather")
    read_feather("challenge.feather")
    

    feather要比RDS速度更快,而且可以在R之外使用。

    6.其他类型的数据

    • haven 可以读取SPSS、Stata 和SAS 文件;
    • readxl 可以读取Excel 文件(.xls 和.xlsx 均可);
    library(readxl)
    d <- read_excel(“d.xlsx”)
    View(d)
    • 配合专用的数据库后端程序(如RMySQL、RSQLite、RPostgreSQL等),
    DBI可以对相应数据库进行SQL查询,并返回一个数据框。

    展开全文
  • <div><p>It would be nice to have the same column guessing with the same syntax as in readr. https://github.com/hadley/readr/releases/tag/v1.0.0</p><p>该提问来源于开源项目:tidyverse/readxl</p></...
  • <div><p>The preview to import CSV form ...<p>The problem is that <code>download.file</code> fails from the IDE, while this works from <code>readr</code> since readr appears to be using curls: ...
  • Adopt guess_max a la readr

    2020-12-27 23:03:12
    <div><p>Propose <code>read_excel()</code> follow readr and gain a <code>guess_max</code> parameter with similar default behaviour: <p>guess_max Maximum number of records to use for guessing column ...
  • R for Data Science总结之——readr

    千次阅读 2018-10-07 11:25:12
    R for Data Science总结之——readr readr包顾名思义就是将数据导入R环境的方法,我们这里直接使用tidyverse框架,其中包含了readr包: library(tidyverse) 主要方法有: 分隔符读入:read_csv(), read_csv2...

    R for Data Science总结之——readr

    readr包顾名思义就是将数据导入R环境的方法,我们这里直接使用tidyverse框架,其中包含了readr包:

    library(tidyverse)
    

    主要方法有:

    • 分隔符读入:read_csv(), read_csv2(), read_tsv(), read_delim()
    • 空格分隔读入:read_fwf(), read_table()
    • log文件读入:read_log()

    首先来看看read_csv():

    heights <- read_csv("data/heights.csv")
    #> Parsed with column specification:
    #> cols(
    #>   earn = col_double(),
    #>   height = col_double(),
    #>   sex = col_character(),
    #>   ed = col_integer(),
    #>   age = col_integer(),
    #>   race = col_character()
    #> )
    
    read_csv("a,b,c
    1,2,3
    4,5,6")
    #> # A tibble: 2 x 3
    #>       a     b     c
    #>   <int> <int> <int>
    #> 1     1     2     3
    #> 2     4     5     6
    

    这里可以发现与read.csv()不同的是,read_csv()默认读入的文件为一个tibble数据集,这会对一些老式方法写的数据读入造成一些困难,这时可以先read.csv()读入生成data.frame再as_tibble()转成一个tibble。
    特殊用法:

    read_csv("The first line of metadata
      The second line of metadata
      x,y,z
      1,2,3", skip = 2)
    #> # A tibble: 1 x 3
    #>       x     y     z
    #>   <int> <int> <int>
    #> 1     1     2     3
    
    read_csv("# A comment I want to skip
      x,y,z
      1,2,3", comment = "#")
    #> # A tibble: 1 x 3
    #>       x     y     z
    #>   <int> <int> <int>
    #> 1     1     2     3
    
    read_csv("1,2,3\n4,5,6", col_names = FALSE)
    #> # A tibble: 2 x 3
    #>      X1    X2    X3
    #>   <int> <int> <int>
    #> 1     1     2     3
    #> 2     4     5     6
    
    read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
    #> # A tibble: 2 x 3
    #>       x     y     z
    #>   <int> <int> <int>
    #> 1     1     2     3
    #> 2     4     5     6
    
    read_csv("a,b,c\n1,2,.", na = ".")
    #> # A tibble: 1 x 3
    #>       a     b c    
    #>   <int> <int> <chr>
    #> 1     1     2 <NA>
    

    以上方法已经可以涵盖75%日常遇到的问题,特殊问题可使用read_tsv()和read_fwf()解决。

    读入原理

    readr读入数据时会对每一列猜测其数据量类型,这里用到了数据转换guess_parser()和parse_guess()函数:

    guess_parser("2010-10-01")
    #> [1] "date"
    guess_parser("15:01")
    #> [1] "time"
    guess_parser(c("TRUE", "FALSE"))
    #> [1] "logical"
    guess_parser(c("1", "5", "9"))
    #> [1] "integer"
    guess_parser(c("12,352,561"))
    #> [1] "number"
    
    str(parse_guess("2010-10-10"))
    #>  Date[1:1], format: "2010-10-10"
    

    然而这会有两个问题:

    • guess_parser()只针对前1000行进行猜测,若前1000行是数值,后面是字符串则会出错。
    • 若前1000行都为NA值则会猜测其为字符串,后面无论是什么数据类型都不加以考虑。

    这里我们对readr_example(“challenge.csv”)进行试验,这个数据集由x, y 两列组成,x列前1000行为整形,后面为浮点数,y列前1000行为NA,后面为日期:

    challenge <- read_csv(readr_example("challenge.csv"))
    #> Parsed with column specification:
    #> cols(
    #>   x = col_integer(),
    #>   y = col_character()
    #> )
    #> Warning in rbind(names(probs), probs_f): number of columns of result is not
    #> a multiple of vector length (arg 1)
    #> Warning: 1000 parsing failures.
    #> row # A tibble: 5 x 5 col     row col   expected         actual       file                           expected   <int> <chr> <chr>            <chr>        <chr>                          actual 1  1001 x     no trailing cha… .2383797508… '/home/travis/R/Library/readr… file 2  1002 x     no trailing cha… .4116799717… '/home/travis/R/Library/readr… row 3  1003 x     no trailing cha… .7460716762… '/home/travis/R/Library/readr… col 4  1004 x     no trailing cha… .7234505538… '/home/travis/R/Library/readr… expected 5  1005 x     no trailing cha… .6145241374… '/home/travis/R/Library/readr…

    #> See problems(...) for more details.
    

    使用problems()调出错误信息:

    problems(challenge)
    #> # A tibble: 1,000 x 5
    #>     row col   expected         actual       file                          
    #>   <int> <chr> <chr>            <chr>        <chr>                         
    #> 1  1001 x     no trailing cha… .2383797508… '/home/travis/R/Library/readr…
    #> 2  1002 x     no trailing cha… .4116799717… '/home/travis/R/Library/readr…
    #> 3  1003 x     no trailing cha… .7460716762… '/home/travis/R/Library/readr…
    #> 4  1004 x     no trailing cha… .7234505538… '/home/travis/R/Library/readr…
    #> 5  1005 x     no trailing cha… .6145241374… '/home/travis/R/Library/readr…
    #> 6  1006 x     no trailing cha… .4739805692… '/home/travis/R/Library/readr…
    #> # ... with 994 more rows
    

    这里最佳方法是一点一点调整数据类型,我们首先看默认方法:

    challenge <- read_csv(
      readr_example("challenge.csv"), 
      col_types = cols(
        x = col_integer(),
        y = col_character()
      )
    )
    

    调整数据类型:

    challenge <- read_csv(
      readr_example("challenge.csv"), 
      col_types = cols(
        x = col_double(),
        y = col_character()
      )
    )
    
    tail(challenge)
    #> # A tibble: 6 x 2
    #>       x y         
    #>   <dbl> <chr>     
    #> 1 0.805 2019-11-21
    #> 2 0.164 2018-03-29
    #> 3 0.472 2014-08-04
    #> 4 0.718 2015-08-16
    #> 5 0.270 2020-02-04
    #> 6 0.608 2019-01-06
    
    

    这会解决第一个问题,再对y列进行调整:

    challenge <- read_csv(
      readr_example("challenge.csv"), 
      col_types = cols(
        x = col_double(),
        y = col_date()
      )
    )
    
    tail(challenge)
    #> # A tibble: 6 x 2
    #>       x y         
    #>   <dbl> <date>    
    #> 1 0.805 2019-11-21
    #> 2 0.164 2018-03-29
    #> 3 0.472 2014-08-04
    #> 4 0.718 2015-08-16
    #> 5 0.270 2020-02-04
    #> 6 0.608 2019-01-06
    

    前面我们说过guess_parser()默认根据前1000行进行猜测,我们可以手动设为1001:

    challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
    #> Parsed with column specification:
    #> cols(
    #>   x = col_double(),
    #>   y = col_date(format = "")
    #> )
    challenge2
    #> # A tibble: 2,000 x 2
    #>       x y         
    #>   <dbl> <date>    
    #> 1   404 NA        
    #> 2  4172 NA        
    #> 3  3004 NA        
    #> 4   787 NA        
    #> 5    37 NA        
    #> 6  2332 NA        
    #> # ... with 1,994 more rows
    

    有时直接把所有数据默认为character更为方便:

    challenge2 <- read_csv(readr_example("challenge.csv"), 
      col_types = cols(.default = col_character())
    )
    

    这和type_convert()联用十分方便:

    df <- tribble(
      ~x,  ~y,
      "1", "1.21",
      "2", "2.32",
      "3", "4.56"
    )
    df
    #> # A tibble: 3 x 2
    #>   x     y    
    #>   <chr> <chr>
    #> 1 1     1.21 
    #> 2 2     2.32 
    #> 3 3     4.56
    
    # Note the column types
    type_convert(df)
    #> Parsed with column specification:
    #> cols(
    #>   x = col_integer(),
    #>   y = col_double()
    #> )
    #> # A tibble: 3 x 2
    #>       x     y
    #>   <int> <dbl>
    #> 1     1  1.21
    #> 2     2  2.32
    #> 3     3  4.56
    

    写文件

    write_csv()和write_tsv()是写文件的代表函数,写出的字符串都是UTF-8类型,日期都是ISO8601格式,若想导出csv文件到Excel,使用write_excel_csv(),这会告诉Excel我们用的是UTF-8编码。

    write_csv(challenge, "challenge.csv")
    

    这里注意,写出文件后每一列的数据类型都会丢失:

    challenge
    #> # A tibble: 2,000 x 2
    #>       x y         
    #>   <dbl> <date>    
    #> 1   404 NA        
    #> 2  4172 NA        
    #> 3  3004 NA        
    #> 4   787 NA        
    #> 5    37 NA        
    #> 6  2332 NA        
    #> # ... with 1,994 more rows
    write_csv(challenge, "challenge-2.csv")
    read_csv("challenge-2.csv")
    #> Parsed with column specification:
    #> cols(
    #>   x = col_integer(),
    #>   y = col_character()
    #> )
    #> # A tibble: 2,000 x 2
    #>       x y    
    #>   <int> <chr>
    #> 1   404 <NA> 
    #> 2  4172 <NA> 
    #> 3  3004 <NA> 
    #> 4   787 <NA> 
    #> 5    37 <NA> 
    #> 6  2332 <NA> 
    #> # ... with 1,994 more rows
    

    这里推荐使用write_rds()和read_rds(),会将数据存储为R的特殊二进制格式RDS,这两个函数是基本的readRDS()和saveRDS()的包装:

    write_rds(challenge, "challenge.rds")
    read_rds("challenge.rds")
    #> # A tibble: 2,000 x 2
    #>       x y         
    #>   <dbl> <date>    
    #> 1   404 NA        
    #> 2  4172 NA        
    #> 3  3004 NA        
    #> 4   787 NA        
    #> 5    37 NA        
    #> 6  2332 NA        
    #> # ... with 1,994 more rows
    

    这里也推荐feather包的方法,其中的二进制格式存储更快:

    library(feather)
    write_feather(challenge, "challenge.feather")
    read_feather("challenge.feather")
    #> # A tibble: 2,000 x 2
    #>       x      y
    #>   <dbl> <date>
    #> 1   404   <NA>
    #> 2  4172   <NA>
    #> 3  3004   <NA>
    #> 4   787   <NA>
    #> 5    37   <NA>
    #> 6  2332   <NA>
    #> # ... with 1,994 more rows
    

    其他格式数据读取

    • haven包读入SPSS, Stata, SAS文件
    • readxl包读入.xls和.xlsx文件
    • DBI读入RMySQL, RSQLite, RPostgreSQL, 针对SQL数据库返回数据集
    • jsonlite读入json文件
    • xml2读入XML文件

    全文代码已上传GITHUB点此进入

    展开全文
  • 二、安装readr包 install.packages("readr") library(readr) 三、读取数据 read.delim 读取带分隔符的行 read_csv2(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c...

    一、R将所有数据读取为一个变量

    二、安装readr包

    install.packages("readr")
    library(readr)

    三、读取数据

    read.delim 读取带分隔符的行

    read_csv2(file, col_names = TRUE, col_types = NULL,
      locale = default_locale(), na = c("", "NA"), quoted_na = TRUE,
      quote = "\"", comment = "", trim_ws = TRUE, skip = 0, n_max = Inf,
      guess_max = min(1000, n_max), progress = show_progress())
    #col_name 是否取第一行作为列名
    #na 将空字符串或者设定好的值作为null值
    #delim 一行记录中的分隔符
    #quoted_na Should missing values inside quotes be treated as missing values (the default) or strings
    #skip 在读取数据之前跳过的记录数
    

    四、实例

    install.packages("readr")
    library(readr)
    setwd("C:\\Users\\yangzifeng_i\\Desktop\\7.23 注销司机去向和挽留分析\\过去一年每天注销司机uid")
    #dta0<-read.csv("logout_driver_daily.csv") 无法正常读取带分隔符的
    dta0<-read_delim("logout_driver_daily.csv",  col_names =F,delim = "\t", 
                         na = c('', 'NA', 'NULL'),quoted_na = TRUE,skip = 13)

     

    展开全文
  • <div><p>In the latest RStudio update the readr package has been linked to the "Import Dataset" button in RStudio, this used to be the utils package. This can cause problems downstream, ...
  • t have the same possible values like <code>readr::col_types, like <code>col_datetime()</code>. I think that's really a cool and useful feature. Is it in the plan of future implementing? <p>Thanks....
  • 相比于read.csv函数,readr中的函数有以下几个优点: 1、速度快,约快10倍 2、可以生成tibble,不会将字符向量转换为因子,不使用行名,也不会改动列名 3、易于重复使用,R基础包依赖操作系统的功能和环境变量,因此...
  • 利用readr和readxl包读写数据 读取数据 相关函数 函数包readr和readxl提供了一系列的数据读入功能,主要函数如下: #readr包 read_delim(file, delim, quote = "\"", escape_backslash = FALSE, escape_...
  • 将数据快速读入R—readr和readxl包

    千次阅读 2017-12-01 18:02:06
    readr包提供了一些在R中读入文本数据的函数。readxl包提供了一些在R中读入Excel电子表格数据的函数。它们的读取速度远远超过你目前正在用的一些函数。 readr包提供了若干函数在R中读取数据。我们通常会用R中的...
  • <p>In addition, and possibly related, when using readr() to import wide CSVs, only the first 50 column headings and attributes are shown in the preview window. This means that it isn't possible to...
  • % readr::type_convert()</code> <p>would be the default output of just running <code>readxl::read_excel("data.xlsx")</code></p>该提问来源于开源项目:tidyverse/readxl</p></div>
  • I have many many issues with DT not working correctly after doing some dplyr manipulation or using readr to read in csv data which results in the DF to have both DT and tbl_df formats. I have posted ...
  • s not possible to read whitespace-delimited datasets through the readr Import Dataset dialog. Compare e.g. <pre><code>R > input readr::read_table(input) Parsed with column specification: cols( a &...
  • R语言安装tidyverse包时显示In install.packages("tidyverse") : 安装程序包‘readr’时退出狀態的值不是0怎么回事?求大佬
  • 解决:安装R包时,经常提示“package ‘readr’ is not available (for R version 3.5.1)”的问题
  • 本文由雪晴数据网负责翻译整理,原文请参考New packages for reading data into R — fast作者David  Smith。...   Hadley Wickham 和 RStudio团队写了一些新的R包,这些包对于每个需要在R中读入数据的...readr包...
  • tidyverse —— readr

    千次阅读 2018-06-04 11:46:00
    readr包用于读取数据。相比于base包,其优势在于速度快,能提速十余倍;相比于data.table包,其速度稍有逊色,作者Hadley大叔表示,差个1.2到2倍速度的样子,但是,在读取过程中能对数据进行更加精细的解析。下面...
  • read_stan_csv with readr

    2020-11-20 23:35:39
    <div><h4>Summary: <p>Use Hadley's <code>read_csv</code> to read in CSV files instead of <code>scan</code>. <h4>Intended Effect: <p>Speed up CSV reading by nearly a factor of 2....
  • 安装R包时,我一般采用两种方法: ... source("http://bioconductor.org/biocLite.R") biocLite("readr") 即便用这两种方法,仍然会出现问题“无法打开URL'https://mirrors.eliteu.cn/CRAN/src/contr...
  • ⑥使用readr包中read_csv读取情况,其适合             > test("C:/Users/admin/Desktop/test.csv") Parsed with column specification: cols( X1 = col_character(), mpg = col_...
  • <div><p>该提问来源于开源项目:IndrajeetPatil/ggstatsplot</p></div>
  • <div><h3>System details <pre><code>RStudio Edition : Desktop RStudio Version : 1.2.1226 OS Version : Windows 10 R Version : 3.5.2 </code></pre> <h3>Steps to reproduce the problem ...
  • readr包:读取/输出文本数据

    千次阅读 2017-09-27 10:08:47
    Hadley大神写的又一神器,可以方便的读入输出文本数据,且速度远远超过传统的函数。 读入数据 可以自动将文本数据读入为字符串格式,不需要设置factorAsString = FALSE 1. read_csv Read a delimited file ...

空空如也

空空如也

1 2 3 4 5 ... 17
收藏数 321
精华内容 128
关键字:

readr