精华内容
下载资源
问答
  • 深入理解Fsync

    2021-01-27 14:30:40
    1 介绍数据库系统从诞生那天开始,就面对一个很棘手的问题,fsync的性能问题。组提交(group commit)就是为了解决fsync的问题。最近,遇到一个业务反映MySQL创建分区表很慢,仔细分析了一下,发现InnoDB在创建表的...

    1 介绍

    数据库系统从诞生那天开始,就面对一个很棘手的问题,fsync的性能问题。组提交(group commit)就是为了解决fsync的问题。最近,遇到一个业务反映MySQL创建分区表很慢,仔细分析了一下,发现InnoDB在创建表的时候有很多fsync——每个文件会有4个fsync的调用。当然,并不每个fsync的开销都很大。

    129a85dfbd0246ed90f9e08c2b02f255.png

    这里引出几个问题:

    (1)问题1:为什么fsync开销相对都比较大?它到底做了什么?

    (2)问题2:细心的人可以发现,第一次open数据文件后,第二次fsync的时间远远小于第1次调用fsync的时间,为什么?

    a797b3b066b555e3da0e52d0c042bfa6.png

    (3)问题3:能否优化fsync?

    来着这些疑问,一起来了解一下fsync。

    2 原因分析

    我们先通过一个测试程序来学习一下fsync在块层的基本流程。

    2.1 测试程序1

    Write page 0

    Sleep 5

    Fsync

    用blktrace跟踪结果如下:

    df574213331fa93417e7897470713c81.png

    上半部红色框内为pwrite在块层的流程,下半部黄色框内为fsync在块层流程,中间刚好相差5秒。

    4722712为测试文件的第1个block对应的扇区号,590339(block号) * 8=4722712(扇区号)。

    2461819098825331ff3a864481bb8f5e.png

    无论是pwrite,还是fsync,主要的开销都发生IO请求提交给驱动和IO完成之间,也就是说开自设备驱动。差不多占了整个系统调用的1/2的开销。

    另外,可以看到调用fsync时,发生了3次块层IO,起始扇区分别是19240、19248和19256,物理上3个连续的块。实际上这3个块为内核线程kjournald写的日志,分别描述块(2405)、数据块(2406)和提交块(2407)。为了验证,不妨看一下这三个块的实际数据。

    块2405:

    cc5fd71772d693f8a69b9ff0472867fb.png

    #define JFS_MAGIC_NUMBER 0xc03b3998U

    #define JFS_DESCRIPTOR_BLOCK 1

    #define JFS_COMMIT_BLOCK 2

    开始的4个字节为JFS_MAGIC_NUMBER,然后是block type:JFS_DESCRIPTOR_BLOCK。

    块2407:

    dd73ce2b96eb2847fa8b4cece30c1e74.png

    的确是提交块。

    2.2 fsync的实现

    既然fsync的开销很大,就来看看代码吧。

    函数ext3_sync_file:

    f9faca20ec530322361b517d5c988595.png

    函数log_start_commit负责唤醒kjounald内核线程,log_wait_commit等待jbd事务提交完成。

    69b495548fffe7d8b2a1b7da003c1a45.png

    从代码来看,fsync的主要开销在于调用log_wait_commit后的等待。也就是说fsync要等待kjournald把事务提交完成,才会返回。

    到这里,我们已经知道了fsync开销的主要来源:(1)硬件驱动层的开销;(2)ext3写日志。

    另外,当log_start_commit返回0时,fsync就不会等待事务提交完成。到这里已经基本可以确认第2次fsync的开销为什么那么小了——没有wait事务提交。

    下面验证这一想法。为了方便调试,打开了内核jbd debug日志。

    2.3 测试程序2

    Write page 0

    Fsync

    Write page 0

    Fsync

    Write page 1

    Fsync

    Write page 2

    Fsync

    a9afd879a5622fb73581a53c78451527.png

    746c708e18c5a260bbb2dbef8249eb4a.png

    从第2个红框的日志来看,第2次fsync时,的确是没有wait的,所以开销这么小,而其它3次fsync都调用了log_wait_commit函数。

    问题4:第2次fsync为什么不会调用log_wait_commit?

    因为挂载文件系统的时候,data=writeback,即写数据本身不会写jbd日志。第2次pwrite没有引起文件扩展,只会修改ext3 inode的i_mtime,而i_mtime只精确到second,也就是说第2次pwrite不会引起inode信息改变,所以,不会生成jbd日志,也就不需要等待事务提交完成。

    bc08db63bcd440d7214dd6e37c6b3d1f.png

    下面验证一下该想法。

    2.4 测试程序3

    Write page 0

    Fsync

    Sleep 1 second

    Write page 0

    Fsync

    Write page 1

    Fsync

    Write page 2

    Fsync

    在第2次pwrite之前,sleep 1秒钟,保证ext3 inode的i_mtime修改。

    b70f918d8f98caa4dbca9742811c9526.png

    想法被证实了,第2次fsync的时间回到正常水平。

    d776d23f596a6ccd88708a047b6c57fc.png

    可以看到,第2次fsync调用提交了新的事务,并调用了log_wait_commit等待事务完成。

    3 优化

    如何优化fsync?是个难题。

    (1)系统减少对fsync的调用。

    作者:YY哥

    出处:http://www.cnblogs.com/hustcat/

    本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

    展开全文
  • 失败返回-1,errno被设为以下的某个值EBADF: 文件描述词无效EIO : 读写的过程中发生错误EROFS, EINVAL:文件所在的文件系统不支持同步强制把系统缓存写入文件sync和fsync函数,, fflush和fsync...

    功能描述:

    同步内存中所有已修改的文件数据到储存设备。

    用法:

    #include

    int fsync(int fd);

    参数:

    fd:文件描述词。

    返回说明:

    成功执行时,返回0。失败返回-1,errno被设为以下的某个值

    EBADF: 文件描述词无效

    EIO : 读写的过程中发生错误

    EROFS, EINVAL:文件所在的文件系统不支持同步

    强制把系统缓存写入文件sync和fsync函数,, fflush和fsync的联系和区别2010-05-10 11:25传统的U N I X实现在内核中设有缓冲存储器,大多数磁盘I / O都通过缓存进行。当将数据写

    到文件上时,通常该数据先由内核复制到缓存中,如果该缓存尚未写满,则并不将其排入输出

    队列,而是等待其写满或者当内核需要重用该缓存以便存放其他磁盘块数据时,再将该缓存排

    入输出队列,然后待其到达队首时,才进行实际的I / O操作。这种输出方式被称之为延迟写

    (delayed write)(Bach 〔1 9 8 6〕第3章详细讨论了延迟写)。延迟写减少了磁盘读写次数,但是

    第4章文件和目录8 7

    下载

    却降低了文件内容的更新速度,使得欲写到文件中的数据在一段时间内并没有写到磁盘上。当

    系统发生故障时,这种延迟可能造成文件更新内容的丢失。为了保证磁盘上实际文件系统与缓

    存中内容的一致性,U N I X系统提供了s y n c和f s y n c两个系统调用函数。

    #include

    void sync(void);

    int fsync(intf i l e d e s) ;

    返回:若成功则为0,若出错则为-1

    s y n c只是将所有修改过的块的缓存排入写队列,然后就返回,它并不等待实际I / O操作结束。

    系统精灵进程(通常称为u p d a t e )一般每隔3 0秒调用一次s y n c函数。这就保证了定期刷新内

    核的块缓存。命令s y n c ( 1 )也调用s y n c函数。

    函数f s y n c只引用单个文件(由文件描述符f i l e d e s指定),它等待I / O结束,然后返回。f s y n c可

    用于数据库这样的应用程序,它确保修改过的块立即写到磁盘上。比较一下f s y n c和O _ S Y N C标

    志(见3 . 1 3节)。当调用f s y n c时,它更新文件的内容,而对于O _ S Y N C,则每次对文件调用w r i t e

    函数时就更新文件的内容。

    fflush和fsync的联系和区别

    [zz ] http://blog.chinaunix.net/u2/73874/showart_1421917.html

    1.提供者fflush是libc.a中提供的方法,fsync是系统提供的系统调用。2.原形fflush接受一个参数FILE *.fflush(FILE *);fsync接受的时一个Int型的文件描述符。fsync(int fd);3.功能fflush:是把C库中的缓冲调用write函数写到磁盘[其实是写到内核的缓冲区]。fsync:是把内核缓冲刷到磁盘上。

    c库缓冲-----fflush---------〉内核缓冲--------fsync-----〉磁盘

    再转一篇英文的

    Write-back support

    UBIFS supports write-back, which means that file changes do not go to the flash media straight away, but they are cached and go to the flash later, when it is absolutely necessary. This helps to greatly reduce the amount of I/O which results in better performance. Write-back caching is a standard technique which is used by most file systems like ext3 or XFS.

    In contrast, JFFS2 does not have write-back support and all the JFFS2 file system changes go the flash synchronously. Well, this is not completely true and JFFS2 does have a small buffer of a NAND page size (if the underlying flash is NAND). This buffer contains last written data and is flushed once it is full. However, because the amount of cached data are very small, JFFS2 is very close to a synchronous file system.

    Write-back support requires the application programmers to take extra care about synchronizing important files in time. Otherwise the files may corrupt or disappear in case of power-cuts, which happens very often in many embedded devices. Let's take a glimpse at Linux manual pages:

    $ man 2 write

    ....

    NOTES

    A successful return from write() does not make any guarantee that data

    has been committed to disk. In fact, on some buggy implementations, it

    does not even guarantee that space has successfully been reserved for

    the data. The only way to be sure is to call fsync(2) after you are

    done writing all your data.

    ...

    This is true for UBIFS (except of the "some buggy implementations" part, because UBIFS does reserves space for cached dirty data). This is also true for JFFS2, as well as for any other Linux file system.

    However, some (perhaps not very good) user-space programmers do not take write-back into account. They do not read manual pages carefully. When such applications are used in embedded systems which run JFFS2 - they work fine, because JFFS2 is almost synchronous. Of course, the applications are buggy, but they appear to work well enough with JFFS2. But the bugs show up when UBIFS is used instead. Please, be careful and check/test your applications with respect to power cut tolerance if you switch from JFFS2 to UBIFS. The following is a list of useful hints and advices.

    If you want to switch into synchronous mode, use the -o sync option when mounting UBIFS; however, the file system performance will drop - be careful; Also remember that UBIFS mounted in synchronous mode provides less guarantees than JFFS2 - refer this section for details.

    Always keep in mind the above statement from the manual pages and run fsync() for all important files you change; of course, there is no need to synchronize "throw-away" temporary files; Just think how important is the file data and decide; and do not use fsync() unnecessarily, because this will hit the performance;

    If you want to be more accurate, you may use fdatasync(), in which cases only data changes will be flushed, but not inode meta-data changes (e.g., "mtime" or permissions); this might be more optimal than using fsync() if the synchronization is done often, e.g., in a loop; otherwise just stick with fsync();

    In shell, the sync command may be used, but it synchronizes whole file system which might be not very optimal; and there is a similar libc sync() function;

    You may use the O_SYNC flag of the open() call; this will make sure all the data (but not meta-data) changes go to the media before the write() operation returns; but in general, it is better to use fsync(), because O_SYNC makes each write to be synchronous, while fsync() allows to accumulate many writes and synchronize them at once;

    It is possible to make certain inodes to be synchronous by default by setting the "sync" inode flag; in a shell, the chattr +S command may be used; in C programs, use the FS_IOC_SETFLAGS ioctl command; Note, the mkfs.ubifs tool checks for the "sync" flag in the original FS tree, so the synchronous files in the original FS tree will be synchronous in the resulting UBIFS image.

    Let us stress that the above items are true for any Linux file system, including JFFS2.

    fsync() may be called for directories - it synchronizes the directory inode meta-data. The "sync" flag may also be set for directories to make the directory inode synchronous. But the flag is inherited, which means all new children of this directory will also have this flag. New files and sub-directories of this directory will also be synchronous, and their children, and so forth. This feature is very useful if one needs to create a whole sub-tree of synchronous files and directories, or to make all new children of some directory to be synchronous by default (e.g., /etc).

    The fdatasync() call for directories is "no-op" in UBIFS and all UBIFS operations which change directory entries are synchronous. However, you should not assume this for portability (e.g., this is not true for ext2). Similarly, the "dirsync" inode flag has no effect in UBIFS.

    The functions mentioned above work on file-descriptors, not on streams (FILE *). To synchronize a stream, you should first get its file descriptor using the fileno() libc function, then flush the stream using fflush(), and then synchronize the file using fsync() or fdatasync(). You may use other synchronization methods, but remember to flush the stream before synchronizing the file. The fflush() function flushes the libc-level buffers, while sync(), fsync(), etc flush kernel-level buffers.

    Please, refer this FAQ entry for information about how to atomically update the contents of a file. Also, the Theodore Tso's article is a good reading.

    Write-back knobs in Linux

    Linux has several knobs in "/proc/sys/vm" which you may use to tune write-back. The knobs are global, so they affect all file-systems. Please, refer the "Documentation/sysctl/vm.txt" file fore more information. The file may be found in the Linux kernel source tree. Below are interesting knobs described in UBIFS context and in a simplified form.

    dirty_writeback_centisecs - how often the Linux periodic write-back thread wakes up and writes out dirty data. This is a mechanism which makes sure all dirty data hits the media at some point.

    dirty_expire_centisecs - dirty data expire period. This is maximum time data may stay dirty. After this period of time it will be written back by the Linux periodic write-back thread. IOW, the periodic write-back thread wakes up every "dirty_writeback_centisecs" centi-seconds and synchronizes data which was dirtied "dirty_expire_centisecs" centi-seconds ago.

    dirty_background_ratio - maximum amount of dirty data in percent of total memory. When the amount of dirty data becomes larger, the periodic write-back thread starts synchronizing it until it becomes smaller. Even non-expired data will be synchronized. This may be used to set a "soft" limit for the amount of dirty data in the system.

    dirty_ratio - maximum amount of dirty data at which writers will first synchronize the existing dirty data before adding more. IOW, this is a "hard" limit of the amount of dirty data in the system.

    Note, UBIFS additionally has small write-buffers which are synchronized every 3-5 seconds. This means that most of the dirty data are delayed by dirty_expire_centisecs centi-seconds, but the last few KiB are additionally delayed by 3-5 seconds.

    UBIFS write-buffer

    UBIFS is asynchronous file-system (read this section for more information). As other Linux file-system, it utilizes the page cache. The page cache is a generic Linux memory-management mechanism. It may be very large and cache a lot of data. When you write to a file, the data are written to the page cache, marked as dirty, and the write returns (unless the file is synchronous). Later the data are written-back.

    Write-buffer is an additional UBIFS buffer, which is implemented inside UBIFS, and it sits between the page cache and the flash. This means that write-back actually writes to the write-buffer, not directly to the flash.

    The write-buffer is designated to speed-up UBIFS on NAND flashes. NAND flashes consist of NAND pages, which are usually 512, 2KiB or 4KiB in size. NAND page is the minimal read/write unit of NAND flash (see this section).

    Write-buffer size is equivalent to NAND page size (so it is tiny comparing to the page cache). It's purpose is to accumulate small writes, and write full NAND pages instead of partially filled. Indeed, imagine we have to write 4 512-byte nodes with half a second interval, and NAND page size is 2KiB. Without write-buffer we would have to write 4 NAND pages and waste 6KiB of flash space, while write-buffer allows us to write only once and waste nothing. This means we write less, we create less dirty space so UBIFS garbage collector will have to do less work, we save power.

    Well, the example shows an ideal situation, and even with the write-buffer we may waste space, for example in case of synchronous I/O, or if the data arrives with long time intervals. This is because the write-buffer has an associated timer, which flushes it every 3-5 seconds, even if it isn't full. We do this for data integrity reasons.

    Of course, when UBIFS has to write a lot of data, it does not use write buffer. Only the last part of the data which is smaller than the NAND page ends up in the write-buffer and waits more for data, until it is flushed by the timer.

    The write-buffer implementation is a little more complex, and we actually have several of them - one for each journal head. But this does not change the basic idea behind the write-buffer.

    Few notes with regards to synchronization:

    "sync()" also synchronizes all write-buffers;

    "fsync(fd)" also synchronizes all write-buffers which contain pieces of "fd";

    synchronous files, as well as files opened with "O_SYNC", bypass write-buffers, so the I/O is indeed synchronous for this files;

    write-buffers are also bypassed if the file-system is mounted with the "-o sync" mount option.

    Take into account that write-buffers delay the data synchronization timeout defined by "dirty_expire_centisecs" (see here) by 3-5 seconds. However, since write-buffers are small, only few data are delayed.

    UBIFS in synchronous mode vs JFFS2

    When UBIFS is mounted in synchronous mode (-o sync mount options) - all file system operations become synchronous. This means that all data are written to flash before the file-system operations return.

    For example, if you write 10MiB of data to a file f.dat using the write() call, and UBIFS is in synchronous mode, then UBIFS guarantees that all 10MiB of data and the meta-data (file size and date changes) will reach the flash media before write() returns. And if a power cut happens after the write() call returns, the file will contain the written data.

    The same is true for situations when f.dat has was opened with O_SYNC or has the sync flag (see man 2 chattr).

    It is well-known that the JFFS2 file-system is synchronous (except a small write-buffer). However, UBIFS in synchronous mode is not the same as JFFS2 and provides somewhat less guarantees that JFFS2 does with respect to sudden power cuts.

    In JFFS2 all the meta-data (like inode atime/mtime/ctime, inode size, UID/GID, etc) are stored in the data node headers. Data nodes carry 4KiB of (compressed) data. This means that the meta-data information is duplicated in many places, but this also means that every time JFFS2 writes a data node to the flash media, it updates inode size as well. So when JFFS2 mounts it scans the flash media, finds the latest data node, and fetches the inode size from there.

    In practice this means that JFFS2 will write these 10MiB of data sequentially, from the beginning to the end. And if you have a power cut, you will just lose some amount of data at the end of the inode. For example, if JFFS2 starts writing those 10MiB of data, write 5MiB, and a power cut happens, you will end up with a 5MiB f.dat file. You lose only the last 5MiB.

    Things are a little bit more complex in case of UBIFS, where data are stored in data nodes and meta-data are stored in (separate) inode nodes. The meta-data are not duplicated in each data node, like in JFFS2. UBIFS never writes data nodes beyond the on-flash inode size. If it has to write a data node and the data node is beyond the on-flash inode size (the in-memory inode has up-to-data size, but it is dirty and was not flushed yet), then UBIFS first writes the inode to the media, and then it starts writing the data. And if you have an interrupt, you lose data nodes and you have holes (or old data nodes, if you are overwriting). Lets consider an example.

    User creates an empty file f.dat. The file is synchronous, or UBIFS is mounted in synchronous mode. User calls the write() function with a 10MiB buffer.

    The kernel first copies all 10MiB of the data to the page cache. Inode size is changed to 10MiB as well and the inode is marked as dirty. Nothing has been written to the flash media so far. If a power cut happens at this point, the user will end up with an empty f.dat file.

    UBIFS sees that the I/O has to be synchronous, and starts synchronizing the inode. First of all, it writes the inode node to the flash media. If a power cut happens at this moment, the user will end up with a 10MiB file which contains no data (hole), and if he read this file, he will get 10MiB of zeroes.

    UBIFS starts writing the data. If a power cut happens at this point, the user will end up with a 10MiB file containing a hole at the end.

    Note, if the I/O was not synchronous, UBIFS would skip the last step and would just return. And the actual write-back would then happen in back-ground. But power cuts during write-back could anyway lead to files with holes at the end.

    Thus, synchronous I/O in UBIFS provides less guarantees than JFFS2 I/O - UBIFS has an effect of holes at the end of files. In ideal world applications should not assume anything about the contents of files which were not synchronized before a power-cut has happened. And "mainstream" file-systems like ext3 do not provide JFSS2-like guarantees.

    However, UBIFS is sometimes used as a JFFS2 replacement and people may want it to behave the same way as JFFS2 if it is mounted synchronously. This is doable, but needs some non-trivial development, so this was not implemented so far. On the other hand, there was no strong demand. You may implement this as an exercise, or you may try to convince UBIFS authors to do this.

    Synchronization exceptions for buggy applications

    As this section describes, UBIFS is an asynchronous file-system, and applications should synchronize their files whenever it is required. The same applies to most Linux file-systems, e.g. XFS.

    However, many applications ignore this and do not synchronize files properly. And there was a huge war between user-space and kernel developers related to ext4 delayed allocation feature. Please, see the Theodore Tso's blog post. More information may be found in this LWN article.

    In short, the flame war was about 2 cases. The first case was about the atomic re-name, where many user-space programs did not synchronize the copy before re-naming it. The second case was about applications which truncate files, then change them. There was no final agreement, but the "we cannot ignore the real world" argument found ext4 developers' understanding, and there were 2 ext4 changes which help both problems.

    Roughly speaking, the first change made ext4 synchronize files on close if they were previously truncated. This was a hack from file-system point of view, but it "fixed" applications which truncate files, write new contents, and close the files without synchronizing them.

    The second change made ext4 synchronize the renamed file.

    Well, this is not exactly correct description, because ext4 does not write the files synchronously, but actually initiates asynchronous write-out of the files, so the performance hit is not very high. For the truncation case this means that the file is synchronized soon after it is closed. For the re-name case this means that ext4 writes data before it writes the re-name meta-data.

    However, the application writers should never rely on these things, because this is not portable. Instead, they should properly synchronize files. The ext4 fixes were because there were many broken user-space applications in the wild already.

    We have plans to implement these features in UBIFS, but this has not been done yet. The problem is that UBI/MTD are fully synchronous and we cannot initiate asynchronous write-out, so we'd have to synchronously write files on close/rename, which is slow. So implementing these features would require implementing asynchronous I/O in UBI, which is a big job. But feel free to do this :-).

    展开全文
  • DSYNC O_DIRECT O_DIRECT_NO_FSYNC Permitted Values (>= 5.6.7) Type (HP-UX)string Defaultfdatasync Valid Valuesfdatasync O_DSYNC O_DIRECT O_DIRECT_NO_FSYNC Controls the system calls used to flush data ...

    此文主要转载自

    官网上有关于MySQL的flush method的设置参数说明,但可能很多人不太明白。下文就详细说明此问题。

    首先官网的说明如下:

    http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_method

    innodb_flush_methodCommand-Line Format--innodb_flush_method=name

    Option-File Formatinnodb_flush_method

    System Variable Name

    Variable ScopeGlobal

    Dynamic VariableNo

    Permitted Values (<= 5.6.6)

    Type (Linux)string

    Defaultfdatasync

    Valid ValuesO_DSYNC

    O_DIRECT

    Permitted Values (<= 5.6.6)

    Type (HP-UX)string

    Defaultfdatasync

    Valid ValuesO_DSYNC

    O_DIRECT

    Permitted Values (<= 5.6.6)

    Type (Solaris)string

    Defaultfdatasync

    Valid ValuesO_DSYNC

    O_DIRECT

    Permitted Values (>= 5.6.7)

    Type (Linux)string

    Defaultfdatasync

    Valid Valuesfdatasync

    O_DSYNC

    O_DIRECT

    O_DIRECT_NO_FSYNC

    Permitted Values (>= 5.6.7)

    Type (Solaris)string

    Defaultfdatasync

    Valid Valuesfdatasync

    O_DSYNC

    O_DIRECT

    O_DIRECT_NO_FSYNC

    Permitted Values (>= 5.6.7)

    Type (HP-UX)string

    Defaultfdatasync

    Valid Valuesfdatasync

    O_DSYNC

    O_DIRECT

    O_DIRECT_NO_FSYNC

    Controls the system calls used to          flush data to the          InnoDB data

    files and log

    files, which can influence I/O throughput. This

    variable is relevant only for Unix and Linux systems. On

    Windows systems, the flush method is always          async_unbuffered and cannot be changed.

    By default, InnoDB uses the          fsync() system call to flush both the data

    and log files. If          innodb_flush_method option is

    set to O_DSYNC, InnoDB

    uses O_SYNC to open and flush the log

    files, and fsync() to flush the data files.

    If O_DIRECT is specified (available on some

    GNU/Linux versions, FreeBSD, and Solaris),          InnoDB uses O_DIRECT (or          directio() on Solaris) to open the data

    files, and uses fsync() to flush both the

    data and log files. Note that InnoDB uses          fsync() instead of          fdatasync(), and it does not use          O_DSYNC by default because there have been

    problems with it on many varieties of Unix.

    An alternative setting is          O_DIRECT_NO_FSYNC: it uses the          O_DIRECT flag during flushing I/O, but

    skips the fsync() system call afterwards.

    This setting is suitable for some types of filesystems but not

    others. For example, it is not suitable for XFS. If you are

    not sure whether the filesystem you use requires an          fsync(), for example to preserve all file

    metadata, use O_DIRECT instead.

    Depending on hardware configuration, setting          innodb_flush_method to          O_DIRECT or          O_DIRECT_NO_FSYNC can have either a

    positive or negative effect on performance. Benchmark your

    particular configuration to decide which setting to use, or

    whether to keep the default. Examine the          Innodb_data_fsyncs status

    variable to see the overall number of          fsync() calls done with each setting. The

    mix of read and write operations in your workload can also

    affect which setting performs better for you. For example, on

    a system with a hardware RAID controller and battery-backed

    write cache, O_DIRECT can help to avoid

    double buffering between the InnoDB buffer

    pool and the operating system's filesystem cache. On some

    systems where InnoDB data and log files are

    located on a SAN, the default value or          O_DSYNC might be faster for a read-heavy

    workload with mostly SELECT statements.

    Always test this parameter with the same type of hardware and

    workload that reflects your production environment. For

    general I/O tuning advice, see          Section 8.5.7, “Optimizing InnoDB Disk I/O”.

    Formerly, a value of fdatasync also

    specified the default behavior. This value was removed, due to

    confusion that a value of fdatasync caused          fsync() system calls rather than          fdatasync() for flushing. To obtain the

    default value now, do not set any value for          innodb_flush_method at

    startup.

    里面提到了fsync()和fdatasync()系统调用,下文给予了详细解释。

    之前在研究MySQL的一个参数innodb_flush_method时,就涉及到了fsync/fdatasync这些系统调用[system

    call](什么是系统调用?它与库函数的区别在哪?参见这里)。接下来就简单的分析一下sync/fsync/fdatasync的区别。

    sync():int sync( void )这就是它的原型,A call to this function will not

    return as long as there is data which has not been written to the

    device,sync()同步写,没有写到物理设备就不会返回,但是现实中并不是这样的。在kernel的手册上有解释:BUGS部分(linux中用man查看命令的时候不是都有一个BUGS部分么,就是指的那个)According

    to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the

    actual writing is done.  However, since version 1.3.20 Linux does

    actually wait.  (This still does not guarantee

    data integrity: modern disks have large

    caches.)也就是sync()负责将这些写物理设备的请求放入写队列,但是不一定写真正被完成了。

    fsync(int fd):The fsync function can be used to make sure all data

    associated with the open file fildes is written to the device associated with

    the

    descriptor。fsync()负责将一个文件描述符(什么是文件描述符,它是unix、类unix系统打开文件的一种方式,应该相当于打开文件的一个句柄一样)打开的文件写到物理设备,而且是真正的同步写,没有写完成就不会返回,而且写的时候讲文件本身的一些元数据都会更新到物理设备上去,比如atime,mtime等等。

    fdatasync(int fd):When a call to the fdatasync function returns, it

    is ensured that all of the file data is written to the

    device。它只保证开打文件的数据全部被写到物理设备上,但是一些元数据并不是一定的,这也是它与fsync的区别。

    这三个系统调用都简单的介绍完,那么为什么需要它们三个呢?最简单的说是从应用的需求来考虑的,sync是全局的,对整个系统都flush,fsync值针对单个文件,fdatasync当初设计是考虑到有特殊的时候一些基本的元数据比如atime,mtime这些不会对以后读取造成不一致性,因此少了这些元数据的同步可能会在性能上有提升(但fsync和fdatasync两者的性能差别有多大?这个不知道有谁测过没)。所以说三者是根据不同的需求而定的。

    接下来谈谈flush dirty

    page,也就是前面说的同步写(没写完的话阻塞后面,直到写完才返回)。为什么是刷脏页?脏页表示缓存中的页(一般也就是内存中)也物理设备上的页处于不一致,不一致是由于在内存中被修改。所以为了使内存中的修改持久化到物理磁盘上我们需要将其从内存中flush到物理磁盘上。根据我的理解,一般来说缓存分成这几种:1>应用程序自己带了缓存,比如InnoDB的buffer

    pool;2>os层面上的缓存

    ;3>磁盘设备自己的缓存,比如raid卡一般都管理着自己的缓存;4>磁盘本身或许会有一点点缓存(这个不确定,自己猜想的,这个即使有估计也是极小的)。好了,那么大部分的时候我们说的flush

    dirty

    page都是指从应用程序的缓存->os的缓存->物理设备,如果物理设备没有缓存的话,此时也就相当于持久化成功,但是像磁盘做了raid,raid卡有缓存的话,实际上还没真正持久化成功,因为此时还只到了raid卡的缓存,没到物理设备,但是由于raid卡一般都带有备用电池,所以即使此时断电也不会造成数据丢失。

    刚才说了很多时候应用自己也有缓存机制,那么你是否想过此时与os的缓存有重复呢?答案是:会的。刚才说了我是通过研究MySQL的一个参数innodb_flush_method注意这些的,innodb_flush_method表示flush策略,MySQL提供了fdatasync/O_DSYNC/O_DIRECT这三个选项,默认是fdatasync(详情可参看博文)我这里主要说明为什么会提供选项:O_DIRECT。这个选项告诉os,InnoDB在读写数据的时候都不经过os的缓存,因为刚才说过InnoDB会维护自己的缓存buffer

    pool,如果还使用os的缓存那么两者就会有一定的重复。在前面参考的文章里面说O_DIRECT对大量随即读写有效率提升,顺序读写则会下降。所以根据自己的需求来定,不过如果你的MySQL用在是OLTP上,基本上选择O_DIRECT没错。

    展开全文
  • 1、fsync和fwrite/fflush组合的区别是啥? 2、mmap和fsync有什么关系? 3、为什么都说fsync之后数据就不会丢失了,真的不会丢失么? 4、数据写入磁盘就安全了么? 5、为什么不能直接close文件,而需要先flush?

    系列文章

    Linux I/O操作fsync后数据就安全了么(fsync、fwrite、fflush、mmap、write barriers详解)

    Linux I/O系列之直接内存(Direct IO)原理剖析和使用

    Linux I/O系列:不使用fsync如何尽快将数据写入磁盘(posix_fadivce的作用与实际应用)

    一、前言

    前段时间一直在研究磁盘顺序写和随机写,以及Java直接内存相关的问题,于是在各类资料或者源码中常常看到flush、mmap等概念和相关使用。然后开始各个击破,一个一个去理解其含义。终于都理清楚之后,回来总结却发现,自己越来越糊涂了。于是产生了如下疑问:

    1、fsync和fwrite/fflush组合的区别是啥?

    2、mmap和fsync有什么关系?

    3、为什么都说fsync之后数据就不会丢失了,真的不会丢失么?

    4、数据写入磁盘就安全了么?

    5、为什么不能直接close文件,而需要先flush?

    带着这些问题我又开始了漫长的研究之路,终于对这些方法和调用有了全新的认识。接下来我们就一起来慢慢分析,并最终解答上述问题。

    PS:由于网上很难找到这些对比参照的详细介绍,本文后续内容都是自己通过各类子类整理之后的结果,可能存在理解不到位的地方,请各位大牛多多指正。

    二、各系统调用介绍

    大部分内容来自百度百科。

    1、fsync

    调用 fsync 可以保证文件的修改时间也被更新。fsync 系统调用可以使您精确的强制每次写入都被更新到磁盘中。您也可以使用同步(synchronous)I/O 操作打开一个文件,这将引起所有写数据都立刻被提交到磁盘中。通过在 open 中指定 O_SYNC 标志启用同步I/O。

    2、fwrite

    fwrite() 是 C 语言标准库中的一个文件处理函数,功能是向指定的文件中写入若干数据块,如成功执行则返回实际写入的数据块数目。该函数以二进制形式对文件进行操作,不局限于文本文件。

    3、fflush

    fflush是一个在C语言标准输入输出库中的函数,功能是冲洗流中的信息,该函数通常用于处理磁盘文件。fflush()会强迫将缓冲区内的数据写回参数stream 指定的文件中。

    4、mmap

    mmap将一个文件或者其它对象映射进内存。文件被映射到多个页上,如果文件的大小不是所有页的大小之和,最后一个页不被使用的空间将会清零。mmap在用户空间映射调用系统中作用很大。

    三、各系统调用之间的区别

    其实熟悉的朋友都知道每个调用的作用是什么,但是他们在底层到底是一个什么关系,估计很多同学还是无法说清楚。为了搞清楚这些问题,我参考网上大牛的资料,画了下面这张图:

     通过上图,我们可以清楚的了解到,在整个文件写入的过程中,其需要经过很多个buffer缓存。如IO Buffer、Page Cache、驱动缓存、Disk cache。所有这些缓存存在的意义都是为了提升我们的文件读写速度。但是在我们需要确保数据百分之百安全的场景下(如WAL),这些buffer就成了一个一个的障碍。为了让数据从应用层直接万无一失的写入磁盘,我们需要合理的利用上面提到的各类方法调用。根据不同的业务需求我们可以采用不同的方法组合。

    1、允许应用奔溃的写操作

    通过上图可知只有数据写入内核的Page Cache中之后,应用崩溃才不会导致数据丢失。通常我们有两个种方式可以将数据写入到内核中。

    A、普通写入(write/flush/close)

    咱们平时调用write(fwrite)的时候,数据仅仅是从应用写入到了C标准库的IO Buffer。此时数据还在用户空间。如果此时我们就调用close关闭操作。那么数据通常不会理解写入内核,更不用说磁盘了。通常需要等到C标准库的IO Buffer满了之后才能被主动写入内核缓存的Page Cache。通过上图我可以看到,我们可以通过flush将数据主动写入到内核的Page Cache中。这就是为什么我们通常被建议在关闭之前flush文件。因为当数据进入内核之后相对于应用来说,该数据就是安全的。此时如果应用挂了,咱们的数据还是安全的。且能够被内部后续写入磁盘。

    B、mmap

    在持久化中经常被提及的mmap的数据其实也只是在Application Cache和内核Page Cache中建立了映射关系。这样所有在应用层对数据的操作实际是映射到内核的Page Cache中的。因此使用mmap我们不用调用flush,也不用担心数据会因为应用崩溃而丢失。

    当然mmap除了能够直接在应用层操作内核中的数据,同时也因此减少了不必要的上下文切换。比如普通写入中,我们调用flush是需要相应的上下文切换呢,这里会有一定的开销。这也是为什么在持久化场景中,我们通常使用mmap的主要原因。

    2、允许操作系统崩溃的写操作

    通过上图可知,只有当数据被写入到磁盘缓存或者磁盘介质中之后,才能够保证当系统崩溃之后,数据不会丢失(如果数据在磁盘缓存中,则需要磁盘具有备份电源)。

    那么需要将内核中的Page Cache中的数据写入到磁盘(缓存)中,我们只需要调用fsync(fdatasync)即可。此时就算机器宕机了,咱们的数据还是安全的。这也就是很多WAL都是fsync刷盘的原因。

    3、磁盘缓存中的数据落盘

    通过上面1和2的操作后,数据就已经进入磁盘了。但是却并不能保证数据百分之百的落盘成功。有可能数据在磁盘的缓存中。此时如果机器掉电了,那我们的数据也就可能会丢失。针对该问题,目前主要有两种解决办法:备用电源和开启OS的Write Barriers。

    A、备用电源

    商用磁盘很多都自带有备用电源,当机器断电后,能够根据依靠备用电源将缓存中的数据落盘。

    B、Write Barriers

    在linux中文件系统ext3或者ext4又被称为日志文件系统。原因是因为其写数据的的时候也有个类似WAL的操作。

     如上图,在日志文件系统中,磁盘大概是上述结构。数据在写入的时候,首先会写入缓存,然后将这次数据写操作的元数据(根据该数据记录了数据的所有修改记录)先写入到磁盘介质中,最后在写入一个commit record标记表示日志已经写完了,此时数据已经安全了。这个时候写入指令就返回了。如fsync指令,当commit record标记写入后,就返回了。但此时真实的数据还在缓存中。但是就算此时磁盘掉电,重启之后磁盘也能够根据记录的日志恢复该数据。同时记录日志和commit record的空间都是连续的,因此写入速度回很快。这也就是日志文件系统如何做到快速写入且数据不会因掉电而丢失的。其实也是我们平常的WAL思想。

    但是这里有个小问题,那就是日志和commit record都是交给驱动执行的写操作,而现代驱动基本都会对所有写入进行重排序从而提高写入性能。此时就可能就将日志和commit record重排序,导致commit record先落盘,日志再后面。这样的处理会导致,如果commit record落盘后,磁盘掉电,此时由于日志没有写入,导致数据无法恢复。

    于是文件系统采用了write barriers。在每次写入commit record之前加入write barriers,该barriers可以确保其后面的数据在写入前,其前面的数据都已经落盘了。这样就保证了日志和commit record不会被重排序,且能够正确落盘。

    四、疑问解答

    到此我们就对整个OS IO操作流程有了全面的了解。现在我们再回过头来看看咱们开始提出的问题,是不是这些问题都能够很好的解答了呢?

    1、fsync和fwrite/fflush组合的区别是啥?

    fwrite和fllush组合是将数据从应用层写入C标准库Buffer,然后刷新到内核的Page Cache中。

    fsync是将内核Page Cache中的数据写入磁盘(并不一定会落盘到介质中)。

    2、mmap和fsync有什么关系?

    mmap:在application和内核的Page Cache之前建立映射。使得应用程序在应用层可以直接操作内核Page Cache中的数据。

    fsync:将内核Page Cache中的数据刷入磁盘。

    3、为什么都说fsync之后数据就不会丢失了,真的不会丢失么?

    我们知道fsync的功能,是将内核中的数据直接刷盘。但是其刷盘之后数据并不一定就安全了。首先如果文件系统没有Write Barriers,或者没有开启Write Barriers。且磁盘也没有备用电源。那么如果系统宕机(掉电)之后,还在磁盘缓存中的数据就会丢失。所以fsync并不一定能确保数据不丢失。

    4、数据写入磁盘就安全了么?

    这具体要看数据写入到磁盘的哪个位置。如果只是在磁盘缓存中,则可能存在风险。如果已经落盘或者成功写日志和Commit record则是安全的。

    其实就算数据真正落盘到介质了,也不一定是安全的。因为磁盘可能会损坏什么的。但这已经超出了本文的讨论范围,我们就不深入讨论了。

    5、为什么不能直接close文件,而需要先flush

    因为通过write写入的数据实际还在应用的缓存中,此时如果flcose文件。则可能由于应用崩溃导致数据丢失。所以在close之前,需要通过flush将数据刷新到内核的page cache中。

    参考:

    Linux IO流程:https://blog.csdn.net/caogenwangbaoqiang/article/details/79358609

    Linux OS: Write Barriers:http://www.rosoo.net/a/201211/16373.html

    Barriers and journaling filesystems:https://lwn.net/Articles/283161/

    五、惯例

    如果你喜欢本文或觉得本文对你有所帮助,欢迎一键三连支持,非常感谢。

    如果你对本文有任何疑问或者高见,欢迎添加公众号lifeofcoder共同交流探讨(添加公众号可以获得楼主最新博文推送以及”Java高级架构“上10G视频和图文资料哦)。

    展开全文
  • fsync操作

    千次阅读 2021-04-28 19:14:10
    #0 os_file_fsync_posix (file=20) at /data/mysql-boost-5.7.32/mysql-5.7.32/storage/innobase/os/os0file.cc:3081 #1 0x000000000198c562 in os_file_flush_func (file=20) at /data/mysql-boost-5.7.32/mysql-...
  • Python os.fsync()方法

    2021-05-16 16:02:43
    Python的os.fsync()方法返回强制将文件描述符fd写入磁盘。 如果使用Python文件对象f,首先要执行f.flush(),然后执行os.fsync(f.fileno()),以确保与f关联的所有内部缓冲区都被写入磁盘。语法以下是fsync()方法的...
  • fsync()函数 Unix/Linux

    2020-12-24 03:31:03
    fsync, fdatasync - 同步文件在内核态与存储设备内容简介#include int fsync(intfd);int fdatasync(intfd);描述fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages ...
  • fsync文件同步相关

    2021-04-19 23:57:01
    最近遇到了设备掉电后,已经调用了fsync函数的文件数据没有被写入磁盘(ext4)。直接没有明确的解决,在调查过程中搜索到如下文件,感觉值得收藏。 1. fsync和fdatasync是否一定能保证数据写入磁盘不会丢? ...
  • 1前言不要诧异在MySQL专题中突然插入fsync系统调用,因为马上就要和大家分享MySQL的undo log、redo log、bin log了,在分享这些文章的时候会经常说fsync这个名词,所以提前来看下。2缓冲传统的UNIX实现的内核中都...
  • 有basic_ostream :: flush(),但是我怀疑这实际上并不调用fsync() – 我期望像stdio中的fflush(),它只确保用户空间运行时库缓冲区被刷新,这意味着操作系统仍然可以缓冲数据. 所以总之,似乎没办法这样做. 该怎么办?...
  • FSYNC为“”function synchronization”的意思,用于传感器的时间同步。 不要想着这个管脚能对你的算法起到惊天动地的作用,它仅仅是为IMU数据打上了一个标记。 用ICM20602手册来说明 查找“FSYNC”关键字后,把所有...
  • mysql fsync

    2021-01-25 19:48:09
    标签:1 介绍数据库系统从诞生那天开始,就面对一个很棘手的问题,fsync的性能问题。组提交(group commit)就是为了解决fsync的问题。最近,遇到一个业务反映MySQL创建分区表很慢,仔细分析了一下,发现InnoDB在创建...
  • 所以可以看出fflush和fsync的调用顺序应该是: c库缓冲-----fflush---------〉内核缓冲--------fsync-----〉磁盘 Calling fsync() does not necessarily ensure that the entry in the directory containing the ...
  • Linux中如何调用fsync函数?针对这个问题,本文详细介绍了相应的分析和解决方法,希望能帮助更多想解决这个问题的伙伴找到更简单易行的方法。功能描述:将内存中所有已修改的文件数据同步到存储设备。用法:#包含...
  • 详见innodb-dedicated-server,简单例举如下:O_DIRECT_NO_FSYNC when O_DIRECT_NO_FSYNC setting is not available, the default 现在已经是MySQL 8.0.19了,应该说,在该版本上,大家可以放心使用O_DIRECT_NO_...
  • 调用系统函数write时有写延迟,write负责把东西写到缓存区上, sync负责把缓存区上的东西排到...fsync函数则是对指定文件的操作,而且必须等到写队列中的内容都写到磁盘后才返回,并且更新文件inode结点里的内容。 ...
  • 编辑文件无法保存Fsync failed 现象 Linux下工作显示打开文件显示Fsync failed 原因 原因是工作空间满了 方法 清理一下有空闲空间即可操作
  • 如果打开这个参数,PostgreSQL服务器将尝试确保更新被物理地写入到磁盘,做法是发出fsync()系统调用或者使用多种等价的方法(见wal_sync_method)。这保证了数据库集簇在一次操作系统或者硬件崩溃后能恢复到一个一致...
  • linux下的sync,flush,fsync

    2021-03-07 14:16:53
    1.2 fflush 和 fsync的联系与区别 提供者fflush是libc.a中提供的方法,fsync是系统提供的系统调用。2.原形fflush接受一个参数FILE *.fflush(FILE *);fsync接受的时一个Int型的文件描述符。fsync(int fd);3.功能...
  • 那么提到数据安全写入磁盘我们首先会想到使用fsync或者O_SYNC标识。但是如果我们不用fsync或者O_SYNC标识呢,咱们有没有什么办法能够让内核尽快将数据从Page Cache写入磁盘呢?当然有,接下来我们就和大家一起学习...
  • 1,头文件: #include <...int fsync(int fd); 3,函数说明: 负责将参数fd 所指的文件数据, 由系统缓冲区写回磁盘, 以确保数据同步 4, 返回值: 成功返回0,失败返回-1,errno为错误代码 ...
  • Group commit and real fsync 分组提交和实时fsync During the recent months Ive seen few cases of customers upgrading to MySQL 5.0 and having serious performance slow downs, up to 10 times in certain ...
  • 4.fsync函数 头文件:#include 定义函数:int fsync(int filedes); 返回值:若成功则返回0,若出错则返回-1,同时设置errno以指明错误. 函数说明: 与sync函数不同,fsync函数只对由文件描符filedes指定的单一文件...
  • fflush/fsync/sync 区别

    千次阅读 2021-01-19 17:56:03
    fflush : 输入句柄为文件...fsync : 输入句柄为内核句柄(open句柄),把内核层缓存刷到磁盘 (阻塞) sync : 设置同步状态,通知所有内核缓存刷到磁盘, (非阻塞) 清理缓存 echo 3 > /proc/sys/vm/drop_caches ...
  • 使用带有ext4hybrid SSD的Ubuntu 14.04运行30分钟后,我看到很多进程使用iotop阻止了IO.这种减速的根本原因可以追溯到Unix系统调用同步.重复从终端运行同步可能需要1 – 2秒,但仅在30分钟正常运行时间....
  • sync,是同步整个系统的磁盘数据的.fsync是同步打开的一个文件到缓冲区数据到磁盘上.fflush是刷新打开的流的.fsync(将缓冲区数据写回磁盘)相关函数 sync表头文件 #include定义函数 int fsync(int fd);函数说明 fsync...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 46,626
精华内容 18,650
关键字:

fsync