1小时前 linux
2008-02-25 17:15:00 Hack6500 阅读数 753
linux

一、安装Red flag linux 6.0;
    小知识:
    1.在进入带有空格的目录时使用/|_|;如进入My Documents时 可输入cd /Desktop/My/ Documents;
    2.在进入下级目录时可使用  目录名/
      比如有这样一个目录 /root/Desktop/My Documents/QQ
      cd /root/Desktop/My/ Documents
      cd QQ/
      这样可进入QQ目录中。

    安装程序的一般步骤:
    1.进入安装程序目录,执行  tar -zxvf a.tar.gz(解压文件的时候用gzip a.zip)
    2. ./configure  --prefix=安装路径
    3. make
    4.make install
     


二、关于linux 的图形编程。
    1、编写c程序,需引用curses.h头文件。C程序引用头文件所在的目录为/usr/include.
    2、然后用gcc -0 [生成执行文件名称] [文件名.c]  -lncurses.
    下面是一个很好的比喻:
    你真正需要的不是头文件,而是函数库,明白吗?
    头文件只不过是“买回来的热水器的说明书”,
    你光拿着说明书,不去找热水器,能烧开水吗?
   
    RH9 中的 curses.h 就是 libncurses.a 这个热水器的说明书,
    明白这个道理了吗?

    gcc 的 -lncurses 这个意思就是说,你要把 libncurses.a 这个热水器插到电源上!

    我希望,
    你在下次学另外一种技术的时候,
    不要再这么问:
    [quote]我已经包含了 xxxx.h,为什么还是报错呢?[/quote]
    而是应该这么问:
    [quote]我要 do it,那么除了包含 xxxx.h 还要链接哪个库文件呢?[/quote]
    或者
    [quote] xxxx.h 对应的是哪个库文件呢?[/quote]

三、关于linux的串口通信
    在liunx系统中,对串口的操作实际上是对 /dev/ttyS0或/dev/ttyS1 、/dev/ttyS2文件进行操作,其中ttyS0代表COM0口,ttyS1代表COM1口,ttyS2代表COM2口。

四、编译是出现的警告。
    1.rty.c:21:2: warning: no newline at end of file
    提示文件结尾没有命令行,可在文件代码的结尾添加回车换行。


linux界面编程《ncurses.h》

NCURSES不仅仅只是封装了底层的终端功能,而且提供了一个稳固的工作框架(Framework)用以产生漂亮的用户界面。它包含了一些可以创建窗口的函数。它的姊妹库 Menu、Panel和Form是CURSES基础库的扩展。这些库一般都随同CURSES包里一起发行。我们可以建立一个应用程序同时包含多窗口(multiplewindows)、菜单(menus)、面板(panels)和表单(forms)。窗口可以被独立管理,


一、在编写c程序时(包括显示中文),所应用的头文件如下:
    #include <stdlib.h>
    #include <ncurses.h>
    #include <locale.h>
   
    stdlib.h库包括最基本的输入输出等函数。
    ncurses.h定义了linux/unix的图形界面。
    locale.h 地区化,本类别的函数用于处理不同国家的语言差异。
   
    1.在使用中文下需要调用loccale.h头文件
    smaple:
    #include <stdlb.h>
    #include <ncurses.h>
    #include <locale.h>

    void mian()
    {
        //使用系统默认的locale
        setlocale(LC_ALL,"");
        initscr();         //开启curses
        .........
        endwin();        //关闭curses
        exit(0);
    }.

    采用gcc编译
    gcc wtes.c -lmenuw -lncursesw
    编译后默认输出为a.out文件
    在目录下执行    ./a.out
   
    注意:在编译是不要用gcc wtes.c -lncurses,而要用gcc wtes.c -lncursesw,ncursesw是ncurses支持宽字符的版本(wide character).这样整个程序就可以使用UTF-8编码的任意字符了。

二、linux控制台显示中文
    要在控制台显示中文,需安装 zhcon.
    安装过程:
     二、安装

  现在最新版本是0.2.6。zhcon的源代码和RPM包可以从http://zhcon.sourceforge.net/下载得到。在Sourceforger的网址http://sourceforge.net/project/showfiles.php?group_id=27400上,需要下载的文件有两个:一个是zhcon-0.2.5.tar.gz,一个是zhcon-0.2.5-to-0.2.6.diff.gz。

  1.首先解压并解包zhcon-0.2.5.tar.gz文件:
  [root@localhost zhcon]# tar zxvf zhcon-0.2.5.tar.gz
  解压之后得到一个zhcon-0.2.5。这个目录包含了zhcon-0.2.5版本的所有源代码和man手册等。

  2.然后将zhcon-0.2.5-to-0.2.6.diff.gz解压:
  [root@localhost zhcon]# gzip -d zhcon-0.2.5-to-0.2.6.diff.gz
  解压后得到zhcon-0.2.5-to-0.2.6.diff文件,这个文件实际上是一个补丁文件。

  3.将zhcon-0.2.5-to-0.2.6.diff文件打补丁到zhcon-0.2.5的源代码中:
  [root@localhost zhcon]# patch -p0 < zhcon-0.2.5-to-0.2.6.diff
  在目录zhcon-0.2.5的父目录中进行这一步。

  这样,原来的0.2.5版本的zhcon就升级到了0.2.6版本了。接下来的过程就是UNIX平台固定的“安装三步曲”了:

  [root@localhost zhcon]# ./configure --prefix=/usr/local/zhcon

  [root@localhost zhcon]# make

  [root@localhost zhcon]# make install

  4.运用

  1.要想在控制台下显示中文,键入命令:
  [root@localhost zhcon]# /usr/local/zhcon/bin/zhcon --utf8

  即可。

  2.要想在控制台下运用中文输入法:
  使用Ctrl+space或Ctrl+2来打开或关闭智能拼音输入法

 
2014-09-01 22:27:30 iteye_6988 阅读数 8

一、修改主机名:

1、hostname  crxy0

2、vi /etc/sysconfig/network 

      改成:HOSTNAME=crxy0.localdomain

4、vi /etc/hosts

     添加192.168.203.132 crxy0 

5、reboot,重启系统。

6、查看hostname ,是否修改成功。

二、配置SSH免密码登录

ssh-keygen -t rsa -P ""

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

 三、关闭防火墙

       关闭防火墙

       执行命令 service iptables stop

       验证: service iptables status

      关闭防火墙的自动运行

       执行命令 chkconfig iptables off

       验证: chkconfig --list | grep iptables

 

 

2016-02-04 17:52:41 nw852884172 阅读数 234
在 Netflix 我们有一个庞大的 EC2 Linux 集群,还有非常多的性能分析工具来监控和调查它的性能。其中包括用于云监控的Atlas,用于实例按需分析的 Vector。即使这些工具帮助我们解决了大多数问题,我们有时还是得登入 Linux 实例,运行一些标准的 Linux 性能工具来解决问题。

这篇文章里,Netflix Performance Engineering 团队将使用居家常备的 Linux 标准命令行工具,演示在性能调查最开始的60秒里要干的事。


最开始的60秒……

运行下面10个命令,你可以在60秒内就对系统资源的使用情况和进程的运行状况有大体上的了解。无非是先查看错误信息和饱和指标,再看下资源的使用量。这里“饱和”的意思是,某项资源供不应求,已经造成了请求队列的堆积,或者延长了等待时间。
uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
有些命令需要你安装 sysstat 包。(译注:指 mpstat, pidstat, iostat和sar,用包管理器直接安装 sysstat 即可) 这些命令所提供的指标能够帮助你实践 USE 方法:这是一种用于定位性能瓶颈的方法论。你可以以此检查所有资源(CPU,内存,硬盘,等等)的使用量是否饱和,以及是否存在错误。同时请留意上一次检查正常的时刻,这将帮助你减少待分析的对象,并指明调查的方向。(译注:USE 方法,就是检查每一项资源的使用量(utilization)、饱和(saturation)、错误(error))
接下来的章节里我们将结合实际例子讲解这些命令。如果你想了解更多的相关信息,请查看它们的 man page。
1. uptime
$ uptime
 23:51:26 up 21:31,  1 user,  load average: 30.02, 26.43, 19.02
这个命令显示了要运行的任务(进程)数,通过它能够快速了解系统的平均负载。在 Linux 上,这些数值既包括正在或准备运行在 CPU 上的进程,也包括阻塞在不可中断 I/O(通常是磁盘 I/O)上的进程。它展示了资源负载(或需求)的大致情况,不过进一步的解读还有待其它工具的协助。对它的具体数值不用太较真。

最右的三个数值分别是1分钟、5分钟、15分钟系统负载的移动平均值。它们共同展现了负载随时间变动的情况。举个例子,假设你被要求去检查一个出了问题的服务器,而它最近1分钟的负载远远低于15分钟的负载,那么你很可能已经扑了个空。

在上面的例子中,负载均值最近呈上升态势,其中1分钟值高达30,而15分钟值仅有19。这种现象有许多种解释,很有可能是对 CPU 的争用;该系列的第3个和第4个命令——vmstat和mpstat——可以帮助我们进一步确定问题所在。
2. dmesg | tail
$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.  Check SNMP counters.
这个命令显示了最新的10个系统信息,如果有的话。注意会导致性能问题的错误信息。上面的例子里就包括对过多占用内存的某进程的死刑判决,还有丢弃 TCP 请求的公告。
不要漏了这一步!检查dmesg总是值得的。
3. vmstat 1
$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0
32  0    0 200889920  73708 591860    0    0     0   592 13284 4282 98  1  1  0  0
32  0    0 200890112  73708 591860    0    0     0     0 9501 2154 99  1  0  0  0
32  0    0 200889568  73712 591856    0    0     0    48 11900 2459 99  0  0  0  0
32  0    0 200890208  73712 591860    0    0     0     0 15898 4840 98  1  1  0  0
^C
vmstat(8),是 “virtual memory stat” 的简称,几十年前就已经包括在 BSD 套件之中,一直以来都是居家常备的工具。它会逐行输出服务器关键数据的统计结果。

通过指定1作为 vmstat 的输入参数,它会输出每一秒内的统计结果。(在我们当前使用的)vmstat 输出的第一行数据是从启动到
现在的平均数据,而不是前一秒的数据。所以我们可以跳过第一行,看看后面几行的情况。

检查下面各列:

r:等待 CPU 的进程数。该指标能更好地判定 CPU 是否饱和,因为它不包括 I/O。简单地说,r 值高于 CPU 数时就意味着饱和。

free:空闲的内存千字节数。如果你数不清有多少位,就说明系统内存是充足的。接下来要讲到的第7个命令,free -m,能够更清楚地说明空闲内存的状态。

siso:交换分区换入和换出。如果它们不为零,意味着内存已经不足,开始动用交换空间的存粮了。

ussyidwast:它们是所有 CPU 的使用百分比。它们分别表示用户态用时(user time),系统态用时(system time)(处于内核态的时间),空闲(idle),I/O 等待(wait I/O)和偷去的时间(steal time)(被其它租户,或者是租户自己的 Xen隔离设备驱动域(isolated driver domain)所占用的时间)。

通过相加 us 和 sy 的百分比,你可以确定 CPU 是否处于忙碌状态。一个持续不变的 I/O 等待意味着瓶颈在硬盘上,这种情况往往伴随着 CPU 的空闲,因为任务都卡在磁盘 I/O 上了。你可以把 I/O 等待当作 CPU 空闲的另一种形式,它额外给出了 CPU 空闲的线索。

I/O 处理同样会消耗系统时间。一个高于20%的平均系统时间,往往值得进一步发掘:也许系统花在 I/O 的时间太长了。

在上面的例子中,CPU 基本把时间花在用户态里面,意味着跑在上面的应用占用了大部分时间。此外,CPU 平均使用率在90%之上。这不一定是个问题;检查下“r”列,看看是否饱和了。
4. mpstat -P ALL 1
$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

07:38:49 PM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99
07:38:50 PM    1  97.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   2.00
07:38:50 PM    2  98.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   1.00
07:38:50 PM    3  96.97   0.00   0.00    0.00   0.00   0.00    0.00    0.00    0.00   3.03
[...]
这个命令显示每个 CPU 的时间使用百分比,你可以用它来检查 CPU 是否存在负载不均衡。单个过于忙碌的 CPU 可能意味着整个应用只有单个线程在工作。
5. pidstat 1
$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

07:41:02 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:03 PM     0         9    0.00    0.94    0.00    0.94     1  rcuos/0
07:41:03 PM     0      4214    5.66    5.66    0.00   11.32    15  mesos-slave
07:41:03 PM     0      4354    0.94    0.94    0.00    1.89     8  java
07:41:03 PM     0      6521 1596.23    1.89    0.00 1598.11    27  java
07:41:03 PM     0      6564 1571.70    7.55    0.00 1579.25    28  java
07:41:03 PM 60004     60154    0.94    4.72    0.00    5.66     9  pidstat

07:41:03 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:04 PM     0      4214    6.00    2.00    0.00    8.00    15  mesos-slave
07:41:04 PM     0      6521 1590.00    1.00    0.00 1591.00    27  java
07:41:04 PM     0      6564 1573.00   10.00    0.00 1583.00    28  java
07:41:04 PM   108      6718    1.00    0.00    0.00    1.00     0  snmp-pass
07:41:04 PM 60004     60154    1.00    4.00    0.00    5.00     9  pidstat
^C
pidstat看上去就像top,不过top的输出会覆盖掉之前的输出,而pidstat的输出则添加在之前的输出的后面。这有利于观察数据随时间的变动情况,也便于把你看到的内容复制粘贴到调查报告中。
上面的例子表明,CPU 主要消耗在两个 java 进程上。%CPU列是在各个 CPU 上的使用量的总和;1591%意味着 java 进程消耗了将近16个 CPU。
6. iostat -xz 1
$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.96    0.00    3.73    0.03    0.06   22.21

Device:   rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda        0.00     0.23    0.21    0.18     4.52     2.08    34.37     0.00    9.98   13.80    5.42   2.44   0.09
xvdb        0.01     0.00    1.02    8.94   127.97   598.53   145.79     0.00    0.43    1.78    0.28   0.25   0.25
xvdc        0.01     0.00    1.02    8.86   127.79   595.94   146.50     0.00    0.45    1.82    0.30   0.27   0.26
dm-0        0.00     0.00    0.69    2.32    10.47    31.69    28.01     0.01    3.23    0.71    3.98   0.13   0.04
dm-1        0.00     0.00    0.00    0.94     0.01     3.78     8.00     0.33  345.84    0.04  346.81   0.01   0.00
dm-2        0.00     0.00    0.09    0.07     1.35     0.36    22.50     0.00    2.55    0.23    5.62   1.78   0.03
[...]
^C
这个命令可以弄清块设备(磁盘)的状况,包括工作负载和处理性能。注意以下各项:

r/sw/srkB/swkB/s:分别表示每秒设备读次数,写次数,读的 KB 数,写的 KB 数。它们描述了磁盘的工作负载。也许性能问题就是由过高的负载所造成的。

await:I/O 平均时间,以毫秒作单位。它是应用中 I/O 处理所实际消耗的时间,因为其中既包括排队用时也包括处理用时。如果它比预期的大,就意味着设备饱和了,或者设备出了问题。

avgqu-sz:分配给设备的平均请求数。大于1表示设备已经饱和了。(不过有些设备可以并行处理请求,比如由多个磁盘组成的虚拟设备)

%util:设备使用率。这个值显示了设备每秒内工作时间的百分比,一般都处于高位。低于60%通常是低性能的表现(也可以从 await 中看出),不过这个得看设备的类型。接近100%通常意味着饱和。

如果某个存储设备是由多个物理磁盘组成的逻辑磁盘设备,100%的使用率可能只是意味着 I/O 占用。

请牢记于心,磁盘 I/O 性能低不一定是个问题。应用的 I/O 往往是异步的(比如预读(read-ahead)和写缓冲(buffering for writes)),所以不一定会被阻塞并遭受延迟。
7. free -m
$ free -m
             total       used       free     shared    buffers     cached
Mem:        245998      24545     221453         83         59        541
-/+ buffers/cache:      23944     222053
Swap:            0          0          0
右边的两列显示:

buffers:用于块设备 I/O 的缓冲区缓存

cached:用于文件系统的页缓存

它们的值接近于0时,往往导致较高的磁盘 I/O(可以通过 iostat 确认)和糟糕的性能。上面的例子里没有这个问题,每一列都有好几 M 呢。

比起第一行,-/+ buffers/cache提供的内存使用量会更加准确些。Linux 会把暂时用不上的内存用作缓存,一旦应用需要的时候立刻重新分配给它。所以部分被用作缓存的内存其实也算是空闲内存,第二行以此修订了实际的内存使用量。为了解释这一点, 甚至有人专门建了个网站: LinuxAteMyRam.com

如果你在 Linux 上安装了 ZFS,正如我们在一些服务上所做的,这一点会变得更加迷惑,因为 ZFS 它自己的文件系统缓存不算入free -m。有时系统看上去已经没有多少空闲内存可用了,其实内存都待在 ZFS 的缓存里呢。
8. sar -n DEV 1
$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015     _x86_64_    (32 CPU)

12:16:48 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:49 AM      eth0  18763.00   5032.00  20686.42    478.30      0.00      0.00      0.00      0.00
12:16:49 AM        lo     14.00     14.00      1.36      1.36      0.00      0.00      0.00      0.00
12:16:49 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:16:49 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:50 AM      eth0  19763.00   5101.00  21999.10    482.56      0.00      0.00      0.00      0.00
12:16:50 AM        lo     20.00     20.00      3.25      3.25      0.00      0.00      0.00      0.00
12:16:50 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
^C
这个命令可以用于检查网络流量的工作负载:rxkB/s 和 txkB/s,以及它是否达到限额了。上面的例子中,eth0接收的流量达到 22Mbytes/s,也即 176Mbits/sec(限额是 1Gbit/sec)
我们用的版本中还提供了%ifutil作为设备使用率(接收和发送两者中的最大值)的指标。我们也可以用 Brendan 的 nicstat计量这个值。一如nicstat,sar显示的这个值不一定是对的,在这个例子里面就没能正常工作(0.00)。
9. sar -n TCP,ETCP 1
$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

12:17:19 AM  active/s passive/s    iseg/s    oseg/s
12:17:20 AM      1.00      0.00  10233.00  18846.00

12:17:19 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:20 AM      0.00      0.00      0.00      0.00      0.00

12:17:20 AM  active/s passive/s    iseg/s    oseg/s
12:17:21 AM      1.00      0.00   8359.00   6039.00

12:17:20 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:21 AM      0.00      0.00      0.00      0.00      0.00
^C
这个命令显示一些关键TCP指标的汇总。其中包括:

active/s:本地每秒创建的 TCP 连接数(比如 concept() 创建的)

passive/s:远程每秒创建的 TCP 连接数(比如 accept() 创建的)

retrans/s:每秒 TCP 重传次数

主动连接数(active)和被动连接数(passive)通常可以用来粗略地描述系统负载。可以认为主动连接是对外的,而被动连接是对内的,虽然严格来说不完全是这个样子。(比如,一个从 localhost 到 localhost 的连接)

重传是网络或系统问题的一个信号;它可能是不可靠的网络(比如公网)所造成的,也有可能是服务器已经过载并开始丢包。在上面的例子中,每秒只创建一个新的 TCP 连接。
10. top
$ top
top - 00:15:40 up 21:56,  1 user,  load average: 31.09, 29.87, 29.92
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu(s): 96.8 us,  0.4 sy,  0.0 ni,  2.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  25190241+total, 24921688 used, 22698073+free,    60448 buffers
KiB Swap:        0 total,        0 used,        0 free.   554208 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20248 root      20   0  0.227t 0.012t  18748 S  3090  5.2  29812:58 java
  4213 root      20   0 2722544  64640  44232 S  23.5  0.0 233:35.37 mesos-slave
 66128 titancl+  20   0   24344   2332   1172 R   1.0  0.0   0:00.07 top
  5235 root      20   0 38.227g 547004  49996 S   0.7  0.2   2:02.74 java
  4299 root      20   0 20.015g 2.682g  16836 S   0.3  1.1  33:14.42 java
     1 root      20   0   33620   2920   1496 S   0.0  0.0   0:03.82 init
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   0:05.35 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:06.94 kworker/u256:0
     8 root      20   0       0      0      0 S   0.0  0.0   2:38.05 rcu_sched
top命令包括很多我们之前检查过的指标。它适合用来查看相比于之前的命令输出的结果,负载有了哪些变动。

不能清晰显示数据随时间变动的情况,这是top的一个缺点。相较而言,vmstat和pidstat的输出不会覆盖掉之前的结果,因此更适合查看数据随时间的变动情况。另外,如果你不能及时暂停top的输出(Ctrl-s 暂停,Ctrl-q 继续),也许某些关键线索会湮灭在新的输出中。

在这之后…

有很多工具和方法论有助于你深入地发掘问题。Brendan 在2015年 Velocity 大会上的 Linux Performance Tools tutorial 中列出超过40个命令,覆盖了观测、基准测试、调优、静态性能调优、分析(profile),和追踪(tracing)多个方面。
2016-11-20 12:58:39 chenycbbc0101 阅读数 715

转载自:http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html


Linux Performance Analysis in 60,000 Milliseconds

You login to a Linux server with a performance issue: what do you check in the first minute? 

At Netflix we have a massive EC2 Linux cloud, and numerous performance analysis tools to monitor and investigate its performance. These include Atlas for cloud-wide monitoring, and Vector for on-demand instance analysis. While those tools help us solve most issues, we sometimes need to login to an instance and run some standard Linux performance tools. 

In this post, the Netflix Performance Engineering team will show you the first 60 seconds of an optimized performance investigation at the command line, using standard Linux tools you should have available. 

First 60 Seconds: Summary

In 60 seconds you can get a high level idea of system resource usage and running processes by running the following ten commands. Look for errors and saturation metrics, as they are both easy to interpret, and then resource utilization. Saturation is where a resource has more load than it can handle, and can be exposed either as the length of a request queue, or time spent waiting.
uptime
dmesg | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
Some of these commands require the sysstat package installed. The metrics these commands expose will help you complete some of the USE Method: a methodology for locating performance bottlenecks. This involves checking utilization, saturation, and error metrics for all resources (CPUs, memory, disks, e.t.c.). Also pay attention to when you have checked and exonerated a resource, as by process of elimination this narrows the targets to study, and directs any follow on investigation. 

The following sections summarize these commands, with examples from a production system. For more information about these tools, see their man pages. 

1. uptime

$ uptime
 23:51:26 up 21:31,  1 user,  load average: 30.02, 26.43, 19.02
This is a quick way to view the load averages, which indicate the number of tasks (processes) wanting to run. On Linux systems, these numbers include processes wanting to run on CPU, as well as processes blocked in uninterruptible I/O (usually disk I/O). This gives a high level idea of resource load (or demand), but can’t be properly understood without other tools. Worth a quick look only. 

The three numbers are exponentially damped moving sum averages with a 1 minute, 5 minute, and 15 minute constant. The three numbers give us some idea of how load is changing over time. For example, if you’ve been asked to check a problem server, and the 1 minute value is much lower than the 15 minute value, then you might have logged in too late and missed the issue. 

In the example above, the load averages show a recent increase, hitting 30 for the 1 minute value, compared to 19 for the 15 minute value. That the numbers are this large means a lot of something: probably CPU demand; vmstat or mpstat will confirm, which are commands 3 and 4 in this sequence. 

2. dmesg | tail

$ dmesg | tail
[1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[...]
[1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child
[1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB
[2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request.  Check SNMP counters.
This views the last 10 system messages, if there are any. Look for errors that can cause performance issues. The example above includes the oom-killer, and TCP dropping a request. 

Don’t miss this step! dmesg is always worth checking. 

3. vmstat 1

$ vmstat 1
procs ---------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
34  0    0 200889792  73708 591828    0    0     0     5    6   10 96  1  3  0  0
32  0    0 200889920  73708 591860    0    0     0   592 13284 4282 98  1  1  0  0
32  0    0 200890112  73708 591860    0    0     0     0 9501 2154 99  1  0  0  0
32  0    0 200889568  73712 591856    0    0     0    48 11900 2459 99  0  0  0  0
32  0    0 200890208  73712 591860    0    0     0     0 15898 4840 98  1  1  0  0
^C
Short for virtual memory stat, vmstat(8) is a commonly available tool (first created for BSD decades ago). It prints a summary of key server statistics on each line. 

vmstat was run with an argument of 1, to print one second summaries. The first line of output (in this version of vmstat) has some columns that show the average since boot, instead of the previous second. For now, skip the first line, unless you want to learn and remember which column is which. 

Columns to check:
  • r: Number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an “r” value greater than the CPU count is saturation.
  • free: Free memory in kilobytes. If there are too many digits to count, you have enough free memory. The “free -m” command, included as command 7, better explains the state of free memory.
  • si, so: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
  • us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).
The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. You can treat wait I/O as another form of CPU idle, one that gives a clue as to why they are idle. 

System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently. 

In the above example, CPU time is almost entirely in user-level, pointing to application level usage instead. The CPUs are also well over 90% utilized on average. This isn’t necessarily a problem; check for the degree of saturation using the “r” column. 

4. mpstat -P ALL 1

$ mpstat -P ALL 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

07:38:49 PM  CPU   %usr  %nice   %sys %iowait   %irq  %soft  %steal  %guest  %gnice  %idle
07:38:50 PM  all  98.47   0.00   0.75    0.00   0.00   0.00    0.00    0.00    0.00   0.78
07:38:50 PM    0  96.04   0.00   2.97    0.00   0.00   0.00    0.00    0.00    0.00   0.99
07:38:50 PM    1  97.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   2.00
07:38:50 PM    2  98.00   0.00   1.00    0.00   0.00   0.00    0.00    0.00    0.00   1.00
07:38:50 PM    3  96.97   0.00   0.00    0.00   0.00   0.00    0.00    0.00    0.00   3.03
[...]
This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application. 

5. pidstat 1

$ pidstat 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

07:41:02 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:03 PM     0         9    0.00    0.94    0.00    0.94     1  rcuos/0
07:41:03 PM     0      4214    5.66    5.66    0.00   11.32    15  mesos-slave
07:41:03 PM     0      4354    0.94    0.94    0.00    1.89     8  java
07:41:03 PM     0      6521 1596.23    1.89    0.00 1598.11    27  java
07:41:03 PM     0      6564 1571.70    7.55    0.00 1579.25    28  java
07:41:03 PM 60004     60154    0.94    4.72    0.00    5.66     9  pidstat

07:41:03 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
07:41:04 PM     0      4214    6.00    2.00    0.00    8.00    15  mesos-slave
07:41:04 PM     0      6521 1590.00    1.00    0.00 1591.00    27  java
07:41:04 PM     0      6564 1573.00   10.00    0.00 1583.00    28  java
07:41:04 PM   108      6718    1.00    0.00    0.00    1.00     0  snmp-pass
07:41:04 PM 60004     60154    1.00    4.00    0.00    5.00     9  pidstat
^C
Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation. 

The above example identifies two java processes as responsible for consuming CPU. The %CPU column is the total across all CPUs; 1591% shows that that java processes is consuming almost 16 CPUs. 

6. iostat -xz 1

$ iostat -xz 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          73.96    0.00    3.73    0.03    0.06   22.21

Device:   rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
xvda        0.00     0.23    0.21    0.18     4.52     2.08    34.37     0.00    9.98   13.80    5.42   2.44   0.09
xvdb        0.01     0.00    1.02    8.94   127.97   598.53   145.79     0.00    0.43    1.78    0.28   0.25   0.25
xvdc        0.01     0.00    1.02    8.86   127.79   595.94   146.50     0.00    0.45    1.82    0.30   0.27   0.26
dm-0        0.00     0.00    0.69    2.32    10.47    31.69    28.01     0.01    3.23    0.71    3.98   0.13   0.04
dm-1        0.00     0.00    0.00    0.94     0.01     3.78     8.00     0.33  345.84    0.04  346.81   0.01   0.00
dm-2        0.00     0.00    0.09    0.07     1.35     0.36    22.50     0.00    2.55    0.23    5.62   1.78   0.03
[...]
^C


This is a great tool for understanding block devices (disks), both the workload applied and the resulting performance. Look for: 
  • r/s, w/s, rkB/s, wkB/s: These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.
  • await: The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems.
  • avgqu-sz: The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)
  • %util: Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.
If the storage device is a logical disk device fronting many back-end disks, then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from saturated, and may be able to handle much more work. 

Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g., read-ahead for reads, and buffering for writes). 

7. free -m

$ free -m
             total       used       free     shared    buffers     cached
Mem:        245998      24545     221453         83         59        541
-/+ buffers/cache:      23944     222053
Swap:            0          0          0
The right two columns show:
  • buffers: For the buffer cache, used for block device I/O.
  • cached: For the page cache, used by file systems.
We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. The above example looks fine, with many Mbytes in each. 

The “-/+ buffers/cache” provides less confusing values for used and free memory. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column, which this line does. There’s even a website, linuxatemyram, about this confusion. 

It can be additionally confusing if ZFS on Linux is used, as we do for some services, as ZFS has its own file system cache that isn’t reflected properly by the free -m columns. It can appear that the system is low on free memory, when that memory is in fact available for use from the ZFS cache as needed. 

8. sar -n DEV 1

$ sar -n DEV 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015     _x86_64_    (32 CPU)

12:16:48 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:49 AM      eth0  18763.00   5032.00  20686.42    478.30      0.00      0.00      0.00      0.00
12:16:49 AM        lo     14.00     14.00      1.36      1.36      0.00      0.00      0.00      0.00
12:16:49 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:16:49 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
12:16:50 AM      eth0  19763.00   5101.00  21999.10    482.56      0.00      0.00      0.00      0.00
12:16:50 AM        lo     20.00     20.00      3.25      3.25      0.00      0.00      0.00      0.00
12:16:50 AM   docker0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
^C
Use this tool to check network interface throughput: rxkB/s and txkB/s, as a measure of workload, and also to check if any limit has been reached. In the above example, eth0 receive is reaching 22 Mbytes/s, which is 176 Mbits/sec (well under, say, a 1 Gbit/sec limit). 

This version also has %ifutil for device utilization (max of both directions for full duplex), which is something we also use Brendan’s nicstat tool to measure. And like with nicstat, this is hard to get right, and seems to not be working in this example (0.00). 

9. sar -n TCP,ETCP 1

$ sar -n TCP,ETCP 1
Linux 3.13.0-49-generic (titanclusters-xxxxx)  07/14/2015    _x86_64_    (32 CPU)

12:17:19 AM  active/s passive/s    iseg/s    oseg/s
12:17:20 AM      1.00      0.00  10233.00  18846.00

12:17:19 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:20 AM      0.00      0.00      0.00      0.00      0.00

12:17:20 AM  active/s passive/s    iseg/s    oseg/s
12:17:21 AM      1.00      0.00   8359.00   6039.00

12:17:20 AM  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
12:17:21 AM      0.00      0.00      0.00      0.00      0.00
^C
This is a summarized view of some key TCP metrics. These include:
  • active/s: Number of locally-initiated TCP connections per second (e.g., via connect()).
  • passive/s: Number of remotely-initiated TCP connections per second (e.g., via accept()).
  • retrans/s: Number of TCP retransmits per second.
The active and passive counts are often useful as a rough measure of server load: number of new accepted connections (passive), and number of downstream connections (active). It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g., consider a localhost to localhost connection). 

Retransmits are a sign of a network or server issue; it may be an unreliable network (e.g., the public Internet), or it may be due a server being overloaded and dropping packets. The example above shows just one new TCP connection per-second. 

10. top

$ top
top - 00:15:40 up 21:56,  1 user,  load average: 31.09, 29.87, 29.92
Tasks: 871 total,   1 running, 868 sleeping,   0 stopped,   2 zombie
%Cpu(s): 96.8 us,  0.4 sy,  0.0 ni,  2.7 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  25190241+total, 24921688 used, 22698073+free,    60448 buffers
KiB Swap:        0 total,        0 used,        0 free.   554208 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20248 root      20   0  0.227t 0.012t  18748 S  3090  5.2  29812:58 java
  4213 root      20   0 2722544  64640  44232 S  23.5  0.0 233:35.37 mesos-slave
 66128 titancl+  20   0   24344   2332   1172 R   1.0  0.0   0:00.07 top
  5235 root      20   0 38.227g 547004  49996 S   0.7  0.2   2:02.74 java
  4299 root      20   0 20.015g 2.682g  16836 S   0.3  1.1  33:14.42 java
     1 root      20   0   33620   2920   1496 S   0.0  0.0   0:03.82 init
     2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
     3 root      20   0       0      0      0 S   0.0  0.0   0:05.35 ksoftirqd/0
     5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     6 root      20   0       0      0      0 S   0.0  0.0   0:06.94 kworker/u256:0
     8 root      20   0       0      0      0 S   0.0  0.0   2:38.05 rcu_sched
The top command includes many of the metrics we checked earlier. It can be handy to run it to see if anything looks wildly different from the earlier commands, which would indicate that load is variable. 

A downside to top is that it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, which provide rolling output. Evidence of intermittent issues can also be lost if you don’t pause the output quick enough (Ctrl-S to pause, Ctrl-Q to continue), and the screen clears. 

Follow-on Analysis


There are many more commands and methodologies you can apply to drill deeper. See Brendan’s Linux Performance Tools tutorial from Velocity 2015, which works through over 40 commands, covering observability, benchmarking, tuning, static performance tuning, profiling, and tracing. 

Tackling system reliability and performance problems at web scale is one of our passions. If you would like to join us in tackling these kinds of challenges we are hiring!

2014-09-13 23:31:27 hero_mis 阅读数 259
1.抛开Linux与Windows的门户之争
2.通过案例来学习
3.工具性
4.Linux只是开始,只是一个平台,以开发者角度来学习
----……

linux显示前几行

阅读数 257

没有更多推荐了,返回首页