2014-11-24 16:54:40 iamstar001 阅读数 890
        近年来,随着市场和技术的发展,越来越多的网络基础架构开始向基于通用计算平台或模块化计算平台的架构方向融合,用以支持和提供多样的网络单元和丰富的功能,如应用处理、控制处理、包处理、信号处理等。除了节约成本和缩短产品上市时间之外,在机架式系统和不同尺寸的网络设备上,此架构还可以提供模块化架构的灵活性以及随需而定的系统组件独立升级能力。在传统的网络架构中,交换模块处理In-band和out-of-band系统模块之间的路由交换,处理器模块提供应用层和控制层功能,包处理模块用于数据平面,DSP模块提供定制化的信号层功能。通过使用Intel DPDK(Intel Data Plane Development Kit,Intel 数据平面开发套件),基于Intel x86架构的处理器模块不仅可以实现传统的处理应用和控制功能,还可以实现智能和高效的包处理。
        相比原生 Linux(Native Linux),采用Intel DPDK技术后能够大幅提升IP转性能的主要原因在于Intel DPDK采用了如下描述的主要特征:

(1)轮询模式取代中断
        通常当数据包进入的时候,Native Linux会从网络接口控制器(NIC,Network Interface Controller)接收到中断,然后调度软中断,对所得的中断进行上下文切换,并唤醒系统调用,如read()和write()。相比之下,Intel  DPDK采用了优化的轮询模式驱动(PMD,Poll Mode Driver)代替默认的以太网驱动程序,从而可以不断地接收数据包,避免软件中断,上下文切换和唤醒系统调用,从而大大的节省重要的CPU资源,并且降低了延迟。

(2)HugePage取代传统页 
        相比Native Linux的4kB页,采用更大的页尺寸意味着可以节省页的查询时间,并减少转译查找缓存(TLB,Translation Lookaside Buffer)丢失的可能。Intel DPDK作为用户空间(User-space)应用运行时,在自己的内存空间中分配HugePage至存储帧缓冲区,环形和其他相关缓冲区,这些缓冲区是由其他应用程序控制,甚至是Linux内核。很多实例中采用1024@2MB的HugePage被保留用于运行IP转发应用。

(3)零拷贝缓冲区
        在传统的数据包处理过程中,原生Linux(Native Linux)解封包的报头,然后根据Socket ID将数据复制到用户空间(User Space)缓冲区。一旦用户空间(User Space)应用程序完成了数据的处理,一个write()系统调用将被唤醒并把数据送至内核,负责将数据从用户空间(User Space)拷贝至内核缓冲区,封装包的报头,最后借助相关的物理端口将数据发出去。显然,原生Linux(Native Linux)在内核缓冲区和用户空间(User Space)缓冲区之间进行拷贝动作,牺牲了很多的时间和资源。
        相比之下,Intel DPDK在自己保留的内存区域接收数据包,这个区域位于用户空间(User Space)缓冲区,之后根据配置规则将这些数据包分类到每一个Flow中。在处理完解封包之后,在相同的用户空间(User Space)缓冲区中使用正确的报头进行包封装,最后通过相关的物理端口发送这些数据。

(4)Run-to-Completion(RTC,运行到完成)和Core Affinity 
        在执行应用之前,Intel DPDK会进行初始化,分配所有的低级资源,如内存空间,PCI设备,定时器,控制台,这些资源将被保留且仅用于那些基于Intel DPDK的应用。初始化完成之后,每一个核(或线程,当BIOS设置中启用了Intel超线程技术时)将被启用来负责每一个执行单元,并根据实际应用的需求,运行相同的或不同的工作负载。
此外,Intel DPDK还提供了一种方法,即可以设置每个执行单元运行在每一个核心上,以维持更多的Core Affinity,从而避免缓存丢失。物理端口根据Affinity可以被绑定在不同的CPU线程上。

(5)无锁执行和缓存校准 

        Intel DPDK提供的库和API,被优化成无锁,以防止多线程应用程序死锁现象的发生。对于缓冲区、环形和其他数据结构,Intel DPDK也进行了优化,执行了缓存校准,以达到缓存行(Cache-Line)的效率最大化,同时最大限度减少缓存行(Cache-Line)的冲突。


        采用Intel DPDK技术后的IPv4转发性能,可以让用户在迁移包处理应用时(从基于NPU的硬件迁移到基于Intel x86的平台上),获得更好的成本和性能优势。同时可以采用统一的平台部署不同的服务,如应用处理,控制处理和包处理服务。但是,值得注意的是,Intel DPDK是一个数据层的开发工具包,并在用户空间运行,它不是一个用户可以直接建立应用程序的完整产品。需要特别指出的是,Intel DPDK不包含需要与控制层(包括内核和协议堆栈)进行交互的工具。

2017-10-12 16:06:07 pangyemeng 阅读数 3277

0x01 缘由

     “纸上得来终觉浅,绝知此事要躬行”,前面学习了linux tcp/ip协议栈、DPDK基础理论,学下来的感觉是:以前对这几个方面都充满敬畏感,觉得很神秘、很高端,但是学习下来总结了一点----只要不断去摸索去调式去找资料,不懂的都会变得你懂的。仅仅学习理论是不够的,还得在实践中运用。然而在公司产品实践中有时无法接触这方面的知识的应用,所以只能通过一些例子和实现开源解决方案来加强加固。
     前面一年或多或少的学习和实践跑起来一些例子,但是没有对代码进行深入的学习。没有学习相关的设计理念和设计者的目的,已经一些压力场景的情况。下面再次复习相关例子,从最简单的开始。

0x02 helloworld例子

     这个是最简单的使用dpdk开发套件的例程。
     源码分析: 
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <errno.h>
#include <sys/queue.h>
//以上头开发环境glibc的相关头文件
#include <rte_memory.h>
#include <rte_memzone.h>
#include <rte_launch.h>
#include <rte_eal.h>
#include <rte_per_lcore.h>
#include <rte_lcore.h>
#include <rte_debug.h>
/*以上为dpdk自己编写的一些公共库头文件,如内存池、线程、抽象环境等工具,DPDK有自己的开发风格,适应之。*/
static int
lcore_hello(__attribute__((unused)) void *arg)/* 此处有一个字节对齐操作,此处不做详细分析。*/
{

    unsigned lcore_id;
    lcore_id = rte_lcore_id(); //获取逻辑核编号,并输出逻辑核id,返回,线程退出。
    printf("hello from core %u\n", lcore_id);
    return 0;
}

int
main(int argc, char **argv)
{
    int ret;
    unsigned lcore_id;
    /* 相关初始化工作,如命令含参数处理,自动检测环境相关条件。以及相关库平台初始化工作*/
    ret = rte_eal_init(argc, argv);
    if (ret < 0)
        rte_panic("Cannot init EAL\n");

    /* 每个从逻辑核调用回调函数lcore_hello输出相关信息。 */
    RTE_LCORE_FOREACH_SLAVE(lcore_id) {
        rte_eal_remote_launch(lcore_hello, NULL, lcore_id);
    }

    /* 再次调用主逻辑核输出相关信息。 */
    lcore_hello(NULL);
    /* 等待所有从逻辑核调用返回,相当于主线程阻塞等待。*/
    rte_eal_mp_wait_lcore();
    return 0;
}

Makefile:
#判断相关环境变量是否设置
ifeq ($(RTE_SDK),)
$(error "Please define RTE_SDK environment variable")
endif

# 默认的平台目标
RTE_TARGET ?= x86_64-native-linuxapp-gcc

include $(RTE_SDK)/mk/rte.vars.mk

# binary name
APP = helloworld

# all source are stored in SRCS-y
SRCS-y := main.c

CFLAGS += -O3
CFLAGS += $(WERROR_FLAGS)

include $(RTE_SDK)/mk/rte.extapp.mk

0x03 运行环境搭建

     运行环境已在另一篇博文有讲解,在此不再赘述。http://blog.csdn.net/pangyemeng/article/details/49883717

0x04 运行

     运行环境:
          Linux Huawei 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
     DPDK版本:
          dpdk-16.04

     1.简单默认运行结果

自动检测32个逻辑核,将0核作为主逻辑核。
[root@Huawei build]# ./helloworld
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 1 on socket 0
EAL: Detected lcore 2 as core 2 on socket 0
EAL: Detected lcore 3 as core 3 on socket 0
EAL: Detected lcore 4 as core 4 on socket 0
EAL: Detected lcore 5 as core 5 on socket 0
EAL: Detected lcore 6 as core 6 on socket 0
EAL: Detected lcore 7 as core 7 on socket 0
EAL: Detected lcore 8 as core 0 on socket 1
EAL: Detected lcore 9 as core 1 on socket 1
EAL: Detected lcore 10 as core 2 on socket 1
EAL: Detected lcore 11 as core 3 on socket 1
EAL: Detected lcore 12 as core 4 on socket 1
EAL: Detected lcore 13 as core 5 on socket 1
EAL: Detected lcore 14 as core 6 on socket 1
EAL: Detected lcore 15 as core 7 on socket 1
EAL: Detected lcore 16 as core 0 on socket 0
EAL: Detected lcore 17 as core 1 on socket 0
EAL: Detected lcore 18 as core 2 on socket 0
EAL: Detected lcore 19 as core 3 on socket 0
EAL: Detected lcore 20 as core 4 on socket 0
EAL: Detected lcore 21 as core 5 on socket 0
EAL: Detected lcore 22 as core 6 on socket 0
EAL: Detected lcore 23 as core 7 on socket 0
EAL: Detected lcore 24 as core 0 on socket 1
EAL: Detected lcore 25 as core 1 on socket 1
EAL: Detected lcore 26 as core 2 on socket 1
EAL: Detected lcore 27 as core 3 on socket 1
EAL: Detected lcore 28 as core 4 on socket 1
EAL: Detected lcore 29 as core 5 on socket 1
EAL: Detected lcore 30 as core 6 on socket 1
EAL: Detected lcore 31 as core 7 on socket 1
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 32 lcore(s)
EAL: Setting up physically contiguous memory...
EAL: Ask a virtual area of 0xc00000 bytes
EAL: Virtual area found at 0x7fda71e00000 (size = 0xc00000)
EAL: Ask a virtual area of 0x400000 bytes
EAL: Virtual area found at 0x7fda71800000 (size = 0x400000)
EAL: Ask a virtual area of 0x400000 bytes
EAL: Virtual area found at 0x7fda71200000 (size = 0x400000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda70e00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda70a00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda70600000 (size = 0x200000)
EAL: Ask a virtual area of 0x400000 bytes
EAL: Virtual area found at 0x7fda70000000 (size = 0x400000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6fc00000 (size = 0x200000)
EAL: Ask a virtual area of 0x400000 bytes
EAL: Virtual area found at 0x7fda6f600000 (size = 0x400000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6f200000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6ee00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6ea00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6e600000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6e200000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6de00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6da00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6d600000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6d200000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6ce00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6ca00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6c600000 (size = 0x200000)
EAL: Ask a virtual area of 0x600000 bytes
EAL: Virtual area found at 0x7fda6be00000 (size = 0x600000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6ba00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6b600000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6b200000 (size = 0x200000)
EAL: Ask a virtual area of 0x400000 bytes
EAL: Virtual area found at 0x7fda6ac00000 (size = 0x400000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6a800000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6a400000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda6a000000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda69c00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda69800000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda69400000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda69000000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda68c00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda68800000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda68400000 (size = 0x200000)
EAL: Ask a virtual area of 0x400000 bytes
EAL: Virtual area found at 0x7fda67e00000 (size = 0x400000)
EAL: Ask a virtual area of 0x400000 bytes
EAL: Virtual area found at 0x7fda67800000 (size = 0x400000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda67400000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda67000000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda66c00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda66800000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda66400000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda66000000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda65c00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda65800000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda65400000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda65000000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda64c00000 (size = 0x200000)
EAL: Ask a virtual area of 0x200000 bytes
EAL: Virtual area found at 0x7fda64800000 (size = 0x200000)
EAL: Requesting 64 pages of size 2MB from socket 0
EAL: TSC frequency is ~2593994 KHz
EAL: Master lcore 0 is ready (tid=72be3880;cpuset=[0])
EAL: lcore 4 is ready (tid=61ffb700;cpuset=[4])
EAL: lcore 15 is ready (tid=5b1f0700;cpuset=[15])
EAL: lcore 27 is ready (tid=539e4700;cpuset=[27])
EAL: lcore 6 is ready (tid=60bf9700;cpuset=[6])
EAL: lcore 8 is ready (tid=5f7f7700;cpuset=[8])
EAL: lcore 10 is ready (tid=5e3f5700;cpuset=[10])
EAL: lcore 12 is ready (tid=5cff3700;cpuset=[12])
EAL: lcore 14 is ready (tid=5bbf1700;cpuset=[14])
EAL: lcore 18 is ready (tid=593ed700;cpuset=[18])
EAL: lcore 16 is ready (tid=5a7ef700;cpuset=[16])
EAL: lcore 19 is ready (tid=589ec700;cpuset=[19])
EAL: lcore 22 is ready (tid=56be9700;cpuset=[22])
EAL: lcore 24 is ready (tid=557e7700;cpuset=[24])
EAL: lcore 25 is ready (tid=54de6700;cpuset=[25])
EAL: lcore 28 is ready (tid=52fe3700;cpuset=[28])
EAL: lcore 5 is ready (tid=615fa700;cpuset=[5])
EAL: lcore 1 is ready (tid=63dfe700;cpuset=[1])
EAL: lcore 7 is ready (tid=601f8700;cpuset=[7])
EAL: lcore 13 is ready (tid=5c5f2700;cpuset=[13])
EAL: lcore 3 is ready (tid=629fc700;cpuset=[3])
EAL: lcore 17 is ready (tid=59dee700;cpuset=[17])
EAL: lcore 23 is ready (tid=561e8700;cpuset=[23])
EAL: lcore 2 is ready (tid=633fd700;cpuset=[2])
EAL: lcore 31 is ready (tid=511e0700;cpuset=[31])
EAL: lcore 9 is ready (tid=5edf6700;cpuset=[9])
EAL: lcore 21 is ready (tid=575ea700;cpuset=[21])
EAL: lcore 26 is ready (tid=543e5700;cpuset=[26])
EAL: lcore 30 is ready (tid=51be1700;cpuset=[30])
EAL: lcore 20 is ready (tid=57feb700;cpuset=[20])
EAL: lcore 11 is ready (tid=5d9f4700;cpuset=[11])
EAL: lcore 29 is ready (tid=525e2700;cpuset=[29])
EAL: PCI device 0000:04:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0000:04:00.1 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0000:04:00.2 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   PCI memory mapped at 0x7fda72a00000
EAL:   PCI memory mapped at 0x7fda72b00000
PMD: eth_igb_dev_init(): port_id 0 vendorID=0x8086 deviceID=0x1521
EAL: PCI device 0000:04:00.3 on NUMA socket 0
EAL:   probe driver: 8086:1521 rte_igb_pmd
EAL:   PCI memory mapped at 0x7fda71d00000
EAL:   PCI memory mapped at 0x7fda71cfc000
PMD: eth_igb_dev_init(): port_id 1 vendorID=0x8086 deviceID=0x1521
EAL: PCI device 0000:05:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1522 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0000:05:00.1 on NUMA socket 0
EAL:   probe driver: 8086:1522 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0000:05:00.2 on NUMA socket 0
EAL:   probe driver: 8086:1522 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0000:05:00.3 on NUMA socket 0
EAL:   probe driver: 8086:1522 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
hello from core 1
hello from core 2
hello from core 3
hello from core 4
hello from core 5
hello from core 6
hello from core 7
hello from core 8
hello from core 9
hello from core 10
hello from core 11
hello from core 12
hello from core 13
hello from core 14
hello from core 15
hello from core 16
hello from core 17
hello from core 18
hello from core 19
hello from core 20
hello from core 21
hello from core 22
hello from core 23
hello from core 24
hello from core 25
hello from core 26
hello from core 27
hello from core 28
hello from core 29
hello from core 30
hello from core 31
hello from core 0

   1.带参数运行 -l 

     ./helloworld -l 0-3
EAL:   Not managed by a supported kernel driver, skipped
EAL: PCI device 0000:05:00.3 on NUMA socket 0
EAL:   probe driver: 8086:1522 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
hello from core 1
hello from core 2
hello from core 3
hello from core 0

   2.带参数运行 -l --master-lcore

     ./helloworld -l 0-3 --master-lcore=1
........
EAL: PCI device 0000:05:00.3 on NUMA socket 0
EAL:   probe driver: 8086:1522 rte_igb_pmd
EAL:   Not managed by a supported kernel driver, skipped
hello from core 0
hello from core 2
hello from core 3
hello from core 1

   3.其他带参数选项,后期学习时一一解释。

0x05 选项解释

[root@dev build]# ./helloworld --help
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 1 on socket 0
EAL: Support maximum 128 logical core(s) by configuration.
EAL: Detected 2 lcore(s)

Usage: ./helloworld [options]

EAL common options:
  -c COREMASK         逻辑核16进制掩码
  -l CORELIST         列出运行时逻辑核列表           
                      参数格式 <c1>[-c2][,c3[-c4],...]
                      where c1, c2, etc are core indexes between 0 and 128
  --lcores COREMAP    映射逻辑核到物理逻辑核集合中
                      The argument format is
                            '<lcores[@cpus]>[<,lcores[@cpus]>...]'
                      lcores and cpus list are grouped by '(' and ')'
                      Within the group, '-' is used for range separator,
                      ',' is used for single number separator.
                      '( )' can be omitted for single element group,
                      '@' can be omitted if cpus and lcores have the same value
  --master-lcore ID   指定主线程逻辑核id
  -n CHANNELS         指定内存通道数
  -m MB               指定内存分配 (类似 --socket-mem)
  -r RANKS            Force number of memory ranks (don't detect) 强制内存参数
  -b, --pci-blacklist 将PCI网络设备列入黑名单,防止EAL环境使用这些PCI设备,参数格式为<domain:bus:devid.func>
  -w, --pci-whitelist 将PCI网络设备列入白名单,仅仅用指定的PCI设备,参数格式为<[domain:]bus:devid.func>
  --vdev              添加一块虚拟设备,这个参数格式为 <driver><id>[,key=val,...]
                      (例如: --vdev=eth_pcap0,iface=eth2).
  -d LIB.so|DIR       添加驱动活驱动目录(can be used multiple times)
  --vmware-tsc-map    Use VMware TSC map instead of native RDTSC
  --proc-type         进程的类型 (primary|secondary|auto)
  --syslog            设定syslog日志
  --log-level         设定默认日志级别
  -v                  启动时显示版本信息
  -h, --help          This help

EAL options for DEBUG use only: 调试
  --huge-unlink       在初始化后去掉大页面文件连接
  --no-huge           用 malloc 代替 hugetlbfs
  --no-pci            关闭 PCI
  --no-hpet           关闭 HPET
  --no-shconf         不共享配置(mmap'd files)

EAL Linux options: 选项
  --socket-mem        内存分配
  --huge-dir          大页面挂载目录
  --file-prefix       页表文件前缀
  --base-virtaddr     虚拟地址基址
  --create-uio-dev    Create /dev/uioX (usually done by hotplug)
  --vfio-intr         Interrupt mode for VFIO (legacy|msi|msix)
  --xen-dom0          Support running on Xen dom0 without hugetlbfs

0x06 总结

     有些细节没去认真的分析,细节部分前面也进行了一些理论学习,后面的学习目标是:快速上手,熟悉使用,学习设计思想。
2016-09-26 21:24:00 iteye_5484 阅读数 178
snake test一般把数据包在各个端口之间来回转,形成比较大的满负荷。

testpmd是dpdk用来验证两个直连网卡的性能,双方对打流量。如果没有硬件(你怎么什么都没有啊?)我们一样可以玩。 Linux下的tap就是成对出现的粒子,不,虚拟网卡,创建以后,什么bridge都不要,他们就是天然的好基友。。。

# ip link add ep1 type veth peer name ep2
# ifconfig ep1 up; ifconfig ep2 up
看看ifconfig, ip link是不是出现了?

testpmd安装运行参见: http://dpdk.org/doc/quick-start
testpmd运行多个实例需要加--no-shconf
hugepage多次运行以后貌似没有释放,不用它性能下降不多, --no-huge

# ./testpmd --no-huge -c[b]7[/b] -n3 --vdev="eth_pcap0,iface=ep1" --vdev=eth_pcap1,iface=ep2 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048
testpmd> start tx_first
testpmd> show port stats all
testpmd> show port stats all //两次
[b] Rx-pps: 418634
Tx-pps: 436095[/b]

我们再创建一对taps测试,同时跑两组:
# ip link add ep3 type veth peer name ep4
# ifconfig ep3 up; ifconfig ep4 up
# ./testpmd1 --no-huge --no-shconf -c[b]70[/b] --vdev="eth_pcap2,iface=ep3" --vdev=eth_pcap3,iface=ep4 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048

两个同时跑性能差不多,因为-c参数把程序分散到不同core上,top命令按“1”可以看到

那么两个对串联性能会怎样?本来数据在 EP1<->EP2, EP3<->EP4, 现在改成EP2<->EP3, EP4<->EP1.

# ./testpmd --no-huge --no-shconf -c70 --vdev="[b]eth_pcap1,iface=ep2[/b]" --vdev=[b]eth_pcap2,iface=ep3[/b] -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048
testpmd> show port stats all
这时候你将看到pps都是0! 因为一边报文发出去tap对端没连上。 现在我们在另外一个窗口把ep4-ep1联通:
# ./testpmd --no-huge -c7 -n3 --vdev="[b]eth_pcap0,iface=ep1[/b]" --vdev=[b]eth_pcap3,iface=ep4[/b] -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048
testpmd> start tx_first
testpmd> show port stats all
testpmd> show port stats all
[b] Rx-pps: 433939
Tx-pps: 423428[/b]
跑起来了,回去第一个窗口show一样有流量,至此snake流量打通。

问题来了,为什么两个串联性能变化不大?!
# lscpu
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
从top看testpmd的core在1-2, 5-6上跑的,跨越NUMA这个内存效率。。。
好吧,-c参数改成15, 这是 bitmap,实际使用core 4 2 0,ep1-ep2测试结果提升50%:
Rx-pps: 612871
Tx-pps: 597219
恢复snake test, cpu分别是15, 2A, 测试性能如下,貌似慢了不少:
Rx-pps: 339290
Tx-pps: 336334
cpu如果用15,1500,结果:
Rx-pps: 540867
Tx-pps: 496891
性能比跨越numa好了很多,但是比单个tap对还是下降了1/6, 那么再看看3个taps的snake结果,第三组cpu 150000还是同一numa,居然变化不大:
Rx-pps: 511881
Tx-pps: 503456

假设cpu不够用了,第三个testpmd程序也跑在cpu 1500上面, 结果非常可悲:
Rx-pps: 1334
Tx-pps: 1334


以上测试说明:
1. 尽量不要跨越numa传递数据
2. 绑定cpu击鼓传花处理数据总吞吐量决定要最慢的一个应用
3. cpu不能复用,切换调度严重影响性能

========================
创建一个bridge br0, 把ep1, ep3, ep5加进去,用testpmd测试ep2-ep4, 这是标准网桥,看看性能下降多少:
#brctl add br0
#brctl add ep1; brctl add ep3
# ./testpmd --no-huge --no-shconf -c15 --vdev="eth_pcap1,iface=ep2" --vdev=eth_pcap3,iface=ep4 -- -i --nb-cores=2 --nb-ports=2 --total-num-mbufs=2048

Rx-pps: 136157
Tx-pps: 128207
600kpps降到130k左右,1/4不到。。。有空用ovs试试。
2018-05-11 21:00:59 kklvsports 阅读数 250

    Linux系统上TCP/IP协议栈在内核态(DPDK等在用户态收包情况例外),用户态如果想要干预报文的处理就需要向内核态注入hook函数,如Linux的iptables,netfilter框架中的HOOK机制即是提供该功能的。通过之前分析ip报文的内核处理路径点击打开链接可知,内核中有如下5个hook点,他们和iptables中chain一一对应。


对应hook函数在内核中是NF_HOOK调用的位于include/linux/netfilter.h文件。(上图中的NF_IP_xx等定义较老,最新的定义应该是NF_INET_xx)

/** 
 *  nf_hook_thresh - call a netfilter hook
 *  
 *  Returns 1 if the hook has allowed the packet to pass.  The function
 *  okfn must be invoked by the caller in this case.  Any other return
 *  value indicates the packet has been consumed by the hook.
 */
static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
                 struct sk_buff *skb,
                 struct net_device *indev,
                 struct net_device *outdev,
                 int (*okfn)(struct sk_buff *), int thresh)
{   
#ifndef CONFIG_NETFILTER_DEBUG
    if (list_empty(&nf_hooks[pf][hook]))
        return 1;
#endif
    return nf_hook_slow(pf, hook, skb, indev, outdev, okfn, thresh);
}
static inline int nf_hook(u_int8_t pf, unsigned int hook, struct sk_buff *skb,
              struct net_device *indev, struct net_device *outdev,
              int (*okfn)(struct sk_buff *))
{   
    return nf_hook_thresh(pf, hook, skb, indev, outdev, okfn, INT_MIN);
}

/* Activate hook; either okfn or kfree_skb called, unless a hook
   returns NF_STOLEN (in which case, it's up to the hook to deal with
   the consequences).

   Returns -ERRNO if packet dropped.  Zero means queued, stolen or
   accepted.
*/

/* RR:
   > I don't want nf_hook to return anything because people might forget
   > about async and trust the return value to mean "packet was ok".

   AK:
   Just document it clearly, then you can expect some sense from kernel
   coders :)
*/

static inline int
NF_HOOK_THRESH(uint8_t pf, unsigned int hook, struct sk_buff *skb,
           struct net_device *in, struct net_device *out,
           int (*okfn)(struct sk_buff *), int thresh)
{
    int ret = nf_hook_thresh(pf, hook, skb, in, out, okfn, thresh);
    if (ret == 1) //如果钩子函数返回1 调用okfn函数走下一流程
        ret = okfn(skb);
    return ret;
}
static inline int
NF_HOOK(uint8_t pf, unsigned int hook, struct sk_buff *skb,
    struct net_device *in, struct net_device *out,
    int (*okfn)(struct sk_buff *))
{
    return NF_HOOK_THRESH(pf, hook, skb, in, out, okfn, INT_MIN);
}

/* Returns 1 if okfn() needs to be executed by the caller,
 * -EPERM for NF_DROP, 0 otherwise. */
int nf_hook_slow(u_int8_t pf, unsigned int hook, struct sk_buff *skb,
         struct net_device *indev,
         struct net_device *outdev,
         int (*okfn)(struct sk_buff *),
         int hook_thresh)
{
    struct list_head *elem;
    unsigned int verdict;
    int ret = 0;


    /* We may already have this, but read-locks nest anyway */
    rcu_read_lock();  //RCU同步对nf_hooks[]的访问


    elem = &nf_hooks[pf][hook];
next_hook:    
    verdict = nf_iterate(&nf_hooks[pf][hook], skb, hook, indev,
                 outdev, &elem, okfn, hook_thresh);
    if (verdict == NF_ACCEPT || verdict == NF_STOP) {
        ret = 1;
    } else if ((verdict & NF_VERDICT_MASK) == NF_DROP) {
        kfree_skb(skb);//如果hook中将包围drop,在此释放skb
        ret = NF_DROP_GETERR(verdict);
        if (ret == 0)
            ret = -EPERM;
    } else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
        int err = nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
                        verdict >> NF_VERDICT_QBITS);
        if (err < 0) {
            if (err == -ECANCELED)
                goto next_hook;
            if (err == -ESRCH &&
               (verdict & NF_VERDICT_FLAG_QUEUE_BYPASS))
                goto next_hook;            kfree_skb(skb);
        }
    }
    rcu_read_unlock();
    return ret;
}
unsigned int nf_iterate(struct list_head *head,
            struct sk_buff *skb,
            unsigned int hook,
            const struct net_device *indev,
            const struct net_device *outdev,
            struct list_head **i,
            int (*okfn)(struct sk_buff *),
            int hook_thresh)
{
    unsigned int verdict;


    /*
     * The caller must not block between calls to this
     * function because of risk of continuing from deleted element.
     */
    list_for_each_continue_rcu(*i, head) {
        struct nf_hook_ops *elem = (struct nf_hook_ops *)*i;


        if (hook_thresh > elem->priority)
            continue;


        /* Optimization: we don't need to hold module
           reference here, since function can't sleep. --RR */
repeat://遍历调用所以hook函数
        verdict = elem->hook(hook, skb, indev, outdev, okfn);
        if (verdict != NF_ACCEPT) {
#ifdef CONFIG_NETFILTER_DEBUG
            if (unlikely((verdict & NF_VERDICT_MASK)
                            > NF_MAX_VERDICT)) {
                NFDEBUG("Evil return from %p(%u).\n",
                    elem->hook, hook);
                continue;
            }
#endif
            if (verdict != NF_REPEAT)
                return verdict;
            goto repeat;
        }
    }
    return NF_ACCEPT;
}

过程比较简单,NF_HOOK直接遍历nf_hooks上的所有所有钩子函数,将报文交给他们处理,如果钩子返回1则交给okfn函数进一步处理。所有钩子函数是存储在nf_hooks上的,nf_hooks是一个二维数组,数组元素是链表。下图以NF_INET_LOCAL_IN为例子展示数据结构的关系,链表的顺序是以priority排序的。


内核插入钩子函数的API是nf_register_hook函数。

int nf_register_hook(struct nf_hook_ops *reg)
{
    struct nf_hook_ops *elem;
    int err;

    err = mutex_lock_interruptible(&nf_hook_mutex);
    if (err < 0)
        return err;
    list_for_each_entry(elem, &nf_hooks[reg->pf][reg->hooknum], list) {
        if (reg->priority < elem->priority)
            break;
    }
    list_add_rcu(&reg->list, elem->list.prev);
    mutex_unlock(&nf_hook_mutex);
    return 0;
}

参考:

https://blog.csdn.net/windeal3203/article/details/51204911



2018-01-23 13:46:54 zhuyong006 阅读数 103

前置:

  1. 手机端通过TCP发送(1-20)的数字到Linux下的主机,Linux的主机实现服务端的监听
  2. Linux服务端的IP地址是192.168.5.174,端口号:9999
1.手机客户端编程
     tcp_client.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
void main(void)
{
 int fd;
 struct sockaddr_in addr;
 int r;
 int i=0;
 //1.socket
 fd=socket(AF_INET,SOCK_STREAM,0);
 if(fd==-1) printf("1:%m\n"),exit(-1);
 printf("建立socket成功!\n");
 //2.connect
 addr.sin_family=AF_INET;
 addr.sin_port=htons(9999);
 inet_aton("192.168.5.174",&addr.sin_addr);
 r=connect(fd,
   (struct sockaddr*)&addr,sizeof(addr));
 if(r==-1) printf("2:%m\n"),exit(-1);
 printf("连接服务器成功!\n");
 
 
 for(i=0;i<20;i++)
 {
  send(fd,&i,4,0);
 }
 close(fd);
}

     Android.mk

LOCAL_PATH := $(call my-dir)

include $(CLEAR_VARS)

LOCAL_SRC_FILES := \
	tcp_client.c
LOCAL_CFLAGS += -pie -fPIE
LOCAL_LDFLAGS += -pie -fPIE

LOCAL_MODULE := tcp_client
include $(BUILD_EXECUTABLE)

2.  Linux下主机实现服务端的监听

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
main()
{
	int serverfd;
	int cfd;
	int a;
	struct sockaddr_in sadr;
	struct sockaddr_in cadr;
	socklen_t len;
	int r;
	char buf[1024];
	//1.socket
	serverfd=socket(AF_INET,SOCK_STREAM,0);
	if(serverfd==-1) printf("1:%m\n"),exit(-1);
	printf("建立服务器socket成功!\n");
	//2.bind
	sadr.sin_family=AF_INET;
	sadr.sin_port=htons(9999);
	inet_aton("192.168.5.174",&sadr.sin_addr);
	r=bind(serverfd,
			(struct sockaddr*)&sadr,sizeof(sadr));
	if(r==-1) printf("2:%m\n"),exit(-1);
	printf("服务器地址绑定成功!\n");
	
	//3.listen
	r=listen(serverfd,10);
	if(r==-1) printf("3:%m\n"),exit(-1);
	printf("监听服务器成功!\n");
	
	//4.accept
	len=sizeof(cadr);
	cfd=accept(serverfd,
			(struct sockaddr*)&cadr,&len);
	printf("有人连接:%d,IP:%s:%u\n",
			cfd,inet_ntoa(cadr.sin_addr),
			ntohs(cadr.sin_port));		
	
	//5.处理代理客户描述符号的数据
	while(1)
	{
		r=recv(cfd,&a,4,MSG_WAITALL);		
		if(r>0)
		{
			//buf[r]=0;
			printf("::%d\n",a);
		}
		
		if(r==0)
		{
			printf("连接断开!\n");
			break;
		}
		if(r==-1)
		{
			printf("网络故障!\n");
			break;
		}
	}
	close(cfd);
	close(serverfd);
}

3.  测试结果

客户端:

root@Hisense:/data # ./tcp_client
建立socket成功!
2:m

服务端:

root@zhuyong:/home/zhuyong/test# ./tcp_server
建立服务器socket成功!
服务器地址绑定成功!
监听服务器成功!
有人连接:4,IP:192.168.2.10:41933
::0
::1
::2
::3
::4
::5
::6
::7
::8
::9
::10
::11
::12
::13
::14
::15
::16
::17
::18
::19
连接断开!




DPDK简介

阅读数 168

没有更多推荐了,返回首页