精华内容
下载资源
问答
  • ARM® Cortex™-A8 Series Programmer’s Guide This Cortex-A Series Programmer’s Guide is protected by copyright and the practice or implementation of the information herein may be protected by one or ...
  • ARM® Cortex™-A7 Series Version: 4.0 Programmer’s Guide This Cortex-A Series Programmer’s Guide is protected by copyright and the practice or implementation of the information herein may be ...
  • ARM汇编语言手册,中文版的ARM汇编语言手册,中文版的ARM汇编语言手册,中文版的
  • ARM Cortex-A(armV7)编程手册V4.0,很底层,很不错,很强大
  • ARM Cortex-A 编程手册学习笔记

    千次阅读 2016-04-03 15:52:32
    从前都在X86上分析内核,做开发、trouble shooting,对于其他架构了解较少,对于新架构的学习,甚至还有些抵触,这次趁分析问题的机会,顺便学习了一下ARM架构的基础知识,权当笔记。 这里主要是AArch32架构(即32位...

    闲话

    从前都在X86上分析内核,做开发、trouble shooting,对于其他架构了解较少,对于新架构的学习,甚至还有些抵触,这次趁分析问题的机会,顺便学习了一下ARM架构的基础知识,权当笔记。

    这里主要是AArch32架构(即32位,后面就都简写成ARM了),相对比较简单,入门必备。

    ARM处理器模式

    在引入安全扩展之前,ARM有7中处理器模式,其中6种是特权模式,剩下一种为用户程序运行的非特权模式。特权模式下,可以做一些user模式下不能做的操作,比如mmu配置和cache操作。

    ModeFunctionPrivilege
    User (USR)Mode in which most programs and applications runUnprivileged
    FIQEntered on an FIQ interrupt exception IRQ Entered on an IRQ interrupt exception 
    Supervisor (SVC)Entered on reset or when a Supervisor Call instruction (SVC) 
    is executed  
    Abort (ABT)Entered on a memory access exception Undef (UND) Entered when an undefined instruction executed 
    System (SYS)Mode in which the OS runs, sharing the register view 

    TrustZone Security Extensions引入了独立于模式的两种安全状态,和一种新的Monitor模式。如此就有8种CPU模式了。

    TrustZone Security Extensions的环境中,通过将所有的硬件和软件资源按设备来划分,来提升系统安全性。 在CPU处于Normal (Non-secure) state,其不能访问在Secure state下分配的内存。

    当前处理器所处模式,由CPSR(当前程序状态寄存器)寄存器的值决定,改变cpu模式可以通过特权软件显示设置该寄存器,也可以由异常触发(发生异常后,CPU就会自动切换到对应的模式)。

    寄存器

    ARM架构提供16个32位的通用寄存器(R0-R15),R0-R14可用作普通的数据存储。

    R15为PC寄存器,修改R15可以改变程序的执行流程。PC值指向当前执行指令的地址加8处,PC是有读写限制的。

    R14也称LR(Link register),通常用于保存当前函数的返回地址(上一级函数),当其空闲时,也可以用作普通寄存器用。每种模式下都有自己独立(物理上)的R14。

    R13也称SP,即用于保存堆栈的栈顶指针,当其空闲时,也可以用作普通寄存器用。每一种异常模式都有其自己独立的r13,它通常指向异常模式所专用的堆栈。当ARM进入异常模式的时候,程序就可以把一般通用寄存器压入堆栈,返回时再出栈,保证了各种模式下程序的状态的完整性。

    程序也可以访问CPSR,SPSR是前一个执行模式下CPSR的副本。

    虽然软件可以访问这些寄存器,但是不同模式下,部分寄存器实际对应的物理存储位置可能不同,也就是说,不同模式下,部分寄存器(除R0-7和R15外)实际物理上是不同(虽然对软件来说是透明的,软件看到的是相同的逻辑上的寄存器),这些寄存器只能在相应模式下才能访问。

    Program Status Registers。CPSR用于存储:flags、当前CPU模式、中断禁用标记、当前CPU状态等。在user模式下,CPSR对应的寄存器为APSR(限制版本的CPSR)。

    Coprocessor 15

    CP15,系统控制协处理器,用于控制Core的需要特性,包括c0-c15主要的32位寄存器,这些寄存器通常也通过名称访问,如CP15.SCTLR

    Cache

    ARM中的常见Cache架构如下:

    ARM中Cache架构

    cache问题

    让程序执行时间变得不可知。

    对于实时性要求高的系统来说有点不可接受。

    外设需要用up-to-date的数据,需要cache控制。

    其他

    tag占物理空间,但不计算在cache size内

    invalidate操作:丢掉cache。

    clean操作:将脏数据刷入内存,保证一致。

    Point of coherency and unification

    For set/way based clean and invalidate, the operation is performed on a specific level of cache.For operations that use a virtual address, the architecture defines two conceptual points:

    Point of Coherency (PoC)

    For a particular address, the PoC is the point at which all blocks, for example,cores, DSPs, or DMA engines, that can access memory are guaranteed to see thesame copy of a memory location. Typically, this will be the main external systemmemory.

    Point of Unification (PoU)

    The PoU for a core is the point at which the instruction and data caches of the coreare guaranteed to see the same copy of a memory location. For example, a unifiedlevel 2 cache would be the point of unification in a system with Harvard level 1caches and a TLB for cacheing translation table entries. If no external cache ispresent, main memory would be the Point of unification.

    ARM64中包含如下几种类型的异常:

    1. 中断(Interrupts),就是我们平常理解的中断,主要由外设触发,是典型的异步异常。 ARM64中主要包括两种类型的中断:IRQ(普通中断)和FIQ(高优先级中断,处理更快)。 Linux内核中好像没有使用FIQ,还没有仔细看代码,具体不详述了。
    2. Aborts。 可能是同步或异步异常。包括指令异常(取指令时产生)、数据异常(读写内存数据时产生),可以由MMU产生(比如典型的缺页异常),也可以由外部存储系统产生(通常是硬件问题)。
    3. Reset。复位被视为一种特殊的异常。
    4. Exception generating instructions。由异常触发指令触发的异常,比如Supervisor Call (SVC)、Hypervisor Call (HVC)、Secure monitor Call (SMC)

    异常级别(EL)

    ARM中,异常由级别之分,具体如下图所示,只要关注:

    普通的用户程序处于EL0,级别最低

    内核处于EL1,HyperV处于EL2,EL1-3属于特权级别。

    Arm64中的异常级别

    异常处理

    Arm中的异常处理过程与X86比较相似,同样包括硬件自动完成部分和软件部分,同样需要设置中断向量,保存上下文,不同的异常类型的处理方式可能有细微差别。 这里不详述了。

    需要关注:用户态(EL0)不能处理异常,当异常发生在用户态时,异常级别(EL)会发生切换,默认切换到EL1(内核态)。

    中断向量表

    Arm64架构中的中断向量表有点特别(相对于X86来说~),包含16个entry,这16个entry分为4组,每组包含4个entry,每组中的4个entry分别对应4种类型的异常:

    1. SError
    2. FIQ
    3. IRQ
    4. Synchronous Aborts

    4个组的分类根据发生异常时是否发生异常级别切换、和使用的堆栈指针来区别。分别对应于如下4组:

    1. 异常发生在当前级别且使用SP_EL0(EL0级别对应的堆栈指针),即发生异常时不发生异常级别切换,可以简单理解为异常发生在内核态(EL1),且使用EL0级别对应的SP。 这种情况在Linux内核中未进行实质处理,直接进入bad_mode()流程。
    2. 异常发生在当前级别且使用SP_ELx(ELx级别对应的堆栈指针,x可能为1、2、3),即发生异常时不发生异常级别切换,可以简单理解为异常发生在内核态(EL1),且使用EL1级别对应的SP。 这是比较常见的场景。
    3. 异常发生在更低级别且在异常处理时使用AArch64模式。 可以简单理解为异常发生在用户态,且进入内核处理异常时,使用的是AArch64执行模式(非AArch32模式)。 这也是比较常见的场景。
    4. 异常发生在更低级别且在异常处理时使用AArch32模式。 可以简单理解为异常发生在用户态,且进入内核处理异常时,使用的是AArch32执行模式(非AArch64模式)。 这中场景基本未做处理。

    代码分析

    ## 中断向量表

    Linux内核中,中断向量表实现在entry.S文件中,代码如下:

    	/*
    	 * Exception vectors.
    	 */
    	
    		.align	11
    	/*el1代表内核态,el0代表用户态*/
    	ENTRY(vectors)
    		ventry	el1_sync_invalid		// Synchronous EL1t 
    		ventry	el1_irq_invalid			// IRQ EL1t
    		ventry	el1_fiq_invalid			// FIQ EL1t
    		/*内核态System Error ,使用SP_EL0(用户态栈)*/
    		ventry	el1_error_invalid		// Error EL1t
    	
    		ventry	el1_sync			// Synchronous EL1h
    		ventry	el1_irq				// IRQ EL1h
    		ventry	el1_fiq_invalid			// FIQ EL1h
    		/*内核态System Error ,使用SP_EL1(内核态栈)*/
    		ventry	el1_error_invalid		// Error EL1h
    	
    		ventry	el0_sync			// Synchronous 64-bit EL0
    		ventry	el0_irq				// IRQ 64-bit EL0
    		ventry	el0_fiq_invalid			// FIQ 64-bit EL0
    		/*用户态System Error ,使用SP_EL1(内核态栈)*/
    		ventry	el0_error_invalid		// Error 64-bit EL0
    	
    	#ifdef CONFIG_COMPAT
    		ventry	el0_sync_compat			// Synchronous 32-bit EL0
    		ventry	el0_irq_compat			// IRQ 32-bit EL0
    		ventry	el0_fiq_invalid_compat		// FIQ 32-bit EL0
    		ventry	el0_error_invalid_compat	// Error 32-bit EL0
    	#else
    		ventry	el0_sync_invalid		// Synchronous 32-bit EL0
    		ventry	el0_irq_invalid			// IRQ 32-bit EL0
    		ventry	el0_fiq_invalid			// FIQ 32-bit EL0
    		ventry	el0_error_invalid		// Error 32-bit EL0
    	#endif
    	END(vectors)
    

    可以明显看出分组和分类的情况。

    invalid类处理

    带invalid后缀的向量都是Linux做未做进一步处理的向量,默认都会进入bad_mode()流程,说明这类异常Linux内核无法处理,只能上报给用户进程(用户态,sigkill或sigbus信号)或die(内核态)

    带invalid后缀的向量最终都调用了inv_entry,inv_entry实现如下:

    /*
     * Invalid mode handlers
     */
     	/*Invalid类异常都在这里处理,统一调用bad_mode函数*/
    .macro	inv_entry, el, reason, regsize = 64
    kernel_entry el, \regsize
    /*传入bad_mode的三个参数*/
    mov	x0, sp
    /*reason由上一级传入*/
    mov	x1, #\reason
    /*esr_el1是EL1(内核态)级的ESR(异常状态寄存器),用于记录异常的详细信息,具体内容解析需要参考硬件手册*/
    mrs	x2, esr_el1
    /*调用bad_mode函数*/
    b	bad_mode
    .endm
    

    调用bad_mode,是C函数,通知用户态进程或者panic。

    /*
     * bad_mode handles the impossible case in the exception vector.
     */
    asmlinkage void bad_mode(struct pt_regs *regs, int reason, unsigned int esr)
    {
    	siginfo_t info;
    	/*获取异常时的PC指针*/
    	void __user *pc = (void __user *)instruction_pointer(regs);
    	console_verbose();
    	/*打印异常信息,messages中可以看到。*/
    	pr_crit("Bad mode in %s handler detected, code 0x%08x -- %s\n",
    		handler[reason], esr, esr_get_class_string(esr));
    	/*打印寄存器内容*/
    	__show_regs(regs);
    	/*如果发生在用户态,需要向其发送信号,这种情况下,发送SIGILL信号,所以就不会有core文件产生了*/
    	info.si_signo = SIGILL;
    	info.si_errno = 0;
    	info.si_code  = ILL_ILLOPC;
    	info.si_addr  = pc;
    	/*给用户态进程发生信号,或者die然后panic*/
    	arm64_notify_die("Oops - bad mode", regs, &info, 0);
    }
    

    arm64_notify_die:

    void arm64_notify_die(const char *str, struct pt_regs *regs,
    		      struct siginfo *info, int err)
    {
    	/*如果发生异常的上下文处于用户态,则给相应的用户态进程发送信号*/
    	if (user_mode(regs)) {
    		current->thread.fault_address = 0;
    		current->thread.fault_code = err;
    		force_sig_info(info->si_signo, info, current);
    	} else {
    		/*如果是内核态,则直接die,最终会panic*/
    		die(str, regs, err);
    	}
    }
    

    IRQ中断处理

    场景的场景中(不考虑EL2和EL3),IRQ处理分两种情况:用户态发生的中断和内核态发生的中断,相应的中断处理接口分别为:

    el0_sync
    el1_sync
    

    相应代码如下:

    el1_sync:

    	.align	6
    el1_irq:
    	/*保存中断上下文*/
    	kernel_entry 1
    	enable_dbg
    #ifdef CONFIG_TRACE_IRQFLAGS
    	bl	trace_hardirqs_off
    #endif
    	/*调用中断处理默认函数*/
    	irq_handler
    	/*如果支持抢占,处理稍复杂*/
    #ifdef CONFIG_PREEMPT
    	get_thread_info tsk
    	ldr	w24, [tsk, #TI_PREEMPT]		// get preempt count
    	cbnz	w24, 1f				// preempt count != 0
    	ldr	x0, [tsk, #TI_FLAGS]		// get flags
    	tbz	x0, #TIF_NEED_RESCHED, 1f	// needs rescheduling?
    	bl	el1_preempt
    1:
    #endif
    #ifdef CONFIG_TRACE_IRQFLAGS
    	bl	trace_hardirqs_on
    #endif
    	/*恢复上下文*/
    	kernel_exit 1
    ENDPROC(el1_irq)
    

    代码非常简单,主要就是调用了irq_handler()函数,不做深入解析了,有兴趣可以自己再看看代码。

    el0_sync处理类似,主要区别在于:其涉及用户态和内核态的上下文切换和恢复。

    	.align	6
    el0_irq:
    	kernel_entry 0
    el0_irq_naked:
    	enable_dbg
    #ifdef CONFIG_TRACE_IRQFLAGS
    	bl	trace_hardirqs_off
    #endif
    	/*退出用户上下文*/
    	ct_user_exit
    	irq_handler
    
    #ifdef CONFIG_TRACE_IRQFLAGS
    	bl	trace_hardirqs_on
    #endif
    	/*返回用户态*/
    	b	ret_to_user
    ENDPROC(el0_irq)
    
    
    

    原文地址: https://happyseeker.github.io/kernel/2016/03/05/ARM-Cortex-A-programming-mannual-notes.html

     
    展开全文
  • arm 优化c编程手册

    2018-06-13 10:09:26
    arm c编程扩展手册,支持armv7 armv8,c语言操作手册
  • ARM指令手册

    2019-03-03 15:24:08
    ARM代码的数据手册,学习ARM编程不会的时候方便查找以及解释
  • BK3432芯片是一款高度集成的蓝牙4.2双模式,带2Mbps数据速率选项。它集成了高性能RF收发器、基带、ARM9E内核、丰富的功能外设单元、PR支持BLE应用的可编程协议和概要文件。编程手册用于指导BK3432的开发。
  • Introduction This programming manual provides information for application and system-level software developers. It gives a full description of the STM32F3 and STM32F4 Series Cortex®-M4 ...
  • ARM NEON 编程快速参考手册

    千次阅读 2019-07-10 14:53:35
    原文地址:... ARM NEON programming quick reference 1 Introduction This article aims to introduce ARM NEON technology. Hope that beginners can get...

    原文地址:http://blog.csdn.net/zsc09_leaf/article/details/45825015

     

    ARM NEON programming quick reference

    1 Introduction

    This article aims to introduce ARM NEON technology. Hope that beginners can get started with NEON programming quickly after reading the article. The article will also inform users which documents can be consulted if more detailed information is needed.

    2 NEON overview

    This section describes the NEON technology and supplies some background knowledge.

    2.1 What is NEON?

    NEON technology is an advanced SIMD (Single Instruction, Multiple Data) architecture for the ARM Cortex-A series processors. It can accelerate multimedia and signal processing algorithms such as video encoder/decoder, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound.

    NEON instructions perform "Packed SIMD" processing:

    • Registers are considered as vectors of elements of the same data type
    • Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single-precision floating-point on ARM 32-bit platform, both single-precision floating-point and double-precision floating-point on ARM 64-bit platform.
    • Instructions perform the same operation in all lanes

    2.2 History of ARM Adv SIMD

    ARMv6[i]

    SIMD extension

    ARMv7-A

    NEON

    ARMv8-A AArch64

    NEON

    • Operates on 32-bit general purpose ARM registers
    • 8-bit/16-bit integer
    • 2x16-bit/4x8-bit operations per instruction
    • Separate register bank, 32x64-bit NEON registers
    • 8/16/32/64-bit integer
    • Single precision floating point
    • Up to 16x8-bit operations per instruction
    • Separate register bank, 32x128-bit NEON registers
    • 8/16/32/64-bit integer
    • Single precision floating point
    • double precision floating point, both of them are IEEE compliance
    • Up to 16x8-bit operations per instruction

    [i] The ARM Architecture Version 6 (ARMv6) David Brash: page 13

     

    2.3 Why use NEON

    NEON provides:

    • Support for both integer and floating point operations ensures the adaptability of a broad range of applications, from codecs to High Performance Computing to 3D graphics.
    • Tight coupling to the ARM processor provides a single instruction stream and a unified view of memory, presenting a single development platform target with a simpler tool flow[ii]

    3 ARMv7/v8 comparison

    ARMv8-A is a fundamental change to the ARM architecture. It supports the 64-bit Execution state called “AArch64”, and a new 64-bit instruction set “A64”. To provide compatibility with the ARMv7-A (32-bit architecture) instruction set, a 32-bit variant of ARMv8-A “AArch32” is provided. Most of existing ARMv7-A code can be run in the AArch32 execution state of ARMv8-A.

    This section compares the NEON-related features of both the ARMv7-A and ARMv8-A architectures. In addition, general purpose ARM registers and ARM instructions, which are used often for NEON programming, will also be mentioned. However, the focus is still on the NEON technology.

    3.1 Register

    ARMv7-A and AArch32 have the same general purpose ARM registers – 16 x 32-bit general purpose ARM registers (R0-R15).

    ARMv7-A and AArch32 have 32 x 64-bit NEON registers (D0-D31). These registers can also be viewed as 16x128-bit registers (Q0-Q15). Each of the Q0-Q15 registers maps to a pair of D registers, as shown in the following figure.

    注:V7a 有32个64位的D寄存器[D0-D31], 16个128位的Q寄存器 [Q0-Q15] ,一个Q对应2个D(2个D公用Q的高64位和低64位)。

           

    AArch64 by comparison, has 31 x 64-bit general purpose ARM registers and 1 special register having different names, depending on the context in which it is used. These registers can be viewed as either 31 x 64-bit registers (X0-X30) or as 31 x 32-bit registers (W0-W30).

    注:ARMv8 有31 个64位寄存器,1个不同名字的特殊寄存器,用途取决于上下文, 因此我们可以看成 31个64位的X寄存器或者31个32位的W寄存器(X寄存器的低32位)

    AArch64 has 32 x 128-bit NEON registers (V0-V31). These registers can also be viewed as 32-bit Sn registers or 64-bit Dn registers.

    注:ARMv8有32个128位的V寄存器,相似的,我们同样可以看成是32个32位的S寄存器或者32个64位的D寄存器。

     

    3.2 Instruction set[iii]

    The following figure illustrates the relationship between ARMv7-A, ARMv8-A AArch32 and ARMv8-A AArch64 instruction set.

       

     

    The ARMv8-A AArch32 instruction set consists of A32 (ARM instruction set, a 32-bit fixed length instruction set) and T32 (Thumb instruction set, a 16-bit fixed length instruction set; Thumb2 instruction set, 16 or 32-bit length instruction set). It is a superset of the ARMv7-A instruction set, so that it retains the backwards compatibility necessary to run existing software. There are some additions to A32 and T32 to maintain alignment with the A64 instruction set, including NEON division, and the Cryptographic Extension instructions. NEON double precision floating point (IEEE compliance) is also supported.

    3.3 NEON instruction format

    This section describes the changes to the NEON instruction syntax.

    3.3.1 ARMv7-A/AArch32 instruction syntax[iv]

    All mnemonics for ARMv7-A/AAArch32 NEON instructions (as with VFP) begin with the letter “V”. Instructions are generally able to operate on different data types, with this being specified in the instruction encoding. The size is indicated with a suffix to the instruction. The number of elements is indicated by the specified register size and data type of operation. Instructions have the following general format:

    V{<mod>}<op>{<shape>}{<cond>}{.<dt>}{<dest>}, src1, src2

    Where:

    <mod> - modifiers

    • Q: The instruction uses saturating arithmetic, so that the result is saturated within the range of the specified data type, such as VQABS, VQSHL etc.
    • H: The instruction will halve the result. It does this by shifting right by one place (effectively a divide by two with truncation), such as VHADD, VHSUB.
    • D: The instruction doubles the result, such as VQDMULL, VQDMLAL, VQDMLSL and VQ{R}DMULH
    • R: The instruction will perform rounding on the result, equivalent to adding 0.5 to the result before truncating, such as VRHADD, VRSHR.

    <op> - the operation (for example, ADD, SUB, MUL).

    <shape> - Shape.

    Neon data processing instructions are typically available in Normal, Long, Wide and Narrow variants.

    • Long (L): instructions operate on double-word vector operands and produce a quad-word vector result. The result elements are twice the width of the operands, and of the same type. Lengthening instructions are specified using an L appended to the instruction.

    • Wide (W): instructions operate on a double-word vector operand and a quad-word vector operand, producing a quad-word vector result. The result elements and the first operand are twice the width of the second operand elements. Widening instructions have a W appended to the instruction.

    • Narrow (N): instructions operate on quad-word vector operands, and produce a double-word vector result. The result elements are half the width of the operand elements. Narrowing instructions are specified using an N appended to the instruction.

    <cond> - Condition, used with IT instruction

    <.dt> - Data type, such as s8, u8, f32 etc.

    <dest> - Destination

    <src1> - Source operand 1

    <src2> - Source operand 2

    Note: {} represents and optional parameter.

    For example:

    VADD.I8 D0, D1, D2

    VMULL.S16 Q2, D8, D9

    For more information, please refer to the documents listed in the Appendix.

    3.3.2 AArch64 NEON instruction syntax[v]

    In the AArch64 execution state, the syntax of NEON instruction has changed. It can be described as follows:

    {<prefix>}<op>{<suffix>}  Vd.<T>, Vn.<T>, Vm.<T>

    Where:

    <prefix> - prefix, such as using S/U/F/P to represent signed/unsigned/float/bool data type.

    <op> – operation, such as ADD, AND etc.

    <suffix> - suffix

    • P: “pairwise” operations, such as ADDP.
    • V: the new reduction (across-all-lanes) operations, such as FMAXV.
    • 2:new widening/narrowing “second part” instructions, such as ADDHN2, SADDL2.

    ADDHN2: add two 128-bit vectors and produce a 64-bit vector result which is stored as high 64-bit part of NEON register.

    SADDL2: add two high 64-bit vectors of NEON register and produce a 128-bit vector result.

    <T> - data type, 8B/16B/4H/8H/2S/4S/2D. B represents byte (8-bit). H represents half-word (16-bit). S represents word (32-bit). D represents a double-word (64-bit).

    For example:

    UADDLP    V0.8H, V0.16B

    FADD V0.4S, V0.4S, V0.4S

     

    For more information, please refer to the documents listed in the Appendix.

    3.4 NEON instructions[vi]

    The following table compares the ARMv7-A, AArch32 and AArch64 NEON instruction set.

    “√” indicates that the AArch32 NEON instruction has the same format as ARMv7-A NEON instruction.

    “Y” indicates that the AArch64 NEON instruction has the same functionality as ARMv7-A NEON instructions, but the format is different. Please check the ARMv8-A ISA document.

    If you are familiar with the ARMv7-A NEON instructions, there is a simple way to map the NEON instructions of ARMv7-A and AArch64. It is to check the NEON intrinsics document, so that you can find the AArch64 NEON instruction according to the intrinsics instruction.

    New or changed functionality is highlighted.

     

    ARMv7-A

    AArch32

    AArch64

    logical and compare

    VAND, VBIC, VEOR, VORN, and VORR (register)

    Y

    VBIC and VORR (immediate)

    Y

    VBIF, VBIT, and VBSL

    Y

    VMOV, VMVN (register)

    Y

    VACGE and VACGT

    Y

    VCEQ, VCGE, VCGT, VCLE, and VCLT

    Y

    VTST

    Y

    general data processing

    VCVT (between fixed-point or integer, and floating-point)

    Y

    VCVT (between half-precision and single-precision floating-point)

    Y

    n/a

          n/a

    FCVTXN(double

    to single-precision)

    VDUP

    Y

    VEXT

    Y

    VMOV, VMVN (immediate)

    Y

    VMOVL, V{Q}MOVN, VQMOVUN

    Y

    VREV

    Y

    VSWP

    n/a

    VTBL, VTBX

    Y

    VTRN

    TRN1, TRN2

    VUZP, VZIP

    UZP1,UZP2, ZIP, ZIP2

    n/a

    n/a

    INS

    n/a

    VRINTA, VRINM,

    VRINTN, VRINTP,

    VRINTR, VRINTX,

    VRINTZ

    FRINTA, FRINTI, FRINTM, FRINTN, FRINTP, FRINTX, FRINTZ

    shift

    VSHL, VQSHL, VQSHLU, and VSHLL (by immediate)

    Y

    V{Q}{R}SHL (by signed variable)

    Y

    V{R}SHR

    Y

    V{R}SHRN

    Y

    V{R}SRA

    Y

    VQ{R}SHR{U}N

    Y

    VSLI and VSRI

    Y

    general arithmetic

    VABA{L} and VABD{L}

    Y

    V{Q}ABS and V{Q}NEG

    Y

    V{Q}ADD, VADDL, VADDW, V{Q}SUB, VSUBL, and VSUBW

    Y

    n/a

    n/a

    SUQADD, USQADD

    V{R}ADDHN and V{R}SUBHN

    Y

    V{R}HADD and VHSUB

    Y

    VPADD{L}, VPADAL

    Y

    VMAX, VMIN, VPMAX, and VPMIN

    Y

    n/a

    n/a

    FMAXNMP, FMINNMP

    VCLS, VCLZ, and VCNT

    Y

    VRECPE and VRSQRTE

    Y

    VRECPS and VRSQRTS

    Y

    n/a

    n/a

    FRECPX

    RBIT

    FSQRT

    ADDV

    SADDLV, UADDLV

    SMAXV,UMAXV,FMAXV

    FMAXNMV

    SMINV,UMINV,FMINV

    FMINNMV

    multiply

    VMUL{L}, VMLA{L}, and VMLS{L}

    There isn’t float MLA/MLS

    VMUL{L}, VMLA{L}, and VMLS{L} (by scalar)

    Y

    VFMA, VFMS

    Y

    VQDMULL, VQDMLAL, and VQDMLSL (by vector or by scalar)

    Y

    VQ{R}DMULH (by vector or by scalar)

    Y

    n/a

    n/a

    FMULX

    n/a

    n/a

    FDIV

    load and store

    VLDn/VSTn(n=1, 2, 3, 4)

    Y

    VPUSH/VPOP

    n/a

    Crypto Extension

    n/a

    PMULL, PMULL2

    PMULL, PMULL2

    AESD, AESE

    AESD, AESE

    AESIMC, AESMC

    AESIMC, AESMC

    SHA1C, SHA1H, SHA1M, SHA1P

    SHA1C, SHA1H, SHA1M, SHA1P

    SHA1SU0,

    SHA1SU1

    SHA1SU0,

    SHA1SU1

    SHA256H,

    SHA256H2

    SHA256H,

    SHA256H2

    SHA256SU0,

    SHA256SU1

    SHA256SU0,

    SHA256SU1

     

    4 NEON programming basics

    There are four ways of using NEON[vii]:

    • NEON optimized libraries
    • Vectorizing compilers
    • NEON intrinsics
    • NEON assembly

    4.1 Libraries

    The users can call the NEON optimized libraries directly in their program. Currently, you can use the following libraries:

    • OpenMax DL

    This provides the recommended approach for accelerating AV codecs and supports signal processing and color space conversions.

    • Ne10

    It is ARM’s open source project. Currently, the Ne10 library provides some math, image processing and FFT function. The FFT implementation is faster than other open source FFT implementations.

    4.2 Vectorizing compilers

    Adding vectorizing options in GCC can help C code to generate NEON code. GNU GCC gives you a wide range of options that aim to increase the speed, or reduce the size of the executable files they generate. For each line in your source code there are generally many possible choices of assembly instructions that could be used. The compiler must trade-off a number of resources, such as registers, stack and heap space, code size (number of instructions), compilation time, ease of debug, and number of cycles per instruction in order to produce an optimized image file.

    4.3 NEON intrinsics

    NEON intrinsics provides a C function call interface to NEON operations, and the compiler will automatically generate relevant NEON instructions allowing you to program once and run on either an ARMv7-A or ARMv8-A platform. If you intend to use the AArch64 specific NEON instructions, you can use the (__aarch64__) macro definition to separate these codes, as in the following example.

    NEON intrinsics example:

    //add for float array. assumed that count is multiple of 4

     

    #include<arm_neon.h>

     

    void add_float_c(float* dst, float* src1, float* src2, int count)

    {

         int i;

         for (i = 0; i < count; i++)

             dst[i] = src1[i] + src2[i];

    }

     

    void add_float_neon1(float* dst, float* src1, float* src2, int count)

    {

         int i;

         for (i = 0; i < count; i += 4)

         {

             float32x4_t in1, in2, out;

             in1 = vld1q_f32(src1);

             src1 += 4;

             in2 = vld1q_f32(src2);

             src2 += 4;

             out = vaddq_f32(in1, in2);

             vst1q_f32(dst, out);

             dst += 4;

     

             // The following is only an example describing how to use AArch64 specific NEON

             // instructions.                             

    #if defined (__aarch64__)

             float32_t tmp = vaddvq_f32(in1);

    #endif

     

         }

    }

    Checking disassembly, you can find vld1/vadd/vst1 NEON instruction on ARMv7-A platform and ldr/fadd/str NEON instruction on ARMv8-A platform.

    4.4 NEON assembly

    There are two ways to write NEON assembly:

    • Assembly files
    • Inline assembly

    4.4.1 Assembly files

    You can use ".S" or “.s” as the file suffix. The only difference is that C/C ++ preprocessor will process .S files first. C language features such as macro definitions can be used.

    When writing NEON assembly in a separate file, you need to pay attention to saving the registers. For both ARMv7 and ARMv8, the following registers must be saved:

     

    ARMv7-A/AArch32

    AArch64[viii]

    General purpose registers

    R0-R3 parameters

    R4-R11 need to be saved

    R12 IP

    R13(SP)

    R14(LR) need to be saved

    R0 for return value

    X0-X7 parameters

    X8-X18

    X19-X28 need to be saved

    X29(FP) need to be saved

    X30(LR)

    X0, X1  for return value

    NEON registers

    D8-D15 need to be saved

    D part of V8-V15 need to be saved

    Stack alignment

    64-bit alignment

    128-bit alignment[ix]

    Stack push/pop

    PUSH/POP Rn list

    VPUSH/VPOP Dn list

    LDP/STP register pair

    The following is an example of ARM v7-A and ARM v8-A NEON assembly.

    //header

    void add_float_neon2(float* dst, float* src1, float* src2, int count);

    //assembly code in .S file

    ARMv7-A/AArch32

    AArch64

        .text

        .syntax unified

     

        .align 4

        .global add_float_neon2

        .type add_float_neon2, %function

        .thumb

    .thumb_func

     

    add_float_neon2:

    .L_loop:

        vld1.32  {q0}, [r1]!

        vld1.32  {q1}, [r2]!

        vadd.f32 q0, q0, q1

        subs r3, r3, #4

        vst1.32  {q0}, [r0]!

        bgt .L_loop

     

        bx lr

        .text

     

        .align 4

        .global add_float_neon2

        .type add_float_neon2, %function

     

    add_float_neon2:

     

    .L_loop:

        ld1     {v0.4s}, [x1], #16

        ld1     {v1.4s}, [x2], #16

        fadd    v0.4s, v0.4s, v1.4s

        subs x3, x3, #4

        st1  {v0.4s}, [x0], #16

        bgt .L_loop

     

        ret

    For more examples, see: https://github.com/projectNe10/Ne10/tree/master/modules/dsp

    4.4.2 Inline assembly

    You can use NEON inline assembly directly in C/C++ code.

    Pros:

    • The procedure call standard is simple. You do not need to save registers manually.
    • You can use C / C ++ variables and functions, so it can be easily integrated into C / C ++ code.

    Cons:

    • Inline assembly has a complex syntax.
    • NEON assembly code is embedded in C/C ++ code, and it’s not easily ported to other platforms.

    Example:

    // ARMv7-A/AArch32

    void add_float_neon3(float* dst, float* src1, float* src2, int count)

    {

        asm volatile (

                   "1:                                                                \n"

                   "vld1.32         {q0}, [%[src1]]!                          \n"

                   "vld1.32         {q1}, [%[src2]]!                          \n"

                   "vadd.f32       q0, q0, q1                                 \n"

                   "subs            %[count], %[count], #4              \n"

                   "vst1.32         {q0}, [%[dst]]!                           \n"

                   "bgt             1b                                             \n"

                   : [dst] "+r" (dst)

                   : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)

                   : "memory", "q0", "q1"

              );

    }

    // AArch64

    void add_float_neon3(float* dst, float* src1, float* src2, int count)

    {

        asm volatile (

                   "1:                                                              \n"

                   "ld1             {v0.4s}, [%[src1]], #16               \n"

                   "ld1             {v1.4s}, [%[src2]], #16               \n"

                   "fadd            v0.4s, v0.4s, v1.4s                    \n"

                   "subs            %[count],  %[count], #4           \n"

                   "st1             {v0.4s}, [%[dst]], #16                 \n"

                   "bgt             1b                                           \n"

                   : [dst] "+r" (dst)

                   : [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)

                   : "memory", "v0", "v1"

              );

     

    }

    4.5 NEON intrinsics and assembly

    NEON intrinsics and assembly are the commonly used NEON. The following table describes the pros and cons of these two approaches:

     

    NEON assembly

    NEON intrinsic

    Performance

    Always shows the best performance for thespecified platform for an experienced developer.

    Depends heavily on the toolchain used

    Portability

    The different ISAs (ARMv7-A/AArch32 and AArch64) have different assembly implementations. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.

    Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.

    Maintainability

    Hard to read/write compared to C.

    Similar to C code, it’s easy to read/write.

    This is a simple summary. When applying NEON to more complex scenarios, there will be many special cases. This will be described in a future article ARM NEON Optimization.

    With the above information, you can choose a NEON implementation and start your NEON programming journey.

    For more reference documentation, please check the appendix.

    Appendix: NEON reference document

     


    [i] The ARM Architecture Version 6 (ARMv6) David Brash: page 13

    [ii] ARM Cortex-A Series Programmer’s Guide Version 4.0: page 7-5

    [iii] http://www.arm.com/zh/products/processors/instruction-set-architectures/armv8-architecture.php

     

    [iv] ARM® Compiler toolchain Version 5.02 Assembler Reference: Chapter 4

    NEON and VFP Programming

    ARM Cortex™-A Series Version: 4.0 Programmer’s Guide: 7.2.4 NEON instruction set

    [v] ARMv8 Instruction Set Overview: 5.8 Advanced SIMD

    [vi] ARMv8 Instruction Set Overview: 5.8.25 AArch32 Equivalent Advanced SIMD Mnemonics

    [vii] http://www.arm.com/zh/products/processors/technologies/neon.php

    [viii]Procedure Call Standard for the ARM 64-bit Architecture (AArch64) : 5 THE BASE PROCEDURE CALL STANDARD

    [ix] Procedure Call Standard for the ARM 64-bit Architecture (AArch64) : 5.2.2 The Stack

    展开全文
  • STM32Cortex-M3编程手册

    2019-01-07 17:19:18
    STM32Cortex-M3编程手册,嵌入式开发参考用。
  • bk3432编程手册.pdf

    2020-02-13 20:12:47
    上海博通bk的ble芯片,bk3432 编程指南,应用开发指导,编程手册. BK3432芯片是一款高度集成的蓝牙4.2双模式,带2Mbps数据速率选项。它集成了高性能RF收发器、基带、ARM9E内核、丰富的功能外设单元、PR支持BLE应用...
  • ARM-Cortex_-M4内核参考手册
  • UCD3138数字电源外设编程手册(中文版)doc,UCD3138是TI的数字电源控制器,提供了一流水平的针对高性能隔离电源的应用的单芯片高集成 度解决方案。其核心是数字控制环路的外设,也被称为数字电源外设(DPP)用于控制...
  • ARM Cortex-A(armV8)编程手册V1.0 ,是V7版本的升级版本,可以参考
  • ARM Cortex-M7 内核编程技术说明文档,含编程模型、系统控制、NVIC中断、内存分布、内存保护单元、浮点运算单元、调试单元。
  • arm 内联函数手册

    2018-06-13 10:07:18
    arm算法优化编程手册,提供armv7 armv8的所有内联指令
  • ARM Cortex-A(armV8)编程手册V1.0 ,是V7版本的升级版本,可以参考
  • ARM NEON 查找手册,可以查找neon内建函数的功能以及入参和返回值类型; RVCT 提供在 ARM 和 Thumb 状态下为 Cortex-A8 处理器生成 NEON 代码的内在 函数。 NEON 内在函数在头文件 arm_neon.h 中定义。头文件既...
  • ARM开发指南手册

    2019-04-08 18:23:03
    本资源是嵌入式的中文开发指南手册,配合正点原子的战舰V3的硬件资源,解压密码是zyl
  • Cortex-A系列编程手册

    2014-01-23 09:52:51
    为Cortex-A系列处理器开发应用的程序开发者提供把广泛的不同类别的资源收集到一起集成为一本指南,涵盖那些对应用代码开发者有价值的缓存,内存管理等概念,是为C\编开发者提供有用的信息。
  • arm968 技术手册

    2012-01-17 17:33:44
    arm9 技术手册 相关的arm9的技术内容可以从中查询
  • 压缩包包括 STM32H7xx参考手册(V3中文版) STM32H7xx参考手册 STM32H7xx编程手册三份文档 ST公司的STM32H743I是高性能工作频率400MHz的32位ARM Cortex®-M7MCU,具有浮点单元(FPU),支持Arm®双精度(IEEE 754...
  • Linux-UNIX系统编程手册共两册文字可复制,适合linux开发者。
  • ARM Cortex-M7 内核编程技术说明文档,最近在研究相关的内核编程策略和相比M4的内核特性的进化,所以就把资料传上来,供大家一起研究
  • STM32F103中文编程手册

    热门讨论 2011-09-18 18:44:02
    STM32F103系列的ARM-Cortex3核心 中文说明,非常适用初学者或提升者。
  • ARM处理器是一种16/32位的高性能、低成本、低功耗的嵌入式RISC微处理器,由ARM公司设计,然后授权给各半导体厂商生产,它目前已经成为... 本书既可作为学习ARM技术的培训材料,也可作为嵌入式系统开发人员的参考手册
  • arm7内核Cortex-M3 技术参考手册中文高清PDF版,非常好的学习arm7的资料。。
  • 一步一步写嵌入式操作系统--ARM编程的方法与实践.pdf 最详细的u-boot讲解.pdf 嵌入式Linux系统开发技术详解--基于ARM(完整版).pdf 嵌入式Linux 操作系统基础教程.pdf 嵌入式开发自学指导.pdf [嵌入式Linux应用开发...
  • FM33A0编程手册

    2019-04-22 15:37:56
    FM33A0xx_ds_chs、bootloaderFM33A0xx组合示例v2.0.zip 、FM33A0xx系列ARM固件函数库用户使用手册_V1.0.pdf、FM17520完整数据手册、FM17550、Keil.FM33A0XX_DFP.0.2.00beta.pack、Keil环境下复旦微FM33A0系列ARM芯片...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 12,644
精华内容 5,057
关键字:

arm编程手册