Introduction to ARM64 NEON assembly
This article was written back in 2013, right after Apple released ARM64-based iPhones and iPads.
If you own a somewhat recent iPhone or iPad, you already own a shiny ARM64 CPU to play with.
Let’s start with a trivial operation: adding two vectors of 32-bits floats.
C++ code:
auto add_to(float *pDst, const float *pSrc, long size) noexcept -> void {
for (long i = 0; i < size; i++) {
*pDst++ += *pSrc++;
}
}
As we’ll be writing the entire routine in plain assembly (sorry I hate GCC inline syntax), we need to study a bit the architecture and its calling convention before diving in.
From “Procedure Call Standard for the ARM 64-bit Architecture” and “ARMv8 Instruction Set Overview” we’ll read this:
-
Access to a larger general-purpose register file with 31 unbanked registers (0–30), with each register extended to 64 bits.
-
Floating point and Advanced SIMD processing share a register file, in a similar manner to AArch32, but extended to thirty-two 128-bit registers. Smaller registers are no longer packed into larger registers, but are mapped one-to-one to the low-order bits of the 128-bit register
-
Unaligned addresses are permitted for most loads and stores, including paired register accesses, floating point and SIMD registers, with the exception of exclusive and ordered accesses
-
There are no multiple register LDM, STM, PUSH and POP instructions, but load-store of a non-contiguous pair of registers is available.
-
The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.
-
The first eight registers, r0-r7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls)
-
The first eight registers, v0-v7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).
-
Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64-bits of each value stored in v8-v15 need to be preserved; it is the responsibility of the caller to preserve larger values.
-
Floating point support is similar to AArch32 VFP but with some extensions.
Our function add_to takes three parameters: 2 pointers and a long word (64-bits). So, we expect arguments to be:
'dst' : x0
'src' : x1
'size' : x2
If we’d be passing floats, registers v0-v7 would hold the FP arguments.
For example if you have a function prototype like the following:
void foo(float *p1, float x1, float *p2, float x2,
float **array, double factor, long size);
Arguments would be dispatched across registers like this:
'p1' : x0
'x1' : v0
'p2' : x1
'x2' : v1
'array' : x2
'factor' : v2
'size' : x3
That’s plenty of space (x0-x7 + v0-v7) to pass arguments: great, no more stack digging !
OK, now we want to use the whole thirty-two 128-bits registers (v0 to v31) so we’ll need to save the first 64-bits of v8 to v15 on stack as suggested by the calling convention.
Create a new plain assembly file (.S) in your IDE (in my case: Xcode) and let’s write two handy macros for pushing and restoring those registers on stack. We don’t need to allocate additional scratch space for saving additional GPR registers, so we’ll just skip this part.
VPUSH/VPOP are gone so we’ll use the new “register pairs” load/store instructions to manipulate two vectors at once. We want to save only the first 64-bits of each 128-bits registers so we’ll reference those vectors using letter ‘d’ (double, 64-bits).
The stack works from “bottom memory” (higher address space) to “top memory” (lower address space), hence the use of pre-indexation to store each pairs. Two 64-bits registers takes 16 bytes, so we push the register pair 16 bytes before the stack pointer.
// preserve_caller_vectors(): Push first 64-bits of v8-v15 on stack (sp)
.macro preserve_caller_vectors
stp d8,d9,[sp,#-16]! // Store pair at [sp - 16], decrement
stp d10,d11,[sp,#-16]!
stp d12,d13,[sp,#-16]!
stp d14,d15,[sp,#-16]!
.endm
When restoring, sp points at the last pair saved (here: d14 and d15), so we need to load them “backwards”.
// restore_caller_vectors(): Restore first 64-bits from v8v-15 on stack (sp)
.macro restore_caller_vectors
ldp d14,d15,[sp],#16 // Load pair at [sp], increment
ldp d12,d13,[sp],#16
ldp d10,d11,[sp],#16
ldp d8,d9,[sp],#16
.endm
Perfect, now we can safely use the whole FP register bank for ourselves. We can store four 32-bits floats per register, so we can fit a total of 32 * 4 = 128 floats. Since we have a destination and a source, we’ll be able to process 128 / 2 = 64 floats per loop iteration.
Let’s try to load 64 floats from the ‘src’ pointer (second function argument is ‘x1’) and load them to v0-v15. We use the ‘ld1’ instruction to load multiple “one-dimension” values from memory since interleaving is ignored in our case.
ld1.4s {v0, v1, v2, v3},[x1],#64 // Load 16 values from 'src', increment
ld1.4s {v4, v5, v6, v7},[x1],#64
ld1.4s {v8, v9, v10, v11},[x1],#64
ld1.4s {v12, v13, v14, v15},[x1],#64
Nice. We won’t reuse the ‘x1’ pointer until next loop turn so we just increment it after each reads.
Note that you could also explicit the format of each operand vectors:
ld1 {v0.4s, v1.4s, v2.4s, v3.4s},[x1],#64
Since we use the same layout for each vector, we’ve put the desired format as the suffix to the instruction, for clarity.
Time to load 64 floats from the ‘src’ pointer (first function argument is ‘x0’), but taking care of copying the address in another register for storing values later on. We load into v16 to v31 and it goes like this:
mov x3,x0 // Save address 'x0' to 'x3'.
ld1.4s {v16, v17, v18, v19},[x0],#64
ld1.4s {v20, v21, v22, v23},[x0],#64
ld1.4s {v24, v25, v26, v27},[x0],#64
ld1.4s {v28, v29, v30, v31},[x0],#64
Pretty easy right ? Now we’ll do the actual arithmetics: add the values from ‘src’ onto ‘dst’.
fadd.4s v16, v16, v0 // v16 += v0
fadd.4s v17, v17, v1 // v17 += v1
fadd.4s v18, v18, v2 // ...
fadd.4s v19, v19, v3
fadd.4s v20, v20, v4
fadd.4s v21, v21, v5
fadd.4s v22, v22, v6
fadd.4s v23, v23, v7
fadd.4s v24, v24, v8
fadd.4s v25, v25, v9
fadd.4s v26, v26, v10
fadd.4s v27, v27, v11
fadd.4s v28, v28, v12
fadd.4s v29, v29, v13
fadd.4s v30, v30, v14
fadd.4s v31, v31, v15
To store the result, we do the following using st1 (1-lane store), using our saved ‘dst’ address in ‘x3’:
st1.4s {v16, v17, v18, v19},[x3],#64
st1.4s {v20, v21, v22, v23},[x3],#64
st1.4s {v24, v25, v26, v27},[x3],#64
st1.4s {v28, v29, v30, v31},[x3],#64
Oh, we need some counter / branch for the actual looping.
ARM64 reduces conditional instruction to the minimum as we’ve read before in the reference documents. Still, this is very straightforward:
subs x2, x2, #1 // Substract and update status flags
cbnz x2, Loop_location // Conditional branch if non-zero
Now, why substracting “1” and not “64” ? I pre-divided the ‘size’ argument (third function argument is ‘x2’) by 64 just before reaching the loop:
lsr x2, x2, #6 // Logical shift right by 6 => size /= 64
I will not make any checks on the ‘size’ argument for clarity, but you need to ensure ‘size’ is a multiple of 64, and is >= 0.
Putting all together we get this :
// preserve_caller_vectors(): Push first 64-bits of v8-v15 on stack (sp)
.macro preserve_caller_vectors
stp d8,d9,[sp,#-16]!
stp d10,d11,[sp,#-16]!
stp d12,d13,[sp,#-16]!
stp d14,d15,[sp,#-16]!
.endm
// restore_caller_vectors(): Restore first 64-bits from v8v-15 on stack (sp)
.macro restore_caller_vectors
ldp d14,d15,[sp],#16
ldp d12,d13,[sp],#16
ldp d10,d11,[sp],#16
ldp d8,d9,[sp],#16
.endm
.globl _neon64_add_to
.align 4
_neon64_add_to:
preserve_caller_vectors()
lsr x2, x2, #6
.p2align 4
Lneon64_add_to:
ld1.4s {v0, v1, v2, v3},[x1],#64
ld1.4s {v4, v5, v6, v7},[x1],#64
ld1.4s {v8, v9, v10, v11},[x1],#64
ld1.4s {v12, v13, v14, v15},[x1],#64
mov x3,x0
ld1.4s {v16, v17, v18, v19},[x0],#64
ld1.4s {v20, v21, v22, v23},[x0],#64
ld1.4s {v24, v25, v26, v27},[x0],#64
ld1.4s {v28, v29, v30, v31},[x0],#64
fadd.4s v16, v16, v0
fadd.4s v17, v17, v1
fadd.4s v18, v18, v2
fadd.4s v19, v19, v3
fadd.4s v20, v20, v4
fadd.4s v21, v21, v5
fadd.4s v22, v22, v6
fadd.4s v23, v23, v7
fadd.4s v24, v24, v8
fadd.4s v25, v25, v9
fadd.4s v26, v26, v10
fadd.4s v27, v27, v11
fadd.4s v28, v28, v12
fadd.4s v29, v29, v13
fadd.4s v30, v30, v14
fadd.4s v31, v31, v15
st1.4s {v16, v17, v18, v19},[x3],#64
st1.4s {v20, v21, v22, v23},[x3],#64
st1.4s {v24, v25, v26, v27},[x3],#64
st1.4s {v28, v29, v30, v31},[x3],#64
subs x2, x2, #1
cbnz x2, Lneon64_add_to
restore_caller_vectors()
ret lr
Once everything is computed and stored, we restore the caller FP registers using the macro we wrote earlier. Then, we return by simply calling “ret lr”.
That’s it, now you can test it against the C routine and benchmark the performance boost.