NEON HelloWorld
From ArmadeusWiki
Developing with SIMD instructions or intrinsics need a bit of experience. That is a complete different way of thinking and programming.
Contents |
Prerequisites
- Install a emulation environment
- Build a GCC toolchain which support NEON intrinsics
Let's go programming
Here is a brief example of what is possible with SIMD programming. This piece of code only add the value "3" to each value of the SIMD vector.
On the Cortex-A platform there is both 64 bits and 128 bits vector registers. Here we use 128 bits ones then we can code sixteen 8 bits values (unsigned integers).
![]() | Warning: Not all ARM Cortex processors have a NEON unit and old ARM processors may have a SIMD unit not compliant with NEON (cf. ARM reference manuals). |
Source code
#include <stdio.h> #include "arm_neon.h" void add3 (uint8x16_t *data) { /* Set each sixteen values of the vector to 3. * * Remark: a 'q' suffix to intrinsics indicates * the instruction run for 128 bits registers. */ uint8x16_t three = vmovq_n_u8 (3); /* Add 3 to the value given in argument. */ *data = vaddq_u8 (*data, three); } void print_uint8 (uint8x16_t data, char* name) { int i; static uint8_t p[16]; vst1q_u8 (p, data); printf ("%s = ", name); for (i = 0; i < 16; i++) { printf ("%02d ", p[i]); } printf ("\n"); } int main () { /* Create custom arbitrary data. */ const uint8_t uint8_data[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }; /* Create the vector with our data. */ uint8x16_t data; /* Load our custom data into the vector register. */ data = vld1q_u8 (uint8_data); print_uint8 (data, "data"); /* Call of the add3 function. */ add3(&data); print_uint8 (data, "data (new)"); return 0; }
What is done ?
- We first create custom data: an array of 16 uint8_t values.
- We load the custom values in a vector register through the vld1q_u8 intrinsic (note the 'q' suffix which means we use 128 registers)
- We call our add3 function
And in the add3 function ?
- Because we use SIMD vectors and want to add 3 to each value we need to set the same value (3) with the vmovq_n_u8 intrinsic
- We call the vaddq_u8 intrinsic to do the addition
Result
$ ./a.out data = 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 data (new) = 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19
How to compile ?
The gcc options to give in order to compile NEON intrinsics in case of an ARM Cortex-A8 is :
$ gcc -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon <sources>