Getting Started with SIMD

Atharva Dubey
3 min readJul 31, 2021

In the previous post, I had started with SIMD and had dived into the assembly code generated and had shown its benefits in terms of the assembly code generated, i.e in terms of the number of instructions. But that does not really portray how fast SIMD actually is. In this post, I would implement vector addition using SIMD instruction and compare the runtime of a non vectorized loop. Please note that the SIMD functions used are for the x86 architecture(AVX2 to be precise).

SIMD Nomenclature

As you must know, SIMD functions are not really named in a user-friendly manner. So before writing SIMD code, let’s get a little familiar with how SIMD functions are named.

Every SIMD function is named as follows — _mm_<operation>_<suffix>where <operationis the operation being performed like add, sub, mul, etc. Suffix denotes the datatype being used. s or no suffix is for float, d for double. If there is an i present in the name, which implies it expects a signed integer representation. The whole list can be found at — https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/naming-and-usage-syntax.html.

Firstly, let’s see how to load the data into the SIMD registers. Firstly, create an array

size_t size = 1 << 14;
auto matA = (float*)aligned_alloc(32, size * size * sizeof(float));
__m256 a_simd = _mm256_load_ps(matA);

Right of the bat, one must have noticed the use of aligned_alloc. So what is this aligned memory? Consider a struct which is defined as —

typdef struct ExampleStruct{
char b;
int a;
short c;

}

Modern processors read data with the restriction in place that the memory access would be performed at the granularity and the alignment of its word size. Processors have multiple levels of hierarchies and the data must be pulled through all of them. Whenever unaligned memory access is performed (if it were even possible), the processor will read multiple words; it will read each word of memory that your requested memory straddles. Thus, this essentially leads to 2x the number of memory reads/writes. Therefore, it can be slower to read 1 byte than reading 4 bytes.

Coming back to the ExampleStruct, Assume the memory 0x0000. One might think the variable a might start at 0x0001 and then c be from 0x0005. Such memory alignment might take place where the data is compressed for transmission efficiency. However not efficient. This is because if you request the next 32 bits from 0x0001, the processor will start from 0x0000 in the register, left shift 1 byte, and then read again from 0x0004. Therefore this cannot be done in one cycle.

If the memory were aligned, the char variable would start at 0x0000 but would be padded and int will start at 0x0004, and short at 0x0008. Now, each variable can be read in one cycle, thus reducing the memory transactions. One might be surprised how little things like these contribute to considerable improvements in performance.

CPU can also operate on aligned memory atomically. Therefore CPU can read our aligned memory in one cycle and place them in the vector registers. Therefore it is advisable to use aligned memory when working with SIMD intrinsics. This, however, does not mean that you need to have aligned memory to load it into vector registers. Such a case might arise when you are dealing with memory that you have not allocated yourself. In such a case, one must use _mm256_loadu_ps, u indicating the fact that the memory is unaligned. Always remember, the rule of thumb is to use aligned memory as much as possible.

So till now, we have allocated memory and loaded it into our vector registers. Now, all that is left is to call the addition function, which one would remember from the previous post as _mm256_add_ps. Therefore, the overall code will look as follows —

I have covered an extremely basic scenario, where almost everything is in order. When we move to more complicated operations like reductions and dot product, it will be required to use multiple accumulator registers, as some times the only bottleneck is the memory latency. I will address all these issues as and when required. One thing to remember is, that the improper flow of SIMD code can result in higher execution time than naive implementations of the same code.

Thus in this post, I have covered how to get started with SIMD, get familiar with its nomenclature, loading the memory and some basic operations.

The next couple of posts will talk about the GPU execution model, and how to code some simple kernels in Cuda and SYCL.

--

--

Atharva Dubey
0 Followers

A recent BITS Pilani graduate. Passionate about the field of Deep Learning and High performance Computing