Making OpenMP our Acquaintance

Atharva Dubey
4 min readMay 28, 2021

In this post, I will go over how to install OpenMP, configuring OpenMP Projects, and how to use some of the basic pragmas of OpenMP for using the SIMD registers and parallelizing our loop.

Installing OpenMP

Installing OpenMP on Linux systems is simple and can be installed via the distributions package manager. On Ubuntu, run the following —

sudo apt install libomp-dev

For other distributions, change apt to the appropriate package manager. This would ensure that OpenMP is installed. Once installed, we have to configure it, which is just a fancy way of saying setting the number of threads OpenMP will use. This is done via setting an environment variable named OMP_NUM_THREADS. To set the said environment variable, run the following command to set the number of threads as 12—

export OMP_NUM_THREADS=12

Setting up CMakeLists.txt

Our setup for using OpenMP is ready, but one has to tell the compiler we are using OpenMP, which can be done by passing compiler flags. For GCC/G++ and clang/clang++, one can pass -fopenmp for enabling the parallel regions and SIMD operations using -fopenmp-simdSIMD instructions as well. A proper CMake solution would be as follows —

Parallelism and SIMD

Well, now let’s see OpenMP in action and the benefits of parallelization. We would take a simple task, that is to change the value of every element in a matrix. This is a delightfully parallel task as threads can divide the data among themselves and get to work and no communication between threads is required. If N iterations are required on a single thread, then this value would be reduced by a factor equal to the number of threads. In OpenMP, this is enabled via #pragma omp parallel. This is also a great example of Single Instruction Multiple Data(SIMD). As the name suggests, the task is to execute the same instruction on all of the data.

But what is the difference between parallelism and using SIMD registers? Parallelism means to spawn your process of different threads as explained above. Each thread will have its chunk of data on which instructions would be executed. However, another level of parallelism can be achieved. Every thread would have a dedicated register and a dedicated ALU. Suppose the width of this register is 256 bytes and we are using float 32 as our datatype. So instead of loading and operating on one floating-point value at a time, why not operate on 256/32 = 8 floating-point values at once ?. This is exactly what SIMD is, this would make our code almost 8x faster. This should serve as the most basic and intuitive understanding of SIMD. In OpenMP, this can be achieved using the pragma #pragma omp simd. In the next post, I would just touch on SIMD Intrinsics, which are essentially like assembly language but can be included in your C/C++ code. They look like function calls but produce a single SIMD assembly instruction. In the next post, I will solve this exact problem using SIMD intrinsics, to make SIMD operations more clear.

To get a better understanding, let us compare the time difference between a serial task and the same task accelerated using SIMD and using SIMD and parallelism at the same time. I would be initializing a square matrix of side 2¹⁵ elements, which translated to 1073741824 32 bit floating point elements. I would be treating the matrix as a flattened array, thus using only one loop. The task is to update the value of each and every element to 69.0. Always remember to consider your array as a flattened one as much as possible. Loops kill the efficiency of the code. The code is as follows —

To compile the code, make sure the above code (named main.cpp) and the cmake file(OpenmpCmake.txt) are in the same folder and run the following

cmake OpenmpCmake.txt
make
./First

The output of the code is as follows —

Average Run Time on a single thread is, in seconds - 1.550000
Average Run time for a parallelized operation is, in seconds - 0.350000

The speedup using multiple threads is phenomenal and approximately equal to 4.2 times !!. Note that parallelism does not always give the desired speedup and can even be slower than a scalar loop. Such parallelism should only be used when the array is really big (I know this is kinda subjective). In our example, suppose the square matrix had a total of 256 items in total, 32 along each dimension, then the parallel for loop would have been much much slower than a simple for loop. This is because threading has an overload. OpenMP has to prepare the threads, distribute the data among them, schedule, and spawn them. By the time this process gets completed, our naive for loop would already be done with the computations.

OpenMP is much more than #pragma omp parallel forand #pragma omp simd. Just by using pragma directives, one can control the number of threads, whether a variable is to be shared between threads, collapsing loops. In fact since OpenMP 4.0, one can even choose a specific target device to accelerate their code. It is truly incredible that all of this can just be achieved just by using pragma directives. I would be introducing more and more functionalities of OpenMP as this series progresses.

Next in the series, I will introduce you to a rather tough SIMD programming technique, called SIMD intrinsics. This is important because it exposes how SIMD works.

--

--

Atharva Dubey
0 Followers

A recent BITS Pilani graduate. Passionate about the field of Deep Learning and High performance Computing