Skip to main content

SPO600 - Project - Stage Three

In this last stage of my SPO600 project, Since I don't have results suitable for upstreaming, I am going to wrap up my project results and do some thorough technical analysis of my results.

First of all, I am going to summary what I did for my project. (If you want to go over the details, you can see my previous posts.)
I picked a software called SSDUP, it is a traffic-aware SSD burst buffer for HPC systems. I noticed that it uses 3 different Murmurhash3 hash functions, the first two hash functions are optimized for x86 platforms and the third hash function is optimized for x64 platforms. I also noticed that it uses 'gcc -std=gnu99' to compile. In order to easier to handler these 3 hash functions, I split them into 3 files and separately testing them on an AArch64 and x86_64 systems.

As the professor said my results in stage two is hard to read, I am going to show my results again in a table format.

First hash function (MurmurHash3_x86_32), the execution time for -O3 is about 802% faster than without compilation option:
without -O3 option
with -O3 option
No code changes
14.117
1.572
Code changes: i+i and len
14.035
N/A

Second hash function (MurmurHash3_x86_128), the execution time for -O3 is about 891% faster than without compilation option:
without -O3 option
with -O3 option
No code changes
13.332
1.338
Code changes: i+i and len
13.543
N/a

Third hash function (MurmurHash3_x64_128), the execution time for -O3 is about 523% faster than without compilation option, and 0.04% faster with code changed:
without -O3 option
with -O3 option
No code changes
8.179
1.315
Code changes: i+i and len
8.137
N/A

All of the tests are first completed on an AArch64 system. My first step to optimize the hash function is to compile my benchmark program with -O3 compilation option. The first two hash functions, which have been optimized for x86 platforms, which has a significant improvement in performance. The third hash function, which has been optimized for x64 platforms, after compiling with -O3 option, which is a very small improvement in performance. My second step in optimization is to change some code in the third function, there is 0.04% faster than without changing the code.

Afterward, I perform the benchmark program on an x86_64 system, the result turns out that it also has a significant improvement in performance if compiling with -O3 option. But the improvement of the third function on an AArch64 system is not as much as different than x86_64 platforms. As a result, compiling with -O3 option for both functions produces the best performance and is the most optimized case.

Comments

Popular posts from this blog

Lab 5

In this lab, we are going to use different approaches to scale volume of sound, and the algorithm’s effect on system performance. Here is some basic knowledge of digital sound: Digital sound is usually represented by a signed 16-bit integer signal sample, taken at a rate of around 44.1 or 48 thousand samples per second for one stream of samples for the left and right stereo channels. In order to change the volume of sound, we will have to scale the volume factor for each sample, the range of 0.00 to 1.00 (silence to full volume). Here is the source code I got from professor: (vol1.h) ------------------------------------------------- #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include "vol.h" // Function to scale a sound sample using a volume_factor // in the range of 0.00 to 1.00. static inline int16_t scale_sample(int16_t sample, float volume_factor) { return (int16_t) (volume_factor * (float) sample); } int main() { // Al

Lab 6A

This lab is separated into two parts, I'll blog my work in different post. In the first part, we've got a source code from professor Chris, which is a similar stuff to our lab5, scaling the volume of sound, but it includes inline assembler. The first thing I'll do is add a timer to the code in order to check the performing time. Build and run the program, here is the output: ------------------------------------------------------------------------- [qichang@aarchie spo600_20181_inline_assembler_lab]$ ./vol_simd Generating sample data. Scaling samples. Summing samples. Result: -462 Time: 0.024963 seconds. ------------------------------------------------------------------------- Then I adjusted the number of samples to 5000000 in vol.h: ------------------------------------------------------------------------- [qichang@aarchie spo600_20181_inline_assembler_lab]$ cat vol_simd.c // vol_simd.c :: volume scaling in C using AArch64 SIMD // Chris Tyler 2017.11.29-2018

Lab 4

This lab is going to exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler. For the people who not familiar with Vectorization, this article will help: Automatic vectorization In this lab, we are going to write a short program that: -Create two 1000-element integer arrays -Fill them with random numbers in the rang -1000 to +1000 -Sum up those two arrays element-by-element to a third array -Sum up the third array -Print out the result Here is the source code I wrote: ------------------------------------------------------ #include <stdlib.h> #include <stdio.h> #include <time.h> int main(){ int sum; int arr1[1000]; int arr2[1000]; int arr3[1000]; srand(time(NULL)); for(int i=0; i<1000; i++){ arr1[i] = rand() % 2001 - 1000; arr2[i] = rand() % 2001 - 1000; } for(int i=0; i<1000; i++){ arr3[i] = arr1[i] + arr2[i]; } for(int i=0; i<1000; i++){ su