SPO600 - Project - Stage One

In our final project, the project will split into 3 stages. This is the first stage of my SPO600 course project. In this stage, we are given a task to find an open source software package that includes a CPU-intensive function or method that compiles to machine code. After I chose the open source software package, I will have to benchmark the performance of the software function on an AArach64 system. When the benchmark job is completed, I will have to think about my strategy that attempts to optimize the hash function for better performance on an AArch64 system and identify it, because those strategies will be used in the second stage of the project.

With so many software, I would say picking software is the hardest job in the project, which is the major reason it took me so long to get this post going. But after a lot of research, I picked a software called SSDUP, it is a traffic-aware SSD burst buffer for HPC systems. You can find the source code over here: https://github.com/CGCL-codes/SSDUP

After I looked through the entire software, I noticed that the software use 'gcc -std=gnu89' to compile, which I can try to use different build option on it. I also found the Murmur3 hash function in the file called murmur3.c. Basically, this file is specific to the hash function implementation. In fact, the file contains three Murmur3 hash functions:
-----------------------------------------------------------------
void MurmurHash3_x86_32 (const void *key, int len, uint32_t seed, void *out);
void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out);
void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out);
-----------------------------------------------------------------
The first function is for 32-bit machines and it produces a 32-bit output. The second function is also for 32-bit machines but it produces a 128-bit output. The third function is for 64-bit machines and it produces a 128-bit output.

Not like the previous lab I did, this hash function accepts input from arguments and be called without Murmur3.c, that means I cannot simply compile and run the hash functions, as well as set a timer for each hash function to count up the time. In this case, in order to benchmark the performance of these hash function, I will have to create and run a script that will call these hash functions with passing the necessary arguments and set a timer to count the time that how long for each hash function execute.

I did some research, I found a file called example.c in Murmur3 repository in Github https://github.com/PeterScott/murmur3,
-----------------------------------------------------------------
/* Murmur3 example: hash the first command line argument. */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "murmur3.h"

int main(int argc, char **argv) {
uint32_t hash[4]; /* Output for the hash */
uint32_t seed = 42; /* Seed value for hash */

if (argc != 2) {
printf("usage: %s \"string to hash\"\n", argv[0]);
exit(1);
}

printf("Input: \"%s\"\n", argv[1]);

MurmurHash3_x86_32(argv[1], strlen(argv[1]), seed, hash);
printf("x86_32: %08x\n", hash[0]);

MurmurHash3_x86_128(argv[1], strlen(argv[1]), seed, hash);
printf("x86_128: %08x %08x %08x %08x\n",
hash[0], hash[1], hash[2], hash[3]);

MurmurHash3_x64_128(argv[1], strlen(argv[1]), seed, hash);
printf("x64_128: %08x %08x %08x %08x\n",
hash[0], hash[1], hash[2], hash[3]);

return 0;
}
-----------------------------------------------------------------
Which is a sample program, it accepts a string from users as an argument and pass it to those three hash functions, and convert it into a hash value and display them on the screen. Each hash function will generate a different hash value.

It is great to found that sample program, it will make me easier to benchmark the performance of them. In order to do it, I am going to create a file called benchmark.c, copy all the code from example.c and Murmur3.c into it. In order to have each hash function to evaluate the performance accurately, I added a loop to each function to execute many times. I also added a timer for each of hash function.
-----------------------------------------------------------------
#include "murmur3.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#ifdef __GNUC__
#define FORCE_INLINE __attribute__((always_inline)) inline
#else
#define FORCE_INLINE
#endif
#define SIZE 100000

static FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
{
return (x << r) | (x >> (32 - r));
}

static FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
{
return (x << r) | (x >> (64 - r));
}

#define ROTL32(x,y) rotl32(x,y)
#define ROTL64(x,y) rotl64(x,y)

#define BIG_CONSTANT(x) (x##LLU)

//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here

#define getblock(p, i) (p[i])

//-----------------------------------------------------------------------------
// Finalization mix - force all bits of a hash block to avalanche

static FORCE_INLINE uint32_t fmix32 ( uint32_t h )
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;

return h;
}

//----------

static FORCE_INLINE uint64_t fmix64 ( uint64_t k )
{
k ^= k >> 33;
k *= BIG_CONSTANT(0xff51afd7ed558ccd);
k ^= k >> 33;
k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
k ^= k >> 33;

return k;
}

//-----------------------------------------------------------------------------
int main() {
uint32_t hash[4]; /* Output for the hash */
uint32_t seed = 42; /* Seed value for hash */

clock_t startTime1, startTime2, startTime3;
clock_t endTime1, endTime2, endTime3;
double totalTime1, totalTime2, totalTime3;

char *samples = (char*)calloc(SIZE, sizeof(char));

srand(1);

//generate random characters

for (int i = 0; i < SIZE; i++) {

samples[i] = (random() % 26) + 'A';

}

/*if (argc != 2) {
printf("usage: %s \"string to hash\"\n", argv[0]);
exit(1);
}

printf("Input: \"%s\"\n", argv[1]);*/

startTime1 = clock(); // Start time
//MurmurHash3_x86_32(argv[1], strlen(argv[1]), seed, hash);
for (int i = 0; i < SIZE; i++) {

MurmurHash3_x86_32(samples, strlen(samples), seed, hash);

}
endTime1 = clock(); // End time
totalTime1 = (double)(endTime1 - startTime1) / CLOCKS_PER_SEC;

printf("x86_32: %08x\n", hash[0]);
printf("Total time: %lf seconds\n", totalTime1);

startTime2 = clock(); // Start time
//MurmurHash3_x86_128(argv[1], strlen(argv[1]), seed, hash);
for (int i = 0; i < SIZE; i++) {

MurmurHash3_x86_128(samples, strlen(samples), seed, hash);

}
endTime2 = clock(); // End time
totalTime2 = (double)(endTime2 - startTime2) / CLOCKS_PER_SEC;

printf("x86_128: %08x %08x %08x %08x\n",
hash[0], hash[1], hash[2], hash[3]);
printf("Total time: %lf seconds\n", totalTime2);

startTime3 = clock(); // Start time
//MurmurHash3_x64_128(argv[1], strlen(argv[1]), seed, hash);
for (int i = 0; i < SIZE; i++) {

MurmurHash3_x64_128(samples, strlen(samples), seed, hash);

}
endTime3 = clock(); // End time
totalTime3 = (double)(endTime3 - startTime3) / CLOCKS_PER_SEC;

printf("x64_128: %08x %08x %08x %08x\n",
hash[0], hash[1], hash[2], hash[3]);
printf("Total time: %lf seconds\n", totalTime3);

return 0;
}

void MurmurHash3_x86_32 ( const void * key, int len,
uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 4;
int i;

uint32_t h1 = seed;

uint32_t c1 = 0xcc9e2d51;
uint32_t c2 = 0x1b873593;

//----------
// body

const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);

for(i = -nblocks; i; i++)
{
uint32_t k1 = getblock(blocks,i);

k1 *= c1;
k1 = ROTL32(k1,15);
k1 *= c2;

h1 ^= k1;
h1 = ROTL32(h1,13);
h1 = h1*5+0xe6546b64;
}

//----------
// tail

const uint8_t * tail = (const uint8_t*)(data + nblocks*4);

uint32_t k1 = 0;

switch(len & 3)
{
case 3: k1 ^= tail[2] << 16;
case 2: k1 ^= tail[1] << 8;
case 1: k1 ^= tail[0];
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
};

//----------
// finalization

h1 ^= len;

h1 = fmix32(h1);

*(uint32_t*)out = h1;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x86_128 ( const void * key, const int len,
uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 16;
int i;

uint32_t h1 = seed;
uint32_t h2 = seed;
uint32_t h3 = seed;
uint32_t h4 = seed;

uint32_t c1 = 0x239b961b;
uint32_t c2 = 0xab0e9789;
uint32_t c3 = 0x38b34ae5;
uint32_t c4 = 0xa1e38b93;

//----------
// body

const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);

for(i = -nblocks; i; i++)
{
uint32_t k1 = getblock(blocks,i*4+0);
uint32_t k2 = getblock(blocks,i*4+1);
uint32_t k3 = getblock(blocks,i*4+2);
uint32_t k4 = getblock(blocks,i*4+3);

k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;

h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;

k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;

k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;

k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
}

//----------
// tail

const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

uint32_t k1 = 0;
uint32_t k2 = 0;
uint32_t k3 = 0;
uint32_t k4 = 0;

switch(len & 15)
{
case 15: k4 ^= tail[14] << 16;
case 14: k4 ^= tail[13] << 8;
case 13: k4 ^= tail[12] << 0;
k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

case 12: k3 ^= tail[11] << 24;
case 11: k3 ^= tail[10] << 16;
case 10: k3 ^= tail[ 9] << 8;
case 9: k3 ^= tail[ 8] << 0;
k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

case 8: k2 ^= tail[ 7] << 24;
case 7: k2 ^= tail[ 6] << 16;
case 6: k2 ^= tail[ 5] << 8;
case 5: k2 ^= tail[ 4] << 0;
k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

case 4: k1 ^= tail[ 3] << 24;
case 3: k1 ^= tail[ 2] << 16;
case 2: k1 ^= tail[ 1] << 8;
case 1: k1 ^= tail[ 0] << 0;
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
};

//----------
// finalization

h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;

h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;

h1 = fmix32(h1);
h2 = fmix32(h2);
h3 = fmix32(h3);
h4 = fmix32(h4);

h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;

((uint32_t*)out)[0] = h1;
((uint32_t*)out)[1] = h2;
((uint32_t*)out)[2] = h3;
((uint32_t*)out)[3] = h4;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x64_128 ( const void * key, const int len,
const uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 16;
int i;

uint64_t h1 = seed;
uint64_t h2 = seed;

uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);

//----------
// body

const uint64_t * blocks = (const uint64_t *)(data);

for(i = 0; i < nblocks; i++)
{
uint64_t k1 = getblock(blocks,i*2+0);
uint64_t k2 = getblock(blocks,i*2+1);

k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;

h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;

k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
}

//----------
// tail

const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

uint64_t k1 = 0;
uint64_t k2 = 0;

switch(len & 15)
{
case 15: k2 ^= (uint64_t)(tail[14]) << 48;
case 14: k2 ^= (uint64_t)(tail[13]) << 40;
case 13: k2 ^= (uint64_t)(tail[12]) << 32;
case 12: k2 ^= (uint64_t)(tail[11]) << 24;
case 11: k2 ^= (uint64_t)(tail[10]) << 16;
case 10: k2 ^= (uint64_t)(tail[ 9]) << 8;
case 9: k2 ^= (uint64_t)(tail[ 8]) << 0;
k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

case 8: k1 ^= (uint64_t)(tail[ 7]) << 56;
case 7: k1 ^= (uint64_t)(tail[ 6]) << 48;
case 6: k1 ^= (uint64_t)(tail[ 5]) << 40;
case 5: k1 ^= (uint64_t)(tail[ 4]) << 32;
case 4: k1 ^= (uint64_t)(tail[ 3]) << 24;
case 3: k1 ^= (uint64_t)(tail[ 2]) << 16;
case 2: k1 ^= (uint64_t)(tail[ 1]) << 8;
case 1: k1 ^= (uint64_t)(tail[ 0]) << 0;
k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
};

//----------
// finalization

h1 ^= len; h2 ^= len;

h1 += h2;
h2 += h1;

h1 = fmix64(h1);
h2 = fmix64(h2);

h1 += h2;
h2 += h1;

((uint64_t*)out)[0] = h1;
((uint64_t*)out)[1] = h2;
}

-----------------------------------------------------------------

Here is the output:
-----------------------------------------------------------------
[qichang@aarchie hash]$ g++ -o benchmark benchmark.c
[qichang@aarchie hash]$ ./benchmark
x86_32: 48f2284b
Total time: 14.185739 seconds
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.185568 seconds
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.157431 seconds
-----------------------------------------------------------------
The first function spent: 14.185739 seconds
The second function spent: 13.185568 seconds
The third function spent: 8.157431 seconds

After benchmarking the performance of the hash functions, I am going to identify the optimization strategies that I will use in the second stage. On the top of my head, I will use altered build options to compile that hash functions, this is the easiest way in my experience. I might try to use In-line assembler to optimize the hash function, cannot guarantee but I will try.

Qichang's Blog - SPO600

Search This Blog

SPO600 - Project - Stage One

Comments

Post a Comment

Popular posts from this blog

SPO600 - Project - Stage Three

Lab 3