Skip to main content

SPO600 - Project - Stage One

In our final project, the project will split into 3 stages. This is the first stage of my SPO600 course project. In this stage, we are given a task to find an open source software package that includes a CPU-intensive function or method that compiles to machine code. After I chose the open source software package, I will have to benchmark the performance of the software function on an AArach64 system. When the benchmark job is completed, I will have to think about my strategy that attempts to optimize the hash function for better performance on an AArch64 system and identify it, because those strategies will be used in the second stage of the project.

With so many software, I would say picking software is the hardest job in the project, which is the major reason it took me so long to get this post going. But after a lot of research, I picked a software called SSDUP, it is a traffic-aware SSD burst buffer for HPC systems. You can find the source code over here: https://github.com/CGCL-codes/SSDUP

After I looked through the entire software, I noticed that the software use 'gcc -std=gnu89' to compile, which I can try to use different build option on it. I also found the Murmur3 hash function in the file called murmur3.c. Basically, this file is specific to the hash function implementation. In fact, the file contains three Murmur3 hash functions:
-----------------------------------------------------------------
void MurmurHash3_x86_32 (const void *key, int len, uint32_t seed, void *out);
void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out);
void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out);
-----------------------------------------------------------------
The first function is for 32-bit machines and it produces a 32-bit output. The second function is also for 32-bit machines but it produces a 128-bit output. The third function is for 64-bit machines and it produces a 128-bit output.

Not like the previous lab I did, this hash function accepts input from arguments and be called without Murmur3.c, that means I cannot simply compile and run the hash functions, as well as set a timer for each hash function to count up the time. In this case, in order to benchmark the performance of these hash function, I will have to create and run a script that will call these hash functions with passing the necessary arguments and set a timer to count the time that how long for each hash function execute.

I did some research, I found a file called example.c in Murmur3 repository in Github https://github.com/PeterScott/murmur3,
-----------------------------------------------------------------
/* Murmur3 example: hash the first command line argument. */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "murmur3.h"

int main(int argc, char **argv) {
  uint32_t hash[4];                /* Output for the hash */
  uint32_t seed = 42;              /* Seed value for hash */

  if (argc != 2) {
    printf("usage: %s \"string to hash\"\n", argv[0]);
    exit(1);
  }

  printf("Input: \"%s\"\n", argv[1]);
 
  MurmurHash3_x86_32(argv[1], strlen(argv[1]), seed, hash);
  printf("x86_32:  %08x\n", hash[0]);

  MurmurHash3_x86_128(argv[1], strlen(argv[1]), seed, hash);
  printf("x86_128: %08x %08x %08x %08x\n",
         hash[0], hash[1], hash[2], hash[3]);

  MurmurHash3_x64_128(argv[1], strlen(argv[1]), seed, hash);
  printf("x64_128: %08x %08x %08x %08x\n",
         hash[0], hash[1], hash[2], hash[3]);

  return 0;
}
-----------------------------------------------------------------
Which is a sample program, it accepts a string from users as an argument and pass it to those three hash functions, and convert it into a hash value and display them on the screen. Each hash function will generate a different hash value.

It is great to found that sample program, it will make me easier to benchmark the performance of them. In order to do it, I am going to create a file called benchmark.c, copy all the code from example.c and Murmur3.c into it. In order to have each hash function to evaluate the performance accurately, I added a loop to each function to execute many times. I also added a timer for each of hash function.
-----------------------------------------------------------------
#include "murmur3.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#ifdef __GNUC__
#define FORCE_INLINE __attribute__((always_inline)) inline
#else
#define FORCE_INLINE
#endif
#define SIZE 100000

static FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
{
  return (x << r) | (x >> (32 - r));
}

static FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
{
  return (x << r) | (x >> (64 - r));
}

#define ROTL32(x,y) rotl32(x,y)
#define ROTL64(x,y) rotl64(x,y)

#define BIG_CONSTANT(x) (x##LLU)

//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here

#define getblock(p, i) (p[i])

//-----------------------------------------------------------------------------
// Finalization mix - force all bits of a hash block to avalanche


static FORCE_INLINE uint32_t fmix32 ( uint32_t h )
{
  h ^= h >> 16;
  h *= 0x85ebca6b;
  h ^= h >> 13;
  h *= 0xc2b2ae35;
  h ^= h >> 16;

  return h;
}

//----------

static FORCE_INLINE uint64_t fmix64 ( uint64_t k )
{
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xff51afd7ed558ccd);
  k ^= k >> 33;
  k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
  k ^= k >> 33;

  return k;
}

//-----------------------------------------------------------------------------
int main() {
  uint32_t hash[4];                /* Output for the hash */
  uint32_t seed = 42;              /* Seed value for hash */

  clock_t startTime1, startTime2, startTime3;
  clock_t endTime1, endTime2, endTime3;
  double totalTime1, totalTime2, totalTime3;

  char *samples = (char*)calloc(SIZE, sizeof(char));

  srand(1);

  //generate random characters

  for (int i = 0; i < SIZE; i++) {

            samples[i] = (random() % 26) + 'A';

  }

  /*if (argc != 2) {
    printf("usage: %s \"string to hash\"\n", argv[0]);
    exit(1);
  }

  printf("Input: \"%s\"\n", argv[1]);*/

  startTime1 = clock();  // Start time
  //MurmurHash3_x86_32(argv[1], strlen(argv[1]), seed, hash);
  for (int i = 0; i < SIZE; i++) {

  MurmurHash3_x86_32(samples, strlen(samples), seed, hash);

  }
  endTime1 = clock();  // End time
  totalTime1 = (double)(endTime1 - startTime1) / CLOCKS_PER_SEC;

  printf("x86_32:  %08x\n", hash[0]);
  printf("Total time: %lf seconds\n", totalTime1);

  startTime2 = clock(); // Start time
  //MurmurHash3_x86_128(argv[1], strlen(argv[1]), seed, hash);
  for (int i = 0; i < SIZE; i++) {

  MurmurHash3_x86_128(samples, strlen(samples), seed, hash);

  }
  endTime2 = clock();  // End time
  totalTime2 = (double)(endTime2 - startTime2) / CLOCKS_PER_SEC; 

  printf("x86_128: %08x %08x %08x %08x\n",
         hash[0], hash[1], hash[2], hash[3]);
  printf("Total time: %lf seconds\n", totalTime2);

  startTime3 = clock(); // Start time
  //MurmurHash3_x64_128(argv[1], strlen(argv[1]), seed, hash);
  for (int i = 0; i < SIZE; i++) {

  MurmurHash3_x64_128(samples, strlen(samples), seed, hash);

  }
  endTime3 = clock();  // End time
  totalTime3 = (double)(endTime3 - startTime3) / CLOCKS_PER_SEC;

  printf("x64_128: %08x %08x %08x %08x\n",
         hash[0], hash[1], hash[2], hash[3]);
  printf("Total time: %lf seconds\n", totalTime3);

  return 0;
}


void MurmurHash3_x86_32 ( const void * key, int len,
                          uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 4;
  int i;

  uint32_t h1 = seed;

  uint32_t c1 = 0xcc9e2d51;
  uint32_t c2 = 0x1b873593;

  //----------
  // body

  const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);

  for(i = -nblocks; i; i++)
  {
    uint32_t k1 = getblock(blocks,i);

    k1 *= c1;
    k1 = ROTL32(k1,15);
    k1 *= c2;
   
    h1 ^= k1;
    h1 = ROTL32(h1,13);
    h1 = h1*5+0xe6546b64;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*4);

  uint32_t k1 = 0;

  switch(len & 3)
  {
  case 3: k1 ^= tail[2] << 16;
  case 2: k1 ^= tail[1] << 8;
  case 1: k1 ^= tail[0];
          k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len;

  h1 = fmix32(h1);

  *(uint32_t*)out = h1;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x86_128 ( const void * key, const int len,
                           uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 16;
  int i;

  uint32_t h1 = seed;
  uint32_t h2 = seed;
  uint32_t h3 = seed;
  uint32_t h4 = seed;

  uint32_t c1 = 0x239b961b;
  uint32_t c2 = 0xab0e9789;
  uint32_t c3 = 0x38b34ae5;
  uint32_t c4 = 0xa1e38b93;

  //----------
  // body

  const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);

  for(i = -nblocks; i; i++)
  {
    uint32_t k1 = getblock(blocks,i*4+0);
    uint32_t k2 = getblock(blocks,i*4+1);
    uint32_t k3 = getblock(blocks,i*4+2);
    uint32_t k4 = getblock(blocks,i*4+3);

    k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;

    h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;

    k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

    h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;

    k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

    h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;

    k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

    h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

  uint32_t k1 = 0;
  uint32_t k2 = 0;
  uint32_t k3 = 0;
  uint32_t k4 = 0;

  switch(len & 15)
  {
  case 15: k4 ^= tail[14] << 16;
  case 14: k4 ^= tail[13] << 8;
  case 13: k4 ^= tail[12] << 0;
           k4 *= c4; k4  = ROTL32(k4,18); k4 *= c1; h4 ^= k4;

  case 12: k3 ^= tail[11] << 24;
  case 11: k3 ^= tail[10] << 16;
  case 10: k3 ^= tail[ 9] << 8;
  case  9: k3 ^= tail[ 8] << 0;
           k3 *= c3; k3  = ROTL32(k3,17); k3 *= c4; h3 ^= k3;

  case  8: k2 ^= tail[ 7] << 24;
  case  7: k2 ^= tail[ 6] << 16;
  case  6: k2 ^= tail[ 5] << 8;
  case  5: k2 ^= tail[ 4] << 0;
           k2 *= c2; k2  = ROTL32(k2,16); k2 *= c3; h2 ^= k2;

  case  4: k1 ^= tail[ 3] << 24;
  case  3: k1 ^= tail[ 2] << 16;
  case  2: k1 ^= tail[ 1] << 8;
  case  1: k1 ^= tail[ 0] << 0;
           k1 *= c1; k1  = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;

  h1 += h2; h1 += h3; h1 += h4;
  h2 += h1; h3 += h1; h4 += h1;

  h1 = fmix32(h1);
  h2 = fmix32(h2);
  h3 = fmix32(h3);
  h4 = fmix32(h4);

  h1 += h2; h1 += h3; h1 += h4;
  h2 += h1; h3 += h1; h4 += h1;

  ((uint32_t*)out)[0] = h1;
  ((uint32_t*)out)[1] = h2;
  ((uint32_t*)out)[2] = h3;
  ((uint32_t*)out)[3] = h4;
}

//-----------------------------------------------------------------------------

void MurmurHash3_x64_128 ( const void * key, const int len,
                           const uint32_t seed, void * out )
{
  const uint8_t * data = (const uint8_t*)key;
  const int nblocks = len / 16;
  int i;

  uint64_t h1 = seed;
  uint64_t h2 = seed;

  uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
  uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);

  //----------
  // body

  const uint64_t * blocks = (const uint64_t *)(data);

  for(i = 0; i < nblocks; i++)
  {
    uint64_t k1 = getblock(blocks,i*2+0);
    uint64_t k2 = getblock(blocks,i*2+1);

    k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;

    h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;

    k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

    h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
  }

  //----------
  // tail

  const uint8_t * tail = (const uint8_t*)(data + nblocks*16);

  uint64_t k1 = 0;
  uint64_t k2 = 0;

  switch(len & 15)
  {
  case 15: k2 ^= (uint64_t)(tail[14]) << 48;
  case 14: k2 ^= (uint64_t)(tail[13]) << 40;
  case 13: k2 ^= (uint64_t)(tail[12]) << 32;
  case 12: k2 ^= (uint64_t)(tail[11]) << 24;
  case 11: k2 ^= (uint64_t)(tail[10]) << 16;
  case 10: k2 ^= (uint64_t)(tail[ 9]) << 8;
  case  9: k2 ^= (uint64_t)(tail[ 8]) << 0;
           k2 *= c2; k2  = ROTL64(k2,33); k2 *= c1; h2 ^= k2;

  case  8: k1 ^= (uint64_t)(tail[ 7]) << 56;
  case  7: k1 ^= (uint64_t)(tail[ 6]) << 48;
  case  6: k1 ^= (uint64_t)(tail[ 5]) << 40;
  case  5: k1 ^= (uint64_t)(tail[ 4]) << 32;
  case  4: k1 ^= (uint64_t)(tail[ 3]) << 24;
  case  3: k1 ^= (uint64_t)(tail[ 2]) << 16;
  case  2: k1 ^= (uint64_t)(tail[ 1]) << 8;
  case  1: k1 ^= (uint64_t)(tail[ 0]) << 0;
           k1 *= c1; k1  = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
  };

  //----------
  // finalization

  h1 ^= len; h2 ^= len;

  h1 += h2;
  h2 += h1;

  h1 = fmix64(h1);
  h2 = fmix64(h2);

  h1 += h2;
  h2 += h1;

  ((uint64_t*)out)[0] = h1;
  ((uint64_t*)out)[1] = h2;
}


-----------------------------------------------------------------

Here is the output:
-----------------------------------------------------------------
[qichang@aarchie hash]$ g++ -o benchmark benchmark.c
[qichang@aarchie hash]$ ./benchmark
x86_32:  48f2284b
Total time: 14.185739 seconds
x86_128: afcbbbf4 e795fb5f 25f06a82 bdb0d389
Total time: 13.185568 seconds
x64_128: 1fe39d94 98f9f86a 9f7dfcd5 f01f2d58
Total time: 8.157431 seconds
-----------------------------------------------------------------
The first function spent: 14.185739 seconds
The second function spent: 13.185568 seconds
The third function spent: 8.157431 seconds



After benchmarking the performance of the hash functions, I am going to identify the optimization strategies that I will use in the second stage. On the top of my head, I will use altered build options to compile that hash functions, this is the easiest way in my experience. I might try to use In-line assembler to optimize the hash function, cannot guarantee but I will try.

Comments

Popular posts from this blog

Lab 3

In this lab, we are going to use Assembly language to finish 3 parts. 1. As we are getting familiar with Assembly language, we will create a loop in Assembly to prints out 10 times of "Hello World!". This part is quite easy to do it, here is the source code for x86_64 assembler: ------------------------------------------------------ .text .globl    _start start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */ max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */ _start:     mov     $start,%r15         /* loop index */     mov     %r15,%r10 loop:         /* ... body of the loop ... do something useful here ... */   ...

SPO600 - Project - Stage Three

In this last stage of my SPO600 project, Since I don't have results suitable for upstreaming, I am going to wrap up my project results and do some thorough technical analysis of my results. First of all, I am going to summary what I did for my project. (If you want to go over the details, you can see my previous posts.) I picked a software called SSDUP, it is a traffic-aware SSD burst buffer for HPC systems. I noticed that it uses 3 different Murmurhash3 hash functions, the first two hash functions are optimized for x86 platforms and the third hash function is optimized for x64 platforms. I also noticed that it uses 'gcc -std=gnu99' to compile. In order to easier to handler these 3 hash functions, I split them into 3 files and separately testing them on an AArch64 and x86_64 systems. As the professor said my results in stage two is hard to read, I am going to show my results again in a table format. First hash function (MurmurHash3_x86_32), the execution time for -O3...

Lab 5

In this lab, we are going to use different approaches to scale volume of sound, and the algorithm’s effect on system performance. Here is some basic knowledge of digital sound: Digital sound is usually represented by a signed 16-bit integer signal sample, taken at a rate of around 44.1 or 48 thousand samples per second for one stream of samples for the left and right stereo channels. In order to change the volume of sound, we will have to scale the volume factor for each sample, the range of 0.00 to 1.00 (silence to full volume). Here is the source code I got from professor: (vol1.h) ------------------------------------------------- #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include "vol.h" // Function to scale a sound sample using a volume_factor // in the range of 0.00 to 1.00. static inline int16_t scale_sample(int16_t sample, float volume_factor) { return (int16_t) (volume_factor * (float) sample); } int main() { // Al...