Skip to main content

Lab 6A

This lab is separated into two parts, I'll blog my work in different post.
In the first part, we've got a source code from professor Chris, which is a similar stuff to our lab5, scaling the volume of sound, but it includes inline assembler.
The first thing I'll do is add a timer to the code in order to check the performing time. Build and run the program, here is the output:
-------------------------------------------------------------------------
[qichang@aarchie spo600_20181_inline_assembler_lab]$ ./vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: -462
Time: 0.024963 seconds.
-------------------------------------------------------------------------

Then I adjusted the number of samples to 5000000 in vol.h:
-------------------------------------------------------------------------
[qichang@aarchie spo600_20181_inline_assembler_lab]$ cat vol_simd.c
// vol_simd.c :: volume scaling in C using AArch64 SIMD
// Chris Tyler 2017.11.29-2018.02.20

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

int16_t* in; // input array
int16_t* limit; // end of input array
int16_t* out; // output array

// these variables will be used in our assembler code, so we're going
// to hand-allocate which register they are placed in
// Q: what is an alternate approach?
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t

int x; // array interator
int ttl; // array total

in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

srand(-1);
printf("Generating sample data.\n");
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}

// --------------------------------------------------------------------

in_cursor = in;
out_cursor = out;
limit = in + SAMPLES ;

// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);

printf("Scaling samples.\n");

// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

while ( in_cursor < limit ) {
__asm__ (
"ldr q0, [%[in]],#16 \n\t"
// load eight samples into q0 (v0.8h)
// from in_cursor, and post-increment
// in_cursor by 16 bytes

"sqdmulh v0.8h, v0.8h, v1.8h \n\t"
// multiply each lane in v0 by v1*2
// saturate results
// store upper 16 bits of results into v0

"str q0, [%[out]],#16 \n\t"
// store eight samples to out_cursor
// post-increment out_cursor by 16 bytes

// Q: what happens if we remove the following
// two lines? Why?
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
);
}

// --------------------------------------------------------------------

printf("Summing samples.\n");
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}

// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);

return 0;

}
-------------------------------------------------------------------------

Here is the output:
-------------------------------------------------------------------------
[qichang@aarchie spo600_20181_inline_assembler_lab]$ ./vol_simd
Generating sample data.
Scaling samples.
Summing samples.
Result: 362
Time: 0.251086 seconds.
-------------------------------------------------------------------------

Look back the solution I did in lab5, and I changed the number of samples to 5000000:
-------------------------------------------------------------------------
[qichang@aarchie spo600_20181_vol_skel]$ cat vol2.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
#include <time.h>

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}

int main() {

// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
int16_t* lookupTable = calloc(65536, sizeof(int16_t));
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

int x;
int ttl;
clock_t start, end, total;

// Seed the pseudo-random number generator
srand(-1);

// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}

// Initialize lookup table by multiplied by the volume factor
for(int i=0; i<65536; i++){
lookupTable[i] = (i-32768) * 0.75;
}

start = clock();

// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################

// Sum up the data
for (x = 0; x < SAMPLES; x++) {
in[x] = (lookupTable[in[x] + 32768]);
ttl = (ttl+out[x])%1000;
}
end = clock();
        total = end - start;

for(int i =0; i<10; i++){
printf("%d \n", in[i]);
}
// Print the sum
printf("Result: %d\n", ttl);
printf("Time spending for scaling the sound samples: %f seconds.\n",
        (float) total/CLOCKS_PER_SEC);
return 0;

}
-------------------------------------------------------------------------

Here is the output:
-------------------------------------------------------------------------
[qichang@aarchie spo600_20181_vol_skel]$ ./vol2
17516
10521
8070
12072
-789
9538
1708
-15747
-20292
-15138
Result: 668
Time spending for scaling the sound samples: 0.036603 seconds.
-------------------------------------------------------------------------

// Q: what is an alternate approach?
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t
A: Let the compiler assign register by itself, not given.
-------------------------------------------------------------------------

// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);
A: Because int16_t has its maximum data , which is 32767.
-------------------------------------------------------------------------

// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
A: It means the register given, w0, is used to v1.8h.
-------------------------------------------------------------------------

// Q: what happens if we remove the following
// two lines? Why?
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
);
A: It couldn't even complie.
[qichang@aarchie spo600_20181_inline_assembler_lab]$ make
gcc -g -O3 vol_simd.c -o vol_simd
vol_simd.c: In function ‘main’:
vol_simd.c:71:4: error: undefined named operand ‘in’
    );
    ^
vol_simd.c:71:4: error: undefined named operand ‘out’
make: *** [Makefile:7: vol_simd] Error 1
-------------------------------------------------------------------------

// Q: are the results usable? are they correct?
A: The results are usable, and they are correct.

Comments

Popular posts from this blog

Lab2

Complied C Lab In this lab, we were asked to compile a C program, using gcc command with different options. At the beginning of this lab, we wrote a simple C program that prints a message: Then using gcc command and the following compiler options to compile the program: -g # enable debugging information -O0 # do not optimize (that's a capital letter and then the digit zero) -fno-builtin # do not use builtin function optimizations Note that the size of file is 73088 bytes We can use objdump --source a.out command to show source code, the source code is under <main> section. And  readelf -p .rodata a.out contains the string to be printed. Then we add the option "-static" to recompile the program, found out the size is changed to 696264 bytes, which is bigger than the original program. And section headers are also increased. Next, I removed the builtin function optimization by remove option "-fno-builtin"...

SPO600 - Project - Stage Three

In this last stage of my SPO600 project, Since I don't have results suitable for upstreaming, I am going to wrap up my project results and do some thorough technical analysis of my results. First of all, I am going to summary what I did for my project. (If you want to go over the details, you can see my previous posts.) I picked a software called SSDUP, it is a traffic-aware SSD burst buffer for HPC systems. I noticed that it uses 3 different Murmurhash3 hash functions, the first two hash functions are optimized for x86 platforms and the third hash function is optimized for x64 platforms. I also noticed that it uses 'gcc -std=gnu99' to compile. In order to easier to handler these 3 hash functions, I split them into 3 files and separately testing them on an AArch64 and x86_64 systems. As the professor said my results in stage two is hard to read, I am going to show my results again in a table format. First hash function (MurmurHash3_x86_32), the execution time for -O3...

SPO600 - Project - Stage One

In our final project, the project will split into 3 stages. This is the first stage of my SPO600 course project. In this stage, we are given a task to find an open source software package that includes a CPU-intensive function or method that compiles to machine code. After I chose the open source software package, I will have to benchmark the performance of the software function on an AArach64 system. When the benchmark job is completed, I will have to think about my strategy that attempts to optimize the hash function for better performance on an AArch64 system and identify it, because those strategies will be used in the second stage of the project. With so many software, I would say picking software is the hardest job in the project, which is the major reason it took me so long to get this post going. But after a lot of research, I picked a software called SSDUP , it is a traffic-aware SSD burst buffer for HPC systems. You can find the source code over here: https://github.com/CGC...