Skip to main content

Lab 4

This lab is going to exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler. For the people who not familiar with Vectorization, this article will help: Automatic vectorization

In this lab, we are going to write a short program that:
-Create two 1000-element integer arrays
-Fill them with random numbers in the rang -1000 to +1000
-Sum up those two arrays element-by-element to a third array
-Sum up the third array
-Print out the result

Here is the source code I wrote:
------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(){
int sum;
int arr1[1000];
int arr2[1000];
int arr3[1000];

srand(time(NULL));

for(int i=0; i<1000; i++){
arr1[i] = rand() % 2001 - 1000;
arr2[i] = rand() % 2001 - 1000;
}

for(int i=0; i<1000; i++){
arr3[i] = arr1[i] + arr2[i];
}

for(int i=0; i<1000; i++){
sum += arr3[i];
}
printf("Sum is: %d\n", sum);
}
------------------------------------------------------

I will using the command 'gcc -O3 -o lab4 lab4.c' to compile this program.
But how do we know this will vectorize our program? Check the article in gcc article on vectorization 
Vectorization is enabled by default when using -O3 optimazation.

Check the instructions in <main>:
------------------------------------------------------
Disassembly of section .text:

0000000000400560 <main>:
  400560:       d285e410        mov     x16, #0x2f20                    // #12064
  400564:       cb3063ff        sub     sp, sp, x16
  400568:       d2800000        mov     x0, #0x0                        // #0
  40056c:       a9007bfd        stp     x29, x30, [sp]
  400570:       910003fd        mov     x29, sp
  400574:       a90153f3        stp     x19, x20, [sp, #16]
  400578:       529a9c74        mov     w20, #0xd4e3                    // #54499
  40057c:       a9025bf5        stp     x21, x22, [sp, #32]
  400580:       72a83014        movk    w20, #0x4180, lsl #16
  400584:       f9001bf7        str     x23, [sp, #48]
  400588:       910103b6        add     x22, x29, #0x40
  40058c:       913f83b5        add     x21, x29, #0xfe0
  400590:       5280fa33        mov     w19, #0x7d1                     // #2001
  400594:       d2800017        mov     x23, #0x0                       // #0
  400598:       97ffffd6        bl      4004f0 <time@plt>
  40059c:       97ffffe9        bl      400540 <srand@plt>
  4005a0:       97ffffdc        bl      400510 <rand@plt>
  4005a4:       9b347c01        smull   x1, w0, w20
  4005a8:       9369fc21        asr     x1, x1, #41
  4005ac:       4b807c21        sub     w1, w1, w0, asr #31
  4005b0:       1b138020        msub    w0, w1, w19, w0
  4005b4:       510fa000        sub     w0, w0, #0x3e8
  4005b8:       b8376ac0        str     w0, [x22, x23]
  4005bc:       97ffffd5        bl      400510 <rand@plt>
  4005c0:       9b347c01        smull   x1, w0, w20
  4005c4:       9369fc21        asr     x1, x1, #41
  4005c8:       4b807c21        sub     w1, w1, w0, asr #31
  4005cc:       1b138020        msub    w0, w1, w19, w0
  4005d0:       510fa000        sub     w0, w0, #0x3e8
  4005d4:       b8376aa0        str     w0, [x21, x23]
  4005d8:       910012f7        add     x23, x23, #0x4
  4005dc:       f13e82ff        cmp     x23, #0xfa0
  4005e0:       54fffe01        b.ne    4005a0 <main+0x40>  // b.any
  4005e4:       d283f002        mov     x2, #0x1f80                     // #8064
  4005e8:       8b0203a1        add     x1, x29, x2
  4005ec:       d2800000        mov     x0, #0x0                        // #0
  4005f0:       3ce06ac0        ldr     q0, [x22, x0]
  4005f4:       3ce06aa1        ldr     q1, [x21, x0]
  4005f8:       4ea18400        add     v0.4s, v0.4s, v1.4s
  4005fc:       3ca06820        str     q0, [x1, x0]
  400600:       91004000        add     x0, x0, #0x10
  400604:       f13e801f        cmp     x0, #0xfa0
  400608:       54ffff41        b.ne    4005f0 <main+0x90>  // b.any
  40060c:       4f000400        movi    v0.4s, #0x0
  400610:       aa0103e0        mov     x0, x1
  400614:       d285e401        mov     x1, #0x2f20                     // #12064
  400618:       8b0103a1        add     x1, x29, x1
  40061c:       3cc10401        ldr     q1, [x0], #16
  400620:       4ea18400        add     v0.4s, v0.4s, v1.4s
  400624:       eb01001f        cmp     x0, x1
  400628:       54ffffa1        b.ne    40061c <main+0xbc>  // b.any
  40062c:       4eb1b800        addv    s0, v0.4s
  400630:       90000000        adrp    x0, 400000 <_init-0x4b8>
  400634:       91208000        add     x0, x0, #0x820
  400638:       0e043c01        mov     w1, v0.s[0]
  40063c:       97ffffc5        bl      400550 <printf@plt>
  400640:       f9401bf7        ldr     x23, [sp, #48]
  400644:       a94153f3        ldp     x19, x20, [sp, #16]
  400648:       52800000        mov     w0, #0x0                        // #0
  40064c:       a9425bf5        ldp     x21, x22, [sp, #32]
  400650:       d285e410        mov     x16, #0x2f20                    // #12064
  400654:       a9407bfd        ldp     x29, x30, [sp]
  400658:       8b3063ff        add     sp, sp, x16
  40065c:       d65f03c0        ret
------------------------------------------------------

SIMD VECTOR INSTRUCTIONS:
------------------------------------------------------
  4005a4:       9b347c01        smull   x1, w0, w20
  4005c0:       9b347c01        smull   x1, w0, w20
------------------------------------------------------

VECTORIZED:
------------------------------------------------------
  4005f8:       4ea18400        add     v0.4s, v0.4s, v1.4s
  4005f8:       4ea18400        add     v0.4s, v0.4s, v1.4s
  40060c:       4f000400        movi    v0.4s, #0x0
  400620:       4ea18400        add     v0.4s, v0.4s, v1.4s
------------------------------------------------------

Here are the articles to explans how can we identify a program was vectorized by looking for the SIMD vector registers: https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/a64_simd_vector.html

Comments

Popular posts from this blog

Lab 3

In this lab, we are going to use Assembly language to finish 3 parts. 1. As we are getting familiar with Assembly language, we will create a loop in Assembly to prints out 10 times of "Hello World!". This part is quite easy to do it, here is the source code for x86_64 assembler: ------------------------------------------------------ .text .globl    _start start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */ max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */ _start:     mov     $start,%r15         /* loop index */     mov     %r15,%r10 loop:         /* ... body of the loop ... do something useful here ... */   ...

SPO600 - Project - Stage One

In our final project, the project will split into 3 stages. This is the first stage of my SPO600 course project. In this stage, we are given a task to find an open source software package that includes a CPU-intensive function or method that compiles to machine code. After I chose the open source software package, I will have to benchmark the performance of the software function on an AArach64 system. When the benchmark job is completed, I will have to think about my strategy that attempts to optimize the hash function for better performance on an AArch64 system and identify it, because those strategies will be used in the second stage of the project. With so many software, I would say picking software is the hardest job in the project, which is the major reason it took me so long to get this post going. But after a lot of research, I picked a software called SSDUP , it is a traffic-aware SSD burst buffer for HPC systems. You can find the source code over here: https://github.com/CGC...

Lab 5

In this lab, we are going to use different approaches to scale volume of sound, and the algorithm’s effect on system performance. Here is some basic knowledge of digital sound: Digital sound is usually represented by a signed 16-bit integer signal sample, taken at a rate of around 44.1 or 48 thousand samples per second for one stream of samples for the left and right stereo channels. In order to change the volume of sound, we will have to scale the volume factor for each sample, the range of 0.00 to 1.00 (silence to full volume). Here is the source code I got from professor: (vol1.h) ------------------------------------------------- #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include "vol.h" // Function to scale a sound sample using a volume_factor // in the range of 0.00 to 1.00. static inline int16_t scale_sample(int16_t sample, float volume_factor) { return (int16_t) (volume_factor * (float) sample); } int main() { // Al...