Skip to main content

Lab 4

This lab is going to exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler. For the people who not familiar with Vectorization, this article will help: Automatic vectorization

In this lab, we are going to write a short program that:
-Create two 1000-element integer arrays
-Fill them with random numbers in the rang -1000 to +1000
-Sum up those two arrays element-by-element to a third array
-Sum up the third array
-Print out the result

Here is the source code I wrote:
------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main(){
int sum;
int arr1[1000];
int arr2[1000];
int arr3[1000];

srand(time(NULL));

for(int i=0; i<1000; i++){
arr1[i] = rand() % 2001 - 1000;
arr2[i] = rand() % 2001 - 1000;
}

for(int i=0; i<1000; i++){
arr3[i] = arr1[i] + arr2[i];
}

for(int i=0; i<1000; i++){
sum += arr3[i];
}
printf("Sum is: %d\n", sum);
}
------------------------------------------------------

I will using the command 'gcc -O3 -o lab4 lab4.c' to compile this program.
But how do we know this will vectorize our program? Check the article in gcc article on vectorization 
Vectorization is enabled by default when using -O3 optimazation.

Check the instructions in <main>:
------------------------------------------------------
Disassembly of section .text:

0000000000400560 <main>:
  400560:       d285e410        mov     x16, #0x2f20                    // #12064
  400564:       cb3063ff        sub     sp, sp, x16
  400568:       d2800000        mov     x0, #0x0                        // #0
  40056c:       a9007bfd        stp     x29, x30, [sp]
  400570:       910003fd        mov     x29, sp
  400574:       a90153f3        stp     x19, x20, [sp, #16]
  400578:       529a9c74        mov     w20, #0xd4e3                    // #54499
  40057c:       a9025bf5        stp     x21, x22, [sp, #32]
  400580:       72a83014        movk    w20, #0x4180, lsl #16
  400584:       f9001bf7        str     x23, [sp, #48]
  400588:       910103b6        add     x22, x29, #0x40
  40058c:       913f83b5        add     x21, x29, #0xfe0
  400590:       5280fa33        mov     w19, #0x7d1                     // #2001
  400594:       d2800017        mov     x23, #0x0                       // #0
  400598:       97ffffd6        bl      4004f0 <time@plt>
  40059c:       97ffffe9        bl      400540 <srand@plt>
  4005a0:       97ffffdc        bl      400510 <rand@plt>
  4005a4:       9b347c01        smull   x1, w0, w20
  4005a8:       9369fc21        asr     x1, x1, #41
  4005ac:       4b807c21        sub     w1, w1, w0, asr #31
  4005b0:       1b138020        msub    w0, w1, w19, w0
  4005b4:       510fa000        sub     w0, w0, #0x3e8
  4005b8:       b8376ac0        str     w0, [x22, x23]
  4005bc:       97ffffd5        bl      400510 <rand@plt>
  4005c0:       9b347c01        smull   x1, w0, w20
  4005c4:       9369fc21        asr     x1, x1, #41
  4005c8:       4b807c21        sub     w1, w1, w0, asr #31
  4005cc:       1b138020        msub    w0, w1, w19, w0
  4005d0:       510fa000        sub     w0, w0, #0x3e8
  4005d4:       b8376aa0        str     w0, [x21, x23]
  4005d8:       910012f7        add     x23, x23, #0x4
  4005dc:       f13e82ff        cmp     x23, #0xfa0
  4005e0:       54fffe01        b.ne    4005a0 <main+0x40>  // b.any
  4005e4:       d283f002        mov     x2, #0x1f80                     // #8064
  4005e8:       8b0203a1        add     x1, x29, x2
  4005ec:       d2800000        mov     x0, #0x0                        // #0
  4005f0:       3ce06ac0        ldr     q0, [x22, x0]
  4005f4:       3ce06aa1        ldr     q1, [x21, x0]
  4005f8:       4ea18400        add     v0.4s, v0.4s, v1.4s
  4005fc:       3ca06820        str     q0, [x1, x0]
  400600:       91004000        add     x0, x0, #0x10
  400604:       f13e801f        cmp     x0, #0xfa0
  400608:       54ffff41        b.ne    4005f0 <main+0x90>  // b.any
  40060c:       4f000400        movi    v0.4s, #0x0
  400610:       aa0103e0        mov     x0, x1
  400614:       d285e401        mov     x1, #0x2f20                     // #12064
  400618:       8b0103a1        add     x1, x29, x1
  40061c:       3cc10401        ldr     q1, [x0], #16
  400620:       4ea18400        add     v0.4s, v0.4s, v1.4s
  400624:       eb01001f        cmp     x0, x1
  400628:       54ffffa1        b.ne    40061c <main+0xbc>  // b.any
  40062c:       4eb1b800        addv    s0, v0.4s
  400630:       90000000        adrp    x0, 400000 <_init-0x4b8>
  400634:       91208000        add     x0, x0, #0x820
  400638:       0e043c01        mov     w1, v0.s[0]
  40063c:       97ffffc5        bl      400550 <printf@plt>
  400640:       f9401bf7        ldr     x23, [sp, #48]
  400644:       a94153f3        ldp     x19, x20, [sp, #16]
  400648:       52800000        mov     w0, #0x0                        // #0
  40064c:       a9425bf5        ldp     x21, x22, [sp, #32]
  400650:       d285e410        mov     x16, #0x2f20                    // #12064
  400654:       a9407bfd        ldp     x29, x30, [sp]
  400658:       8b3063ff        add     sp, sp, x16
  40065c:       d65f03c0        ret
------------------------------------------------------

SIMD VECTOR INSTRUCTIONS:
------------------------------------------------------
  4005a4:       9b347c01        smull   x1, w0, w20
  4005c0:       9b347c01        smull   x1, w0, w20
------------------------------------------------------

VECTORIZED:
------------------------------------------------------
  4005f8:       4ea18400        add     v0.4s, v0.4s, v1.4s
  4005f8:       4ea18400        add     v0.4s, v0.4s, v1.4s
  40060c:       4f000400        movi    v0.4s, #0x0
  400620:       4ea18400        add     v0.4s, v0.4s, v1.4s
------------------------------------------------------

Here are the articles to explans how can we identify a program was vectorized by looking for the SIMD vector registers: https://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/a64_simd_vector.html

Comments

Popular posts from this blog

Lab2

Complied C Lab In this lab, we were asked to compile a C program, using gcc command with different options. At the beginning of this lab, we wrote a simple C program that prints a message: Then using gcc command and the following compiler options to compile the program: -g # enable debugging information -O0 # do not optimize (that's a capital letter and then the digit zero) -fno-builtin # do not use builtin function optimizations Note that the size of file is 73088 bytes We can use objdump --source a.out command to show source code, the source code is under <main> section. And  readelf -p .rodata a.out contains the string to be printed. Then we add the option "-static" to recompile the program, found out the size is changed to 696264 bytes, which is bigger than the original program. And section headers are also increased. Next, I removed the builtin function optimization by remove option "-fno-builtin"...

Lab 6A

This lab is separated into two parts, I'll blog my work in different post. In the first part, we've got a source code from professor Chris, which is a similar stuff to our lab5, scaling the volume of sound, but it includes inline assembler. The first thing I'll do is add a timer to the code in order to check the performing time. Build and run the program, here is the output: ------------------------------------------------------------------------- [qichang@aarchie spo600_20181_inline_assembler_lab]$ ./vol_simd Generating sample data. Scaling samples. Summing samples. Result: -462 Time: 0.024963 seconds. ------------------------------------------------------------------------- Then I adjusted the number of samples to 5000000 in vol.h: ------------------------------------------------------------------------- [qichang@aarchie spo600_20181_inline_assembler_lab]$ cat vol_simd.c // vol_simd.c :: volume scaling in C using AArch64 SIMD // Chris Tyler 2017.11.29-2018...

Lab 5

In this lab, we are going to use different approaches to scale volume of sound, and the algorithm’s effect on system performance. Here is some basic knowledge of digital sound: Digital sound is usually represented by a signed 16-bit integer signal sample, taken at a rate of around 44.1 or 48 thousand samples per second for one stream of samples for the left and right stereo channels. In order to change the volume of sound, we will have to scale the volume factor for each sample, the range of 0.00 to 1.00 (silence to full volume). Here is the source code I got from professor: (vol1.h) ------------------------------------------------- #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include "vol.h" // Function to scale a sound sample using a volume_factor // in the range of 0.00 to 1.00. static inline int16_t scale_sample(int16_t sample, float volume_factor) { return (int16_t) (volume_factor * (float) sample); } int main() { // Al...