Vectorizer Lab

Take a look at the file vec_demo.c. It does a simple summation of integers stored in an array. We store the element number in the array, then do a summation so that we can easily check that the calculations are correct.

In this file you will find 4 implementations of the code to add up the array.

  • sum_array implements the function in plain old standard C.
  • sum_array_intrinsics uses no assembler but uses SSE intrinsics
  • sum_array_asm is a hand-coded assembler using tradition x86 opcodes
  • sum_array_asm_sse is a hand-coded assembler function using SSE instructions

Step 1 Compile and Run

You will compile this program with different compilation options to see the effect of the compiler's optimizations and to compare different ways of implementing vectorizable code.

The compilation commands looks like this.

icc -use_msasm -O0     vec_demo.c -o vec_demo
icc -use_msasm -O3     vec_demo.c -o vec_demo
icc -use_msasm -O3 -xW vec_demo.c -o vec_demo

  • We need -use_msasm because I did not use AT&T assembler format for the assembly routuine)
  • -xW enables the Pentium 4-compatible vectorizer
  • -O0 turns off all optimizations. With no -O option the compiler assumes -O2
  • ignore the no EMMS instruction before return warnings

For each compilation, run the code and note your results below:

  -O0       -O3       -O3 -xW
plain C      
intrinsics      
assembly      

Things to think about / try if you have time...

  • How much edge does hand-coded assembler have over compiler-optimized plain-old C? How abour over intrinsics?
  • Which optimizations benefit the intrinsic version?
  • What happens is you use unaligned memory (allocated with malloc, not mm_malloc) with the plain-C version? (it will crash the 2 SSE-equipped functions) Is there a benefit?
  • try the -vec_report3 option with the -xW compilations and see a report about both vectorized and non-vectorized loops. This program featured some very simple loops. When programs have more complex loops, this kind of diagnostic data can help make a loop vectorizable

Teacher's notes (don't include in lab)

Here's my results on a 2.8Ghz Prestonia

  -O0       -O3       -O3 -xW
plain old C 5.0 .76 .32
C w/ intrinsics 2.5 .37 .38
plain old ASM 1.2 1.18 1.18
ASM w/ intrinsics .46 .48 .48

Conclusion: it usually best to write plain, portable code and leverage the Intel compiler. In this case, we get better performance with plain code than intrinsics or assembler. Of course, this could be fixed with some careful improvements (e.g. the assembler version could use some prefetch instructions), but the point is that you should always try the compiler's methind first.

To do (to enhance the lesson)

  • use the "don't vectorize" pragma's
  • do floating point rather than integer math

See also...

http://bmagic.sourceforge.net/bmsse2opt.html
http://www.tommesani.com/SSE2MMX.html
http://www.codeproject.com/cpp/sseintro.asp?df=100&forumid=16168&exp=0&select=568493

-- MattWalsh - 29 May 2004

Topic attachments
I Attachment Action Size Date Who Comment
C source code filec vec_demo.c manage 3.3 K 29 May 2004 - 18:35 MattWalsh  
Topic revision: r1 - 29 May 2004 - MattWalsh
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback