Introduction to Computer Systems Loop Unrolling

Loop Unrolling

Data Dependencies

To understand how a CPU will execute a series of instructions, you need to understand implicit ordering constraints within a set of instructions:


int f(int a, int b, int c)
{
    int t1 = a * b;
    int t2 = b / c;
    int t3 = t1 & t2;
    int t4 = t1 | t2;
    int t5 = t3 - t4;


    return t5;
}

f:
                        ;# a = %rdi
                        ;# b = %rsi
                        ;# c = %rdx
    movl  %esi, %eax
    movl  %edx, %ecx
    imull %esi, %edi
    cltd
    idivl %ecx
    movl  %edi, %edx
    andl  %eax, %edx
    orl   %eax, %edi
    movl  %edx, %eax
    subl  %edi, %eax
    ret
    


Intel Core i7 Functional Units

  1. Fast integer arithmetic, most vector arithmetic, floating point fused multiply and add, integer and floating point division, and branches
  2. All integer arithmetic, most vector arithmetic, floating point fused multiply and add, and load effective address
  3. Load and address computation
  4. Load and address computation
  5. Store Data
  6. Fast integer arithmetic, some vector arithmetic, and fast load effective address
  7. Fast integer arithmetic
  8. Store address computation


Characteristics of CPU instructions


Data Dependencies With Loops


int A[N];
int sum = 0;

for (int i = 0; i < N i++)
{
    sum += A[i];
}

    movl   $0, %eax

;# %rcx points to end of A
.L3:
    addl   (%rdx), %eax
    addq   $4, %rdx ;# adding 4 bits w/ quad = adding one
    compq  %rcx, %rdx
    jne    .L3
    rep ret
    


##Unrolling The Loop By 4


int A[N];
int sum = 0;

for (int i = 0; i < N; i += 4)
{
    sum += A[i];
    sum += A[i + 1];
    sum += A[i + 2];
    sum += A[i + 3];
}

    movl   $0, %edx
    movl   $0, %eax

;# %edi points to the end of A
.L8:
    ...
    

Fixing Potential Errors

Removing Unneeded Dependencies