Introduction to Computer Systems Loop Unrolling

# Loop Unrolling

## Data Dependencies

To understand how a CPU will execute a series of instructions, you need to understand implicit ordering constraints within a set of instructions:

``````
int f(int a, int b, int c)
{
int t1 = a * b;
int t2 = b / c;
int t3 = t1 & t2;
int t4 = t1 | t2;
int t5 = t3 - t4;

return t5;
}
``````
``````
f:
;# a = %rdi
;# b = %rsi
;# c = %rdx
movl  %esi, %eax
movl  %edx, %ecx
imull %esi, %edi
cltd
idivl %ecx
movl  %edi, %edx
andl  %eax, %edx
orl   %eax, %edi
movl  %edx, %eax
subl  %edi, %eax
ret
``````

## Intel Core i7 Functional Units

1. Fast integer arithmetic, most vector arithmetic, floating point fused multiply and add, integer and floating point division, and branches
2. All integer arithmetic, most vector arithmetic, floating point fused multiply and add, and load effective address
3. Load and address computation
4. Load and address computation
5. Store Data
6. Fast integer arithmetic, some vector arithmetic, and fast load effective address
7. Fast integer arithmetic
8. Store address computation

## Characteristics of CPU instructions

• Latency - the amount of time between the start of an individual instruction and its completion (data-dependent)
• Issue - (AKA Throughput) the amount of time between the start of an individual instruction and the start of the next instruction within the same functional unit
• Capacity - the amount of functional units that can concurrently accept a specific instruction

## Data Dependencies With Loops

``````
int A[N];
int sum = 0;

for (int i = 0; i < N i++)
{
sum += A[i];
}
``````
``````
movl   \$0, %eax

;# %rcx points to end of A
.L3:
addl   (%rdx), %eax
addq   \$4, %rdx ;# adding 4 bits w/ quad = adding one
compq  %rcx, %rdx
jne    .L3
rep ret
``````
• Only uses 4 of the functional units

##Unrolling The Loop By 4

• same code unrolled by 4:
``````
int A[N];
int sum = 0;

for (int i = 0; i < N; i += 4)
{
sum += A[i];
sum += A[i + 1];
sum += A[i + 2];
sum += A[i + 3];
}
``````
``````
movl   \$0, %edx
movl   \$0, %eax

;# %edi points to the end of A
.L8:
...
``````