MCS-284 Lab 3: Measuring Processor Architectures' Performance

For next time:  Edit the instructions to include...

Here is what I propose you may want do for lab 3:

(1)  Recompile without optimization to get a .s file.

(2)  One loop will have one fewer instruction.  Insert that
     missing instruction.

(3)  Compile the .s into an executable.  I found I had to do so on a
     P3 to get machine code which would run on either platform without
     seg faulting.  (Not sure why yet.)

(4)  When you take data, TAKE ONLY MINIMAL DATA.  In particular,
     you might just take data on the two values n=100 and n=200.
     You can get a clear slope from that.  (If you want to confirm the
     slope, you can take n=300 as well.)

Now you can focus your attention on content rather than on the
presentation data; most of the data was tangential to your main points
anyway.
We've composed a list of processors found in the lab which are Pentium III's. You guys are usually working on Pentium 4's. In any case, be sure to double-check by looking at /proc/cpuinfo

Introduction

The goal of this lab is to see how the performance of an advanced pipelined architecture can be empirically measured, using the processor's cycle counter, and in particular to see the performance interaction between different hardware approaches to control hazards and different software approaches to jumping. As usual, I expect most of you to work in pairs.

You should be able to finish this lab quite quickly, and I therefore expect each lab group to find a way to enhance their lab report by conducting an additional experiment. I recommend that the experiment be simple and that you investigate other differences between the Intel Pentium II (or III) and Pentium 4 pipelines.

Measuring performance

You will measure the performance of two different programs each on two different computers, for a total of four combinations. The two programs will be nearly identical. Both read in a number, n, and then loop n times, doing nothing in the loop except incrementing the counter i until it reaches n. Each program then prints out the number of clock cycles that the loop took. (This is found by using a special cycle counter built into the computer.) The only difference between the two programs is in how they jump back from the bottom of the loop to the top. One jumps to a label, like the MIPS j instruction, while the other jumps to an address stored in a register, like the MIPS jr instruction. (The register is initialized before the loop with the address of the label at the top of the loop.) The two computers you will use are not variants of the MIPS architecture, but rather of the IA-32 architecture:

You can find out what kind of processor a computer has by in a shell window giving the command

 cat /proc/cpuinfo 
Your lab report should specify the particular processor model name.

It would be unreasonable for me to expect you to learn Intel's assembly language after having just struggled with MIPS's. So instead, you'll use a different approach: we'll write the programs in C and compiling them with the gcc compiler. You'll then look at the assembly language output produced by the compiler so that you can see the difference between the two versions, but it is much easier to read an unfamiliar assembly language than to write it.

Compiling, linking and running loop.c

I have written the first version of the program, which jumps to a label, and have included some extra code which will make it trivial to generate the second version of the program. It is in the loop.c file linked from the web version of this lab handout. In order to use the cycle counter, it makes use of a second file, cyccounter.h, which is also on the web. You will need to download copies of these two files into your own directory. You are not expected to be able to read and understand cyccounter.h.

To compile this program using gcc, you can use the following two commands:

gcc -S loop.c
gcc -o loop loop.s
The first command compiles the C program loop.c and stops at assembly language (that is what the -S option does). This assembly language version is left in the file loop.s. The second command says to compile (actually assemble) the loop.s assembly language program the rest of the way to an executable machine language program, which it should put in the output file loop. (The -o option just says that what follows it is the output file.)

In order to run the loop program, choose a number of iterations n, and issue the command

./loop n
The program will then print out the number of cycles the loop took. You can repeat this several times with the same value of n to get an idea how repeatable it is, and repeat it with several different values of n to see how the number of cycles grows with n, in particular, how many additional cycles are done for each additional trip through the loop. If there is some variation in the cycle counts for a particular n, you should use the smallest rather than taking an average. This is because we only want to pay attention to cycles actually spent in the processor, not the extra cycles that can sometimes be needed because of memory delays or other effects outside of chapter 6's scope.

Counting cycles per iteration

The question of how many cycles each iteration of the loop takes can be answered in two ways: the average number of cycles per iteration or the marginal number of cycles per iteration. The average is found by taking the total number of cycles for n iterations and dividing it by n. The marginal value is found by seeing how many more cycles the loop takes to do n1 iterations than n0 (where n1 > n0) and dividing that difference by n1 - n0. The marginal value is generally the more interesting one, because it excludes any fixed costs. Some of those fixed costs are artifacts of the measurement study: starting and stopping the timer takes some time. Even fixed costs that aren't artificial are of diminishing relevance as the number of iterations grows large, and if the number of iterations is small, then performance is probably not crucial in the first place. Besides, we are particularly interested in the performance of the jumping to the top of the loop, and that is a per-iteration phenomenon, not a start-up cost like initializing i to 0. Therefore, you should calculate marginal, not average, values in this lab.

Which values of n are appropriate?

I'd recommend using an n of (say) 100, and one of 200 and extract the slope of the line. (Or, you could try 1000 and 2000.) Then, to be sure, you might try one or to more values (like 300) to make sure the slope persists as you'd expect. If you use values which are too small, may of the cycles will be due to loop start-up costs. If you use values which are too large, you risk other processes sneaking in and stealing cycles while your loop is running.

Modifying the program to jump using a register

To make the alternative version of the program, where the loop is closed not with a jump to a label but rather a jump using a register, you will need to edit the C program (or rather, a copy of it with a different name) and recompile. The loop in the C program ends with the statement
 goto loop; 
This is what gets translated by the C compiler into a jump to a label. We want to change this to the C equivalent of jumping to an address specified by a register, which is using goto with the target pointed to by a variable. This can't be done in standard C, but can using a non-standard extension to C that the gcc compiler provides. To learn about this non-standard extension, you'll need to look at the gcc documentation. Here are two ways to locate the documentation: This should now have the information you need. If you have trouble understanding it, be sure to ask. (Note that you should only have to change one line of the C code.) One likely trouble spot is that foo is used in the example without explanation. The idea is that this is a label appearing in part of the example program that isn't shown. That is, some statement would be labeled with foo: (note the colon here). Then after assigning &&foo to ptr, the statement goto *ptr; has the same effect as goto foo; would.

You can compile and run the new version of the program the same way as the original. You should also compare the two assembly language versions (the .s files) to make sure that the loop is really the same except for the jump instruction at the end, and possibly which specific register has been assigned to hold each variable. The jump instruction in the IA-32 assembly language is jmp. This is used for both jumping to a label (MIPS's j) and jumping via a register (MIPS's jr). The difference is that if you are jumping to a label, the label appears after the jmp, while if you are jumping to a location specified by a register, the jmp is followed by a * and then the register name. The labels generated by gcc start with .L, and the register names all start with % (which is like $ for MIPS). If you have trouble identifying what lines of assembly language are the loop, or understanding the difference, please ask for help.

You may observe another difference between the loops in the .s files. While the loops should have different jump instructions, they should not differ in other ways, since that would introduce too many variables. You'd like to just compare how the two jump instructions affect the pipelines. Consequently, you should edit one the shorter .s files so that it includes the additional instruction in the loop of the larger .s file.

Note, by the way, that neither version of the program is good C programming style, and there certainly is no reason why a programmer would use the second version, where a variable (or register) is used to jump to the fixed address at the top of the loop. However, these programs let us try out in a simple context the two forms of jump instructions that real programs do use. For example, jumping to an address contained in a register is an important part of how modern object-oriented languages are implemented.

Counting instructions

While you are looking at the assembly language versions, you should also count how many instructions are executed each time through the loop. By putting this together with your cycle count data, you should be able to calculate the CPI for each version of the loop. By this I mean not the overall CPI of the full program (or even of that portion that is between the starting and stopping of the timer) but rather the CPI of the looping itself, based on just how many instructions are done per additional trip around the loop and how many extra cycles that takes.

Presenting data

You should be careful about choosing the best way to present your data. Sometimes a graph is clearer than a table, and sometimes a table is clearer than a graph, depending on what information you are trying to convey. Sometimes, including both tables and graphs is best, but only if you are trying to communicate different ideas using the two presentations styles.

If one goal is to compare data, it's best of all the data to be compared is in one table or graph so that the reader doesn't have to glance back and forth between two tables or graphs.

Graphs can be useful for showing your data, but you need to be careful about some common pitfalls, particularly if you use a computer to generate the graphs rather than doing it by hand:

In your lab report, you should show what the difference in assembly language code is, explain how you determined the cycles-per-iteration and CPI of each version on each processor, and provide your results. You should also try to draw some bigger picture conclusions from these results. What can you say about the pipelines of the two processors? How coping with control hazards is influenced by the kind of jump?