MCS-284 Lab 4: Studying Cache Auto-Prefetch Effectiveness

In this lab, you will work together with a partner to gain experience in how hypotheses about memory system performance can be tested using trace-driven cache simulation. As usual, I expect most of you will work in pairs. Choose a partner who you have not worked with yet. Each group should submit a single jointly authored lab report.

Important: It is important that you complete the Pre-Lab portion of this assignment before you start the In Lab portion.


Rather than waiting for the first access to a block to start fetching that block from main memory to the cache, it is possible to prefetch a block likely to be accessed soon. Our textbook mentions one form of prefetching, compiler-directed prefetching, on page 555. In this lab we will study a different form of prefetching, hardware prefetching or auto-prefetching, in which the cache hardware automatically chooses when to prefetch a block rather than relying on ``hints'' from the compiler. Generally this involves a very simple rule such as prefetching the next block after the one currently being accessed.

Auto-prefetch can decrease the miss rate if the prefetched blocks are used, but may also increase the amount of traffic between the main memory and the cache if the blocks go unused. In fact, if the prefetched blocks go unused, and bump other blocks out of the cache that would have been used, the miss rate can actually go up rather than down. A prefetch strategy is highly effective if it causes a substantial decrease in the miss rate with only a minor increase in memory traffic. For example, the miss rate might go down by 50% while the memory traffic goes up by 1%. In the other scenarios, where the miss rate doesn't go down much and/or the traffic goes up a lot, we can term the prefetch strategy ineffective (if the miss rate doesn't go down by much), costly (if the miss rate is substantially reduced, but at a high price in memory traffic), or counterproductive (if miss rate is marginally improved at best, and traffic goes way up).

The question you are to answer is this: what is the relationship between cache block size and total cache size and the cost and effectiveness of auto-prefetch? That is, if you consider a two-dimensional space of possible caches, where one dimension is block size and the other is total cache size, and you were to map out for each point in that two-dimensional space the percent increase in memory traffic that auto-prefetching would cause, what would it look like? Now suppose you mapped out for each point in the two-dimensional space the percent reduction (or increase!) in misses that auto-prefetching would cause, what does that look like? Putting these together, in what regions of the space is auto-prefetching cheap and highly effective? In which regions is it ineffective to the extent of being pointless? In which is it out and out harmful?

Before starting the lab, you should decide and write down the following two things:

  1. What is your hypothesis regarding the answer to the above questions? What is the rationale behind your hypothesis?
  2. What is your experimental plan for testing your hypothesis? In particular, what values of block size and total cache size are you going to use? (Note: all the block sizes must be multiples of four bytes, since all accesses are of four-byte words. Additionally, the block sizes and cache sizes should be powers of two.) What values are you going to use for the other cache parameters that you hold constant? Which of the various auto-prefetch variants supported by the dineroIII simulator are you going to use? Which address trace(s) are you going to use? In order to answer these questions, you will need to read ahead into the In Lab section below.

In Lab

You will use the dineroIII trace-driven cache simulator. A trace-driven cache simulator reads in a trace, that is, a record of the memory references from a particular execution (with each address tagged with whether it was a read, a write, or an instruction fetch). The simulator is parameterizable so that the same trace can be tried with a variety of different parameters for block size, associativity, cache size, prefetch strategy, etc. At the end of the simulation, a variety of statistics are produced, including the miss rate for the cache and the amount of memory traffic. DineroIII is a good quality, flexible simulator often used in such work.

We have available on-line traces from three benchmarks (tex, gcc, and spice). Each of the traces contains about a million memory references. The traces reside in the directory ~max/MC48, and have filenames tex.din.gz, cc1.din.gz, and spice.din.gz respectively. The .gz suffix indicates that they are compressed using the gzip program (to save disk space); to decompress them at the same time you run dineroIII, you can use a command that decompresses the file and "pipes" the decompressed version directly into dineroIII. On one of our Linux PCs, the command would be

zcat ~max/MC48/name.din.gz | ~max/MC48/dineroIII options
where name is the name of the trace you want to use and options is a list of dineroIII parameters as specified in the accompanying manual page for dineroIII. Note that at a bare minimum you need to specify the -b option and either the -u option or the -i and -d options. The -f option will also be particularly relevant to your investigation. Remember that the block size for the -b option is measured in bytes, so needs to be a multiple of four.

You should observe the improvement prefetching makes in the number of misses, specifically the so-called ``demand'' misses, i.e., the misses on those blocks actually requested by the CPU, rather than by the prefetching. You should also observe the increase in memory traffic caused by unnecessary prefetches. Further, it would be interesting to see whether instruction and data references are equally suited to auto-prefetching. (DineroIII prints out separate statistics in each category as well as totals.)

Collecting data

You may find it convenient to automate the data collection using a feature of the unix shell known as ``foreach.'' In particular, you can set up a shell script that uses nested foreach loops to run systematically through a collection of cache sizes, and for each of them through a collection of block sizes, and for each of them through a collection of traces, and for each of them through two prefetching strategies (don't prefetch, or some particular form of auto-prefetch). The exact nature of this will depend on what you are trying to do. The below is just an example, you should work from your own experimental design choices (from the pre-lab) regarding what ranges of values to iterate over. Keep in mind that you will need to analyze whatever data you generate, and it may be easier to generate lots of data than to analyze it. This example shell script, for instance, will do 120 experimental runs, creating an output file for each one. Each of those files will contain at least 4 interesting numbers. That would leave you with 480 numbers to make sense of.
for size in 16k 32k 64k; do 
  for block in 4 8; do 
    for bench in cc1 spice tex; do
      for prefetch in d a; do
         echo -n $bench $size $block $prefetch
         zcat ~max/MC48/"$bench".din.gz | ~max/MC48/dineroIII -b$block -d$size \
           -i$size -f$prefetch > "$bench"."$size"."$block"."$prefetch"
         echo ""
Note that although you could literally type something like the above into a shell window it is probably more sensible to store it in a file and then run it from the file. (Be sure to end the file with a newline after the last "end", i.e., press the enter key after the last "end".) If you have your script in a file called script, you could run it using the command
bash script

Processing data

A script like the following could be used to process the data files you collected. This short script calculates the difference in the number of Demand Misses between each .a file and its corresponding .d file, divided by the original total in the .d file.
declare -i a  # Optional:  declared a and d to be integers
declare -i d  # (This just effects whether whitespace gets removed.)
for file in *.a; do
  a=`head -26 $base.a | tail -1 | cut -f 3`
  d=`head -22 $base.d | tail -1 | cut -f 3`
  result=`echo "scale = 2 ; ($d - $a) / $d" | bc`
  echo "$base: ($d - $a) / $d = $result"

Here is the same script with comments so you can get a feel for what it's doing.
declare -i a  # Optional:  declared a and d to be integers
declare -i d
for file in *.a; do
  base=${file%.*}            # remove the extension (that is the .a) from the file name
  # head is a program which gets the first n lines of a file (or standard input)
  # tail gets the last n lines of a file (or standard input)
  # cut extracts a range of characters (-c) or tab-separated fields (-f)
  a=`head -26 $base.a | tail -1 | cut -f 3`
  d=`head -22 $base.d | tail -1 | cut -f 3`
  # bc is a simple calculator.  The "scale" is the number of digits after the decimal point to print.
  result=`echo "scale = 2 ; ($d - $a) / $d" | bc`
  echo "$base: ($d - $a) / $d = $result"

Note that you can only do arithmetic on integer fields using the @ symbol. If you want to do more complicated arithmetic, try the reverse polish (i.e., stack) calculator dc. For example, in
  set ratio = `dc -e "2 k $d $a / p"`
the -e means execute the argument, 2 k sets the precision to 2 decimal places, $d $a / divides $d by $a, and p prints the result.

Summary of tasks


For your lab report, you can assume the reader already knows what the assignment is. In order to assess your report I'll be looking for: