Homework 2
CS 345
Computer Organization
10 Points
Due Friday, Jan. 30, 2015 at the beginning of class
You must submit a hard copy of this assignment or turn it in to the dropbox and show all work!
1. Assume you have a program that runs in 84s. Assume that
two-thirds
of the program can be parallelized. What is the run time and
speedup if the parallelism is implemented ideally using 4
processors? 8
processors? 16 processors? What is the theoretical shortest run time
for the program? Show all work.
2. Find two free (or at least free to try) benchmarking tools.
Hint: one such tool is on the links page of the class website.
For each tool, answer the following questions.
What is the benchmark tool called and where can
it be found? Provide a URL.
Is the tool synthetic?
What kind(s) of measurements are performed by the
tool?
What are the results of running the tool on two
different computers?
Summarize your experiments, explaining whether or
not you think the tools are useful and in which situations.
3. a) Write your own simple benchmarking tool to test a particular
floating point operation (e.g. addition, subtraction,
multiplication, division, sine, cosine, etc.). Your tool
should run the same set of
computations for a significant number of iterations (e.g. 10
million,
100 million, 1 billion iterations, etc.) and time the result using a
system utility such
as System.nanoTime(). Run your program several times on your
laptop computer. Be sure to average the results. How did
your results compare to those from the previous
question? Explain.
b) Log on to the LittleFe cluster computer at
littlefe2.nwmissouri.edu using the PuTTY login. Login
information will be provided in class. Instructions
to use the system are provided at the following links: PuTTY.html, LittleFeTutorial.html .
You do not need to submit the tutorial files as part of this
homework.
After learning how to log in and create a .c file, create a file
called OMPTest.c with the following code:
#include
<stdio.h>
#include <time.h>
#include <omp.h>
#define NUM_ITERATIONS 100000000L
#define NUM_THREADS 8
int main(int argc, char ** argv)
{
long i;
double x = 3.8;
double start = clock()/(double)CLOCKS_PER_SEC;
#pragma omp parallel for num_threads(NUM_THREADS)
for(i = 0; i < NUM_ITERATIONS/NUM_THREADS; i++)
x = 8.2 * 7.2 / 1000.24;
printf("run time: %lf
s\n",clock()/((double)CLOCKS_PER_SEC*NUM_THREADS));
return 0;
}
Compile the following code on LittleFe using the following command:
gcc OMPTest.c -fopenmp
-o OMPTest.exe
Run the OMPTest.exe as follows:
./OMPTest.exe
Change NUM_THREADS to 6, 4, and 2, recompile and re-run the
program. Record the run times of each, run and explain the
results.
4. The newest supercomputers are able to offload computations to a
graphics card such as the NVIDIA Tesla K80 or a co-processor such as
the Xeon Phi. Processor counts are extremely different between
these two different architectures: 4992 CUDA cores and 24GB RAM for the NVIDIA
card
vs. 60 processors for the Xeon Phi. Both report approximately
2.91 and 1.2
TFLOP double-precision performance, respectively. Why is this
the case?
Are these two architectures best suited for the same types of
tasks? What kinds of tasks are these? Explain. Be sure
to perform an internet search on
both products, examine their specifications, and determine the
difference between a CUDA core (processor) and a Xeon Phi core
(processor).
5. Given a program that makes use of 300,000 instructions half
of
which are ADD, 29% are JUMP, 11% are DIV, and 10% are MOVE, compute
the user run time of
the program on a 2.2 GHz machine computing the total cycles first
and then the run time. Assume that ADD instructions
cost 4 cycles, JUMP instructions cost 1 cycle, DIV instructions cost
8 cycles, and MOVE instructions cost 4 cycles.