Finding an Efficient Algorithm for Matrix Multiplication Utilizing Cache Friendliness and Parallelization

(1)

Finding an Efficient Algorithm for Matrix Multiplication Utilizing

Cache Friendliness and Parallelization

Tiril Voss Stabell

Thesis submitted for the degree of

Master in Informatics: Programming and System Architecture

60 credits

Institute for Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

(2)

(3)

Finding an Efficient Algorithm for Matrix Multiplication

Utilizing Cache Friendliness and Parallelization

Tiril Voss Stabell

(4)

Finding an Efficient Algorithm for Matrix Multiplication Utilizing Cache Friendliness and Parallelization

http://www.duo.uio.no/

Printed: Reprosentralen, University of Oslo

(5)

Abstract

This Master Thesis examines if a matrix multiplication program that combines the two efficiency strategies of parallelization and cache friendlinesscan be faster and more efficient than programs that only use one of the strategies.

Programs that utilize each strategy are coded and tested, and used as benchmarks for two separate iterations of combination programs.

The thesis demonstrates that it is possible to create efficient algorithms for matrix multiplication that utilize both cache friendliness and paral- lelization, and that outperform the benchmark programs by a speedup- factor of 7.8 for 1000×1000 matrices on a computer with sixteen cores.

(6)

(7)

List of Figures

1.1 Runtime comparison for 1000×1000 matrices run on Sixteen Cores. All programs tested. X-axis in milliseconds. . . 7 3.1 Moore’s Law. Source: Lecture notes IN3030, UiO, Prof. Eric

Bartley Jul, Spring 2021 [13] . . . 16 3.2 Clock Rates. Source: Lecture notes IN3030, UiO, Prof. Eric

Bartley Jul, Spring 2021 [13] . . . 17 3.3 Volume of data/information created, captured, copied, and

consumed worldwide from 2010 to 2025 in zettabytes (predicted). [12] . . . 18 3.4 US online sales vs total sales. U.S. Department of Commerce.

[8] . . . 19 3.5 Elkjøp Nordic figures for Black Friday preparations 2021.

Source: Trond Andersen, Head of IT Innovation and Archi- tecture, Elkjøp Nordic, 2021 [2] . . . 20 7.1 The first step of the Strassen method of matrix multiplication 34 7.2 The second step of the Strassen method of matrix multiplic-

ation . . . 35 11.1 Project Plan . . . 49 12.1 A Visual Representation of How a Matrix is Stored in Java.

The Numbers Reference Each Element’s Position in the Array. 53 12.2 Transposing of a Matrix . . . 54 16.1 Runtime comparison of the Classical algorithm vs the

Parallel program run on four Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 70 16.2 Runtime comparison of the Classical algorithm vs the

Parallel program run on eight Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 70 16.3 Runtime comparison of the Classical algorithm vs the

Parallelprogram run on sixteen Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 71 16.4 Speedup of the Classical Method vs theParallelprogram on

Four Cores. . . 72 16.5 Speedup of the Classical Method vs theParallelprogram on

Eight Cores. . . 72

(12)

16.6 Speedup of the Classical Method vs theParallelprogram on Sixteen Cores. . . 73 16.7 Runtime comparison of the Classical algorithm vs the

CacheFriendlyprogram run on Four Cores. Note that the Y- axis is logarithmic and shows time in milliseconds. . . 74 16.8 Runtime comparison of the Classical algorithm vs the

CacheFriendlyprogram run on Eight Cores. Note that the Y- axis is logarithmic and shows time in milliseconds. . . 74 16.9 Runtime comparison of the Classical algorithm vs the

CacheFriendlyprogram run on Sixteen Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 75 16.10Speedup of the Classical Method vs theCacheFriendly pro-

gram on Four Cores. . . 75 16.11Speedup of the Classical Method vs theCacheFriendly pro-

gram on Eight Cores. . . 76 16.12Speedup of the Classical Method vs theCacheFriendly pro-

gram on Sixteen Cores. . . 76 16.13Runtime comparison of theParallelprogram vsIteration1run

on four Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 77 16.14Runtime comparison of the Parallel program vs Iteration1

run on Eight Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 77 16.15Runtime comparison of theParallelprogram vsIteration1run

on Sixteen Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 78 16.16Speedup of the Parallel program vs Iteration1 run on Four

Cores. . . 79 16.17Speedup of the Parallel program vs Iteration1 run on Eight

Cores. . . 79 16.18Speedup of theParallelprogram vsIteration1run on Sixteen

Cores. . . 80 16.19Runtime comparison of theParallelprogram vsIteration2run

on Four Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 81 16.20Runtime comparison of the Parallel program vs Iteration2

run on Eight Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 81 16.21Runtime comparison of theParallelprogram vsIteration2run

on Sixteen Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 82 16.22Runtime comparison of theParallelprogram vsIteration2run

on Four Cores. . . 82 16.23Runtime comparison of theParallelprogram vsIteration2run

on Eight Cores. . . 83 16.24Runtime comparison of theParallelprogram vsIteration2run

on Sixteen Cores. . . 83

(13)

16.25Runtime comparison of the CacheFriendly program vs Iter- ation1run on Four Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 84 16.26Runtime comparison of the CacheFriendly program vs Iter-

ation1run on eight Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 85 16.27Runtime comparison of the CacheFriendly program vs Iter-

ation1 run on Sixteen Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 85 16.28Runtime comparison of the CacheFriendly program vs Iter-

ation1run on Four Cores. . . 86 16.29Runtime comparison of the CacheFriendly program vs Iter-

ation1run on Eight Cores. . . 86 16.30Runtime comparison of the CacheFriendly program vs Iter-

ation1run on Sixteen Cores. . . 87 16.31Runtime comparison of the CacheFriendly program vs Iter-

ation2run on Four Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 87 16.32Runtime comparison of the CacheFriendly program vs Iter-

ation2run on Eight Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 88 16.33Runtime comparison of the CacheFriendly program vs Iter-

ation2 run on Sixteen Cores. Note that the Y-axis is logarithmic and shows time in milliseconds. . . 88 16.34Runtime comparison of the CacheFriendly program vs Iter-

ation2run on Four Cores. . . 89 16.35Runtime comparison of the CacheFriendly program vs Iter-

ation2run on Eight Cores. . . 90 16.36Runtime comparison of the CacheFriendly program vs Iter-

ation2run on Sixteen Cores. . . 90 16.37Runtime comparison ofIteration1 vs Iteration2 run on Four

Cores. . . 91 16.38Runtime comparison ofIteration1 vsIteration2 run on Eight

Cores. . . 91 16.39Runtime comparison ofIteration1vsIteration2run on Sixteen

Cores. . . 92 16.40Speedup ofIteration1vsIteration2run on Four Cores. . . 93 16.41Speedup ofIteration1vsIteration2run on Eight Cores. . . 93 16.42Speedup ofIteration1vsIteration2run on Sixteen Cores. . . . 94 16.43Runtime comparison for 100×100 matrices run on Sixteen

Cores. All programs tested. X-axis in milliseconds. . . 94 16.44Runtime comparison for 200×200 matrices run on Sixteen

Cores. All programs tested. X-axis in milliseconds. . . 97

(14)

(15)

List of Tables

7.1 Pseudo Code of The Classic Algorithm for Matrix Multiplic- ation . . . 34 11.1 High-Level Project Plan . . . 48 12.1 Speedup FromClassictoParallelRun with four Cores . . . . 52 12.2 Pseudo Code of the Transposing Algorithm used in our

Cache-Friendly Program for Matrix Multiplication . . . 55 12.3 Pseudo code of the Algorithm for Matrix Multiplication used

in our Cache-Friendly Program . . . 55 12.4 Speedup FromClassictoCacheFriendlyRun with four Cores . 56 13.1 Speedup FromParalleltoIteration1Run with four Cores . . . 59 14.1 Speedup FromIteration1toIteration2Run with Four Cores . 63 17.1 Speedup FromClassictoParallelRun with Four Cores . . . . 103 17.2 Speedup FromClassictoParallelRun with Eight Cores . . . . 103 17.3 Speedup FromClassictoParallelRun with Sixteen Cores . . 103 17.4 Speedup FromClassictoCacheFriendlyRun with Four Cores 104 17.5 Speedup FromClassictoCacheFriendlyRun with Eight Cores 104 17.6 Speedup FromClassictoCacheFriendlyRun with Sixteen Cores 104 17.7 Speedup FromParalleltoIteration1Run with Four Cores . . 104 17.8 Speedup FromParalleltoIteration1Run with Eight Cores . . 105 17.9 Speedup FromParalleltoIteration1Run with Sixteen Cores . 105 17.10Speedup FromCacheFriendlytoIteration1Run with Four Cores105 17.11Speedup FromCacheFriendlytoIteration1Run with Eight Cores105 17.12Speedup FromCacheFriendly to Iteration1 Run with Sixteen

Cores . . . 105 17.13Speedup FromParalleltoIteration2Run with Four Cores . . 106 17.14Speedup FromParalleltoIteration2Run with Eight Cores . . 106 17.15Speedup FromParalleltoIteration2Run with Sixteen Cores . 106 17.16Speedup FromCacheFriendlytoIteration2Run with Four Cores106 17.17Speedup FromCacheFriendlytoIteration2Run with Eight Cores106 17.18Speedup FromCacheFriendly to Iteration2 Run with Sixteen

Cores . . . 107 17.19Speedup FromIteration1toIteration2Run with Four Cores . 107 17.20Speedup FromIteration1toIteration2Run with Eight Cores . 107 17.21Speedup FromIteration1toIteration2Run with Sixteen Cores 107

(16)

(17)

Preface

This Master Thesis is the result of much hard work and many setbacks.

Not least the Covid-19 pandemic that put an unfortunate, but albeit temporary pause to my efforts and made the journey a lonely one. I would like to thank my supervisor, Professor Eric Bartley Jul for his patience, wit, sense of humour and essential advice during this journey.

For the background information, I must thank Trond Andersen, Head of IT Innovation and Architecture at Elkjøp Nordic for sharing his thoughts and actual figures from Elkjøp’s Prep4Peak project, and for shedding light on why performance of IT systems is critical in their operations.

I must also thank my family for their eternal support. Especially my grandfather, Ulf Stabell, author of an impressive number of scientific articles from the University of Oslo, for his advice on scientific methods, and teaching me the difference between theory and facts, and my dad, Sindre Stabell, for making an effort to read up on the subject, and proofreading my thesis.

Lastly, I must thank my teachers at the University of Oslo that have managed to knock some knowledge and skills into me during the past five years. I am grateful and proud to have attended this institution.

(18)

(19)

Part I

Introduction

(20)

(21)

Chapter 1

Thesis Introduction

A key factor for most computer programs is speed. This has been an important element in the past decades when CPUs were much slower than today. CPUs have evolved and become faster, and thus computer programs have been able to run faster, opening up new possibilities for what they can be used for. One of the key drivers of this development has been clock speed of the CPUs. This feature has not seen the same rate of improvement in the last few years, however.

Better speed in computer programs is even more important today, with a rapidly increasing volume of data and new emerging technologies like self-driving cars on the horizon. These technologies require the processing of huge amounts of data in as short a time as possible and as efficiently as possible. To tackle the challenges of tomorrow, new solutions must be found.

This thesis is about making fast and efficient computer programs utilizing key innovations in modern CPUs. Modern CPUs are typically equipped with multiple CPU cores and fast CPU memory cache. We will take advantage of these features and attempt to create fast computer programs by utilizing the methods of parallelization, to take advantage of multiple CPU cores, and cache friendliness, to take advantage of the fast CPU memory cache. We will also study the relationship between these two methods. We will look at the possibilities in this relationship when building algorithms for matrix multiplication.

1.1 Motivation

Technology has changed the lives of humanity, and these changes have accelerated in the past decades. Inventions like production robots in factories, self check-out systems in stores and mobile phones connected to the internet have contributed to major changes in how we work and live. These changes have been made possible by better and faster computer hardware and software. As new technologies attempt to facilitate more advanced operations and functions, there is a constant

(22)

need to research new and improved computer hardware and software and better ways for these to work together.

We seem to have hit the ceiling for computer CPUs when it comes to clock rate while the number of transistors is still increasing [13].

This phenomenon makes innovations in efficient computer programs more important. Much effort is put into this problem worldwide, and we would like to contribute in a small way to this and increase our understanding of how computer programs can be made efficient and fast by utilizing modern computer hardware in the best way possible.

To examine this in a practical way, we want to study how the CPU and the CPU cache can be utilized in an optimal way. We want to develop and measure a program that solves matrix multiplication, using a combination of parallelization and cache friendliness to make the program run as efficiently as possible.

1.2 Problem Statement

A cache-friendly program strives to be fast and efficient by accessing data elements that are already stored in the cache, and never reload data into the cache if avoidable. A parallelized program strives to be fast and efficient by splitting up the calculations in multiple independent threads. This can be counter-productive to the cache hit rates as the parallelized program threads are not working on the same cached elements at the same time. Parallelizing an otherwise cache-friendly program can severely damage the cache friendliness and thus decrease the efficiency of the program, even as the parallelization itself is increasing the efficiency. On the surface, these two methods can seem to be conflicting strategies that can not be effectively combined, if the objective is a fast and efficient program.

Obviously, different computer programs can be created to solve different types of tasks. Some of these tasks are not relevant for cache friendliness, and many more are not relevant for parallelization. To study the problem of combining the two often conflicting efficiency strategies, we must find a relevant task where both strategies are applicable.

Matrix multiplication is such a task. The classic algorithm for matrix multiplication (described in detail in chapter 7) results in programs that show no consideration for how modern caches perform when storing and accessing information, causing substantial amounts of delay from cache misses. A cache-friendly program can show a significant speedup compared to the classic algorithm. Matrix multiplication is also a task that can be parallelized for increased effectiveness. A program optimized for parallelization can also show a significant speedup compared to the classic algorithm. Since Matrix Multiplication is an area that benefits

(23)

from both strategies independently, this would seem an ideal area to investigate whether combinations of the two strategies can be effective or not.

We investigate the problem of combining the two strategies of cache friendliness and parallelization in matrix multiplication programs in an optimal way.

1.3 Goal

Based on the problem at hand we assume that it is possible to develop programs that combine the two efficiency strategies, and that are faster and more efficient than similar programs that only utilize one of the strategies.

The goal is to show that a matrix multiplication program that combines the two efficiency strategies of parallelization and cache friendliness can outperform programs that only use one of the strategies.

1.4 Approach

To meet the goal, we must study scientific papers on the topic of programming efficient matrix multiplication as well as papers on cache friendliness and effective parallelism. After this we must experiment by creating programs that combine the two efficiency strategies, and measure these against programs utilizing only one of the strategies. The results from these measurements will show if we reached the goal or not and if our method for combining the two strategies is more efficient than using only one of them.

1.5 Work Done

We have studied scientific papers on programming efficient matrix multiplication as well as papers on cache friendliness and effective parallelism.

We have also experimented by writing several programs with different approaches towards parallelization and cache friendliness, and analyzed and reported on the results received from running these programs. We decided to use Java as the programming language, and have installed and used IntelliJ as IDE. The programs and tests were run on three different Windows computers using Intel processors with four, eight and sixteen cores.

(24)

1.6 Evaluation

To figure out whether or not we have reached the goal, we must first define a more exact measure for the differences in speed for the different programs. To achieve this we are using a type of measurement called speedup. Speedup is the runtime of program A divided by the runtime of program B. Speedup will be measured on the same computer and on the same sized matrices that will be multiplied. We will have each program run seven times, and pick out the median run for each program, and use these two runtimes to calculate the speedup. Since we do not know what to expect from our programs, it is difficult to set an exact target that is realistic. We need to set a clear evaluation target, however, and we decide on the following:

Target 1: When compared to a program that only utilizes one of the strategies, the program that utilizes both strategies should have a speedup of at least 5 for matrices of a size of500×500cells when run on a computer with eight cores.

We believe that the benefits of a combination of parallelization and cache friendliness should be greater for a higher number of CPU cores and with a greater number of calculations needed. We therefore state a second evaluation target:

Target 2: The speedup should be higher for matrices with a size of 1000×1000for computers with sixteen CPU cores.

We have evaluated the project by measuring the speedup of the runtime for our programs that combine the strategies of Cache Friendliness and Parallelization, comparing them with solid programs that utilize each of the strategies alone, as well as a program that runs the classical algorithm. The results are then evaluated against the goal and targets we have set.

(25)

1.7 Results

Figure 1.1: Runtime comparison for 1000×1000 matrices run on Sixteen Cores. All programs tested. X-axis in milliseconds.

We have successfully managed to create two separate programs that combine the two strategies of Parallelization and Cache Friendliness and that are faster than programs that only run one of the strategies. We were not quite successful in reaching a speedup of 5 for500×500matrices on eight cores measured against theParallelProgram, but we reached much higher speedups for1000×1000matrices. This was especially true when running on sixteen CPU Core configurations.

A runtime comparison of the five different programs can be seen in figure 1.1.

1.8 Conclusion

Both our iterations of programs successfully utilize the two performance strategies simultaneously. Thus we successfully describe two separate methods for combining Cache Friendliness and Parallelization. The high speedups we measured on large matrix sizes with a large number of CPU Cores show that it is better to utilize and combine both of these strategies to achieve fast and efficient programs.

It is important to note that the results in these experiments are not directly transferable to other problems that involve many calculations.

(26)

However, we believe that utilizing the important features in modern CPUs today of fast CPU Cache and multiple cores is a very promising path to solving other real-world challenges.

1.9 Contributions

In the drive for faster and more efficient programs that utilize the latest generation of CPUs in an optimal way, a blend of parallelism and cache friendliness seem to be highly efficient for multiplying matrices. The results from both iterations are promising and prove that combining the two strategies lead to clear improvements compared to the test programs focusing on only one strategy. The improvements are significant when tested on a sixteen Core CPU and the largest matrix sizes.

The iterations in this thesis can hopefully be used as a building block to investigate further optimizations in this area. In particular it would be interesting to research other areas than matrix multiplication and see if the methods applied here can provide the same type of results.

1.10 Limitations

There are several important limitations to this thesis. First of all, there is no guarantee that either of the iterations are optimal, and improvements to the methods could possibly be found.

It is also important to note that this thesis is by no means the first to forage into the problem of combining the strategies of parallelization and cache friendliness. Some of the ideas for the iterations in this thesis are for instance built on the work of Gregory R. Andrews[3] and other ideas are built on lectures by Eric Bartley Jul[13].

Our programs have not been tested on matrices larger than1000×1000 cells or smaller than 100×100 cells. Testing the programs on much larger matrices might lead to different results, and the tests here can not necessarily be extrapolated to predict results on such data. This should be examined further.

Our own programs also require rather large matrices to gain speedup.

For matrices of size50×50or smaller it is likely that the cache-friendly sequential program, or the program using the classic algorithm might be quicker, as there will not be enough gains by splitting the tasks in parallel processes. For very small matrices, the conclusions in this thesis are not relevant.

(27)

The programs have only been tested on Intel CPU computers with four, eight and sixteen cores. Testing the programs on different hardware setup might lead to different results, and the tests are not directly valid for such cases. This should also be examined further.

While our two iterations show a significant speedup, more efficient solutions to programming matrix multiplication theoretically have been made, using big O notation to determine the efficiency. These solutions are, however, only theoretical, as in order to gain speedup they would require matrices so large modern day computers are unable to handle them. There are at least theoretical possibilities to gain far greater speedup than the iterations in this thesis do on very large matrices.

1.11 Outline

1.11.1 Background

This part contains keyword definitions and background information on matrix multiplication, effective parallelization and cache friendliness, along with examples from real life showcasing a great need for quick and scalable programs. A large part of this chapter is based on our studies of scientific papers and describes the most important findings from this.

1.11.2 The Project

This part describes the actual undertaking of the Master Thesis project.

It describes the planning of the project and the building of the comparison programs, as well as explanations of the actual programs that utilize both of the performance strategies, cache friendliness and parallelization. It became necessary to test different methods for optimizing the code and combining parallel processing with cache friendliness, and each of these iterations are described in separate chapters. The chapters contain the ideas we were working towards while writing the programs, as well as descriptions of the programs, and analyses of the code. It also includes initial test results for the programs and a discussion of these results which is expanded upon further in the next part of the thesis.

1.11.3 Evaluation

This part contains the evaluation and testing performed in the project.

It describes and compares the results of the different iterations, and we discuss the implications of these results. We also suggest further experiments and areas for future research into this important field.

(28)

(29)

Part II

Background

(30)

(31)

Chapter 2

Background Introduction

Before embarking on the project itself, we will discuss some techno- logical challenges and key drivers that we observe in the world today.

These provide a backdrop for why we believe it is vital to find new techniques and new strategies for building faster and more efficient pro- grams. We will look at advances in CPU speed, and how that has im- pacted performance and possibilities of computer programs. We will point to Big Data and discuss how that trend puts additional pressure on processing performance. We will point to Distributed Programming as a key technology enabler for processing large volumes of data. We will look at an emerging technology in self-driving cars that will require very fast processing of complex data streams. Lastly, we will point to a cur- rent technology challenge that many online retailers are struggling with at the moment, serving a large number of customers online at the same time during Black Friday sales and share some input we have received from Elkjøp Nordic on how they are working with these challenges.

After discussing some of the background factors, it is necessary to set clear definitions of the key elements that we will base our experiments on. To achieve this we have gone through relevant scientific papers on the different topics and attempted to find clear, short and concise definitions for parallelization, cache friendliness and matrix multiplication. Chapters 4-7 are the result of these studies.

Obviously, our study of the subject of combining the two strategies of parallelization and cache friendliness is not the first study into this field. We go on to discuss other scientific work that we have built our experiments and our work upon.

Based on the previous data, we will define the problem we want to examine, and state a clear problem definition for this. This problem definition will be the north star for our project. We will then go on to analyse the problem and discuss how we can go about tackling it, and set up criteria for how we can measure success or failure in our attempts at this.

(32)

(33)

Chapter 3

Why is Processing Speed Important?

Since the birth of the first enigma machine to the modern super- computers, the world has seen an astonishing increase in processing power and ability to handle a radical increase in data at exceedingly greater speeds. What was believed impossible a few decades ago is now commonplace.

The arrival of the Internet is perhaps the biggest revolution in this century. The Internet has turned the world upside down for newspapers that are now publishing their articles and making their income online, the music industry where music is now mostly sold through streaming services without a physical CD, the film industry where movies and TV-series are now increasingly sold through streaming services, and the gaming industry where games are now mostly sold through online platforms. Likewise, marketing has had to move its focus to the internet including the new social media as opposed to newspapers and linear TV.

The retail industry has experienced a growing share of sales happening online instead of in brick and mortar stores. For humanity, the internet has caused radical changes in the way we live our lives. What is also a revolution is the enormous amount of data processing that takes place and that enables this change.

The future is uncertain, but there are several trends that indicate that the increase in data and requirements for processing power will continue. This will demand fresh ideas on improving performance and developing faster hardware and faster software, and in particular better combinations of the two.

3.1 Moore’s Law

One important enabler for increased performance of computers has been the increase in CPU speeds over the past decades. There are several

(34)

Figure 3.1: Moore’s Law. Source: Lecture notes IN3030, UiO, Prof. Eric Bartley Jul, Spring 2021 [13]

factors that influence the speed of a CPU. One of the most important ones is how small and thus how many transistors can be placed in an integrated circuit. In 1965, Gordon Moore, former CEO of Intel, observed that transistor density had doubled over the past years, and predicted that the number of transistors in a dense integrated circuit would continue to double every year going forward for at least ten years[18]. In 1975 he modified this prediction to be that the number of transistors would double every two years [19]. This prediction has held up surprisingly well even up to this day (see figure 3.1), and this phenomenon is commonly referred to as Moore’s Law. It is clearly not a certain prediction for future innovations in the field, however.

Another factor that is equally important is the clock-frequency on which the CPU operates. These clock rates have also increased dramatically over several decades, but halted around 2005 (see figure 3.2). The main reason is that higher clock rates produce heat, and we have now reached a threshold where CPUs are overheating if they are pushed to higher frequencies.

A new innovation is that CPUs can be built with a number of cores.

Each of these cores can process data individually and in parallel. This can compensate for lack of speed increase in the clock frequency, but only if we write programs that utilize more cores. The main method to do this is to utilize parallelization in our code.

(35)

Figure 3.2: Clock Rates. Source: Lecture notes IN3030, UiO, Prof. Eric Bartley Jul, Spring 2021 [13]

3.2 Big Data

Our ability to generate and store data has increased dramatically the past years. The proliferation of internet usage through computers and mobile phones, and communication of complicated data in streaming services for music and video are some of the key drivers for this. According to Research Lead Arne Holst [12], the amount of data created in 2020 was 64.2 zettabytes (see figure 3.3), while we stored two percent of this (still a staggering amount of data).

The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 64.2 zettabytes in 2020. Over the next five years up to 2025, global data creation is projected to grow to more than 180 zettabytes.(Holst: 2021) [12]

3.3 Distributed Programming

Another important invention in processing speed has been Distributed Programming. With this strategy, programs utilize different computers networked together. Since the number of computers that are connected to each other has exploded over the last decades, especially when we take smart phones into consideration, it seems sensible to make use of all these together to utilize the computing power that they possess.

A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system. (Tanenbaum and Steen: 2002) [26].

(36)

Figure 3.3: Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025 in zettabytes (predicted). [12]

Distributed computing offers scalability by running on a large number of machines, and having distributed objects that initiate new instances of themselves if they are shut down or stopped. This is one of the main principles behind successful cloud software like Kubernetes [14] and Microsoft Azure Microservices [17].

3.4 Self-driving Cars

One exciting technology field currently being researched, developed and tested by several institutions is self-driving cars. The vision of replacing human control with computer control over vehicles has proved to be challenging. Not least on hardware requirements. Processing speed of large amounts of data is critical. Several iterations of new specialised hardware has been created. Programming efficient, reliable and fast code that utilizes this new hardware in an optimal way will be crucial. It is interesting to note that matrix multiplication is a key technique in handling image processing and signal processing that is vital in developing self-driving technology.

One of the companies working on self-driving cars is Tesla [29].

They have released several iterations of their self-driving software and hardware. Obviously, processing speed is a key challenge also for Tesla.

Throughput, latency, correctness and determinism are the main metrics we optimize our code for. Build the Autopilot software foundations up

(37)

from the lowest levels of the stack, tightly integrating with our custom hardware. Implement super-reliable bootloaders with support for over- the-air updates and bring up customized Linux kernels. Write fast, memory-efficient low-level code to capture high-frequency, high-volume data from our sensors, and to share it with multiple consumer processes—

without impacting central memory access latency or starving critical functional code from CPU cycles. Squeeze and pipeline compute across a variety of hardware processing units, distributed across multiple system- on-chips. - Official Tesla website on the subject of code foundations - [28].

As we can see, Tesla focus on many of the same parameters discussed in this thesis. It is clear that fast, memory-efficient code is what they strive to build to process high-volume data from their sensors.

3.5 Online Sales

Figure 3.4: US online sales vs total sales. U.S. Department of Commerce.

[8]

The past years has seen a steady increase in US online sales as shown in figure 3.4. Increased online traffic results in increased requirements for online store programs and hardware to perform. The Covid-19 pandemic has accelerated these sales and put sudden additional pressure on these systems to handle increased volumes of data traffic and processing.

In the Nordic countries, there has been a similar development as in the USA. In particular, Black Friday retail sales have increased dramatically over the past few years. Electronics retailer Elkjøp Nordic, running

(38)

Figure 3.5: Elkjøp Nordic figures for Black Friday preparations 2021.

Source: Trond Andersen, Head of IT Innovation and Architecture, Elkjøp Nordic, 2021 [2]

the brands, Elkjøp (NO), Elgiganten (SE, DK) and Gigantti (FI) have experienced impressive sales of 1.4 billion NOK on this single day and set a new Nordic record with this, according to Trond Andersen, Head of IT Innovation and Architecture at Elkjøp Nordic [2]. This has put a huge strain on their infrastructure and the programs running their online store systems and retail store systems and other back-end systems.

Processing speed is critical, and every year they need to prepare their IT systems for increased loads during the Peak season, and especially for Black Friday. As can be seen in figure 3.5, there are a lot of factors that put strain on their IT hardware and software platforms. Their newly deployed and modern store sales system, Blueberry, is running in Azure in a Kubernetes cluster and requests data from micro-services built in Azure. Blueberry makes 45 million API calls on an average day. This is expected to increase dramatically on Black Friday, but since it is a new system, the exact increase is unknown.

In all coding performed at Elkjøp Nordic, performance is now very much in focus. All programs and systems are now built and tested with Black Friday performance requirements in mind. Performance has become a key driver in all development.

Elkjøp cache their product images, videos and other assets on Akamai [27], and report on Akamai average cache hit ratio (91.47%). They also focus on their web-platform, Intershop and its cache hit ratio that has been between 93.15% and 93.97%. These cache hit ratios are not to be confused with CPU memory cache hit ratio as discussed elsewhere in

(39)

this Thesis, although there are similarities. Cache hit ratio in Elkjøps reporting refers to requests that can be retrieved directly from fast memory as opposed to disk or slower memory storage. It is interesting to note, however, that it has clear similarities to the performance strategies used in Cache Friendliness programming.

It is interesting to learn that Elkjøp Nordic are focusing on similar problems that we are focusing on in this Master Thesis. They have had to handle at least 500.000 page views per minute on their Norwegian web- shop alone. The techniques used to resolve these requests in a timely manner is amongst others Distributed Programming and Parallelization.

3.6 Summary

We have now looked at various aspects of why processing speed is important. From the very first computers and up until today there has been a tremendous increase in CPU performance where CPUs have increased the number of transistors exponentially. This is often referred to as Moore’s Law. Unfortunately, the increase in CPU clock frequency seems to have reached a threshold due to heat generation and other factors. Other paths to increase performance are being developed, and moving the memory access closer to the CPU in so called CPU Cache is one of these paths. Building CPUs with multiple cores is another such path. To make use of these inventions computer programs must utilize these new possibilities.

We have then explored other trends that we can observe today. The massive increase in data being generated, often referred to as Big Data, drives a need for better performance in computer programs that read or write such data.

Another important innovation is the possibilities provided by Distrib- uted Programming. Programs can utilize the processing power of mul- tiple machines connected across vast geographical distances. This also provide new opportunities for faster processing.

We have then looked at an emerging technology with huge potential in Self-Driving Cars. A key driver in the development of this technology is processing speed and performance.

Likewise, we have interviewed Trond Andersen at Elkjøp Nordic and seen how processing performance is a business critical area in their operations that they continuously work to improve.

(40)

Looking at all these trends and background factors, it is clear that discovering new and better ways of building computer programs with improved runtime and performance is of vital importance.

(41)

Chapter 4

Definitions

Before continuing we need to define some key elements that we will base the later discussions around. These are first the two performance strategies of Parallelization and Cache Friendliness. Then we go on to define what we mean by Matrix Multiplication that is the task the programs should solve.

4.1 Parallelization

A parallel program, as opposed to a sequential program, operates with multiple threads performing tasks in parallel. Parallelization is the act of creating parallel code.

4.2 Cache Friendliness

CPU Cache is a quick but small memory unit that is used to store the parts of the memory that have been used most recently by the processor.

Modern computers usually have three levels of these caches, each slower, but larger than the last. It is the first place the processor looks when trying to find a memory location, getting a cache hit if it is found and a cache miss if not. A cache hit is faster than a cache miss, as the data then needs to be fetched from a slower cache, or the main memory if it is not found in any of the accessible caches.

Cache friendliness is a measure of how effectively a program utilizes the CPU cache, meaning the less (avoidable) cache misses the more cache- friendly the program.

4.3 Matrix Multiplication

Matrix multiplication is the act of multiplying two matrices, resulting in a single matrix. It can be defined in the following manner[5]:

(42)

The product ofAandB, whereAis am*nmatrix andBis ap*qmatrix, is defined if and only ifn=p. The result is then anm*qmatrix,C, where

C_ij =a_i1b_1j+a_i2b_2j+...+a_inb_pj.

4.4 Summary

We have now defined the three key elements of our thesis in Paralleliz- ation, Cache Friendliness and Matrix Multiplication, and can move on to look at these topics in more detail.

(43)

Chapter 5

Parallelization

We will now look at the performance strategy of Parallelization. This strategy has arguably been the major performance strategy utilized in computer programs today, and is a vital tool for creating efficient programs.

A large task can sometimes be made faster by splitting it up into several smaller parts that can be performed in parallel. A task where one person is chopping down 100 trees can be made faster by having 100 people chopping down one tree each, if they are organized well. This is not true of all tasks, as not all tasks can be split up into separate tasks. For example, a woman can produce a baby in nine months, but nine women cannot produce a baby in one month. The same is true for tasks that can be solved by a computer. Some tasks are not possible to divide, or sometimes, parts of a task are not possible to divide, while other parts are possible to divide into smaller simultaneous tasks. In cases where this divide is possible, performing the divided tasks in parallel can lead to significant reduction of the time the task takes.

5.1 History of Parallelization

The principle of Parallelization was discovered to be useful for Super Computers that could perform a large number of calculations simultaneously. Programs there have been able to run massive amounts of calculations by utilizing the parallelization strategy for several decades. For personal computers, however, it took a long time for parallelization to be useful, as they were not able to process many calculations simultaneously with their single core CPUs.

A major driver for faster computer programs since the early days of computers and up until around 2005, has been faster clock frequencies.

The clock frequency determines how fast a CPU performs the operations that it gets as input. The higher the clock rate, the faster the CPU gets new instructions and performs them.

(44)

There are several problems with increasing the clock frequency of the CPUs. One of the problems facing the CPU manufacturers is that higher clock rates also produce heat. The manufacturers of CPUs have had to find new ways of dispersing the heat created to avoid overheating or melting of the CPUs. A threshold was reached around 2005, where the growth in clock frequency has stopped. So far we have not found ways to work around the heat issue and other problems like quantum mechanic limitations.

Because there has not been any speed increase due to faster clock rates, other possibilities have had to be explored. The main new development for faster CPUs in the past few years have been the addition of multiple cores in CPUs for PCs and smart phones.

With the introduction of multiple cores, we have had to reface the challenge of parallelization. These multi-core CPUs have the potential to get significant performance improvements by utilizing each of the cores simultaneously.

To utilize multiple cores in a CPU, parallelization is the key. For a program to run faster based on multiple cores, it must be split up into several threads that each perform their tasks on separate cores in parallel.

Because it is the central method to utilize multiple cores for performance gains, parallelization has become a major tool in modern programming.

5.2 Importance

Parallelization allows for tasks to be run in parallel, and well written parallel programs has the potential to gain huge speedups when compared to equivalent sequential programs.

Obviously, the importance of Parallelization depends a great deal on the type of task at hand. If the task contains many operations that can be performed independently, then Parallelization can revolutionize the runtime and efficiency of the task. Having 10.000 operations that each take one second to calculate in a task, and being able to run these in parallel, could mean that the entire task can be performed in a few seconds, instead of 10.000 seconds. If, on the other hand, we are faced with a task that contain operations that must be run sequentially, Parallelization is not important. There will not be any gains by attempting to run the tasks in parallel.

5.3 Guidelines

A parallel program should utilize parallelization as much as possible, and divide each parallelized part evenly between the threads. Places

(45)

that need synchronization should be compressed to not decrease speed unnecessarily. It is however crucial that synchronization blocks are not too small as this will lead to unpredictability and incorrect outcomes when the program runs.

5.4 Drawbacks

It is important to note that writing parallel programs is more difficult than writing a single thread program. The reason is that parallel programming introduces a set of new possible errors and issues.

1. The first area of possible issues is the way the task is split up into separate processes. The process of calculating and setting up the different tasks can sometimes take time and defeat the purpose of running the parallel threads. It is important to strive for threads that will take approximately the same amount of time to finish, in order to gain maximum parallel efficiency. The program will not be quicker than the thread that takes the longest time to complete.

An additional challenge in this area is that it takes time to create the separate threads. This time can sometimes defeat the purpose of creating the tasks if the time spent on this is greater than the gain from running in parallel.

2. The second area of issues and errors is the need for the separate threads to synchronize with one another. Typically, some form of synchronization is required. Parallel programs tend to run faster the less synchronization the program has as the individual threads then spend less time waiting for each other. Synchronization needs to happen when a thread writes to a part of the memory that other threads also access, either through reading or writing. It is therefore beneficial to have threads work on different parts of the memory as much as possible. This becomes a challenge when also considering cache friendliness, as it can lead to more cache misses when the threads keep overwriting the shared cache lines with data only they use.

3. The third area is related to dependent calculations where some threads are dependant on data from other calculations to proceed.

This causes wait situations that must be handled correctly and efficiently.

4. The fourth and final area is related to collection of all the results from the separate threads and combining the results into a single meaningful result. This can sometimes be challenging, especially if one or more of the threads fail to complete their process. More complex error handling is typically required in parallel programs.

(46)

5.5 Measuring Speedup

To check how fast programs can perform a certain task, we need a standardized way to measure this. We are using a concept of speedup, where we measure the relative speed of two or more programs performing the same task. In this case multiplying two matrices of a set size. To avoid freak results and special cases, we have let all the programs run seven times on the same matrices, and then taken the median runtime out of these seven. This runtime in milliseconds is then used to measure the relative speedup against the other program or programs.

5.6 Summary

We have looked at the history and importance of Parallelization as a tool for improving performance. Parallelization has been and continues to be one of the most important methods for achieving fast processing speeds.

The development of multi-core CPUs have been a key driver for this in more recent years.

Parallelization does come at a cost of more complex programming. We have identified four areas of possible issues that do not exist in Classical programs that do not use Parallelization.

Finally, we have described how to measure speedup so that we can say something meaningful about the effect of better performance. Speedup is defined as the relative speed of two or more programs performing the same task.

(47)

Chapter 6

Caching and Cache Friendliness

We will now take a closer look at the performance strategy of Cache Friendliness. Utilizing the fast CPU memory cache is another important tool at our disposal when creating fast computer programs.

6.1 History of the Cache

From the early days of computers in the 1950s and 1960s up until today, different components of a computer has developed at different rates.

Since the 1980s, the rate of improvement for microprocessor speeds have exceeded the rate of improvement for main memory speed (Mahapatra:

1999) [16]. This led to an increasing disparity and a bottleneck for computer performance. A solution to this problem was found by utilizing a small, but fast memory cache very close to the CPU, thereby avoiding the requirement to consult the main memory for all CPU tasks.

This obviously came at a cost of an extra memory unit that needed to synchronize with the main memory, but this cost was found to be very small compared to the gains it gave.

The solution of utilizing CPU Cache has been extremely successful and has gained widespread use. It has been further developed to include several layers of CPU cache that is kept at different distances from the CPU Cores. The closest CPU Cache, called Level 1 cache or just L1 cache, is fastest and usually unique to each core of the CPU, while the next layers (L2 and L3 cache) are slightly slower and can share the cache between different cores and even CPUs. Currently, all main CPUs for PCs[20] and even for mobile phones[21] have CPU cache installed.

6.2 Importance

It has proven to be extremely difficult and expensive to build main memory in a computer that responds so quickly to read and write

(48)

operations that the computer will not be strictly limited by the main memory performance (Smith: 1987) [23]. The main memory is built to be large and dense and physically distant from the other parts of the CPU.

Main memory performance is inherently limited. Cache memory, on the other hand, is built to be very fast, small and physically close to the other parts of the CPU.

No modern CPU today is therefore built without a CPU Cache. It is vital for the performance of the CPU in most tasks. Only very exceptional tasks are unaffected by the presence of CPU cache when it comes to performance.

6.3 Guidelines

A cache-friendly program should work on as few parts part of the memory at a time as possible to utilize what is stored in the cache before changing it. It should also, if possible, avoid reloading parts of the memory into the cache, by doing all tasks related to the same memory allocation before moving on to different tasks. This will prevent some avoidable cache misses.

6.4 Measuring Cache Friendliness

A common way to determine cache misses for parallel programs is to use the big O notation:

O(Q+S∗(M/B))

where Q is the caching cost for the sequential version of the program, S is the number of times a core or processor takes a task from another core/processor, calledsteals,M is the size of the cache, andBis the size of the cache lines.

In their research paper "Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers"[7], Cole and Ramachandran introduce other bounds for cache miss cost for different algorithms, including matrix transposing with O(Q +S ∗B), and two different versions of matrix multiplication, n³ matrix multiplication algorithm withQ+ (n²/B)∗S¹³ +S∗Band Strassen’s algorithm withQ+ (n²/B)∗ S¹⁻²^λ +S∗B.

6.5 Summary

We have looked at the history and importance of the CPU Cache in this chapter. To avoid a situation where main memory of a computer was the main bottleneck for performance, new and fast, but small memory was installed close to the CPU. This has proved to be a very successful

(49)

development for increasing performance, and is now the dominant way of building mainstream CPUs in computers and mobile phones.

We then went on to describe general guidelines for utilizing the CPU Cache and also how to measure Cache Friendliness. Here the big O notation was utilized.

(50)

(51)

Chapter 7

Matrix Multiplication

We will now look at Matrix Multiplication, which is the task we have decided to solve in our programs. We will describe what Matrix Multiplication is, and look at theoretical research that has been done in this field.

7.1 Importance

Matrix multiplication was first described by the French mathematician Jacques Philippe Marie Binet in 1812, to represent the composition of linear maps that are represented by matrices[30]. It has since become an important tool in mathematics. In particular, it is an important tool in linear algebra, where three-dimensional objects can be described in matrices[5]. It has applications in many areas of mathematics, as well as in applied mathematics, statistics, physics, economics, and engineering[31]. Matrix multiplication is a central operation when we utilize computers to calculate linear algebra. In the later years, it has proved useful in computing fields like Image Processing, Signal Processing and even in Artificial Neural Networks.

7.2 Previous Algorithms for Matrix Multiplication

The classical way of programming matrix multiplication is to implement the calculations without regarding run-time of the program. This means that the programs simply perform the calculations directly for each element in the matrices. This process of calculation is also called the n³Algorithm.

Several improvements to the classical algorithm have been found.

The Strassen’s matrix multiplication algorithm is one of the most referenced. Strassen proved that it was possible to reduce the number of multiplications from eight to seven for 2x2 matrices[25].

(52)

The discovery by Strassen led to much more research into optimal ways of performing matrix multiplication. One of the most important findings was the Coppersmith–Winograd algorithm[9].

We will proceed to describe these three methods in more detail.

7.2.1 Then³Algorithm

Parameters: matrixA, matrixB Return: matrixC

For each rowiinA For each columnjinB

Sum=0.0

For each elementkinj Sum= Sum+ (A_ik×B_kj) C_ij =Sum

Table 7.1: Pseudo Code of The Classic Algorithm for Matrix Multiplication The most straight forward matrix multiplication algorithm (see table 7.1) multiplies each element combination in the matrices directly. This will takeO(n³)time, and the method is therefore often referred to as then³ algorithm.

7.2.2 Strassens Matrix Multiplication Algorithm

In 1969, Volker Strassen published an algorithm taking O(n^2.8074) time[25], proving that then³algorithm was not optimal. Although not a drastic improvement big O notation wise, Strassen’s new algorithm was noticeably quicker for large matrices. The method used for achieving the improvement, was to define new matrices by adding and subtracting the different elements in the original matrices.

Figure 7.1: The first step of the Strassen method of matrix multiplication The Strassen method of matrix multiplication first breaks down the original matrices of A and B into four sub-matrices, corresponding to

(53)

the north-west, north-east, south-west and south-east quadrants. These sub-matrices are used in seven calculations denominated as M1 to M7 in figure 7.1. The purpose of this is to reduce the number of multiplications required to solve the matrix multiplication. As a second step, the Strassen method performs arithmetic operations on the results of the seven multiplications.

Figure 7.2: The second step of the Strassen method of matrix multiplication The new M1 to M7 results are added and subtracted from one another as shown in figure 7.2, giving the result in matrix C exactly as the classical n³algorithm does.

As can be seen in figures 7.1 and 7.2, there are only seven multiplica- tions required, instead of the classical eight that are necessary when per- forming matrix multiplication with then³Algorithm. This was a break- through in our understanding of fast matrix multiplication, as it meant that in theory it was possible to perform matrix multiplication faster than the classical method.

7.2.3 The Coppersmith-Winograd Algorithm

In 1987 came the Coppersmith-Winograd algorithm[9] which again improved the time, now toO(n^2.37369). This algorithm, along with newer algorithms based on it, is not used in practice as speedup would only happen for matrices too large to handle by modern computers. It is still an interesting algorithm theoretically, and has been used to further improve the theoretical time of matrix multiplication algorithms by A.

M. Davie and A. J. Stothers[10], François Le Gall[15], and Josh Alman and Virginia Vassilevska Williams[1] among others. At the time of writing this thesis the current best theoretical algorithm runs inO(n^2.3728596)time and was discovered by Alman and Vassilevska in December 2020 [1].

Cohn, Kleinberg, Szegedy and Umans has conjected that the best theoretical algorithm should giveO(n²)[6]. Such an algorithm has not yet been found, however.

7.3 Summary

In this chapter we have described Matrix Multiplication. We have gone through the history of Matrix Multiplication from the French

(54)

mathematician Jacques Philippe Marie Binet first described it, up until current uses in Image Processing and other areas.

We have discussed how Matrix Multiplication has been solved by computer programs and the developments we have seen here. We have looked at the classical algorithm called the n³ Algoritm and how that can be implemented, and then moved on to Volker Strassens discovery in 1969 where a more efficient theoretical model was found. We then looked at the Coppersmith-Winograd algorithm from 1987, which again improved upon the Strassen method. Currently, Alman and Vassilevska have published the best theoretical algoritm that runs in O(n^2.3728596) time.

Finding an Efficient Algorithm for Matrix Multiplication Utilizing Cache Friendliness and Parallelization

Finding an Efficient Algorithm for Matrix Multiplication Utilizing

Cache Friendliness and Parallelization

Tiril Voss Stabell

Thesis submitted for the degree of

Master in Informatics: Programming and System Architecture

60 credits

Institute for Informatics

Faculty of mathematics and natural sciences

UNIVERSITY OF OSLO

Finding an Efficient Algorithm for Matrix Multiplication

Utilizing Cache Friendliness and Parallelization

Tiril Voss Stabell

Abstract

Contents

List of Figures

List of Tables

Preface

Part I

Introduction

Chapter 1

Thesis Introduction

1.1 Motivation

1.2 Problem Statement

1.3 Goal

1.4 Approach

1.5 Work Done

1.6 Evaluation

1.7 Results

1.8 Conclusion

1.9 Contributions

1.10 Limitations

1.11 Outline

Part II

Background

Chapter 2

Background Introduction

Chapter 3

Why is Processing Speed Important?

3.1 Moore’s Law

3.2 Big Data

3.3 Distributed Programming

3.4 Self-driving Cars

3.5 Online Sales

3.6 Summary

Chapter 4

Definitions

4.1 Parallelization

4.2 Cache Friendliness

4.3 Matrix Multiplication

4.4 Summary

Chapter 5

Parallelization

5.1 History of Parallelization

5.2 Importance

5.3 Guidelines

5.4 Drawbacks

5.5 Measuring Speedup

5.6 Summary

Chapter 6

Caching and Cache Friendliness

6.1 History of the Cache

6.2 Importance

6.3 Guidelines

6.4 Measuring Cache Friendliness

6.5 Summary

Chapter 7

Matrix Multiplication

7.1 Importance

7.2 Previous Algorithms for Matrix Multiplication

7.3 Summary