Taxonomy for the Automated Tuning of Matrix Algebra Software
September 10, 2010
4:00 p.m.
Dr. Elizabeth R. Jessup
Abstract
Linear algebra constitutes the most time-consuming part of simulations in many fields of science and engineering. Reducing the costs of those calculations can have a significant impact on overall routine performance, but such optimization is difficult. At each step of the process, the code developer is confronted with many possibilities. Choosing between them generally requires expertise in numerical computation, mathematical software, compilers, and computer architecture, yet few scientists have such broad expertise. This talk will cover two interrelated projects focused on easing the production of high-performance matrix algebra software.
We will first present our work in progress on a taxonomy of software that can be used to build highly-optimized matrix algebra software. The taxonomy will provide an organized anthology of software components and programming tools needed for that task. It will serve as a guide to practitioners seeking to learn what is available for their programming tasks, how to use it, and how the various parts fit together. It will build upon and improve existing collections of numerical software, adding tools for the tuning of matrix algebra computations. Our objective is to build a taxonomy that will provide all of the software needed to take a matrix algebra problem from algorithm description to a high-performance implementation: initially, the tool will accept a MATLAB prototype and produce optimized Fortran or C.
We will then introduce one of the tuning tools to be included in the taxonomy, the Build to Order (BTO) compiler which automates loop fusion in matrix algebra kernels. This optimization serves to reduce the amount of data moved between memory and the processor. In particular, we will describe BTO’s analytic memory model which accelerates the compiler by substantially reducing the number of loop fusion options processed by it. The initial draft of the model took into account traffic through the caches and TLB. We will discuss an example that motivated us to improve the accuracy of the model by adding register allocation.