# Technical Reports

## 2014

### A Hybrid Method for the Verification and Synthesis of Parameterized Self-Stabilizing Protocols

Ali Ebnenasir **CS-TR-14-02**

This paper presents a hybrid method for verification and synthesis of parameterized self-stabilizing protocols where algorithmic design and mechanical verification techniques/tools are used hand-in-hand. The core idea behind the proposed method includes the automated synthesis of self-stabilizing protocols in a limited scope (i.e., fixed number of processes) and the use of theorem proving methods for the generalization of the solutions produced by the synthesizer. Specifically, we use the Prototype Verification System (PVS) to mechanically verify an algorithm for the synthesis of weakly self-stabilizing protocols. Then, we reuse the proof of correctness of the synthesis algorithm to establish the correctness of the generalized versions of synthesized protocols for an arbitrary number of processes. We demonstrate the proposed approach in the context of an agreement and a coloring protocol of the ring topology.

### Synthesizing Self-Stabilization Through Superposition and Backtracking

Alex P. Klinkhamer and Ali Ebnenasir CS-TR-14-01

This paper presents a sound and complete method for algorithmic design of self-stabilizing network protocols. While the design of self-stabilization is known to be a hard problem, several sound (but incomplete) heuristics exists for algorithmic design of self-stabilization. The essence of the proposed approach is based on variable superposition and backtracking search. We have validated the proposed method by creating both a sequential and a parallel implementation in the context of a software tool, called Protocon. Moreover, we have used Protocon to automatically design self-stabilizing protocols for problems which all existing heuristics fail to solve.

## 2013

### Verifying Livelock Freedom on Parameterized Rings

Alex P. Klinkhamer and Ali Ebnenasir CS-TR-13-01

This paper investigates the complexity of verifying livelock freedom and weak/strong self-stabilization in parameterized unidirectional ring and bidirectional chain topologies. Specifically, we illustrate that verifying livelock freedom of parameterized rings consisting of self-disabling and deterministic processes is undecidable (specifically, Π01-complete). This result implies that verifying weak/strong self-stabilization for parameterized rings of self-disabling processes is also undecidable. The results of this paper strengthen previous work on the undecidability of verifying temporal logic properties in symmetric rings. The proof of undecidability is based on a reduction from the periodic domino problem.

## 2012

*Exploring Software Cache Design for Remote References in UPC Applications*

**Wei Wang, Steve Carr, Steven R. Seidel, and Zhenlin Wang CS-TR-12-02**

Good performance of partitioned global address space languages running on commodity clusters requires efficient handling of remote memory accesses. Compilers and runtime systems can reduce the costs of such accesses by modeling remote memory access patterns. This paper uses an extended single-threaded reuse distance model that predicts remote memory access behavior for Unified Parallel C (UPC) applications to guide software cache design.

The remote memory access model presented here handles changes in per-thread problem size and changes in access patterns due to problem size increases. We have used this model to design application-specific software caches for remote references, considering cache-line length, total cache size, and other design features, including no cache at all, as design options. For twelve UPC application kernels encompassing five out of seven identified parallel application types, no single cache design works best for all applications. However, our analysis can predict an effective design based upon the reuse characteristics of the application. Finally, we show that this approach can be expected to scale well to thousands of threads.

*iMap - A Stable Layout for Navigating Large Image Collections*

with Embedded Search

**Chaoli Wang**, John P. Reese, Huan Zhang, Jun Tao, Robert J. Nemiroff CS-TR-12-01

Effective techniques for organizing and visualizing large image collections are in
growing demand as visual search gets increasingly popular. We present our solution
for image search and clustering based on the evaluation image similarity using both
visual and textual information. To lay out images, we introduce

iMap, a treemap representation for visualizing and navigating image search and clustering
results. iMap not only makes effective use of available display area to arrange images
but also maintains stable update when images are inserted or removed during the query.
We also develop an embedded visualization that integrates image tags for in-place
search refinement. Using an online astronomy archive with thousands of images, we
show the effectiveness of our approach by demonstrating experimental results and conducting
a comparative user study.

## 2011

*Local Reasoning for Global Convergence of Parameterized Rings*

Aly Farahat and Ali Ebnenasir CS-TR-11-04

This report demonstrates a method that can generate Self-Stabilizing (SS) *parameterized* protocols that are generalizable; i.e., correct for arbitrary number of finite-state
processes. Specifically, we present *necessary and sufficient* conditions for deadlock-freedom specified in the local state space of the representative
process of parameterized rings. Moreover, we introduce sufficient conditions that
guarantee live lock-freedom in arbitrary-sized unidirectional rings. More importantly,
we sketch a methodology for automated design of global convergence in the local state
space of the representative process. We illustrate our method in the context of several
examples including maximal matching, agreement, two-coloring, three-coloring and sum-not-two
protocols.

*UPC-SPIN: A Framework for the Model Checking of UPC Programs*

This paper presents a method supported by a software framework for the model checking of Unified Parallel C (UPC) programs. The proposed framework includes a front-end compiler that generates finite models of UPC programs in the modeling language of the SPIN model checker. The model generation is based on a set of sound abstraction rules that transform the UPC synchronization primitives to semantically-equivalent code snippets in SPIN's modeling language. The back-end includes SPIN that verifies the generated model. If the model checking succeeds, then the UPC program is correct with respect to properties of interest such as data race-freedom and/or deadlock-freedom. Otherwise, the back-end provides feedback as sequences of UPC instructions that lead to a data race or a deadlock from initial states, called counterexamples. Using the UPC-SPIN framework, we have detected design flaws in several real-world UPC applications, including a program simulating heat flow in metal rods, parallel bubble sort, parallel data collection, and an integer permutation program. More importantly, for the first time (to the best of our knowledge), we have mechanically verified data race-freedom and deadlock-freedom in a UPC implementation of the Conjugate Gradient (CG) kernel of the NAS Parallel Benchmarks (NPB).We believe that UPC-SPIN provides a valuable tool for developers towards increasing their confidence in the computational results generated by UPC applications.

*Swarm Synthesis of Convergence for Symmetric Protocols*

Ali Ebnenasir and Aly Farahat CS-TR-11-02

Due to their increasing complexity, today's distributed systems are subject to a variety of transient faults (e.g., loss of coordination, soft errors, bad initialization), thereby making self-stabilization a highly important property of such systems. However, designing Self-Stabilizing (SS) network protocols is a complex task in part because a SS protocol should recover to a set of legitimate states from any state in its state space; i.e., convergence. Once converged, a SS protocol should remain in its set of legitimate states as long as no faults occur; i.e., closure. The verification of SS protocols is even harder as developers have to prove the interference-freedom of closure and convergence. To facilitate the design and verification of SS protocols, previous work proposes techniques that take a non-stabilizing protocol and automatically add convergence while guaranteeing interference-freedom. Nonetheless, such algorithmic methods must search an exponential space of candidate sequences of transitions that could be included in a SS protocol. This paper presents a novel method for exploiting the computational resources of computer clusters and search diversification towards increasing the chance of success in automated design of finite-state self- stabilizing symmetric protocols. We have implemented our approach in a software tool that enables an embarrassingly parallel platform for the addition of convergence. Our tool has automatically synthesized several SS protocols that cannot be designed by extant automated techniques.

*Efficient LRU-Based Working Set Size Tracking*

Weiming Zhao, Xinxin Jin, Zhenlin Wang, Yingwei Luo and Xiaoming Li CS-TR-11-01

Efficient memory resource management requires the knowledge of the memory demands of applications or systems at runtime. A widely proposed approach is to construct an LRU-based miss ratio curve (MRC), which provides not only the current working set size (WSS) but also the relationship between performance and target memory allocation size. Unfortunately, the cost of LRU MRC monitoring is nontrivial. Although previous techniques adopt some optimizations, they still incur large overheads. To attack this problem, this paper proposes a low cost working set size tracking method, which seeks to generate an accurate MRC of programs with negligible overhead. We first introduce two enhancements over current techniques: AVL-tree based LRU structure and dynamic hot set sizing. Nonetheless, the overhead is still as high as 16% on average. Based on a key insight that for most programs the working set sizes are stable most of the time, we design an intermittent tracking scheme, which can temporarily turn off memory tracking when memory demands are predicted to be stable. With the assistance of hardware performance counters, memory tracking can be turned on again if a significant change in memory demands is expected. Experimental results show that, by using this intermittent tracking design, memory tracking can be turned off for 82% of execution time while the accuracy loss is no more than 4%. More importantly, this design is orthogonal to existing optimizing techniques. By combining the three enhancements, the mean overhead is lowered to only 2%. We show that when applied it to memory balancing for virtual machines, our scheme brings a speedup of 2.12.

## 2010

*Towards an Extensible Framework for Automated Design of Self-Stabilization*

Ali Ebnenasir and Aly Nour Farahat CS-TR-10-03

This paper proposes a framework for automatic design of (finite-state) self-stabilizing programs from their non-stabilizing versions. A method that automatically adds stabilization to non-stabilizing program is highly desirable as it generates self-stabilizing programs that are correct by construction, thereby eliminating the need for after-the-fact verification. Moreover, automated design methods enable separation of stabilization from functional concerns. In this paper, we first present a deterministically sound and complete algorithm that adds weak stabilization to programs in polynomial time (in the state space of the non-stabilizing program). We also present a sound method that automatically adds strong stabilization to non-stabilizing programs using the results of the proposed weak stabilization algorithm. These two methods constitute the first elements of the proposed extensible framework. To demonstrate the applicability of our algorithms in automated design of self-stabilization, we have implemented our algorithms in a software tool using which we have synthesized many self-stabilizing programs including Dijkstra’s token ring, graph coloring and maximal matching. While some of the automatically generated self-stabilizing programs are the same as their manually designed versions, our tool surprisingly has synthesized programs that represent alternative solutions for the same problem. Moreover, our algorithms have helped us reveal design flaws in manually designed self-stabilizing programs (claimed to be correct).

**Approximating Weighted Matchings Using the Partitioned Global Address Space Model**

**Approximating Weighted Matchings Using the Partitioned Global Address Space Model**

Alicia Thorsen and Phillip Merkey CS-TR-10-02

Even though the maximum weighted matching problem can be solved in polynomial time, there are many real world problems for which this is still too costly. As a result, there is a need for fast approximation algorithms for this problem. Manne and Bisseling presented a parallel greedy algorithm which is well suited to distributed memory computers. This paper discusses the implementation of the Manne-Bisseling algorithm using the partitioned global address space (PGAS) model. The PGAS model is a union of the shared address space and distributed models which aims to provide the programming conveniences and performance benefits of these models. Four variations of the algorithm are presented which explore the design constructs available in the PGAS model. The algorithms are expressed using a generic PGAS description, however the experiments and empirical performance numbers were obtained from an implementation written in Unified Parallel C with atomic memory operation extensions.

**Identifying Satisfying Subsets: A Method for Algorithmic Correction of Inter-Thread
Synchronization Mechanisms**

**Identifying Satisfying Subsets: A Method for Algorithmic Correction of Inter-Thread Synchronization Mechanisms**

Ali Ebnenasir and Mohammad Amin Alipour CS-TR-10-01

Correcting concurrency failures (e.g., data races and deadlocks/livelocks) is a difficult task in part due to the non-deterministic nature of multithreaded programs, and the fact that a correction often involves multiple threads. As such, manually fixing a failure often results in introducing new flaws, which in turn requires a regression verification step after every fix. Most existing methods for detection and correction of data races rely on code annotation (e.g., atomic sections) or linguistic constructs (e.g., coordination models) that enable programmers to introduce non-determinism in a controlled fashion at the expense of degraded performance. In this report, we present an alternative paradigm for the correction of data races and non-progress failures, where we systematically search for a subset of inter-thread synchronization traces of faulty programs that is data race-free and meets progress requirements, called a satisfying subset of traces. To evaluate the effectiveness of satisfying subsets in the correction of concurrency failures, we have developed explicit-state and symbolic implementations of our algorithms. Our software tools have facilitated the correction of data races and non-progress failures in several Unified Parallel C (UPC) and Java programs that programmers have failed to correct in a reasonable amount of time.

## 2009

**Predicting Remote Reuse Distance Patterns in Unified Parallel C Applications***

**Predicting Remote Reuse Distance Patterns in Unified Parallel C Applications***

Steve Vormwald, Steve Carr, Steven Seidel and Zhenlin Wang CS-TR-09-02

Productivity is becoming increasingly important in high performance computing. Parallel systems, as well as the problems they are being used to solve, are becoming dramatically larger and more complicated. Traditional approaches to programming for these systems, such as MPI, are being regarded as too tedious and too tied to particular machines. Languages such as Unified Parallel C attempt to simplify programming on these systems by abstracting the communication with a global shared memory, partitioned across all the threads in an application. These Partitioned Global Address Space, or PGAS, languages offer the programmer a way to specify programs in a much simpler and more portable fashion.

However, performance of PGAS applications has tended to lag behind applications implemented in a more traditional way. It is hoped that cache optimizations can provide similar benefits to UPC applications as they have given single-threaded applications to close this performance gap. Memory reuse distance is a critical measure of how much an application will benefit from a cache, as well as an important piece of tuning information for enabling effective cache optimization.

This research explores extending existing reuse distance analysis to remote memory
accesses in UPC applications. Existing analyses store a very good approximation of
the reuse distance histogram for each memory access in a program efficiently. Reuse
data are collected for small test runs, and then used to predict program behavior
during full runs by curve fitting the patterns seen in the training runs to a function
of the problem size. Reuse data are kept for each UPC thread in a UPC application,
and these data are used to predict the data for each UPC thread in a larger run. Both
scaling up the problem size and the increasing the total number of UPC threads are
explored for prediction. Results indicate that good predictions can be made using
existing prediction algorithms. However, it is noted that choice of training threads
can have a dramatic effect on the accuracy of the prediction. Therefore, a simple
algorithm is also presented that partitions threads into groups with similar behavior
to select threads in the training runs that will lead to good predictions in the full
run.

*This work is partially supported by NSF grant CCF-0833082.

**Analysis and Performance of a UPC Implementation of a Parallel Longest Common Subsequence
Algorithm**

**Analysis and Performance of a UPC Implementation of a Parallel Longest Common Subsequence Algorithm**

Bryan Franklin and Steven Seidel CS-TR-09-01

An important problem in computational biology is finding the longest common subsequence (LCS) of two nucleotide sequences. This paper examines the correctness and performance of a recently proposed parallel LCS algorithm that uses successor tables and pruning rules to construct a list of sets from which an LCS can be easily reconstructed. Counterexamples are given for two pruning rules that were given with the original algorithm. Because of these errors, performance measurements originally reported cannot be validated. The work presented here shows that speedup can be reliably achieved by an implementation in Unified Parallel C that runs on an Infiniband cluster. This performance is partly facilitated by exploiting the software cache of the MuPC runtime system. In addition, this implementation achieved speedup without bulk memory copy operations and the associated programming complexity of message passing.

## 2008

**How Hard Is Aspect-Oriented Programming?**

**How Hard Is Aspect-Oriented Programming?**

In aspect-oriented programming, it is desirable to algorithmically design correct advised programs that meet a new crosscutting concern while preserving the properties of the base program. Such an automated design eliminates the need for after-the-fact verification, thereby potentially reducing development costs. In this paper, we present a formalization of the problem of designing safety aspects while preserving liveness for finite-state programs, where safety stipulates that nothing bad ever happens, and liveness states that something good will eventually happen in program computations. Subsequently, we illustrate that, for deterministic sequential programs, it is impossible to efficiently design safety aspects and preserve liveness unless P = NP; i.e., the problem is NP-hard in program state space. This is a counterintuitive result as previous work illustrates that this problem is NP-hard for non-deterministic programs.

**Feasibility of Stepwise Addition of Multitolerance to High Atomicity Programs**

**Feasibility of Stepwise Addition of Multitolerance to High Atomicity Programs**

We present sound and (deterministically) complete algorithms for stepwise design of two families of multitolerant programs in a high atomicity program model, where a program can read and write all its variables in an atomic step. We illustrate that if one needs to add failsafe (respectively, nonmasking) fault-tolerance to one class of faults and masking fault-tolerance to another class of faults, then such an addition can be done in separate polynomial-time (in the state space of the fault-intolerant program) steps regardless of the order of addition. This result has a significant methodological implication in that designers need not be concerned about unknown fault tolerance requirements that may arise due to unanticipated types of faults. Further, we illustrate that if one needs to add failsafe fault-tolerance to one class of faults and nonmasking fault-tolerance to a different class of faults, then the resulting problem is NP-complete in program state space. This is a counterintuitive result in that adding failsafe and nonmasking fault-tolerance to the same class of faults can be done in polynomial time. We also present sufficient conditions for polynomial-time addition of failsafe-nonmasking multitolerance to programs. Finally, we demonstrate the stepwise addition of multitolerance in the context of a token ring protocol and a repetitive agreement protocol that is subject to Byzantine and transient faults.

**ScaleUPC: A UPC Compiler for Mulit-Core Systems**

**ScaleUPC: A UPC Compiler for Mulit-Core Systems**

Weiming Zhao and Zhenlin Wang CS-TR-08-02

As multi-core computers begin to dominate the market, the lack of a well-accepted parallel programming language and its compilation support targeting this specific architecture has become an immediate problem to solve. Unified Parallel C (UPC), a parallel extension to ANSI C, is gaining great interest from the high performance computing community. UPC was originally designed for large scale parallel computers and cluster environments. However, the partitioned global address space programming model of UPC makes it a natural choice for a single multi-core machine, where the main memory is physically shared. This paper builds a case for UPC as a feasible language for multi-core programming by providing an optimizing compiler, called ScaleUPC, that outperforms other UPC compilers targeting SMPs.

Since the communication cost for remote accesses is removed because all accesses are physically local in a multi-core, we find that the overhead of pointer arithmetic on shared data accesses becomes a prominent bottleneck. The reason is that directly mapping the UPC logical memory layout to physical memory, as used in most of the existing UPC compilers, incurs prohibitive address calculation overhead. This paper presents an alternative memory layout, which effectively eliminates the overhead without sacrificing the UPC memory semantics. Our research also reveals that the compiler for multi-core systems needs to pay special attention to the memory system. We demonstrate how the compiler can enforce static process/thread binding to improve cache performance.

**Action-Level Addition of Leads-To Properties to Shared Memory Parallel Programs**

**Action-Level Addition of Leads-To Properties to Shared Memory Parallel Programs**

We present a method for large-scale automatic addition of Leads-To properties to shared memory parallel programs. Such an automated addition of Leads-To is highly desirable in facilitating the design of multicore programs for average programmers. Our approach is action-level in that we separately analyze and revise the actions of an existing program in order to accommodate a new Leads-To property while preserving already existing properties. Based on our method, we have developed a software framework that exploits the computational resources of geographically distributed machines to add Leads-To properties to parallel programs. We demonstrate our approach in the context of a token passing and a barrier synchronization program.

## 2007

**A Performance Model for Unified Parallel C**

**A Performance Model for Unified Parallel C**

Dr. Zhang Zhang CS-TR-07-04

This research is a performance centric investigation of the Unified Parallel C (UPC),
a parallel programming language that belong to the Partitioned Global Address Space
(PGAS) language family. The objective is to develop performance modeling methodology
that targets UPC but can be generalized for other PGAS languages.

The performance modeling methodology relies on platform characterization and program
characterization, achieved through shared memory benchmarking and static code analysis,
respectively. Models built using this methodology can predict the performance of simple
UPC application kernels with relative errors below 15%. Beside performance prediction,
this work provide a framework based on shared memory benchmarking and code analysis
for platform evaluation and compiler/runtime optimization study. A few platforms are
evaluated in terms of their fitness to UPC computing. Some optimization techniques,
such as remote reference caching, is studied using this framework.

A UPC implementation, MuPC, is developed along with the performance study. MuPC consists
of a UPC-to-C translator built upon a modified version of the EDG C/C++ front end
and a runtime system built upon MPI and POSIX threads. MuPC performance features include
a runtime software cache for remote accesses and low latency access to shared memory
with affinity to the issuing thread. In this research, MuPC serves as a platform that
facilitates the development, testing, and validation of performance microbenchmarks
and optimization techniques.

**DiConic Addition of Failsafe Fault-Tolerance**

**DiConic Addition of Failsafe Fault-Tolerance**

Ali Ebnenasir CS-TR-07-03

We present a divide and conquer method, called DiConic, for automatic addition of
failsafe fault-tolerance to distributed programs, where a failsafe program guarantees
to meet its safety specification even when faults occur. Specifically, instead of
adding fault-tolerance to a program as a whole, we separately revise program instructions
in a static way so that the entire program becomes failsafe fault-tolerant. Our DiConic
algorithm can potentially utilize the processing power of a large number of machines
working in parallel, thereby enabling automatic addition of failsafe fault-tolerance
to programs with a large number of instructions. We formulate our DiConic synthesis
algorithm in term of the satisfiability problem. We validate our approach using the
Yices Satisfiability Modulo Theories (SMT) solver for the Byzantine Generals problem
and an industrial application.

**Non-Viable Path Branch Prediction**

**Non-Viable Path Branch Prediction**

Ying Xiong and Dr. Soner Onder CS-TR-07-02

For modern superscalar processors which implement deeper and wider pipelines, accurate
branch prediction is crucial for feeding sufficient number of correct instructions
into the superscalar's highly-parallelized execution core. In this paper, we show
that what the branch predictor is learning has significant implications for its ability
to make effective use of branch correlation and its ability to exploit longer history
lengths. Contrary to the commonly employed approach of training each case using both
successes and failures of previous branch behavior, a Non-Viable Path branch predictor
learns those paths that the program will never follow, given prior history; such a
predictor is referred to as Non-Viable Path Predictor (NVPP). An NVPP can learn any
boolean function, given enough space.

In this paper, we present an analysis of the design space and arrive at a practically
feasible implementation of the concept which uses simple traditional one or two-bit
tables and traditional table indexing mechanisms. Because of its unique properties,
NVPP can also exploit very long global histories with only a linear increase in the
space requirement and provide rapid training times. Although NVPP exploits large histories,
its access time remains relatively constant and it is comparable to small Gshare predictors,
permitting a single cycle implementation. The performance of NVPP is compared with
most recently published highly accurate predictors including piecewise linear and
the O-GEometric History Length predictors. Our results indicate that the NVPP accuracy
compares well with these predictors while maintaining the advantages of simplicity
and the speed of a traditional style predictor design.

**A Simple Parallel Approximation Algorithm for the Weighted Matching Problem**

**A Simple Parallel Approximation Algorithm for the Weighted Matching Problem**

Alicia Thorsen, Phillip Merkey and Fredrik Manne (Univeristy of Bergen) CS-TR-07-01

Given a weighted graph, the weighted matching problem is to find a matching with maximum weight. The fastest known exact algorithm runs in O(nm + n2 log n) however for many real world applications this is too costly, and an approximate matching is sufficient. A c-approximation algorithm is one which always finds a weight of at least c times the optimal weight. Drake and Hougardy developed a linear time 2/3 - epsilon approximation algorithm which is the best known serial algorithm. They also developed a parallel 1- epsilon approximation algorithm for the PRAM model, however it requires a large number of processors which is not as useful in practice. Hoepman developed a distributed 1/2 approximation algorithm which is the best known distributed algorithm. We present a shared memory parallel version of the best 2/3 - epsilon algorithm, which is simple to understand and easy to implement.

## 2006

**A Hill-Climbing Approach for Planning with Temporal Uncertainty**

**A Hill-Climbing Approach for Planning with Temporal Uncertainty**

Janae N. Foss and Dr. Nilufer Onder CS-TR-06-02

We approach the problem of finding temporally contingent plans, i.e., plans with branches
that are based on the duration of an action at execution time, using a hill-climbing
algorithm. We find an optimistic plan that is valid when all actions complete quickly.
We then use efficient temporal reasoning techniques to determine when the plan may
fail. At time points that cause an unsafe situation, we insert temporal contingency
branches. We describe our implemented planner PHOCUS-HC and provide experimental results.

**UPC Collective Conformance Suite**

**UPC Collective Conformance Suite**

Lisa Begum and Dr. Charles Wallace CS-TR-06-01

The UPC collective conformance suite is a collection of tests that help determine
how closely a given implementation of the UPC collective operations conform to the
specifications. The test suite is based on Version 1.1 of the UPC specification and
Version 1.0 of the UPC collectives specifications. It has been used on the MuPC and
HP UPC platforms. These tests exercise all of the collective operations, with the
exception of the deprecated upc all sort operation, and notify the tester of behavior
that departs from the specifications. This document describes the tests in the suite
and explains why they were chosen.

## 2005

**A Specification of the Extensions to the Collective Operations of Unified Parallel
C**

**A Specification of the Extensions to the Collective Operations of Unified Parallel C**

Zinnu Ryne and Dr. Steven Seidel CS-TR-05-08

Unified Parallel C (UPC) is an extension of the C programming language that provides
a partitioned shared-memory model for parallel programming. On such platforms, a simple
assignment operation can perform remote memory reads or writes. Nevertheless, since
bulk transfers of data are more efficient than byte-level copying, collective operations
are needed, and as a result a specification for a standard set of collective operations
was proposed in 2003. This specification is now part of the UPC V1.2 language specification.
However, these operations have limited functionality. This thesis work explores ways
to overcome these limitations by proposing a set of extensions to the standard collective
relocalization operations. The proposal includes performance and convenience improvements
that provide asynchronous behavior, an in-place option, runtime implementation hints
and subsetting feature. Also proposed are the new one-sided collective operations
which are alternatives to the traditional MPI way of looking at collective operations.

A comparative performance analysis is done to ensure that the reference implementation
of the extended collective operations perform within reasonable bounds of that of
the standard collective operations. This provides necessary support to adopt these
extensions as they provide greater flexibility and power to the programmer with a
negligible performance penalty.

**Implementing UPC's MYSYNC Synchronization Mode Using Pairwise Synchronization of Threads**

**Implementing UPC's MYSYNC Synchronization Mode Using Pairwise Synchronization of Threads**

Prasad Dhamne CS-TR-05-07

UPC (Unifed Parallel C) is an extension of ANSI C that provides a partitioned shared
memory model for parallel programming. Synchronization between threads (processes)
in UPC is done through the use of locks or barriers. We have investigated a new synchronization
method which is better suited in situations where threads do not need to synchronize
with all of the other threads in the system.

We implemented pairwise synchronization that can be used to synchronize pairs of threads
while not disturbing the remaining threads. This project also explored how to synchronize
pairs of UPC threads over a Myrinet interconnect. Myrinet is a low-latency, high bandwidth
local area network with a low level communication layer called GM. We implemented
UPC's MYSYNC synchronization mode in collective operations which make use of pairwise
synchronization. We compared the performance of MYSYNC synchronization mode to the
ALLSYNC synchronization mode, that is often implemented by using a barrier. For performance
analysis we used a collectives testbed previously developed at MTU.

Results obtained from the testbed indicate that for collectives, such as Broadcast,
Scatter and Gather, MYSYNC synchronization considerably reduces the collective time.
Overall testbed execution time for 100 trials with MYSYNC mode was improved by 10-15%
compared to the ALLSYNC synchronization. For the Permute collective operation the
performance improvement is maximum in pull based implementation and it is close to
20%. For Gatherall and Exchange the performance improvement is minimal. However, ALLSYNC
mode performs better over MYSYNC synchronization in push based implementation of Permute
collective operation. In general, the performance of MYSYNC reduces with increase
in number of threads.

**Implementing Sort in UPC: Performance and Optimization**

**Implementing Sort in UPC: Performance and Optimization**

Kohinoor (Lisa) Begum and Dr. Steven Seidel CS-TR-05-06

Unifed Parallel C (UPC) is a parallel extension of ANSI C that is based on a partitioned
shared address space. It supports development of parallel applications over diverse
hardware platforms and does not require a true shared memory system. It simplifies
programming by providing the same syntax for local references as for remote references.
This helps the user focus on performance, and allows for the exploitation of locality.
Our project is to implement a generic sort function in UPC. It will help programmers
focus on design, implementation and performance of the entire application rather than
work on details of sorting.

**A Mechanism for Protecting Passwords from Key Loggers**

**A Mechanism for Protecting Passwords from Key Loggers**

Lior Shamir CS-TR-05-03

A simple mechanism that protects passwords from being stolen by malicious key loggers
is proposed. The described mechanism is based on feeding potential key logers with
incorrect passwords by generating faked keystrokes. Applications that require their
user to type in passwords (e.g. internet browsers surfing to password protected web
sites) can use this mechanism so that passwords can be protected even in the absence
of an anti-key logger. Different approaches of protection are proposed for thread
based, hook based and kernel based key loggers.

Upc_all_sort is an implementation of a generic sort function in UPC using a bin sort
algorithm. Techniques applied to improve the performance of the sort function include:
sampling of data to ensure more nearly uniform bucket sizes among threads, an all-to-all
exchange of bucket sizes to minimize the amount of dynamically allocated memory, rebalancing
data at the end of the sort to ensure that each thread finishes with the same number
of elements as it started with. In this paper, we analyze and compare the performance
of sequential and parallel sort, the potential of performance improvements over sequential
sort and also the bottlenecks limiting these improvements. This sort function runs
3.9 times faster than the sequential sort for problem sizes of at least 128K when
run on 8 threads. The performance scales up with the increasing number of threads
and bigger problem sizes. It runs 6.8 times.

**An Overlay Network Simulator in Java**

**An Overlay Network Simulator in Java**

Eric Simonton CS-TR-05-05

It is not too difficult to generate a power-law topology. It is not trivial, however
to maintain that topology while simulating network churn, nor to keep the network
unpartitioned in the process. This report describes how to do just that, and do it
efficiently.

**Eager Misprediction Recovery**

**Eager Misprediction Recovery**

Peng Zhou, Dr. Soner Onder, and Dr. Steve Carr CS-TR-05-04

Current trends in modern out-of-order processors involve implementing deeper pipelines
and a large instruction window to achieve high performance. However, as pipeline depth
increases, the branch misprediction penalty becomes a critical factor in overall processor
performance. Current approaches to handling branch mispredictions either incrementally
roll back to in-order state by waiting until the mispredicted branch reaches the head
of the reorder buffer, or utilize checkpointing at branches for faster recovery. Rolling
back to inorder state stalls the pipeline for a significant number of cycles and checkpointing
is costly. This paper proposes a fast recovery mechanism, called Eager Misprediction
Recovery (EMR), to reduce the branch misprediction penalty. Upon a misprediction,
the processor immediately starts fetching, renaming and executing instructions from
the correct path without restoring the map table. Those instructions that access incorrect
speculative values wait until the correct data are restored; however, instructions
that access correct values continue executing while recovery occurs. Thus, the recovery
mechanism hides the latency of long branch recovery with useful instructions. EMR
achieves a mean performance improvement very close to a recovery mechanism that supports
checkpointing at each branch. In addition, EMR provides an average of 9.0% and up
to 19.9% better performance than traditional sequential misprediction recovery on
the SPEC2000 benchmark suite.

**MuON: Epidemic Based Mutual Anonymity**

**MuON: Epidemic Based Mutual Anonymity**

Neelesh Bansod, Ashish Malgi, Dr. Byung Kyu Choi and Dr. Jean Mayo CS-TR-05-02

A mutually anonymous service hides the identity of a client from the service provider
and vice-versa. Providing mutual anonymity usually requires a large number of participants.
While peer-to-peer (P2P) networks are capable of recruiting a large number of participants,
reliable anonymous communication in these architectures, with low bandwidth usage,
still needs further investigation. This paper presents MuON, a protocol to achieve
mutual anonymity in unstructured P2P networks. MuON leverages epidemic-style data
dissemination to deal with the high churn (changes in system membership) characteristic
of unstructured P2P networks. The results from our security analysis and simulation
show that MuON provides mutual anonymity over unstructured P2P networks while maintaining
predictable latencies, high reliability, and low communication overhead.

**An Introduction to Flexible Architecture Simulation Tool (FAST) and Architecture Description
Language ADL**

**An Introduction to Flexible Architecture Simulation Tool (FAST) and Architecture Description Language ADL**

Dr. Soner Onder CS-TR-05-01

The following document provides an overview of how to use the Flexible Architecture
Simulation Tool, and an in-depth look at the Architecture Description Language. The
system has been developed by Dr. Soner Onder and Dr. Rajiv Gupta. After many revisions,
it has become a powerful simulation tool. The documentation that you are reading is
a result of the collaboration between University of Alberta and Michigan Technological
University. Aaron Dittrich under the direction of Dr. Mike MacGregor has put together
pieces of documentation Dr. Onder had written into the form of a technical report
(TR03-18, Department of Computing Science, University of Alberta, Edmonton, Alberta,
Canada) in a nicely organized way which has later became the basis for this document.
The examples presented are centered around the MIPS RISC architecture. At first a
single-stage MIPS architecture is presented, and then the remainder of the examples
are centered around a simplified MIPS R2000/R3000 architecture.

## 2004

**High Performance Unified Parallel C (UPC) Collectives for Linux/Myrinet Platforms**

**High Performance Unified Parallel C (UPC) Collectives for Linux/Myrinet Platforms**

Alok Mishra and Dr. Steven Seidel CS-TR-04-05

Unified Parallel C (UPC) is a partitioned shared memory parallel programming language
that is being developed by a consortium of academia, industry and government. UPC
is an extension of ANSI C. In this project we implemented a high performance UPC collective
communications library of functions to perform data relocalization in UPC programs
for Linux/Myrinet clusters. Myrinet is a low latency, high bandwidth local area network.
The library was written using Myrinet?s low level communication layer called GM. We
implemented the broadcast, scatter, gather, gather all, exchange and permute collective
functions as defined in the UPC collectives specification document. The performance
of these functions was compared to a UPC-level reference implementation of the collectives
library. We also designed and implemented a micro-benchmarking application to measure
and compare the performance of collective functions. The collective functions implemented
in GM usually ran two to three times faster than the reference implementation over
a wide range of message lengths and numbers of threads.

**TransMig: Implementing a Transparent Strong Migration of Mobile Agents in Java**

**TransMig: Implementing a Transparent Strong Migration of Mobile Agents in Java**

Ashish Malgi, Neelesh Bansod, and Dr. Byung Kyu Choi CS-TR-04-04

Mobile agents are programs which ship themselves from one machine to another with
a programmer defined primitive. They form an interesting paradigm to develop distributed
applications. The key issue for migrating a mobile agent from one machine to another
is the transmission of the execution state of the agent. We propose TransMig, which
is an approach to capture and re-establish the state of mobile agents programmed in
Java without any modifications to the Java virtual machine. Our approach instruments
the original source code and the resultant byte code to achieve the capture and re-establishment
of the execution state of mobile agents at a language level. Currently, our prototype
implementation of TransMig has been used for providing a programming and runtime environment
for mobile agents, but it could easily be extended for any Java program.

**Significant Precursors to UPC**

**Significant Precursors to UPC**

Dr. Phillip Merkey and Jaisudha Purushothaman CS-TR-04-03

Designing parallel distributed languages involves deciding the level of abstraction
that can be supported through the features provided to the programmer and the level
of control the programmer is allowed to wield over the system. C has been one of the
most commonly used languages for almost three decades. It has low level access to
information and commands while still retaining the portability and syntax of a high
level language. Its flexibility and powerful pointer manipulations capabilities are
some of the reasons why its still actively maintained. It has been the natural base
language on which researchers have developed parallel languages or have added libraries
or extension for parallelism. For example, Cilk, C*, Split-C, AC, PCP, Handel-C, mpC,
DPCE, C- -, C//, Jade, Parsec, MPI, OpenMP,Parsec,ParaC are some of the languages
developed as C libraries or extensions. UPC can be viewed as the culmination of previous
attempts at coming up with the right combination of programming features for the parallel
distributed shared memory environment. In particular, PCP1, Split-C and AC2 are consider
immediate predecessors of UPC. This essay discusses these languages and UPC in a comparative
manner.

**Exploring Linked Lists in UPC**

**Exploring Linked Lists in UPC**

Hiriomi Suenaga, Dr. Phillip Merkey, and Dr. Steven Seidel CS-TR-04-02

Shared memory multiprocessors offer great computational power while introducing new
problems such as race conditions and deadlock. Concurrent objects can be accessed
and modified by multiple processes simultaneously. In order to avoid corruption of
data, we need to construct carefully planned algorithms that minimize the use of synchronization
schemes which inhibit performance. Unified Parallel C (UPC) is a parallel extension
of ANSI C. It sees its memory as a partitioned shared memory. Because of this, past
studies that implemented concurrent objects in pure shared memory machines cannot
be applied directly. In order to achieve high performance while avoiding data corruptions
or inconsistent results, it is important to understand the underlying architecture
and the memory model. In this study we explored three different ways of implementing
an elemental data structure, the linked-list, on which multiple processes can operate
at the same time. Unfortunately, we were not able to show good speedup at this point
because of experimentally confirmed limitations of the current compiler, however,
we are confident that our approaches will give us a great performance improvement
as the compiler develops.

**Pairwise Synchronization in UPC**

**Pairwise Synchronization in UPC**

Ernie Fessenden and Dr. Steven Seidel CS-TR-04-01

UPC (Unified Parallel C) is a small extension of C that provides a partitioned shared
memory model for parallel programming. Synchronization between threads (processes)
in UPC is done through the use of locks or barriers. We have investigated a new synchronization
process that we believe to be better suited in situations where threads do not need
to synchronize with all of the other threads in the system.

This project explores how to synchronize UPC threads in specialized ways. Our approach
is to use UPC locks to implement pairwise synchronization. Using this new approach
we implemented the MYSYNC synchronization mode and compared it to the ALLSYNC synchronization
mode that is often implemented by using a barrier. Pairwise synchronization can also
be used to synchronize arbitrary subsets of threads while not disturbing the remaining
threads. Such subset-based synchronization does not presently exist in UPC. We also
investigate how to implement the barrier using pairwise synchronization.

The MYSYNC synchronization mode has shown better performance than the ALLSYNC synchronization
mode for several of the relocalization collective operations. Specifically, for an
uneven computational load the MYSYNC implementation of upc all broadcast, upc all
scatter, and upc all gather outperforms the ALLSYNC implementation. For upc all permute
under even and uneven computational loads the MYSYNC implementation shows large improvements
in run time. However, for upc all exchange and upc all gather all the MYSYNC implementation
runs slower than the ALLSYNC implementation.

## 2003

**Illustrative Test Cases for the UPC Memory Model**

**Illustrative Test Cases for the UPC Memory Model**

William Kuchera and Dr. Charles Wallace CS-TR-03-02

The memory model underlying UPC is an important but subtle aspect of the language,
and everyone involved with it must understand its implications. The only resource
currently available with detailed coverage of the memory model is the high-level description
in the UPC specification [1]. As mentioned in our previous report [3], it is difficult
to tie this description to actual behavior of a UPC program. UPC implementers and
application developers must be able to readily distinguish program behavior that complies
with the UPC specification from behavior that does not comply. For instance, implementers
of UPC on a particular platform need to ensure that the optimizations they employ
guarantee compliant behavior. Programmers exploiting relaxed consistency must have
a grasp of what possible behaviors their programs may induce.

To this end, we have devised a set of test cases for the UPC memory model. These cases
fall into two categories:

Compliance tests. These are examples of behavior that falls outside the UPC specification.
They illustrate the consistency guarantees that UPC gives the programmer.

?Dark corner? tests. These are examples of acceptable behavior, according to the UPC
specification, that may be surprising to the UPC novice. The UPC memory model is designed
to be quite lax and allows some cases that programmers may not anticipate. These examples
serve to highlight some of these ?dark corners?.

Our test cases are at the architecture level rather than the program level; that is,
each thread?s execution is expressed in terms of a sequence of read, write, fence,
notify, and wait instructions, rather than UPC statements. Any references to ?ordering,?
?following,? ?intervening,? etc. refer to this thread-by-thread instruction ordering.
Read and write instructions are assumed to be relaxed unless specified otherwise.
For each case, we illustrate why it illustrates compliance or non-compliance, using
the operational semantics devised in our previous report [3].

**Toward a Programmer-Friendly Formal Specification of the UPC Memory Model**

**Toward a Programmer-Friendly Formal Specification of the UPC Memory Model**

William Kuchera and Dr. Charles Wallace CS-TR-03-01

As part of our efforts to elucidate the UPC memory model, we have closely examined
the memory model definition given in the official UPC specification [3] (hereafter,
"the UPC spec"). This document gives most of the relevant details about the memory
model.

## 2002

**ASM 101: An Abstract State Machine Primer**

**ASM 101: An Abstract State Machine Primer**

James K. Huggins and Dr. Charles Wallace CS-TR-02-04

Over the last twenty years, *Abstract State Machines (ASMs)* have been used to describe and validate a wide variety of computing systems [9].
Numerous introductions to ASMs have been written (see for example [20, 10, 21, 36]).
In addition, many ASM papers include brief overviews for the casual reader. In this
paper we attempt to provide a gentler introduction, focusing more on the use of the
technique than on formal definitions. Our target audience is a senior-level undergraduate
class in software engineering, but we hope that this work proves helpful to other
readers as well.

While there are references to "software" throughout, the reader should be aware that
ASMs are relevant to other kinds of system development as well. For simplicity, our
focus is on ASMs with only a single agent; distributed ASMs may be considered in a
later paper.

**MuPC: A Run Time System for Unified Parallel C**

**MuPC: A Run Time System for Unified Parallel C**

Jeevan Savant CS-TR-02-03

Unified Parallel C is an extension to the C programming language intended for parallel
programming. UPC adds a small number of parallel programming constructs to C. These
constructs enable users to write parallel programs based on a distributed shared memory
model. MuPC (MTU's UPC) is a run time system for UPC, enabling it to run on a wide
variety of architectures including networks of workstations, Beowulf clusters and
shared memory machines.

**Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: II -
Theory and Software Tools**

**Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: II - Theory and Software Tools**

Dr. Adrian Sandu, Dacian N. Daescu and Gregory R. Carmichael CS-TR-02-02

The Kinetic PreProcessor KPP was extended to generate the building blocks needed for
the direct and adjoint sensitivity analysis of chemical kinetic systems. An overview
of the theoretical aspects of sensitivity calculations and a discussion of the KPP
software tools is presented in the companion paper.

In this work the correctness and efficiency of the KPP generated code for direct and
adjoint sensitivity studies are analyzed through an extensive set of numerical experiments.
Direct decoupled Rosenbrock methods are shown to be cost-effective for providing sensitivities
at low and medium accuracies. A validation of the discrete-adjoint evaluated gradients
is performed against the finite difference gradients. Consistency between the discrete
and continuous adjoint models is shown through step size reduction. The accuracy of
the adjoint gradients is measured against a reference gradient value obtained with
a standard direct-decoupled method. The accuracy is studied for both constant step
size and variable step size integration of the forward model.

Applications of the KPP-1.2 software package to direct and adjoint sensitivity studies,
variational data assimilation and parameter identification are considered for the
comprehensive chemical mechanism SAPRC-99.

**Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: I -
Theory and Software Tools**

**Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: I - Theory and Software Tools**

Dr. Adrian Sandu, Dacian N. Daescu and Gregory R. Carmichael CS-TR-02-01

The analysis of comprehensive chemical reactions mechanisms, parameter estimation
techniques, and variational chemical data assimilation applications require the development
of efficient sensitivity methods for chemical kinetics systems. The new release (KPP-1.2)
of the Kinetic PreProcessor KPP contains software tools that facilitate direct and
adjoint sensitivity analysis. The direct decoupled method, build using BDF formulas,
has been the method of choice for direct sensitivity studies. In this work we extend
the direct decoupled approach to Rosenbrock stiff integration methods. The need for
Jacobian derivatives prevented Rosenbrock methods to be used extensively in direct
sensitivity calculations; however, the new automatic differentiation and symbolic
differentiation technologies make the computation of these derivatives feasible. The
adjoint modeling is presented as an efficient tool to evaluate the sensitivity of
a scalar response function with respect to the initial conditions and model parameters.
In addition, sensitivity with respect to time dependent model parameters may be obtained
through a single backward integration of the adjoint model. KPP software may be used
to completely generate the continuous and discrete adjoint models taking full advantage
of the sparsity of the chemical mechanism. Flexible direct-decoupled and adjoint sensitivity
code implementations are achieved with minimal user intervention. In the companion
paper [6] we present an extensive set of numerical experiments that validate the KPP
software tools for several direct/adjoint sensitivity applications, and demonstrate
the efficiency of KPP-generated sensitivity code implementations.

## 2001

**A TCP/UDP Protocol Visualization Tool: Visual TCP/UDP Animator (VTA)**

**A TCP/UDP Protocol Visualization Tool: Visual TCP/UDP Animator (VTA)**

Ping Chen, Rong Ge, Yin Ma, Dr. Jean Mayo, Yuping Zhang, and Chunhua Zhao CS-TR-01-10

The TCP/UDP/IP/Ethernet protocol suite is commonly covered within an undergraduate
curriculum. Often, this is the first exposure students have to a network protocol.
We have found that students have difficulty understanding the unique function of these
protocols and how the protocols are related. This is due at least in part to the fact
that packet specification and construction takes place within the operating system,
while covering a typical operating system protocol stack implementation would be beyond
the scope or time constraints of many of these undergraduate courses. In order to
address this problem we have written a visualization tool that illustrates the operation
of TCP and UDP over the IP and Ethernet protocols using actual packets, captured via
"libpcap," live from the network or stored in a file. The tool provides several views
of captured data that, when used individually or in combination, illustrate the operation
of these protocols.

**Discretizing Aerosol Dynamics with B-Splines**

**Discretizing Aerosol Dynamics with B-Splines**

Dr. Adrian Sandu and Christian T. Borden CS-TR-01-09

This paper presents a discretization technique for particle dynamics equation based
on the B-spline interpolation of the solution. The method is developed in the general
framework recently proposed by the authors, and extends the previous work of Pilinis
and Seinfeld to arbitrary order splines and arbitrary knot sequences. Numerical tests
include the coagulation-growth of the exponential distribution and of a cosine hill
in logarithmic coordinates.

**Reduced Atmospheric Chemistry Models by Sensitivity Matrix Approach**

**Reduced Atmospheric Chemistry Models by Sensitivity Matrix Approach**

Dr. Adrian Sandu CS-TR-01-08

Simulation of chemical kinetics typically claims about 80 to 90% of the total computational
time in three-dimensional atmospheric chemistry-transport models; to alleviation this
computational burden reduced chemical models are often employed. Reduced models are
built via a dynamical analysis that results in the elimination of the fastest time
scales, or by tuning a simpler model to reproduce the input-output relations of the
kinetic system.

The sensitivity matrix is a linear entity associated with the dynamics of a nonlinear
system. In this communication we show that sensitivity matrices can be used to identify
low-demensional linear evolution manifolds and to construct reduced kinetic models.

**The Kinetic PreProcessor KPP - A Software Environment for Solving Chemical Kinetics**

**The Kinetic PreProcessor KPP - A Software Environment for Solving Chemical Kinetics**

Valeriu Damian, Dr. Adrian Sandu, Mirela Damian, Florian Potra and Gregory Carmichael
CS-TR-01-07

The KPP kinetic preprocessor is a software tool that assists the computer simulation
of chemical kinetic systems. The concentrations of a chemical system evolve in time
according to the differential law of mass action kinetics. A computer simulation requires
the implementation of the differential system and its numerical integration in time.

KPP translates a specification of the chemical mechanism into FORTRAN or C simulation
code that implements the concentration time derivative function and its Jacobian,
together with a suitable numerical integration scheme. Sparsity in Jacobian is carefully
exploited in order to obtain computational efficiency.

KPP incorporates a library with several widely used atmospheric chemistry mechanisms;
users can add their own chemical mechanisms to the library. KPP also includes a comprehensive
suite of stiff numerical integrators. The KPP development environment is designed
in a modular fashion and allows for rapid prototyping of a new chemical kinetic schemes
as well as new numerical integration methods.

**Rosenbrock-Nystrom Integrators for SSODE of Multibody Dynamics**

**Rosenbrock-Nystrom Integrators for SSODE of Multibody Dynamics**

Dr. Adrian Sandu, D. Negrut, E.J. Haug, F.A. Potra and C. Sandu CS-TR-01-06

When performing dynamic analysis of a constrained mechanical system, a set of index
3 Differential-Algebraic Equations (DAE) describes the time evolution of the system
model. The paper presents a state-space based method for the numerical solution of
the resulting DAE. A subset of so called independent generalized coordinates, equal
in number to the number of degrees of freedom of the mechanical system, is used to
express the time evolution of the mechanical system. The second order state-space
ordinary differential equations (SSODE) that describe the time variation of independent
coordinates are numerically integrated using a Rosenbrock type formula. For stiff
mechanical systems, the proposed algorithm is shown to significantly reduce simulation
times when compared to state of the art existent algorithms. The better efficiency
is due to the use of an L-stable integrator and a rigorous and general approach to
providing analytical derivatives required by it.

**A Communication Library for the Parallelization of Air Quality Models on Structured
Grids**

**A Communication Library for the Parallelization of Air Quality Models on Structured Grids**

Philip Miehe, Dr. Adrian Sandu, Gregory R. Carmichael, Youhua Tang, and Dacian Daescu
CS-TR-01-05

PAQMSG is an MPI-based, Fortran 90 communication library for the parallelization of
Air Quality Models on structured grids. It consists of distribution, gathering and
repartitioning routines for different domain decompositions implementing a master-worker
strategy. The library is architecture and application independent and includes optimization
strategies for different architectures. This paper presents the library from a user
perspective. Results are shown from the parallelization of STEM-III on Beowulf clusters.

The PAQMSG library is available on the web. The communication routines are easy to
use, and should allow for an immediate parallelization of existing air quality models.
PAQMSG can also be used for constructing new models.

**A Spectral Method for Solving Aerosol Dynamics**

**A Spectral Method for Solving Aerosol Dynamics**

Dr. Adrian Sandu CS-TR-01-04

This paper presents a discretization of particle dynamics equation based on spectral
polynomial interpolation of the solution. Numerical tests show that the method achieves
spectral accuracy in linear coordinates, and is able to reproduce well the reference
solutions in logarithmic coordinates.

**A Framework for the Numerical Treatment of Aerosol Dynamics**

**A Framework for the Numerical Treatment of Aerosol Dynamics**

Dr. Adrian Sandu and Christian Borden CS-TR-01-03

This paper presents a general framework for the discretization of particle dynamics
equation by projection methods, which include Galerkin and collocation techniques.
Based on this framework a discretization over piecewise polynomial spaces is discussed.
Numerical examples are given using both linear and logarithmic coordinates; the results
show that this discretization approach is able to accurately solve aerosol size distribution
using a relatively small number of "size bins."

**A Newton-Cotes Quadrature Approach for Solving the Aerosol Coagulation Equation**

**A Newton-Cotes Quadrature Approach for Solving the Aerosol Coagulation Equation**

Dr. Adrian Sandu CS-TR-01-02

This paper presents a direct approach to solving the aerosol coagulation equation.
Newton-Cotes formulas are used to discretize the integral terms, and the semi-discrete
system is built using collocation. A semi-implicit Gauss-Seidel time integration method
is employed. The approach generalizes the semi-implicit method of Jacobson.

**The Location Consistency Memory Model and Cache Protocol: Specification and Verification**

**The Location Consistency Memory Model and Cache Protocol: Specification and Verification**

Dr. Charles Wallace, Guy Tremblay, and Jose Nelson Amaral CS-TR-01-01

We use the *Abstract State Machine* methodology to give formal operational semantics for the *Location Consistency* memory model and cache protocol. With these formal models, we prove that the cache
protocol satisfies the memory model, but in a way that is strictly stronger than necessary,
disallowing certain behavior allowed by the memory model.

## 2000

**Tools for Distributed Systems Instruction**

**Tools for Distributed Systems Instruction**

Meghana Gothe and Dr. Jean Mayo CS-TR-00-02

Teaching a distributed system class becomes a difficult task when the student community
is from a varied background. These systems are complex, as they are distributed across
the network of component processes. A majority of the students have a sequential programming
background. Hence, distributed systems concepts pose a challenge of re-learning, making
the teachers job all the more difficult. Furthermore, distributed systems rely on
message passing for communication between component processes. The large amount of
system specific network interface detail obscures the actual concepts. Tools that
attempt to eliminate this obscurity are needed. The students are introduced to numerous
protocols, which form the basic foundation of distributed systems programming. Shortage
of time prevents students from implementing them all. A simulation of the protocols
which can be used by the students for understanding is a need of time.

With this motivation, we have implemented a tool which provides certain communication
primitives and some important distributed system protocols. The communication primitives
will provide an easy interface that can be used by the students in their assignments.
Protocol execution dumps will enable the students to correlate to the topics taught
in class.

**Time-Stepping Methods that Favor Positivity for Atmospheric Chemistry Modeling**

**Time-Stepping Methods that Favor Positivity for Atmospheric Chemistry Modeling**

Dr. Adrian Sandu CS-TR-00-01

Chemical kinetics conserves mass and renders non-negative solutions; a good numerical
simulation would ideally produce mass balanced, positive concentration vectors. Many
time-steping methods are mass conservative; however, unconditional positivity restricts
the order of a traditional method to one. The projection method presented in [13]
post-processes the solution of a one-step integration method to ensure mass conservation
and positivity. It works best when the underlying time-steping scheme favors positivity.
In this paper several Rosenbrock-type methods that favor positivity are presented;
they are designed such that their transfer functions together with several derivatives
are non-negative for all real, negative arguments.

## 1999

**Adjoint Implementation of Rosenbrock Methods Applied to Variational Data Assimilation
Problems**

**Adjoint Implementation of Rosenbrock Methods Applied to Variational Data Assimilation Problems**

Dacian Daescu, Gregory Carmichael and Dr. Adrian Sandu CS-TR-99-06

In the past decade the variational method has become an attractive alternative to
the traditional Kalman filter in data assimilation problems for atmospheric chemistry
models. From the computational point of view its major advantage is that the gradient
of the cost function is computed fast, at the expense of few function evaluations,
making the optimization process very efficient. The drawbacks are the high storage
requirements and the difficulty to implement adjoint code when sophisticated integrators
are used to solve the stiff chemistry. If the sparse structure of the models is carefully
exploited Rosenbrock solvers have been proved a reliable alternative to classical
schemes. In this paper we present an efficient implementation of the adjoint code
for the Rosenbrocj type methods which can reduce the storage requirements of the forward
model and is suitable for automatization.

**Positive Numerical Integration Methods for Chemical Kinetic Systems**

**Positive Numerical Integration Methods for Chemical Kinetic Systems**

Dr. Adrian Sandu CS-TR-99-05

Mass action chemical kinetics conserves mass and renders non-negative solutions; a
good numerical simulation would ideally produce a mass balanced, positive numerical
concentration vector. Many time stepping methods are mass conservative; however, unconditional
positivity restricts the order of a traditional method to one. The positive projection
method presented in the paper ensures mass conservation and positivity. First, a numerical
approximation is computed with one step of a mass-preserving traditional scheme. If
there are negative components the nearest vector in the reaction simplex is found
using a primal-dual optimization routine; this vector is shown to better approximate
the true solution. The technique works best when the underlying time-stepping scheme
favors positivity. Projected methods are able to use larger integration time steps,
being more efficient then traditional methods for systems which are unstable outside
the positive quadrant.

**Stable Predicate Evaluation with Probabilistically Synchronized Clocks**

**Stable Predicate Evaluation with Probabilistically Synchronized Clocks**

Dr. Jean Mayo and Don Darling CS-TR-99-04

The evaluation of a global state in a distributed system is very difficult, since
the individual processes that comprise the system share no common memory or time base,
and can only communicate with each other via message passing. Although no common time
base exists in these systems, the system clocks are often roughly synchronized within
some known upper bound. In previous work, Mayo and Kearns developed a schema for evaluating
a class of stable predicates on systems with a rough global time base. These global
predicates are comprised of the conjunction of the local process states. The fact
that the predicates are stable means that once they become true, they remain true
forever. The ability to detect stable predicates in the system allows us to detect
critical system states such as program termination and deadlock.

We propose to develop a protocol based on Mayo and Kearns's schema, which will also
handle systems where clocks are synchronized with a probabilistic clock synchronization
protocol. Also, our protocol will allow processes to enter and leave the system, facilitating
its use in dynamic systems. Finally, we plan to implement our protocol on an actual
system in order to see how it performs in general use.

**Do Blending and Offsetting Commute for Dupin Cyclides?**

**Do Blending and Offsetting Commute for Dupin Cyclides?**

Dr. Ching-Kuang Shene CS-TR-99-03

CS-TR-99-03 is no longer available. This work has been published in "Computer-Aided
Geometric Design," Vol. 17, pp. 891-910, 2000.

**Register Assignment for Software Pipelining with Partitioned Register Banks**

**Register Assignment for Software Pipelining with Partitioned Register Banks**

Jason Hiser, Dr. Philip Sweany, Dr. Steven Carr, and Dr. Steven Beaty CS-TR 99-02

Trends in instruction-level parallelism (ILP) are putting increased demands on a machine's
register resources. Computer architects are adding increased numbers of functional
units to allow more instructions to be executed in parallel, thereby increasing the
number of register ports needed. This increase in ports hampers access time. Additionally,
aggressive scheduling techniques, such as software pipelining, are requiring more
registers to attain high levels of ILP. These two factors make it difficult to maintain
a single monolithic register bank.

One possible solution to the problem is to partition the register banks and connect
each functional unit to a subset of the register banks, thereby decreasing the number
of register ports required. In this type of architecture, if a functional unit needs
access to a register to which it is not directly connected, a delay is incurred to
copy the value so that the proper functional unit has access to it. As a consequence,
the compiler now must not only schedule code for maximum parallelism but also for
minimal delays due to copies.

In this paper, we present a greedy framework for generating code for architectures
with partitioned register banks. Our technique is global in nature and can be applied
using any scheduling method (e.g, trace scheduling, modulo scheduling) and any register
assignment technique. We present an experiment using our partitioning techniques with
modulo scheduling and Chaitin/Briggs register assignment. Our results show that it
is possible to generate good code for partitioned-register-bank architectures.

**Register Assignment for Architectures with Partitioned Register Banks**

**Register Assignment for Architectures with Partitioned Register Banks**

Jason Hiser, Dr. Philip Sweany and Dr. Steven Carr CS-TR 99-01

Modern computers take advantage of the instruction-level parallelism (ILP) available
in programs with advances in both architecture and compiler design. Unfortunately,
large amounts of ILP hardware and aggressive instruction scheduling techniques put
great demands on a machine's register resources. With increasing ILP, it becomes difficult
to maintain a single monolithic register bank and a high clock rate. The number of
ports required for such a register bank severely hampers access time. To provide support
for large amounts of ILP while retaining a high clock rate, registers can be partitioned
among several different register banks. Each bank is directly accessible by only a
subset of the functional units with explicit inter-bank copies required to move data
between banks. Therefore, a compiler must deal not only with achieving maximal parallelism
via aggressive scheduling, but also with data placement to limit inter-bank copies.
Our approach to code generation for ILP architectures with partitioned register resources
provides flexibility by representing machine dependent features as node and edge weights
and by remaining independent of scheduling and register allocation methods Preliminary
experimentation with our framework has shown a degradation in execution performance
of 10% on average when compared to an unrealizable monolithic-register-bank architecture
with the same level of ILP. This compares very favorably with other approaches to
the same problem.

## 1998

**Instruction Manual for Schemm's Configurable Analyzer v0.1 0.1 **

Evan Schemm CS-TR 98-02

In a research environment, it is often nice to check the effects of changes or additions
to existing architectures. To build each variation of a design used in a comparative
study would be difficult, as well as prohibitively expensive. Usually a simulator
is used to generate a trace of the executed instructions that can be run through an
analyzer. This analyzer is designed to be configurable for wide range of target architectures.
Different architectural variations are specified via a configuration file, alleviating
the need to recompile. The analyzer returns statistics on instruction and data cache
hit rates, branch prediction, and total execution time.

**An Efficient Intersection Detection Algorithm for Truncated Cylinders**

**An Efficient Intersection Detection Algorithm for Truncated Cylinders**

Dr. Ching Kuang Shene and Yuan Zhao CS-TR 98-01

This paper presents an efficient algorithm for detecting the intersection of two truncated
cylinders. A truncated cylinder is the solid bounded by a section of a cylinder and
two cap disks whose planes are perpendicular to the axis of the cylinderl. This paper
characterizes the intersection and reduces the problem to the intersection detetion
of a disk and a truncated cylinder which, in turn, is reduced to testing if a degree
four polynomial has a real root in an interval. The method of choice is a case-by-case
analysis. This type of geometric analysis has been known for its robustness and efficiency,
although the program that implements the algorithm could be larger.

## 1997

**Unroll-and-Jam Using Uniformly Generated Sets**

**Unroll-and-Jam Using Uniformly Generated Sets**

Dr. Steven Carr and Yiping Guan CS-TR 97-04

Modern architectural trends in instruction-level parallelism (ILP) are to increase
the computational power of microprocessors significantly. As a result, the demands
on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical
cache structures are ineffective if programs do not exhibit cache locality. Because
of this compilers need to be concerned not only with finding ILP to utilize machine
resources effectively, but also with ensuring that the resulting code has a high degree
of cache locality.

One compiler transformation that is essential for a compiler to meet the above objectives
is unroll-and-jam, or outer-loop unrolling. Previous work has either used a dependence
based model to compute unroll amounts, significantly increasing the size of the dependence
graph, or has applied brute force techniques. In this paper, we present an algorithm
that uses linear-algebra-based techniques to compute unroll amounts that save 84%
of the dependence-graph space needed by dependence-based techniques, while retaining
performance, and that gives a more elegant solution.

**A Fast Geometric Conic Reconstruction Algorithm**

**A Fast Geometric Conic Reconstruction Algorithm**

Dr. Ching Kuang Shene and Dr. George Paschos CS-TR 97-03

Conics are the first curvilinear feature beyond points and line segments that are
commonly used in computer vision. This paper addresses the problem of determining
if two conics, from two views, are the projections of a conic in space. If the answer
is affirmative, recover this conic from the information given in the images and camera
positions. This problem is important in the study of stereo vision and shape from
motion.

**Geometric Computing in the Undergraduate Computer Science Curricula**

**Geometric Computing in the Undergraduate Computer Science Curricula**

Dr. John Lowther and Dr. Ching Kuang Shene CS-TR 97-02

Geometric computing is a rapidly evolving interdisciplinary field involving computer
science, engineering and mathematics. It has relationships to many other areas within
computer science such as computational geometry, computer-aided design, computer graphics,
computer vision, geometric/solid modeling, robotics, and visualization. Due to its
interdisciplinary nature, geometric computing also serves as a vehicle for engineering
students to approach important topics in manufacturing processes, *NC-machining, molding, and milling), geometric design/modeling*, (feature-based design and constraint solving), and computational metrology, *(sampling strategy and tolerancing design*). Unfortunately, in a typical computer science curriculum, computing with geometry
is virtually missing in spite of its tremendous impact to computer science and the
other fields mentioned above and its importance to increase students' employability.

This paper investigates this issue and suggests a possible remedy by designing an
elementary and interdisciplinary intermediate level geometric computing course. The
rationale, goals and objectives, and the impact and significance of the proposed course
are discussed. A review of previous work, including textbooks, software systems, and
other media materials, is included. Finally, after presenting the merit of designing
this course, this paper proposes a sample ten-week syllabus with software modules
for lab use.

**An Optimal In-Place Array Rotation Algorithm **

**An Optimal In-Place Array Rotation Algorithm**

Dr. Ching Kuang Shene CS-TR 97-01

This paper presents a simple optimal algorithm of rotating an array of *n* elements to the right D positions with *n + gcd (n,* D) assignments and one extra memory location. Its performance is compared against
the HP STL implementation, which uses*3 (n - gcd(n,* D)) assignments, and a commonly seen algorithm that uses an auxiliary array.

## 1996

**Modulo Scheduling with Cache-Reuse Information **

**Modulo Scheduling with Cache-Reuse Information**

Chen Ding, Dr. Steven Carr and Dr. Philip Sweany CS-TR 96-08

Instruction scheduling in general, and software pipelining in particular face the difficult task of scheduling operations in the presence of uncertain latencies. The largest contributor to these uncertain latencies is the use of cache memories required to provide adequate memory access speed in modern processors. Scheduling for instruction-level parallel architectures with non-blocking caches usually assigns memory access latency by assuming either that all accesses are cache hits or that all are cache misses. We contend that allowing memory latencies to be set by cache reuse analysis leads to better software pipelining than using either the all-hit or all-miss assumption. Using a simple cache reuse model in our modulo scheduling software pipelining optimization, we achieved a benefit of 10% improved execution performance over assuming all-cache-hits and we used 18% fewer registers than were required by an all-cache-miss assumption. In addition, we outline refinements to our simple reuse model that should allow modulo scheduling with reuse to achieve improved execution performance over the all-cache-miss assumption as well. Therefore, we conclude that software pipelining algorithms for target architectures with non-blocking cache, but without rotating register files, should use a memory-reuse latency model.

**Blending with Affine and Projective Dupin Cyclides**

**Blending with Affine and Projective Dupin Cyclides**

Dr. Ching Kuang Shene CS-TR 96-07

This paper presents three new construction algorithms for blending a plane and a quadric
cone with affine and projective Dupin cyclides. The first algorithm is capable of
constructing an affine Dupin cyclide that blends a quadric cone (along an ellipse)
and a plane. The second one is equivalent to the first but is based on geometric methods.
The third one is able to construct a projective Dupin cyclide that blends a plane
and a quadric cone along a hyperbola or a parabola. Thus, the constructed projective
Dupin cyclide "smooths/covers'' the intersection angle made by the given surfaces.
All algorithms use affine or projective transformations to reduce the given configuration
to a right cone and a plane so that the blending Dupin cyclide construction can be
applied.

**Rotating an Array In-Place**** **

**Rotating an Array In-Place**

Dr. Ching Kuang Shene CS-TR 96-06

A simple algorithm that is capable of rotating an array of *n* elements to the *D* right positions using *n +* GCD ( *n,* D) assignments and no extra space is presented in this paper. Its performance with
different GCD algorithms is compared against a commonly seen algorithm that uses an
auxiliary array. Techniques used include regression analysis for establishing linear
models and the Komogorov-Smirnov test for comparing performance data. The algorithm
and the comparison techniques could serve as a model for CS2 students to carry out
experimental analysis of algorithms.

**Planar Intersection, Common Inscribed Sphere and Blending with Dupin Cyclides**

**Planar Intersection, Common Inscribed Sphere and Blending with Dupin Cyclides**

Dr. Ching Kuang Shene CS-TR 96-05

In his paper "Cyclides in Computer Aided Geometric Design," Computer Aided Geometric
Design, Vol. 7 (1990), pp. 221--242, Pratt announced the following result: two cones
can be blended by a Dupin cyclide if and only if they have a common inscribed sphere.
However, only a proof for the sufficient part is given. This paper fills this gap
with a sequence of existential results. More precisely, for two cones with distinct
axes and distinct vertices, it will be shown that the existence of a planar intersection,
the existence of a common inscribed sphere, the existence of a pair of common tangent
planes, and the existence of a blending Dupin cyclide are all equivalent. Moreover,
the existence of a pair of common tangent planes is reduced to a test of equal heights,
simplifying Piegl's conic intersection detection and calculation algorithm. Degenerate
cases will also be addressed.

**Scalar Replacement and the Law of Diminishing Returns**

**Scalar Replacement and the Law of Diminishing Returns**

Dr. Steven Carr and Qunyan Wu and Dr. Philip Sweany CS-TR 96-04

Scalar replacement is a transformation that replaces references to arrays with sequences
of scalar temporaries to effect register allocation of array values. Scalar replacement
algorithms with varying degrees of power have been proposed; however, no study has
been done to show what degree of power is necessary in practice. Are simple algorithms
enough? In addition, each algorithm may be applied using any number of dependence
tests. More powerful tests should yield larger subsets of references that can be scalar
replaced. Is this true?

This paper addresses these two questions. Our experiments show that nearly all opportunities
for scalar replacement that exist in scientific benchmark code can be handled by simple
algorithms. Also, powerful dependence analysis is not needed. Our results show that
scalar replacement is an effective transformation that is cheap to implement.

**Eager Prediction**

**Eager Prediction**

Spiridon A. Messaris and Dr. David A. Poplawski CS-TR 96-03

Speculative execution, in the form of branch prediction, has become a common feature
in superscalar processors and has been shown to be essential in achieving high performance.
Eager execution, the fetching, issuing and execution of instructions down both paths
of a conditional branch, has the potential to improve performance even more, but suffers
from several problems when several successive branches are encountered. In this paper
we show that eager prediction, defined to be eager execution to a certain level followed
by ordinary branch prediction afterwards, produces performance improvements over pure
branch prediction alone when sufficient resources are available; pure prediction still
wins on machines with small issue window sizes and few functional units. Given a sufficiently
large machine, the marginal cost, in terms of functional unit needs and register and
memory bandwidth requirements are shown to be very minimal.

**Blending Two Cones with Dupin Cyclides**** **

**Blending Two Cones with Dupin Cyclides**

Dr. Ching Kuang Shene CS-TR 96-02

This paper provides a complete theory of blending cones with Dupin cyclides and consists
of four major contributions. First, a necessary and sufficient condition for two cones
to have a blending Dupin cyclide will be established. Except for a very special case
in which the cones have a double line and parallel axes, two cones have a blending
Dupin cyclide if and only if they have planar intersection. Second, based on the intersection
structure of the cones, finer characterization results are possible. For example,
all blending Dupin cyclides for two cones with a double line, distinct vertices and
intersecting axes must be cubic. Third, a new construction algorithm that establishes
a correspondence between points on one or two coplanar lines and all constructed blending
Dupin cyclides makes the construction easy and well- organized. Fourth, the completeness
of the construction algorithm will also be proved. As a result, any blending Dupin
cyclide can be constructed by the algorithm. Consequently, all blending Dupin cyclides
are organized into one to four one-parameter families, each of which is ``parameterized''
by points on a special line. It will be shown that each family contains an infinite
number of ring cyclides, ensuring the existence of singularity free blending surfaces.
Finally, color plates are provided to illustrate various cyclide.

**On Parametric Representations of Cubic Dupin Cyclides**

**On Parametric Representations of Cubic Dupin Cyclides**

Dr. Ching Kuang Shene CS-TR 96-01

Srinivas and Dutta recently constructed a parametric representation for cubic Dupin
cyclides. They assumed two skew lines on a cubic Dupin cyclide are available and used
the third intersection point of a moving line that touches both skew lines to generate
a rational parameterization. In this note, it will be shown that this method is just
another form of a popular circle parameterization. A new method based on the geometry
of cubic Dupin cyclides will also be presented. Unlike Srinivas and Dutta's method,
this new method does not require the cubic Dupin cyclide in a special orientation
and a parametric representation is constructed directly without consulting an implicit
equation. Implicitizing the parametric equations obtained from this new method yields
an implicit equation.

## 1995

**Dupin Cyclides as Blending Surfaces for CSG Primitives**

**Dupin Cyclides as Blending Surfaces for CSG Primitives**

Dr. Ching Kuang Shene CS-TR 95-10

This paper extends the existing work of using Dupin cyclides as blending surfaces
for CSG primitives. Necessary and sufficient conditions as well as construction algorithms
are presented for the cone-torus and the torus-torus cases, completing the study of
using a single cyclide to blend a pair of CSG primitives. The common inscribed sphere
criterion for two axial natural quadrics and the offset property of cyclides are used
extensively to derive important results and to construct the blending surfaces. This
paper also reveals new geometric insights as well as some interesting ways of blending
a cone and a torus, and two tori.

**Improving Data Locality with Loop Transformations**

**Improving Data Locality with Loop Transformations**

Kathryn S. McKinley (Univ. of Massachusetts-Amherst), Dr. Steve Carr and Chau-Wen
Tseng (Univ. of Maryland) CS-TR 95-09

In the past decade, processor speed has become significantly faster than memory speed.
Small, fast cache memories are designed to overcome this discrepancy, but they are
only effective when programs exhibit data locality. In this paper, we present compiler
optimizations to improve data locality based on a simple yet accurate cost model.
The model computes both temporal and spatial reuse of cache lines to find desirable
loop organizations. The cost model drives the application of compound transformations
consisting of loop permutation, loop fusion, loop distribution, and loop reversal.
We demonstrate that these program transformations are useful for optimizing many programs.
To validate our optimization strategy, we implemented our algorithms and ran experiments
on a large collection of scientific programs and kernels. Experiments with kernels
illustrate that our model and algorithm can select and achieve the best performance.
For over thirty complete applications, we executed the original and transformed versions
and simulated cache hit rates. We collected statistics about the inherent characteristics
of these programs and our ability to improve their data locality. To our knowledge,
these studies are the first of such breadth and depth. We found performance improvements
were difficult to achieve because benchmark programs typically have high hit rates
even for small data caches; however, our optimizations significantly improved several
programs.

**Compiler Blockability of Dense Matrix Factorizations**

**Compiler Blockability of Dense Matrix Factorizations**

Dr. Steve Carr and R.B. Lehoucq (Rice University) CS-TR 95-08

Recent architectural advances have made memory accesses a significant bottleneck for
computational problems. Even though cache memory helps in some cases, it fails to
alleviate the bottleneck for problems with large working sets. As a result, scientists
are forced to restructure their codes by hand to reduce the working-set size to fit
a particular machine. Unfortunately, these hand optimizations create machine-specific
code that is not portable across multiple architectures without a significant loss
in performance or a significant effort to re-optimize the code. It is the thesis of
this paper that most of the hand optimizations performed on matrix factorization codes
are unnecessary because they can and should be performed by the compiler. It is better
for the programmer to express algorithms in a machine-independent form and allow the
compiler to handle the machine-dependent details. This gives the algorithms portability
across architectures and removes the error-prone, expensive and tedious process of
hand optimization. In this paper, we show that LU and Cholesky factorizations may
be optimized automatically by the compiler to as efficient as the same hand-optimized
version found in LAPACK. We also show that QR factorization may be optimized by the
compiler to perform comparably with the hand-optimized LAPACK version on matrix sizes
that are typically run on nodes of massively parallel systems. Our approach allows
us to conclude that matrix factorizations can be expressed in a machine-independent
form with the expectation of good memory performance across a variety of architectures.

**Communication on the Intel Paragon**

**Communication on the Intel Paragon**

Hazel Shirley, Robert Reynolds, and Dr. Steven Seidel CS-TR 95-07

Simple cost models have proved inadequate to predict the cost of complex communication
operations such as the complete exchange on the Intel Paragon. To understand why this
is true this work investigated several attributes of interprocessor communication
on the Intel Paragon. It describes a standard of operation that offers the best combination
of global synchronization, message passing primitives and protocols, memory allocation
decisions, and other features of an application's communication code that effect performance.
Descriptions of the simple communication operations and two algorithms for the complete
exchange problem are given along with their observed costs. And finally, this work
reveals why simple message passing cost models for the Intel Paragon are inadequate.
It is shown that system interrupts disrupt the synchronization of complex communication
algorithms and that a bimodal latency cost function adds to the cost of communication.
In consideration of these factors, a new model is constructed that more accurately
reflects the behavior of communication on the Intel Paragon.

**Combining Optimization for Cache and Instruction-Level Parallelism**

**Combining Optimization for Cache and Instruction-Level Parallelism**

Dr. Steven Carr CS-TR 95-06

Current architectural trends in instruction-level parallelism have dramatically increased
the computational power of microprocessors. As a result, the demands on the memory
system have increased dramatically. Not only do compilers need to be concerned with
finding enough ILP to utilize machine resources effectively, but they also need to
be concerned with ensuring that the resulting code has a high degree of cache locality.
Previous work has concentrated either on improving ILP in nested loops or on improving
cache performance. This paper presents a performance metric that can be used to guide
the optimization of loops considering the combined effects of ILP, data reuse and
latency hiding techniques. Preliminary experiments reveal that dramatic performance
improvements for loops are obtainable (a factor of nearly 9 is achieved on two kernels
on the IBM RS/6000).

**An Analysis of Unroll-and-Jam on the HP 715/50**

**An Analysis of Unroll-and-Jam on the HP 715/50**

Dr. Steve Carr and Qunyan Wu CS-TR 95-05

Current directions in computer design have resulted in dramatic improvements in the
power of microprocessors. These gains have been achieved by cycle-time improvements,
by multiple-instruction issue and by pipelined functional units. As a result of these
architectural improvements, today's microprocessors, like the HP PA-RISC, can perform
many more operations per machine cycle than their predecessors. But with these gains
have come problems. Because the latency and bandwidth of memory systems have not kept
pace with processor speed, computations are often delayed waiting for data from memory.
As a result, processors see idle computational cycles and empty pipeline stages more
frequently. In fact, memory and pipeline delays have become the principal performance
bottleneck for high-performance microprocessors. To overcome these performance problems,
programmers have learned to employ a coding style that achieves a good balance between
memory references and floating-point operations. The goal of this coding methodology
is to eliminate pipeline interlock and to match the ratio of memory operations to
floating-point operations in program loops to the optimum such ratio that the target
machine can deliver. In this report, we present an experiment with an automatic compiler
method to match balance ratios that obviates the need for programmer optimization.
The goal is to improve the efficiency of schedules generated by the back end of the
HP compiler.

**Improving Software Pipelining with Unroll-and-Jam**

**Improving Software Pipelining with Unroll-and-Jam**

Dr. Steve Carr, Chen Ding and Philip Sweany CS-TR 95-04

To take advantage of recent architectural improvements in microprocessors, advanced
compiler optimizations such as software pipelining have been developed. Unfortunately,
not all loops have enough parallelism in the innermost loop body to take advantage
of all of the resources a machine provides. Parallelism is normally inhibited by either
inner-loop recurrences, or by a mismatch between the resource requirements of a loop
and the resources provided by the target architecture. Unroll-and-jam is a transformation
that can be used to increase the amount of parallelism in the innermost loop body
by making better use of resources and limiting the effects of recurrences. In this
paper, we demonstrate how unroll-and-jam can significantly improve the initiation
interval in a software-pipelined loop. Improvements in the initiation interval of
greater than 35% are common, while dramatic improvements of a factor of 5 are possible.

**An Analysis of Loop Permutation on the HP PA-RISC**

**An Analysis of Loop Permutation on the HP PA-RISC**

Dr. Steve Carr and Qunyan Wu CS-TR 95-03

In modern computers, processor speed has become significantly faster than memory speed.
Cache memories are designed to overcome this difference, but they are only effective
when programs exhibit data locality. In this report, we present an experiment with
compiler optimizations to improve data locality based on a simple cost model. The
model computes both temporal and spatial reuse of cache lines to find desirable loop
permutations. The cost model drives the application of compound transformations consisting
of loop permutation, loop distribution, and loop reversal. We implemented the transformations
and ran experiments on a large collection of scientific programs on both the HP 715/50
and HP 712/80. For over thirty complete applications, we executed the original and
transformed versions and simulated cache hit rates. We collected statistics about
the inherent data-locality characteristics of these programs and our ability to improve
their locality. Performance improvements were difficult to achieve because benchmark
programs typically have high hit rates; however, our optimizations significantly improved
several programs.

**The Performance of Scalar Replacement on the HP 715/50**

**The Performance of Scalar Replacement on the HP 715/50**

Dr. Steve Carr and Qunyan Wu CS-TR 95-02

Most conventional compilers fail to allocate array elements to registers because standard
data-flow analysis treats arrays like scalars, making it impossible to analyze the
definitions and uses of individual array elements. This deficiency is particularly
troublesome for floating-point registers, which are most often used as temporary repositories
for subscripted variables. This paper presents an experiment on an HP 715/50 with
a source-to-source transformation, called scalar replacement, that finds opportunities
for reuse of subscripted variables and replaces the references involved by references
to temporary scalar variables. The scalar replaced variables are more likely to be
assigned to registers by the coloring-based register allocators found in most compilers
than are their unreplaced counterparts. Experimental results show that scalar replacement
is extremely effective. On kernels, integer- factor improvements over code generated
by a good optimizing compiler of conventional design are possible.

*The Unlimited Resource Machine (URM)*

Dr. David A. Poplawski CS-TR 95-01

The Unlimited Resource Machine (URM) is a hypothetical instruction set architecture that was designed specifically for study of the architecture of and code generation techniques for machines exhibiting instruction level parallelism. A set of tools have been implemented to support this machine model, including compilers, a (linking) assembler, simulators, and trace analyzers. The architecture and its tools are being used in a variety of research projects at MTU. This report describes the philosophy behind the URM, the architecture and instruction set, the tools and their interfaces, and how to use the tools to model various ILP architectures.

## 1994

**Overview of the ROCKET Retargetable C Compiler**

**Overview of the ROCKET Retargetable C Compiler**

Dr. Philip H. Sweany and Steven J. Beaty (Cray Computer Corporation). CS-TR 94-01

This report describes the ROCKET compiler, a highly optimizing C translator which
is retargetable for a broad class of ILP architectures.

**Compiler Optimizations for Improving Data Locality**

**Compiler Optimizations for Improving Data Locality**

Dr. Steven Carr, Dr. Kathryn S. McKinley (Univ. of Mass. at Amherst)

and Dr. Chau-Wen Tseng (Stanford Univ.) CS-TR 94-02

In the past decade, processor speed has become significantly faster than memory speed.
Small, fast cache memories are designed to overcome this discrepancy, but they are
only effective when programs exhibit data locality. In this paper, we present compiler
optimizations to improve data locality based on a simple yet accurate cost model.
The model computes both temporal and spatial reuse of cache lines to find desirable
loop organizations. The cost model drives the application of compound transformations,
consisting of loop permutation, loop distribution and loop fusion. We show that these
transformations are required to optimize many programs. To validate our optimization
strategy, we implemented our algorithms and ran experiments on a large collection
of scientific programs and kernels. Experiments with kernels illustrate that our model
and algorithm can select and achieve the best performance. For complete applications,
we executed the original and transformed versions, simulated cache hit rates, and
collected statistics about the inherent characteristics of these programs and our
ability to improve them. To our knowledge, these studies are the first with this breadth
and depth.

**An Examination of the Behavior of Slice Based Cohesion Measures**

**An Examination of the Behavior of Slice Based Cohesion Measures**

Sakari Karstu and Dr. Linda M. Ott CS-TR 94-03

Is cohesion really indicative of other software attributes, like reliability or maintainability,
as the software community believes? This question has been difficult to answer since
there have been no automated tools for measuring the cohesion of software modules.
With the development of an automated tool for collecting cohesion data, the validation
of the measured cohesion with the original definition of cohesion was possible, using
production software.

This study is one of the first studies to relate cohesion to any other attributes
of software, in this case, revisions made to the software. It is commonly known that
the number of revisions is strongly related to the length of the module. As a result
of the study, we present data that supports our intuition that in addition to length,
the level of cohesion of a module is a factor in estimating the quality of the software.
In particular the data indicates that there appears to be a correlation between modular
cohesion and revisions made to the module. This would suggest that highly cohesive
modules are less likely to need changes.

## 1993

**A Compiler Blockable Algorithm for QR Decomposition**

**A Compiler Blockable Algorithm for QR Decomposition**

Dr. Steve Carr and Dr. Richard Lehoucq (Rice University) CS-TR 93-02

Over the past decade, microprocessor design strategies have focused on increasing
the computational power on a single chip. Unfortunately, memory speeds have not kept
pace, resulting in an imbalance between computation speed and memory speed. This imbalance
is leading machine designers to use more complicated memory hierarchies. In turn,
programmers are explicitly restructuring codes to perform well on particular memory
systems, leading to machine-specific programs. The compiler should automatically perform
restructuring well enough to obviate the need for hand-optimization, thus allowing
machine-independent expression of algorithms with the expectation of good performance.

This paper extends previous work on the ability of a compiler to restructure codes
automatically to improve memory-hierarchy performance. Specifically, the paper extends
work in the automatic derivation of block algorithms from their corresponding point
algorithms. Previous work showed that the compiler could not automatically derive
the block algorithm for QR factorization found in LAPACK from its corresponding point
algorithm because the block LAPACK version is a new algorithm rather than a reshaping
of the point algorithm. This paper presents a compiler-derivable block algorithm.
The algorithm is shown on smaller array sizes to be effective enough to obviate the
need for the tediously hand-optimized algorithm found in LAPACK implementations. By
relying on the compiler to do the optimization, the algorithm remains machine independent
and retains high-performance across multiple architectures, without programmer intervention.

**Profiling the Communication Workload of an iPSC/860**

**Profiling the Communication Workload of an iPSC/860**

Brian VanVoorst and Dr. Steven Seidel CS-TR 93-03

A communication profiler was developed for the Intel iPSC/860 hypercube. The profiler
was implemented as a kernel modification so that communication traffic could be monitored
transparently. The profiler was used to gather information about the communication
traffic in applications that comprised an iPSC/860's daily workload. Since communication
is an expensive operation in a MIMD machine, the profiler was designed to show where
those costs were concentrated and to show where efforts to improve communication performance
should be directed. The 128 node iPSC/860 at the NASA Ames Research Center was profiled
for several days. Summaries of data are presented here, along with recommendations
that include proposed changes to the long message protocol to reduce transmission
latency.

**Broadcast and Complete Exchange Algorithms for Mesh Topologies**

**Broadcast and Complete Exchange Algorithms for Mesh Topologies**

Siddhartha Takkella and Dr. Steven Seidel CS-TR 93-04

The availability of multicomputers with a mesh interconnection network has created
a need to develop efficient global communication algorithms for that architecture.
The broadcast and the complete exchange are the two communication problems of primary
interest in this work. Improved complete exchange algorithms for meshes of arbitrary
size, and for square meshes in particular, have been developed. Motivated by the efficiency
of pipelined broadcast algorithms for hypercubes, efforts were concentrated to develop
similar broadcast algorithms for meshes, resulting in an improved pipelined broadcast
algorithm for meshes. Performance results from the Intel Delta mesh are also given
for one of the new complete exchange algorithms.

**An Algorithm for Identifying Regions of a DNA Sequence that Satisfy a Content Requirement**

**An Algorithm for Identifying Regions of a DNA Sequence that Satisfy a Content Requirement**

Dr. Xiaoqiu Huang CS-TR 93-05

We present a dynamic programming algorithm for identifying regions of a DNA sequence
that meet a user-specified compositional requirement. Applications of the algorithm
include finding C+G-rich regions, locating TA+CG-deficient regions, identifying CpG
islands, and finding regions rich in periodical three-base patterns. The algorithm
has an advantage over the simple window method in that the algorithm shows the exact
location of each identified region. The algorithm has been implemented as a portable
C program called LCP (Local Content Program). LCP is extremely efficient in computer
time and memory; it instantly locates all regions of a long DNA sequence meeting a
given requirement. The LCP program was used to analyze the rabbit A-like goblin gene
cluster sequence.

**A Context Dependent Method for Comparing Sequences**

**A Context Dependent Method for Comparing Sequences**

Dr. Xiaoqiu Huang CS-TR 93-06

A scoring scheme is presented to measure the similarity score between two biological
sequences, where matches are weighted dependent on their context. The scheme generalizes
a widely used scoring scheme. A dynamic programming algorithm is developed to compute
a largest-scoring alignment of two sequences of lengths.

**Alignment of Three Sequences in Quadratic Space**

**Alignment of Three Sequences in Quadratic Space**

Dr. Xiaoqiu Huang CS-TR 93-07

We present a dynamic programming algorithm for computing an optimal global alignment
of three sequences in quadratic space. The algorithm has been implemented as a portable
C program. The program makes it possible to align simultaneously any three protein
sequences on workstations with tens of megabytes of main memory. This rigorous program
is used as a benchmark to evaluate the performance of a heuristic multiple sequence
alignment program on protein sequences from the Protein Identification Resource protein
sequence database. Experimental results indicate that the rigorous program may perform
significantly better than the heuristic program on three sequences of weak similarity.
We also describe algorithms for computing local alignments among three sequences in
quadratic space.

**Improving the Ratio of Memory Operations to Floating-Point in Loops**

**Improving the Ratio of Memory Operations to Floating-Point in Loops**

Dr. Steven Carr and Dr. Ken Kennedy (Rice University) CS-TR 93-08

Over the past decade, microprocessor design strategies have focused on increasing
the computational power on a single chip. Because computations often require more
data from cache per floating-point operation than a machine can deliver and because
operations are pipelined, idle computational cycles are common when scientific applications
are run on modern microprocessors. To overcome these bottlenecks, programmers have
learned to use a coding style that ensures a better balance between memory references
and floating-point operations. In our view, this is a step in the wrong direction
because it makes programs more machine-specific. A programmer should not be required
to write a new program version for each new machine; instead, the task of specializing
a program to a target machine should be left to the compiler.

But is our view practical? Can a sophisticated optimizing compiler obviate the need
for the myriad of programming tricks that have found their way into practice to improve
performance of memory hierarchy? In this paper we attempt to answer that question.
To do so, we develop and evaluate techniques that automatically restructure program
loops to achieve high performance on specific target architectures. These methods
attempt to balance computation and memory accesses and seek to eliminate or reduce
pipeline interlock. To do this, they statically estimate the performance of each loop
in a particular program and use these estimates to determine whether to apply various
loop transformations.