Exploring Software Cache Design for Remote References in UPC Applications
Wei Wang, Steve Carr, Steven R. Seidel, and Zhenlin Wang CS-TR-12-02
Good performance of partitioned global address space languages running on commodity clusters requires efficient handling of remote memory accesses. Compilers and runtime systems can reduce the costs of such accesses by modeling remote memory access patterns. This paper uses an extended single-threaded reuse distance model that predicts remote memory access behavior for Unified Parallel C (UPC) applications to guide software cache design.
The remote memory access model presented here handles changes in per-thread problem size and changes in access patterns due to problem size increases. We have used this model to design application-specific software caches for remote references, considering cache-line length, total cache size, and other design features, including no cache at all, as design options. For twelve UPC application kernels encompassing five out of seven identified parallel application types, no single cache design works best for all applications. However, our analysis can predict an effective design based upon the reuse characteristics of the application. Finally, we show that this approach can be expected to scale well to thousands of threads.
iMap - A Stable Layout for Navigating Large Image Collections
with Embedded Search
Chaoli Wang, John P. Reese, Huan Zhang, Jun Tao, Robert J. Nemiroff CS-TR-12-01
Effective techniques for organizing and visualizing large image collections are in growing demand as visual search gets increasingly popular. We present our solution for image search and clustering based on the evaluation image similarity using both visual and textual information. To lay out images, we introduce
iMap, a treemap representation for visualizing and navigating image search and clustering results. iMap not only makes effective use of available display area to arrange images but also maintains stable update when images are inserted or removed during the query. We also develop an embedded visualization that integrates image tags for in-place search refinement. Using an online astronomy archive with thousands of images, we show the effectiveness of our approach by demonstrating experimental results and conducting a comparative user study.
Local Reasoning for Global Convergence of Parameterized Rings
This report demonstrates a method that can generate Self-Stabilizing (SS) parameterized protocols that are generalizable; i.e., correct for arbitrary number of finite-state processes. Specifically, we present necessary and sufficient conditions for deadlock-freedom specified in the local state space of the representative process of parameterized rings. Moreover, we introduce sufficient conditions that guarantee live lock-freedom in arbitrary-sized unidirectional rings. More importantly, we sketch a methodology for automated design of global convergence in the local state space of the representative process. We illustrate our method in the context of several examples including maximal matching, agreement, two-coloring, three-coloring and sum-not-two protocols.
UPC-SPIN: A Framework for the Model Checking of UPC Programs
This paper presents a method supported by a software framework for the model checking of Unified Parallel C (UPC) programs. The proposed framework includes a front-end compiler that generates finite models of UPC programs in the modeling language of the SPIN model checker. The model generation is based on a set of sound abstraction rules that transform the UPC synchronization primitives to semantically-equivalent code snippets in SPIN's modeling language. The back-end includes SPIN that verifies the generated model. If the model checking succeeds, then the UPC program is correct with respect to properties of interest such as data race-freedom and/or deadlock-freedom. Otherwise, the back-end provides feedback as sequences of UPC instructions that lead to a data race or a deadlock from initial states, called counterexamples. Using the UPC-SPIN framework, we have detected design flaws in several real-world UPC applications, including a program simulating heat flow in metal rods, parallel bubble sort, parallel data collection, and an integer permutation program. More importantly, for the first time (to the best of our knowledge), we have mechanically verified data race-freedom and deadlock-freedom in a UPC implementation of the Conjugate Gradient (CG) kernel of the NAS Parallel Benchmarks (NPB).We believe that UPC-SPIN provides a valuable tool for developers towards increasing their confidence in the computational results generated by UPC applications.
Swarm Synthesis of Convergence for Symmetric Protocols
Due to their increasing complexity, today's distributed systems are subject to a variety of transient faults (e.g., loss of coordination, soft errors, bad initialization), thereby making self-stabilization a highly important property of such systems. However, designing Self-Stabilizing (SS) network protocols is a complex task in part because a SS protocol should recover to a set of legitimate states from any state in its state space; i.e., convergence. Once converged, a SS protocol should remain in its set of legitimate states as long as no faults occur; i.e., closure. The verification of SS protocols is even harder as developers have to prove the interference-freedom of closure and convergence. To facilitate the design and verification of SS protocols, previous work proposes techniques that take a non-stabilizing protocol and automatically add convergence while guaranteeing interference-freedom. Nonetheless, such algorithmic methods must search an exponential space of candidate sequences of transitions that could be included in a SS protocol. This paper presents a novel method for exploiting the computational resources of computer clusters and search diversification towards increasing the chance of success in automated design of finite-state self- stabilizing symmetric protocols. We have implemented our approach in a software tool that enables an embarrassingly parallel platform for the addition of convergence. Our tool has automatically synthesized several SS protocols that cannot be designed by extant automated techniques.
Efficient LRU-Based Working Set Size Tracking
Efficient memory resource management requires the knowledge of the memory demands of applications or systems at runtime. A widely proposed approach is to construct an LRU-based miss ratio curve (MRC), which provides not only the current working set size (WSS) but also the relationship between performance and target memory allocation size. Unfortunately, the cost of LRU MRC monitoring is nontrivial. Although previous techniques adopt some optimizations, they still incur large overheads. To attack this problem, this paper proposes a low cost working set size tracking method, which seeks to generate an accurate MRC of programs with negligible overhead. We first introduce two enhancements over current techniques: AVL-tree based LRU structure and dynamic hot set sizing. Nonetheless, the overhead is still as high as 16% on average. Based on a key insight that for most programs the working set sizes are stable most of the time, we design an intermittent tracking scheme, which can temporarily turn off memory tracking when memory demands are predicted to be stable. With the assistance of hardware performance counters, memory tracking can be turned on again if a significant change in memory demands is expected. Experimental results show that, by using this intermittent tracking design, memory tracking can be turned off for 82% of execution time while the accuracy loss is no more than 4%. More importantly, this design is orthogonal to existing optimizing techniques. By combining the three enhancements, the mean overhead is lowered to only 2%. We show that when applied it to memory balancing for virtual machines, our scheme brings a speedup of 2.12.
Towards an Extensible Framework for Automated Design of Self-Stabilization
This paper proposes a framework for automatic design of (finite-state) self-stabilizing programs from their non-stabilizing versions. A method that automatically adds stabilization to non-stabilizing program is highly desirable as it generates self-stabilizing programs that are correct by construction, thereby eliminating the need for after-the-fact verification. Moreover, automated design methods enable separation of stabilization from functional concerns. In this paper, we first present a deterministically sound and complete algorithm that adds weak stabilization to programs in polynomial time (in the state space of the non-stabilizing program). We also present a sound method that automatically adds strong stabilization to non-stabilizing programs using the results of the proposed weak stabilization algorithm. These two methods constitute the first elements of the proposed extensible framework. To demonstrate the applicability of our algorithms in automated design of self-stabilization, we have implemented our algorithms in a software tool using which we have synthesized many self-stabilizing programs including Dijkstra’s token ring, graph coloring and maximal matching. While some of the automatically generated self-stabilizing programs are the same as their manually designed versions, our tool surprisingly has synthesized programs that represent alternative solutions for the same problem. Moreover, our algorithms have helped us reveal design flaws in manually designed self-stabilizing programs (claimed to be correct).
Approximating Weighted Matchings Using the Partitioned Global Address Space Model
Alicia Thorsen and Phillip Merkey CS-TR-10-02
Even though the maximum weighted matching problem can be solved in polynomial time, there are many real world problems for which this is still too costly. As a result, there is a need for fast approximation algorithms for this problem. Manne and Bisseling presented a parallel greedy algorithm which is well suited to distributed memory computers. This paper discusses the implementation of the Manne-Bisseling algorithm using the partitioned global address space (PGAS) model. The PGAS model is a union of the shared address space and distributed models which aims to provide the programming conveniences and performance benefits of these models. Four variations of the algorithm are presented which explore the design constructs available in the PGAS model. The algorithms are expressed using a generic PGAS description, however the experiments and empirical performance numbers were obtained from an implementation written in Unified Parallel C with atomic memory operation extensions.
Identifying Satisfying Subsets: A Method for Algorithmic Correction of Inter-Thread Synchronization Mechanisms
Correcting concurrency failures (e.g., data races and deadlocks/livelocks) is a difficult task in part due to the non-deterministic nature of multithreaded programs, and the fact that a correction often involves multiple threads. As such, manually fixing a failure often results in introducing new flaws, which in turn requires a regression verification step after every fix. Most existing methods for detection and correction of data races rely on code annotation (e.g., atomic sections) or linguistic constructs (e.g., coordination models) that enable programmers to introduce non-determinism in a controlled fashion at the expense of degraded performance. In this report, we present an alternative paradigm for the correction of data races and non-progress failures, where we systematically search for a subset of inter-thread synchronization traces of faulty programs that is data race-free and meets progress requirements, called a satisfying subset of traces. To evaluate the effectiveness of satisfying subsets in the correction of concurrency failures, we have developed explicit-state and symbolic implementations of our algorithms. Our software tools have facilitated the correction of data races and non-progress failures in several Unified Parallel C (UPC) and Java programs that programmers have failed to correct in a reasonable amount of time.
Predicting Remote Reuse Distance Patterns in Unified Parallel C Applications*
Productivity is becoming increasingly important in high performance computing. Parallel systems, as well as the problems they are being used to solve, are becoming dramatically larger and more complicated. Traditional approaches to programming for these systems, such as MPI, are being regarded as too tedious and too tied to particular machines. Languages such as Unified Parallel C attempt to simplify programming on these systems by abstracting the communication with a global shared memory, partitioned across all the threads in an application. These Partitioned Global Address Space, or PGAS, languages offer the programmer a way to specify programs in a much simpler and more portable fashion.
However, performance of PGAS applications has tended to lag behind applications implemented in a more traditional way. It is hoped that cache optimizations can provide similar benefits to UPC applications as they have given single-threaded applications to close this performance gap. Memory resuse distance is a critical measure of how much an application will benefit from a cache, as well as an important piece of tuning information for enabling effective cache optimization.
This research explores extending existing reuse distance analysis to remote memory accesses in UPC applications. Existing analyses store a very good approximation of the reuse distance histogram for each memory access in a program efficiently. Reuse data are collected for small test runs, and then used to predict program behavior during full runs by curve fitting the patterns seen in the training runs to a function of the problem size. Reuse data are kept for each UPC thread in a UPC application, and these data are used to predict the data for each UPC thread in a larger run. Both scaling up the problem size and the increasing the total number of UPC threads are explored for prediction. Results indicate that good predictions can be made using existing prediction algorithms. However, it is noted that choice of training threads can have a dramatic effect on the accuracy of the prediction. Therefore, a simple algorithm is also presented that partitions threads into groups with similar behavior to select threads in the training runs that will lead to good predictions in the full run.
*This work is partially supported by NSF grant CCF-0833082.
Analysis and Performance of a UPC Implementation of a Parallel Longest Common Subsequence Algorithm
An important problem in computational biology is finding the longest common subsequence (LCS) of two nucleotide sequences. This paper examines the correctness and performance of a recently proposed parallel LCS algorithm that uses successor tables and pruning rules to construct a list of sets from which an LCS can be easily reconstructed. Counterexamples are given for two pruning rules that were given with the original algorithm. Because of these errors, performance measurements originally reported cannot be validated. The work presented here shows that speedup can be reliably achieved by an implementation in Unified Parallel C that runs on an Infiniband cluster. This performance is partly facilitated by exploiting the software cache of the MuPC runtime system. In addition, this implementation achieved speedup without bulk memory copy operations and the associated programming complexity of message passing.
How Hard Is Aspect-Oriented Programming?
In aspect-oriented programming, it is desirable to algorithmically design correct advised programs that meet a new crosscutting concern while preserving the properties of the base program. Such an automated design eliminates the need for after-the-fact verification, thereby potentially reducing development costs. In this paper, we present a formalization of the problem of designing safety aspects while preserving liveness for finite-state programs, where safety stipulates that nothing bad ever happens, and liveness states that something good will eventually happen in program computations. Subsequently, we illustrate that, for deterministic sequential programs, it is impossible to efficiently design safety aspects and preserve liveness unless P = NP; i.e., the problem is NP-hard in program state space. This is a counterintuitive result as previous work illustrates that this problem is NP-hard for non-deterministic programs.
Feasibility of Stepwise Addition of Multitolerance to High Atomicity Programs
We present sound and (deterministically) complete algorithms for stepwise design of two families of multitolerant programs in a high atomicity program model, where a program can read and write all its variables in an atomic step. We illustrate that if one needs to add failsafe (respectively, nonmasking) fault-tolerance to one class of faults and masking fault-tolerance to another class of faults, then such an addition can be done in separate polynomial-time (in the state space of the fault-intolerant program) steps regardless of the order of addition. This result has a significant methodological implication in that designers need not be concerned about unknown fault tolerance requirements that may arise due to unanticipated types of faults. Further, we illustrate that if one needs to add failsafe fault-tolerance to one class of faults and nonmasking fault-tolerance to a different class of faults, then the resulting problem is NP-complete in program state space. This is a counterintuitive result in that adding failsafe and nonmasking fault-tolerance to the same class of faults can be done in polynomial time. We also present sufficient conditions for polynomial-time addition of failsafe-nonmasking multitolerance to programs. Finally, we demonstrate the stepwise addition of multitolerance in the context of a token ring protocol and a repetitive agreement protocol that is subject to Byzantine and transient faults.
ScaleUPC: A UPC Compiler for Mulit-Core Systems
As multi-core computers begin to dominate the market, the lack of a well-accepted parallel programming language and its compilation support targeting this specific architecture has become an immediate problem to solve. Unified Parallel C (UPC), a parallel extension to ANSI C, is gaining great interest from the high performance computing community. UPC was originally designed for large scale parallel computers and cluster environments. However, the partitioned global address space programming model of UPC makes it a natural choice for a single multi-core machine, where the main memory is physically shared. This paper builds a case for UPC as a feasible language for multi-core programming by providing an optimizing compiler, called ScaleUPC, that outperforms other UPC compilers targeting SMPs.
Since the communication cost for remote accesses is removed because all accesses are physically local in a multi-core, we find that the overhead of pointer arithmetic on shared data accesses becomes a prominent bottleneck. The reason is that directly mapping the UPC logical memory layout to physical memory, as used in most of the existing UPC compilers, incurs prohibitive address calculation overhead. This paper presents an alternative memory layout, which effectively eliminates the overhead without sacrificing the UPC memory semantics. Our research also reveals that the compiler for multi-core systems needs to pay special attention to the memory system. We demonstrate how the compiler can enforce static process/thread binding to improve cache performance.
Action-Level Addition of Leads-To Properties to Shared Memory Parallel Programs
We present a method for large-scale automatic addition of Leads-To properties to shared memory parallel programs. Such an automated addition of Leads-To is highly desirable in facilitating the design of multicore programs for average programmers. Our approach is action-level in that we separately analyze and revise the actions of an existing program in order to accommodate a new Leads-To property while preserving already existing properties. Based on our method, we have developed a software framework that exploits the computational resources of geographically distributed machines to add Leads-To properties to parallel programs. We demonstrate our approach in the context of a token passing and a barrier synchronization program.
A Performance Model for Unified Parallel C
Dr. Zhang Zhang CS-TR-07-04
This research is a performance centric investigation of the Unified Parallel C (UPC), a parallel programming language that belong to the Partitioned Global Address Space (PGAS) language family. The objective is to develop performance modeling methodology that targets UPC but can be generalized for other PGAS languages.
The performance modeling methodology relies on platform characterization and program characterization, achieved through shared memory benchmarking and static code analysis, respectively. Models built using this methodology can predict the performance of simple UPC application kernels with relative errors below 15%. Beside performance prediction, this work provide a framework based on shared memory benchmarking and code analysis for platform evaluation and compiler/runtime optimization study. A few platforms are evaluated in terms of their fitness to UPC computing. Some optimization techniques, such as remote reference caching, is studied using this framework.
A UPC implementation, MuPC, is developed along with the performance study. MuPC consists of a UPC-to-C translator built upon a modified version of the EDG C/C++ front end and a runtime system built upon MPI and POSIX threads. MuPC performance features include a runtime software cache for remote accesses and low latency access to shared memory with affinity to the issuing thread. In this research, MuPC serves as a platform that facilitates the development, testing, and validation of performance microbenchmarks and optimization techniques.
DiConic Addition of Failsafe Fault-Tolerance
Dr. Ali Ebnenasir CS-TR-07-03
We present a divide and conquer method, called DiConic, for automatic addition of failsafe fault-tolerance to distributed programs, where a failsafe program guarantees to meet its safety specification even when faults occur. Specifically, instead of adding fault-tolerance to a program as a whole, we separately revise program instructions in a static way so that the entire program becomes failsafe fault-tolerant. Our DiConic algorithm can potentially utilize the processing power of a large number of machines working in parallel, thereby enabling automatic addition of failsafe fault-tolerance to programs with a large number of instructions. We formulate our DiConic synthesis algorithm in term of the satisfiability problem. We validate our approach using the Yices Satisfiability Modulo Theories (SMT) solver for the Byzantine Generals problem and an industrial application.
Non-Viable Path Branch Prediction
Ying Xiong and Dr. Soner Onder CS-TR-07-02
For modern superscalar processors which implement deeper and wider pipelines, accurate branch prediction is crucial for feeding sufficient number of correct instructions into the superscalar's highly-parallelized execution core. In this paper, we show that what the branch predictor is learning has significant implications for its ability to make effective use of branch correlation and its ability to exploit longer history lengths. Contrary to the commonly employed approach of training each case using both successes and failures of previous branch behavior, a Non-Viable Path branch predictor learns those paths that the program will never follow, given prior history; such a predictor is referred to as Non-Viable Path Predictor (NVPP). An NVPP can learn any boolean function, given enough space.
In this paper, we present an analysis of the design space and arrive at a practically feasible implementation of the concept which uses simple traditional one or two-bit tables and traditional table indexing mechanisms. Because of its unique properties, NVPP can also exploit very long global histories with only a linear increase in the space requirement and provide rapid training times. Although NVPP exploits large histories, its access time remains relatively constant and it is comparable to small Gshare predictors, permitting a single cycle implementation. The performance of NVPP is compared with most recently published highly accurate predictors including piecewise linear and the O-GEometric History Length predictors. Our results indicate that the NVPP accuracy compares well with these predictors while maintaining the advantages of simplicity and the speed of a traditional style predictor design.
A Simple Parallel Approximation Algorithm for the Weighted Matching Problem
Alicia Thorsen, Phillip Merkey and Fredrik Manne (Univeristy of Bergen) CS-TR-07-01
Given a weighted graph, the weighted matching problem is to find a matching with maximum weight. The fastest known exact algorithm runs in O(nm + n2 log n) however for many real world applications this is too costly, and an approximate matching is sufficient. A c-approximation algorithm is one which always finds a weight of at least c times the optimal weight. Drake and Hougardy developed a linear time 2/3 - epsilon approximation algorithm which is the best known serial algorithm. They also developed a parallel 1- epsilon approximation algorithm for the PRAM model, however it requires a large number of processors which is not as useful in practice. Hoepman developed a distributed 1/2 approximation algorithm which is the best known distributed algorithm. We present a shared memory parallel version of the best 2/3 - epsilon algorithm, which is simple to understand and easy to implement.
A Hill-Climbing Approach for Planning with Temporal Uncertainty
Janae N. Foss and Dr. Nilufer Onder CS-TR-06-02
We approach the problem of finding temporally contingent plans, i.e., plans with branches that are based on the duration of an action at execution time, using a hill-climbing algorithm. We find an optimistic plan that is valid when all actions complete quickly. We then use efficient temporal reasoning techniques to determine when the plan may fail. At time points that cause an unsafe situation, we insert temporal contingency branches. We describe our implemented planner PHOCUS-HC and provide experimental results.
UPC Collective Conformance Suite
Lisa Begum and Dr. Charles Wallace CS-TR-06-01
The UPC collective conformance suite is a collection of tests that help determine how closely a given implementation of the UPC collective operations conform to the specifications. The test suite is based on Version 1.1 of the UPC specification and Version 1.0 of the UPC collectives specifications. It has been used on the MuPC and HP UPC platforms. These tests exercise all of the collective operations, with the exception of the deprecated upc all sort operation, and notify the tester of behavior that departs from the specifications. This document describes the tests in the suite and explains why they were chosen.
A Specification of the Extensions to the Collective Operations of Unified Parallel C
Zinnu Ryne and Dr. Steven Seidel CS-TR-05-08
Unified Parallel C (UPC) is an extension of the C programming language that provides a partitioned shared-memory model for parallel programming. On such platforms, a simple assignment operation can perform remote memory reads or writes. Nevertheless, since bulk transfers of data are more efficient than byte-level copying, collective operations are needed, and as a result a specification for a standard set of collective operations was proposed in 2003. This specification is now part of the UPC V1.2 language specification. However, these operations have limited functionality. This thesis work explores ways to overcome these limitations by proposing a set of extensions to the standard collective relocalization operations. The proposal includes performance and convenience improvements that provide asynchronous behavior, an in-place option, runtime implementation hints and subsetting feature. Also proposed are the new one-sided collective operations which are alternatives to the traditional MPI way of looking at collective operations.
A comparative performance analysis is done to ensure that the reference implementation of the extended collective operations perform within reasonable bounds of that of the standard collective operations. This provides necessary support to adopt these extensions as they provide greater flexibility and power to the programmer with a negligible performance penalty.
Implementing UPC's MYSYNC Synchronization Mode Using Pairwise Synchronization of Threads
Prasad Dhamne CS-TR-05-07
UPC (Unifed Parallel C) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming. Synchronization between threads (processes) in UPC is done through the use of locks or barriers. We have investigated a new synchronization method which is better suited in situations where threads do not need to synchronize with all of the other threads in the system.
We implemented pairwise synchronization that can be used to synchronize pairs of threads while not disturbing the remaining threads. This project also explored how to synchronize pairs of UPC threads over a Myrinet interconnect. Myrinet is a low-latency, high bandwidth local area network with a low level communication layer called GM. We implemented UPC's MYSYNC synchronization mode in collective operations which make use of pairwise synchronization. We compared the performance of MYSYNC synchronization mode to the ALLSYNC synchronization mode, that is often implemented by using a barrier. For performance analysis we used a collectives testbed previously developed at MTU.
Results obtained from the testbed indicate that for collectives, such as Broadcast, Scatter and Gather, MYSYNC synchronization considerably reduces the collective time. Overall testbed execution time for 100 trials with MYSYNC mode was improved by 10-15% compared to the ALLSYNC synchronization. For the Permute collective operation the performance improvement is maximum in pull based implementation and it is close to 20%. For Gatherall and Exchange the performance improvement is minimal. However, ALLSYNC mode performs better over MYSYNC synchronization in push based implementation of Permute collective operation. In general, the performance of MYSYNC reduces with increase in number of threads.
Implementing Sort in UPC: Performance and Optimization
Kohinoor (Lisa) Begum and Dr. Steven Seidel CS-TR-05-06
Unifed Parallel C (UPC) is a parallel extension of ANSI C that is based on a partitioned shared address space. It supports development of parallel applications over diverse hardware platforms and does not require a true shared memory system. It simplifes programming by providing the same syntax for local references as for remote references. This helps the user focus on performance, and allows for the exploitation of locality. Our project is to implement a generic sort function in UPC. It will help programmers focus on design, implementation and performance of the entire application rather than work on details of sorting.
A Mechanism for Protecting Passwords from Key Loggers
Lior Shamir CS-TR-05-03
A simple mechanism that protects passwords from being stolen by malicious key loggers is proposed. The described mechanism is based on feeding potential key logers with incorrect passwords by generating faked keystrokes. Applications that require their user to type in passwords (e.g. internet browsers surfing to password protected web sites) can use this mechanism so that passwords can be protected even in the absence of an anti-key logger. Different approaches of protecction are proposed for thread based, hook based and kernel based key loggers.
Upc_all_sort is an implementation of a generic sort function in UPC using a bin sort algorithm. Techniques applied to improve the performance of the sort function include: sampling of data to ensure more nearly uniform bucket sizes among threads, an all-to-all exchange of bucket sizes to minimize the amount of dynamically allocated memory, rebalancing data at the end of the sort to ensure that each thread finishes with the same number of elements as it started with. In this paper, we analyze and compare the performance of sequential and parallel sort, the potential of performance improvements over sequential sort and also the bottlenecks limiting these improvements. This sort function runs 3.9 times faster than the sequential sort for problem sizes of at least 128K when run on 8 threads. The performance scales up with the increasing number of threads and bigger problem sizes. It runs 6.8 times.
An Overlay Network Simulator in Java
Eric Simonton CS-TR-05-05
It is not too difficult to generate a power-law topology. It is not trivial, however to maintain that topology while simulating network churn, nor to keep the network unpartitioned in the process. This report describes how to do just that, and do it efficiently.
Eager Misprediction Recovery
Peng Zhou, Dr. Soner Onder, and Dr. Steve Carr CS-TR-05-04
Current trends in modern out-of-order processors involve implementing deeper pipelines and a large instruction window to achieve high performance. However, as pipeline depth increases, the branch misprediction penalty becomes a critical factor in overall processor performance. Current approaches to handling branch mispredictions either incrementally roll back to in-order state by waiting until the mispredicted branch reaches the head of the reorder buffer, or utilize checkpointing at branches for faster recovery. Rolling back to inorder state stalls the pipeline for a significant number of cycles and checkpointing is costly. This paper proposes a fast recovery mechanism, called Eager Misprediction Recovery (EMR), to reduce the branch misprediction penalty. Upon a misprediction, the processor immediately starts fetching, renaming and executing instructions from the correct path without restoring the map table. Those instructions that access incorrect speculative values wait until the correct data are restored; however, instructions that access correct values continue executing while recovery occurs. Thus, the recovery mechanism hides the latency of long branch recovery with useful instructions. EMR achieves a mean performance improvement very close to a recovery mechanism that supports checkpointing at each branch. In addition, EMR provides an average of 9.0% and up to 19.9% better performance than traditional sequential misprediction recovery on the SPEC2000 benchmark suite.
MuON: Epidemic Based Mutual Anonymity
Neelesh Bansod, Ashish Malgi, Dr. Byung Kyu Choi and Dr. Jean Mayo CS-TR-05-02
A mutually anonymous service hides the identity of a client from the service provider and vice-versa. Providing mutual anonymity usually requires a large number of participants. While peer-to-peer (P2P) networks are capable of recruiting a large number of participants, reliable anonymous communication in these architectures, with low bandwidth usage, still needs further investigation. This paper presents MuON, a protocol to achieve mutual anonymity in unstructured P2P networks. MuON leverages epidemic-style data dissemination to deal with the high churn (changes in system membership) characteristic of unstructured P2P networks. The results from our security analysis and simulation show that MuON provides mutual anonymity over unstructured P2P networks while maintaining predictable latencies, high reliability, and low communication overhead.
An Introduction to Flexible Architecture Simulation Tool (FAST) and Architecture Description Language ADL
Dr. Soner Onder CS-TR-05-01
The following document provides an overview of how to use the Flexible Architecture Simulation Tool, and an indepth look at the Architecture Description Language. The system has been developed by Dr. Soner Onder and Dr. Rajiv Gupta. After many revisions, it has become a powerful simulation tool. The documentation that you are reading is a result of the collaboration between University of Alberta and Michigan Technological University. Aaron Dittrich under the direction of Dr. Mike MacGregor has put together pieces of documentation Dr. Onder had written into the form of a technical report (TR03-18, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada) in a nicely organized way which has later became the basis for this document. The examples presented are centered around the MIPS RISC architecture. At first a single-stage MIPS architecture is presented, and then the remainder of the examples are centered around a simplified MIPS R2000/R3000 architecture.
High Performance Unified Parallel C (UPC) Collectives for Linux/Myrinet Platforms
Alok Mishra and Dr. Steven Seidel CS-TR-04-05
Unified Parallel C (UPC) is a partitioned shared memory parallel programming language that is being developed by a consortium of academia, industry and government. UPC is an extension of ANSI C. In this project we implemented a high performance UPC collective communications library of functions to perform data relocalization in UPC programs for Linux/Myrinet clusters. Myrinet is a low latency, high bandwidth local area network. The library was written using Myrinet?s low level communication layer called GM. We implemented the broadcast, scatter, gather, gather all, exchange and permute collective functions as defined in the UPC collectives specification document. The performance of these functions was compared to a UPC-level reference implementation of the collectives library. We also designed and implemented a micro-benchmarking application to measure and compare the performance of collective functions. The collective functions implemented in GM usually ran two to three times faster than the reference implementation over a wide range of message lengths and numbers of threads.
TransMig: Implementing a Transparent Strong Migration of Mobile Agents in Java
Ashish Malgi, Neelesh Bansod, and Dr. Byung Kyu Choi CS-TR-04-04
Mobile agents are programs which ship themselves from one machine to another with a programmer defined primitive. They form an interesting paradigm to develop distributed applications. The key issue for migrating a mobile agent from one machine to another is the transmission of the execution state of the agent. We propose TransMig, which is an approach to capture and re-establish the state of mobile agents programmed in Java without any modifications to the Java virtual machine. Our approach instruments the original source code and the resultant byte code to achieve the capture and re-establishment of the execution state of mobile agents at a language level. Currently, our prototype implementation of TransMig has been used for providing a programming and runtime environment for mobile agents, but it could easily be extended for any Java program.
Significant Precursors to UPC
Dr. Phillip Merkey and Jaisudha Purushothaman CS-TR-04-03
Designing parallel distributed languages involves deciding the level of abstraction that can be supported through the features provided to the programmer and the level of control the programmer is allowed to wield over the system. C has been one of the most commonly used languages for almost three decades. It has low level access to information and commands while still retaining the portability and syntax of a high level language. Its flexibility and powerful pointer manipulations capabilities are some of the reasons why its still actively maintained. It has been the natural base language on which researchers have developed parallel languages or have added libraries or extension for parallelism. For example, Cilk, C*, Split-C, AC, PCP, Handel-C, mpC, DPCE, C- -, C//, Jade, Parsec, MPI, OpenMP,Parsec,ParaC are some of the languages developed as C libraries or extensions. UPC can be viewed as the culmination of previous attempts at coming up with the right combination of programming features for the parallel distributed shared memory environment. In particular, PCP1, Split-C and AC2 are consider immediate predecessors of UPC. This essay discusses these languages and UPC in a comparative manner.
Exploring Linked Lists in UPC
Hiriomi Suenaga, Dr. Phillip Merkey, and Dr. Steven Seidel CS-TR-04-02
Shared memory multiprocessors offer great computational power while introducing new problems such as race conditions and deadlock. Concurrent objects can be accessed and modified by multiple processes simultaneously. In order to avoid corruption of data, we need to construct carefully planned algorithms that minimize the use of synchronization schemes which inhibit performance. Unified Parallel C (UPC) is a parallel extension of ANSI C. It sees its memory as a partitioned shared memory. Because of this, past studies that implemented concurrent objects in pure shared memory machines cannot be applied directly. In order to achieve high performance while avoiding data corruptions or inconsistent results, it is important to understand the underlying architecture and the memory model. In this study we explored three different ways of implementing an elemental data structure, the linked-list, on which multiple processes can operate at the same time. Unfortunately, we were not able to show good speedup at this point because of experimentally confirmed limitations of the current compiler, however, we are confident that our approaches will give us a great performance improvement as the compiler develops.
Pairwise Synchronization in UPC
Ernie Fessenden and Dr. Steven Seidel CS-TR-04-01
UPC (Unified Parallel C) is a small extension of C that provides a partitioned shared memory model for parallel programming. Synchronization between threads (processes) in UPC is done through the use of locks or barriers. We have investigated a new synchronization process that we believe to be better suited in situations where threads do not need to synchronize with all of the other threads in the system.
This project explores how to synchronize UPC threads in specialized ways. Our approach is to use UPC locks to implement pairwise synchronization. Using this new approach we implemented the MYSYNC synchronization mode and compared it to the ALLSYNC synchronization mode that is often implemented by using a barrier. Pairwise synchronization can also be used to synchronize arbitrary subsets of threads while not disturbing the remaining threads. Such subset-based synchronization does not presently exist in UPC. We also investigate how to implement the barrier using pairwise synchronization.
The MYSYNC synchronization mode has shown better performance than the ALLSYNC synchronization mode for several of the relocalization collective operations. Specifically, for an uneven computational load the MYSYNC implementation of upc all broadcast, upc all scatter, and upc all gather outperforms the ALLSYNC implementation. For upc all permute under even and uneven computational loads the MYSYNC implementation shows large improvements in run time. However, for upc all exchange and upc all gather all the MYSYNC implementation runs slower than the ALLSYNC implementation.
Illustrative Test Cases for the UPC Memory Model
William Kuchera and Dr. Charles Wallace CS-TR-03-02
The memory model underlying UPC is an important but subtle aspect of the language, and everyone involved with it must understand its implications. The only resource currently available with detailed coverage of the memory model is the high-level description in the UPC specification . As mentioned in our previous report , it is difficult to tie this description to actual behavior of a UPC program. UPC implementors and application developers must be able to readily distinguish program behavior that complies with the UPC specification from behavior that does not comply. For instance, implementors of UPC on a particular platform need to ensure that the optimizations they employ guarantee compliant behavior. Programmers exploiting relaxed consistency must have a grasp of what possible behaviors their programs may induce.
To this end, we have devised a set of test cases for the UPC memory model. These cases fall into two categories:
Compliance tests. These are examples of behavior that falls outside the UPC specification. They illustrate the consistency guarantees that UPC gives the programmer.
?Dark corner? tests. These are examples of acceptable behavior, according to the UPC specification, that may be surprising to the UPC novice. The UPC memory model is designed to be quite lax and allows some cases that programmers may not anticipate. These examples serve to highlight some of these ?dark corners?.
Our test cases are at the architecture level rather than the program level; that is, each thread?s execution is expressed in terms of a sequence of read, write, fence, notify, and wait instructions, rather than UPC statements. Any references to ?ordering,? ?following,? ?intervening,? etc. refer to this thread-by-thread instruction ordering. Read and write instructions are assumed to be relaxed unless specified otherwise. For each case, we illustrate why it illustrates compliance or non-compliance, using the operational semantics devised in our previous report .
Toward a Programmer-Friendly Formal Specification of the UPC Memory Model
William Kuchera and Dr. Charles Wallace CS-TR-03-01
As part of our efforts to elucidate the UPC memory model, we have closely examined the memory model definition given in the official UPC specification  (hereafter, "the UPC spec"). This document gives most of the relevant details about the memory model.
ASM 101: An Abstract State Machine Primer
James K. Huggins and Dr. Charles Wallace CS-TR-02-04
Over the last twenty years, Abstract State Machines (ASMs) have been used to describe and validate a wide variety of computing systems . Numerous introductions to ASMs have been written (see for example [20, 10, 21,36]). In addition, many ASM papers include brief overviews for the casual reader. In this paper we attempt to provide a gentler introduction, focusing more on the use of the technique than on formal definitions. Our target audience is a senior-level undergraduate class in software engineering, but we hope that this work proves helpful to other readers as well.
While there are references to "software" throughout, the reader should be aware that ASMs are relevant to other kinds of system development as well. For simplicity, our focus is on ASMs with only a single agent; distributed ASMs may be considered in a later paper.
MuPC: A Run Time System for Unified Parallel C
Jeevan Savant CS-TR-02-03
Unified Parallel C is an extension to the C programming language intended for parallel programming. UPC adds a small number of parallel programming constructs to C. These constructs enable users to write parallel programs based on a distributed shared memory model. MuPC (MTU's UPC) is a run time system for UPC, enabling it to run on a wide variety of architectures including networks of workstations, Beowulf clusters and shared memory machines.
Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: II - Theory and Software Tools
Dr. Adrian Sandu, Dacian N. Daescu and Gregory R. Carmichael CS-TR-02-02
The Kinetic PreProcessor KPP was extended to generate the building blocks needed for the direct and adjoint sensitivity analysis of chemical kinetic systems. An overview of the theoretical aspects of sensitivity calculations and a discussion of the KPP software tools is presented in the companion paper.
In this work the correctness and efficiency of the KPP generated code for direct and adjoint sensitivity studies are analyzed through an extensive set of numerical experiments. Direct decoupled Rosenbrock methods are shown to be cost-effective for providing sensitivities at low and medium accuracies. A validation of the discrete-adjoint evaluated gradients is performed against the finite difference gradients. Consistency between the discrete and continuous adjoint models is shown through step size reduction. The accuracy of the adjoint gradients is measured against a reference gradient value obtained with a standard direct-decoupled method. The accuracy is studied for both constant step size and variable step size integration of the forward model.
Applications of the KPP-1.2 software package to direct and adjoint sensitivity studies, variational data assimilation and parameter identification are considered for the comprehensive chemical mechanism SAPRC-99.
Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: I - Theory and Software Tools
Dr. Adrian Sandu, Dacian N. Daescu and Gregory R. Carmichael CS-TR-02-01
The analysis of comprehensive chemical reactions mechanisms, parameter estimation techniques, and variational chemical data assimilation applications require the development of efficient sensitivity methods for chemical kinetics systems. The new release (KPP-1.2) of the Kinetic PreProcessor KPP contains software tools that facilitate direct and adjoint sensitivity analysis. The direct decoupled method, build using BDF formulas, has been the method of choice for direct sensitivity studies. In this work we extend the direct decoupled approach to Rosenbrock stiff integration methods. The need for Jacobian derivatives prevented Rosenbrock methods to be used extensively in direct sensitivity calculations; however, the new automatic differentiation and symbolic differentiation technologies make the computation of these derivatives feasible. The adjoint modeling is presented as an efficient tool to evaluate the sensitivity of a scalar response function with respect to the initial conditions and model parameters. In addition, sensitivity with respect to time dependent model parameters may be obtained through a single backward integration of the adjoint model. KPP software may be used to completely generate the continuous and discrete adjoint models taking full advantage of the sparsity of the chemical mechanism. Flexible direct-decoupled and adjoint sensitivity code implementations are achieved with minimal user intervention. In the companion paper  we present an extensive set of numerical experiments that validate the KPP software tools for several direct/adjoint sensitivity applications, and demonstrate the efficiency of KPP-generated sensitivity code implementations.
A TCP/UDP Protocol Visualization Tool: Visual TCP/UDP Animator (VTA)
Ping Chen, Rong Ge, Yin Ma, Dr. Jean Mayo, Yuping Zhang, and Chunhua Zhao CS-TR-01-10
The TCP/UDP/IP/Ethernet protocol suite is commonly covered within an undergraduate curriculum. Often, this is the first exposure students have to a network protocol. We have found that students have difficulty understanding the unique function of these protocols and how the protocols are related. This is due at least in part to the fact that packet specification and construction takes place within the operating system, while covering a typical operating system protocol stack implementation would be beyond the scope or time constraints of many of these undergraduate courses. In order to address this problem we have written a visualization tool that illustrates the operation of TCP and UDP over the IP and Ethernet protocols using actual packets, captured via "libpcap," live from the network or stored in a file. The tool provides several views of captured data that, when used individually or in combination, illustrate the operation of these protocols.
Discretizing Aerosol Dynamics with B-Splines
Dr. Adrian Sandu and Christian T. Borden CS-TR-01-09
This paper presents a discretization technique for particle dynamics equation based on the B-spline interpolation of the solution. The method is developed in the general framework recently proposed by the authors, and extends the previous work of Pilinis and Seinfeld to arbitrary order splines and arbitrary knot sequences. Numerical tests include the coagulation-growth of the exponential distribution and of a cosine hill in logarithmic coordinates.
Reduced Atmospheric Chemistry Models by Sensitivity Matrix Approach
Dr. Adrian Sandu CS-TR-01-08
Simulation of chemical kinetics typically claims about 80 to 90% of the total computational time in three-demensional atmospheric chemistry-transport models; to alleviation this computational burden reduced chemical models are often employed. Reduced models are built via a dynamical analysis that results in the elimination of the fastest time scales, or by tuning a simpler model to reproduce the input-output relations of the kinetic system.
The sensitivity matrix is a linear entity associated with the dynamics of a nonlinear system. In this communicatino we show that sensitivity matrices can be used to identify low-demensional linear evolution manifolds and to construct reduced kinetic models.
The Kinetic PreProcessor KPP - A Software Environment for Solving Chemical Kinetics
Valeriu Damian, Dr. Adrian Sandu, Mirela Damian, Florian Potra and Gregory Carmichael CS-TR-01-07
The KPP kinetic preprocessor is a software tool that assists the computer simulation of chemical kinetic systems. The concentrations of a chemical system evolve in time according to the differential law of mass action kinetics. A computer simulation requires the implementation of the differential system and its numerical integration in time.
KPP translates a specification of the chemical mechanism into FORTRAN or C simulation code that implements the concentration time derivative function and its Jacobian, together with a suitable numerical integration scheme. Sparsity in Jacobian is carefully exploited in order to obtain computational efficiency.
KPP incorporates a library with several widely used atmospheric chemistry mechanisms; users can add their own chemical mechanisms to the library. KPP also includes a comprehensive suite of stiff numerical integrators. The KPP development environment is designed in a modular fashion and allows for rapid prototyping of a new chemical kinetic schemes as well as new numerical integration methods.
Rosenbrock-Nystrom Integrators for SSODE of Multibody Dynamics
Dr. Adrian Sandu, D. Negrut, E.J. Haug, F.A. Potra and C. Sandu CS-TR-01-06
When performing dynamic analysis of a constrained mechanical system, a set of index 3 Differential-Algebraic Equations (DAE) describes the time evolution of the system model. The paper presents a state-space based method for the numerical solution of the resulting DAE. A subset of so called independent generalized coordinates, equal in number to the number of degrees of freedom of the mechanical system, is used to express the time evolution of the mechanical system. The second order state-space ordinary differential equations (SSODE) that describe the time variation of independent coordinates are numerically integrated using a Rosenbrock type formula. For stiff mechanical systems, the proposed algorithm is shown to significantly reduce simulation times when compared to state of the art existent algorithms. The better efficiency is due to the use of an L-stable integrator and a rigourous and general approach to providing analytical derivatives required by it.
A Communication Library for the Parallelization of Air Quality Models on Structured Grids
Philip Miehe, Dr. Adrian Sandu, Gregory R. Carmichael, Youhua Tang, and Dacian Daescu CS-TR-01-05
PAQMSG is an MPI-based, Fortran 90 communication library for the parallelization of Air Quality Models on structured grids. It consists of distribution, gathering and repartitioning routines for different domain decompositions implementing a master-worker strategy. The library is architecture and application independent and includes optimization strategies for different architectures. This paper presents the library from a user perspective. Results are shown from the parallelization of STEM-III on Beowulf clusters.
The PAQMSG library is available on the web. The communication routines are easy to use, and should allow for an immediate parallelization of existing air quality models. PAQMSG can also be used for constructing new models.
A Spectral Method for Solving Aerosol Dynamics
Dr. Adrian Sandu CS-TR-01-04
This paper presents a discretization of particle dynamics equation based on spectral polynomial interpolation of the solution. Numerical tests show that the method achieves spectral accuracy in linear coordinates, and is able to reproduce well the reference solutions in logarithmic coordinates.
A Framework for the Numerical Treatment of Aerosol Dynamics
Dr. Adrian Sandu and Christian Borden CS-TR-01-03
This paper presents a general framework for the discretization of particle dynamics equation by projection methods, which include Galerkin and collocation techniques. Based on this framework a discretization over piecewise polynomial spaces is discussed. Numerical examples are given using both linear and logarithmic coordinates; the results show that this discretization approach is able to accurately solve aerosol size distribution using a relatively small number of "size bins."
A Newton-Cotes Quadrature Approach for Solving the Aerosol Coagulation Equation
Dr. Adrian Sandu CS-TR-01-02
This paper presents a direct approach to solving the aerosol coagulation equation. Newton-Cotes formulas are used to discretize the integral terms, and the semi-discrete system is built using collocation. A semi-implicit Gauss-Seidel time integration method is employed. The approach generalizes the semi-implicit method of Jacobson.
The Location Consistency Memory Model and Cache Protocol: Specification and Verification
Dr. Charles Wallace, Guy Tremblay, and Jose Nelson Amaral CS-TR-01-01
We use the Abstract State Machine methodology to give formal operational semanitics for the Location Consistency memory model and cache protocol. With these formal models, we prove that the cache protocol satisfies the memory model, but in a way that is strictly stronger than necessary, disallowing certain behavior allowed by the memory model.
Tools for Distributed Systems Instruction
Meghana Gothe and Dr. Jean Mayo CS-TR-00-02
Teaching a distributed system class becomes a difficult task when the student community is from a varied background. These systems are complex, as they are distributed across the network of component processes. A majority of the students have a sequential programming background. Hence, distributed systems concepts pose a challenge of re-learning, making the teachers job all the more difficult. Furthermore, distributed systems rely on message passing for communication between component processes. The large amount of system specific network interface detail obscures the actual concepts. Tools that attempt to eliminate this obsurity are needed. The students are introdudced to numerous protocols, which form the basic foundation of distributed systems programming. Shortage of time prvents students from implementing them all. A simulation of the protocols which can be used by the students for understanding is a need of time.
With this motivation, we have implemented a tool which provides certain communication primitives and some important distributed system protocols. The communication primitives will provide an easy interface that can be used by the students in their assignments. Protocol excecution dumps will enable the students to correlate to the topics taught in class.
Time-Stepping Methods that Favor Positivity for Atmospheric Chemistry Modeling
Dr. Adrian Sandu CS-TR-00-01
Chemical kinetics conserves mass and renders non-negative solutions; a good numerical simulation would ideally produce mass balanced, positive concentration vectors. Many time-steping methods are mass conservative; however, unconditional positivity restricts the order of a traditional method to one. The projection method presented in  post-processes the solution of a one-step integration method to ensure mass conservation and positivity. It works best when the underlying time-steping scheme favors positivity. In this paper several Rosenbrock-type methods that favor positivity are presented; they are designed such that their transfer functions together with several derivatives are non-negative for all real, negative arguments.
Adjoint Implementation of Rosenbrock Methods Applied to Variational Data Assimilation Problems
Dacian Daescu, Gregory Carmichael and Dr. Adrian Sandu CS-TR-99-06
In the past decade the variational method has become an attractive alternative to the traditional Kalman filter in data assimilation problems for atmospheric chemistry models. From the computational point of view its major advantage is that the gradient of the cost function is computed fast, at the expense of few function evaluations, making the optimization process very efficient. The drawbacks are the high storage requirements and the difficulty to implement adjoint code when sophisticated integrators are used to solve the stiff chemistry. If the sparse structure of the models is carefully exploited Rosenbrock solvers have been proved a reliable alternative to classical schemes. In this paper we present an efficient implementation of the adjoint code for the Rosenbrocj type methods which can reduce the storage requirements of the forward model and is suitable for automatization.
Positive Numerical Integration Methods for Chemical Kinetic Systems
Dr. Adrian Sandu CS-TR-99-05
Mass action chemical kinetics conserves mass and renders non-negative solutions; a good numerical simulation would ideally produce a mass balanced, positive numerical concentration vector. Many time stepping methods are mass conservative; however, unconditional positivity restricts the order of a traditional method to one. The positive projection method presented in the paper ensures mass conservation and positivity. First, a numerical approximation is computed with one step of a mass-preserving traditional scheme. If there are negative components the nearest vector in the reaction simplex is found using a primal-dual optimization routine; this vector is shown to better approximate the true solution. The technique works best when the underlying time-stepping scheme favors positivity. Projected methods are able to use larger integration time steps, being more efficient then traditional methods for systems which are unstable outside the positive quadrant.
Stable Predicate Evaluation with Probabilistically Synchronized Clocks
Dr. Jean Mayo and Don Darling CS-TR-99-04
The evaluation of a global state in a distributed system is very difficult, since the individual processes that comprise the system share no common memory or time base, and can only communicate with each other via message passing. Although no common time base exists in these systems, the system clocks are often roughly synchronized within some known upper bound. In previous work, Mayo and Kearns developed a schema for evaluating a class of stable predicates on systems with a rough global time base. These global predicates are comprised of the conjunction of the local process states. The fact that the predicates are stable means that once they become true, they remain true forever. The ability to detect stable predicates in the system allows us to detect critical system states such as program termination and deadlock.
We propose to develop a protocol based on Mayo and Kearns's schema, which will also handle systems where clocks are synchronized with a probabilistic clock synchronization protocol. Also, our protocol will allow processes to enter and leave the system, facilitating its use in dynamic systems. Finally, we plan to implement our protocol on an actual system in order to see how it performs in general use.
Do Blending and Offsetting Commute for Dupin Cyclides?
Dr. Ching-Kuang Shene CS-TR-99-03
CS-TR-99-03 is no longer available. This work has been published in "Computer-Aided Geometric Design," Vol. 17, pp. 891-910, 2000.
Register Assignment for Software Pipelining with Partitioned Register Banks
Jason Hiser, Dr. Philip Sweany, Dr. Steven Carr, and Dr. Steven Beaty CS-TR 99-02
Trends in instruction-level parallelism (ILP) are putting increased demands on a machine's register resources. Computer architects are adding increased numbers of functional units to allow more instructions to be executed in parallel, thereby increasing the number of register ports needed. This increase in ports hampers access time. Additionally, aggressive scheduling techniques, such as software pipelining, are requiring more registers to attain high levels of ILP. These two factors make it difficult to maintain a single monolithic register bank.
One possible solution to the problem is to partition the register banks and connect each functional unit to a subset of the register banks, thereby decreasing the number of register ports required. In this type of architecture, if a functional unit needs access to a register to which it is not directly connected, a delay is incurred to copy the value so that the proper functional unit has access to it. As a consequence, the compiler now must not only schedule code for maximum parallelism but also for minimal delays due to copies.
In this paper, we present a greedy framework for generating code for architectures with partitioned register banks. Our technique is global in nature and can be applied using any scheduling method (e.g, trace scheduling, modulo scheduling) and any register assignment technique. We present an experiment using our partitioning techniques with modulo scheduling and Chaitin/Briggs register assignment. Our results show that it is possible to generate good code for partitioned-register-bank architectures.
Register Assignment for Architectures with Partitioned Register Banks
Jason Hiser, Dr. Philip Sweany and Dr. Steven Carr CS-TR 99-01
Modern computers take advantage of the instruction-level parallelism (ILP) available in programs with advances in both architecture and compiler design. Unfortunately, large amounts of ILP hardware and aggressive instruction scheduling techniques put great demands on a machine's register resources. With increasing ILP, it becomes difficult to maintain a single monolithic register bank and a high clock rate. The number of ports required for such a register bank severely hampers access time. To provide support for large amounts of ILP while retaining a high clock rate, registers can be partitioned among several different register banks. Each bank is directly accessible by only a subset of the functional units with explicit inter-bank copies required to move data between banks. Therefore, a compiler must deal not only with achieving maximal parallelism via aggressive scheduling, but also with data placement to limit inter-bank copies. Our approach to code generation for ILP architectures with partitioned register resources provides flexibility by representing machine dependent features as node and edge weights and by remaining independent of scheduling and register allocation methods Preliminary experimentation with our framework has shown a degradation in execution performance of 10% on average when compared to an unrealizable monolithic-register-bank architecture with the same level of ILP. This compares very favorably with other approaches to the same problem.
Instruction Manual for Schemm's Configurable Analyzer v0.1 0.1
Evan Schemm CS-TR 98-02
In a research environment, it is often nice to check the effects of changes or additions to existing architectures. To build each variation of a design used in a comparative study would be difficult, as well as prohibitively expensive. Usually a simulator is used to generate a trace of the executed instructions that can be run through an analyzer. This analyzer is designed to be configurable for wide range of target architectures. Different architectural variations are specified via a configuration file, alleviating the need to recompile. The analyzer returns statistics on instruction and data cache hit rates, branch prediction, and total execution time.
An Efficient Intersection Detection Algorithm for Truncated Cylinders
Dr. Ching Kuang Shene and Yuan Zhao CS-TR 98-01
This paper presents an efficient algorithm for detecting the intersection of two truncated cylinders. A truncated cylinder is the solid bounded by a section of a cylinder and two cap disks whose planes are perpendicular to the axis of the cylinderl. This paper characterizes the intersection and reduces the problem to the intersection detetion of a disk and a truncated cylinder which, in turn, is reduced to testing if a degree four polynomial has a real root in an interval. The method of choice is a case-by-case analysis. This type of geometric analysis has been known for its robustness and efficiency, although the program that implements the algorithm could be larger.
Unroll-and-Jam Using Uniformly Generated Sets
Dr. Steven Carr and Yiping Guan CS-TR 97-04
Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality.
One compiler transformation that is essential for a compiler to meet the above objectives is unroll-and-jam, or outer-loop unrolling. Previous work has either used a dependence based model to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied brute force techniques. In this paper, we present an algorithm that uses linear-algebra-based techniques to compute unroll amounts that save 84% of the dependence-graph space needed by dependence-based techniques, while retaining performance, and that gives a more elegant solution.
A Fast Geometric Conic Reconstruction Algorithm
Dr. Ching-Kuang Shene and Dr. George Paschos CS-TR 97-03
Conics are the first curvilinear feature beyond points and line segments that are commonly used in computer vision. This paper addresses the problem of determining if two conics, from two views, are the projections of a conic in space. If the answer is affirmative, recover this conic from the information given in the images and camera positions. This problem is important in the study of stereo vision and shape from motion.
Geometric Computing in the Undergraduate Computer Science Curricula
Dr. John Lowther and Dr. Ching-Kuang Shene CS-TR 97-02
Geometric computing is a rapidly evolving interdisciplinary field involving computer science, engineering and mathematics. It has relationships to many other areas within computer science such as computational geometry, computer-aided design, computer graphics, computer vision, geometric/solid modeling, robotics, and visualization. Due to its interdisciplinary nature, geometric computing also serves as a vehicle for engineering students to approach important topics in manufacturing processes, NC-machining, molding, and milling), geometric design/modeling, (feature-based design and constraint solving), and computational metrology, (sampling strategy and tolerancing design). Unfortunately, in a typical computer science curriculum, computing with geometry is virtually missing in spite of its tremendous impact to computer science and the other fields mentioned above and its importance to increase students' employability.
This paper investigates this issue and suggests a possible remedy by designing an elementary and interdisciplinary intermediate level geometric computing course. The rationale, goals and objectives, and the impact and significance of the proposed course are discussed. A review of previous work, including textbooks, software systems, and other media materials, is included. Finally, after presenting the merit of designing this course, this paper proposes a sample ten-week syllabus with software modules for lab use.
An Optimal In-Place Array Rotation Algorithm
Dr. Ching-Kuang Shene CS-TR 97-01
This paper presents a simple optimal algorithm of rotating an array of n elements to the right D positions with n + gcd (n, D) assignments and one extra memory location. Its performance is compared against the HP STL implementation, which uses3 (n - gcd(n, D)) assignments, and a commonly seen algorithm that uses an auxiliary array.
Modulo Scheduling with Cache-Reuse Information
Chen Ding, Dr. Steven Carr and Dr. Philip Sweany CS-TR 96-08
Instruction scheduling in general, and software pipelining in particular face the difficult task of scheduling operations in the presence of uncertain latencies. The largest contributor to these uncertain latencies is the use of cache memories required to provide adequate memory access speed in modern processors. Scheduling for instruction-level parallel architectures with non-blocking caches usually assigns memory access latency by assuming either that all accesses are cache hits or that all are cache misses. We contend that allowing memory latencies to be set by cache reuse analysis leads to better software pipelining than using either the all-hit or all-miss assumption. Using a simple cache reuse model in our modulo scheduling software pipelining optimization, we achieved a benefit of 10% improved execution performance over assuming all-cache-hits and we used 18% fewer registers than were required by an all-cache-miss assumption. In addition, we outline refinements to our simple reuse model that should allow modulo scheduling with reuse to achieve improved execution performance over the all-cache-miss assumption as well. Therefore, we conclude that software pipelining algorithms for target architectures with non-blocking cache, but without rotating register files, should use a memory-reuse latency model.
Blending with Affine and Projective Dupin Cyclides
Dr. Ching-Kuang Shene CS-TR 96-07
This paper presents three new construction algorithms for blending a plane and a quadric cone with affine and projective Dupin cyclides. The first algorithm is capable of constructing an affine Dupin cyclide that blends a quadric cone (along an ellipse) and a plane. The second one is equivalent to the first but is based on geometric methods. The third one is able to construct a projective Dupin cyclide that blends a plane and a quadric cone along a hyperbola or a parabola.Thus, the constructed projective Dupin cyclide "smooths/covers'' the intersection angle made by the given surfaces. All algorithms use affine or projective transformations to reduce the given configuration to a right cone and a plane so that the blending Dupin cyclide construction can be applied.
Rotating an Array In-Place
Dr. Ching-Kuang Shene CS-TR 96-06
A simple algorithm that is capable of rotating an array of n elements to the D right positions using n + GCD ( n, D) assignments and no extra space is presented in this paper. Its performance with different GCD algorithms is compared against a commonly seen algorithm that uses an auxiliary array. Techniques used include regression analysis for establishing linear models and the Komogorov-Smirnov test for comparing performance data. The algorithm and the comparison techniques could serve as a model for CS2 students to carry out experimental analysis of algorithms.
Planar Intersection, Common Inscribed Sphere and Blending with Dupin Cyclides
Dr. Ching-Kuang Shene CS-TR 96-05
In his paper "Cyclides in Computer Aided Geometric Design," Computer Aided Geometric Design, Vol. 7 (1990), pp. 221--242, Pratt announced the following result: two cones can be blended by a Dupin cyclide if and only if they have a common inscribed sphere. However, only a proof for the sufficient part is given. This paper fills this gap with a sequence of existential results. More precisely, for two cones with distinct axes and distinct vertices, it will be shown that the existence of a planar intersection, the existence of a common inscribed sphere, the existence of a pair of common tangent planes, and the existence of a blending Dupin cyclide are all equivalent. Moreover, the existence of a pair of common tangent planes is reduced to a test of equal heights, simplifying Piegl's conic intersection detection and calculation algorithm. Degenerate cases will also be addressed.
Scalar Replacement and the Law of Diminishing Returns
Dr. Steven Carr and Qunyan Wu and Dr. Philip Sweany CS-TR 96-04
Scalar replacement is a transformation that replaces references to arrays with sequences of scalar temporaries to effect register allocation of array values. Scalar replacement algorithms with varying degrees of power have been proposed; however, no study has been done to show what degree of power is necessary in practice. Are simple algorithms enough? In addition, each algorithm may be applied using any number of dependence tests. More powerful tests should yield larger subsets of references that can be scalar replaced. Is this true?
This paper addresses these two questions. Our experiments show that nearly all opportunities for scalar replacement that exist in scientific benchmark code can be handled by simple algorithms. Also, powerful dependence analysis is not needed. Our results show that scalar replacement is an effective transformation that is cheap to implement.
Spiridon A. Messaris and Dr. David A. Poplawski CS-TR 96-03
Speculative execution, in the form of branch prediction, has become a common feature in superscalar processors and has been shown to be essential in achieving high performance. Eager execution, the fetching, issuing and execution of instructions down both paths of a conditional branch, has the potential to improve performance even more, but suffers from several problems when several successive branches are encountered. In this paper we show that eager prediction, defined to be eager execution to a certain level followed by ordinary branch prediction afterwards, produces performance improvements over pure branch prediction alone when sufficient resources are available; pure prediction still wins on machines with small issue window sizes and few functional units. Given a sufficiently large machine, the marginal cost, in terms of functional unit needs and register and memory bandwidth requirements are shown to be very minimal.
Blending Two Cones with Dupin Cyclides
Dr. Ching-Kuang Shene CS-TR 96-02
This paper provides a complete theory of blending cones with Dupin cyclides and consists of four major contributions. First, a necessary and sufficient condition for two cones to have a blending Dupin cyclide will be established. Except for a very special case in which the cones have a double line and parallel axes, two cones have a blending Dupin cyclide if and only if they have planar intersection. Second, based on the intersection structure of the cones, finer characterization results are possible. For example, all blending Dupin cyclides for two cones with a double line, distinct vertices and intersecting axes must be cubic. Third, a new construction algorithm that establishes a correspondence between points on one or two coplanar lines and all constructed blending Dupin cyclides makes the construction easy and well- organized. Fourth, the completeness of the construction algorithm will also be proved. As a result, any blending Dupin cyclide can be constructed by the algorithm. Consequently, all blending Dupin cyclides are organized into one to four one-parameter families, each of which is ``parameterized'' by points on a special line. It will be shown that each family contains an infinite number of ring cyclides, ensuring the existence of singularity free blending surfaces. Finally, color plates are provided to illustrate various cyclide.
On Parametric Representations of Cubic Dupin Cyclides
Dr. Ching-Kuang Shene CS-TR 96-01
Srinivas and Dutta recently constructed a parametric representation for cubic Dupin cyclides. They assumed two skew lines on a cubic Dupin cyclide are available and used the third intersection point of a moving line that touches both skew lines to generate a rational parameterization. In this note, it will be shown that this method is just another form of a popular circle parameterization. A new method based on the geometry of cubic Dupin cyclides will also be presented. Unlike Srinivas and Dutta's method, this new method does not require the cubic Dupin cyclide in a special orientation and a parametric representation is constructed directly without consulting an implicit equation. Implicitizing the parametric equations obtained from this new method yields an implicit equation.
Dupin Cyclides as Blending Surfaces for CSG Primitives
Dr. Ching-Kuang Shene CS-TR 95-10
This paper extends the existing work of using Dupin cyclides as blending surfaces for CSG primitives. Necessary and sufficient conditions as well as construction algorithms are presented for the cone-torus and the torus-torus cases, completing the study of using a single cyclide to blend a pair of CSG primitives. The common inscribed sphere criterion for two axial natural quadrics and the offset property of cyclides are used extensively to derive important results and to construct the blending surfaces. This paper also reveals new geometric insights as well as some interesting ways of blending a cone and a torus, and two tori.
Improving Data Locality with Loop Transformations
Kathryn S. McKinley (Univ. of Massachusetts-Amherst), Dr. Steve Carr and Chau-Wen Tseng (Univ. of Maryland) CS-TR 95-09
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.
Compiler Blockability of Dense Matrix Factorizations
Dr. Steve Carr and R.B. Lehoucq (Rice University) CS-TR 95-08
Recent architectural advances have made memory accesses a significant bottleneck for computational problems. Even though cache memory helps in some cases, it fails to alleviate the bottleneck for problems with large working sets. As a result, scientists are forced to restructure their codes by hand to reduce the working-set size to fit a particular machine. Unfortunately, these hand optimizations create machine-specific code that is not portable across multiple architectures without a significant loss in performance or a significant effort to re-optimize the code. It is the thesis of this paper that most of the hand optimizations performed on matrix factorization codes are unnecessary because they can and should be performed by the compiler. It is better for the programmer to express algorithms in a machine-independent form and allow the compiler to handle the machine-dependent details. This gives the algorithms portability across architectures and removes the error-prone, expensive and tedious process of hand optimization. In this paper, we show that LU and Cholesky factorizations may be optimized automatically by the compiler to as efficient as the same hand-optimized version found in LAPACK. We also show that QR factorization may be optimized by the compiler to perform comparably with the hand-optimized LAPACK version on matrix sizes that are typically run on nodes of massively parallel systems. Our approach allows us to conclude that matrix factorizations can be expressed in a machine-independent form with the expectation of good memory performance across a variety of architectures.
Communication on the Intel Paragon
Hazel Shirley, Robert Reynolds, and Dr. Steven Seidel CS-TR 95-07
Simple cost models have proved inadequate to predict the cost of complex communication operations such as the complete exchange on the Intel Paragon. To understand why this is true this work investigated several attributes of interprocessor communication on the Intel Paragon. It describes a standard of operation that offers the best combination of global synchronization, message passing primitives and protocols, memory allocation decisions, and other features of an application's communication code that effect performance. Descriptions of the simple communication operations and two algorithms for the complete exchange problem are given along with their observed costs. And finally, this work reveals why simple message passing cost models for the Intel Paragon are inadequate. It is shown that system interrupts disrupt the synchronization of complex communication algorithms and that a bimodal latency cost function adds to the cost of communication. In consideration of these factors, a new model is constructed that more accurately reflects the behavior of communication on the Intel Paragon.
Combining Optimization for Cache and Instruction-Level Parallelism
Dr. Steven Carr CS-TR 95-06
Current architectural trends in instruction-level parallelism have dramatically increased the computational power of microprocessors. As a result, the demands on the memory system have increased dramatically. Not only do compilers need to be concerned with finding enough ILP to utilize machine resources effectively, but they also need to be concerned with ensuring that the resulting code has a high degree of cache locality. Previous work has concentrated either on improving ILP in nested loops or on improving cache performance. This paper presents a performance metric that can be used to guide the optimization of loops considering the combined effects of ILP, data reuse and latency hiding techniques. Preliminary experiments reveal that dramatic performance improvements for loops are obtainable (a factor of nearly 9 is achieved on two kernels on the IBM RS/6000).
An Analysis of Unroll-and-Jam on the HP 715/50
Dr. Steve Carr and Qunyan Wu CS-TR 95-05
Current directions in computer design have resulted in dramatic improvements in the power of microprocessors. These gains have been achieved by cycle-time improvements, by multiple-instruction issue and by pipelined functional units. As a result of these architectural improvements, today's microprocessors, like the HP PA-RISC, can perform many more operations per machine cycle than their predecessors. But with these gains have come problems. Because the latency and bandwidth of memory systems have not kept pace with processor speed, computations are often delayed waiting for data from memory. As a result, processors see idle computational cycles and empty pipeline stages more frequently. In fact, memory and pipeline delays have become the principal performance bottleneck for high-performance microprocessors. To overcome these performance problems, programmers have learned to employ a coding style that achieves a good balance between memory references and floating-point operations. The goal of this coding methodology is to eliminate pipeline interlock and to match the ratio of memory operations to floating-point operations in program loops to the optimum such ratio that the target machine can deliver. In this report, we present an experiment with an automatic compiler method to match balance ratios that obviates the need for programmer optimization. The goal is to improve the efficiency of schedules generated by the back end of the HP compiler.
Improving Software Pipelining with Unroll-and-Jam
Dr. Steve Carr, Chen Ding and Philip Sweany CS-TR 95-04
To take advantage of recent architectural improvements in microprocessors, advanced compiler optimizations such as software pipelining have been developed. Unfortunately, not all loops have enough parallelism in the innermost loop body to take advantage of all of the resources a machine provides. Parallelism is normally inhibited by either inner-loop recurrences, or by a mismatch between the resource requirements of a loop and the resources provided by the target architecture. Unroll-and-jam is a transformation that can be used to increase the amount of parallelism in the innermost loop body by making better use of resources and limiting the effects of recurrences. In this paper, we demonstrate how unroll-and-jam can significantly improve the initiation interval in a software-pipelined loop. Improvements in the initiation interval of greater than 35% are common, while dramatic improvements of a factor of 5 are possible.
An Analysis of Loop Permutation on the HP PA-RISC
Dr. Steve Carr and Qunyan Wu CS-TR 95-03
In modern computers, processor speed has become significantly faster than memory speed. Cache memories are designed to overcome this difference, but they are only effective when programs exhibit data locality. In this report, we present an experiment with compiler optimizations to improve data locality based on a simple cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop permutations. The cost model drives the application of compound transformations consisting of loop permutation, loop distribution, and loop reversal. We implemented the transformations and ran experiments on a large collection of scientific programs on both the HP 715/50 and HP 712/80. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent data-locality characteristics of these programs and our ability to improve their locality. Performance improvements were difficult to achieve because benchmark programs typically have high hit rates; however, our optimizations significantly improved several programs.
The Performance of Scalar Replacement on the HP 715/50
Dr. Steve Carr and Qunyan Wu CS-TR 95-02
Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables. This paper presents an experiment on an HP 715/50 with a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The scalar replaced variables are more likely to be assigned to registers by the coloring-based register allocators found in most compilers than are their unreplaced counterparts. Experimental results show that scalar replacement is extremely effective. On kernels, integer- factor improvements over code generated by a good optimizing compiler of conventional design are possible.
The Unlimited Resource Machine (URM)
The Unlimited Resource Machine (URM) is a hypothetical instruction set architecture that was designed specifically for study of the architecture of and code generation techniques for machines exhibiting instruction level parallelism. A set of tools have been implemented to support this machine model, including compilers, a (linking) assembler, simulators, and trace analyzers. The architecture and its tools are being used in a variety of research projects at MTU. This report describes the philosophy behind the URM, the architecture and instruction set, the tools and their interfaces, and how to use the tools to model various ILP architectures.
Overview of the ROCKET Retargetable C Compiler
Dr. Philip H. Sweany and Steven J. Beaty (Cray Computer Corporation). CS-TR 94-01
This report describes the ROCKET compiler, a highly optimizing C translator which is retargetable for a broad class of ILP architectures.
Compiler Optimizations for Improving Data Locality
Dr. Steven Carr, Dr. Kathryn S. McKinley (Univ. of Mass. at Amherst)
and Dr. Chau-Wen Tseng (Stanford Univ.) CS-TR 94-02
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations, consisting of loop permutation, loop distribution and loop fusion. We show that these transformations are required to optimize many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For complete applications, we executed the original and transformed versions, simulated cache hit rates, and collected statistics about the inherent characteristics of these programs and our ability to improve them. To our knowledge, these studies are the first with this breadth and depth.
An Examination of the Behavior of Slice Based Cohesion Measures
Sakari Karstu and Dr. Linda M. Ott CS-TR 94-03
Is cohesion really indicative of other software attributes, like reliability or maintainability, as the software community believes? This question has been difficult to answer since there have been no automated tools for measuring the cohesion of software modules. With the development of an automated tool for collecting cohesion data, the validation of the measured cohesion with the original definition of cohesion was possible, using production software.
This study is one of the first studies to relate cohesion to any other attributes of software, in this case, revisions made to the software. It is commonly known that the number of revisions is strongly related to the length of the module. As a result of the study, we present data that supports our intuition that in addition to length, the level of cohesion of a module is a factor in estimating the quality of the software. In particular the data indicates that there appears to be a correlation between modular cohesion and revisions made to the module. This would suggest that highly cohesive modules are less likely to need changes.
A Compiler Blockable Algorithm for QR Decomposition
Dr. Steve Carr and Dr. Richard Lehoucq (Rice University) CS-TR 93-02
Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Unfortunately, memory speeds have not kept pace, resulting in an imbalance between computation speed and memory speed. This imbalance is leading machine designers to use more complicated memory hierarchies. In turn, programmers are explicitly restructuring codes to perform well on particular memory systems, leading to machine-specific programs. The compiler should automatically perform restructuring well enough to obviate the need for hand-optimization, thus allowing machine-independent expression of algorithms with the expectation of good performance.
This paper extends previous work on the ability of a compiler to restructure codes automatically to improve memory-hierarchy performance. Specifically, the paper extends work in the automatic derivation of block algorithms from their corresponding point algorithms. Previous work showed that the compiler could not automatically derive the block algorithm for QR factorization found in LAPACK from its corresponding point algorithm because the block LAPACK version is a new algorithm rather than a reshaping of the point algorithm. This paper presents a compiler-derivable block algorithm. The algorithm is shown on smaller array sizes to be effective enough to obviate the need for the tediously hand-optimized algorithm found in LAPACK implementations. By relying on the compiler to do the optimization, the algorithm remains machine independent and retains high-performance across multiple architectures, without programmer intervention.
Profiling the Communication Workload of an iPSC/860
Brian VanVoorst and Dr. Steven Seidel CS-TR 93-03
A communication profiler was developed for the Intel iPSC/860 hypercube. The profiler was implemented as a kernel modification so that communication traffic could be monitored transparently. The profiler was used to gather information about the communication traffic in applications that comprised an iPSC/860's daily workload. Since communication is an expensive operation in a MIMD machine, the profiler was designed to show where those costs were concentrated and to show where efforts to improve communication performance should be directed. The 128 node iPSC/860 at the NASA Ames Research Center was profiled for several days. Summaries of data are presented here, along with recommendations that include proposed changes to the long message protocol to reduce transmission latency.
Broadcast and Complete Exchange Algorithms for Mesh Topologies
Siddhartha Takkella and Dr. Steven Seidel CS-TR 93-04
The availability of multicomputers with a mesh interconnection network has created a need to develop efficient global communication algorithms for that architecture. The broadcast and the complete exchange are the two communication problems of primary interest in this work. Improved complete exchange algorithms for meshes of arbitrary size, and for square meshes in particular, have been developed. Motivated by the efficiency of pipelined broadcast algorithms for hypercubes, efforts were concentrated to develop similar broadcast algorithms for meshes, resulting in an improved pipelined broadcast algorithm for meshes. Performance results from the Intel Delta mesh are also given for one of the new complete exchange algorithms.
An Algorithm for Identifying Regions of a DNA Sequence that Satisfy a Content Requirement
Dr. Xiaoqiu Huang CS-TR 93-05
We present a dynamic programming algorithm for identifying regions of a DNA sequence that meet a user-specified compositional requirement. Applications of the algorithm include finding C+G-rich regions, locating TA+CG-deficient regions, identifying CpG islands, and finding regions rich in periodical three-base patterns. The algorithm has an advantage over the simple window method in that the algorithm shows the exact location of each identified region. The algorithm has been implemented as a portable C program called LCP (Local Content Program). LCP is extremely efficient in computer time and memory; it instantly locates all regions of a long DNA sequence meeting a given requirement. The LCP program was used to analyze the rabbit A-like goblin gene cluster sequence.
A Context Dependent Method for Comparing Sequences
Dr. Xiaoqiu Huang CS-TR 93-06
A scoring scheme is presented to measure the similarity score between two biological sequences, where matches are weighted dependent on their context. The scheme generalizes a widely used scoring scheme. A dynamic programming algorithm is developed to compute a largest-scoring alignment of two sequences of lengths.
Alignment of Three Sequences in Quadratic Space
Dr. Xiaoqiu Huang CS-TR 93-07
We present a dynamic programming algorithm for computing an optimal global alignment of three sequences in quadratic space. The algorithm has been implemented as a portable C program. The program makes it possible to align simultaneously any three protein sequences on workstations with tens of megabytes of main memory. This rigorous program is used as a benchmark to evaluate the performance of a heuristic multiple sequence alignment program on protein sequences from the Protein Identification Resource protein sequence database. Experimental results indicate that the rigorous program may perform significantly better than the heuristic program on three sequences of weak similarity. We also describe algorithms for computing local alignments among three sequences in quadratic space.
Improving the Ratio of Memory Operations to Floating-Point in Loops
Dr. Steven Carr and Dr. Ken Kennedy (Rice University) CS-TR 93-08
Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Because computations often require more data from cache per floating-point operation than a machine can deliver and because operations are pipelined, idle computational cycles are common when scientific applications are run on modern microprocessors. To overcome these bottlenecks, programmers have learned to use a coding style that ensures a better balance between memory references and floating-point operations. In our view, this is a step in the wrong direction because it makes programs more machine-specific. A programmer should not be required to write a new program version for each new machine; instead, the task of specializing a program to a target machine should be left to the compiler.
But is our view practical? Can a sophisticated optimizing compiler obviate the need for the myriad of programming tricks that have found their way into practice to improve performance of memory hierarchy? In this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they statically estimate the performance of each loop in a particular program and use these estimates to determine whether to apply various loop transformations.