Abstracts

Jon Weissman: Nebula: Distributed Edge Cloud for Data Intensive Computing

Centralized cloud infrastructures have become the de-facto platform for data-intensive computing today. However, they suffer from inefficient data mobility due to the centralization of cloud resources, and hence, are highly unsuited for dispersed data- intensive applications, where the data may be spread at multiple geographical locations. In this paper, we present Nebula: a dispersed cloud infrastructure that uses voluntary edge resources for both computation and data storage. We describe the lightweight Nebula architecture and scheduling challenges associated with location-aware data and computation placement, replication, and recovery. We evaluate Nebula on an emulated volunteer platform that spans over 50 PlanetLab nodes distributed across Europe, and show promising results for data-intensive computing applications based on MapReduce.

 


 

Padma Raghavn : Utilizing Fine-Grain Parallelism in Sparse Computations through Scheduling
Joint work with Humayun Kabir and Joshua Booth.

A key challenge in realizing high performance for the parallel processing of sparse matrix and graph computations concerns the fact that the available parallelism is largely at a fine-grain. Such computations are conceptually about the processing in parallel of task graphs with complex dependencies and fine-grain tasks. However, their implementations are not explicitly stated as such and are instead formulated to exploit parallelism in nested loops. We propose implementations based on the scheduling of fine-grain task graph representations of sparse matrix and graph computations as an alternative to existing approaches. Our scheduling approach can utilize properties of NUMA multicores and restructuring techniques to yield task graphs with different levels of available parallelism. We will discuss our approach and provide performance results for kernels such as sparse-matrix vector multiplication and breadth-first search. Our preliminary results are promising with performance that is comparable or better than the performance of traditional implementations.

 


 

Alix Munier: Dominance of K-Periodic schedules for evaluating the maximum throughput of a Synchronous Dataflow Graph
Joint work with Bruno Bodin and Benoit Dupont de DInechin.

Synchronous Dataflow graphs, introduced by Lee and Messerschmitt in 1987, are a well-known formalism commonly used to model data-exchanges between parallel programs. This model was extensively studied in the last two decades because of the importance of its applications. However, the determination of a maximal throughput is a difficult question, for which no polynomial time algorithm exists up to now. In this context, several authors proved that a K-Periodic schedule, where K is a vector of no polynomially bounded values, reaches the maximum throughput. On the other hand a 1-Periodic schedule may be built polynomially, but without any guarantee on the throughput achieved. Therefore, the investigated question is the trade-off between the schedule size induced by the vector K (called the periodicity vector) and its corresponding throughput. The aim of this talk is to show that, for any fixed vector K on the actors, a K-Perodic schedule of minimum period can be built where each actor t is repeated K_t times. We noticed experimentally that the maximum throughput of a K-Periodic schedule does not necessarily increases when K values increase. A set of dominant values S of K is exhibited, and a relationship between the optimal throughput of these values is proved. We proved very recently that any vector K which does not belong to S leads to the same optimal throughput than an element of S of smaller value. Some real-life experiments measure the variation of the throughput according to K.

 


 

Abdou Guermouche: A runtime approach to dynamic resource allocation for sparse direct solvers

To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG-of-tasks parallelism regained popularity in the high performance, scientific computing community. In this context, enabling HPC applications to perform efficiently when dealing with graphs of parallel tasks that could potentially run simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling multiple parallel tasks over the same set of hardware resources introduces many issues, such as undesirable cache flushes or memory bus contention. In this paper, we show how runtime system-based scheduling contexts can be used to dynamically enforce locality of parallel tasks on multicore machines. We extend an existing generic sparse direct solver to use our mechanism and introduce a new decomposition method based on proportional mapping that is used to build the scheduling contexts. We propose a runtime-level dynamic context management policy to cope with the very irregular behavior of the application. A detailed performance analysis shows significant performance improvements of the solver over various multicore hardware.

 


 

Véronika Sonigo: Optimizing Buffer Sizes for Pipeline Workflow Scheduling With Setup Times

Mapping linear workflow applications onto a set of homogeneous processors can be optimally solved in polynomial time for the throughput objective with fewer processors than stages. This result even holds true, when setup times occur in the execution and homogeneous buffers are available for the storage of intermediate results. In this kind of applications, several computation stages are interconnected as a linear application graph, and each stage holds a buffer of limited size where intermediate results are stored and a processor setup time occurs when passing from one stage to another. In this paper, we tackle the problem where the buffer sizes are not given beforehand and have to be fixed before the execution to maximize the throughput within each processor. The goal of this work is to minimize the cost induced by the setup times allocating buffers with proportional sizes of each other. We present a closed formula to compute the optimal buffer allocation in the case of non- decreasing setup costs in the linear application. For the case of unsorted setup times, we provide competitive heuristics that are validated via extensive simulation.

 


 

Amina Guermouche: Minimizing Energy Consumption of MPI Programs in Realistic Environment.

Dynamic voltage and frequency scaling prove to be an efficient way of reducing energy con- sumption of servers. Energy savings are typically achieved by setting a well-chosen frequency during some program phases. However, determining suitable program phases and their associated optimal frequencies is a complex problem. Moreover, hardware is constrained by non neg- ligible frequency transition latencies and limited number of power islands. Thus, various heuristics were proposed to determine and apply frequencies, but evaluating their efficiency remains an issue. To evaluate the accuracy of energy saving heuristics, we present a system able to generate the frequency sched- ule leading to minimal energy consumption. It profiles parallel programs and builds an adequate ILP problem that specifically models realistic hardware limitations. We show that, in most cases, providing a solution is impossible because of the problem complexity. Then we present how we can alleviate some constraints in order to provide a solution.

 


 

Guillaume Aupy: Power-aware replica placement in tree networks with multiple servers per client

In this paper, we revisit the well-studied problem of replica placement in tree networks. Rather than minimizing the number of servers needed to serve all client requests, we aim at minimizing the total power consumed by these servers. In addition, we use the most general (and powerful) server assignment policy, where the requests of a client can be served by multiple servers located in the (unique) path from this client to the root of the tree. We consider multi-modal servers that can operate at a set of discrete speeds, using the dynamic voltage and frequency scaling (DVFS) technique. The optimization problem is to determine an optimal location of the servers in the tree, as well as the speed at which each server is operated. A major result is the NP-completeness of this problem, to be contrasted with the minimization of the number of servers, which has polynomial complexity. Another important contribution is the formulation of a Mixed Integer Linear Program (MILP) for the problem, together with the design of several polynomial-time heuristics. We assess the efficiency of these heuristics by simulation. For mid-size instances (up to 30 nodes in the tree), we evaluate their absolute performance by comparison with the optimal solution (obtained via the MILP). The most efficient heuristics provide satisfactory results, within 20% of the optimal solution.

 


 

Samuel Thibault : StarPU: a platform for experimenting with schedulers in the wild

In this talk, I will present how StarPU can be used as a platform for experimenting with new scheduling strategies, applied over a range of various real-world applications. StarPU is indeed a runtime system which provides dynamic execution support, for applications expressed as a DAG, on hybrid CPU+GPU systems. Particular care was taken in the scheduling part, which uses state-of-the-art algorithms for handling the heterogeneity of the resources. A wide range of applications have been ported over StarPU, including dense and sparse linear algebra, element methods, stencils, etc. Implementing a new StarPU scheduler has been made seamless, all the technical details about task management being handled by the StarPU core. This has been made even easier by allowing schedulers to be written as a combinations of elementary components (FIFO, decision function, prefetch trigger, etc.)

e
Online user: 1