# Analysis of Data Reuse in Task-Parallel Runtimes

## Miquel Pericàs\*, Abdelhalim Amer†

\*Global Scientific Information and Computing Center †Department of Mathematical and Computing Sciences Tokyo Institute of Technology

> \*pericas.m.aa@m.titech.ac.jp †amer@matsulab.is.titech.ac.jp

## Kenjiro Taura

Graduate School of Information Science and Technology The University of Tokyo tau@eidos.ic.i.u-tokyo.ac.jp

## Satoshi Matsuoka

Global Scientific Information and Computing Center Department of Mathematical and Computing Sciences Tokyo Institute of Technology matsu@is.titech.ac.jp

Abstract—This paper proposes a methodology to study the data reuse quality of task-parallel runtimes. We introduce an extension to the reuse distance method called the Kernel Reuse Distance (KRD). The metric is a low-overhead alternative designed to analyze data reuse at the socket level while minimizing perturbation to the parallel schedule. Using the KRD metric we show that reuse depends considerably on the system configuration (sockets, cores) and on the runtime scheduler. Furthermore, we correlate KRD with hardware metrics such as cache misses and work time inflation. Overall we found that KRD can be used effectively to assess data reuse in parallel applications. The study also revealed that several current runtimes suffer from severe bottlenecks at scale which often dominate performance.

#### I. INTRODUCTION AND BACKGROUND

Tasking has become an established technique to program multicore systems. This programming scheme supports many variations of parallel control, including nested, recursive and irregular parallelism. Task-parallel models, such as OpenMP [1], Threading Building Blocks [2] or Cilk [3], allow the developer to annotate functions or code blocks for asynchronous task execution and add synchronization points to process the children tasks' outputs. u An underlying runtime tracks dependencies among tasks and *schedules* ready tasks to physical cores.

#### A. Scalability of Runtimes

Although the functionality of a runtime for homogeneous multicore systems may seem simple, developing efficient and scalable implementations is challenging. Design decisions can adversely affect execution time:

- a) Runtime Overheads: Operations such as task creation, synchronization or scheduling introduce non-work cycles that can considerably increase execution time. Runtime pressure grows with the number of workers and with finer task granularities. Contention can easily occur at scale. Runtimes should be as lightweight as possible to avoid such bottlenecks.
- b) Scheduling Constraints: Runtimes may place restrictions on task scheduling to simplify implementation or to set bounds on resource consumption. For example, some runtimes never migrate tasks once they have started. Some runtimes also limit the depth of nesting to avoid unlimited stack growth. Such constraints limit dynamic parallelism which manifests as non-work overheads in the form of processor idle time.
- c) Resource Sharing: Scheduling policies, such as work-first [4] or its dual help-first, and work stealing [5] techniques, determine the execution order of tasks. The resulting schedule defines the order of work kernels and their sharing of resources. A task order that ignores data locality issues can increase cache misses and generate work time inflation (WTI) [6].

In this work we use the term *non-work* for any kind of processor activity that is not directly related to the program's main

functionality, which is carried out by work kernels. This includes runtime overheads and parallel idleness [7]. Tasks may include several kernels, but the kernels themselves do not generate any new tasks.  $\text{OVR}_N$  and  $\text{WTI}_N$  (Non-work Overheads and Work Time Inflation at N cores) are scaling factors which represent the increase of execution time compared to the ideal parallel execution time (i.e.  $T_{\text{parallel}} = (T_{\text{serial}}/N) \times \text{OVR}_N \times \text{WTI}_N$ ). Using these definitions the speed-up at N cores is defined as follows: Speed-Up $_N = N/(\text{OVR}_N \times \text{WTI}_N)$ 

## B. Performance Tools

Application developers are often unaware of such issues and are then surprised by the bad performance of their applications as they scale to many cores. Quality tools are needed to detect these problems. Profilers and tracers provide insight into *non-work overheads* [7]–[10] by quantifying load imbalance and runtime activity overhead. *Scheduling constraints* are more difficult to analyze, since they relate to algorithmic decisions inside the runtimes. Similarly, *caching problems* caused by scheduling decisions may be hard to identify. Low locality exploitation in users' code, on the other hand, is a well known topic addressed by several tools [11]–[13].

This paper focuses on the problem of understanding caching problems introduced by the runtime scheduler in task-parallel applications. To analyze the impact of schedulers on data reuse we propose a methodology based on the concept of the reuse distance [14]. By analyzing the reuse distance observed at each last level cache, the metric allows to make a system-level assessment on the reuse performance of different runtimes.

#### C. Contributions

This paper makes the following contributions: 1) We describe the implementation of the Kernel Reuse Distance (KRD), an extension to the reuse distance metric targeting the analysis of temporal locality in task-parallel applications. 2) Using KRD we evaluate the temporal locality of two benchmarks using four schedulers. Our analysis reveals that differences in reuse increase with the number of cores and sockets. 3) We study the correlation between the KRD metric and hardware metrics such as cache misses and work time inflation. As part of this research we also observed that, at scale, performance and work time inflation are often dominated by runtime bottlenecks.

This paper is organized as follows: Section II sets the scenario by analyzing the scalability of two benchmark applications. Section III describes the KRD metric and its implementation. The metric is applied in Section IV to observe how temporal locality is influenced by runtime schedulers and to study its correlation with performance metrics. We conclude in Sections V and VI by discussing weaknesses of the approach and by summarizing the main conclusions.

## II. CASE STUDY: MATRIX MULTIPLICATION AND FAST MULTIPOLE METHOD

The development of KRD is motivated with a scalability study of two codes: Matrix Multiplication (MATMUL) and the Fast Multipole Method (FMM).

#### A. Benchmarks

The *MATMUL* code is a SIMD-optimized divide-and-conquer implementation which includes a task parallel implementation based on Cilk-like spawn and sync constructs [4]. The code recursively bisects the matrices until all three matrices A, B and C fit in the L1 cache. For the experiment we use matrices A, B and C of size 4096×4096, which translates into 64MB per matrix (single precision). On our test environment (described below) the granularity of each task (kernel) is about 17 microseconds.

The Fast Multipole Method is based on the exaFMM-dev implementation developed by Rio Yokota [15]. The FMM algorithm contains multiple steps. We focus only on the dominant phase: the dual tree traversal, which includes the two main kernels: M2L (multipole-to-local) and P2P (particle-to-particle). We run one FMM timestep on 1 million particles organized in a plummer distribution. The multipole expansion coefficient is set to 5 and the number of particles per leaf box is 32. The tree traversal phase is also parallelized by a divide and conquer approach, and uses the same Cilk-like constructs as MATMUL. The FMM kernels are very small, with each call to M2L only 500 nanoseconds. To avoid excessive overhead the recursion stops when less than 300 bodies are remaining under the current subtree, yielding multiple kernels per task. On our test system, the average size of one task is 3.25 microseconds.

## B. Experimental Infrastructure

We benchmark the codes on a 4-socket x86-64 server featuring  $4\times$  Intel Xeon E7-4807 (Westmere-EX) processors, each with 6 cores clocked at 1.86GHz. The cores have a 32KB L1 data cache (8-way set associative) and a 256KB L2 cache (8-way). The six cores share a 18MB last level cache (L3) with 16 ways. Hyperthreading is not used. When scaling to multiple cores, we first allocate all the cores in one socket and then fill the cores from a different socket. All codes were compiled using gcc version 4.7. The research platform runs a Linux distribution with kernel version 2.6.32.

The Cilk-like constructs are translated into API calls for three runtimes, identified as follows:

a) MTH: MassiveThreads [16], [17] is a lightweight task-parallel library that features a work-first scheduler, per-core LIFO task queues, and a random work stealer similar to the MIT-Cilk design. We use version 0.3beta.

*b)* TBB: Threading Building Blocks [2], [18] is a C++ template library for task parallelism with a help-first approach, per-core LIFO task queues and a locality aware work stealer [19]. We use version tbb41\_20130116oss.

c) QTH: Qthread [20]–[22] is a lightweight threading package that implements a help-first scheduler otherwise similar to Cilk. Qthread adds a new level to the task queues' hierarchy called *shepherds*. Shepherds can be assigned per socket to create a shared LIFO task queue among the workers (i.e. cores) of the socket. The goal is to improve the use of the shared cache. We refer to this configuration as QThread/Socket. We also test a configuration with one shepherd per core, we which identify as QThread/Core. The qthread version we use is 1.9.

## C. Scalability Analysis

The applications were manually instrumented with our own profiling library (described later). This library measures execution times, work time inflation and non-work overheads. Figures 1 and 2 show the speed-ups and non-work overheads (OVR $_N$ ) for the two applications and four schedulers when scaling from 1 to 24 cores. We also show the product of the speed-up and overhead normalized by the number of cores. Using the earlier equation we derive (Speed-Up $_N \times$  OVR $_N/N$ ) =  $1/\text{WTI}_N$ . The product is thus a measure of the speed-down caused by work time inflation.

The figures show that scalability of these applications is highly dependent on the runtime. Using MassiveThreads, speedups of up to  $21\times$  and  $18\times$  are achieved for MATMUL and FMM at 24 cores, respectively. TBB displays very good scaling until the first socket is filled, but performance decays at higher core counts. Qthreads performance is already degraded in the single socket scenario. However, it scales better than TBB for multiple sockets.

These results are highly correlated with the non-work overheads. MassiveThreads is the only runtime that does not suffer from a large increase, with 30% overhead in the worst case. The other runtimes suffer about 3-4× higher overheads at high core counts. Both Matmul and FMM have small task sizes, which MassiveThreads is designed to handle efficiently. Othread's single core overheads demonstrate that it is more heavy and suffers under fine-grained parallelism. However, scaling to higher cores reveals just a smooth degradation. TBB's overheads are lower than MassiveThreads for a single socket but increase fast for multiple sockets. The QTH/Socket overheads are consistently larger than those of QTH/Core. QTH/Socket features a per-socket shared LIFO task queue which is accessed by all workers in a shepherd. The frequency of accesses to the queue is proportional to the number of workers sharing it and inversely proportional to the average task size. For small task sizes and large number of workers this method is likely to suffer from contention.

The third plot shows the Speed-Up × Non-work Overhead product normalized by the number of cores. In the ideal case this metric should yield 1.0. A value below 1.0 indicates work time inflation. The plots show that work time inflation is an important issue, contributing a further performance reduction of up to 20% in the worst case (FMM with TBB). Since kernels never block, WTI can only be attributed to destructive resource sharing. This effect is mainly observed as an increased number of cache misses or increased memory access latencies. Two factors can cause this: 1) When the memory subsystem or system interconnect is overloaded, average memory access latency increases. In addition, runtime bottlenecks -such as excessive contention on a global lock- can steal bus cycles from the memory subsystem which further contribute to increase latencies. 2) A change in the work time can also be caused by data locality variations. Different kernel schedules, for example, impact temporal locality and cache misses.

Measuring how much of work time inflation is caused by the runtime and how much is due to locality is difficult because of the small kernel sizes and because of the high overheads of accessing hardware performance monitors in Linux kernels [23]. To identify work time inflation due to temporal locality issues we look for a scenario with minimal non-work overheads. For the case of MATMUL, MassiveThreads and TBB have overheads around or below  $1.1\times$  until 12 cores (2 sockets). Figure 1 (c) shows a work time difference of about 2% between





Fig. 2. Case Study: Scalability and Overheads of the Fast Multipole Method

MTH and TBB at 12 cores that must be related to different task orders. At 24 cores this difference is around 8%. The KRD metric defined in the next section can provide additional insight regarding the origin of additional observed cache misses.

## III. KERNEL REUSE DISTANCE

To characterize the effects of task ordering on temporal locality we start with the reuse distance metric [14]. The reuse distance has traditionally been used as a measure of cache performance [24]. It processes traces of memory accesses and counts the number of unique addresses between two accesses to the same element. This count is also called the *stack distance*.

When analyzing task-parallel applications it is important to minimize perturbation to the runtime task schedule. Heavyweight instrumentation to generate address traces may impact the execution and result in a parallel schedule that is not representative. To reduce overheads we extend the method to collect data accesses only in bulk at kernel execution times. For each data structure that is an input or output to a kernel, an identifier (usually its base address), a timestamp, and its size in bytes are recorded. We rely on manual instrumentation to perform these actions.

A trace of data accesses is recorded separately for each core. To analyze the reuse on a per-node<sup>1</sup> basis we process a merged trace containing all the kernel inputs and outputs accessed by the cores sharing the same last level cache. The trace is synchronized using the timestamps. Using this merged trace, the stack distances are computed and the histogram is generated. When a system contains multiple nodes, we summarize the contribution of each by generating per-node histograms and then reporting their summation.

Altogether, this set of modifications on top of the reuse distance is called the Kernel Reuse Distance metric (KRD). KRD is is a low-overhead and architecture independent method that provides an intuitive measure of data reuse. Its correlation to hardware metrics such as cache misses and performance is analyzed later. Figure 3 shows a diagram explaining the methodology in a single socket environment with two cores. Two workers are running, one on each core, and generating a series of kernel accesses. To analyze the last level cache and memory access, the traces are merged and the reuse histogram is generated. The histogram shows the ratio of data reuses that occur within a certain data window, shown on the x-axis. All elements have a first access. This event is included in the last data point labeled as INF (infinity). In the multiple nodes scenario, work steal activity across nodes introduces such cold accesses. By looking at the number of accesses that contribute to the INF category, one can observe the effects of inter-node work steals.

For visualization purposes, we subdivide the histogram into close, near, and far reuses. This choice is arbitrary but will help us later in describing the plots. As a rule of thumb, we use close reuses for those that fall within L2 cache size, near reuses for those within last level cache (LLC) size, and far reuses for those beyond the size of the LLC.

## A. Implementation Details

We implemented KRD as a set of tools that can compute the histograms from traces generated by our own low overhead profiling and tracing facility called LoI (low-overhead instrumentation). LoI is designed to analyze task-parallel applications with fine grained kernels. LoI attempts to be as lightweight as

<sup>&</sup>lt;sup>1</sup>in this work we use *node* as shorthand for *NUMA node* 



Fig. 3. Generation of the KRD metric for a single socket with two cores

possible in order to not influence the task parallel schedule. The library associates timestamps to events, and either aggregates execution times for individual kernels or generates timestamped traces. Timestamps are obtained by using the x86 TSC facility [25]. For both applications the tracing facility increases execution time less than 5% in the worst case.

#### IV. EXPERIMENTAL EVALUATION

This section describes two experiments. We begin by generating KRD profiles for MATMUL and FMM to display how reuse changes with the runtime scheduler. Next we analyze the correlation between the KRD metric and hardware performance counters.

## A. KRD correlation with runtime schedulers

Figures 4 and 5 show the KRD plots for the two benchmarks using the four tested runtime schedulers on three hardware configurations: single core, one socket and 4 sockets.

The *single core* histograms show that in the absence of work steals, different schedulers have little impact on the temporal reuse of recursive divide and conquer task-parallel codes. In fact, for MATMUL, the KRDs of both work-first and help-first policies are identical. This is not surprising as the recursive bisecting of the matrices and corresponding task generation are symmetric. For FMM the decomposition is not completely symmetric because of a property of the algorithm which allows to discard one of the branches based on a condition (mutual interactions). However, differences between schedulers are still barely noticeable.

Differences start to emerge when one socket is occupied, particularly on MATMUL. QThread/Socket stands out, having the highest reuse ratio at almost all distances. This good performance results from QThread/Socket's usage of a global LIFO queue shared by all the workers. In this design, workers tend to execute tasks that have been recently generated by other workers. Since programs are commonly optimized for data reuse on the serial path, this policy improves cache sharing [22]. TBB also shows improved reuse compared to MTH and QTH/Core when executing MATMUL. As with QThread/Socket we attribute this to TBB's locality-aware design [18], [19]. MTH and QTH/Core, on the other hand, implement just a fully distributed random work stealer. It has the worst reuse performance, but offers the advantage of simplicity. The differences between schedulers are considerably smaller

| Runtime    | Exec. Time | LLC Misses            | Kernel Time & Inflation |
|------------|------------|-----------------------|-------------------------|
| MTH        | 1.642 sec  | $1.829 \times 10^{6}$ | 17441ns (1.0250×)       |
| TBB        | 1.742 sec  | $2.807 \times 10^{6}$ | 17898ns (1.0519×)       |
| QTH/Core   | 1.859 sec  | $2.339 \times 10^{6}$ | 17767ns (1.0441×)       |
| QTH/Socket | 2.111 sec  | $1.987 \times 10^{6}$ | 18401ns (1.0814×)       |

TABLE I. HARDWARE METRICS AND WTI FOR 2-SOCKET SCENARIO

in the case of FMM. QTH/Socket is still better for close reuses, but the difference is only 3% at most. Other schedulers show almost no differences. The similarity between histograms is likely a result of FMM's tree traversal algorithm, which conditionally executes two kernels that operate on independent data. Furthermore, the non-homogeneous input (*plummer* distribution) generates an irregular kernel pattern that is hard for schedulers to optimize.

This histograms for the 4-socket scenario are similar to the 1-socket case. QTH/Socket again shows the best reuse performance, but this time it is closely followed by MTH for far reuses. Surprisingly, in this multi-socket scenario, TBB has the worst reuse performance for far reuses, trailing the other schedulers at a noticeable distance. Compared to the singlesocket plots, one important fact revealed by the four-socket KRD histograms is the larger amount of cold accesses. This is expected, as separate sockets have disjoint caches and will thus need to be warmed up separately. KRD can be used to measure how many first time accesses occurred, which indirectly correlates to the size of the working set observed at each socket. QThread/Socket shows lowest ratio of cold accesses while TBB shows the highest amount. A larger number of cold accesses means that the scheduler is distributing tasks that share the same working set across different nodes. TBB implements several restrictions in its scheduling algorithm that limit which tasks can be stolen and disallows the migration of tasks that have already started [18]. These limitations might be forcing TBB into a suboptimal work partitioning.

The fact that the KRD histograms can be correlated with different schedulers is an encouraging result. Next we address the question whether these plots can be correlated with actual performance.

#### B. KRD correlation with Hardware Metrics

In the second experiment, we attempt to correlate the results of the KRD metric with last level cache misses and work time increase. To do so we select a scenario with low runtime overheads to minimize possible perturbation. For MATMUL using 2 sockets (12 cores), MTH and TBB present non-work inflation of about  $1.1\times$ , while the QTH/Core overhead is about  $1.2\times$ . The KRD plot of far reuses (i.e. beyond 18MB) for this configuration is reported in Figure 6. Table I reports hardware performance counters and time measurements collected as averages of five runs. The kernel times are average over all kernel executions ( $\sim 1\times 10^6$ ) and have been collected by reading the x86 timestamp counter at each kernel call (RDTSC). The LLC misses column reports the per-core LAST\_LEVEL\_CACHE\_MISSES metric from Intel's Architectural PerfMon [25], as reported by PAPI [26].

We first compare MTH and TBB, which have similar overheads. The KRD plot in Figure 6 shows that for all distances beyond 16 MB, MTH has a higher percentage of reuses than TBB. The LLC size of the featured Westmere-EX chip is 18 MB, which makes it worth to analyze of the data point at 32 MB. For MTH, 3.57% of the kernel loads access data with a



(b) 1 socket / 6 cores

Fig. 5. Kernel Reuse Distance plots for Fast Multipole Method

(a) 1 socket / 1 core



Fig. 6. Far reuses for MATMUL in the 2-socket, 12 core scenario

reuse distance beyond 32 MB, while for TBB the number of far reuses is 4.5%. This 25% difference correlates to a 53% increase in LLC misses and to a work time inflation of 2.7% compared to MassiveThreads.

For QTH/Core, the KRD plots show that it has higher number of far reuses than MTH but less than TBB for distances of more than 32MB. The number of LLC misses and the work time inflation are between those of TBB and MTH. This order is also clearly observed at high distances (e.g. 128MB) and also cold misses. It suggests that these data points might be good

indicators for cache misses and WTI.

The KRD plot also shows that QTH/Socket has overall the smallest amount of far reuses (3.15% at 32MB). However, its number of cache misses is higher than MassiveThreads, and its work time inflation is the highest of all four schedulers. QTH/Socket has comparatively high overheads (1.33×). A closer analysis using perf record revealed that the MAT-MUL benchmark spends about 25% in two Qthread functions (qt\_scheduler\_get\_thread and qt\_hash\_lock), both of which include memory bus locking activity. Bus locking increases memory access latencies, and is a probable explanation for the observed work time inflation.

(c) 4 sockets / 24 cores

The 2.7% difference between MTH and TBB may seem very small, but is also expected since the studied algorithm (MATMUL) is not particularly memory intensive. At 4 sockets TBB has about 10% higher work time inflation compared to MTH. In the case of FMM, the relative inflation reaches 45% for the memory bound M2L kernel on 4 sockets. It shows that depending on the algorithm WTI can become an important problem.

## V. DISCUSSION

Although KRD shows correlation with work inflation and cache misses, it should be used mainly as an intuitive model. The KRD metric contains many simplifications that are the result of the constraints set by our original goal: to qualitatively measure temporal reuse in task-parallel programs. The requirement of minimal overhead is an important consideration which enables only a coarse-grained, manually-instrumented tracking of data accesses. The model does not consider other accesses such as stack accesses, based on the assumption that kernel (heap) data accesses dominate cache performance.

KRD does also not attempt to measure spatial locality among individual accesses. Our original goal was to analyze the effects of different schedulers on data reuse. Different schedules might, however, benefit more or less from prefetchers. If such an effect is large, then extending KRD with a metric to quantify spatial locality [27] might be a worthy addition.

One limitation of the current model is that it does not provide enough information to model the effects of cache coherence protocols [28], [29]. When one core writes a data structure allocated in the last level cache of a different socket, this will conceptually result in a cache-to-cache *transfer*. The KRD metric currently uses only the notion of intra-socket data accesses. It can report increases in cold misses due to work stealing operations, but it cannot model misses due to cache line invalidations. As part of our future work we plan to extend KRD by classifying accesses into reads and writes. This will allow a simple modeling of the effects of cache coherence.

Finally we would like to note that, while the KRD model has been developed with task-parallel runtimes in mind, it is actually quite generic as it does not instrument tasks, but the kernels inside tasks. This allow it to be applied to study any kind of shared memory parallel framework.

## VI. CONCLUSIONS

In this work we have attempted to provide some insight on the impact of task-parallel schedulers on temporal locality and its effect on performance. We developed a coarse-grained version of the reuse distance metric to study reuse in task parallel executions. Based on our analysis of two benchmarks and four runtime schedulers we observed that schedulers can have considerable impact on the reuse distance, and that the reuse quality depends considerably on the system configuration. Furthermore we observed correlation between the KRD metric and hardware metrics such as last level cache misses and average kernel execution time. However, we also observed that runtime contention can be dominant in high core count scenarios, thus minimizing overheads should take precedence over locality optimizations.

#### ACKNOWLEDGMENT

This work has been supported by a JSPS postdoctoral fellowship (P-12044). We would like to thank the anonymous reviewers for their valuable feedback.

## REFERENCES

- [1] OpenMP Specification, OpenMP ARB Std. 4.0, Jul. 2013. [Online]. Available: http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
- [2] Intel Corporation. Threading building blocks. [Online]. Available: https://www.threadingbuildingblocks.org/
- [3] MIT CSAIL Supertech Research Group. The cilk project. [Online]. Available: http://supertech.csail.mit.edu/cilk/
- [4] M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in *Proceedings of SIGPLAN 1998*, Jun 1998
- [5] E. Mohr, D. A. Kranz, and R. H. Halstead, "Lazy Task Creation: A technique for Increasing the Granularity of Parallel Programs," *IEEE Transactions on Parallel and Distributed Systems*, vol. 2, no. 3, Jul. 1991.
- [6] S. L. Olivier, B. R. de Supinski, M. Schulz, and J. F. Prins, "Characterizing and Mitigating Work Time Inflation in Task Parallel Programs," in *Proceedings of SC12*, Nov. 2012.
- [7] N. R. Tallent and J. M. Mellor-Crummey, "Effective Performance Measurement and Analysis of Multithreaded Applications," in *Proceedings of PPoPP'09*, Feb. 2009.
- [8] A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel, *The Vampir Performance Analysis Tool-Set*. Springer Berlin Heidelberg, 2008, pp. 139–155.

- [9] Extrae User Guide Manual, Barcelona Supercomputing Center, May 2013.
- [10] Vi-HPS, "SCORE-P User Manual," Tech. Rep. 4859, 2013.
- [11] C. McCurdy and J. Vetter, "Memphis: Finding and Fixing NUMArelated Performance Problems on Multi-core Platforms," in *Proceedings* of ISPASS 2010, Mar. 2010.
- [12] X. Liu and J. Mellor-Crummey, "Pinpointing Data Locality Problems Using Data-centric Analysis," in *Proceedings of CGO'11*, Apr. 2011.
- [13] Intel Corporation. Intel VTune Amplifier XE 2013. [Online]. Available: http://software.intel.com/en-us/intel-vtune-amplifier-xe
- [14] R. Mattson, J. Gecsei, D. Slutz, and I. Traiger, "Evaluation techniques for storage hierarchies," *IBM Systems Journal*, vol. 9, no. 2, pp. 78–117, 1970.
- [15] Rio Yokota. exafmm-dev. [Online]. Available: https://bitbucket.org/ rioyokota/exafmm-dev
- [16] Massivethreads: A lightweight thread library for high productivity languages. [Online]. Available: http://code.google.com/p/massivethreads/
- [17] J. Nakashima, S. Nakatani, and K. Taura, "Design and implementation of a customizable work stealing scheduler," in *Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Super-computers*, ser. ROSS '13, 2013, pp. 9:1–9:8.
- [18] Intel Corporation. TBB: Scheduling algorithm. [Online]. Available: http://www.threadingbuildingblocks.org/docs/help/reference/task\_scheduler/scheduling\_algorithm.htm
- [19] U. A. Acar, G. E. Blelloch, and R. D. Blumofe, "The Data Locality of Work Stealing," in *Proceedings of SPAA'00*, 2000.
- [20] The qthread library. [Online]. Available: http://www.cs.sandia.gov/qthreads/
- [21] K. Wheeler, R. Murphy, and D. Thain, "Qthreads: An api for programming with millions of lightweight threads," in *IEEE International Symposium on Parallel and Distributed Processing*, 2008, pp. 1–8.
- [22] S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins, "Scheduling Task Parallelism on Multi-Socket Multicore Systems," in *Proceedings of ROSS'11*, 2011, pp. 49–56.
- [23] V. M.Weaver, "Linux perf\_event Features and Overhead," in *Proceedings of the 2013 FastPath Workshop*, 2013.
- [24] K. Beyls and E. H. D'Hollander, "Reuse distance as a metric for cache behavior," in in Proceedings of the iasted conference on parallel and distributed computing and systems, 2001, pp. 617–662.
- [25] Intel Corporation. Intel 64 and ia-32 architectures software developer's manual volume 3b:. [Online]. Available: http://www.intel.com/content/ www/us/en/processors/architectures-software-developer-manuals.html
- [26] PAPI Team. Performance application programming interface. [Online]. Available: http://icl.cs.utk.edu/papi/
- [27] J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely, "Quantifying Locality In The Memory Access Patterns of HPC Applications," in *Proceedings of the 2005 ACM/IEEE conference on Supercomputing*, Nov. 2005.
- [28] Intel Corporation, An Introduction to the Intel QuickPath Interconnect, 2009.
- [29] D. Hackenberg, D. Molka, and W. E. Nagel, "Comparing Cache Architectures and Coherency Protocols on x86-64 Multicore SMP Systems," in *Proceedings of MICRO09*, Dec. 2009.