<table>
<thead>
<tr>
<th><strong>Title</strong></th>
<th>Cache affinity optimization techniques for scaling software transactional memory systems on multi-CMP architectures</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Author(s)</strong></td>
<td>Chan, K; Lam, KT; Wang, CL</td>
</tr>
<tr>
<td><strong>Citation</strong></td>
<td>The 14th IEEE International Symposium on Parallel and Distributed Computing (ISPDC 2015), Limassol, Cyprus, 29 June-1 July 2015. In Conference Proceedings, 2015, p. 56-65</td>
</tr>
<tr>
<td><strong>Issued Date</strong></td>
<td>2015</td>
</tr>
<tr>
<td><strong>URL</strong></td>
<td><a href="http://hdl.handle.net/10722/219217">http://hdl.handle.net/10722/219217</a></td>
</tr>
<tr>
<td><strong>Rights</strong></td>
<td>International Symposium on Parallel and Distributed Computing (ISPDC). Copyright © IEEE.; ©2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.; This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.</td>
</tr>
</tbody>
</table>
Cache Affinity Optimization Techniques for Scaling Software Transactional Memory Systems on Multi-CMP Architectures

Kinison Chan  
The University of Hong Kong  
Hong Kong, China  
Email: kchan@cs.hku.hk

King Tin Lam  
The University of Hong Kong  
Hong Kong, China  
Email: ktlam@cs.hku.hk

Cho-Li Wang  
The University of Hong Kong  
Hong Kong, China  
Email: clwang@cs.hku.hk

Abstract—Software transactional memory (STM) enhances both ease-of-use and concurrency, and is considered one of the next-generation paradigms for parallel programming. Application programs may see hotspots where data conflicts are intensive and seriously degrade the performance. So advanced STM systems employ dynamic concurrency control techniques to curb the conflict rate through properly throttling the rate of spawning transactions. High-end computers may have two or more multicore processors so that data sharing among cores goes through a non-uniform cache memory hierarchy. This poses challenges to concurrency control designs as improper metadata placement and sharing will introduce scalability issues to the system. Poor thread-to-core mappings that induce excessive cache invalidation are also detrimental to the overall performance. In this paper, we share our experience in designing and implementing a new dynamic concurrency controller for TinySTM, which helps keeping the system concurrency at a near-optimal level. By decoupling unfavorable metadata sharing, our controller design avoids costly inter-processor communications. It also features an affinity-aware thread migration technique that fine-tunes thread placements by observing inter-thread transactional conflicts. We evaluate our implementation using the STAMP benchmark suite and show that the controller can bring around 21% average speedup over the baseline execution.

Keywords—Software transactional memory, concurrency control, thread migration, multicore processors, cache affinity

I. INTRODUCTION

During the last decade, we have witnessed the dominance of multicore processors (a.k.a. chip multiprocessors, CMPs) and their success in fueling the next computing performance leap beyond the ultimate limit of CPU clock rate scaling. A multi-level cache is one of the most important resources of a CMP. Some CMP designs prefer keeping the last-level cache private to each core for simplicity, while other architectures expose it for sharing among different cores for faster communication and improved resource utilization. To unleash extreme parallel performance, it is common for high-end machines to have two or more CMPs interconnected through high-speed links such as Intel QuickPath Interconnect and AMD HyperTransport. Such multi-CMP systems present both optimization chances and challenges to system architects in regard to data sharing between threads along the hierarchy of private, shared and remote caches.

Considering the significant gap between intra- and inter-CMP access latencies, incorporating cache affinity techniques into software systems design is an inexorable step to harness the full power of multi-CMP architectures. Apart from this, a longstanding barrier to fully unlocking such power is rooted in traditional lock-based synchronization that creates serialized sections of execution limiting the available parallelism. This calls for a radical change in the parallel programming model. Software transactional memory (STM), being the most heavily purported programming paradigm to replace locks for multi-core systems, is undoubtedly a kind of system software that needs to apply the cache affinity principles. Access conflict in a transaction is a good indicator of data sharing between a pair of threads. By collecting sufficient contention statistics, the STM may depict the inter-thread sharing patterns, and can make use of the CPU affinity mask to reconfigure thread-to-core mappings and pin down favorable ones. Both the application code and the STM system activities such as conflict resolution will see enhanced cache locality when shared data stay in high-speed caches most of the time. Besides application data, the STM itself has extra internal data structures for keeping metadata. When implementing an STM, care must be taken to avoid penalizing the overall system execution speed by adding code that entails intensive metadata updates going through inter-CMP links.

The STM paradigm has many potential merits like much higher concurrency than locks. But for high-contention applications, STMs can perform even worse than locks due to too many access conflicts, rollbacks and retries. For high concurrency and high commit throughput to coexist at runtime, we cling to the notion of supervised concurrency level. With adaptive thread scheduling mechanisms, the STM runtime tunes the level of concurrency dynamically. The scheduler may suspend some active threads in place to let others proceed their commit procedures smoothly, thereby minimizing the chance of access conflicts when they are to grow in number. On the other hand, if the scheduler observes that the present execution seldom encounters conflicts, it may raise the concurrency level on the fly to approach the full application parallelism.

There have been several research efforts on adaptive concurrency control [1], [2], [3], [4] but we observe two areas...
falling short thus far. First, the existing feedback-controlled concurrency tuning mechanisms and policies are rather simple. They could be misled by the observed commit ratios, resulting in overly strict concurrency control—the worst case could be a slowdown of nearly 30%. Second, they were not carefully designed with respect to the multicore cache hierarchy, especially on a multi-CMP system.

In this paper, we demonstrate an advanced word-based STM combining the concepts of supervised concurrency and cache affinity to boost the overall system performance. Based on our previous work, we develop an enhanced version of the Probe concurrency control protocol [5] for tuning the runtime concurrency with low overheads. In terms of novel technical contributions, we propose two cache-affinity techniques, namely metadata zoning and affinity-driven thread migration, to enhance the performance of the Probe-controlled system. Metadata zoning aims to confine metadata sharing to within a CMP rather than going through the inter-CMP paths. Thread migration aims to reconfigure thread-to-core mappings to improve cache affinity. We implement the methodology into the TinySTM library and evaluate its performance. Experimental results show that the enhanced system can obtain performance improvement of about 100% (best-case) and 21% (average over all benchmarks) compared with the baseline execution without concurrency control and also outstrip previous work like Shrink [1] as well as Yoo and Lee’s ATS [3] significantly.

For the rest of the paper, we will give the background information on modern cache hierarchy, STM basics and the relationship between commit throughput and active thread count in Section II. Section III presents the Probe concurrency control enhanced with the metadata zoning technique. Section IV discusses the affinity-driven thread migration algorithm. We explain our system implementation in Section V and evaluate the performance in Section VI. Related work is reviewed in Section VII. Finally, Section VIII concludes this paper.

II. BACKGROUND

In this section we discuss multicore (CMP) processors and software transactional memory systems. We highlight the difficulties in having a precise concurrency control in STM, especially on multi-CMP systems.

A. Modern Computer Cache Hierarchy

Most modern computers employ a multicore processor (a.k.a. chip multiprocessor, CMP). A multicore processor is a single chip containing two or more cores, each of which can run at least one hardware thread. Threads within the same core share the L1 cache while threads on different cores of the same CMP share L2 or L3 caches.

To pursue high performance, high-end computers can have multiple multicore processors, complicating the cache sharing performance. Threads on different processors do not share any common cache. Fig. 1 depicts the organization of a common two-way multicore multiprocessor (or multi-CMP) computer. In the example, we see that thread 1 and thread 3 share the same L3 cache for fast data sharing, whereas thread 3 and thread 5 share data through the narrower and longer inter-processor communication channel.

To characterize intra-processor and inter-processor communication speeds, we run ping-pong tests on several systems equipped with common multicore processors. Table I shows the results (the last two columns show the number of round-trip transfers of a cache line between two cores). We notice that the inter-CMP communication cost is up to a factor of 4 higher than the intra-CMP one.

B. Software Transactional Memory

Software transactional memory is an optimistic approach of parallel programming. Programmers mark critical sections as transactions and the STM per se will ensure that each of the transactions is executed atomically. STM speculatively runs the transactions in parallel and checks for conflicts (i.e. violation to the atomic property). Some transactions involved in a conflict will be aborted (with their updates rolled back) during conflict resolution. When a transaction reaches to the end without encountering any conflict, it commits and makes the updates visible to upcoming transactions. STMs provide interfaces \texttt{stm\_begin()} and \texttt{stm\_commit()} for purpose of starting and finishing a transaction.\footnote{These functions are common among most STMs but their names may vary.} When a conflict is detected, the system automatically aborts at least one of the transactions by calling \texttt{stm\_abort()}, through which each of the victim threads is reset to where \texttt{stm\_begin()} was called and all its speculative updates are discarded.

STMs can be classified in a few dimensions. We can classify the systems according to progress guarantee and the granularity of data sharing. There are several common progress guarantees—lock-free, obstruction-free [6] and blocking. Lock-free systems guarantee that at least one of the transactions can proceed when multiple transactions run into conflicts. Obstruction-free systems only guarantee that a transaction can be eventually successful at repeated retries. While there exists an obstruction-free word-based STM contributed by Marathe et al. [7], lock- or obstruction-freedom makes

\begin{table}[h]
\centering
\caption{Ping-Pong Rates of Different Intel Multicore Processors}
\begin{tabular}{|c|c|c|c|}
\hline
Processor & Clockspeed & Intra-Pro & Inter-Pro \\
\hline
Core 2 Quad Q6600 & 2.40 GHz & $1.6 \times 10^7$ & $3.9 \times 10^6$ \\
Xeon E5540 & 2.53 GHz & $1.3 \times 10^7$ & $5.4 \times 10^6$ \\
Xeon E5550 & 2.66 GHz & $1.5 \times 10^7$ & $6.3 \times 10^6$ \\
Xeon X5670 & 2.93 GHz & $1.4 \times 10^7$ & $6.2 \times 10^6$ \\
\hline
\end{tabular}
\end{table}

\footnote{Although Q6600 comes in a single quad-core package, it features two sets of shared L2 cache, each to be shared by two cores only. By our definition in Section I, we treat it as two processors.}

\footnote{State-of-the-art CPUs leverage chip multithreading technology to enable two or more threads to run on each core.}
efficient implementation typically difficult. Inspired by Ennals’ work [8], blocking (or lock-based) STMs are recently under the research spotlight as they can simplify the design and provide higher performance. As a result, most of the latest word-based STM implementations such as TL2 [9], TinySTM [10] and SwissTM [11] are blocking.

Blocking STMs do not have progress guarantee. Doomed transactions (transactions that will eventually roll back) may hold ownership of some data items, obstructing the progress of other transactions. In these systems, it becomes more important to control the concurrency since excessive threading is just wasting processor time and hurting system performance.

C. Dynamic Concurrency Control for STM

Transactional memory helps tapping into the maximal parallelism of a program while maintaining the ease of programming similar to coarse-grained locking. Performance of software transactional memory can be quantitatively represented by the commit rate, which can be further identified as a product of the transaction attempt rate and commit ratio (committed transactions divided by attempted transactions).

Transaction attempt rate is dependent on various conditions. First of all, it is related to the number of active threads. When there are more threads, there are more concurrent transaction attempts. It is also related to the commit ratio—there are more retry attempts when there are more rollbacks. The attempt rate is affected by the transaction length as well. Longer transactions lead to fewer attempts in a unit time as each of them takes longer time to finish. Commit ratio also depends on several system conditions, including the number of active threads and the application nature. Commit ratio is lower when there are more threads because it is more likely to have collisions. Some applications are more sensitive to the number of active threads than others as the threads have more access to common data locations. We notice that both of the two factors are related to the number of threads. When there are more threads, there are more transaction attempts but with less success to commit (lower commit ratio).

We need an effective adaptive concurrency control (ACC) algorithm adjusting the number of concurrent threads based on observing the rate of transactional conflicts in order to maximize the throughput (commit rate) of the STM system. There are two challenges in designing such a control algorithm. First of all, it is non-trivial to determine whether a commit rate is regarded high or low. Second, it is tricky to obtain the live commit rate at runtime. It involves collecting total amount of commits, which forces the threads to share data across the processors.

### TABLE II. CONCURRENCE CONTROL VARIABLES

<table>
<thead>
<tr>
<th>variable</th>
<th>type</th>
<th>usage</th>
</tr>
</thead>
<tbody>
<tr>
<td>quota</td>
<td>integer</td>
<td>concurrency quota—number of threads allowed to run transactions</td>
</tr>
<tr>
<td>active</td>
<td>integer</td>
<td>number of threads that have entered into transactions and not yet exited</td>
</tr>
<tr>
<td>peak</td>
<td>integer</td>
<td>the peak number of threads that runs transactions simultaneously</td>
</tr>
<tr>
<td>commits</td>
<td>integer</td>
<td>number of committed transactions</td>
</tr>
<tr>
<td>aborts</td>
<td>integer</td>
<td>number of aborted transactions</td>
</tr>
<tr>
<td>stalled</td>
<td>boolean</td>
<td>indicates whether some threads are stalled at transactions’ entrance</td>
</tr>
</tbody>
</table>

### TABLE III. CONCURRENCE QUOTA MECHANISM

```plaintext
1 function onBegin
2  retry:
3    if active ≥ quota then
4      stalled ← true
5      yield
6    goto retry
7  end if
8  if active ← active + 1
9    if peak < active then
10      peak ← active
11    end if
12  end
13
14 function onCommit
15  active ← active - 1
16  commits ← commits + 1
17  end
18
19 function onAbort
20  active ← active - 1
21  aborts ← aborts + 1
22  yield
23 end
```

(Section III-A), and then the control policy named Probe (Section III-B).

A. Concurrency Quota Control Mechanism

To facilitate concurrency control, we introduce several variables listed in Table II. These variables are shared among all the transactions in the system.

Table III shows the basic mechanism to operate on the concurrency quota concept. There is a daemon thread running in the background to control a system-wide quota parameter which limits the number of active threads to run on the STM. When a thread begins a new transaction (i.e. calling `stm_begin()` function), it has to check if the current number of active threads has reached the current quota allowed (line 3). If this is the case, it has to wait in place until either some active threads get exited transactions (either committed or aborted) or the current quota is lifted on demand by the background daemon. We apply an extra heuristic to the mechanism: we record the active thread count that once happens to be the highest into the variable `peak` (line 10). If the daemon finds that the current quota is even higher than this peak value, it will set `quota` to `peak`. This measure is to avoid `quota` being incremented beyond the highest concurrency requirement. Superfluous quota will simply oppose the virtue of putting a bound on concurrent thread count.

B. The Probe Concurrency Control Policy

Table IV shows the algorithm of Probe. The name implies the algorithm is probing for a concurrency quota that optimizes commits by some “trial and error” method. In theory, as
discussed in Section II-C, there exists an optimal active thread count that corresponds to the crest of the commit rate curve which looks bell-shaped as shown in Fig. 2. Note that in this policy, we use commit rate, i.e. the number of committed transactions per unit time, rather than commit ratio as other researchers like Ansari et al. [12] do. We quantify the unit time logically by a counter called laps which is being incremented per sampling event. To avoid reacting too fast, the policy proactively postpone quota updates until commits + aborts ≥ warmup, where warmup is a counter we add to record the number of daemon sleeping cycles that have passed until the system is “warmed up”. A warmup phase is necessary to accumulate sufficient statistics upon which the ACC policy can be based in order to make correct decisions. Therefore, commit rate can be calculated by dividing commits by laps. We also keep record of the previous sample in variables last_commits and last_laps.

In reality, the optimal point for the current active thread count is floating—it may shift horizontally along the execution. We could make the system stay close to the sweet spot continuously by a probing technique as follows. If we use fewer threads but observe a drop of commit rate (line 19), we know we are falling down the hill of commit rate and getting further away from the optimal. Therefore we should reverse the tuning direction from down to up, and keep incrementing the quota until we start to see another drop of commit rate, which is a signal for us to reverse the tuning direction from up to down.

C. Metadata Zoning

We divide the compute cores into zones where cores in a zone belong to the same processor. As we have shown in Section II-A, data sharing is more efficient among cores in a zone, than cores from different zones. On each zone there is a separate set of concurrency control metadata. These metadata are not involved in conflict detection so this change does not affect the STM correctness. Each thread is set to belong to a zone and only access metadata related to the zone. This ensures the metadata remains exclusively shared among the cores in the same processor. The intra-processor data sharing allows the metadata to be read and updated more frequently, allowing higher amount of transactions to be processed within same amount of time. We also create an independent concurrency control daemons on each zone. Just in case transactions on different zones behave differently, the different daemons can react independently.


IV. AFFINITY-AWARE THREAD MIGRATION

In this section, we discuss a thread migration technique made to enhance the STM system for multi-CMP systems.

The aim of having thread migration among multiple processors is to better utilize the partially shared cache. The compute cores do not have access to all the cache units so that there are often some cache lines duplicated in different units, which reduce the effective cache size. There are also cache invalidation overheads when two threads on different processors modify the same data, stalling the thread for some considerable delays explained in Section II-A. By finding out the thread correlations, we can identify some threads having access to some common data. They can be rearranged to execute on cores of the same processor, reducing cross-die cache invalidation overheads and bringing about an overall speed improvement.

While runtime thread correlation is hard to track in a lightweight manner as our past experience tells [13], we can estimate the correlation by counting the conflicts between individual threads as two transactions conflict only when they access common data. Even if there can be a false conflict, they access some common pieces of STM metadata.

We compute and store the likelihood of threads to conflict in a 2D table of pairwise contention intensity. The original definition of contention intensity, proposed by Yoo and Lee [3], is a single weighted average value of a thread’s conflict likelihood across time and ranges from 0 (never conflict) to 1 (always conflict). In our case, we borrow the concept for estimating inter-thread conflict probability. There are \( n^2 \) values for \( n \) threads and each thread \( T_i \) possess an array of \( n \) numbers, \( \{ C_{ij} \} \), which represent how likely it is to be obstructed by thread \( T_j \). While we could save half of storage by using a triangular matrix, we store \( C_{ij} \) and \( C_{ji} \) separately to eliminate inter-thread synchronization.
and let each thread have exclusive write access to its array. The 2D array can be taken as an directed adjacency matrix which represents the edge presences and weights. Our aim is to reduce the sum of edges that span across different CMPs. Upon successful partitioning, we have fewer closely related threads running on different CMPs. Graph partitioning problem is NP-complete and less expensive heuristics are required. We derive a solution based on the work of Kernighan and Lin [14]. The heuristic accepts a graph partitioned in two sets, or pause execution of most threads until it makes a decision. The heuristic only sets processor affinities of particular threads and it does not interfere with the conflict detection mechanism, so the correctness of execution is not affected.

Although we now schedule the threads, we observe the overall cross-boundary edge value is substantially reduced from 0.8 to 0.1.

Although the pickPair process is having O(n^2) time complexity, we do not regard it as a serious burden because the procedure is only executed once at a time and it does not delay or pause execution of most threads until it makes a decision.

The migration policy is currently designed for STMs running on dual-processor systems, which are commonly used as workstations and general-purpose servers. When there are more than two processors, we can run pickPair in multiple iterations to swap threads around different processors, or apply multilevel partitioning. However, their implementation is out of the scope of this paper.

### V. IMPLEMENTATION

We implement all the above-mentioned techniques in TinySTM 0.9.5 [10], one of the state-of-the-art word-based

---

4To facilitate fast commit in the implementation, we adopt a lazy update mechanism which updates one variable only.
STM framework [15]. It is encouraging to see that Probe-Z and
Shrink and Yoo, can gain over the plain execution
without concurrency control. The first two policies (with
a suffix Z) are our zone-based versions of the Probe policy, i.e.
they have a separate set of metadata and control daemon on
each processor. Probe-ZM implements also the thread migra-
tion policy besides zoning. For comparison purpose, we also
include the original Probe, which has only one system-wide
control daemon and no partitioning of threads into zones. For
ease of reference, we use the term ACC (adaptive concurrency
control) in later context to collectively refer to any of these
policies.

While the system supports only 16 simultaneous threads,
we intentionally have a case of running 32 threads to see how
the adaptive concurrency control protocols handle the adverse
condition of excessive threading. Some applications show high
discrepancies in execution time across runs. We handled this
by running each of the tests five times and taking the average.

B. Experimental Results

Fig. 4 shows speedup charts of different TM benchmarks
in the STAMP suite, with and without ACC. Each separate
chart corresponds to one application. (There are two test cases
for each of kmeans and vacation: low (1) and high (2) in
terms of inherent contention). The x-axis represents the static
initial thread count (ITC) spawned by the benchmark. Except
“original” (i.e., the baseline execution with ACC disabled), the
actual number of active threads is under the influence of the
ACC and changes from time to time.

TinySTM generally shows increasing speedup when ITC
is between 2 to 8. Speedup drops when ITC reaches 16 or 32,
depending on application nature. STAMP is a transactional
benchmark suite with applications of different natures. There
are several relatively non-scalable benchmarks in the STAMP
suite—kmeans, ssca and yada, as well as some scalable
benchmarks such as genome, labyrinth and vacation. We
observe irregular speedup graphs in bayes.

With the ACCs enabled, the speedup drops less rapidly
when there are excessive threads running. However, there are
also cases where ACCs result in degraded STM performance
for a number of benchmarks, when there are moderate amount
of initial threads.

Table VI shows the best-case, worst-case and average gain
in speedup of various ACCs, compared to the baseline execu-
tion without ACCs. We have skipped bayes from comparison
because its execution time depends on the order of graph
nodes being processed, instead of true performance of the
STM framework [15]. It is encouraging to see that Probe-Z and
Fig. 4. Speedup of STAMP applications under different concurrency control policies.
Probe-ZM outperform other ACCs in most of the test cases, and obtain 21% and 20% average speedup respectively.

### C. Discussion on the Results

Our policies generally perform the best on vacation, which is a travel agency simulator. Clients make random reservations so that the data accesses in different transactions are not necessarily related. Both our policy and Yoo’s algorithm detect undesirable commit rate and ratio, and lower the concurrency accordingly. While we believe Shrink detects undesirable commit ratio as well, the nature of random data access makes the hotspot detection useless. Nevertheless, the hotspot detection hinders the loser threads’ progress, and hence slightly reduces the contention.

Our policies bring about a slight slowdown to ssca while Yoo and Shrink have performance on a par with the baseline. This is due to the already saturated communication channel for cache-level data transfer. Any additional implementations that share data cross cores can really slow down the system significantly. Due to the high commit ratio, both Yoo’s and Shrink policies do not react to reduce concurrency.

Our Probe policy, as well as Shrink, achieve excellent improvement for yada when there are 16 initial threads. These algorithms do not rely purely on commit ratio for decision making. As yada has a commit ratio well below 20%, other ACCs are confused and get the program over-serialized.

Performance of Probe and Probe-Z is very similar for a lot of applications. However, Probe-Z is much more well-rounded, providing much better worst-case performance. Among the STAMP applications, intruder, ssca, and kmeans have by nature high commit rates. Probe cannot perform well for these applications because the concurrency control metadata have to be frequently transferred between the two disjoint processor caches. By partitioning the threads into zones, the performance degradation is greatly relieved.

To further investigate the difference in performance of the ACCs, we carried out an analysis on some of the benchmark applications, and the results are shown in Fig. 5. The performance of the yada application gets quite close to the sweet spot with 8 threads although the commit ratio is well below 10%. Our Probe policy improves the situation by stalling a few threads, getting the program accelerated by about 10%. Shrink also performs well by stalling some threads. Before stalling a thread, it performs hotspot detection to ensure only transactions accessing hotspots are stalled. Since it finds only some amount of hotspots, the system stalls only a few threads and achieves a good result. Yoo’s policy does not work well with yada. It keeps watching the contention intensities of individual threads and is thus mislead by the commit ratios to stall too many transactional threads. While they achieved much higher commit ratios than Probe and Shrink, they have actually dragged down the commit rate, and hence lengthened the program execution.

For kmeans executed with 16 threads, there exists severe contention so that the commit ratio is below 25% when ACC is not activated. This application performs clustering on multi-dimensional data points. In each transaction, the center of a cluster is modified. As there are only a few clusters, they become memory hotspots in the program. By adaptively stalling different amounts of threads, Probe drastically improves the commit rate. Yoo keeps the commit ratio constant, at about 65%, which is the same as that in yada. Shrink keeps the commit ratio at about 80%, which is close to the threshold of activating hotspot detection. As all the transactions make access to hotspots, a transaction is definitely stalled when the thread-local contention drops below the threshold.

The performance results of Probe-Z and Probe-ZM on STAMP benchmarks are quite similar. This is because there are not any specific thread correlations in the STAMP benchmark inputs. It would be worse if the system spends extra effort on detection and migration but ends up achieving nothing better in terms of cache sharing. To demonstrate the potential of the system, we develop a micro-benchmark called dual red-black tree. In this program, each thread is given either a group of odd numbers or a group of even numbers, which are to be inserted into two red-black trees. Without proper thread placement, the threads will contend for the same memory location to make their updates. As shown in Fig. 6, Probe-ZM reduces cross-partition contention intensity and makes the system scale almost linearly when there are 8 threads. In comparison, the original TinySTM yields less than a speedup of 4 times in the same situation.

### VII. Related Work

Ansari et al [2], [12] propose a P-only control algorithm for tuning the active thread count according to the observed commit ratio. They presented preliminary benchmarking results, tested with eight threads only, to show the effectiveness of their algorithm compared with other simple adaptive concurrency control schemes. In contrast, we propose a new policy (Probe) that observes commit rate instead of commit ratio, and also conduct experimental evaluation of a bigger scale and a wider range of benchmarks.

Yoo and Lee [3] proposed another mechanism to adjust number of active threads. Unlike the work by Ansari, Yoo’s policy does not require heuristic data sharing among threads. It computes contention intensity on each thread, which, when above a threshold, the thread acquires a common lock before starting a new transaction. Meanwhile, counting number of conflicts is not reliable as large amount of conflicts (e.g. in Yada) may mislead the policy to unnecessarily serialize the transactions. Our Probe policy disregards the rollback counts and does not suffer from the same problem.

Shrink [1] assumes conflicts are induced by memory hotspots. It adaptively activates hotspot detection when there are repeated conflicts encountered by a thread. Transactions that are known making accesses to hotspots are required to acquire a common lock so that they serialize among themselves without affecting other threads. Unfortunately, if data access of a problem is purely random, all these efforts do nothing helpful.

<table>
<thead>
<tr>
<th>Policy</th>
<th>Best-Case</th>
<th>Worst-Case</th>
<th>Average</th>
</tr>
</thead>
<tbody>
<tr>
<td>Probe</td>
<td>+101.79% (yada)</td>
<td>-26.51% (ssca)</td>
<td>+11.03%</td>
</tr>
<tr>
<td>Probe-Z</td>
<td>+81.59% (yada)</td>
<td>-6.73% (labyrinth)</td>
<td>+21.04%</td>
</tr>
<tr>
<td>Probe-ZM</td>
<td>+94.70% (yada)</td>
<td>-9.18% (ssca)</td>
<td>+20.08%</td>
</tr>
<tr>
<td>Yoo</td>
<td>+70.50% (yada)</td>
<td>-12.95% (intruder)</td>
<td>+9.95%</td>
</tr>
<tr>
<td>Shrink</td>
<td>+61.02% (yada)</td>
<td>-28.96% (kmeans-1)</td>
<td>-3.23%</td>
</tr>
</tbody>
</table>
but waste even more computing time, as shown in Vacation benchmark. On the contrary our heuristics handle this case properly by reducing number of active threads globally.

Key-based adaptive transactional memory executor [17] reduces conflicts by introducing the concept of key. Transactions modifying similar data are given similar keys. The executor schedules unrelated transactions (of dissimilar keys) on different processors for maximal parallelism and related transactions on the same processor for conflict prevention plus better cache utilization. Unfortunately, the design lacks a well-defined strategy to assign proper keys to transactions.

Dependence-aware transactional memory (DATM) [18] is a recently proposed model for increasing concurrency of memory transactions without complicating their interface. DATM manages dependencies between conflicting, uncommitted transactions so that they commit safely. DATM accepts all conflict-serializable concurrent interleavings. DATM outperforms TL2 by 4.86 times at 16 threads for Vacation. However it lacks evidence for other applications, especially the ones with low commit ratio such as Kmeans. The model also has to deal with deadlocks that can arise in commit protocol for various reasons.

Maldonado, et al [4] proposes various methods to schedule transactional threads. They extensively evaluated various methods of scheduling, including spin-locks, condition variables (CVs), and extending the kernel schedulers. For instance, they noticed using CVs to wake up loser threads adds huge overhead to the commit procedure. While transaction-aware schedulers perform better than user-space contention management in excessive threading situations, most of the implementations perform similarly in situations where there are more threads than cores.

VIII. CONCLUSION AND FUTURE WORK

In this paper, we demonstrate an advanced software transactional memory (STM) tailored to platforms of multiple chip multiprocessors (CMPs). To optimize the throughput of the STM, a cache-aware design of dynamic concurrency control is important for regulating the rate of spawning transactions with low overheads on a multi-CMP system. The concurrency control aims for seeking the sweet spot on commit rate by varying the number of active threads, and is found more robust than other solutions which might get misled by low commit ratios. We found that control overheads are very sensitive to the underlying cache sharing and illustrate the zoning technique to localize the traffic of metadata sharing across cores. We also refine our implementation with a novel thread migration policy that considers pairwise contention intensities and swaps threads in a sense to improve the cache affinity. In future, we may consider new adaptive scheduling policies that observe more STM parameters for smarter decisions, including transaction length, read-write ratio, etc.

ACKNOWLEDGMENT

This research is supported by Hong Kong RGC Grant HKU 718011E and China 863 grant 2014AA01A302.

REFERENCES

[1] A. Dragojević, A. Guerraoui, R. amd Singh, and V. Singh, “Preventing versus curing: Avoiding conflicts in transactional memories,” in Pro-


