Linked Lists and Balanced Search Trees are familiar data structures, but can they make the leap to parallelized environments?
ByHerb Sutter
June 27, 2008
URL:http://www.ddj.com/architecture-and-design/208801371
Herb is a software development consultant, a software architect atMicrosoft, and chair of the ISO C++ Standards committee. He can becontacted at www.gotw.ca.
What is a high-performance data structure? To answer that question,we're used to applying normal considerations like Big-Oh complexity, andmemory overhead, locality, and traversal order. All of those apply toboth sequential and concurrent software.
But in concurrent code, we need to consider two additional things tohelp us pick a data structure that is also sufficientlyconcurrency-friendly:
It turns out that both of these answers are directly influenced bywhether the data structure allows truly localized updates. If makingwhat appears to be a small change in one part of the data structureactually ends up reading or writing other parts of the structure, thenwe lose locality; those other parts need to be locked, too, and all ofthe memory that has changed needs to be synchronized.
To illustrate, let's consider two common data structures: linked lists and balanced trees.
Linked lists are wonderfully concurrency-friendly data structuresbecause they support highly localized updates. In particular, asillustrated in Figure 1, to insert a new node into a doubly linked list,you only need to touch two existing nodes; namely, the ones immediatelyadjacent to the position the new node will occupy to splice the newnode into the list. To erase a node, you only need to touch three nodes:the one that is being erased, and its two immediately adjacent nodes.
This locality enables the option of using fine-grained locking: We canallow a potentially large number of threads to be actively workinginside the same list, knowing that they won't conflict as long as theyare manipulating different parts of the list. Each operation only needsto lock enough of the list to cover the nodes it actually uses.
For example, consider Figure 2, which illustrates the technique ofhand-over-hand locking. The basic idea is this: Each segment of thelist, or even each individual node, is protected by its own mutex. Eachthread that may add or remove nodes from the list takes a lock on thefirst node, then while still holding that, takes a lock on the nextnode; then it lets go of the first node and while still holding a lockon the second node, it takes a lock on the third node; and so on. (Todelete a node requires locking three nodes.) While traversing the list,each such thread always holds at least two locks—and the locks arealways taken in the same order.
This technique delivers a number of benefits, including the following:
Aside: If we always traverse the list in the same order, why doesthe figure show a doubly linked list? Because not all operations needto take multiple locks; those that use individual segments or nodesin-place one at a time without taking more than one node's or chunk'slock at a time can traverse the list in any order without deadlock. (Formore on avoiding deadlock, see [1].)
Besides being well suited for concurrent traversal and update, linkedlists also are cache-friendly on parallel hardware. When one threadremoves a node, for example, the only memory that needs to betransferred to every other core that subsequently reads the list is thememory containing the two adjacent nodes. If the rest of the list hasn'tbeen changed, multiple cores can happily store read-only copies of thelist in their caches without expensive memory fetches andsynchronization. (Remember, writes are always more expensive than readsbecause writes need to be broadcast. In turn, "lots of writes" arealways more expensive than "limited writes.")
Clearly, one benefit lists enjoy is that they are node-based containers:Each element is stored in its own node, unlike an array or vector whereelements are contiguous and inserting or erasing typically involvescopying an arbitrary number of elements to one side or the other of theinserted or erased value. We might therefore anticipate that perhaps allnode-based containers will be good for concurrency. Unfortunately, wewould be wrong.
The story isn't nearly as good for another popular data structure: thebalanced search tree. (Important note: This section refers only tobalanced trees; unbalanced trees that support localized updates don'tsuffer from the problems we'll describe next.)
Consider a red-black tree: The tree stays balanced by marking each nodeas either "red" or "black," and applying rules that call for optionallyrebalancing the tree on each insert or erase to avoid having differentbranches of the tree become too uneven. In particular, rebalancing isdone by rotating subtrees, which involves touching an inserted or erasednode's parent and/or uncle node, that node's own parent and/or uncle,and so on to the grandparents and granduncles up the tree, possibly asfar as the root.
For example, consider Figure 3. To start with, the tree contains three nodes with the values 1, 2, and 3. To insert the value 4, we simply make it a child of node 3, as we would in a nonbalanced binary search tree. Clearly, that involves writing to node 3, to set its right-child pointer. However, to satisfy the red-black tree mechanics, we must also change node 3's and node 1's color to black. That adds overhead and loses some concurrency; for example, inserting 4 would conflict with adding 1.5 concurrently, because both inserts would need to touch both nodes 1 and 3.
Next, to insert the value 5, we need to touch all but one of the nodes in the tree: We first make node 4 point to the new node 5 as its right child, then recolor both node 4 and node 3, and then because the tree is out of balance we also rotate 3-4-5 to make node 4 the root of that subtree, which means also touching node 2 to install node 4 as its new right child.
So red-black trees cause some problems for concurrent code:
"But wait," I can hear some people saying, "why can't we just put amutex inside each node and take the locks in a single direction (up thetree) like we could do with the linked list and hand-over-hand locking?Wouldn't that let us regain the ability to have concurrent use of thedata structure at least?" Short answer: That's easy to do, but hard todo right. Unlike the linked list case, however: (a) you may need to takemany more locks, even all the way up to the root; and (b) thehigher-level nodes will still end up being high-contention resourcesthat bottleneck scalability. Also, the code to do this is much morecomplicated. As Fraser noted in 2004: "One superficially attractivesolution is to read-lock down the tree and then write-lock on the wayback up, just as far as rebalancing operations are required. This schemewould acquire exclusive access to the minimal number of nodes (thosethat are actually modified), but can result in deadlock with searchoperations (which are locking down the tree)." [2] He also proposed afine-grained locking technique that does allow some concurrency, butnotes that it "is significantly more complicated." There are easyanswers, but few easy and correct answers.
To get around these limitations, researchers have worked on alternativestructures such as skip lists [4], and on variants of red-black treesthat can be more amenable to concurrency, such as by doing relaxedbalancing instead of rebalancing immediately when needed after eachupdate. Some of these are significantly more complex, which incurs itsown costs in both performance and correctness/maintainability (forexample, relaxed balancing was first suggested in 1978 but notimplemented successfully until five years later). For more informationand some relative performance measurements showing how even concurrentversions can still limit scalability, see [3].
Concurrency-friendliness alone doesn't singlehandedly trump otherperformance requirements. The usual performance considerations of Big-Ohcomplexity, and memory overhead, locality, and traversal order allstill apply. Even when writing parallel code, you shouldn't choose adata structure only because it's concurrency-friendly; you should choosethe right one that meets all your performance needs. Lists may be moreconcurrency-friendly than balanced trees, but trees are faster tosearch, and "individual searches are fast" can outbalance "multiplesearches can run in parallel." (If you need both, try an alternativelike skip lists.)
Remember:
In those situations, prefer concurrency-friendly data structures. Themore a container supports truly localized updates, the more concurrencyyou can have as multiple threads can actively use different parts of thedata structure at the same time, and (secondarily but still sometimesimportantly) the more you can avoid invisible memory synchronizationoverhead in your high-performance code.
[1] H. Sutter. "Use Lock Hierarchies to Avoid Deadlock" (Dr. Dobb's Journal, January 2008).
[2] K. Fraser. "Practical lock-freedom" (University of Cambridge Computer Laboratory Technical Report #579, February 2004).
[3] S. Hanke. "The Performance of Concurrent Red-Black Tree Algorithms" (Lecture Notes in Computer Science, 1668:286-300, Springer, 1999).
[4] M. Fomitchev and E. Ruppert. "Lock-Free Linked Lists and Skip Lists" (PODC '04, July 2004).
联系客服