BastEt 阅读(5179) 评论(0)
Shared access with TBB read/write locks
By Wooyoung Kim (Intel) (5 posts) on April 2, 2009 at 3:05 pm

Lately a sequence of interesting things happened to me. It started with finding an aptly titled paper "TLRW: Return of the Read-Write Lock" by Dave Dice and Nir Shavit. A couple of weeks later, I got an email asking a question about how to use TBB read/write locks (i.e., spin_rw_mutex). A week or so later, my officemate hit a road block while preparing for a release, which appeared to be a typical reader-writer race problem. (It turned out to be a false alarm, though.) Then, the following Monday I received a question about difference between TBB spin_rw_mutex and POSIX read/write lock (i.e., pthread_rwlock). All of the things happened within a month timeframe as if they were all arranged in advance.

These events made me wonder what interesting things have happened to read/write locks in the world since the last time I worked on spin_rw_mutex. So, I did some googling. The following is a list of things I found most informative.

1. On Windows, read-write locks (*SRWLock*) are available in Vista and onward (http://msdn.microsoft.com/en-us/library/aa904937(VS.85).aspx). TryAcquireSRWLockShared/TryAcquireSRWLockExclusive variants are available in Windows 7/Windows Server 2008 R2.

2. A futex-based read-write lock (furwock) was mentioned in "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux" (http://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf). Old but still useful, I think. A link to an example implementation is found at the bottom at http://everything2.com/title/futex.

3. The Linux Kernel spinlock documentation mentions reader-writer locks (http://www.mjmwired.net/kernel/Documentation/spinlocks.txt).

4. A LKML message thread about 64-bit futexes (http://lkml.org/lkml/2008/5/30/458). This is something new to me. Linux Kernel bigwigs including Linus Torvald, Ulrich Drepper, Ingo Molnar, etc., were debating whether extending futex variables to 64-bit is necessary or 32-bit futex variables are sufficient. What does it to do with read-write locks? Ulrich Drepper believed it was needed for a faster read-write lock implementation (http://udrepper.livejournal.com/13123.html). He is currently the foremost contributor to the GNU C library, and the thread is almost a year old, so I expect some form of the new read-write lock implementation will be available in glibc in near future if it has not made a way into the latest release already.

5. Boost.Thread, a publicly available C++ thread library, provides a multiple-reader/single-writer mutex named shared_mutex since its 1.35 release (1.38.0 is the latest as of 3/27/2009). http://www.boost.org/doc/libs/1_35_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_types.shared_mutex.

6. The new C++0x Standard being finalized does not include a read/write mutex in its thread support library. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2798.pdf.

TBB spin_rw_mutex, POSIX pthread_rwlock, and boost::shared_mutex

Considering that many platforms implement the POSIX Threads, pthread_rwlock seems most representative to date. Like TBB, Boost.Thread is in C++. So, what similarities and differences do these read/write locks have?

1. spin_rw_mutex is strictly in user-land and employs spin-wait for blocking. boost::shared_mutex is also in user-land, but its slow-path may require a trip to the kernel. Being in user-land, spin_rw_mutex and shared_mutex can protect only accesses to data within a process. In contrast, pthread_rwlock is across user-land and kernel. Its specification requires a blocking thread release processing resources (See the Single UNIX Specification v2. System Interfaces. 2.9.6 Thread Read-Write Locks. http://www.opengroup.org/onlinepubs/009695399/). As a result, its slow path contains a trip to the kernel. When a writer holds a lock and a reader arrives, the reader would block and go to sleep. Moreover, pthread_rwlock may be initialized to protect accesses to data shared by multiple processes (cf. pthread_rwlockattr_setpshare). More capabilities, more bookkeeping, and more cost.

2. TBB spin_rw_mutex is almost fair. More specifically, TBB spin_rw_mutex favors writers slightly more than readers. It makes sense because TBB spin_rw_mutex assumes many readers/frequent reads and few writers/occasional writes. In theory, readers may starve though the implementation makes sure reader starvation is unlikely (see below for how TBB deals with it). TBB offers a fair sibling of spin_rw_mutex as queuing_rw_mutex.

As for pthread_rwlock and boost::shared_mutex, it is not clear to me what fairness policy they implement: fair, reader-over-writer, or writer-over-reader. The Single UNIX Specification v2 leaves the decision to the hands of implementers. The Boost.Thread documentation does not mention it, either. For example, the GLIBC NPTL pthread_rwlock uses writer-over-reader on some releases and reader-over-writer on others (cf. pthread.h). On platforms that conform to the UNIX® 98 standards (i.e., __USE_UNIX98) or to the X/Open Portability Guide (i.e., __USE_XOPEN2K), it allows users to specify the attribute via a non-portable pthread_rwlockattr_setkind_np.

3. TBB spin_rw_mutex provides the scoped_lock RAII pattern. By contrast, pthread_rwlock requires explicit initialization/destruction in addition to lock acquisition/release. TBB spin_rw_mutex uses a Boolean parameter to distinguish read/write lock acquisition. In case of pthread_rwlock, separate APIs for read and write lock acquisition/release are provided. boost::shared_mutex takes an in-between approach. In addition to member functions supplied by shared_mutex, Boost.Thread defines lock_guard and friends for elaborate RAII support.

TBB spin_rw_mutex implementation details

As the name carries it, spin_rw_mutex is built upon spin-wait. Like any spin locks, TBB spin_rw_mutex assumes access to a critical section is brief and contention is light. On top of them does it assume many readers/few writers (or frequent reads/occasional updates). A concurrent hash table is a good example with such access patterns. With the assumptions, spin_rw_mutex is optimized to the point where all operations except one take only a single atomic operation in their fast-paths. Applications with the equal numbers of readers/writers or many writers/few readers would perform better with TBB spin_mutex as the latter incurs less overhead per lock acquisition/release.

A variable of type 'intptr_t' is used as the lock variable. The two least significant bits of the variable are used by writers to mark that a writer is present (WRITER) and that writers are waiting (WRITER_PENDING). The rest are used to keep track of the number of readers in the critical section.

When a writer enters a critical section, it checks if someone is in the critical section. If someone is already in it, the writer marks that it is waiting and spin-waits. When the critical section becomes unoccupied, it attempts to set the WRITER bit using ‘locked cmpxchg’. If successful, it has the lock and enters the critical section. Otherwise, it repeats the process.

A writer releases the lock by clearing both WRITER_PENDING and WRITER bits using ‘locked and’. Note that it cannot simply store 0 because of the way readers acquire the read lock (see below). Clearing the WRITER_PENDING bit is to not over-favor pending writers.

Read lock acquisition is optimistic. When a reader enters a critical section, it immediately increments the reader count by 1 by using 'locked xadd' if no writer or pending writer is present. Then, if no writer cuts in line before it, it enters the critical section. Otherwise, it undoes its speculative action and repeats the process. Being optimistic pays off when readers are more than writers because 'locked xadd' is cheaper than 'locked cmpxchg' and it always succeeds.

Atomic decrement of the reader counter by 1 releases the read lock.

In addition, TBB spin_rw_mutex offers upgrade_to_writer and downgrade_to_reader operations. When successful (i.e., when the return value is true), they upgrade/downgrade the caller to a writer/reader without losing the lock. The two operations are mostly for convenience and can be thought of as equivalent to their corresponding release-acquire pairs despite the differences described below.

downgrade_to_reader takes a single atomic operation and never fails. By comparison, the writer-release/reader-acquire pair takes two atomics operations.

upgrade_to_writer is a bit trickier because often in a poor (and incorrect) implementation simultaneous upgrade requests by more than two readers may lead to indefinite wait (e.g., two readers wait for each other to leave). If a reader is alone in the critical section, it is allowed to make an upgrade attempt. If successful, it assumes a writer and releases the read lock. When multiple readers wish to upgrade, they get a green signal as long as no writer is waiting. They make upgrade attempts by trying to set both WRITER and WRITER_PENDING bits atomically. One of them would succeed. This reader becomes a writer without losing the lock, spin-waits for all the peer readers to leave before releasing the read lock, and then returns true. The rest follow the slow path of reader-release/writer-acquire and return false. Note that a writer may step in and acquire the write lock before any of the readers; in this case no readers would succeed. So, don’t forget to check what upgrade_to_writer returns; when a reader upgrades to a writer the state it sees may differ from what it saw when it made the request.
Categories: Open Source, Software Tools
Comments (9)
April 2, 2009 6:36 PM PDT

jseigh
jseigh
If you're talking about pure spin locks, you might look at bakery style spin locks here (pseudo code)
http://groups.google.com/group/comp.programming.threads/msg/3add8a83d4142d85

You can implement them with XADD. They're FIFO. Probably a bit more cache friendly than conventional spin locks. And upgrade is unconditional.
April 3, 2009 12:19 AM PDT

Dmitriy Vyukov
Dmitriy Vyukov Total Points:
39,424
Black Belt
Hi Joe, it's nice to see you again :)

I've submitted an eventcount implementation to the TBB as a contribution some time ago:
http://software.intel.com/en-us/forums/intel-threading-build.....pic/62364/

And I wanted to contact you to get some information about eventcount origins (including the term itself), unfortunately I was unable to contact you via email at xemaps.com. I can't find clear information about this on c.p.t. Are you the inventor of the original eventcount algorithm? Can you provide some links?
Thanks.
April 3, 2009 12:57 AM PDT

Dmitriy Vyukov
Dmitriy Vyukov Total Points:
39,424
Black Belt
Here is another idea on reader-writer locks from Joe Duffy (somehow similar to TLRW's byte-lock):
http://www.bluebytesoftware.com/blog/CommentView,guid,c69cb0.....99ffa.aspx

However IMHO they both have to consider grouping of reader-counters based on thread id (processor id). Joe Duffy version just increases memory consumption from number_of_processors bytes to number_of_processors * 16 * 128 bytes to overcome false-sharing. And David Dice's version is totally amenable to false-sharing.
April 3, 2009 1:02 AM PDT

Dmitriy Vyukov
Dmitriy Vyukov Total Points:
39,424
Black Belt
Another idea of reader-writer mutex is my asymmetric mutex:
http://groups.google.com/group/lock-free/browse_frm/thread/1efdc652571c6137
Provided high read load it totally slaughters all other rw mutexes. Performance difference I seen already on quad-core was some 300x. On higher number of cores and more distributed systems performance difference will be much bigger.
April 3, 2009 2:31 AM PDT

jseigh
jseigh
I didn't invent eventcounts. That concept has been around for a while. They're just a shareable form of events. Events have the problem of having to coordinate the reset of the event without causing race conditions. "Resetting" an eventcount by one thread doesn't affect other threads.

http://portal.acm.org/citation.cfm?id=359076
April 3, 2009 3:25 AM PDT

Dmitriy Vyukov
Dmitriy Vyukov Total Points:
39,424
Black Belt
Thank you!
April 6, 2009 9:14 AM PDT

Wooyoung Kim (Intel)
Wooyoung Kim (Intel) Total Points:
960
Brown Belt
Thank you for the pointer.
I will probably need some time to digest it :^)
BTW, upgrade appears to require readers decide whether they want upgrade or not before acquiring read locks.
(i.e., it seems very similar (or even equivalent) to the write lock acquisition.
December 21, 2009 6:55 PM PST

softarts
softarts Total Points:
1,190
Brown Belt
does TBB implement any kind of bakery style lock? what is the main lock type in TBB?
December 22, 2009 7:34 AM PST

Wooyoung Kim (Intel)
Wooyoung Kim (Intel) Total Points:
960
Brown Belt
Softarts,
We don't have a Lamport's Bakery lock in TBB. The closest thing is queuing_mutex which ensures FIFO stytle access to a critical section.

We prototyped a bakery-style mutex and experimented with it, but its performance degrades quickly even when a machine is slightly oversubscribed.

TBB does not have the "main" lock; instead, it provides a gamut of locks with different costs/functionalties. A user should choose a lock which best fits his/her synchronization needs. For example, our experiments show that tbb::spin_mutex is most efficient, but it is inherently unfair. If you need to guarantee fair access, queue-based locks should be used.

I hope my response answered your question properly. Let us know if you have any other related questions

发表评论
切换编辑模式