Bug #61406 for Cache-FastMmap: race condition with signals

Thu Sep 16 18:20:00 2010 j.david.lowe [...] gmail.com - Ticket created

Subject:

race condition with signals

There's a race condition in the page locking code. If the page locking is interrupted in just the right spot, it will render the cache unusable -- all subsequent accesses will die with "page X is already locked, can't lock multiple pages..." errors. Here is a test which reproduces the issue on my system; you might have to adjust the alarm length a bit to make the error reproduce elsewhere: #!/usr/bin/env perl use strict; use warnings; use Cache::FastMmap (); use File::Temp (); use Time::HiRes (); use List::Util (); use Test::More qw(no_plan); my $share_file = File::Temp::tmpnam(); my $c = new Cache::FastMmap( init_file => 1, share_file => $share_file, page_size => 65536, num_pages => 89, raw_values => 1, compress => 0, enable_stats => 0, expire_time => 0, unlink_on_exit => 0, empty_on_exit => 0, ); my @keys = qw(foo bar baz barney poot boot zoot toot); for my $key (@keys) { $c->set($key => 'the quick brown fox'); } ## this is an inherently unreliable test. In particular, the rand(3) below is specific to the ## speed of the machine where I developed the test. I don't have a solution. eval { for (1 .. 100) { eval { local $SIG{ALRM} = sub { die 'alarm' }; while (1) { Time::HiRes::alarm(0.000001 * int(rand(3))); $c->get((List::Util::shuffle(@keys))[0]); Time::HiRes::alarm(0); } }; ok($@ =~ m/^alarm /, "exited the loop by alarm..."); ## if the race condition occurs, this will die() because it thinks ## it's still holding a lock. If the race does not occur, this will ## not die. $c->get('foo'); } }; ok($@ eq '', "edge cache remains in a usable state") or diag($@);

Thu Sep 16 18:22:49 2010 j.david.lowe [...] gmail.com - Correspondence added

This issue can be worked around. In our code base we have surrounded all calls into Cache::FastMmap methods with: my $set = POSIX::SigSet->new(POSIX::SIGALRM()); POSIX::sigprocmask(POSIX::SIG_BLOCK(), $set); ## Cache::FastMmap call POSIX::sigprocmask(POSIX::SIG_UNBLOCK(), $set); Which solves the issue for us.

Thu Sep 16 18:22:50 2010 j.david.lowe [...] gmail.com - Status changed from 'new' to 'open'

Tue Sep 28 23:25:12 2010 cpan [...] robm.fastmail.fm - Correspondence added

I couldn't reproduce this problem with the code you included. The fcntl() call should catch EINTR return codes and try again, I'm not sure why it wouldn't be. Anyway I've made the whole setting alarm() thing optional now, you have to explicitly set the Catch_deadlocks => 1 argument in 1.36. Assuming you don't set that, it doesn't set any alarm() calls, and it should retry the fcntl call if it gets an EINTR. On Thu Sep 16 18:22:49 2010, DLOWE wrote: Show quoted text

> This issue can be worked around. In our code base we have surrounded all > calls into Cache::FastMmap methods with: > > my $set = POSIX::SigSet->new(POSIX::SIGALRM()); > POSIX::sigprocmask(POSIX::SIG_BLOCK(), $set); > > ## Cache::FastMmap call > > POSIX::sigprocmask(POSIX::SIG_UNBLOCK(), $set); > > Which solves the issue for us.

Tue Sep 28 23:25:14 2010 cpan [...] robm.fastmail.fm - Status changed from 'open' to 'resolved'

Fri Oct 01 19:33:14 2010 j.david.lowe [...] gmail.com - Correspondence added

On Tue Sep 28 23:25:12 2010, ROBM wrote: Show quoted text

> I couldn't reproduce this problem with the code you included. The > fcntl() call should catch EINTR return codes and try again, I'm not sure > why it wouldn't be.

You might need to fiddle with the timer a bit to get it to reproduce in another environment; it is an extremely small race. We see it in production, but that's because we're handling ~4 billion requests/day via Cache::FastMmap under very strict latency controls. This isn't an issue with getting the signal inside fcntl(), though. The problem occurs when a signal is delivered between the lock and unlock steps, while cache->p_cur is != -1. There's no way for a caller to ensure that the lock is ever released & it renders the cache object unusable to that process. There's even an extremely short race interval in unix.c mmc_unlock_page() *after* the fcntl() completes, but *before* p_cur is set to -1, where if a signal is delivered, the library *thinks* a page is being held, but it actually isn't. But basically anytime a cache action is interrupted while p_cur is set will break things. Show quoted text

> Anyway I've made the whole setting alarm() thing optional now, you have > to explicitly set the Catch_deadlocks => 1 argument in 1.36. Assuming > you don't set that, it doesn't set any alarm() calls, and it should > retry the fcntl call if it gets an EINTR.

This doesn't have anything to do with your internal alarms (though that does help with the other ticket I filed, and thank you!) Resolving this may be as simple as documenting the fact that the cache object isn't always recoverable if it gets a signal during a get() or set(), and that callers should ensure that signals which they expect to be able to handle and recover from should be blocked during those calls; I'm not sure whether there's a sane way for you to fix this inside the library (you have no way to know whether your callers want to catch and recover from a given signal...)

Fri Oct 01 19:33:20 2010 The RT System itself - Status changed from 'resolved' to 'open'

Sun Jul 17 21:54:15 2011 cpan [...] robm.fastmail.fm - Correspondence added

Ok I understand what was happening here. I've changed the page locking/unlocking so it uses an object with a DESTROY method to do the unlocking now. That means however you exit a sub (eg die from a $SIG{ALRM}), it should unlock the page correctly.

Sun Jul 17 21:54:16 2011 cpan [...] robm.fastmail.fm - Status changed from 'open' to 'resolved'