Skip Menu |

This queue is for tickets about the Cache-FastMmap CPAN distribution.

Report information
The Basics
Id: 61406
Status: resolved
Priority: 0/
Queue: Cache-FastMmap

People
Owner: Nobody in particular
Requestors: j.david.lowe [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: race condition with signals
There's a race condition in the page locking code. If the page locking is interrupted in just the right spot, it will render the cache unusable -- all subsequent accesses will die with "page X is already locked, can't lock multiple pages..." errors. Here is a test which reproduces the issue on my system; you might have to adjust the alarm length a bit to make the error reproduce elsewhere: #!/usr/bin/env perl use strict; use warnings; use Cache::FastMmap (); use File::Temp (); use Time::HiRes (); use List::Util (); use Test::More qw(no_plan); my $share_file = File::Temp::tmpnam(); my $c = new Cache::FastMmap( init_file => 1, share_file => $share_file, page_size => 65536, num_pages => 89, raw_values => 1, compress => 0, enable_stats => 0, expire_time => 0, unlink_on_exit => 0, empty_on_exit => 0, ); my @keys = qw(foo bar baz barney poot boot zoot toot); for my $key (@keys) { $c->set($key => 'the quick brown fox'); } ## this is an inherently unreliable test. In particular, the rand(3) below is specific to the ## speed of the machine where I developed the test. I don't have a solution. eval { for (1 .. 100) { eval { local $SIG{ALRM} = sub { die 'alarm' }; while (1) { Time::HiRes::alarm(0.000001 * int(rand(3))); $c->get((List::Util::shuffle(@keys))[0]); Time::HiRes::alarm(0); } }; ok($@ =~ m/^alarm /, "exited the loop by alarm..."); ## if the race condition occurs, this will die() because it thinks ## it's still holding a lock. If the race does not occur, this will ## not die. $c->get('foo'); } }; ok($@ eq '', "edge cache remains in a usable state") or diag($@);
This issue can be worked around. In our code base we have surrounded all calls into Cache::FastMmap methods with: my $set = POSIX::SigSet->new(POSIX::SIGALRM()); POSIX::sigprocmask(POSIX::SIG_BLOCK(), $set); ## Cache::FastMmap call POSIX::sigprocmask(POSIX::SIG_UNBLOCK(), $set); Which solves the issue for us.
I couldn't reproduce this problem with the code you included. The fcntl() call should catch EINTR return codes and try again, I'm not sure why it wouldn't be. Anyway I've made the whole setting alarm() thing optional now, you have to explicitly set the Catch_deadlocks => 1 argument in 1.36. Assuming you don't set that, it doesn't set any alarm() calls, and it should retry the fcntl call if it gets an EINTR. On Thu Sep 16 18:22:49 2010, DLOWE wrote: Show quoted text
> This issue can be worked around. In our code base we have surrounded all > calls into Cache::FastMmap methods with: > > my $set = POSIX::SigSet->new(POSIX::SIGALRM()); > POSIX::sigprocmask(POSIX::SIG_BLOCK(), $set); > > ## Cache::FastMmap call > > POSIX::sigprocmask(POSIX::SIG_UNBLOCK(), $set); > > Which solves the issue for us.
On Tue Sep 28 23:25:12 2010, ROBM wrote: Show quoted text
> I couldn't reproduce this problem with the code you included. The > fcntl() call should catch EINTR return codes and try again, I'm not sure > why it wouldn't be.
You might need to fiddle with the timer a bit to get it to reproduce in another environment; it is an extremely small race. We see it in production, but that's because we're handling ~4 billion requests/day via Cache::FastMmap under very strict latency controls. This isn't an issue with getting the signal inside fcntl(), though. The problem occurs when a signal is delivered between the lock and unlock steps, while cache->p_cur is != -1. There's no way for a caller to ensure that the lock is ever released & it renders the cache object unusable to that process. There's even an extremely short race interval in unix.c mmc_unlock_page() *after* the fcntl() completes, but *before* p_cur is set to -1, where if a signal is delivered, the library *thinks* a page is being held, but it actually isn't. But basically anytime a cache action is interrupted while p_cur is set will break things. Show quoted text
> Anyway I've made the whole setting alarm() thing optional now, you have > to explicitly set the Catch_deadlocks => 1 argument in 1.36. Assuming > you don't set that, it doesn't set any alarm() calls, and it should > retry the fcntl call if it gets an EINTR.
This doesn't have anything to do with your internal alarms (though that does help with the other ticket I filed, and thank you!) Resolving this may be as simple as documenting the fact that the cache object isn't always recoverable if it gets a signal during a get() or set(), and that callers should ensure that signals which they expect to be able to handle and recover from should be blocked during those calls; I'm not sure whether there's a sane way for you to fix this inside the library (you have no way to know whether your callers want to catch and recover from a given signal...)
Ok I understand what was happening here. I've changed the page locking/unlocking so it uses an object with a DESTROY method to do the unlocking now. That means however you exit a sub (eg die from a $SIG{ALRM}), it should unlock the page correctly.