Bug #17335 for Cache-FastMmap: Attempt to work with a partially corrupted sharefile leads to a SIGSEGV

Sat Jan 28 17:28:01 2006 Guest - Ticket created

Subject:

Attempt to work with a partially corrupted sharefile leads to a SIGSEGV

Module: Cache-FastMmap-1.09. Perl version: v5.8.5 built for sun4-solaris OS version: SunOS 5.8 Generic_108528-19 sun4u sparc SUNW,UltraAX-i2 Problem description: An attempt to work with a partially corrupted sharefile crashes the process. As far as I could understand during debug of the process, the reason for crash is an uninitialised "last_error" variable of the "mmap_cache" structure. In my case the crash happens after the mmc_lock() function has revealed the corrupted state of some page in the file. The function returns an error condition (-1) without calling the internal _mmc_set_error() function. The _mmc_set_error(), if called, should store the error message in the "last_error" variable of the "mmap_cache" structure. If it doesn't, the uninitialized value (which points somewhere beyond the process address space) is used as an input for croak() which leads to a crash. Proposed patch: The least proposed patch is to initialise the "last_error" variable in the mmc_init() function with some constant string value like "unknown error" or something. Additionally, it will be a good idea to enrich error code returns from mmc_lock() function with corresponding _mmc_set_error() calls. A scratch patch file is included in the attachment.

Subject:

Cache-FastMmap-1.09-patch.txt

--- Cache-FastMmap-CImpl/mmap_cache.c.orig 2006-01-29 00:52:23.099017000 +0300 +++ Cache-FastMmap-CImpl/mmap_cache.c 2006-01-29 01:08:36.075541000 +0300 @@ -83,7 +83,7 @@ /* Internal functions */ -void _mmc_set_error(mmap_cache *, int, char *, ...); +int _mmc_set_error(mmap_cache *, int, char *, ...); void _mmc_init_page(mmap_cache *, MU32); MU32 * _mmc_find_slot(mmap_cache * , MU32 , void *, int ); @@ -222,6 +222,8 @@ return -1; } + cache->last_error = "Unknown error"; + /* Basic cache params */ c_num_pages = cache->c_num_pages; ASSERT(c_num_pages >= 1 && c_num_pages <= 1000); @@ -449,7 +451,8 @@ return -1; } - if (!(P_Magic(p_ptr) == 0x92f7e3b1)) return -1; + if (!(P_Magic(p_ptr) == 0x92f7e3b1)) + return -1 + _mmc_set_error( cache, 0, "magic page start marker not found. p_cur is %u, offset is %u", p_cur, p_offset); /* Copy to cache structure */ cache->p_num_slots = P_NumSlots(p_ptr); @@ -459,10 +462,10 @@ cache->p_free_bytes = P_FreeBytes(p_ptr); /* Reality check */ - if (!(cache->p_num_slots >= 89 && cache->p_num_slots < cache->c_page_size)) return -1; - if (!(cache->p_free_slots > 0 && cache->p_free_slots <= cache->p_num_slots)) return -1; - if (!(cache->p_old_slots <= cache->p_free_slots)) return -1; - if (!(cache->p_free_data + cache->p_free_bytes == cache->c_page_size)) return -1; + if (!(cache->p_num_slots >= 89 && cache->p_num_slots < cache->c_page_size)) return -1 + _mmc_set_error( cache, 0, "cache num_slots mistmatch"); + if (!(cache->p_free_slots > 0 && cache->p_free_slots <= cache->p_num_slots)) return -1 + _mmc_set_error( cache, 0, "cache free slots mustmatch"); + if (!(cache->p_old_slots <= cache->p_free_slots)) return -1 + _mmc_set_error( cache, 0, "cache old slots mistmatch"); + if (!(cache->p_free_data + cache->p_free_bytes == cache->c_page_size)) return -1 + _mmc_set_error( cache, 0, "cache free data mistmatch"); /* Check page header */ ASSERT(P_Magic(p_ptr) == 0x92f7e3b1); @@ -1228,12 +1231,12 @@ } /* - * void _mmc_set_error(mmap_cache *cache, int err, char * error_string, ...) + * int _mmc_set_error(mmap_cache *cache, int err, char * error_string, ...) * * Set internal error string/state * */ -void _mmc_set_error(mmap_cache *cache, int err, char * error_string, ...) { +int _mmc_set_error(mmap_cache *cache, int err, char * error_string, ...) { va_list ap; static char errbuf[1024]; @@ -1256,7 +1259,7 @@ va_end(ap); - return; + return 0; } /*

Tue Jan 31 19:26:20 2006 Guest - Correspondence added

Thanks for the report. I'll integrate your changes into the next version. FYI, I've been working on and off on a new version of Cache::FastMmap that will include an additional "has been modified" flag on each page. Basically before any modification occurs, this flag will be set, and only after any changes have been made which make the page consistent will the flag be unset. Basically there is currently a race condition where if a process is killed while in the middle of making changes to a page, it can leave the page corrupted (and the OS will release the lock when the process is killed) and cause other processes attached to it to crash. While this is very rare, it's still a possibility that should be dealt with. Sorry, no timeframe at the moment, but keep an eye out on CPAN. Rob

Tue Jan 31 19:26:21 2006 The RT System itself - Status changed from 'new' to 'open'

Fri Feb 10 16:49:16 2006 cpan [...] my-e-mail.ru - Correspondence added

Subject:	RE: [rt.cpan.org #17335] Attempt to work with a partially corrupted sharefile leads to a SIGSEGV
Date:	Sat, 11 Feb 2006 00:47:52 +0300
To:	<bug-Cache-FastMmap [...] rt.cpan.org>
From:	"Quonsto" <cpan [...] my-e-mail.ru>

Thank you for the answer. I'll keep an eye on cpan for the next version. I just want to notice, that in my case it's not Cache::FastMmap's responsibility for spoiling the cache file. It's definitely not a situation you described, that a process is being killed during page update. The page gets almost completely corrupted with some un-identifiable garbage. I have a lot of different modules working together under mod_perl and I guess some of them could be responsible for it. The only goal for me was that Cache::FastMmap shouldn't crash in this case. Thank you for your work, the module became really useful for me. I even had to refresh my memory about debugging C programs to keep it with me :) Regards, Quonsto. Show quoted text

> -----Original Message----- > From: Guest via RT [mailto:bug-Cache-FastMmap@rt.cpan.org] > Sent: Wednesday, February 01, 2006 3:26 AM > To: cpan@my-e-mail.ru > Subject: [rt.cpan.org #17335] Attempt to work with a > partially corrupted sharefile leads to a SIGSEGV > > > Thanks for the report. I'll integrate your changes into the > next version. > > FYI, I've been working on and off on a new version of > Cache::FastMmap that will include an additional "has been > modified" flag on each page. Basically before any > modification occurs, this flag will be set, and only after > any changes have been made which make the page consistent > will the flag be unset. Basically there is currently a race > condition where if a process is killed while in the middle of > making changes to a page, it can leave the page corrupted > (and the OS will release the lock when the process is killed) > and cause other processes attached to it to crash. While this > is very rare, it's still a possibility that should be dealt with. > > Sorry, no timeframe at the moment, but keep an eye out on CPAN. > > Rob > >

Sun Feb 19 19:06:40 2006 robm [...] fastmail.fm - Correspondence added

Subject:	Re: [rt.cpan.org #17335] Attempt to work with a partially corrupted sharefile leads to a SIGSEGV
Date:	Mon, 20 Feb 2006 11:06:02 +1100
To:	<bug-Cache-FastMmap [...] rt.cpan.org>
From:	"Rob Mueller" <robm [...] fastmail.fm>

Show quoted text

> I just want to notice, that in my case it's not Cache::FastMmap's > responsibility for spoiling the cache file. > It's definitely not a situation you described, that a process is being > killed during page update. > The page gets almost completely corrupted with some un-identifiable > garbage. > I have a lot of different modules working together under mod_perl and I > guess some of them could be responsible for it. > The only goal for me was that Cache::FastMmap shouldn't crash in this > case.

That is quite strange. It sounds like something is randomly trashing memory or something? Ouch. Because of the new "page dirty" flag, I have code to completely clean and empty a particular page (eg if a page does get left in an undefined dirty state when a process is killed, the next process to try and access that page has to completely wipe it), i'll make it that if the magic code at the start of the page is wrong, it also wipes the page clean. I might also make it log something to STDERR since really this should almost never happen, it implies that something trashed memory, or that a process was killed in the middle of modifying a page. Rob

Tue Sep 28 23:32:51 2010 cpan [...] robm.fastmail.fm - Correspondence added

I believe this was fixed some time back. After locking a page, the first thing it checks is that magic page marker is correct. if (!(P_Magic(p_ptr) == 0x92f7e3b1)) return -1 + _mmc_set_error(cache, 0, "magic page start marker not found. p_cur is %u, offset is %u", p_cur, p_offset); It also initialises last_error. cache->last_error = 0; And the code to get the error does. char * mmc_error(mmap_cache * cache) { if (cache->last_error) return cache->last_error; return "Unknown error"; }

Tue Sep 28 23:32:53 2010 cpan [...] robm.fastmail.fm - Status changed from 'open' to 'resolved'