Bug #20942 for Text-Unaccent: segmentation fault

Fri Aug 11 07:17:31 2006 cedric [...] over-blog.com - Ticket created

Subject:	segmentation fault
Date:	Fri, 11 Aug 2006 13:17:10 +0200
To:	bug-Text-Unaccent [...] rt.cpan.org
From:	cedric <cedric [...] over-blog.com>

I am using unaccent on a lot of content, geting them using DBI. Usually it works fine, but sometime I get a segmentation fault. I can not reproduce whitout using DBI (because no time and data on db) It happens after 1000 or more cycle.. but can not be sure it can not happen before.... Here is an extract of the code : #!/usr/bin/perl -w use strict; use DBI; use Text::Unaccent; my $charset = 'ISO-8859-15'; ...... while (@ids) { .... $clean_text = unac_string($charset, lc( $clean_text )); .... }

Fri Aug 31 18:51:47 2007 EWILHELM [...] cpan.org - Correspondence added

From:

EWILHELM [...] cpan.org

On Fri Aug 11 07:17:31 2006, cedric@over-blog.com wrote: Show quoted text

> I am using unaccent on a lot of content, geting them using DBI. > Usually it works fine, but sometime I get a segmentation fault.

May or may not be the same issue as this, but I came across this unfiled bug via musicbrainz. Their finding was that the code would segfault if it hit an undef or otherwise non-string scalar. http://users.musicbrainz.org/~dave/Text-Unaccent-1.07-svrok.patch Attaching patch for permanent reference. --Eric

A patch for Text::Unaccent 1.07 (http://www.senga.org/unac/) to safely handle non-string scalars - e.g. undef, references, etc. It's not as "complete" as maybe it could be - it doesn't "stringize" overloaded scalars, for example - but at least it stops the lockups and core dumps. diff -aur Text-Unaccent-1.07/Unaccent.xs Text-Unaccent-1.07-patched/Unaccent.xs --- Text-Unaccent-1.07/Unaccent.xs 2002-09-02 15:16:06.000000000 +0100 +++ Text-Unaccent-1.07-patched/Unaccent.xs 2004-03-29 19:35:40.000000000 +0100 @@ -65,7 +65,7 @@ PROTOTYPE: $$ CODE: STRLEN in_length; - in_length = SvCUR(ST(1)); + in_length = (SvPOK(ST(1)) ? SvCUR(ST(1)) : 0); if(unac_string(charset, in, in_length, &buffer, &buffer_length) == 0) { @@ -83,7 +83,7 @@ PROTOTYPE: $ CODE: STRLEN in_length; - in_length = SvCUR(ST(0)); + in_length = (SvPOK(ST(1)) ? SvCUR(ST(1)) : 0); if(unac_string_utf16(in, in_length, &buffer, &buffer_length) == 0) { RETVAL = newSVpv(buffer, buffer_length); diff -aur Text-Unaccent-1.07/t/unac.t Text-Unaccent-1.07-patched/t/unac.t --- Text-Unaccent-1.07/t/unac.t 2002-09-02 15:16:06.000000000 +0100 +++ Text-Unaccent-1.07-patched/t/unac.t 2004-03-29 19:36:04.000000000 +0100 @@ -19,7 +19,7 @@ use Text::Unaccent; -plan test => 4; +plan test => 8; ok(unac_string("ISO-8859-1", "ï¿½ï¿½), "ete", "removing accents from ï¿½ï¿½(1)"); ok(unac_string("ISO-8859-1", "ï¿½ï¿½), "ete", "removing accents from ï¿½ï¿½(2)"); @@ -30,6 +30,11 @@ # ok(unac_debug($Text::Unaccent::DEBUG_HIGH), undef, "setting debug level"); +ok(unac_string("UTF-8", $a="abc"), "abc", "SvROK test (string)"); +ok(unac_string("UTF-8", $a=[]), "", "SvROK test (ref)"); +ok(unac_string("UTF-8", $a="abc"), "abc", "SvROK test (string)"); +ok(unac_string("UTF-8", $a=undef), "", "SvROK test (undef)"); + # Local Variables: *** # mode: perl *** # End: ***

Fri Aug 31 18:51:51 2007 The RT System itself - Status changed from 'new' to 'open'