Skip Menu |

This queue is for tickets about the Lingua-BrillTagger CPAN distribution.

Report information
The Basics
Id: 76575
Status: new
Priority: 0/
Queue: Lingua-BrillTagger

People
Owner: Nobody in particular
Requestors: cpanbt.10.eveland [...] spamgourmet.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Core Dump For Large Tokens
Date: Sun, 15 Apr 2012 13:42:53 -0400
To: bug-lingua-brilltagger [...] rt.cpan.org
From: cpanbt.10.eveland [...] spamgourmet.com
The Brill Tagger library core dumps with tokens more than 256 characters. If you add: $text = [ map { substr $_, 0, 250 } @$text ]; Right after the call to tokenize in tag, you won’t hit this. Presumably 255 would work just as well as 250, but I didn’t take the time to fully test and didn’t want to find the exact boundary condition. :) I’m using Lingua::BrillTagger 0.02 on perl v5.10.1 Linux 2.6.32-35-server #78-Ubuntu SMP Tue Oct 11 16:26:12 UTC 2011 x86_64 GNU/Linux.
Subject: Re: [rt.cpan.org #76575] AutoReply: Core Dump For Large Tokens
Date: Mon, 16 Apr 2012 09:18:26 -0400
To: bug-lingua-brilltagger [...] rt.cpan.org
From: cpanbt.10.eveland [...] spamgourmet.com
To prevent the same buffer overflow from occurring on unicode strings, the following line is an improvement over the previous suggestion: $text = [ map { decode('utf8', substr(encode('utf8', $_), 0, 250), Encode::FB_QUIET) } @$text ];