Bug #27482 for Text-Markdown: wide characters cause croak from Digest::MD5 (test and patch included)

Thu Jun 07 11:24:04 2007 st [...] istic.org - Ticket created

Subject:

wide characters cause croak from Digest::MD5 (test and patch included)

First off, I thank you for providing this module to the community. I really appreciate Markdown's syntax and chose it over more featureful alternatives like Textile because it makes it easy to write simple HTML and then gets out of my way if I want to write more complex markup. However, I have a bug to report. Text::Markdown uses Digest::MD5 to store hashes of HTML blocks in the text passed to it, but if you pass Digest::MD5 some text with wide characters in it, it dies, because MD5 works on octet strings, not character strings. It seems that any text with block elements (such as blockquotes) that contain wide characters triggers this. The workaround recommended in the POD of Digest::MD5 is to encode the text to UTF-8 before hashing, because UTF-8 round-trips all UCS text. The attached patch inserts a utility function that Encode::encode()s text before md5_hex()ing it, and uses it in all the places that handle input text. The patched Text::Markdown passes the test and works in the longer case that originally alerted me to the error. There is one design decision involved in the patch: encoding every time we do an MD5 sum rather than encoding the text once. While this appears inefficient, on many texts it will be much more efficient: only the block elements are encoded rather than the whole text, so on input with no block elements (on the input) it makes no difference. Of course, on input that is 8-bit-only the only overhead is the additional sub call per MD5 sum. The other advantage to this approach is that it will make it easier to later add markup that has wide characters as its input or output, should you ever wish to extend Markdown in this way. Examples might be to specially treat the UCS line separator and paragraph separator characters, or to make "..." map to the UCS ellipsis character. I assert no copyright on the attached patch and test: you may do with them what you like. I hope you will consider uploading an updated version of Text::Markdown to CPAN.

Subject:

unicode.t

use utf8; use warnings; use Test::More tests => 2; use_ok('Text::Markdown', 'markdown'); my $m = Text::Markdown->new; my $html1; $html1 = eval { $m->markdown(<<"EOF"); }; > Foâo Î¼Î¿ÏÎµÎ¿ÏÎµÏ > ÃÃ¥Å EOF is( <<"EOF", $html1 ); <blockquote> Foâo </blockquote> Î¼Î¿ÏÎµÎ¿ÏÎµÏ <blockquote> ÃÃ¥Å </blockquote> EOF

Subject:

markdown.diff

diff --git a/lib/Text/Markdown.pm b/lib/Text/Markdown.pm index ece5c7a..90d250a 100644 --- a/lib/Text/Markdown.pm +++ b/lib/Text/Markdown.pm @@ -12,6 +12,7 @@ use strict; use warnings; use Digest::MD5 qw(md5_hex); +use Encode; use base 'Exporter'; our $VERSION = '1.0.3'; @@ -79,6 +80,17 @@ my %g_html_blocks; # (see _ProcessListItems() for details): my $g_list_level = 0; +sub md5_utf8 { +# Internal function used to safely MD5sum chunks of the input, which might be Unicode in Perl's internal representation. + my $input = shift; + return undef unless defined $input; + if (Encode::is_utf8 $input) { + return md5_hex(encode('utf8', $input)); + } else { + return md5_hex($input); + } +} + sub Markdown { # # Main function. The order in which other subs are called here is @@ -201,7 +213,7 @@ sub _HashHTMLBlocks { (?=\n+|\Z) # followed by a newline or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egmx; @@ -221,7 +233,7 @@ sub _HashHTMLBlocks { (?=\n+|\Z) # followed by a newline or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egmx; @@ -243,7 +255,7 @@ sub _HashHTMLBlocks { (?=\n{2,}|\Z) # followed by a blank line or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egx; @@ -266,7 +278,7 @@ sub _HashHTMLBlocks { (?=\n{2,}|\Z) # followed by a blank line or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egx;

Sun Nov 04 00:41:23 2007 hanenkamp [...] cpan.org - Correspondence added

From:

HANENKAMP [...] cpan.org

The patch works great. +1 On Thu Jun 07 11:24:04 2007, danh wrote: Show quoted text

> First off, I thank you for providing this module to the community. I > really appreciate Markdown's syntax and chose it over more featureful > alternatives like Textile because it makes it easy to write simple HTML > and then gets out of my way if I want to write more complex markup. > However, I have a bug to report. > > Text::Markdown uses Digest::MD5 to store hashes of HTML blocks in the > text passed to it, but if you pass Digest::MD5 some text with wide > characters in it, it dies, because MD5 works on octet strings, not > character strings. > > It seems that any text with block > elements (such as blockquotes) that contain wide characters triggers > this. > > The workaround recommended in the POD of Digest::MD5 is to encode the > text to UTF-8 before hashing, because UTF-8 round-trips all UCS text. > The attached patch inserts a utility function that Encode::encode()s > text before md5_hex()ing it, and uses it in all the places that handle > input text. The patched Text::Markdown passes the test and works in the > longer case that originally alerted me to the error. > > There is one design decision involved in the patch: encoding every time > we do an MD5 sum rather than encoding the text once. While this appears > inefficient, on many texts it will be much more efficient: only the > block elements are encoded rather than the whole text, so on input with > no block elements (on the input) it makes no difference. Of course, on > input that is 8-bit-only the only overhead is the additional sub call > per MD5 sum. The other advantage to this approach is that it will make > it easier to later add markup that has wide characters as its input or > output, should you ever wish to extend Markdown in this way. Examples > might be to specially treat the UCS line separator and paragraph > separator characters, or to make "..." map to the UCS ellipsis > character. > > I assert no copyright on the attached patch and test: you may do with > them what you like. I hope you will consider uploading an updated > version of Text::Markdown to CPAN.

Sun Nov 04 00:41:25 2007 The RT System itself - Status changed from 'new' to 'open'

Sat Jan 05 11:28:15 2008 bobtfish [...] bobtfish.net - Cc BOBTFISH added

Sat Jan 05 11:41:01 2008 bobtfish [...] bobtfish.net - Correspondence added

Hi I'm the maintainer of Text::MutliMarkdown. I'm currently trying to reach SRI to take over maintainence of Text::Markdown also, as Text::MultiMarkdown will be at the point where you can 'turn off' the extra MultiMarkdown features (and emulate Text::Markdown) shortly. I also have a pretty decent test suite :_) I'll be putting your test into my next point release: http://svn.kulp.ch/cpan/text_multimarkdown/branches/1.0.6-dev-t0m/Todo If you don't mind the MultiMarkdown features as well for the moment - I'd be really glad if you could give me some assistance testing when I release 1.0.6 Thanks in advance. Tom

Sun Jan 06 08:21:15 2008 bobtfish [...] bobtfish.net - Correspondence added

FYI - Text::MultiMarkdown 1.0.6 just hit CPAN, containing your test / patch and full compatability with the original Markdown test suite.

Thu Jan 10 13:12:14 2008 bobtfish [...] bobtfish.net - Taken

Thu Jan 10 17:59:47 2008 bobtfish [...] bobtfish.net - Correspondence added

Fixed in 1.0.4 (just uploaded, should be on CPAN shortly).

Thu Jan 10 17:59:51 2008 bobtfish [...] bobtfish.net - Status changed from 'open' to 'resolved'

Thu Jan 10 18:39:51 2008 st [...] istic.org - Correspondence added

Subject:	Re: [rt.cpan.org #27482] wide characters cause croak from Digest::MD5 (test and patch included)
Date:	Thu, 10 Jan 2008 23:38:51 +0000
To:	Tomas Doran via RT <bug-Text-Markdown [...] rt.cpan.org>
From:	Daniel Hulme <st [...] istic.org>

That's great. Thanks for putting the effort in.

Thu Jan 10 18:39:54 2008 The RT System itself - Status changed from 'resolved' to 'open'

Fri Jan 18 05:19:35 2008 bobtfish [...] bobtfish.net - Status changed from 'open' to 'resolved'

Bug #27482 for Text-Markdown: wide characters cause croak from Digest::MD5 (test and patch included)

Preferred bug tracker