Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Text-Markdown CPAN distribution.

Report information
The Basics
Id: 27482
Status: resolved
Priority: 0/
Queue: Text-Markdown

People
Owner: bobtfish [...] bobtfish.net
Requestors: st [...] istic.org
Cc: bobtfish [...] bobtfish.net
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.0.3
Fixed in: (no value)



Subject: wide characters cause croak from Digest::MD5 (test and patch included)
First off, I thank you for providing this module to the community. I really appreciate Markdown's syntax and chose it over more featureful alternatives like Textile because it makes it easy to write simple HTML and then gets out of my way if I want to write more complex markup. However, I have a bug to report. Text::Markdown uses Digest::MD5 to store hashes of HTML blocks in the text passed to it, but if you pass Digest::MD5 some text with wide characters in it, it dies, because MD5 works on octet strings, not character strings. It seems that any text with block elements (such as blockquotes) that contain wide characters triggers this. The workaround recommended in the POD of Digest::MD5 is to encode the text to UTF-8 before hashing, because UTF-8 round-trips all UCS text. The attached patch inserts a utility function that Encode::encode()s text before md5_hex()ing it, and uses it in all the places that handle input text. The patched Text::Markdown passes the test and works in the longer case that originally alerted me to the error. There is one design decision involved in the patch: encoding every time we do an MD5 sum rather than encoding the text once. While this appears inefficient, on many texts it will be much more efficient: only the block elements are encoded rather than the whole text, so on input with no block elements (on the input) it makes no difference. Of course, on input that is 8-bit-only the only overhead is the additional sub call per MD5 sum. The other advantage to this approach is that it will make it easier to later add markup that has wide characters as its input or output, should you ever wish to extend Markdown in this way. Examples might be to specially treat the UCS line separator and paragraph separator characters, or to make "..." map to the UCS ellipsis character. I assert no copyright on the attached patch and test: you may do with them what you like. I hope you will consider uploading an updated version of Text::Markdown to CPAN.
Subject: unicode.t
use utf8; use warnings; use Test::More tests => 2; use_ok('Text::Markdown', 'markdown'); my $m = Text::Markdown->new; my $html1; $html1 = eval { $m->markdown(<<"EOF"); }; > Fo—o μορεοϋερ > ßåř EOF is( <<"EOF", $html1 ); <blockquote> <p>Fo—o</p> </blockquote> <p>μορεοϋερ</p> <blockquote> <p>ßåř</p> </blockquote> EOF
Subject: markdown.diff
diff --git a/lib/Text/Markdown.pm b/lib/Text/Markdown.pm index ece5c7a..90d250a 100644 --- a/lib/Text/Markdown.pm +++ b/lib/Text/Markdown.pm @@ -12,6 +12,7 @@ use strict; use warnings; use Digest::MD5 qw(md5_hex); +use Encode; use base 'Exporter'; our $VERSION = '1.0.3'; @@ -79,6 +80,17 @@ my %g_html_blocks; # (see _ProcessListItems() for details): my $g_list_level = 0; +sub md5_utf8 { +# Internal function used to safely MD5sum chunks of the input, which might be Unicode in Perl's internal representation. + my $input = shift; + return undef unless defined $input; + if (Encode::is_utf8 $input) { + return md5_hex(encode('utf8', $input)); + } else { + return md5_hex($input); + } +} + sub Markdown { # # Main function. The order in which other subs are called here is @@ -201,7 +213,7 @@ sub _HashHTMLBlocks { (?=\n+|\Z) # followed by a newline or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egmx; @@ -221,7 +233,7 @@ sub _HashHTMLBlocks { (?=\n+|\Z) # followed by a newline or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egmx; @@ -243,7 +255,7 @@ sub _HashHTMLBlocks { (?=\n{2,}|\Z) # followed by a blank line or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egx; @@ -266,7 +278,7 @@ sub _HashHTMLBlocks { (?=\n{2,}|\Z) # followed by a blank line or end of document ) }{ - my $key = md5_hex($1); + my $key = md5_utf8($1); $g_html_blocks{$key} = $1; "\n\n" . $key . "\n\n"; }egx;
From: HANENKAMP [...] cpan.org
The patch works great. +1 On Thu Jun 07 11:24:04 2007, danh wrote: Show quoted text
> First off, I thank you for providing this module to the community. I > really appreciate Markdown's syntax and chose it over more featureful > alternatives like Textile because it makes it easy to write simple HTML > and then gets out of my way if I want to write more complex markup. > However, I have a bug to report. > > Text::Markdown uses Digest::MD5 to store hashes of HTML blocks in the > text passed to it, but if you pass Digest::MD5 some text with wide > characters in it, it dies, because MD5 works on octet strings, not > character strings. > > It seems that any text with block > elements (such as blockquotes) that contain wide characters triggers > this. > > The workaround recommended in the POD of Digest::MD5 is to encode the > text to UTF-8 before hashing, because UTF-8 round-trips all UCS text. > The attached patch inserts a utility function that Encode::encode()s > text before md5_hex()ing it, and uses it in all the places that handle > input text. The patched Text::Markdown passes the test and works in the > longer case that originally alerted me to the error. > > There is one design decision involved in the patch: encoding every time > we do an MD5 sum rather than encoding the text once. While this appears > inefficient, on many texts it will be much more efficient: only the > block elements are encoded rather than the whole text, so on input with > no block elements (on the input) it makes no difference. Of course, on > input that is 8-bit-only the only overhead is the additional sub call > per MD5 sum. The other advantage to this approach is that it will make > it easier to later add markup that has wide characters as its input or > output, should you ever wish to extend Markdown in this way. Examples > might be to specially treat the UCS line separator and paragraph > separator characters, or to make "..." map to the UCS ellipsis > character. > > I assert no copyright on the attached patch and test: you may do with > them what you like. I hope you will consider uploading an updated > version of Text::Markdown to CPAN.
Hi I'm the maintainer of Text::MutliMarkdown. I'm currently trying to reach SRI to take over maintainence of Text::Markdown also, as Text::MultiMarkdown will be at the point where you can 'turn off' the extra MultiMarkdown features (and emulate Text::Markdown) shortly. I also have a pretty decent test suite :_) I'll be putting your test into my next point release: http://svn.kulp.ch/cpan/text_multimarkdown/branches/1.0.6-dev-t0m/Todo If you don't mind the MultiMarkdown features as well for the moment - I'd be really glad if you could give me some assistance testing when I release 1.0.6 Thanks in advance. Tom
FYI - Text::MultiMarkdown 1.0.6 just hit CPAN, containing your test / patch and full compatability with the original Markdown test suite.
Fixed in 1.0.4 (just uploaded, should be on CPAN shortly).
Subject: Re: [rt.cpan.org #27482] wide characters cause croak from Digest::MD5 (test and patch included)
Date: Thu, 10 Jan 2008 23:38:51 +0000
To: Tomas Doran via RT <bug-Text-Markdown [...] rt.cpan.org>
From: Daniel Hulme <st [...] istic.org>
That's great. Thanks for putting the effort in.