Subject: | wide characters cause croak from Digest::MD5 (test and patch included) |
First off, I thank you for providing this module to the community. I
really appreciate Markdown's syntax and chose it over more featureful
alternatives like Textile because it makes it easy to write simple HTML
and then gets out of my way if I want to write more complex markup.
However, I have a bug to report.
Text::Markdown uses Digest::MD5 to store hashes of HTML blocks in the
text passed to it, but if you pass Digest::MD5 some text with wide
characters in it, it dies, because MD5 works on octet strings, not
character strings.
It seems that any text with block
elements (such as blockquotes) that contain wide characters triggers
this.
The workaround recommended in the POD of Digest::MD5 is to encode the
text to UTF-8 before hashing, because UTF-8 round-trips all UCS text.
The attached patch inserts a utility function that Encode::encode()s
text before md5_hex()ing it, and uses it in all the places that handle
input text. The patched Text::Markdown passes the test and works in the
longer case that originally alerted me to the error.
There is one design decision involved in the patch: encoding every time
we do an MD5 sum rather than encoding the text once. While this appears
inefficient, on many texts it will be much more efficient: only the
block elements are encoded rather than the whole text, so on input with
no block elements (on the input) it makes no difference. Of course, on
input that is 8-bit-only the only overhead is the additional sub call
per MD5 sum. The other advantage to this approach is that it will make
it easier to later add markup that has wide characters as its input or
output, should you ever wish to extend Markdown in this way. Examples
might be to specially treat the UCS line separator and paragraph
separator characters, or to make "..." map to the UCS ellipsis
character.
I assert no copyright on the attached patch and test: you may do with
them what you like. I hope you will consider uploading an updated
version of Text::Markdown to CPAN.
Subject: | unicode.t |
use utf8;
use warnings;
use Test::More tests => 2;
use_ok('Text::Markdown', 'markdown');
my $m = Text::Markdown->new;
my $html1;
$html1 = eval { $m->markdown(<<"EOF"); };
> Foâo
μοÏεοÏεÏ
> ÃÃ¥Å
EOF
is( <<"EOF", $html1 );
<blockquote>
<p>Foâo</p>
</blockquote>
<p>μοÏεοÏεÏ</p>
<blockquote>
<p>ÃÃ¥Å</p>
</blockquote>
EOF
Subject: | markdown.diff |
diff --git a/lib/Text/Markdown.pm b/lib/Text/Markdown.pm
index ece5c7a..90d250a 100644
--- a/lib/Text/Markdown.pm
+++ b/lib/Text/Markdown.pm
@@ -12,6 +12,7 @@ use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
+use Encode;
use base 'Exporter';
our $VERSION = '1.0.3';
@@ -79,6 +80,17 @@ my %g_html_blocks;
# (see _ProcessListItems() for details):
my $g_list_level = 0;
+sub md5_utf8 {
+# Internal function used to safely MD5sum chunks of the input, which might be Unicode in Perl's internal representation.
+ my $input = shift;
+ return undef unless defined $input;
+ if (Encode::is_utf8 $input) {
+ return md5_hex(encode('utf8', $input));
+ } else {
+ return md5_hex($input);
+ }
+}
+
sub Markdown {
#
# Main function. The order in which other subs are called here is
@@ -201,7 +213,7 @@ sub _HashHTMLBlocks {
(?=\n+|\Z) # followed by a newline or end of document
)
}{
- my $key = md5_hex($1);
+ my $key = md5_utf8($1);
$g_html_blocks{$key} = $1;
"\n\n" . $key . "\n\n";
}egmx;
@@ -221,7 +233,7 @@ sub _HashHTMLBlocks {
(?=\n+|\Z) # followed by a newline or end of document
)
}{
- my $key = md5_hex($1);
+ my $key = md5_utf8($1);
$g_html_blocks{$key} = $1;
"\n\n" . $key . "\n\n";
}egmx;
@@ -243,7 +255,7 @@ sub _HashHTMLBlocks {
(?=\n{2,}|\Z) # followed by a blank line or end of document
)
}{
- my $key = md5_hex($1);
+ my $key = md5_utf8($1);
$g_html_blocks{$key} = $1;
"\n\n" . $key . "\n\n";
}egx;
@@ -266,7 +278,7 @@ sub _HashHTMLBlocks {
(?=\n{2,}|\Z) # followed by a blank line or end of document
)
}{
- my $key = md5_hex($1);
+ my $key = md5_utf8($1);
$g_html_blocks{$key} = $1;
"\n\n" . $key . "\n\n";
}egx;