Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 93660
Status: rejected
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: alexander.danel [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 5.03
Fixed in: (no value)



Subject: Entity 'ndash' converts to Octal(342,200,223), want 226
The entity 'ndash' converts to three octal bytes: 342, 200, 223. This is also true for the entity '#8209'. When sent through HTML::PrettyPrinter->format() this causes the warning "Wide character in print...", and the result is incorrect. It seems to me these entities should be converted to the single character \o{226}, which is decimal 150; which is "en dash". Setting "$root->no_expand_entities(1);" is not helpful; the entity stays unexpanded, then PrettyPrinter does not recognize that it is an entity and converts the leading '&' into "&amp" for every entity in the document. (And, I don't want to turn off conversion, because there might be real ampersands.) My work-around will be to convert these ndash entities into normal hyphen characters. I am working in a CygWin environment.
From: alexander.danel [...] gmail.com
On Sat Mar 08 18:03:43 2014, alexander.danel@gmail.com wrote: Show quoted text
> The entity 'ndash' converts to three octal bytes: 342, 200, 223. This > is also true for the entity '#8209'. When sent through > HTML::PrettyPrinter->format() this causes the warning "Wide character > in print...", and the result is incorrect. > > It seems to me these entities should be converted to the single > character \o{226}, which is decimal 150; which is "en dash". > > Setting "$root->no_expand_entities(1);" is not helpful; the entity > stays unexpanded, then PrettyPrinter does not recognize that it is an > entity and converts the leading '&' into "&amp" for every entity in > the document. (And, I don't want to turn off conversion, because > there might be real ampersands.) > > My work-around will be to convert these ndash entities into normal > hyphen characters. > > I am working in a CygWin environment.
From: alexander.danel [...] gmail.com
OK, well, once again, I may need to withdraw my ticket minutes after creating it. I seem to be having two issues: (1) There might be a problem in HTML::Entities, rather than HTML::Tree. I just looked into the "Entities.pm" file, and found this: 'ndash;' => chr(8211), This should probably say "8209", not "8211". (2) There doesn't seem to be an easy way to tell PrettyPrint that ALL entities should be converted. (Any advice?) Sorry to be causing trouble, but I am trying to be helpful, not trying to be a pain. Alexander Danel # ----------------------------------- On Sat Mar 08 18:03:43 2014, alexander.danel@gmail.com wrote: Show quoted text
> The entity 'ndash' converts to three octal bytes: 342, 200, 223. This > is also true for the entity '#8209'. When sent through > HTML::PrettyPrinter->format() this causes the warning "Wide character > in print...", and the result is incorrect. > > It seems to me these entities should be converted to the single > character \o{226}, which is decimal 150; which is "en dash". > > Setting "$root->no_expand_entities(1);" is not helpful; the entity > stays unexpanded, then PrettyPrinter does not recognize that it is an > entity and converts the leading '&' into "&amp" for every entity in > the document. (And, I don't want to turn off conversion, because > there might be real ampersands.) > > My work-around will be to convert these ndash entities into normal > hyphen characters. > > I am working in a CygWin environment.
From: alexander.danel [...] gmail.com
Just checked Wikipedia, it lists char(8211) as "en dash", so that agrees with HTML::Entities. I can tell you that for both entities I mentioned, '‑' and '–', when processed by HTML::TreeBuilder and then by HTML::PrettyPrinter, get converted to three octal bytes; for '8209' the result is [342 200 221] and for 'ndash' it is [342 220 223]. Somehow, these represent decimal 8209 and 8211, I guess. (I don't get how, but that's besides the point.) The reason I got interested in ‑ is because I have a Microsoft Word document that uses that entity for the minus in "N-1" and the hyphen in "x-ray". Alexander Danel # ------------------------------------- On Sat Mar 08 18:42:18 2014, alexander.danel@gmail.com wrote: Show quoted text
> OK, well, once again, I may need to withdraw my ticket minutes after > creating it. > > I seem to be having two issues: > > (1) There might be a problem in HTML::Entities, rather than > HTML::Tree. I just looked into the "Entities.pm" file, and found > this: > > 'ndash;' => chr(8211), > > This should probably say "8209", not "8211". > > (2) There doesn't seem to be an easy way to tell PrettyPrint that ALL > entities should be converted. (Any advice?) > > Sorry to be causing trouble, but I am trying to be helpful, not trying > to be a pain. > > Alexander Danel > # ----------------------------------- > On Sat Mar 08 18:03:43 2014, alexander.danel@gmail.com wrote:
> > The entity 'ndash' converts to three octal bytes: 342, 200, 223. > > This > > is also true for the entity '#8209'. When sent through > > HTML::PrettyPrinter->format() this causes the warning "Wide character > > in print...", and the result is incorrect. > > > > It seems to me these entities should be converted to the single > > character \o{226}, which is decimal 150; which is "en dash". > > > > Setting "$root->no_expand_entities(1);" is not helpful; the entity > > stays unexpanded, then PrettyPrinter does not recognize that it is an > > entity and converts the leading '&' into "&amp" for every entity in > > the document. (And, I don't want to turn off conversion, because > > there might be real ampersands.) > > > > My work-around will be to convert these ndash entities into normal > > hyphen characters. > > > > I am working in a CygWin environment.
This kind of question is much more appropriate for http://stackoverflow.com/ HTML-Tree is a long-established and heavily used module. If its basic functionality isn't working, you should assume you're doing something wrong. Please don't report a bug unless you're sure it's really a problem with the module. You didn't give a sample program to illustrate the problem you're having, but I'm guessing you haven't set the proper encoding on your output filehandle. Show quoted text
> Somehow, these represent decimal 8209 and 8211, I guess. (I don't get how, but that's besides the point.)
http://en.wikipedia.org/wiki/UTF-8 Here are some other links you should read up on: http://www.joelonsoftware.com/articles/Unicode.html http://perldoc.perl.org/perlunicode.html http://www.simon-cozens.org/content/everything-you-need-know-about-unicode &#8209; is not an en-dash. It is Unicode Character 'NON-BREAKING HYPHEN' U+2011 (Unicode codepoints are normally written in hexadecimal notation). http://www.fileformat.info/info/unicode/char/2011/index.htm In other words, it's a hyphen that word-wrapping code is not supposed to allow a line-break after. Here's what I pass to HTML::PrettyPrinter for entity encoding when I want all non-ASCII chars as entities: entities => do { "<>&\x7F-\x{10FFFF}" } The "no warnings 'utf8';" suppresses a warning in some versions of Perl (complaining that U+10FFFF hasn't actually been allocated yet).
That was supposed to be entities => do { no warnings 'utf8'; "<>&\x7F-\x{10FFFF}" }
From: alexander.danel [...] gmail.com
So, I have a work-around, and my problem was with "PrettyPrinter.pm"; sorry for the drama. My workaround is: $prettyPrinter->entities($prettyPrinter->entities() . "\o{200}-\o{377}" . "\o{400}-\o{177777}" ); I looked into the "PrettyPrinter.pm" file and found that is calls "as_HTML()" and "encode_entities()". In contrast to "as_HTML()", which allows "undef" to stand for "all unsafe values", PrettyPrinter does not allow "undef" as a value; probably because of the attempt to be clever in creating the "setter" routines. I read the "Entities.pm" code, and decided to try a range that goes beyond "\377" and was surprised to find it worked; I didn't know Perl does that. $ perl -e 'print "\o{400}-\o{177777}\n";' | od -c Wide character in print at -e line 1. 0000000 304 200 - 357 277 277 \n Again, sorry for the "sky is falling" messages. Alexander Danel # -------------------- On Sat Mar 08 19:22:22 2014, alexander.danel@gmail.com wrote: Show quoted text
> Just checked Wikipedia, it lists char(8211) as "en dash", so that > agrees with HTML::Entities. > > I can tell you that for both entities I mentioned, '&#8209;' and > '&ndash;', when processed by HTML::TreeBuilder and then by > HTML::PrettyPrinter, get converted to three octal bytes; for '8209' > the result is [342 200 221] and for 'ndash' it is [342 220 223]. > > Somehow, these represent decimal 8209 and 8211, I guess. (I don't get > how, but that's besides the point.) > > The reason I got interested in &#8209; is because I have a Microsoft > Word document that uses that entity for the minus in "N-1" and the > hyphen in "x-ray". > > Alexander Danel > # ------------------------------------- > On Sat Mar 08 18:42:18 2014, alexander.danel@gmail.com wrote:
> > OK, well, once again, I may need to withdraw my ticket minutes after > > creating it. > > > > I seem to be having two issues: > > > > (1) There might be a problem in HTML::Entities, rather than > > HTML::Tree. I just looked into the "Entities.pm" file, and found > > this: > > > > 'ndash;' => chr(8211), > > > > This should probably say "8209", not "8211". > > > > (2) There doesn't seem to be an easy way to tell PrettyPrint that ALL > > entities should be converted. (Any advice?) > > > > Sorry to be causing trouble, but I am trying to be helpful, not > > trying > > to be a pain. > > > > Alexander Danel > > # ----------------------------------- > > On Sat Mar 08 18:03:43 2014, alexander.danel@gmail.com wrote:
> > > The entity 'ndash' converts to three octal bytes: 342, 200, 223. > > > This > > > is also true for the entity '#8209'. When sent through > > > HTML::PrettyPrinter->format() this causes the warning "Wide > > > character > > > in print...", and the result is incorrect. > > > > > > It seems to me these entities should be converted to the single > > > character \o{226}, which is decimal 150; which is "en dash". > > > > > > Setting "$root->no_expand_entities(1);" is not helpful; the entity > > > stays unexpanded, then PrettyPrinter does not recognize that it is > > > an > > > entity and converts the leading '&' into "&amp" for every entity in > > > the document. (And, I don't want to turn off conversion, because > > > there might be real ampersands.) > > > > > > My work-around will be to convert these ndash entities into normal > > > hyphen characters. > > > > > > I am working in a CygWin environment.
From: alexander.danel [...] gmail.com
Christopher, Great response; thanks much. Will read-up, and will use "stackoverflow.com" in the future. Alexander # -------------------------------------------- On Sat Mar 08 22:43:39 2014, CJM wrote: Show quoted text
> This kind of question is much more appropriate for > http://stackoverflow.com/ HTML-Tree is a long-established and heavily > used module. If its basic functionality isn't working, you should > assume you're doing something wrong. Please don't report a bug unless > you're sure it's really a problem with the module. > > You didn't give a sample program to illustrate the problem you're > having, but I'm guessing you haven't set the proper encoding on your > output filehandle. >
> > Somehow, these represent decimal 8209 and 8211, I guess. (I don't get > > how, but that's besides the point.)
> > http://en.wikipedia.org/wiki/UTF-8 > > Here are some other links you should read up on: > http://www.joelonsoftware.com/articles/Unicode.html > http://perldoc.perl.org/perlunicode.html > http://www.simon-cozens.org/content/everything-you-need-know-about- > unicode > > &#8209; is not an en-dash. It is Unicode Character 'NON-BREAKING > HYPHEN' U+2011 (Unicode codepoints are normally written in hexadecimal > notation). > http://www.fileformat.info/info/unicode/char/2011/index.htm In other > words, it's a hyphen that word-wrapping code is not supposed to allow > a line-break after. > > Here's what I pass to HTML::PrettyPrinter for entity encoding when I > want all non-ASCII chars as entities: > > entities => do { "<>&\x7F-\x{10FFFF}" } > > The "no warnings 'utf8';" suppresses a warning in some versions of > Perl (complaining that U+10FFFF hasn't actually been allocated yet).