Bug #112631 for Pod-Markdown: Bug: E<lt> results in '<' instead of '<'

Wed Mar 02 08:59:34 2016 boesen [...] belwue.de - Ticket created

Subject:	Bug: E<lt> results in '<' instead of '<'
Date:	Wed, 2 Mar 2016 14:59:11 +0100
To:	bug-Pod-Markdown [...] rt.cpan.org
From:	Andreas Boesen <boesen [...] belwue.de>

Hi, there is a bug in Pod::Markdown v3.003 that results in 'E<lt>' being substituted with '<' instead of '<'. One can trigger the bug by creating a pod file (see snip below) and using the commandline interface `pod2markdown file.pod`. <snip file.pod> =head1 AUTHOR Foo Bar E<lt>foo@bar.comE<gt> =cut </snap> The output is: <snip `pod2markdown file.pm > output.md`> # AUTHOR Foo Bar <foo@bar.com> </snip> I would expect that 'E<lt>' is replaced by '<' in the .md file because I'm not aware that HTML entities belong into a markdown file because markdown is obviously not html. :-) Initially I created a Bugreport[1] in Minilla[2] and got pointed to the Changelog[3] of Pod::Markdown v3.000. I honestly do not fully understand the Issue on stackoverflow.com[4] because if you for example want to display a markdown file on a website (like github/gitlab/...) the software running the website (i.e. gitlab) should do the job ob parsing the markdown file correctly and substitute '<' by '<'. Also just replacing '<' with '<' seems a bit inconsistent to me because the '>' is not replaced. But to me it seems that putting HTML entities into markdown files is not something desirable. [1] <https://github.com/tokuhirom/Minilla/issues/186> [2] <http://search.cpan.org/~tokuhirom/Minilla-v3.0.1/lib/Minilla.pm> [3] <https://metacpan.org/source/RWSTAUNER/Pod-Markdown-3.000/Changes> [4] <https://stackoverflow.com/questions/28496298/escape-angle-brackets-using-podmarkdown> Also if I use pandoc (<http://pandoc.org>) on output.md (attached file) it does NOT recognise that it is an email address: <snip `pandoc -f markdown -t html output.md`> <h1 id="author">AUTHOR</h1> <p>Foo Bar <foo@bar.com></p> </snap> If you replace the '<' in the attached output.md pandoc recognises the email address and sets an anchor (`<a href="mailto:...`). Distribution name and version: Pod-Markdown-3.003.tar.gz Perl Version: This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-thread-multi Operating System vendor and version: Linux $Hostname 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux Best regards, Andreas -- Andreas Boesen, BelWü-Koordination, Universität Stuttgart Industriestr. 28, 70565 Stuttgart Tel. 0711/685-65750 - Fax 0711/6788363 boesen@belwue.de - http://www.belwue.de ~cd in and find out

Message body is not shown because sender requested not to inline it.

Download signature.asc
application/pgp-signature 842b

Message body not shown because it is not plain text.

Wed Mar 02 10:08:36 2016 boesen [...] belwue.de - Correspondence added

Subject:	Re: [rt.cpan.org #112631] Bug: E<lt> results in '<' instead of '<'
Date:	Wed, 2 Mar 2016 16:08:10 +0100
To:	bug-Pod-Markdown [...] rt.cpan.org
From:	Andreas Boesen <boesen [...] belwue.de>

Hi, I've had a look at commit #1d2ad592 (starting line 414 in lib/Pod/Markdown.pm). <https://github.com/rwstauner/Pod-Markdown/commit/1d2ad592ef989b9229cfaf6a68af56076aa172ea> Quote: Show quoted text

> +# In order to only encode the occurrences that require it (something that > +# could be interpreted as an entity) we escape them all so that we can do the > +# suffix test later after the string is complete (since we don't know what > +# strings might come after this one).

As far as I understand that means that if $something in 'E<lt>$somethingE<gt>' does NOT equal an HTML tag it should "re-replace" '<' back to '<' after initially replacing E<lt> with '<'. (Is my assumption correct?) But unfortunately the "re-replace" seems not be working. :-/ Best regards, Andreas -- Andreas Boesen, BelWü-Koordination, Universität Stuttgart Industriestr. 28, 70565 Stuttgart Tel. 0711/685-65750 - Fax 0711/6788363 boesen@belwue.de - http://www.belwue.de ~cd in and find out

Download signature.asc
application/pgp-signature 842b

Message body not shown because it is not plain text.

Sat Mar 05 21:44:17 2016 RWSTAUNER [...] cpan.org - Correspondence added

Thanks, I appreciate your thorough bug report. Show quoted text

> I would expect that 'E<lt>' is replaced by '<' in the .md file because I'm not aware that HTML entities belong into a markdown file because markdown is obviously not html. :-)

Markdown is very specifically a format for producing HTML. That is why it accepts HTML inline with its own syntax. Show quoted text

> I honestly do not fully understand the Issue on stackoverflow.com[4] because if you for example want to display a markdown file on a website (like github/gitlab/...) the software running the website (i.e. gitlab) should do the job ob parsing the markdown file correctly and substitute '<' by '<'. Also just replacing '<' with '<' seems a bit inconsistent to me because the '>' is not replaced. But to me it seems that putting HTML entities into markdown files is not something desirable.

The issue (represented on stackoverflow) is that Markdown will treat things that look like HTML as inline HTML instead of as Markdown. As it says on http://daringfireball.net/projects/markdown/syntax : Show quoted text

> In HTML, there are two characters that demand special treatment: < and &. Left angle brackets are used to start tags; ampersands are used to denote HTML entities. If you want to use them as literal characters, you must escape them as entities, e.g. <, and &.

However, many people (including myself) agree with what you have expressed: This is markdown, there should be as few HTML entities as possible. The job of this module however is to faithfully represent the Pod that it takes in as Markdown so that when the Markdown is processed the HTML will be accurate. So, if there is a bare left angle bracket (<) in the pod, it should show up in the HTML. The issue is that if it happens to be followed by text that could be confused with an HTML tag, leaving it verbatim in the Markdown will then have Markdown think it is HTML and it will disappear from the page. See this example: https://gist.github.com/rwstauner/543ed03ba3a830870db8 Show quoted text

> As far as I understand that means that if $something in 'E<lt>$somethingE<gt>' does NOT equal an HTML tag it should "re-replace" '<' back to '<' after initially replacing E<lt> with '<'. (Is my assumption correct?) But unfortunately the "re-replace" seems not be working. :-/

The escape-and-later-unescape (or re-replace) dance is done so that we only escape the things that need to be. I took a regular expression directly from the original Markdown processor (http://daringfireball.net/projects/markdown/) and inverted it so that Pod::Markdown will only HTML escape the things that Markdown would specifically treat as HTML (in an attempt to produce as little HTML as possible in order to keep the output as clean as possible). What I had overlooked, however, is that Markdown treats things that look like email addresses (<foo@bar.com>) specially (and turns them into links). I have added this functionality of preserving email addresses so that Markdown will still see them: https://github.com/rwstauner/Pod-Markdown/commit/db076c33891ae193782b6bff45a572755b71a9a6 The new release should be on CPAN shortly.

Sat Mar 05 21:44:17 2016 The RT System itself - Status changed from 'new' to 'open'

Sat Mar 05 21:44:45 2016 RWSTAUNER [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #112631 for Pod-Markdown: Bug: E<lt> results in '&lt;' instead of '<'

Bug #112631 for Pod-Markdown: Bug: E<lt> results in '<' instead of '<'