Skip Menu |

This queue is for tickets about the Perl6-Perldoc CPAN distribution.

Report information
The Basics
Id: 26383
Status: open
Priority: 0/
Queue: Perl6-Perldoc

People
Owner: Nobody in particular
Requestors: perl [...] perceptualsolutions.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: No warning for invalid bytes in data assumed to be UTF8
According to the Perldoc spec, files that lack an =encoding declaration are assumed to be in UTF8: "By default, Perldoc assumes that documents are Unicode, encoded in one of the three common schemes (UTF-8, UTF-16, or UTF-32). The particular scheme a document uses is autodiscovered by examination of the first few bytes of the file (where possible). If the autodiscovery fails, UTF-8 is assumed," However, if chars that do not exist in UTF8 are present in the source file, they are included in the output XHTML, without any warning. The XHTML then does not validate, and has invalid data in it. The spec itself adds: "and parsers may treat any non-UTF-8 bytes later in the document as fatal errors." This feature is therefore not required, but highly desirable. Minimal test document attached.
Subject: pod6_test.pod6
Download pod6_test.pod6
application/octet-stream 188b

Message body not shown because it is not plain text.

Subject: Re: [rt.cpan.org #26383] No warning for invalid bytes in data assumed to be UTF8
Date: Fri, 20 Apr 2007 11:21:35 +1000
To: bug-Perl6-Perldoc [...] rt.cpan.org
From: Damian Conway <damian [...] conway.org>
Nick Johnston via RT wrote: Show quoted text
> According to the Perldoc spec, files that lack an =encoding declaration > are assumed to be in UTF8
The parser doesn't currently support this assumption. I've now documented the fact as an unresolved bug. Of course, patches are always welcome! :-) Damian
Subject: Re: [rt.cpan.org #26383] No warning for invalid bytes in data assumed to be UTF8
Date: Wed, 25 Apr 2007 21:05:41 +0100
To: bug-Perl6-Perldoc [...] rt.cpan.org
From: Nick Johnston <nick [...] perceptualsolutions.com>
damian@conway.org via RT wrote: Show quoted text
><URL: http://rt.cpan.org/Ticket/Display.html?id=26383 > > > >The parser doesn't currently support this assumption. I've now documented the >fact as an unresolved bug. > >Of course, patches are always welcome! :-) > >
I've been thinking about how to do this, and can only come up with one sane solution: specify the encoding for each item in the PDOM structure, but convert all text nodes to UTF8. It would then be up to the formatter to encode the output as the original encoding if desired. Does this should sensible? Thanks, Nick