Bug #26383 for Perl6-Perldoc: No warning for invalid bytes in data assumed to be UTF8

Sun Apr 15 15:32:16 2007 perl [...] perceptualsolutions.com - Ticket created

Subject:

No warning for invalid bytes in data assumed to be UTF8

According to the Perldoc spec, files that lack an =encoding declaration are assumed to be in UTF8: "By default, Perldoc assumes that documents are Unicode, encoded in one of the three common schemes (UTF-8, UTF-16, or UTF-32). The particular scheme a document uses is autodiscovered by examination of the first few bytes of the file (where possible). If the autodiscovery fails, UTF-8 is assumed," However, if chars that do not exist in UTF8 are present in the source file, they are included in the output XHTML, without any warning. The XHTML then does not validate, and has invalid data in it. The spec itself adds: "and parsers may treat any non-UTF-8 bytes later in the document as fatal errors." This feature is therefore not required, but highly desirable. Minimal test document attached.

Subject:

pod6_test.pod6

Download pod6_test.pod6
application/octet-stream 188b

Message body not shown because it is not plain text.

Thu Apr 19 21:22:05 2007 damian [...] conway.org - Correspondence added

Subject:	Re: [rt.cpan.org #26383] No warning for invalid bytes in data assumed to be UTF8
Date:	Fri, 20 Apr 2007 11:21:35 +1000
To:	bug-Perl6-Perldoc [...] rt.cpan.org
From:	Damian Conway <damian [...] conway.org>

Nick Johnston via RT wrote: Show quoted text

> According to the Perldoc spec, files that lack an =encoding declaration > are assumed to be in UTF8

The parser doesn't currently support this assumption. I've now documented the fact as an unresolved bug. Of course, patches are always welcome! :-) Damian

Thu Apr 19 21:22:35 2007 The RT System itself - Status changed from 'new' to 'open'

Wed Apr 25 15:58:08 2007 nick [...] perceptualsolutions.com - Correspondence added

Subject:	Re: [rt.cpan.org #26383] No warning for invalid bytes in data assumed to be UTF8
Date:	Wed, 25 Apr 2007 21:05:41 +0100
To:	bug-Perl6-Perldoc [...] rt.cpan.org
From:	Nick Johnston <nick [...] perceptualsolutions.com>

damian@conway.org via RT wrote: Show quoted text

><URL: http://rt.cpan.org/Ticket/Display.html?id=26383 > > > >The parser doesn't currently support this assumption. I've now documented the >fact as an unresolved bug. > >Of course, patches are always welcome! :-) > >

I've been thinking about how to do this, and can only come up with one sane solution: specify the encoding for each item in the PDOM structure, but convert all text nodes to UTF8. It would then be up to the formatter to encode the output as the original encoding if desired. Does this should sensible? Thanks, Nick