Bug #89355 for podlators: documentation: "generating ascii text" is out of date

Tue Oct 08 19:04:44 2013 ether [...] cpan.org - Ticket created

Subject:

No way to fetch the document encoding

I see, from a Data::Dumper of a Pod::Text object, that there is an 'encoding' field, but there is no method accessor to get at this information. Also, the documentation describes Pod::Text as generating "ascii text", but if encodings are respected, it's not ascii, is it? Is the intent to provide encoding support?

Tue Oct 08 19:09:51 2013 rra [...] stanford.edu - Correspondence added

Subject:	Re: [rt.cpan.org #89355] No way to fetch the document encoding
Date:	Tue, 08 Oct 2013 16:09:38 -0700
To:	bug-podlators [...] rt.cpan.org
From:	Russ Allbery <rra [...] stanford.edu>

"Karen Etheridge via RT" <bug-podlators@rt.cpan.org> writes: Show quoted text

> Tue Oct 08 19:04:44 2013: Request 89355 was acted upon. > Transaction: Ticket created by ETHER > Queue: podlators > Subject: No way to fetch the document encoding > Broken in: (no value) > Severity: Normal > Owner: Nobody > Requestors: ether@cpan.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=89355 >

Show quoted text

> I see, from a Data::Dumper of a Pod::Text object, that there is an > 'encoding' field, but there is no method accessor to get at this > information.

There is. It's even called encoding(). :) But it's provided by Pod::Simple and therefore not separately documented. As the Pod::Text man page says: As a derived class from Pod::Simple, Pod::Text supports the same methods and interfaces. See Pod::Simple for all the details That said, Pod::Text, like most POD formatters, generally acts in streaming mode, so there isn't a great point in the process to ask it what the encoding is prior to outputting the results. Show quoted text

> Also, the documentation describes Pod::Text as generating "ascii text", > but if encodings are respected, it's not ascii, is it? Is the intent to > provide encoding support?

This is just a documentation bug. I'll correct it. Pod::Text has supported various encodings for some time; see the documentation of the utf8 constructor argument for the details of how encoding works. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>

Tue Oct 08 19:09:51 2013 The RT System itself - Status changed from 'new' to 'open'

Tue Oct 08 19:25:27 2013 ether [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #89355] No way to fetch the document encoding
Date:	Tue, 8 Oct 2013 16:25:10 -0700
To:	"rra [...] stanford.edu via RT" <bug-podlators [...] rt.cpan.org>
From:	Karen Etheridge <ether [...] cpan.org>

On Tue, Oct 08, 2013 at 07:09:51PM -0400, rra@stanford.edu via RT wrote: Show quoted text

> There is. It's even called encoding(). :) But it's provided by > Pod::Simple and therefore not separately documented. As the Pod::Text man > page says:

Aha, I missed that! Show quoted text

> That said, Pod::Text, like most POD formatters, generally acts in > streaming mode, so there isn't a great point in the process to ask it what > the encoding is prior to outputting the results.

I suppose I'm confused about the API then... If I'm getting the content in string form, rather than as a file, I need to know what layers I should put on the filehandle when I write it. I will read the Pod::Simple docs! Show quoted text

> This is just a documentation bug. I'll correct it. Pod::Text has > supported various encodings for some time; see the documentation of the > utf8 constructor argument for the details of how encoding works.

kk. if all the encoding logic is in Pod::Simple, I can see how the discrepancy in the docs would sneak in.

Tue Oct 08 19:30:22 2013 rra [...] stanford.edu - Correspondence added

Subject:	Re: [rt.cpan.org #89355] No way to fetch the document encoding
Date:	Tue, 08 Oct 2013 16:30:08 -0700
To:	bug-podlators [...] rt.cpan.org
From:	Russ Allbery <rra [...] stanford.edu>

"Karen Etheridge via RT" <bug-podlators@rt.cpan.org> writes: Show quoted text

>> That said, Pod::Text, like most POD formatters, generally acts in >> streaming mode, so there isn't a great point in the process to ask it >> what the encoding is prior to outputting the results.

Show quoted text

> I suppose I'm confused about the API then... If I'm getting the content > in string form, rather than as a file, I need to know what layers I > should put on the filehandle when I write it. I will read the > Pod::Simple docs!

Ah! The general rule of thumb is that you shouldn't have to worry about it provided that you write an =encoding directive at the start of the stream. If you do that, the default is for Pod::Text to output the same encoding that it got in, and the output will already be encoded before it's written to the file handle. At least, that's how it's all supposed to work, but I've gotten this wrong several times, so it wouldn't surprise me a great deal if there are still bugs lurking. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>

Tue Oct 08 19:37:51 2013 ether [...] cpan.org - Correspondence added

On 2013-10-08 16:30:22, rra@stanford.edu wrote: Show quoted text

> Ah! The general rule of thumb is that you shouldn't have to worry about > it provided that you write an =encoding directive at the start of the > stream. If you do that, the default is for Pod::Text to output the same > encoding that it got in, and the output will already be encoded before > it's written to the file handle.

Is it possible to get the text back, decoded, as a string, so I can do further processing on it before writing it out to a file myself? In that case, I'd need to know what the encoding is so I can write it properly. (One usecase here is Dist::Zilla file generation - files are 'created' using an object that stores the content, with the content made available to other plugins for more massaging, and then written out to disk at the very end all at once. And I'm working on a bug in a plugin that is currently doing this wrong - i.e. it's encoding-blind.) Show quoted text

> At least, that's how it's all supposed to work, but I've gotten this wrong > several times, so it wouldn't surprise me a great deal if there are still > bugs lurking.

Heh, I think unicode has that effect on everyone :D

Tue Oct 08 19:39:18 2013 ether [...] cpan.org - Subject changed from 'No way to fetch the document encoding' to 'documentation: "generating ascii text" is out of date'

Tue Oct 08 20:35:38 2013 rra [...] stanford.edu - Correspondence added

Subject:	Re: [rt.cpan.org #89355] No way to fetch the document encoding
Date:	Tue, 08 Oct 2013 17:35:20 -0700
To:	bug-podlators [...] rt.cpan.org
From:	Russ Allbery <rra [...] stanford.edu>

"Karen Etheridge via RT" <bug-podlators@rt.cpan.org> writes: Show quoted text

> Queue: podlators > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=89355 >

Show quoted text

> On 2013-10-08 16:30:22, rra@stanford.edu wrote:

Show quoted text

>> Ah! The general rule of thumb is that you shouldn't have to worry >> about it provided that you write an =encoding directive at the start of >> the stream. If you do that, the default is for Pod::Text to output the >> same encoding that it got in, and the output will already be encoded >> before it's written to the file handle.

Show quoted text

> Is it possible to get the text back, decoded, as a string, so I can do > further processing on it before writing it out to a file myself? In that > case, I'd need to know what the encoding is so I can write it properly. > (One usecase here is Dist::Zilla file generation - files are 'created' > using an object that stores the content, with the content made available > to other plugins for more massaging, and then written out to disk at the > very end all at once. And I'm working on a bug in a plugin that is > currently doing this wrong - i.e. it's encoding-blind.)

$parser->output_string(\$output) will put all of the output into $output. I believe that Pod::Simple treats this as equivalent to a file handle and does output encoding before adding it to that string, but I'm not positive about that (I haven't tested). If so, it will be encoded in whatever encoding is declared by the =encoding string. That said, I would recommend standardizing on UTF-8, because that will let you pass utf8 => 1 to Pod::Text's constructor, which will override all that and force the encoding to be in UTF-8. Then you can just decode the $output stream with decode('UTF-8', $output) and you should get back Perl internal strings. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>

Wed Oct 09 14:11:47 2013 ether [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #89355] No way to fetch the document encoding
Date:	Wed, 9 Oct 2013 11:11:24 -0700
To:	"rra [...] stanford.edu via RT" <bug-podlators [...] rt.cpan.org>
From:	Karen Etheridge <ether [...] cpan.org>

On Tue, Oct 08, 2013 at 08:35:38PM -0400, rra@stanford.edu via RT wrote: Show quoted text

> $parser->output_string(\$output) will put all of the output into $output. > I believe that Pod::Simple treats this as equivalent to a file handle and > does output encoding before adding it to that string, but I'm not positive > about that (I haven't tested). If so, it will be encoded in whatever > encoding is declared by the =encoding string.

Right - what I'm saying is that sometimes we don't *want* the content encoded yet, because we'll be doing some more processing of that text before we write it out to a file ourselves -- so we'll need to know what the encoding is, so we can apply the right layer to the $fh when we write it. for example, Pod::Weaver... working on already-encoded octets isn't optimal here. Show quoted text

> That said, I would recommend standardizing on UTF-8, because that will let > you pass utf8 => 1 to Pod::Text's constructor, which will override all > that and force the encoding to be in UTF-8. Then you can just decode the > $output stream with decode('UTF-8', $output) and you should get back Perl > internal strings.

This might be enough, but users upstream might be using other encodings other than UTF-8 (and I believe Dist::Zilla intends to support that). However, as long as we know the actual encoding of the string (actually a bytestring) we get back, we can decode it properly, do our munging, and the re-encode without generating mojibake.

Thu Oct 10 19:25:16 2013 rra [...] stanford.edu - Correspondence added

Subject:	Re: [rt.cpan.org #89355] No way to fetch the document encoding
Date:	Thu, 10 Oct 2013 16:25:01 -0700
To:	bug-podlators [...] rt.cpan.org
From:	Russ Allbery <rra [...] stanford.edu>

"Karen Etheridge via RT" <bug-podlators@rt.cpan.org> writes: Show quoted text

> Right - what I'm saying is that sometimes we don't *want* the content > encoded yet, because we'll be doing some more processing of that text > before we write it out to a file ourselves -- so we'll need to know what > the encoding is, so we can apply the right layer to the $fh when we > write it. for example, Pod::Weaver... working on already-encoded octets > isn't optimal here.

I'm not sure there's any way to get Pod::Simple to not encode, so I'm not sure whether this is something you can deal with via layers. But I haven't investigated this myself. Show quoted text

>> That said, I would recommend standardizing on UTF-8, because that will >> let you pass utf8 => 1 to Pod::Text's constructor, which will override >> all that and force the encoding to be in UTF-8. Then you can just >> decode the $output stream with decode('UTF-8', $output) and you should >> get back Perl internal strings.

Show quoted text

> This might be enough, but users upstream might be using other encodings > other than UTF-8 (and I believe Dist::Zilla intends to support that).

Show quoted text

> However, as long as we know the actual encoding of the string (actually > a bytestring) we get back, we can decode it properly, do our munging, > and the re-encode without generating mojibake.

Yeah, that was the strategy that I was thinking of. I don't know if you always know the encoding of the string, but I think you're going to want to know that regardless for other reasons. Dealing safely with strings in unknown encodings is rather difficult. -- Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>

Wed Dec 02 00:49:07 2015 RRA [...] cpan.org - Fixed in 4.00 added

Wed Dec 02 00:49:07 2015 RRA [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Dec 02 00:49:14 2015 RRA [...] cpan.org - Taken