Skip Menu |

This queue is for tickets about the POE-Component-Client-HTTP CPAN distribution.

Report information
The Basics
Id: 35538
Status: resolved
Priority: 0/
Queue: POE-Component-Client-HTTP

People
Owner: Nobody in particular
Requestors: YKAR [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: (no value)
Fixed in: (no value)



Subject: Content-Encoding is not handled properly
Hello In method POE::Component::Client::HTTP::return_response if header Content-Encoding is seen attempt to decode content is performed. I don't think that it is proper to do so (I think receiver of HTTP::Response object should use method 'decoded_content'). I think PoCoCl::HTTP should niether set Accept-Encoding nor decode contend based on Content-Encoding. It is all higher level things. On my opinion PoCoCl::HTTP must only decode Transfer-Encoding (and set TE: which transfer encodings supported). Anyway, if you think that current design is correct, I suggest to add some fixes to current implementation. ----------------[cut]-------------------- if ($response->header('content-encoding')) { my $content; # LWP likes to die on error. eval { $content = $response->decoded_content }; if ($content) { $response->content($content); } } ----------------[cut]--------------------- 1. decoded_content not only decode content based on Content-Encoding, but also does charset decoding. To disable charset decoding need to use $response->decoded_content(charset => 'none'); 2. if content was reset with decoded content, I think it is right to delete Content-Encoding header. Leaving this header confuse user of HTTP::Response (I'm using decode_content, and it confusing because Content-Encoding is set but content already decoded). 3. some performance improvement decoded_content can return reference if called with ref => 1 parameter, it can save some cpu cycles if Content-Encoding is identity (it would not copy content, but just return reference to it) Please either remove code which deals with Accept-Encoding/Content-Encoding things or at least apply proposed fixes. It is not possible to work correctly in application which doing requests with Accept-Encoding header is set. Thank you in advance!
One more thing. I forgot to say why is better to disable charset decoding. If document charset (obtained from Content-Type or any other place) is unknown to 'Encode' module, decoded_content will fail (even if content can be unzipped successfully, for example if it Content-Encoding was gzip).
From: overlordq [...] gmail.com
On a related note, it seems that the latest version of libwww-perl makes this a slightly bigger problem. === Release 5.810 Don't allow HTTP::Message content to be set to Unicode strings. === My problems: http://www.perlmonks.org/?node_id=683833 Seems to come back to the above content encoding problems.
Funny that one small piece of code caused problems for different people practically at the same time ;) I'm not yet upgdaded to newer libwww so I didn't stuck with your problem. My problem was caused by invalid response (Content-Encoding header is still set, but content is not gzipped), so when my program was trying to get decoded_content it failes because content already decoded. PS. There is a reply at perlmonks by Juerd. He saying exactlyt what I said earlier in this ticket and I'm 100% agree with him.
On Thu May 01 02:23:17 2008, YKAR wrote: Show quoted text
> Funny that one small piece of code caused problems for different people > practically at the same time ;)
If time permits, I will address this issue over the coming weekend. Just so I'm clear, should this reliably produce the error: 1. Upgrade to latest libwww. 2. Fetch a page with a Content-Encoding header. ?
On Thu May 01 13:01:18 2008, RCAPUTO wrote: Show quoted text
> On Thu May 01 02:23:17 2008, YKAR wrote:
> > Funny that one small piece of code caused problems for different people > > practically at the same time ;)
> > If time permits, I will address this issue over the coming weekend. > > Just so I'm clear, should this reliably produce the error: > > 1. Upgrade to latest libwww. > 2. Fetch a page with a Content-Encoding header. > > ?
Essentially yes, attached is very simple script which triggers: HTTP::Message content must be bytes at /usr/local/share/perl/5.8.8/POE/Component/Client/HTTP/Request.pm line 187
#!/usr/bin/perl use POE qw(Component::Client::HTTP); use HTTP::Request; POE::Component::Client::HTTP->spawn(Alias => 'ua'); POE::Session->create( inline_states => { _start => sub { POE::Kernel->post( 'ua', 'request', 'response', HTTP::Request->new(GET => 'http://en.wikipedia.org/wiki/Main_Page'), ); }, _stop => sub {}, response => \&response_handler, }, ); POE::Kernel->run(); exit;
Show quoted text
> If time permits, I will address this issue over the coming weekend.
Thank you. overlordq passed ahead of me with the example ;) Show quoted text
> Just so I'm clear, should this reliably produce the error: > > 1. Upgrade to latest libwww. > 2. Fetch a page with a Content-Encoding header. > > ?
On Wed Apr 30 04:14:27 2008, YKAR wrote: Show quoted text
> Hello > > In method POE::Component::Client::HTTP::return_response if header > Content-Encoding is seen attempt to decode content is performed. > > I don't think that it is proper to do so (I think receiver of > HTTP::Response object should use method 'decoded_content').
I understand this is probably the more correct behavior, but I currently expect the user to always call decoded_content(). If I understood cases where the user would avoid decoded_content(), then I would be more inclined to change this behavior. Can you help by explaining such cases? Show quoted text
> Anyway, if you think that current design is correct, I suggest to add > some fixes to current implementation.
I am undecided, but I'm leaning towards continuing to support the existing behavior. I'm willing to break existing code for a good reason... I just need to know what that is. :) Here's an alternative implementation based on your suggestions, committed as revision 326. Please let me know what you think of it. ---cut--- if ($response->header('content-encoding')) { # LWP likes to die on error. my $content = eval { $response->decoded_content(charset => 'none', ref => 1) }; if ($content) { $response->content_ref($content); $response->content_length(length($$content)); $response->header('content-encoding' => undef); } } ---cut--- To recap: 1. Disable charset decoding with decoded_content() Done. 2. Remove Content-Encoding header if using decoded content. Done. I also adjusted the Content-Length to reflect the expanded content length. 3. Pass by reference to avoid unnecessary copying. Done. Also: * The test case passes, thank you. * "make test" passes. :)
Show quoted text
> I understand this is probably the more correct behavior, but I currently > expect the user to always call decoded_content(). If I understood cases > where the user would avoid decoded_content(), then I would be more > inclined to change this behavior. Can you help by explaining such cases?
Probably I explained uncertainly. I think user should call decoded_content() always, but PoCoCl::HTTP must not call decoded_content() at all. LWP::UserAgent never did decoding transparently. I think better to maintain compatible behavior (because many developers have expirience of using LWP::UserAgent). Gisle Aas (author of libwww-perl) always objected against handling Content-Encoding transparently. For example here (there was many discussion on this subject, it is just one from many): http://www.mail-archive.com/libwww@perl.org/msg04703.html One of reasons he mention, that apache sets Content-Encoding: x-gzip header for gzipped files. And it is unwise to decode content in library, probably user want just to save gzipped file, or compute md5 sum or any other case where user needs original not decoded content. Other reason is because Transfer-Encoding and Content-Encoding "works at different levels in the HTTP protocol" (afair I already mentioned this). I consider Gisle Aas an authority. He is writting libwww-perl library long time, he read HTTP RFC from cover to cover many times, so I inclined to trust him. Show quoted text
> > Anyway, if you think that current design is correct, I suggest to add > > some fixes to current implementation.
> > I am undecided, but I'm leaning towards continuing to support the > existing behavior. I'm willing to break existing code for a good > reason... I just need to know what that is. :) > > Here's an alternative implementation based on your suggestions, > committed as revision 326. Please let me know what you think of it. > > ---cut--- > if ($response->header('content-encoding')) { > # LWP likes to die on error. > my $content = eval { > $response->decoded_content(charset => 'none', ref => 1) > }; > if ($content) { > $response->content_ref($content); > $response->content_length(length($$content)); > $response->header('content-encoding' => undef); > } > } > ---cut---
Thank you :) Even better than I thought to do. I did not even knew about existence of content_ref method ;) I think problem of overlordq solved just by setting charset => 'none', because if charset decoding is not performed no wide characters will appear.
On Sun May 04 08:34:58 2008, YKAR wrote: Show quoted text
> > Probably I explained uncertainly. I think user should call > decoded_content() always, but PoCoCl::HTTP must not call > decoded_content() at all.
Thank you for the link to Gisle Aas' discussion about the matter. He makes good points, and I will refer to them if anyone gives me a hard time about the change. I understand your point that libraries should not transparently decode content, but I'm reluctant to break everybody's code just for correctness. I think the best way to remove the feature would be to phase it out over several releases. Here's the plan I suggested to POE's mailing list. What do you think? 1a. Only decode content for text/* that is gzip encoded. All other content would be passed through unchanged. 1b. Add a new option to do transparent decoding, and leave it on by default. 2. Later, throw a warning if the option is turned on. Remind the application developer to use decoded_content() if they want decoded content. 3. Later, throw a hard error (and reminder) at the developer if the option is enabled. 4. Later, remove the transparent decoding feature and option altogether. The option lets developers explicitly state what they want, and then deprecation guides them into the new behavior.
Show quoted text
> I understand your point that libraries should not transparently decode > content, but I'm reluctant to break everybody's code just for correctness.
I don't think that this change will break much (any?) code. Thats why: There are two cases (at least which I know) when server sends Content-Encoding header. 1. Request has header Accept-Encoding. 2. Requesting ".gz" file from apache. In first case if developer setting Accept-Encoding header, developer knows what he is doing and he ready to get response with encoded content. In this case transparent decoding even make situation confusing (expected encoded content but get decoded). In second case, transparent decoding, I guess, is not the thing which developer expected. And I'm doubt that someone relies on such behavior (I think if developer request some file with .gz suffix it expect to get gzipped data but not already unzipped file). In other cases (when niether Accept-Encoding header is set nor .gz file requested) developer does not expect to get encoded content (and developer should not get it). Current implementation of PoCoCl::HTTP is adding Accept-Encoding: gzip to request and trying to handle response with Content-Encoding. Actually PoCoCl::HTTP is trying to deceive client program. Even if program do not want to use compression the PoCoCl::HTTP request a document with compression and decompress it for program to present it as uncompressed. I want to emphasize that http server does not compress document if it is not asked for. So if Accept-Encoding is not set, server will not respond with encoded content and Content-Encoding header (The exception is when document already compressed, like static gzipped file, in this case server has no other option just like to send content as is and set Content-Type and Content-Encoding. But it is not the case where transpared decompression is expected). So I think the only consequence of disabling transparent compression is increased consumed bandwitdh (compression is requested implicitly by PoCoCl::HTTP, but after disabling transparent compression, compression will be used only if client application requested it). By the way, I see Transfer-Encoding handling code exists in PoCoCl::HTTP, but a) compression filters is commented out in te_possible_filters b) TE: header is not added to request. If this code works, it seems it is not difficult to enable transparent compression on this level? PS. I thought out some improvement to current workaround. Currently PoCoCl::HTTP does not set Accept-Encoding header if user set it already. But if it sees Content-Encoding, content decoded is unconditionally (no matter who set Accept-Encoding user or library). It is possible to make some flag in PoCoCl::HTTP::Request that Accept-Encoding was set by library, and handle Content-Encoding only if flag is set. Thus if user set Accept-Encoding header, user get "untouched" content. With this change compatibilty (from application developer's perspective) with LWP::UserAgent will be full. Show quoted text
> I think the best way to remove the feature would be to phase it out over > several releases. Here's the plan I suggested to POE's mailing list. > What do you think?
As I explained above, I think this change does not hit most of users. So probably it is ok to disable transparent decoding by default. And if this will breake some code (either make fixes to work with this code, or revert change and implement this change by phases like you proposed). Show quoted text
> 1a. Only decode content for text/* that is gzip encoded. All other > content would be passed through unchanged. > > 1b. Add a new option to do transparent decoding, and leave it on by
default. Show quoted text
> > 2. Later, throw a warning if the option is turned on. Remind the > application developer to use decoded_content() if they want decoded
content. Show quoted text
> > 3. Later, throw a hard error (and reminder) at the developer if the > option is enabled. > > 4. Later, remove the transparent decoding feature and option altogether. > > The option lets developers explicitly state what they want, and then > deprecation guides them into the new behavior.
Thank you for your patient explanation. Now I understand that it's safe to remove the transparent gzip handling. I have done this in revision 327, and I will shortly release the fix in version 0.84.