Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the File-MimeInfo CPAN distribution.

Report information
The Basics
Id: 30958
Status: rejected
Priority: 0/
Queue: File-MimeInfo

People
Owner: Nobody in particular
Requestors: kas [...] fi.muni.cz
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 0.14
Fixed in: 0.12



Subject: File::MimeInfo text/plain heuristics broken
The heuristics for distinguishing between text/plain and application/octet-stream in MimeInfo.pm is broken. It works differently when the file is passed in as a file name, than when the file is passed in as a filehandle. The test case: - create a file which is not UTF-8 encoded but contains only printable Latin-<n> characters (i.e. not in [\x00-\x1f\x7f] except \t, \r, and \n): $ perl -e 'open my $fh, ">/tmp/xxx"; print $fh "\xb0\xb1\xb2\xf2"' $ perl -MFileHandle -MFile::MimeInfo::Magic -e 'print mimetype("/tmp/xxx"), "\n"' text/plain - so far it is good, however: $ $ perl -MFileHandle -MFile::MimeInfo::Magic -e 'open my $fh, "<", "/tmp/xxx"; print mimetype($fh), "\n"' application/octet-stream - this is incorrect (at least the results should be the same; but I think text/plain is the correct choice). It worked correctly in 0.12, and it is broken in 0.14. The reason is that 0.14 tries to run "binmode FILE, ':utf8'" when the file is given as a file name, but not when it is given as a filehandle (which is probably a good approach, it should not mess with the I/O layer settings of a someone else's filehandle). However, the result is incorrect detection of plain text files when given as a filehandle.
Subject: RE: [rt.cpan.org #30958] File::MimeInfo text/plain heuristics broken
Date: Sun, 25 Nov 2007 11:52:50 +0100
To: <bug-File-MimeInfo [...] rt.cpan.org>, <undisclosed-recipients:;>
From: <j.g.karssenberg [...] student.utwente.nl>
Hi, I do not agree that this is a bug. When a file handle is given through the API I assume that whoever opened the file knows best what the encoding is. When you are handling a utf8 file using non-utf8 binmode probably the rest of your program will also see control characters, not plain text, so the mimetype is a good result. What I can do is to add a recommendation to use binmode utf8 for filehandles in the documentation. Regards, Jaap <pardus@cpan.org>
From: kas [...] informatics.muni.cz
Dne ne 25.lis.2007 05:53:33, j.g.karssenberg@student.utwente.nl napsal(a): Show quoted text
> I do not agree that this is a bug. When a file handle is given through > the API I assume that whoever opened the file knows best what the > encoding is.
This is ridiculous: we are inside a module which is designed to _guess_ what the MIME type is. So I don't think the caller necessarily has to have any idea of which encoding the filehandle has. I still think tasks like "what MIME type my STDIN is" are legal tasks which File::MimeInfo::Magic should be able to perform. I think the correct solution would be to duplicate the filehandle and set its own I/O layer using something like open my $fh, "<&", $filehandle; binmode $fh, ':utf8' if $] >= 5.008 ... do the magic using [:printable:] or /[\x0-\x1f\x7f]/ ... close $fh;
Subject: RE: [rt.cpan.org #30958] File::MimeInfo text/plain heuristics broken
Date: Sun, 25 Nov 2007 13:07:45 +0100
To: <bug-File-MimeInfo [...] rt.cpan.org>
From: <j.g.karssenberg [...] student.utwente.nl>
Well, depends on the use case. E.g. if you take a file from STDIN that can be any type and you want to open it, you should save to a file anyway because external programs also need a file to open. If you read data from STDIN to use internally in your program you usually already have a subset of filetypes that you can handle, if that subset is text based or contains text based types you should think about setting the right encoding. Please keep in mind that the MimeInfo module does not and should not guess character encoding. I do admit that my use of utf8 is an extension of the actual spec. The spec says to check for ascii control chars and ignore all chars with the high bit set because they are probably ascii. However I wanted to use the perl definition of utf8 chars as a more sophisticated way to do this. If I reverse the behavior to what the spec states both cases in your example would work. Please provide a detailed use case for a situation where you should not be concerned with char sets where this default method poses a problem. In case of doubt I will reverse behavior to what the spec requires, but at the moment I do not see a direct reason for that. Regards, Jaap <pardus@cpan.org>
Dne ne 25.lis.2007 07:08:03, j.g.karssenberg@student.utwente.nl napsal(a): Show quoted text
> Please provide a detailed use case for a situation where you should > not be concerned with char sets where this default method poses a > problem. In case of doubt I will reverse behavior to what the spec > requires, but at the moment I do not see a direct reason for that.
I have ran into this problem in the web-based information system, where I use File::MimeInfo::Magic for guessing the MIME type of uploaded files. The inner workings of this system give me only a filehandle (altough a seekable one) to guess the MIME type. When mimetype() returns text/*, only then I try to guess the file charset by my own heuristics (based on what languages and charsets are primarily used in this system, etc.). If I get the MIME type of application/octet-stream, I do not try to do anything else with the file.
Subject: RE: [rt.cpan.org #30958] File::MimeInfo text/plain heuristics broken
Date: Mon, 26 Nov 2007 02:20:29 +0100
To: <bug-File-MimeInfo [...] rt.cpan.org>
From: <j.g.karssenberg [...] student.utwente.nl>
The correct way of working would be to guess the encoding before you run the default method. Without info on the encoding the result of the default function is not trustworthy anyway. For example you might have an encoding which uses character codes which happen to be control chars in ascii or utf8, in that case the default method does not return text/plain, and you do not try to read it. What you should do is first call magic() on the file handle, this will return a mimetype or undef. If you get undef, you try to guess the encoding. Only after setting the encoding for the filehandle can you call default() to check if it is indeed plain text. Regards, Jaap pardus@cpan.org
From: kas [...] informatics.muni.cz
Dne ne 25.lis.2007 20:20:52, j.g.karssenberg@student.utwente.nl napsal(a): Show quoted text
> The correct way of working would be to guess the encoding before you > run the default method. Without info on the encoding the result of > the default function is not trustworthy anyway. For example you > might have an encoding which uses character codes which happen to > be control chars in ascii or utf8, in that case the default method > does not return text/plain, and you do not try to read it.
Well, I am not aware of any such strange encoding, which is used in real life (maybe UTF-16, but nobody here uses that). The problem is that you do the heuristics anyway, but only when the file name (and not file handle) is given to mimetype(). I still think both should work the same way (as it did in 0.12).
Added documentation to warn about this issue, no code changed.