Bug #30958 for File-MimeInfo: File::MimeInfo text/plain heuristics broken

Sat Nov 24 16:17:28 2007 kas [...] fi.muni.cz - Ticket created

Subject:

File::MimeInfo text/plain heuristics broken

The heuristics for distinguishing between text/plain and application/octet-stream in MimeInfo.pm is broken. It works differently when the file is passed in as a file name, than when the file is passed in as a filehandle. The test case: - create a file which is not UTF-8 encoded but contains only printable Latin-<n> characters (i.e. not in [\x00-\x1f\x7f] except \t, \r, and \n): $ perl -e 'open my $fh, ">/tmp/xxx"; print $fh "\xb0\xb1\xb2\xf2"' $ perl -MFileHandle -MFile::MimeInfo::Magic -e 'print mimetype("/tmp/xxx"), "\n"' text/plain - so far it is good, however: $ $ perl -MFileHandle -MFile::MimeInfo::Magic -e 'open my $fh, "<", "/tmp/xxx"; print mimetype($fh), "\n"' application/octet-stream - this is incorrect (at least the results should be the same; but I think text/plain is the correct choice). It worked correctly in 0.12, and it is broken in 0.14. The reason is that 0.14 tries to run "binmode FILE, ':utf8'" when the file is given as a file name, but not when it is given as a filehandle (which is probably a good approach, it should not mess with the I/O layer settings of a someone else's filehandle). However, the result is incorrect detection of plain text files when given as a filehandle.

Sun Nov 25 05:53:33 2007 j.g.karssenberg [...] student.utwente.nl - Correspondence added

Subject:	RE: [rt.cpan.org #30958] File::MimeInfo text/plain heuristics broken
Date:	Sun, 25 Nov 2007 11:52:50 +0100
To:	<bug-File-MimeInfo [...] rt.cpan.org>, <undisclosed-recipients:;>
From:	<j.g.karssenberg [...] student.utwente.nl>

Hi, I do not agree that this is a bug. When a file handle is given through the API I assume that whoever opened the file knows best what the encoding is. When you are handling a utf8 file using non-utf8 binmode probably the rest of your program will also see control characters, not plain text, so the mimetype is a good result. What I can do is to add a recommendation to use binmode utf8 for filehandles in the documentation. Regards, Jaap <pardus@cpan.org>

Sun Nov 25 05:53:38 2007 The RT System itself - Status changed from 'new' to 'open'

Sun Nov 25 06:42:39 2007 kas [...] fi.muni.cz - Correspondence added

From:

kas [...] informatics.muni.cz

Dne ne 25.lis.2007 05:53:33, j.g.karssenberg@student.utwente.nl napsal(a): Show quoted text

> I do not agree that this is a bug. When a file handle is given through > the API I assume that whoever opened the file knows best what the > encoding is.

This is ridiculous: we are inside a module which is designed to _guess_ what the MIME type is. So I don't think the caller necessarily has to have any idea of which encoding the filehandle has. I still think tasks like "what MIME type my STDIN is" are legal tasks which File::MimeInfo::Magic should be able to perform. I think the correct solution would be to duplicate the filehandle and set its own I/O layer using something like open my $fh, "<&", $filehandle; binmode $fh, ':utf8' if $] >= 5.008 ... do the magic using [:printable:] or /[\x0-\x1f\x7f]/ ... close $fh;

Sun Nov 25 07:08:03 2007 j.g.karssenberg [...] student.utwente.nl - Correspondence added

Subject:	RE: [rt.cpan.org #30958] File::MimeInfo text/plain heuristics broken
Date:	Sun, 25 Nov 2007 13:07:45 +0100
To:	<bug-File-MimeInfo [...] rt.cpan.org>
From:	<j.g.karssenberg [...] student.utwente.nl>

Well, depends on the use case. E.g. if you take a file from STDIN that can be any type and you want to open it, you should save to a file anyway because external programs also need a file to open. If you read data from STDIN to use internally in your program you usually already have a subset of filetypes that you can handle, if that subset is text based or contains text based types you should think about setting the right encoding. Please keep in mind that the MimeInfo module does not and should not guess character encoding. I do admit that my use of utf8 is an extension of the actual spec. The spec says to check for ascii control chars and ignore all chars with the high bit set because they are probably ascii. However I wanted to use the perl definition of utf8 chars as a more sophisticated way to do this. If I reverse the behavior to what the spec states both cases in your example would work. Please provide a detailed use case for a situation where you should not be concerned with char sets where this default method poses a problem. In case of doubt I will reverse behavior to what the spec requires, but at the moment I do not see a direct reason for that. Regards, Jaap <pardus@cpan.org>

Sun Nov 25 15:32:46 2007 kas [...] fi.muni.cz - Correspondence added

Dne ne 25.lis.2007 07:08:03, j.g.karssenberg@student.utwente.nl napsal(a): Show quoted text

> Please provide a detailed use case for a situation where you should > not be concerned with char sets where this default method poses a > problem. In case of doubt I will reverse behavior to what the spec > requires, but at the moment I do not see a direct reason for that.

I have ran into this problem in the web-based information system, where I use File::MimeInfo::Magic for guessing the MIME type of uploaded files. The inner workings of this system give me only a filehandle (altough a seekable one) to guess the MIME type. When mimetype() returns text/*, only then I try to guess the file charset by my own heuristics (based on what languages and charsets are primarily used in this system, etc.). If I get the MIME type of application/octet-stream, I do not try to do anything else with the file.

Sun Nov 25 20:20:52 2007 j.g.karssenberg [...] student.utwente.nl - Correspondence added

Subject:	RE: [rt.cpan.org #30958] File::MimeInfo text/plain heuristics broken
Date:	Mon, 26 Nov 2007 02:20:29 +0100
To:	<bug-File-MimeInfo [...] rt.cpan.org>
From:	<j.g.karssenberg [...] student.utwente.nl>

The correct way of working would be to guess the encoding before you run the default method. Without info on the encoding the result of the default function is not trustworthy anyway. For example you might have an encoding which uses character codes which happen to be control chars in ascii or utf8, in that case the default method does not return text/plain, and you do not try to read it. What you should do is first call magic() on the file handle, this will return a mimetype or undef. If you get undef, you try to guess the encoding. Only after setting the encoding for the filehandle can you call default() to check if it is indeed plain text. Regards, Jaap pardus@cpan.org

Mon Nov 26 04:29:44 2007 kas [...] fi.muni.cz - Correspondence added

From:

kas [...] informatics.muni.cz

Dne ne 25.lis.2007 20:20:52, j.g.karssenberg@student.utwente.nl napsal(a): Show quoted text

> The correct way of working would be to guess the encoding before you > run the default method. Without info on the encoding the result of > the default function is not trustworthy anyway. For example you > might have an encoding which uses character codes which happen to > be control chars in ascii or utf8, in that case the default method > does not return text/plain, and you do not try to read it.

Well, I am not aware of any such strange encoding, which is used in real life (maybe UTF-16, but nobody here uses that). The problem is that you do the heuristics anyway, but only when the file name (and not file handle) is given to mimetype(). I still think both should work the same way (as it did in 0.12).

Thu Feb 14 04:00:46 2008 pardus [...] cpan.org - Correspondence added

Added documentation to warn about this issue, no code changed.

Thu Feb 14 04:00:48 2008 pardus [...] cpan.org - Status changed from 'open' to 'rejected'

Bug #30958 for File-MimeInfo: File::MimeInfo text/plain heuristics broken

Preferred bug tracker