Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the File-MimeInfo CPAN distribution.

Report information
The Basics
Id: 11765
Status: resolved
Priority: 0/
Queue: File-MimeInfo

People
Owner: pardus [...] cpan.org
Requestors: jgmyers [...] proofpoint.com
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 0.10
Fixed in: (no value)



Subject: [PATCH] provide file extension information
The attached patch extends the mimetype method to return both a mime type and file extension when called in list context.
diff -ru File-MimeInfo-0.10-orig/MimeInfo/Magic.pm File-MimeInfo-0.10/MimeInfo/Magic.pm --- File-MimeInfo-0.10-orig/MimeInfo/Magic.pm 2003-12-05 08:35:14.000000000 -0800 +++ File-MimeInfo-0.10/MimeInfo/Magic.pm 2005-03-04 15:20:24.000000000 -0800 @@ -31,12 +31,19 @@ sub mimetype { my $file = pop || croak 'subroutine "mimetype" needs a filename as argument'; - return magic($file) || default($file) if ref $file; - return - inodetype($file) || + my ($mime, $ext); + + if (ref $file) { + $mime = magic($file) || default($file); + } else { + $mime = inodetype($file) || globs($file) || magic($file) || default($file); + } + $ext = File::MimeInfo::_mime_to_ext($mime) if $mime; + $ext = '' if ! defined($ext); + return wantarray ? ($mime, $ext) : $mime; } sub magic { @@ -205,7 +212,9 @@ =item C<mimetype($file)> -Returns a mime-type string for C<$file>, returns undef on failure. +In scalar context, returns a mime-type string for C<$file>. +In list context, returns mime-type and file extension strings. +Returns undef on failure. This method bundles C<inodetype>, C<globs> and C<magic>. Only in File-MimeInfo-0.10/MimeInfo: Magic.pm~ diff -ru File-MimeInfo-0.10-orig/MimeInfo.pm File-MimeInfo-0.10/MimeInfo.pm --- File-MimeInfo-0.10-orig/MimeInfo.pm 2004-02-08 03:32:17.000000000 -0800 +++ File-MimeInfo-0.10/MimeInfo.pm 2005-03-04 15:00:04.000000000 -0800 @@ -12,10 +12,11 @@ our $VERSION = '0.10'; our $DEBUG; -our (@globs, %literal, %extension, $LANG); +our (@globs, %literal, %extension, %mime2ext, $LANG); # @globs = [ [ 'glob', qr//, $mime_string ], ... ] # %literal contains literal matches # %extension contains extensions (globs matching /^\*(\.\w)+$/ ) +# %mime2ext is used for looking up extension by mime type # $LANG can be used to set a default language for the comments rehash(); # initialise data @@ -126,7 +127,8 @@ chomp; ($string, $glob) = split /:/, $_, 2; unless ($glob =~ /[\?\*\[]/) { $literal{$glob} = $string } - elsif ($glob =~ /^\*\.(\w+(\.\w+)*)$/) { $extension{$1} = $string } + elsif ($glob =~ /^\*\.(\w+(\.\w+)*)$/) { + $extension{$1} = $string; $mime2ext{$string} = $1; } else { unshift @globs, [$glob, _glob_to_regexp($glob), $string] } } close GLOB || croak "Could not open file '$file' for reading" ; @@ -162,6 +164,10 @@ return $desc; } +sub _mime_to_ext { + return $mime2ext{$_[0]}; +} + 1; __END__
[guest - Fri Mar 4 18:23:57 2005]: Show quoted text
> The attached patch extends the mimetype method to return both a mime > type and file extension when called in list context.
I'll regard this as a feature request for a mimetype => extensions conversion. Not sure what you want to do with it, but I suppose it could be useful in some cases. But I have a few problems with the way your patch works: I) The extension you return has no connection to the filename you are matching, so it can't be used to for example determine the basename of the file. This is so because: Ia) The way you build the mime2ext hash works you suppose there is only one extension for each mimetype, this is wrong for example '.tar.gz' and '.tgz' both map to 'application/x-compressed-tar'. Your %mime2ext will only contain the last one in the database (which can be arbitrary). Ib) You also try to return a extension when an other rule was used to determine the mimetype of the file, like magic or inode type. II) To me this would be an optional feature that people only would use in certain special cases, but the mime2ext hash can potentialy be quite large; so I would like to have %mime2ext only be build when this feature is used. So please let me know what problem you try to solve here and I'll see about adding a conversion routine to File::MimeInfo.
Aah, what you basicly want is to generate a filename for given content with an extension matching the real content type. This is a interesting problem :) Just out of curiosity, do you use the IO::Scalar + magic method for this, or some other trick ? The order in the database depends partially on the heuristic of the update-mime-database command, which might be subject to change. But luckily the database knows priorities, you could adjust these in order to have the extension you want outranking others. A reverse lookup table is needed. Of course we could try a reverse lookup in the existing extensions hash, but then the ranking is lost. ( Also it makes little sense to be greedy with memory, the module isn't memory optimised anyway ). Because probably only relatively few users will use this hash it makes sense to make it optional, since the module isn't really object oriented I'm thinking of putting the option in the 'import' function (but I'm open to suggestions). Secondly I would like to make the feature more general by having all extensions in the hash, rather than just the first one; so the hash can contain both scalars and arrays of scalars. (Since there are relatively few mimetypes claiming more then one extension this probably won't hurt performance.) So this boils down to the following changes: A) An option somewhere, either 'import' or 'new', called something like 'hash_reversed_lookup' B) A method extensions($mimetype) returning a list of extensions, highest ranking - or dieing if the option under A wasn't set first And just for good measures: C) If the extension was used for matching (and only if ..) return the extension that was matched from mimetype() if called in wantarray context - this can be used to determine the basename of a file ( for example some program tells you the file is called "foo.jpg.exe" and you want to know what the real extension is ;) ) There is no reason to have the extension for the mimetype directly returned from mimetype($file) because it doesn't depend on the file; but I don't suppose that would cause you trouble. Your code would probably boil down to something like: # set option somewhere earlier my $mime = mimetype($whateveryouusehere); my @ext = extensions($mime); if (@ext) { my $regex = join('|', @ext); $filename = "$filename.$ext[0]" unless $filename =~ /($regex)$/; } You are welcome to submit a new patch; I will probably need to edit some of it myself anyway before releasing. I would also like to include a small cookbook doc with a few code examples. Although I have currently a daytime occupation conflicting the time I can spare for programming, I am willing to put the release of a new File::MimeInfo on my to todo list for the next two weeks. If you want to prepare a patch drop me a note and i'll start writing the cookbook and wait with code till your patch is in. Probably better to mail me if there are any details left to flesh out. cheers!
From: jhe [...] proofpoint.com
[PARDUS - Sat Mar 5 05:37:37 2005]: I am attaching the changed files here for your reference. Here is the change summary: 1) Fixed a bug in Magic.pm. (qr/$reg/ -> qr/$reg/sm). Before the fix, it had trouble matching magic strings in multi-line buffer (only worked for single line). 2) Added Magic.pm::extensions(). See doc section for details. 3) Added Magic.pm::priority(). See doc section for details. By making the priority() call, we can tell which mime-types are user-defined ones (since we always set priority to 100 (highest) for user-defined types). This is needed in resolving conflicts between "shared-mime" and other mime detecting tools. Show quoted text
> [guest - Fri Mar 4 18:23:57 2005]:
> > The attached patch extends the mimetype method to return both a mime > > type and file extension when called in list context.
> > I'll regard this as a feature request for a mimetype => extensions > conversion. Not sure what you want to do with it, but I suppose it could > be useful in some cases. > > But I have a few problems with the way your patch works: > > I) The extension you return has no connection to the filename you are > matching, so it can't be used to for example determine the basename of > the file. This is so because: > > Ia) The way you build the mime2ext hash works you suppose there is only > one extension for each mimetype, this is wrong for example '.tar.gz' and > '.tgz' both map to 'application/x-compressed-tar'. Your %mime2ext will > only contain the last one in the database (which can be arbitrary). > > Ib) You also try to return a extension when an other rule was used to > determine the mimetype of the file, like magic or inode type. > > II) To me this would be an optional feature that people only would use > in certain special cases, but the mime2ext hash can potentialy be quite > large; so I would like to have %mime2ext only be build when this feature > is used. > > So please let me know what problem you try to solve here and I'll see > about adding a conversion routine to File::MimeInfo.
package File::MimeInfo::Magic; use strict; use Carp; use Fcntl 'SEEK_SET'; use File::BaseDir qw/xdg_data_files/; require File::MimeInfo; require Exporter; BEGIN { no strict "refs"; for (qw/describe globs inodetype default/) { *{$_} = \&{"File::MimeInfo::$_"}; } } our @ISA = qw(Exporter File::MimeInfo); our @EXPORT = qw(mimetype); our @EXPORT_OK = qw(describe globs inodetype magic); our $VERSION = '0.9'; our $DEBUG; our (@magic); # used to store parse tree of magic data our (%mime2pri); # used to store the priority of each mime type _rehash(); # initialize data # use Data::Dumper; # print Dumper \@magic; sub mimetype { my $file = pop || croak 'subroutine "mimetype" needs a filename as argument'; return magic($file) || default($file) if ref $file; return inodetype($file) || globs($file) || magic($file) || default($file); } sub extensions { my $aref = $File::MimeInfo::mime2ext{$_[0]}; return $aref ? @{$aref} : undef if wantarray; return $aref ? @{$aref}[0] : ''; } sub priority { return $mime2pri{$_[0]} || ''; } sub magic { my $file = pop || croak 'subroutine "magic" needs a filename as argument'; return undef unless ref($file) || -s $file; print STDERR "> Checking magic rules\n" if $DEBUG; my $fh; unless (ref $file) { open $fh, $file || return undef; binmode $fh; } else { $fh = $file } # allowing for IO::Something for my $type (@magic) { for (2..$#$type) { next unless _check_rule($$type[$_], $fh, 0); close $fh unless ref $file; return $$type[1]; } } close $fh unless ref $file; return undef; } sub _check_rule { my ($ref, $fh, $lev) = @_; my $line; if (ref $fh eq 'GLOB') { seek($fh, $$ref[1][0], SEEK_SET); read($fh, $line, $$ref[1][1]); } else { # allowing for IO::Something $fh->seek($$ref[1][0], SEEK_SET); $fh->read($line, $$ref[1][1]); } return undef unless $line =~ $$ref[2]; my $succes; unless ($$ref[3]) { $succes++ } else { # mask stuff my $v = $2 & $$ref[3][1]; $succes++ if $v eq $$ref[3][0]; } print STDERR '>', '>'x$lev, ' Value "', _escape_bytes($2), '" at offset ', $$ref[1][0]+length($1), " matches line $$ref[0]\n" if $succes && $DEBUG; return undef unless $succes; if ($#$ref > 3) { for (4..$#$ref) { # recurs $succes = _check_rule($$ref[$_], $fh, $lev+1); last if $succes; } } print STDERR "> Failed nested rules\n" if $DEBUG && !($lev || $succes); return $succes; } sub rehash { &File::MimeInfo::rehash; &_rehash } sub _rehash { @magic = (); %mime2pri = (); _hash_magic($_) for reverse xdg_data_files('mime/magic'); @magic = sort {$$b[0] <=> $$a[0]} @magic; } sub _hash_magic { my $file = shift; open MAGIC, $file || croak "Could not open file '$file' for reading"; binmode MAGIC; <MAGIC> eq "MIME-Magic\x00\n" or carp "Magic file '$file' doesn't seem to be a magic file"; @magic = (); my $line = 1; while (<MAGIC>) { $line++; if (/^\[(\d+):(.*?)\]\n$/) { push @magic, [$1,$2]; $mime2pri{$2} = $1; next; } s/^(\d*)>(\d+)=(.{2})//s || carp "$file line $line skipped\n" && next; my ($i, $o, $l) = ($1, $2, unpack 'n', $3); # indent, offset, value length while (length($_) <= $l) { $_ .= <MAGIC>; $line++; } my $v = substr $_, 0, $l, ''; # value /^(?:&(.{$l}))?(?:~(\d+))?(?:\+(\d+))?\n$/s || carp "$file line $line skipped\n" && next; my ($m, $w, $r) = ($1, $2, $3 || 0); # mask, word size, range # the word size is given for big endian to little endian conversion # dunno whether we need to do that in perl my $ref = $i ? _find_branch($i) : $magic[-1]; my $reg = '^' . ( $r ? "(.{0,$r}?)" : '()' ) . ( $m ? "(.{$l})" : '('.quotemeta($v).')' ) ; push @$ref, [ $line, [$o, $o+$l+$r], qr/$reg/sm, $m ? [$v, $m] : 0 ]; } close MAGIC; } sub _find_branch { my $i = shift; my $ref = $magic[-1]; for (1..$i) { $ref = $$ref[-1] } return $ref; } sub _escape_bytes { my $string = shift; if ($string =~ /[\x00-\x1F\xF7]/) { $string = join '', map { my $o = ord($_); ($o < 32) ? '^' . chr($o + 64) : ($o == 127) ? '^?' : $_ ; } split '', $string; } return $string; } 1; __END__ =head1 NAME File::MimeInfo::Magic - Determine file type with magic =head1 SYNOPSIS use File::MimeInfo::Magic; my $mime_type = mimetype($file); =head1 DESCRIPTION This module inherits from L<File::MimeInfo>, it is transparant to its functions but adds support for the freedesktop magic file. =head1 EXPORT The method C<mimetype> is exported by default. The methods C<magic>, C<inodetype>, C<globs> and C<describe> can be exported on demand. =head1 METHODS See also L<File::MimeInfo> for methods that are inherited. =over 4 =item C<mimetype($file)> Returns a mime-type string for C<$file>, returns undef on failure. This method bundles C<inodetype>, C<globs> and C<magic>. If this doesn't work the file is read and the mime-type defaults to 'text/plain' or to 'application/octet-stream' when the first ten chars of the file match ascii control chars (white spaces excluded). If the file doesn't exist or isn't readable C<undef> is returned. If C<$file> is an object reference only C<magic> and the default method are used. =item C<extensions($mime)> In list context, returns the list of extensions of a given mime-type; In scalar context, returns the first extension (which has the highest priority) of a given mime-type; =item C<priority($mime)> Returns the priority of a given mime-type =item C<magic($file)> Returns a mime-type string for C<$file> based on the magic rules, returns undef on failure. C<$file> can be an object reference, in that case it is supposed to have a C<seek()> and a C<read()> method. This allows you for example to determine the mimetype of data in memory by using L<IO::Scalar>. =item C<rehash()> Rehash the data files. Glob and magic information is preparsed when this method is called. =back =head1 SEE ALSO L<File::MimeInfo> =head1 BUGS Please mail the author when you encounter any bugs. Most likely to cause bugs is the fact that I partially used line based parsing while the source data is binary and can contain newlines on strange places. I tested with the 0.11 version of the database I found no problems, but I can think of configurations that can cause problems. =head1 AUTHOR Jaap Karssenberg || Pardus [Larus] E<lt>pardus@cpan.orgE<gt> Copyright (c) 2003 Jaap G Karssenberg. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
From: jhe [...] proofpoint.com
[PARDUS - Mon Mar 7 18:11:07 2005]: attaching one more file...
package File::MimeInfo; use strict; use Carp; use Fcntl 'SEEK_SET'; use File::BaseDir qw/xdg_data_files/; require Exporter; our @ISA = qw(Exporter); our @EXPORT = qw(mimetype); our @EXPORT_OK = qw(describe globs inodetype); our $VERSION = '0.10'; our $DEBUG; our (@globs, %literal, %extension, %mime2ext, $LANG); # @globs = [ [ 'glob', qr//, $mime_string ], ... ] # %literal contains literal matches # %extension contains extensions (globs matching /^\*(\.\w)+$/ ) # %mime2ext is used for looking up extension by mime type # $LANG can be used to set a default language for the comments rehash(); # initialise data sub new { bless \$VERSION, shift } # what else is there to bless ? sub mimetype { my $file = pop || croak 'subroutine "mimetype" needs a filename as argument'; croak 'You should use File::MimeInfo::Magic to check open filehandles' if ref $file; return inodetype($file) || globs($file) || default($file); } sub inodetype { my $file = pop; print STDERR "> Checking inode type\n" if $DEBUG; return (-d $file) ? 'inode/directory' : (-l $file) ? 'inode/symlink' : (-p $file) ? 'inode/fifo' : (-c $file) ? 'inode/chardevice' : (-b $file) ? 'inode/blockdevice' : (-S $file) ? 'inode/socket' : undef ; } sub globs { my $file = pop || croak 'subroutine "globs" needs a filename as argument'; print STDERR "> Checking globs for basename '$file'\n" if $DEBUG; return $literal{$file} if exists $literal{$file}; if ($file =~ /\.(\w+(\.\w+)*)$/) { my @ext = split /\./, $1; if ($#ext) { while (@ext) { my $ext = join('.', @ext); print STDERR "> Checking for extension '.$ext'\n" if $DEBUG; return $extension{$ext} if exists $extension{$ext}; shift @ext; } } else { print STDERR "> Checking for extension '.$ext[0]'\n" if $DEBUG; return $extension{$ext[0]} if exists $extension{$ext[0]}; } } for (@globs) { next unless $file =~ $_->[1]; print STDERR "> This file name matches \"$_->[0]\"\n" if $DEBUG; return $_->[2]; } return globs(lc $file) if $file =~ /[A-Z]/; # recurs return undef; } sub default { my $file = pop || croak 'subroutine "default" needs a filename as argument'; my $line; unless (ref $file) { return undef unless -f $file; print STDERR "> File exists, trying default method\n" if $DEBUG; return 'text/plain' if -z $file; open FILE, $file || return undef; binmode FILE, ':utf8' unless $] < 5.008; read FILE, $line, 10; close FILE; } else { print STDERR "> Trying default method on object\n" if $DEBUG; $file->seek(0, SEEK_SET); $file->read($line, 10); } { no warnings; # warnings can be thrown when input is neither ascii or utf8 $line =~ s/\s//g; # \n and \t are also control chars return 'text/plain' unless $line =~ /[\x00-\x1F\xF7]/; } print STDERR "> First 10 bytes of the file contain control chars\n" if $DEBUG; return 'application/octet-stream'; } sub rehash { (@globs, %literal, %extension, %mime2ext) = ((), (), (), ()); # clear data my $done; ++$done && _hash_globs($_) for reverse xdg_data_files('mime/globs'); print STDERR << 'EOE' unless $done; You don't seem to have a mime-info database. See http://freedesktop.org/Software/shared-mime-info EOE } sub _hash_globs { my $file = shift; open GLOB, $file || croak "Could not open file '$file' for reading" ; my ($string, $glob); while (<GLOB>) { next if /^\s*#/; # skip comments chomp; ($string, $glob) = split /:/, $_, 2; unless ($glob =~ /[\?\*\[]/) { $literal{$glob} = $string } elsif ($glob =~ /^\*\.(\w+(\.\w+)*)$/) { $extension{$1} = $string; $mime2ext{$string} = [] if !defined($mime2ext{$string}); push @{$mime2ext{$string}}, $1; } else { unshift @globs, [$glob, _glob_to_regexp($glob), $string] } } close GLOB || croak "Could not open file '$file' for reading" ; } sub _glob_to_regexp { my $glob = shift; $glob =~ s/\./\\./g; $glob =~ s/([?*])/.$1/g; $glob =~ s/([^\w\/\\\.\?\*\[\]])/\\$1/g; qr/^$glob$/; } sub describe { shift if ref $_[0]; my ($mt, $lang) = @_; croak 'subroutine "describe" needs a mimetype as argument' unless $mt; $lang = $LANG unless defined $lang; my $att = $lang ? qq{xml:lang="$lang"} : ''; my $desc; for my $file (xdg_data_files('mime', split '/', "$mt.xml")) { $desc = ''; # if a file was found, return at least empty string open XML, $file || croak "Could not open file '$file' for reading"; binmode XML, ':utf8' unless $] < 5.008; while (<XML>) { next unless m!<comment\s*$att>(.*?)</comment>!; $desc = $1; last; } close XML || croak "Could not open file '$file' for reading"; last if $desc; } return $desc; } 1; __END__ =head1 NAME File::MimeInfo - Determine file type =head1 SYNOPSIS use File::MimeInfo; my $mime_type = mimetype($file); =head1 DESCRIPTION This module can be used to determine the mime type of a file. It tries to implement the freedesktop specification for a shared MIME database. For this module shared-mime-info-spec 0.12 was used. This package only uses the globs file. No real magic checking is used. The L<File::MimeInfo::Magic> package is provided for magic typing. If you want to detemine the mimetype of data in a memory buffer you should use L<File::MimeInfo::Magic> in combination with L<IO::Scalar>. =head1 EXPORT The method C<mimetype> is exported by default. The methods C<inodetype>, C<globs> and C<describe> can be exported on demand. =head1 METHODS =over 4 =item C<new()> Simple constructor to allow Object Oriented use of this module. If you want to use this, use the package as C<use File::MimeInfo ();> to avoid importing sub C<mimetype>. =item C<mimetype($file)> Returns a mime-type string for C<$file>, returns undef on failure. This method bundles C<inodetype> and C<globs>. If these methods are unsuccessfull the file is read and the mime-type defaults to 'text/plain' or to 'application/octet-stream' when the first ten chars of the file match ascii control chars (white spaces excluded). If the file doesn't exist or isn't readable C<undef> is returned. =item C<inodetype($file)> Returns a mimetype in the 'inode' namespace or undef when the file is actually a normal file. =item C<globs($file)> Returns a mime-type string for C<$file> based on the glob rules, returns undef on failure. The file doesn't need to exist. =item C<describe($mimetype, $lang)> Returns a description of this mimetype as supplied by the mime info database. You can specify a language with the optional parameter C<$lang>, this should be the two letter language code used in the xml files. Also you can set the global variable C<$File::MimeInfo::LANG> to specify a language. This method returns undef when no xml file was found (i.e. the mimetype doesn't exist in the database). It returns an empty string when the xml file doesn't contain a description in the language you specified. I<Currently no real xml parsing is done, it trust the xml files are nicely formatted.> =item C<rehash()> Rehash the data files. Glob information is preparsed when this method is called. =back =head1 DIAGNOSTICS This module throws an exception when it can't find any data files, when it can't open a data file it found for reading or when a subroutine doesn't get enough arguments. In the first case youn either don't have the freedesktop mime info database installed, or your environment variables point to the wrong places, in the second case you have the database installed, but it is broken (the mime info database should logically be world readable). =head1 TODO Make an option for using some caching mechanism to reduce init time. Make L</describe> do real xml parsing ? =head1 BUGS Perl versions prior to 5.8.0 do not have the ':utf8' IO Layer, thus for the default method and for reading the xml files utf8 is not supported for these versions. Since it is not possible to distinguishe between encoding types (utf8, latin1, latin2 etc.) in a straightforward manner only utf8 is supported (because the spec recommends this). Please mail the author when you encounter any other bugs. =head1 AUTHOR Jaap Karssenberg || Pardus [Larus] E<lt>pardus@cpan.orgE<gt> Copyright (c) 2003 Jaap G Karssenberg. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =head1 SEE ALSO L<File::BaseDir>, L<File::MimeInfo::Magic>, L<File::MimeInfo::Rox> =over 4 =item related CPAN modules L<File::MMagic> =item freedesktop specifications used L<http://freedesktop.org/Standards/shared-mime-info-spec>, L<http://freedesktop.org/Standards/basedir-spec> =item freedesktop mime database L<http://freedesktop.org/Software/shared-mime-info> =item other programs using this mime system L<http://rox.sourceforge.net> =cut
Merging code and preparing for an release - only now I see the last one was more then a year ago, if only all my packages were that stable :) Two points though: 1) I'll put the extensions() method in MimeInfo.pm - this won't change the interface, Magic.pm imports from MimeInfo.pm anyway. 2) I'm not putting the priority code in because: a) priorities are assigned to rules, not perse to mimetypes b) the way you use it is a hack and won't be useful for others c) there is an easy work-around: # HACK alert: using special priority number to label mimetypes my @prefered_mimetypes = map { $$_[1] } grep { $$_[0] == 100 } @File::MimeInfo::Magic::magic; if (grep {$mimetype eq $_} @prefered_mimetypes) { # we put it there } If nothing else pops up I'll close this ticket as soon as the release goes to CPAN, if all goes well it will be there before the weekend. Thanks for your input !