Bug #58152 for Image-ExifTool: png tEXt vs iTXt

Fri Jun 04 19:30:30 2010 user42 [...] zip.com.au - Ticket created

Subject:	png tEXt vs iTXt
Date:	Sat, 05 Jun 2010 09:30:10 +1000
To:	bug-Image-ExifTool [...] rt.cpan.org
From:	Kevin Ryde <user42 [...] zip.com.au>

In the ImageInfo() from a png file, is it possible to tell when a tag was an iTXt as opposed to a tEXt? I hoped to see something about that in the pod of Image::ExifTool::PNG perhaps. I think it's fairly important since a tEXt is supposed to be latin-1 whereas an iTXt is supposed to be utf-8, so you have to know which it is to print or display the bytes correctly (if the ImageInfo always returns bytes). iTXt is rather rare, but it'd be good to be able to do the right thing with it. I suppose there's a language and translated tag name in it too though I'm not particularly interested in those.

Sat Jun 05 07:16:56 2010 EXIFTOOL [...] cpan.org - Correspondence added

Thanks for pointing this out. ExifTool should be doing this conversion for you. I will add the ability to translate PNG encodings properly so you don't have to worry about this. Also, I probably should be handling the PNG alternate languages properly (as I am doing with XMP), but maybe it can wait until I actually see an example of this used in the wild. :) - Phil

Sat Jun 05 07:16:57 2010 The RT System itself - Status changed from 'new' to 'open'

Sat Jun 05 07:16:57 2010 EXIFTOOL [...] cpan.org - Given to EXIFTOOL

Sat Jun 05 07:17:53 2010 EXIFTOOL [...] cpan.org - Correspondence added

Thanks for pointing this out. ExifTool should be doing this conversion for you. I will add the ability to translate PNG encodings properly so you don't have to worry about this. Also, I probably should be handling the PNG alternate languages properly (as I am doing with XMP), but maybe it can wait until I actually see an example of this used in the wild. :) - Phil

Sat Jun 05 07:21:40 2010 EXIFTOOL [...] cpan.org - Correspondence added

Thanks for pointing this out. ExifTool should be doing this conversion for you. I will add the ability to translate PNG encodings properly so you don't have to worry about this. Also, I probably should be handling the PNG alternate languages properly (as I am doing with XMP), but maybe it can wait until I actually see an example of this used in the wild. :) - Phil

Mon Jun 07 19:41:52 2010 user42 [...] zip.com.au - Correspondence added

Subject:	Re: [rt.cpan.org #58152] png tEXt vs iTXt
Date:	Tue, 08 Jun 2010 09:40:45 +1000
To:	bug-Image-ExifTool [...] rt.cpan.org
From:	Kevin Ryde <user42 [...] zip.com.au>

"Phil Harvey via RT" <bug-Image-ExifTool@rt.cpan.org> writes: Show quoted text

> > ExifTool should be doing this conversion for you.

I saw the pod "Notes:" bit saying all returns are bytes. I suppose if there's a chance of bad encoding in the file that lets you see the raw stuff. Maybe a conversion would have to be an option to be upwardly compatible. Show quoted text

> I will add the ability to translate PNG encodings properly so you > don't have to worry about this.

For myself I just wanted to pick out the Title decoded to perl wide-chars to display (in a mail message as it happens). For bad bytes I would probably go for either escapes or substitutions (Encode::FB_PERLQQ perhaps), but maybe others would want an error throw to not let bad inputs go unnoticed. Show quoted text

> XMP

I don't think I know anything about that. :)

Tue Jun 08 08:30:24 2010 EXIFTOOL [...] cpan.org - Correspondence added 180 min

On Mon Jun 07 19:41:52 2010, user42@zip.com.au wrote: Show quoted text

> I saw the pod "Notes:" bit saying all returns are bytes.

Yes. All this means is that the utf8 flag will not be set on returned strings, even if they are UTF-8. Show quoted text

> I suppose if > there's a chance of bad encoding in the file that lets you see the raw > stuff. Maybe a conversion would have to be an option to be upwardly > compatible.

I agree that this isn't upwardly compatible, but in the past changes like this have solved more problems than they have caused because most people won't be handling the character encoding themselves. And in this case, as you point out, the encoding can't be handled properly because you don't know if it is UTF-8 (iTXt) or Latin (tEXt/zTXt). I have gone ahead and made this change. Also, while I was at it I added alternate language support. Show quoted text

> For myself I just wanted to pick out the Title decoded to perl > wide-chars to display (in a mail message as it happens). For bad bytes > I would probably go for either escapes or substitutions > (Encode::FB_PERLQQ perhaps), but maybe others would want an error throw > to not let bad inputs go unnoticed.

ExifTool will convert bad characters to question marks without issuing a warning. - Phil P.S. Sorry about the multiple copies of my last post. That's what I get for using the web interface with an unreliable internet connection.

Wed Jun 09 08:53:02 2010 EXIFTOOL [...] cpan.org - Correspondence added

I have just released version 8.22 which should handle special PNG characters properly. Also, I was wrong about what I said before regarding the warning. A Warning tag ("Malformed UTF-8 character(s)") is generated if ExifTool tries to convert bad UTF-8. However, under normal conditions no conversion of UTF-8 is done since ExifTool's default character set is UTF-8. - Phil

Thu Jun 10 20:14:28 2010 user42 [...] zip.com.au - Correspondence added

Subject:	Re: [rt.cpan.org #58152] png tEXt vs iTXt
Date:	Fri, 11 Jun 2010 10:13:28 +1000
To:	bug-Image-ExifTool [...] rt.cpan.org
From:	Kevin Ryde <user42 [...] zip.com.au>

"Phil Harvey via RT" <bug-Image-ExifTool@rt.cpan.org> writes: Show quoted text

> > All this means is that the utf8 flag will not be set on returned > strings, even if they are UTF-8.

Ah, I didn't understand that. I took the pod to mean the return was bytes, raw and undecoded, straight from the image file's format. You might make it clearer by saying the returns are byte strings of utf8 encoded characters recoded where necessary/appropriate/whatever from the image file's native format. Or where the format specifies an encoding, or where the fields are characters, or whatnot ...

Fri Jun 11 07:45:12 2010 EXIFTOOL [...] cpan.org - Correspondence added

I will try to make it clearer, but am struggling a bit with the wording. How is this?: ExifTool returns all values as byte strings of encoded characters, not as Perl character strings with wide characters. For tags which are translated, the encoding is set by the Charset option. See FAQ number 10 in html/faq.html of the ExifTool distribution for more details about character encodings. - Phil

Mon Jun 14 20:09:35 2010 user42 [...] zip.com.au - Correspondence added

Subject:	Re: [rt.cpan.org #58152] png tEXt vs iTXt
Date:	Tue, 15 Jun 2010 10:08:26 +1000
To:	bug-Image-ExifTool [...] rt.cpan.org
From:	Kevin Ryde <user42 [...] zip.com.au>

"Phil Harvey via RT" <bug-Image-ExifTool@rt.cpan.org> writes: Show quoted text

> > ExifTool returns all values as byte strings of encoded characters, not as > Perl character strings with wide characters. For tags which are translated, > the encoding is set by the Charset option.

I didn't at first understand what "translated". Perhaps ExifTool returns all values as byte strings. If the file format has a specified character encoding then strings are re-coded from that to UTF8 bytes, or to bytes of the requested Charset option. If the format doesn't have a specified encoding then the bytes are directly from the file. Perl wide characters are not used. Is this true though? Show quoted text

> See FAQ number 10 in > html/faq.html of the ExifTool distribution for more details about character > encodings.

I wonder if some of that would go in the pod somewhere.

Tue Jun 15 07:49:26 2010 EXIFTOOL [...] cpan.org - Correspondence added

Thanks for the good suggestions. On Mon Jun 14 20:09:35 2010, user42@zip.com.au wrote: Show quoted text

> Is this true though?

This is the way things are going, and the major metadata formats now have this behaviour, but I haven't implemented this for all formats yet. Originally, there was no re-coding of 8-bit character sets. I want to keep this section short, so maybe something like this: "ExifTool returns all values as byte strings. Perl wide characters are not used. See CHARACTER ENCODINGS for details about the encodings." And the CHARACTER ENCODINGS section will duplicate most of the information in FAQ 10. - Phil

Tue Jun 15 11:23:16 2010 EXIFTOOL [...] cpan.org - Correspondence added

And here is POD source for the new section I am considering: =head1 CHARACTER ENCODINGS Certain meta information formats allow coded character sets other than plain ASCII. When reading, 8-bit encodings are passed straight through ExifTool without re-coding unless specified otherwise below, and multi-byte encodings are converted according to the L</Charset> option ('UTF8' by default). When writing, the inverse conversions are performed. See the L</Charset> option for a list of valid character sets. More specific details are given below about how character coding is handled for EXIF, IPTC, XMP, PNG, ID3, PDF and MIE information: =head2 EXIF Most textual information in EXIF is stored in ASCII format, and ExifTool does not convert these tags. However it is not uncommon for applications to write UTF-8 or other encodings where ASCII is expected, and ExifTool will quite happily read/write any encoding without conversion. For a few EXIF tags (UserComment, GPSProcessingMethod and GPSAreaInformation) the stored text may be encoded either in ASCII, Unicode (UCS-2) or JIS. When reading these tags, Unicode and JIS are converted to the character set specified by the L</Charset> option. Other encodings are not converted. When writing, text is stored as ASCII unless the string contains special characters, in which case it is converted from the specified character set and stored as Unicode. ExifTool writes Unicode in native EXIF byte ordering by default, but this may be changed by setting the ExifUnicodeByteOrder tag. The EXIF "XP" tags (XPTitle, XPComment, etc) are always stored as little-endian Unicode, and are read and written using the specified character set. =head2 IPTC The value of the IPTC:CodedCharacterSet tag determines how the internal IPTC string values are interpreted. If CodedCharacterSet exists and has a value of 'UTF8' (or 'ESC % G') then string values are assumed to be stored as UTF-8, otherwise Windows Latin1 (cp1252, 'Latin') coding is assumed by default, but this can be changed with the L</CharsetIPTC> option. When reading, these strings are converted to the character set specified by the L</Charset> option. When writing, the inverse conversions are performed. No conversion is done if the internal (IPTC) and external (ExifTool) character sets are the same. Note that ISO 2022 character set shifting is not supported. Instead, a warning is issued and the string is not converted if an ISO 2022 shift code is found. See L<http://www.iptc.org/IIM/> for the official IPTC specification. =head2 XMP Exiftool reads XMP encoded as UTF-8, UTF-16 or UTF-32, and converts them all to UTF-8 internally. Also, all XML character entity references and numeric character references are converted. When writing, ExifTool always encodes XMP as UTF-8, converting the following 5 characters to XML character references: E<amp> E<lt> E<gt> E<39> E<quot>. By default no further conversion is performed, however if the L</Charset> option is other than 'UTF8' then text is converted to/from a specified character set when reading/writing. =head2 PNG L<PNG TextualData tags|Image::ExifTool::TagNames/"PNG TextualData Tags"> are stored as tEXt, zTXt and iTXt chunks in PNG images. The tEXt and zTXt chunks use ISO 8859-1 encoding, while iTXt uses UTF-8. When reading, ExifTool converts all PNG textual data to the character set specified by the L</Charset> option. When writing, ExifTool generates a tEXt chunk (or zTXt with the L</Compress> option) if the text doesn't contain special characters or if Latin encoding is specified; otherwise an iTXt chunk is used and the text is converted from the specified character set and stored as UTF-8. =head2 ID3 The ID3v1 specification officially supports only ISO 8859-1 encoding (a subset of Windows Latin1), although some applications may incorrectly use other character sets. By default ExifTool converts ID3v1 text from Latin to the character set specified by the </Charset> option. However, the internal ID3v1 charset may be specified with the L</CharsetID3> option. The encoding for ID3v2 information is stored in the file, so ExifTool converts ID3v2 text from this encoding to the character set specified by the L</Charset> option. ExifTool does not currently write ID3 information. =head2 PDF PDF text strings are stored in either PDFDocEncoding (similar to Windows Latin1) or Unicode (UCS-2). When reading, ExifTool converts to the character set specified by the L</Charset> option. When writing, ExifTool encodes input text from the specified character set as Unicode only if the string contains special characters, otherwise PDFDocEncoding is used. =head2 MIE MIE strings are stored as either UTF-8 or ISO 8859-1. When reading, UTF-8 strings are converted according to the L</Charset> option, and ISO 8859-1 strings are never converted. When writing, input strings are converted from the specified character set to UTF-8. The resulting strings are stored as UTF-8 if they contain multi-byte UTF-8 character sequences, otherwise they are stored as ISO 8859-1.

Tue Jun 15 19:37:04 2010 user42 [...] zip.com.au - Correspondence added

Subject:	Re: [rt.cpan.org #58152] png tEXt vs iTXt
Date:	Wed, 16 Jun 2010 09:35:51 +1000
To:	bug-Image-ExifTool [...] rt.cpan.org
From:	Kevin Ryde <user42 [...] zip.com.au>

"Phil Harvey via RT" <bug-Image-ExifTool@rt.cpan.org> writes: Show quoted text

> > And here is POD source for the new section I am considering:

Looks good. Show quoted text

> When reading, 8-bit encodings are passed straight through ExifTool > without re-coding unless specified otherwise below, and multi-byte encodings > are converted according to the L</Charset> option ('UTF8' by default).

I wonder if I still need to know whether conversion has been done, ie. whether the string is in the requested Charset, or is raw and perhaps unknown. You'd be tempted to have a Charset=>'wide' or some such for new enough perl to ask for wide returns for converted strings. (As I said first I'm only looking to display a couple of strings in a mail message. The same would apply for getting non-ascii to display nicely in say a gtk gui or whatnot.)

Wed Jun 16 08:58:50 2010 EXIFTOOL [...] cpan.org - Correspondence added

On Tue Jun 15 19:37:04 2010, user42@zip.com.au wrote: Show quoted text

> I wonder if I still need to know whether conversion has been done, > ie. whether the string is in the requested Charset, or is raw and > perhaps unknown. You'd be tempted to have a Charset=>'wide' or some > such for new enough perl to ask for wide returns for converted > strings.

If you just treat all returned text as UTF-8 encoded (with Charset set to the default of 'UTF8'), you can't go far wrong. There are some exceptions of course, but if you want to be really safe you can add your own test for valid UTF-8. I really don't want to add something like a Wide setting as you suggest unless absolutely necessary. - Phil

Wed Jul 14 08:01:58 2010 EXIFTOOL [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Jul 14 08:01:58 2010 EXIFTOOL [...] cpan.org - Fixed in 8.25 added