Skip Menu |

This queue is for tickets about the Git-Repository-Plugin-Log CPAN distribution.

Report information
The Basics
Id: 97045
Status: open
Priority: 0/
Queue: Git-Repository-Plugin-Log

People
Owner: Nobody in particular
Requestors: RCAPUTO [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 1.309
  • 1.310
  • 1.311
Fixed in: (no value)



Subject: Git::Repository::Log::Iterator is not decoding author names (more?)
git log for my project Dist::Zilla::Plugin::ChangeLogFromGit includes this change: commit 8507a954f6e8096802a80299233b0fb3345eae26 Author: André Santos <andrefs@cpan.org> Date: Wed Sep 26 21:58:19 2012 +0100 The log objects returned by the iterator include UTF-8 octets rather than decoded characters. For example: Andr\x{c3}\x{a9} Santos I understand that getting the decodes right will be hard without character set information. Maybe the iterator can be given an expected character set during construction?
On Tue Jul 08 03:10:02 2014, RCAPUTO wrote: Show quoted text
> git log for my project Dist::Zilla::Plugin::ChangeLogFromGit includes > this change: > > commit 8507a954f6e8096802a80299233b0fb3345eae26 > Author: André Santos <andrefs@cpan.org> > Date: Wed Sep 26 21:58:19 2012 +0100 > > The log objects returned by the iterator include UTF-8 octets rather > than decoded characters. For example: > > Andr\x{c3}\x{a9} Santos > > I understand that getting the decodes right will be hard without > character set information. Maybe the iterator can be given an > expected character set during construction?
Actually, git supports setting a commit encoding, and the default is utf8. I suppose I should explicitly decode the octet stream, using utf8 (default) or the provided encoding. I'll try to reproduce the bug, thanks for the report. -- BooK
On Wed Jul 16 11:27:37 2014, BOOK wrote: Show quoted text
> > > > commit 8507a954f6e8096802a80299233b0fb3345eae26 > > Author: André Santos <andrefs@cpan.org> > > Date: Wed Sep 26 21:58:19 2012 +0100 > > > > The log objects returned by the iterator include UTF-8 octets rather > > than decoded characters. For example: > > > > Andr\x{c3}\x{a9} Santos > > > > I understand that getting the decodes right will be hard without > > character set information. Maybe the iterator can be given an > > expected character set during construction?
> > Actually, git supports setting a commit encoding, and the default is > utf8. > > I suppose I should explicitly decode the octet stream, using utf8 > (default) or the provided encoding. > > I'll try to reproduce the bug, thanks for the report.
Some of the test in that repository use a specifically generated git repository with as many edge-cases as I know of. A version of the repository is published at https://github.com/book/git-test-repository . So I started by adding some commits with an author name having accented letters, and saved the commit in utf-8 and also latin-1. As expected when the commit encoding is provided, git properly decodes the author name in utf-8, unless the environment declares another encoding (e.g. latin-1). In other words, the encoding used by git when running git log depends on the encoding declared in the environment. Which I think means that indeed the iterator could be told what encoding to expect (if not UTF-8), and do the decoding accordingly. I'll keep you posted. -- BooK
On Wed Jul 16 12:29:33 2014, BOOK wrote: Show quoted text
> > In other words, the encoding used by git when running git log depends > on the encoding declared in the environment. Which I think means that > indeed the iterator could be told what encoding to expect (if not UTF- > 8), and do the decoding accordingly. >
So, the good news is: you can easily know which encoding to use for the data provided by git. And you can also force it by setting the environment in which the git commands will be run. (In my test script I set LC_ALL=C, which forces git to use UTF-8.) The fields in Git::Repository::Log that are concerned by this are: author, author_name, committer, committer_name, raw_message, message, subject and body. I tend to think that it's not the iterator's job to do the decoding. If you had access to the filehandle from which the data is read, you could probably do something like binmode $it->{fh}, ':encoding(utf8)'. I'll make the filehandle available in the next version. Regards, -- BooK
On Mon Jul 21 12:58:42 2014, BOOK wrote: Show quoted text
> > I tend to think that it's not the iterator's job to do the decoding. > > If you had access to the filehandle from which the data is read, you could > probably do something like binmode $it->{fh}, ':encoding(utf8)'. > > I'll make the filehandle available in the next version.
Git-Repository-Plugin-Log 1.312 is on CPAN, and the Git::Repository::Log::Iterator object has a 'fh' attribute on which you should be able to call binmode. Let me know if that works for you. The next step would be to add an 'encoding' attribute to System::Command (similar to what Sys::Cmd (a fork of System::Command) does), but I'd rather leave things open for the user do decide. -- BooK