Bug #113716 for Image-Libpuzzle: signature_as_char_string possible not correct at all

Tue Apr 12 16:09:24 2016 gnu.oracle [...] gmail.com - Ticket created

CC:	estrabd [...] cpan.org
Subject:	signature_as_char_string possible not correct at all
Date:	Tue, 12 Apr 2016 23:09:10 +0300
To:	bug-Image-Libpuzzle [...] rt.cpan.org
From:	Сергей Лукашевич <gnu.oracle [...] gmail.com>

Hi! I found useful your library but noticed the 'signature_as_char_string' method is not correct. It treats cvec as UNSIGNED char because you use unpack("C*"): # from lib/Image/Libpuzzle.pm # uses unpack as bin to char and $self accessor to get signature directly from the internal cvec sub signature_as_char_string { my $self = shift; my @sig = unpack("C*", $self->get_signature()); my $sig = q{}; foreach my $i (@sig) { $sig .= sprintf("%02d", $i); } return $sig; } but cvec is an array of SIGNED bytes having values between -2 and 2 (5 possible values: -2,-1,0,-1,2) -- see original typedef from the puzzle.h: typedef struct PuzzleCvec_ { size_t sizeof_vec; signed char *vec; } PuzzleCvec; As a result signature_as_char_string yelds chars in range of ['0'..'5'] (6 possible values). And what is probably worse -- its character output length varies from one image to another (printf("%02d") not works as expected?). Though binary cvecs all have same length. This fact makes character string cvecs (and ngrams made from such char cvecs) not probably usable for image indexing. At least using them would not be the correct way of indexing images. I know there could be cases when interpreting SIGNED bytes as UNSIGNED make sense. But I think this time you are wrong. At least char cvec length should not vary. But it changes from one image to another (check length($it)). Might be a better idea is interpreting cvecs as SIGNED numbers but do add +2 to all of them. Then we get a range of ['0'..'4'] which best fits in only one digit, not two. Best idea would probably be using another chars but digits 0-4 to encode cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams would be signifiacally better. -- best regards Sergey Lukashevich

Wed May 25 12:38:16 2016 estrabd [...] gmail.com - Correspondence added

Thank you for letting me know. This is my first attempt at such a module, and your comments are helpful. It might also explain some issues I've had in using it. Would you PLEASE create a github issue for this? https://github.com/estrabd/Image-Libpuzzle/issues Also, please see this issue - it may very well be caused by the issue you're bringing up. https://github.com/estrabd/Image-Libpuzzle/issues/5 Brett On Tue Apr 12 16:09:24 2016, gnu.oracle@gmail.com wrote: Show quoted text

> Hi! > > I found useful your library but noticed the 'signature_as_char_string' > method is not correct. It treats cvec as UNSIGNED char > > because you use unpack("C*"): > > # from lib/Image/Libpuzzle.pm > > # uses unpack as bin to char and $self accessor to get signature directly > from the internal cvec > sub signature_as_char_string { > my $self = shift; > my @sig = unpack("C*", $self->get_signature()); > my $sig = q{}; > foreach my $i (@sig) { > $sig .= sprintf("%02d", $i); > } > return $sig; > } > > > > > but cvec is an array of SIGNED bytes having values between -2 and 2 (5 > possible values: -2,-1,0,-1,2) -- see original typedef from the puzzle.h: > > typedef struct PuzzleCvec_ { > size_t sizeof_vec; > signed char *vec; > } PuzzleCvec; > > > As a result signature_as_char_string yelds chars in range of ['0'..'5'] (6 > possible values). And what is probably worse -- its character output length > varies from one image to another (printf("%02d") not works as expected?). > Though binary cvecs all have same length. This fact makes character string > cvecs (and ngrams made from such char cvecs) not probably usable for image > indexing. At least using them would not be the correct way of indexing > images. > > I know there could be cases when interpreting SIGNED bytes as UNSIGNED make > sense. But I think this time you are wrong. At least char cvec length > should not vary. But it changes from one image to another (check > length($it)). > > Might be a better idea is interpreting cvecs as SIGNED numbers but do add > +2 to all of them. Then we get a range of ['0'..'4'] which best fits in > only one digit, not two. > > Best idea would probably be using another chars but digits 0-4 to encode > cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams would be > signifiacally better. >

Wed May 25 12:38:16 2016 The RT System itself - Status changed from 'new' to 'open'

Wed May 25 14:12:55 2016 estrabd [...] gmail.com - Correspondence added

Hi, after thinking more about your report I came up with this change. I think it is what you were meaning: https://github.com/estrabd/Image-Libpuzzle/commit/43cdae1ed5fe6990900256cca05ccf5b026aeea0 Can you please review/test that and provide me with some feedback. If it is then correct, I will push out a new version to CPAN with the fix. Thank you for your report. On Tue Apr 12 16:09:24 2016, gnu.oracle@gmail.com wrote: Show quoted text

> Hi! > > I found useful your library but noticed the 'signature_as_char_string' > method is not correct. It treats cvec as UNSIGNED char > > because you use unpack("C*"): > > # from lib/Image/Libpuzzle.pm > > # uses unpack as bin to char and $self accessor to get signature directly > from the internal cvec > sub signature_as_char_string { > my $self = shift; > my @sig = unpack("C*", $self->get_signature()); > my $sig = q{}; > foreach my $i (@sig) { > $sig .= sprintf("%02d", $i); > } > return $sig; > } > > > > > but cvec is an array of SIGNED bytes having values between -2 and 2 (5 > possible values: -2,-1,0,-1,2) -- see original typedef from the puzzle.h: > > typedef struct PuzzleCvec_ { > size_t sizeof_vec; > signed char *vec; > } PuzzleCvec; > > > As a result signature_as_char_string yelds chars in range of ['0'..'5'] (6 > possible values). And what is probably worse -- its character output length > varies from one image to another (printf("%02d") not works as expected?). > Though binary cvecs all have same length. This fact makes character string > cvecs (and ngrams made from such char cvecs) not probably usable for image > indexing. At least using them would not be the correct way of indexing > images. > > I know there could be cases when interpreting SIGNED bytes as UNSIGNED make > sense. But I think this time you are wrong. At least char cvec length > should not vary. But it changes from one image to another (check > length($it)). > > Might be a better idea is interpreting cvecs as SIGNED numbers but do add > +2 to all of them. Then we get a range of ['0'..'4'] which best fits in > only one digit, not two. > > Best idea would probably be using another chars but digits 0-4 to encode > cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams would be > signifiacally better. >

Wed May 25 16:15:21 2016 gnu.oracle [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113716] signature_as_char_string possible not correct at all
Date:	Wed, 25 May 2016 23:15:08 +0300
To:	bug-Image-Libpuzzle [...] rt.cpan.org
From:	Сергей Лукашевич <gnu.oracle [...] gmail.com>

Thank you for an answer. Sorry I am lazy enough to use github, but I will check your changes in a few days. Personally I like using your module like that: my @letters=split(//,"ABCDEFGHIJKLMNOPQRSTUWXYZ?"); my %letter_hash; my $hash_ind=0; for(my $i=0;$i<=4;$i++) { for(my $j=0;$j<=4;$j++) { $letter_hash{ ($i<<4)+$j } = $letters[$hash_ind++]; } } sub signature_as_char_string2 { my(@signature)=map($_+2,unpack("c*", $_[0])); my $octets=""; my $i; for($i=0; $i<$#signature; $i+=2) { my $ind=($signature[$i]<<4) + ($signature[$i+1]); $octets .= $letter_hash{($signature[$i]<<4) + ($signature[$i+1])}; } return $octets; } $str = signature_as_char_string2($pic->fill_cvec_from_file($file)); As a result I receive pretty letter strings of same length like this: AJYSABEERXZXTZJZUKBKJIQQWSIIBGPWDJFKDXAAAFAAYUJEUPTSGQFGSTYSGAAFYODPZZTZZWYZYUYZBPPMBGIGSQYNSIHIQWDDFUJSIAGWQAUWECPKSNNQIPGGIDGIAPSJEPYTZGGPFBTYUJBFFSGSQFBZZYRGABAWEDRYZTZTWZZSUZTBFFGGQJIWQQIJOIMYWCJKKDSAABFABYUJDPPSSIPFGSOYSAAAAHJDPZZOTZXYZYUYZBFFAGQIGQYLWLWDEGZAFPEJEPSQ Yet I do not know whether these strings are very useful for comparing images. I will try to investigate it futher. 2016-05-25 21:12 GMT+03:00 B. D. Estrade via RT < bug-Image-Libpuzzle@rt.cpan.org>: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=113716 > > > Hi, after thinking more about your report I came up with this change. I > think it is what you were meaning: > > > https://github.com/estrabd/Image-Libpuzzle/commit/43cdae1ed5fe6990900256cca05ccf5b026aeea0 > > Can you please review/test that and provide me with some feedback. If it > is then correct, I will push out a new version to CPAN with the fix. > > Thank you for your report. > > On Tue Apr 12 16:09:24 2016, gnu.oracle@gmail.com wrote:

> > Hi! > > > > I found useful your library but noticed the 'signature_as_char_string' > > method is not correct. It treats cvec as UNSIGNED char > > > > because you use unpack("C*"): > > > > # from lib/Image/Libpuzzle.pm > > > > # uses unpack as bin to char and $self accessor to get signature directly > > from the internal cvec > > sub signature_as_char_string { > > my $self = shift; > > my @sig = unpack("C*", $self->get_signature()); > > my $sig = q{}; > > foreach my $i (@sig) { > > $sig .= sprintf("%02d", $i); > > } > > return $sig; > > } > > > > > > > > > > but cvec is an array of SIGNED bytes having values between -2 and 2 (5 > > possible values: -2,-1,0,-1,2) -- see original typedef from the puzzle.h: > > > > typedef struct PuzzleCvec_ { > > size_t sizeof_vec; > > signed char *vec; > > } PuzzleCvec; > > > > > > As a result signature_as_char_string yelds chars in range of ['0'..'5']

> (6

> > possible values). And what is probably worse -- its character output

> length

> > varies from one image to another (printf("%02d") not works as expected?). > > Though binary cvecs all have same length. This fact makes character

> string

> > cvecs (and ngrams made from such char cvecs) not probably usable for

> image

> > indexing. At least using them would not be the correct way of indexing > > images. > > > > I know there could be cases when interpreting SIGNED bytes as UNSIGNED

> make

> > sense. But I think this time you are wrong. At least char cvec length > > should not vary. But it changes from one image to another (check > > length($it)). > > > > Might be a better idea is interpreting cvecs as SIGNED numbers but do add > > +2 to all of them. Then we get a range of ['0'..'4'] which best fits in > > only one digit, not two. > > > > Best idea would probably be using another chars but digits 0-4 to encode > > cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams would be > > signifiacally better. > >

> > > >

-- Sergey Lukashevich

Wed May 25 17:40:05 2016 estrabd [...] gmail.com - Correspondence added

No problem =) you waited 6 weeks for a response, I can't expect you to jump on a change. I think based on your code, that I fixed unpack as you had described. I am happy to add in the code you have that creates an A-Z representation if you find that it works for you. Please let me know what you find. I will be afk until next week, then I will start looking at how this change affects indexing and will likely push out another release to CPAN. Cheers, Brett On Wed May 25 16:15:21 2016, gnu.oracle@gmail.com wrote: Show quoted text

> Thank you for an answer. Sorry I am lazy enough to use github, but I > will > check your changes in a few days. > > Personally I like using your module like that: > > my @letters=split(//,"ABCDEFGHIJKLMNOPQRSTUWXYZ?"); > my %letter_hash; > my $hash_ind=0; > > for(my $i=0;$i<=4;$i++) { > for(my $j=0;$j<=4;$j++) { > $letter_hash{ ($i<<4)+$j } = $letters[$hash_ind++]; > } > } > > sub signature_as_char_string2 { > my(@signature)=map($_+2,unpack("c*", $_[0])); > my $octets=""; > my $i; > for($i=0; $i<$#signature; $i+=2) { > my $ind=($signature[$i]<<4) + ($signature[$i+1]); > $octets .= $letter_hash{($signature[$i]<<4) + ($signature[$i+1])}; > } > return $octets; > } > > $str = signature_as_char_string2($pic->fill_cvec_from_file($file)); > > > As a result I receive pretty letter strings of same length like this: > > AJYSABEERXZXTZJZUKBKJIQQWSIIBGPWDJFKDXAAAFAAYUJEUPTSGQFGSTYSGAAFYODPZZTZZWYZYUYZBPPMBGIGSQYNSIHIQWDDFUJSIAGWQAUWECPKSNNQIPGGIDGIAPSJEPYTZGGPFBTYUJBFFSGSQFBZZYRGABAWEDRYZTZTWZZSUZTBFFGGQJIWQQIJOIMYWCJKKDSAABFABYUJDPPSSIPFGSOYSAAAAHJDPZZOTZXYZYUYZBFFAGQIGQYLWLWDEGZAFPEJEPSQ > > Yet I do not know whether these strings are very useful for comparing > images. I will try to investigate it futher. > > > 2016-05-25 21:12 GMT+03:00 B. D. Estrade via RT < > bug-Image-Libpuzzle@rt.cpan.org>: >

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=113716 > > > > > Hi, after thinking more about your report I came up with this change. > > I > > think it is what you were meaning: > > > > > > https://github.com/estrabd/Image- > > Libpuzzle/commit/43cdae1ed5fe6990900256cca05ccf5b026aeea0 > > > > Can you please review/test that and provide me with some feedback. If > > it > > is then correct, I will push out a new version to CPAN with the fix. > > > > Thank you for your report. > > > > On Tue Apr 12 16:09:24 2016, gnu.oracle@gmail.com wrote:

> > > Hi! > > > > > > I found useful your library but noticed the > > > 'signature_as_char_string' > > > method is not correct. It treats cvec as UNSIGNED char > > > > > > because you use unpack("C*"): > > > > > > # from lib/Image/Libpuzzle.pm > > > > > > # uses unpack as bin to char and $self accessor to get signature > > > directly > > > from the internal cvec > > > sub signature_as_char_string { > > > my $self = shift; > > > my @sig = unpack("C*", $self->get_signature()); > > > my $sig = q{}; > > > foreach my $i (@sig) { > > > $sig .= sprintf("%02d", $i); > > > } > > > return $sig; > > > } > > > > > > > > > > > > > > > but cvec is an array of SIGNED bytes having values between -2 and 2 > > > (5 > > > possible values: -2,-1,0,-1,2) -- see original typedef from the > > > puzzle.h: > > > > > > typedef struct PuzzleCvec_ { > > > size_t sizeof_vec; > > > signed char *vec; > > > } PuzzleCvec; > > > > > > > > > As a result signature_as_char_string yelds chars in range of > > > ['0'..'5']

> > (6

> > > possible values). And what is probably worse -- its character > > > output

> > length

> > > varies from one image to another (printf("%02d") not works as > > > expected?). > > > Though binary cvecs all have same length. This fact makes character

> > string

> > > cvecs (and ngrams made from such char cvecs) not probably usable > > > for

> > image

> > > indexing. At least using them would not be the correct way of > > > indexing > > > images. > > > > > > I know there could be cases when interpreting SIGNED bytes as > > > UNSIGNED

> > make

> > > sense. But I think this time you are wrong. At least char cvec > > > length > > > should not vary. But it changes from one image to another (check > > > length($it)). > > > > > > Might be a better idea is interpreting cvecs as SIGNED numbers but > > > do add > > > +2 to all of them. Then we get a range of ['0'..'4'] which best > > > fits in > > > only one digit, not two. > > > > > > Best idea would probably be using another chars but digits 0-4 to > > > encode > > > cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams would be > > > signifiacally better. > > >

> > > > > > > >

Wed May 25 17:41:46 2016 estrabd [...] gmail.com - Correspondence added

Also, don't worry about creating a separate issue in Github. RT is fine. On Wed May 25 17:40:05 2016, ESTRABD wrote: Show quoted text

> No problem =) you waited 6 weeks for a response, I can't expect you to > jump on a change. > > I think based on your code, that I fixed unpack as you had described. > I am happy to add in the code you have that creates an A-Z > representation if you find that it works for you. > > Please let me know what you find. I will be afk until next week, then > I will start looking at how this change affects indexing and will > likely push out another release to CPAN. > > Cheers, > Brett > > On Wed May 25 16:15:21 2016, gnu.oracle@gmail.com wrote:

> > Thank you for an answer. Sorry I am lazy enough to use github, but I > > will > > check your changes in a few days. > > > > Personally I like using your module like that: > > > > my @letters=split(//,"ABCDEFGHIJKLMNOPQRSTUWXYZ?"); > > my %letter_hash; > > my $hash_ind=0; > > > > for(my $i=0;$i<=4;$i++) { > > for(my $j=0;$j<=4;$j++) { > > $letter_hash{ ($i<<4)+$j } = $letters[$hash_ind++]; > > } > > } > > > > sub signature_as_char_string2 { > > my(@signature)=map($_+2,unpack("c*", $_[0])); > > my $octets=""; > > my $i; > > for($i=0; $i<$#signature; $i+=2) { > > my $ind=($signature[$i]<<4) + ($signature[$i+1]); > > $octets .= $letter_hash{($signature[$i]<<4) + > > ($signature[$i+1])}; > > } > > return $octets; > > } > > > > $str = signature_as_char_string2($pic->fill_cvec_from_file($file)); > > > > > > As a result I receive pretty letter strings of same length like this: > > > > AJYSABEERXZXTZJZUKBKJIQQWSIIBGPWDJFKDXAAAFAAYUJEUPTSGQFGSTYSGAAFYODPZZTZZWYZYUYZBPPMBGIGSQYNSIHIQWDDFUJSIAGWQAUWECPKSNNQIPGGIDGIAPSJEPYTZGGPFBTYUJBFFSGSQFBZZYRGABAWEDRYZTZTWZZSUZTBFFGGQJIWQQIJOIMYWCJKKDSAABFABYUJDPPSSIPFGSOYSAAAAHJDPZZOTZXYZYUYZBFFAGQIGQYLWLWDEGZAFPEJEPSQ > > > > Yet I do not know whether these strings are very useful for comparing > > images. I will try to investigate it futher. > > > > > > 2016-05-25 21:12 GMT+03:00 B. D. Estrade via RT < > > bug-Image-Libpuzzle@rt.cpan.org>: > >

> > > <URL: https://rt.cpan.org/Ticket/Display.html?id=113716 > > > > > > > Hi, after thinking more about your report I came up with this > > > change. > > > I > > > think it is what you were meaning: > > > > > > > > > https://github.com/estrabd/Image- > > > Libpuzzle/commit/43cdae1ed5fe6990900256cca05ccf5b026aeea0 > > > > > > Can you please review/test that and provide me with some feedback. > > > If > > > it > > > is then correct, I will push out a new version to CPAN with the > > > fix. > > > > > > Thank you for your report. > > > > > > On Tue Apr 12 16:09:24 2016, gnu.oracle@gmail.com wrote:

> > > > Hi! > > > > > > > > I found useful your library but noticed the > > > > 'signature_as_char_string' > > > > method is not correct. It treats cvec as UNSIGNED char > > > > > > > > because you use unpack("C*"): > > > > > > > > # from lib/Image/Libpuzzle.pm > > > > > > > > # uses unpack as bin to char and $self accessor to get signature > > > > directly > > > > from the internal cvec > > > > sub signature_as_char_string { > > > > my $self = shift; > > > > my @sig = unpack("C*", $self->get_signature()); > > > > my $sig = q{}; > > > > foreach my $i (@sig) { > > > > $sig .= sprintf("%02d", $i); > > > > } > > > > return $sig; > > > > } > > > > > > > > > > > > > > > > > > > > but cvec is an array of SIGNED bytes having values between -2 and > > > > 2 > > > > (5 > > > > possible values: -2,-1,0,-1,2) -- see original typedef from the > > > > puzzle.h: > > > > > > > > typedef struct PuzzleCvec_ { > > > > size_t sizeof_vec; > > > > signed char *vec; > > > > } PuzzleCvec; > > > > > > > > > > > > As a result signature_as_char_string yelds chars in range of > > > > ['0'..'5']

> > > (6

> > > > possible values). And what is probably worse -- its character > > > > output

> > > length

> > > > varies from one image to another (printf("%02d") not works as > > > > expected?). > > > > Though binary cvecs all have same length. This fact makes > > > > character

> > > string

> > > > cvecs (and ngrams made from such char cvecs) not probably usable > > > > for

> > > image

> > > > indexing. At least using them would not be the correct way of > > > > indexing > > > > images. > > > > > > > > I know there could be cases when interpreting SIGNED bytes as > > > > UNSIGNED

> > > make

> > > > sense. But I think this time you are wrong. At least char cvec > > > > length > > > > should not vary. But it changes from one image to another (check > > > > length($it)). > > > > > > > > Might be a better idea is interpreting cvecs as SIGNED numbers > > > > but > > > > do add > > > > +2 to all of them. Then we get a range of ['0'..'4'] which best > > > > fits in > > > > only one digit, not two. > > > > > > > > Best idea would probably be using another chars but digits 0-4 to > > > > encode > > > > cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams would > > > > be > > > > signifiacally better. > > > >

> > > > > > > > > > > >

Mon Jun 06 15:23:32 2016 gnu.oracle [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113716] signature_as_char_string possible not correct at all
Date:	Mon, 6 Jun 2016 22:23:18 +0300
To:	bug-Image-Libpuzzle [...] rt.cpan.org
From:	Сергей Лукашевич <gnu.oracle [...] gmail.com>

Well, as I can see you are wrong using the "%02d" format instead of just "%d". The main idea of libpuzzle is to get the maximum possible narrow fingerprint of an image. It is not the case when one adds extra nulls when only one digit [0-4] used. So please fix it if you agree. Next I can comment my attemp to narrow the image fingerprint using alphabet (letters) as I mentioned before. Obviously this hint makes image signature twice shorter than one from signature_as_char_string. Which is good for storing it in a database and it still allows effectively comparing signatures using byte-to-byte comparison or Text::Levenshtein. But such alphabet format of a signature not as effective as [0-4] signature format when indexing libpuzzle signatires as words for quick check (see http://stackoverflow.com/questions/9703762/libpuzzle-indexing-millions-of-pictures for the idea of indexing). One cannot index half of a letter (4 bits), just a single letter (4*2 bits). That's the difference. Additionally Levenshtein distances would look different but still meaningfull. You can safely ignore my last paragraph if you are not planning to implement such king of "signature compression". 2016-05-26 0:41 GMT+03:00 B. D. Estrade via RT < bug-Image-Libpuzzle@rt.cpan.org>: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=113716 > > > Also, don't worry about creating a separate issue in Github. RT is fine. > > On Wed May 25 17:40:05 2016, ESTRABD wrote:

> > No problem =) you waited 6 weeks for a response, I can't expect you to > > jump on a change. > > > > I think based on your code, that I fixed unpack as you had described. > > I am happy to add in the code you have that creates an A-Z > > representation if you find that it works for you. > > > > Please let me know what you find. I will be afk until next week, then > > I will start looking at how this change affects indexing and will > > likely push out another release to CPAN. > > > > Cheers, > > Brett > > > > On Wed May 25 16:15:21 2016, gnu.oracle@gmail.com wrote:

> > > Thank you for an answer. Sorry I am lazy enough to use github, but I > > > will > > > check your changes in a few days. > > > > > > Personally I like using your module like that: > > > > > > my @letters=split(//,"ABCDEFGHIJKLMNOPQRSTUWXYZ?"); > > > my %letter_hash; > > > my $hash_ind=0; > > > > > > for(my $i=0;$i<=4;$i++) { > > > for(my $j=0;$j<=4;$j++) { > > > $letter_hash{ ($i<<4)+$j } = $letters[$hash_ind++]; > > > } > > > } > > > > > > sub signature_as_char_string2 { > > > my(@signature)=map($_+2,unpack("c*", $_[0])); > > > my $octets=""; > > > my $i; > > > for($i=0; $i<$#signature; $i+=2) { > > > my $ind=($signature[$i]<<4) + ($signature[$i+1]); > > > $octets .= $letter_hash{($signature[$i]<<4) + > > > ($signature[$i+1])}; > > > } > > > return $octets; > > > } > > > > > > $str = signature_as_char_string2($pic->fill_cvec_from_file($file)); > > > > > > > > > As a result I receive pretty letter strings of same length like this: > > > > > >

> AJYSABEERXZXTZJZUKBKJIQQWSIIBGPWDJFKDXAAAFAAYUJEUPTSGQFGSTYSGAAFYODPZZTZZWYZYUYZBPPMBGIGSQYNSIHIQWDDFUJSIAGWQAUWECPKSNNQIPGGIDGIAPSJEPYTZGGPFBTYUJBFFSGSQFBZZYRGABAWEDRYZTZTWZZSUZTBFFGGQJIWQQIJOIMYWCJKKDSAABFABYUJDPPSSIPFGSOYSAAAAHJDPZZOTZXYZYUYZBFFAGQIGQYLWLWDEGZAFPEJEPSQ

> > > > > > Yet I do not know whether these strings are very useful for comparing > > > images. I will try to investigate it futher. > > > > > > > > > 2016-05-25 21:12 GMT+03:00 B. D. Estrade via RT < > > > bug-Image-Libpuzzle@rt.cpan.org>: > > >

> > > > <URL: https://rt.cpan.org/Ticket/Display.html?id=113716 > > > > > > > > > Hi, after thinking more about your report I came up with this > > > > change. > > > > I > > > > think it is what you were meaning: > > > > > > > > > > > > https://github.com/estrabd/Image- > > > > Libpuzzle/commit/43cdae1ed5fe6990900256cca05ccf5b026aeea0 > > > > > > > > Can you please review/test that and provide me with some feedback. > > > > If > > > > it > > > > is then correct, I will push out a new version to CPAN with the > > > > fix. > > > > > > > > Thank you for your report. > > > > > > > > On Tue Apr 12 16:09:24 2016, gnu.oracle@gmail.com wrote:

> > > > > Hi! > > > > > > > > > > I found useful your library but noticed the > > > > > 'signature_as_char_string' > > > > > method is not correct. It treats cvec as UNSIGNED char > > > > > > > > > > because you use unpack("C*"): > > > > > > > > > > # from lib/Image/Libpuzzle.pm > > > > > > > > > > # uses unpack as bin to char and $self accessor to get signature > > > > > directly > > > > > from the internal cvec > > > > > sub signature_as_char_string { > > > > > my $self = shift; > > > > > my @sig = unpack("C*", $self->get_signature()); > > > > > my $sig = q{}; > > > > > foreach my $i (@sig) { > > > > > $sig .= sprintf("%02d", $i); > > > > > } > > > > > return $sig; > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > but cvec is an array of SIGNED bytes having values between -2 and > > > > > 2 > > > > > (5 > > > > > possible values: -2,-1,0,-1,2) -- see original typedef from the > > > > > puzzle.h: > > > > > > > > > > typedef struct PuzzleCvec_ { > > > > > size_t sizeof_vec; > > > > > signed char *vec; > > > > > } PuzzleCvec; > > > > > > > > > > > > > > > As a result signature_as_char_string yelds chars in range of > > > > > ['0'..'5']

> > > > (6

> > > > > possible values). And what is probably worse -- its character > > > > > output

> > > > length

> > > > > varies from one image to another (printf("%02d") not works as > > > > > expected?). > > > > > Though binary cvecs all have same length. This fact makes > > > > > character

> > > > string

> > > > > cvecs (and ngrams made from such char cvecs) not probably usable > > > > > for

> > > > image

> > > > > indexing. At least using them would not be the correct way of > > > > > indexing > > > > > images. > > > > > > > > > > I know there could be cases when interpreting SIGNED bytes as > > > > > UNSIGNED

> > > > make

> > > > > sense. But I think this time you are wrong. At least char cvec > > > > > length > > > > > should not vary. But it changes from one image to another (check > > > > > length($it)). > > > > > > > > > > Might be a better idea is interpreting cvecs as SIGNED numbers > > > > > but > > > > > do add > > > > > +2 to all of them. Then we get a range of ['0'..'4'] which best > > > > > fits in > > > > > only one digit, not two. > > > > > > > > > > Best idea would probably be using another chars but digits 0-4 to > > > > > encode > > > > > cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams would > > > > > be > > > > > signifiacally better. > > > > >

> > > > > > > > > > > > > > > >

> > > >

-- Sergey Lukashevich

Wed Jun 08 13:56:57 2016 estrabd [...] gmail.com - Correspondence added

On Mon Jun 06 15:23:32 2016, gnu.oracle@gmail.com wrote: Show quoted text

> Well, as I can see you are wrong using the "%02d" format instead of > just > "%d". The main idea of libpuzzle is to get the maximum possible narrow > fingerprint of an image. It is not the case when one adds extra nulls > when > only one digit [0-4] used. So please fix it if you agree. >

Thanks! Yes, I noticed this before merging to master yesterday. https://github.com/estrabd/Image-Libpuzzle/commit/c93502487b8190199fac91b5546ecc5e190b1ec9 Show quoted text

> Next I can comment my attemp to narrow the image fingerprint using > alphabet > (letters) as I mentioned before. Obviously this hint makes image > signature > twice shorter than one from signature_as_char_string. Which is good > for > storing it in a database and it still allows effectively comparing > signatures using byte-to-byte comparison or Text::Levenshtein. But > such > alphabet format of a signature not as effective as [0-4] signature > format > when indexing libpuzzle signatires as words for quick check (see > http://stackoverflow.com/questions/9703762/libpuzzle-indexing- > millions-of-pictures > for the idea of indexing). One cannot index half of a letter (4 bits), > just > a single letter (4*2 bits). That's the difference. Additionally > Levenshtein > distances would look different but still meaningfull. > > You can safely ignore my last paragraph if you are not planning to > implement such king of "signature compression".

While I do not require compression, if you have a way to compress in a manner that would allow for indexing/comparing (like in the millions of images post), I would most definitely welcome a pull request. Off topic, but I filed an issue upstream (libpuzzle itself) regarding inconsistencies observed when comparing scaled images: https://github.com/jedisct1/libpuzzle/issues/16 Thanks again for this report. If you're so included, would you agree that this issue has been resolved? I'll try to get a new release out soon if so. Cheers, Brett Show quoted text

> > > 2016-05-26 0:41 GMT+03:00 B. D. Estrade via RT < > bug-Image-Libpuzzle@rt.cpan.org>: >

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=113716 > > > > > Also, don't worry about creating a separate issue in Github. RT is > > fine. > > > > On Wed May 25 17:40:05 2016, ESTRABD wrote:

> > > No problem =) you waited 6 weeks for a response, I can't expect you > > > to > > > jump on a change. > > > > > > I think based on your code, that I fixed unpack as you had > > > described. > > > I am happy to add in the code you have that creates an A-Z > > > representation if you find that it works for you. > > > > > > Please let me know what you find. I will be afk until next week, > > > then > > > I will start looking at how this change affects indexing and will > > > likely push out another release to CPAN. > > > > > > Cheers, > > > Brett > > > > > > On Wed May 25 16:15:21 2016, gnu.oracle@gmail.com wrote:

> > > > Thank you for an answer. Sorry I am lazy enough to use github, > > > > but I > > > > will > > > > check your changes in a few days. > > > > > > > > Personally I like using your module like that: > > > > > > > > my @letters=split(//,"ABCDEFGHIJKLMNOPQRSTUWXYZ?"); > > > > my %letter_hash; > > > > my $hash_ind=0; > > > > > > > > for(my $i=0;$i<=4;$i++) { > > > > for(my $j=0;$j<=4;$j++) { > > > > $letter_hash{ ($i<<4)+$j } = $letters[$hash_ind++]; > > > > } > > > > } > > > > > > > > sub signature_as_char_string2 { > > > > my(@signature)=map($_+2,unpack("c*", $_[0])); > > > > my $octets=""; > > > > my $i; > > > > for($i=0; $i<$#signature; $i+=2) { > > > > my $ind=($signature[$i]<<4) + ($signature[$i+1]); > > > > $octets .= $letter_hash{($signature[$i]<<4) + > > > > ($signature[$i+1])}; > > > > } > > > > return $octets; > > > > } > > > > > > > > $str = signature_as_char_string2($pic-

> > > > >fill_cvec_from_file($file));

> > > > > > > > > > > > As a result I receive pretty letter strings of same length like > > > > this: > > > > > > > >

> > AJYSABEERXZXTZJZUKBKJIQQWSIIBGPWDJFKDXAAAFAAYUJEUPTSGQFGSTYSGAAFYODPZZTZZWYZYUYZBPPMBGIGSQYNSIHIQWDDFUJSIAGWQAUWECPKSNNQIPGGIDGIAPSJEPYTZGGPFBTYUJBFFSGSQFBZZYRGABAWEDRYZTZTWZZSUZTBFFGGQJIWQQIJOIMYWCJKKDSAABFABYUJDPPSSIPFGSOYSAAAAHJDPZZOTZXYZYUYZBFFAGQIGQYLWLWDEGZAFPEJEPSQ

> > > > > > > > Yet I do not know whether these strings are very useful for > > > > comparing > > > > images. I will try to investigate it futher. > > > > > > > > > > > > 2016-05-25 21:12 GMT+03:00 B. D. Estrade via RT < > > > > bug-Image-Libpuzzle@rt.cpan.org>: > > > >

> > > > > <URL: https://rt.cpan.org/Ticket/Display.html?id=113716 > > > > > > > > > > > Hi, after thinking more about your report I came up with this > > > > > change. > > > > > I > > > > > think it is what you were meaning: > > > > > > > > > > > > > > > https://github.com/estrabd/Image- > > > > > Libpuzzle/commit/43cdae1ed5fe6990900256cca05ccf5b026aeea0 > > > > > > > > > > Can you please review/test that and provide me with some > > > > > feedback. > > > > > If > > > > > it > > > > > is then correct, I will push out a new version to CPAN with the > > > > > fix. > > > > > > > > > > Thank you for your report. > > > > > > > > > > On Tue Apr 12 16:09:24 2016, gnu.oracle@gmail.com wrote:

> > > > > > Hi! > > > > > > > > > > > > I found useful your library but noticed the > > > > > > 'signature_as_char_string' > > > > > > method is not correct. It treats cvec as UNSIGNED char > > > > > > > > > > > > because you use unpack("C*"): > > > > > > > > > > > > # from lib/Image/Libpuzzle.pm > > > > > > > > > > > > # uses unpack as bin to char and $self accessor to get > > > > > > signature > > > > > > directly > > > > > > from the internal cvec > > > > > > sub signature_as_char_string { > > > > > > my $self = shift; > > > > > > my @sig = unpack("C*", $self->get_signature()); > > > > > > my $sig = q{}; > > > > > > foreach my $i (@sig) { > > > > > > $sig .= sprintf("%02d", $i); > > > > > > } > > > > > > return $sig; > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > but cvec is an array of SIGNED bytes having values between -2 > > > > > > and > > > > > > 2 > > > > > > (5 > > > > > > possible values: -2,-1,0,-1,2) -- see original typedef from > > > > > > the > > > > > > puzzle.h: > > > > > > > > > > > > typedef struct PuzzleCvec_ { > > > > > > size_t sizeof_vec; > > > > > > signed char *vec; > > > > > > } PuzzleCvec; > > > > > > > > > > > > > > > > > > As a result signature_as_char_string yelds chars in range of > > > > > > ['0'..'5']

> > > > > (6

> > > > > > possible values). And what is probably worse -- its character > > > > > > output

> > > > > length

> > > > > > varies from one image to another (printf("%02d") not works as > > > > > > expected?). > > > > > > Though binary cvecs all have same length. This fact makes > > > > > > character

> > > > > string

> > > > > > cvecs (and ngrams made from such char cvecs) not probably > > > > > > usable > > > > > > for

> > > > > image

> > > > > > indexing. At least using them would not be the correct way of > > > > > > indexing > > > > > > images. > > > > > > > > > > > > I know there could be cases when interpreting SIGNED bytes as > > > > > > UNSIGNED

> > > > > make

> > > > > > sense. But I think this time you are wrong. At least char > > > > > > cvec > > > > > > length > > > > > > should not vary. But it changes from one image to another > > > > > > (check > > > > > > length($it)). > > > > > > > > > > > > Might be a better idea is interpreting cvecs as SIGNED > > > > > > numbers > > > > > > but > > > > > > do add > > > > > > +2 to all of them. Then we get a range of ['0'..'4'] which > > > > > > best > > > > > > fits in > > > > > > only one digit, not two. > > > > > > > > > > > > Best idea would probably be using another chars but digits 0- > > > > > > 4 to > > > > > > encode > > > > > > cvecs (A-Z,a-z, etc). Then word INDEX composed from ngrams > > > > > > would > > > > > > be > > > > > > signifiacally better. > > > > > >

> > > > > > > > > > > > > > > > > > > >

> > > > > > > >

Wed Jun 08 14:59:48 2016 gnu.oracle [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113716] signature_as_char_string possible not correct at all
Date:	Wed, 8 Jun 2016 21:59:30 +0300
To:	bug-Image-Libpuzzle [...] rt.cpan.org
From:	Сергей Лукашевич <gnu.oracle [...] gmail.com>

Message body is not shown because it is too large.

Message body is not shown because sender requested not to inline it.

Download cvec1.sh
application/x-sh 132b

Message body not shown because it is not plain text.

Download cvec.sh
application/x-sh 181b

Message body not shown because it is not plain text.

Wed Jun 08 15:40:34 2016 estrabd [...] gmail.com - Correspondence added

Message body is not shown because it is too large.