Skip Menu |

This queue is for tickets about the Redis-JobQueue CPAN distribution.

Report information
The Basics
Id: 87807
Status: rejected
Priority: 0/
Queue: Redis-JobQueue

People
Owner: Nobody in particular
Requestors: victor [...] vsespb.ru
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: possible issue with detecting utf8
I see the following code here https://metacpan.org/source/SGLADKOV/Redis-JobQueue-1.03/lib/Redis/JobQueue.pm#L1000 elsif ( $method eq 'HSET' and !$self->_redis->{encoding} and utf8::is_utf8( $_[2] ) ) { # For non-serialized fields: UTF8 can not be transferred to the Redis server in mode of 'encoding => undef' confess $self->_error( E_MISMATCH_ARG )." (utf8 in $_[1])"; } thing is plain ASCII-7bit data can contain utf-8 flag on. example: use strict; use warnings; use utf8; my $utfstr = "\x{442}\x{435}\x{441}\x{442}"; my $s = "x $utfstr"; my ($ascii_u, undef) = split (' ', $s); die "its not ascii" unless $ascii_u eq 'x'; die "utf8 on" if utf8::is_utf8($ascii_u); __END__ dies with "utf8 on" message
RT-Send-CC: victor [...] vsespb.ru
It's not an issue with UTF detection in Redis::JobQueue, but rather implemented "by design". Plus the way how perl operates with UTF8 strings internally. In provided example any string produced with a string operand from an UTF8 string will have UTF8 flag set on it, even if the resulting string doesn't contain any UTF-8 specific characters. By design Redis::JobQueue uses freeze before storing job data on Redis (workload,result containers). This ensures that among other things, UTF8-encoded strings are safe when passed this way. Though custom-named fields are processed in any way and passed to Redis as-is. They are designed as an easy and fast way for software developer to store some internal / supplemental data among job details. As a workaround for such behavior you can do one of the following: - forcefully downgrade string to ASCII (see perldoc utf8) before attempting to pass it to Redis::JobQueue as a custom named field - use freeze (Storable) before passing it to Redis - store such string as part of worload / result data structures Втр Авг 13 05:06:41 2013, vsespb писал: Show quoted text
> I see the following code here > https://metacpan.org/source/SGLADKOV/Redis-JobQueue- > 1.03/lib/Redis/JobQueue.pm#L1000 > > elsif ( $method eq 'HSET' and !$self->_redis->{encoding} and > utf8::is_utf8( $_[2] ) ) > { > # For non-serialized fields: UTF8 can not be transferred to > the Redis server in mode of 'encoding => undef' > confess $self->_error( E_MISMATCH_ARG )." (utf8 in $_[1])"; > } > > thing is plain ASCII-7bit data can contain utf-8 flag on. > > example: > > use strict; > use warnings; > use utf8; > > my $utfstr = "\x{442}\x{435}\x{441}\x{442}"; > my $s = "x $utfstr"; > > my ($ascii_u, undef) = split (' ', $s); > > die "its not ascii" unless $ascii_u eq 'x'; > die "utf8 on" if utf8::is_utf8($ascii_u); > > __END__ > > dies with "utf8 on" message
From: victor [...] vsespb.ru
Show quoted text
> In provided example any string produced with a string operand from an UTF8 string will have UTF8 flag set on it, even if the resulting string doesn't contain any UTF-8 specific characters.
Yes, agree. That was the point of this example. Show quoted text
> - forcefully downgrade string to ASCII (see perldoc utf8)
Point was that Redis::JobQueue code should try to downgrade string (because programmer cannot really control if his ASCII string contain utf-8 bit or no - this is shown in example). That is why, btw, utf8::is_utf8() is advertised as indication of some wrong workflow: perlunifaq: Show quoted text
> Please, unless you're hacking the internals, or debugging weirdness, don't think about the UTF8 flag at all. > That means that you very probably shouldn't use is_utf8 , _utf8_on or _utf8_off at all.
Also, in core module Digest::SHA::PurePerl you can see similar code which uses utf8::downgrade() (note that such use better be advertised in doc if input parameters altered and downgraded) On Wed Aug 14 02:37:59 2013, SGLADKOV wrote: Show quoted text
> It's not an issue with UTF detection in Redis::JobQueue, but rather > implemented "by design". Plus the way how perl operates with UTF8 > strings internally. In provided example any string produced with a > string operand from an UTF8 string will have UTF8 flag set on it, even > if the resulting string doesn't contain any UTF-8 specific characters. > > By design Redis::JobQueue uses freeze before storing job data on Redis > (workload,result containers). This ensures that among other things, > UTF8-encoded strings are safe when passed this way. Though custom- > named fields are processed in any way and passed to Redis as-is. They > are designed as an easy and fast way for software developer to store > some internal / supplemental data among job details. > > As a workaround for such behavior you can do one of the following: > - forcefully downgrade string to ASCII (see perldoc utf8) before > attempting to pass it to Redis::JobQueue as a custom named field > - use freeze (Storable) before passing it to Redis > - store such string as part of worload / result data structures > > > Втр Авг 13 05:06:41 2013, vsespb писал:
> > I see the following code here > > https://metacpan.org/source/SGLADKOV/Redis-JobQueue- > > 1.03/lib/Redis/JobQueue.pm#L1000 > > > > elsif ( $method eq 'HSET' and !$self->_redis->{encoding} and > > utf8::is_utf8( $_[2] ) ) > > { > > # For non-serialized fields: UTF8 can not be transferred to > > the Redis server in mode of 'encoding => undef' > > confess $self->_error( E_MISMATCH_ARG )." (utf8 in $_[1])"; > > } > > > > thing is plain ASCII-7bit data can contain utf-8 flag on. > > > > example: > > > > use strict; > > use warnings; > > use utf8; > > > > my $utfstr = "\x{442}\x{435}\x{441}\x{442}"; > > my $s = "x $utfstr"; > > > > my ($ascii_u, undef) = split (' ', $s); > > > > die "its not ascii" unless $ascii_u eq 'x'; > > die "utf8 on" if utf8::is_utf8($ascii_u); > > > > __END__ > > > > dies with "utf8 on" message
Digest::SHA example does not apply here: it uses user-supplied data "one-way" only, to get a digest, and original data is never received back. In our case, the user sends data to the job queue and then gets it back. Workload and result are fine with Unicode -- they're properly serialized by Storable and stored as bytes in Redis; Storable takes care about Unicode etc. Metadata, however, is not serialized for performance and convenience reasons, and stored in Redis as-is. If there is Unicode in metadata, we have the following options: 1. Assume everything is Unicode, turn utf-8 encoding in Redis.pm settings and take a substantial performance hit; as we store the biggest parts of job data - workload and result - serialized already, encoding and decoding them again is not a good idea. 2. Assume that all metadata is Unicode, encode and decode it; this may lead to subtle errors if user provides metadata which is binary, not Unicode. 3. Detect Unicode metadata and store "utf-8" flag along with metadata on redis, to decode only utf-8 metadata when it is requested by user. This makes metadata management more complicated. 4. Assume that metadata is for application internal use, and that application must ensure that it does not contain Unicode; if Unicode is really needed, it should be either stored in workload or result, or the application must take care about encoding and decoding Unicode metadata before sending to the job queue. The job queue will throw an exception if Unicode metadata is encountered. We choose (4) as it is consistent, does not degrade performance and does not cause subtle errors with damaged data. Втр Авг 13 18:49:15 2013, vsespb писал: Show quoted text
> > In provided example any string produced with a string operand from an > > UTF8 string will have UTF8 flag set on it, even if the resulting > > string doesn't contain any UTF-8 specific characters.
> > Yes, agree. That was the point of this example. >
> > - forcefully downgrade string to ASCII (see perldoc utf8)
> > Point was that Redis::JobQueue code should try to downgrade string > (because programmer cannot really control if his ASCII string contain > utf-8 bit or no - this is shown in example). > > That is why, btw, utf8::is_utf8() is advertised as indication of some > wrong workflow: > > perlunifaq: >
> > Please, unless you're hacking the internals, or debugging weirdness, > > don't think about the UTF8 flag at all. > > That means that you very probably shouldn't use is_utf8 , _utf8_on or > > _utf8_off at all.
> > Also, in core module Digest::SHA::PurePerl you can see similar code > which uses utf8::downgrade() (note that such use better be advertised > in doc if input parameters altered and downgraded) > > > On Wed Aug 14 02:37:59 2013, SGLADKOV wrote:
> > It's not an issue with UTF detection in Redis::JobQueue, but rather > > implemented "by design". Plus the way how perl operates with UTF8 > > strings internally. In provided example any string produced with a > > string operand from an UTF8 string will have UTF8 flag set on it, > > even > > if the resulting string doesn't contain any UTF-8 specific > > characters. > > > > By design Redis::JobQueue uses freeze before storing job data on > > Redis > > (workload,result containers). This ensures that among other things, > > UTF8-encoded strings are safe when passed this way. Though custom- > > named fields are processed in any way and passed to Redis as-is. They > > are designed as an easy and fast way for software developer to store > > some internal / supplemental data among job details. > > > > As a workaround for such behavior you can do one of the following: > > - forcefully downgrade string to ASCII (see perldoc utf8) before > > attempting to pass it to Redis::JobQueue as a custom named field > > - use freeze (Storable) before passing it to Redis > > - store such string as part of worload / result data structures > > > > > > Втр Авг 13 05:06:41 2013, vsespb писал:
> > > I see the following code here > > > https://metacpan.org/source/SGLADKOV/Redis-JobQueue- > > > 1.03/lib/Redis/JobQueue.pm#L1000 > > > > > > elsif ( $method eq 'HSET' and !$self->_redis->{encoding} and > > > utf8::is_utf8( $_[2] ) ) > > > { > > > # For non-serialized fields: UTF8 can not be transferred to > > > the Redis server in mode of 'encoding => undef' > > > confess $self->_error( E_MISMATCH_ARG )." (utf8 in $_[1])"; > > > } > > > > > > thing is plain ASCII-7bit data can contain utf-8 flag on. > > > > > > example: > > > > > > use strict; > > > use warnings; > > > use utf8; > > > > > > my $utfstr = "\x{442}\x{435}\x{441}\x{442}"; > > > my $s = "x $utfstr"; > > > > > > my ($ascii_u, undef) = split (' ', $s); > > > > > > die "its not ascii" unless $ascii_u eq 'x'; > > > die "utf8 on" if utf8::is_utf8($ascii_u); > > > > > > __END__ > > > > > > dies with "utf8 on" message
From: victor [...] vsespb.ru
- and utf8::is_utf8( $_[2] ) + and utf8::is_utf8( $_[2] ) and length($_[2]) != bytes::length($_[2]) does not decrease performance at all (because length() and bytes::length() never executed usually). I assume uft8::downgrade does not decrease performance too (if utf8 flag is off, or at least if data is ascii) and current (4) code is just broken, because programmer really cannot _control_ if his ASCII data has utf flag or no. Any ASCII data can have utf flag on (depends how it was processed before) On Wed Aug 14 18:50:37 2013, SGLADKOV wrote: Show quoted text
> Digest::SHA example does not apply here: it uses user-supplied data > "one-way" only, to get a digest, and original data is never received > back. In our case, the user sends data to the job queue and then gets > it back. Workload and result are fine with Unicode -- they're properly > serialized by Storable and stored as bytes in Redis; Storable takes > care about Unicode etc. Metadata, however, is not serialized for > performance and convenience reasons, and stored in Redis as-is. If > there is Unicode in metadata, we have the following options: > 1. Assume everything is Unicode, turn utf-8 encoding in Redis.pm > settings and take a substantial performance hit; as we store the > biggest parts of job data - workload and result - serialized already, > encoding and decoding them again is not a good idea. > 2. Assume that all metadata is Unicode, encode and decode it; this may > lead to subtle errors if user provides metadata which is binary, not > Unicode. > 3. Detect Unicode metadata and store "utf-8" flag along with metadata > on redis, to decode only utf-8 metadata when it is requested by user. > This makes metadata management more complicated. > 4. Assume that metadata is for application internal use, and that > application must ensure that it does not contain Unicode; if Unicode > is really needed, it should be either stored in workload or result, or > the application must take care about encoding and decoding Unicode > metadata before sending to the job queue. The job queue will throw an > exception if Unicode metadata is encountered. > > We choose (4) as it is consistent, does not degrade performance and > does not cause subtle errors with damaged data. > > > Втр Авг 13 18:49:15 2013, vsespb писал:
> > > In provided example any string produced with a string operand from > > > an > > > UTF8 string will have UTF8 flag set on it, even if the resulting > > > string doesn't contain any UTF-8 specific characters.
> > > > Yes, agree. That was the point of this example. > >
> > > - forcefully downgrade string to ASCII (see perldoc utf8)
> > > > Point was that Redis::JobQueue code should try to downgrade string > > (because programmer cannot really control if his ASCII string contain > > utf-8 bit or no - this is shown in example). > > > > That is why, btw, utf8::is_utf8() is advertised as indication of some > > wrong workflow: > > > > perlunifaq: > >
> > > Please, unless you're hacking the internals, or debugging > > > weirdness, > > > don't think about the UTF8 flag at all. > > > That means that you very probably shouldn't use is_utf8 , _utf8_on > > > or > > > _utf8_off at all.
> > > > Also, in core module Digest::SHA::PurePerl you can see similar code > > which uses utf8::downgrade() (note that such use better be advertised > > in doc if input parameters altered and downgraded) > > > > > > On Wed Aug 14 02:37:59 2013, SGLADKOV wrote:
> > > It's not an issue with UTF detection in Redis::JobQueue, but > > > rather > > > implemented "by design". Plus the way how perl operates with UTF8 > > > strings internally. In provided example any string produced with a > > > string operand from an UTF8 string will have UTF8 flag set on it, > > > even > > > if the resulting string doesn't contain any UTF-8 specific > > > characters. > > > > > > By design Redis::JobQueue uses freeze before storing job data on > > > Redis > > > (workload,result containers). This ensures that among other things, > > > UTF8-encoded strings are safe when passed this way. Though custom- > > > named fields are processed in any way and passed to Redis as-is. > > > They > > > are designed as an easy and fast way for software developer to > > > store > > > some internal / supplemental data among job details. > > > > > > As a workaround for such behavior you can do one of the following: > > > - forcefully downgrade string to ASCII (see perldoc utf8) before > > > attempting to pass it to Redis::JobQueue as a custom named field > > > - use freeze (Storable) before passing it to Redis > > > - store such string as part of worload / result data structures > > > > > > > > > Втр Авг 13 05:06:41 2013, vsespb писал:
> > > > I see the following code here > > > > https://metacpan.org/source/SGLADKOV/Redis-JobQueue- > > > > 1.03/lib/Redis/JobQueue.pm#L1000 > > > > > > > > elsif ( $method eq 'HSET' and !$self->_redis->{encoding} and > > > > utf8::is_utf8( $_[2] ) ) > > > > { > > > > # For non-serialized fields: UTF8 can not be transferred > > > > to > > > > the Redis server in mode of 'encoding => undef' > > > > confess $self->_error( E_MISMATCH_ARG )." (utf8 in > > > > $_[1])"; > > > > } > > > > > > > > thing is plain ASCII-7bit data can contain utf-8 flag on. > > > > > > > > example: > > > > > > > > use strict; > > > > use warnings; > > > > use utf8; > > > > > > > > my $utfstr = "\x{442}\x{435}\x{441}\x{442}"; > > > > my $s = "x $utfstr"; > > > > > > > > my ($ascii_u, undef) = split (' ', $s); > > > > > > > > die "its not ascii" unless $ascii_u eq 'x'; > > > > die "utf8 on" if utf8::is_utf8($ascii_u); > > > > > > > > __END__ > > > > > > > > dies with "utf8 on" message