Bug #91640 for REST-Neo4p: UTF-8 in REST::Neo4p::Query doesn't work? What I'm doing wrong?

Tue Dec 24 12:48:02 2013 stesin [...] gmail.com - Ticket created

Subject:

UTF-8 in REST::Neo4p::Query doesn't work? What I'm doing wrong?

Dear Mark Allen Jensen, Merry Christmas and thank you so much for really useful module REST::Neo4p! I have some question which is not covered by documentation. As you certainly know, UTF-8 is somewhat "native" encoding for Perl, so I write my Perl scripts in Windows using the UTF-8 capable text editor. I reopen console with open STDERR, '>:encoding(cp1251)', 'CON:' or die... it does the job, and naturally, all Cyrillic string literals in my scripts are UTF-8 and this works Ok for me. But as soon as I started learning Neo4j and playing with it, I discovered the following. Suppose I have a perfectly valid Cypher query wich contains Cyrillic literals UTF-8 encoded. As soon as I use it by mouse Ctrl-C+Ctrl-V into Neo4j web interface, it works. As soon as I pack my query into text string, and pass it to REST::Neo4p::Query->new() constructor works Ok. But as soon as I perform $qry->execute() program fails with dubious non-informative message in stderr: HTTP::Message content must be bytes at D:/Strawberry_Perl_5.18.1.1_x64/perl/site/lib/HTTP/Request/Common.pm line 94. As soon as I remove all Cyrillics from literals in the query, it works perfectly Ok. Same goes when I try to use REST::Neo4p::Node methods, BTW. Would you please mind giving me an advice for this matter? Shall I do something special with my Cypher querys? Is is a bug in HTTP::Request module? What shall I do in order to get REST interface UTF-8 transparent it both directions? Thank you so much for your work and your kind attention! With best regards, Andrii Stesin Kyiv, Ukraine (it's where Maidan goes now).

Wed Dec 25 13:04:25 2013 maj.fortinbras [...] gmail.com - Taken

Wed Dec 25 13:06:47 2013 maj.fortinbras [...] gmail.com - Correspondence added

Hi Andrii- I will look into this. I'm sure I am not handling non-UTF-8 very well. I appreciate the report; I hope I can fix it soon. best! Mark On Tue Dec 24 12:48:02 2013, stesin@gmail.com wrote: Show quoted text

> Dear Mark Allen Jensen, > > Merry Christmas and thank you so much for really useful module > REST::Neo4p! > > I have some question which is not covered by documentation. As you > certainly know, UTF-8 is somewhat "native" encoding for Perl, so I > write my Perl scripts in Windows using the UTF-8 capable text editor. > I reopen console with > > open STDERR, '>:encoding(cp1251)', 'CON:' or die... > > it does the job, and naturally, all Cyrillic string literals in my > scripts are UTF-8 and this works Ok for me. > > But as soon as I started learning Neo4j and playing with it, I > discovered the following. Suppose I have a perfectly valid Cypher > query wich contains Cyrillic literals UTF-8 encoded. As soon as I use > it by mouse Ctrl-C+Ctrl-V into Neo4j web interface, it works. > > As soon as I pack my query into text string, and pass it to > REST::Neo4p::Query->new() constructor works Ok. But as soon as I > perform $qry->execute() program fails with dubious non-informative > message in stderr: > > HTTP::Message content must be bytes at > D:/Strawberry_Perl_5.18.1.1_x64/perl/site/lib/HTTP/Request/Common.pm > line 94. > > As soon as I remove all Cyrillics from literals in the query, it works > perfectly Ok. Same goes when I try to use REST::Neo4p::Node methods, > BTW. > > Would you please mind giving me an advice for this matter? Shall I do > something special with my Cypher querys? Is is a bug in HTTP::Request > module? What shall I do in order to get REST interface UTF-8 > transparent it both directions? > > Thank you so much for your work and your kind attention! > > With best regards, > Andrii Stesin > Kyiv, Ukraine (it's where Maidan goes now).

Wed Dec 25 13:06:47 2013 The RT System itself - Status changed from 'new' to 'open'

Thu Dec 26 02:47:43 2013 stesin [...] gmail.com - Correspondence added

From:

stesin [...] gmail.com

Seems that the root of the problem lies inside https://metacpan.org/pod/HTTP::Message namely: $mess->content( $bytes ) ... Note that the content should be a string of bytes. Strings in perl can contain characters outside the range of a byte. The Encode module can be used to turn such strings into a string of bytes. Probably the best way to ensure that $bytes is a string of bytes is force the conversion inside REST::Neo4p module where appropriate? Show quoted text

> REST::Neo4p::Query->new() constructor works Ok. But as soon as I perform $qry->execute() program fails with dubious non-informative message in stderr: > > HTTP::Message content must be bytes at D:/Strawberry_Perl_5.18.1.1_x64/perl/site/lib/HTTP/Request/Common.pm line 94. > > As soon as I remove all Cyrillics from literals in the query, it works perfectly Ok.

Thu Dec 26 02:58:48 2013 stesin [...] gmail.com - Correspondence added

From:

stesin [...] gmail.com

In HTTP::Message they also have this: $mess->add_content_utf8( $string ) The add_content_utf8() method appends the UTF-8 bytes representing the string to the end of the current content buffer. $mess->content_charset This returns the charset used by the content in the message. The charset is either found as the charset attribute of the Content-Type header or by guessing. and finally: $mess->decoded_content( %options ) Returns the content with any Content-Encoding undone and for textual content the raw content encoded to Perl's Unicode strings. If the Content-Encoding or charset of the message is unknown this method will fail by returning undef. ... It seems to me that there will be Ok to add some parameter to REST::Neo4p which explicitly initialises it into "utf8 only" mode. Or maybe even better way, just to make it iteract with http service always in utf8. No matter which language your strings are in, make sure they always are utf8-encoded and be happy with REST::Neo4p :) Is it an option? Thank you once again for your kind attention! Please forgive me for being too perseverant; my whole project critically depends on this issue. With best regards, Andrii Stesin

Thu Dec 26 03:47:19 2013 stesin [...] gmail.com - Correspondence added

From:

stesin [...] gmail.com

Yet more news. I nailed the problem even deeper. Here is test script (utf8 .pl file) and it's output at console: Neo4j v.2.0.0 ready to serve Creating index. Zero node created with payload e43b2e72. Test qry for mouse copy-paste test from screen: MATCH (i:Attic { payload: 'e43b2e72' } ) RETURN i, i.payload Test qry returned 0 rows. If I copy-paste test qry with mouse into web interface, it returns 0 also. Same does test script imported through web interface and run with green round button: 0 But my test node with payload e43b2e72 IS here in database! I see it with my eyes in web interface. Even more, the same test query typed by hands from the keyboard in web interface interpreter line, does not work either. At the same time, match (i:Attic) return i.payload i.payload d6335ad8 3391776b e6335ad8 7cf07a09 e43b2e72 Returned 5 rows in 94 ms What I am doing wrong? :(

Subject:

Sample_UTF-8_test.pl

# # when I use utf-8 encoded .pl file, I can't even correctly MATCH nodes with ASCII payload # this is sample UTF-8 cyrillic string, just for test: ÐÑÐ¸Ð²ÐµÑ ÑÑÐ°ÑÑÐ½Ð¸ÐºÐ°Ð¼ ÑÐ¾ÑÐµÐ²Ð½Ð¾Ð²Ð°Ð½Ð¸Ð¹! :) # use utf8; use POSIX; use Encode; use REST::Neo4p; use Data::GUID; use Digest::CRC qw(crc64 crc32 crc16 crcccitt crc crc8 crcopenpgparmor); system( "chcp 1251" ) ; # this sets ccmd.exe console to a known state, 1251 is Ok for me STDOUT->flush; STDERR->flush; close STDERR; open STDERR, '>:encoding(cp1251)', 'CON:' or die "Open STDERR failed: " . $! . "\n"; close STDOUT; open STDOUT, '>:encoding(utf8)', 'TestScript.cypher' or die "Open STDOUT failed: " . $! . "\n"; close STDIN; REST::Neo4p->connect('http://127.0.0.1:7474') or die "Neo4j connect failed: " . $! . "\n"; $version = REST::Neo4p->neo4j_version; print STDERR "Neo4j v." . $version . " ready to serve\n" ; $payload0 = sprintf( "%08x", crc32( Data::GUID->new ) ); # just some random but certainly ASCII value print STDERR "Creating index.\n"; $node_idx = REST::Neo4p::Index->new('node', 'Attic_node_index') or die "Failed to create node index.\n"; $node0 = $node_idx->create_unique( payload => qq( $payload0 ) , { payload => qq( $payload0 ) }, 'fail' ) or die "Failed to create node with payload $i\n"; $node0->add_labels( 'Attic') or die "Failed to set label at node 0\n"; print STDERR "\nZero node created with payload $payload0.\n\n"; # # Now I want to MATCH it # $test_match_qry = sprintf ( "MATCH (i:Attic { payload: '%s' } ) RETURN i, i.payload", $payload0 ); print STDERR "Test qry for mouse copy-paste test from screen:\n\t" . $test_match_qry . "\n"; print STDOUT $test_match_qry . "\n"; # this is to test as a script through neo4j web interface $qry_test_match = REST::Neo4p::Query->new( $test_match_qry ) or die "Qry compile failed\n$test_match_qry\n"; $numrows = $qry_test_match->execute(); if ($qry_test_match->err) { print STDERR "DIED $test_match_qry\t\n"; printf STDERR "status code: '%s'\t", $qry_test_match->err; printf STDERR "error message: %s\n", $qry_test_match->errstr; die $! . "\n"; }; print STDERR "Test qry returned " . $numrows . " rows.\n"; while( --$numrows >= 0 ) { print STDERR $qry_test_match->fetch->[$numrows] ; } ; $node_idx->remove or die "Failed to remove index.\n"; exit $!

Thu Dec 26 05:13:04 2013 stesin [...] gmail.com - Correspondence added

From:

stesin [...] gmail.com

Please discard all my additions as of 03:47:19 Dec. 26, 2013 and later till now. My test script contains elementary misprint: qq($a) != qq( $a ) because the latter also catches two ' ' spaces :) Sorry for that :) WBR, Andrii

Thu Dec 26 05:16:00 2013 stesin [...] gmail.com - Correspondence added

From:

stesin [...] gmail.com

Here http://perldoc.perl.org/perluniintro.html#Displaying-Unicode-As-Text there is a nice simple sub nice_string which helps to catch the invisible :)

Thu Dec 26 12:58:35 2013 maj.fortinbras [...] gmail.com - Correspondence added

Andrii-- Awesome work, this is really helpful for me (a poor American with a simple uninteresting alphabet!). I think your solution to just make sure everything is encoded would work. I also found this : http://stackoverflow.com/questions/2951466/httpmessage-content-must-be-bytes-error-when-trying-to-post, which pinpoints the issue in HTTP::Message. There, a possible solution is to just use utf8; as a pragma. But I'm not sure-- this pragma tells the compiler that the *script* is written in UTF-8, but may not do the conversion of the contents of the variables automatically. There are methods in the utf8 module to do that explicitly, and maybe 'nice_string' is a way to go as well. I agree that the code (or the option) should be in Neo4p, rather than making you work around it in your code. I think we are very close! Mark On Thu Dec 26 05:16:00 2013, stesin@gmail.com wrote: Show quoted text

> Here http://perldoc.perl.org/perluniintro.html#Displaying-Unicode-As- > Text there is a nice simple sub nice_string which helps to catch the > invisible :)

Thu Dec 26 13:00:49 2013 maj.fortinbras [...] gmail.com - Correspondence added

Also I will try to replicate that strangeness you saw in your direct queries to the server. It would be worth understanding that -- On Thu Dec 26 12:58:35 2013, MAJENSEN wrote: Show quoted text

> Andrii-- Awesome work, this is really helpful for me (a poor American > with a simple uninteresting alphabet!). I think your solution to just > make sure everything is encoded would work. I also found this : > http://stackoverflow.com/questions/2951466/httpmessage-content-must- > be-bytes-error-when-trying-to-post, which pinpoints the issue in > HTTP::Message. There, a possible solution is to just > > use utf8; > > as a pragma. But I'm not sure-- this pragma tells the compiler that > the *script* is written in UTF-8, but may not do the conversion of the > contents of the variables automatically. There are methods in the utf8 > module to do that explicitly, and maybe 'nice_string' is a way to go > as well. > > I agree that the code (or the option) should be in Neo4p, rather than > making you work around it in your code. > > I think we are very close! Mark > On Thu Dec 26 05:16:00 2013, stesin@gmail.com wrote:

> > Here http://perldoc.perl.org/perluniintro.html#Displaying-Unicode-As- > > Text there is a nice simple sub nice_string which helps to catch the > > invisible :)

Thu Dec 26 14:17:19 2013 maj.fortinbras [...] gmail.com - Correspondence added

Sorry Andrii-- now I'm starting to read carefully- Thanks for the nice_string advice, I see that will help me make sense of what I'm seeing! And I see you are useing utf8. What I found so far is that use utf8; use HTTP::Message; use strict; use warnings; my $s = 'Сохранить'; utf8::encode($s); my $m = HTTP::Message->new([ Content_type => 'text/plain' ], $s); will create message $m without throwing the "must be bytes" error. Also, the following (without utf8 at all) creates the message without error: use HTTP::Message; use strict; use warnings; my $s = 'Сохранить'; my $m = HTTP::Message->new([ Content_type => 'text/plain' ], $s); But, if you use utf8, and do NOT encode, HTTP::Message throws: use utf8; use HTTP::Message; use strict; use warnings; my $s = 'Сохранить'; # utf8::encode($s); my $m = HTTP::Message->new([ Content_type => 'text/plain' ], $s); # throws 'must be bytes' More as I continue exploring- MAJ On Thu Dec 26 12:58:35 2013, MAJENSEN wrote: Show quoted text

> Andrii-- Awesome work, this is really helpful for me (a poor American > with a simple uninteresting alphabet!). I think your solution to just > make sure everything is encoded would work. I also found this : > http://stackoverflow.com/questions/2951466/httpmessage-content-must- > be-bytes-error-when-trying-to-post, which pinpoints the issue in > HTTP::Message. There, a possible solution is to just > > use utf8; > > as a pragma. But I'm not sure-- this pragma tells the compiler that > the *script* is written in UTF-8, but may not do the conversion of the > contents of the variables automatically. There are methods in the utf8 > module to do that explicitly, and maybe 'nice_string' is a way to go > as well. > > I agree that the code (or the option) should be in Neo4p, rather than > making you work around it in your code. > > I think we are very close! Mark > On Thu Dec 26 05:16:00 2013, stesin@gmail.com wrote:

> > Here http://perldoc.perl.org/perluniintro.html#Displaying-Unicode-As- > > Text there is a nice simple sub nice_string which helps to catch the > > invisible :)

Thu Dec 26 22:27:53 2013 maj.fortinbras [...] gmail.com - Correspondence added

Andrii - I believe I have solved the problem. I have uploaded v0.2230 to CPAN-- please give it a try. The solution was to make sure that the JSON created for the HTTP requests was encoded as utf8. This was a very small change. I'm still trying to figure out exactly why it works. Please let me know if the fix works for you- thanks Mark On Thu Dec 26 14:17:19 2013, MAJENSEN wrote: Show quoted text

> Sorry Andrii-- now I'm starting to read carefully- Thanks for the > nice_string advice, I see that will help me make sense of what I'm > seeing! And I see you are useing utf8. What I found so far is that > > use utf8; > use HTTP::Message; > use strict; > use warnings; > my $s = 'Сохранить'; > utf8::encode($s); > my $m = HTTP::Message->new([ Content_type => 'text/plain' ], $s); > > will create message $m without throwing the "must be bytes" error. > Also, the following (without utf8 at all) creates the message without > error: > > use HTTP::Message; > use strict; > use warnings; > my $s = 'Сохранить'; > my $m = HTTP::Message->new([ Content_type => 'text/plain' ], $s); > > But, if you use utf8, and do NOT encode, HTTP::Message throws: > > use utf8; > use HTTP::Message; > use strict; > use warnings; > my $s = 'Сохранить'; > # utf8::encode($s); > my $m = HTTP::Message->new([ Content_type => 'text/plain' ], $s); # > throws 'must be bytes' > > More as I continue exploring- > MAJ > > On Thu Dec 26 12:58:35 2013, MAJENSEN wrote:

> > Andrii-- Awesome work, this is really helpful for me (a poor American > > with a simple uninteresting alphabet!). I think your solution to just > > make sure everything is encoded would work. I also found this : > > http://stackoverflow.com/questions/2951466/httpmessage-content-must- > > be-bytes-error-when-trying-to-post, which pinpoints the issue in > > HTTP::Message. There, a possible solution is to just > > > > use utf8; > > > > as a pragma. But I'm not sure-- this pragma tells the compiler that > > the *script* is written in UTF-8, but may not do the conversion of > > the > > contents of the variables automatically. There are methods in the > > utf8 > > module to do that explicitly, and maybe 'nice_string' is a way to go > > as well. > > > > I agree that the code (or the option) should be in Neo4p, rather than > > making you work around it in your code. > > > > I think we are very close! Mark > > On Thu Dec 26 05:16:00 2013, stesin@gmail.com wrote:

> > > Here http://perldoc.perl.org/perluniintro.html#Displaying-Unicode- > > > As- > > > Text there is a nice simple sub nice_string which helps to catch > > > the > > > invisible :)

Thu Dec 26 23:00:57 2013 stesin [...] gmail.com - Correspondence added

From:

stesin [...] gmail.com

Dear Mark, thank you for your effort! I'll try it ASAP, most probably just today, and report. With best regards, Andrii

Sat Jan 11 16:39:27 2014 stesin [...] gmail.com - Correspondence added

From:

stesin [...] gmail.com

Dear Mark, thank you so much for your help. As of 0.2230, utf8 strings are working Ok (sorry, didn't have time to check this during the New Year vacation). Great job! I think (sincerely hope ;) that this issue is closed already. With best regards, Andrii

Sun Jan 12 23:40:01 2014 maj.fortinbras [...] gmail.com - Correspondence added

Thanks Andrii-- I will mark this resolved. MAJ

Sun Jan 12 23:40:02 2014 maj.fortinbras [...] gmail.com - Status changed from 'open' to 'resolved'