Skip Menu |

This queue is for tickets about the K CPAN distribution.

Report information
The Basics
Id: 76680
Status: resolved
Priority: 0/
Queue: K

People
Owner: Nobody in particular
Requestors: calid1984 [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: patch: return K char vectors as Perl string scalars
Subject: k_char2str.patch
diff -u -r ./kparse.c /home/nbkp1nr/github/k-perl/kparse.c --- ./kparse.c 2012-04-08 15:39:27.000000000 -0500 +++ /home/nbkp1nr/github/k-perl/kparse.c 2012-04-18 20:57:37.361327186 -0500 @@ -165,6 +165,10 @@ break; } + if ( k->t == KC || k->t == KG ) { + return result; + } + return newRV_noinc((SV*)result); } @@ -382,21 +386,18 @@ } SV* byte_vector_from_k(K k) { - AV *av = newAV(); - char byte_str[1]; + char byte_str[k->n]; int i = 0; for (i = 0; i < k->n; i++) { if (kG(k)[i] == 0) { - av_push(av, &PL_sv_undef); continue; } - byte_str[0] = kG(k)[i]; - av_push(av, newSVpvn(byte_str, 1)); + byte_str[i] = kG(k)[i]; } - return (SV*)av; + return newSVpvn(byte_str, k->n); } SV* short_vector_from_k(K k) { diff -u -r ./t/k.t /home/nbkp1nr/github/k-perl/t/k.t --- ./t/k.t 2012-04-08 16:46:39.000000000 -0500 +++ /home/nbkp1nr/github/k-perl/t/k.t 2012-04-18 20:45:13.190224758 -0500 @@ -12,7 +12,7 @@ is $k->cmd('4 + 4'), 8, 'make an int'; - is_deeply $k->cmd(q/"abc"/), [qw/a b c/], 'make char vector'; + is $k->cmd(q/"abc"/), "abc", 'make string'; my $timestamp = $k->cmd(q/2012.03.24D12:13:14.15161728/); is "$timestamp", '385906394151617280', 'timestamp'; diff -u -r ./t/raw.t /home/nbkp1nr/github/k-perl/t/raw.t --- ./t/raw.t 2012-04-08 16:46:39.000000000 -0500 +++ /home/nbkp1nr/github/k-perl/t/raw.t 2012-04-18 21:06:51.104563448 -0500 @@ -122,13 +122,13 @@ my ($handle) = @_; is_deeply k( $handle, '(),0b' ), [ undef ], 'null boolean vector'; - is_deeply k( $handle, '(),0x00'), [ undef ], 'null byte vector'; + is k( $handle, '(),0x00'), "\000", 'null byte vector'; is_deeply k( $handle, '(),0Nh' ), [ undef ], 'null short vector'; is_deeply k( $handle, '(),0N' ), [ undef ], 'null int vector'; is_deeply k( $handle, '(),0Nj' ), [ undef ], 'null long vector'; is_deeply k( $handle, '(),0Ne' ), [ undef ], 'null real vector'; is_deeply k( $handle, '(),0n' ), [ undef ], 'null float vector'; - is_deeply k( $handle, '()," "' ), [ ' ' ], 'null char vector'; # this ones weird + is_deeply k( $handle, '()," "' ), ' ', 'null char vector'; # this ones weird is_deeply k( $handle, '(),`' ), [ undef ], 'null sym vector'; is_deeply k( $handle, '(),0Nm' ), [ undef ], 'null month vector'; is_deeply k( $handle, '(),0Nd' ), [ undef ], 'null day vector'; @@ -165,7 +165,7 @@ my ($handle) = @_; is_deeply k($handle, '(0b;1b;0b)'), [undef, 1, undef], 'parse bool vector'; - is_deeply k($handle, '"abc"'), [qw(a b c)], 'parse char vector'; + is_deeply k($handle, '"abc"'), "abc", 'parse char vector'; is_deeply k($handle, '(7h;8h;9h)'), [7, 8, 9], 'parse short vector'; is_deeply k($handle, '(7i;8i;9i)'), [7, 8, 9], 'parse int vector';
I came across a problem with the approach taken by this patch. Take the following table definition: ([] foo: (`a`b`c); bar: (1;2;3)) In q it looks like this: foo bar ------- a 1 b 2 c 3 In Perl it looks like this: { 'foo' => [ 'a', 'b', 'c' ], 'bar' => [ 1, 2, 3 ], } All is good. Now change the table defintion to this: ([] foo: ("a";"b";"c"); bar: (1;2;3)) In q it looks like this: foo bar ------- a 1 b 2 c 3 With your patch applied, in Perl it looks like this: { 'foo' => 'abc', 'bar' => [ 1, 2, 3 ], } Q gets away with the vector/string duality because the syntax consistency allows you to not care. In perl though, getting the nth character in a string is completely different syntax from getting the nth element of a list. So while it's a pain to deal with char vectors as arrays in Perl, I think it might be worse to silently devectorize them to strings where you want to be able to assume you have a an array.
From: calid1984 [...] gmail.com
On Sat Apr 21 19:49:42 2012, WHITNEY wrote: Show quoted text
> I came across a problem with the approach taken by this patch. Take the > following table definition: > > ([] foo: (`a`b`c); bar: (1;2;3)) > > In q it looks like this: > > foo bar > ------- > a 1 > b 2 > c 3 > > In Perl it looks like this: > > { > 'foo' => [ 'a', 'b', 'c' ], > 'bar' => [ 1, 2, 3 ], > } > > All is good. > > Now change the table defintion to this: > > ([] foo: ("a";"b";"c"); bar: (1;2;3)) > > In q it looks like this: > > foo bar > ------- > a 1 > b 2 > c 3 > > With your patch applied, in Perl it looks like this: > > { > 'foo' => 'abc', > 'bar' => [ 1, 2, 3 ], > } > > Q gets away with the vector/string duality because the syntax > consistency allows you to not care. In perl though, getting the nth > character in a string is completely different syntax from getting the > nth element of a list. > > So while it's a pain to deal with char vectors as arrays in Perl, I > think it might be worse to silently devectorize them to strings where > you want to be able to assume you have a an array.
I think there is some confusion here. In q, "foo" is the same as ("f";"o";"o"): q)"foo" ~ ("f";"o";"o") 1b So saying getting a string in Perl is okay for the first case, but not for the second case, doesn't really make sense since they're the same case (flat char vector). And indeed, q itself tries to be helpful by presenting the more 'usable' form: q)("f";"o";"o") "foo" In my opinion, it's best to optimize for the more common case. The vast majority of time, in Perl, we are going to want to deal with a flat char vector as a string. For the boundary case when (if) we *really* want a list of chars, vectorizing is trivial (e.g. split //, $str). Since we consistently return a char vector as a string, blindly 'vectorizing' if necessary is not dangerous. Also, we still get nested char vectors represented accurately: warn Dumper( $k->cmd('(enlist "f"; enlist "o"; enlist "o")')); $VAR1 = [ 'f', 'o', 'o' ]; which of course then extends to making a list of strings much more friendly to deal with programmatically in Perl: warn Dumper( $k->cmd('("the"; "quick"; "whittns")')); $VAR1 = [ 'the', 'quick', 'whittns' ];
Show quoted text
> I think there is some confusion here. In q, "foo" is the same as > ("f";"o";"o"): >
I think there is some confusion about here about there being some confusion here :) I've got a good handle on how strings work in q. Show quoted text
> In my opinion, it's best to optimize for the more common case. The vast > majority of time, in Perl, we are going to want to deal with a flat char > vector as a string. For the boundary case when (if) we *really* want a > list of chars, vectorizing is trivial (e.g. split //, $str). >
I see what you're saying. You're clearly right about the fact that in the common case you want your char vector as a string. However, I'd rather deal with the annoyance of having to stringify arrays of chars than encounter the case where I intentionally produce a vector of chars only to be returned a string. I brought up the table example to illustrates a case where the auto stringification behavior is surprising and annoying. You want to write code that can safely treat all tables as hashes of array refs. But now all your table processing code has to have a special condition in it to deal with the fact that some columns may in fact be strings. Yuk. Given that Q is a all about vector processing I expect this to come up in lots of places. Having to check all the time to see if your vector is really a vector isn't acceptable. Not checking means you're sacrificing correctness.
From: calid1984 [...] gmail.com
Show quoted text
> > I think there is some confusion about here about there being some > confusion here :) I've got > a good handle on how strings work in q.
Show quoted text
> > I see what you're saying. You're clearly right about the fact that in > the common case you > want your char vector as a string. However, I'd rather deal with the > annoyance of having to > stringify arrays of chars than encounter the case where I > intentionally produce a vector of > chars only to be returned a string. I brought up the table example to > illustrates a case > where the auto stringification behavior is surprising and annoying. > You want to write code > that can safely treat all tables as hashes of array refs. But now all > your table processing > code has to have a special condition in it to deal with the fact that > some columns may in > fact be strings. Yuk. > > Given that Q is a all about vector processing I expect this to come up > in lots of places. > Having to check all the time to see if your vector is really a vector > isn't acceptable. Not > checking means you're sacrificing correctness. > >
I've thought about this further, and realized that returning an 'arrayref of chars' in Perl is not only unusable, it's also incorrect. A string *is* a char vector. Even in Perl, there is this implicit understanding. That's why list-like operators such as index, substr, and chop work on strings as though they are a list of chars... because they are. So if we're talking about correctness, where correctness is mapping the corresponding Q -> Perl data structure, it is actualy *incorrect* to return a char vector as an array ref of chars in Perl. The corresponding Perl data structure for a flat vector of chars *is* a string. This is a win:win. It is both the correct mapping, and eminently more usable.
From: calid1984 [...] gmail.com
Well I've had a change of heart, and I'm now leaning towards returning an array of chars, but before I explain why a few more nitpicks/devil's advocate arguments :) Show quoted text
> auto stringification behavior is surprising and annoying.
No. If you document that's what your library does, then it is consistent and predictable Show quoted text
> Having to check all the time to see if your vector is really a vector
isn't acceptable. Not checking means you're sacrificing correctness. You don't have to check all the time, since you've documented that a flat char vector is a string all the time every time. Show quoted text
> But now all your table processing code has to have a special condition
in it to deal with the fact that some columns may in fact be strings. Yuk. Again, you don't need 'special condition' code. As the client when I get a column of strings I simply have to iterate over the column treating each value as a string... why do I need a 'special condition?' If you want to talk about special processing, this happens when you return an array of chars. Now, when I get a column of strings, for each value I have to splice together the chars into a usable string form. Now *that's* YUK. Ok, I prefaced this by saying as much as I think it's wrong, inelegant, and annoying, I'm starting to lean towards returning an array chars for each string... why? First, there is a valid argument that it's important to maintain the same behavior as Kx's own language wrappers. Their java wrapper, for instance, returns a char vector given a string, and a string given a symbol. The real problem with auto-stringification is distinguishing between symbols and strings, and this to me is the most persuasive argument in favor of returning a string for the former and an array of chars for the latter... That said, if there was an easy way to distinguish between a 'stringified' array of chars and a symbol, then I would still prefer stringifying char vectors, even at the expense of having different behavior than Kx's wrappers
Cool. This is a pick your poison type situation so I guess it's not surprising that there is disagreement or agreement for different reasons :) I still want weigh in on some of the points you brought up: Show quoted text
> > auto stringification behavior is surprising and annoying.
> No. If you document that's what your library does, then it is > consistent and predictable
Sure. My point is that you want to write code like: map { first @{ $_ } } values %{ $table } But you can't because you forgot to check whether $_ is really an array ref and then deal with the special case where it's actually a string. Show quoted text
> > Having to check all the time to see if your vector is really a vector
> isn't acceptable. Not checking means you're sacrificing correctness. > > You don't have to check all the time, since you've documented that a > flat char vector is a string all the time every time.
I should have written "Having to check all the time to see if your vector is really an *array* isn't acceptable." So sorry for muddying the waters there. As you say, a string is an equally correct representation of a char vector. My point is that by having two equally valid representations of a vector means I have to sprinkle my generic vector processing code with conditions to see which representation is being used so I can apply the right syntax. If I don't put those conditionals in place then I risk sacrificing correctness. The above map command is an example of what I'm talking about. Show quoted text
> If you want to talk about special processing, this happens when you > return an array of chars. Now, when I get a column of strings, for each > value I have to splice together the chars into a usable string form. > Now *that's* YUK.
Yep. That's the downside. That's why this is a trade-off. To me this is the less yukky of the two yuks. Show quoted text
> Ok, I prefaced this by saying as much as I think it's wrong, inelegant, > and annoying, I'm starting to lean towards returning an array chars for > each string... why? > > First, there is a valid argument that it's important to maintain the > same behavior as Kx's own language wrappers. Their java wrapper, for > instance, returns a char vector given a string, and a string given a symbol.
Good to know. Show quoted text
> The real problem with auto-stringification is distinguishing between > symbols and strings, and this to me is the most persuasive argument in > favor of returning a string for the former and an array of chars for the > latter...
Hmm...interesting. Sadly there will still be cases where you can't distingish: $k->cmd(q/`a/) and $k->cmd(q/"a"/) both return the perl string "a". $k->cmd(q/`a`b`c/) and $k->cmd(q/"abc"/) both return [qw/a b c/]. But with no q symbol concept in Perl, maybe it doesn't matter that much? Show quoted text
> That said, if there was an easy way to distinguish between a > 'stringified' array of chars and a symbol, then I would still prefer > stringifying char vectors, even at the expense of having different > behavior than Kx's wrappers > >
I've been thinking about implementing a strict mode or some such thing where we only return object representations of q data. That way we can preserve all information like the q type and attributes in situations where we care those things. That would prevent any information loss when we go from q to perl and take care of any ambiguity problems. Also, you could make char vector objects stringify sanely. There might be significant performance implications though. We'll have to see.