Bug #5473 for RTF-Tokenizer: Order of orperations problem (and small enhancement)

Fri Feb 27 15:35:41 2004 Guest - Ticket created

Subject:

Order of orperations problem (and small enhancement)

Problem observed with: Rtf::Tokenizer 1.06 perl version used: v5.8.2 built for MSWin32-x86-multi-thread OSs behavior observed: MS Win98SE, Win2k, WinXP When an rtf file contains unicode characters, the grab_control subroutine in Tokenizer.pm returns the parameter as a control word rather than as a unicode character. The net effect is nearly the same, it returns the correct type and parameter, they just don't follow the code path expected. The reason I noticed this: apparently Microsoft Wordpad stores unicode characters as \udddd? - with a question mark terminating each one. When the routine is run, it returns the expected character code but leaves the question mark after each character. I changed the unicode filter from: # Unicode characters } elsif ( $self->{_BUFFER} =~ s/^u(\d+)// ) { return( 'u', $1 ); } to # Unicode characters } elsif ( $self->{_BUFFER} =~ s/^u(\d+)\??// ) { return( 'u', $1 ); } to strip off the trailing question mark if it exists, but was quite befuddled when it didn't work. I finally puzzled out that an rtf unicode character will be caught by the control word filter block and will never execute the unicode character filter block. Moving the unicode filter block to just before the control word block made it work as expected. Thanks, Steve Schulze

Sun Mar 14 17:23:24 2004 sargie [...] cpan.org - Correspondence added

Show quoted text

> The reason I noticed this: apparently Microsoft Wordpad stores unicode > characters as \udddd? - with a question mark terminating each one. > When the routine is run, it returns the expected character code > but leaves the question mark after each character. I changed the > unicode filter from:

Yes. The question mark is the ASCII representation of the unicode character in question. So in fact, you could put any character there you wanted - and in fact, you could put several characters there - but you'd need to precede that with a \ucX control, where X is the number of characters that make up the ASCII representation. This way, old viewers can make a sensible representation of documents they don't understand. Show quoted text

> to strip off the trailing question mark if it exists, but was quite > befuddled when it didn't work. I finally puzzled out that an rtf > unicode character will be caught by the control word filter block > and will never execute the unicode character filter block.

Yes it will, just, not the example you gave. \u123A, for example, would get caught by it, because that wouldn't normally be a valid control, but it is a valid unicode control. Makes my head hurt too... Will put some comments in the code about this... Thanks, +Pete

Sun Mar 14 17:23:39 2004 sargie [...] cpan.org - Status changed from 'new' to 'resolved'