Subject: | Order of orperations problem (and small enhancement) |
Problem observed with: Rtf::Tokenizer 1.06
perl version used: v5.8.2 built for MSWin32-x86-multi-thread
OSs behavior observed: MS Win98SE, Win2k, WinXP
When an rtf file contains unicode characters, the grab_control subroutine in Tokenizer.pm returns the parameter as a control word rather than as a unicode character. The net effect is nearly the same, it returns the correct type and parameter, they just don't follow the code path expected.
The reason I noticed this: apparently Microsoft Wordpad stores unicode characters as \udddd? - with a question mark terminating each one. When the routine is run, it returns the expected character code but leaves the question mark after each character. I changed the unicode filter from:
# Unicode characters
} elsif ( $self->{_BUFFER} =~ s/^u(\d+)// ) {
return( 'u', $1 );
}
to
# Unicode characters
} elsif ( $self->{_BUFFER} =~ s/^u(\d+)\??// ) {
return( 'u', $1 );
}
to strip off the trailing question mark if it exists, but was quite befuddled when it didn't work. I finally puzzled out that an rtf unicode character will be caught by the control word filter block and will never execute the unicode character filter block.
Moving the unicode filter block to just before the control word block made it work as expected.
Thanks,
Steve Schulze