Skip Menu |

This queue is for tickets about the Lingua-Treebank CPAN distribution.

Report information
The Basics
Id: 15079
Status: resolved
Worked: 20 min
Priority: 0/
Queue: Lingua-Treebank

People
Owner: kahn [...] cpan.org
Requestors: vassilii [...] tarunz.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: newer treebank format not supported?
I'm new to the Penn Treebank area, and I have heard that at some time there had been a format change. Here I have access to the most recent PT CDROM data, and when I try to parse the following tree (from the wsj2300 beginning): ( (S (INTJ No) , (NP-SBJ it) (VP was n't (NP-PRD Black Monday)) .)) I get the following error: at /usr/local/share/perl/5.8.4/Lingua/Treebank.pm line 74 even if I use one of your scripts, e.g., tree-collapse. I was able to manually correct the input to make it work, by wrapping every punctuation and wrapping the implicit children, as in: ( (S (INTJ No) (, ,) (NP-SBJ it) (VP (VB was n't) (NP-PRD Black Monday)) (. .))) (Similar problems happen with other constructs of the form (<TAG> <words> (<subtree>) ...) where <words> belong to an implicitly started (as inferred by the parent type?) tag that is omitted, like the VB above. While the punctuation problem is trivial to fix, I don't have a patch ready for the implicit tags, and hence would appreciate any assistance here. Kind regards, Vassilii
Vassilii-- The treebank format that this system works with is the .mrg format, that uses both the part-of-speech tags and the non-terminals. Your example uses the .psd file, which is only the non-terminals. I have resolved this by including a note in the latest version (0.12), explaining that you need to use the .mrg files. Hope that's adequate. --Jeremy