Subject: | newer treebank format not supported? |
I'm new to the Penn Treebank area, and I have heard that at some time there had been a format change. Here I have access to the most recent PT CDROM data, and when I try to parse the following tree (from the wsj2300 beginning):
( (S (INTJ No)
,
(NP-SBJ it)
(VP was n't
(NP-PRD Black Monday))
.))
I get the following error:
at /usr/local/share/perl/5.8.4/Lingua/Treebank.pm line 74
even if I use one of your scripts, e.g., tree-collapse.
I was able to manually correct the input to make it work,
by wrapping every punctuation and wrapping the implicit children,
as in:
( (S (INTJ No)
(, ,)
(NP-SBJ it)
(VP (VB was n't)
(NP-PRD Black Monday))
(. .)))
(Similar problems happen with other constructs of the form
(<TAG> <words> (<subtree>) ...)
where <words> belong to an implicitly started (as inferred
by the parent type?) tag that is omitted, like the VB above.
While the punctuation problem is trivial to fix, I don't
have a patch ready for the implicit tags, and hence would
appreciate any assistance here.
Kind regards,
Vassilii