Subject: | possible bug in hso, strong matching of compounds |
This was reported by Hideki Shima of CMU.
-----------------------------------------------------
(4) HSO: strong match with compound words
-----------------------------------------------------
According to the definition from the paper by Hirst and St-Onge,
"any link between two synsets if one word is a compound word
or phrase that includes the other word" is a "strong relation"
(score of 16).
For example, two synsets 01124794 (n) and 01125562 (n) have a
hypernym/hyponym link between them, and words associated with these
synsets are compound (government <--> misgovernment). So following
the definition, I think there is a "strong relation" between the
two synsets.
Now, using word-pos-sensenumber notation, the synset 01124794 (n)
can be represented as government#n#2 etc, and the other
synset 01125562 (n) can be represented in two ways:
"misgovernment#n#1" and "misrule#n#1" (using WordNet 3.0).
WordNet::Similarity gives different results for different wps of same synset:
The relatedness of government#n#2 and misgovernment#n#1 using hso is 16.
The relatedness of government#n#2 and misrule#n#1 using hso is 4.
I was wondering if the line 329 in hso.pm:
if($word1 =~ /$word2/ || $word2 =~ /$word1/) {
should ideally be a comparison between all words associated with the
synsets, rather than the words from wps notation.
Below are some more examples.
protocol#n#1 tcp/ip#n#1(=transmission_control_protocol/internet_protocol#n#1)
company#n#1 ltd.#n#1(=limited_company#n#1)
cell_phone#v#1 call#v#3(=phone#v#1)
This phenomenon is also very rare and has not been observed in 10k
randomly generated noun-noun pairs of synsets.