Subject: | bug in lesk normalization |
Lesk normalization has always been a little unstable (and can provide scores greater than 1). The following report is from Ryan Simmons and provides more details.
--------------------------------------------
The past few days I have been working with the Lesk normalization feature, which (as has been mentioned previously) doesn't always constrain the output to the upper bound of 1, as it should. I am not sure if I have found the problem, or merely another symptom of it ... I haven't had the chance to experiment that much, but I figured I would let you know what I found. I am not an expert at WordNet or Perl programming in general, so please point out my mistakes.
When Lesk is calculated, function scores are obtained for various relation pairs (from the lesk-relation.dat file). The default/example file has 88 pairs (also-also, also-attr, etc.). For each pair, the overlap score is calculated (and normalized if that option is activated). The score is determined for each relation pair, then added to the main score. So, the main score is the sum of the individual scores for each relation pair within the super gloss. In the lesk.pm file, the score obtained by counting the glosses for each relation pair is normalized according to the size of glosses; however, these numbers are still added together, so the main score will exceed 1.
For example:
Say you compare "dog#n#1" and "dog#n#1" with Lesk. To make my example a little simpler, I used the following lesk-relation.dat file, instead of the default one:
RelationFile
also-also
attr-attr
caus-caus
enta-enta
example-example
glosexample-glosexample
glos-glos
holo-holo
hype-hype
hypo-hypo
mero-mero
part-part
pert-pert
sim-sim
syns-syns
The output for this is 5.15428512949297. Now, I ran Lesk again, using 15 separate relation.dat files, each with only a single relation pair each. So, "also-also", than "attr-attr", than "caus-caus", etc. Here are the values:
also-also = 0
attr-attr = 0
caus-caus = 0
enta-enta = 0
example-example = 1
glosexample-glosexample = 1
glos-glos = 1
holo-holo = 0.505190311418685
hype-hype = 0.470663265306122
hypo-hypo = 0.0584315527681661
mero-mero = 1
part-part = 0
pert-pert = 0
sim-sim = 0
syns-syns = 0.12
They, predictably, add up to 5.15428512949297.
Now, this example isn't perfect (under the default relation.dat file, the score for dog#n#1 and dog#n#1 is 4.25107026707742), but I think it illustrates the issue. Since the [0,1] normalization only occurs at the relation pair level, in cases of identity it will add up to be greater than 1. I am not sure what the best workaround for this will be (or even if I have the problem really nailed down ... my example might not be a representative one, and I haven't had the time to check a whole lot more, under different/more variable conditions). But, so far as I can tell from looking at the output and the .pm file, this is where the problem is occurring.