REL: Error for start position of mentions
Issue
Returning the same start positions for different mentions
Examples
Example 1
In this example, mention “[128, 5, ‘Paris’…“, which the second arrow indicates, has a wrong start position. (The true start position is 449.)
IP_ADDRESS = "https://rel.cs.ru.nl/api"
def el(text):
document = {
"text": text,
"spans": []
}
API_result = requests.post("{}".format(IP_ADDRESS), json=document).json()
return API_result
# Input text
dialogue = "Hi. I need to book a vacation to Long Beach between August 25 and September 3. Departure is from Paris\tWould you like to depart Paris on September 6th?\tPreferably by the 3rd. Is there anything available?\tSorry, looks like there is nothing available from Paris to Long Beach on September 3rd.\tI'm not quite sure I understand, is there anything available leaving Long Beach to go to Paris between August 25 and September 3rd?\tWould you like to depart Paris on September 6th?\tNo. I would like to leave Long Beach around the 25th of August to go to Paris for some reason. What is so confusing about that!?\tYou can leave Long Beach, USA and go to Paris, France on Tuesday, August 30th. Will I book this?\tFinally! No, don't book yet, I would like to know more about the hotel. Is there free breakfast?\tThere is free wifi.\tLook buddy, is there free breakfast or not? Tell me, am I gonna get eggs, toast, cereal, etc? You know? The good stuff?\tThere is free wifi at the hotel. \tWhat is the price of this toastless package?\tThis package costs 2191.55USD.\tIs this the cheapest option?\tYes, this is the cheapest option from Long Beach to Paris.\tAnd the hotel has how many stars?\tMuse Hotel has 2.0 stars.\tOk I will book this one.\tGreat. You will leave Paris on September 7"
# Entity Linking
el(dialogue)
# Output
#[
# [33, 10, 'Long Beach', 'Long_Beach,_California', 0.37835263573950106, 0.9858004450798035, 'LOC'],
# [97, 5, 'Paris', 'Paris', 0.8168372931596612, 0.9900407195091248, 'LOC'],
# ---> [128, 5, 'Paris', 'Paris', 0.8369923447610185, 0.9948850274085999, 'LOC'],
# [254, 5, 'Paris', 'Paris', 0.7787875371965208, 0.9929811954498291, 'LOC'],
# ...
# [381, 5, 'Paris', 'Paris', 0.8693552725957252, 0.9956340193748474, 'LOC'],
# ---> [128, 5, 'Paris', 'Paris', 0.9030778907304463, 0.9946692585945129, 'LOC'],
# [499, 10, 'Long Beach', 'Long_Beach,_California', 0.38097207439050973, 0.9697867035865784, 'LOC'],
# [545, 5, 'Paris', 'Paris', 0.8518866064471391, 0.9992392063140869, 'LOC'],
# ...
#]
Example 2
dialogue = 'Paris. Paris. Paris. Paris.'
el(dialogue)
# Output
# [[0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC'],
# [0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC'],
# [0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC'],
# [0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC']]
Cause
- The pice of code in [1] is meant to translate local (within single sentence) span indices to global (in the whole piece of text) indices. However, if the sentences (in this case “Paris.”, “Paris.”, etc…) are identical, this will only find the first one and hence the offsets are not correct.
- Also, the same issue can be found in [2].
Update 14-10-2020
How issues [1] and [2] affect the results. See [3].
- If input spans’ length is
len(spans) == 0
, theninclude_offset
become True and issue [1] affects the result. - If input spans’ length is
len(spans) > 0
, theninclude_offset
become False and issue [2] affects the result.- Because
start_pos
become the same asment["pos"]
, which is the result of [2]. (Maybe)
- Because
How to fix
Idea
Use
pos_start = text[pos_start:].find(sent) + pos_start
instead of using
pos_start = text.find(sent)
E.g., Regarding [2]:
...
i = 0
pos_start = 0 # Added
for sent in sentences:
if len(sent.strip()) == 0:
continue
# Match gt to sentence.
#pos_start = text.find(sent) # Original
pos_start = text[pos_start:].find(sent) + pos_start # Added
...
I believe this correction also works for [1], I did not check it yet though.
Todo
- Decide how to fix both issues
- Fix the first issue mentioned in “cause” section
- Fix the second issue
References
[1] https://github.com/informagi/REL/blob/master/REL/utils.py#L92 [2] https://github.com/informagi/REL/blob/master/REL/mention_detection.py#L87 [3] https://github.com/informagi/REL/blob/master/REL/utils.py#L94
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 17 (17 by maintainers)
Thanks, looks great. I have merged your PR and will create some tests for this later today (and I’ll close this issue after). I can quickly walk you through it maybe next week; the tests are all in the
tests
folder and run usingpytest
through Github workflows (see.github
folder).Hi guys,
As I can’t push my changes to Hideaki’s repo, here’s my proposed change. The root cause is actually in
find_mention
, so let’s correct that with the following code forfind_mentions
andprocess_results
respectively. Note that this way we don’t have to create an additional dictionary, thus saving us an additional loop.Let me know if I missed anything!
Detailed explanation In
find_mentions
we now add an offset if we are using Flair. This consists of a three-step process:Okay, so I have an idea to “fix” this fairly easily but I ran into another problem while exploring my solution… More on that later.
I thought of using the field
sent_idx
(https://github.com/informagi/REL/blob/master/REL/mention_detection.py#L174). Since every mention has that field, we could calculate the offset per mention based on total sentence length up that point. My solution looks something like this:So once you have these offsets, you can write something for the for-loop from utils.py#L91:
This also works properly for both examples. Effectively the same as your proposition @hideaki-j .
Now for the new problem: because the text is split up and parsed into Flair
Sentence
objects, the extra whitespace between sentences is lost. This is partly resolved by the lineoffset = len_until_now + text[len_until_now:].find(sent)
, where the cumulative length is summed with an additional index increase from.find
. However, if the amount of whitespace between sentences is significant, this doesn’t work. For example, the input:Which of course is not correct. If we want to solve this, we also need to keep track of whitespace between sentences, which is actually not trivial I think as sentence segmentation is handled using a different library (
segtok
). It does not seem to have a built-in way of keeping track of the whitespace.Of course this new problem is a crafted example that you probably won’t find in the wild, so we can choose to live with it. What do you think @hideaki-j ?
I can add some tests later, good idea.
This case works in my environment.
Regarding the unit test, I agree with you. (I am still figuring out how can I create a unit test in GitHub tho)
Thank you all!
One minor change. In the current code,
(offset - cum_sent_length)
could become minus when a sentence does not contain any entity. Easily fixed by changing the place ofform inside the for loop:
to inside the for loop:
Fixed code is like:
I tested for (maybe) all of the examples mentioned in our discussion and confirmed that it works. All test cases are shown below.
If you do not have any more ideas for this correction, I will create a pull request.
Updated code! Better safe than sorry, although the ‘many whitespaces’ example is hopefully something that’ll never occur xD.
Nice Mick, thanks 😃 One thing that needs to change is the line
cum_sent_length += len(sentence)
, since it doesn’t deal properly with e.g. the “much whitespace” example (“Paris. Paris. Paris.”).Easily fixed however by changing it to read
cum_sent_length += len(sentence) + (offset - cum_sent_length)
.Also, the offset calculation should be outside of the
for entity in ...
loop. In its current place, you will get anUnboundLocalError
by trying to reference theoffset
variable for a sentence that didn’t have any mentions (e.g. “Hello. Hello.”).You’re right, I missed that… In that case I agree with your idea of creating the offset dictionary earlier in the pipeline (for example in
find_mentions
). Let me know how you get on! And thanks for the feedback 👍You’re right @hideaki-j , the version on Gem accumulates the offsets in a different way and then it doesn’t happen. I played around a bit more this morning, I think I have it fixed completely now with some slight tweaks:
In our test examples, this works as intended. Could you verify @hideaki-j ? If it works for you as well I will open a separate PR for this.
Ideally I’d like to solve this a bit neater, feels a bit hacky like this… But it will do for now, I’ll put a note in the project board 😃