REL: Error for start position of mentions

Issue

Returning the same start positions for different mentions

Examples

Example 1

In this example, mention “[128, 5, ‘Paris’…“, which the second arrow indicates, has a wrong start position. (The true start position is 449.)

IP_ADDRESS = "https://rel.cs.ru.nl/api"

def el(text):
    document = {
        "text": text,
        "spans": []
    }
    API_result = requests.post("{}".format(IP_ADDRESS), json=document).json()
    return API_result

# Input text
dialogue =  "Hi. I need to book a vacation to Long Beach between August 25 and September 3. Departure is from Paris\tWould you like to depart Paris on September 6th?\tPreferably by the 3rd. Is there anything available?\tSorry, looks like there is nothing available from Paris to Long Beach on September 3rd.\tI'm not quite sure I understand, is there anything available leaving Long Beach to go to Paris between August 25 and September 3rd?\tWould you like to depart Paris on September 6th?\tNo. I would like to leave Long Beach around the 25th of August to go to Paris for some reason. What is so confusing about that!?\tYou can leave Long Beach, USA and go to Paris, France on Tuesday, August 30th. Will I book this?\tFinally! No, don't book yet, I would like to know more about the hotel. Is there free breakfast?\tThere is free wifi.\tLook buddy, is there free breakfast or not? Tell me, am I gonna get eggs, toast, cereal, etc? You know? The good stuff?\tThere is free wifi at the hotel. \tWhat is the price of this toastless package?\tThis package costs 2191.55USD.\tIs this the cheapest option?\tYes, this is the cheapest option from Long Beach to Paris.\tAnd the hotel has how many stars?\tMuse Hotel has 2.0 stars.\tOk I will book this one.\tGreat. You will leave Paris on September 7"

# Entity Linking
el(dialogue)

# Output
#[
#    [33, 10, 'Long Beach', 'Long_Beach,_California', 0.37835263573950106, 0.9858004450798035, 'LOC'], 
#    [97, 5, 'Paris', 'Paris', 0.8168372931596612, 0.9900407195091248, 'LOC'], 
#    ---> [128, 5, 'Paris', 'Paris', 0.8369923447610185, 0.9948850274085999, 'LOC'], 
#    [254, 5, 'Paris', 'Paris', 0.7787875371965208, 0.9929811954498291, 'LOC'], 
#    ...
#    [381, 5, 'Paris', 'Paris', 0.8693552725957252, 0.9956340193748474, 'LOC'], 
#    ---> [128, 5, 'Paris', 'Paris', 0.9030778907304463, 0.9946692585945129, 'LOC'], 
#    [499, 10, 'Long Beach', 'Long_Beach,_California', 0.38097207439050973, 0.9697867035865784, 'LOC'], 
#    [545, 5, 'Paris', 'Paris', 0.8518866064471391, 0.9992392063140869, 'LOC'], 
#    ...
#]

Example 2

dialogue = 'Paris. Paris. Paris. Paris.'
el(dialogue)

# Output
# [[0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC'],
# [0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC'],
# [0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC'],
# [0, 5, 'Paris', 'Paris', 0.871144498463729, 0.9808972477912903, 'LOC']]

Cause

  • The pice of code in [1] is meant to translate local (within single sentence) span indices to global (in the whole piece of text) indices. However, if the sentences (in this case “Paris.”, “Paris.”, etc…) are identical, this will only find the first one and hence the offsets are not correct.
  • Also, the same issue can be found in [2].

Update 14-10-2020

How issues [1] and [2] affect the results. See [3].

  • If input spans’ length is len(spans) == 0, then include_offset become True and issue [1] affects the result.
  • If input spans’ length is len(spans) > 0, then include_offset become False and issue [2] affects the result.
    • Because start_pos become the same as ment["pos"], which is the result of [2]. (Maybe)

How to fix

Idea

Use pos_start = text[pos_start:].find(sent) + pos_start instead of using pos_start = text.find(sent)

E.g., Regarding [2]:

...
i = 0
pos_start = 0 # Added
for sent in sentences:
    if len(sent.strip()) == 0:
        continue
    # Match gt to sentence.
    #pos_start = text.find(sent) # Original
    pos_start = text[pos_start:].find(sent) + pos_start # Added
    ...

I believe this correction also works for [1], I did not check it yet though.

Todo

  • Decide how to fix both issues
  • Fix the first issue mentioned in “cause” section
  • Fix the second issue

References

[1] https://github.com/informagi/REL/blob/master/REL/utils.py#L92 [2] https://github.com/informagi/REL/blob/master/REL/mention_detection.py#L87 [3] https://github.com/informagi/REL/blob/master/REL/utils.py#L94

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 17 (17 by maintainers)

Most upvoted comments

Thanks, looks great. I have merged your PR and will create some tests for this later today (and I’ll close this issue after). I can quickly walk you through it maybe next week; the tests are all in the tests folder and run using pytest through Github workflows (see .github folder).

Hi guys,

As I can’t push my changes to Hideaki’s repo, here’s my proposed change. The root cause is actually in find_mention, so let’s correct that with the following code for find_mentions and process_results respectively. Note that this way we don’t have to create an additional dictionary, thus saving us an additional loop.

Let me know if I missed anything!

Detailed explanation In find_mentions we now add an offset if we are using Flair. This consists of a three-step process:

  1. Compute the cumulative sentence length.
  2. Compute offset based on the raw text and cumulative sentence length. This is required as during splitting, some whitespaces may be removed between sentences.
  3. Add offset to the start-position in the respective sentence.
def find_mentions(self, dataset, tagger=None):
        """
        Responsible for finding mentions given a set of documents in a batch-wise manner. More specifically,
        it returns the mention, its left/right context and a set of candidates.
        :return: Dictionary with mentions per document.
        """
        if tagger is None:
            raise Exception(
                "No NER tagger is set, but you are attempting to perform Mention Detection.."
            )
        # Verify if Flair, else ngram or custom.
        is_flair = isinstance(tagger, SequenceTagger)
        dataset_sentences_raw, processed_sentences, splits = self.split_text(dataset, is_flair)
        results = {}
        total_ment = 0
        if is_flair:
            tagger.predict(processed_sentences)
        for i, doc in enumerate(dataset_sentences_raw):
            contents = dataset_sentences_raw[doc]
            raw_text = dataset[doc][0]
            sentences_doc = [v[0] for v in contents.values()]
            sentences = processed_sentences[splits[i] : splits[i + 1]]
            result_doc = []
            cum_sent_length = 0
            offset = 0
            for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
                contents.items(), sentences
            ):
                for entity in (
                    snt.get_spans("ner")
                    if is_flair
                    else tagger.predict(snt, processed_sentences)
                ):
                    text, start_pos, end_pos, conf, tag = (
                        entity.text,
                        entity.start_pos,
                        entity.end_pos,
                        entity.score,
                        entity.tag,
                    )
                    total_ment += 1
                    m = self.preprocess_mention(text)
                    cands = self.get_candidates(m)
                    if len(cands) == 0:
                        continue
                    # Re-create ngram as 'text' is at times changed by Flair (e.g. double spaces are removed).
                    ngram = sentence[start_pos:end_pos]
                    left_ctxt, right_ctxt = self.get_ctxt(
                        start_pos, end_pos, idx_sent, sentence, sentences_doc
                    )
                    # Only include offset if using Flair.
                    if is_flair:
                        offset = raw_text.find(sentence, cum_sent_length)
                    res = {
                        "mention": m,
                        "context": (left_ctxt, right_ctxt),
                        "candidates": cands,
                        "gold": ["NONE"],
                        "pos": start_pos+offset,
                        "sent_idx": idx_sent,
                        "ngram": ngram,
                        "end_pos": end_pos+offset,
                        "sentence": sentence,
                        "conf_md": conf,
                        "tag": tag,
                    }
                    result_doc.append(res)
                cum_sent_length += len(sentence) + (offset - cum_sent_length)
            results[doc] = result_doc
        return results, total_ment 

def process_results(
    mentions_dataset,
    predictions,
    processed,
    include_offset=False,
):
    """
    Function that can be used to process the End-to-End results.
    :return: dictionary with results and document as key.
    """
    res = {}
    for doc in mentions_dataset:
        if doc not in predictions:
            # No mentions found, we return empty list.
            continue
        pred_doc = predictions[doc]
        ment_doc = mentions_dataset[doc]
        text = processed[doc][0]
        res_doc = []

        for pred, ment in zip(pred_doc, ment_doc):
            sent = ment["sentence"]
            idx = ment['sent_idx']
            start_pos = ment["pos"]
            mention_length = int(ment["end_pos"] - ment["pos"])

            if pred["prediction"] != "NIL":
                temp = (
                    start_pos,
                    mention_length,
                    ment["ngram"],
                    pred["prediction"],
                    pred["conf_ed"],
                    ment["conf_md"] if "conf_md" in ment else 0.0,
                    ment["tag"] if "tag" in ment else "NULL",
                )
                res_doc.append(temp)
        res[doc] = res_doc
    return res

Okay, so I have an idea to “fix” this fairly easily but I ran into another problem while exploring my solution… More on that later.


I thought of using the field sent_idx (https://github.com/informagi/REL/blob/master/REL/mention_detection.py#L174). Since every mention has that field, we could calculate the offset per mention based on total sentence length up that point. My solution looks something like this:

input = "Paris, Paris. Paris."

# then mentions look something like:
[{"mention": "Paris", ... "sent_idx": 0, "pos": 0}, {"mention": "Paris", ... "sent_idx": 0, "pos": 7}, {"mention": "Paris", ... "sent_idx": 1, "pos": 0}]

# so we can calculate an offset based on sentence lengths, something like:
sentence_lengths = {m["sent_idx"] + 1: len(m["sentence"]) for m in mentions}
# the "+ 1" is necessary because we do not want any offset with sent_idx = 0, so everything needs to shift over by 1 index
# e.g. for sent_idx = 1, we want offset = len(sent_idx 0)
cumulative_offsets = dict(zip(sentence_lengths.keys(), numpy.cumsum(list(sentence_lengths.values())).tolist()))
# after adding {0: 0}, this results in something like:
{0: 0, 1: 13, 2: 19}

So once you have these offsets, you can write something for the for-loop from utils.py#L91:

if include_offset:
    len_until_now = cumulative_offsets[ment["sent_idx"]]
    offset = len_until_now + text[len_until_now:].find(sent)
start_pos = offset + ment["pos"]
...

This also works properly for both examples. Effectively the same as your proposition @hideaki-j .


Now for the new problem: because the text is split up and parsed into Flair Sentence objects, the extra whitespace between sentences is lost. This is partly resolved by the line offset = len_until_now + text[len_until_now:].find(sent), where the cumulative length is summed with an additional index increase from .find. However, if the amount of whitespace between sentences is significant, this doesn’t work. For example, the input:

example = "Paris.              Paris.               Paris."

#> outputs:
[[0, 5, "Paris"...], [20, 5, "Paris"...], [20, 5, "Paris"...]]

Which of course is not correct. If we want to solve this, we also need to keep track of whitespace between sentences, which is actually not trivial I think as sentence segmentation is handled using a different library (segtok). It does not seem to have a built-in way of keeping track of the whitespace.


Of course this new problem is a crafted example that you probably won’t find in the wild, so we can choose to live with it. What do you think @hideaki-j ?

I can add some tests later, good idea.

text = "Hello. Hello. Hello"

This case works in my environment.

Regarding the unit test, I agree with you. (I am still figuring out how can I create a unit test in GitHub tho)

Thank you all!

One minor change. In the current code, (offset - cum_sent_length) could become minus when a sentence does not contain any entity. Easily fixed by changing the place of

if is_flair:
    offset = raw_text.find(sentence, cum_sent_length)

form inside the for loop:

for entity in (
...

to inside the for loop:

for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
...

Fixed code is like:

def find_mentions(self, dataset, tagger=None):
        """
        Responsible for finding mentions given a set of documents in a batch-wise manner. More specifically,
        it returns the mention, its left/right context and a set of candidates.
        :return: Dictionary with mentions per document.
        """
        if tagger is None:
            raise Exception(
                "No NER tagger is set, but you are attempting to perform Mention Detection.."
            )
        # Verify if Flair, else ngram or custom.
        is_flair = isinstance(tagger, SequenceTagger)
        dataset_sentences_raw, processed_sentences, splits = self.split_text(dataset, is_flair)
        results = {}
        total_ment = 0
        if is_flair:
            tagger.predict(processed_sentences)
        for i, doc in enumerate(dataset_sentences_raw):
            contents = dataset_sentences_raw[doc]
            raw_text = dataset[doc][0]
            print('contents', contents)
            print('raw_text', raw_text)
            sentences_doc = [v[0] for v in contents.values()]
            sentences = processed_sentences[splits[i] : splits[i + 1]]
            result_doc = []
            cum_sent_length = 0
            offset = 0
            for (idx_sent, (sentence, ground_truth_sentence)), snt in zip(
                contents.items(), sentences
            ):
                
                # Only include offset if using Flair.
                if is_flair:
                    offset = raw_text.find(sentence, cum_sent_length)
                
                for entity in (
                    snt.get_spans("ner")
                    if is_flair
                    else tagger.predict(snt, processed_sentences)
                ):
                    text, start_pos, end_pos, conf, tag = (
                        entity.text,
                        entity.start_pos,
                        entity.end_pos,
                        entity.score,
                        entity.tag,
                    )
                    total_ment += 1
                    m = self.preprocess_mention(text)
                    cands = self.get_candidates(m)
                    if len(cands) == 0:
                        continue
                    # Re-create ngram as 'text' is at times changed by Flair (e.g. double spaces are removed).
                    ngram = sentence[start_pos:end_pos]
                    left_ctxt, right_ctxt = self.get_ctxt(
                        start_pos, end_pos, idx_sent, sentence, sentences_doc
                    )
                    res = {
                        "mention": m,
                        "context": (left_ctxt, right_ctxt),
                        "candidates": cands,
                        "gold": ["NONE"],
                        "pos": start_pos+offset,
                        "sent_idx": idx_sent,
                        "ngram": ngram,
                        "end_pos": end_pos+offset,
                        "sentence": sentence,
                        "conf_md": conf,
                        "tag": tag,
                    }
                    result_doc.append(res)
                cum_sent_length += len(sentence) + (offset - cum_sent_length)
            results[doc] = result_doc
        return results, total_ment 

I tested for (maybe) all of the examples mentioned in our discussion and confirmed that it works. All test cases are shown below.

# Case 1
text =  "Hi. I need to book a vacation to Long Beach between August 25 and September 3. Departure is from Paris\tWould you like to depart Paris on September 6th?\tPreferably by the 3rd. Is there anything available?\tSorry, looks like there is nothing available from Paris to Long Beach on September 3rd.\tI'm not quite sure I understand, is there anything available leaving Long Beach to go to Paris between August 25 and September 3rd?\tWould you like to depart Paris on September 6th?\tNo. I would like to leave Long Beach around the 25th of August to go to Paris for some reason. What is so confusing about that!?\tYou can leave Long Beach, USA and go to Paris, France on Tuesday, August 30th. Will I book this?\tFinally! No, don't book yet, I would like to know more about the hotel. Is there free breakfast?\tThere is free wifi.\tLook buddy, is there free breakfast or not? Tell me, am I gonna get eggs, toast, cereal, etc? You know? The good stuff?\tThere is free wifi at the hotel. \tWhat is the price of this toastless package?\tThis package costs 2191.55USD.\tIs this the cheapest option?\tYes, this is the cheapest option from Long Beach to Paris.\tAnd the hotel has how many stars?\tMuse Hotel has 2.0 stars.\tOk I will book this one.\tGreat. You will leave Paris on September 7"

# Case 2
text = "Paris, Paris. Paris."

# Case 3
text = "Paris.              Paris.               Paris."

# Case 4
text = 'Pars. Paris. Paris. Paris.'

# Case 5
text = 'Paris. Paris. JParis. Paris. ' # JParis is the entity but this is not in Wikipedia, so REL cannot detect it.

If you do not have any more ideas for this correction, I will create a pull request.

Updated code! Better safe than sorry, although the ‘many whitespaces’ example is hopefully something that’ll never occur xD.

Nice Mick, thanks 😃 One thing that needs to change is the line cum_sent_length += len(sentence), since it doesn’t deal properly with e.g. the “much whitespace” example (“Paris. Paris. Paris.”).

Easily fixed however by changing it to read cum_sent_length += len(sentence) + (offset - cum_sent_length).

Also, the offset calculation should be outside of the for entity in ... loop. In its current place, you will get an UnboundLocalError by trying to reference the offset variable for a sentence that didn’t have any mentions (e.g. “Hello. Hello.”).

Not all sentences have a mention.

You’re right, I missed that… In that case I agree with your idea of creating the offset dictionary earlier in the pipeline (for example in find_mentions). Let me know how you get on! And thanks for the feedback 👍

You’re right @hideaki-j , the version on Gem accumulates the offsets in a different way and then it doesn’t happen. I played around a bit more this morning, I think I have it fixed completely now with some slight tweaks:

# in utils.py - process_results

sents = {m["sent_idx"]: m["sentence"] for m in ment_doc}
cum_offsets = {0: 0}
# e.g. with 4 sentences, loop through 1, 2, 3 and skip 0
for idx in range(1, max(sents.keys())+1):
    # current sentence text
    cur_sent = sents[idx]
    # length of previous sentence
    prev_sent_len = len(sents[idx-1])
    # number of chars up until now: previous sentence offset + length of previous sentence
    until_now = cum_offsets[idx-1] + prev_sent_len
    # total offset is until_now + whitespace compensation through .find(cur_sent)
    cum_offsets[idx] = until_now + text[until_now:].find(cur_sent)

...

if include_offset:
    offset = cum_offsets[ment["sent_idx"]]
start_pos = offset + ment["pos"]
...

In our test examples, this works as intended. Could you verify @hideaki-j ? If it works for you as well I will open a separate PR for this.


Ideally I’d like to solve this a bit neater, feels a bit hacky like this… But it will do for now, I’ll put a note in the project board 😃