aleph: Highlighting of search terms does not work on Documents

I have set ALEPH_RESULT_HIGHLIGHT=true in aleph.env and highlighting of the search term seems to work on entity types other than Documents. I.e. searching for a term that results in hits in CourtCase entities will display a highlight of the search term if the search term is in i.e. the Summary field of CourtCase.

For Documents highlighting works if the search term is in the title attribute of the Document entity (or other attributes), but not if the search term is in the indexed text of the document.

I can see from the text tab in the detail view of the document that the text has been indexed correctly, but still no highlight is displayed.

A search result limited to CourtCase entities returns highlight attribute. Limiting to Document (Pages) entities will not return the highlight attribute on the results.

Skjermbilde 2022-02-16 kl  10 24 38

Skjermbilde 2022-02-16 kl  10 25 04

The strange thing is that highlighting seems to work if the entity is of type Image which has been OCRed into text. Then the search term (if found in OCRed text) will appear as highlighted.

If I do a search within the document when viewing the Document entity itself (the scope is only that Document), then highlighting works and the Pages returned contain the highlight attribute

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 15 (10 by maintainers)

Most upvoted comments

The search matches on a text field that is excluded from search results by default. Also, this text field is not part of the properties.* field that is used in the entities highlight query, so won’t be included in the highlight.

After including this field in results & highlights, highlight results are weird, with elastic highlighting wrong parts of the text, which seems to be caused by the term_vector configuration, when i comment this line, i get proper search highlights.

I’ve asked @pudo for a bit of explanation before i start tackling this, as i’d love to know a bit more about the reasoning behind excluding these text fields etc.

Hmm, after looking at this more closely, i think we’re actually discussing 2 separate things here.

  1. The search highlights are currently not working for Pages entities, which is an issue.
  2. How to improve the search for documents to provide more context about what page a search term is found on, which is more like a feature request.

Would it be an idea to open a separate issue to track point 2, the page context in search?

Also, when i’ve got some time next week, i’ll see if i can investigate the missing highlights.

hi @anderser , @sunu how are you? ^^ @monneyboi said I should come to the party 😃

I think knowing the page number is relevant because you might want to search for more context on the pages - the context might be a lot larger than the snippet. Or you can see a search term concentrated in a certain part of the document. It is valuable info imho