aleph: BUG: Languages and countries filter not showing results

Describe the bug When searching an investigation I see that some files have an associated Country and Language, nevertheless when I click on the Language or Country filters in the left bar it doesn’t show any results.

To Reproduce Steps to reproduce the behavior:

  1. Go to an investigation with documents that have a detected language and country
  2. Click on Countries in the left side
  3. No Countries are shown

Expected behavior The detected languages should be listed in order to filter out documents.

Additional context I’m running 3.12.6 and I was not able to reproduce the error in OCCRP aleph’s open instance

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 29 (15 by maintainers)

Most upvoted comments

I have been able to reproduce the issue with a plaintext version of the 1st Book of the King James Bible

Awesome, thanks for that @tillprochaska

@lyz-code just FYI I’m going to have a look at this right after we release the current pending version of ingest-file, there’s one more bug I need to look at.

Hi @stchris, I have been able to reproduce the issue with a plaintext version of the 1st Book of the King James Bible (available here from Project Gutenberg). Maybe the test documents we tried before weren’t long enough for language detection to work.

Steps to reproduce:

  1. Create an investigation.
  2. Upload the document.
  3. Make an empty search in the investigation (i.e. simply press enter while the investigation search input is focused).
  4. If the language column isn’t visible on the search results page, click “Configure columns” and select “Languages”. Screen Shot 2022-12-14 at 13 24 47
  5. You should see the document in the search results list, including “English” in the “Languages” column. However, opening the “Languages” facet in the left-hand sidebar doesn’t show any options.

Note: Contrary to what I said earlier, the inverted type groups added to the serialized entity data by to_full_dict aren’t relevant when we display search results in the UI. We generate the type groups on the fly in the UI based on the raw properties (relevant source code). That’s why languages and countries are displayed perfectly fine in search results, even though the filter options aren’t populated correctly.

I’ve copied the attribute language and country on the files with the next script and the UI caught both languages and the countries facets:

"""Script that adds required fields to Aleph documents."""

import logging
from contextlib import suppress

from elasticsearch import Elasticsearch

INDEX = "aleph-entity-pages-v1"
ELASTICSEARCH_URL = "http://aleph_elasticsearch_1:9200"

# Configure the logging
log = logging.getLogger(__name__)

logging.basicConfig(level=logging.INFO)

log.info("Creating the connection of the database")
client = Elasticsearch(ELASTICSEARCH_URL)

log.info("Getting the documents to edit")
resp = client.search(index=INDEX, query={"match_all": {}}, size=9999)
documents = resp["hits"]["hits"]

for document in documents:
    for old_field, new_field in [
        ("detectedCountry", "countries"),
        ("detectedLanguage", "languages"),
    ]:
        if new_field in document["_source"]["properties"].keys():
            continue

        doc = {}
        with suppress(KeyError):
            doc[new_field] = document["_source"]["properties"][old_field]

        if doc != {}:
            id_ = document["_id"]

            log.info(f"Updating the document {id_} with body {doc}")
            resp = client.update(index=INDEX, id=id_, doc=doc)
            if resp["result"] != "updated":
                log.warning(
                    f"The update of the file {id_} with doc {doc} "
                    f"returned {resp['result']}"
                )