datahub: DataHub is unable to search/list all datasets more than 10k datasets

Describe the bug DataHub is unable to search/list all datasets more than 10k datasets

To Reproduce Steps to reproduce the behavior:

  1. Have a DataHub instance with more than 10k datasets.
  2. Run a search query on the UI e.g. * or snowflake if there are more than 10k snowflake datasets, it only displays up to 10k results.
  3. This happens similarly for a GraphQL search query, the searchResult total returned is 10000 but there exists more than 10000 datasets.
query {
  search(input: {
    type: DATASET,
    query: "*",
    count: 500,
    start: 0,
  }) {
    total
    searchResults {
      entity {
        ... on Dataset {
          urn
        }
      }
    }
  }
}

Expected behavior DataHub should be able to list / search for more than 10k datasets.

Screenshots Unable to post.

Desktop (please complete the following information):

  • OS: Mac
  • Browser: Chrome
  • Version: 99.0.4844.84 (Official Build) (x86_64)

Additional context We are using the search query to list all the datasets we have, but due to this limitation, we are unable to list all the datasets.

Use Case We have a scheduled job to generate a report of all datasets. The scheduled job calls the GaphQL Search API to list all of the datasets and relevant information we need. Our requirements is to catalog all of the datasets and some relevant information such as field tags / terms.

Problem Our datahub instance has more than 10k datasets. So we there is no way to pull all of the datasets via an API.

We tried using the Search API but since we have more than 10k datasets, it does not work. To clarify, we are paging through 500 datasets each call, but the searchResults only allow you to get you a total of the first 10k results, even with pagination.

So e.g. if there are 25k datasets, the searchResults total will be 10000 still and there’s no way to get the remaining 15k datasets.

Potential Solutions If DataHub has another API to get all of the datasets rather than a “search” API.

Or if DataHub should have an API to list all the URNs, then we can query each dataset individually (https://datahubproject.io/docs/graphql/queries#dataset)

Or if using search API, we can segment the search api call such that the results are under 10k each call by providing different “filters” for the search term. But not as reliable because:

  1. May not ensure coverage of all datasets
  2. Requires manual updates to the search filters as the number of datasets increase

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (10 by maintainers)

Most upvoted comments

Try using this rest endpoint which goes through all entities in mysql

curl --location --request POST 'http://localhost:8080/entities?action=listUrns' \
--header 'X-RestLi-Protocol-Version: 2.0.0' \
--header 'Content-Type: application/json' \
--data-raw '{
    "entity": "dataset",
    "start": 0,
    "count": 10
}'

Note we do not have any guarantees on the latency of this endpoint. We have seen that it could take order of minutes when there are more than a million datasets.