mlflow: [BUG] No index on foreign keys, in postgres store

MLflow Roadmap Item

This is an MLflow Roadmap item that has been prioritized by the MLflow maintainers. We’re seeking help with the implementation of roadmap items tagged with the help wanted label.

For requirements clarifications and implementation questions, or to request a PR review, please tag @WeichenXu123 in your communications related to this issue.

System information

  • Linux Ubuntu 18.04
  • MLflow installed from pip
  • MLFlow version 1.11:
  • Python version 3.8.3

Describe the problem

This might not be true for other dialects, but Postgres doesn’t automatically index on FKs. It makes something as simple as querying all run details extremely slow, as the number of experiments / runs (the UI only allows to fetch results 100 at time).

Code to reproduce issue

The following has about 10k rows but essentially times out.

SELECT * 
FROM runs
JOIN params on params.run_uuid = runs.run_uuid
JOIN metrics on metrics.run_uuid = params.run_uuid
WHERE runs.experiment_id = 30

Other info / logs

Query to get all the missing FKs (credit)

SELECT c.conrelid::regclass AS "table",
       /* list of key column names in order */
       string_agg(a.attname, ',' ORDER BY x.n) AS columns,
       pg_catalog.pg_size_pretty(
          pg_catalog.pg_relation_size(c.conrelid)
       ) AS size,
       c.conname AS constraint,
       c.confrelid::regclass AS referenced_table
FROM pg_catalog.pg_constraint c
   /* enumerated key column numbers per foreign key */
   CROSS JOIN LATERAL
      unnest(c.conkey) WITH ORDINALITY AS x(attnum, n)
   /* name for each key column */
   JOIN pg_catalog.pg_attribute a
      ON a.attnum = x.attnum
         AND a.attrelid = c.conrelid
WHERE NOT EXISTS
        /* is there a matching index for the constraint? */
        (SELECT 1 FROM pg_catalog.pg_index i
         WHERE i.indrelid = c.conrelid
           /* the first index columns must be the same as the
              key columns, but order doesn't matter */
           AND (i.indkey::smallint[])[0:cardinality(c.conkey)-1]
               OPERATOR(pg_catalog.@>) c.conkey)
  AND c.contype = 'f'
GROUP BY c.conrelid, c.conname, c.confrelid
ORDER BY pg_catalog.pg_relation_size(c.conrelid) DESC;

What component(s), interfaces, languages, and integrations does this bug affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • area/server-infra: MLflow server, JavaScript dev server
  • [ x] area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow’s components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 24 (11 by maintainers)

Most upvoted comments

@mberr Is it possible to benchmark this query in your environment?

SELECT *
FROM metrics
WHERE metrics.run_uuid = <run id> AND metrics.key = <key>
-- Please replace <run_id> and <key>

This is a query executed when the get_metric_history API is called.

There is not much difference here, since the primary key’s index is used (cf. details below).

b-tree index hash index execution time
yes yes 702.850 ms
no yes 702.080 ms
yes no 743.145 ms
no no 702.850 ms
Details

both

                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using metric_pk on metrics  (cost=0.70..345344.60 rows=1165 width=87) (actual time=59.415..702.164 rows=2000 loops=1)
   Index Cond: ((key = 'eval.validation.left.num_rank'::text) AND (run_uuid = '8f9c932d81164360969ed9b724601eb1'::text))
   Heap Fetches: 0
 Planning Time: 31.427 ms
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.394 ms, Inlining 0.000 ms, Optimization 0.146 ms, Emission 1.969 ms, Total 2.509 ms
 Execution Time: 702.850 ms
(9 rows)

btree

                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using metric_pk on metrics  (cost=0.70..345344.60 rows=1165 width=87) (actual time=67.880..742.411 rows=2000 loops=1)
   Index Cond: ((key = 'eval.validation.left.num_rank'::text) AND (run_uuid = '8f9c932d81164360969ed9b724601eb1'::text))
   Heap Fetches: 0
 Planning Time: 3.881 ms
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.436 ms, Inlining 0.000 ms, Optimization 0.172 ms, Emission 2.229 ms, Total 2.836 ms
 Execution Time: 743.145 ms
(9 rows)

hash tree

                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using metric_pk on metrics  (cost=0.70..345344.60 rows=1165 width=87) (actual time=55.568..701.436 rows=2000 loops=1)
   Index Cond: ((key = 'eval.validation.left.num_rank'::text) AND (run_uuid = '8f9c932d81164360969ed9b724601eb1'::text))
   Heap Fetches: 0
 Planning Time: 0.501 ms
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.363 ms, Inlining 0.000 ms, Optimization 0.145 ms, Emission 1.876 ms, Total 2.384 ms
 Execution Time: 702.080 ms
(9 rows)

none

                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using metric_pk on metrics  (cost=0.70..345344.60 rows=1165 width=87) (actual time=59.415..702.164 rows=2000 loops=1)
   Index Cond: ((key = 'eval.validation.left.num_rank'::text) AND (run_uuid = '8f9c932d81164360969ed9b724601eb1'::text))
   Heap Fetches: 0
 Planning Time: 31.427 ms
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.394 ms, Inlining 0.000 ms, Optimization 0.146 ms, Emission 1.969 ms, Total 2.509 ms
 Execution Time: 702.850 ms
(9 rows)

table schema

mlflow_db=# \d metrics
                         Table "public.metrics"
  Column   |          Type          | Collation | Nullable |   Default   
-----------+------------------------+-----------+----------+-------------
 key       | character varying(250) |           | not null | 
 value     | double precision       |           | not null | 
 timestamp | bigint                 |           | not null | 
 run_uuid  | character varying(32)  |           | not null | 
 step      | bigint                 |           | not null | '0'::bigint
 is_nan    | boolean                |           | not null | false
Indexes:
    "metric_pk" PRIMARY KEY, btree (key, "timestamp", step, run_uuid, value, is_nan)
    "metrics_run_uuid_idx" btree (run_uuid)
    "metrics_run_uuid_idx1" hash (run_uuid)
Foreign-key constraints:
    "metrics_run_uuid_fkey" FOREIGN KEY (run_uuid) REFERENCES runs(run_uuid)