NVTabular: [BUG] Simplify Categorify encoding for better standardization and easier reverse mapping
Context
Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing. When Categorify runs, it creates a mapping between original and encoded values.
Problem
There are currently some issues in the encoding of Categorify:
- Collision of special encodings - Some special values – Nulls, Out-Of-Vocabulary and infrequent items (when using frequency capping) – are all encoded to id 0, so it is not possible to differentiate between them for modeling purposes.
- Inconsistent mapping in unique values parquet - When the NVTabular workflow fits, the encoding mapping is persisted to parquet file with the unique values. But the mapping in the parquet file does not match the actual mapping performed by categorify (e.g. for not considering the
start_index
ormax_size
#1736). There are more examples of these mismatches in this doc (Nvidia internal). It is important for Merlin that we make it more straightforward to map the encoded ids back to the original values by using just a mapping table, without the need to be aware of the complex logic inside Categorify to cover the hashing options available. In RecSys, reverse mapping is critical for item id, as that is what models will be predicting (encoded ids) and they need to be presented to the user (original ids).
Proposed solution
This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details
Save a consistence mapping table to parquet
- As discussed above, ensure that when Categorify saves the unique values parquet, the mapping matches exactly the encoded item ids, so that it is easy to use that parquet mapping to reserve back to original ids.
Fixing the first 3 values of encoding table: <PADDING>, <OOV>, <NULL>
- Eliminating
start_index
argument. It was created to allow for users reserving a range of control ids (the first N encoded ids) so that no original value is mapped to them by Categorify. Then, the user could do some post-processing after Categorify to set some values in that range. The only use case we found so far for that was reserving 0 for padding sequence features, which is common for sequential recommendation. So we removestart_index
and reserve just the id 0 for padding (or for other purposes the user might use it). That simplifies a lot the logic within Categorify as start_index shifts all other values. - Mapping Out-of-Vocabulary (OOV) values during workflow.transform() always to id (1)
- Eliminating
na_sentinel
as nulls will be always mapped to a single id (2).
Hashing / Frequency hashing
- Rename
max_size
(used for frequency capping and frequency hashing) totop_frequent_values
, so that users don’t assume that the value will be the maximum cardinality (as that one will also include the special ids and the num_buckets).
Pre-defined vocabulary
- When the user provides the
vocabs
argument with a pre-defined mapping (from original values to encoded ids) our encoding standard does not apply and we just use that mapping. We should reserve an extra position in the end of their mapping table to assign values that are potentially not found (including nulls) in thevocabs
mapping
Proposed encoding strategy
<html> <body> <div dir="ltr" style="margin-left:36pt;" align="left">Original value | Encoded id |
---|---|
Fixed | |
<PADDING> | 0 |
<OOV> | 1 |
<NULL> | 2 |
========= When not using hashing ========= | |
1st most frequent value | 3 |
2nd most frequent value | 4 |
3rd most frequent value | 5 |
… | … |
========= When using simple hashing - Categorify(num_buckets=3) ========= |
|
<HASHED> Hash bucket #1 | 3 |
<HASHED> Hash bucket #2 | 4 |
<HASHED> Hash bucket #3 | 5 |
========= When using frequency capping based on threshold (freq_threshold ) or number of top-k values (max_size->top_frequent_values ) ========= |
|
<HASHED> Infrequent bucket | 3 |
1st most frequent value | 4 |
2nd most frequent value | 5 |
3rd most frequent value | 6 |
… | … |
========= When using frequency hashing - (num_buckets, max_size->top_frequent_values) ========= |
|
<HASHED> Infrequent hash bucket #1 | 3 |
<HASHED> Infrequent hash bucket #2 | 4 |
<HASHED> Infrequent hash bucket #3 | 5 |
… | … |
1st most frequent value | n-4 |
2nd most frequent value | n-3 |
3rd most frequent value | n-2 |
4th most frequent value | n-1 |
5th most frequent value | n |
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 51 (29 by maintainers)
Yes, I agree that it will be best to provide a simple reverse mapping utility
Okay, given the useful comments and information from both @gabrielspmoreira and @karlhigley, I’m thinking the best solution here is to keep the
encodings.x.parquet
standard as simple as possible. That is,RangIndex
to specify the encoded integer values.encodings.x.parquet
data must include anx
column (wherex
is the name of the column or column group), and anx_size
column:x
column shall use a null value for any category that does not correspond to a direct unique-value mapping.x_size
column should include the observed-value counts collected during the originalCategorify.fit
operation.0
encoding will be reserved for padding1
encoding will be reserved for null values2 : 2 + num_buckets
will be used for OOV, infrequent values, or hash buckets. Note that the default and minimum value fornum_buckets
shall be1
.2 + num_buckets
and greater will be used for literal unique-value encodings.As far as I understand, this standard makes the encoding rules and results pretty clear. Embeddings
0
and1
will always mean the same thing (padding and nulls), and so the value ofnum_buckets
can be easily inferred from the number of null values in thex
column (num_buckets = null_count - 2
in all cases). We can also infer the number of unique-value encodings from the number of non-null values in thex
column. This all means that a simple Merlin utility could easily use this file to provide a reverse mapping (as long as they don’t expect us to specify a list of all possible values observed for a specific “bucket” encoding - where we would probably want the utility to return either “infreq” or “hashed”, depending on the value ofnum_buckets
).Are you asking why I have
<NA>
in the OOV row? If so, this is because the “value” for the OOV row would need to be null. I don’t think we should arbitrarily choose a string or numerical value to represent OOV.Yes it makes sense to reserve index
2
for OOV, and that is what I was attempting to show (1
is for nulls, and2
is for OOV - However, it doesn’t make sense to include a literal “value” for either of these rows).