NVTabular: [BUG] Simplify Categorify encoding for better standardization and easier reverse mapping

Context

Categorify encodes categorical columns into contiguous integer ids. It offers functionalities to deal with high-cardinality features, such as simple hashing and frequency capping / hashing. When Categorify runs, it creates a mapping between original and encoded values.

Problem

There are currently some issues in the encoding of Categorify:

  • Collision of special encodings - Some special values – Nulls, Out-Of-Vocabulary and infrequent items (when using frequency capping) – are all encoded to id 0, so it is not possible to differentiate between them for modeling purposes.
  • Inconsistent mapping in unique values parquet - When the NVTabular workflow fits, the encoding mapping is persisted to parquet file with the unique values. But the mapping in the parquet file does not match the actual mapping performed by categorify (e.g. for not considering the start_index or max_size #1736). There are more examples of these mismatches in this doc (Nvidia internal). It is important for Merlin that we make it more straightforward to map the encoded ids back to the original values by using just a mapping table, without the need to be aware of the complex logic inside Categorify to cover the hashing options available. In RecSys, reverse mapping is critical for item id, as that is what models will be predicting (encoded ids) and they need to be presented to the user (original ids).

Proposed solution

This task proposes some simplifications of the Categorify encoding and is based in the discussions from this doc (Nvidia internal). Check it for more details

Save a consistence mapping table to parquet

  • As discussed above, ensure that when Categorify saves the unique values parquet, the mapping matches exactly the encoded item ids, so that it is easy to use that parquet mapping to reserve back to original ids.

Fixing the first 3 values of encoding table: <PADDING>, <OOV>, <NULL>

  • Eliminating start_index argument. It was created to allow for users reserving a range of control ids (the first N encoded ids) so that no original value is mapped to them by Categorify. Then, the user could do some post-processing after Categorify to set some values in that range. The only use case we found so far for that was reserving 0 for padding sequence features, which is common for sequential recommendation. So we remove start_index and reserve just the id 0 for padding (or for other purposes the user might use it). That simplifies a lot the logic within Categorify as start_index shifts all other values.
  • Mapping Out-of-Vocabulary (OOV) values during workflow.transform() always to id (1)
  • Eliminating na_sentinel as nulls will be always mapped to a single id (2).

Hashing / Frequency hashing

  • Rename max_size (used for frequency capping and frequency hashing) to top_frequent_values, so that users don’t assume that the value will be the maximum cardinality (as that one will also include the special ids and the num_buckets).

Pre-defined vocabulary

  • When the user provides the vocabs argument with a pre-defined mapping (from original values to encoded ids) our encoding standard does not apply and we just use that mapping. We should reserve an extra position in the end of their mapping table to assign values that are potentially not found (including nulls) in the vocabs mapping

Proposed encoding strategy

<html> <body> <div dir="ltr" style="margin-left:36pt;" align="left">
Original value Encoded id
Fixed
<PADDING> 0
<OOV> 1
<NULL> 2
========= When not using hashing =========
1st most frequent value 3
2nd most frequent value 4
3rd most frequent value 5
========= When using simple hashing - Categorify(num_buckets=3) =========
<HASHED> Hash bucket #1 3
<HASHED> Hash bucket #2 4
<HASHED> Hash bucket #3 5
========= When using frequency capping based on threshold (freq_threshold) or number of top-k values (max_size->top_frequent_values) =========
<HASHED> Infrequent bucket 3
1st most frequent value 4
2nd most frequent value 5
3rd most frequent value 6
========= When using frequency hashing - (num_buckets, max_size->top_frequent_values) =========
<HASHED> Infrequent hash bucket #1 3
<HASHED> Infrequent hash bucket #2 4
<HASHED> Infrequent hash bucket #3 5
1st most frequent value n-4
2nd most frequent value n-3
3rd most frequent value n-2
4th most frequent value n-1
5th most frequent value n
</div>
</body> </html>

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 51 (29 by maintainers)

Most upvoted comments

I’m just saying that we (me, in all likelihood) need to cut users off at the pass by providing reverse mapping functionality before they try to implement reverse mapping themselves again

Yes, I agree that it will be best to provide a simple reverse mapping utility

Okay, given the useful comments and information from both @gabrielspmoreira and @karlhigley, I’m thinking the best solution here is to keep the encodings.x.parquet standard as simple as possible. That is,

  1. The parquet files will use a simple RangIndex to specify the encoded integer values.
  2. The encodings.x.parquet data must include an x column (where x is the name of the column or column group), and an x_size column:
    • The x column shall use a null value for any category that does not correspond to a direct unique-value mapping.
    • The x_size column should include the observed-value counts collected during the original Categorify.fit operation.
  3. The 0 encoding will be reserved for padding
  4. The 1 encoding will be reserved for null values
  5. Encodings 2 : 2 + num_buckets will be used for OOV, infrequent values, or hash buckets. Note that the default and minimum value for num_buckets shall be 1.
  6. Encodings 2 + num_buckets and greater will be used for literal unique-value encodings.

As far as I understand, this standard makes the encoding rules and results pretty clear. Embeddings 0 and 1 will always mean the same thing (padding and nulls), and so the value of num_buckets can be easily inferred from the number of null values in the x column (num_buckets = null_count - 2 in all cases). We can also infer the number of unique-value encodings from the number of non-null values in the x column. This all means that a simple Merlin utility could easily use this file to provide a reverse mapping (as long as they don’t expect us to specify a list of all possible values observed for a specific “bucket” encoding - where we would probably want the utility to return either “infreq” or “hashed”, depending on the value of num_buckets).

For the default scenario, can not the item_id be at the index 2 (like below)? why it reads as ?

Are you asking why I have <NA> in the OOV row? If so, this is because the “value” for the OOV row would need to be null. I don’t think we should arbitrarily choose a string or numerical value to represent OOV.

we wont have any OOVs in training but in the valid set we can, therefore in the valid set when we see encoded categories (e.g. item_ids) as 2 we’d know these are for OOVs.

Yes it makes sense to reserve index 2 for OOV, and that is what I was attempting to show (1 is for nulls, and 2 is for OOV - However, it doesn’t make sense to include a literal “value” for either of these rows).