NVTabular: [BUG] No schema.pbtxt File is being generated from NVTabular workflow

Describe the bug I am trying to generate schema.pbtxt file from NVTabular workflow using the same exact script posted here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1156

NUM_ROWS = 1000
long_tailed_item_distribution = np.clip(np.random.lognormal(3., 1., NUM_ROWS).astype(np.int32), 1, 50000)

# generate random item interaction features 
df = pd.DataFrame(np.random.randint(70000, 80000, NUM_ROWS), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution

# generate category mapping for each item-id
df['category'] = pd.cut(df['item_id'], bins=334, labels=np.arange(1, 335)).astype(np.int32)
df['timestamp/age_days'] = np.random.uniform(0, 1, NUM_ROWS)
df['timestamp/weekday/sin']= np.random.uniform(0, 1, NUM_ROWS)

# generate day mapping for each session 
map_day = dict(zip(df.session_id.unique(), np.random.randint(1, 10, size=(df.session_id.nunique()))))
df['day'] =  df.session_id.map(map_day)

# Categorify categorical features
categ_feats = ['session_id', 'item_id', 'category'] >> nvt.ops.Categorify(start_index=1)

# Define Groupby Workflow
groupby_feats = categ_feats + ['day', 'timestamp/age_days', 'timestamp/weekday/sin']

# Groups interaction features by session and sorted by timestamp
groupby_features = groupby_feats >> nvt.ops.Groupby(
    groupby_cols=["session_id"], 
    aggs={
        "item_id": ["list", "count"],
        "category": ["list"],     
        "day": ["first"],
        "timestamp/age_days": ["list"],
        'timestamp/weekday/sin': ["list"],
        },
    name_sep="-")

# Select and truncate the sequential features
sequence_features_truncated = (groupby_features['category-list', 'item_id-list', 
                                          'timestamp/age_days-list', 'timestamp/weekday/sin-list']) >> \
                            nvt.ops.ListSlice(0,20) >> nvt.ops.Rename(postfix = '_trim')

# Filter out sessions with length 1 (not valid for next-item prediction training and evaluation)
MINIMUM_SESSION_LENGTH = 2
selected_features = groupby_features['item_id-count', 'day-first', 'session_id'] + sequence_features_truncated
filtered_sessions = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)


workflow = nvt.Workflow(filtered_sessions)
dataset = nvt.Dataset(df, cpu=False)
# Generating statistics for the features
workflow.fit(dataset)
workflow.transform(dataset).to_parquet(
    './schema',
    out_files_per_proc=1,
)

schema_path = Path('./schema')
proto_schema = Schema.read_protobuf(schema_path / "schema.pbtxt")

Expected behavior When I check the contents of the ./schema folder, the only files placed in there are: _file_list.txt _metadata _metadata.json part_0.parquet

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 19 (10 by maintainers)

Most upvoted comments

Hey @karlhigley, sure, I got no module named nvtabular.io.dataset when running the training cells in the 2nd getting-started notebook until I downgraded to nvtabular==0.10.0

alexanderDoria on Mar 15, 2022

hey @jperez999 and @rnyak, sorry for the late response. I’m happy to inform that I got it working. The most recent issue was running out of memory (which was solved by increasing the size of the cluster). We also had an issue with NVTabular/T4R which was solved by downgrading NVTabular. I’m able to run the tutorial cells now.

Let me know if you’d like me to give more info on anything.

alexanderDoria on Mar 15, 2022