NVTabular: [BUG] No schema.pbtxt File is being generated from NVTabular workflow
Describe the bug I am trying to generate schema.pbtxt file from NVTabular workflow using the same exact script posted here: https://github.com/NVIDIA-Merlin/NVTabular/issues/1156
NUM_ROWS = 1000
long_tailed_item_distribution = np.clip(np.random.lognormal(3., 1., NUM_ROWS).astype(np.int32), 1, 50000)
# generate random item interaction features
df = pd.DataFrame(np.random.randint(70000, 80000, NUM_ROWS), columns=['session_id'])
df['item_id'] = long_tailed_item_distribution
# generate category mapping for each item-id
df['category'] = pd.cut(df['item_id'], bins=334, labels=np.arange(1, 335)).astype(np.int32)
df['timestamp/age_days'] = np.random.uniform(0, 1, NUM_ROWS)
df['timestamp/weekday/sin']= np.random.uniform(0, 1, NUM_ROWS)
# generate day mapping for each session
map_day = dict(zip(df.session_id.unique(), np.random.randint(1, 10, size=(df.session_id.nunique()))))
df['day'] = df.session_id.map(map_day)
# Categorify categorical features
categ_feats = ['session_id', 'item_id', 'category'] >> nvt.ops.Categorify(start_index=1)
# Define Groupby Workflow
groupby_feats = categ_feats + ['day', 'timestamp/age_days', 'timestamp/weekday/sin']
# Groups interaction features by session and sorted by timestamp
groupby_features = groupby_feats >> nvt.ops.Groupby(
groupby_cols=["session_id"],
aggs={
"item_id": ["list", "count"],
"category": ["list"],
"day": ["first"],
"timestamp/age_days": ["list"],
'timestamp/weekday/sin': ["list"],
},
name_sep="-")
# Select and truncate the sequential features
sequence_features_truncated = (groupby_features['category-list', 'item_id-list',
'timestamp/age_days-list', 'timestamp/weekday/sin-list']) >> \
nvt.ops.ListSlice(0,20) >> nvt.ops.Rename(postfix = '_trim')
# Filter out sessions with length 1 (not valid for next-item prediction training and evaluation)
MINIMUM_SESSION_LENGTH = 2
selected_features = groupby_features['item_id-count', 'day-first', 'session_id'] + sequence_features_truncated
filtered_sessions = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)
workflow = nvt.Workflow(filtered_sessions)
dataset = nvt.Dataset(df, cpu=False)
# Generating statistics for the features
workflow.fit(dataset)
workflow.transform(dataset).to_parquet(
'./schema',
out_files_per_proc=1,
)
schema_path = Path('./schema')
proto_schema = Schema.read_protobuf(schema_path / "schema.pbtxt")
Expected behavior
When I check the contents of the ./schema folder, the only files placed in there are:
_file_list.txt _metadata _metadata.json part_0.parquet
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 19 (10 by maintainers)
Hey @karlhigley, sure, I got
no module named nvtabular.io.dataset
when running the training cells in the 2nd getting-started notebook until I downgraded tonvtabular==0.10.0
hey @jperez999 and @rnyak, sorry for the late response. I’m happy to inform that I got it working. The most recent issue was running out of memory (which was solved by increasing the size of the cluster). We also had an issue with NVTabular/T4R which was solved by downgrading NVTabular. I’m able to run the tutorial cells now.
Let me know if you’d like me to give more info on anything.