sagemaker-python-sdk: sagemaker job failing in transformation step
Please fill out the form below.
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
- Framework Version: 1.12.0
- Python Version: Python 3.6.8
- CPU or GPU: CPU
- Python SDK Version: 1.18.13
- Are you using a custom image: No
Describe the problem
We’re using sagemaker to parallelize a tensorflow job. We create a model using tensorflow. Training completes successfully. When the job moves on to transformation, it fails with an error: “Unable to get response from algorithm.”
Minimal repro / logs
Stack trace:
File "/home/abexecutor/ml-conversion/dcs-analytics/ML_models/sage_maker_job_runner.py", line 160, in _predict
transformer.wait()
File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 135, in wait
self.latest_transform_job.wait()
File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 209, in wait
self.sagemaker_session.wait_for_transform_job(self.job_name)
File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 886, in wait_for_transform_job
self._check_job_status(job, desc, 'TransformJobStatus')
File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 908, in _check_job_status
raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
ValueError: Error for Transform job sagemaker-tensorflow-2019-04-16-20-36-36-284: Failed Reason: AlgorithmError: See job logs for more information
Transformation logs:
2019-04-16T20:39:55.259:[sagemaker logs]: MaxConcurrentTransforms=100, MaxPayloadInMB=1, BatchStrategy=SINGLE_RECORD
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.289:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
- Exact command to reproduce:
Tensorflow model settings:
estimator = TensorFlow(entry_point='ML_models/tensorflow_entry_point.py'
role=ROLE,
framework_version='1.12.0',
training_steps=1000,
evaluation_steps=100,
train_instance_count=2,
train_instance_type='ml.m5.xlarge')
estimator.output_path = f's3://JOB_PATH/model/TRAIN_TABLE_NAME'
validation_prefix = f's3://JOB_PATH/validation/TRAIN_TABLE_NAME'
s3_eval_train = sagemaker.s3_input(
s3_data=validation_prefix,
content_type='csv',
distribution='ShardedByS3Key')
input_prefix = f's3://JOB_PATH/train/TRAIN_TABLE_NAME'
s3_input_train = sagemaker.s3_input(
s3_data=input_prefix,
content_type='csv',
distribution='ShardedByS3Key')
estimator.fit({'train': s3_input_train, 'validation': s3_eval_train})
Transformation settings:
estimator.transformer(instance_count=10, instance_type=ml.c5.2xlarge,
strategy='SingleRecord',
assemble_with='Line',
max_payload=1,
max_concurrent_transforms=100)
transformer.output_path = f's3://JOB_PATH/predictions/PREDICTION_TABLE_NAME'
transformer.transform(f's3://JOB_PATH/test/PREDICTION_TABLE_NAME',
content_type='text/csv',
split_type='Line')
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (9 by maintainers)
Commits related to this issue
- fix flaky metrics test (#753) — committed to qidewenwhen/sagemaker-python-sdk by danabens 2 years ago
- fix flaky metrics test (#753) — committed to qidewenwhen/sagemaker-python-sdk by danabens 2 years ago
- fix flaky metrics test (#753) — committed to qidewenwhen/sagemaker-python-sdk by danabens 2 years ago
- fix flaky metrics test (#753) — committed to qidewenwhen/sagemaker-python-sdk by danabens 2 years ago
- fix flaky metrics test (#753) — committed to qidewenwhen/sagemaker-python-sdk by danabens 2 years ago
- feature: Add SageMaker Experiment (#3536) * feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Ad... — committed to aws/sagemaker-python-sdk by qidewenwhen 2 years ago
- feature: Add SageMaker Experiment (#3536) * feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Ad... — committed to claytonparnell/sagemaker-python-sdk by qidewenwhen 2 years ago
- feature: Add SageMaker Experiment (#3536) * feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Ad... — committed to mufaddal-rohawala/sagemaker-python-sdk by qidewenwhen 2 years ago
- feature: Add SageMaker Experiment (#3536) * feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Ad... — committed to aws/sagemaker-python-sdk by qidewenwhen 2 years ago
- feature: Add SageMaker Experiment (#3536) * feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Ad... — committed to JoseJuan98/sagemaker-python-sdk by qidewenwhen 2 years ago
- feature: Add SageMaker Experiment (#3536) * feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Ad... — committed to JoseJuan98/sagemaker-python-sdk by qidewenwhen 2 years ago
- feature: Add SageMaker Experiment (#3536) * feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Ad... — committed to nmadan/sagemaker-python-sdk by qidewenwhen 2 years ago
It looks like the question has been solved, so I closing this issue.