sagemaker-python-sdk: sagemaker job failing in transformation step

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow
  • Framework Version: 1.12.0
  • Python Version: Python 3.6.8
  • CPU or GPU: CPU
  • Python SDK Version: 1.18.13
  • Are you using a custom image: No

Describe the problem

We’re using sagemaker to parallelize a tensorflow job. We create a model using tensorflow. Training completes successfully. When the job moves on to transformation, it fails with an error: “Unable to get response from algorithm.”

Minimal repro / logs

Stack trace:

  File "/home/abexecutor/ml-conversion/dcs-analytics/ML_models/sage_maker_job_runner.py", line 160, in _predict
    transformer.wait()
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 135, in wait
    self.latest_transform_job.wait()
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/transformer.py", line 209, in wait
    self.sagemaker_session.wait_for_transform_job(self.job_name)
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 886, in wait_for_transform_job
    self._check_job_status(job, desc, 'TransformJobStatus')
  File "/home/abexecutor/Environments/ml_py_36/lib/python3.6/site-packages/sagemaker/session.py", line 908, in _check_job_status
    raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
ValueError: Error for Transform job sagemaker-tensorflow-2019-04-16-20-36-36-284: Failed Reason: AlgorithmError: See job logs for more information

Transformation logs:

2019-04-16T20:39:55.259:[sagemaker logs]: MaxConcurrentTransforms=100, MaxPayloadInMB=1, BatchStrategy=SINGLE_RECORD

#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.253:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
#011at java.lang.Thread.run(Thread.java:748)2019-04-16T20:40:23.289:[sagemaker logs]: normalizedetl/ml-clicks-lookalike/campaign-1863917904632580551/1555446017448/test/testing_data_encoded_mean_aligned_0_2_1.csv: Unable to get response from algorithm
  • Exact command to reproduce:

Tensorflow model settings:

estimator = TensorFlow(entry_point='ML_models/tensorflow_entry_point.py'
                              role=ROLE,
                              framework_version='1.12.0',
                              training_steps=1000,
                              evaluation_steps=100,
                              train_instance_count=2,
                              train_instance_type='ml.m5.xlarge')
estimator.output_path = f's3://JOB_PATH/model/TRAIN_TABLE_NAME'

validation_prefix = f's3://JOB_PATH/validation/TRAIN_TABLE_NAME'

s3_eval_train = sagemaker.s3_input(
    s3_data=validation_prefix,
    content_type='csv',
    distribution='ShardedByS3Key')
    
input_prefix = f's3://JOB_PATH/train/TRAIN_TABLE_NAME'

s3_input_train = sagemaker.s3_input(
    s3_data=input_prefix,
    content_type='csv',
    distribution='ShardedByS3Key')
    
estimator.fit({'train': s3_input_train, 'validation': s3_eval_train})

Transformation settings:

estimator.transformer(instance_count=10,                                  instance_type=ml.c5.2xlarge,
                                         strategy='SingleRecord',
                                         assemble_with='Line',
                                         max_payload=1,
                                         max_concurrent_transforms=100)

transformer.output_path = f's3://JOB_PATH/predictions/PREDICTION_TABLE_NAME'

transformer.transform(f's3://JOB_PATH/test/PREDICTION_TABLE_NAME',
                              content_type='text/csv',
                              split_type='Line')

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (9 by maintainers)

Commits related to this issue

Most upvoted comments

It looks like the question has been solved, so I closing this issue.