carla: TCP Connection Error Always Breaks (Closed or Refused)

I am running CARLA across 4 GPUs on a server using the documentation provided in the setup docs, and using them to generate experience for a reinforcement learning agent.

My main issue is that during training, my server seems to close the connection (not necessarily in the beginning of training, but rather approximately 12K timesteps), despite having both the client and server timeouts set to extremely high numbers. The interesting thing is that if I don’t run this across multiple GPUs, it doesn’t seem to ever close.

My code used to look like this:

     self.client.start_episode(self.start_idx)
     measurements, sensor_data = self.client.read_data()

But I would always get a TCP Error on the start_episode line. Using some of the work done by NervanaSystems and their CARLA wrapper, I changed my code to look similar to theirs (i.e connect if you get a TCP error on start_episode), but since the connection is either closed / refused, this also times out and then my environment crashes, which stops training on all of my agents.

    # Blocking function until episode is ready
    while True:
        try:
           self.client.start_episode(self.start_idx)
           break
        except TCPConnectionError:
           self.client.connect()
           print('TCP Connection error: {}'.format(self.port))
           time.sleep(1)

I am using 8 workers of PPO, a synchronous RL algorithm. I know A3C as described in the paper would be able to get around this problem by restarting the server and then reconnecting the client without interrupting the training of other agents due to the asynchrony. Is there anything that can be done about this? I am not super sure what else I could be doing to help with this problem, so I wanted to post this and see if anyone could find some incorrect logic in what I am doing here. (This code lies in the reset function of my agent environment)

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 15

Most upvoted comments

Hi guys,

I had the same problem. I think the problem is because your client is trying to read data from your server while your server is not ready yet. I solve this by setting a sleep time before my client reading data from a new episode. 2-3 seconds works perfectly for me.