Mephisto: ParlAI Chat Demo - Something wrong with worker pairing and agent status updates

Hi,

I have been doing some pilot studies for a crowdsourcing task that pairs 2 MTurk workers for a conversation. I noticed some unusual behavior in some assignments. Specifically, there are cases where the worker starts the HIT and gets the partner timeout message within a few seconds. I’ll list down the steps to replicate the task setup and provide the logs below.

I forked Mephisto (vaibhavad/Mephisto) (just a few hours ago) and made some minor changes:

Changed the logging (vaibhavad/Mephisto@f5158ed0ce0fd91a215cd13b6482ef21e6c510d9) in supervisor.py, operator.py, and blueprint.py so that issue-specific information is logged.
Statically linked packages (vaibhavad/Mephisto@a18ac6e9967fd016be3dafb740744fcee3e37fb6) as somehow node modules we not working on my system (#325)
Changed task configuration to 10 conversations (vaibhavad/Mephisto@c00c4b488146f247697580c2f0086373623ff02a), and using custom_prebuilt.yaml so that bundle.js is picked from webapp.

Running the task - I used the following steps

git clone https://github.com/vaibhavad/Mephisto.git
pip install parlai
cd Mephisto
pip install -e .
mephisto register mturk_sandbox name=my_mturk_user_sandbox access_key_id=<ACCESS_KEY> secret_access_key=<SECRET_KEY>
cd Mephisto/packages/mephisto-task
npm install; npm run dev
cd Mephisto/packages/bootstrap-chat
npm install; npm run dev
cd Mephisto/examples/parlai_chat_task_demo/webapp
npm install; npm run dev
cd ..
python parlai_test_script.py mephisto/architect=heroku mephisto.provider.requester_name=my_mturk_user_sandbox

I tested the system using three different MTurk Sandbox accounts, so 3 workers are registered. I kept returning the task midway frequently and starting new ones from different accounts to replicate the scenario I observed in production. Here are some observation (with reference to logs) mephisto_logs.txt

(Line 79) Task available on Worker Sandbox
(Line 159) Worker 2 returned the HIT as Agent 2, and started to work on a new task as Agent 4, but the status is updated after ~1-2 minutes (Line 194). In some cases we saw that this time to update the agent status is even longer.
(Line 327) As we are testing with only 3 workers, having 6 in_task agent statuses means some of these states are stale.
The specific abnormal behavior we observed is visible from Line 340-347. Assignment 3 was originally launched with Agents 5 and 6 (Line 206). Their statuses changed in Line 331. Agent 11 is created and paired without any waiting with Agent 5 (Line 340), although Agent 5 status was returned, not waiting. Agent 6 status updated from partner disconnect to completed, which is also unusual. Finally, Agent 11 gets partner disconnect within 5 seconds of starting the Assignment (Line 347).
Similar behavior is also observed in Line 410-414. Agent 14 is paired with Agent 5 although Agent 5’s status is timeout.

Is there a way to make sure that only agents with status waiting are paired? How does the Heroku server declare an agent disconnected/returned? I am assuming it must be based on the frequency of alive signals received. Can you point me to that specific code section? Also, the unusual pairing mentioned above is always preceded with the line Updating a final status, was timeout/returned and want to set to in task, which I’m assuming is referring to the status of Agent 5 (Line 339 and 409). It from update_status function in data_model/agent.py, although I don’t quite understand the sequence of functions which lead to it being called.

I’ll be very grateful if you can help me with this. 😃

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 16 (16 by maintainers)

Most upvoted comments

#347 is now in, so I’ll be closing this issue. Huge thanks for helping us debug what was going on here! Let us know if anything else comes up 👍

JackUrb on Jan 4, 2021

Stale states definitely looks like a heroku issue. I tried MockProvider with both LocalArchitect and HerokuArchitect as you suggested. Socket disconnect or socket error is not being detected by HerokuArchitect. I tried with both Safari and Google Chrome. It is very likely that all the other issues are originating from this.

In one of the runs, the rare case happened. Here are the relevant Mephisto logs (HerokuArchitect and MockProvider).

[2020-12-08 15:16:52,506][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting'}
[12-08 15:16:56] p42509 {supervisor.py:686} DEBUG - Agent statuses received - {'22': 'waiting'}
[2020-12-08 15:16:56,519][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting'}
[12-08 15:17:00] p42509 {supervisor.py:527} DEBUG - Incoming request to register agent 220-5.
[2020-12-08 15:17:00,388][mephisto.operations.supervisor][DEBUG] - Incoming request to register agent 220-5.
[12-08 15:17:00] p42509 {supervisor.py:405} DEBUG - Worker 5 is being assigned one of 19 units.
[2020-12-08 15:17:00,398][mephisto.operations.supervisor][DEBUG] - Worker 5 is being assigned one of 19 units.
[12-08 15:17:00] p42509 {supervisor.py:425} DEBUG - Created agent 23, y.
[2020-12-08 15:17:00,405][mephisto.operations.supervisor][DEBUG] - Created agent 23, y.
Assignment 101 is launching with ['22', '23']
[12-08 15:17:00] p42509 {supervisor.py:686} DEBUG - Agent statuses received - {'22': 'waiting'}
[2020-12-08 15:17:00,567][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting'}
[12-08 15:17:04] p42509 {supervisor.py:686} DEBUG - Agent statuses received - {'22': 'waiting', '23': 'in task'}
[2020-12-08 15:17:04,652][mephisto.operations.supervisor][DEBUG] - Agent statuses received - {'22': 'waiting', '23': 'in task'}

Agent 22 was actually disconnected but HerokuArchitect could not detect the socket disconnect so the status is waiting. But somehow the state is not updated to in task even though the assignment has been launched with this agent. In 4-5 test runs, this only happened once.

vaibhavad on Dec 8, 2020