MARL XAI

Project

simple_adversary_v3 simple_crypto_v3 simple_push_v3 simple_reference_v3 simple_speaker_listener_v4 simple_spread_v3 simple_tag_v3 simple_world_comm_v3

------------------------------------ simple_adversary_v3 => adversary_0 : Box(-inf, inf, (8,), float32) / Discrete(5) => agent_0 : Box(-inf, inf, (10,), float32) / Discrete(5) => agent_1 : Box(-inf, inf, (10,), float32) / Discrete(5) - [size] raw action: (3,) - [size] raw next obs, reward: (3, 10) (3,) ------------------------------------ simple_crypto_v3 => eve_0 : Box(-inf, inf, (4,), float32) / Discrete(4) => bob_0 : Box(-inf, inf, (8,), float32) / Discrete(4) => alice_0 : Box(-inf, inf, (8,), float32) / Discrete(4) - [size] raw action: (3,) - [size] raw next obs, reward: (3, 8) (3,) ------------------------------------ simple_push_v3 => adversary_0 : Box(-inf, inf, (8,), float32) / Discrete(5) => agent_0 : Box(-inf, inf, (19,), float32) / Discrete(5) - [size] raw action: (2,) - [size] raw next obs, reward: (2, 19) (2,) ------------------------------------ simple_reference_v3 => agent_0 : Box(-inf, inf, (21,), float32) / Discrete(50) => agent_1 : Box(-inf, inf, (21,), float32) / Discrete(50) - [size] raw action: (2,) - [size] raw next obs, reward: (2, 21) (2,) ------------------------------------ simple_speaker_listener_v4 => speaker_0 : Box(-inf, inf, (3,), float32) / Discrete(3) => listener_0 : Box(-inf, inf, (11,), float32) / Discrete(5) - [size] raw action: (2,) - [size] raw next obs, reward: (2, 11) (2,) ------------------------------------ simple_spread_v3 => agent_0 : Box(-inf, inf, (18,), float32) / Discrete(5) => agent_1 : Box(-inf, inf, (18,), float32) / Discrete(5) => agent_2 : Box(-inf, inf, (18,), float32) / Discrete(5) - [size] raw action: (3,) - [size] raw next obs, reward: (3, 18) (3,) ------------------------------------ simple_tag_v3 => adversary_0 : Box(-inf, inf, (16,), float32) / Discrete(5) => adversary_1 : Box(-inf, inf, (16,), float32) / Discrete(5) => adversary_2 : Box(-inf, inf, (16,), float32) / Discrete(5) => agent_0 : Box(-inf, inf, (14,), float32) / Discrete(5) - [size] raw action: (4,) - [size] raw next obs, reward: (4, 16) (4,)

Direct step in environment for rendering

masking sampling is required when sampling actions (for listener_v4)
observation requires zero padding when computing
input of environment is a list of discrete actions.

------------------------ simple_adversary_v3 obs length: [11, 7, 7] | act length: [5, 5, 5] input action: [4, 3, 2] ------------------------ simple_crypto_v3 obs length: [5, 5, 7] | act length: [4, 4, 4] input action: [3, 0, 3] ------------------------ simple_push_v3 obs length: [11, 7] | act length: [5, 5] input action: [1, 1] ------------------------ simple_reference_v3 obs length: [7, 7] | act length: [50, 50] input action: [48, 31] ------------------------ simple_speaker_listener_v4 obs length: [9, 10] | act length: [3, 5] input action: [2, 3] ------------------------ simple_spread_v3 obs length: [7, 7, 7] | act length: [5, 5, 5] input action: [4, 2, 2] ------------------------ simple_tag_v3 obs length: [11, 11, 11, 7] | act length: [5, 5, 5, 5] input action: [0, 1, 2, 1]

### Direct step in environment with Communication Model ------------------------------------ simple_crypto_v3 actions: tensor([[3, 1, 3]]) | obs_size: (3, 8) | termination: [0 0 0] actions: tensor([[3, 2, 2]]) | obs_size: (3, 8) | termination: [0 0 0] actions: tensor([[0, 1, 1]]) | obs_size: (3, 8) | termination: [0 0 0] ------------------------------------ simple_push_v3 actions: tensor([[1, 4]]) | obs_size: (2, 19) | termination: [0 0] actions: tensor([[3, 0]]) | obs_size: (2, 19) | termination: [0 0] actions: tensor([[0, 2]]) | obs_size: (2, 19) | termination: [0 0] ------------------------------------ simple_reference_v3 actions: tensor([[29, 22]]) | obs_size: (2, 21) | termination: [0 0] actions: tensor([[17, 38]]) | obs_size: (2, 21) | termination: [0 0] actions: tensor([[41, 7]]) | obs_size: (2, 21) | termination: [0 0] ------------------------------------ simple_speaker_listener_v4 actions: tensor([[1, 0]]) | obs_size: (2, 11) | termination: [0 0] actions: tensor([[1, 2]]) | obs_size: (2, 11) | termination: [0 0] actions: tensor([[0, 1]]) | obs_size: (2, 11) | termination: [0 0] ------------------------------------ simple_spread_v3 actions: tensor([[3, 2, 3]]) | obs_size: (3, 18) | termination: [0 0 0] actions: tensor([[3, 2, 3]]) | obs_size: (3, 18) | termination: [0 0 0] actions: tensor([[2, 1, 4]]) | obs_size: (3, 18) | termination: [0 0 0] ------------------------------------ simple_tag_v3 actions: tensor([[0, 0, 4, 4]]) | obs_size: (4, 16) | termination: [0 0 0 0] actions: tensor([[3, 1, 1, 0]]) | obs_size: (4, 16) | termination: [0 0 0 0] actions: tensor([[0, 1, 3, 1]]) | obs_size: (4, 16) | termination: [0 0 0 0]

simple adversary

adversary
agents= [adversary_0, agent_0,agent_1]

Trained Model’s Return

simple push

adversary
agents= [adversary_0, agent_0]

Trained Model’s Return

simple crypto

adversary
agents= [eve_0, bob_0, alice_0]

Trained Model’s Return

simple reference

adversary
agents= [adversary_0, agent_0,agent_1]

Trained Model’s Return

simple speaker and listener

cooperative
agents=[speaker_0, listener_0]

Trained Model’s Return

simple spread

cooperative
agents= [agent_0, agent_1, agent_2]

Trained Model’s Return

simple tag

adversary
predator-prey
agents= [adversary_0, adversary_1, adversary_2, agent_0]

Trained Model’s Return

simple communication world

adversary
agents=[leadadversary_0, adversary_0, adversary_1, adversary_3, agent_0, agent_1]

Trained Model’s Return

Three types of models and training results

written in 23.11.15

환경별, 모델 복잡도에 따른 메시지의 혼잡도를 분석하기 위해서 일차적으로 모델을 학습하고 성능을 평가한다. 환경은 크게 세 가지 요소 good, adversary, obstacle 가 있으며, 이들의 개수를 설정하여 환경의 복잡도를 올릴 수 있다. 본 실험에서는 다음 개수를 설정하여 에이전트들을 학습하였다.

num_adversaries : [2 3 4]
num_goods : [1 2]
num_obstacles : [0 2 4]

num_steps=128
update_epochs=4
num_layers=4
total_timesteps=1000000
hidden_dim=128 
env_max_cycles=50
seed=0
msg_activation in Sigmoid 
message_dim : 1 2 4 8 16
activation : [ReLU,  Sigmoid]
num_layers : 3
update_epochs : 4 

Model V1

메시지를 생성하지 않고, 관찰값으로만 행동을 취한다.

def step1(self, obs):
   message = None 
   return message

def step2(self, obs, messages):
   combined = obs 
   return combined

def forward(self, obs):
   message = self.step1(obs)
   combined = self.step2(obs, message)
   return combined

Model V2

메시지를 생성하고, 생성된 아군 메시지들을 결합한다. 결합하는 방법은 average pooling이다.

pooled_message = torch.stack(gathered_messages, dim=0).mean(dim=0)

Model V3

메시지를 생성하고, 생성된 아군 메시지들을 결합한다. 결합하는 방법은 query 에 대해서 attention을 취하는 방법이다.

query = self.queries[group_id](agent_obs).unsqueeze(1)
gathered_messages = torch.stack(gathered_messages, dim=1)
pooled_message, scores = self.attentions[group_id](query, gathered_messages, gathered_messages)
pooled_message = pooled_message.squeeze(1)

학습결과 : google drive 학습결과 full png : google drive