• State data: This information includes the position of the agent, the number of balls it currently holds, the number of hits it can take before being eliminated, as well as information about the flags in Capture the Flag mode. Agents use this information to strategize and determine their chances of winning.
  • Other agents state data: This information includes the position and health of the agent’s teammates, and whether any of them are holding a flag. Note that, since the number of agents is not fixed (agents can be eliminated anytime), we use a Buffer Sensor for agents to process a variable number of observations. Here, the number of observations refers to the number of teammates still in the game.

The DodgeBall environment also makes use of hybrid actions, which are a mix of continuous and discrete actions. The agent has three continuous actions for movement: One is to move forward, another is to move sideways, and the last is to rotate. At the same time, there are two discrete actions: One to throw a ball and another to dash. This action space corresponds to the actions that a human player can perform in both the Capture the Flag and Elimination scenarios. 

Meanwhile, we intentionally ensure that rewards given to the agents are rather simple. We give a large, final reward for winning and losing, and a few intermediate rewards for learning how to play the game.

For Elimination:

  • Agents are given a +0.1 reward for hitting an opponent with a ball.
  • The team is given +1 for winning the game (eliminating all opponents), or -1 for losing.
  • The winning team is also awarded a time bonus for winning quickly, equal to 1 (remaining time) / (maximum time).

For Capture the Flag:

  • Agents are given a +0.02 reward for hitting an opponent with a ball.
  • The team is given +2 for winning the game (returning the opponent’s flag to base), or -1 for losing.

While it is tempting to give agents many small rewards to encourage desirable behaviors, we must avoid overprescribing the strategy that agents should pursue. For instance, if we gave a reward for picking up balls in Elimination, agents might focus solely on picking up balls rather than hitting their opponents. By making our rewards as “sparse” as possible, the agents are free to discover their own strategies in the game, even if it prolongs the training period.

Because there are so many different possible winning strategies that can earn agents these rewards, we had to determine what optimal behaviors would look like. For instance, would the best strategy be to hoard the balls or move them around to conveniently grab later? Would it be wise to stick together as a team, or spread out to find the enemy faster? The answers to these questions were dependent on game design choices that we made: If balls were scarce, agents would hold on to them longer to prevent the enemies from getting them. If agents were allowed to know where the enemy was at all times, they would stay together as a group as much as possible. That said, when we wanted to make changes to the game, we did not have to make any code changes to the AI. We simply retrained a new behavior that would adapt to the new environment.

Source: Unity Technologies Blog