When used effectively, Reinforcement Learning (RL) can learn complex behaviors and play the role of Non-playable Characters (NPCs) or bots in a game, with minimal code. One necessary and challenging aspect of using RL to learn agent behaviors is to specify an algorithm’s hyperparameters, such as the learning rate and network size. These parameters can be unintuitive and brittle so that an incorrect setting can prevent the learning of desirable behaviors. Consequently, trial and error, as well as RL expertise, are often a requirement for the successful application of RL. For this project, we focused on discovering the relationships between various hyperparameters with the goal of reducing the number of changes required to maximize performance.

The project’s main contribution is to modify Proximal Policy Optimization (PPO), a common and effective RL algorithm, such that the user needs to adjust only two instead of five of the most commonly tweaked hyper-parameters. We reduce the number of hyperparameters by grouping five parameters into two groups: frequency and work. The frequency parameter controls how often the agent’s behavior is allowed to change, while the work parameter controls how much the agent’s behavior is allowed to change per step of the learning process. Additionally, since the relationships between variables are better understood, some hyperparameters can be automatically adjusted to maximize computational efficiency during training. To test the algorithm changes, we made custom environments (games in Unity that can be used to train RL agents) that can have their characteristics, such as the reward function and the number of parallel simulations, changed without rebuilding the Unity game each time. Once this work is integrated into the ML-Agents Toolkit, users will be able to train a good behavior with much less trial and error and tweaking of obtuse parameters than before.

Source: Unity Technologies Blog