Advanced

Training AI Agents

Effective training requires well-designed environments, carefully shaped rewards, curriculum strategies, and proper hyperparameter configuration.

Training Environment Design

  • Parallel environments: Run multiple training instances simultaneously to collect experience faster. ML-Agents supports up to 100+ parallel environments.
  • Domain randomization: Vary environment parameters (colors, sizes, physics) during training so the agent generalizes better.
  • Simplified visuals: Training environments do not need full graphics. Disable shadows, particles, and post-processing to speed up simulation.
  • Time scale: Increase Time.timeScale to accelerate training. Values of 10-20x are common.

Reward Shaping

The reward function is the most critical part of training. A poorly designed reward leads to unintended behaviors (reward hacking).

StrategyDescriptionExample
Sparse rewardsOnly reward on success/failure+1 for reaching goal, -1 for falling
Dense rewardsContinuous feedback signalsSmall reward proportional to distance decrease
Shaping rewardsIntermediate milestonesReward for picking up key, opening door, reaching exit
Penalty signalsDiscourage undesired behaviorSmall negative reward per time step to encourage speed
YAML - ML-Agents Training Configuration
behaviors:
  MyAgent:
    trainer_type: ppo
    hyperparameters:
      batch_size: 1024
      buffer_size: 10240
      learning_rate: 3.0e-4
      beta: 5.0e-3
      epsilon: 0.2
      num_epoch: 3
    network_settings:
      hidden_units: 256
      num_layers: 2
      normalize: true
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
    max_steps: 500000
    time_horizon: 64
    summary_freq: 10000

Curriculum Learning

Curriculum learning starts the agent on easy tasks and gradually increases difficulty. This prevents the agent from getting stuck on tasks that are too hard to solve from random exploration.

YAML - Curriculum Configuration
environment_parameters:
  wall_height:
    curriculum:
      - name: EasyWalls
        completion_criteria:
          measure: reward
          behavior: MyAgent
          min_lesson_length: 100
          threshold: 0.8
        value: 0.5
      - name: MediumWalls
        completion_criteria:
          measure: reward
          behavior: MyAgent
          min_lesson_length: 100
          threshold: 0.8
        value: 1.5
      - name: HardWalls
        value: 3.0

Monitoring Training

  • TensorBoard: Monitor cumulative reward, episode length, policy loss, and value loss in real-time.
  • Cumulative reward: Should trend upward. Plateaus may indicate learning rate issues or reward function problems.
  • Episode length: Should decrease if the agent is learning to solve tasks more efficiently.
  • Policy entropy: Should decrease over time as the agent becomes more confident in its decisions.
Key takeaway: Successful training depends on good environment design, careful reward shaping, and proper hyperparameter tuning. Use curriculum learning for hard tasks, monitor with TensorBoard, and run parallel environments to speed up training.