Advanced
Training AI Agents
Effective training requires well-designed environments, carefully shaped rewards, curriculum strategies, and proper hyperparameter configuration.
Training Environment Design
- Parallel environments: Run multiple training instances simultaneously to collect experience faster. ML-Agents supports up to 100+ parallel environments.
- Domain randomization: Vary environment parameters (colors, sizes, physics) during training so the agent generalizes better.
- Simplified visuals: Training environments do not need full graphics. Disable shadows, particles, and post-processing to speed up simulation.
- Time scale: Increase
Time.timeScaleto accelerate training. Values of 10-20x are common.
Reward Shaping
The reward function is the most critical part of training. A poorly designed reward leads to unintended behaviors (reward hacking).
| Strategy | Description | Example |
|---|---|---|
| Sparse rewards | Only reward on success/failure | +1 for reaching goal, -1 for falling |
| Dense rewards | Continuous feedback signals | Small reward proportional to distance decrease |
| Shaping rewards | Intermediate milestones | Reward for picking up key, opening door, reaching exit |
| Penalty signals | Discourage undesired behavior | Small negative reward per time step to encourage speed |
YAML - ML-Agents Training Configuration
behaviors:
MyAgent:
trainer_type: ppo
hyperparameters:
batch_size: 1024
buffer_size: 10240
learning_rate: 3.0e-4
beta: 5.0e-3
epsilon: 0.2
num_epoch: 3
network_settings:
hidden_units: 256
num_layers: 2
normalize: true
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 500000
time_horizon: 64
summary_freq: 10000
Curriculum Learning
Curriculum learning starts the agent on easy tasks and gradually increases difficulty. This prevents the agent from getting stuck on tasks that are too hard to solve from random exploration.
YAML - Curriculum Configuration
environment_parameters:
wall_height:
curriculum:
- name: EasyWalls
completion_criteria:
measure: reward
behavior: MyAgent
min_lesson_length: 100
threshold: 0.8
value: 0.5
- name: MediumWalls
completion_criteria:
measure: reward
behavior: MyAgent
min_lesson_length: 100
threshold: 0.8
value: 1.5
- name: HardWalls
value: 3.0
Monitoring Training
- TensorBoard: Monitor cumulative reward, episode length, policy loss, and value loss in real-time.
- Cumulative reward: Should trend upward. Plateaus may indicate learning rate issues or reward function problems.
- Episode length: Should decrease if the agent is learning to solve tasks more efficiently.
- Policy entropy: Should decrease over time as the agent becomes more confident in its decisions.
Key takeaway: Successful training depends on good environment design, careful reward shaping, and proper hyperparameter tuning. Use curriculum learning for hard tasks, monitor with TensorBoard, and run parallel environments to speed up training.
Lilly Tech Systems