Paper: Concrete Problems In AI Safety
Imagine you've created the world's first fully intelligent cleaning robot. You're getting ready for the very first real world test run.
Attempt 1
You put the robot in your living room, turn it on, and command it to clean the room for you.
You wait for a few moments, but the robot stands exactly where it was, doing nothing. You sigh, grumble something incoherent about the stupidity of computers and wheel the robot back into your lab.
So what happened? Well, you programmed the robot to clean anything it observed that was worth cleaning, using as little energy as it could. Electricity is expensive after all. The robot had decided to play with the definition of 'observe' – if it cut off power to its vision system (two cameras attached to the front), it wouldn't observe any mess, and therefore wouldn't need to clean anything! The robot had found a way to hack its objective function. It was asked to clean everything it saw, so it simply chose to see nothing.
The problem: Reward hacking. The AI accomplished the goal you had specified for it, but not in the way you wanted it to.
This turns out to be surprisingly difficult to fix. The AI seems clever enough to game any of your attempts at making it actually do the thing you want. Three months later you think you've finally solved the problem. To assess whether the bot is actually doing its job, you have a second AI inside the bot that evaluates the cleaning AI. The evaluation AI doles out rewards or punishments to the cleaning AI depending on how good a job it does. The evaluation AI is much simpler, and so easier to control, but it's smart enough to force the cleaning AI to actually clean the room.
Attempt 2
Alright, time to test in the real world again. You lug the robot back to your living room, which has collected even more dust over the three months you've spent in your lab. You turn on the robot and command it to clean the room.
Things seem to go better this time. You watch with pride as your creation begins to vacuum the floor. Then the bot attempts to vacuum under a side table. In the process it knocks over a vase which crashes to the floor. Then it heads for the sofa, tossing it to one side to reveal the dust underneath, which it vacuums hungrily. Its heading for the TV next, but try to block its path. It picks you up and throws you to one side. A few seconds later you hear the TV soar across the room and crash into the wall.
Back in the lab you spend several weeks fixing the cleaning AI again. It's clear what went wrong. The bot was asked to clean the room, and that's exactly what it did. It let nothing stand in its way.
The problem: Negative side effects. The AI accomplished its goal of a dust free room, but in the process caused excessive side effects.
Your solution is to add a punishment system to the evaluation AI, that penalises the bot if it tries to influence its environment too much. Clean the room, but with minimum influence to the environment.
Attempt 3
It's working! Your AI has successfully cleaned a room. Now time to try it on a factory floor.
35 minutes later, the bot has jammed itself into a industrial shredder. You didn't tell it to not clean the inside of the shredder, after all.
The problem: Robustness to distributional shift. The AI could not realise that a factory floor environment was different from and more dangerous than an office.
You sigh as you realise you have much work to do before this thing is ready to be sold.
Making AIs do what we actually want them to do can be incredibly hard. Concrete Problems In AI Safety discusses all the problems described above and more, with some strategies on how we can address them.