At some point I'll get around to reading the paper, but meanwhile Jay Alammar has managed to explain how GPT-3 works in a short Twitter thread.

It doesn't seem all that different from a typical language model. It's trained to predict the next word given a sequence of words. The sequence of words serves as context to decide what to generate next, and can be up to 2048 words long.

The main innovation seems to be the scale: 175B parameters, $4.6M to train, supports 2048 words of context. Really impressive (and scary) that such a crude technique can be scaled up to deliver such impressive results.