The ATMEGA is running a USB stack + virtual ethernet stack + minecraft server.
Probably the major consumers are the USB buffers, ethernet buffers and server connections.
The protocols are all implemented in FLASH, and 32kB sounds plenty to do simple USB + TCP/IP + own TCP/IP protocol.
The server may be limited to only 1 client, so he doesn't have to share other players coordinates. So that could simplify implementation.
He isn't building anything, so that world is probably loaded from FLASH and, except from switch levels, doesn't have to change ever.
My estimations of RAM usage:
USB is probably 2x8 for USB endpoint 0 (enumeration & setup), and 2x64 for endpoint 1 (assuming you only need 1 for a virtual USB ethernet device). That's 144 bytes of RAM.
TCP/IP could require 1560 bytes for a full frame according to spec. But if the minecraft protocol is very simple, there is little necessity for that big of a buffer. He may very well be getting away with 512 bytes of buffer. Sure the packets will be dropped/controller will crap itself if you send more than 512 bytes per packet.. but that's not the point of his video.
On a 1kB chip that leaves 368bytes for other stuff. You may need a few bytes for a TCP/IP client, but I'm sure it doesn't have to be more than 64 bytes for a very simple implementation.
So it sounds very reasonably that it can be done.
In fact:
https://github.com/cnlohr/avrcraft