This still needs to be cleaned up, particularly to properly pass the
texture size around--there are a few unnecessary '128x256' magic
numbers scattered in the code.
I think the original version is more readable, but the newer version
is better overall because it doesn't require reading from dst, and is
able to directly write to a 32-bit dst.