Yeah, it more or less goes back to Philo T. Farnsworth routing the CRT electron beam as English text: lines scanned left to right, then top to bottom.
Probably if it were an Arabic invention, it would scan right to left first, or if Chinese, top to bottom even(??).
Diversion:
The display is scanned in real time -- there is no memory, no buffering, no nothing between the video generator and the screen. Once you see and understand an analog video signal, it's really quite a visceral thing to see, you can almost read the video off the oscilloscope without it being rasterized. Easier to recognize patterns, contrasts and animations, than exact imagery of course, but sometimes that's enough.
This had been true for almost the entire history of television: what's being viewed, in the studio, in real time, is exactly in lockstep synchronization (save for the propagation delays through cables and radio transmitters) with what's appearing on television screens across the country, indeed across the globe for international telecasts, within some limits (NTSC vs. PAL for example needs a scan converter of some sort, AFAIK usually no more fancy than a camera watching a CRT).
Of course there was always the option of filming, and rebroadcasting the film reel -- an easy way to distribute edited programs and movies, to affiliates across the country, allowing them to show material at location-appropriate times. As technology improved, somewhat cheaper tapes (rewritable!) showed up (in the 50s), giving similar benefits with shorter turn-around -- no film developing and production needed. Tape loops allowed sanitization of live video feeds. Later still (~70s), real-time digital signal processing became available, so that entire frames could be buffered, synchronized, mixed, scaled and composited, without having to synchronize (genlock) multiple sources together. Television networks began to link up with worldwide microwave and satellite feeds -- feeds which, in the early days (~80s), were sent in the clear so that anyone with a suitably equipped satellite receiver could view them! Come the present day, memory and processing are so cheap that video streams, heavily compressed, are stored automatically at regional cache servers, then transmitted over wired and sometimes wireless networks, before sitting in a local buffer for seconds at a time; finally, the video decoder, and output chain itself, buffers several more frames still, before the final result is sent to the physical display. Which still receives its video (albeit in a digital format)...in the same raster scanned sequence as a century ago.
As for alternate coordinate systems -- probably there are more interesting examples from oddball tubes and systems. The first high definition display was developed in the late 50s, believe it or not -- by marrying two 10-bit DACs to a mainframe computer, you can scan a beam back and forth fast enough to form images, with a total resolution of 1024 x 1024! Of course, the analog bandwidth of this system limits how many points or line segments can be drawn, a very different and much harsher drawback compared to the 525-line television of the time.
RADAR sets used a polar coordinate system, although I don't know how many were purely electrical, versus electromechanical (e.g., the CRT deflection yoke is physically rotated, in sync with the RADAR antenna, to implement the azimuth scan). You might be able to implement image rotation with a solenoidal deflection coil (e.g., this is how a Tek 475 implements its trace rotation trim), though I'm not sure how well that can work out for large angles (e.g., the center and edge of the image may not rotate the same, unavoidably turning a line scan into an 'S' shape?). Alternately, a rather complicated polar-cartesian converter circuit could be used, which could still be analog (ah, the heady days of analog computing) and so not compromise the granularity or bandwidth too badly.
Some chipsets scanned memory in a peculiar way, for example the ZX Spectrum apparently divided the screen into three regions (top, middle, bottom), and I don't know the particulars beyond that, something about how sprites or palettes or background were mapped into memory in a nonlinear manner (in graphics, "linear" always refers to a simple analog-style raster).
I do know very well the IBM-PC-compatible systems, CGA, EGA and VGA. Text modes are character based (effectively, a fixed grid of sprites, laid out in linear fashion), selecting from 256 characters. Each character has a foreground and background color (choosing from 16 and 8 options respectively), and a flashing attribute, for a total of 16 bits per character. Typical modes are 40x25, 80x25, 80x43, etc..
CGA is linear, and free to access but accessing video RAM during scan is likely to cause "snow" (corruption of the bits read into the palette decoder and then to the screen). 16 colors in text, 4 colors at 320x200 (linear pixel raster), 2 colors at 640x200.
EGA introduced bit planes, where each color (R, G, B and additive gray) is bit-linear (1 byte = 8 consecutive pixels), and memory is read from four different locations to make each string of eight pixels. This sounds kinda horrible, but it's both better and worse: IO is mediated by the controller, which allows simultaneous writes to any combination of colors, with logic operations. You don't have sprites moving over a background -- there's only one plane of memory that's drawn to the display -- but you can emulate that with masking and compositing (at the expense of CPU and IO cycles, so it goes rather slowly).
VGA introduced higher resolutions, a few hardware touchups (e.g., being able to, y'know, read back the IO register states?), and a 256-color mode that's linear on the face, but actually chained to internal planes. Downside, a lot of VRAM is wasted (you don't get any hidden pages that you can write to while viewing a different, static, memory area). This led to the development of "Mode X", the unchained version of this mode, which has IIRC, pixels interleaved every 4th byte. So byte 0 is (0,0), byte 1 is (4,0), byte 2 is (8,0) and so on; byte (320/4 * 200 - 1) is (319,199), byte (320/4 * 200 -1) + 1 is (1,0), etc. This seems terribly inconvenient, but when you can organize writes into columns, it's no imposition (which happens to be what WOLF3D, DOOM and others did naturally), and because the memory isn't fragmented, you get two (or more) pages so you can buffer the output to eliminate frame tearing (draw to invisible plane while viewing active plane; when done with both, swap them and redraw).
Which is the reason for a peculiar fault, speaking of DOOM: the
Venetian Blind crash. When a particularly nasty exception occurred (with an unimplemented handler), the protected-mode backend would ragequit back to DOS without cleaning up after everything. Which left interrupts (timer, keyboard) hooked so the prompt was unresponsive and you had to reset, and left the video in its unchained ("Mode X") state, which was funny because the error message emitted (and dutifully printed by DOS, and the video BIOS routine in turn) just becomes colorful vertical lines of gibberish!
If you're curious for more depth, there's quite a lot of info on the various sprite systems that were used through the 80s and 90s -- the C64, the NES and SNES, the Master System/Genesis, even into the 3D era with the PS1, N64, and to a lesser extent the GBA or DS, as well as the many ways modern GPUs model scenes and composite graphics.
Tim