What you are doing in the software may be of vital importance. I don't know the layout of the panels you have. I'd guess serpentine. That is first row one direction, then second row goes the opposite direction. It makes manufacture easier. That unfortunately messes up just using a double length buffer, and moving the start point in the buffer.
Next idea: Drawing buffer is like a ring buffer, but both in X and Y directions. To scroll, you change the start X/Y pointer, and only draw the part between the old and new location of the start point. A routine will need to be written to copy from the start point to the output buffer. It will need to wrap around to the beginning of the physical buffer when it reaches the end of the physical buffer. I'd also look into double buffering the output buffer. That way one can be displayed from using DMA while the other is being copied to.
A few notes:
* The drawing buffer can be larger than the display.
* Having the drawing buffer larger than the display may make it so the area being redrawn can be outside the currently actively displayed area.
* With the drawing buffer as an X/Y array, it should be possible to scroll up and down and left and right.
* If you want to have both fixed and scrolling areas, they can be implemented with separate drawing buffers. The copy to output buffer routine then selects which one to grab pixels from based on where it is writing into the output buffer.
* I've heard of this technique being done decades ago when computers were much slower. XWindows has routines to handle this. I know it existed before XWindows did.
In regards to slow display scrolling - yes, I've had that before, but usually that's because of non-optimized display library and usually SPI still will not work on a speeds above 20MHz so no point in having a serial display and 120MHz MCU, have a look into datasheet for the top speed of I2C/SPI buses. If you're using a parallel display driver - that's when it all starts to work really fast. Also, you can use older SAM4L MCUs if you want high CPU frequency, FPU and other things but combined with low-power features.
I'll second all of this. My setup is much different, and has much fewer LEDs. I'm not familiar with the drive requirements for your display panels, but I'm assuming an array of the panels you mentioned. My MCU dev board is a Teensy 3.6 which has a MK66FX1M0VMD18 Cortex-M4F ARM MCU at 180 MHz. On a 288 APA102 LED array I get free running almost 1000 calculate, and display cycles when using a simple integer math and very simple display calculations. Half my display is written at 8 MHz and the other half is at 4 MHz by the FastLED library. They are on two different SPI busses. I normally use floats to do all my arithmetic for HSV selection, and conversion to RGB, and my displayed images are much more complex than the simple integer test. It slows the update rate down to around 400 updates per second when free running. Normally I use a 120Hz update cycle. Most off my newer drawing objects are defined by mathematical functions, so I do a lot of floating point work. For them I just calculate the function values for HSV at the point on the display. No mixing of adjacent pixels for those objects, except at their edges. I need to work on the display speed, and make sure DMA is being used.
For hooking multiple APA102 strings to one SPI port I use TI 74HCT125 quad buffer chips. At 5 VDC they accept 3.3 VDC logic levels. Check the specs on them. I do think it is the HCT line.
For larger displays I'm thinking of dedicating 2 CPU cores to calculating the display on a multi core ARM chip like used in the Raspberry Pi. I'll have another core that handles the updating of the display, and gives the marching orders to the calculation cores. The last core will be left for running raspbian, reading the GPS, syncing time with time servers on the net, etc. Raspberry Pis have a nice fast SPI port that can drive up to 32MHz. I just wish I could get a 4 core one in a Pi Zero form factor. The normal format is to large for some of my artwork.