You can use the RTL viewer to see what kind of logic is generated from your code, but the main problem is probably that the synthesizer can't use the hardware RAM blocks in the FPGA and have to implement the arrays using LUTs only. It means already one LUT per bit, so 4096 LUTs plus the address decoding logic. If you do this in several different places, and they each run in parallel, each implementation will have its own set of address decoding logic LUTs. And if it gets too complicated, the synthesizer may in addition be required to duplicate some logic to meet the timing requirements.
If you need to reduce LUT usage you'll have to look into using the RAM blocks to store the arrays. Depending on the synthesizer you use, you may have to write your code in a certain way (for example register the input and/or the output, and limit access to two ports or less) or use a pre existing RAM ip block that you instantiate in your code. You will probably have to redesign your code around those lines, to adjust to the extra latency from the input/output registering, and try to centralize access instead of spreading it in several places.