Truly impressive work - out of interest how much time would you get from instruction to completion for a project like this?
This was a fairly short timescale for something on this scale, being installed in such a difficult location - I think it kicked off about last October, & install started early May, so about 6 months. The biggest problem is usually getting a definite "Go" from the end client to start manufacturing, as it often needs signoff from all sorts of different people, many of whom rarely appreciate leadtime issues. For stuff on this scale you also have things like signoff from structural engineers and the people in charge of the building. Leadtime on this was a particular issue as carbon fibre has fairly long cure times, and the manufacturer has finite autoclave capacity. They were working round the clock for at least a couple of weeks.
I'm just wondering whether you get a lot of test time on the bench before everything gets hoisted into the air.
I do try to insist on some burn-in time where possible, however this very often disappears due to other parts of the project running late. Failures after install can be costly, but sometimes preferable to not being able to finish in time for an opening.
It can also be difficult, because as well as the actual fixtures, you also need all the cabling, PSUs etc. and things as simple as having the physical space to set it all up can be a major problem.
One thing I always do at the design stage is test PSUs and anything that handles significant power at full load for at least day or two. PSUs are also never run at more than about 50-75% of their rating. Many PSU specs are somewhat optimistic - e.g. I once had a Traco DIN-rail unit that when running at 70%, the air coming out of the top hit 75 deg.C at 25 deg ambient - needless to say that didn't make it to the final install - we used a PULS unit that ran at about 40 deg.
My general rule on heating is the finger test - if you can't hold your finger on it, it's too hot.
Some things can be done at the design stage to reduce risk, e.g. designing an architecture which is at some level a number of identical parts which aren't dependent on others, so if one works, there is negligible risk that you will hit scaling issues.
A good example of this is wireless setups - if the architecture is broadcast, with receive-only nodes, you know you won't hit bandwidth issues with larger numbers of nodes, however architectures like mesh networks or poll-reply & message forwarding can be very hard to fully test, and tend to have exponential failure modes as traffic increases. They can also be a complete bitch to faultfind.
This is one reason I like one-way comms with no readback mechanism - there are far fewer things to go wrong, and for a light installation, it's obvious visually if something isn't working, unlike something like a sensor network, where you want to know the difference between "sensor not currently detecting anything" and "sensor broken or missing".
Where you do need 2-way comms, I like using a time-slotted protocol so bandwidth is very well defined, and USB latency issues can be mimised (e.g. you send a "give me data" command, and each node replies in a timeslot defined by its ID, so you effectively get a single packet back with all the nodes' data) .
For this install, there is one thing you do need to read back, which is an indication of when the SPI flash has completed its erase time, which can typically be 10-20 seconds. As this is only done occasionally, and manually, the way I handled this is for the controller to display a bargraph on the LEDs indicating erase progress - the person doing the content uploading sends an erase command, and then watches for this to complete before programming.
The (much shorter) flash write times during programming are ensured by adding dummy bytes to the serial stream, so the baudrate guarantees sufficient write time.