Author Topic: Airbus finds control systems in the A320 family to be affected by solar activity  (Read 2221 times)

BradC and 1 Guest are viewing this topic.

Online mawyatt

  • Super Contributor
  • ***
  • Posts: 5161
  • Country: us
I seem to remember many years ago Xilinx had trouble with a flipchip where the internal solder had too "new" lead, the decay flipped a bit roughly every 20 minutes I think

Believe this was related to the radioactive contamination of the lead in the solder. Recall folks using very old lead since the radioactive elements short half life caused the effects to be significantly reduced. Remember a story about searching and using very old cannon balls which were lead and obviously "very old"!!

Another issue with radioactive contamination occurred even longer ago when the US memory chips were causing issues (flipped bits), and the Japanese memory chips didn't exhibit this behavior. Everyone jumped on the bandwagon blaming US semiconductor manufacturing as "inferior".

Intel discovered the root source and published it as the ceramic package had small traces of radioactive materials, and because this package ceramic was in close proximity to the memory chip could cause bit flipping!! Credit to intel as they could have made significant $ off this discovery by keeping it quiet and just switching package sources.

3M was the US ceramic source for the package and Kyocera was the Japanese source. Kyocera ceramic didn't have the trace radioactive materials, 3M did!! 3M quickly "cleaned" up their ceramic and everyone's (US and Japan) memory chips behaved as expected!!

Best
Curiosity killed the cat, also depleted my wallet!
~Wyatt Labs by Mike~
 
The following users thanked this post: tom66, MK14

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
And if the potential was there, I'm surprised it wasn't factored into the design from the outset.

It would have been, and the fact that they're fixing it with a rollback would suggest that they made some changes that didn't deal with faults as well as the original code did.

Quote
I'm a little surprised that the remedy is being touted as a software fix, as this must surely be a hardware issue at heart?

That's fairly standard, all the radiation-tolerant stuff I've worked with uses COTS industrial-grade components with software mitigation for faults.  Rad-hard components are (1) eyewateringly expensive, (2) a giant PITA to get because of export controls and (3) thoroughly obsolete.  Given the amount of fault-mitigation present in current hardware (data-centre and workstation CPUs for example are essentially rad-hard components) it makes a lot more sense to mitigate in software than to go through the excruciating pain and expense of dealing with rad-hard hardware.
 

Online tom66

  • Super Contributor
  • ***
  • Posts: 8104
  • Country: gb
  • Professional HW / FPGA / Embedded Engr. & Hobbyist
SpaceX are using conventional x86 processors in their rockets in a triplicate configuration for flight control.  The processors are kept in lockstep (somehow) and outputs are compared with a master controller determining which computer gets voted off if something goes wrong.  It doesn't seem to cause them issues.
 

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
Intel discovered the root source and published it as the ceramic package had small traces of radioactive materials, and because this package ceramic was in close proximity to the memory chip could cause bit flipping!! Credit to intel as they could have made significant $ off this discovery by keeping it quiet and just switching package sources.

It was discovered by Tim May at Intel, who found that thorium and uranium naturally present at ppm levels in the ceramic clay were causing SEUs (single event upsets) through alpha particle emissions.  A switch to plastic packaging solved the problem.  May used the financial compensation from helping save Intel to retire early and help start the cypherpunk group alongside Eric Hughes and John Gilmore.
 

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
SpaceX are using conventional x86 processors in their rockets in a triplicate configuration for flight control.  The processors are kept in lockstep (somehow) and outputs are compared with a master controller determining which computer gets voted off if something goes wrong.  It doesn't seem to cause them issues.

They're some of the ones that are essentially rad-hard, they can take astounding doses of gamma rays without blinking (I'd have to look up the figures if anyone wants them).  There was an experiment done about 25-odd years ago when Pentium IIIs were current where a lab wanted to evaluate their radiation-resistance.  What failed wasn't the exposed PIII but the VRM driving it, which wasn't exposed to the gamma ray beam but was hit by enough backscatter to disturb its operation.
 

Offline langwadt

  • Super Contributor
  • ***
  • Posts: 5398
  • Country: dk
I seem to remember many years ago Xilinx had trouble with a flipchip where the internal solder had too "new" lead, the decay flipped a bit roughly every 20 minutes I think

Believe this was related to the radioactive contamination of the lead in the solder. Recall folks using very old lead since the radioactive elements short half life caused the effects to be significantly reduced. Remember a story about searching and using very old cannon balls which were lead and obviously "very old"!!

cannon balls were usually iron not lead, but they were salvaging lead ballast from old ship wrecks and also pre 1945 steel
There were companies replacing the lead roofs on old buildings to get the old lead roof


 

Offline negativ3

  • Regular Contributor
  • *
  • Posts: 179
  • Country: th
Sounds like something was broken by a hastily rolled out update.
Bug fixes are rolled out waaay too fast in many industries without full/documented system testing, especially fundamental operation testing.

Tweaking a bit of code "over here" shouldn't break functionality "over there" but it is amusing how often it can and does.
 

Online mawyatt

  • Super Contributor
  • ***
  • Posts: 5161
  • Country: us
IBM PowerPC was in a CMOS SOI process and inherently somewhat Rad Hard. We tried unsuccessfully to get our folks to look into this, but they had already invested heavily into a GaAs type CMOS process and custom Space Processor which eventually died a painful financial death!!

Iridium used the PowerPC, but don't know if it was Motorola or IBMs version, don't think Motorola had a SOI CMOS process like IBM tho.

Also recall in early 90s some folks buying up all of intels old 8086 something based wafers and making the 3 orthogonal (XYZ) voting Space Processors using these intel chips. They created a custom "voting" chip is a Space Qualified Rad Hard process (Honeywell RICMOS) and got these Processors Space Qualified!!

Best
Curiosity killed the cat, also depleted my wallet!
~Wyatt Labs by Mike~
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 5245
  • Country: gb
  • Doing electronics since the 1960s...
I am sure the "solar flare" is BS and this is just another rare bug. According to reports it has been seen before exactly the same, c. 2008.

Autopilot / flight control system certification is a funny business. Obviously the mfgs do their best, especially on airliners, but for a fine example of a cocked-up autopilot, google kfc225 :) And that was done by Honeywell who do tons of bizjet and airliner stuff. I spoke to one of the guys who worked on that (c. 2000) and to make the cert easier they used a separate .c file for each function, so if they had to edit one function, they would have to submit much less to the FAA. Obviously the resulting code was pretty much unreadable, and the serious servo burnout issue was never fixed (partly because anybody who could do anything had retired or went to Garmin, after the Honeywell takeover).

When Airbus went to FBW many years ago they (IIRC) got multiple teams to write dual redundant stuff in different languages. But we all know that coders, writing to the same spec, will do the same mistakes, so this basically guards against defective compilers more than anything else.

The only assurance of a lack of bugs in really complex stuff is millions of hours of trouble-free operation.
« Last Edit: December 02, 2025, 10:12:45 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
When Airbus went to FBW many years ago they (IIRC) got multiple teams to write dual redundant stuff in different languages. But we all know that coders, writing to the same spec, will do the same mistakes, so this basically guards against defective compilers more than anything else.

It wasn't just the software, the hardware was diversified as well.  It's been a long time since I read the (very dry) design docs but from memory one of the CPUs was an 88000, the other I think was an 80186, with the hardware built by different companies.  Everything was duplicated, two power buses, two generators with an emergency backup, and a few things were in triplicate with voting.  They also had things like hydraulic systems in triplicate (red, green, and blue circuits) or duplicate in some cases.

Disclaimer: My neurons are not duplicated so I may have remembered some of the above wrong.
 

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
A friend just commented that the Russian authorities have said the remaining Airbuses there are safe from this problem because it's like the situation where you're still running Windows 7 and the problem is with Windows 10  :-DD.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 16978
  • Country: fr
If you have 3 parallel systems written in different languages and running on different hardware that all fail at the same time due to alleged radiation, then you probably have a pretty serious problem somewhere.
 

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 6553
  • Country: au
A friend just commented that the Russian authorities have said the remaining Airbuses there are safe from this problem because it's like the situation where you're still running Windows 7 and the problem is with Windows 10  :-DD.

Is that what happened to Tupolev?
 

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
A friend just commented that the Russian authorities have said the remaining Airbuses there are safe from this problem because it's like the situation where you're still running Windows 7 and the problem is with Windows 10  :-DD.

Is that what happened to Tupolev?

You mean the TU-204?  I think that's running DOS.
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 5245
  • Country: gb
  • Doing electronics since the 1960s...
I thought avionics is mostly running under specific "certified" RTOSes.

But Russia has always been doing things differently. Interesting history of hardware development, for example. I recall talking, many years ago, to a guy who was on the VAX11/780 reverse engineering team :)
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
I thought avionics is mostly running under specific "certified" RTOSes.

AFAIK it used to be VxWorks Cert Edition, they've been working on their own RTOS, JetOS, for awhile now but like many other Russian technologies there seems to be quite a gap between claims and reality - news around it is always "JetOS will" rather than "JetOS does" - so it's hard to tell what the real status is.
 

Online Gyro

  • Super Contributor
  • ***
  • Posts: 10744
  • Country: gb
I thought avionics is mostly running under specific "certified" RTOSes.

But Russia has always been doing things differently. Interesting history of hardware development, for example. I recall talking, many years ago, to a guy who was on the VAX11/780 reverse engineering team :)

I remember security suddenly getting tightened up. Full height 'cozy' turnstiles and cameras everywhere, enhanced access control systems, seminars on export controls and security, we were told to challenge anyone we didn't recognize etc.

DEC (and the US in general) were a bit slow with espionage and export controls but at least measures were put in place in time to at least hamper copying of the more advanced stuff. I'm glad I was in R&D rather than sales, it was a nightmare for them.
« Last Edit: December 03, 2025, 12:21:55 pm by Gyro »
Best Regards, Chris
 

Offline paulca

  • Super Contributor
  • ***
  • Posts: 5253
  • Country: gb
But we all know that coders, writing to the same spec, will do the same mistakes, so this basically guards against defective compilers more than anything else.

Actually the evidence is quite different.  If you give the same spec to a dozen engineers/coders you will likely get a dozen different solutions and 11 of them might work.

I believe the sentiment and understanding of this dates back to Margaret Hamilton from the Apollo systems projects.

EDIT:  It is however part of the challenge.  Because later when the one you picked blows up anyway, you now have dozens of different conventions and ways of doing things to content with trying to debug a production issue years later.  "Why?!.... wait, no... WHY?" I have been here, felt the pain of someones "soup boxing" or "hobby horsing" or coupling "self learning" with the project as a vehicle.
« Last Edit: December 03, 2025, 04:11:30 pm by paulca »
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 

Offline langwadt

  • Super Contributor
  • ***
  • Posts: 5398
  • Country: dk
But we all know that coders, writing to the same spec, will do the same mistakes, so this basically guards against defective compilers more than anything else.

Actually the evidence is quite different.  If you give the same spec to a dozen engineers/coders you will likely get a dozen different solutions and 11 of them might work.

yeh, remember ~25 years ago when I worked on the first Bluetooth, all the different companies came together for "unplug fest" to all try connecting to each others implementations, there were always minor differences in how each company had interpreted the spec.




 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 10309
  • Country: fi
But we all know that coders, writing to the same spec, will do the same mistakes, so this basically guards against defective compilers more than anything else.

Actually the evidence is quite different.  If you give the same spec to a dozen engineers/coders you will likely get a dozen different solutions and 11 of them might work.

yeh, remember ~25 years ago when I worked on the first Bluetooth, all the different companies came together for "unplug fest" to all try connecting to each others implementations, there were always minor differences in how each company had interpreted the spec.

And, what I like to point out:

If the spec is complete, unambiguous and correct, then it is the program. Separate implementation step is superfluous and can only do harm. Said complete spec should be in formal language, so that compiler can implement it. This is called "programming language".

OTOH, if the spec is incomplete, ambiguous, or has mistakes in it, then surely the next step, "implementation", is very important, and as such different implementations behave differently, and it's well possible the "most incorrect" of them happens to be the best (desired outcome). In such case, "implementation" is not just implementation anymore, it is design.

Therefore, my opinion (which paulca will point out I have no right expressing because I'm not professional enough software developer) is that tightly separating specification and implementation and especially hiring "code monkeys" to do the implementation has been a huge mistake. Those who decide what the system should do, should be definitely working with the code as well. And vice versa, totally everybody working with the code should have opinion of what the right outcome should be. (I'm not saying make random decisions alone, of course. Communicate.) In which case there is no clear line between specification and code, or waterfall model, but rather, collection of code, natural language descriptions, simulations etc. which are all kept in sync in a bidirectional process.

But what we still see is the idea of Perfect Specification unidirectionally flowing (waterfall) into lower-level implementation job, and instead of adding understanding and design to the whole process, companies apply all kind of super weird stuff like getting separate teams do the implementation and compare the results in hopes that somebody got the desired outcome. Like, why not instead get the implementers as part of the design team and ask them, "what do you think the plane should do"? And if they are monkeys, how about, don't hire monkeys, get the wise guys, the designers, write code?
« Last Edit: December 03, 2025, 06:54:24 pm by Siwastaja »
 
The following users thanked this post: tom66, langwadt

Offline paulca

  • Super Contributor
  • ***
  • Posts: 5253
  • Country: gb
It really depends on the market these days as to whether there is a "Specification" at all.

Define upfront on paper, review and only then code is probably reserved for the like of airliners and other "life critical"s.

In most less fields it was realised that, honestly, the only people who know what is possible are the dev team and with modern tooling developing a demonstrable and testable "first version" an order of magnitude faster than asking him or anyone else to try and spec it first.

Water fall outside of government projects and some with very high tracability requirements....  doesn't exist today.  Where and when it does it's butchers over simplified version with most of the important (and costly) steps taken out to fit budgets.

I had a 1 month experience in November of writing a "Functional Design Specification".  A month to waffle and explain what we hope to do which is about 10% of worth to the next developer because the audience for the document are "ivory tower" architects who don't even have access to the code repos and probably haven't looked at a bit of the project code in years.  Worse, they give it to a single developer to write when the full spec spans s 23 service architecture of which I know about 10% intimately and 50% I have never even knew existed.  A task for which it is next to impossible to produce any quality or worth from.

I am also leaving that company, not least that month of torture, but it has been one of the last straws.  I didn't go through 4 years of uni + further years of training and 20 years of career to write fucking word docs to keep a customer happy.  Do they want working software or pretty worthless "businses ceremony" documents?
« Last Edit: December 03, 2025, 07:51:11 pm by paulca »
"What could possibly go wrong?"
Current Open Projects:  STM32F411RE+ESP32+TFT for home IoT (NoT) projects.  Child's advent xmas countdown toy.  Digital audio routing board.
 
The following users thanked this post: Siwastaja

Online 5U4GB

  • Super Contributor
  • ***
  • Posts: 1236
  • Country: au
yeh, remember ~25 years ago when I worked on the first Bluetooth, all the different companies came together for "unplug fest" to all try connecting to each others implementations, there were always minor differences in how each company had interpreted the spec.

It depends how dysfunctional the standards body, or group within it is.  Years ago I took part in an interop for an awful IETF standard.  The report from the first (non-)interop was, in one sentence "this standard does not work".  They made a few minor tweaks and pushed it through anyway, cancelling any further interops so there'd be no record of it not working.
 

Online tom66

  • Super Contributor
  • ***
  • Posts: 8104
  • Country: gb
  • Professional HW / FPGA / Embedded Engr. & Hobbyist
As much as we may like to complain about dodgy software engineering practices, the reality is aviation software is very safe.  Boeing's 737MAX is probably the exception.  There was an Airbus incident on a Qantas aircraft that led to buggy ADIRU data, causing uncommanded descents under autopilot; that led to a few serious injuries onboard (Qantas flight 72).  But bugs that lead to actual accidents are very rare indeed.
 
The following users thanked this post: Siwastaja

Offline hwasti

  • Regular Contributor
  • *
  • Posts: 54
  • Country: us
This is a very serious problem that puts the entire society at risk: Clueless media posting sensational and misleading headlines devoid of any actual knowledge or insight.

That is the primary problem in this story. The technical issues are all well known and benign, even boring.

As for the technical issues:

SEU has been known since the 70's. What is considered by many to be the seminal paper on the subject was published by IBM in 1996 "IBM experiments in soft fails in computer electronics (1978-1994)" by J. F. Ziegler. It can be downloaded from http://www.pld.ttu.ee/IAF0030/curtis.pdf The paper has 27 co-authors and references papers going back to 1975 (assuming you ignore citations to Faraday from 1839). While the physics are still valid, a lot of the specifics have changed since the processes are different and the structures are orders of magnitude smaller.

Boeing also did a lot of work on the field, led by Eugene Normand, who published extensively on the subject till his retirement about 15 years ago. Normand was considered the foremost expert in the field when it came to avionics.

What could have gone wrong here:

SEU can be mitigated in hardware by ECC hardware and on many processors that has to be explicitly enabled. Was the enabling code accidentally eliminated?

FLASH PROM can be considered immune to SEU. On certain microprocessors with ongoing CRC check, you can assume the code is not going to get corrupted, but the data can. Data is protected by having multiple copies and comparing them. Did someone enable cache making code susceptible to SEU?

Constants located in FLASH PROM can be considered incorruptible. Did a definition change or optimization move a constant into RAM making it susceptible to SEU?

Without knowing anything at all about the hardware or software, I can come up with these hypothetical examples of purely software changes that can make a previously impervious module susceptible to SEU.

In reality, none of these are not code bugs. They are process bugs. A failure of the processes and procedures to prevent or detect these issues in pre-release testing.
 

Offline aeg

  • Regular Contributor
  • *
  • Posts: 249
  • Country: us
Maybe the SEU is occurring in a sensor and not in the ELAC itself. Then the software problem could be in the majority vote algorithm to filter the sensor readings.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf