Author Topic: Boeing Starliner: 2 SW bugs found, patched, uploaded in-flight to avoid disaster  (Read 1613 times)

0 Members and 1 Guest are viewing this topic.

Online splin

  • Frequent Contributor
  • **
  • Posts: 994
  • Country: gb
It seems the software problems were worse than previously reported, which was a timer error which set the clock 11 hours off. But there was another serious bug:

https://www.theregister.co.uk/2020/02/10/more_software_errors_beset_boeings_calamity_capsule/

Quote
Firstly, that timer wasn't the only software glitch. The Service Module (SM) Disposal Sequence was incorrectly translated into the SM Integrated Propulsion Controller (IPC). The result was that rather than performing a burn to dispose of the SM prior to re-entry, the bug could have actually sent the SM bouncing off the Crew Module.

Fortunately, the team noticed that second error while reviewing the code following the first, and uploaded the fix prior to landing.

The comments are well worth reading, specifically this:

Quote
Re: "re-verifying flight software code"
It’s worse.

They cut and pasted their own code, from the capsule to the service module, but then didn’t update what lookup tables it used.

If not spotted and rushed patched, when the two parts separated, the service module would have used completely wrong thrust, then try to self correct using even more incorrect thrust and so on until it’s out of thrust or has crashed back into the capsule.

Boeing sent multiple, untested, unvalidated software patches, written on the fly to the star liner whilst on mission, just to get it to return safely and it still failed to reach the iss.

The approach and docking at the iss hasn’t been tested.

Let’s not forget this was a proof it all works mission. That without direct intervention would have resulted in total loss.


I'm not sure what the source for this claim is but if true, I'd struggle to come to a much different conclusion than this later comment:

Quote
This situation in this case is a whole different kettle of fuck up. Its not a one point failure situation - its the exact opposite - you'd be hard pressed to find anything that these people did right. I think my A level computer studies group could have done better (in BASIC) than this shower. They cut/pasted code and worst of all, look up tables, into software for a different system, didn't check it visually or otherwise, didn't simulate it on ground systems that should have been available as part of the contract.

The reviewer, if there was one, didn't verify the contents of the look-up tables which is just as important as the code itself - unforgivable.  :palm:

This wasn't one of those obscure and totally unexpected bugs such as a subtle compiler fault that very rarely causes a latent data error to be introduced into the software state which only reveals itself long after the initial source of the error and in a totally unrelated subsystem. Or a race condition which only occurs in very specific circumstances not exercised by even the most extensive test-suites. In such cases one can have a great deal of sympathy with the developers when these faults make it undetected to released product.

The atmosphere must be very uncomfortable in Boeing's development teams right now. Hmm,  my tablet auto-corrected Boeing to Boring - how wrong could it get. :-DD
« Last Edit: February 11, 2020, 07:00:03 pm by splin »
 

Online SilverSolder

  • Super Contributor
  • ***
  • Posts: 2114
  • Country: 00

If true, these are signs of near total decay at Boeing -  needs a "back to basics" cultural change.
 
The following users thanked this post: tooki

Offline edy

  • Super Contributor
  • ***
  • Posts: 2149
  • Country: ca
    • DevHackMod Channel
What ever happened to "measure 1,000,000 times, cut once". I would have expected there to be extensive simulator and systems testing, and not just on one but several different independently made simultation/testing platforms, just to hammer out every possible issue there could ever be. I guess the code is so huge now that it is becoming impossible to manage. That, or there could be several generations of code in there now that newer people in the development cycle are having a hard time understanding. Perhaps a "reboot" and recoding from basic fundamentals is needed every once in a while to build up the system from scratch again (although that may also introduce errors... isn't there a concept of "leave spaghetti code alone if it seems to work").

Anyways, a lot of assuming still as to exactly what went wrong but it seems like one problem (timing error) was compounded by another (copying code without associated tables) when they tried to fix it mid-mission. Honestly I didn't even know Boeing and NASA were up to this... I think it is cool that they are using unmanned flights for supply missions.
YouTube: www.devhackmod.com LBRY: https://lbry.tv/@winegaming:b
"Ye cannae change the laws of physics, captain" - Scotty
 
The following users thanked this post: SilverSolder, tooki

Offline Tomorokoshi

  • Frequent Contributor
  • **
  • Posts: 869
  • Country: us
What is it they say? "Test what you fly, fly what you test".

https://appel.nasa.gov/2002/10/01/test-what-you-fly/
 

Offline WastelandTek

  • Frequent Contributor
  • **
  • Posts: 589
  • Country: 00

If true, these are signs of near total decay at Boeing -  needs a "back to basics" cultural change.

It's not just Boeing, it's culture wide, we can't even do the choo choo trains any more, and there never seem to be any real consequences for the perpetrators of failure.
I'm new here, but I tend to be pretty gregarious, so if I'm out of my lane please call me out.
 
The following users thanked this post: SilverSolder, tooki

Offline chris_leyson

  • Super Contributor
  • ***
  • Posts: 1408
  • Country: wales
Quote
They cut and pasted their own code
They must have been using the new StackOverflow keyboard. I've seen people write code just by cutting and pasting. It's not just software but hardware people do it as well. Was going to say engineers but that would be inapropriate. ESA did the same with the first Arian 5 launch.
« Last Edit: February 11, 2020, 07:40:45 pm by chris_leyson »
 

Online splin

  • Frequent Contributor
  • **
  • Posts: 994
  • Country: gb
but it seems like one problem (timing error) was compounded by another (copying code without associated tables) when they tried to fix it mid-mission. Honestly I didn't even know Boeing and NASA were up to this... I think it is cool that they are using unmanned flights for supply missions.

I think you misunderstood - they didn't create the second fault whilst fixing the first; it seems that while investigating the errant flight, they discovered two serious bugs - the clock problem that instigated the, presumably frantic and white knuckle real time code review, and the other cut and paste error which would have manifested itself disasterously later in the flight. They managed to fix both errors and upload them in time to avoid catastrophe.

Maybe the first error was added deliberately by a conscientious engineer knowing it was the only way to ensure the rest of the steaming pile got a proper review.   >:D
 
The following users thanked this post: edy

Online rdl

  • Super Contributor
  • ***
  • Posts: 2961
  • Country: us
Because of this latest screw up Boeing will have to undergo a full "invasive" safety review, similar to what was done at SpaceX. They had basically bought their way out of that originally. Now SpaceX will probably beat them to launching astronauts to the ISS by months if not years.
 

Offline DBecker

  • Frequent Contributor
  • **
  • Posts: 320
  • Country: us
Because of this latest screw up Boeing will have to undergo a full "invasive" safety review, similar to what was done at SpaceX. They had basically bought their way out of that originally. Now SpaceX will probably beat them to launching astronauts to the ISS by months if not years.

There were claims that the NASA safety review of SpaceX was a move by Boeing lobbyists in DC trying to slow SpaceX while justifying their own slow delivery.

karma++
 

Online Gyro

  • Super Contributor
  • ***
  • Posts: 5585
  • Country: gb
Because of this latest screw up Boeing will have to undergo a full "invasive" safety review, similar to what was done at SpaceX. They had basically bought their way out of that originally. Now SpaceX will probably beat them to launching astronauts to the ISS by months if not years.

There were claims that the NASA safety review of SpaceX was a move by Boeing lobbyists in DC trying to slow SpaceX while justifying their own slow delivery.

karma++


What goes around comes around.  >:D
« Last Edit: February 11, 2020, 09:37:04 pm by Gyro »
Chris

"Victor Meldrew, the Crimson Avenger!"
 

Offline Red Squirrel

  • Super Contributor
  • ***
  • Posts: 2459
  • Country: ca
Seems like Boeing is taking the game approach to software development.  It compiles!  Ship it, we'll test and patch later.
 
The following users thanked this post: Domagoj T

Online wraper

  • Supporter
  • ****
  • Posts: 11500
  • Country: lv
Seems like Boeing is taking the game approach to software development.  It compiles!  Ship it, we'll test and patch later.
There is even a name for it: Early access. https://en.wikipedia.org/wiki/Early_access
 

Offline Homer J Simpson

  • Super Contributor
  • ***
  • Posts: 1113
  • Country: us
 
The following users thanked this post: splin, SilverSolder

Offline donotdespisethesnake

  • Super Contributor
  • ***
  • Posts: 1106
  • Country: gb
  • Embedded stuff
And if you are wondering "Is 1 million lines of code a lot?" then  https://informationisbeautiful.net/visualizations/million-lines-of-code/

Some comparisons with similar projects (the "machine" category):

Space shuttle                0.4 million
F22 Raptor fighter         1.7 million
Hubble space telescope  2 million
Boeing 787                   6.5 million (avionics and support)
F35 fighter                   24 million

So on modern standards it is not that much. However, a comprehensive code review is still a lot of effort, let's say ballpark figure about 1 person-year. With 50 engineers should be done in less than ... 1 year.  ::)
Bob
"All you said is just a bunch of opinions."
 

Offline Tomorokoshi

  • Frequent Contributor
  • **
  • Posts: 869
  • Country: us
So on modern standards it is not that much. However, a comprehensive code review is still a lot of effort, let's say ballpark figure about 1 person-year. With 50 engineers should be done in less than ... 1 year.  ::)

One year... that's a problem. Let's double that and add 50 more engineers:

https://en.wikipedia.org/wiki/Brooks%27s_law
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 5317
  • Country: fr
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 5464
  • Country: gb
I would have expected there to be extensive simulator and systems testing, and not just on one but several different independently made simultation/testing platforms, just to hammer out every possible issue there could ever be.
The key problem with most of this kind of work is it takes a very specific kind of character to do it well. A character that largely ignores most of the obvious stuff that almost any simplistic testing will find, and seeks out the funky corner cases where most of the serious issues tend to be found too late, and often disasterously. These people need to be a part of every simulation team, and need to be the most highly regarded and rewarded. Instead they are too often suppressed, because they increase the test complexity to the point where its effective, while management just wants to ship stuff. People love talking about flushing out corner case issues, but it has a nasty tendency to go little further than talk.
 
The following users thanked this post: tooki

Online splin

  • Frequent Contributor
  • **
  • Posts: 994
  • Country: gb
Boeing have finally fessed up to their failure to properly test the software:

https://www.theregister.co.uk/2020/03/03/space_roundup/

Quote
Boeing vice president and program manager John Mulholland put on a brave face
....
The latter was caused by spacecraft's propulsion controller not being available for testing (it was being used in a hot fire test of the service module). The gang made do instead with what Boeing described as an "incorrect emulator" which didn't have the correct jet mapping. It was only once the mission was under way and the hardware returned to the lab that the team could re-run the test and discover the issue. A hurried patch was uploaded.

 :palm:

Quote
The approach to testing was eye-opening. It transpired that Boeing had not run an end-to-end test of the entire mission, opting to segment things instead. As far as the timer issue was concerned, where the Starliner's clock was set incorrectly and the ISS docking cancelled, the company had elected to end the test segment concerning launch at the point of spacecraft separation. A few more minutes of testing would have shown the problem.

 :palm:

Remind me again how much Boeing were billing for this?
 

Offline tom66

  • Super Contributor
  • ***
  • Posts: 3756
  • Country: gb
  • Electron Fiddler, FPGA Hacker, Embedded Systems EE
Boeing has some serious core engineering problems that need to be addressed.

If this was truly a free market, rather than Boeing being in the pocket of the US Military and Gov't, then the 737MAX8 disasters would have been enough to take the company down.  Think how it only took one accident to put people off the de Havilland Comet,  or the accidents that brought down McDonnell Douglas (perhaps some of the engineering transplants from MD are responsible for the fuckups at Boeing?)
 

Online wraper

  • Supporter
  • ****
  • Posts: 11500
  • Country: lv
Boeing has some serious core engineering problems that need to be addressed.
More like management problems.
Quote
(perhaps some of the engineering transplants from MD are responsible for the fuckups at Boeing?)
Management transplants. There was internal joke that McDonnell bought Boeing with Boeing's money.
« Last Edit: March 04, 2020, 08:08:10 am by wraper »
 
The following users thanked this post: tooki

Online wraper

  • Supporter
  • ****
  • Posts: 11500
  • Country: lv
Quote
By the time I visited the company—for Fortune, in 2000—that had begun to change. In Condit’s office, overlooking Boeing Field, were 54 white roses to celebrate the day’s closing stock price. The shift had started three years earlier, with Boeing’s “reverse takeover” of McDonnell Douglas—so-called because it was McDonnell executives who perversely ended up in charge of the combined entity, and it was McDonnell’s culture that became ascendant. “McDonnell Douglas bought Boeing with Boeing’s money, went the joke around Seattle.
https://www.theatlantic.com/ideas/archive/2019/11/how-boeing-lost-its-bearings/602188/
 

Offline tooki

  • Super Contributor
  • ***
  • Posts: 4987
  • Country: ch
Boeing has some serious core engineering problems that need to be addressed.
More like management problems.
Quote
(perhaps some of the engineering transplants from MD are responsible for the fuckups at Boeing?)
Management transplants. There was internal joke that McDonnell bought Boeing with Boeing's money.
Heh, the same thing happened with Apple, where we say that NeXT bought Apple with negative money. The difference is that NeXT’s management replacing Apple’s (in 1997) was an indisputable win, whereas McDD’s management was a clear downgrade from Boeing’s.
 

Offline Rerouter

  • Super Contributor
  • ***
  • Posts: 4610
  • Country: au
  • Question Everything... Except This Statement
When you skip testing, or just have horrible tools for testing, things start catching up with you in short order.

About 1 year ago, I finally was able to pry enough information about how a bit of automotive electronics scripting was made, turns out its simple stack scripts, lets you read in data from the CANBUS, process it and react accordingly, including writing back to the CAN

These scripts where mostly copy-pasta only 300-700 lines, the grey beard that retired 5 years ago wrote it, so we don't question it kind of crap.

Well I went and wrote an emulator for there scripting language, just to check things out, as we kept seeing weird behavior in the data it was logging, and little hard to replicate glitches.

Turns out the grey beard code was pushing an error value to stack any time it could not handle a packet, but never removed it. so the glitches where the stack hitting something it shouldn't, causing a very rapid reboot, however there debugger that they test scripts with, would remove that value, so they would never see the issue in testing, combined with only feeding "good" data to the scripts so they rarely handled packets that would trigger this behavior.

I went and re-wrote the problematic scripts, explained what was going wrong, and how to fix it, even shared the emulator with the company, and don't think I have had a door slammed harder in my face before, with all future updates with the same bugs, that I would then patch.... company culture and ego plays a big part in these things.
 

Online SilverSolder

  • Super Contributor
  • ***
  • Posts: 2114
  • Country: 00
When you skip testing, or just have horrible tools for testing, things start catching up with you in short order.

About 1 year ago, I finally was able to pry enough information about how a bit of automotive electronics scripting was made, turns out its simple stack scripts, lets you read in data from the CANBUS, process it and react accordingly, including writing back to the CAN

These scripts where mostly copy-pasta only 300-700 lines, the grey beard that retired 5 years ago wrote it, so we don't question it kind of crap.

Well I went and wrote an emulator for there scripting language, just to check things out, as we kept seeing weird behavior in the data it was logging, and little hard to replicate glitches.

Turns out the grey beard code was pushing an error value to stack any time it could not handle a packet, but never removed it. so the glitches where the stack hitting something it shouldn't, causing a very rapid reboot, however there debugger that they test scripts with, would remove that value, so they would never see the issue in testing, combined with only feeding "good" data to the scripts so they rarely handled packets that would trigger this behavior.

I went and re-wrote the problematic scripts, explained what was going wrong, and how to fix it, even shared the emulator with the company, and don't think I have had a door slammed harder in my face before, with all future updates with the same bugs, that I would then patch.... company culture and ego plays a big part in these things.

Perhaps echoes of what happened at Volkswagen, with the emissions scandal.  Many people knew...  but did not want to know.
 

Offline tooki

  • Super Contributor
  • ***
  • Posts: 4987
  • Country: ch
I went and re-wrote the problematic scripts, explained what was going wrong, and how to fix it, even shared the emulator with the company, and don't think I have had a door slammed harder in my face before, with all future updates with the same bugs, that I would then patch.... company culture and ego plays a big part in these things.
Oh man, I hear you.

Very few people have the magical skills to successfully approach companies with a “hey, your product/website/menu/whatever sucks, and I’m the right person to fix it!”
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf