Sanding our assholes with 150 grit. Slowly. Lovingly.

Mars Global Surveyor Died from Single Bad Command

http://science.slashdot.org/science/07/04/14/0529246.shtml

<quote>
The LA Times reports that a single wrong command sent to the wrong computer address caused a cascade of events that led to the loss of the Mars Global Surveyor spacecraft last November. The command was an orientation instruction for the spacecraft's main communications antenna. The mistake caused a problem with the positioning of the solar power panels, which in turned caused one of the batteries to overheat, shutting down the solar power system and draining the batteries some 12 hours later. 'The review panel found the management team followed existing procedures in dealing with the problem, but those procedures were inadequate to catch the errors that occurred. The review also said the spacecraft's onboard fault-protection system failed to respond correctly to the errors. Instead of protecting the spacecraft, the programmed response made it worse.'"
</quote>

Having worked on problems like this I really feel for these guys. You just feel horrible. You feel responsible. But you can't foresee everything. These are very tight environments with an infinite number of things that can go wrong and where actions can have unpredictable consequences. Our bodies have evolved of many millions of years to have a fairly small set of responses to bad things. Fight or flight. Immune system. Healing. And finally the brain. And even the brain does a pretty poor job of it.
Permalink son of parnas 
April 14th, 2007 10:40am
> Having worked on problems like this ....

Good old pointers!

> And even the brain does a pretty poor job of it.

Relative to what?  Suicide is not a good sign, but open heart surgery is a neat trick.
Permalink z/\xon 
April 14th, 2007 11:19am
rm -rf /*

is a doozy too.
Permalink Send private email strawberry snowflake 
April 14th, 2007 11:22am
Ouch.  I haven't read the article yet, but I too work with spacecraft operators.  Even after double and triple checking, and years of operations of a particular craft, STILL amazing things can go wrong.
Permalink SaveTheHubble 
April 14th, 2007 11:34am
> Relative to what? 

Humans are pretty poor at quickly solving complex problems. We rely primarily on instinct and learned behaviors. In emergencies our response are rather limited.
Permalink son of parnas 
April 14th, 2007 11:51am
Which is why on Apollo they used simulation a LOT to train both the astronauts and the ground crew.

In fact, the failure mode of Apollo 13 had been tried in a simulation 2 months before -- and in the simulation everybody died.  As a result, a response plan was prepared for what to do should someting similar happen in the future.

Thus when the Apollo 13 failure DID occur, there was a plan (with checklists) already on the shelf, so in the last 20 minutes of power before everybody would have died, the Lunar Module was prepared as a "lifeboat".

Also, every catastrophic error is usually found to be a chain of lesser errors.  For Global Surveyer, the error handling code actually made things worse.
Permalink SaveTheHubble 
April 14th, 2007 11:59am
> the error handling code actually made things worse.

That's not unusual at all. It's really common. The world is in an unknown state, your own internal state is stressed and fluctuating, the situation is novel, and the bots need to act. The chances are bad for the little fella.
Permalink son of parnas 
April 14th, 2007 12:04pm
http://www.latimes.com/news/science/la-sci-mars14apr14,0,4365613.story?coll=la-home-headlines

Wow, and Dolly Perkins gets a quote.  I've worked on her projects before -- a very good lady.

It's sad that in recent years, Nasa has been trying to save so much money (to go to Mars) that Operations Funding has had to get cuts.

I don't know if "a new operator mistake" was a problem here, of course.
Permalink SaveTheHubble 
April 14th, 2007 12:04pm
There is only one good way to write completely robust and failsafe code, and that is to have one single highly competent person *totally* responsible for the entire design. That person owns every line of code in the system and nobody else gets to so much as touch it without the explicit approval and oversight of the designer. No coding in teams, no division of responsibilities for different modules, no bug fixing by other coders.

Everyone else on the team gets to treat the code like a puzzle and their objective is to find fault with it. They spend days coming up with creative ways to make the code fail, and the code designer responds each time by adjusting the design until the weakness is eliminated.

If you don't follow the 'one single designer' rule, you get design by committee, and nothing good ever comes out of committees. Design, management, government, whatever...
Permalink Send private email bon vivant 
April 14th, 2007 1:41pm
> Humans are pretty poor at quickly
> solving complex problems.

Relative to what? The solutions you have seen non-humans come up with?  I notice the wiggle word quickly has been added.
Permalink z/\xon 
April 14th, 2007 2:02pm
> Relative to what?

Relative to the problems we must solve.

> The solutions you have seen non-humans come up with?

It's more the error in our solutions rather than saying X is better, you suck.

> There is only one good way to write completely robust and failsafe code, and that is to have one single highly competent person *totally* responsible for the entire design.

Not a chance. The scenarios are so complex and so vast no one can anticipate them all. Reality is more complex than you can imagine.
Permalink son of parnas 
April 14th, 2007 2:21pm
The chief designer doesn't have to think of everything, but does have to keep the whole design consistent. Everyone else on the project can beat on the design and find its weaknesses.
Permalink Send private email bon vivant 
April 14th, 2007 2:30pm
"It's sad that in recent years, Nasa has been trying to save so much money (to go to Mars) that Operations Funding has had to get cuts."

All of 8 posts until it was Bush's fault? You guys need to get on the ball a little quicker! This should have been 4 posts tops until we hit that.
Permalink Practical Economist 
April 14th, 2007 2:41pm
Fuk Yoo, manager of the Mars exploration program at JPL, said an "end-to-end" review of all missions would be undertaken to make sure the mistakes made with the spacecraft were not repeated.
Permalink wrong boy 
April 14th, 2007 2:44pm
Anyway the satellite lasted 5 times longer that it was expected to, which seems to me to be a pretty fucking big success story and not the tale of failure due to incompetence and/or lack of funding that is being painted. What percentage of software programs launched ten years ago are still running without incident?
Permalink Practical Economist 
April 14th, 2007 2:45pm
You don't get it PE -- it was Clinton who released sufficient funds for it to be 5x more successful as planned, but Bush who froze funds (you know, after it was already launched, etc) to make it not 12x more successful.

Causality is not hard once you get the hang of it.
Permalink Send private email strawberry snowflake 
April 14th, 2007 2:48pm
bon vivant's single designer rule is extremely interesting. Are there papers on this or is this your own innovation? The obvious counterexample would be to discuss designs that are simply too big for one person, but maybe they are very rare. Werner von Braun was personally responsible for almost all of the design of the Saturn V, so that might even be called a single designer project. Same goes for some of the largest integrated circuit microprocessors - one guy and the rest is support. I mention those since they are two of the most complex engineering problems I can think of.
Permalink Practical Economist 
April 14th, 2007 2:49pm
But strawberry, it was 5x more successful during Bush. If Gore had been elected, it would have failed in 2000.
Permalink Practical Economist 
April 14th, 2007 2:50pm
How could it have failed under Gore? Gore invented the Mars Global Surveyor.
Permalink Send private email strawberry snowflake 
April 14th, 2007 2:52pm
True, but he was the one that designed the error recovery mode that caused it to fuck up. The mission was important to Gore and his investment consortium because it would show exactly how far Bush's fuck-ups extend:

"The Mars Global Surveyor also found that carbon dioxide ice was disappearing from the planet's south pole, raising the possibility that a new round of global warming was underway on the planet."

Bush's refusing to sign the Kyoto accord is even fucking up the Red Planet! We are screwed now - I was planning to move to Mars!
Permalink Practical Economist 
April 14th, 2007 2:58pm
> Bush who froze funds

Bush simply wanted the surveyor to become part of the ownership society because of it's own efforts. Didn't want to make it too easy. Work for it, like he did.
Permalink son of parnas 
April 14th, 2007 3:39pm
"Same goes for some of the largest integrated circuit microprocessors - one guy and the rest is support. I mention those since they are two of the most complex engineering problems I can think of."

Seymour Cray.

"Cray Research corporate headquarters is in Mendota Heights, but Cray works at its Chippewa Falls plant. As the stories go, when the spirit moves him he walks to his nearby cottage on Lake Wissota, sets up a card table on the porch, grabs a basket of computer chips, a pair of tweezers, and a soldering gun, and puts together supercomputer parts as if he were building a Heathkit.

"The truth is somewhat less fanciful. Cray does retire t his cottage, where he can concentrate hours at a time in solitude. But computer building is a highly abstract exercise. He uses only pencil and paper--"about a pad a day" of 8 1/2 by 11 inch quadrille-ruled paper. His calculations are review by a 30-person development team, modified if necessary, and converted into a computer module of microcircuit chips."

http://mbbnet.umn.edu/hoff/hoff_sc.html

"I of course, needed alot of support people to implement the ideas but the basic concept I thought could not be and should not be a group effort. Designing by committee is not appropriate for computers. You pretty much need one person to say "This is the way its going to be for this machine.""

http://americanhistory.si.edu/collections/comphist/cray.htm#tc13
Permalink Rocky Mountain "Hi!" 
April 14th, 2007 4:19pm
"There is only one good way to write completely robust and failsafe code, and that is to have one single highly competent person *totally* responsible for the entire design."

Sorry, Bon, that will work for a nice, 5,000 to 20,000 line of code program.  Maybe.

Otherwise, I think the history of software, "The Mythical Man Month", Microsoft, Sun, Apple, and many many different software projects, have all combined to say "completely robust and failsafe code" is an oxymoron.

The world is too complicated, and software (as complicated as it is) is too simple, for any software package to be able to deal with EVERY eventuality it faces.

I suppose it can be "completely robust" or "failsafe" in a fairly limited environment.  That's not what we find in space.
Permalink SaveTheHubble 
April 14th, 2007 4:32pm
Very very nice example there, Rocky Mountain - the world's fastest supercomputers. Can't get much more complex than that. And even a quote about how one-person design is the way to go, excellent find. Thanks.
Permalink Practical Economist 
April 14th, 2007 5:42pm
Where is cray now?
Permalink son of parnas 
April 14th, 2007 5:57pm
He's dead.
Permalink Send private email Ward 
April 14th, 2007 6:00pm
See.
Permalink son of parnas 
April 14th, 2007 6:02pm
A friend's father worked at CDC in the early days & told me there were about 30 engineers a day or two behind Cray, making sure he didn't screw up too badly.  Google did the rest.
Permalink Rocky Mountain "Hi!" 
April 15th, 2007 5:01am
> Google did the rest. ?

http://en.wikipedia.org/wiki/Seymour_Cray#Control_Data_Corporation
Permalink a 1950s version of google 
April 15th, 2007 7:28am
Having one person in charge of everything and a huge support team wasn't common back then (check out, for example, IBM's revolutionary ACS project) and is totally unheard of now that processors are 100x more complex.
Permalink random vlsi guy 
April 15th, 2007 1:06pm

This topic is archived. No further replies will be accepted.

Other topics: April, 2007 Other topics: April, 2007 Recent topics Recent topics