Date: Wed, 4 Mar 1998 21:21:23 -0600
Reply-To: ddejongh@worldnet.att.net
Sender: IBM Mainframe Assembler List
From: David de Jongh
Subject: Re: Save Area Chaining
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
John R. Ehrman wrote:
>
> As Stephen Bacher noted, it's not a good idea to omit the forward link
> in save area chaining. The reason it's there is in case R13 is destroyed:
> the only way to provide a save area trace is to start from the top-level
> save area and work down the chain until it appears to end. If only back
> links are saved, any corruption of R13 will produce nonsense tracebacks.
> John Ehrman (ehrman@vnet.ibm.com)
I'm glad to see there are still a few people out there that understand
what's going on. Here's the deal:
We have program-check recovery exits in all of our main batch-processing
programs. I'm talking serious load modules here, with static links of
1,000+ compiled source programs in assembler and COBOL. These programs
process millions of records in a nightly run. You don't want anything as
trivial as bad data in a record causing the run to die. You certainly
don't want to be called in at night, and you don't want a 200,000 line
dump.
When a program check occurs, we trap it in an ESPIE exit, and produce an
intelligent mini-dump, showing the current calling chain and all
associated storage, similar to AbendAid, but with GETMAINed areas,
application-specific parm areas and other goodies. We put the record
causing the program check into "suspended" status and go to the next
one. If this happens to enough records, we abend the task, because we
figure it's a programming error.
We start with the registers returned in the EPIE, walk the backward
chain from R13 until we reach the control program and dump what we find
on the way. We assume the chain is good if a back chain address points
to a save area where the forward chain points to the current save area.
This is problem A, but no big deal, because we can find out if the chain
is good by walking up until we find the LE/370 dummy DSA, (or a zero, or
we get a S0C4 for trying.)
Only thing, it's pretty easy to walk on storage in re-entrant COBOL,
e.g., by a runaway indexed move over the end of working storage, or by
moving to incorrectly defined linkage section. If you do this, your
chances of walking on at least one savearea and/or TGT are pretty high.
Hence, you can lose the backward chain, and often the current registers,
including R13, because the program check didn't happen until you
reloaded the registers from the corrupted storage.
If this happens, and all your programs perform (what has been, for the
past 30-odd years) standard savearea chaining, you can start from the
A(CVT) at location X'00000010', find the current TCB, look in TCBFSA and
walk the forward chain until you find the damage.
Phew!
I've complained to IBM about this, but they say that no-one else has
complained. Apparently, they got the idea from their C++ and PL/I
compilers, which they claim have never done forward chaining. My
response: ever tried to read a PL/I dump?
Anyway, if you think this is worth complaining about, especially if
you're not on LE/370 yet, but are planning to migrate soon, let your IBM
rep know that this is JUST UNACCEPTABLE!
|