Assembler style discussion

The following dialog took place on the IBM mainframe assembler mailing list a while back. I didn't think to save the people's names at the time but I wish I could credit them!
> ... and assembly language programming is largely about a culture and
> a mode of expression shared by a group of specialized people.

Well said. Human readability is more important than mere assemblability.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

 > That said, what do people think about things like:

 >         BCTR    R6,R0
 > vs.
 >         BCTR    R6,0

 > I prefer, nay insist on, the latter because you're not really using
 > register 0 and it should not be counted by the assembler (or the
 > editor's FIND command) as a "reference".

There we agree.

 > Problems:  you can't use them intelligently in instructions like BXLE
 > or MR which use register pairs;

You can if the author has done things systematically instead of
haphazardly.

Like any tool, EQU is a good servant but a poor master. Used properly,
it can save you a lot of time any make the code easier to maintain and more
readable; used improperly, it can sink you in a morass.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> Maybe, but those numbers look awfully bare without the preceeding "R"
> to me.  What difference does one extra character make especially when
> assembly source is typically 80 column "card" image.

If you don't like the constraint of 80-character "card" images, you can
use the input exit ASMAXINV shipped (as sample code) with High Level
Assembler Release 2. It allows you to create V-format input to the
assembler; the exit then takes care of converting this user-friendly
input to the traditional fixed format the assembler digests. (See p.324
of the HLASM Programmer's Guide for details.)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

 How about clearing registers?  Again, the performance difference
  won't be dramatic in any case, but as obvious choices we have
    LA  R15,0(0,0)
    XR  R15,R15
    SR  R15,R15
    SLR R15,R15
  which can be used interchangeably if the value of the CC does not
  matter.  I dislike the first because I have a hard time doing in
  4 bytes what can be done in 2.  Of the remaining choices, I prefer
  XR because it is like XC, which can be similarly used to initialize
  a field to binary zeros.  On the other hand IBM code I've read usually
  chooses SR or occasionally SLR.  I recall reading somewhere that XR
  should be fastest, but I have never tested this and doubt there is
  much of a difference.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 >   matter. I recall reading somewhere that XR should be fastest

   Once upon a time, a long long time ago IBM used to publish a "Functional
Characteristics" book on each of its processors and in this book the published
instruction timings. I recall that SLR was the fastest way.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

   Speaking of style let me add my own element. I prefer NOT to put labels on
instructions. Labels always go on a statement themselves along with a DS 0H
(e.g. RETURN  DS    0H). This insures that a label doesn't get deleted when
deleting the instruction it is attached to and also insures halfword alignment
unlike "RETURN   EQU   *". It also makes it easier to comment out sections of
code because column 1 is always blank so you can just put an asterisk there.


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 > All-in-all style, like other things is in the eye of the beholder, >IMHO the
idea is to make the program work.

   I can't agree with you on that point. The only time
that is true is when there is only a single programmer
working on the system and the program will be run once and then
thrown away.

> BAL is my first love also, but I take the other view. It's lots
> of relative addressing and registers. I guess it's my scientific
> background, lots of table searching does that to you.
> The code is the only true document.
> Back in the 360 days I would LPSW rather than a B.
>

   Many of things mentioned in "style" also have efficiency
implications, if you write an operating system exit that is executed
hundreds or thousands of times each day you had better make sure that
you have given some thought to efficiency.

   Code readability is also important, many of the "one time programs"
that I have written have survived longer than those written to be
part of a major application system. When there is a problem in the
program and you need to look at the source, documentation and
consistency make the job much much easier. While using LPSW instead
of a branch may seem obvious to us dinosaurs some less experienced
person may have to pick up the code when there is a problem, they
don't know much about the PSW other than that's where to look
in a dump to see where the program blew up.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I use not only DS 0H, but also (in large programs, when I'm up to the
additional effort) DS 0Y.  The DS 0Y go on labels which are only
referenced "locally", i.e., within a few lines of them.  For example,
the oh-so-common construct:

         TM    SOMEFLAG,SOMEBIT
         BO    SKIPONE
         
SKIPONE  DS    0Y

The label on the subroutine would be a DS 0H because it's referenced
elsewhere.

This isn't perfect, of course, but if followed, helps avoid extra
searching for "Who else might get us to this line?"

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

>if the instruction stream is terminated by an unconditional branch, as
>it is almost certain to be.  Sure, there may be a level at which bytes
>are blindly fetched ahead of the instruction counter, but it won't last
>long - the I- unit will soon realize that it's a bad place to be fetching
>from.

  While the I-unit should be able to detect an unconditional branch,
  that is only part of the problem.  Cache lines are a good deal longer
  than instructions are, so you end up with cache lines containing
  both instructions and data.  If the data changes, the entire
  line is modified.  This isn't really much of a problem for a set
  of instructions that are only executed a few times, but within
  loops this could be noticeable, especially if the data is modified.
  An annoying aspect of this problem is that it can get better or worse
  depending on how the code is aligned relative to cache line boundaries,
  so making an unrelated change in the program or running on a different
  processor may unexpectedly degrade performance.


>From:         "Shmuel (Seymour J.) Metz" 
> >  Once upon a time, a long long time ago IBM used to publish a
> > "Functional Characteristics" book on each of its processors and
> > in this book the published instruction timings. I recall that SLR
> > was the fastest way.
>
> >                     Ken (kgunther@delphi.com)
>
>Only on specific models; on some XR was faster and on some LA was
>faster. BTW, they still publish the "funky specs", but they no longer
>contain timings. In the last one (370/168) that had timings, you had
>timing formulae rather than simple numbers, and I'm sure that if they
>published timing information on current models you wouldn't want to
>drop one of the manuals on your foot .
>
  Style can't be answered definitively since it is subjective, but
  times can be found.  To see how LA, XR, SR, and SLR compare for
  clearing a register, I wrote a program to generate then time execution
  of N successive copies of each instruction.  Varying N from 50000 to
  1000000, I got the following results:

       N      9021-580, CMS version   9672, MVS version
                LA    others            LA    others
     50 000     27      15              16      14
    100 000     33      15              21      13
    200 000     32      16              22      16
    500 000     32      16              22      17
  1 000 000     32      17              22      17

  All times are in ns, and hopefully no errors...
  That is, XR, SR, and SLR are practically the same (the run-to-run
  variation was more than the time differences), and about 1/2 to
  3/4 of the time of LA.  The increased time with longer series may
  be due to cache delays.  (Note that LA is 2X longer and is affected
  first.)

  I was planning to append the program here, but it grew to almost
  500 lines, so I thought it might be better not to.  However, if anyone
  would like a copy, send me e-mail, and I'll send one back.  The
  program is conditionally assembled based on &SYSTEM_ID, and should
  work in CMS or MVS without modification.
If you can help me identify the authors of the dialog above, let me know.
Return to Dave's S/390 Assembler FAQ page.