GCC optimization (Chris Torek)

Index Home About Blog

From: Chris Torek <torek@elf.bsdi.com>
Newsgroups: comp.lang.c.moderated
Subject: Re: Pointer Arithmetic
Date: 23 Mar 1996 00:42:18 -0600

In article <4ip9k0$sqs@solutions.solon.com> Ari Lukumies
<aril@cmt.lpr.mail.carel.fi> writes:
[given m,n of type short]
>	m = n << 8;
>... when compiled using optimization, the compiler (MSC 7 for NT)
>produced incorrect code:
>
>	MOV	AX,[n]
>	MOV	AH,AL
>... it left out XOR AL,AL ...

(Actually, I would expect to see something along the lines of `xor
al,al; mov ah,[n]' for a case like this -- note that my optimization
depends on byte order.)  I would find a bug this serious quite
troubling, and perhaps look for another compiler.

In article <4imabr$duc@solutions.solon.com> Stan Ryckman wrote:
>>It's not that much harder to implement [multiply by shifting] in a
>>compiler.

This is true; also, if it is done correctly, index-multiplication by
shifting falls naturally out of multiplication by shifting (simply
turn the indexing into multiplication before doing the multiplication
optimizations).

>That's a clever compiler. I haven't come across one yet in the Intel world,

Gcc 2.7.2 generates this (perhaps surprising) code for `i*15':

	movl 8(%ebp),%eax	# i [supplied as a parameter, hence 8(frame)]
	leal (%eax,%eax,2),%eax # i <- i + i*2 = i*3
	leal (%eax,%eax,4),%eax # i <- i + i*4 = i*5
				# since 3*5=15, final i <- original i * 15

For i*31 we get:

	movl 8(%ebp),%eax
	sall $5,%eax
	subl 8(%ebp),%eax

which is just `(i << 5) - i'.

GCC also optimizes division by constants, particularly in the case
when the division is known to have no remainder (e.g., when producing
an index from pointer subtraction).

To see how it works, pick up the gcc source and look at `synth_mult'
in expmed.c (touching on another thread, I note that this makes
gratuitous use of `alloca'; the two `struct algorithm' objects
could simply be local variables inside synth_mult).  Note also that
there are three variants using synth_mult (`basic', `negate', and
`add'; see expand_mult for details).
-- 
In-Real-Life: Chris Torek, Berkeley Software Design Inc
El Cerrito, CA	Domain:	torek@bsdi.com	+1 510 234 3167

From: torek@elf.bsdi.com (Chris Torek)
Newsgroups: comp.lang.c
Subject: gcc and assembly (not Re: Top 10 subjects in comp.lang.c)
Date: 28 Jan 1997 13:23:57 -0800

In article <5cipjg$hhj@ousrvr3.oulu.fi> Vesa Karvonen
<vkarvone@raita.oulu.fi> writes:
>You [Peter Seebach] also think that RISC assembly language is not
>human readable. I have to disagree to a point. The point is that any
>assembly language is readable to a person who is experienced with
>the assembly language.

I will agree with that---sparc, mips, alpha, etc., are all readable
enough to someone with experience.

>Unfortunately not very many people are experienced in programming RISC
>chips. Some months ago I pointed out that you can not use the carry flag
>in C (whether or not it is useful - it is a FACT).

Indeed, you cannot use it directly.  There are lots of reasons,
but one particularly good example exists on the MIPS: there *is*
no carry flag.  There are no `add with carry' instructions, no `add
and set carry' instructions---there are no condition-code flags
whatsoever.  (Instead, there are logical `set' instructions and
branch-on-register-value instructions, so that you can `sltu' to
set a register to 1 if $v0<$v1, or `bnz' to branch if $a0 != 0,
etc.)

>... This is something I saved back then (this is quoted
>exactly, but I don't remember who I am quoting...):

You are quoting me.  We can drop the C code for now, and just go with
the generated sparc assembly, minus the stores:

torek>		add %o0,%o1,%o1
torek>		cmp %o1,%o0
torek>		addx %o2,%o3,%o2

torek>This is one instruction away from optimal for most SPARCs --
torek>it should omit the `cmp' and use `addcc' instead of `add'.

torek>The point here is not that gcc produces suboptimal code (it does),
torek>but rather that gcc can figure out that `sum is less than either
torek>operand' is equivalent to `a carry occurred in the add'.

I goofed here; I should not have said `a carry occurred in the add'
but rather, simply `a carry occurred' -- it was not in the add per se.

>Unfortunately I have to say that I am not experienced with Sparc assembly
>language, but that is ok, because the writer of the above text
>probably knows even less about assembly language (in general).

I daresay you are wrong about that. :-)  Going back to the carry
bit, the straightforward (aka `stupid') way to use the comparison
result is to branch:

	add	%o0, %o1, %o1
	add	%o2, %o3, %o2
	cmp	%o1, %o0
	bgeu	Laround		! do not add 1 if no carry
	 nop			! (delay slot)
	inc	%o2
Laround:

But gcc did not generate the obvious (and bad for the pipeline)
code; instead, it used an add, then a `set carry flag', then an
`add with carry'.  Now:

>Why?  Because gcc didn't figure out that 'sum is less than either
>operand' is equivalent to 'a carry occured in the add'. It was the
>writer who figured it out.

This is certainly true.  In fact, this is required in C, because
the language lacks the notion of `carry out of an add', so we must
use the identical notion `sum is less than an operand'.  (You can
consider this a language failing if you like; but it is more just
a way of doing things than a `failing' per se.)  The semantics are
the same, so either one suffices, though one certainly maps more
more obviously onto machines with carries.

Nonetheless, gcc *did* in fact use the carry flag, via the `addx'
instruction, to compute `t2 = c + d + (t1 < a)'.  As I recall, the
original claim was that `compilers cannot use the carry flag.'  Of
course, gcc did what it did because it has been told how.  It failed
to use `addcc' to set the carry flag in the first place because it
has *not* been told how to do *that*.  So, I took a quick stab at it.

Now, I am definitely not an expert at gcc machine descriptions; it
took me several tries to get something that worked at all.  I used
a simplistic [see footnote 1] peephole definition to turn an `add'
followed by the right kind of `compare' into an `addcc':

	(define_peephole
	  [(set (match_operand:SI 0 "register_operand" "=r")
		(plus:SI (match_operand:SI 1 "arith_operand" "%r")
			 (match_operand:SI 2 "arith_operand" "rI")))
	   (set (reg:CC 0) (compare:CC (match_dup 0) (match_dup 1)))]
	  ""
	  "addcc %1,%2,%0")

This got gcc to produce the desired sequence, as:

	void
	f(a, b, c, d)
		unsigned long a, b, c, d;
	{
		unsigned long t;
		t = a + b;
		use(t, c + d + (t < a));
	}

[see footnote 2] now compiles to:

	_f:
		save %sp,-104,%sp
		addcc %i0,%i1,%o0
		call _use,0
		addx %i2,%i3,%o1
		ret
		restore

which is optimal in terms of the `add' sequence.  (It would be nice
to have gcc avoid the save and restore, and simply jump directly
to _use(), but that is another optimization entirely.)

>The point isn't that C compilers would not be able to use the carry flag.
>There is no doubt about it. There a two points: 1. assembly language is
>readable, but only to persons who are experienced with the assembly
>language and even then the code must be short and fairly well organized;
>2. compilers are not as intelligent as you are (hopefully ;).

I will definitely agree with both of these points (though actually,
the assembly code can get rather long as long as it remains well
organized, and of course `short' and `long' here are rather loosely
defined).  However, compilers also do not get tired; once we teach
them how to transform some particular source-language construct
into an equivalent and efficient machine-level construct, they will
do so without error, much faster than we can.

I still write some (indeed, too many) things in assembly, but I
also try to get the compilers to write the best assembly for me,
so that I will not have to do so again.
-----
Footnotes:

1 There is at least one obvious stumbling block: gcc represents
  the entire 32-bit condition code register (`reg:CC 0' above---
  there is also a 64-bit CCR on V9, and up to four FP CCRs) as a
  single entity.  It can then make use of the individual bits (e.g.,
  carry and zero) in later instructions (whether they are branches,
  conditional moves, or what-have-you), after it has set `reg:CC
  0' via a `compare'.  However, while the peephole replacement sets
  the `C' flag in the same way, it sets the N, Z, and V flags
  differently, and my attempt does not capture this at all.  I
  think this could be handled by splitting the single `cc0' register
  into four registers, one per flag, but also I think this would
  also lead to terrible code generation.  Alternatively, the empty
  string above can be replaced with C code that verifies that this
  peephole optimization is being used correctly (though I have no
  clear idea what that code would be).  It might be `best', though,
  given gcc's current setup, to add a separate notion of `add and
  set carry', then handle this in the combiner, which is where gcc
  does algebraic transforms---the transform would then turn `sum
  is less than an operand' back into `add and set carry'---but this
  would require far more work than I am willing to attempt here.

2 Astute readers may note that I changed the example C source here.
  There is a reason I did that: the peephole recognizer fails to
  match across a store, and by the time the peephole optimization
  pass runs, the global variable store has been pipeline-scheduled
  up in between the add and the compare.  This is another reason
  my attempt is not ready for production use:  It does not catch
  enough cases.
--
In-Real-Life: Chris Torek, Berkeley Software Design Inc
El Cerrito, CA	Domain:	torek@bsdi.com	+1 510 234 3167

Index Home About Blog