BASM for beginners, lesson 6

This is lesson number 6 and the topic is CharPos. This function searchs for the first occurrence of a character in a string, and returns the position of it when found. If nothing is found it returns zero. The Delphi library function does basically the same thing, but with the difference that it searches for a substring in a string. Passing a character to Pos as the substring is possible and this way use Pos as a CharPos. In this lesson we will develop a CharPos that is nearly 4 times faster than Delphi's Pos.

As usual we start with a Pascal implementation of the algorithm.

function CharPos2(Chr : Char; const Str : AnsiString) : Cardinal;
var
I : Integer;

begin
if (Str <> '') then
begin
I := 0;
repeat
Inc(I);
until((Str[I] = Chr) or (Str[I] = #0));
if (Str[I] <> #0) then
Result := I
else
Result := 0;
end
else
Result := 0;
end;

Input to the function is a Char and a string. The string is passed as const. The functions Result is a Cardinal. First the function checks that the string is not empty and if is empty it returns zero. If there is a string we iterate over it with a repeat until loop searching for a match to the input char. If an occurrence of the char is not found before we reach the zero terminator this will also terminate the loop. Because the loop can terminate on either of the two conditions it is necessary to check what terminated the search before we know what to return as result. If the loop was ended by a successful search we return the value of the loop counter, and if the search was unsuccessful the result is zero.

It is also possible to use the length of the string as a condition for terminating the loop on an unsuccessful search. Then the code looks like this.

function CharPos1(Chr : Char; const Str : AnsiString) : Cardinal;
var
StrLenght, I : Integer;

begin
StrLenght := Length(Str);
if StrLenght > 0 then
begin
I := 0;
repeat
Inc(I);
until((Str[I] = Chr) or (I > StrLenght));
if I <= StrLenght then
Result := I
else
Result := 0;
end
else
Result := 0;
end;

Before we start jumping into ASM it is a good idea to see which of the Pascal implementations is the fastest. For doing this a benchmark must be established.

const
NOOFLOOPS : Cardinal = 200000;
SCALE : Cardinal = 1000;

procedure Benchmark;
var
lpPerformanceCount, StartCount, EndCount : TLargeInteger;
Succes : Boolean;
Str, Str1, FunctionName : AnsiString;
Chr1, Chr2 : Char;
I, CharPos, J, K, Bench, SumBench : Cardinal;
StringArray : array[1..255] of AnsiString;

begin
Series1.Clear;
Str1 := 'T';
for J := 1 to 255 do
begin
StringArray[J] := Str1;
Str1 := 'A' + Str1;
end;
SumBench := 0;
Chr1 := 'T';
Chr2 := 'X';
for K := 1 to 255 do
begin
Str := StringArray[K];
Succes := QueryPerformanceCounter(lpPerformanceCount);
if Succes then
StartCount := lpPerformanceCount
else
raise Exception.Create('QueryPerformanceCounter failed');
for I := 1 to NOOFLOOPS do
begin
CharPos := CharPosFunction(Chr1, Str);
end;
for I := 1 to NOOFLOOPS do
begin
CharPos := CharPosFunction(Chr2, Str);
end;
Succes := QueryPerformanceCounter(lpPerformanceCount);
if Succes then
EndCount := lpPerformanceCount
else
raise Exception.Create('QueryPerformanceCounter failed');
Bench := Round((EndCount - StartCount) / SCALE);
Series1.AddXY(K, Bench, '', clBlue);
Bench := Round(Bench / K);
SumBench := SumBench + Bench;
Update;
end;
FunctionName := FunctionSelectRadioGroup.Items[FunctionSelectRadioGroup.ItemIndex];
ReportRichEdit.Lines.Add(FunctionName + #9 + IntToStr(SumBench));
end;

The benchmark builds an array of 255 AnsiStrings. The first string is 'T'. 'T' is also the character we use as the search character. String number 1 is therefore the shortest possible successful match. The next strings are 'AT', 'AAT' and 'AAAT'. I guess the pattern is clear. It is equally important to measure the performance on unsuccessful searches. For this purpose we pass 'X' as search character. The benchmark does NOOFLOOPS searches on each string and measures the time used on each string. Because we want results from each string length to contribute approximately the same to the benchmark, each timing result is divided by the length of the string.

On this benchmark CharPos1 obtains the score 767 on a P4 1600A clocked at 1920 and CharPos2 obtains the score 791. For comparison Delphi Pos scores 2637.

Because CharPos1 is slightly better than CharPos2 we select this as basis for optimization. This is the ASM code Delphi 6 compiled it into with optimizations turned on.

function CharPos14(Chr : Char; const Str : AnsiString) : Cardinal;
var
StrLenght, I : Integer;

begin
{
push ebx
push esi
mov esi,edx
mov ebx,eax
}
StrLenght := Length(Str);
{
mov eax,esi
call @LStrLen
mov edx,eax
}
if StrLenght > 0 then
{
test edx,edx
jle @Else1Begin
}
begin
I := 0;
{
xor eax,eax
}
repeat
{
@RepeatBegin :
}
Inc(I);
{
inc eax
}
until((Str[I] = Chr) or (I > StrLenght));
{
cmp bl,[esi+eax-$01]
jz @If2
cmp edx,eax
jnl @RepeatBegin :
}
if I <= StrLenght then
{
@If2 :
cmp edx,eax
jnl @Exit
}
Result := I
{
}
else
Result := 0;
{
xor eax,eax
pop esi
pop ebx
ret
}
end
else
Result := 0;
{
@Else1Begin :
xor eax,eax
}
{
@Exit :
pop esi
pop ebx
}
end;

This time there is no stackframe. Ebx and esi is used and needs to be backed up and restored by pushing and popping them on the stack. Because the function does not have its own stackframe they are simply pushed on the stack on the top of the calling functions frame. The input parameters are received in eax an edx and they are as first thing copied into esi and ebx. The function Length has a secret internal name which is LStrLen. This function is called on Str which is passed in eax. We see that LStrLen is following the register calling convention. Str was received in edx copied to esi and then to eax. LStrLen delivers its result, StrLenght, in eax. This result is copied into edx and compared to 0. Test edx,edx is the same as cmp edx,0 and it sets the flags. The jle instruction checks the flags and pass execution to the else part of the if-else code if StrLenght is lower than or equal to zero. In this else code we have one line of Pascal, which is Result := 0;. Because our function must return its result in eax we create a zero here by x-oring eax by itself. If the string is longer than zero execution is continued in the if-code. The first line here initializes the loop counter I to zero. Again a xor instruction is used for this. The loop body has one line only and this is easy to understand Inc(I); = inc eax. Pascal and ASM is nearly the same thing ;-)
The implementation of the loop closing test is where the real meat of this function is. It is made up of these four lines of ASM.

cmp bl,[esi+eax-$01]
jz @If2
cmp edx,eax
jnl @RepeatBegin :

We see that there are two jump instructions. The last one jumps to the start of the loop and the first one jmps out of the loop. There are two cmp instructions to set the flags. Number two is the easiest to understand. It compares eax with edx. A fast look at the Pascal code tells us that I and StrLenght must be in these two registers. In fact we have just seen that eax is I and we also saw in the top of the function that StrLenght was in edx.
In line number 4 Chr was copied into ebx, but a char is only one byte. Therefore the first cmp instruction compares something to bl which is the lowest byte of the four bytes in ebx. We expect that the search character - Chr - is compared to character number 1, 2, 3.. of the input string. [esi+eax-$01] must therefore be a pointer into this string. eax is the loop counter I, which is 1 at the first iteration. esi must be the address of the Str parameter that was received in edx and immediately copied into esi. -$01 is a constant offset which is needed because the first character in an AnsiString is located at position 0. The position where Str points to.
Where has the or from the Pascal code gone? To understand this we need to talk about the optimization called incomplete boolean evaluation. This optimization can be applied to the boolean operator and, and to the boolean operator or. Let us take a look at the truth table for and first.

false and false is false
false and true is false
true and false is false
true and true is true

The and-operator is only returning true if both operands are true. This is taken advantage of by the incomplete boolean evaluation optimization. If one operand is found false there is no need to check the second one, because the result of and will be false no matter what it is.

The truth table for or is

false or false is false
false or true is true
true or false is true
true or true is true

The result of an or is true if one or both of the operands are true. If one is seen to be true there is no need to check the second operand.

Our loop closing test takes advantage of this by jumping out of the loop if the first compare is successful. This is the case if we found a match for the search character in the string. If it was a match it does not make sense to see if it was the zero terminator too! The last compare is only executed on unsuccessful matches.

If we knew something about the strings and characters our function was called upon most often we could take advantage of it by changing the order of the tests such that the most often true one is located first.
Try switching on the compiler switch "complete boolean evaluation", in project options, and see what code is created then.

The rest of the code resembles what we have seen in earlier lessons and I will skip the explanation and go get a cup of coffee instead ;-)

Now it is time to do some optimizations. At first the function is made pure BASM. The labels were introduced in the previous code listing. Here it is seen that they follow the Pascal code in an intuitive way.

function CharPos15(Chr : Char; const Str : AnsiString) : Cardinal;
//var
//StrLenght, I : Integer;

asm
push ebx
push esi
mov esi,edx
mov ebx,eax
//StrLenght := Length(Str);
mov eax,esi
call System.@LStrLen
mov edx,eax
//if StrLenght > 0 then
test edx,edx
jle @Else1Begin
//I := 0;
xor eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp bl,[esi+eax-$01]
jz @If2
cmp edx,eax
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp edx,eax
jnl @Exit
//Result := I
//else
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//else
//Result := 0;
@Else1Begin :
xor eax,eax
@Exit :
pop esi
pop ebx
end;

The call to LStrLen has been prefixed with “System”, otherwise the compiler will not recognize it. LStrLen is implemented in the System unit.
The var section is removed because we do not reference any variables by name.

function CharPos16(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
mov ebx,eax
//StrLenght := Length(Str);
mov eax,esi
//call System.@LStrLen
//*************
test eax,eax
jz @LStrLenExit
mov eax,[eax-$04]
@LStrLenExit :
//*************
mov edx,eax
//if StrLenght > 0 then
test edx,edx
jle @Else1Begin
//I := 0;
xor eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp bl,[esi+eax-$01]
jz @If2
cmp edx,eax
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp edx,eax
jnl @Exit
//Result := I
//else
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//else
//Result := 0;
@Else1Begin :
xor eax,eax
@Exit :
pop esi
pop ebx
end;

The first thing we do is to inline LStrLen. This is done by tracing into it and copying the body of the function from the CPU view. It is made up of these four lines.

test eax,eax
jz +$03
mov eax,[eax-$04]
ret

If a nil pointer is passed to LStrLen in eax nothing is done but returning. If the pointer is valid the length of the string is found at the 4 bytes preceding the start of the string. These 4 bytes are returned in eax. To inline it we replace the call with the 4 lines. Jz is transferring execution to the ret instruction. Instead of this ret instruction we place a label called LStrLenExit. The ret instruction returned execution to the line after the call of the function. This ret has to be removed, otherwise it would pass execution to the function that called CharPos, not exactly what we want. This is what the inlined LStrLen ended up looking like

test eax,eax
jz @LStrLenExit
mov eax,[eax-$04]
@LStrLenExit :

function CharPos17(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
mov ebx,eax
//StrLenght := Length(Str);
mov eax,esi
//*************
test eax,eax
//jz @LStrLenExit
jz @Else1Begin
mov eax,[eax-$04]
//@LStrLenExit :
//*************
mov edx,eax
//if StrLenght > 0 then
//test edx,edx
//jle @Else1Begin
//I := 0;
xor eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp bl,[esi+eax-$01]
jz @If2
cmp edx,eax
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp edx,eax
jnl @Exit
//Result := I
//else
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//else
//Result := 0;
@Else1Begin :
xor eax,eax
@Exit :
pop esi
pop ebx
end;

Now we observe that the Pascal line; if StrLenght > 0 then, is testing the same thing as the first line in the inlined LStrLen. Testing Str for being nil one time must be enough ;-). The second test is removed and the first is changed such that we jump to @Else1Begin instead of just "out of" the inlined StrLen function if Str is nil. Then there is no need for the LStrLenExit label.

function CharPos18(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
mov ebx,eax
//StrLenght := Length(Str);
//mov eax,esi
//if StrLenght > 0 then
//test eax,eax
test esi,esi
jz @Else1Begin
//mov eax,[eax-$04]
mov eax,[esi-$04]
mov edx,eax
//I := 0;
xor eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp bl,[esi+eax-$01]
jz @If2
cmp edx,eax
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp edx,eax
jnl @Exit
//Result := I
//else
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//else
//Result := 0;
@Else1Begin :
xor eax,eax
@Exit :
pop esi
pop ebx
end;

We moved the test of StrLenght = 0 and the comment //if StrLenght > 0 then must be moved too. After the inlining it becomes possible to copy propagate esi in these lines.

mov eax,esi
//*************
test eax,eax
jz @Else1Begin
mov eax,[eax-$04]

The last of the lines overwrites eax and is the last use of the value in eax that was copied from esi.

mov eax,esi
//*************
//test eax,eax
test esi,esi
jz @Else1Begin
//mov eax,[eax-$04]
mov eax,[esi-$04]

Actually we must also follow the possible branch to Else1Begin and see whether the value in eax is used here. We see that eax is zeroed out in the line immediately after the label and therefore is not used. This way the first line is seen to be dead and can be removed.

//mov eax,esi
test esi,esi
jz @Else1Begin
mov eax,[esi-$04]

function CharPos19(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
mov ebx,eax
//if StrLenght > 0 then
test esi,esi
jz @Else1Begin
//StrLenght := Length(Str);
//mov eax,[esi-$04]
mov edx,[esi-$04]
//mov edx,eax
//I := 0;
xor eax,eax
//repeat
@RepeatBegin :
//Inc(I);
inc eax
//until((Str[I] = Chr) or (I > StrLenght));
cmp bl,[esi+eax-$01]
jz @If2
cmp edx,eax
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp edx,eax
jnl @Exit
//Result := I
//else
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//else
//Result := 0;
@Else1Begin :
xor eax,eax
@Exit :
pop esi
pop ebx
end;

Also as a result of inlining LStrLen we can remove one more mov. LStrLen returned its result in eax and then it was copied into edx. mov eax,[esi-$04] can then be changed to mov edx,[esi-$04] and the mov edx, eax can be removed.
After this we change focus to the loop. This is far more important because it is executed many more times, depending on the length of the string or the position of a match.

function CharPos20(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
mov ebx,eax
//if StrLenght > 0 then
test esi,esi
jz @Else1Begin
//StrLenght := Length(Str);
mov edx,[esi-$04]
//I := 0;
xor eax,eax
dec esi
@RepeatBegin :
//Inc(I);
inc eax
//until((Str[I] = Chr) or (I > StrLenght));
//cmp bl,[esi+eax-$01]
cmp bl,[esi+eax]
jz @If2
cmp edx,eax
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp edx,eax
jnl @Exit
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//Result := 0;
@Else1Begin :
xor eax,eax
@Exit :
pop esi
pop ebx
end;

When we analyzed the code we saw that there was an offset of -1 in the addressing into the string. There is no need to subtract this offset at each iteration. It is a better idea to decrement the pointer to Str in esi once before entering the loop. We could also have decremented the loop counter in eax, but then we would have to add 1 again before returning the result.

At the very top of the function the two input parameters are copied to new registers. This is redundant and we would like to copy propagate both, but eax is used as loop counter and we must first find another register for this purpose.

function CharPos22(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
mov ebx,eax
//if StrLenght > 0 then
test esi,esi
jz @Else1Begin
//StrLenght := Length(Str);
mov edx,[esi-$04]
//I := 0;
//xor eax,eax
xor ecx,ecx
dec esi
@RepeatBegin :
//Inc(I);
//inc eax
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
//cmp bl,[esi+eax]
cmp bl,[esi+ecx]
jz @If2
//cmp edx,eax
cmp edx,ecx
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
//cmp edx,eax
cmp edx,ecx
jnl @Exit
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//Result := 0;
@Else1Begin :
xor eax,eax
pop esi //New
pop ebx //New
ret //New
@Exit :
mov eax, ecx
pop esi
pop ebx
end;

All lines where eax is in use for I, eax is changed to ecx. Because I is the return value of the function on a match and the return value is in eax, we have to copy ecx into eax just beneath the @Exit label. This introduces a little problem, because a jump to Else1Begin also brings us here, and in this situation we copy a value from ecx into eax, which we have just cleared. The fix is to add the lines marked new.
Then we are ready to copy propagate eax. Only one line uses ebx. This is cmp bl,[esi+ecx], which is changed to cmp al,[esi+ecx]. Then the copy mov ebx,eax becomes dead and is removed. This was copy propagation followed by dead code removal once more and we begin realising how important this optimization is.

function CharPos23(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
//mov ebx,eax
//if StrLenght > 0 then
test esi,esi
jz @Else1Begin
//StrLenght := Length(Str);
mov edx,[esi-$04]
//I := 0;
xor ecx,ecx
dec esi
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
//cmp bl,[esi+ecx]
cmp al,[esi+ecx]
jz @If2
cmp edx,ecx
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp edx,ecx
jnl @Exit
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//Result := 0;
@Else1Begin :
xor eax,eax
pop esi
pop ebx
ret
@Exit :
mov eax, ecx
pop esi
pop ebx
end;

Before we can copy propagate edx (holding the pointer to Str), we must free edx from other uses. It is used for StrLenght and we allocate ebx for this instead of edx.

function CharPos24(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
mov esi,edx
//if StrLenght > 0 then
test esi,esi
jz @Else1Begin
//StrLenght := Length(Str);
//mov edx,[esi-$04]
mov ebx,[esi-$04]
//I := 0;
xor ecx,ecx
dec esi
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp al,[esi+ecx]
jz @If2
//cmp edx,ecx
cmp ebx,ecx
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
//cmp edx,ecx
cmp ebx,ecx
jnl @Exit
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//Result := 0;
@Else1Begin :
xor eax,eax
pop esi
pop ebx
ret
@Exit :
mov eax, ecx
pop esi
pop ebx
end;

Then edx is copy propagated and the mov esi,edx becomes dead.

function CharPos25(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push esi
//mov esi,edx
//if StrLenght > 0 then
//test esi,esi
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
//mov ebx,[esi-$04]
mov ebx,[edx-$04]
//I := 0;
xor ecx,ecx
//dec esi
dec edx
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
//cmp al,[esi+ecx]
cmp al,[edx+ecx]
jz @If2
cmp ebx,ecx
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp ebx,ecx
jnl @Exit
//Result := 0;
xor eax,eax
pop esi
pop ebx
ret
//Result := 0;
@Else1Begin :
xor eax,eax
pop esi
pop ebx
ret
@Exit :
mov eax, ecx
pop esi
pop ebx
end;

This removed the use of esi and it does not need to be saved and restored any more. When removing the pop esi's, we remember that there are 3 exit paths all with each own pop esi.

function CharPos26(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//push esi
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov ebx,[edx-$04]
//I := 0;
xor ecx,ecx
dec edx
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp al,[edx+ecx]
jz @If2
cmp ebx,ecx
jnl @RepeatBegin
//if I <= StrLenght then
@If2 :
cmp ebx,ecx
jnl @Exit
//Result := 0;
xor eax,eax
//pop esi
pop ebx
ret
//Result := 0;
@Else1Begin :
xor eax,eax
//pop esi
pop ebx
ret
@Exit :
mov eax, ecx
//pop esi
pop ebx
end;

In the line after the If2 label there is a line which is identical to the second compare in the loop closing test. In Pascal it was necessary to have the line if I <= StrLenght after the loop because we could not know which condition the loop terminated on. This line generated the extra cmp ebx,ecx instruction, which looks a little redundant now. It is in fact not redundant because two paths of execution lead to the If2 Label and only one of them has the test. If we split the two exit paths such that only one of them goes to If2 we can remove the extra check. Instead of jumping to If2 on a match we can jump directly to Exit.

function CharPos27(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov ebx,[edx-$04]
//I := 0;
xor ecx,ecx
dec edx
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp al,[edx+ecx]
//jz @If2
jz @Exit
cmp ebx,ecx
jnl @RepeatBegin
//if I <= StrLenght then
//@If2 :
//cmp ebx,ecx
//jnl @Exit
//Result := 0;
xor eax,eax
pop ebx
ret
//Result := 0;
@Else1Begin :
xor eax,eax
pop ebx
ret
@Exit :
mov eax,ecx
pop ebx
end;

Then the If2 label goes out of use and when we reach the code at this position we know that the zero terminator was reached and there is no need to test for it again.

There are two sections of identical code just before the Else1Begin label and just after it. The upper section can be deleted.

function CharPos28(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov ebx,[edx-$04]
//I := 0;
xor ecx,ecx
dec edx
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp al,[edx+ecx]
jz @Exit
cmp ebx,ecx
jnl @RepeatBegin
//Result := 0;
//xor eax,eax
//pop ebx
//ret
//Result := 0;
@Else1Begin :
xor eax,eax
pop ebx
ret
@Exit :
mov eax,ecx
pop ebx
end;

This ended our search for redundant code to remove. The cleaned up function looks like this.

function CharPos29(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov ebx,[edx-$04]
//I := 0;
xor ecx,ecx
dec edx
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
cmp al,[edx+ecx]
jz @Exit
cmp ebx,ecx
jnl @RepeatBegin
@Else1Begin :
//Result := 0;
xor eax,eax
pop ebx
ret
@Exit :
mov eax,ecx
pop ebx
end;

When the loop is iterating in search for a match or the end of the string, these lines of code are executed over and over again

inc ecx
cmp al,[edx+ecx]
jz @Exit
cmp ebx,ecx
jnl @RepeatBegin

Let us make some variants of them and see how they perform. The most exiting line is

cmp al,[edx+ecx]

It generates two microinstructions, one for loading a byte from the address [edx+ecx] and one to compare it against al. This line could also be coded as

mov ah, byte ptr [edx+ecx]
cmp al, ah

This variant also generates two microinstructions, but it needs one more register (ah).

If we allocate an extra register it could also be coded as

movzx efx, byte ptr [edx+ecx]
cmp al, fh

movzx is mov with zero extension. It loads one byte into the lowest of efx and fills the three remaining bytes with zeroes. Of course there is no such thing as an efx register, but the two unused registers esi & edi can not be accessed bytewise. Therefore it is necessary to free eax, ebx, ecx or edx, by substituting it by edi or esi and then use eg. ebx instead of "efx".

This function demonstrates the first variant.

function CharPos30(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov ebx,[edx-$04]
//I := 0;
xor ecx,ecx
dec edx
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
mov ah, [edx+ecx]
//cmp al,[edx+ecx]
cmp al,ah
jz @Exit
cmp ebx,ecx
jnl @RepeatBegin
@Else1Begin :
//Result := 0;
xor eax,eax
pop ebx
ret
@Exit :
mov eax,ecx
pop ebx
end;

This function demonstrates the second variant.

function CharPos31(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push edi
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov edi,[edx-$04]
//I := 0;
xor ecx,ecx
dec edx
@RepeatBegin :
//Inc(I);
inc ecx
//until((Str[I] = Chr) or (I > StrLenght));
movzx ebx, byte ptr [edx+ecx]
//cmp al,[edx+ecx]
cmp al, bl
jz @Exit
cmp edi,ecx
jnl @RepeatBegin
@Else1Begin :
//Result := 0;
xor eax,eax
pop edi
pop ebx
ret
@Exit :
mov eax,ecx
pop edi
pop ebx
end;

Instead of adding edx and ecx in the address calculation at every iteration, we could add them prior to entering the loop. Then it is necessary to subtract them from each other again to extract the loop counter value for the result. This is done with the sub instruction in line two after the Exit label.

function CharPos32(Chr : Char; const Str : AnsiString) : Cardinal;
asm
push ebx
push edi
//if StrLenght > 0 then
test edx,edx
jz @Else1Begin
//StrLenght := Length(Str);
mov edi,[edx-$04]
//I := 0;
xor ecx,ecx
//dec edx
add ecx,edx
@RepeatBegin :
//Inc(I);
//until((Str[I] = Chr) or (I > StrLenght));
movzx ebx, byte ptr [ecx]
inc ecx
cmp al, bl
jz @Exit
//cmp edi,ecx
test bl, bl
jnz @RepeatBegin
@Else1Begin :
//Result := 0;
xor eax,eax
pop edi
pop ebx
ret
@Exit :
mov eax,ecx
sub eax,edx
pop edi
pop ebx
end;

Then there are 4 functions to compare the performance of; CharPos29, CharPos30, CharPos31 and CharPos32.

Results on P4 1920 are

CharPos29 716
CharPos30 973
CharPos31 710
CharPos32 702

Winner is CharPos32

Results on P3 1400 are

CharPos29 949
CharPos30 921
CharPos31 950
CharPos32 1403

Winner is CharPos30

Summed time

CharPos29 716 + 949 = 1665
CharPos30 973 + 921 = 1894
CharPos31 710 + 950 = 1660
CharPos32 702 + 1403 = 2105

Winner is CharPos31

On P4 the winning loop is

@RepeatBegin :
movzx ebx, byte ptr [ecx]
inc ecx
cmp al, bl
jz @Exit
test bl, bl
jnz @RepeatBegin

On P3 the winning loop is

@RepeatBegin :
inc ecx
mov ah, [edx+ecx]
cmp al,ah
jz @Exit
cmp ebx,ecx
jnl @RepeatBegin

on a blend of targets the winning loop is

@RepeatBegin :
inc ecx
movzx ebx, byte ptr [edx+ecx]
cmp al, bl
jz @Exit
cmp edi,ecx
jnl @RepeatBegin

The P4 winner performs very bad on P3 and could not be used in a library targeting more targets than P4, such as the Delphi RTL. The winner on P3 is performing quite badly on P4 and this one should not be used in a blended target library either. The winner on a blend of the two targets is CharPos31, which performs near to optimal on P4 and also performs optimal within a few percent on P3. This function would make a perfect choice for the Delphi RTL, where I have missed a CharPos function. It is a relief to see that it is possible to optimize for both processors at the same time without sacrificing more than a few percent of performance.

Comparing performance of P3 and P4 on a clock by clock basis is always a hit. There is broad tendency to think at the P4 as having an inferior design, but this is not proved by this code. Taking the blended winner's performance and scaling it to a 1400 MHz processor it is 950 on P3 and 710 * (1920/1400) = 973 on P4. The processors are performing very much the same at the same clock.

Regards
Dennis

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。