打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
BASM for beginners, lesson 6
This is lesson number 6 and the topic is CharPos. This function searchs for the first occurrence of a character in a string, and returns the position of it when found. If nothing is found it returns zero. The Delphi library function does basically the same thing, but with the difference that it searches for a substring in a string. Passing a character to Pos as the substring is possible and this way use Pos as a CharPos. In this lesson we will develop a CharPos that is nearly 4 times faster than Delphi's Pos.

 As usual we start with a Pascal implementation of the algorithm.

 function CharPos2(Chr : Char; const Str : AnsiString) : Cardinal;
 var
 I : Integer;

 begin
 if (Str <> '') then
 begin
 I := 0;
 repeat
 Inc(I);
 until((Str[I] = Chr) or (Str[I] = #0));
 if (Str[I] <> #0) then
 Result := I
 else
 Result := 0;
 end
 else
 Result := 0;
 end;

 Input to the function is a Char and a string. The string is passed as const. The functions Result is a Cardinal. First the function checks that the string is not empty and if is empty it returns zero. If there is a string we iterate over it with a repeat until loop searching for a match to the input char. If an occurrence of the char is not found before we reach the zero terminator this will also terminate the loop. Because the loop can terminate on either of the two conditions it is necessary to check what terminated the search before we know what to return as result. If the loop was ended by a successful search we return the value of the loop counter, and if the search was unsuccessful the result is zero.

 It is also possible to use the length of the string as a condition for terminating the loop on an unsuccessful search. Then the code looks like this.

 function CharPos1(Chr : Char; const Str : AnsiString) : Cardinal;
 var
 StrLenght, I : Integer;

 begin
 StrLenght := Length(Str);
 if StrLenght > 0 then
 begin
 I := 0;
 repeat
 Inc(I);
 until((Str[I] = Chr) or (I > StrLenght));
 if I <= StrLenght then
 Result := I
 else
 Result := 0;
 end
 else
 Result := 0;
 end;

 Before we start jumping into ASM it is a good idea to see which of the Pascal implementations is the fastest. For doing this a benchmark must be established.

 const
 NOOFLOOPS : Cardinal = 200000;
 SCALE : Cardinal = 1000;

 procedure Benchmark;
 var
 lpPerformanceCount, StartCount, EndCount : TLargeInteger;
 Succes : Boolean;
 Str, Str1, FunctionName : AnsiString;
 Chr1, Chr2 : Char;
 I, CharPos, J, K, Bench, SumBench : Cardinal;
 StringArray : array[1..255] of AnsiString;

 begin
 Series1.Clear;
 Str1 := 'T';
 for J := 1 to 255 do
 begin
 StringArray[J] := Str1;
 Str1 := 'A' + Str1;
 end;
 SumBench := 0;
 Chr1 := 'T';
 Chr2 := 'X';
 for K := 1 to 255 do
 begin
 Str := StringArray[K];
 Succes := QueryPerformanceCounter(lpPerformanceCount);
 if Succes then
 StartCount := lpPerformanceCount
 else
 raise Exception.Create('QueryPerformanceCounter failed');
 for I := 1 to NOOFLOOPS do
 begin
 CharPos := CharPosFunction(Chr1, Str);
 end;
 for I := 1 to NOOFLOOPS do
 begin
 CharPos := CharPosFunction(Chr2, Str);
 end;
 Succes := QueryPerformanceCounter(lpPerformanceCount);
 if Succes then
 EndCount := lpPerformanceCount
 else
 raise Exception.Create('QueryPerformanceCounter failed');
 Bench := Round((EndCount - StartCount) / SCALE);
 Series1.AddXY(K, Bench, '', clBlue);
 Bench := Round(Bench / K);
 SumBench := SumBench + Bench;
 Update;
 end;
 FunctionName := FunctionSelectRadioGroup.Items[FunctionSelectRadioGroup.ItemIndex];
 ReportRichEdit.Lines.Add(FunctionName + #9 + IntToStr(SumBench));
 end;

 The benchmark builds an array of 255 AnsiStrings. The first string is 'T'. 'T' is also the character we use as the search character. String number 1 is therefore the shortest possible successful match. The next strings are 'AT', 'AAT' and 'AAAT'. I guess the pattern is clear. It is equally important to measure the performance on unsuccessful searches. For this purpose we pass 'X' as search character. The benchmark does NOOFLOOPS searches on each string and measures the time used on each string. Because we want results from each string length to contribute approximately the same to the benchmark, each timing result is divided by the length of the string.

 On this benchmark CharPos1 obtains the score 767 on a P4 1600A clocked at 1920 and CharPos2 obtains the score 791. For comparison Delphi Pos scores 2637.

 Because CharPos1 is slightly better than CharPos2 we select this as basis for optimization. This is the ASM code Delphi 6 compiled it into with optimizations turned on.

 function CharPos14(Chr : Char; const Str : AnsiString) : Cardinal;
 var
 StrLenght, I : Integer;

 begin
 {
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 }
 StrLenght := Length(Str);
 {
 mov eax,esi
 call @LStrLen
 mov edx,eax
 }
 if StrLenght > 0 then
 {
 test edx,edx
 jle @Else1Begin
 }
 begin
 I := 0;
 {
 xor eax,eax
 }
 repeat
 {
 @RepeatBegin :
 }
 Inc(I);
 {
 inc eax
 }
 until((Str[I] = Chr) or (I > StrLenght));
 {
 cmp bl,[esi+eax-$01]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin :
 }
 if I <= StrLenght then
 {
 @If2 :
 cmp edx,eax
 jnl @Exit
 }
 Result := I
 {
 }
 else
 Result := 0;
 {
 xor eax,eax
 pop esi
 pop ebx
 ret
 }
 end
 else
 Result := 0;
 {
 @Else1Begin :
 xor eax,eax
 }
 {
 @Exit :
 pop esi
 pop ebx
 }
 end;

 This time there is no stackframe. Ebx and esi is used and needs to be backed up and restored by pushing and popping them on the stack. Because the function does not have its own stackframe they are simply pushed on the stack on the top of the calling functions frame. The input parameters are received in eax an edx and they are as first thing copied into esi and ebx. The function Length has a secret internal name which is LStrLen. This function is called on Str which is passed in eax. We see that LStrLen is following the register calling convention. Str was received in edx copied to esi and then to eax. LStrLen delivers its result, StrLenght, in eax. This result is copied into edx and compared to 0. Test edx,edx is the same as cmp edx,0 and it sets the flags. The jle instruction checks the flags and pass execution to the else part of the if-else code if StrLenght is lower than or equal to zero. In this else code we have one line of Pascal, which is Result := 0;. Because our function must return its result in eax we create a zero here by x-oring eax by itself. If the string is longer than zero execution is continued in the if-code. The first line here initializes the loop counter I to zero. Again a xor instruction is used for this. The loop body has one line only and this is easy to understand Inc(I); = inc eax. Pascal and ASM is nearly the same thing ;-)
 The implementation of the loop closing test is where the real meat of this function is. It is made up of these four lines of ASM.

 cmp bl,[esi+eax-$01]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin :

 We see that there are two jump instructions. The last one jumps to the start of the loop and the first one jmps out of the loop. There are two cmp instructions to set the flags. Number two is the easiest to understand. It compares eax with edx. A fast look at the Pascal code tells us that I and StrLenght must be in these two registers. In fact we have just seen that eax is I and we also saw in the top of the function that StrLenght was in edx.
 In line number 4 Chr was copied into ebx, but a char is only one byte. Therefore the first cmp instruction compares something to bl which is the lowest byte of the four bytes in ebx. We expect that the search character - Chr - is compared to character number 1, 2, 3.. of the input string. [esi+eax-$01] must therefore be a pointer into this string. eax is the loop counter I, which is 1 at the first iteration. esi must be the address of the Str parameter that was received in edx and immediately copied into esi. -$01 is a constant offset which is needed because the first character in an AnsiString is located at position 0. The position where Str points to.
 Where has the or from the Pascal code gone? To understand this we need to talk about the optimization called incomplete boolean evaluation. This optimization can be applied to the boolean operator and, and to the boolean operator or. Let us take a look at the truth table for and first.

 false and false is false
 false and true is false
 true and false is false
 true and true is true

 The and-operator is only returning true if both operands are true. This is taken advantage of by the incomplete boolean evaluation optimization. If one operand is found false there is no need to check the second one, because the result of and will be false no matter what it is.

 The truth table for or is

 false or false is false
 false or true is true
 true or false is true
 true or true is true

 The result of an or is true if one or both of the operands are true. If one is seen to be true there is no need to check the second operand.

 Our loop closing test takes advantage of this by jumping out of the loop if the first compare is successful. This is the case if we found a match for the search character in the string. If it was a match it does not make sense to see if it was the zero terminator too! The last compare is only executed on unsuccessful matches.

 If we knew something about the strings and characters our function was called upon most often we could take advantage of it by changing the order of the tests such that the most often true one is located first.
 Try switching on the compiler switch "complete boolean evaluation", in project options, and see what code is created then.

 The rest of the code resembles what we have seen in earlier lessons and I will skip the explanation and go get a cup of coffee instead ;-)

 Now it is time to do some optimizations. At first the function is made pure BASM. The labels were introduced in the previous code listing. Here it is seen that they follow the Pascal code in an intuitive way.

 function CharPos15(Chr : Char; const Str : AnsiString) : Cardinal;
 //var
 //StrLenght, I : Integer;

 asm
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 //StrLenght := Length(Str);
 mov eax,esi
 call System.@LStrLen
 mov edx,eax
 //if StrLenght > 0 then
 test edx,edx
 jle @Else1Begin
 //I := 0;
 xor eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp bl,[esi+eax-$01]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp edx,eax
 jnl @Exit
 //Result := I
 //else
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 @Exit :
 pop esi
 pop ebx
 end;

 The call to LStrLen has been prefixed with “System”, otherwise the compiler will not recognize it. LStrLen is implemented in the System unit.
 The var section is removed because we do not reference any variables by name.

 function CharPos16(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 //StrLenght := Length(Str);
 mov eax,esi
 //call System.@LStrLen
 //*************
 test eax,eax
 jz @LStrLenExit
 mov eax,[eax-$04]
 @LStrLenExit :
 //*************
 mov edx,eax
 //if StrLenght > 0 then
 test edx,edx
 jle @Else1Begin
 //I := 0;
 xor eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp bl,[esi+eax-$01]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp edx,eax
 jnl @Exit
 //Result := I
 //else
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 @Exit :
 pop esi
 pop ebx
 end;

 The first thing we do is to inline LStrLen. This is done by tracing into it and copying the body of the function from the CPU view. It is made up of these four lines.

 test eax,eax
 jz +$03
 mov eax,[eax-$04]
 ret

 If a nil pointer is passed to LStrLen in eax nothing is done but returning. If the pointer is valid the length of the string is found at the 4 bytes preceding the start of the string. These 4 bytes are returned in eax. To inline it we replace the call with the 4 lines. Jz is transferring execution to the ret instruction. Instead of this ret instruction we place a label called LStrLenExit. The ret instruction returned execution to the line after the call of the function. This ret has to be removed, otherwise it would pass execution to the function that called CharPos, not exactly what we want. This is what the inlined LStrLen ended up looking like

 test eax,eax
 jz @LStrLenExit
 mov eax,[eax-$04]
 @LStrLenExit :

 function CharPos17(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 //StrLenght := Length(Str);
 mov eax,esi
 //*************
 test eax,eax
 //jz @LStrLenExit
 jz @Else1Begin
 mov eax,[eax-$04]
 //@LStrLenExit :
 //*************
 mov edx,eax
 //if StrLenght > 0 then
 //test edx,edx
 //jle @Else1Begin
 //I := 0;
 xor eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp bl,[esi+eax-$01]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp edx,eax
 jnl @Exit
 //Result := I
 //else
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 @Exit :
 pop esi
 pop ebx
 end;


 Now we observe that the Pascal line; if StrLenght > 0 then, is testing the same thing as the first line in the inlined LStrLen. Testing Str for being nil one time must be enough ;-). The second test is removed and the first is changed such that we jump to @Else1Begin instead of just "out of" the inlined StrLen function if Str is nil. Then there is no need for the LStrLenExit label.

 function CharPos18(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 //StrLenght := Length(Str);
 //mov eax,esi
 //if StrLenght > 0 then
 //test eax,eax
 test esi,esi
 jz @Else1Begin
 //mov eax,[eax-$04]
 mov eax,[esi-$04]
 mov edx,eax
 //I := 0;
 xor eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp bl,[esi+eax-$01]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp edx,eax
 jnl @Exit
 //Result := I
 //else
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 @Exit :
 pop esi
 pop ebx
 end;

 We moved the test of StrLenght = 0 and the comment //if StrLenght > 0 then must be moved too. After the inlining it becomes possible to copy propagate esi in these lines.

 mov eax,esi
 //*************
 test eax,eax
 jz @Else1Begin
 mov eax,[eax-$04]

 The last of the lines overwrites eax and is the last use of the value in eax that was copied from esi.

 mov eax,esi
 //*************
 //test eax,eax
 test esi,esi
 jz @Else1Begin
 //mov eax,[eax-$04]
 mov eax,[esi-$04]

 Actually we must also follow the possible branch to Else1Begin and see whether the value in eax is used here. We see that eax is zeroed out in the line immediately after the label and therefore is not used. This way the first line is seen to be dead and can be removed.

 //mov eax,esi
 test esi,esi
 jz @Else1Begin
 mov eax,[esi-$04]

 function CharPos19(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz @Else1Begin
 //StrLenght := Length(Str);
 //mov eax,[esi-$04]
 mov edx,[esi-$04]
 //mov edx,eax
 //I := 0;
 xor eax,eax
 //repeat
 @RepeatBegin :
 //Inc(I);
 inc eax
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp bl,[esi+eax-$01]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp edx,eax
 jnl @Exit
 //Result := I
 //else
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //else
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 @Exit :
 pop esi
 pop ebx
 end;

 Also as a result of inlining LStrLen we can remove one more mov. LStrLen returned its result in eax and then it was copied into edx. mov eax,[esi-$04] can then be changed to mov edx,[esi-$04] and the mov edx, eax can be removed.
 After this we change focus to the loop. This is far more important because it is executed many more times, depending on the length of the string or the position of a match.

 function CharPos20(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov edx,[esi-$04]
 //I := 0;
 xor eax,eax
 dec esi
 @RepeatBegin :
 //Inc(I);
 inc eax
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp bl,[esi+eax-$01]
 cmp bl,[esi+eax]
 jz @If2
 cmp edx,eax
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp edx,eax
 jnl @Exit
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 @Exit :
 pop esi
 pop ebx
 end;

 When we analyzed the code we saw that there was an offset of -1 in the addressing into the string. There is no need to subtract this offset at each iteration. It is a better idea to decrement the pointer to Str in esi once before entering the loop. We could also have decremented the loop counter in eax, but then we would have to add 1 again before returning the result.

 At the very top of the function the two input parameters are copied to new registers. This is redundant and we would like to copy propagate both, but eax is used as loop counter and we must first find another register for this purpose.

 function CharPos22(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 mov ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov edx,[esi-$04]
 //I := 0;
 //xor eax,eax
 xor ecx,ecx
 dec esi
 @RepeatBegin :
 //Inc(I);
 //inc eax
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp bl,[esi+eax]
 cmp bl,[esi+ecx]
 jz @If2
 //cmp edx,eax
 cmp edx,ecx
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 //cmp edx,eax
 cmp edx,ecx
 jnl @Exit
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 pop esi //New
 pop ebx //New
 ret //New
 @Exit :
 mov eax, ecx
 pop esi
 pop ebx
 end;

 All lines where eax is in use for I, eax is changed to ecx. Because I is the return value of the function on a match and the return value is in eax, we have to copy ecx into eax just beneath the @Exit label. This introduces a little problem, because a jump to Else1Begin also brings us here, and in this situation we copy a value from ecx into eax, which we have just cleared. The fix is to add the lines marked new.
 Then we are ready to copy propagate eax. Only one line uses ebx. This is cmp bl,[esi+ecx], which is changed to cmp al,[esi+ecx]. Then the copy mov ebx,eax becomes dead and is removed. This was copy propagation followed by dead code removal once more and we begin realising how important this optimization is.

 function CharPos23(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 //mov ebx,eax
 //if StrLenght > 0 then
 test esi,esi
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov edx,[esi-$04]
 //I := 0;
 xor ecx,ecx
 dec esi
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp bl,[esi+ecx]
 cmp al,[esi+ecx]
 jz @If2
 cmp edx,ecx
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp edx,ecx
 jnl @Exit
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 pop esi
 pop ebx
 ret
 @Exit :
 mov eax, ecx
 pop esi
 pop ebx
 end;

 Before we can copy propagate edx (holding the pointer to Str), we must free edx from other uses. It is used for StrLenght and we allocate ebx for this instead of edx.

 function CharPos24(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 mov esi,edx
 //if StrLenght > 0 then
 test esi,esi
 jz @Else1Begin
 //StrLenght := Length(Str);
 //mov edx,[esi-$04]
 mov ebx,[esi-$04]
 //I := 0;
 xor ecx,ecx
 dec esi
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp al,[esi+ecx]
 jz @If2
 //cmp edx,ecx
 cmp ebx,ecx
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 //cmp edx,ecx
 cmp ebx,ecx
 jnl @Exit
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 pop esi
 pop ebx
 ret
 @Exit :
 mov eax, ecx
 pop esi
 pop ebx
 end;

 Then edx is copy propagated and the mov esi,edx becomes dead.

 function CharPos25(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push esi
 //mov esi,edx
 //if StrLenght > 0 then
 //test esi,esi
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 //mov ebx,[esi-$04]
 mov ebx,[edx-$04]
 //I := 0;
 xor ecx,ecx
 //dec esi
 dec edx
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 //cmp al,[esi+ecx]
 cmp al,[edx+ecx]
 jz @If2
 cmp ebx,ecx
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp ebx,ecx
 jnl @Exit
 //Result := 0;
 xor eax,eax
 pop esi
 pop ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 pop esi
 pop ebx
 ret
 @Exit :
 mov eax, ecx
 pop esi
 pop ebx
 end;

 This removed the use of esi and it does not need to be saved and restored any more. When removing the pop esi's, we remember that there are 3 exit paths all with each own pop esi.

 function CharPos26(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 //push esi
 //if StrLenght > 0 then
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov ebx,[edx-$04]
 //I := 0;
 xor ecx,ecx
 dec edx
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp al,[edx+ecx]
 jz @If2
 cmp ebx,ecx
 jnl @RepeatBegin
 //if I <= StrLenght then
 @If2 :
 cmp ebx,ecx
 jnl @Exit
 //Result := 0;
 xor eax,eax
 //pop esi
 pop ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 //pop esi
 pop ebx
 ret
 @Exit :
 mov eax, ecx
 //pop esi
 pop ebx
 end;

 In the line after the If2 label there is a line which is identical to the second compare in the loop closing test. In Pascal it was necessary to have the line if I <= StrLenght after the loop because we could not know which condition the loop terminated on. This line generated the extra cmp ebx,ecx instruction, which looks a little redundant now. It is in fact not redundant because two paths of execution lead to the If2 Label and only one of them has the test. If we split the two exit paths such that only one of them goes to If2 we can remove the extra check. Instead of jumping to If2 on a match we can jump directly to Exit.

 function CharPos27(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov ebx,[edx-$04]
 //I := 0;
 xor ecx,ecx
 dec edx
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp al,[edx+ecx]
 //jz @If2
 jz @Exit
 cmp ebx,ecx
 jnl @RepeatBegin
 //if I <= StrLenght then
 //@If2 :
 //cmp ebx,ecx
 //jnl @Exit
 //Result := 0;
 xor eax,eax
 pop ebx
 ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 pop ebx
 ret
 @Exit :
 mov eax,ecx
 pop ebx
 end;

 Then the If2 label goes out of use and when we reach the code at this position we know that the zero terminator was reached and there is no need to test for it again.

 There are two sections of identical code just before the Else1Begin label and just after it. The upper section can be deleted.

 function CharPos28(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov ebx,[edx-$04]
 //I := 0;
 xor ecx,ecx
 dec edx
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp al,[edx+ecx]
 jz @Exit
 cmp ebx,ecx
 jnl @RepeatBegin
 //Result := 0;
 //xor eax,eax
 //pop ebx
 //ret
 //Result := 0;
 @Else1Begin :
 xor eax,eax
 pop ebx
 ret
 @Exit :
 mov eax,ecx
 pop ebx
 end;

 This ended our search for redundant code to remove. The cleaned up function looks like this.

 function CharPos29(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov ebx,[edx-$04]
 //I := 0;
 xor ecx,ecx
 dec edx
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 cmp al,[edx+ecx]
 jz @Exit
 cmp ebx,ecx
 jnl @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor eax,eax
 pop ebx
 ret
 @Exit :
 mov eax,ecx
 pop ebx
 end;

 When the loop is iterating in search for a match or the end of the string, these lines of code are executed over and over again

 inc ecx
 cmp al,[edx+ecx]
 jz @Exit
 cmp ebx,ecx
 jnl @RepeatBegin

 Let us make some variants of them and see how they perform. The most exiting line is

 cmp al,[edx+ecx]

 It generates two microinstructions, one for loading a byte from the address [edx+ecx] and one to compare it against al. This line could also be coded as

 mov ah, byte ptr [edx+ecx]
 cmp al, ah

 This variant also generates two microinstructions, but it needs one more register (ah).

 If we allocate an extra register it could also be coded as

 movzx efx, byte ptr [edx+ecx]
 cmp al, fh

 movzx is mov with zero extension. It loads one byte into the lowest of efx and fills the three remaining bytes with zeroes. Of course there is no such thing as an efx register, but the two unused registers esi & edi can not be accessed bytewise. Therefore it is necessary to free eax, ebx, ecx or edx, by substituting it by edi or esi and then use eg. ebx instead of "efx".

 This function demonstrates the first variant.

 function CharPos30(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 //if StrLenght > 0 then
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov ebx,[edx-$04]
 //I := 0;
 xor ecx,ecx
 dec edx
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 mov ah, [edx+ecx]
 //cmp al,[edx+ecx]
 cmp al,ah
 jz @Exit
 cmp ebx,ecx
 jnl @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor eax,eax
 pop ebx
 ret
 @Exit :
 mov eax,ecx
 pop ebx
 end;

 This function demonstrates the second variant.

 function CharPos31(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push edi
 //if StrLenght > 0 then
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov edi,[edx-$04]
 //I := 0;
 xor ecx,ecx
 dec edx
 @RepeatBegin :
 //Inc(I);
 inc ecx
 //until((Str[I] = Chr) or (I > StrLenght));
 movzx ebx, byte ptr [edx+ecx]
 //cmp al,[edx+ecx]
 cmp al, bl
 jz @Exit
 cmp edi,ecx
 jnl @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor eax,eax
 pop edi
 pop ebx
 ret
 @Exit :
 mov eax,ecx
 pop edi
 pop ebx
 end;

 Instead of adding edx and ecx in the address calculation at every iteration, we could add them prior to entering the loop. Then it is necessary to subtract them from each other again to extract the loop counter value for the result. This is done with the sub instruction in line two after the Exit label.

 function CharPos32(Chr : Char; const Str : AnsiString) : Cardinal;
 asm
 push ebx
 push edi
 //if StrLenght > 0 then
 test edx,edx
 jz @Else1Begin
 //StrLenght := Length(Str);
 mov edi,[edx-$04]
 //I := 0;
 xor ecx,ecx
 //dec edx
 add ecx,edx
 @RepeatBegin :
 //Inc(I);
 //until((Str[I] = Chr) or (I > StrLenght));
 movzx ebx, byte ptr [ecx]
 inc ecx
 cmp al, bl
 jz @Exit
 //cmp edi,ecx
 test bl, bl
 jnz @RepeatBegin
 @Else1Begin :
 //Result := 0;
 xor eax,eax
 pop edi
 pop ebx
 ret
 @Exit :
 mov eax,ecx
 sub eax,edx
 pop edi
 pop ebx
 end;

 Then there are 4 functions to compare the performance of; CharPos29, CharPos30, CharPos31 and CharPos32.

 Results on P4 1920 are

 CharPos29 716
 CharPos30 973
 CharPos31 710
 CharPos32 702

 Winner is CharPos32

 Results on P3 1400 are

 CharPos29 949
 CharPos30 921
 CharPos31 950
 CharPos32 1403

 Winner is CharPos30

 Summed time

 CharPos29 716 + 949 = 1665
 CharPos30 973 + 921 = 1894
 CharPos31 710 + 950 = 1660
 CharPos32 702 + 1403 = 2105

 Winner is CharPos31

 On P4 the winning loop is

 @RepeatBegin :
 movzx ebx, byte ptr [ecx]
 inc ecx
 cmp al, bl
 jz @Exit
 test bl, bl
 jnz @RepeatBegin

 On P3 the winning loop is

 @RepeatBegin :
 inc ecx
 mov ah, [edx+ecx]
 cmp al,ah
 jz @Exit
 cmp ebx,ecx
 jnl @RepeatBegin

 on a blend of targets the winning loop is

 @RepeatBegin :
 inc ecx
 movzx ebx, byte ptr [edx+ecx]
 cmp al, bl
 jz @Exit
 cmp edi,ecx
 jnl @RepeatBegin

 The P4 winner performs very bad on P3 and could not be used in a library targeting more targets than P4, such as the Delphi RTL. The winner on P3 is performing quite badly on P4 and this one should not be used in a blended target library either. The winner on a blend of the two targets is CharPos31, which performs near to optimal on P4 and also performs optimal within a few percent on P3. This function would make a perfect choice for the Delphi RTL, where I have missed a CharPos function. It is a relief to see that it is possible to optimize for both processors at the same time without sacrificing more than a few percent of performance.

 Comparing performance of P3 and P4 on a clock by clock basis is always a hit. There is broad tendency to think at the P4 as having an inferior design, but this is not proved by this code. Taking the blended winner's performance and scaling it to a 1400 MHz processor it is 950 on P3 and 710 * (1920/1400) = 973 on P4. The processors are performing very much the same at the same clock.

 Regards
 Dennis
 

 
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
IDA简易教程
软件脱壳、破重启验证、追码、制作内存注册机
qq反汇编日志
去除天狼星视频加密系统的各种限制
For Delphi,让你的注册机变小一些
浅谈游戏外挂——外挂篇2(完美封包浅谈) - 心动--完美空间 - yifeigzs - ...
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服