Performance

In this section I run a few simple test programs to get a feel for the run-time cost of the elements of SML/NJ programs. In the basic performance section I look at simple loops and memory allocation. After that I look at the cost of CML operations such as message passing and thread creation.

The times that I measure are wall time because I can get finer resolution on my Linux system. The CPU timers in the SML/NJ Timer module have 10 millisecond resolution from the kernel's internal timing. As long as the system is idle while running the programs the two times should be similar enough.

These tests are run on a 1GHz Athlon system running the Linux 2.2.19 kernel. There is 256MB of PC133 memory.

Basic SML/NJ Performance

The tests I describe in this section cover some of the basic code examples to give you a feel for how fast SML/NJ runs. Remember that SML/NJ compiles direct to machine language. It's not some interpreted toy. I compare the speed to similar C code. The test programs are called speed.sml and cspeed.c. The figures are execution times (wall time) averaged over five runs on a quiet system. The C program is compiled with Gcc 2.96 using just the basic optimisation "cc -O".

The first test is just a simple loop to count up to 100,000,000. I've tried two loops in SML counting up and down to see what the difference is.

fun countdown args =
let
    val max_cnt = int_arg args

    fun loop 0 = 0                  (* it returns something *)
    |   loop n = loop (n-1)
in
    Timing.timeIt "countdown" (fn () => ignore(loop max_cnt))
end


fun countup args =
let
    val max_cnt = int_arg args

    fun loop n =
    (
        if n = max_cnt
        then
            0
        else
            loop (n+1)
    )
in
    Timing.timeIt "countup" (fn () => ignore(loop 0))
end

The countdown function compares against the constant 0. The countup function compares with a variable. The C functions use a for loop while SML uses recursion. Table 7-2 shows the figures.

Table 7-2. Speed of the Counting Functions.

Function

SML (millisec)

C (millisec)

countdown

399

200

countup

449

199

You can see that SML/NJ is half the speed of C. The countup function is 10% slower for comparing with a variable compared with a constant. The C functions are the same speed in each direction. It would be interesting to study the machine code generated by SML/NJ but this is not readily accessible.

The next set of functions all count the number of lines in a text file. The number of lines is 10000 and they are all 60 characters long. This will test how SML/NJ does with character processing and I/O. Both programs read in the entire file into memory and then count the new-line characters.

In the SML program I've tried a number of different ways to count. The straight-forward C-like function is the slow index function:

fun count_slowix text =
let
    val len = size text

    fun loop 0 l = l
    |   loop n l = 
    (
        loop (n-1) (if S.sub(text, n) = #"\n" then l+1 else l)
    )
in
    loop (len-1) 0
end

This indexes into the text to test each character. A faster version uses the Unsafe index function (see the section called The Unsafe API in Chapter 4):

fun count_fastix text =
let
    val len = size text

    fun loop 0 l = l
    |   loop n l = 
    (
        loop (n-1) (if Unsafe.CharVector.sub(text, n) = #"\n"
                    then l+1 else l)
    )
in
    loop (len-1) 0
end

The remaining functions use the Substring module, just to see how much slower they are than direct indexing. The first two split the text into tokens on new-line characters using two ways of testing for a new-line. The third uses the Substring.getc function to step through the text.

fun count_tokens text =
let
    val lines = SS.tokens (fn c => c = #"\n") (SS.all text)
in
    length lines
end


(*  See if isCntrl is faster. *)
fun count_cntrl text =
let
    val lines = SS.tokens Char.isCntrl (SS.all text)
in
    length lines
end


(*  Count the characters individually using substring. *)
fun count_getc text =
let
    fun loop ss n =
    (
        case SS.getc ss of
          NONE => n
        | SOME (c, rest) =>
            loop rest (if c = #"\n" then n+1 else n)
    )
in
    loop (SS.all text) 0
end

The C program reads the entire file into a malloced buffer using fread and counts the new-lines in the usual way. Table 7-3 shows the figures. The time to read the file is included in the readall function. The length entry is the time to find the length of the string which shows that it comes from a field in the string rather than counting the characters like strlen in C.

Table 7-3. Speed of the Line Counting Functions.

Function

SML (microsec)

C (microsec)

readall

4980

4609

length

1

 

slowix

21975

 

fastix

13792

1854

tokens

54856

 

cntrl

61299

 

getc

59050

 

SML/NJ does well reading in the file. Counting the characters is woeful though. The compiler is supposed to generate in-line machine code for Unsafe.CharVector.sub but it still ends up 7 times slower than C. The Unsafe function is certainly faster than the normal one which has bounds checking on each call.

The Substring function use the Unsafe functions internally. I'm surprised to see that the getc version is slower than tokens.

Memory Performance

This test explores the performance of memory allocation. The program builds a linked list of integers and then frees it. For the SML/NJ program freeing consists of letting go of the list and triggering a garbage collection. Here is the test code.

(* lst should be garbage after this function ends *)
fun build max_cnt =
let
    fun loop 0 rslt = rslt
    |   loop n rslt = loop (n-1) (n::rslt)

    val lst = loop max_cnt []
in
    print(concat["Built a list with length ",
                  Int.toString(length lst), "\n"])
end


fun linkedlist args =
let
    val max_cnt = int_arg args

    fun run() =
    (
        build max_cnt;
        SMLofNJ.Internals.GC.doGC 0
    )
in
    run(); run();                   (* go for steady state *)
    SMLofNJ.Internals.GC.messages true;
    SMLofNJ.Internals.GC.doGC 10;   (* clear the heap *)
    print "Starting the run\n";
    Timing.timeIt "linkedlist" run;
    SMLofNJ.Internals.GC.messages false;
    ()
end

A separate top-level function is used for building the list to ensure that the list is truly garbage when it terminates. If it were nested within another function some compilers might retain a reference to it in the outer function's scope.

I ran the program for different list lengths to see how the performance scales. To try to ensure there is one collection I increased the heap size by adding @SMLalloc=4096 to the run-time command line. This sets an allocation size of 4M rather than the default of 256K and the heap arenas are scaled accordingly. But I found that the speed doesn't increase for values over 1M. I always ended up with an additional major collection for lengths over 50000 which cost around 10-20 milliseconds.

Table 7-4 shows the figures for the linked list program. An estimate of the amount of time doing a major collection is included. The collection times have a 10 millisecond resolution so they are only rough.

For small list sizes SML/NJ is 3 times faster than C when allocating and freeing heap. The speed advantage largely disappears at larger sizes. The C figures are linear with the memory size. The SML/NJ figures have a hump around the 200000 level when the major collection kicks in.

Table 7-4. Speed of Linked List Building.

Length

SML (millisec)

GC (millisec)

C (millisec)

50000

5.2

 

15.6

100000

10.6

 

31.6

200000

47.2

10

64.3

500000

142.1

30

161.5

1000000

252.9

30

323.4

The bottom line is that the speed gain from faster memory allocation can compensate for the loss in raw code speed to result in execution times for SML/NJ comparable to C (or C++).

CML Channel Communication and Scheduling

This test measures how CML performs when sending messages through a channel. The test sets up a number of receiver threads all blocked on their own channel. A matching set of sender threads are started but they all first wait on a time-out event. The time-out uses CML.atTimeEvt to produce a single event that enables all of the senders at the same time. The receiver records the time it receives the message and the delay in sending the message. This is printed at the end of the test.

When the event becomes enabled the CML time-out code will put all of the sender threads onto the ready-to-run queue before switching to any of the threads. This will test how the scheduler behaves when it has a large number of threads ready. When a sender thread runs and sends its message CML will immediately switch to the receiver thread. So the transmission delay will be a measure of the overhead in sending a message and switching threads. The receiver saves its record and exits which lets CML select the next sender thread to run.

Here are figures for a run with 100 threads.

Pair 99 receives at 1004543299.986603 after 9
Pair 98 receives at 1004543299.986618 after 2
Pair 97 receives at 1004543299.986622 after 2
Pair 96 receives at 1004543299.986626 after 2
Pair 95 receives at 1004543299.986630 after 1
Pair 94 receives at 1004543299.986633 after 2
Pair 93 receives at 1004543299.986637 after 1
Pair 92 receives at 1004543299.986640 after 3
Pair 91 receives at 1004543299.986645 after 1
Pair 90 receives at 1004543299.986648 after 2
...
Pair 9 receives at 1004543299.987025 after 3
Pair 8 receives at 1004543299.987030 after 3
Pair 7 receives at 1004543299.987035 after 2
Pair 6 receives at 1004543299.987040 after 2
Pair 5 receives at 1004543299.987045 after 2
Pair 4 receives at 1004543299.987049 after 3
Pair 3 receives at 1004543299.987055 after 2
Pair 2 receives at 1004543299.987060 after 2
Pair 1 receives at 1004543299.987064 after 3
Pair 0 receives at 1004543299.987069 after 3
...
Timing Rx 0 8
Timing Sn 0 13
Timing Rx 1 3
Timing Sn 1 4
Timing Rx 2 2
Timing Sn 2 15
Timing Rx 3 2
Timing Sn 3 3
Timing Rx 4 2
Timing Sn 4 16
...
Timing Rx 92 2
Timing Sn 92 3
Timing Rx 93 1352
Timing Sn 93 10
Timing Rx 94 4
Timing Sn 94 5
Timing Rx 95 3
Timing Sn 95 2
Timing Rx 96 2
Timing Sn 96 3
Timing Rx 97 4
Timing Sn 97 4
Timing Rx 98 3
Timing Sn 98 4
Timing Rx 99 3
Timing Sn 99 4

The first lines show the receiver records. The first number is the time when the message arrived and the second is the transmission delay in microseconds. The delay stays around 2 microseconds for all threads without growing. The receivers run at around 4 microsecond intervals. This includes the time to switch to a new thread, send the message and save the records. This time does not grow as the number of threads increases.

The second set of lines show the time to spawn the sender and receiver threads. The time is in the last column in microseconds. This time stays fairly stable. There are occassional spikes which may be some house-keeping inside CML.

I would rate this is as good performance.

Spawning Threads for Time-outs

In this section I examine the cost of spawning a thread and how it scales to large numbers of threads. The first test is the thr_scaling program and it uses time-out events. It spawns 5000 threads numbered from 5000 down to 1. The main thread creates a time-out event using timeOutEvt and passes it to the spawned thread which immediately waits on the event. This models an early implementation of time-outs in the Swerve server (see the section called Time-outs in Chapter 8). The time-out expires well after all of the threads have been spawned. The program reports the time to spawn each thread and the order that the threads wake up.

When a thread is spawned it starts running immediately while the parent thread is blocked. The new thread then blocks on the time-out event which transfers control to the CML scheduler to choose a new thread to run. Blocked threads are placed on a time-out queue. If there is nothing else happening in the program then this queue will be examined at each time slice, typically 20 milliseconds.

The time-out queue is kept sorted in order of increasing expiry time. So as more threads are created with later time-outs they get appended to the end of the queue which takes longer and longer. But the thr_scaling creates time-outs with 1 second resolution and the CML scheduler uses an internal clock with a resolution of the time slice. Theis results in batches of threads on the queue with the same expiry time. The size of the batch will be determined by how many of these threads can be spawned in a time slice.

Each new member of the batch goes to the front of the queue section for the batch since its expiry time is not greater than the others. For example each member of the first batch goes to the front of the queue so this is a fast operation. The threads in the batch appear in the queue in reverse order so they will be woken in reverse order of spawning within the batch.

These quirks of the implementation help to explain the measured timing. The following data shows the time taken to perform the CML.spawn in microseconds and the finishing order.

Timing Thread 5000 18
Timing Thread 4999 17
Timing Thread 4998 5
Timing Thread 4997 3
Timing Thread 4996 16
Timing Thread 4995 3
Timing Thread 4994 3
Timing Thread 4993 4
Timing Thread 4992 4
...
Timing Thread 4374 5
Timing Thread 4373 5
Timing Thread 4372 5
Timing Thread 4371 3
Timing Thread 4370 527
Timing Thread 4369 208
Timing Thread 4368 234
Timing Thread 4367 249
...
Timing Thread 1160 9844
Timing Thread 1159 5257
Timing Thread 1158 5586
Timing Thread 1157 8607
Timing Thread 1156 5032
Timing Thread 1155 5354
Timing Thread 1154 3641
Timing Thread 1153 10322
Timing Thread 1152 3774
Timing Thread 1151 4902
...
Timing Thread 4 11012
Timing Thread 3 7851
Timing Thread 2 16275
Timing Thread 1 6906
Thread 4371 finishes
Thread 4372 finishes
Thread 4373 finishes
Thread 4374 finishes
...
Thread 4998 finishes
Thread 4999 finishes
Thread 5000 finishes
Thread 4312 finishes
Thread 4313 finishes
...
Thread 5 finishes
Thread 2 finishes
Thread 3 finishes
Thread 4 finishes
Thread 1 finishes

The first 630 threads spawn quickly in only a few microseconds. This would be the first batch. As the number of threads increases the time to spawn grows rapidly to over 5 milliseconds each with some much longer times up to 16 milliseconds. This is rather a long time just to set up a time-out. It resulted in poor performance in the Swerve server.

The first thread to finish is number 4371 which corresponds to the last thread in the first batch, where the spawning time jumps suddenly. This confirms the time-out queue behaviour.

Behaviour of Timeout Events

The timeout_evt program shows some odd behavioural differences between the events produced by the CML.atTimeEvt and CML.timeOutEvt functions.

The program spawns 1000 threads all waiting on the same time-out event. I expect that as soon as the event is enabled all of the threads will wake and terminate. If the command line argument is "time" then it will use CML.atTimeEvt otherwise it will use CML.timeOutEvt. It reports the duration of each spawn operation and the finish time for each thread.

If I use the CML.atTimeEvt to create the event then all of the spawn operations work in practically constant time of well under 10 microseconds. There are only a few blips where the operations take 500 microseconds or more. When the threads awake and terminate all 1000 finish in an interval of around 20 milliseconds.

If I use the CML.timeOutEvt function then the time for the spawn operation starts small but grows rapidly to take several hundred microseconds. When the threads awake it takes around 200 milliseconds for them all to terminate.

The reason for this behaviour stems from the implementation of the time-out queue within CML as described in the section called Spawning Threads for Time-outs. When using CML.timeOutEvt, each thread gets its own individual time-out (to the resolution of a time-slice) which is calculated at the time that the CML.sync on the event is attempted in the thread. Since each thread starts at slightly different times this results in many different time-out times which make the time-out queue quite long. The queue is kept in time order using an insertion sort which is rather slow. This slows down the spawn operation since the CML.sync is performed before the spawn returns.

When using the CML.atTimeEvt there is exactly one time-out time and the time-out queue stays small.

So the lesson for good time-out performance is to keep the number of distinct expiry times and the number of waiting threads small. The final implementation of time-outs in the Swerve server goes to some lengths to achieve this. See the section called The Abort Module in Chapter 9.