graph of scores of piping stdin on Just(js), Node.js and plain C

19 February 2021

On Javascript Performance

2. Syscall Overhead in Javascript

by billwhizz

Measuring Syscall Overhead in Just(js)

The goal of this demo is to establish the overhead of making syscalls in just(js) runtime for Javascript. We will write a simple console application which counts the number of bytes piped to it over stdin. The application will call the read syscall repeatedly in various chunk sizes.

We will implement the demo in C, in Javascript on just(js) using both sync and async interfaces. We will also write a comparison application on node.js to establish if there is any extra overhead on that platform.

Prerequisites

See README.md for initial setup. In order to run this test you will need linux strace and time tools. These should be pre-installed if you run in the docker created in README.md.

Initial Investigation - Using 'wc'

First of all, let's see what performance we get from using the wc) tool that comes with most if not all unix-y systems. We will pipe 500k 64k chunks from /dev/zero into 'wc -c' command to count the number of bytes.


time dd if=/dev/zero bs=65536 count=500000 | wc -c

Run it a few times to rule out any contention and we can see best performance on my setup of something like this


32768000000 bytes (33 GB, 31 GiB) copied, 8.20788 s, 4.0 GB/s

real    0m8.212s

So it takes 8.2 seconds in total and runs at a rate of 4 GB per second

Establishing a Baseline - Optimised C

We should be able to improve slightly on system wc if we write an optimised c version as the wc application does more than just count bytes and also likely does not have any processor specific optimisations that may help.

Here is our little program. It shouldn't need any explantion.

wc.c


#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[]) {
  unsigned int blocksize = 65536;
  if (argc == 2) {
    blocksize = atoi(argv[1]);
  }
  char buf[blocksize];
  unsigned long size = 0;
  unsigned int reads = 0;
  int n = read(STDIN_FILENO, buf, blocksize);
  while (n > 0) {
    reads++;
    size += n;
    n = read(STDIN_FILENO, buf, blocksize);
  }
  if (n < 0) {
    fprintf(stderr, "read: %s (%i)\n", strerror(errno), errno);
    exit(1);
  }
  fprintf(stdout, "size %lu reads %u blocksize %u\n", size, reads, blocksize);
}

Let's compile it with all optimizations...

gcc -O3 -o wc wc.c

... and see what we get ...


time dd if=/dev/zero bs=65536 count=500000 | ./wc

500000+0 records in
500000+0 records out
32768000000 bytes (33 GB, 31 GiB) copied, 7.24869 s, 4.5 GB/s
size: 32768000000

real    m7.53s
user    m0.24s
sys     m14.77s

Ok, so that's 7.2 seconds and 4.5 GB/s - a nice 11% improvement over the builtin wc tool.

While we are at it, let's see what syscalls our program makes


dd if=/dev/zero bs=65536 count=500000 | strace -c ./wc

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    5.447454          11    500002           read
  0.00    0.000013           4         3           brk
  0.00    0.000009           3         3           fstat

So 500k read syscalls sounds about right and there are no other syscalls to speak of. We can compare this to the other programs later. This program is just running read() in a tight loop and doing little else, which is exactly what we want for our test.

Just Blocking/Sync Performance

We will run with blocking/synchronous syscalls in wc.js and see how close we get to raw C performance.

wc.js


const { print, error, exit, memoryUsage, sys, net } = just
const { read } = net
const { strerror, errno, STDIN_FILENO } = sys
const blocksize = parseInt(just.args[2] || 65536)
const buf = new ArrayBuffer(blocksize)
let size = 0
let reads = 0
let n = read(STDIN_FILENO, buf)
while (n > 0) {
  reads++
  size += n
  n = read(STDIN_FILENO, buf)
}
if (n < 0) {
  error(`read: ${strerror(errno())} (${errno()})`)
  exit(1)
}
print(`size ${size} reads ${reads} blocksize ${blocksize}`)

It shouldn't need much explanation. We import the internal modules we need from just runtime, create a 64k ArrayBuffer and then do blocking reads on stdin in a loop until we get 0 or an error code back.

Let's see what the results are like...


time dd if=/dev/zero bs=65536 count=500000 | just wc.js

500000+0 records in
500000+0 records out
32768000000 bytes (33 GB, 31 GiB) copied, 7.19774 s, 4.6 GB/s
size 32768000000 rss 18542592

real    0m7.205s
user    0m0.649s
sys     0m13.740s

So, looks like performance is exactly the same as the C version, if not a little faster. How can this be? Let's look at the syscalls to make sure we are making the same number of read() calls as the C version.


dd if=/dev/zero bs=65536 count=500000 | strace -c just wc.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96    5.341601          11    500014           read
  0.02    0.000878           7       133         6 futex
  0.01    0.000495           5       110           mprotect
  0.00    0.000196          33         6           munmap
  0.00    0.000164           5        36           madvise

This looks good. The same 500k syscalls. We can also see some mprotect and futex calls. These are due to locking issues with ArrayBuffers in v8 - something I have looked into before but need to investigate further.

Let's just take a quick look at garbage collection stats and see what is happening. We are only allocating a single buffer outside the loop and on the C++ side we are only creating short lived v8 objects which should get marked as "young generation" and collected by the scavenger which is very quick.


dd if=/dev/zero bs=65536 count=500000 | just --trace-gc wc.js

[8451:0x298000000000]      688 ms: Scavenge 1.3 (1.5) -> 0.9 (2.0) MB, 0.3 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     1201 ms: Scavenge 1.3 (2.0) -> 0.9 (1.8) MB, 0.3 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     1819 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     2441 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     3059 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     3678 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     4304 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     4934 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     5555 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     6175 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
[8451:0x298000000000]     6799 ms: Scavenge 1.4 (1.8) -> 0.9 (1.8) MB, 0.1 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure 
500000+0 records in
500000+0 records out
32768000000 bytes (33 GB, 31 GiB) copied, 7.38609 s, 4.4 GB/s
size 32768000000 rss 18587648
[8451:0x298000000000]     7381 ms: Mark-sweep (reduce) 1.3 (1.8) -> 0.1 (1.8) MB, 0.4 / 0.0 ms  (average mu = 1.000, current mu = 1.000) low memory notification GC in old space requested
[8451:0x298000000000]     7382 ms: Mark-sweep (reduce) 0.1 (1.8) -> 0.1 (1.0) MB, 0.4 / 0.0 ms  (average mu = 0.068, current mu = 0.068) low memory notification GC in old space requested

We can see the scavenger working away roughly every 600ms but having negligible impact on performance. The full Mark-sweep collection at the end is because just(js) runtime forces garbage collection and clean up of memory before exiting.

If you want to dive deeper into V8 garbage collection there are two very good articles here and here on the V8 team blog.

So, I think we are ok in asserting that there is practically zero overhead to a simple syscall from JS compared to plain C.

Just Non-Blocking / Async Performance

Let's do the same thing but using the event loop and only counting the bytes as the event loop tells us data is ready. The code here is slightly more complex as we need to mark the stdin file descriptor as non-blocking and add it to the default event loop of the just runtime.

wc-async.js


const { net, sys, print, error, memoryUsage } = just
const { read, O_NONBLOCK } = net
const { strerror, errno, STDIN_FILENO } = sys
const { EPOLLIN } = just.loop
const { loop } = just.factory
const blocksize = parseInt(just.args[2] || 65536)
const buf = new ArrayBuffer(blocksize)
const flags = sys.fcntl(STDIN_FILENO, sys.F_GETFL, 0) | O_NONBLOCK
sys.fcntl(STDIN_FILENO, sys.F_SETFL, flags)
let size = 0
let reads = 0
function onData (fd) {
  const n = read(fd, buf)
  if (n < 0) {
    error(`read: ${strerror(errno())} (${errno()})`)
    loop.remove(fd)
    just.exit(1)
  }
  if (n === 0) {
    print(`size ${size} reads ${reads} blocksize ${blocksize} rss ${memoryUsage().rss}`)
    loop.remove(fd)
    return
  }
  reads++
  size += n
}
loop.add(STDIN_FILENO, onData, EPOLLIN)

We import the default event loop from the just runtime, use fcntl to set the fd as non blocking, and then we tell the event loop that we are only listening for data coming in (EPOLLIN).

By default we are using level triggered events which means the event will carry on firing if we do not read all the available data on each iteration. If we had added EPOLLET to our mask as follows:


const { EPOLLIN, EPOLLET } = just.loop
...
loop.add(STDIN_FILENO, onData, EPOLLIN | EPOLLET)

then we would only receive an event when data was detected and would be expected to keep on reading until the read syscall returns EAGAIN to indicate the kernel buffer for that file descriptor is empty. If you want to delve deeper into this, I would suggest this great article from Cindy Sridharan.

So, let's see what the async results look like.


time dd if=/dev/zero bs=65536 count=500000 | just wc-async.js

500000+0 records in
500000+0 records out
32768000000 bytes (33 GB, 31 GiB) copied, 7.16841 s, 4.6 GB/s
size 32768000000 rss 19324928

real    0m7.175s
user    0m0.796s
sys     0m13.533s

Nice. So it looks like doing async adds little or no overhead to this program. Let's see what syscalls were made.


time dd if=/dev/zero bs=65536 count=500000 | strace -c just wc-async.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 76.89    5.228990          10    500013           read
 23.10    1.570899           3    500001           epoll_wait
  0.01    0.000503           6        83         1 futex
  0.00    0.000209           2        93           mprotect

Interesting. So even though we are making twice as many syscalls we get the same performance. What is going on here? I'll have to delve a little deeper into why this is the case but I am guessing it is something to do with the blocking calls in the first example having some additional overhead over the non-blocking reads in the async version.

Just(js) Static Build

Before we finish with just(js), let's have a look at the memory Usage. We can run a quick eval to see what default memory usage is on startup


just eval 'just.print(just.memoryUsage().rss)'
13529088

So, about 13.5 MB. In the synchronous example we saw usage after the run was 18.5 MB and 19.3 MB for the async version. So the overhead of loading the various modules and running the code is around 5-6MB. If we are interested in reducing memory usage further we can compile the application into a static binary and see what happens.

First, we make a few changes to take into account our static app will not use any of the standard just(js) runtime created here. Instead, we will compile a bare application that only uses the small number of functions (currently print, error, load, exit, pid, chdir, sleep, builtin, memoryUsage and version) in the just(js) core and the builder will figure out what other modules it needs and compile those in.

wcb.js


// we need to use just.library() to load modules for now as that is how builder pickes them up
just.library = (name, lib = name) => just.load(name)
const { sys } = just.library('sys')
const { net } = just.library('net')
const { read } = net
const { strerror, errno, STDIN_FILENO } = sys
const blocksize = parseInt(just.args[1] || 65536)
const buf = new ArrayBuffer(blocksize)
let size = 0
let reads = 0
let n = 0
while ((n = read(STDIN_FILENO, buf))) {
  reads++
  size += n
}
if (n < 0) {
  just.error(`read: ${strerror(errno())} (${errno()})`)
  just.exit(1)
}
just.print(`size ${size} reads ${reads} blocksize ${blocksize}`)

Let's run a build with the --dump flag to see what kind of configuration this generates. We will also pass --clean and --static flags to tell it to clean up existing auto generated object files and to build a statically linked binary so we will have no runtime dependencies on system libraries.


just build wcb.js --clean --static --dump

{
  "version": "0.0.21",
  "libs": [
    "lib/fs.js",
    "lib/loop.js",
    "lib/path.js",
    "lib/process.js",
    "lib/build.js",
    "lib/repl.js",
    "lib/configure.js",
    "lib/acorn.js"
  ],
  "modules": [
    {
      "name": "sys",
      "obj": [
        "modules/sys/sys.o"
      ],
      "lib": [
        "rt"
      ]
    },
    {
      "name": "fs",
      "obj": [
        "modules/fs/fs.o"
      ]
    },
    {
      "name": "net",
      "obj": [
        "modules/net/net.o"
      ]
    },
    {
      "name": "vm",
      "obj": [
        "modules/vm/vm.o"
      ]
    },
    {
      "name": "epoll",
      "obj": [
        "modules/epoll/epoll.o"
      ]
    }
  ],
  "capabilities": [],
  "target": "wcb",
  "main": "just.js",
  "v8flags": "--stack-trace-limit=10 --use-strict --disallow-code-generation-from-strings",
  "embeds": [
    "just.js",
    "config.js",
    "wcb.js"
  ],
  "static": true,
  "debug": false,
  "v8flagsFromCommandLine": true,
  "external": {},
  "index": "wcb.js",
  "LIBS": "lib/fs.js lib/loop.js lib/path.js lib/process.js lib/build.js lib/repl.js lib/configure.js lib/acorn.js",
  "EMBEDS": "just.js config.js wcb.js",
  "MODULES": "modules/sys/sys.o modules/fs/fs.o modules/net/net.o modules/vm/vm.o modules/epoll/epoll.o",
  "LIB": "-lrt",
  "justDir": "/home/andrew/.just",
  "build": "main-static",
  "moduleBuild": "module-static"
}

Ok, so that looks like it has a bunch of extra junk in there that we don't need for our mini-app. So, let's create a config for the app to tell the builder we want to use 'wcb.js' as our main script and not 'just.js'.

wcb.config.js


module.exports = { main: 'wcb.js' }

And let's dump the config again and see what comes out this time.


just build wcb.js --clean --static --dump

{
  "main": "wcb.js",
  "external": {},
  "modules": [
    {
      "name": "sys",
      "obj": [
        "modules/sys/sys.o"
      ],
      "lib": [
        "rt"
      ]
    },
    {
      "name": "net",
      "obj": [
        "modules/net/net.o"
      ]
    }
  ],
  "target": "wcb",
  "version": "0.0.21",
  "v8flags": "--stack-trace-limit=10 --use-strict --disallow-code-generation-from-strings",
  "debug": false,
  "capabilities": [],
  "static": true,
  "libs": [],
  "embeds": [
    "wcb.js"
  ],
  "LIBS": "",
  "EMBEDS": "wcb.js",
  "MODULES": "modules/sys/sys.o modules/net/net.o",
  "LIB": "-lrt",
  "justDir": "/home/andrew/.just",
  "build": "main-static",
  "moduleBuild": "module-static"
}

That looks better. Now we are only importing the sys and net C++ modules and none of the JS files and c++ libraries not required for this program. Let's go ahead and build it. In order to build it an app, we will need to set JUST_HOME environment variable to tell the builder where to download and build the various things it needs. You can set this to the directory you compiled the runtime in or if you are using docker it will already be set for you.


just build wcb.js --cleanall --clean --static --silent

clean wcb complete in 0.01 sec
clean modules/sys complete in 0.02 sec
build modules/sys complete in 1.35 sec
clean modules/net complete in 1.36 sec
build modules/net complete in 2.37 sec
build wcb 0.0.21 (wcb.js) complete in 5.99 sec

If you have issues with previous versions of object files that were generated you can pass the --cleanall flag also which will clean the C++ modules and rebuild them. The build system is very much a work in progress so please report any issues here.

We should now have a nice small binary called 'wcb' in the current directory. On my system this comes out around 13MB which is quite a bit smaller than the dynamic build. Bear in mind that a static build will not be able to import any libraries using dlopen at run-time so this option is very much for those cases where the app has all the dependencies it needs bundled up within it.

Let's run the test on our new binary and see what the memory usage is like.


time dd if=/dev/zero bs=65536 count=500000 | ./wcb

500000+0 records in
500000+0 records out
32768000000 bytes (33 GB, 31 GiB) copied, 7.2249 s, 4.5 GB/s
size 32768000000 rss 12763136

real    0m7.232s
user    0m0.581s
sys     0m13.869s

Nice! we just saved about 6MB of memory over the version running under just(js) runtime.

Comparing to node.js

Ok, we will now do a quick comparison against a typical node.js program which meets our requirement.

wc-node.js


let size = 0
const { stdin } = process
let reads = 0
const blocksize = parseInt(process.argv[2] || 65536, 10)
stdin.on('readable', () => {
  let chunk
  while ((chunk = stdin.read())) {
    reads++
    size += chunk.length
  }
})
stdin.on('end', () => {
  console.log(`size ${size} reads ${reads} blocksize ${blocksize}`)
})
stdin.on('error', err => {
  console.error(err.stack)
})

Let's run it and see how it does. I am running v15.6.0 of node.js on Ubuntu 18.04.


time dd if=/dev/zero bs=65536 count=500000 | node wc-node.js

500000+0 records in
500000+0 records out
32768000000 bytes (33 GB, 31 GiB) copied, 8.93132 s, 3.7 GB/s
size 32768000000 rss 56827904

real    0m8.941s
user    0m2.432s
sys     0m15.505s

So, we are seeing a run time of 8.9 seconds and a processing rate of 3.7 GB/s. This is about a 23% decrease in throughput. If we look at memory usage we can see the node.js process is consuming 56MB compared to 19MB or 13MB with just(js), a factor of 3-4 higher.

Let's see if we can make the node.js program do a little better than this. We'll write something that calls the C++ bindings directly. There used to be a way of doing this in a blocking fashion in node.js many moons ago but i was unable to figure it out when digging through the node.js source code for this demo.

wc-node-fast.js


const { Pipe } = process.binding('pipe_wrap')
const blocksize = parseInt(process.argv[2] || 65536, 10)
const stdin = new Pipe(0)
stdin.open(0)
let size = 0
let reads = 0
stdin.onread = buf => {
  if (!buf) {
    console.log(`size ${size} reads ${reads} blocksize ${blocksize} rss ${process.memoryUsage().rss}`)
    return
  }
  reads++
  size += buf.byteLength
}
stdin.readStart()

Let's run it and see if we get a better result.


time dd if=/dev/zero bs=65536 count=500000 | node wc-node-fast.js

500000+0 records in
500000+0 records out
32768000000 bytes (33 GB, 31 GiB) copied, 8.85562 s, 3.7 GB/s
size 32768000000 rss 66383872

real    0m8.867s
user    0m1.587s
sys     0m16.226s

Hmmm... this is no faster and now memory usage is 10MB more. Let's take a look at syscalls in the node.js version.


dd if=/dev/zero bs=65536 count=500000 | strace -c node wc-node-fast.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.06    5.654736          11    500019         5 read
  1.01    0.058474           3     17994      6502 futex
  0.78    0.044935           3     15671           epoll_wait
  0.09    0.005060           3      1965           getpid

and with the first node.js version...


dd if=/dev/zero bs=65536 count=500000 | strace -c node wc-node.js

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.62    5.446947          11    500032        15 read
  1.39    0.078514           3     31270     15633 epoll_ctl
  1.21    0.068166           3     21504      6990 futex
  0.68    0.038313           2     15643           epoll_wait
  0.08    0.004764           2      1966           getpid

Ok, we can see a significant number of extra syscalls here. The second program is calling futex and epoll_wait a lot more than the just(js) versions. This needs further investigation and is likely to do with extra garbage collection and a different polling mechanism on the event loop. We can see both node.js programs do the same number of reads so we are reading in 64k chunks across the board.

For now, I will have to leave the experiment with node.js as I cannot figure out a way to make this do better without writing a C++ module. Let's do some runs for different block sizes and see what the overall results look like.

Different Block Sizes

The script bench.js will perform ten runs for each program across all power of 2 block sizes from 256 bytes to 65536 bytes. It saves all results in a file names 'all.json'. It will take the best score across the ten rounds for each program and save them in a file named 'results.json'. If you then run analyse.js it will produce a html page with the results plotted using HighCharts.


# run the benchmark and produce all.json and results.json
just bench.js
# generate a report on the results and display in the browser
just analyse.js

Throughput in MBps

Throughput in MBps

This shows the best performance over 10 runs for each program at the different chunk sizes. We can see pretty clearly there is no significant difference between the JS and c versions of the programs. In fact, we see slightly better performance from the JS programs with smaller chunk sizes.

We can see here node.js throughput is approx. 30% lower than the c or just(js) programs.

Time Taken in Milliseconds

Time Taken

And the same picture when we look at time taken.

Unfortunately, i could not find any way of forcing node.js to read from the pipe in explicitly sized chunks for the comparison above. It's so frustrating to have to wade through tons and tons of web pages and stack overflow questions and the source code itself only to find there doesn't seem to be any way to do this. If anyone knows how, please ping me.

Conclusions and Assertions

So, that brings the investigations to an end for now. Let's see if we make make some assertions.

Further Work

Please feel free to point out any mistakes/omissions and point out any improvements that could be made.

References

  1. Just(js) Javascript Platform: https://github.com/just-js
  2. wc (Unix): https://en.wikipedia.org/wiki/Wc_(Unix)
  3. Trash Talk: The Orinoco Garbage Collector: https://v8.dev/blog/trash-talk
  4. High Performance Garbage Collection for C++: https://v8.dev/blog/high-performance-cpp-gc
  5. fcntl(2): https://man7.org/linux/man-pages/man2/fcntl.2.html
  6. epoll(7): https://man7.org/linux/man-pages/man7/epoll.7.html
  7. The method to epoll's Madness: https://copyconstruct.medium.com/the-method-to-epolls-madness-d9d2d6378642
  8. Just(js) Runtime Bootstrap: https://github.com/just-js/just/blob/0.0.21/just.js#L245
  9. Just(js) Standard Library Initialisation: https://github.com/just-js/just/blob/0.0.21/just.cc#L462
  10. Just(js) Github Issues: https://github.com/just-js/just/issues