, 6 min read

Parallelizing the Output of Simplified Saaze

Original post is here eklausmeier.goip.de/blog/2024/02-27-parallelizing-simplified-saaze-output.


This blog uses Simplified Saaze as its static site generator. Generating all 561 HTML pages takes 0.25 seconds. The environment used is as in below table.

Type Value
CPU AMD Ryzen 7 5700G
RAM 64 GB
OS Arch Linux 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC
PHP PHP 8.3.3 (cli)
PHP with JIT PHP 8.3.3 (cli), Zend Engine v4.3.3 with Zend OPcache v8.3.3
Simplified Saaze 2.0

1. Runtimes in serial mode. In the following we use PHP with no JIT. So far runtimes for this very blog are as below:

$ time php saaze -mortb /tmp/build
Building static site in /tmp/build...
    execute(): filePath=./content/aux.yml, nSIentries=7, totalPages=1, entries_per_page=20
    execute(): filePath=./content/blog.yml, nSIentries=452, totalPages=23, entries_per_page=20
    execute(): filePath=./content/gallery.yml, nSIentries=7, totalPages=1, entries_per_page=20
    execute(): filePath=./content/music.yml, nSIentries=69, totalPages=4, entries_per_page=20
    execute(): filePath=./content/error.yml, nSIentries=0, totalPages=0, entries_per_page=20
Finished creating 5 collections, 4 with index, and 561 entries (0.25 secs / 24.46MB)
#collections=5, parseEntry=0.0103/563-5, md2html=0.0201, MathParser=0.0141/561, renderEntry=0.1573/561, renderCollection=0.0058/33, content=561/0, excerpt=0/0
    real 0.28s
    user 0.16s
    sys 0
    swapped 0
    total space 0

It can be seen that the renderEntry() function uses 0.1573 seconds from overall 0.25 seconds, i.e., more than 60%. These 561 calls will now be parallelized. The rest stays serial.

For the Lemire blog we have:

$ time php saaze -rb /tmp/buildLemire
Building static site in /tmp/buildLemire...
        execute(): filePath=/home/klm/php/saaze-lemire/content/blog.yml, nSIentries=2771, totalPages=139, entries_per_page=20
Finished creating 1 collections, 1 with index, and 4483 entries (1.01 secs / 97.18MB)
#collections=1, parseEntry=0.0702/4483-1, md2html=0.1003, MathParser=0.0594/4483, renderEntry=0.4121/4483, renderCollection=0.0225/140, content=4483/0, excerpt=0/0
        real 1.03s
        user 0.64s
        sys 0
        swapped 0
        total space 0

In this case the output template processing is 0.4121 seconds from overall 1.01 seconds, that's 40%. This shows that the Lemire templates are easier. No wonder, they do not use categories and tags, and many other gimmicks, which I used in this blog. But still, 40% of the runtime is spent on output rendering.

In Performance Comparison Saaze vs. Hugo vs. Zola I wrote:

It would be quite easy to use threads in Saaze, i.e., so-called entries and the chunks of collections could easily be processed in parallel.

It is even easier to parallelize the generation of the output files when the PHP templating is in place. We will see that parallelizing can be done in less than 20 lines of PHP code.

2. Runtimes in serial mode with JIT enabled. Below are the runtime with JIT and OPCache enabled for PHP.

time php saaze -mortb /tmp/build
Building static site in /tmp/build...
        execute(): filePath=./content/aux.yml, nSIentries=7, totalPages=1, entries_per_page=20
        execute(): filePath=./content/blog.yml, nSIentries=453, totalPages=23, entries_per_page=20
        execute(): filePath=./content/gallery.yml, nSIentries=7, totalPages=1, entries_per_page=20
        execute(): filePath=./content/music.yml, nSIentries=69, totalPages=4, entries_per_page=20
        execute(): filePath=./content/error.yml, nSIentries=0, totalPages=0, entries_per_page=20
Finished creating 5 collections, 4 with index, and 562 entries (0.16 secs / 20.36MB)
#collections=5, parseEntry=0.0104/564-5, md2html=0.0219, MathParser=0.0203/562, renderEntry=0.0521/562, renderCollection=0.0022/33, content=562/0, excerpt=0/0
        real 0.19s
        user 0.11s
        sys 0
        swapped 0
        total space 0

The previous massive renderEntry() part in runtime shrank from 0.1573 seconds to 0.0521 seconds. I think this is mainly due to the OPCache, which now avoids recompiling and reparsing the PHP output template.

For the Lemire blog with JIT enabled we have:

time php saaze -rb /tmp/buildLemire
Building static site in /tmp/buildLemire...
        execute(): filePath=/home/klm/php/saaze-lemire/content/blog.yml, nSIentries=2771, totalPages=139, entries_per_page=20
Finished creating 1 collections, 1 with index, and 4483 entries (0.62 secs / 96.24MB)
#collections=1, parseEntry=0.0655/4483-1, md2html=0.0974, MathParser=0.0586/4483, renderEntry=0.0707/4483, renderCollection=0.0110/140, content=4483/0, excerpt=0/0
        real 0.65s
        user 0.40s
        sys 0
        swapped 0
        total space 0

Similar picture to the above: the renderEntry() part dropped from 0.4121 seconds to 0.0707 seconds. That's massive.

3. Unix forks in PHP. As a preliminary introduction to pcntl_fork() in PHP, look at below simple PHP code.

<?php
    for ($i=1; $i<=4; ++$i) {
        if (($pid = pcntl_fork())) {
            printf("i=%d, pid=%d\n",$i,$pid);
            sleep(1);
            exit(0);
        }

Running this script:

$ php forktst.php
i=1, pid=15082
i=2, pid=15083
i=3, pid=15084
i=4, pid=15085

The fork and join method of parallelization is easy to use, but it has the disadvantage that communicating results from the children to the parent is "difficult". Communicating data from the parent to its children is "easy": everything is copied over.

4. Implementation in BuildCommand.php. The command-line version of Simplified Saaze calls buildAllStatic(). This routine iterates through all collections, and for each collection it iterates through all entries.

  1. Function getEntries() reads Markdown files into memory and converts them to HTML by using MD4C, all in memory
  2. Function buildEntry() uses the entry in question and writes the HTML to disk by processing it through our PHP templates.

PHP function buildEntry() is essentially:

private function buildEntry(Collection $collection, Entry $entry, string $dest) : void {
    ...
    file_put_contents($entryDir, $this->templateManager->renderEntry($entry);
}

buildEntry() is now encapsulated within beginParallel() and endParallel(). That's it.

foreach ($collections as $collection) {
    $entries    = $collection->getEntries();	# finally calls getContentAndExcerpt() and sorts
    $nentries   = count($entries);
    $nSIentries = count($collection->entriesSansIndex);
    $entries_per_page = $collection->data['entries_per_page'] ?? \Saaze\Config::$H['global_config_entries_per_page'];
    $totalPages = ceil($nSIentries / $entries_per_page);
    printf("\texecute(): filePath=%s, nSIentries=%d, totalPages=%d, entries_per_page=%d\n",$collection->filePath,$nSIentries,$totalPages,$entries_per_page);

    $this->beginParallel($nentries,$aprocs);
    $i = 0;
    foreach ($entries as $entry) {
        if ($this->nprocs > 0  &&  ($i++ % $this->nprocs) != $this->procnr) continue;	// distribute work among nprocs processes
        if ($entry->data['entry'] ?? true) {
            $this->buildEntry($collection, $entry, $dest);
            $entryCount++;
        }
    }
    $this->endParallel();

    if ($tags) {	// populate cat_and_tag[][] array
        foreach ($entries as $entry) {
            if ($entry->data['entry'] ?? true)
                $this->build_cat_and_tag($entry,$collection->draftOverride);
        }
    }

    ++$totalCollection;
    if ($this->buildCollectionIndex($collection, 0, $dest)) $collectionCount++;

    for ($page=1; $page <= $totalPages; $page++)
        $this->buildCollectionIndex($collection, $page, $dest);
}

The two PHP functions for fork and join are thus:

protected function beginParallel(int $nentries, int $aprocs) : void {
    $this->pid = 0;
    $this->procnr = 0;
    $this->nprocs = 1;
    if ($nentries < 128) return;	// too few entries to warrant forking
    $this->nprocs = $aprocs;	// aprocs = allowed procs, specified on commmand-line
    for ($this->procnr=0; $this->procnr<$this->nprocs; ++$this->procnr)
        if (($this->pid = pcntl_fork())) return;	// child returns to work
}

protected function endParallel() : void {
    if ($this->pid) exit(0);	// exit child process; pid=0 is parent
}

This fork and join via pcntl_fork() does not work on Microsoft Windows.

5. Benchmarking. How much of an improvement do we get by this? For this very blog with 561 entries, the runtimes can be more than halved. This is in line with the 60% runtime used by the output template processing. It should be noted that this blog is comprised of five collections:

  1. aux: 7 entries
  2. blog: 452 entries, only these are parallelized!
  3. gallery: 7 entries
  4. music: 69 entries
  5. error: 1 entry

The parallelization kicks in only for at least 128 entries. I.e., only the blog-part is parallelized, the music-part and the other parts are not.

Another benchmark is the Lemire blog converted to Simplified Saaze, see Example Theme for Simplified Saaze: Lemire.

Command-lines are:

time php saaze -p16 -mortb /tmp/build
time php saaze -p16 -rb /tmp/buildLemire

Then we are varying the parameter -p. All output is to /tmp, which is a RAM disk in Arch Linux. Obviously, I do not want to measure disk read or write speed. I want to measure the processing speed of Simplified Saaze.

Timings are from time, taking real time.

Blog entries p=1 p=2 p=4 p=8 p=16
561 posts / this blog 0.28 0.18 0.16 0.13 0.12
561 posts with JIT 0.19 0.17 0.14 0.13 0.12
4.483 posts in Lemire 1.03 1.02 0.65 0.54 0.52
4.483 posts with JIT 0.65 0.64 0.53 0.47 0.46

Overall, with just 20 lines of PHP we can halve the runtime. For JIT enabled, the drop in runtime is not so pronounced, but also almost halved.

The very good performance of JIT, which we can see here, is in line with the findings in Phoronix: PHP 8.0 JIT Is Offering Very Compelling Performance Ahead Of Its Alpha.