Leading items

Welcome to the LWN.net Weekly Edition for June 15, 2017

This edition contains the following feature content:

Making Python faster: a report from a PyCon talk on what has been done to improve the performance of recent (and future) Python releases.
Assembling the history of Unix: the Unix Heritage Project is managing the history of the Unix system; that history is now available in a Git repository. This article includes a discussion with the founder of the project on what he is doing and why.
Shrinking the scheduler: how far are the kernel developers willing to go to support tiny systems?
A survey of scheduler benchmarks: there is a wealth of tools out there for benchmarking the kernel's CPU scheduler.
Alioth moving toward pagure: Debian's aging Alioth code forge looks to be replaced.
A beta for PostgreSQL 10: Josh Berkus gives us an overview of what's coming in the PostgreSQL 10 release.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (4 posted)

Making Python faster

By Jake Edge
June 14, 2017

PyCon

The Python core developers, and Victor Stinner in particular, have been focusing on improving the performance of Python 3 over the last few years. At PyCon 2017, Stinner gave a talk on some of the optimizations that have been added recently and the effect they have had on various benchmarks. Along the way, he took a detour into some improvements that have been made for benchmarking Python.

He started his talk by noting that he has been working on porting OpenStack to Python 3 as part of his day job at Red Hat. So far, most of the unit tests are passing. That means that an enormous Python program (with some 3 million lines of code) has largely made the transition to the Python 3 world.

Benchmarks

Back in March 2016, developers did not really trust the Python benchmark suite, he said. The benchmark results were not stable, which made it impossible to tell if a particular optimization made CPython (the Python reference implementation) faster or slower. So he set out to improve that situation.

He created a new module, perf, as a framework for running benchmarks. It calibrates the number of loops to run the benchmark based on a time budget. Each benchmark run then consists of sequentially spawning twenty processes, each of which performs the appropriate number of loops three times. That generates 60 time values; the average and standard deviation are calculated from those. He noted that the standard deviation can be used to spot problems in the benchmark or the system; if it is large, meaning lots of variation, that could indicate a problem.

Using perf has provided more stable and predictable results, he said. That has led to a new Python performance benchmark suite. It is being used at the speed.python.org site to provide benchmark numbers for Python. Part of that work has resulted in CPython being compiled with link-time optimization and profile-guided optimization by default.

The perf module has a "system tune" command that can be used to tune a Linux system for benchmarking. That includes using a fixed CPU frequency, rather than allowing each core's frequency to change all the time, disabling the Intel Turbo Boost feature, using CPU pinning, and running the benchmarks on an isolated CPU if that feature is enabled in the kernel.

Having stable benchmarks makes it much easier to spot a performance regression, Stinner said. For a real example, he pointed to a graph in his slides [PDF] that showed the python_startup benchmark time increasing dramatically during the development of 3.6 (from 20ms to 27ms). The problem was a new import in the code; the fix dropped the benchmark to 17ms.

The speed.python.org site allows developers to look at a timeline of the performance of CPython since April 2014 on various benchmarks. Sometimes it makes sense to focus on micro-benchmarks, he said, but the timelines of the larger benchmarks can be even more useful for finding regressions.

Stinner put up a series of graphs showing that 3.6 is faster than 3.5 and 2.7 on multiple benchmarks. He chose the most significant changes to show in the graphs, and there are a few benchmarks that go against these trends. The differences between 3.6 and 2.7 are larger than those for 3.6 versus 3.5, which is probably not a huge surprise.

The SymPy benchmarks show some of the largest performance increases. They are 22-42% faster in 3.6 than they are in 2.7. The largest increase, though, was on the telco benchmark, which is 40x faster on 3.6 versus 2.7. That is because the decimal module was rewritten in C for Python 3.3.

Preliminary results indicate that the in-development Python 3.7 is faster than 3.6, as well. There were some optimizations that were merged just after the 3.6 release; there were worries about regressions, which is why they were held back, he said.

Optimizations

Stinner then turned to some of the optimizations that have made those benchmarks faster. For 3.5, several developers rewrote the functools.lru_cache() decorator in C. That made the SymPy benchmarks 20% faster. The cache is "quite complex" with many corner cases, which made it hard to get right. In fact, it took three and a half years to close the bug associated with it.

Another 3.5 optimization was for ordered dictionaries (collections.OrderedDict). Rewriting it in C made the html5lib benchmark 20% faster, but it was also tricky code. It took two and a half years to close that bug, he said.

Moving on to optimizations for 3.6, he described the change he made for memory allocation in CPython. Instead of using PyMem_Malloc() for smaller allocations, he switched to the Python fast memory allocator that is used for Python objects. It only changed two lines of code, but resulted in many benchmarks getting 5-22% faster—and no benchmarks ran slower.

The xml.tree.ElementTree.iterparse() routine was optimized in response to a PyCon Canada 2015 keynote [YouTube video] by Brett Cannon. That resulted in the etree_parse and etree_iterparse benchmarks running twice as fast, which Stinner called "quite good". As noted in the bug report, though, it is still somewhat slower than 2.7.

The profile-guided optimization for CPython was improved by using the Python test suite. Previously, CPython would be compiled twice using the pidigits module to guide the optimization. That only tested a few, math-oriented Python functions, so using the test suite instead covers more of the interpreter. That resulted in many benchmarks showing 5-27% improvement just by changing the build process.

In 3.6, Python moved from using a bytecode for its virtual machine to a "wordcode". Instead of either having one or three bytes per instruction, now all instructions are two bytes long. That removed an if statement from the hot path in ceval.c (the main execution loop).

Stinner added a way to make C function calls faster using a new internal _PyObject_FastCall() routine. Creating and destroying the tuple that is used to call C functions would take around 20ns, which is expensive if the call itself is only, say, 100ns. So the new function dispenses with creating the tuple to pass the function arguments. It shows a 12-50% speedup for many micro-benchmarks.

He also optimized the ASCII and UTF-8 codecs when using the "ignore", "replace", "surrogateescape", and "surrogatepass" error handlers. Those codecs were full of "bad code", he said. His work resulted in UTF-8 decoding being 15x faster and encoding to be 75x faster. For ASCII, decoding is now 60x faster, while encoding is 3x faster.

Python 3.5 added byte-string formatting back into the language as a result of PEP 461, but the code was inefficient. He used the _PyBytesWriter() interface to handle byte-string formatting. That resulted in 2-3x speedups for those types of operations.

There were also improvements to the filename pattern matching or "globbing" operations (in the glob module and in the pathlib.Path.glob() routine). Those improved glob by 3-6x and pathlib globbing by 1.5-4x by using the new os.scandir() iterator that was added in Python 3.5.

The last 3.6 optimization that Stinner described was an improvement for the asyncio module that increased the performance of some asynchronous programs by 30%. The asyncio.Future and asyncio.Task classes were rewritten in C (for reference, here is the bug for Future and the bug for Task).

There are "lots of ideas" for optimizations for 3.7, Stinner said, but he is not sure which will be implemented or if they will be helpful. One that has been merged already is to add new opcodes (LOAD_METHOD and CALL_METHOD) to support making method calls as fast calls, which makes method calls 10-20% faster. It is an idea that has come to CPython from PyPy.

He concluded his talk by pointing out that on some benchmarks, Python 3.7 is still slower than 2.7. Most of those are on the order of 10-20% slower, but the python_startup benchmarks are 2-3x slower. There is a need to find a way to optimize interpreter startup in Python 3. There are, of course, more opportunities to optimize the language and he encouraged those interested to check out speed.python.org, as well as his Faster CPython site (which he mentioned in his Python Language Summit session earlier in the week).

YouTube video of Stinner's talk is also available.

[I would like to thank the Linux Foundation for travel assistance to Portland for PyCon.]

Comments (15 posted)

Assembling the history of Unix

June 14, 2017

This article was contributed by A. Jesse Jiryu Davis

The moment when an antique operating system that has not run in decades boots and presents a command prompt is thrilling for Warren Toomey. He compares it to restoring an old Model-T. "An old car looks pretty, but at the end of the day its purpose is to drive you somewhere. I love being able to turn the engine over and actually get it to do its job."

Toomey, an Australian university lecturer, founded the Unix Heritage Society to reconstruct the early history of the Unix operating system. Recently this historical code has become much more accessible: we can now browse it in an instant on GitHub, thanks to the efforts of a computer science professor at the Athens University of Economics and Business named Diomidis Spinellis. The 50th anniversary of the invention of Unix will be in 2019; the painstaking work of Toomey and Spinellis makes it possible for us to appreciate Unix's epic story.

The Unix Heritage Society

Around 1993, while he was a researcher at the University of New South Wales, Toomey began asking on mailing lists and news groups for old Unix versions with the intent to run them on a PDP-11 simulator. He began a group called the PDP-11 Unix Preservation Society, whose mission grew to encompass all old Unix releases and was renamed the Unix Heritage Society in 2000. "I think the title is a bit grandiose," he said in an interview with LWN. "It's not really a society, just me and the mailing list."

Toomey's project faced two obstacles; the first was simply to locate enough parts of each old Unix version to assemble a complete copy. He haunted the newsgroups and mailing lists of old Unix hackers, and he heard rumors of people who knew where to get historical artifacts. Most of his requests went unanswered. He recalls spending five or six years repeatedly asking for specific files, until eventually someone would respond, "Oh, actually, I have it." By chance, Toomey discovered in his own university's computer room a dozen tapes with backups of the 6th and 7th Editions of Unix. The backups weren't bootable—there wasn't even a complete backup of either edition—but the discovery accelerated his project nevertheless.

His second obstacle was the long shadow of AT&T's original copyright. AT&T and other corporations allowed individuals to own copies of Unix, but not to share them. Toomey had found his university's copy of a System V source license, but this only provided a small bit of legal cover to ask strangers to share their vintage files with him. Occasionally, one of Toomey's inside informants might give him a 15-year-old copy of some file, saying, "Just don't tell anyone where you got it."

Whenever Toomey acquired what seemed to be a complete version of Unix, he had to get it up and running, without any documentation to guide him. "You've got an artifact," he said, "It might be a binary or source code and there's no Makefile, you've got no idea what was the right sequence of things to do to build it."

Last year, for example, Toomey and his friends from the Unix Heritage Society resuscitated the first version of Unix for the PDP-7, written in mid-1970. The primary source was a dot-matrix printout containing PDP-7 assembly code, badly printed with notes and corrections scribbled on it. The members of the society converted the blurred copy to digital text with an OCR program, but they knew there were transcription errors that they'd have to backtrack and fix. Undaunted, they proceeded to the next stage: they learned the syntax of PDP-7 assembly code and wrote an assembler to convert the badly scanned text to machine code.

Now, with a set of executable binaries, the team had to store them in a filesystem, and here they hit a circular dependency. They didn't know the binary format of the filesystem for that version of the Unix kernel. The kernel itself implemented this filesystem, but they had to get the kernel to boot in order to use it for that purpose. Toomey decided to use a PDP-7 simulator to reverse-engineer the basic layout of a bootable disk image, and wrote a tool to create such an image containing the executables that he and his friends had assembled. "It's chicken-and-egg, but you work in stages," he said. "You get one little bit working and you use that to leverage up the next bit."

Unix's two inventors have helped him along the way. "Ken Thompson is minimalist in his communication," Toomey said. When the Unix Heritage Society brought up a PDP-11 version of Unix, he sent Thompson a series of emails about it, to which Thompson responded with single-word messages: "Amazing," or, "Incredible." Toomey said that while Dennis Ritchie was alive, he enthusiastically supported the project. "I really miss him an awful lot."

The Unix History Repository on GitHub

It's valuable to preserve snapshots of old-fashioned systems, but these snapshots don't fit modern programmers' methods for exploring the history of an evolving code base. Today, we read history with tools like Git. Spinellis has imported over 44 years of Unix code history into Git and published the repository on GitHub. The project builds on Toomey's accomplishments, but Spinellis wants more than just the code: he is building a moment-by-moment history of its evolution, and line-by-line attribution of each author's contributions.

Unix was developed without any version control at first. When development moved to the University of California at Berkeley in the late 1970s, coders began tracking certain files in an early version control system called SCCS, but even then it was not used for all files. Spinellis reconstructed as much history as he could by importing entire snapshots of early Unix versions into Git as if they were single commits. He researched primary sources like publications, technical reports, man pages, or names written in comments in the source code to attribute particular parts of the code to their authors.

Since publishing the repository on GitHub, Spinellis has continued to refine it periodically. He recently discovered an author unacknowledged in the Git logs whose contributions he wants to add. This March, the copyright holders for Unix Research Editions 8, 9, and 10 granted permission to distribute those versions, so that history can now be integrated into the repository. Additionally, Spinellis points out that he only followed one Unix variant to its conclusion: FreeBSD. Other variants like NetBSD and OpenBSD are just as old and interesting; their stories could be added to the repository as distinct branches.

But why?

Both Spinellis and Toomey enjoy reading old Unix code to see how much power the early programmers could jam into a tiny memory footprint. For example, the PDP-7 Unix that Toomey recovered last year is a minimalist masterpiece. It is a recognizable Unix system, including the fork() and exec() system calls, multiple user accounts, file permissions, and a directory structure, all implemented in only 4000 words of memory.

"But it's really not the source code that's important," said Toomey. "It's the ideas that are embodied in it." AT&T's efforts to protect the Unix code were irrelevant, he said, because the real value lies in concepts like connecting small utilities together with pipes and implementing the system in a portable programming language.

Spinellis agrees: when he examined the ~~7th edition of Unix from 1979~~ the 1970 edition of Unix, he saw that, even though it was a bare prototype system, it already contained key architectural elements of modern Unix, such as abstracting I/O, and separating the kernel from the command-line interpreter. Within a few years, even more powerful concepts became visible: devices that appeared as files, a hierarchical filesystem, and a shell that ran as a user process distinct from the kernel. The very first Unix versions contained the basic ideas that inspired the modern operating system that dominates computing today.

Comments (31 posted)

Shrinking the scheduler

By Jonathan Corbet
June 14, 2017

The ups and downs of patching the kernel to wedge Linux into tiny systems has been debated numerous times over the years, most recently in the context of Nicolas Pitre's alternative TTY layer patches posted in April. Pitre is driving the debate again, this time by trying to shrink the kernel's CPU scheduler. In the process, he has exposed a couple of areas of fundamental disagreement on the value of this kind of work.

Pitre's goal is to make it possible to run a system based on a Linux kernel on a processor with as little as 256KB of memory. Doing so requires more than just making code and data structures smaller; he has to simply eliminate as much code as possible. With the TTY patches, he replaced the TTY subsystem outright with a much smaller (and less capable) alternative. His approach with the scheduler is different: the kernel's core scheduler remains when his patches are active, but a number of features, including the realtime and deadline scheduler classes, are compiled out. The resulting scheduler is 25% smaller at the cost of features that will almost certainly not be used on tiny systems anyway.

As is often the case with "tinification" patches, the scheduler patches were given a chilly reception, with Ingo Molnar rejecting them outright. In the ensuing discussion, it became clear that there were two points of core disagreement on the value of these patches.

Alan Cox argued that a kernel configured for such small systems isn't really Linux anymore:

So once you've rewritten the tty layer, the device drivers, the VFS and removed most of the syscalls why even pretend it's Linux any more. It's something else, and that something else is totally architecturally incompatible with Linux. That's btw a good thing - trying to fit Linux directly into such a tiny device isn't sensible because the core assumptions you make about scalability are just totally different.

His suggestion was to create an entirely new kernel, borrowing bits of Linux source where it helps. Pitre's response was that this approach has been tried many times without success. Special-purpose kernels tend to be small projects with few developers; they progress slowly, are poorly maintained, and suffer from a lack of code review. By making it possible to use much of the Linux kernel, including, importantly, its device drivers, Pitre hopes to pull together the tiny-systems community around a single, well-supported alternative.

Molnar's argument, instead, was that the value obtained by supporting such small systems is not worth the cost, as measured in code complexity. Moore's law may be slowing down, he said, but these systems will still become more capable over time. Given the time that passes between the application of a scheduler patch and its appearance in distributions and products — between two and five years — there may be no need for this work by the time it becomes widely available. Given that, he argued, it is far more important to reduce the complexity of the scheduler than to reduce its size:

So while it obviously the "complexity vs. kernel size" trade-off will always be a judgment call, for the scheduler it's not really an open question what we need to do at this stage: we need to reduce complexity and #ifdef variants, not increase it.

Pitre disagreed with Molnar on every point. The smallest systems, he said, will remain small for economic reasons:

Your prediction is based on a false premise. There is simply no money to be made with IoT hardware, especially in the low end. Those little devices will be given away for free because it is in the service subscription that the money is. So the hardware has to, and will be, extremely cheap to produce.

The need for extremely low power consumption, so that a system can run for months or years on a single battery, will also keep these systems small. With regard to the time lag for adoption of the changes, he pointed out that much of the Android-related code that has gone into the mainline has been merged years after being deployed in products; the timing tends to be reversed in that part of the market. He also argued that his patches actually reduce the complexity of the scheduler code by factoring out the different scheduler classes and making it possible to remove them.

The conversation did not progress much beyond that point. There was one important bit of progress, though: Molnar agreed that Pitre's code-movement patches make the scheduler more maintainable. He requested that those patches be posted on their own so that they can be merged; that, he said, "should make future arguments easier". Pitre has obliged, but there has been no discussion of the new patches as of this writing. Should they be accepted, the remaining changes, which actually compile out those scheduler classes, should be quite small.

It is clear that getting core kernel maintainers to accept the costs associated with supporting tiny systems will always be a hard sell. In this case, the tiny-systems community has a developer who is determined to get the job done, but who is also familiar with how kernel development works and is willing to make the changes needed to get his patches merged. That still doesn't guarantee success in this inherently difficult — if not quixotic — endeavor, but the odds this time around would seem to be better than with previous attempts which, as can be seen here or here, have not always gone well.

Comments (31 posted)

A survey of scheduler benchmarks

June 14, 2017

This article was contributed by Matt Fleming

Many benchmarks have been used by kernel developers over the years to test the performance of the scheduler. But recent kernel commit messages have shown a particular pattern of tools being used (some relatively new), all of which were created specifically for developing scheduler patches. While each benchmark is different, having its own unique genesis story and intended testing scenario, there is a unifying attribute; they were all written to scratch a developer's itch.

Hackbench

Hackbench is a message-passing scheduler benchmark that allows developers to configure both the communication mechanism (pipes or sockets) and the task configuration (POSIX threads or processes). This benchmark is a stalwart of kernel scheduler testing, and has had more versions than the Batman franchise. It was originally created in 2001 by Rusty Russell to demonstrate the improved performance of the multi-queue scheduler patch series. Over the years, many people have added their contributions to Russell's version, including Ingo Molnar, Yanmin Zhang, and David Sommerseth. Hitoshi Mitake added the most recent incarnation to the kernel source tree as part of the perf-bench tool in 2009.

Here's an example of the output of perf-bench:

    $ perf bench sched pipe
    # Running 'sched/pipe' benchmark:
    # Executed 1000000 pipe operations between two processes

         Total time: 3.643 [sec]

           3.643867 usecs/op
             274433 ops/sec

The output of the benchmark is the average scheduler wakeup latency — the duration between telling a task it needs to wake up to perform work and that task running on a CPU. When analyzing latency, it's important to look at as many latency samples as possible because outliers (high-latency values) can be hidden by a summary statistic, such as the arithmetic mean. It's quite easy to miss those high latency events if the only data you have is the average latency, but scheduler wakeup delays can quickly lead to major performance issues.

Because hackbench calculates an average latency for communicating a fixed amount of data between two tasks, it is most often used by developers who are making changes to the scheduler's load-balancing code. On the flip side, the lack of data for analyzing the entire latency distribution makes it difficult to dig into scheduler latency wakeup issues without using tracing tools.

Schbench

One benchmark that does provide detailed latency distribution statistics for scheduler wakeups is schbench. It allows users to configure the usual parameters — such as number of tasks and test duration — but also the time between wakeups (--sleeptime), time spinning once woken (--cputime); it also has the ability to automatically increase the task count until the 99th percentile wakeup latencies become extreme.

Schbench was created by Chris Mason in 2016 while forward porting some kernel patches that Facebook was carrying to improve the performance of its workloads. "Schbench allowed me to quickly test a variety of theories as we were forward porting our old patches", Mason said in a private email. It has since become useful for more than that, and Facebook now uses it for performance regression detection, investigating performance issues, and benchmarking patches before they're posted upstream.

Here's an example showing the detailed statistics produced by schbench:

    $ ./schbench -t 16 -m 2
    Latency percentiles (usec)
	50.0000th: 15
	75.0000th: 24
	90.0000th: 26
	95.0000th: 30
	*99.0000th: 85
	99.5000th: 1190
	99.9000th: 7272
	min=0, max=7270

The scheduler wakeup latency distribution that schbench prints at the end of the benchmark run is one of its distinguishing features, and was one of the main rationales for creating it. Mason continued: "The focus on p99 latencies instead of average latencies is the most important part. For us, lots of problems only show up when you start looking at the long tail in the latency graphs." It's also a true micro-benchmark, including only the bare minimum code required to simulate Facebook's workloads while ensuring the scheduler is the slowest part of the code path.

Publishing this benchmark has provided a common tool for discussing Facebook's workloads with upstream developers, and non-Facebook engineers are now using it to test their scheduler changes, which Mason is very happy with: "I'm really grateful when I see people using schbench to help validate new patches going in."

Adrestia

Adrestia is a dirt-simple scheduler wakeup latency micro-benchmark that contains even less code than schbench. I wrote it in 2016 to measure scheduler latency without using the futex() system call as is done in schbench in order to provide more coverage by testing a different kernel subsystem in the scheduler wakeup path. I also needed something that had fewer bells and whistles and was trivial to configure. While schbench models Facebook's workloads, Adrestia is designed only to provide the 95th-percentile wakeup latency value, which provides a simple answer to the question: "What is the typical maximum wakeup latency value?"

I use adrestia to detect performance regressions of patches merged, and to validate potential patches as they're posted to the linux-kernel mailing list. It has been particularly useful for triggering regressions caused by changes to the cpufreq code, mainly because I test with wakeup times that are a multiple of 32ms — the Linux scheduling period. Using multiples of the scheduling period allows the CPU frequency to be reduced before the next wakeup, and thus provides an understanding of the effects of frequency selection on scheduler wakeup latencies. This turns out to be important when validating performance because many enterprise distributions ship with the intel_pstate driver enabled and the default governor set to "powersave".

Rt-app

Rt-app is a highly configurable real-time workload simulator that accepts a JSON grammar for describing task execution and periodicity. It was originally created by Giacomo Bagnoli as part of his master's thesis so that he could create background tasks to induce scheduler latency and test his Linux kernel changes for low-latency audio. Juri Lelli started working on it around 2010 when he began his efforts on the deadline scheduler project, again, for his master's thesis [PDF]. Lelli said (in a private email) that he used rt-app while writing his thesis because it was the in-house testing solution at RetisLab (Scuola Superiore Sant'Anna University) at the time, "I didn't also know about any other tool that was able to create synthetic sets base on a JSON description".

Today, ARM and Linaro are using rt-app to trigger specific scheduler code paths. It is a flexible tool that can be used to test small scheduling and load-balancing changes; it is also useful for generating end-to-end workload performance and power figures. Because of its flexibility (and expressive JSON grammar) it is heavily used to model workloads when they are impractical to run directly, such as Android benchmarks on mainline Linux. "You want to use it to abstract complexity and test for regressions across different platforms/os stacks/back/forward ports", said Lelli.

Lelli himself uses it primarily for handling bug reports because he can model problematic workloads without having to run the actual application stack. He also uses it for regression testing; the rt-app source repository has amassed a large collection of configurations for workloads that have caused regressions in the past. Many developers run rt-app indirectly via ARM's LISA framework, since LISA further abstracts the creation of rt-app configuration files and also includes libraries to post-process the rt-app trace data.

If modeling of complex workloads is needed when testing scheduler changes, rt-app appears to be the obvious choice. "It's useful to model (almost) any sort of real-world application without coding it from scratch - you just need to be fluent with its own JSON grammar… I'm actually relatively confident that for example it shouldn't be too difficult to create {hackbench,cyclictest,etc.}-like type of workloads with rt-app".

In closing

Benchmarks offer benefits that no other tool can; they can help developers communicate the important bits of a workload by paring it back to its core, making it simple to reproduce reported performance issues, and ensuring that performance doesn't regress. Yet a large number of performance-improving kernel patches contain no benchmark numbers at all. That's slowly starting to change for the scheduler subsystem with the help of the benchmarks mentioned above. But if you can't find a benchmark that represents your workload, maybe it's time to write your own, and finally scratch that itch.

Comments (1 posted)

Alioth moving toward pagure

June 14, 2017

This article was contributed by Antoine Beaupré

Since 2003, the Debian project has been running a server called Alioth to host source code version control systems. The server will hit the end of life of the Debian LTS release (Wheezy) next year; that deadline raised some questions regarding the plans for the server over the coming years. Naturally, that led to a discussion regarding possible replacements.

In response, the current Alioth maintainer, Alexander Wirt, announced a sprint to migrate to pagure, a free-software "Git-centered forge" written in Python for the Fedora project, which LWN covered last year. Alioth currently runs FusionForge, previously known as GForge, which is the free-software fork of the SourceForge code base when that service closed its source in 2001. Alioth hosts source code repositories, mainly Git and Subversion (SVN) and, like other "forge" sites, also offers forums, issue trackers, and mailing list services. While other alternatives are still being evaluated, a consensus has emerged on a migration plan from FusionForage to a more modern and minimal platform based on pagure.

Why not GitLab?

While this may come as a surprise to some who would expect Debian to use the more popular GitLab project, the discussion and decision actually took place a while back. During a lengthy debate last year, Debian contributors discussed the relative merits of different code-hosting platforms, following the initiative of Debian Developer "Pirate" Praveen Arimbrathodiyil to package GitLab for Debian. At that time, Praveen also got a public GitLab instance running for Debian (gitlab.debian.net), which was sponsored by GitLab B.V. — the commercial entity behind the GitLab project. The sponsorship was originally offered in 2015 by the GitLab CEO, presumably to counter a possible move to GitHub, as there was a discussion about creating a GitHub Organization for Debian at the time. The deployment of a Debian-specific GitLab instance then raised the question of the overlap with the already existing git.debian.org service, which is backed by Alioth's FusionForge deployment. It then seemed natural that the new GitLab instance would replace Alioth.

But when Praveen directly proposed to move to GitLab, Wirt stepped in and explained that a migration plan was already in progress. The plan then was to migrate to a simpler gitolite-based setup, a decision that was apparently made in corridor discussions surrounding the Alioth Git replacement BoF held during Debconf 2015. The first objection raised by Wirt against GitLab was its "huge number of dependencies". Another issue Wirt identified was the "open core / enterprise model", preferring a "real open source system", an opinion which seems shared by other participants on the mailing list. Wirt backed his concerns with an hypothetical example:

Debian needs feature X but it is already in the enterprise version. We make a patch and, for commercial reasons, it never gets merged (they already sell it in the enterprise version). Which means we will have to fork the software and keep those patches forever. Been there done that. For me, that isn't acceptable.

This concern was further deepened when GitLab's Director of Strategic Partnerships, Eliran Mesika, explained the company's stewardship policy that explains how GitLab decides which features end up in the proprietary version. Praveen pointed out that:

[...] basically it boils down to features that they consider important for organizations with less than 100 developers may get accepted. I see that as a red flag for a big community like debian.

Since there are over 600 Debian Developers, the community seems to fall within the needs of "enterprise" users. The features the Debian community may need are, by definition, appropriate only to the "Enterprise Edition" (GitLab EE), the non-free version, and are therefore unlikely to end up in the "Community Edition" (GitLab CE), the free-software version.

Interestingly, Mesika asked for clarification on which features were missing, explaining that GitLab is actually open to adding features to GitLab CE. The response from Debian Developer Holger Levsen was categorical: "It's not about a specific patch. Free GitLab and we can talk again." But beyond the practical and ethical concerns, some specific features Debian needs are currently only in GitLab EE. For example, debian.org systems use LDAP for authentication, which would obviously be useful in a GitLab deployment; GitLab CE supports basic LDAP authentication, but advanced features, like group or SSH-key synchronization, are only available in GitLab EE.

Wirt also expressed concern about the Contributor License Agreement that GitLab B.V. requires contributors to sign when they send patches, which forces users to allow the release of their code under a non-free license.

The debate then went on going through a exhaustive inventory of different free-software alternatives:

GitLab, a Ruby-based GitHub replacement, dual-licensed MIT/Commercial
Gogs, Go, MIT
Gitblit, Java, Apache-licensed
Kallithea, in Python, also supports Mercurial, GPLv3
and finally, pagure, also written Python, GPLv2

A feature comparison between each project was created in the Debian wiki as well. In the end, however, Praveen gave up on replacing Alioth with GitLab because of the controversy and moved on to support the pagure migration, which resolved the discussion in July 2016.

More recently, Wirt admitted in an IRC conversation that "on the technical side I like GitLab a lot more than pagure" and that "as a user, GitLab is much nicer than pagure and it has those nice CI [continuous integration] features". However, as he explained in his blog "GitLab is Opencore, [and] that it is not entirely opensource. I don't think we should use software licensed under such a model for one of our core services" which leaves pagure as the only stable candidate. Other candidates were excluded on technical grounds, according to Wirt: Gogs "doesn't scale well" and a quick security check didn't yield satisfactory results; "Gitblit is Java" and Kallithea doesn't have support for accessing repositories over SSH (although there is a pending pull request to add the feature).

In an email interview, Sid Sijbrandij, CEO of GitLab, did say that "we want to make sure that our open source edition can be used by open source projects". He gave examples of features liberated following requests by the community, such as branded login pages for the VLC project and GitLab Pages after popular demand. He stressed that "There are no artificial limits in our open source edition and some organizations use it with more than 20.000 users." So if the concern of the Debian community is that features may be missing from GitLab CE, there is definitely an opening from GitLab to add those features. If, however, the concern is purely ethical, it's hard to see how an agreement could be reached. As Sijbrandij put it:

On the mailinglist it seemed that some Debian maintainers do not agree with our open core business model and demand that there is no proprietary version. We respect that position but we don't think we can compete with the purely proprietary software like GitHub with this model.

Working toward a pagure migration

The issue of Alioth maintenance came up again last month when Boyuan Yang asked what would happen to Alioth when support for Debian LTS (Wheezy) ends next year. Wirt brought up the pagure migration proposal and the community tried to make a plan for the migration.

One of the issues raised was the question of the non-Git repositories hosted on Alioth, as pagure, like GitLab, only supports Git. Indeed, Ben Hutchings calculated that while 90% (~19,000) of the repositories currently on Alioth are Git, there are 2,400 SVN repositories and a handful of Mercurial, Bazaar (bzr), Darcs, Arch, and even CVS repositories. As part of an informal survey, however, most packaging teams explained they either had already migrated away from SVN to Git or were in the process of doing so. The largest CVS user, the web site team, also explained it was progressively migrating to Git. Mattia Rizzolo then proposed that older repository services like SVN could continue running even if FusionForge goes down, as FusionForge is, after all, just a web interface to manage those back-end services. Repository creation would be disabled, but older repositories would stay operational until they migrate to Git. This would, effectively, mean the end of non-Git repository support for new projects in the Debian community, at least officially.

Another issue is the creation of a Debian package for pagure. Ironically, while Praveen and other Debian maintainers have been working for 5 years to package GitLab for Debian, pagure isn't packaged yet. Antonio Terceiro, another Debian Developer, explained this isn't actually a large problem for debian.org services: "note that DSA [Debian System Administrator team] does not need/want the service software itself packaged, only its dependencies". Indeed, for Debian-specific code bases like ci.debian.net or tracker.debian.org, it may not make sense to have the overhead of maintaining Debian packages since those tools have limited use outside of the Debian project directly. While Debian derivatives and other distributions could reuse them, what usually happens is that other distributions roll their own software, like Ubuntu did with the Launchpad project. Still, Paul Wise, a member of the DSA team, reasoned that it was better, in the long term, to have Debian packages for debian.org services:

Personally I'm leaning towards the feeling that all configuration, code and dependencies for Debian services should be packaged and subjected to the usual Debian QA activities but I acknowledge that the current archive setup (testing migration plus backporting etc) doesn't necessarily make this easy.

Wise did say that "DSA doesn't have any hard rules/policy written down, just evaluation on a case-by-case basis" which probably means that pagure packaging will not be a blocker for deployment.

The last pending issue is the question of the mailing lists hosted on Alioth, as pagure doesn't offer mailing list management (nor does GitLab). In fact, there are three different mailing list services for the Debian project:

the main service, lists.debian.org, running ~~Mailman 2~~ SmartList and managed by hand
the Alioth service, lists.alioth.debian.org, running Mailman 2 and managed by FusionForge
the Debconf service, lists.debconf.org, also running Mailman 2

Wirt, with his "list-master hat" on, explained that the main mailing list service is "not really suited as a self-service" and expressed concern at the idea of migrating the large number mailing lists hosted on Alioth. Indeed, there are around 1,400 lists on Alioth while the main service has a set of 300 lists selected by the list masters. No solution for those mailing lists was found at the time of this writing.

In the end, it seems like the Debian project has chosen pagure, the simpler, less featureful, but also less controversial, solution and will use the same hosting software as their fellow Linux distribution, Fedora. Wirt is also considering using FreeIPA for account management on top of pagure. The plan is to migrate away from FusionForge one bit at a time, and pagure is the solution for the first step: the Git repositories. Lists, other repositories, and additional features of FusionForge will be dealt with later on, but Wirt expects a plan to come out of the upcoming sprint.

It will also be interesting to see how the interoperability promises of pagure will play out in the Debian world. Even though the federation features of pagure are still at the early stages, one can already clone issues and pull requests as Git repositories, which allows for a crude federation mechanism.

In any case, given the long history and the wide variety of workflows in the Debian project, it is unlikely that a single tool will solve all problems. Alioth itself has significant overlap with other Debian services; not only does it handle mailing lists and forums, but it also has its own issue tracker that overlaps with the Debian bug tracking system (BTS). This is just the way things are in Debian: it is an old project with lots of moving part. As Jonathan Dowland put it: "The nature of the project is loosely-coupled, some redundancy, lots of legacy cruft, and sadly more than one way to do it."

Hopefully, pagure will not become part of that "legacy redundant cruft". But at this point, the focus is on keeping the services running in a simpler, more maintainable way. The discussions between Debian and GitLab are still going on as we speak, but given how controversial the "open core" model used by GitLab is for the Debian community, pagure does seem like a more logical alternative.

Comments (36 posted)

A beta for PostgreSQL 10

June 9, 2017

This article was contributed by Josh Berkus

PostgreSQL version 10 had its first beta release on May 18, just in time for the annual PGCon developer conference. The latest annual release comes with a host of major features, including new versions of replication and partitioning, and enhanced parallel query. Version 10 includes 451 commits, nearly half a million lines of code and documentation, and over 150 new or changed features since version 9.6. The PostgreSQL community will find a lot to get excited about in this release, as the project has delivered a long list of enhancements to existing functionality. There's also a few features aimed at fulfilling new use cases, particularly in the "big data" industry sector.

Built-in logical replication

The built-in, single-master replication that has shipped with PostgreSQL since version 9.0 is known as "binary replication." This means that it replicates 8KB data pages, not logical database objects like tables or rows. This has a number of advantages, the biggest being easy administration. However, it has a few major disadvantages: you can never replicate less than your entire server instance, no writes of any kind are permitted on the replica, and you can't replicate between different PostgreSQL versions.

In contrast, logical replication does allow replicating individual tables, or between versions, because it replicates tables and rows. For several years, there have been a number of third-party projects for logical replication, including Slony-I, Londiste, and Bucardo. These systems are more involved to administer and have lower performance than many users need, though, so several developers have been working on a built-in logical replication system. This work has been led by developers working for the European consulting company 2nd Quadrant.

The new replication is designed to operate concurrently with, and be as similar as possible to, the existing binary replication. As such, it also uses a transaction log stream as the data transport, and requires features like replication slots that are already familiar to PostgreSQL database administrators. The primary new concept is the "pub/sub" (publication and subscription) model for tables or groups of tables. Each node in a cluster can be a publisher or a subscriber for each specific set of tables.

Suppose I decide I want to replicate just the fines and loans tables from my public library database to the billing system so that it can process amounts owed. I would create a publication from those two tables with one command. This command includes the word "ONLY" because we only want to replicate the tables named, not other tables linked to them:

    libdata=# CREATE PUBLICATION financials FOR TABLE ONLY loans, ONLY fines;
    CREATE PUBLICATION

Then, in the billing database, I would create two tables that looked identical to the tables I'm replicating, and have the same names. They can have additional columns and a few other differences. Particularly, since I'm not copying the patrons or books tables, I'll want to drop some foreign keys that the origin database has. I also need to create any special data types or other database artifacts required for those tables. Often the easiest way to do this is selective use of the pg_dump and pg_restore backup utilities:

    origin# pg_dump libdata -Fc -f /netshare/libdata.dump

    replica# pg_restore -d libdata -s -t loans -t fines /netshare/libdata.dump

Following that, I can start a subscription to those two tables:

    libdata=# CREATE SUBSCRIPTION financials
                  CONNECTION 'dbname=libdata user=postgres host=origin.example.com'
                  PUBLICATION financials;
    NOTICE:  synchronized table states
    NOTICE:  created replication slot "financials" on publisher
    CREATE SUBSCRIPTION

This will first copy a snapshot of the data currently in the tables, and then start catching up from the transaction log. Once it's caught up, you can check status in pg_stat_subscription:

    libdata=# select * from pg_stat_subscription;
    -[ RECORD 1 ]---------+---------------------
    subid                 | 16475
    subname               | financials
    pid                   | 167
    relid                 |
    received_lsn          | 0/1FBEAF0
    last_msg_send_time    | 2017-06-07 00:59:44
    last_msg_receipt_time | 2017-06-07 00:59:44
    latest_end_lsn        | 0/1FBEAF0
    latest_end_time       | 2017-06-07 00:59:44

Possibly the biggest benefit to this new logical replication is that it will make "upgrade by replication" feasible for a lot more users. Databases that aren't permitted to be down will be able to upgrade from PostgreSQL 10 to 11 using logical replication to a new cluster and failing over.

The other major replication feature involves PostgreSQL's synchronous replication. It now supports quorum commit, allowing administrators to define a specific number of nodes in a pool of replicas that must receive a transaction for it to be complete. This will allow PostgreSQL to be used in a similar way to several scalable, consensus-based database systems.

Native partitioning

The second aspect of PostgreSQL to get a major ease-of-use upgrade is table partitioning. Partitioning is used to split up a huge table into multiple sub-tables, improving maintenance, backup, and the performance of some queries. While PostgreSQL has had partitioning since version 8.0, the existing implementation has been difficult for administrators, requiring multiple steps and a lot of unintuitive SQL syntax, or installing PostgreSQL extensions.

The new partitioning feature uses a simple declarative syntax, as well as enforcing some sensible defaults. It currently supports both "list" and "range" partitioning. List partitioning is where each item in a list of potential values (such as weekdays or Australian territories) receives a partition and data related to that value goes in that partition. Range partitioning is when each partition holds a defined upper and lower limit, as one would use for dates or numeric values. For example, we might decide to partition the book_history table; that's probably a good idea since that table is liable to accumulate data forever. Since the table is essentially a log file, we'll range partition it, with one partition per month.

First, we create a "master" partition table, which will hold no data but forms a template for the rest of the partitions:

    libdata=# CREATE TABLE book_history (
                  book_id INTEGER NOT NULL,
                  status BOOK_STATUS NOT NULL,
                  period TSTZRANGE NOT NULL )
              PARTITION BY RANGE ( lower (period) );

Then we create several partitions, one per month:

    libdata=# CREATE TABLE book_history_2016_09
              PARTITION OF book_history
              FOR VALUES FROM ('2016-09-01 00:00:00') TO
	                      ('2016-10-01 00:00:00');
    CREATE TABLE
    libdata=# CREATE TABLE book_history_2016_08
              PARTITION OF book_history
              FOR VALUES FROM ('2016-08-01 00:00:00') TO
	                      ('2016-09-01 00:00:00');
    CREATE TABLE
    libdata=# CREATE TABLE book_history_2016_07
              PARTITION OF book_history
              FOR VALUES FROM ('2016-07-01 00:00:00') TO
	                      ('2016-09-01 00:00:00');
    ERROR:  partition "book_history_2016_07" would overlap \
    partition "book_history_2016_08"

As you can see, the system even prevents accidental overlap. New rows will automatically be stored in the correct partition, and SELECT queries will search the appropriate partitions. If we decide to sunset the log after 12 months, we can delete old data by dropping partitions:

    libdata=# DROP TABLE book_history_2016_05;

Since dropping partitions is just a file unlink operation, it is orders of magnitude faster than deleting thousands or millions of rows.

There's still some work to be done on the new partitioning. Contributor Yugo Nagata is already hard at work on a hash partitioning option. Others want to add automatic creation of partition constraints and keys. Eventually, PostgreSQL will also support automated creation of new partitions, and performance improvements in searching partitions. Regardless, the new version of partitioning will be accessible to many users who avoided the complexity of the prior implementation.

For upgrade compatibility, the old form of partitioning will still work in PostgreSQL for the foreseeable future.

More parallel query operations

Parallel query was introduced as a PostgreSQL feature in version 9.6. It allows a single query to make use of multiple processes and cores in order to speed up execution. Implementing parallelism has been a matter of parallelizing one query operation at a time across successive PostgreSQL releases. In this beta, the most common operations in read queries can all be distributed across multiple cores.

In version 9.6, full table scans, aggregates, nested loop joins, and hash joins could be executed in parallel. Version 10 has added three new parallel operations: btree index scans, bitmap scans, and merge joins. With the addition of these query operation types, most read queries can be executed in parallel. This means that users who generally have fewer database connections than cores (something that is common in analytics databases) can count on speeding up much of their database workload through parallelism.

For example, if we wanted to search financial transaction history by an indexed column, it can now be executed in one-quarter the time by using four parallel workers:

    accounts=# \timing
    Timing is on.
    accounts=# select bid, count(*) from account_history
      where delta > 1000 group by bid;
    ...
    Time: 324.903 ms

    accounts=# set max_parallel_workers_per_gather=4;
    SET
    Time: 0.822 ms
    accounts=# select bid, count(*) from account_history
    where delta > 1000 group by bid;
    ...
    Time: 72.864 ms

The project has added a few new configuration options for resource controls over the number of workers that PostgreSQL uses for various things, including parallel query.

Future work on parallel query is likely to include parallel bulk loading (in a Google Summer of Code project) and parallel utility commands, such as building indexes. The project will also work on parallel scan for other types of indexes, such as geographic GiST indexes and full-text GIN indexes. Parallel execution of write queries has also been discussed, but there are some fundamental technical hurdles to making it work.

JSON full-text search

PostgreSQL is known both for having powerful JSON features for a relational database and very good full-text search for a general-purpose database. In version 10, Dmitry Dolgov decided to combine both of these features in order to make JSON fields fully searchable, both as JSON and as human-readable text. Combined with other features and extensions, this makes PostgreSQL a superior option to dedicated document databases for some JSON use cases.

The new feature works with both text JSON and binary JSONB types. You can index your JSONB field using a full-text index. This involves converting the JSONB field to a tsvector, then creating an specific language full-text index on it:

    libdata=# CREATE INDEX bookdata_fts ON bookdata
                  USING gin (( to_tsvector('english',bookdata) ));
    CREATE INDEX

(Note that this feature currently has a bug that will be fixed in the next beta.)

Once that's set up, you can do full-text searching against all of the values in your JSON documents:

    libdata=# SELECT bookdata -> 'title'
		  FROM bookdata
		  WHERE to_tsvector('english',bookdata) @@
			to_tsquery('duke');  
    --------------------------------------------------------
     "The Tattooed Duke"
     "She Tempts the Duke"
     "The Duke Is Mine"
     "What I Did For a Duke"

Combined with the JsQuery extension, this provides PostgreSQL with a full set of JSON search tools that rival dedicated JSON databases. Its community expects more JSON applications to switch to PostgreSQL, or to combine PostgreSQL with non-relational databases like MongoDB and Couchbase in hybrid infrastructure.

Other features

PostgreSQL 10 includes new and improved security features, chief among them support for SCRAM authentication. This provides a much more secure password authentication method than the prior MD5 hashing used by the libpq library. Also, contributor Stephen Frost has added restrictive row-level security policies, enabling tighter permissions management of high-security data. Previously, row-level security permissions were always permissive, which means that users could grant permissions on roles but not revoke them. Now, both actions are possible.

There are many other interesting features among the 150 added in this release. XMLTABLE permits manipulating XML data in the database like it was a SQL table. Multi-column correlation statistics introduces a first-in-the-industry method of dealing with a chronic database performance issue: estimating selectivity for conditions on multiple columns. Oracle database administrators will be happy to now find "latch wait time" statistics available in PostgreSQL monitoring. PostgreSQL has also added support for the global standard ICU library for language collations.

With all these major features, though, there's a larger-than-normal number of backward-incompatible changes in this release. The first such change is the version number, which now has just two components instead of three. There are also a number of changes to the client library that could break some drivers. Version 10 also drops support for some antiquated data types: floating-point timestamps and the "tsearch2" full-text search indexes.

Possibly the most disruptive change for database administrators is the global renaming of everything that was called "xlog" to "wal", including directories, filenames, and administrative functions. The two abbreviations — the former standing for "transaction log", and the latter for "write-ahead log" — have been used as somewhat confusing synonyms for some time. The developers made this change in order to deter the common data loss scenario when a user confuses the disposable activity logs with the essential transaction logs.

Final release, version 11, and more

Users can expect to see at least two more beta releases, and possibly more, before a final release toward the end of the year. The target release date is early September, but release dates have slipped several times over the last five years. There are quite a few open issues against the current beta, including some pieces of user-visible behavior that may change before release. Mostly, the developers want driver authors to get working on some of the new client features now, particularly supporting SCRAM authentication.

During beta testing, work on PostgreSQL 11 (the next version according to the new numbering scheme) has already started. Features that already have patches include Write Amplification Reduction Method (WARM), designed to combat some of the I/O issues that Uber complained about, Auto Prewarm to speed up database server restarts, cascading logical replication, automatic per-statement savepoints, and support for the recent SQL/JSON standard. More speculatively, there's some hope that some types of multi-master replication will be integrated with the mainline in that release.

Regardless of what's in future versions, PostgreSQL 10 looks like a landmark release for the project. Users will want to test it now because they will likely want to upgrade soon after the 10.0 release.

[Josh Berkus is a contributor to PostgreSQL. He works for Red Hat.]

Comments (none posted)

Page editor: Jonathan Corbet
Next page: Brief items>>