Leading items

Grammar and style-checking tools for Emacs

By Nathan Willis
June 22, 2016

Grammar be hard. Both for human beings and for software programs. These days, writers who use free software generally have their choice of reliable utilities for catching spelling mistakes, regardless of what editors or word processors they use. The outlook for grammar-and-style checking is not nearly as rosy. I recently explored the options available for Emacs, and was underwhelmed with the status quo.

But the limited options available testify primarily to the difficulty of the problem, rather than indicting the development community. Natural-language processing is at the heart of grammar-checking, and there are few relevant projects available to the public that offer much of a general solution. Those that do exist (and have free-software compatible licenses) tend to come from academia. As a result, users can choose either lightweight programs that offer only a limited set of simple grammar checks, or more complete grammar-checkers that can involve awkward glue code to hook Emacs into an external service.

Perhaps it goes without saying, but I limited my research into grammar utilities that support my native language, English. Yet, as far as I can discern, the situation is not dramatically better for any other languages—in fact, once one ventures too far outside of the European languages, the situation seems to be much worse on a practical level. The theoretical problems abound, and one is at the mercy of whoever has the funds to support the necessary research. I also limited this search to tools with Emacs integration, but a bit of looking suggests the number and variety of solutions available for Vim and other editors is similar.

Limited tools, unlimited problem space

Among the first discoveries one makes when reading about grammar checking is that there is a wide range of errors that someone might consider a grammatical mistake. The simplest are obvious syntactic errors, like repeated words—and there are, indeed, quite a few options available to catch duplicates. Simple string matching will catch most of these, although false positives are possible. Programs can also easily check documents against a blacklist, so commonly misused patterns (such as "rather then") can be highlighted. Only slightly more complicated to catch are grammatical constructions like the passive voice; regular expressions can match the most common forms of verbs as they are used in the passive voice (such as "are used").

But not everyone agrees on whether or not many such stylistic rules are genuinely grammatical rules. It is common for textbooks and schools to teach students to avoid slang, contractions, and the like (especially in "formal" writing), but those are conventions largely about what is appropriate, not what is correct. Detecting genuine grammar mistakes like subject/verb disagreement, misplaced commas, or dangling participles is apparently far more difficult.

Consequently, there are multiple options available to tackle the syntactic issues that can be dealt with through regular expressions and simple blacklists. But the use of a single blacklist for grammatical mistakes and for words that are undesirable for stylistic reasons (for example, words that are regarded as imprecise, like "some," or that add no information, like "very") muddles the picture. Fortunately, some of these tools are flexible enough that users can adapt them to issue warnings about their particular set of concerns.

Duplicate words

At the simple end of the offerings is the dupwords.el package written in the 1990s by Stephen Eglen. Naive double-word detection is almost trivial; some existing spell-checkers for Emacs already perform the function. Eglen's script improves matters by being able to detect repeats that are separated by a user-configurable threshold of other words. Setting the variable dw-forward-words changes this threshold; the default is one (which catches adjacent duplicates only). Setting it to a negative value will catch duplicates anywhere within the same sentence.

Eglen's script is sentence-oriented; it will not catch situations where the same word ends one sentence and starts the next (for that, there are other solutions to be found with a bit of searching, such as this function by Matthew Morley). The script must be called explicitly; M-x dw-check-to-end will check from the cursor point to the end of the active buffer.

Diction

A step up from dupwords.el is the diction.el package by Sven Utke, which depends on the operating system's GNU diction package. While perhaps not terribly well-known, diction is a classic UNIX text-processing utility. It can find duplicate words as well as match problematic words from the program's rules database. The default databases are stored in /usr/share/diction/, and currently cover English, German, Dutch, and C. Each entry can include a recommended substitute or a brief explanation of why the word in question is frowned upon.

The English database focuses on unnecessarily verbose language, such as recommending that "along the lines of" be replaced with "like," and on pointing out the distinctions between often confused pairs of words (such as proceed and precede). Many of the recommendations are drawn from Strunk and White's The Elements of Style, which is a classic manual on writing style. But the book has its share of critics, who contend that it contains lots of "rules" that are little more than opinion on whether or not certain words and phrases are "inelegant" or overused.

For Emacs users, GNU diction is likely to highlight an excessive number of words, many of which are hits on Strunk and White's stylistic recommendations—at least, that is the case when using the built-in diction database. But it is possible to create a custom database that is more useful for a particular user or writing project. The diction.el script contains some logic to automatically deduce the correct database to use based on the ispell dictionary in use in the active buffer; to point the script to a different database, this value needs to be overwritten using the command:

    M-x set-variable RET diction-ruleset RET "databasename"

Like the previous tool, diction.el must be evoked by the user. Calling M-x diction-buffer will scan the current Emacs buffer. The diction-ruleset variable is per-buffer, so users who wish to use different custom databases for different files will either need to set the variable separately or add the command to the relevant mode hooks for each file type.

Write good

Benjamin Beckwith's writegood-mode uses a similar approach to diction.el, but it relies on a custom blacklist that covers slightly different ground. It matches three classes of error: duplicate words, passive voice constructions, and "weasel words," a term more-or-less synonymous with "stylistic problems" as listed in the GNU diction database.

The writegood-mode blacklist, however, is adapted from a set of shell scripts by Matt Might at the University of Utah. Might's list was assembled from years of reading student papers; it breaks "weasel words" into three categories:

Salt and pepper words that "look and feel like technical words, but convey nothing." Examples include "various," "fairly," and "a number of."
Beholder words that tell the reader how to react, such as "interestingly," "clearly," or "surprisingly."
Adverbs, which Might says should be removed from all "technical" writing.

Writegood-mode's list of weasel words is editable; one only needs to add a string to the write-good-weasel-words list. But, notably, the list consists of string literals, not regular expressions; if one decides to supplement it in bulk or to add a lot of variations, it could grow unwieldy.

On the plus side, writegood-mode is an Emacs minor mode, which is a class of feature commonly used to perform on-the-fly syntax highlighting and indentation. Thus, when activated, writegood-mode highlights all of the matching words in the current buffer as one continues to work on the document. That is more convenient than periodically stopping to re-run a command, and users can selectively enable the mode based on the type of document (in addition to enabling it manually). In addition, using syntax highlighting makes it simple for the user to ignore false positives, whereas using a function that steps through each flagged word sequentially can quickly become an interminable chore.

Art Bollocks

Another minor-mode option worth considering is artbollocks-mode, which was originally written by Rob Myers and was later revived by Sacha Chua. The name, incidentally, is a reference to a famous article criticizing postmodern art, which contended that postmodernism is more of a linguistic argument about art than it is an approach to creativity itself.

In a sense, the original Art Bollocks was an attack on weasel words, and that is what artbollocks-mode focuses on as well. It includes checks to highlight passive-voice constructions, "jargon" words, duplicated words, and a set of weasel words that covers the same general categories described by Might. In addition, each of these checks can be enabled or disabled individually, and there are commands available to compute some statistics about the active buffer (such as its Flesch-Kincaid readability score).

Writegood-mode is newer, but artbollocks-mode includes a larger list of weasel and jargon words—although, it should be pointed out, some of those words originate from art criticism and may not be useful in other disciplines. The distinction between weasel words and jargon could be useful for anyone hoping to tailor artbollocks-mode to their own writing; the different categories are highlighted in different colors.

As far as modifications go, artbollocks-mode is not as simple to update as writegood-mode. Rather than a list of literal strings to search for, artbollocks-mode uses a single regular expression for each of its checks, and those regular expressions are optimized with Emacs's regex-opt function. On the plus side, this results in faster string matching, but it also requires the user to regenerate the optimized regular expressions, out of band, in order to update the mode.

Style versus grammar

The utilities examined above focus on writing style, rather than fundamental English grammar. But, in a lot of the online debates, mailing-list threads, and Stack Overflow answers that I examined when looking for Emacs grammar-checking tools, users were interested in stylistic issues. After all, the underlying concern is users wanting their writing to be clear; whether the problem at hand is a vague adverb or a split infinitive, the user wants it fixed.

So the style-oriented tools clearly have their place, and many writers seem to find them useful. Nevertheless, many of the same writers probably have "real" grammar-checking in mind when they first go looking for such an Emacs utility. Next time, we'll take a look at the tools available for assessing grammatical correctness from Emacs. All of the tools involve linking to external processes or even remote servers, which raises its own set of hurdles for those intent on working with a purely free-software solution.

Comments (21 posted)

Twisted in an asyncio world

By Jake Edge
June 22, 2016

PyCon 2016

At PyCon 2016, Amber Brown gave a presentation on the advent of the asyncio module for handling asynchronous I/O in Python 3 and what that means for the Twisted event-driven networking framework. There is some thinking that asyncio "kills" Twisted, but that's not how she sees things. Brown is a core Twisted developer and the release manager for the project. Over the last year or so, she has ported 40,000 lines of Twisted code to Python 3. She has also ported Autobahn|Python and Crossbar.io to Python 3 as part of her day job working on Crossbar.io.

The inspiration behind the talk came from two places. Russell Keith-Magee asked her at one point why Twisted was still relevant now that asyncio had been added to Python. In addition, Twisted's lead architect Glyph Lefkowitz posted that the "report of Twisted's death was an exaggeration" to his blog in May 2014. She believes that she is in a unique position to explain what asyncio means for Twisted and what the future holds, thus the talk.

The basic problem that Twisted addresses is handling multiple concurrent I/O operations, generally network I/O. The way that web frameworks (e.g. Django) typically do that is with multiple "runners" to handle requests. These runners are either threads or processes.

But neither threads nor processes will help with the C10k problem—handling 10,000 concurrent connections. Threads are "hard to get right" and have high overhead. A 128KB stack per thread means that 10,000 connections requires 1.3GB just for the stacks. Beyond that, the Python global interpreter lock (GIL) means there will be no parallelism anyway. Furthermore, "you won't do threads properly"—she suggested posting that statement "above your computer".

The only good way to handle that many connections in Python is by using non-threaded asynchronous I/O. Twisted is one of the first Python asynchronous I/O frameworks, going back to 2001, while asyncio is much newer. But they are identical at the core, she said.

In general, asynchronous I/O uses select() and friends to wait for a list of file descriptors, which can be sockets, files, or other events, to become ready for read or write. When the call returns, it indicates which of the descriptors is ready. Those calls allow programs to handle "thousands and thousands of concurrent connections", she said.

To demonstrate that, she ran a live demo on her Mac laptop. Using Twisted running under PyPy, she ran a client and server that made over 10,000 concurrent connections sending ping messages back and forth. Handling more than 10,000 pings per second on consumer hardware shows what asynchronous I/O can do, she said. That's probably more concurrent connections "than your site needs".

Twisted, asyncio, and others rely on "selector loops" so that they do not block. Data is queued to be sent when the network is ready and reads are only done when it is already known that there is data available to be read. These selector loops, also called "I/O loops" or "reactors", allow a higher density per CPU core, without threads. There is no parallelism, but there is concurrency: "you are still handling one thing at a time, but you are a bit smarter about what one thing you are handling when".

This works well when there is high I/O throughput, high-latency clients such as mobile phones, and low CPU processing needed for each request. Calculating pi to a million digits for each connection is not going to work well, but in most cases, the program is waiting for the client or for the database.

Asynchronous I/O frameworks generally provide users with an object that is a stand-in for a pending result. Twisted uses Deferred objects to do so, while asyncio uses Future objects. They are similar, though a Deferred will run its callback as soon as possible, while a Future will schedule it for the next reactor loop.

In 2012, the asynchronous I/O situation in Python 3 was "a mess". Twisted was not available, but Node.js was exploding in popularity and .NET had recently added async/await for asynchronous I/O support. Python 3 needed a "killer feature", she said. Enter asyncio. It was designed with coroutines in mind. Coroutines in Python are a special type of Generator. Python 3.5 added async and await to make Future objects act like coroutines.

The asyncio module will help reduce the library API fragmentation that has occurred over time and will also reduce duplication. Other frameworks, such as Twisted, Tornado , gevent, and others will be able to adapt their event loops to fit into the asyncio model. None of those will have to duplicate what is already available in the language. She quoted extensively from the "Interoperability" section of PEP 3156, which is the basis of asyncio, in her slides [Speaker Deck].

So that leads to a question, she said: "Doesn't asyncio replace Twisted?". They are both cooperative, single-threaded frameworks with primitives to support asynchronous programming. They use the same system calls and their I/O loops are architecturally similar. The asyncio transports and protocols were directly inspired by Twisted. Asyncio comes as a standard feature in Python 3.4 and beyond, so perhaps Twisted is not needed any longer?

But Brown begs to differ: "asyncio is an apple, Twisted is a fruit salad". There is a huge amount of code and comments in Twisted, nearly 300,000 lines of code (Python and C with tests), including over 100,000 lines of comments. Asyncio has some 24,000 lines currently. That size difference is not from bloat, she said; there are lots of places where the standard library is deficient in terms of networking protocols and the like, so Twisted has filled in a lot of those gaps. There are many features in Twisted that are not available in asyncio, as well.

Tornado is an asynchronous web server framework that has many similar concepts and constructs to those in Twisted. It has its own I/O loop, though it integrates with either Twisted or asyncio. Ultimately, the project may remove its I/O loop and move to using the asyncio version. Over the years, Tornado has changed to adopt the standard Python mechanisms as they have become available. She wondered if that was a model for Twisted moving forward.

But interoperability turns out to be hard. Asyncio is similar, but not the same, and there is no way to directly map Twisted to asyncio. Her focus is on getting async and await working with Twisted. await gets the result of a coroutine, but without blocking waiting for the result. It allows writing asynchronous code in a synchronous style. Since coroutines are a special form of Generator, the "trampoline" that will turn a Deferred into a Generator, which has been in Twisted since 2006, can be used to make that work.

Two features are coming soon that will help with interoperability. The @deferredCoroutine decorator will allow coroutines wrapped in a Deferred so that await can be used on a Deferred. The second is the asyncioreactor, which is a Twisted reactor built on top of asyncio. The patches for those have not been reviewed yet and require changes to asyncio, so they may still be a ways out.

There are good reasons to continue to use Twisted, Brown said. It is released often, typically three times per year, though 2016 is set to have five. These are time-based releases that come directly from the trunk. Because of its stability, some people actually deploy from the Twisted trunk, though she is "not going to say it's a good idea."

There are a large number of protocols available in Twisted right out of the box. She put up a list of a dozen or so (e.g. HTTP, DNS, IRC, FTP, POP3, IMAP4), all of which can be glued together in various ways. It is also easy to add protocols. Support for HTTP/2 is coming soon.

There are a number of libraries and frameworks that use Twisted under the hood. These include txacme and txsni for supporting automatic certificate renewal of Let's Encrypt certificates, the hendrix web server, and Autobahn|Python for WebSocket handling, which is "really fast under PyPy", she said.

Twisted is a dependable base; "we try not to break your code". It has deprecation cycles that give a year's warning when things are being removed. It undergoes a lot of code review and automated testing, which allows users to "upgrade with impunity". Twisted is also fast, especially when it is run with PyPy.

Beyond that, Twisted officially supports multiple platforms (most major Linux distributions, FreeBSD, Windows, and OS X). That means that all tests must pass on each supported platform before a branch can be merged to the trunk. It runs on Python 2.7 for all of those platforms; it also supports 3.4 and 3.5 (though there are still some protocols and such that need to be ported) on Linux and FreeBSD. There are only a handful of tests that do not pass under PyPy, almost all of which are due to the code making assumptions that it is running on CPython.

Competition is good, she said. The arrival of asyncio helped get Twisted moving to support Python 3 better. Eventually, Twisted will be calling asyncio and vice versa and there will be full interoperability between them. Those wishing to help make that happen should follow the async-sig mailing list.

A YouTube video of the talk is available for this interested in more details.

[ I would like to thank LWN subscribers for supporting my travel to Portland for PyCon. ]

Comments (none posted)

SourceForge eyes a comeback

By Nathan Willis
June 22, 2016

Years ago, SourceForge.net was the premiere hosting service for open-source and free-software projects. But, after changing hands several times, the site ran seriously afoul of the development community in 2015; its staff was accused of secretly commandeering inactive project accounts and of replacing project downloads with installers side-loaded with adware or even malware. In early 2016, however, the site changed hands yet again, and its new owners have set out to regain the community's trust.

To recap, SourceForge was launched in 1999 by VA Linux Systems, which was initially a hardware vendor. Over the next few years, the company acquired several other free-software related sites, including Freshmeat, Slashdot, and NewsForge (where I worked for several years). For a while, VA operated SourceForge.net for "community" open-source projects and offered a separate "enterprise" edition to corporate clients.

After some rearranging, the enterprise version of the hosting platform became the primary product and the company became SourceForge, Inc. Eventually, various pieces of the business (including the enterprise edition of the hosting software) were sold or spun off in different directions, and all that remained was the SourceForge.net hosting service and Slashdot, which were acquired by the job-search site Dice.com in 2012.

Such corporate shuffling is commonplace but, while SourceForge's popularity as a service had been waning in favor of GitHub, under the Dice.com regime, events took a dark turn. In 2013, the company announced a "revenue sharing" plan called DevShare, through which hosted projects could earn a little cash by allowing SourceForge to package their releases into an installer that would also include third-party, side-loaded apps.

Although the DevShare program was initially touted as something voluntary that would be completely under the individual project's control, in May 2015, the GIMP for Windows project discovered that SourceForge had unilaterally (and, some would say, secretly) replaced its release packages with adware-loaded DevShare installers. To make matters worse, the owner of the GIMP for Windows project account, Jernej Simončič, then discovered that he had been locked out of his account and found that SourceForge employees would not restore his access.

As it turned out, SourceForge had quietly started what it called the "SourceForge Open Source Mirror Directory," a program that seemed to entail site management taking over the accounts of project teams that had migrated their software off of SourceForge onto other hosting services. Needless to say, those project teams did not find the "mirroring" welcome: at best, it confused users about where to look for new releases or where to go for help. At worst, users could find downloads coupled with annoying adware, spyware, or other undesirable add-ons that neither the user nor the developers agreed to.

An uphill battle

Shortly after the 2015 controversy, Dice.com announced that it planned to sell off Slashdot and SourceForge. The sale took place in January 2016 to BIZX, LLC. New president Logan Abbott posted a message announcing plans for the site soon afterward, including:

Our first order of business was to terminate the “DevShare” program. As of last week, the DevShare program was completely eliminated. The DevShare program delivered installer bundles as part of the download for participating projects. We want to restore our reputation as a trusted home for open source software, and this was a clear first step towards that.

In May, SourceForge started scanning all downloads for malware. On June 8, Abbott appeared on Reddit to hold an unofficial "Ask Me Anything" session about the site. He highlighted the fact that the new management team was unrelated to the one on hand for the DevShare debacle, promised to continue modernizing the site's functionality, and vowed to further move away from the shady practices of the recent past.

For instance, Abbott noted that SourceForge has been criticized for allowing misleading ads that look like download buttons; staff is now working to remove these ads, he said, and users would soon be able to report them. He also reported that SourceForge has given all seized project accounts back to their original owners, and added:

I do not have insight into the decisions made under the old ownership. I can tell you that everyone here now loathes the fact that DevShare happened in the first place.

To be sure, restoring the community's trust in the SourceForge name will require quite a bit of effort. On top of that, rebuilding the hosting service into a competitive challenger to current market leader GitHub will not happen overnight, either. In the first comment linked-to above, Abbott noted that the site still gets over one million unique visitors every day, and hosts more than half a million active projects.

In the discussion, Abbott noted a few changes already in the works, like a site redesign and the addition of servers outside the US.

What might surprise some in the free-software community is that so many participants in the discussion expressed optimism that SourceForge could be resurrected and turned into a modern project-hosting service. Evidently, the site still has its fans. User "FluentInTypo" commented that SourceForge is more convenient to hosting large binaries (including ISO images) for download. When Abbott asked for input, user "pier4r" made several feature requests, such as integrated Gist-like code-sharing and support for holding discussions outside of issue comments. Many others in the thread shared their own feature requests.

Looking forward

To be sure, many of the feature requests offered in the debate are not new; most (like the lack of a general-purpose discussion tool) are common requests whenever users lament gaps in GitHub's feature set, too. But that is telling in and of itself; despite all the conventional wisdom that GitHub is "too big" to catch, complaints about the service's shortcomings are not difficult to find.

It is also interesting to note how many projects still use at least some SourceForge services, even if they also use GitHub, GNOME or KDE servers, or their own infrastructure. To cite one example, FontForge has migrated its code to GitHub, but continues to use its old SourceForge-based mailing lists. Such inertia is a subtle issue keeping projects on their existing infrastructure; while GitHub and other services make it relatively easy to migrate code repositories, migrating bug trackers, mailing lists, and other materials is rarely as smooth.

And, of course, many long-term developers may be reluctant to jump ship to whatever the latest hosting site du jour is. SourceForge's day in the sun may have been a long time ago, but one must not forget that not all that long ago, Google Code had its turn as the go-to hosting destination. When Google Code was launched in 2006, it was seen as vastly superior to SourceForge; it took several years for GitHub to supplant it.

Whatever the cause, there are still many free-software developers using SourceForge. The new owners have a lot of ground to make up, but it is far from out of the question that SourceForge could make a recovery and prove itself relevant again.

Comments (17 posted)

Page editor: Jonathan Corbet
Next page: Security>>