Regex Utils

Zero-dependency TypeScript library for regex intersection, complement and other utilities that go beyond string matching. These are surprisingly hard to come by for any programming language.

import { intersection, size, enumerate } from '@gruhn/regex-utils'

// `intersection` combines multiple regex into one:
const passwordRegex = intersection(
  /^[a-zA-Z0-9]{12,32}$/, // 12-32 alphanumeric characters
  /[0-9]/, // at least one number
  /[A-Z]/, // at least one upper case letter   
  /[a-z]/, // at least one lower case letter
)

// `size` calculates the number of strings matching the regex: 
console.log(size(passwordRegex))
// 2301586451429392354821768871006991487961066695735482449920n

// `enumerate` returns a stream of strings matching the regex:
for (const sample of enumerate(passwordRegex).take(10)) {
  console.log(sample)
}
// aaaaaaaaaaA0
// aaaaaaaaaa0A
// aaaaaaaaaAA0
// aaaaaaaaaA00
// aaaaaaaaaaA1
// aaaaaaaaa00A
// baaaaaaaaaA0
// AAAAAAAAAA0a
// aaaaaaaaaAA1
// aaaaaaaaaa0B

Installation

npm install @gruhn/regex-utils

High- vs. Low-Level API

There is a high-level API and a low-level API:

The high-level API operates directly on nativ 7423 e JavaScript RegExp instances, which is more convenient but also requires parsing the regular expression. The low-level API operates on an internal representation which skips parsing step and is more efficient when combining multiple functions. For example, say you want to know how many strings match the intersection of two regular expressions:

import { size, intersection } from '@gruhn/regex-utils'

size(intersection(regex1, regex2))

This:

parses the two input RegExp
computes the intersection
converts the result back to RegExp
parses that again
computes the size

Step (1) should be fast for small handwritten regex. But the intersection of two regex can be quite large, which can make step (3) and (4) quite costly. With the low-level API, step (3) and step (4) can be eliminated:

import * as RE from '@gruhn/regex-utils/low-level-api'

RE.size(
  RE.toStdRegex(
    RE.and(
      RE.parse(regex1),
      RE.parse(regex2)
    )
  )
)

Limitations

Syntax support
- The library implements a custom parser for regular expressions, so only a subset of the syntax is supported:
  - quantifiers: *, +, ?, {3,5}, ...
  - alternation: |
  - character classes: ., \w, [a-z], ...
  - optional start/end markers: ^ / $ but only at the start/end (technically they are allowed anywhere in the expression)
  - escaped meta characters: \$, \., ...
  - capturing groups: (...)
- regex flags are not supported at all
performance of intersection and complement
- These function have worst case exponential complexity. But often the worst case is not realized.
  - Nested quantifiers are especially dangerous, e.g. (a*|b)*.
- A bigger problem is: even if computation is fast, the output regex can be extremely large to the point that the new RegExp(...) constructor crashes.

References

Heavily informed by these papers:

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
benchmark		benchmark
src		src
test		test
.envrc		.envrc
.gitignore		.gitignore
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
typedoc.json		typedoc.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Regex Utils

Installation

High- vs. Low-Level API

Limitations

References

About

Uh oh!

Releases 15

Uh oh!

Languages

gruhn/regex-utils

Folders and files

Latest commit

History

Repository files navigation

Regex Utils

Installation

High- vs. Low-Level API

Limitations

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 15

Uh oh!

Languages