正規表現入門　星の高さを求めて

正規表現入門
星の高さを求めて
J O I 2 0 1 4 春合宿

講義内容
★ 正規表現とはパターンマッチングのための記法であ
り，文字列検索の便利な道具として広く親しまれてい
ます．この講義では，正規表現の基礎から始め，
「星の高さ」という性質に注目して正規表現の裏側に
潜む数理構造に迫っていきます．
表紙画像： http://www.superbwallpapers.com/nature/starry-night-sky-over-the-mountains-14678/

★ 正規表現好き
★ オートマトンもっと好き
★修士号は正規言語で取得
★ 東工大博士一年次（数理計算）
新屋良磨
q0
q1 q2
q3
q4q5
T
I
T
E
C
H
@sinya8282
東京工業大学大学院情報理工学研究科数理・計算科学専攻
修士論文
正規言語上の
Abstract Numeration System の
文字列圧縮への応用
新屋良磨
指導教員佐々政孝
2013 年 9 月 25 日

グラフのパスとその表現
(●(●●|●)*●●|●)(●●|(●|●●)
(●|●●)*●●)*((●●|●)(●●|●)*)|●(●●|●)*

皆大好きグラフ
この形のグラフは
｢ピーターセングラフ」
と呼ばれています．

良くある問題：最短経路
★ A-C間の最短経路長は？

★ 長さ2

★ 長さ2
●●

良くある問題：パスの列挙問題
★ A-C間の長さ4のパスを列
挙せよ
本講義では「パス」は
エッジ・頂点の重複を
含み得ます (歩道とも)

良くある問題：パスの列挙問題
★ A-C間の長さ4のパスを列
挙せよ
●●●●,●●●●,●●●●
●●●●,●●●●,●●●●
●●●●,●●●●,●●●●
本講義では「パス」は
エッジ・頂点の重複を
含み得ます (歩道とも)

良くある問題：パスの数え上げ
★ A-C間の長さ100のパスの
数は？

数は？
★ 7886

数は？
★ 7886
ちなみにパスの数え上げ
ぐらいは隣接行列で素直
に計算できる．

グラフと隣接行列（小さい例で）
3
1
2
4
5
グラフ

3
1
2
4
5
グラフその隣接行列

3
1
2
4
5
頂点数 × 頂点数の要素
持ったデータ構造

3
1
2
4
5
( 1 , 1 ) 要素が 0 ノード 1 からノード 1
へのエッジがない（ループが無い）

3
1
2
4
5
対角成分が全て 0 グラフ全体に
ループが無い．

3
1
2
4
5
( 1 , 3 ) 要素が 1 ノード 1 からノード 3
へのエッジがある．

3
1
2
4
5
M=

3
1
2
4
5
M=
M =
10

隣接行列の N 乗は長さ N のパスの
総数を表す．
3
1
2
4
5
M=
M =
10

隣接行列の N 乗は長さ N のパスの
総数を表す．
3
1
2
4
5
M=
M =
10
( 1 , 3 ) 要素が 1 6 5 ノード 1 からノード 3
への長さ 1 0 のパスが 1 6 5 個ある

This document is licensed to nomosupremetornado@gmail.com.
この辺は蟻本読めば良いでしょう

グラフと隣接行列 ( 線形代数 ) の理論体型
「 G R A P H S P E C T R A T H E O RY 」もお薦め

再度パスを列挙してみよう
★ A-C間の長さ100のパスを
列挙せよ

再度パスを列挙してみよう
★ A-C間の長さ100のパスを
列挙せよ
★ めんどくさい

任意のパスの列挙を考える
★ A-C間の任意の長さのパス
を列挙せよ

任意のパスの列挙を考える
★ A-C間の任意の長さのパス
を列挙せよ
！！！？？？
無限に存在する
のですが … . .

無限のパスをどうにか表現したい
★ 小さい有向グラフで考え
てみる
★ A-A間の任意の長さの閉路
を列挙したい
★ 右のグラフでは●をいくら
通ってもA-A間の閉路
A

てみる
を列挙したい
A
ε,●,●●,●●●,●●●●
●●●●●,●●●●●●
●●●●●●●, ……

てみる
を列挙したい
A
ε,●,●●,●●●,●●●●
●●●●●,●●●●●●
●●●●●●●, ……
εは「長さ0のパス」
を表す特別な記号．

てみる
を列挙したい
A
ε,●,●●,●●●,●●●●
●●●●●,●●●●●●
●●●●●●●, ……
無限に存在するとは
いえこれぐらい
「単純な無限」
は表現したい．

てみる
を列挙したい
A
ε,●,●●,●●●,●●●●
●●●●●,●●●●●●
●●●●●●●, ……
無限に存在するとは
いえこれぐらい
「単純な無限」
は表現したい．
「 ● が任意回」続く
パスを
● *
と表現してみよう

A B
てみる
を列挙したい

A B
てみる
を列挙したい
ε,●,●●,●●,●●●,●●●
●●●,●●●●,●●●●_
●●●●,●●●●, ……

A B
てみる
を列挙したい
ε,●,●●,●●,●●●,●●●
●●●,●●●●,●●●●_
●●●●,●●●●, ……
「「 ● または ● ● 」が任
意回」続くパスを
( ● | ● ● ) *
!
と表現してみよう

「 * 」と「 | 」で無限のパスを表現
★ 実は，任意の有限グラフ
の任意点間のパス集合が
「*」と「|」を使えば表現
できる
!

★ 実は，任意の有限グラフ
の任意点間のパス集合が
「*」と「|」を使えば表現
できる
! A B
C
このグラフの「 A - B 間」と
「 A - C 間」のパスを表現してみる

★ Step 1. まずは矢印の色を，
矢印の上に文字として表す
A B
C
● ●
●●
●
●

A B
C
● ●
●●
●
●
★ Step 2. 特別なノードDと
Eを追加し，表現したい
パスの始点にDからεの
矢印を張る．さらに終点
からEにεの矢印を張
る．

A B
C
● ●
●●
●
●
る．
D ε

A B
C
● ●
●●
●
●
る．
D ε
E
ε
ε

★ Step 3. 「*」と「|」でパス
を表しながら追加したD,E
以外の状態を消していく．
A B
C
● ●
●●
●
●
D
E
まず A を消す．もち
ろん，単に消すわけ
じゃない．パスの情
報を保ったまま消し
ていく．
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
B から A を通って B に
戻るパス ● ● を「 B か
ら B のループ」とし
て扱う
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
B から A を通って B に
戻るパス ● ● を「 B か
ら B のループ」とし
て扱う
|●●
ここに追加．元々 ●
によるループがある
ので，「または」を
表す「 | 」を使って
追加する．
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
B から A を通って C に
向かうパス ● ● を「 B
から C へのエッジ」
として扱う
|●●ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
B から A を通って C に
向かうパス ● ● を「 B
から C へのエッジ」
として扱う
|●●
ここに「 B から C へ
の ● ● のエッジ」とし
て追加
●●
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
C についても同様
に，「 A を経由する
パス」をエッジとし
て追加していく．
|●●
●●
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
|●●
●●
●●
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
|●●
●●
●●
|●●
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
D についても同様
|●●
●●
|●●
●●
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
|●●
●●
|●●
●
●●
ε
ε
ε

A B
C
● ●
●●
●
●
D
E
ていく．
|●●
●●
|●●
●
●
●●
ε
ε
ε

●
●
B
C
● ●
●
●A
D
E
ていく．
A を経由する全ての
パスを「 A を経由し
ないエッジ」に変換
した．後は消すだけ．
|●●
●●
|●●
●
●
●●
ε
ε
ε

●
●
B
C
D
E
ていく．
A を経由する全ての
パスを「 A を経由し
ないエッジ」に変換
した．後は消すだけ．
|●●
●●
|●●
●
●
●●
ε
ε
ε

●
●
B
C
D
E
グラフを見やすく変形
|●●
|●●
●
●●
●
●●
ε
ε

●
●
B
C
D
E
|●●
|●●
●
●●
続いて B を消す． A
の時と同様に，「 B
を経由するパス」を
「 B を経由しない
エッジ」に変換． ●
●●
ε
ε

●
●
B
C
D
E
|●●
|●●
●
●●
●●
「 C から B を経由し
て C に戻るパス」を
「 C のループ」に置
きかえる．
ε
ε

●
●
B
C
D
E
|●●
|●●
●
●●
●●
きかえる．
しかし B は「 ● | ● ● 」
のループを持ってい
るので，これを
「 ( ● | ● ● ) * 」として
表す．
ε
ε

●
●
B
C
D
E
|●●
|●●
●
●●
●●
きかえる．
しかし B は「 ● | ● ● 」
のループを持ってい
るので，これを
「 ( ● | ● ● ) * 」として
表す．
「 C → B → … → B → C の
パス」をここに追加． |(●|●●)(●|●●)*●●
ε
ε

●
●
B
C
D
E
|●●
|●●
●
●●
●●
|(●|●●)(●|●●)*●●
ε
ε

●
●
B
C
D
E
|●●
|●●
●
●●
●●
|(●|●●)(●|●●)*●●
同様に「 B を経由す
るパス」を置き換え
ていく ( ちゃんとルー
プは * で表す ) ．
ε
ε

C
D
E
●(●●|●)*
●●|(●|●●)(●|●●)*●●
(●●|●)(●●|●)*|ε
●(●●|●)*●●|●

C
D
E
●(●●|●)*
●●|(●|●●)(●|●●)*●●
(●●|●)(●●|●)*|ε
●(●●|●)*●●|●
最後に， C を消すと
D から E への一本の
エッジだけが残る

D E
(●(●●|●)*●●|●)(●●|(●|●●)(●|●●)*●●)*((●●|●)(●●|●)*|ε)|●(●●|●)*

D E
(●(●●|●)*●●|●)(●●|(●|●●)(●|●●)*●●)*((●●|●)(●●|●)*|ε)|●(●●|●)*
これこそが求める「 A - B
間」と「 A - C 間」のパス
を表現したもの．

D E
(●(●●|●)*●●|●)(●●|(●|●●)(●|●●)*●●)*((●●|●)(●●|●)*|ε)|●(●●|●)*
これこそが求める「 A - B
間」と「 A - C 間」のパス
を表現したもの．
このように，状態を消去
していってパスのパターン
を構成する方法を
State elimination method
(状態消去法)
と呼ぶ[3]．

「 * 」と「 | 」を使った表現：正規表現
★ 正規表現とは「パスのパターンの表現式」
!
★ 任意の(有限)グラフのパスは正規表現で表現できる
★ 正規表現は「無限のパス」を記述することができる
★ 現代では，正規表現は「文字列のパターン」を表現す
る用途で重宝されている → 正規表現マッチング

プログラマのための正規表現
★ プログラミングやShellコマンドでよく使われる
★ 文字列の validation や検索に
★ 「|」「*」以外に色々な記法・機能が導入されている
★ 「.」(ドット)は任意の一文字を表す
★ 「 (a|b|c|d|e) 」は「 [abcde] 」や「 [a-e] 」と書ける
★ 「d」で「数字1文字」(0|1|2|3|4|5|6|7|8|9) とか．
★ 「+」で1回以上の繰り返しを表す
★ 「*」は任意回(0回以上)の繰り返し
★ 「{n}」でn回の繰り返し
★ 他にもいっぱい．．．（覚える必要は無い）

実用的 ( ? ) な正規表現の例 1
★ IP電話番号を表す正規表現 http://blog.livedoor.jp/nipotan/archives/51644244.html
★ （理解する必要はありません！！！）
050-(8(8(10|0d|6[4-8]|8[0-6])|0([0-2]d|3[0-8])|
2(0d|1[0-2])|6(86|0[01]))|7(6([01]d|2[0-5])|7(88|
7[0-5])|1(0d|1[0-3])|30[0-3]|00[01]|5dd)|
3(8([01]d|2[0-5])|2([0-4]d|5[01])|[013-7]dd|
90[01])|5(([02]0|5[0-6])d|8([0-3]d|4[0-2])|79[89])|
2(0([0-2]d|3[0-6])|20[01]|403|525)|6(6(19|2[0-2])|
86[0-8]|[01]00)|1(8(0d|1[0-2])|[0-7]dd)|90(0d|
1[0-5]))d{4}

実用的 ( ? ) な正規表現の例 2
★ URIを表す正規表現 http://swatmac.info/?p=1064
★ （理解する必要はありません！！！！！！）
[a-z][x2bx2dx2e0-9a-z]*:(//(([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-f]|[!x24&-,:;=])*@)?(x5b(([0-9a-f]{1,4}:){6}([0-9a-f]{1,4}:[0-9a-f]
{1,4}|(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|
1d{2}|2[0-4]d|25[0-5]))|::([0-9a-f]{1,4}:){5}([0-9a-f]{1,4}:[0-9a-f]{1,4}|(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|
2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5]))|([0-9a-f]{1,4})?::([0-9a-f]{1,4}:){4}
([0-9a-f]{1,4}:[0-9a-f]{1,4}|(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|
25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5]))|(([0-9a-f]{1,4}:)?[0-9a-f]{1,4})?::([0-9a-f]{1,4}:){3}([0-9a-f]{1,4}:[0-9a-f]{1,4}|(d|[1-9]d|
1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|
25[0-5]))|(([0-9a-f]{1,4}:){0,2}[0-9a-f]{1,4})?::([0-9a-f]{1,4}:){2}([0-9a-f]{1,4}:[0-9a-f]{1,4}|(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|
[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5]))|(([0-9a-f]{1,4}:){0,3}
[0-9a-f]{1,4})?::[0-9a-f]{1,4}:([0-9a-f]{1,4}:[0-9a-f]{1,4}|(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])
x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5]))|(([0-9a-f]{1,4}:){0,4}[0-9a-f]{1,4})?::([0-9a-f]{1,4}:[0-9a-
f]{1,4}|(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|
[1-9]d|1d{2}|2[0-4]d|25[0-5]))|(([0-9a-f]{1,4}:){0,5}[0-9a-f]{1,4})?::[0-9a-f]{1,4}|(([0-9a-f]{1,4}:){0,6}[0-9a-f]{1,4})?::|v[0-9a-f]+x2e[!
x24&-x2e0-;=_a-z~]+)x5d|(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|
2[0-4]d|25[0-5])x2e(d|[1-9]d|1d{2}|2[0-4]d|25[0-5])|([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-f]|[!x24&-,;=])*)(:d*)?(/([x2dx2e0-9_a-
z~]|%[0-9a-f][0-9a-f]|[!x24&-,:;=@])*)*|/(([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-f]|[!x24&-,:;=@])+(/([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-
f]|[!x24&-,:;=@])*)*)?|([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-f]|[!x24&-,:;=@])+(/([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-f]|[!
x24&-,:;=@])*)*)?(x3f([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-f]|[!x24&-,/:;=x3f@])*)?(x23([x2dx2e0-9_a-z~]|%[0-9a-f][0-9a-f]|[!
x24&-,/:;=x3f@])*)?

正規表現の歴史
– S T E P H E N C O L E K L E E N E ( 1 9 0 9 - 1 9 9 4 )

脳の計算モデル：
形式的ニューロン
BULLETIN OF
MATHEMATICAL BIOPHYSICS
VOLUME5, 1943
A LOGICAL CALCULUS OF THE
IDEAS IMMANENT IN NERVOUS ACTIVITY
WARREN S. MCCULLOCH AND WALTER PITTS
FROM THE UNIVERSITY OF ILLINOIS, COLLEGEOF MEDICINI~,
DEPARTMENT OF PSYCHIATRY AT THE ILLINOIS NEUROPSYCHIATRICINSTITUTE,
AND THE UNIVERSITY OF CHICAGO
Because of the "all-or-none" character of nervous activity, neural
events and the relations among them can be treated by means of propo-
sitional logic. It is found that the behavior of every net can be described
in these terms, with the addition of more complicated logical means for
nets containing circles; and that for any logical expression satisfying
certain conditions, one can find a net behaving in the fashion it describes.
It is shown that many particular choices among possible neurophysiologi-
cal assumptions are equivalent, in the sense that for every net behav-
ing under one assumption, there exists another net which behaves un-
der the other and gives the same results, although perhaps not in the
same time. Various applications of the calculus are discussed.
I. Introduction
Theoretical neurophysiology rests on certain cardinal assump-
tions. The nervous system is a net of neurons, each having a soma
and an axon. Their adjunctions, or synapses, are always between the
axon of one neuron and the soma of another. At any instant a neuron
has some threshold, which excitation must exceed to initiate an im-
pulse. This, except for the fact and the time of its occurrence, is de-
termined by the neuron, not by the excitation. From the point of ex-
citation the impulse is propagated to all parts of the neuron. The
velocity along the axon varies directly with its diameter, from less
than one meter per second in thin axons, which are usually short, to
more than 150 meters per second in thick axons, which are usually
long. The time for axonal conduction is consequently of little impor-
tance in determining the time of arrival of impulses at points un-
equally remote from the same source. Excitation across synapses oc-
curs predominantly from axonal terminations to somata. It is still a
moot point whether this depends upon irreciprocity of individual syn-
apses or merely upon prevalent anatomical configurations. To sup-
pose the latter requires no hypothesis ad hoc and explains known ex-
ceptions, but any assumption as to cause is compatible with the cal-
culus to come. No case is known in which excitation through a single
synapse has elicited a nervous impulse in any neuron, whereas any
115
★ 1943年，McCulloch
と Pitts によって脳の
計算モデル「形式的
ニューロン」が提案
された [4]．

脳の計算モデル：
形式的ニューロン
★ 1943年，McCulloch
と Pitts によって脳の
計算モデル「形式的
ニューロン」が提案
された [4]．
★ 形式的ニューロンは
エッジに「重みw」，
ノードに「閾値θ」
を備えた単純なグラ
フ的モデルである．
脳の神経細胞(ニューロン)と形式的ニューロン
画像は [5] p.16, p.18 より引用

正規表現の誕生
★ 1951年にKleeneが
形式的ニューロンの
計算可能なクラスを
「正規表現」「オー
トマトン」の2つの道
具で形式化 [6]．
★ 「形式的ニューロンで計算可」
「正規表現で表現可」
「オートマトンで計算可」

正規表現の誕生
★ 「正規表現 (regular
expression)」の命名
は1951年のKleeneの
論文から．
★ つまり正規表現の生
みの親は Kleene 大先
生．
★ 正規表現の書き換え
規則（公理）の考察
も行ってる．論文 [3] の1ページを引用

オートマトンの
発展
★ オートマトンとは？
→ 文字列を認識する
有限グラフモデル．
★ 初期状態から入力に
従って状態を遷移，
最終的に受理状態に
到達すれば入力を
「受理」，到達しな
ければ「非受理」．
0
a
1
b
a
2
b
b
a

発展
0
a
1
b
a
2
b
b
a
「a,bからなる文字列で，
bの出現回数が3の倍数」
となる文字列を受理する
オートマトン．0が初期状
態かつ受理状態．

発展
0
a
1
b
a
2
b
b
a
「a,bからなる文字列で，
bの出現回数が3の倍数」
となる文字列を受理する
オートマトン．0が初期状
態かつ受理状態．
正規表現で書くと
a*(ba*ba*ba*)*

発展
★ オートマトンには非
決定性・決定性とい
う性質がある．
★ 初期状態が1つで，か
つ各状態が「1文字に
対して遷移先がたか
だか1つ」という条件
をみたす場合決定性
(DFA)，そうでない場
合は非決定性(NFA)．
0
b
1
a
2
a
3
b
a
b
b
a
0
a
b
1
a
2
a
b
「a,bからなる文字列で後ろから
2番目の文字がa」となる文字列
を受理す2つのオートマトン．
上はNFAで下はDFA！

発展
★ 1959年，RabinとScott
が「任意のオートマト
ンは決定性のオートマ
トンに変換できる」こ
とを示した（部分集合
構成法）[7]．
★ この業績で二人は計算
機科学者にとって最高
の栄誉であるチューリ
ング章を受賞．
0
a
b
1
a
2
a
b
部分集合構成による決定化
0
b
0
1
a
0
1
2
a
0
2
b
a
b
b
a

プログラマのた
めの道具へ
★ 1968年，Ken Thompson
（UNIXとかC言語作った
マジモンのハッカー）が
「正規表現からNFAへの
変換(Thompsonの構成
法)」及び「NFAの実装手
法」に関する最初の論文
を発表 [8]．
★ Thompson のこの成果は
世界初の「ソフトウェア
の特許」に（1971年）．
R. M. McCLUR£, Editor
RegularExpressionSearch
Algorithm
I~EN THOMPSON
Bell Telephone Laboratories, Inc., Murray Hill, New Jersey
A method for locating specific character strings embedded
in character text is described and an implementation of this
method in the form of a compiler is discussed. The compiler
accepts a regular expression as source language and pro-
duces an IBM 7094 program as object language. The object
program then accepts the text to be searched as input and
produces a signal every time an embedded string in the text
matches the given regular expression. Examples, problems,
and solutions are also presented.
KEY WORDS AND PHRASES: search,match, regular expression
CR CATEGORIES: 3.74, 4.49, 5.32
The Algorithm
Previous search algorithms involve backtracking when
a partially successful search path fails. This necessitates
a lot of storage and bookkeeping, and executes slowly. In
the regular expression recognition technique described in
this paper, each character in the text to be searched is
examined in sequence against a list of all possible current
characters. During this examination a new list of all
possible next characters is built. When the end of the
current list is reached, the new list becomes the current
list, the next character is obtained, and the process con-
tinues. In the terms of Brzozowski [1], this algorithm con-
tinually takes the left derivative of the given regular ex-
pression with respect to the text to be searched. The
parallel nature of this algorithm makes it extremely fast.
The Implementation
The specific implementation of this algorithm is a com-
piler that translates a regular expression into IBM 7094
code. The compiled code, along with certain runtime
routines, accepts the text to be searched as input and
finds all substrings in the text that match the regular
expression. The compiling phase of the implemention does
not detract from the overall speed since any search routine
must translate the input regular expression into some
sort of machine accessible form.
In the compiled code, the lists mentioned in the algo-
rithm are not characters, but transfer instructions into
the compiled code. The execution is extremely fast since
a transfer to the top of the current list automatically
searches for all possible sequel characters in the regular
expression.
This compile-search algorithm is incorporated as the
context search in a time-sharing text editor. This is by
no means the only use of such a search routine. For
example, a variant of this algorithm is used as the symbol
table search in an assembler.
It is assumed that the reader is familiar with regular
expressions [2] and the machine language of the IBM 7094
computer [3].
The Compiler
The compiler consists of three concurrently running
stages. The first stage is a syntax sieve that allows only
syntactically correct regular expressions to pass. This
stage also inserts the operator "." for juxtaposition of
regular expressions. The second stage converts the regular
expression to reverse Polish form. The third stage is the
object code producer. The first two stages are straight-
forward and are not discussed. The third stage expects a
syntactically correct, reverse Polish regular expression.
The regular expression a(b I c),d will be carried through
as an example. This expression is translated into abc I * " d •
by the first two stages. A functional description of the
third stage of the compiler follows:
The heart of the third stage is a pushdown stack. Each
entry in the pushdown stack is a pointer to the compiled
code of an operand. When a binary operator ("1" or ". ")
is compiled, the top (most recent) two entries on the stack
are combined and a resultant pointer for the operation re-
places the two stack entries. The result of the binary
operator is then available as an operand in another opera-
tion. Similarly, a unary operator ("*") operates on the top
entry of the stack and creates an operand to replace that
entry. When the entire regular expression is compiled,
there is just one entry in the stack, and that is a pointer to
the code for the regular expression.
The compiled code invokes one of two functional rou-
tines. The first is called NNODE. NNODE matches a
single character and will be represented by an oval con-
taining the character that is recognized. The second func-
tional routine is called CNODE. CNODE will split the
Volume 11 / Number 6 / June, 1968 Communications of the ACM 419

プログラマのた
めの道具へ
★ 1973年，Thompson が正
規表現による文字列検索
ツール grep をUNIXのた
めに開発．
★ 以降，Perl 等のプログラ
ミング言語にも正規表現
は搭載され，今日では正
規表現はほとんどのプロ
グラミング言語に搭載．
多くのプログラマに愛さ
れる存在に．
Regular
Expressions
Powerful Techniques for Perl and Other Tools
Jeffrey E. F. Friedl
Mastering

正規表現の理論
と現在
★ 正規表現は，実用にも便利
だけど，理論的にもとても
奥深い研究対象．対応する
計算モデルや代数モデル，
論理モデルが多数存在．
★ 「普遍的な構造は多様な_
_ 特徴付けを持つ」
★ 純粋理論的な研究もまだま
だ行われている．
正規言語
(文字列の集合)
{ε, a, aa, bb, aaa, abb, bab, bbb, ...}
0
a
1
b
b
a
有限オートマトン
(計算モデル)
有限モノイド
(代数モデル)
単項二階述語論理
(論理式)
正規文法
(生成文法)
正規表現
(表現式)
a*(ba*ba*)*
Proﬁnite words
上の開閉集合
(トポロジー)

正規表現の理論
と現在
★ 正規表現は，実用にも便利
だけど，理論的にもとても
奥深い研究対象．対応する
計算モデルや代数モデル，
論理モデルが多数存在．
★ 「普遍的な構造は多様な_
_ 特徴付けを持つ」
★ 純粋理論的な研究もまだま
だ行われている．
正規言語
0
a
1
b
b
a
(計算モデル)
有限モノイド
(代数モデル)
(論理式)
正規文法
(生成文法)
正規表現
(表現式)
a*(ba*ba*)*
Proﬁnite words
上の開閉集合
(トポロジー)
★ その一つが「星の高さ」

正規表現と星の高さ
– V I N C E N T VA N G O G H ( 1 8 8 8 )
“Starry Night over the Rhone.”

正規表現における星
★ 繰り返しを表す演算子「*」は
★ 「Kleeneスター」とか「Kleene閉包」とか
★ 「スター演算子」とか呼ばれている
!
★ 正規表現における「星の高さ」とは，直感的にはこの
スター演算子のネストの深さのこと．

星の高さ
S TA R H E I G H T
★ 正規表現に対する「星の高さ」を表す関数 h は次のよ
うに再帰的に定義される
★ 任意のアルファベットσに対して h(σ) = 0
★ 正規表現 r, s に対して
★ h(rs) = h(r|s) = max(h(r), h(s))
★ h(r*) = h(r) + 1

星の高さ
S TA R H E I G H T
★ 正規表現に対する「星の高さ」を表す関数 h は次のよ
うに再帰的に定義される
★ 任意のアルファベットσに対して h(σ) = 0
★ h(rs) = h(r|s) = max(h(r), h(s))
★ h(r*) = h(r) + 1
★ つまりは，一番ネストしてる *の深さのこと

星の高さの例
★ (a|bc)d = 星の高さ0 ， a* = 星の高さ1
!
★ a*(ba*ba*)* = 星の高さ 2
!
★ ((a*)*)* = 星の高さ 3

同じだけど見た目は異なる正規表現
★ a* と (a*)* は明らかに同じ正規表現 (なぜ?)
!
★ a*(ba*ba*)* と (a*|a*b(a|ba*b)*ba*) も同じ正規表現
!
★ 同じ正規表現でも星の高さが異なる場合がある

正規 ” 言語 ” における星の高さ
★ 正規表現で表現できる文字列の集合を正規言語と呼ぶ
★ 「正規言語 L の星の高さ」を「Lを表現する正規表現
_の星の高さの中で最小のもの」として定義
★ h(L) = min{ h(r) | r は L を表現する正規表現 }

正規 ” 言語 ” における星の高さ
★ 正規表現で表現できる文字列の集合を正規言語と呼ぶ
★ 「正規言語 L の星の高さ」を「Lを表現する正規表現
_の星の高さの中で最小のもの」として定義
★ h(L) = min{ h(r) | r は L を表現する正規表現 }
★ 与えられた正規言語 L の星の高さを決定する問題を
「星の高さ問題 (star-hight problem)」と呼ぶ．
★ 1963年に Eggan が提唱[9]

星の高さを計れるか？
★ 与えられた正規言語 L に対して，星の高さを決定する
アルゴリズムは存在するか？

★ 存在する！ 1988年に Hashiguchi が肯定的に解決 [10]

★ もの凄く一般的な命題の証明

★ 証明がめちゃくちゃ難しい…

★ 証明がめちゃくちゃ難しい…
★ 計算量がやばい

「星の高さ問題」の証明は難しい
“Indeed, the existing proof, putting all pieces together, takes more
than a hundred pages of very heavy combinatorial reasoning.”
I. Simon, MFCS’88 Proceedings, 1988
!
“The proof is very difficult to understand and a lot remains to be
done to make it a tutorial presentation.”
D. Perrin, Finite Automata, Handbook of Theor. Comp. Sc., 1990
!
“Hashiguchi’s solution for arbitrary star height relies on a
complicated induction, which makes the proof very difficult to
follow.” J.-E. Pin, Tropical Semirings, Idempotency, 1998

「星の高さ問題」は難しい
★ 同じ言語を表現する正規表現は無限にある．
★ そのため，列挙して調べることは無理（停止しない）
!
★ 正規表現の性質の良い最小形は知られていない．

E G G A N の定理
★ h(L)は，Lを認識するNFAの最小サイクルランクに等しい
★ Egganの1963年の論文の結果 [9]
★ 厳密には「ε-NFAの最小サイクルランク」について
Egganは定理を示し，1972年にRina [11]が「NFAの最
小サイクルランク」というより強い結果に拡張した．

有向グラフのサイクルランク
★ 有向グラフ G = (V, E) に対するサイクルランクr(G)は以
下のように再帰的に定義される
★ Gがループを含まない場合
★ r(G) = 0
★ Gが強連結の場合：
★ r(G) = 1 + min { r(G - v) | v は Gのノード}
★ Gが強連結でない場合
★ r(G) = max( r(G’) | G’は Gの部分グラフ )
★ (Gの部分グラフの中で最大のサイクルランク)

有向グラフのサイクルランク
★ 有向グラフ G = (V, E) に対するサイクルランクr(G)は以
下のように再帰的に定義される
★ Gがループを含まない場合
★ r(G) = 0
★ Gが強連結の場合：
★ r(G) = 1 + min { r(G - v) | v は Gのノード}
★ Gが強連結でない場合
★ r(G) = max( r(G’) | G’は Gの部分グラフ )
★ (Gの部分グラフの中で最大のサイクルランク)
★ 直感的には「サイクルの消しにくさ」と捉えれる

サイクルランクの例
Examples 6.2
(i) Graphs with loop complexity 1:
(ii) Graphs with loop complexity 2:
(iii) Graphs with loop complexity 3:
★サイクルランク 1 のグラフ

Examples 6.2
Examples 6.2

Examples 6.2
Examples 6.2
amples 6.2
Graphs with loop complexity 1:
Graphs with loop complexity 2:
) Graphs with loop complexity 3:★サイクルランク 3 のグラフ

最小サイクルランク = 星の高さ
★ 証明はそんなに難しくない．が，細かく議論すると大変
!
★ 証明方針
★ 任意の正規表現 r にたいして，h(r)以下のサイクルラ
ンクを持つ等価な N FA を構成できる事を示す
( State elimination 法 )
★ 任意のNFA Nに対して，r(N)以下の星の高さを持つ等
価な正規表現が構成できることを示す
( Thompson の構成法)

E G G A N のもう一つの定理
★ 星の高さは無限の階層をなす
★ つまり任意の自然数 n について，h(L) ≧ n となる正規
言語 L が存在する！
★ Egganが1963年に証明．
★ Sakarovitch が 2007年により強い形で再証明 [1]．

E G G A N のもう一つの定理
★ 星の高さは無限の階層をなす
★ つまり任意の自然数 n について，h(L) ≧ n となる正規
言語 L が存在する！
★ Egganが1963年に証明．
★ Sakarovitch が 2007年により強い形で再証明 [1]．
q
★ 「a,bからなる文字列で，aの個数とbの個数が2 による
_において等しい」全ての文字列からなる言語 L の星の
高さがちょうど q となる (Sakarovitch, [1])．

星の高さは潰れない
Figure 6.7: The
The proof that Wq is of heigh
w be the sequence of q word
右のオートマトンで受理される言語は
「2^3の剰余でaとbの個数が一致する」
ような文字列．つまりこのオートマトン
が受理する言語の星の高さは3．この
オートマトンのサイクルランクも3．

q
★ 「a,bからなる文字列で，aの個数とbの個数が2 による
_剰余において等しい」全ての文字列からなる言語 L の
星の高さがちょうど q となる (Sakarovitch, [1])．
Figure 6.7: The

★ サイクルランクが高いオートマトンから言語を階層的に
構成して，あとは正規表現の方向から頑張って証明す
る（証明は難しい）．
Figure 6.7: The

正規言語の良い性質
★ 対応する表現・計算モデ
ルが多い
★ 多くの演算に閉じている
★ 正規言語は否定(補集合)
が取れる．
正規言語
0
a
1
b
b
a
(計算モデル)
有限モノイド
(代数モデル)
(論理式)
正規文法
(生成文法)
正規表現
(表現式)
a*(ba*ba*)*
Proﬁnite words
上の開閉集合
(トポロジー)

ルが多い
が取れる．
正規言語
0
a
1
b
b
a
(計算モデル)
有限モノイド
(代数モデル)
(論理式)
正規文法
(生成文法)
正規表現
(表現式)
a*(ba*ba*)*
Proﬁnite words
上の開閉集合
(トポロジー)
★ かっこ良く言うと
「正規言語はブール代数
＿で閉じてる」

ルが多い
が取れる．
正規言語
0
a
1
b
b
a
(計算モデル)
有限モノイド
(代数モデル)
(論理式)
正規文法
(生成文法)
正規表現
(表現式)
a*(ba*ba*)*
Proﬁnite words
上の開閉集合
(トポロジー)
正規言語が否定に閉じて
いることを正規表現を使っ
て示すのは，難しい．

ルが多い
が取れる．
正規言語
0
a
1
b
b
a
(計算モデル)
有限モノイド
(代数モデル)
(論理式)
正規文法
(生成文法)
正規表現
(表現式)
a*(ba*ba*)*
Proﬁnite words
上の開閉集合
(トポロジー)
正規言語が否定に閉じて
いることを正規表現を使っ
て示すのは，難しい．
でも有限オートマトンや
モノイド等を使うと証明
は楽勝(一発)！！色んな
特徴付けがある利点．

正規言語の否定 ( 正規表現で考える )
★ 正規言語 L の否定 Lとは「Lに属さない文字列の集合」
★ つまりは L の補集合
!
★ a*(ba*ba*)* は「bが偶数回現れるa,b上の文字列」
★ 問題：ではその否定は？

!
★ 当然「bが奇数回現れるa,b上の文字列」

!
★ 正規表現で書くと a*ba*(ba*ba*)*

!
★ 正規表現で書くと a*ba*(ba*ba*)*
★ 正規表現を「否定の正規表現」に変形するのはとても
難しい

正規言語の否定 ( オートマトンで考える )
★ 「0から0へのパスの集合」の補集合は？
0
a
1
b
b
a

★ 「0から1へのパスの集合」にほかならない！
0
a
1
b
b
a

★ パスの終点(受理状態)の補集合を取れば良いだけ！
0
a
1
b
b
a

0
a
1
b
b
a
0
a
1
b
b
a
否定

0
a
1
b
b
a
0
a
1
b
b
a
否定
注意：この方法で否定を取れるのはDFA．NFAは一般的にダメ(なぜ？)

正規表現から否定の正規表現を作る
★ Step 1. 正規表現からNFAを作る (Thompson 構成法)
★ Step 1.5 ε遷移を除去する
★ Step 2. NFAからDFAを作る (部分集合構成法)
★ Step 3. DFA の否定を取る (受理状態を入れ替える)
★ Step 4. 否定のDFAから正規表現を構成 (State elimination)

正規表現から否定の正規表現を作る
★ Step 1. 正規表現からNFAを作る (Thompson 構成法)
★ Step 1.5 ε遷移を除去する
★ Step 2. NFAからDFAを作る (部分集合構成法)
★ Step 3. DFA の否定を取る (受理状態を入れ替える)
★ Step 4. 否定のDFAから正規表現を構成 (State elimination)
以上！

一般化星の高さ問題
– R U F I N O TA M AY O ( 1 9 5 0 )
“Man Before the Infinite.”

一般化星の高さ
G E N E R A L I Z E D S TA R H E I G H T
★ 正規表現に否定演算「!」を入れたときの星の高さ
★ 「*」「!」「|」が使える正規表現を拡張正規表現と呼ぶ
★ 正規表現の一般化星の高さgh(r)は以下の定義
★ 任意のアルファベットσに対して gh(σ) = 0
★ gh(rs) = gh(r|s) = max(gh(r), gh(s))
★ gh(r*) = gh(r) + 1
★ gh(!r) = gh(r)

一般化星の高さ
G E N E R A L I Z E D S TA R H E I G H T
★ 正規表現に否定演算「!」を入れたときの星の高さ
★ 「*」「!」「|」が使える正規表現を拡張正規表現と呼ぶ
★ 正規表現の一般化星の高さgh(r)は以下の定義
★ 任意のアルファベットσに対して gh(σ) = 0
★ gh(rs) = gh(r|s) = max(gh(r), gh(s))
★ gh(r*) = gh(r) + 1
★ gh(!r) = gh(r) ← New!

一般化星の高さ問題
★ 与えられた正規言語 L において，その一般化星の高さ
を計算するアルゴリズムは存在するか？

一般化星の高さの階層
★ 一般化星の高さは無限階層をなすか？
★ つまり任意の自然数 n について，あるLが存在して
gh(L) ≧ n となるか？

一般化星の高さ問題とその階層問題は … .
★ 半世紀未解決
（1960∼）

一般化星の高さの階層
★ 一般化星の高さは無限階層をなすか不明
!
★ それどころか，gh(L) = 2 となる正規言語Lが存在する
かどうかすら不明！！

S C H Ü T Z E N B E R G E ’ S T H E O R E M
一般化星の高さの有名な定理
★ 言語 L が Star-free L の有限モノイドが aperiodic
★ Star-free: 「*」を使わないで書ける言語(拡張正規表現で)
★ とても美しい定理だけど，説明するには半群の講義
が必要なので省略
★ Schützenberger が 1965年に証明 [12]

S TA R - F R E E L A N G U A G E
星を使わない言語
★ (ab)* は星自由言語（考えてみよう）
!
★ (aa)* は星自由言語ではない (Schützenbergerの定理より)

★ 正規表現はどうでしたか？星の高さに浪漫を感じまし
たか？
★ 「一般化星の高さ問題」は正規言語の理論において
燦然と輝く1960年代からの未解決問題です．
★ 「解かれている問題」を改めて解くのも良いですが，
「解けていない問題」にぶつかってみるのも楽しいか
もしれません．

J O I 2 0 1 4 合宿での様子
★ 若い学生さんたちに興味津々に聞いてもらえたと思います
★ 「 (ab)* が星自由である」を中高生がサクサク解いてびっくり！

正規言語の研究者は日本では少ない
★ 皆さん正規言語の研究
しましょう．
★ いつでも相談に乗りま
す．研究テーマを提案
することも可能です．
★ 正規言語に関する質問
はいつでも歓迎です．
Twitter ( @sinya8282 )

[ 1 ] S A K A R O V I T C H 先生
の神本：通称 E AT ( 2 0 0 7 )
★ オートマトンの話題が広
く深く網羅されている
★ Egganの定理の証明も
載ってる
★ 僕が「正規言語」の虜に
なったのはこの本を読ん
だから
★ 分厚い(760ページぐらい)
★ 組版が美しい

[ 2 ] L A W S O N 先生の入門書
( 2 0 0 3 )
★ 正規言語の「代数の理論」が
しっかり書かれている
★ Schützenberger の定理の
証明も
!
★ 半群の研究者が書いた本
!
★ Variety theory の入門書に最適

[ 3 ] S I P S E R 先生の入門書
第三版 ( 2 0 1 2 )
★ 計算理論の入門書
!
★ 最も読みやすいオートマ
トンの入門書．
すぐ読める．
!
★ 和訳（『計算理論の基
礎』）もあるのでお薦め．

★ [4] Michael A. Arbib: Brains, Machines, and Mathematics (1987)
★ [5] Warren S. McCulloch, Walter Pitts: A Logical Calculus of the Ideas Imma-
nent in Nervous Activity (1943)
★ [6] S. C. Kleene: Representation of Events in Nerve Nets and Finite Automata
(1951)
★ [7] M.O. Rabin, D. Scott: Finite Automata and Their Decision Problems (1959)
★ [8] Ken Thompson: Regular Ex- pression Search Algorithm (1968)
★ [9] L.C. Eggan: Transition graphs and the star-height of regular events (1963)
★ [10] K. Hashiguchi: Algorithms for Determining Relative Star Height and Star
Height (1988)
★ [11] Rina, S, Cohen: Rank-non-increasing transformations on transition graphs
(1971)
★ [12] M.P. Schützenberger: On finite monoids having only trivial subgroups
(1965)

正規表現入門 星の高さを求めて

More Related Content