Abstract
We review the grammatical inference problem for regular languages which aims to generate a deterministic finite automaton from a representative set of training sample strings known to be in or not in the language. Although the general problem of producing a minimal DFA consistent with a given sample is known to be NP-hard, it is possible to generate minimal consistent DFA in polynomial time if certain constraints are satisfied by the given samples. In this work we propose a new algorithm which generates minimal DFA if the given training samples satisfy a certain sufficient condition. On the negative side, we also show that this problem is indeed hard, such that even for a more restricted class of training sets, the problem of generating minimal consistent DFA is already intractable.
Similar content being viewed by others
Notes
Although the general results seem negative, Gold has also proved that from a given regular language, it is possible to generate a characteristic sample set S, such that learning from that training set yields a minimal DFA in polynomial time.
Recall that Myhill–Nerode theorem defines equivalence classes on members of a language \(L\subseteq {\varSigma }^*\). If \(w_1,w_2\in {\varSigma }^*\) are defined as equivalent, then for all \(w\in {\varSigma }^*\), \(w_1\cdot w\) and \(w_2\cdot w\) are either both accepted by the language or both rejected by the language. The number of such equivalence classes for members of a regular language is always finite, which is equivalent to the number of states in a minimal DFA.
This is similar to [5], in which it is shown that there may be multiple minimal cover-automata that are consistent with a given finite input sample. Also in the original TB algorithm, the construction of \(\delta _M(w,a)\) is nondeterministic as if \(w\cdot a\in S^+\setminus {\texttt {red}}\), the choice of \(w'\) is any compatible state in \({\texttt {red}}\).
Philosophically, one may argue that when studying an automaton it is more reasonable to assume that the acceptability of all prefixes of a word should be observed.
Being maximal means when compared with others there needs to have semantic evidences, either by showing it is larger or smaller by containment, or neither larger nor smaller (i.e., incomparable by an evidence trace). The term semantic-completeness enforces that such a measure always exists at least for the maximal elements.
References
Angluin, D.: On the complexity of minimum inference of regular sets. Inf. Control 39, 337–350 (1978)
Angluin, D.: Learning regular sets from queries and counterexamples. Inf. Comput. 75, 87–106 (1987)
Angluin, D., Smith, C.H.: Inductive inference: theory and methods. ACM Comput. Surv. 15(3), 237–269 (1983)
Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)
Câmpeanu, C., Sântean, N., Yu, S.: Minimal cover-automata for finite languages. Theor. Comput. Sci. 267, 3–16 (2001)
de la Higuera, C.: Characteristic sets for polynomial grammatical inference. Mach. Learn. 27(2), 125–138 (1997)
de la Higuera, C., Oncina, J., Vidal, E.: Identification of DFA: data-dependent versus data-independent algorithms. In: Proceedings of International Colloquium on Grammatical Inference (ICGI’96), pp. 313–325 (1996)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)
García, P., López, D., Vázquez de Parga, M.: Polynomial characteristic sets for DFA identification. Theor. Comput. Sci. 448, 41–46 (2012)
Gold, M.E.: Language identification in the limit. Inf. Control 10, 447–474 (1967)
Gold, M.E.: Complexity of automaton identificaton from given data. Inf. Control 37, 302–320 (1978)
Gentilini, R., Piazza, C., Policriti, A.: From bisimulation to simulation: coarsest partition problems. J. Autom. Reason. 31(1), 73–103 (2003)
Juillé, H., Pollack, J.B.: A stochastic search approach to grammar induction. In: Proceedings of International Colloquium on Grammatical Inference (ICGI’98), pp. 126–137 (1998)
Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Proceedings of International Colloquium on Grammatical Inference (ICGI’98), pp. 1–12 (1998)
Oncina, J., García, P.: Inferring regular languages in polynomial update time. In: Paredes, R., Cardoso, J.S., Pardo, X.M. (eds.) Pattern Recognition and Image Analysis, vol. 1, pp. 49–61. World Scientific, Singapore (1992)
Parekh, R., Honavar, V.: Learning DFA from simple examples. Mach. Learn. 44(1–2), 9–35 (2001)
Pitt, L., Warmuth, M.K.: The minimum consistent DFA problem cannot be approximated within any polynomial. J. ACM 40(1), 95–142 (1993)
Trakhtenbrot, B.A., Barzdin, Y.M.: Finite Automata; Behavior and Synthesis (Fundamental Studies in Computer Science). North-Holland Publishing Co., Amsterdam (1973)
Valiant, L.: A theory of the learnable. Commun. ACM 27, 1134–1142 (1984)
van Glabbeek, R.: The linear time-branching time spectrum I. The semantics of concrete, sequential processes. In: Bergstra, J.A., Ponse, A., Smolka, S.A. (eds.) Handbook of Process Algebra, pp. 3–99. Elsevier, Amsterdam (2001)
van Glabbeek, R., Ploeger, B.: Correcting a space-efficient simulation algorithm. In: Proceedings of International Conference on Computer Aided Verification (CAV’08), pp. 517–529 (2008)
Vázquez de Parga, M., García, P., López, D.: Minimal consistent DFA revisited. Theor. Comput. Sci. 647, 43–49 (2016)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, C. Minimal consistent DFA from sample strings. Acta Informatica 57, 657–670 (2020). https://doi.org/10.1007/s00236-020-00365-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00236-020-00365-8