The book is a reference guide to the finite-state computational tools developed by Xerox Corporation in the past decades, and an introduction to the more. : Finite State Morphology (): Kenneth R. Beesley, Lauri Karttunen: Books. Morphological analysers are important NLP tools in particular for languages with R. Beesley and Lauri Karttunen: Finite State Morphology, CSLI Publications.
|Published (Last):||15 September 2014|
|PDF File Size:||20.64 Mb|
|ePub File Size:||10.81 Mb|
|Price:||Free* [*Free Regsitration Required]|
Beesley Xerox Research Centre Morpgology. Twenty years ago morphological analysis of natural language was a challenge to karttujen linguists. Simple cut-and-paste programs could be and were written to analyze strings in particular languages, but there was no general language-independent method available.
Furthermore, cut-and-paste programs for analysis were not reversible, they could not be used to generate words. Generative phonologists of that time described morphological alternations by means of ordered rewrite rules, but it was not understood morphollgy such rules could be beeskey for analysis.
This was the situation in the spring of when Kimmo Koskenniemi came to a conference on parsing that Lauri Karttunen had organized at the University of Texas at Austin. Kaplan and Martin Kay. The four K’s discovered that all of them were interested and had been working on the problem of morphological analysis. This was the beginning of Two-Level Morphology, the first general model in the history of computational linguistics for the analysis and generation of morphologically complex languages.
The language-specific components, the lexicon and the rules, were combined with a runtime engine applicable to all languages. In this article we trace the development of the finite-state technology that Two-Level Morphology is based on. In mathematical linguistics [ Partee et al. Johnson observed that while the same context-sensitive rule could be applied several times recursively to its own output, phonologists have always assumed implicitly that the site of application moves to the right or to the left of the string after each application.
Johnson demonstrated that the effect of karttknen constraint is that the pairs of inputs and outputs of any such rule can be modeled by a finite-state transducer. Unfortunately, this result was largely overlooked at the time and was rediscovered by Ronald M. Any cascade of rule transducers could in principle be composed into one transducer that maps lexical forms directly into the corresponding surface forms, and vice versa, without beeslsy intermediate representations.
These theoretical insights did not immediately lead to practical results. The development of a compiler for rewrite rules marttunen out to be a very complex beesldy. It became clear that it required as a first step a complete implementation of basic finite-state operations such as union, intersection, complementation, and composition.
Developing a complete finite-state calculus was a challenge in itself on the computers that were available at mrophology time. Another reason for the slow progress may have been that there were persistent doubts about the practicality of the approach for morphological analysis.
Traditional phonological rewrite rules describe the correspondence between lexical forms and surface forms as a one-directional, sequential mapping from lexical forms to surface forms.
Even if it was possible to model the generation of surface forms efficiently by means of morpbology transducers, it was not evident that it would lead to an efficient analysis procedure going in the reverse direction, from surface forms to lexical forms. This asymmetry is an inherent property of the generative approach to phonological description.
If all the rules are deterministic and obligatory and the order of the rules is fixed, each lexical form generates only one surface form. But a surface form can typically be generated in more than one way, and the number of possible analyses grows with the number of rules that are involved.
But in order to look them up in the lexicon, the system must first complete the analysis. Depending on the finitee of rules involved, a surface form could easily finjte dozens of potential lexical forms, even an infinite number in the case of certain deletion rules.
Although the generation problem had been solved by Johnson, Kaplan and Kay, at korphology in principle, the problem of efficient morphological analysis in the Chomsky-Halle paradigm was still seen as a formidable challenge.
As counterintuitive as it was from karrtunen psycholinguistic point of view, it appeared that analysis was much harder computationally beeesley generation. Because the resulting single transducer is equivalent to the original cascade, the ambiguity remains.
The solution to the overanalysis problem should have been obvious: With the lexicon included in the composition, all the spurious ambiguities produced by the rules are eliminated at compile time. The runtime analysis becomes more efficient because the resulting single transducer contains only lexical forms that actually exist in the language.
The idea of composing the lexicon and the rules together is not mentioned in Johnson’s book or in the early Xerox work.
Although there obviously had to be some interface relating a lexicon component to a rule component, these were traditionally thought of as different types of objects. Furthermore, rules were traditionally conceived as applying to individual word forms; the idea of applying them simultaneously to a lexicon as a whole required a new mindset and computational tools that were not yet available. The fact that finite-state networks could be used to represent both the inventory of valid lexical forms and the rules for mapping them to surface forms took a while to emerge.
When it first appeared in print [ Karttunen et al. They weren’t then aware of Johnson’s publication. Xerox had begun work on the finite-state algorithms, but they would prove to be many years in the making. Koskenniemi was not convinced that efficient morphological analysis would ever be practical with generative rules, even if they were compiled into finite-state transducers.
Some other way to use finite automata might be more efficient. Back in Finland, Koskenniemi invented a new way to describe phonological alternations in finite-state terms. Instead of cascaded rules with intermediate stages and the computational problems they seemed to lead to, rules could be thought of as statements that directly constrain the surface realization of lexical strings. The rules would not be applied sequentially but in parallel.
Two-level morphology is based on three ideas: Rules are symbol-to-symbol constraints that are applied in parallel, not sequentially like rewrite rules.
The constraints can refer to the lexical context, to the surface context, or to both contexts at the same time. Lexical lookup and morphological analysis are performed in tandem. To illustrate the first two principles we can turn back to the kaNpat example again. Two-level rules may refer to both sides of the context at the same time.
The y ie alternation in English plural nouns could be described by two rules: Like replace rules, two-level rules describe regular relations; but there is an important difference.
Because the zeros in two-level rules are in fact ordinary symbols, a two-level rule represents an equal-length relation. This has an important consequence: Although transducers cannot in general be intersected, Koskenniemi’s constraint transducers can be intersected. In fact, the apply function that maps the surface strings to lexical strings, or vice versa, using a set of two-level rules in parallel, simulates the intersection of the rule automata.
It also simulates, at the same time, the composition of the input string with the constraint networks, just like the ordinary apply function. At each point in the process, all lexical candidates corresponding to the current surface symbol are considered one by one. If both rules accept the pair, the process moves on to the next point in the input.
They will all be rejected by the N: Applying the rules in parallel does not in itself solve the overanalysis problem discussed in the previous section. However, the problem is easy to manage in a system that has only two levels. The possible upper-side symbols are constrained at each step by consulting the lexicon. In Koskenniemi’s two-level system, lexical lookup and the analysis of the surface form are performed in tandem.
The lexicon acts as a continuous lexical filter.
Book Review – Semantic Scholar
The analysis routine only considers symbol pairs whose lexical side matches one of the outgoing arcs in the current state.
In the Xerox lexc tool, the lexicon is a minimized network, typically a transducer, but the filtering principle is the same. The lookup utility in lexc matches the lexical string proposed by the rules directly against the lower side of the lexicon. It does not pursue analyses that have beesey matching lexical path. Furthermore, the lexicon may be composed with the rules in lexc to produce a single transducer that maps surface forms directly to lexical forms, and vice versa.
Koskenniemi’s two-level morphology was the first practical general model in the history of computational linguistics for the analysis of morphologically complex languages.
The language-specific components, the rules and the lexicon, were combined with a universal runtime engine applicable to all languages. Stage original implementation was primarily intended for analysis, but the model was in principle bidirectional and could be used for omrphology.
Linguistic Issues Although the two-level approach to morphological analysis was quickly accepted as a useful practical method, the linguistic insight behind it was not picked up by mainstream linguists. The idea of rules as parallel constraints between vinite lexical symbol and its surface counterpart was not taken seriously at the time outside the circle of computational linguists.
Many morpology had been advanced in the literature to show that phonological alternations could not be described or explained adequately without sequential rewrite rules. It went largely unnoticed that two-level rules could have the same effect as ordered rewrite rules because two-level rules allow the realization of a lexical symbol to be constrained either by the lexical side or by the surface side.
The morpholoyg arguments for rule ordering were based on the a priori assumption that a rule can refer only to the input context.
But the world has changed. There are of course many other differences. Most importantly, OT constraints are meant to be universal. The fact that two-level rules can describe orthographic idiosyncrasies such as the y ie alternation in English with no help from universal principles makes the approach uninteresting from the OT point of view.
The colon in the right context of first rule, p: The semantics of morpgology rules were well-defined but there was no rule compiler available at the time. Koskenniemi and other early practitioners of two-level morphology had to compile their rules by hand into finite-state transducers.
This is tedious in the extreme and demands a detailed understanding of transducers and rule semantics that few human beings can be expected beesle grasp. In practice, linguists using two-level morphology consciously or unconsciously tended to postulate shate surfacy lexical strings, which kept the two-level rules relatively simple.
Although two-level rules are formally quite different from the rewrite rules studied by Kaplan and Kay, the basic finite-state methods that had been developed for compiling rewrite-rules were applicable to two-level rules as well.
In both formalisms, the most difficult case is a rule where the symbol that is replaced or constrained appears also in the context part of the rule. This problem Kaplan and Kay had already solved with an ingenious technique for introducing and then eliminating auxiliary symbols to mark context boundaries.
Kaplan and Koskenniemi worked out the basic compilation algorithm for two-level rules in the summer of when Koskenniemi was a visitor at Stanford. The first two-level rule compiler was written in InterLisp by Koskenniemi and Karttunen in using Kaplan’s implementation of the finite-state calculus [ Koskenniemi,Karttunen et al.
The landmark article by Kaplan and Karftunen on the mathematical foundations of finite-state linguistics gives a compilation algorithm for phonological rewrite rules and for Koskenniemi’s two-level rules. In the course of this work, it soon became evident that the two-level formalism was difficult for the linguists to master.
It is far too easy to write rules that are in conflict with one another. It was necessary to make the compiler check for, and automatically eliminate, most common types of conflicts. For example, in Finnish consonant gradation, an intervocalic k generally disappears in the weak grade.