EFFECTIVE ALGORITHM FOR PARSING SENTENCES USING SEMANTICALLY ATTRIBUTED WEIGHTED AFFIX CONTEXT FREE

Context . The problem of increasing efficiency of affix grammars over a finite lattice (AGFL) is considered. AGFL is a context-free grammar with flexible and compact form of productions for parsing texts in natural languages. Objective. The goal of the work is to increase efficiency of parsing sentences by means of AGFL with a modification that adds semantical attributes to the productions and introduces a new form of production called the “template production”. This modification helps to decrease the number of productions that are required to describe a language and lets reduce the computational complexity of the parsing algorithm. Method. A mathematical model of the template production is developed and the theorem is proved that claims that the normal form of the template production exists and the normalization procedure produces an equivalent grammar. The normal form is utilized to increase efficiency of parsing Ukrainian sentences. The template production helps to represent ontology-based rules in a short and computationally inexpensive way. The normal form of template production is studied, and an effective algorithm for parsing sentences is proposed. The worst-case complexity of the proposed algorithm is ⋅ ⋅ , where n is the length of input string of terminals, p m is the maximum number of combinations of symbol and attributes that can produce the same string of terminals, and r m is the maximum number of productions that have the same starting non-terminal symbol in the right part. The growth of parsing time turned out to be almost linear function of the number of words in a sentence when parsing of sentences from the test database of Ukrainian fiction literature. Results. The developed method has been implemented in the UkrParser software that is available open-source on GitHub. Conclusions. The developed algorithm was tested on the database of Ukrainian sentences and demonstrated ten times faster parsing speed than Stanford parser. The future research can be focused on the development of grammatically attributed ontologies for wider set of topics that should improve results of semantical sentence parsing.

-a multiplicative weight of production. The weight symbol is omitted where it is equal to 1.

INTRODUCTION
The problem of natural text parsing arises in such areas of computer applications as text summarization, machine translation, information retrieval, document classification, human-computer interaction, question answering systems, social networks monitoring, expert systems, etc.
The task of semantic parsing is a complex problem of artificial intelligence because its comprehensive solution requires the construction of a complete human knowledge model. Although such models are currently under development [1], no viable solution is available yet.
We propose an approach for partial syntactic and semantic parsing by means of weighted affix grammar over a finite lattice (WAGFL). WAGFL uses benefits of probabilistic context-free grammars (PCFG) [2] and affix grammars over a finite lattice (AGFL) developed by C.H.A. Koster [3]. Weighted and stochastic grammars are known to be equally expressive [4], but the approach based on weights is less restrictive and thus more flexible.
This article supports an approach where semantic analysis is integrated into the syntax parsing algorithm. This approach helps to decrease the number of intermediate constructions that have to be considered. It is especially important for flexible word order languages like Ukrainian and other slavonic languages.
The main contribution of this work is an approach to effective representation of weighted affix context-free grammar using a special form of "template productions". A small review of the existing methods is given in Section 2, "template productions" and the algorithm for parsing sentences are introduced in Section 3, experiments are provided in Section 4, parsing results are given in Section 5, and the results are discussed in Section 6.

PROBLEM STATEMENT
The problem is to develop effective methods for integrating semantic attributes into productions of weighted affix context-free grammar and to develop computationally effective algorithm for parsing sentences. The sentence is considered as a list of words 1 2 n w w w … that is converted to a sequence of terminal symbols n t t t ... 2 1 of the WAGFL grammar. The sentence parsing is formulated as a problem of finding a sequence of productions that has the maximum weight and can be applied sequentially to the starting symbol S to produce the given sequence of terminals n t t t ... 2 1 . The weight of the sequence is calculated as a multiplication of weights of all contained productions.

REVIEW OF THE LITERATURE
The problem of syntactic sentence parsing has been studied for a long time. Among many methods of parsing sentences, the approach based on generative grammars introduced by Noam Chomsky [5] is one of the most studied. Extended affix grammars (EAG) [6] and probabilistic contextfree grammars (PCFG) [2] are generative grammar fundamental extensions widely used in linguistic applications nowadays.
Affix grammars, which belong to the family of two-level grammars, are a subset of augmented grammars. Productions of affix grammars are the productions that are extended with attributes. The domain of attributes is defined by a meta-grammar.
Efficient affix grammars over a finite lattice (AGFL) formalism and its parsing algorithm were developed by C. H. A. Koster [3]. The formalism imposes restrictions on a set of productions and attributes to make the parsing computationally inexpensive. However, it still leaves it expressive enough to parse most of the natural sentence structures. AGFL extensions that are based on probabilities were also studied by T. C. Smith and J. G. Cleary [7].
Our approach is based on weighted affix grammar over a finite lattice. It is close to the method introduced by C.H.A. Koster. However, we formulate lattice grammar and productions in a slightly different way what leads to a shorter form of productions and a more compact sentence parsing algorithm.

MATERIALS AND METHODS
For the purpose of partial semantic-syntactic parsing of sentences, a new parser was developed. It is based on the weighted affix grammar over a finite lattice. This grammar extends symbols of generative grammar with affixes what can be used to decrease the number of productions required to describe a language. Our definition of the affix grammar over a finite lattice is slightly different from the original given by C.H.A. Koster, but it has the same idea. This new definition was used to prove that some transformation rules can be applied to the grammar to speed up the process of parsing.
The weighted affix grammar over a finite lattice G is defined as a 5-tuple ( ) represents all non-empty strings of attributed Terminal symbols T t i ∈ do not have attributes. They usually represent words of parsed sentences. For example, the word "student" can be a male or female singular noun until it is known from the context. In terms of generative grammar, it can be written in the following way: It represents both cases given above. Productions that generate terminal symbols are added by a morphological parser. If some word is a homograph, the morphological parser generates one production for every meaning of the word. The weight of each production represents the admissibility of this meaning in the parsed context. In the example above, are grammatical attributes, and STUDENT a is a semantical attribute. Semantical attributes are elements of domain SEM D . Providing regular productions for all possible combinations of affixes can be inefficient. Thus, a template form of production is introduced. This form is tailored for the needs of computationally efficient language processing.
The template production has the form ( ) ( ) ( ) , …., A A m req ⊂ are required affixes for symbols in the right part of the production; and w is a multiplicative weight of the production.
The template form is equivalent to a set of regular productions by definition. Consider the following template and regular productions (1) and (2): denote the intersection of all attributes that should be uniform in the right part of regular production p in order to conform to template production q : . We say that regular production p conforms to template production q if requirements R1-R3 are met: Requirement R1 assures that for each unified domain there is at least one common affix. Requirement R2 describes how required attributes are treated, and requirement R3 states how attributes of symbols in the left part of the production are obtained.
For instance, the Ukrainian equivalent of the English noun phrase "BEAUTIFUL STREET OF THE CITY" is "ГАР-НА ВУЛИЦЯ МІСТА". In this noun phrase, the case, gender, and number of the adjective (ГАРНА) is coordinated by the case, gender, and number of the first noun (ВУЛИ-ЦЯ), and the case of the second noun (МІСТА) should be GENITIVE. Semantical attribute for the whole phrase is taken from the word "STREET". The template production for this phrase in Ukrainian is The Normal Form of Template Productions. The length of the right-hand side of a production is called its rank. Effective parsing of sentences using generative grammars can be achieved when the grammar is in Chomsky normal form (CNF) -the form that ensures that all productions of the grammar have the rank not more than 2. Template productions can also be converted to a form that has at most two symbols in the right part. This conversion is performed by applying simplification steps to all productions that have the rank greater than 2. Every step takes one template production with the rank 2 > m and produces two template productions -one with the rank 2 and one with the rank 1 − m . The process stops when there are no more productions with the rank 3 and above.
The simplification step takes one template production q of the form (1) and produces 2 template productions: , and .m v' 2. is a new non-terminal symbol.
Theorem 1: The grammar obtained from original grammar G by the replacement of template production q with template productions 1 q and 2 q produces the same language.
In order to prove this, it is sufficient to show that: 1) all regular productions of form (2), which conform to template production (1), can be split into 2 productions 1 p and 2 p that conform to template productions 1 q and 2 q , respectively; 2) all productions that conform to template productions 1 q and 2 q define the same grammar as productions that conform to template production q.
1) The First Part of the Proof: Given that production p conforms to template production q . It should be proven that there exists a split of p into 2 productions 1 p and 2 p such that 1 p conforms to 1 q and 2 p conforms to 2 q . This split can be found by the assignment ...
2) The Second Part of the Proof: Given that 1 p conforms to 1 q and 2 p conforms to 2 q , it can be proved that they can be composed into a single production that conforms to q .
First of all, it should be noted that symbol Therefore, requirements R1 and R3 are satisfied for productions p and q because they are satisfied for 1 p and 1 q ; requirement R2 follows from the fact that 1 p and 2 p conform to 1 q and 2 q , respectively. Thus, production p, that is obtained from 1 p and 2 p , conforms to template production q . This concludes the proof of Theorem 1.
Algorithm for Parsing Sentences. The problem of sentence parsing is formulated as a problem of finding a sequence of productions that has the maximum weight and can be applied sequentially to some starting attributed to produce a given sequence of terminals n t t t ... 2 1 . The weight of the sequence is calculated as a multiplication of weights of all contained productions. If the right part of a production contains only one symbol, the weight of the production should not exceed 1 in order to avoid cyclic productions that can increase weight of non-terminal symbols during the bottom-up parsing procedure.
The developed algorithm for parsing sentences is based mostly on probabilistic CYK algorithm. The main difference is that symbols are compared not only by weight but also by the set of affixes. The algorithm uses the notion of weighted attributed symbol -a 3-tuple ( ) v A v, w, that contains weight w, symbol v, and set of affixes We say that weighted attributed symbol ( ) In the worst-case scenario, the computational complexity of the algorithm is ) ( , where n is the length of input string of terminals, p m is the maximum number of combinations of symbols and attributes that can produce the same string of terminals (this value can be treated as the ambiguity of the language being parsed), and r m is the maximum number of productions that have the same starting non-terminal symbol in the right part.
The parsing algorithm can be described by the following pseudocode: Input. String of terminals n t t t = s ... Output. Sequence of productions that produce string s.
if t is not dominated by any element of L, then append t to L.
Add elements of L to [ ] j k, P .
If [ ] n P 1, doesn't contain any triple ( ) , where S is a starting symbol of the grammar, the parsing is impossible. If it does, select a triple with the maximum weight w among them and reconstruct all productions that are required to produce string n t t t ...

EXPERIMENTS
The algorithm for parsing sentences was implemented in UkrParser 1 open source software project. This project contains classes for morphology analysis and sentence parser. The morphology for Ukrainian language is implemented in com.langproc. UkrainianISpellMorphology and com.langproc.UkrainianGrammarlyMorphology clases. The first class is based on open source project iSpell-uk 2 by Andriy Rysin and the second is based on Ukrainian morphology database gracefully provided by Mariana Romanyshyn from Grammarly. The algorithm for parsing sentences is implemented in com.langproc.APCFGParser class and productions for parsing Ukrainian sentences are placed in com.langproc.APCFGUkrainian class.
Computational efficiency of the developed algorithm was tested on database of 500 sentences from "Fata Morgana" story by Michael Kotsyubynsky. The average sentence parsing time depending of the sentence length is depicted in Fig. 1. These results were obtained on computer with 2.4 GHz Intel Core i5 CPU. The parsing time grows turned out to be almost linear notwithstanding the worst-case cubic estimate provided in Section 3.

RESULTS
The developed approach for mixed semantic and syntactic sentence parsing was used for parsing and translation of the annotated Ukrainian Sign language and the Ukrainian Spoken language [8], where the translation based on the parser that utilized productions generated from ontologies outperforms the parser that utilized only syntactic productions by 25% (90% of correct translations as compared to 65% correct translations obtained when using only syntactical productions).
An example of parsing sentence "Моя донька ходить у дитячий садок" (My daughter attends nursery school) by means of the developed method is shown in Figure 2.
In this example the following rules were added from subject area "Education": NG

DISCUSSION
The experimental results on database of Ukrainian sentences show significant speed-up in comparison with well-known context-free grammar parsers. This result was achieved by using compact form of production with syntactical and semantical attributes. In comparison with Stanford Parser 3 the average sentence parsing time was decreased in about 10 times. CONCLUSIONS This article demonstrated an efficient algorithm for parsing sentences by means of weighted affix context-free grammar with semantical attributes. The developed algorithm is based on the normal form of "template productions" that were introduced. The algorithms has worst-case cubic complexity, that turned out to be almost linear in real example.
The obtained sentence parsing trees are more semantically rich than the parsing trees obtained by means of regular syntactic parser. Additional computational cost for that is not very high because only hypernyms of words that are present in the sentence and corresponding expressions are included into the grammar.
The future research will be focused on optimal weight assignment and automatic extraction of productions that are specific for particular subject area.