Filter Bubble

STUDY OF SPELL CHECKING COMPLEXITY THROUGH COMPLEX NETWORK

INTRODUCTION:

Spell-checking is a natural language processing phenomenon refers to identifying misspelled words and correcting them in a electronic text document. Two type of spelling error can take place one is Real Word Error (RWE) and Non Word Error (NWE). If the misspell word itself is a part of lexicon of the domain language then it refers to RWE otherwise it will be NWE. e.g. ‘then’ in place of ‘than’ is an RWE while ‘lo’ in place of ‘go’ is an NWE. Identifying NWE is a trivial problem, however correcting the word is not that easy. Also it is almost impossible to find any RWE in the part of a context-insensitive spell checker. Often the complexity of building a perfect spell checker depends on the lexicon of a language itself. In this blog, we are going to discuss such factors through complex network concepts. The whole idea is taken from [1] ChoudhuryM 2007 et.al. subject to availability of language lexicon. Also we are going to discuss the spell checking complexity across English, Hindi and Bangla. Currently analyses with respect to RWE are only taken into consideration.

THE PROCESS:

As a first step, Spelling Network (SpelNet) was created. It is a undirected weighted graph G(V,E) where each word w in lexicon (Λ) is a vertex (w∈ V | w ∈ Λ). The lexicon is created from given data source .The edge between two words/ vertices (w1, w2) is attributed with a weight edit distance ed(w1,w2). Edit distance between two words can be calculated as minimum number of insertion, deletion and substitution ( each with weight 1) on one word needs to be done in order to get the other one. E.g. edit distance between fun and gun is 1 ( 0 insertion, 0 deletion , 1 substitution so total 1). Figure 1(a) shows a sample SpellNet over few words.

Figure 1: Sample SpellNet (a) and SpellNet with threshold (θ)

Each vertex is also associated with a weight called vertex weight Wt_V(w). Vertex weight of a node or word w is defined as frequency of occurrence of that word in document/ data source. The unweighted version of the graph is represented in figure 1 (B). This is constructed with the help of a threshold. The nodes in this graph remain same as that of weighted graph. We include an edge between(w1,w2) in current graph if the weight of the edge between them in weighted graph is less then threshold θ. There are some mathematical concepts and formulas mentioned for measuring various features of the network.

Degree of a node represents number of edges incident on the node. The factor P_k represents probability that a node in the graph will have degree k or more. The average degree <k> and weighted average degree <k_wt> of a node in the network is given by

<k> =

<k_wt> =

Where N is the number of nodes in the network. K_wt can be interpreted as the probability of RWE in a language. This is correlated as given bellow.

P_rwe(Λ) =

P_rwe represents probability of RWE in lexicon Λ.

The neighbor(w,d) defined as number of words in lexicon Λ that are at an edit distance d from w. So in terms of neighbor(w,d) the P_rwe can be defined as

P_rwe(Λ) =

For practicality it has been assumed that d is bounded by a mall integer. That is the number of errors simultaneously made on a word is always small (usually assumed to be 1 or a slowly growing function of the word length. Let’s assume this bound as

. So,

P_rwe (Λ)

Since

< 1, we can substitute

to get an upper bound on prwe(Λ), which gives

P_rwe (Λ)

The sum of neighbor(w,d) over d (from 1 to

) is nothing but k and k_wt depends on k(w) and Wt_V(w).

prwe(Λ) < C1 k_wt

Where c1 is some constant. for

= 1, prwe(Λ) ∝ k_wt. If we ignore the distribution of the words, that is if we assume p(w) = 1/N, then prwe(Λ) ∝ k. Thus, the quantity k_wt provides a good estimate of the probability of RWE in a language.

RESULTS AND ANALYSIS:

SpellNet was constructed for three languages English, Hindi, Bengali in order to do comparative study of lexicon effect on spell check complexity across these three languages. For Bengali and Hindi Brahmi-derived script and Devnagri scripts were used respectively. For English they used Roman script. Lexica of three languages were collected from open sources i.e for English source was www.audiencedialogue.org/susteng.html; for Hindi and Bengali it was Central Institute of Indian Languages. Around 10000 words were collected all together with near equal amount for each language. In the current study

is considered to 1, 3 and 5 respectively. Since

at 5 makes the whole graph connected so taking

more then 5 doesn’t have any significance. The statistical information obtained for all three SpellNets for all three value of

is as shown in bellow table.

TABLE 1: Statistical Analysis (M: number of Edges k: average degree and k_w weighted average degree)

The cumulative degree distribution of all three languages at different threshold is represented in figure 2

Figure 2: Cumulative degree distribution at different thresholds

(Figure 3: Scatter-plots for degree versus unigram frequency at different threshold for Hindi)

= 1, the average weighted degrees for Hindi, Bengali and English are 13.81, 7.73 and 6.61 respectively. Thus, the probability of RWE in Hindi is significantly higher than that of Bengali, which in turn is higher than that of English. Similar thing has also been observed for other values of

This is also evident from Figures 2, which show the distribution of Hindi to lie above that of Bengali, which lies above English (for all thresholds). The average degree k is substantially smaller (0.5 to 0.33 times) than the average weighted degree kwi for all the 9 SpellNets. This suggests that the higher degree nodes in SpellNet have higher node weight (i.e. occurrence frequency). Indeed, as shown in Figure 3 for Hindi, the high unigram frequency of a node implies higher degree, though the reverse is not true. The scatter-plots for the other languages are similar in nature.

Choudhury, Monojit, et al. "How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach." arXiv preprint physics/0703198 (2007).
R. Albert and A. L. Barab´asi. 2002. Statistical mechanics of complex networks. Reviews of Modern Physics, 74:47–97.
M. Choudhury, A. Mukherjee, A. Basu, and N. Ganguly. 2006. Analysis and synthesis of the distribution of consonants over languages: A complex network approach. In Proceedings of the COLING/ACL Main Conference Poster Sessions, pages 128–135.

Thursday, 31 March 2016