Ugrás a tartalomhoz

## ALGORITHMS OF INFORMATICS vol. 3.

Antal Iványi

Kempelen Farkas Hallgatói Információs Központ

Chapter 26. Complexity of Words

## Chapter 26. Complexity of Words

The complexity of words is a continuously growing field of the combinatorics of words. Hundreds of papers are devoted to different kind of complexities. We try to present in this chapter far from being exhaustive the basic notions and results for finite and infinite words.

First of all we summarize the simple (classical) complexity measures, giving formulas and algorithms for several cases. After this, generalized complexities are treated, with different type of algorithms. We finish this chapter by presenting the palindrome complexity.

Finally, references from a rich bibliography are given.

## 26.1 Simple complexity measures

In this section simple (classical) complexities, as measures of the diversity of the subwords in finite and infinite words, are discussed. First, we present some useful notions related to the finite and infinite words with examples. Word graphs, which play an important role in understanding and obtaining the complexity, are presented in detail with pertinent examples. After this, the subword complexity (as number of subwords), with related notions, is expansively presented.

### 26.1.1 Finite words

Let be a finite, nonempty set, called alphabet. Its elements are called letters or symbols. A string , formed by (not necessary different) elements of , is a word. The length of the word is , and is denoted by . The word without any element is the empty word, denoted by (sometimes ). The set of all finite words over is denoted by . We will use the following notations too:

that is is the set of all finite and nonempty words over , whilst is the set of all words of length over . Obviously . The sets and are infinite denumerable sets.

We define in the binary operation called concatenation (shortly catenation). If and , then

This binary operation is associative, but not commutative. Its neutral element is because . The set with this neutral element is a monoid. We introduce recursively the power of a word:

, if .

A word is primitive if it is no power of any word, so is primitive if

For example, is a primitive word, whilst is not.

The word is periodic if there is a value , such that

and is the period of . The least such is the least period of .

The word is periodic with the least period .

Let us denote by the greatest common divisor of the naturals and . The following result is obvious.

Theorem 26.1 If is periodic, and and are periods, then is a period too, provided .

The reversal (or mirror image) of the word is . Obviously . If , then is a palindrome.

The word is a subword (or factor) of if there exist the words and such that . If , then is a proper subword of . If , then is a prefix of , and if , then is a suffix of . The set of all subwords of length of is denoted by . is the set of nonempty subwords of , so

For example, if , then

,

.

The words and are equal, if

and

, for .

Theorem 26.2 (Fine–Wilf) If and are words of length , respective , and if there are the natural numbers and , such that and have a common prefix of length , then and are powers of the same word.

The value in the theorem is tight. This can be illustrated by the following example. Here the words and have a common prefix of length , but and are not powers of the same word.

,      ,      ,

,         ,       .

By the theorem a common prefix of length 7 would ensure that and are powers of the same word. We can see that and have a common prefix of length 6 (), but and are not powers of the same word, so the length of the common prefix given by the theorem is tight.

### 26.1.2 Infinite words

Beside the finite words we consider infinite (more precisely infinite at right) words too:

The set of infinite words over the alphabet is denoted by . If we will study together finite and infinite words the following notation will be useful:

The notions as subwords, prefixes, suffixes can be defined similarly for infinite words too. The word is a subword of if there are the words , , such that . If , then is a prefix of , whilst is a suffix of . Here also represents the set of all subwords of length of .

Examples of infinite words over a binary alphabet:

1) The power word is defined as:

It can be seen that

,

The Champernowne word is obtained by writing in binary representation the natural numbers :

It can be seen that

,

3) The finite Fibonacci words can be defined recursively as:

,

, if .

From this definition we obtain:

,

,

,

,

,

,

.

The infinite Fibonacci word can be defined as the limit of the sequence of finite Fibonacci words:

The subwords of this word are:

,

.

The name of Fibonacci words stems from the Fibonacci numbers, because the length of finite Fibonacci words is related to the Fibonacci numbers: , i.e. the length of the th finite Fibonacci word is equal to the th Fibonacci number.

The infinite Fibonacci word has a lot of interesting properties. For example, from the definition, we can see that it cannot contain the subword 11. The number of 1's in a word will be denoted by . An infinite word is balanced, if for arbitrary subwords and of the same length, we have , i.e.

Theorem 26.3 The infinite Fibonacci word is balanced.

Theorem 26.4 has elements.

If word is concatenated by itself infinitely, then the result is denoted by .

The infinite word is periodic, if there is a finite word , such that . This is a generalization of the finite case periodicity. The infinite word is ultimately periodic, if there are the words and , such that .

The infinite Fibonacci word can be generated by a (homo)morphism too. Let us define this morphism:

Based on this definition, the function can be defined on letters only. A morphism can be extended for infinite words too:

The finite Fibonacci word can be generated by the following morphism:

In this case we have the following theorem.

Theorem 26.5 .

Proof. The proof is by induction. Obviously . Let us presume that for all . Because

,

by the induction hypothesis

.

From this we obtain:

Theorem 26.6 .

The infinite Fibonacci word is the fixed point of the morphism .

### 26.1.3 Word graphs

Let be a set of words of length over , and . We define a digraph, whose vertices are from , and whose arcs from . There is an arc from the vertex to the vertex if

that is the last letters in the first word are identical to the first letters in the second word. This arc is labelled by (or ).

#### De Bruijn graphs

If and , where , then the graph is called De Bruijn graph, denoted by .

Figures 26.1 and 26.2 illustrate De Bruijn graphs and .

To a walk in the De Bruijn graph we attach the label , which is obtained by maximum overlap of the vertices of the walk.

Footnote. In a graph a walk is a sequence of neighbouring edges (or arcs with the same orientation). If the edges or arcs of the walk are all different the walk is called trail, and when all vertices are different, the walk is a path.

In Figure26.1 in the graph the label attached to the walk (which is a path) is . The word attached to a Hamiltonian path (which contains all vertices of the graph) in the graph is an -type De Bruijn word. For example, words and are -type De Bruijn word. An -type De Bruijn word contains all words of length .

A connected digraph is Eulerian if the in-degree of each vertex is equal to its out-degree.

Footnote. A digraph (oriented graph) is connected if between every pair of vertices there is an oriented path at least in a direction.

Footnote. A digraph is Eulerian if it contains a closed oriented trail with all arcs of the graph.

Footnote. In-degree (out-degree) of a vertex is the number of arcs which enter (leave) this vertex.

Theorem 26.7 The De Bruijn graph is Eulerian.

Proof. a) The graph is connected because between all pair of vertices and there is an oriented path. For vertex there are leaving arcs, which enter vertices whose first letters are , and the last letters in this words are all different. Therefore, there is the path , .

b) There are incoming arcs to vertex from vertices , where ( is the alphabet of the graph, i.e. ). The arcs leaving vertex enter vertices , where . Therefore, the graph is Eulerian, because the in-degree and out-degree of each vertex are equal.

From this the following theorem is simply obtained.

Theorem 26.8 An oriented Eulerian trail of the graph (which contains all arcs of graph) is a Hamiltonian path in the graph , preserving the order.

For example, in the sequence 000, 001, 010, 101, 011, 111, 110, 100 of arcs is an Eulerian trail. At the same time these words are vertices of a Hamiltonian path in .

#### Algorithm to generate De Bruijn words

Generating De Bruijn words is a common task with respectable number of algorithms. We present here the well-known Martin algorithm. Let be an alphabet. Our goal is to generate an -type De Bruijn word over the alphabet .

We begin the algorithm with the word , and add at its right end the letter with the greatest possible subscript, such that the suffix of length of the obtained word does not duplicate a previously occurring subword of length . Repeat this until such a prolongation is impossible. When we cannot continue, a De Bruijn word is obtained, with the length . In the following detailed algorithm, is the -letters alphabet, and represents the result, an -type De Bruijn word.

Martin()

  1  FOR  TO    2    DO    3     4  REPEAT   5    TRUE   6       7    WHILE     8       DO IF        Not a subword.   9          THEN   10               11             FALSE  12             EXIT WHILE  13          ELSE   14  UNTIL   15  RETURN         .

Because this algorithm generates all letters of a De Bruijn word of length , and and are independent, its time complexity is . The more precise characterization of the running time depends on the implementation of line 8. The rEPEAT statement is executed times. The wHILE statement is executed at most times for each step of the rEPEAT. The test can be made in the worst case in steps. So, the total number of steps is not greater than , resulting a worst case bound . If we use Knuth-Morris-Pratt string mathching algorithm, then the worst case running time is .

In chapter 29 a more efficient implementation of the idea of Martin is presented.

Based on this algorithm the following theorem can be stated.

Theorem 26.9 An -type De Bruijn word is the shortest possible among all words containing all words of length over an alphabet with letters.

To generate all ()-type De Bruijn words the following recursive algorithm is given. Here is also an alphabet with letters, and represents an -type De Bruijn word. The algorithm is called for each position with .

All-De-Bruijn()

  1  FOR  TO    2    DO    3       IF        Not a subword.   4          THEN All-De-Bruijn()   5          ELSE IF    6             THEN print         A De Bruijn word.   7                EXIT FOR

The call of the procedure:

       FOR  TO           DO        ALL-DE-BRUIJN .

This algorithm naturally is exponential.

In following, related to the De Bruijn graphs, the so-called De Bruijn trees will play an important role.

A De Bruijn tree with the root is a -ary tree defined recursively as follows:

i. The word of length over the alphabet is the root of .

ii. If is a leaf in the tree , then each word of the form , , will be a descendent of , if in the path from root to the vertex does not appears.

iii. The rule ii is applied as many as it can.

In Figure 26.3 the De Bruijn tree is given.

#### Rauzy graphs

If the word is infinite, and , , then the corresponding word graph is called Rauzy graph (or subword graph). Figure 26.4 illustrates the Rauzy graphs of the infinite Fibonacci word for and . As we have seen, the infinite Fibonacci word is

and , , .

In the case of the power word , where , , the corresponding Rauzy graphs are given in Figure 26.5.

As we can see in Fig, 26.4 and 26.5 there are subwords of length which can be continued only in a single way (by adding a letter), and there are subwords which can be continued in two different ways (by adding two different letters). These latter subwords are called special subwords. A subword is a right special subword, if there are at least two different letters , such that . Similarly, is left special subword, if there are at least two different letters , such that . A subword is bispecial, if at the same time is right and left special. For example, the special subwords in Figures 26.4 and 26.5) are:

left special subwords:         010, 0100 (Figure 26.4),

110, 000, 111, 1110, 0001, 1111, 0011 (Figure 26.5),

right special subwords:       010, 0010 (Figure 26.4),

011, 000, 111, 0111, 1111, 0011 (Figure 26.5)

bispecial subwords:            010 (Figure 26.4),

000, 111, 1111, 0011 (Figure 26.5).

### 26.1.4 Complexity of words

The complexity of words measures the diversity of the subwords of a word. In this regard the word has smaller complexity then the word .

We define the following complexities for a word.

1) The subword complexity or simply the complexity of a word assigns to each the number of different subwords of length . For a word the number of different subwords of length is denoted by .

If the word is finite, then , if .

2) The maximal complexity is considered only for finite words.

If is an infinite word, then is the lower maximal complexity, respectively the upper maximal complexity.

3) The global maximal complexity is defined on the set :

4) The total complexity for a finite word is the number of all different nonempty subwords

For an infinite word is the lower total complexity, and is the upper total complexity:

Footnote. Sometimes the empty subword is considered too. In this case the value of total complexity is increased by 1.

5) A decomposition is called a factorization of . If each (with the possible exception of ) is the shortest prefix of which does not occur before in , then this factorization is called the Lempel-Ziv factorization. The number of subwords in such a factorization is the Lempel-Ziv factorization complexity of . For example for the word the Lempel-Ziv factorization is: . So, the Lempel-Ziv factorization complexity of is .

6) If in a factorization each is the longest possible palindrome, then the factorization is called a palindromic factorization, and the number of subwords in this is the palindromic factorization complexity. For , so the palindromic factorization complexity of is .

7) The window complexity is defined for infinite words only. For the window complexity is

#### Subword complexity

As we have seen

, if .

For example, in the case of :

In Theorem 26.4 was stated that for the infinite Fibonacci word:

In the case of the power word the complexity is:

This can be proved if we determine the difference , which is equal to the number of words of length which can be continued in two different ways to obtain words of length . In two different ways can be extended only the words of the form (it can be followed by 1 always, and by 0 when ) and (it can be followed by 0 always, and by 1 when ). Considering separately the cases when is odd and even, we can see that:

and from this we get

In the case of the Champernowne word

the complexity is .

Theorem 26.10 If for the infinite word there exists an such that , then is ultimately periodic.

Proof. , otherwise the word is trivial (contains just equal letters). Therefore there is a , such that . But

It follows that each subword has only one extension to obtain . So, if , then . Because is a finite set, and is infinite, there are and (), for which , but in this case is true too. Then from we obtain the following equality results: , therefore is true for all values. Therefore is ultimately periodic.

A word is Sturmian, if for all .

Sturmian words are the least complexity infinite and non periodic words. The infinite Fibonacci word is Sturmian. Because , the Sturmian words are two-letters words.

From the Theorem 26.10 it follows that each infinite and not ultimately periodic word has complexity at least , i.e.

The equality holds for Sturmian words.

Infinite words can be characterized using the lower and upper total complexity too.

Theorem 26.11 If an infinite word is not ultimately periodic and , then

For the Sturmian words equality holds.

Let us denote by the fractional part of , and by its integer part. Obviously . The composition of a function by itself times will be denoted by . So ( times). Sturmian words can be characterized in the following way too:

Theorem 26.12 A word is Sturmian if and only if there exists an irrational number and a real number , such that for

or

In the case of the infinite Fibonacci number, these numbers are: .

Sturmian words can be generated by the orbit of a billiard ball inside a square too. A billiard ball is launched under an irrational angle from a boundary point of the square. If we consider an endless move of the ball with reflection on boundaries and without friction, an infinite trajectory will result. We put an 0 in the word if the ball reaches a horizontal boundary, and 1 when it reaches a vertical one. In such a way we generate an infinite word. This can be generalized using an -letter alphabet and an -dimensional hypercube. In this case the complexity is

If , then , and if , then .

#### Maximal complexity

For a finite word

is the maximal complexity. In Figure 26.6 the values of the complexity function for several words are given for all possible length. From this, we can see for example that , etc.

For the complexity of finite words the following interesting result is true.

Theorem 26.13 If is a finite word, is its complexity, then there are the natural numbers and with such that

,        for ,

,        for ,

,   for .

From the Figure 26.6, for example, if , then , ,

, then , ,

, then , .

#### Global maximal complexity

The global maximal complexity is

that is the greatest (maximal) complexity in the set of all words of length on a given alphabet. The following problems arise:

what is length of the subwords for which the global maximal complexity is equal to the maximal complexity?

how many such words exist?

Example 26.1 For the alphabet the Figure 26.7 and 26.8 contain the complexity of all 3-length and 4-length words.

In this case of the 3-length words (Figure 26.7) the global maximal complexity is 2, and this value is obtained for 1-length and 2-length subwords. There are 6 such words.

For 4-length words (Figure 26.8) the global maximal complexity is 3, and this value is obtained for 2-length words. The number of such words is 8.

To solve the above two problems, the following notations will be used:

In the table of Figure 26.9 values of , , are given for length up to 20 over on a binary alphabet.

We shall use the following result to prove some theorems on maximal complexity.

Lemma 26.14 For each , the shortest word containing all the words of length over an alphabet with letters has letters (hence in this word each of the words of length appears only once).

Theorem 26.15 If and , then .

Proof. Let us consider at first the case , . From Lemma 26.14 we obtain the existence of a word of length which contains all the words of length , hence . It is obvious that for and for . Any other word of length will have the maximal complexity less than or equal to , hence we have .

For we consider now the values of of the form with , hence . If from the word of length considered above we delete the last letters, we obtain a word of length with . This word will have and this value will be its maximal complexity. Indeed, it is obvious that for ; for it follows that , hence . Because it is not possible for a word of length , with to have the maximal complexity greater than , it follows that .

Theorem 26.16 If and then ; if then .

Proof. In the first part of the proof of Theorem 26.15, we proved for , , the existence of a word of length for which . This means that . For the word , as well as for any other word of length , we have , , because of the special construction of , which contains all the words of length in the most compact way. It follows that .

As in the second part of the proof of Theorem 26.15, we consider with and the word for which . We have again . For , it is obvious that the complexity function of , or of any other word of length , is strictly less than . We examine now the possibility of finding a word with for which for . We have , hence the equality holds only for and , that is for .

We show that for we have indeed . If we start with the word of length generated by the Martin's algorithm (or with another De Bruijn word) and add to this any letter from , we obtain obviously a word of length , which contains all the words of length and words of length , hence .

Having in mind the Martin algorithm (or other more efficient algorithms), words with maximal complexity can be easily constructed for each and for both situations in Theorem 26.16.

Theorem 26.17 If and then is equal to the number of different paths of length in the de Bruijn graph .

Proof. From Theorems 26.15 and 26.16 it follows that the number of the words of length with global maximal complexity is given by the number of words with . It means that these words contain subwords of length , all of them distinct. To enumerate all of them we start successively with each word of letters (hence with each vertex in ) and we add at each step, in turn, one of the symbols from which does not duplicate a word of length which has already appeared. Of course, not all of the trials will finish in a word of length , but those which do this, are precisely paths in starting with each vertex in turn and having the length . Hence to each word of length with we can associate a path and only one of length starting from the vertex given by the first letters of the initial word; conversely, any path of length will provide a word of length which contains distinct subwords of length .

can be expressed also as the number of vertices at level in the set of De Bruijn trees.

Theorem 26.18 If , then .

Proof. In the De Bruijn graph there are different Hamiltonian cycles. With each vertex of a Hamiltonian cycle a De Bruijn word begins (containing all -length subwords), which has maximal complexity, so , which proves the theorem.

A generalization for an alphabet with letters:

Theorem 26.19 If , then .

#### Total complexity

The total complexity is the number of different nonempty subwords of a given word:

The total complexity of a trivial word of length (of the form , ) is equal to . The total complexity of a rainbow word (with pairwise different letters) of length is equal to .

The problem of existence of words with a given total complexity are studied in the following theorems.

Theorem 26.20 If is a natural number different from 1, 2 and 4, then there exists a nontrivial word of total complexity equal to .

Proof. To prove this theorem we give the total complexity of the following -length words:

These can be proved immediately from the definition of the total complexity.

1. If is odd then we can write for a given . It follows that , and the word has total complexity .

2. If is even, then .

2.1. If , then gives , and from this results. The word has total complexity .

2.2. If then gives , and from this results. The word has total complexity .

In the proof we have used more than two letters in a word only in the case of the numbers of the form (case 2.2 above). The new question is, if there exist always nontrivial words formed only of two letters with a given total complexity. The answer is yes anew. We must prove this only for the numbers of the form . If and , we use the followings:

If , then gives , and the word has total complexity .

If , then gives , and the word has total complexity . For only 14, 26 and 30 are feasible. The word has total complexity 14, has 26, and 30. Easily it can be proved, using a tree, that for 6, 10, 18 and 22 such words does not exist. Then the following theorem is true.

Theorem 26.21 If is a natural number different from 1, 2, 4, 6, 10, 18 and 22, then there exists a nontrivial word formed only of two letters, with the given total complexity .

The existence of a word with a given length and total complexity is not always assured, as we will prove in what follows.

In relation with the second problem a new one arises: How many words of length and complexity there exist? For small this problem can be studied exhaustively. Let be of letters, and let us consider all words of length over . By a computer program we have got Figure 26.10, which contains the frequency of words with given length and total complexity.

Let and let denote the frequency of the words of length over having a complexity . Then we have the following easy to prove results:

,        if or ,

,

,

,

As regards the distribution of the frequency , the following are true:

If , then .

If , then .

The question is, if there exists a value from which up to no more 0 frequency exist. The answer is positive. Let us denote by the least number between and for which

for all with .

The number exists for any (in the worst case it may be equal to ):

Theorem 26.22 If , , , then