This chapter introduces concepts in algorithms, data structures, program design, and applied Python programming. It also contains review of the basic mathematical notions of set, relation, and function, and illustrates them in terms of Python data structures. It contains many working program fragments that you should try yourself.
naming variables
Modules
Codecs for processing Chinese text have been incorporated into Python (since version 2.4).
|
With appropriate support on your terminal, the escaped text string inside the <SENT> element above will be rendered as the following string of ideographs: 甚至猫以人贵.
We can also read in the contents of an XML file using the etree package (at least, if the file is encoded as UTF-8 — as of writing, there seems to be a problem reading GB2312-encoded files in etree).
|
An algorithm is a "recipe" for solving a problem. For example, to multiply 16 by 12 we might use any of the following methods:
Each of these methods is a different algorithm, and requires different amounts of computation time and different amounts of intermediate information to store. A similar situation holds for many other superficially simple tasks, such as sorting a list of words.
Now, as we saw above, Python provides a built-in function sort() that performs this task efficiently. However, NLTK also provides several algorithms for sorting lists, to illustrate the variety of possible methods. To illustrate the difference in efficiency, we will create a list of 1000 numbers, randomize the list, then sort it, counting the number of list manipulations required.
|
Now we can try a simple sort method called bubble sort, that scans through the list many times, exchanging adjacent items if they are out of order. It sorts the list a in-place, and returns the number of times it modified the list:
|
We can try the same task using various sorting algorithms. Evidently merge sort is much better than bubble sort, and quicksort is better still.
|
Readers are encouraged to look at nltk.misc.sort to see how these different methods work. The collection of NLTK modules exemplify a variety of algorithm design techniques, including brute-force, divide-and-conquer, dynamic programming, and greedy search. Readers who would like a systematic introduction to algorithm design should consult the resources mentioned at the end of this tutorial.
In Chapter 6 we saw how to sort a list of items according to some property of the list.
|
This is inefficient when the list of items gets long, as we compute len() twice for every comparison (about 2nlog(n) times). The following is more efficient:
|
This technique is called decorate-sort-undecorate. We can compare its performance by timing how long it takes to execute it a million times.
|
MORE: consider what happens as the lists get longer...
Another example: sorting dates of the form "1 Jan 1970"
|
The comparison function says that we compare two times of the form ('Mar', '2004') by reversing the order of the month and year, and converting the month into a number to get ('2004', '3'), then using Python's built-in cmp function to compare them.
Now do this using decorate-sort-undecorate, for large data size
Time comparison
Wordfinder Puzzle
Here we will generate a grid of letters, containing words found in the dictionary. First we remove any duplicates and disregard the order in which the lexemes appeared in the dictionary. We do this by converting it to a set, then back to a list. Then we select the first 200 words, and then only keep those words having a reasonable length.
|
Now we generate the wordfinder grid, and print it out.
|
Finally we generate the words which need to be found.
|
Find words which, when reversed, make legal words. Extremely wasteful brute force solution:
|
More efficient:
|
Observe that this output contains redundant information; each word and its reverse is included. How could we remove this redundancy?
Presorting, sets:
Find words which have at least (or exactly) one instance of all vowels. Instead of writing extremely complex regular expressions, some simple preprocessing does the trick:
|
Indexing
Fuzzy Spelling
Object-Oriented Programming is a programming paradigm in which complex structures and processes are decomposed into classes, each encapsulating a single data type and the legal operations on that type. In this section we show you how to create simple data classes and processing classes by example. For a systematic introduction to Object-Oriented design, please consult the large literature of books on this topic.
An important data type in language processing is the syntactic tree. Here we will review the parts of the NLTK code that defines the Tree class.
The first line of a class definition is the class keyword followed by the class name, in this case Tree. This class is derived from Python's built-in list class, permitting us to use standard list operations to access the children of a tree node.
|
Next we define the initializer __init__(); Python knows to call this function when you ask for a new tree object by writing t = Tree(node, children). The constructor's first argument is special, and is standardly called self, giving us a way to refer to the current object from within its definition. This particular constructor calls the list initializer (similar to calling self = list(children)), then defines the node property of a tree.
|
Next we define another special function that Python knows to call when we index a Tree. The first case is the simplest, when the index is an integer, e.g. t[2], we just ask for the list item in the obvious way. The other cases are for handling slices, like t[1:2], or t[:].
|
This method was for accessing a child node. Similar methods are provided for setting and deleting a child (using __setitem__) and __delitem__).
Two other special member functions are __repr__() and __str__(). The __repr__() function produces a string representation of the object, one that can be executed to re-create the object, and is accessed from the interpreter simply by typing the name of the object and pressing 'enter'. The __str__() function produces a human-readable version of the object; here we call a pretty-printing function we have defined called pp().
|
Next we define some member functions that do other standard operations on trees. First, for accessing the leaves:
|
Next, for computing the height:
|
And finally, for enumerating all the subtrees (optionally filtered):
|
This section will discuss the tag.ngram module.
[to be written]
About this document...
This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [http://www.nltk.org/], Version 0.9.8b1, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [http://creativecommons.org/licenses/by-nc-nd/3.0/us/].
This document is Revision: 7606 Sat Feb 14 17:10:35 EST 2009
About this document...
This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [http://www.nltk.org/], Version 0.9.8b1, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [http://creativecommons.org/licenses/by-nc-nd/3.0/us/].
This document is Revision: 7606 Sat Feb 14 17:10:35 EST 2009