Natural Language Processing

Authors:Steven Bird, Ewan Klein, Edward Loper
Version:0.9.6 (draft only, please send feedback to authors)
Copyright:© 2001-2008 the authors
License:Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License



This is a book about Natural Language Processing. By natural language we mean a language that is used for everyday communication by humans; languages like English, Hindi or Portuguese. In contrast to artificial languages such as programming languages and mathematical notations, natural languages have evolved as they pass from generation to generation, and are hard to pin down with explicit rules. We will take Natural Language Processing (or NLP for short) in a wide sense to cover any kind of computer manipulation of natural language. At one extreme, it could be as simple as counting the number of times the letter t occurs in a paragraph of text. At the other extreme, NLP involves "understanding" complete human utterances, at least to the extent of being able to give useful responses to them.

Technologies based on NLP are becoming increasingly widespread. For example, handheld computers (PDAs) support predictive text and handwriting recognition; web search engines give access to information locked up in unstructured text; machine translation allows us to retrieve texts written in Chinese and read them in Spanish. By providing more natural human-machine interfaces, and more sophisticated access to stored information, language processing has come to play a central role in the multilingual information society.

This book provides a comprehensive introduction to the field of NLP. It can be used for individual study or as the textbook a course on natural language processing or computational linguistics. The book is intensely practical, containing hundreds of fully-worked examples and graded exercises. It is based on the Python programming language together with an open source library called the Natural Language Toolkit (NLTK). NLTK includes software, data, and documentation, all freely downloadable from Distributions are provided for Windows, Macintosh and Unix platforms. We encourage you, the reader, to download Python and NLTK, and try out the examples and exercises along the way.


NLP is important for scientific, economic, social, and cultural reasons. NLP is experiencing rapid growth as its theories and methods are deployed in a variety of new language technologies. For this reason it is important for a wide range of people to have a working knowledge of NLP. Within industry, it includes people in human-computer interaction, business information analysis, and Web software development. Within academia, this includes people in areas from humanities computing and corpus linguistics through to computer science and artificial intelligence.

This book is intended for a diverse range of people who want to learn how to write programs that analyze written language:

New to Programming?:
 The book is suitable for readers with no prior knowledge of programming, and the early chapters contain many examples that you can simply copy and try for yourself, together with graded exercises. If you decide you need a more general introduction to Python, we recommend you read Learning Python (O'Reilly) in conjunction with this book.
New to Python?:Experienced programmers can quickly learn enough Python using this book to get immersed in natural language processing. All relevant Python features are carefully explained and exemplified, and you will quickly come to appreciate Python's suitability for this application area.
Already dreaming in Python?:
 Skip the Python examples and dig into the interesting language analysis material that starts in Chapter 1. Soon you'll be applying your skills to this exciting new application area.


This book is a practical introduction to NLP. You will learn by example, write real programs, and grasp the value of being able to test an idea through implementation. If you haven't learnt already, this book will teach you programming. Unlike other programming books, we provide extensive illustrations and exercises from NLP. The approach we have taken is also principled, in that we cover the theoretical underpinnings and don't shy away from careful linguistic and computational analysis. We have tried to be pragmatic in striking a balance between theory and application, identifying the connections and the tensions. Finally, we recognize that you won't get through this unless it is also pleasurable, so we have tried to include many applications and examples that are interesting and entertaining, sometimes whimsical.

What You Will Learn

By digging into the material presented here, you will learn:

  • how simple programs can help you manipulate and analyze language data, and how to write these programs;
  • how key concepts from NLP and linguistics are used to describe and analyse language;
  • how data structures and algorithms are used in NLP;
  • how language data is stored in standard formats, and how data can be used to evaluate the performance of NLP techniques.

Depending on your background, and your motivation for being interested in NLP, you will gain different kinds of skills and knowledge from this book, as set out below:

Table I.1

Goals Background in Arts and Humanities Background in Science and Engineering
Language Programming to manage language Language as a source of interesting
Analysis data, explore linguistic models, models, and test empirical claims problems in data modeling, data mining, and knowledge discovery
Language Learning to program, with Knowledge of linguistic algorithms
Technology applications to familiar problems to work in language technology or other technical field and data structures for high quality, maintainable language processing software


The early chapters are organized in order of conceptual difficulty, starting with a gentle introduction to language processing and Python, before proceeding on to fundamental topics such as tokenization, tagging, and evaluation. After this, a sequence of chapters covers topics in grammars and parsing, which have long been central tasks in language processing. The last third of the book contains chapters on advanced topics, which can be read independently of each other.

Each chapter consists of an introduction, a sequence of sections that progress from elementary to advanced material, and finally a summary and suggestions for further reading. Most sections include exercises that are graded according to the following scheme: ☼ is for easy exercises that involve minor modifications to supplied code samples or other simple activities; ◑ is for intermediate exercises that explore an aspect of the material in more depth, requiring careful analysis and design; ★ is for difficult, open-ended tasks that will challenge your understanding of the material and force you to think independently (readers new to programming are encouraged to skip these); ☺ is for non-programming exercises for reflection or discussion. The exercises are important for consolidating the material in each section, and we strongly encourage you to try a few before continuing with the rest of the chapter.

Within each chapter, we'll be switching between different styles of presentation. In one style, natural language will be the driver. We'll analyze language, explore linguistic concepts, and use programming examples to support the discussion. Sometimes we'll present Python constructs that have not been introduced systematically; this way you will see useful idioms early, and might not appreciate their workings until later. In the other style, the programming language will be the driver. We'll analyze programs, explore algorithms, and use linguistic examples to support the discussion.

Why Python?

Python is a simple yet powerful programming language with excellent functionality for processing linguistic data. Python can be downloaded for free from Installers are available for all platforms.

Here is a five-line Python program that processes file.txt and prints all the words ending in ing:

>>> for line in open("file.txt"):      # for each line of input text
...     for word in line.split():      # for each word in the line
...         if word.endswith('ing'):   # does the word end in 'ing'?
...             print word             # if so, print the word

This program illustrates some of the main features of Python. First, whitespace is used to nest lines of code, thus the line starting with if falls inside the scope of the previous line starting with for; this ensures that the ing test is performed for each word. Second, Python is object-oriented; each variable is an entity that has certain defined attributes and methods. For example, the value of the variable line is more than a sequence of characters. It is a string object that has a method (or operation) called split() that we can use to break a line into its words. To apply a method to an object, we write the object name, followed by a period, followed by the method name; i.e., line.split(). Third, methods have arguments expressed inside parentheses. For instance, in the example above, split() had no argument because we were splitting the string wherever there was white space, and we could therefore use empty parentheses. To split a string into sentences delimited by a period, we would write split('.'). Finally — and most importantly — Python is highly readable, so much so that it is fairly easy to guess what the above program does even if you have never written a program before.

We chose Python because it has a shallow learning curve, its syntax and semantics are transparent, and it has good string-handling functionality. As a scripting language, Python facilitates interactive exploration. As an object-oriented language, Python permits data and methods to be encapsulated and re-used easily. As a dynamic language, Python permits attributes to be added to objects on the fly, and permits variables to be typed dynamically, facilitating rapid development. Python comes with an extensive standard library, including components for graphical programming, numerical processing, and web data processing.

Python is heavily used in industry, scientific research, and education around the world. Python is often praised for the way it facilitates productivity, quality, and maintainability of software. A collection of Python success stories is posted at

NLTK defines an infrastructure that can be used to build NLP programs in Python. It provides basic classes for representing data relevant to natural language processing; standard interfaces for performing tasks such as tokenization, part-of-speech tagging, and syntactic parsing; and standard implementations for each task which can be combined to solve complex problems.

NLTK comes with extensive documentation. In addition to this book, the website provides API documentation which covers every module, class and function in the toolkit, specifying parameters and giving examples of usage. The website also provides module guides; these contain extensive examples and test cases, and are intended for users, developers and instructors.

Learning Python for Natural Language Processing

This book contains self-paced learning materials including many examples and exercises. An effective way to learn is simply to work through the materials. The program fragments can be copied directly into a Python interactive session. Any questions concerning the book, or Python and NLP more generally, can be posted to the NLTK-Users mailing list (see

Python Environments:
 The easiest way to start developing Python code, and to run interactive Python demonstrations, is to use the simple editor and interpreter GUI that comes with Python called IDLE, the Integrated DeveLopment Environment for Python.
NLTK Community:NLTK has a large and growing user base. There are mailing lists for announcements about NLTK, for developers and for teachers. lists many courses around the world where NLTK and materials from this book have been adopted, a useful source of extra materials including slides and exercises.

The Design of NLTK

NLTK was designed with four primary goals in mind:

Simplicity:We have tried to provide an intuitive and appealing framework along with substantial building blocks, so you can gain a practical knowledge of NLP without getting bogged down in the tedious house-keeping usually associated with processing annotated language data. We have provided software distributions for several platforms, along with platform-specific instructions, to make the toolkit easy to install.
Consistency:We have made a significant effort to ensure that all the data structures and interfaces are consistent, making it easy to carry out a variety of tasks using a uniform framework.
Extensibility:The toolkit easily accommodates new components, whether those components replicate or extend existing functionality. Moreover, the toolkit is organized so that it is usually obvious where extensions would fit into the toolkit's infrastructure.
Modularity:The interaction between different components of the toolkit uses simple, well-defined interfaces. It is possible to complete individual projects using small parts of the toolkit, without needing to understand how they interact with the rest of the toolkit. Modularity also makes it easier to change and extend the toolkit.

Contrasting with these goals are three non-requirements — potentially useful features that we have deliberately avoided. First, while the toolkit provides a wide range of functions, it is not encyclopedic; it will continue to evolve with the field of NLP. Second, while the toolkit should be efficient enough to support meaningful tasks, it does not need to be highly optimized for runtime performance; such optimizations often involve more complex algorithms, and sometimes require the use of programming languages like C or C++. This would make the toolkit less accessible and more difficult to install. Third, we have tried to avoid clever programming tricks, since clear implementations are preferable to ingenious yet indecipherable ones.

For Instructors

Natural Language Processing (NLP) is often taught within the confines of a single-semester course at advanced undergraduate level or postgraduate level. Many instructors have found that it is difficult to cover both the theoretical and practical sides of the subject in such a short span of time. Some courses focus on theory to the exclusion of practical exercises, and deprive students of the challenge and excitement of writing programs to automatically process language. Other courses are simply designed to teach programming for linguists, and do not manage to cover any significant NLP content. NLTK was originally developed to address this problem, making it feasible to cover a substantial amount of theory and practice within a single-semester course, even if students have no prior programming experience.

A significant fraction of any NLP syllabus deals with algorithms and data structures. On their own these can be rather dry, but NLTK brings them to life with the help of interactive graphical user interfaces making it possible to view algorithms step-by-step. Most NLTK components include a demonstration which performs an interesting task without requiring any special input from the user. An effective way to deliver the materials is through interactive presentation of the examples, entering them in a Python session, observing what they do, and modifying them to explore some empirical or theoretical issue.

The book contains hundreds of examples and exercises which can be used as the basis for student assignments. The simplest exercises involve modifying a supplied program fragment in a specified way in order to answer a concrete question. At the other end of the spectrum, NLTK provides a flexible framework for graduate-level research projects, with standard implementations of all the basic data structures and algorithms, interfaces to dozens of widely used data-sets (corpora), and a flexible and extensible architecture. Additional support for teaching using NLTK is available on the NLTK website, and on a closed mailing list for instructors.

We believe this book is unique in providing a comprehensive framework for students to learn about NLP in the context of learning to program. What sets these materials apart is the tight coupling of the chapters and exercises with NLTK, giving students — even those with no prior programming experience — a practical introduction to NLP. After completing these materials, students will be ready to attempt one of the more advanced textbooks, such as Speech and Language Processing, by Jurafsky and Martin (Prentice Hall, 2008).

This book presents programming concepts in an unusual order, beginning with a non-trivial data type — lists of strings — before introducing non-trivial control structures like comprehensions and conditionals. These idioms permit us to do useful language processing from the start. Once this motivation is in place we deal with the fundamental concepts systematically. Thus we cover the same ground as more conventional approaches, without expecting readers to be interested in the programming language for its own sake.

Table I.2:

Suggested Course Plans; Lectures/Lab Sessions per Chapter

Chapter Arts and Humanities Science and Engineering
1 Language Processing and Python 2-4 2
2 Text Corpora and Lexical Resources 2-4 2
3 Processing Raw Text 2-4 2
4 Categorizing and Tagging Words 2-4 2-4
5 Data-Intensive Language Processing 0-2 2-4
6 Structured Programming 2-4 0
7 Partial Parsing and Interpretation 2 2
8 Grammars and Parsing 2-4 2-4
9 Advanced Parsing 0 1-4
10 Feature Based Grammar 2-4 1-4
11 Logical Semantics 1 1-4
12 Linguistic Data Management 0-2 0-4
13 Conclusion 1 1
Total 18-36 18-36


NLTK was originally created as part of a computational linguistics course in the Department of Computer and Information Science at the University of Pennsylvania in 2001. Since then it has been developed and expanded with the help of dozens of contributors. It has now been adopted in courses in dozens of universities, and serves as the basis of many research projects.

In particular, we're grateful to the following people for their feedback, comments on earlier drafts, advice, contributions: Michaela Atterer, Greg Aumann, Kenneth Beesley, Steven Bethard, Ondrej Bojar, Trevor Cohn, Grev Corbett, James Curran, Jean Mark Gawron, Baden Hughes, Gwillim Law, Mark Liberman, Christopher Maloof, Stefan Müller, Stuart Robinson, Jussi Salmela, Rob Speer. Many others have contributed to the toolkit, and they are listed at We are grateful to many colleagues and students for feedback on the text.

We are grateful to the US National Science Foundation, the Linguistic Data Consortium, and the Universities of Pennsylvania, Edinburgh, and Melbourne for supporting our work on this book.

About the Authors


Figure I.1: Edward Loper, Ewan Klein, and Steven Bird, Stanford, July 2007

Steven Bird is Associate Professor in the Department of Computer Science and Software Engineering at the University of Melbourne, and Senior Research Associate in the Linguistic Data Consortium at the University of Pennsylvania. After completing his undergraduate training in computer science and mathematics at the University of Melbourne, Steven went to the University of Edinburgh to study computational linguistics, and completed his PhD in 1990 under the supervision of Ewan Klein. He later moved to Cameroon to conduct linguistic fieldwork on the Grassfields Bantu languages under the auspices of the Summer Institute of Linguistics. More recently, he spent several years as Associate Director of the Linguistic Data Consortium where he led an R&D team to create models and tools for large databases of annotated text. Back at Melbourne University, he established a language technology research group and has taught at all levels of the undergraduate computer science curriculum. Steven is Vice President of the Association for Computational Linguistics.

Ewan Klein is Professor of Language Technology in the School of Informatics at the University of Edinburgh. He completed a PhD on formal semantics at the University of Cambridge in 1978. After some years working at the Universities of Sussex and Newcastle upon Tyne, Ewan took up a teaching position at Edinburgh. He was involved in the establishment of Edinburgh's Language Technology Group 1993, and has been closely associated with it ever since. From 2000–2002, he took leave from the University to act as Research Manager for the Edinburgh-based Natural Language Research Group of Edify Corporation, Santa Clara, and was responsible for spoken dialogue processing. Ewan is a past President of the European Chapter of the Association for Computational Linguistics and was a founding member and Coordinator of the European Network of Excellence in Human Language Technologies (ELSNET). He has been involved in leading numerous academic-industrial collaborative projects, the most recent of which is a biological text mining initiative funded by ITI Life Sciences, Scotland, in collaboration with Cognia Corporation, NY.

Edward Loper is a doctoral student in the Department of Computer and Information Sciences at the University of Pennsylvania, conducting research on machine learning in natural language processing. Edward was a student in Steven's graduate course on computational linguistics in the fall of 2000, and went on to be a TA and share in the development of NLTK. In addition to NLTK, he has helped develop other major packages for documenting and testing Python software, epydoc and doctest.

About this document...

This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [], Version 0.9.6, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [].

This document is Revision: 7166 Mon Dec 8 21:47:15 EST 2008

1   Language Processing and Python

It is easy to get our hands on millions of words of text. What can we do with it, assuming we can write some simple programs? In this chapter we'll tackle the following questions:

  1. what can we achieve by combining simple programming techniques with large quantities of text?
  2. how can we automatically extract key words and phrases that sum up the style and content of a text?
  3. is the Python programming language suitable for such work?
  4. what are some of the interesting challenges of natural language processing?

This chapter is divided into sections that skip between two quite different styles. In the "computing with language" sections we will take on some linguistically-motivated programming tasks without necessarily understanding how they work. In the "closer look at Python" sections we will systematically review key programming concepts. We'll flag the two styles in the section titles, but later chapters will mix both styles without being so up-front about it. We hope this style of introduction gives you an authentic taste of what will come later, while covering a range of elementary concepts in linguistics and computer science. If you have basic familiarity with both areas you can skip to Section 1.5; we will repeat any important points in later chapters, and if you miss anything you can easily consult the online reference material at

1.1   Computing with Language: Texts and Words

We're all very familiar with text, since we read and write it every day. But here we will treat text as raw data for the programs we write, programs that manipulate and analyze it in a variety of interesting ways. Before we can do this, we have to get started with the Python interpreter.

Getting Started

One of the friendly things about Python is that it allows you to type directly into the interactive interpreter — the program that will be running your Python programs. You can access the Python interpreter using a simple graphical interface called the Interactive DeveLopment Environment (IDLE). On a Mac you can find this under ApplicationsMacPython, and on Windows under All ProgramsPython. Under Unix you can run Python from the shell by typing idle (if this is not installed, try typing python). The interpreter will print a blurb about your Python version; simply check that you are running Python 2.4 or greater (here it is 2.5.1):

Python 2.5.1 (r251:54863, Apr 15 2008, 22:57:26)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

The >>> prompt indicates that the Python interpreter is now waiting for input. When copying examples from this book be sure not to type in the >>> prompt yourself. Now, let's begin by using Python as a calculator:

>>> 1 + 5 * 2 - 3

Once the interpreter has finished calculating the answer and displaying it, the prompt reappears. This means the Python interpreter is waiting for another instruction.


Your Turn: Enter a few more expressions of your own. You can use asterisk (*) for multiplication and slash (/) for division, and parentheses for bracketing expressions. Note that division doesn't always behave as you might expect — it does integer division or floating point division depending on whether you type 1/3 or 1.0/3.0.

These examples demonstrate how you can work interactively with the interpreter, allowing you to experiment and explore. Now let's try a nonsensical expression to see how the interpreter handles it:

>>> 1 +
  File "<stdin>", line 1
    1 +
SyntaxError: invalid syntax

Here we have produced a syntax error. It doesn't make sense to end an instruction with a plus sign. The Python interpreter indicates the line where the problem occurred (line 1 of "standard input").

Searching Text

Now that we can use the Python interpreter, let's see how we can harness its power to process text. The first step is to type a special command at the Python prompt which tells the interpreter to load some texts for us to explore: from import * — i.e. load NLTK's book module, which contains the examples you'll be working with as you read this chapter. After printing a welcome message, it loads the text of several books, including Moby Dick. Type the following, taking care to get spelling and punctuation exactly right:

>>> from import *
*** Introductory Examples for the NLTK Book ***
Loading: text1, ..., text8 and sent1, ..., sent8
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>

We can examine the contents of a text in a variety of ways. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous.

>>> text1.concordance("monstrous")
mong the former , one was of a most monstrous size . ... This came towards us , o
ION OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have re
all over with a heathenish array of monstrous clubs and spears . Some were thickl
ed as you gazed , and wondered what monstrous cannibal and savage could ever have
 that has survived the flood ; most monstrous and most mountainous ! That Himmale
 they might scout at Moby Dick as a monstrous fable , or still worse and more det
ath of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere lo
ling Scenes . In connexion with the monstrous pictures of whales , I am strongly


Your Turn: Try seaching for other words; you can use the up-arrow key to access the previous command and modify the word being searched. You can also try searches on some of the other texts we have included. For example, search Sense and Sensibility for the word affection, using text2.concordance("affection"). Search the book of Genesis to find out how long some people lived, using: text3.concordance("lived"). You could look at text4, the US Presidential Inaugural Addresses to see examples of English dating back to 1789, and search for words like nation, terror, god to see how these words have been used differently over time. We've also included text5, the NPS Chat Corpus: search this for unconventional words like im, ur, lol. (Note that this corpus is uncensored!)

Once you've spent a few minutes examining these texts, we hope you have a new sense of the richness and diversity of language. In the next chapter you will learn how to access a broader range of text, including text in languages other than English.

It is one thing to automatically detect that a particular word occurs in a text and to display some words that appear in the same context. We can also determine the location of a word in the text: how many words in from the beginning it appears. This positional information can be displayed using a so-called dispersion plot. Each stripe represents an instance of a word and each row represents the entire text. In Figure 1.1 we see some striking patterns of word usage over the last 220 years. You can produce this plot as shown below. You might like to try different words, and different texts. Can you predict the dispersion of a word before you view it? As before, take care to get the quotes, commas, brackets and parentheses exactly right.

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Figure 1.1: Lexical Dispersion Plot for Words in US Presidential Inaugural Addresses


You need to have Python's Numpy and Pylab packages installed in order to produce the graphical plots used in this book. Please see for installation instructions.

A concordance permits us to see words in context, e.g. we saw that monstrous appeared in the context the monstrous pictures. What other words appear in the same contexts that monstrous appears in? We can find out as follows:

>>> text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate
>>> text2.similar("monstrous")
great very so good vast a exceedingly heartily amazingly as sweet
remarkably extremely

Observe that we get different results for different books. Melville and Austen use this word quite differently. For Austen monstrous has positive connotations, and might even function as an intensifier, like the word very. Let's examine the contexts that are shared by monstrous and very

>>> text2.common_contexts(["monstrous", "very"])
be_glad am_glad a_pretty is_pretty a_lucky


Your Turn: Pick another word and compare its usage in two different texts, using the similar() and common_contexts() methods.

Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the "generate" function:

>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she

Note that first time you run this, it is slow because it gathers statistics about word sequences. Each time you run it, you will get different output text. Now try generating random text in the style of an inaugural address or an internet chat room. Although the text is random, it re-uses common words and phrases from the source text and gives us a sense of its style and content.


When text is printed, punctuation has been split off from the previous word. Although this is not correct formatting for English text, we do this to make it clear that punctuation does not belong to the word. This is called "tokenization", and you will learn about it in Chapter 3.

Counting Vocabulary

The most obvious fact about texts that emerges from the previous section is that they differ in the vocabulary they use. In this section we will see how to use the computer to count the words in a text, in a variety of useful ways. As before you will jump right in and experiment with the Python interpreter, even though you may not have studied Python systematically yet. Test your understanding by modifying the examples, and trying the exercises at the end of the chapter.

Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We'll use the text of Moby Dick again:

>>> len(text1)

That's a quarter of a million words long! But how many distinct words does this text contain? To work this out in Python we have to pose the question slightly differently. The vocabulary of a text is just the set of words that it uses, and in Python we can list the vocabulary of text3 with the command: set(text3) (many screens of words will fly past). Now try the following:

>>> sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3))
>>> len(text3) / len(set(text3))

Here we can see a sorted list of vocabulary items, beginning with various punctuation symbols and continuing with words starting with A. All capitalized words precede lowercase words. We discover the size of the vocabulary indirectly, by asking for the length of the set. There are fewer than 3,000 distinct words in this book. Finally, we can calculate a measure of the lexical richness of the text and learn that each word is used 16 times on average.

Next, let's focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:

>>> text3.count("smote")
>>> 100.0 * text4.count('a') / len(text4)


Your Turn: How many times does the word lol appear in text5? How much is this as a percentage of the total number of words in this text?

You might like to repeat such calculations on several texts, but it is tedious to keep retyping it for different texts. Instead, we can come up with our own name for a task, e.g. "score", and associate it with a block of code. Now we only have to type a short name instead of one or more complete lines of Python code, and we can re-use it as often as we like:

>>> def score(text):
...     return len(text) / len(set(text))
>>> score(text3)
>>> score(text5)


The Python interpreter changes the prompt from >>> to ... after encountering the colon at the end of the first line. The ... prompt indicates that Python expects an indented code block to appear next. It is up to you to do the indentation, by typing four spaces. To finish the indented block just enter a blank line.

The keyword def is short for "define", and the above code defines a function:dt" called "score". We used the function by typing its name, followed by an open parenthesis, the name of the text, then a close parenthesis. This is just what we did for the len and set functions earlier. These parentheses will show up often: their role is to separate the name of a task — such as score — from the data that the task is to be performed on — such as text3. Functions are an advanced concept in programming and we only mention them at the outset to give newcomers a sense of the power and creativity of programming. Later we'll see how to use such functions when tabulating data, like Table 1.1. Each row of the table will involve the same computation but with different data, and we'll do this repetitive work using functions.

Table 1.1:

Lexical Diversity of Various Genres in the Brown Corpus

Genre Token Count Type Count Score
skill and hobbies 82345 11935 6.9
humor 21695 5017 4.3
fiction: science 14470 3233 4.5
press: reportage 100554 14394 7.0
fiction: romance 70022 8452 8.3
religion 39399 6373 6.2

1.2   A Closer Look at Python: Texts as Lists of Words

You've seen some important building blocks of the Python programming language. Let's review them systematically.


What is a text? At one level, it is a sequence of symbols on a page, such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here's how we represent text in Python, in this case the opening sentence of Moby Dick:

>>> sent1 = ['Call', 'me', 'Ishmael', '.']

After the prompt we've given a name we made up, sent1, followed by the equals sign, and then some quoted words, separated with commas, and surrounded with brackets. This bracketed material is known as a list in Python: it is how we store a text. We can inspect it by typing the name, and we can ask for its length:

>>> sent1
['Call', 'me', 'Ishmael', '.']
>>> len(sent1)
>>> score(sent1)

We can even apply our own "score" function to it. Some more lists have been defined for you, one for the opening sentence of each of our texts, sent2sent8. We inspect two of them here; you can see the rest for yourself using the Python interpreter.

>>> sent2
['The', 'family', 'of', 'Dashwood', 'had', 'long',
'been', 'settled', 'in', 'Sussex', '.']
>>> sent3
['In', 'the', 'beginning', 'God', 'created', 'the',
'heaven', 'and', 'the', 'earth', '.']

We can also do arithmetic operations with lists in Python. Multiplying a list by a number, e.g. sent1 * 2, creates a longer list with multiple copies of the items in the original list. Adding two lists, e.g. sent4 + sent1, creates a new list with everything from the first list, followed by everything from the second list:

>>> sent1 * 2
['Call', 'me', 'Ishmael', '.', 'Call', 'me', 'Ishmael', '.']
>>> sent4 + sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the',
'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']

Indexing Lists

As we have seen, a text in Python is just a list of words, represented using a particular combination of brackets and quotes. Just as with an ordinary page of text, we can count up the total number of words using len(text1), and count the occurrences of a particular word using text1.count('heaven'). And just as we can pick out the first, tenth, or even 14,278th word in a printed text, we can identify the elements of a list by their number, or index, by following the name of the text with the index inside brackets. We can also find the index of the first occurrence of any word:

>>> text4[173]
>>> text4.index('awaken')

Indexes turn out to be a common way to access the words of a text, or — more generally — the elements of a list. Python permits us to access sublists as well, extracting manageable pieces of language from large texts, a technique known as slicing.

>>> text5[16715:16735]
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good',
'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without',
'buying', 'it']
>>> text6[1600:1625]
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We',
'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive',
'officer', 'for', 'the', 'week']

Indexes have some subtleties, and we'll explore these with the help of an artificial sentence:

>>> sent = ['word1', 'word2', 'word3', 'word4', 'word5',
...         'word6', 'word7', 'word8', 'word9', 'word10',
...         'word11', 'word12', 'word13', 'word14', 'word15',
...         'word16', 'word17', 'word18', 'word19', 'word20']
>>> sent[0]
>>> sent[19]

Notice that our indexes start from zero: sent element zero, written sent[0], is the first word, 'word1', while sent element 19 is 'word20'. The reason is simple: the moment Python accesses the content of a list from the computer's memory, it is already at the first element; we have to tell it how many elements forward to go. Thus, zero steps forward leaves it at the first element.


This is initially confusing, but typical of modern programming languages. You'll quickly get the hang of this if you've mastered the system of counting centuries where 19XY is a year in the 20th century, or if you live in a country where the floors of a building are numbered from 1, and so walking up n-1 flights of stairs takes you to level n.

Now, if we tell it to go too far, by using an index value that is too large, we get an error:

>>> sent[20]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: list index out of range

This time it is not a syntax error, for the program fragment is syntactically correct. Instead, it is a runtime error, and it produces a Traceback message that shows the context of the error, followed by the name of the error, IndexError, and a brief explanation.

Let's take a closer look at slicing, using our artificial sentence again:

>>> sent[17:20]
['word18', 'word19', 'word20']
>>> sent[17]
>>> sent[18]
>>> sent[19]

Thus, the slice 17:20 includes sent elements 17, 18, and 19. By convention, m:n means elements mn-1. We can omit the first number if the slice begins at the start of the list, and we can omit the second number if the slice goes to the end:

>>> sent[:3]
['word1', 'word2', 'word3']
>>> text2[141525:]
['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne',
',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',',
'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of',
'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between',
'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.',
'THE', 'END']

We can modify an element of a list by assigning to one of its index values, e.g. putting sent[0] on the left of the equals sign. We can also replace an entire slice with new material:

>>> sent[0] = 'First Word'
>>> sent[19] = 'Last Word'
>>> sent[1:19] = ['Second Word', 'Third Word']
>>> sent
['First Word', 'Second Word', 'Third Word', 'Last Word']

Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used above. Check your understanding by trying the exercises on lists at the end of this chapter.


From the start of Section 1.1, you have had access texts called text1, text2, and so on. It saved a lot of typing to be able to refer to a 250,000-word book with a short name like this! In general, we can make up names for anything we care to calculate. We did this ourselves in the previous sections, e.g. defining a variable sent1 as follows:

>>> sent1 = ['Call', 'me', 'Ishmael', '.']

Such lines have the form: variable = expression. Python will evaluate the expression, and save its result to the variable. This process is called assignment. It does not generate any output; you have to type the variable on a line of its own to inspect its contents. The equals sign is slightly misleading, since information is copied from the right side to the left. It might help to think of it as a left-arrow. The variable can be anything you like, e.g. my_sent, sentence, xyzzy. It must start with a letter, and can include numbers and underscores. Here are some examples of variables and assignments:

>>> mySent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', 'forth',
...          'from', 'Camelot', '.']
>>> noun_phrase = mySent[1:4]
>>> noun_phrase
['bold', 'Sir', 'Robin']
>>> wOrDs = sorted(noun_phrase)
>>> wOrDs
['Robin', 'Sir', 'bold']

It is good to choose meaningful variable names to help you — and anyone who reads your Python code — to understand what your code is meant to do. Python does not try to make sense of the names; it blindly follows your instructions, and does not object if you do something confusing, such as one = 'two' or two = 3. A variable name cannot be any of Python's reserved words, such as if, not, and import. If you use a reserved word, Python will produce a syntax error:

>>> not = 'Camelot'
File "<stdin>", line 1
    not = 'Camelot'
SyntaxError: invalid syntax

We can use variables to hold intermediate steps of a computation. This may make the Python code easier to follow. Thus len(set(text1)) could also be written:

>>> vocab = set(text1)
>>> vocab_size = len(vocab)
>>> vocab_size

1.3   Computing with Language: Simple Statistics

Let's return to our exploration of the ways we can bring our computational resources to bear on large quantities of text. We began this discussion in Section 1.1, and saw how to search for words in context, how to compile the vocabulary of a text, how to generate random text in the same style, and so on.

In this section we pick up the question of what makes a text distinct, and use automatic methods to find characteristic words and expressions of a text. As in Section 1.1, you will try new features of the Python language by copying them into the interpreter, and you'll learn about these features systematically in the following section.

Before continuing with this section, check your understanding of the previous section by predicting the output of the following code, and using the interpreter to check if you got it right. If you found it difficult to do this task, it would be a good idea to review the previous section before continuing further.

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
...           'more', 'is', 'said', 'than', 'done', '.']
>>> words = set(saying)
>>> words = sorted(words)
>>> words[:2]

Frequency Distributions

How could we automatically identify the words of a text that are most informative about the topic and genre of the text? Let's begin by finding the most frequent words of the text. Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item, like that shown in Figure 1.2. We would need thousands of counters and it would be a laborious process, so laborious that we would rather assign the task to a machine.


Figure 1.2: Counting Words Appearing in a Text (a frequency distribution)

The table in Figure 1.2 is known as a frequency distribution, and it tells us the frequency of each vocabulary item in the text (in general it could count any kind of observable event). It is a "distribution" since it tells us how the the total number of words in the text — 260,819 in the case of Moby Dick — are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words of Moby Dick.

>>> fdist1 = FreqDist(text1)
>>> fdist1
<FreqDist with 260819 samples>
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-',
'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for',
'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on',
'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were',
'now', 'which', '?', 'me', 'like']
>>> fdist1['whale']


Your Turn: Try the above frequency distribution example for yourself, for text2. Be careful use the correct parentheses and uppercase letters. If you get an error message NameError: name 'FreqDist' is not defined, you need to start your work with from import *.

Do any words in the above list help us grasp the topic or genre of this text? Only one word, whale, is slightly informative! It occurs over 900 times. The rest of the words tell us nothing about the text; they're just English "plumbing." What proportion of English text is taken up with such words? We can generate a cumulative frequency plot for these words, using fdist1.plot(cumulative=True), to produce the graph in Figure 1.3. These 50 words account for nearly half the book!


Figure 1.3: Cumulative Frequency Plot for 50 Most Frequent Words in Moby Dick

If the frequent words don't help us, how about the words that occur once only, the so-called hapaxes. See them using fdist1.hapaxes(). This list contains lexicographer, cetological, contraband, expostulations, and about 9,000 others. It seems that there's too many rare words, and without seeing the context we probably can't guess what half of them mean in any case! Neither frequent nor infrequent words help, so we need to try something else.

Fine-grained Selection of Words

Next let's look at the long words of a text; perhaps these will be more characteristic and informative. For this we adapt some notation from set theory. We would like to find the words from the vocabulary of the text that are more than than 15 characters long. Let's call this property P, so that P(w) is true if and only if w is more than 15 characters long. Now we can express the words of interest using mathematical set notation as shown in (1a). This means "the set of all w such that w is an element of V (the vocabulary) and w has property P.


a.{w | wV & P(w)}

b.[w for w in V if p(w)]

The equivalent Python expression is given in (1b). Notice how similar the two notations are. Let's go one more step and write executable Python code:

>>> V = set(text1)
>>> long_words = [w for w in V if len(w) > 15]
>>> sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically',
'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations',
'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness',
'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities',
'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness',
'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']

For each word w in the vocabulary V, we check if len(w) is greater than 15; all other words will be ignored. We will discuss this syntax more carefully later.


Your Turn: Try out the above statements in the Python interpreter, and try changing the text, and changing the length condition. Also try changing the variable names, e.g. using [word for word in vocab if ...].

Let's return to our task of finding words that characterize a text. Notice that the long words in text4 reflect its national focus: constitutionally, transcontinental, while those in text5 reflect its informal content: boooooooooooglyyyyyy and yuuuuuuuuuuuummmmmmmmmmmm. Have we succeeded in automatically extracting words that typify a text? Well, these very long words are often hapaxes (i.e. unique) and perhaps it would be better to find frequently occurring long words. This seems promising since it eliminates frequent short words (e.g. the) and infrequent long words like (antiphilosophists). Here are all words from the chat corpus that are longer than 7 characters, that occur more than 7 times:

>>> fdist5 = FreqDist(text5)
>>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)
['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',
'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football',
'innocent', 'listening', 'remember', 'seriously', 'something', 'together',
'tomorrow', 'watching']

Notice how we have used two conditions: len(w) > 7 ensures that the words are longer than seven letters, and fdist5[w] > 7 ensures that these words occur more than seven times. At last we have managed to automatically identify the frequently-occuring content-bearing words of the text. It is a modest but important milestone: a tiny piece of code, processing thousands of words, produces some informative output.

Counting Other Things

Counting words is useful, but we can count other things too. For example, we can look at the distribution of word lengths in a text, by creating a FreqDist out of a long list of numbers, where each number is the length of the corresponding word in the text:

>>> [len(w) for w in text1]
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> fdist = FreqDist(len(w) for w in text1)
>>> fdist
<FreqDist with 260819 samples>
>>> fdist.samples()
[3, 1, 4, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]

The material being counted up in the frequency distribution consists of the numbers [1, 4, 4, 2, ...], and the result is a distribution containing a quarter of a million items, one per word. There are only twenty distinct items being counted, the numbers 1 through 20. Let's look at the frequency of each sample:

>>> fdist.items()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
(8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
(15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
>>> fdist.max()
>>> fdist[3]
>>> fdist.freq(3)

From this we see that the most frequent word length is 3, and that words of length 3 account for 50,000 (20%) of of the words of the book. Further analysis of word length might help us understand differences between authors, genres or languages. Table 1.2 summarizes the methods defined in frequency distributions.

Table 1.2:

Methods Defined for NLTK's Frequency Distributions

Example Description
fdist = FreqDist(samples) create a frequency distribution containing the given samples increment the count for this sample
fdist['monstrous'] count of the number of times a given sample occurred
fdist.freq('monstrous') frequency of a given sample
fdist.N() total number of samples
fdist.keys() the samples sorted in order of decreasing frequency
for sample in fdist: iterate over the samples, in order of decreasing frequency
fdist.max() sample with the greatest count
fdist.tabulate() tabulate the frequency distribution
fdist.plot() graphical plot of the frequency distribution
fdist.plot(cumulative=True) cumulative plot of the frequency distribution
fdist1 < fdist2 samples in fdist1 occur less frequently than in fdist2

Our discussion of frequency distributions has introduced some important Python concepts, and we will look at them systematically in Section 1.4. We've also touched on the topic of normalization, and we'll explore this in depth in Chapter 3.

1.4   Back to Python: Making Decisions and Taking Control

So far, our little programs have had some interesting qualities: (i) the ability to work with language, and (ii) the potential to save human effort through automation. A key feature of programming is the ability of machines to make decisions on our behalf, executing instructions when certain conditions are met, or repeatedly looping through text data until some condition is satisfied. This feature is known as control, and is the focus of this section.


Python supports a wide range of operators like < and >= for testing the relationship between values. The full set of these relational operators are shown in Table 1.3.

Table 1.3:

Numerical Comparison Operators

Operator Relationship
< less than
<= less than or equal to
== equal to (note this is two not one = sign)
!= not equal to
> greater than
>= greater than or equal to

We can use these to select different words from a sentence of news text. Here are some examples — only the operator is changed from one line to the next. They all use sent7, the first sentence from text7 (Wall Street Journal).

>>> [w for w in sent7 if len(w) < 4]
[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
>>> [w for w in sent7 if len(w) <= 4]
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
>>> [w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
>>> [w for w in sent7 if len(w) != 4]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board',
'as', 'a', 'nonexecutive', 'director', '29', '.']

Notice the pattern in all of these examples: [w for w in text if condition ]. In these cases the condition is always a numerical comparison. However, we can also test various properties of words, using the functions listed in Table 1.4.

Table 1.4:

Some Word Comparison Operators

Function Meaning
s.startswith(t) s starts with t
s.endswith(t) s ends with t
t in s t is contained inside s
s.islower() all cased characters in s are lowercase
s.isupper() all cased characters in s are uppercase
s.isalpha() all characters in s are alphabetic
s.isalnum() all characters in s are alphanumeric
s.isdigit() all characters in s are digits
s.istitle() s is titlecased (all words have initial capital)

Here are some examples of these operators being used to select words from our texts: words ending with -ableness; words containing gnt; words having an initial capital; and words consisting entirely of digits.

>>> sorted(w for w in set(text1) if w.endswith('ableness'))
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...]
>>> sorted(term for term in set(text4) if 'gnt' in term)
['Sovereignty', 'sovereignties', 'sovereignty']
>>> sorted(item for item in set(text6) if item.istitle())
['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...]
>>> sorted(item for item in set(sent7) if item.isdigit())
['29', '61']

We can also create more complex conditions. If c is a condition, then not c is also a condition. If we have two conditions c1 and c2, then we can combine them to form a new condition using and and or: c1 and c2, c1 or c2.


Your Turn: Run the following examples and try to explain what is going on in each one. Next, try to make up some conditions of your own.

sorted(w for w in set(text7) if '-' in w and 'index' in w) sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10) sorted(w for w in set(sent7) if not w.islower()) sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)

Operating on Every Element

In Section 1.3, we saw some examples of counting items other than words. Let's take a closer look at the notation we used:

>>> [len(w) for w in text1]
[1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
>>> [w.upper() for w in text1]
['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]

These expressions have the form [f(w) for ...] or [w.f() for ...], where f is a function that operates on a word to compute its length, or to convert it to uppercase. For now, you don't need to understand the difference between the notations f(w) and w.f(). Instead, simply learn this Python idiom which performs the same operation on every element of a list. In the above examples, it goes through each word in text1, assigning each one in turn to the variable w and performing the specified operation on the variable.


The above notation is called a "list comprehension". This is our first example of a Python idiom, a fixed notation that we use habitually without bothering to analyze each time. Mastering such idioms is an important part of becoming a fluent Python programmer.

Let's return to the question of vocabulary size, and apply the same idiom here:

>>> len(text1)
>>> len(set(text1))
>>> len(set(word.lower() for word in text1))

Now that we are not double-counting words like This and this, which differ only in capitalization, we've wiped 2,000 off the vocabulary count! We can go a step further and eliminate numbers and punctuation from the vocabulary count, by filtering out any non-alphabetic items:

>>> len(set(word.lower() for word in text1 if word.isalpha()))

This example is slightly complicated: it lowercases all the purely alphabetic items. Perhaps it would have been simpler just to count the lowercase-only items, but this gives the incorrect result (why?). Don't worry if you don't feel confident with these already. You might like to try some of the exercises at the end of this chapter, or wait til we come back to these again in the next chapter.

Nested Code Blocks

Most programming languages permit us to execute a block of code when a conditional expression, or if statement, is satisfied. In the following program, we have created a variable called word containing the string value 'cat'. The if statement checks whether the conditional expression len(word) < 5 is true. It is, so the body of the if statement is invoked and the print statement is executed, displaying a message to the user. Remember to indent the print statement by typing four spaces.

>>> word = 'cat'
>>> if len(word) < 5:
...     print 'word length is less than 5'
word length is less than 5

When we use the Python interpreter we have to have an extra blank line in order for it to detect that the nested block is complete.

If we change the conditional expression to len(word) >= 5, to check that the length of word is greater than or equal to 5, then the conditional expression will no longer be true. This time, the body of the if statement will not be executed, and no message is shown to the user:

>>> if len(word) >= 5:
...   print 'word length is greater than or equal to 5'

An if statement is known as a control structure because it controls whether the code in the indented block will be run. Another control structure is the for loop. Don't forget the colon and the four spaces:

>>> for word in ['Call', 'me', 'Ishmael', '.']:
...     print word

This is called a loop because Python executes the code in circular fashion. It starts by performings the assignment word = 'Call', effectively using the word variable to name the first item of the list. Then it displays the value of word to the user. Next, it goes back to the for statement, and performs the assignment word = 'me', before displaying this new value to the user, and so on. It continues in this fashion until every item of the list has been processed.

Looping with Conditions

Now we can combine the if and for statements. We will loop over every item of the list, and only print the item if it ends with the letter "l". We'll pick another name for the variable to demonstrate that Python doesn't try to make sense of variable names.

>>> sent1 = ['Call', 'me', 'Ishmael', '.']
>>> for xyzzy in sent1:
...     if xyzzy.endswith('l'):
...         print xyzzy

You will notice that if and for statements have a colon at the end of the line, before the indentation begins. In fact, all Python control structures end with a colon. The colon indicates that the current statement relates to the indented block that follows.

We can also specify an action to be taken if the condition of the if statement is not met. Here we see the elif "else if" statement, and the else statement. Notice that these also have colons before the indented code.

>>> for token in sent1:
...     if token.islower():
...         print 'lowercase word'
...     elif token.istitle():
...         print 'titlecase word'
...     else:
...         print 'punctuation'
titlecase word
lowercase word
titlecase word

As you can see, even with this small amount of Python knowledge, you can start to build multi-line Python programs. It's important to develop such programs in pieces, testing that each piece does what you expect before combining them into a program. This is why the Python interactive interpreter is so invaluable, and why you should get comfortable using it.

Finally, let's combine the idioms we've been exploring. First we create a list of cie and cei words, then we loop over each item and print it. Notice the comma at the end of the print statement, which tells Python to produce its output on a single line.

>>> confusing = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
>>> for word in confusing:
...     print word,
ancient ceiling conceit conceited conceive conscience
conscientious conscientiously deceitful deceive ...

1.5   Automatic Natural Language Understanding

We have been exploring language bottom-up, with the help of texts, dictionaries, and a programming language. However, we're also interested in exploiting our knowledge of language and computation by building useful language technologies.

At a purely practical level, we all need help to navigate the universe of information locked up in text on the Web. Search engines have been crucial to the growth and popularity of the Web, but have some shortcomings. It takes skill, knowledge, and some luck, to extract answers to such questions as What tourist sites can I visit between Philadelphia and Pittsburgh on a limited budget? What do experts say about digital SLR cameras? What predictions about the steel market were made by credible commentators in the past week? Getting a computer to answer them automatically involves a range of language processing tasks, including information extraction, inference, and summarization, and would need to be carried out on a scale and with a level of robustness that is still beyond our current capabilities.

On a more philosophical level, a long-standing challenge within artificial intelligence has been to build intelligent machines, and a major part of intelligent behaviour is understanding language. For many years this goal has been seen as too difficult. However, as NLP technologies become more mature, and robust methods for analysing unrestricted text become more widespread, the prospect of natural language understanding has re-emerged as a plausible goal.

In this section we describe some language processing components and systems, to give you a sense the interesting challenges that are waiting for you.

Pronoun Resolution

A deeper kind of language understanding is to work out who did what to whom — i.e. to detect the subjects and objects of verbs. You learnt to do this in elementary school, but its harder than you might think. In the sentence the thieves stole the paintings it is easy to tell who performed the stealing action. Consider three possible following sentences in (4c), and try to determine what was sold, caught, and found (one case is ambiguous).


a.The thieves stole the paintings. They were subsequently sold.

b.The thieves stole the paintings. They were subsequently caught.

c.The thieves stole the paintings. They were subsequently found.

Answering this question involves finding the antecedent of the pronoun they (the thieves or the paintings). Computational techniques for tackling this problem include anaphora resolution — identifying what a pronoun or noun phrase refers to — and semantic role labeling — identifying how a noun phrase relates to verb (as agent, patient, instrument, and so on).

Generating Language Output

If we can automatically solve such problems, we will have understood enough of the text to perform some tasks that involve generating language output, such as question answering and machine translation. In the first case, a machine should be able to answer a user's questions relating to collection of texts:


a.Text: ... The thieves stole the paintings. They were subsequently sold. ...

b.Human: Who or what was sold?

c.Machine: The paintings.

The machine's answer demonstrates that it has correctly worked out that they refers to paintings and not to thieves. In the second case, the machine should be able to translate the text into another language, accurately conveying the meaning of the original text. In translating the above text into French, we are forced to choose the gender of the pronoun in the second sentence: ils (masculine) if the thieves are sold, and elles (feminine) if the paintings are sold. Correct translation actually depends on correct understanding of the pronoun.


a.The thieves stole the paintings. They were subsequently found.

b.Les voleurs ont volé les peintures. Ils ont été trouvés plus tard. (the thieves)

c.Les voleurs ont volé les peintures. Elles ont été trouvées plus tard. (the paintings)

In all of the above examples — working out the sense of a word, the subject of a verb, the antecedent of a pronoun — are steps in establishing the meaning of a sentence, things we would expect a language understanding system to be able to do. We'll come back to some of these topics later in the book.

Spoken Dialog Systems

In the history of artificial intelligence, the chief measure of intelligence has been a linguistic one, namely the Turing Test: can a dialogue system, responding to a user's text input, perform so naturally that we cannot distinguish it from a human-generated response? In contrast, today's commercial dialogue systems are very limited, but still perform useful functions in narrowly-defined domains, as we see below:

S: How may I help you?
U: When is Saving Private Ryan playing?
S: For what theater?
U: The Paramount theater.
S: Saving Private Ryan is not playing at the Paramount theater, but
it's playing at the Madison theater at 3:00, 5:30, 8:00, and 10:30.

You could not ask this system to provide driving instructions or details of nearby restaurants unless the required information had already been stored and suitable question-answer pairs had been incorporated into the language processing system.

Observe that the above system seems to understand the user's goals: the user asks when a movie is showing and the system correctly determines from this that the user wants to see the movie. This inference seems so obvious that you probably didn't notice it was made, yet a natural language system needs to be endowed with this capability in order to interact naturally. Without it, when asked Do you know when Saving Private Ryan is playing, a system might unhelpfully respond with a cold Yes. However, the developers of commercial dialogue systems use contextual assumptions and business logic to ensure that the different ways in which a user might express requests or provide information are handled in a way that makes sense for the particular application. So, if you type When is ..., or I want to know when ..., or Can you tell me when ..., simple rules will always yield screening times. This is enough for the system to provide a useful service.

Dialogue systems give us an opportunity to mention the complete processing pipeline for NLP. Figure 1.4 shows the architecture of a simple dialogue system.


Figure 1.4: Simple Pipeline Architecture for a Spoken Dialogue System

Along the top of the diagram, moving from left to right, is a "pipeline" of some language understanding components. These map from speech input via syntactic parsing to some kind of meaning representation. Along the middle, moving from right to left, is the reverse pipeline of components for converting concepts to speech. These components make up the dynamic aspects of the system. At the bottom of the diagram are some representative bodies of static information: the repositories of language-related data that the processing components draw on to do their work.

Textual Entailment

The challenge of language understanding has been brought into focus in recent years by a public "shared task" called Recognizing Textual Entailment (RTE). The basic scenario is simple. Suppose you want to find find evidence to support the hypothesis: Sandra Goudie was defeated by Max Purnell, and that you have another short text that seems to be relevant, for example, Sandra Goudie was first elected to Parliament in the 2002 elections, narrowly winning the seat of Coromandel by defeating Labour candidate Max Purnell and pushing incumbent Green MP Jeanette Fitzsimons into third place. Does the text provide enough evidence for you to accept the hypothesis? In this particular case, the answer will be No. You can draw this conclusion easily, but it is very hard to come up with automated methods for making the right decision. The RTE Challenges provide data which allow competitors to develop their systems, but not enough data to brute-force approaches using standard machine learning techniques. Consequently, some linguistic analysis is crucial. In the above example, it is important for the system to note that Sandra Goudie names the person being defeated in the hypothesis, not the person doing the defeating in the text. As another illustration of the difficulty of the task, consider the following text/hypothesis pair:


a.David Golinkin is the editor or author of eighteen books, and over 150 responsa, articles, sermons and books

b.Golinkin has written eighteen books

In order to determine whether or not the hypothesis is supported by the text, the system needs the following background knowledge: (i) if someone is an author of a book, then he/she has written that book; (ii) if someone is an editor of a book, then he/she has not written that book; (iii) if someone is editor or author of eighteen books, then one cannot conclude that he/she is author of eighteen books.

Limitations of NLP

Despite the research-led advances in tasks like RTE, natural language systems that have been deployed for real-world applications still cannot perform common-sense reasoning or draw on world knowledge in a general and robust manner. We can wait for these difficult artificial intelligence problems to be solved, but in the meantime it is necessary to live with some severe limitations on the reasoning and knowledge capabilities of natural language systems. Accordingly, right from the beginning, an important goal of NLP research has been to make progress on the holy grail of natural language understanding, using superficial yet powerful counting and symbol manipulation techniques, but without recourse to this unrestricted knowledge and reasoning capability.

This is one of the goals of this book, and we hope to equip you with the knowledge and skills to build useful NLP systems, and to contribute to the long-term vision of building intelligent machines.

1.6   Summary

  • Texts are represented in Python using lists: ['Monty', 'Python']. We can use indexing, slicing and the len() function on lists.
  • We get the vocabulary of a text t using sorted(set(t)).
  • To get the vocabulary, collapsing case distinctions and ignoring punctuation, we can write set(w.lower() for w in text if w.isalpha()).
  • We operate on each item of a text using [f(x) for x in text].
  • We process each word in a text using a for statement such as for w in t: or for word in text:. This must be followed by the colon character and an indented block of code, to be executed each time through the loop.
  • We test a condition using an if statement: if len(word) < 5:. This must be followed by the colon character and an indented block of code, to be executed only if the condition is true.
  • A frequency distribution is a collection of items along with their frequency counts (e.g. the words of a text and their frequency of appearance).
  • WordNet is a semantically-oriented dictionary of English, consisting of synonym sets — or synsets — and organized into a hierarchical network.

About this document...

This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [], Version 0.9.6, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [].

This document is Revision: 7166 Mon Dec 8 21:47:15 EST 2008

2   Text Corpora and Lexical Resources

Practical work in Natural Language Processing usually involves a variety of established bodies of linguistic data. Such a body of text is called a corpus (plural corpora). The goal of this chapter is to answer the following questions:

  1. What are some useful text corpora and lexical resources, and how can we access them with Python?
  2. Which Python constructs are most helpful for this work?
  3. How do we re-use code effectively?

This chapter continues to present programming concepts by example, in the context of a linguistic processing task. We will wait till later before exploring each Python construct systematically. Don't worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and — if you're game — modify it by substituting some part of the code with a different text or word. This way you will associate a task with a programming idiom, and learn the hows and whys later.

2.1   Accessing Text Corpora

As just mentioned, a text corpus is any large body of text. Many, but not all, corpora are designed to contain a careful balance of material in one or more genres. We examined some small text collections in Chapter 1, such as the speeches known as the US Presidential Inaugural Addresses. This particular corpus actually contains dozens of individual texts — one per address — but we glued them end-to-end and treated them as a single text. In this section we will examine a variety of text corpora and will see how to select individual texts, and how to work with them.

The Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive containing some 25,000 free electronic books. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.files(), the files in NLTK's corpus of Gutenberg texts:

>>> import nltk
>>> nltk.corpus.gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'melville-moby_dick.txt', 'milton-paradise.txt',
'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt',

Let's pick out the first of these texts — Emma by Jane Austen — and give it a short name emma, then find out how many words it contains:

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)


You cannot carry out concordancing (and other tasks from Section 1.1) using a text defined this way. Instead you have to make the following statement:

>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

When we defined emma, we invoked the words() function of the gutenberg module in NLTK's corpus package. But since it is cumbersome to type such long names all the time, so Python provides another version of the import statement, as follows:

>>> from nltk.corpus import gutenberg
>>> gutenberg.files()
('austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt',
'blake-poems.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'melville-moby_dick.txt', 'milton-paradise.txt',
'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt',

Let's write a short program to display other information about each text:

>>> for file in gutenberg.files():
...     num_chars = len(gutenberg.raw(file))
...     num_words = len(gutenberg.words(file))
...     num_sents = len(gutenberg.sents(file))
...     num_vocab = len(set(w.lower() for w in gutenberg.words(file)))
...     print num_chars/num_words, num_words/num_sents, num_words/num_vocab, file
4 21 26 austen-emma.txt
4 23 16 austen-persuasion.txt
4 24 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 18 5 blake-poems.txt
4 16 12 carroll-alice.txt
4 17 11 chesterton-ball.txt
4 19 11 chesterton-brown.txt
4 16 10 chesterton-thursday.txt
4 24 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 12 8 shakespeare-caesar.txt
4 13 7 shakespeare-hamlet.txt
4 13 6 shakespeare-macbeth.txt
4 35 12 whitman-leaves.txt

This program has displayed three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score). Observe that average word length appears to be a general property of English, since it is always 4. Average sentence length and lexical diversity appear to be characteristics of particular authors.

This example also showed how we can access the "raw" text of the book, not split up into words. The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt') tells us how many letters occur in the text, including the spaces between words. The sents() function divides the text up into its sentences, where each sentence is a list of words:

>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
>>> macbeth_sentences
[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare',
'1603', ']'], ['Actus', 'Primus', '.'], ...]
>>> macbeth_sentences[1038]
['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';',
'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble']
>>> longest_len = max(len(s) for s in macbeth_sentences)
>>> [s for s in macbeth_sentences if len(s) == longest_len]
[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that',
'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The',
'mercilesse', 'Macdonwald', ...], ...]


Most NLTK corpus readers include a variety of access methods apart from words(). We access the raw file contents using raw(), and get the content sentence by sentence using sents(). Richer linguistic content is available from some corpora, such as part-of-speech tags, dialogue tags, syntactic trees, and so forth; we will see these in later chapters.

Web and Chat Text

Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. NLTK's small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of Pirates of the Carribean, personal advertisements, and wine reviews:

>>> from nltk.corpus import webtext
>>> for f in webtext.files():
...     print f, webtext.raw(f)[:70]
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to set fut
grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there!  [clop clop
overheard.txt White guy: So, do you have any plans for this evening? Asian girl: Yea
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terry Ros
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encounters.
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawberrie

There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of internet predators. The corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form "UserNNN", and manually edited to remove any other identifying information. The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom). The filename contains the date, chatroom, and number of posts, e.g. 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on 10/19/2006.

>>> from nltk.corpus import nps_chat
>>> chatroom = nps_chat.posts('10-19-20s_706posts.xml')
>>> chatroom[123]
['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',',
'I', 'can', 'look', 'in', 'a', 'mirror', '.']

The Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Table 2.1 gives an example of each genre (for a complete list, see

Table 2.1:

Example Document for Each Section of the Brown Corpus

ID File Genre Description
A16 ca16 news Chicago Tribune: Society Reportage
B02 cb02 editorial Christian Science Monitor: Editorials
C17 cc17 reviews Time Magazine: Reviews
D12 cd12 religion Underwood: Probing the Ethics of Realtors
E36 ce36 hobbies Norling: Renting a Car in Europe
F25 cf25 lore Boroff: Jewish Teenage Culture
G22 cg22 belles_lettres Reiner: Coping with Runaway Technology
H15 ch15 government US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17 cj19 learned Mosteller: Probability with Statistical Applications
K04 ck04 fiction W.E.B. Du Bois: Worlds of Color
L13 cl13 mystery Hitchens: Footsteps in the Night
M01 cm01 science_fiction Heinlein: Stranger in a Strange Land
N14 cn15 adventure Field: Rattlesnake Ridge
P12 cp12 romance Callaghan: A Passion in Rome
R06 cr06 humor Thurber: The Future, If Any, of Comedy

We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',
'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',
>>> brown.words(categories='news')
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(files=['cg22'])
['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial', 'reviews'])
[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

We can use the Brown Corpus to study systematic differences between genres, a kind of linguistic inquiry known as stylistics. Let's compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre:

>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist(w.lower() for w in news_text)
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals:
...     print m + ':', fdist[m],
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389


Your Turn: Choose a different section of the Brown Corpus, and adapt the above method to count a selection of wh words, such as what, when, where, who and why.

Next, we need to obtain counts for each genre of interest. To save re-typing, we can put the above code into a function, and use the function several times over. (We discuss functions in more detail in Section 2.3.) However, there is an even better way, using NLTK's support for conditional frequency distributions (Section 2.2), as follows:

>>> cfd = nltk.ConditionalFreqDist((g,w)
...           for g in brown.categories()
...           for w in brown.words(categories=g))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
                 can could  may might must will
           news   93   86   66   38   50  389
       religion   82   59   78   12   54   71
        hobbies  268   58  131   22   83  264
science_fiction   16   49    4   12    8   16
        romance   74  193   11   51   45   43
          humor   16   30    8    8    9   13

Observe that the most frequent modal in the news genre is will, suggesting a focus on the future, while the most frequent modal in the romance genre is could, suggesting a focus on possibilities.

Reuters Corpus

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called "training" and "test" (for training and testing algorithms that automatically detect the topic of a document, as we will explore further in Chapter 5).

>>> from nltk.corpus import reuters
>>> reuters.files()
('test/14826', 'test/14828', 'test/14829', 'test/14832', ...)
>>> reuters.categories()
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa',
'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn',
'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]

Unlike the Brown Corpus, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. For convenience, the corpus methods accept a single name or a list of names.

>>> reuters.categories('training/9865')
['barley', 'corn', 'grain', 'wheat']
>>> reuters.categories(['training/9865', 'training/9880'])
['barley', 'corn', 'grain', 'money-fx', 'wheat']
>>> reuters.files('barley')
['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...]
>>> reuters.files(['barley', 'corn'])
['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106',
'test/15287', 'test/15341', 'test/15618', 'test/15618', 'test/15648', ...]

Similarly, we can specify the words or sentences we want in terms of files or categories. The first handful of words in each of these texts are the titles, which by convention are stored as upper case.

>>> reuters.words('training/9865')[:14]
'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']
>>> reuters.words(['training/9865', 'training/9880'])
>>> reuters.words(categories='barley')
>>> reuters.words(categories=['barley', 'corn'])
['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]


Many other English text corpora are provided with NLTK. For a list see Appendix D.1. For more examples of how to access NLTK corpora, please consult the online guide at

US Presidential Inaugural Addresses

In section 1.1, we looked at the US Presidential Inaugural Addresses corpus, but treated it as a single text. The graph in Figure 1.1, used word offset as one of the axes, but this is difficult to interpret. However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension:

>>> from nltk.corpus import inaugural
>>> inaugural.files()
('1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...)
>>> [file[:4] for file in inaugural.files()]
['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...]

Notice that the year of each text appears in its filename. To get the year out of the file name, we extracted the first four characters, using file[:4].

Let's look at how the words America and citizen are used over time. The following code will count similar words, such as plurals of these words, or the word Citizens as it would appear at the start of a sentence (how?). The result is shown in Figure 2.1.

>>> cfd = nltk.ConditionalFreqDist((target, file[:4])
...           for file in inaugural.files()
...           for w in inaugural.words(file)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))
>>> cfd.plot()

Figure 2.1: Conditional Frequency Distribution for Two Words in the Inaugural Address Corpus

Corpora in Other Languages

NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora (see Appendix B).

>>> nltk.corpus.cess_esp.words()
['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]
>>> nltk.corpus.floresta.words()
['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...]
>>> nltk.corpus.udhr.files()
('Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1',
'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1',
'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', ...)
>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]
[u'Saben', u'umat', u'manungsa', u'lair', u'kanthi', ...]
>>> nltk.corpus.indian.words('hindi.pos')
'\xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xac\xe0\xa4\x82\xe0\xa4\xa7', ...]

The last of these corpora, udhr, contains the Universal Declaration of Human Rights in over 300 languages. (Note that the names of the files in this corpus include information about character encoding, and for now we will stick with texts in ISO Latin-1, or ASCII)

Let's use a conditional frequency distribution to examine the differences in word lengths, for a selection of languages included in this corpus. The output is shown in Figure 2.2 (run the program yourself to see a color plot).

>>> from nltk.corpus import udhr
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist((lang, len(word))
...          for lang in languages
...          for word in udhr.words(lang + '-Latin1'))
>>> cfd.plot(cumulative=True)

Figure 2.2: Cumulative Word Length Distributions for Several Languages


Your Turn: Pick a language of interest in udhr.files(), and define a variable raw_text = udhr.raw('Language-Latin1'). Now plot a frequency distribution of the letters of the text using nltk.FreqDist(raw_text).plot().

Unfortunately, for many languages, substantial corpora are not yet available. Often there is no government or industrial support for developing language resources, and individual efforts are piecemeal and hard to discover or re-use. Some languages have no established writing system, or are endangered. A good place to check is the search service of the Open Language Archives Community, at This service indexes the catalogs of dozens of language resource archives and publishers.


The most complete inventory of the world's languages is Ethnologue,

Text Corpus Structure

The corpora we have seen exemplify a variety of common corpus structures, summarized in Figure 2.3. The simplest kind lacks any structure: it is just a collection of texts. Often, texts are grouped into categories that might correspond to genre, source, author, language, etc. Sometimes these categories overlap, notably in the case of topical categories, since a text can be relevant to more than one topic. Occasionally, text collections have temporal structure, news collections being the most common.


Figure 2.3: Common Structures for Text Corpora (one point per text)

NLTK's corpus readers support efficient access to a variety of corpora, and can easily be extended to work with new corpora [REF]. Table 2.2 lists the basic methods provided by the corpus readers.

Table 2.2:

Basic Methods Defined in NLTK's Corpus Package

Example Description
files() the files of the corpus
categories() the categories of the corpus
abspath(file) the location of the given file on disk
words() the words of the whole corpus
words(files=[f1,f2,f3]) the words of the specified files
words(categories=[c1,c2]) the words of the specified categories
sents() the sentences of the specified categories
sents(files=[f1,f2,f3]) the sentences of the specified files
sents(categories=[c1,c2]) the sentences of the specified categories


For more information about NLTK's Corpus Package, type help(nltk.corpus.reader) at the Python prompt, or see You will probably have other text sources, stored in files on your computer or accessible via the web. We'll discuss how to work with these in Chapter 3.

Loading your own Corpus

If you have a collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader as follows:

>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = '/usr/share/dict'
>>> wordlists = PlaintextCorpusReader(corpus_root, '.*')
>>> wordlists.files()
('README', 'connectives', 'propernames', 'web2', 'web2a', 'words')
>>> wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

The second parameter of the PlaintextCorpusReader can be a list of file pathnames, like ['a.txt', 'test/b.txt'], or a pattern that matches all file pathnames, like '[abc]/.*\.txt' (see Section 3.3 for information about regular expressions).

2.2   Conditional Frequency Distributions

We introduced frequency distributions in Chapter 1, and saw that given some list mylist of words or other items, FreqDist(mylist) would compute the number of occurrences of each item in the list. When the texts of a corpus are divided into several categories, by genre, topic, author, etc, we can maintain separate frequency distributions for each category to enable study of systematic differences between the categories. In the previous section we achieved this using NLTK's ConditionalFreqDist data type. A conditional frequency distribution is a collection of frequency distributions, each one for a different "condition". The condition will often be the category of the text. Figure 2.4 depicts a fragment of a conditional frequency distribution having just two conditions, one for news text and one for romance text.


Figure 2.4: Counting Words Appearing in a Text Collection (a conditional frequency distribution)

Conditions and Events

As we saw in Chapter 1, a frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each such event with a condition. So instead of processing a text (a sequence of words), we have to process a sequence of pairs:

>>> text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

Each pair has the form (condition, event). If we were processing the entire Brown Corpus by genre there would be 15 conditions (one for each genre), and 1,161,192 events (one for each word).


Counting Words by Genre

In section 2.1 we saw a conditional frequency distribution where the condition was the section of the Brown Corpus, and for each condition we counted words. Whereas FreqDist() takes a simple list as input, ConditionalFreqDist() takes a list of pairs.

>>> cfd = nltk.ConditionalFreqDist((g,w)
...                                for g in brown.categories()
...                                for w in brown.words(categories=g))

Let's break this down, and look at just two genres, news and romance. For each genre, we loop over every word in the genre, producing pairs consisting of the genre and the word:

>>> genre_word = [(g,w) for g in ['news', 'romance'] for w in brown.words(categories=g)]
>>> len(genre_word)

So pairs at the beginning of the list genre_word will be of the form ('news', word) while those at the end will be of the form ('romance', word). (Recall that [-4:] gives us a slice consisting of the last four items of a sequence.)

>>> genre_word[:4]
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')]
>>> genre_word[-4:]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')]

We can now use this list of pairs to create a ConditionalFreqDist, and save it in a variable cfd. As usual, we can type the name of the variable to inspect it, and verify it has two conditions:

>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance']

Let's access the two conditions, and satisfy ourselves that each is just a frequency distribution:

>>> cfd['news']
<FreqDist with 100554 samples>
>>> cfd['romance']
<FreqDist with 70022 samples>
>>> list(cfd['romance'])
[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had',
'?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him',
'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]
>>> cfd['romance']['could']

Apart from combining two or more frequency distributions, and being easy to initialize, a ConditionalFreqDist provides some useful methods for tabulation and plotting. We can optionally specify which conditions to display with a conditions= parameter. When we omit it, we get all the conditions.


Your Turn: Find out which days of the week are most newsworthy, and which are most romantic. Define a variable called days containing a list of days of the week, i.e. ['Monday', ...]. Now tabulate the counts for these words using cfd.tabulate(samples=days). Now try the same thing using plot in place of tabulate.

Other Conditions

The plot in Figure 2.2 is based on a conditional frequency distribution where the condition is the name of the language and the counts being plotted are derived from word lengths. It exploits the fact that the filename for each language is the language name followed by``'-Latin1'`` (the character encoding).

>>> cfd = nltk.ConditionalFreqDist((lang, len(word))
...          for lang in languages
...          for word in udhr.words(lang + '-Latin1'))

The plot in Figure 2.1 is based on a conditional frequency distribution where the condition is either of two words america or citizen, and the counts being plotted are the number of times the word occurs in a particular speech. It expoits the fact that the filename for each speech, e.g. 1865-Lincoln.txt contains the year as the first four characters.

>>> cfd = nltk.ConditionalFreqDist((target, file[:4])
...           for file in inaugural.files()
...           for w in inaugural.words(file)
...           for target in ['america', 'citizen']
...           if w.lower().startswith(target))

This code will generate the tuple ('america', '1865') for every instance of a word whose lowercased form starts with "america" — such as "Americans" — in the file 1865-Lincoln.txt.

Generating Random Text with Bigrams

We can use a conditional frequency distribution to create a table of bigrams (word pairs). (We introducted bigrams in Section 1.3.) The bigrams() function takes a list of words and builds a list of consecutive word pairs:

>>> sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven',
...   'and', 'the', 'earth', '.']
>>> nltk.bigrams(sent)
[('In', 'the'), ('the', 'beginning'), ('beginning', 'God'), ('God', 'created'),
('created', 'the'), ('the', 'heaven'), ('heaven', 'and'), ('and', 'the'),
('the', 'earth'), ('earth', '.')]

In Figure 2.5, we treat each word as a condition, and for each one we effectively create a frequency distribution over the following words. The function generate_model() contains a simple loop to generate text. When we call the function, we choose a word (such as 'living') as our initial context, then once inside the loop, we print the current value of the variable word, and reset word to be the most likely token in that context (using max()); next time through the loop, we use that word as our new context. As you can see by inspecting the output, this simple approach to text generation tends to get stuck in loops; another method would be to randomly choose the next word from among the available words.

def generate_model(cfdist, word, num=15):
    for i in range(num):
        print word,
        word = cfdist[word].max()
>>> bigrams = nltk.bigrams(nltk.corpus.genesis.words('english-kjv.txt'))
>>> cfd = nltk.ConditionalFreqDist(bigrams)
>>> print cfd['living']
<FreqDist: 'creature': 7, 'thing': 4, 'substance': 2, ',': 1, '.': 1, 'soul': 1>
>>> generate_model(cfd, 'living')
living creature that he said , and the land of the land of the land

Figure 2.5 ( Figure 2.5: Generating Random Text in the Style of Genesis


Table 2.3:

Methods Defined for NLTK's Conditional Frequency Distributions

Example Description
cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution
cfdist.conditions() alphabetically sorted list of conditions
cfdist[condition] the frequency distribution for this condition
cfdist[condition][sample] frequency for the given sample for this condition
cfdist.tabulate() tabulate the conditional frequency distribution
cfdist.plot() graphical plot of the conditional frequency distribution
cfdist1 < cfdist2 samples in cfdist1 occur less frequently than in cfdist2

2.3   More Python: Reusing Code

By this time you've probably retyped a lot of code. If you mess up when retyping a complex example you have to enter it again. Using the arrow keys to access and modify previous commands is helpful but only goes so far. In this section we see two important ways to reuse code: text editors and Python functions.

Creating Programs with a Text Editor

The Python interative interpreter performs your instructions as soon as you type them. Often, it is better to compose a multi-line program using a text editor, then ask Python to run the whole program at once. Using IDLE, you can do this by going to the File menu and opening a new window. Try this now, and enter the following one-line program:

msg = 'Monty Python'

Save this program in a file called, then go to the Run menu, and select the command Run Module. The result in the main IDLE window should look like this:

>>> ================================ RESTART ================================

Now, where is the output showing the value of msg? The answer is that the program in will show a value only if you explicitly tell it to, using the print statement. So add another line to so that it looks as follows:

msg = 'Monty Python'
print msg

Select Run Module again, and this time you should get output that looks like this:

>>> ================================ RESTART ================================
Monty Python

From now on, you have a choice of using the interactive interpreter or a text editor to create your programs. It is often convenient to test your ideas using the interpreter, revising a line of code until it does what you expect, and consulting the interactive help facility. Once you're ready, you can paste the code (minus any >>> prompts) into the text editor, continue to expand it, and finally save the program in a file so that you don't have to type it in again later. Give the file a short but descriptive name, using all lowercase letters and separating words with underscore, and using the .py filename extension, e.g.


Our inline code examples will continue to include the >>> and ... prompts as if we are interacting directly with the interpreter. As they get more complicated, you should instead type them into the editor, without the prompts, and run them from the editor as shown above.


Suppose that you work on analyzing text that involves different forms of the same word, and that part of your program needs to work out the plural form of a given singular noun. Suppose it needs to do this work in two places, once when it is processing some texts, and again when it is processing user input.

Rather than repeating the same code several times over, it is more efficient and reliable to localize this work inside a function. A function is just a named block of code that performs some well-defined task. It usually has some inputs, also known as parameters, and it may produce a result, also known as a return value. We define a function using the keyword def followed by the function name and any input parameters, followed by the body of the function. Here's the function we saw in section 1.1:

>>> def score(text):
...     return len(text) / len(set(text))

We use the keyword return to indicate the value that is produced as output by the function. In the above example, all the work of the function is done in the return statement. Here's an equivalent definition which does the same work using multiple lines of code. We'll change the parameter name to remind you that this is an arbitrary choice:

>>> def score(my_text_data):
...     word_count = len(my_text_data)
...     vocab_size = len(set(my_text_data))
...     richness_score = word_count / vocab_size
...     return richness_score

Notice that we've created some new variables inside the body of the function. These are local variables and are not accessible outside the function. Notice also that defining a function like this produces no output. Functions do nothing until they are "called" (or "invoked").

Let's return to our earlier scenario, and actually define a simple plural function. The function plural() in Figure 2.6 takes a singular noun and generates a plural form (one which is not always correct).

def plural(word):
    if word.endswith('y'):
        return word[:-1] + 'ies'
    elif word[-1] in 'sx' or word[-2:] in ['sh', 'ch']:
        return word + 'es'
    elif word.endswith('an'):
        return word[:-2] + 'en'
    return word + 's'
>>> plural('fairy')
>>> plural('woman')

Figure 2.6 ( Figure 2.6: Example of a Python function

(There is much more to be said about functions, but we will hold off until Section 6.2.)


Over time you will find that you create a variety of useful little text processing functions, and you end up copy-pasting them from old programs to new ones. Which file contains the latest version of the function you want to use? It makes life a lot easier if you can collect your work into a single place, and access previously defined functions without any copying and pasting.

To do this, save your function(s) in a file called (say) Now, you can access your work simply by importing it from the file:

>>> from textproc import plural
>>> plural('wish')
>>> plural('fan')

Our plural function has an error, and we'll need to fix it. This time, we won't produce another version, but instead we'll fix the existing one. Thus, at every stage, there is only one version of our plural function, and no confusion about which one we should use.

A collection of variable and function definitions in a file is called a Python module. A collection of related modules is called a package. NLTK's code for processing the Brown Corpus is an example of a module, and its collection of code for processing all the different corpora is an example of a package. NLTK itself is a set of packages, sometimes called a library.

[Work in somewhere: In general, we use import statements when we want to get access to Python code that doesn't already come as part of core Python. This code will exist somewhere as one or more files. Each such file corresponds to a Python module — this is a way of grouping together code and data that we regard as reusable. When you write down some Python statements in a file, you are in effect creating a new Python module. And you can make your code depend on another module by using the import statement.]


If you are creating a file to contain some of your Python code, do not name your file it may get imported in place of the "real" NLTK package. (When it imports modules, Python first looks in the current folder / directory.)

2.4   Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example, if we have a defined a text my_text, then vocab = sorted(set(my_text)) builds the vocabulary of my_text, while word_freq = FreqDist(my_text) counts the frequency of each word in the text. Both of vocab and word_freq are simple lexical resources. Similarly, a concordance (Section 1.1) gives us information about word usage that might help in the preparation of a dictionary.

Standard terminology for lexicons is illustrated in Figure 2.7.


Figure 2.7: Lexicon Terminology

The simplest kind of lexicon is nothing more than a sorted list of words. Sophisticated lexicons include complex structure within and across the individual entries. In this section we'll look at some lexical resources included with NLTK.

Wordlist Corpora

NLTK includes some corpora that are nothing more than wordlists. The Words corpus is the /usr/dict/words file from Unix, used by some spell checkers. We can use it to find unusual or mis-spelt words in a text corpus, as shown in Figure 2.8.

def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)
    return sorted(unusual)
>>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
['abbeyland', 'abhorrence', 'abominably', 'abridgement', 'accordant', 'accustomary',
'adieus', 'affability', 'affectedly', 'aggrandizement', 'alighted', 'allenham',
'amiably', 'annamaria', 'annuities', 'apologising', 'arbour', 'archness', ...]
>>> unusual_words(nltk.corpus.nps_chat.words())
['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abou', 'abourted', 'abs', 'ack', 'acros',
'actualy', 'adduser', 'addy', 'adoted', 'adreniline', 'ae', 'afe', 'affari', 'afk',
'agaibn', 'agurlwithbigguns', 'ahah', 'ahahah', 'ahahh', 'ahahha', 'ahem', 'ahh', ...]

Figure 2.8 ( Figure 2.8: Using a Lexical Resource to Filter a Text

There is also a corpus of stopwords, that is, high-frequency words like the, to and also that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fail to distinguish it from other texts.

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',
'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow',
'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]

Let's define a function to compute what fraction of words in a text are not in the stopwords list:

>>> def content_fraction(text):
...     stopwords = nltk.corpus.stopwords.words('english')
...     content = [w for w in text if w.lower() not in stopwords]
...     return 1.0 * len(content) / len(text)
>>> content_fraction(nltk.corpus.reuters.words())

Thus, with the help of stopwords we filter out a third of the words of the text. Notice that we've combined two different kinds of corpus here, using a lexical resource to filter the content of a text corpus.


Figure 2.9: A Word Puzzle Known as "Target"

A wordlist is useful for solving word puzzles, such as the one in Figure 2.9. Our program iterates through every word and, for each one, checks whether it meets the conditions. The obligatory letter and length constraint are easy to check (and we'll only look for words with six or more letters here). It is trickier to check that candidate solutions only use combinations of the supplied letters, especially since some of the latter appear twice (here, the letter v). We use the FreqDist comparison method to check that the frequency of each letter in the candidate word is less than or equal to the frequency of the corresponding letter in the puzzle.

>>> puzzle_letters = nltk.FreqDist('egivrvonl')
>>> obligatory = 'r'
>>> wordlist = nltk.corpus.words.words()
>>> [w for w in wordlist if len(w) >= 6
...                      and obligatory in w
...                      and nltk.FreqDist(w) <= puzzle_letters]
['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor',
'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi',
'revolving', 'ringle', 'roving', 'violer', 'virole']


Your Turn: Can you think of an English word that contains gnt? Write Python code to find any such words in the wordlist.

One more wordlist corpus is the Names corpus, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names which appear in both files, i.e. names that are ambiguous for gender:

>>> names = nltk.corpus.names
>>> names.files()
('female.txt', 'male.txt')
>>> male_names = names.words('male.txt')
>>> female_names = names.words('female.txt')
>>> [w for w in male_names if w in female_names]
['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis',
'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',
'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ..]

It is well known that names ending in the letter a are almost always female. We can see this and some other patterns in the graph in Figure 2.10, produced by the following code:

>>> cfd = nltk.ConditionalFreqDist((file, name[-1])
...           for file in names.files()
...           for name in names.words(file))
>>> cfd.plot()

Figure 2.10: Frequency of Final Letter of Female vs Male Names

A Pronouncing Dictionary

As we have seen, the entries in a wordlist lack internal structure — they are just words. A slightly richer kind of lexical resource is a table (or spreadsheet), containing a word plus some properties in each row. NLTK includes the CMU Pronouncing Dictionary for US English, which was designed for use by speech synthesizers.

>>> entries = nltk.corpus.cmudict.entries()
>>> len(entries)
>>> for entry in entries[39943:39951]:
...     print entry
('fir', ['F', 'ER1'])
('fire', ['F', 'AY1', 'ER0'])
('fire', ['F', 'AY1', 'R'])
('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M'])
('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M'])
('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z'])
('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z'])
('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])

For each word, this lexicon provides a list of phonetic codes — distinct labels for each contrastive sound — known as phones. Observe that fire has two pronunciations (in US English): the one-syllable F AY1 R, and the two-syllable F AY1 ER0. The symbols in the CMU Pronouncing Dictionary are from the Arpabet, described in more detail at

Each entry consists of two parts, and we can process these individually, using a more complex version of the for statement. Instead of writing for entry in entries:, we replace entry with two variable names. Now, each time through the loop, word is assigned the first part of the entry, and pron is assigned the second part of the entry:

>>> for word, pron in entries:
...     if len(pron) == 3:
...         ph1, ph2, ph3 = pron
...         if ph1 == 'P' and ph3 == 'T':
...             print word, ph2,
pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1
pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1
pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1

The above program scans the lexicon looking for entries whose pronunciation consists of three phones (len(pron) == 3). If the condition is true, we assign the contents of pron to three new variables ph1, ph2 and ph3. Notice the unusual form of the statement which does that work: ph1, ph2, ph3 = pron.

Here's another example of the same for statement, this time used inside a list comprehension. This program finds all words whose pronunciation ends with a syllable sounding like nicks. You could use this method to find rhyming words.

>>> syllable = ['N', 'IH0', 'K', 'S']
>>> [word for word, pron in entries if pron[-4:] == syllable]
["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics',
'chetniks', "clinic's", 'clinics', 'conics', 'cynics', 'diasonics', "dominic's",
'ebonics', 'electronics', "electronics'", 'endotronics', "endotronics'", 'enix', ...]

Notice that the one pronunciation is spelt in several ways: nics, niks, nix, even ntic's with a silent t, for the word atlantic's. Let's look for some other mismatches between pronunciation and writing. Can you summarize the purpose of the following examples and explain how they work?

>>> [w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']
['autumn', 'column', 'condemn', 'damn', 'goddamn', 'hymn', 'solemn']
>>> sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))
['gn', 'kn', 'mn', 'pn']

The phones contain digits, to represent primary stress (1), secondary stress (2) and no stress (0). As our final example, we define a function to extract the stress digits and then scan our lexicon to find words having a particular stress pattern.

>>> def stress(pron):
...     return [int(char) for phone in pron for char in phone if char.isdigit()]
>>> [w for w, pron in entries if stress(pron) == [0, 1, 0, 2, 0]]
['abbreviated', 'abbreviating', 'accelerated', 'accelerating', 'accelerator',
'accentuated', 'accentuating', 'accommodated', 'accommodating', 'accommodative',
'accumulated', 'accumulating', 'accumulative', 'accumulator', 'accumulators', ...]
>>> [w for w, pron in entries if stress(pron) == [0, 2, 0, 1, 0]]
['abbreviation', 'abbreviations', 'abomination', 'abortifacient', 'abortifacients',
'academicians', 'accommodation', 'accommodations', 'accreditation', 'accreditations',
'accumulation', 'accumulations', 'acetylcholine', 'acetylcholine', 'adjudication', ...]

Note that this example has a user-defined function inside the condition of a list comprehension.

Rather than iterating over the whole dictionary, we can also access it by looking up particular words. (This uses Python's dictionary data structure, which we will study in Section 4.3.)

>>> prondict = nltk.corpus.cmudict.dict()
>>> prondict['fire']
[['F', 'AY1', 'ER0'], ['F', 'AY1', 'R']]
>>> prondict['blog']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'blog'
>>> prondict['blog'] = ['B', 'L', 'AA1', 'G']
>>> prondict['blog']
['B', 'L', 'AA1', 'G']

We look up a dictionary by specifying its name, followed by a key (such as the word fire) inside square brackets: prondict['fire']. If we try to look up a non-existent key, we get a KeyError, as we did when indexing a list with an integer that was too large. The word blog is missing from the pronouncing dictionary, so we tweak our version by assigning a value for this key (this has no effect on the NLTK corpus; next time we access it, blog will still be absent).

We can use any lexical resource to process a text, e.g. to filter out words having some lexical property (like nouns), or mapping every word of the text. For example, the following text-to-speech function looks up each word of the text in the pronunciation dictionary.

>>> text = ['natural', 'language', 'processing']
>>> [ph for w in text for ph in prondict[w][0]]
['N', 'AE1', 'CH', 'ER0', 'AH0', 'L', 'L', 'AE1', 'NG', 'G', 'W', 'AH0', 'JH',
'P', 'R', 'AA1', 'S', 'EH0', 'S', 'IH0', 'NG']

Comparative Wordlists

Another example of a tabular lexicon is the comparative wordlist. NLTK includes so-called Swadesh wordlists, lists of about 200 common words in several languages. The languages are identified using an ISO 639 two-letter code.

>>> from nltk.corpus import swadesh
>>> swadesh.files()
('be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk',
'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk')
>>> swadesh.words('en')
['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that',
'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some',
'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', ...]

We can access cognate words from multiple languages using the entries() method, specifying a list of languages. With one further step we can convert this into a simple dictionary.

>>> fr2en = swadesh.entries(['fr', 'en'])
>>> fr2en
[('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ...]
>>> translate = dict(fr2en)
>>> translate['chien']
>>> translate['jeter']

We can make our simple translator more useful by adding other source languages. Let's get the German-English and Spanish-English pairs, convert each to a dictionary, then update our original translate dictionary with these additional mappings:

>>> de2en = swadesh.entries(['de', 'en'])    # German-English
>>> es2en = swadesh.entries(['es', 'en'])    # Spanish-English
>>> translate.update(dict(de2en))
>>> translate.update(dict(es2en))
>>> translate['Hund']
>>> translate['perro']

(We will return to Python's dictionary data type dict() in Section 4.3.) We can compare words in various Germanic and Romance languages:

>>> languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'it', 'la']
>>> for i in [139, 140, 141, 142]:
...     print swadesh.entries(languages)[i]
('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dire', 'dicere')
('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'cantare', 'canere')
('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'giocare', 'ludere')
('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'galleggiare', 'fluctuare')

Shoebox and Toolbox Lexicons

Perhaps the single most popular tool used by linguists for managing data is Toolbox, previously known as Shoebox (freely downloadable from A Toolbox file consists of a collection of entries, where each entry is made up of one or more fields. Most fields are optional or repeatable, which means that this kind of lexical resource cannot be treated as a table or spreadsheet.

Here is a dictionary for the Rotokas language. We see just the first entry, for the word kaa meaning "to gag":

>>> from nltk.corpus import toolbox
>>> toolbox.entries('rotokas.dic')
[('kaa', [('ps', 'V'), ('pt', 'A'), ('ge', 'gag'), ('tkp', 'nek i pas'), ('dcsv', 'true'),
('vx', '1'), ('sc', '???'), ('dt', '29/Oct/2005'),
('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'),
('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'),
('xe', 'Apoka is gagging from food while talking.')]), ...]

Entries consist of a series of attribute-value pairs, like ('ps', 'V') to indicate that the part-of-speech is 'V' (verb), and ('ge', 'gag') to indicate that the gloss-into-English is 'gag'. The last three pairs contain an example sentence in Rotokas and its translations into Tok Pisin and English.

The loose structure of Toolbox files makes it hard for us to do much more with them at this stage. XML provides a powerful way to process this kind of corpus and we will return to this topic in Chapter 12.


The Rotokas language is spoken on the island of Bougainville, Papua New Guinea. This lexicon was contributed to NLTK by Stuart Robinson. Rotokas is notable for having an inventory of just 12 phonemes (contrastive sounds),

2.5   WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 "synonym sets". We'll begin by looking at synonyms and how they are accessed in WordNet.

Senses and Synonyms

Consider the sentence in (8a). If we replace the word motorcar in (8a) by automobile, to get (8b), the meaning of the sentence stays pretty much the same:


a.Benz is credited with the invention of the motorcar.

b.Benz is credited with the invention of the automobile.

Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e. they are synonyms. Let's explore these words with the help of WordNet:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('motorcar')

Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas"):

>>> wn.synset('car.n.01').lemma_names
['car', 'auto', 'automobile', 'machine', 'motorcar']

Each word of a synset can have several meanings, e.g. car can also signify a train carriage, a gondola, or an elevator car. However, we are only interested in the single meaning that is common to all words of the above synset. Synsets also come with a prose definition and some example sentences:

>>> wn.synset('car.n.01').definition
'a motor vehicle with four wheels; usually propelled by an internal combustion engine'
>>> wn.synset('car.n.01').examples
['he needs a car to get to work']

Although these help humans understand the intended meaning of a synset, the words of the synset are often more useful for our programs. To eliminate ambiguity, we will identify these words as car.n.01.automobile, car.n.01.motorcar, and so on. This pairing of a synset with a word is called a lemma, and here's how to access them:

>>> wn.synset('car.n.01').lemmas
[Lemma(''), Lemma(''), Lemma('car.n.01.automobile'),
Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')]
>>> wn.lemma('car.n.01.automobile')
>>> wn.lemma('car.n.01.automobile').synset
>>> wn.lemma('car.n.01.automobile').name

Unlike the words automobile and motorcar, the word car itself is ambiguous, having five synsets:

>>> wn.synsets('car')
[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'),
>>> for synset in wn.synsets('car'):
...     print synset.lemma_names
['car', 'auto', 'automobile', 'machine', 'motorcar']
['car', 'railcar', 'railway_car', 'railroad_car']
['car', 'gondola']
['car', 'elevator_car']
['cable_car', 'car']

For convenience, we can access all the lemmas involving the word car as follows:

>>> wn.lemmas('car')
[Lemma(''), Lemma(''), Lemma(''),
Lemma(''), Lemma('')]

Observe that there is a one-to-one correspondence between the synsets of car and the lemmas of car.


Your Turn: Write down all the senses of the word dish that you can think of. Now, explore this word with the help of WordNet, using the same operations we used above.

The WordNet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event — these are called unique beginners or root synsets. Others, such as gas guzzler and hatchback, are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2.11. The edges between nodes indicate the hypernym/hyponym relation...


Figure 2.11: Fragment of WordNet Concept Hierarchy

WordNet makes it easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific; the (immediate) hyponyms.

>>> motorcar = wn.synset('car.n.01')
>>> types_of_motorcar = motorcar.hyponyms()
>>> types_of_motorcar[26]
>>> sorted([ for synset in types_of_motorcar for lemma in synset.lemmas])
['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon',
'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible',
'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car',
'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap',
'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover',
'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car',
'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer',
'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan',
'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car',
'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car',
'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon', 'wagon']

We can also navigate up the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified either as a vehicle or as a container.

>>> motorcar.hypernyms()
>>> [ for synset in motorcar.hypernym_paths()[1]]
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02',
'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01',
'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01',

We can get the most general hypernyms (or root hypernyms) of a synset as follows:

>>> motorcar.root_hypernyms()


NLTK includes a convenient web-browser interface to WordNet nltk.wordnet.browser()

More Lexical Relations

Hypernyms and hyponyms are called lexical "relations" because they relate one synset to another. These two relations navigate up and down the "is-a" hierarchy. Another important way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms). For example, the parts of a tree are its trunk, crown, and so on; the part_meronyms(). The substance a tree is made of include heartwood and sapwood; the substance_meronyms(). A collection of trees forms a forest; the member_holonyms():

>>> wn.synset('tree.n.01').part_meronyms()
[Synset('burl.n.02'), Synset('crown.n.07'), Synset('stump.n.01'),
Synset('trunk.n.01'), Synset('limb.n.02')]
>>> wn.synset('tree.n.01').substance_meronyms()
[Synset('heartwood.n.01'), Synset('sapwood.n.01')]
>>> wn.synset('tree.n.01').member_holonyms()

To see just how intricate things can get, consider the word mint, which has several closely-related senses. We can see that mint.n.04 is part of mint.n.02 and the substance from which mint.n.05 is made.

>>> for synset in wn.synsets('mint', wn.NOUN):
...     print + ':', synset.definition
batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government
>>> wn.synset('mint.n.04').part_holonyms()
>>> wn.synset('mint.n.04').substance_holonyms()

There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:

>>> wn.synset('walk.v.01').entailments()
>>> wn.synset('eat.v.01').entailments()
[Synset('swallow.v.01'), Synset('chew.v.01')]
>>> wn.synset('tease.v.03').entailments()
[Synset('arouse.v.07'), Synset('disappoint.v.01')]

Some lexical relationships hold between lemmas, e.g. antonymy:

>>> wn.lemma('').antonyms()
>>> wn.lemma('rush.v.01.rush').antonyms()
>>> wn.lemma('horizontal.a.01.horizontal').antonyms()
[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]
>>> wn.lemma('staccato.r.01.staccato').antonyms()

Semantic Similarity

We have seen that synsets are linked by a complex network of lexical relations. Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term like vehicle will match documents containing specific terms like limousine.

Recall that each synset has one or more hypernym paths that link it to a root hypernym such as entity.n.01. Two synsets linked to the same root may have several hypernyms in common. If two synsets share a very specific hypernym — one that is low down in the hypernym hierarchy — they must be closely related.

>>> orca = wn.synset('orca.n.01')
>>> minke = wn.synset('minke_whale.n.01')
>>> tortoise = wn.synset('tortoise.n.01')
>>> novel = wn.synset('novel.n.01')
>>> orca.lowest_common_hypernyms(minke)
>>> orca.lowest_common_hypernyms(tortoise)
>>> orca.lowest_common_hypernyms(novel)

Of course we know that whale is very specific, vertebrate is more general, and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

>>> wn.synset('whale.n.02').min_depth()
>>> wn.synset('vertebrate.n.01').min_depth()
>>> wn.synset('entity.n.01').min_depth()

The WordNet package includes a variety of sophisticated measures that incorporate this basic insight. For example, path_similarity assigns a score in the range 01, based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1.

>>> orca.path_similarity(minke)
>>> orca.path_similarity(tortoise)
>>> orca.path_similarity(novel)

This is a convenient interface, and gives us the same relative ordering as before. Several other similarity measures are available (see help(wn)).

NLTK also includes VerbNet, a hierarhical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verbnet.

2.6   Summary

  • A text corpus is a large, structured collection of texts. NLTK comes with many corpora, e.g. the Brown Corpus, nltk.corpus.brown.
  • Some text corpora are categorized, e.g. by genre or topic; sometimes the categories of a corpus overlap each other.
  • To find out about some variable v that you have created, type help(v) to read the help entry for this kind of object.
  • Some functions are not available by default, but must be accessed using Python's import statement.

2.7   Further Reading (NOTES)

Natural Language Processing

Several websites have useful information about NLP, including conferences, resources, and special-interest groups, e.g.,, The website of the Association for Computational Linguistics, at, contains an overview of computational linguistics, including copies of introductory chapters from recent textbooks. Wikipedia has entries for NLP and its subfields (but don't confuse natural language processing with the other NLP: neuro-linguistic programming.) The new, second edition of Speech and Language Processing, is a more advanced textbook that builds on the material presented here. Three books provide comprehensive surveys of the field: [Cole, 1997], [Dale, Moisl, & Somers, 2000], [Mitkov, 2002]. Several NLP systems have online interfaces that you might like to experiment with, e.g.:

  • WordNet:
  • Translation:
  • ChatterBots:
  • Question Answering:
  • Summarization:


[Rossum & Drake, 2006] is a Python tutorial by Guido van Rossum, the inventor of Python and Fred Drake, the official editor of the Python documentation. It is available online at A more detailed but still introductory text is [Lutz & Ascher, 2003], which covers the essential features of Python, and also provides an overview of the standard libraries. A more advanced text, [Rossum & Drake, 2006] is the official reference for the Python language itself, and describes the syntax of Python and its built-in datatypes in depth. It is also available online at [Beazley, 2006] is a succinct reference book; although not suitable as an introduction to Python, it is an excellent resource for intermediate and advanced programmers. Finally, it is always worth checking the official Python Documentation at

Two freely available online texts are the following:

  • Josh Cogliati, Non-Programmer's Tutorial for Python,'s_Tutorial_for_Python/Contents
  • Jeffrey Elkner, Allen B. Downey and Chris Meyers, How to Think Like a Computer Scientist: Learning with Python (Second Edition),

Learn more about functions in Python by reading Chapter 4 of [Lutz & Ascher, 2003].

Archives of the CORPORA mailing list.

[Woods, Fletcher, & Hughes, 1986]


The online API documentation at contains extensive reference material for all NLTK modules.

Although WordNet was originally developed for research in psycholinguistics, it is widely used in NLP and Information Retrieval. WordNets are being developed for many other languages, as documented at

For a detailed comparison of wordnet similarity measures, see [Budanitsky & Hirst, 2006].

2.8   Exercises

  1. ☼ How many words are there in text2? How many distinct words are there?
  2. ☼ Compare the lexical diversity scores for humor and romance fiction in Table 1.1. Which genre is more lexically diverse?
  3. ☼ Produce a dispersion plot of the four main protagonists in Sense and Sensibility: Elinor, Marianne, Edward, Willoughby. What can you observe about the different roles played by the males and females in this novel? Can you identify the couples?
  4. ☼ According to Strunk and White's Elements of Style, the word however, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. ( Use the concordance tool to study actual usage of this word in the various texts we have been considering.
  5. ☼ Create a variable phrase containing a list of words. Experiment with the operations described in this chapter, including addition, multiplication, indexing, slicing, and sorting.
  6. ☼ The first sentence of text3 is provided to you in the variable sent3. The index of the in sent3 is 1, because sent3[1] gives us 'the'. What are the indexes of the two other occurrences of this word in sent3?
  7. ☼ Using the Python interactive interpreter, experiment with the examples in this section. Think of a short phrase and represent it as a list of strings, e.g. ['Monty', 'Python']. Try the various operations for indexing, slicing and sorting the elements of your list.
  8. ☼ Investigate the holonym / meronym relations for some nouns. Note that there are three kinds (member, part, substance), so access is more specific, e.g., wordnet.MEMBER_MERONYM, wordnet.SUBSTANCE_HOLONYM.
  9. ☼ The polysemy of a word is the number of senses it has. Using WordNet, we can determine that the noun dog has 7 senses with: len(nltk.wordnet.N['dog']). Compute the average polysemy of nouns, verbs, adjectives and adverbs according to WordNet.
  10. ☼ Using the Python interpreter in interactive mode, experiment with the dictionary examples in this chapter. Create a dictionary d, and add some entries. What happens if you try to access a non-existent entry, e.g. d['xyz']?
  11. ☼ Try deleting an element from a dictionary, using the syntax del d['abc']. Check that the item was deleted.
  12. ☼ Create a dictionary e, to represent a single lexical entry for some word of your choice. Define keys like headword, part-of-speech, sense, and example, and assign them suitable values.
  13. ☼ Try the examples in this section, then try the following.
    1. Create a variable called msg and put a message of your own in this variable. Remember that strings need to be quoted, so you will need to type something like: msg = "I like NLP!"
    2. Now print the contents of this variable in two ways, first by simply typing the variable name and pressing enter, then by using the print statement.
    3. Try various arithmetic expressions using this string, e.g. msg + msg, and 5 * msg.
    4. Define a new string hello, and then try hello + msg. Change the hello string so that it ends with a space character, and then try hello + msg again.
  14. ☼ Consider the following two expressions which have the same result. Which one will typically be more relevant in NLP? Why?
    1. "Monty Python"[6:12]
    2. ["Monty", "Python"][1]
  15. ☼ Define a string s = 'colorless'. Write a Python statement that changes this to "colourless" using only the slice and concatenation operations.
  16. ☼ Try the slice examples from this section using the interactive interpreter. Then try some more of your own. Guess what the result will be before executing the command.
  17. ☼ We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat.
  18. ☼ We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?
  19. ☼ We can also specify a "step" size for the slice. The following returns every second character within the slice: msg[6:11:2]. It also works in the reverse direction: msg[10:5:-2] Try these for yourself, then experiment with different step values.
  20. ☼ What happens if you ask the interpreter to evaluate msg[::-1]? Explain why this is a reasonable result.
  21. ☼ Define a conditional frequency distribution over the Names corpus that allows you to see which initial letters are more frequent for males vs females (cf. Figure 2.10).
  22. ☼ Use the corpus module to read austen-persuasion.txt. How many word tokens does this book have? How many word types?
  23. ☼ Use the Brown corpus reader nltk.corpus.brown.words() or the Web text corpus reader nltk.corpus.webtext.words() to access some sample text in two different genres.
  24. ☼ Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?
  25. ◑ Consider the following Python expression: len(set(text4)). State the purpose of this expression. Describe the two steps involved in performing this computation.
  26. ◑ Pick a pair of texts and study the differences between them, in terms of vocabulary, vocabulary richness, genre, etc. Can you find pairs of words which have quite different meanings across the two texts, such as monstrous in Moby Dick and in Sense and Sensibility?
  27. ◑ Use text9.index(??) to find the index of the word sunset. By a process of trial and error, find the slice for the complete sentence that contains this word.
  28. ◑ Using list addition, and the set and sorted operations, compute the vocabulary of the sentences sent1 ... sent8.
  29. ◑ What is the difference between sorted(set(w.lower() for w in text1)) and sorted(w.lower() for w in set(text1))? Which one will gives a larger value? Will this be the case for other texts?
  30. ◑ Write the slice expression to produces the last two words of text2.
  31. ◑ Read the BBC News article: UK's Vicky Pollards 'left behind' The article gives the following statistic about teen language: "the top 20 words used, including yeah, no, but and like, account for around a third of all words." How many word types account for a third of all word tokens, for a variety of text sources? What do you conclude about this statistic? Read more about this on LanguageLog, at
  32. ◑ Assign a new value to sent, namely the sentence ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore'], then write code to perform the following tasks:
    1. Print all words beginning with 'sh':
    2. Print all words longer than 4 characters.
  33. ◑ What does the following Python code do? sum(len(w) for w in text1) Can you use it to work out the average word length of a text?
  34. ◑ What is the difference between the following two tests: w.isupper(), not w.islower()?
  35. ◑ Investigate the table of modal distributions and look for other patterns. Try to explain them in terms of your own impressionistic understanding of the different genres. Can you find other closed classes of words that exhibit significant differences across different genres?
  36. ◑ The CMU Pronouncing Dictionary contains multiple pronunciations for certain words. How many distinct words does it contain? What fraction of words in this dictionary have more than one possible pronunciation?
  37. ◑ What is the branching factor of the noun hypernym hierarchy? (For all noun synsets that have hyponyms, how many do they have on average?)
  38. ◑ Define a function supergloss(s) that takes a synset s as its argument and returns a string consisting of the concatenation of the glosses of s, all hypernyms of s, and all hyponyms of s.
  39. ☺ Review the mappings in Table 4.4. Discuss any other examples of mappings you can think of. What type of information do they map from and to?
  40. ◑ Write a program to find all words that occur at least three times in the Brown Corpus.
  41. ◑ Write a program to generate a table of token/type ratios, as we saw in Table 1.1. Include the full set of Brown Corpus genres (nltk.corpus.brown.categories()). Which genre has the lowest diversity (greatest number of tokens per type)? Is this what you would have expected?
  42. ◑ Modify the text generation program in Figure 2.5 further, to do the following tasks:
    1. Store the n most likely words in a list lwords then randomly choose a word from the list using random.choice().
    2. Select a particular genre, such as a section of the Brown Corpus, or a genesis translation, one of the Gutenberg texts, or one of the Web texts. Train the model on this corpus and get it to generate random text. You may have to experiment with different start words. How intelligible is the text? Discuss the strengths and weaknesses of this method of generating random text.
    3. Now train your system using two distinct genres and experiment with generating text in the hybrid genre. Discuss your observations.
  43. ◑ Write a program to print the most frequent bigrams (pairs of adjacent words) of a text, omitting non-content words, in order of decreasing frequency.
  44. ◑ Write a program to create a table of word frequencies by genre, like the one given above for modals. Choose your own words and try to find words whose presence (or absence) is typical of a genre. Discuss your findings.
  45. ◑ Write a function that finds the 50 most frequently occurring words of a text that are not stopwords.
  46. ◑ Write a function tf() that takes a word and the name of a section of the Brown Corpus as arguments, and computes the text frequency of the word in that section of the corpus.
  47. ◑ Write a program to guess the number of syllables contained in a text, making use of the CMU Pronouncing Dictionary.
  48. ◑ Define a function hedge(text) which processes a text and produces a new version with the word 'like' between every third word.
  49. Zipf's Law: Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf's law states that the frequency of a word type is inversely proportional to its rank (i.e. f.r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type.
    1. Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf's law? (Hint: it helps to use a logarithmic scale). What is going on at the extreme ends of the plotted line?
    2. Generate random text, e.g. using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, and generate the Zipf plot as before, and compare the two plots. What do you make of Zipf's Law in the light of this?
  50. ★ Modify the generate_model() function in Figure 2.5 to use Python's random.choose() method to randomly pick the next word from the available set of words.
  51. ★ Define a function find_language() that takes a string as its argument, and returns a list of languages that have that string as a word. Use the udhr corpus and limit your searches to files in the Latin-1 encoding.
  52. ★ Use one of the predefined similarity measures to score the similarity of each of the following pairs of words. Rank the pairs in order of decreasing similarity. How close is your ranking to the order given here? (Note that this order was established experimentally by [Miller & Charles, 1998].)
car-automobile, gem-jewel, journey-voyage, boy-lad, coast-shore, asylum-madhouse, magician-wizard, midday-noon, furnace-stove, food-fruit, bird-cock, bird-crane, tool-implement, brother-monk, lad-brother, crane-implement, journey-car, monk-oracle, cemetery-woodland, food-rooster, coast-hill, forest-graveyard, shore-woodland, monk-slave, coast-forest, lad-wizard, chord-smile, glass-magician, rooster-voyage, noon-string.

About this document...

This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [], Version 0.9.6, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [].

This document is Revision: 7166 Mon Dec 8 21:47:15 EST 2008

3   Processing Raw Text

The most important source of texts is undoubtedly the Web. Its convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.

The goal of this chapter is to answer the following questions:

  1. How can we write programs to access text from local files and from the web, in order to get hold of an unlimited range of language material?
  2. How can we split documents up into individual words and punctuation symbols, so we can do the same kinds of analysis we did with text corpora in earlier chapters?
  3. What features of the Python programming language are needed to do this?

In order to address these questions, we will be covering key concepts in NLP, including tokenization and stemming. Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions. Since so much text on the web is in HTML format, we will also see how to dispense with markup.


From this chapter onwards, our program samples will assume you begin your interactive session or your program with the following import statment: import nltk, re, pprint

3.1   Accessing Text from the Web and from Disk

Electronic Books

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at, and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese and Spanish (with more than 100 texts each).

Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows:

>>> from urllib import urlopen
>>> url = ""
>>> raw = urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
>>> raw[:75]
'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'


The read() process will take a few seconds as it downloads this large book. If you're using an internet proxy which is not correctly detected by Python, you may need to specify the proxy manually as follows:

>>> proxies = {'http': ''}
>>> raw = urllib.urlopen(url, proxies=proxies).read()

The variable raw contains a string with 1,176,831 characters. This is the raw content of the book, including many details we are not interested in such as whitespace, line breaks and blank lines. Instead, we want to break it up into words and punctuation, as we saw in Chapter 1. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation. From now on we will call these tokens.

>>> text = nltk.wordpunct_tokenize(raw)
>>> type(text)
<class 'nltk.text.Text'>
>>> len(text)
>>> text[:10]
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1, along with the regular list operations like slicing:

>>> text = nltk.Text(tokens)
>>> type(text)
<type 'nltk.text.Text'>
>>> text[1020:1060]
['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in',
'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in',
'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly',
',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.']
>>> text.collocations()
Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr
Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch;
Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey
Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great
deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings

Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:

>>> raw.find("PART I")
>>> raw.rfind("End of Project Gutenberg's Crime")
>>> raw = raw[5303:1157681]

The find() and rfind() ("reverse find") functions help us get the right index values. Now the raw text begins with "PART I", and goes up to (but not including) the phrase that marks the end of the content.

This was our first brush with reality: texts found on the web may contain unwanted material, and there may not be an automatic way to remove it. But with a small amount of extra work we can extract the material we need.

Dealing with HTML

Much of the text on the web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the section on files below. However, if you're going to do this a lot, its easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we'll pick a BBC News story called Blondes to die out in 200 years, an urban legend reported as established scientific fact:

>>> url = ""
>>> html = urlopen(url).read()
>>> html[:60]
'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

You can type print html to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.

Getting text out of HTML is a sufficiently common task that NLTK provides a helper function nltk.clean_html(), which takes an HTML string and returns raw text. We can then tokenize this to get our familiar text structure:

>>> raw = nltk.clean_html(html)
>>> tokens = nltk.wordpunct_tokenize(raw)
>>> tokens
['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out', ...]

This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.

>>> tokens = tokens[96:399]
>>> text = nltk.Text(tokens)
>>> text.concordance('gene')
 they say too few people now carry the gene for blondes to last beyond the next tw
t blonde hair is caused by a recessive gene . In order for a child to have blonde
to have blonde hair , it must have the gene on both sides of the family in the gra
there is a disadvantage of having that gene or by chance . They don ' t disappear
ondes would disappear is if having the gene was a disadvantage and I do not think


For more sophisticated processing of HTML, use the Beautiful Soup package, available from

Processing Google Results

[how to extract google hits]

LanguageLog example for absolutely

Table 3.1:

Absolutely vs Definitely (Liberman 2005,

Google hits adore love like prefer
absolutely 289,000 905,000 16,200 644
definitely 1,460 51,000 158,000 62,600
ratio 198:1 18:1 1:10 1:97

Reading Local Files


Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as doc.txt inside the directory that IDLE offers in the pop-up dialogue box. Next, in the Python interpreter, open the file using f = open('doc.txt'), then inspect its contents using print

Various things might have gone wrong when you tried this. If the interpreter couldn't find your file, you would have seen an error like this:

>>> f = open('document.txt')
Traceback (most recent call last):
File "<pyshell#7>", line 1, in -toplevel-
f = open('document.txt')
IOError: [Errno 2] No such file or directory: 'document.txt'

To check that the file that you are trying to open is really in the right directory, use IDLE's Open command in the File menu; this will display a list of all the files in the directory where IDLE is running. An alternative is to examine the current directory from within Python:

>>> import os
>>> os.listdir('.')

Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('document.txt', 'rU')'r' means to open the file for reading (the default), and 'U' stands for "Universal", which lets us ignore the different conventions used for marking newlines.

Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:

'Time flies like an arrow.\nFruit flies like a banana.\n'

Recall that the '\n' characters are newlines; this is equivalent to pressing Enter on a keyboard and starting a new line.

We can also read a file one line at a time using a for loop:

>>> f = open('document.txt', 'rU')
>>> for line in f:
...     print line.strip()
Time flies like an arrow.
Fruit flies like a banana.

Here we use the strip() function to remove the newline character at the end of the input line.

NLTK's corpus files can also be accessed using these methods. We simply have to use to get the filename for any corpus item. Then we can open it in the usual way:

>>> file ='corpora/gutenberg/melville-moby_dick.txt')
>>> raw = open(file, 'rU').read()

Extracting Text from PDF, MSWord and other Binary Formats

ASCII text and HTML text are human readable formats. Text often comes in binary formats — like PDF and MSWord — that can only be opened using specialized software. Third-party libraries such as pypdf and pywin32 can be used to access these formats. Extracting text from multi-column documents can be particularly challenging. For once-off conversion of a few documents, it is simpler to open the document with a suitable application, then save it as text to your local drive, and access it as described below. If the document is already on the web, you can enter its URL in Google's search box. The search result often includes a link to an HTML version of the document, which you can save as text.

Getting User Input

Another source of text is a user interacting with our program. We can prompt the user to type a line of input using the Python function raw_input(). We can save that to a variable and manipulate it just as we have done for other strings.

>>> s = raw_input("Enter some text: ")
Enter some text: On an exceptionally hot evening early in July
>>> print "You typed", len(nltk.wordpunct_tokenize(s)), "words."
You typed 8 words.


Figure 3.1 summarizes what we have covered in this section, including the process of building a vocabulary that we saw in Chapter 1. (One step, normalization, will be discussed in section 3.5).


Figure 3.1: The Processing Pipeline

There's a lot going on in this pipeline. To understand it properly, it helps to be clear about the type of each variable that it mentions. We find out the type of any Python object x using type(x), e.g. type(1) is <int> since 1 is an integer.

When we load the contents of a URL or file, and when we strip out HTML markup, we are dealing with strings, Python's <str> data type (We will learn more about strings in section 3.2):

>>> raw = open('document.txt').read()
>>> type(raw)
<type 'str'>

When we tokenize a string we produce a list (of words), and this is Python's <list> type. Normalizing and sorting lists produces other lists:

>>> tokens = nltk.wordpunct_tokenize(raw)
>>> type(tokens)
<type 'list'>
>>> words = [w.lower() for w in tokens]
>>> type(words)
<type 'list'>
>>> vocab = sorted(set(words))
>>> type(vocab)
<type 'list'>

The type of an object determines what operations you can perform on it. So, for example, we can append to a list but not to a string:

>>> vocab.append('blog')
>>> raw.append('blog')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'append'

Similarly, we can concatenate strings with strings, and lists with lists, but we cannot concatenate strings with lists:

>>> query = 'Who knows?'
>>> beatles = ['john', 'paul', 'george', 'ringo']
>>> query + beatles
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot concatenate 'str' and 'list' objects

You may also have noticed that our analogy between operations on strings and numbers works for multiplication and addition, but not subtraction or division:

>>> 'very' - 'y'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for -: 'str' and 'str'
>>> 'very' / 2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'str' and 'int'

These error messages are another example of Python telling us that we have got our data types in a muddle. In the first case, we are told that the operation of substraction (i.e., -) cannot apply to objects of type str (strings), while in the second, we are told that division cannot take str and int as its two operands.

3.2   Strings: Text Processing at the Lowest Level

It's time to study a fundamental data type that we've been studiously avoiding so far. In earlier chapters we focussed on a text as a list of words. We didn't look too closely at words and how they are handled in the programming language. By using NLTK's corpus interface we were able to ignore the files that these texts had come from. The contents of a word, and of a file, are represented by programming languages as a fundamental data type known as a string. In this section we explore strings in detail, and show the connection between strings, words, texts and files.

Printing Strings

So far, when we have wanted to look at the contents of a variable or see the result of a calculation, we have just typed the variable name into the interpreter. We can also see the contents of a variable using the print statement:

>>> print monty
Monty Python

Notice that there are no quotation marks this time. When we inspect a variable by typing its name in the interpreter, the interpreter prints the Python representation of its value. Since it's a string, the result is quoted. However, when we tell the interpreter to print the contents of the variable, we don't see quotation characters since there are none inside the string.

The print statement allows us to display more than one item on a line in various ways, as shown below:

>>> grail = 'Holy Grail'
>>> print monty + grail
Monty PythonHoly Grail
>>> print monty, grail
Monty Python Holy Grail
>>> print monty, "and the", grail
Monty Python and the Holy Grail

Accessing Individual Characters

As we saw in Section 1.2 for lists, strings are indexed, starting from zero. When we index a string, we get one of its characters (or letters):

>>> monty[0]
>>> monty[3]
>>> monty[5]
' '

As with lists, if we try to access an index that is outside of the string we get an error:

>>> monty[20]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: string index out of range

Again as with lists, we can use negative indexes for strings, where -1 is the index of the last character. Using positive and negative indexes, we have two ways to refer to any position in a string. In this case, when the string had a length of 12, indexes 5 and -7 both refer to the same character (a space), and: 5 = len(monty) - 7.

>>> monty[-1]
>>> monty[-7]
' '

We can write for loops to iterate over the characters in strings. This print statement ends with a trailing comma, which is how we tell Python not to print a newline at the end.

>>> sent = 'colorless green ideas sleep furiously'
>>> for char in sent:
...     print char,
c o l o r l e s s   g r e e n   i d e a s   s l e e p   f u r i o u s l y

We can count individual characters as well. We should ignore the case distinction by normalizing everything to lowercase, and filter out non-alphabetic characters:

>>> from nltk.corpus import gutenberg
>>> raw = gutenberg.raw('melville-moby_dick.txt')
>>> fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
>>> fdist.keys()
['e', 't', 'a', 'o', 'n', 'i', 's', 'h', 'r', 'l', 'd', 'u', 'm', 'c', 'w',
'f', 'g', 'p', 'b', 'y', 'v', 'k', 'q', 'j', 'x', 'z']

This gives us the letters of the alphabet, with the most frequently occurring letters listed first (this is quite complicated and we'll explain it more carefully below). You might like to visualize the distribution using fdist.plot(). The relative character frequencies of a text can be used in automatically identifying the language of the text.

Accessing Substrings

A substring is any continuous section of a string that we want to pull out for further processing. We can easily access substrings using the same slice notation we used for lists. For example, the following code accesses the substring starting at index 6, up to (but not including) index 10:

>>> monty[6:10]

Here we see the characters are 'P', 'y', 't', and 'h' which correspond to monty[6] ... monty[9] but not monty[10]. This is because a slice starts at the first index but finishes one before the end index.

We can also slice with negative indices — the same basic rule of starting from the start index and stopping one before the end index applies; here we stop before the space character.

>>> monty[0:-7]

As with list slices, if we omit the first value, the substring begins at the start of the string. If we omit the second value, the substring continues to the end of the string:

>>> monty[:5]
>>> monty[6:]

We can also find the position of a substring within a string, using find():

>>> monty.find('Python')

Analyzing Strings

  • character frequency plot, e.g get text in some language using language_x = nltk.corpus.udhr.raw(x), then construct its frequency distribution fdist = FreqDist(language_x), then view the distribution with fdist.keys() and fdist.plot().
  • functions involving strings, e.g. determining past tense
  • built-ins, find(), rfind(), index(), rindex()
  • revisit string tests like endswith() from chapter 1

The Difference between Lists and Strings

Strings and lists are both kind of sequence. We can pull them apart by indexing and slicing them, and we can join them together by concatenating them. However, we cannot join strings and lists:

>>> query = 'Who knows?'
>>> beatles = ['John', 'Paul', 'George', 'Ringo']
>>> query[2]
>>> beatles[2]
>>> query[:2]
>>> beatles[:2]
['John', 'Paul']
>>> query + " I don't"
"Who knows? I don't"
>>> beatles + 'Brian'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "str") to list
>>> beatles + ['Brian']
['John', 'Paul', 'George', 'Ringo', 'Brian']

When we open a file for reading into a Python program, we get a string corresponding to the contents of the whole file. If we to use a for loop to process the elements of this string, all we can pick out are the individual characters — we don't get to choose the granularity. By contrast, the elements of a list can be as big or small as we like: for example, they could be paragraphs, sentence, phrases, words, characters. So lists have the advantage that we can be flexible about the elements they contain, and correspondingly flexible about any downstream processing. So one of the first things we are likely to do in a piece of NLP code is tokenize a string into a list of strings (Section 3.6). Conversely, when we want to write our results to a file, or to a terminal, we will usually format them as a string (Section 3.8).

Lists and strings do not have exactly the same functionality. Lists have the added power that you can change their elements:

>>> beatles[0] = "John Lennon"
>>> del beatles[-1]
>>> beatles
['John Lennon', 'Paul', 'George']

On the other hand if we try to do that with a string — changing the 0th character in query to 'F' — we get:

>>> query[0] = 'F'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: object does not support item assignment

This is because strings are immutable — you can't change a string once you have created it. However, lists are mutable, and their contents can be modified at any time. As a result, lists support operations that modify the original value rather than producing a new value.

3.3   Regular Expressions for Detecting Word Patterns

Many linguistic processing tasks involve pattern matching. For example, we can find words ending with ed using endswith('ed'). We saw a variety of such "word tests" in Figure 1.4. Regular expressions give us a more powerful and flexible method for describing the character patterns we are interested in.


There are many other published introductions to regular expressions, organized around the syntax of regular expressions and applied to searching text files. Instead of doing this again, we focus on the use of regular expressions at different stages of linguistic processing. As usual, we'll adopt a problem-based approach and present new features only as they are needed to solve practical problems. In our discussion we will mark regular expressions using chevrons like this: «patt».

To use regular expressions in Python we need to import the re library using: import re. We also need a list of words to search; we'll use the words corpus again (Section 2.4). We will preprocess it to remove any proper names.

>>> import re
>>> wordlist = [w for w in nltk.corpus.words.words() if w.islower()]

Ranges and Closures


Figure 3.2: T9: Text on 9 Keys

The T9 system is used for entering text on mobile phones. Two or more words that are entered using the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered using 4653. What other words could be produced with the same sequence? Here we use the regular expression «^[ghi][mno][jlk][def]$»:

>>> [w for w in wordlist if'^[ghi][mno][jlk][def]$', w)]
['gold', 'golf', 'hold', 'hole']

The first part of the expression, «^[ghi]», matches the start of a word followed by g, h, or i. The next part of the expression, «[mno]», constrains the second character to be m, n, or o. The third and fourth characters are also constrained. Only six words satisfy all these constraints. Note that the order of characters inside the square brackets is not significant, so we could have written «^[hig][nom][ljk][fed]$» and matched the same words.


Your Turn: Look for some "finger-twisters", by searching for words that only use part of the number-pad. For example «^[g-o]+$» will match words that only use keys 4, 5, 6 in the center row, and «^[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner. What do "-" and "+" mean?

Let's explore the "+" symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:

>>> chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
>>> [w for w in chat_words if'^m+i+n+e+$', w)]
['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine',
>>> [w for w in chat_words if'^[ha]+$', w)]
['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh',
'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa',
'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', ...]

It should be clear that "+" simply means "one or more instances of the preceding item", which could be an individual character like m, a set like [fed] or a range like [d-f]. Now let's replace "+" with "*" which means "zero or more instances of the preceding item". The regular expression «^m*i*n*e*$» will match everything that we found using «^m+i+n+e+$», but also words where some of the letters don't appear at all, e.g. me, min, and mmmmm. Note that the "+" and "*" symbols are sometimes referred to as Kleene closures, or simply closures.

The "^" operator has another function when it appears inside square brackets. For example «[^aeiouAEIOU]» matches any character other than a vowel. We can search the Chat corpus for words that are made up entirely of non-vowel characters using «^[^aeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r and zzzzzzzz. Notice this includes non-alphabetic characters.


Your Turn: Study the following examples and work out what the \, {} and | notations mean:

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> [w for w in wsj if'^[0-9]+\.[0-9]+$', w)]
['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5',
'0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99',
'1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...]
>>> [w for w in wsj if'^[A-Z]+\$$', w)]
['C$', 'US$']
>>> [w for w in wsj if'^[0-9]{4}$', w)]
['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', ...]
>>> [w for w in wsj if'^[0-9]+-[a-z]{3,5}$', w)]
['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', ...]
>>> [w for w in wsj if'^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)]
['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting',
>>> [w for w in wsj if'(ed|ing)$', w)]
['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', ...]

You probably worked out that a backslash means that the following character is deprived of its special powers and must literally match a specific character in the word. Thus, while '.' is special, '\.' only matches a period. The brace characters are used to specify the number of repeats of the previous item.

The meta-characters we have seen are summarized in Table 3.2.

Table 3.2:

Basic Regular Expression Meta-Characters, Including Wildcards, Ranges and Closures

Operator Behavior
. Wildcard, matches any character
^abc Matches some pattern abc at the start of a string
abc$ Matches some pattern abc at the end of a string
[abc] Matches a set of characters
[A-Z0-9] Matches a range of characters
ed|ing|s Matches one of the specified strings (disjunction)
* Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
+ One or more of previous item, e.g. a+, [a-z]+
? Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n} Exactly n repeats where n is a non-negative integer
{m,n} At least m and no more than n repeats (m, n optional)
(ab|c)+ Parentheses that indicate the scope of the operators

3.4   Useful Applications of Regular Expressions

The above examples all involved searching for words w that match some regular expression regexp using, w). Apart from checking if a regular expression matches a word, we can use regular expressions to extract material from words, or to modify words in specific ways.

Extracting Word Pieces

The re.findall()` ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find all the vowels in a word, then count them:

>>> word = 'supercalifragulisticexpialidocious'
>>> re.findall('[aeiou]', word)
['u', 'e', 'a', 'i', 'a', 'u', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u']
>>> len(re.findall('[aeiou]', word))

Let's look for all sequences of two or more vowels in some text, and determine their relative frequency:

>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj
...                       for vs in re.findall('[aeiou]{2,}', word))
>>> fd.items()
[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253),
('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95),
('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ...]


Your Turn: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31].

[int(n) for n in re.findall(?, '2009-12-31')]

Doing More with Word Pieces

Once we can use re.findall() to extract material from words, there's interesting things to do with the pieces, like glue them back together or plot them.

It is sometimes noted that English text is highly redundant, and it is still easy to read when word-internal vowels are left out. For example, declaration becomes dclrtn, and inalienable becomes inlnble, retaining any initial or final vowel sequences. This regular expression matches initial vowel sequences, final vowel sequences, and all consonants; everything else is ignored. We use re.findall() to extract all the matching pieces, and ''.join() to join them together (see Section 3.8 for more about the join operation).

>>> regexp = '^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
>>> def compress(word):
...     pieces = re.findall(regexp, word)
...     return ''.join(pieces)
>>> english_udhr = nltk.corpus.udhr.words('English-Latin1')
>>> print nltk.tokenwrap(compress(w) for w in english_udhr[:75])
Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and

Next, let's combine regular expressions with conditional frequency distributions. Here we will extract all consonant-vowel sequences from the words of Rotokas, such as ka and si. Since each of these is a pair, it can be used to initialize a conditional frequency distribution. We then tabulate the frequency of each pair:

>>> rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
>>> cvs = [cv for w in rotokas_words for cv in re.findall('[ptksvr][aeiou]', w)]
>>> cfd = nltk.ConditionalFreqDist(cvs)
>>> cfd.tabulate()
     a    e    i    o    u
k  418  148   94  420  173
p   83   31  105   34   51
r  187   63   84   89   79
s    0    0  100    2    1
t   47    8    0  148   37
v   93   27  105   48   49

Examining the rows for s and t, we see they are in partial "complementary distribution", which is evidence that they are not distinct phonemes in the language. Thus, we could conceivably drop s from the Rotokas alphabet and simply have a pronunciation rule that the letter t is pronounced s when followed by i.

If we want to be able to inspect the words behind the numbers in the above table, it would be helpful to have an index, allowing us to quickly find the list of words that contains a given consonant-vowel pair, e.g. cv_index['su'] should give us all words containing su. Here's how we can do this:

>>> cv_word_pairs = [(cv, w) for w in rotokas_words
...                          for cv in re.findall('[ptksvr][aeiou]', w)]
>>> cv_index = nltk.Index(cv_word_pairs)
>>> cv_index['su']
>>> cv_index['po']
['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa',
'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', ...]

This program processes each word w in turn, and for each one, finds every substring that matches the regular expression «[ptksvr][aeiou]». In the case of the word kasuari, it finds ka, su and ri. Therefore, the cv_word_pairs list will contain ('ka', 'kasuari'), ('su', 'kasuari') and ('ri', 'kasuari'). One further step, using nltk.Index(), converts this into a useful index.

Finding Word Stems

When we use a web search engine, we usually don't mind (or even notice) if the words in the document differ from our search terms in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed, laptop and laptops are just two forms of the same word. For some language processing tasks we want to ignore word endings, and just deal with word stems.

There are various ways we can pull out the stem of a word. Here's a simple-minded approach which just strips off anything that looks like a suffix:

>>> def stem(word):
...     for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
...         if word.endswith(suffix):
...             return word[:-len(suffix)]
...     return word

Although we will ultimately use NLTK's built-in stemmers, its interesting to see how we can use regular expressions for this task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to limit the scope of the disjunction.

>>> re.findall('^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses for scoping the disjunction but not for selecting output, we have to add ?: (just one of many arcane subtleties of regular expressions). Here's the revised version.

>>> re.findall('^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')

However, we'd actually like to split the word into stem and suffix. Instead, we should just parenthesize both parts of the regular expression:

>>> re.findall('^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing')
[('process', 'ing')]

This looks promising, but still has a problem. Let's look at a different word, processes

>>> re.findall('^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('processe', 's')]

The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star operator is "greedy" and the .* part of the expression tries to consume as much of the input as possible. If we use the "non-greedy" version of the star operator, written *?, we get what we want:

>>> re.findall('^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')
[('process', 'es')]

This works even when we allow empty suffix, by making the content of the second parentheses optional:

>>> re.findall('^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language')
[('language', '')]

This approach still has many problems (can you spot them?) but we will move on to define a stemming function and apply it to a whole text:

>>> def stem(word):
...     regexp = '^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
...     stem, suffix = re.findall(regexp, word)[0]
...     return stem
>>> raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
>>> tokens = nltk.wordpunct_tokenize(raw)
>>> [stem(t) for t in tokens]
['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond',
'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern',
'.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from',
'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words like distribut and deriv, but these are acceptable stems.

Searching Tokenized Text

You can use a special kind of regular expression for searching across multiple words in a text (where a text is a list of tokens).

>>> from nltk.corpus import gutenberg, nps_chat
>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
>>> moby.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> chat = nltk.Text(nps_chat.words())
you rule bro; telling you bro; u twizted bro
lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


Your Turn: Consolidate your understanding of regular expression patterns and substitutions using nltk.re_show(p, s) which annotates the string s to show every place where pattern p was matched, and nltk.draw.finding_nemo() which provides a graphical interface for exploring regular expressions.

3.5   Normalizing Text

In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g. set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want to go further than this, and strip off any affixes, a task known as stemming. A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization. We discuss each of these in turn.


NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter Stemmer strips affixes and knows about some special cases, e.g. that lie not ly is the stem of lying.

>>> porter = nltk.PorterStemmer()
>>> lancaster = nltk.LancasterStemmer()
>>> [porter.stem(t) for t in tokens]
['DENNI', ':', 'Listen', ',', 'strang', 'women', 'lie', 'in', 'pond',
'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern',
'.', 'Suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from',
'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']
>>> [lancaster.stem(t) for t in tokens]
['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut',
'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem',
'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not',
'from', 'som', 'farc', 'aqu', 'ceremony', '.']

Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words (illustrated in Figure 3.3, which uses object oriented programming techniques that will be covered in Chapter REF, and string formatting techniques to be covered in section 3.8).

class IndexedText(object):

    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                 for (i, word) in enumerate(text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = width/4                # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '%*s'  % (width, lcontext[-width:])
            rdisplay = '%-*s' % (width, rcontext[:width])
            print ldisplay, rdisplay

    def _stem(self, word):
        return self._stemmer.stem(word).lower()
>>> porter = nltk.PorterStemmer()
>>> grail = nltk.corpus.webtext.words('grail.txt')
>>> text = IndexedText(porter, grail)
>>> text.concordance('lie')
r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t

Figure 3.3 ( Figure 3.3: Indexing a Text Using a Stemmer


The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary (and this additional checking process makes it slower). It doesn't handle lying, but it converts women to woman.

>>> wnl = nltk.WordNetLemmatizer()
>>> [wnl.lemmatize(t) for t in tokens]
['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond',
'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of',
'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a',
'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical',
'aquatic', 'ceremony', '.']

The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lexical items.

3.6   Regular Expressions for Tokenizing Text

Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data. Although it is a fundamental task, we have been able to delay it til now because many corpora are already tokenized, and because NLTK includes some tokenizers. Now that you are familiar with regular expressions, you can learn how to use them to tokenize text, and to have much more control over the process.

Simple Approaches to Tokenization

The very simplest method for tokenizing text is to split on whitespace. Consider the following text from Alice's Adventures in Wonderland:

>>> raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
... though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
... well without--Maybe it's always pepper that makes people hot-tempered,'..."""

We could split this raw text on whitespace using raw.split(). To do the same using a regular expression, we need to match any number of spaces, tabs, or newlines.

>>> re.split(r'[ \t\n]+', raw)
["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a',
'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in',
'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe',
"it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]

The regular expression «[ \t\n]+» matches one or more space, tab (\t) or newline (\n). Other whitespace characters, such as carriage-return and form-feed should really be included too. Instead, we will can use a built-in re abbreviation, \s, which means any whitespace character. The above statement can be rewritten as re.split(r'\s+', raw).


When using regular expressions that contain the backslash character, you should prefix the string with the letter r (meaning "raw"), which instructs the Python interpreter to treat them as literal backslashes.

Splitting on whitespace gives us tokens like '(not' and 'herself,'. An alternative is to use the fact that Python provides us with a character class \w for word characters [define] and also the complement of this class \W. So, we can split on anything other than a word character:

>>> re.split(r'\W+', raw)
['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in',
'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in',
'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe',
'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']

Observe that this gives us empty strings [explain why]. We get the same result using re.findall(r'\w+', raw), using a pattern that matches the words instead of the spaces.

>>> re.findall(r'\w+|\S\w*', raw)
["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',
'(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t",
'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does',
'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that',
'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']

The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g. 's) but that sequences of two or more punctuation characters are separated. Let's generalize the \w+ in the above expression to permit word-internal hyphens and apostrophes: «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it's. (We need to include ?: in this expression for reasons discussed earlier.) We'll also add a pattern to match quote characters so these are kept separate from the text they enclose.

>>> print re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)
["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',',
'(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't",
'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does',
'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that',
'makes', 'people', 'hot-tempered', ',', "'", '...']

The above expression also included «[-.)]+» which causes the double hyphen, ellipsis, and open bracket to be tokenized separately.

Table 3.3 lists the regular expression character class symbols we have seen in this section.

Table 3.3:

Regular Expression Symbols

Symbol Function
\b Word boundary (zero width)
\d Any decimal digit (equivalent to [0-9])
\D Any non-digit character (equivalent to [^0-9])
\s Any whitespace character (equivalent to [ \t\n\r\f\v]
\S Any non-whitespace character (equivalent to [^ \t\n\r\f\v])
\w Any alphanumeric character (equivalent to [a-zA-Z0-9_])
\W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])
\t The tab character
\n The newline character

NLTK's Regular Expression Tokenizer

The function nltk.regexp_tokenize() is like re.findall, except it is more efficient and it avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) "verbose flag" tells Python to strip out the embedded whitespace and comments.

>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens
... '''
>>> nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression is applied to the gaps between tokens (cf re.split()).


We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and reporting any tokens that don't appear in the wordlist, using set(tokens).difference(wordlist). You'll probably want to lowercase all the tokens first.

Dealing with Contractions

A final issue for tokenization is the presence of contractions, such as didn't. If we are analyzing the meaning of a sentence, it would probably be more useful to normalize this form to two separate forms: did and n't (or not). [MORE]

3.7   Sentence Segmentation

[Explain how sentence segmentation followed by word tokenization can give different results to word tokenization on its own.]

Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences. As we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number of words per sentence in the Brown Corpus:

>>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents())

In other cases, the text is only available as a stream of characters. Before tokenizing the text into words, we need to segment it into sentences. NLTK facilitates this by including the Punkt sentence segmenter [Tibor & Jan, 2006], along with supporting data for English. Here is an example of its use in segmenting the text of a novel:

>>> text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
>>> sents = sent_tokenizer.tokenize(text)
>>> pprint.pprint(sents[171:181])
 '" said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\nrailway trains look so sad and tired, so very sad and tired?',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\nis because they know that whatever place they have taken a ticket\nfor that place they will reach.',
 'It is because after they have\npassed Sloane Square they know that the next station must be\nVictoria, and nothing but Victoria.',
 'Oh, their wild rapture!',
 'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation were unaccountably Baker Street!'
 '"\n\n"It is you who are unpoetical," replied the poet Syme.']

Notice that this example is really a single sentence, reporting the speech of Mr Lucian Gregory. However, the quoted speech contains several sentences, and these have been split into individual strings. This is reasonable behavior for most applications.

3.8   Formatting: From Lists to Strings

Often we write a program to report a single data item, such as a particular element in a corpus that meets some complicated criterion, or a single summary statistic such as a word-count or the performance of a tagger. More often, we write a program to produce a structured result, such as a tabulation of numbers or linguistic forms, or a reformatting of the original data. When the results to be presented are linguistic, textual output is usually the most natural choice. However, when the results are numerical, it may be preferable to produce graphical output. In this section you will learn about a variety of ways to present program output.

Converting Between Strings and Lists (notes)

We specify the string to be used as the "glue", followed by a period, followed by the join() function.

>>> silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
>>> ' '.join(silly)
'We called him Tortoise because he taught us .'
>>> ';'.join(silly)

So ' '.join(silly) means: take all the items in silly and concatenate them as one big string, using ' ' as a spacer between the items. (Many people find the notation for join() rather unintuitive.)

Notice that join() only works on a list of strings (what we have been calling a text).

Formatting Output

The output of a program is usually structured to make the information easily digestible by a reader. Instead of running some code and then manually inspecting the contents of a variable, we would like the code to tabulate some output. There are many ways we might want to format the output of a program. For instance, we might want to place the length value in parentheses after the word, and print all the output on a single line:

>>> saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
...           'more', 'is', 'said', 'than', 'done', '.']
>>> for word in saying:
...     print word, '(' + str(len(word)) + '),',
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1),

However, this approach has some problems. First, the print statement intermingles variables and punctuation, making it a little difficult to read. Second, the output has spaces around every item that was printed. Third, we have to convert the length of the word to a string so that we can surround it with parentheses. A cleaner way to produce structured output uses Python's string formatting expressions. Before diving into clever formatting tricks, however, let's look at a really simple example. We are going to use a special symbol, %s, as a placeholder in strings. Once we have a string containing this placeholder, we follow it with a single % and then a value v. Python then returns a new string where v has been slotted in to replace %s:

>>> "I want a %s right now" % "coffee"
'I want a coffee right now'

In fact, we can have a number of placeholders, but following the % operator we need to specify a tuple with exactly the same number of values.

>>> "%s wants a %s %s" % ("Lee", "sandwich", "for lunch")
'Lee wants a sandwich for lunch'

We can also provide the values for the placeholders indirectly. Here's an example using a for loop:

>>> menu = ['sandwich', 'spam fritter', 'pancake']
>>> for snack in menu:
...     "Lee wants a %s right now" % snack
'Lee wants a sandwich right now'
'Lee wants a spam fritter right now'
'Lee wants a pancake right now'

We oversimplified things when we said that placeholders were of the form %s; in fact, this is a complex object, called a conversion specifier. This has to start with the % character, and ends with conversion character such as s or d. The %s specifier tells Python that the corresponding variable is a string (or should be converted into a string), while the %d specifier indicates that the corresponding variable should be converted into a decimal representation. The string containing conversion specifiers is called a format string.

Picking up on the print example that we opened this section with, here's how we can use two different kinds of conversion specifier:

>>> for word in saying:
...     print "%s (%d)," % (word, len(word)),
After (5), all (3), is (2), said (4), and (3), done (4), , (1), more (4), is (2), said (4), than (4), done (4), . (1),'

To summarize, string formatting is accomplished with a three-part object having the syntax: format % values. The format section is a string containing format specifiers such as %s and %d that Python will replace with the supplied values. The values section of a formatting string is a parenthesized list containing exactly as many items as there are format specifiers in the format section. In the case that there is just one item, the parentheses can be left out.

In the above example, we used a trailing comma to suppress the printing of a newline. Suppose, on the other hand, that we want to introduce some additional newlines in our output. We can accomplish this by inserting the "special" character \n into the print string:

>>> for i, word in enumerate(saying[:6]):
...    print "Word = %s\nIndex = %s" % (word, i)
Word = After
Index = 0
Word = all
Index = 1
Word = is
Index = 2
Word = said
Index = 3
Word = and
Index = 4
Word = done
Index = 5

Strings and Formats

We have seen that there are two ways to display the contents of an object:

>>> word = 'cat'
>>> sentence = """hello
... world"""
>>> print word
>>> print sentence
>>> word
>>> sentence

The print command yields Python's attempt to produce the most human-readable form of an object. The second method — naming the variable at a prompt — shows us a string that can be used to recreate this object. It is important to keep in mind that both of these are just strings, displayed for the benefit of you, the user. They do not give us any clue as to the actual internal representation of the object.

There are many other useful ways to display an object as a string of characters. This may be for the benefit of a human reader, or because we want to export our data to a particular file format for use in an external program.

Formatted output typically contains a combination of variables and pre-specified strings, e.g. given a dictionary wordcount consisting of words and their frequencies we could do:

>>> wordcount = {'cat':3, 'dog':4, 'snake':1}
>>> for word in sorted(wordcount):
...     print word, '->', wordcount[word], ';',
cat -> 3 ; dog -> 4 ; snake -> 1 ;

Apart from the problem of unwanted whitespace, print statements that contain alternating variables and constants can be difficult to read and maintain. A better solution is to use formatting strings:

>>> for word in sorted(wordcount):
...    print '%s->%d;' % (word, wordcount[word]),
cat->3; dog->4; snake->1;

Lining Things Up

So far our formatting strings have contained specifications of fixed width, such as %6s, a string that is padded to width 6 and right-justified. We can include a minus sign to make it left-justified. In case we don't know in advance how wide a displayed value should be, the width value can be replaced with a star in the formatting string, then specified using a variable:

>>> '%6s' % 'dog'
'   dog'
>>> '%-6s' % 'dog'
'dog   '
>>> width = 6
>>> '%-*s' % (width, 'dog')
'dog   '

Other control characters are used for decimal integers and floating point numbers. Since the percent character % has a special interpretation in formatting strings, we have to precede it with another % to get it in the output:

>>> "accuracy for %d words: %2.4f%%" % (9375, 100.0 * 3205/9375)
'accuracy for 9375 words: 34.1867%'

An important use of formatting strings is for tabulating data. Recall that in section 2.1 we saw data being tabulated from a conditional frequency distribution. Let's perform the tabulation ourselves, exercising full control of headings and column widths. Note the clear separation between the language processing work, and the tabulation of results.

def tabulate(cfdist, words, categories):
    print '%-16s' % 'Category',
    for word in words:                                  # column headings
        print '%6s' % word,
    for category in categories:
        print '%-16s' % category,                       # row heading
        for word in words:                              # for each word
            print '%6d' % cfdist[category][word],       # print table cell
        print                                           # end the row
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist((g,w)
...                                for g in brown.categories()
...                                for w in brown.words(categories=g))
>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> tabulate(cfd, modals, genres)
Category            can  could    may  might   must   will
news                 93     86     66     38     50    389
religion             82     59     78     12     54     71
hobbies             268     58    131     22     83    264
science_fiction      16     49      4     12      8     16
romance              74    193     11     51     45     43
humor                16     30      8      8      9     13

Figure 3.4 ( Figure 3.4: Frequency of Modals in Different Sections of the Brown Corpus

Recall from the listing in Figure 3.3 that we used a formatting string "%*s". This allows us to specify the width of a field using a variable.

>>> '%*s' % (15, "Monty Python")
'   Monty Python'

We could use this to automatically customise the width of a column to be the smallest value required to fit all the words, using width = min(len(w) for w in words). Remember that the comma at the end of print statements adds an extra space, and this is sufficient to prevent the column headings from running into each other.

Writing Results to a File

We have seen how to read text from files (Section 3.1). It is often useful to write output to files as well. The following code opens a file output.txt for writing, and saves the program output to the file.

>>> file = open('output.txt', 'w')
>>> words = set(nltk.corpus.genesis.words('english-kjv.txt'))
>>> for word in sorted(words):
...     file.write(word + "\n")

When we write non-text data to a file we must convert it to a string first. We can do this conversion using formatting strings, as we saw above. We can also do it using Python's backquote notation, which converts any object into a string. Let's write the total number of words to our file, before closing it.

>>> len(words)
>>> `len(words)`
>>> file.write(`len(words)` + "\n")
>>> file.close()

3.9   Conclusion

In this chapter we saw that we can do a variety of interesting language processing tasks that focus solely on words. Tokenization turns out to be far more difficult than expected. No single solution works well across-the-board, and we must decide what counts as a token depending on the application domain. We also looked at normalization (including lemmatization) and saw how it collapses distinctions between tokens. In the next chapter we will look at word classes and automatic tagging.

3.10   Summary

  • In this book we view a text as a list of words. A "raw text" is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.
  • A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".
  • The characters of a string are accessed using indexes, counting from zero: 'Monty Python'[1] gives the value o. The length of a string is found using len().
  • Substrings are accessed using slice notation: 'Monty Python'[1:5] gives the value onty. If the start index is omitted, the substring begins at the start of the string; if the end index is omitted, the slice continues to the end of the string.
  • Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python']. Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/Python'.
  • we can read text from a file f using text = open(f).read()
  • we can read text from a URL u using text = urlopen(u).read()
  • texts found on the web may contain unwanted material (such as headers, footers, markup), that need to be removed before we do any linguistic processing.
  • a word token is an individual occurrence of a word in a particular context
  • a word type is the vocabulary item, independent of any particular use of that item
  • tokenization is the segmentation of a text into basic units — or tokens — such as words and punctuation.
  • tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words
  • lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g. appear).
  • Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern, and we can use re.sub() to replace substrings of one sort with another.
  • If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.
  • Normalization of words collapses distinctions, and is useful when indexing texts.

3.11   Further Reading (NOTES)

To learn about Unicode, see 1.

A.M. Kuchling. Regular Expression HOWTO,

For more examples of processing words with NLTK, please see the guides at,, and A guide on accessing NLTK corpora is available at: Chapters 2 and 3 of [Jurafsky & Martin, 2008] contain more advanced material on regular expressions and morphology.

For languages with a non-Roman script, tokenizing text is even more challenging. For example, in Chinese text there is no visual representation of word boundaries. The three-character string: 爱国人 (ai4 "love" (verb), guo3 "country", ren2 "person") could be tokenized as 爱国 / 人, "country-loving person" or as 爱 / 国人, "love country-person." The problem of tokenizing Chinese text is a major focus of SIGHAN, the ACL Special Interest Group on Chinese Language Processing

Regular Expressions

There are many references for regular expressions, both practical and theoretical. [Friedl, 2002] is a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python.

For an introductory tutorial to using regular expressions in Python with the re module, see A. M. Kuchling, Regular Expression HOWTO,

Chapter 3 of [Mertz, 2003] provides a more extended tutorial on Python's facilities for text processing with regular expressions. is a useful online resource, providing a tutorial and references to tools and other sources of information.

Unicode Regular Expressions:

Regex Library:

3.12   Exercises

  1. ☼ Describe the class of strings matched by the following regular expressions.

    1. [a-zA-Z]+
    2. [A-Z][a-z]*
    3. p[aeiou]{,2}t
    4. \d+(\.\d+)?
    5. ([^aeiou][aeiou][^aeiou])*
    6. \w+|[^\w\s]+

    Test your answers using re_show().

  2. ☼ Write regular expressions to match the following classes of strings:

    1. A single determiner (assume that a, an, and the are the only determiners).
    2. An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.
  3. ☼ Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use urllib.urlopen to access the contents of the URL, e.g. raw_contents = urllib.urlopen('').read().

  4. ☼ Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

    1. Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use a single regular expression, with inline comments using the re.VERBOSE flag.
    2. Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and companies.
  5. ☼ Rewrite the following loop as a list comprehension:

    >>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
    >>> result = []
    >>> for word in sent:
    ...     word_len = (word, len(word))
    ...     result.append(word_len)
    >>> result
    [('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]
  6. ☼ Split sent on some other character, such as 's'.

  7. ☼ We pointed out that when phrase is a list, phrase.reverse() returns a modified version of phrase rather than a new list. On the other hand, we can use the slice trick mentioned in the exercises for the previous section, [::-1] to create a new reversed list without changing phrase. Show how you can confirm this difference in behavior.

  8. ☼ We have seen how to represent a sentence as a list of words, where each word is a sequence of characters. What does phrase1[2][2] do? Why? Experiment with other index values.

  9. ☼ Write a for loop to print out the characters of a string, one per line.

  10. ☼ What is the difference between calling split on a string with no argument or with ' ' as the argument, e.g. sent.split() versus sent.split(' ')? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use '\t' to enter a tab character.)

  11. ☼ Create a variable words containing a list of words. Experiment with words.sort() and sorted(words). What is the difference?

  12. ☼ Earlier, we asked you to use a text editor to create a file called, containing the single line msg = 'Monty Python'. If you haven't already done this (or can't find the file), go ahead and do it now. Next, start up a new session with the Python interpreter, and enter the expression msg at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the .py part of the filename):

    >>> from test import msg
    >>> msg

    This time, Python should return with a value. You can also try import test, in which case Python should be able to evaluate the expression test.msg at the prompt.

  13. ◑ Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

  14. ◑ Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.

  15. ◑ Write a function unknown() that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using re.findall()) and remove any items from this set that occur in the words corpus (nltk.corpus.words). Try to categorize these words manually and discuss your findings.

  16. ◑ Examine the results of processing the URL using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.

  17. ◑ Define a function ghits() that takes a word as its argument and builds a Google query string of the form Strip the HTML markup and normalize whitespace. Search for a substring of the form Results 1 - 10 of about, followed by some number n, and extract n. Convert this to an integer and return it.

  18. ◑ The above example of extracting (name, domain) pairs from text does not work when there is more than one email address on a line, because the + operator is "greedy" and consumes too much of the input.

    1. Experiment with input text containing more than one email address per line, such as that shown below. What happens?
    2. Using re.findall(), write another regular expression to extract email addresses, replacing the period character with a range or negated range, such as [a-z]+ or [^ >]+.
    3. Now try to match email addresses by changing the regular expression .+ to its "non-greedy" counterpart, .+?
    >>> s = """
    ...  (internet)  hart@uiucvmd (bitnet)
    ... austen-emma.txt:Internet (; TEL: (212-254-5093)
    ... austen-persuasion.txt:Editing by Martin Ward (
    ... blake-songs.txt:Prepared by David Price, email
    ... """
  19. ◑ Are you able to write a regular expression to tokenize text in such a way that the word don't is tokenized into do and n't? Explain why this regular expression won't work: «n't|\w+».

  20. ◑ Write code to convert text into hAck3r again, this time using regular expressions and substitution, where e3, i1, o0, l|, s5, .5w33t!, ate8. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: $ for word-initial s, and 5 for word-internal s.

  21. Pig Latin is a simple transliteration of English. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay, e.g. stringingstray, idleidleay.

    1. Write a function to convert a word to Pig Latin.
    2. Write code that converts text, instead of individual words.
    3. Extend it further to preserve capitalization, to keep qu together (i.e. so that quiet becomes ietquay), and to detect when y is used as a consonant (e.g. yellow) vs a vowel (e.g. style).
  22. ◑ Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.

  23. ◑ Consider the numeric expressions in the following sentence from the MedLine corpus: The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15% and 8.16 +/- 0.23%, respectively. Should we say that the numeric expression 4.53 +/- 0.15% is three words? Or should we say that it's a single compound word? Or should we say that it is actually nine words, since it's read "four point five three, plus or minus fifteen percent"? Or should we say that it's not a "real" word at all, since it wouldn't appear in any dictionary? Discuss these different possibilities. Can you think of application domains that motivate at least two of these answers?

  24. ◑ Readability measures are used to score the reading difficulty of a text, for the purposes of selecting texts of appropriate difficulty for language learners. Let us define μw to be the average number of letters per word, and μs to be the average number of words per sentence, in a given text. The Automated Readability Index (ARI) of the text is defined to be: 4.71 * `` |mu|\ :subscript:`w` ``+ 0.5 * `` |mu|\ :subscript:`s` ``- 21.43. Compute the ARI score for various sections of the Brown Corpus, including section f (popular lore) and j (learned). Make use of the fact that nltk.corpus.brown.words() produces a sequence of words, while nltk.corpus.brown.sents() produces a sequence of sentences.

  25. ◑ Use the Porter Stemmer to normalize some tokenized text, calling the stemmer on each word. Do the same thing with the Lancaster Stemmer and see if you observe any differences.

  26. ◑ Process the list saying using a for loop, and store the result in a new list lengths. Hint: begin by assigning the empty list to lengths, using lengths = []. Then each time through the loop, use append() to add another length value to the list.

  27. ◑ Define a variable silly to contain the string: 'newly formed bland ideas are inexpressible in an infuriating way'. (This happens to be the legitimate interpretation that bilingual English-Spanish speakers can assign to Chomsky's famous nonsense phrase, colorless green ideas sleep furiously according to Wikipedia). Now write code to perform the following tasks:

    1. Split silly into a list of strings, one per word, using Python's split() operation, and save this to a variable called bland.
    2. Extract the second letter of each word in silly and join them into a string, to get 'eoldrnnnna'.
    3. Combine the words in bland back into a single string, using join(). Make sure the words in the resulting string are separated with whitespace.
    4. Print the words of silly in alphabetical order, one per line.
  28. ◑ The index() function can be used to look up items in sequences. For example, 'inexpressible'.index('e') tells us the index of the first position of the letter e.

    1. What happens when you look up a substring, e.g. 'inexpressible'.index('re')?
    2. Define a variable words containing a list of words. Now use words.index() to look up the position of an individual word.
    3. Define a variable silly as in the exercise above. Use the index() function in combination with list slicing to build a list phrase consisting of all the words up to (but not including) in in silly.
  29. ◑ Write code to abbreviate text by removing all the vowels. Define sentence to hold any string you like, then initialize a new string result to hold the empty string ''. Now write a for loop to process the string, one character at a time, and append any non-vowel characters to the result string.

  30. ◑ Write code to convert nationality adjectives like Canadian and Australian to their corresponding nouns Canada and Australia. (see

  31. ★ An interesting challenge for tokenization is words that have been split across a line-break. E.g. if long-term is split, then we have the string long-\nterm.

    1. Write a regular expression that identifies words that are hyphenated at a line-break. The expression will need to include the \n character.
    2. Use re.sub() to remove the \n character from these words.
  32. ★ Read the Wikipedia entry on Soundex. Implement this algorithm in Python.

  33. ★ Define a function percent(word, text) that calculates how often a given word occurs in a text, and expresses the result as a percentage.

  34. ★ Obtain raw texts from two or more genres and compute their respective reading difficulty scores as in the previous exercise. E.g. compare ABC Rural News and ABC Science News ( Use Punkt to perform sentence segmentation.

  35. ★ Rewrite the following nested loop as a nested list comprehension:

    >>> words = ['attribution', 'confabulation', 'elocution',
    ...          'sequoia', 'tenacious', 'unidirectional']
    >>> vsequences = set()
    >>> for word in words:
    ...     vowels = []
    ...     for char in word:
    ...         if char in 'aeiou':
    ...             vowels.append(char)
    ...     vsequences.add(''.join(vowels))
    >>> sorted(vsequences)
    ['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']
  36. ★ Write a program that processes a text and discovers cases where a word has been used with a novel sense. For each word, compute the wordnet similarity between all synsets of the word and all synsets of the words in its context. (Note that this is a crude approach; doing it well is an open research problem.)

About this document...

This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [], Version 0.9.6, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [].

This document is Revision: 7166 Mon Dec 8 21:47:15 EST 2008

4   Categorizing and Tagging Words

Back in elementary school you learnt the difference between nouns, verbs, adjectives, and adverbs. These "word classes" are not just the idle invention of grammarians, but are useful categories for many language processing tasks. As we will see, they arise from simple analysis of the distribution of words in text. The goal of this chapter is to answer the following questions:

  1. What are lexical categories and how are they used in natural language processing?
  2. What is a good Python data structure for storing words and their categories?
  3. How can we automatically tag each word of a text with its word class?

Along the way, we'll cover some fundamental techniques in NLP, including sequence labeling, n-gram models, backoff, and evaluation. These techniques are useful in many areas, and tagging gives us a simple context in which to present them.

The process of classifying words into their parts-of-speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. The collection of tags used for a particular task is known as a tagset. Our emphasis in this chapter is on exploiting tags, and tagging text automatically.

4.1   Applications of Tagging

Automatic tagging has several applications. We have already seen an example of how to exploit tags in corpus analysis — we get a clear understanding of the distribution of often by looking at the tags of adjacent words. Automatic tagging also helps predict the behavior of previously unseen words. For example, if we encounter the word scrobbling we can probably infer that it is a verb, with the root scrobble, and likely to occur after forms of the auxiliary to be (e.g. he was scrobbling). Parts-of-speech are also used in speech synthesis and recognition. For example, wind/NN, as in the wind blew, is pronounced with a short vowel, whereas wind/VB, as in to wind the clock, is pronounced with a long vowel. Other examples can be found where the stress pattern differs depending on whether the word is a noun or a verb, e.g. contest, insult, present, protest, rebel, suspect. Without knowing the part-of-speech we cannot be sure of pronouncing the word correctly. Finally, there are many applications where automatic part-of-speech tagging is a vital step that feeds into later processing. We will look at many examples of this in later chapters.

Evidence for Lexical Categories: Distributional Similarity

Before we go further, let's look for words based on similar distribution in a text. We will look up woman (a noun), bought (a verb), over (a preposition), and the (a determiner), using NLTK's Text.similar() function:

>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
>>> text.similar('woman')
Building word-context index...
man number fact end time world use kind state matter house result way
group part day rest sense couple be
>>> text.similar('bought')
able been made found used was had said have that given in expected as
told put taken got seen done
>>> text.similar('over')
of in to on at for was is with that from and into by all as out up back the
>>> text.similar('the')
a his this that it their one her an all in its any which our some he
these my be

This function takes a word w, finds all contexts w1w w2, then finds all words w' that appear in the same context, i.e. w1w'w2. (You can find the implementation online at

Observe that searching for woman finds nouns; searching for bought finds verbs; searching for over generally finds prepositions; searching for the finds several determiners.

These groups of words are so important that they have several names, all in common use: word classes, lexical categories, and parts of speech. We'll use these names interchangeably.


Remember that our program samples assume you begin your interactive session or your program with: import nltk, re, pprint

4.2   Tagged Corpora

A Simplified Part-of-Speech Tagset

Tagged corpora use many different conventions for tagging words. To help us get started, we will be looking at a simplified tagset (shown in Table 4.1).

Table 4.1:

Simplified Part-of-Speech Tagset

Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADV adverb really, already, still, early, now
CNJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no
EX existential there, there's
FW foreign word dolce, ersatz, esprit, quo, maitre
MOD modal verb will, can, would, may, must, should
N noun year, home, costs, time, education
NP proper noun Alison, Africa, April, Washington
NUM number twenty-four, fourth, 1991, 14:24
PRO pronoun he, their, her, its, my, I, us
P preposition on, of, at, with, by, into, under
TO the word to to
UH interjection ah, bang, ha, whee, hmpf, oops
V verb is, has, get, do, make, see, run
VD past tense said, took, told, made, asked
VG present participle making, going, playing, working
VN past participle given, taken, begun, sung
WH wh determiner who, which, when, what, where, how

Let's see which of these tags are the most common in the news category of the Brown corpus:

>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True)
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
>>> tag_fd.keys()
['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]


Your Turn: Plot the above frequency distribution using tag_fd.plot(cumulative=True). What percentage of words are tagged using the first five tags of the above list?

We can use these tags to do powerful searches using a graphical POS-concordance tool nltk.draw.pos_concordance(). Use it to search for any combination of words and POS tags, e.g. N N N N, hit/VD, hit/VN, the ADJ man.

Reading Tagged Corpora

Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

However, other tagged corpus files represent their part-of-speech tags in different ways. NLTK's corpus readers provide a uniform interface to these various formats so that you don't have to be concerned with them. By contrast with the text extract shown above, the corpus reader for the Brown Corpus presents the data as follows:

>>> list(nltk.corpus.brown.tagged_words())[:25]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'),
('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'),
('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ...]

Part-of-speech tags have been converted to uppercase, since this has become standard practice since the Brown Corpus was published.

Whenever a corpus contains tagged text, it will have a tagged_words() method. Here are some more examples, again using the output format illustrated for the Brown Corpus:

>>> print nltk.corpus.nps_chat.tagged_words()
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
>>> nltk.corpus.conll2000.tagged_words()
[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]
>>> nltk.corpus.treebank.tagged_words()
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

Not all corpora employ the same set of tags; please see Appendix A for a comprehensive list of tags for some popular tagsets. (Note that each NLTK corpus has a README file which may also have documentation on tagsets.) Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to a simplified tagset:

>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]
>>> nltk.corpus.treebank.tagged_words(simplify_tags=True)
[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list.

>>> nltk.corpus.sinica_treebank.tagged_words()
[('\xe4\xb8\x80', 'Neu'), ('\xe5\x8f\x8b\xe6\x83\x85', 'Nad'), ...]
>>> nltk.corpus.indian.tagged_words()
[('\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0', 'NN'),
('\xe0\xa6\xb8\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xa4\xe0\xa6\xbe\xe0\xa6\xa8', 'NN'), ...]
>>> nltk.corpus.mac_morpho.tagged_words()
[('Jersei', 'N'), ('atinge', 'V'), ('m\xe9dia', 'N'), ...]
>>> nltk.corpus.conll2002.tagged_words()
[('Sao', 'NC'), ('Paulo', 'VMI'), ('(', 'Fpa'), ...]
>>> nltk.corpus.cess_cat.tagged_words()
[('El', 'da0ms0'), ('Tribunal_Suprem', 'np0000o'), ...]

If your environment is set up correctly, with appropriate editors and fonts, you should be able to display individual strings in a human-readable way. For example, Figure 4.1 shows the output of the demonstration code nltk.corpus.indian.demo().


Figure 4.1: POS-Tagged Data from Four Indian Languages

If the corpus is also segmented into sentences, it will have a tagged_sents() method that divides up the tagged words into sentences rather than presenting them as one big list. This will be useful when we come to developing automatic taggers, as they typically function on a sentence at a time.


Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb, as shown in Table 4.2.

Table 4.2:

Syntactic Patterns involving some Nouns

Word After a determiner Subject of the verb
woman the woman who I saw yesterday ... the woman sat down
Scotland the Scotland I remember as a child ... Scotland has five million people
book the book I bought yesterday ... this book recounts the colonization of Australia
intelligence the intelligence displayed by the child ... Mary's intelligence impressed her teachers

The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.

Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as (('The', 'DET'), ('Fulton', 'NP')) and (('Fulton', 'NP'), ('County', 'N')). Then we construct a FreqDist from the tag parts of the bigrams.

>>> word_tag_pairs = nltk.bigrams(brown_news_tagged)
>>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N'))
['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]

This confirms our assertion that nouns occur after determiners and adjectives, including numeral adjectives (tagged as NUM).


Verbs are words that describe events and actions, e.g. fall, eat in Table 4.3. In the context of a sentence, verbs typically express a relation involving the referents of one or more noun phrases.

Table 4.3:

Syntactic Patterns involving some Verbs

Word Simple With modifiers and adjuncts (italicized)
fall Rome fell Dot com stocks suddenly fell like a stone
eat Mice eat cheese John ate the pizza with gusto

What are the most common verbs in news text? Let's sort all the verbs by frequency:

>>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)
>>> word_tag_fd = nltk.FreqDist(wsj)
>>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')]
['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V',
'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD',
'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V',
'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]

Note that the items being counted in the frequency distribution are word-tag pairs. Since words and tags are paired, we can treat the word as a condition and the tag as an event, and initialize a conditional frequency distribution with a list of condition-event pairs. This lets us see a frequency-ordered list of tags given a word:

>>> cfd1 = nltk.ConditionalFreqDist(wsj)
>>> cfd1['yield'].keys()
['V', 'N']
>>> cfd1['cut'].keys()
['V', 'VD', 'N', 'VN']

We can reverse the order of the pairs, so that the tags are the conditions, and the words are the events. Now we can see likely words for a given tag:

>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
>>> cfd2['VN'].keys()
['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold',
'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]

To clarify the distinction between VD (past tense) and VN (past participle), let's find words which can be both VD and VN, and see some surrounding text:

>>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]]
['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...]
>>> idx1 = wsj.index(('kicked', 'VD'))
>>> wsj[idx1-4:idx1+1]
[('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'), ('kicked', 'VD')]
>>> idx2 = wsj.index(('kicked', 'VN'))
>>> wsj[idx2-4:idx2+1]
[('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]

In this case, we see that the past participle of kicked is preceded by a form of the auxiliary verb have. Is this generally true?


Your Turn: Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs that immediately precede items in that list.


By convention in NLTK, a tagged token is represented using a Python tuple. Python tuples are just like lists, except for one important difference: tuples cannot be changed in place, for example by sort() or reverse(). In other words, like strings, they are immutable. Tuples are formed with the comma operator, and typically enclosed using parentheses. Like lists, tuples can be indexed and sliced:

>>> t = ('walk', 'fem', 3)
>>> t[0]
>>> t[1:]
('fem', 3)
>>> t[0] = 'run'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: object does not support item assignment

A tagged token is represented using a tuple consisting of just two items. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

>>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[0]
>>> tagged_token[1]

We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()).

>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

Unsimplified Tags

Let's find the most frequent nouns of each noun part-of-speech type. The program in Figure 4.2 finds all tags starting with NN, and provides a few example words for each one. You will see that there are many variants of NN; the most important contain $ for possessive nouns, S for plural nouns (since plural nouns typically end in s) and P for proper nouns. In addition, most of the tags have suffix modifiers: -NC for citations, -HL for words in headlines and -TL for titles.

def findtags(tag_prefix, tagged_text):
    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                  if tag.startswith(tag_prefix))
    return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())
>>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
>>> for tag in sorted(tagdict):
...     print tag, tagdict[tag]
NN ['year', 'time', 'state', 'week', 'man']
NN$ ["year's", "world's", "state's", "nation's", "company's"]
NN$-HL ["Golf's", "Navy's"]
NN$-TL ["President's", "University's", "League's", "Gallery's", "Army's"]
NN-HL ['cut', 'Salary', 'condition', 'Question', 'business']
NN-NC ['eva', 'ova', 'aya']
NN-TL ['President', 'House', 'State', 'University', 'City']
NN-TL-HL ['Fort', 'City', 'Commissioner', 'Grove', 'House']
NNS ['years', 'members', 'people', 'sales', 'men']
NNS$ ["children's", "women's", "men's", "janitors'", "taxpayers'"]
NNS$-HL ["Dealers'", "Idols'"]
NNS$-TL ["Women's", "States'", "Giants'", "Officers'", "Bombers'"]
NNS-HL ['years', 'idols', 'Creations', 'thanks', 'centers']
NNS-TL ['States', 'Nations', 'Masters', 'Rules', 'Communists']
NNS-TL-HL ['Nations']

Figure 4.2 ( Figure 4.2: Program to Find the Most Frequent Noun Tags

When we come to constructing part-of-speech taggers later in this chapter, we will use the unsimplified tags.

Exploring Tagged Corpora (NOTES)

We can continue the kinds of exploration of corpora we saw in previous chapters, but exploiting the tags...

Suppose we're studying the word often and want to see how it is used in text. We could ask to see the words that follow often

>>> brown_learned_text = brown.words(categories='learned')
>>> sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often'))
[',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming',
'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...]

However, it's probably more instructive use the tagged_words() method to look at the part-of-speech tag of the following words:

>>> brown_learned_tagged = brown.tagged_words(categories='learned', simplify_tags=True)
>>> tags = [b[1] for (a, b) in nltk.ibigrams(brown_learned_tagged) if a[0] == 'often']
>>> list(nltk.FreqDist(tags))
['VN', 'V', 'VD', 'DET', 'ADJ', 'ADV', 'P', 'CNJ', ',', 'TO', 'VG', 'WH', 'VBZ', '.']

Notice that the most high-frequency parts of speech following often are verbs. Nouns never appear in this position (in this particular corpus).

4.3   Mapping Words to Properties Using Python Dictionaries

As we have seen, a tagged word of the form (word, tag) is an association between a word and a part-of-speech tag. Once we start doing part-of-speech tagging, we will be creating programs that assign a tag to a word, the tag which is most likely in a given context. We can think of this process as mapping from words to tags. The most natural way to store mappings in Python uses the dictionary data type. In this section we look at dictionaries and see how they can represent a variety of language information, including parts of speech.

Indexing Lists vs Dictionaries

A text, as we have seen, is treated in Python as a list of words. An important property of lists is that we can "look up" a particular item by giving its index, e.g. text1[100]. Notice how we specify a number, and get back a word. We can think of a list as a simple kind of table, as shown in Figure 4.3.


Figure 4.3: List Look-up

Contrast this situation with frequency distributions (section 1.3), where we specify a word, and get back a number, e.g. fdist['monstrous'], which tells us the number of times a given word has occurred in a text. Look-up using words is familiar to anyone who has used a dictionary. Some more examples are shown in Figure 4.4.


Figure 4.4: Dictionary Look-up

In the case of a phonebook, we look up an entry using a name, and get back a number. When we type a domain name in a web browser, the computer looks this up to get back an IP address. A word frequency table allows us to look up a word and find its frequency in a text collection. In all these cases, we are mapping from names to numbers, rather than the other way round as with a list. In general, we would like to be able to map between arbitrary types of information. Table 4.4 lists a variety of linguistic objects, along with what they map.

Table 4.4:

Linguistic Objects as Mappings from Keys to Values

Linguistic Object Maps From Maps To
Document Index Word List of pages (where word is found)
Thesaurus Word sense List of synonyms
Dictionary Headword Entry (part-of-speech, sense definitions, etymology)
Comparative Wordlist Gloss term Cognates (list of words, one per language)
Morph Analyzer Surface form Morphological analysis (list of component morphemes)

Most often, we are mapping from a "word" to some structured object. For example, a document index maps from a word (which we can represent as a string), to a list of pages (represented as a list of integers). In this section, we will see how to represent such mappings in Python.

Dictionaries in Python

Python provides a dictionary data type that can be used for mapping between arbitrary types. It is like a conventional dictionary, in that it gives you an efficient way to look things up. However, as we see from Table 4.4, it has a much wider range of uses.

To illustrate, we define pos to be an empty dictionary and then add four entries to it, specifying the part-of-speech of some words. We add entries to a dictionary using the familiar square bracket notation:

>>> pos = {}
>>> pos['colorless'] = 'ADJ'
>>> pos['ideas'] = 'N'
>>> pos['sleep'] = 'V'
>>> pos['furiously'] = 'ADV'

So, for example, pos['colorless'] = 'ADJ' says that the part-of-speech of colorless is adjective, or more specifically, that the key 'colorless' is assigned the value 'ADJ' in dictionary pos. Once we have populated the dictionary in this way, we can employ the keys to retrieve values:

>>> pos['ideas']
>>> pos['colorless']

Of course, we might accidentally use a key that hasn't been assigned a value.

>>> pos['green']
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: 'green'

This raises an important question. Unlike lists and strings, where we can use len() to work out which integers will be legal indices, how do we work out the legal keys for a dictionary? If the dictionary is not too big, we can simply inspect its contents by evaluating the variable pos.

>>> pos
{'furiously': 'ADV', 'ideas': 'N', 'colorless': 'ADJ', 'sleep': 'V'}

Here, the contents of the dictionary are shown as key-value pairs, separated by a colon. The order of the key-value pairs is different from the order in which they were originally entered; this is because dictionaries are not sequences but mappings (cf. Figure 4.4), and the keys are not inherently ordered.

Alternatively, to just find the keys, we can convert the dictionary to a list — or use the dictionary in a context where a list is expected, as the parameter of sorted() or in a for loop:

>>> list(pos)
['ideas', 'furiously', 'colorless', 'sleep']
>>> sorted(pos)
['colorless', 'furiously', 'ideas', 'sleep']
>>> [w for w in pos if w.endswith('s')]
['colorless', 'ideas']


When you type list(pos) you might see a different order to the one shown above. If you want to see the keys in order, just sort them.

As well as iterating over all keys in the dictionary with a for loop, we can use the for loop as we did for printing lists:

>>> for word in sorted(pos):
...     print word + ":", pos[word]
colorless: ADJ
furiously: ADV
sleep: V
ideas: N

Finally, the dictionary methods keys(), values() and items() allow us to access the keys, values, and key-value pairs as separate lists:

>>> pos.keys()
['colorless', 'furiously', 'sleep', 'ideas']
>>> pos.values()
['ADJ', 'ADV', 'V', 'N']
>>> pos.items()
[('colorless', 'ADJ'), ('furiously', 'ADV'), ('sleep', 'V'), ('ideas', 'N')]
>>> for key, val in sorted(pos.items()):
...     print key + ":", val
colorless: ADJ
furiously: ADV
ideas: N
sleep: V

We want to be sure that when we look something up in a dictionary, we only get one value for each key. Now suppose we try to use a dictionary to store the fact that the word sleep can be used as both a verb and a noun:

>>> pos['sleep'] = 'V'
>>> pos['sleep'] = 'N'
>>> pos['sleep']

Initially, pos['sleep'] is given the value 'V'. But this is immediately overwritten with the new value 'N'. In other words, there can only be one entry in the dictionary for 'sleep'. However, there is a way of storing multiple values in that entry: we use a list value, e.g. pos['sleep'] = ['N', 'V']. In fact, this is what we saw in Section 2.4 for the CMU Pronouncing Dictionary, which stores multiple pronunciations for a single word.

Default Dictionaries

Since Python 2.5, a special kind of dictionary has been available, which can automatically create a default entry for a given key. (It is provided as nltk.defaultdict for the benefit of readers who are using Python 2.4). In order to use it, we have to supply a parameter which can be used to create the right kind of initial entry, e.g. int or list:

>>> frequency = nltk.defaultdict(int)
>>> frequency['colorless'] = 4
>>> frequency['ideas']
>>> pos = nltk.defaultdict(list)
>>> pos['sleep'] = ['N', 'V']
>>> pos['ideas']

If we want to supply our parameter to create a initial value, we have to supply it as a function. Let's return to our part-of-speech example, and create a dictionary whose default value for any entry is 'N'.

>>> pos = nltk.defaultdict(lambda: 'N')
>>> pos['colorless'] = 'ADJ'
>>> pos['blog']


The above example used a lambda expression, an advanced feature we will study in section 6.2. For now you just need to know that lambda: 'N' creates a function, and when we call this function it produces the value 'N':

>>> f = lambda: 'N'
>>> f()

Incrementally Updating a Dictionary

We can employ dictionaries to count occurrences, emulating the method for tallying words shown in Figure 1.2 of Chapter 1. We begin by initializing an empty defaultdict, then process each part-of-speech tag in the text. If the tag hasn't been seen before, it will have a zero count by default. Each time we encounter a tag, we increment its count using the += operator.

>>> counts = nltk.defaultdict(int)
>>> for (word, tag) in brown_news_tagged:
...     counts[tag] += 1
>>> counts['N']
>>> list(counts)
['FW', 'DET', 'WH', "''", 'VBZ', 'VB+PPO', "'", ')', 'ADJ', 'PRO', '*', '-', ...]

>>> from operator import itemgetter
>>> sorted(counts.items(), key=itemgetter(1), reverse=True)
[('N', 22226), ('P', 10845), ('DET', 10648), ('NP', 8336), ('V', 7313), ...]
>>> [t for t,c in sorted(counts.items(), key=itemgetter(1), reverse=True)]
['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]

Figure 4.5 ( Figure 4.5: Incrementally Updating a Dictionary, and Sorting by Value

The listing in Figure 4.5 illustrates an important idiom for sorting a dictionary by its values, in order to show the words in decreasing order of frequency. The sorted() function takes key and reverse as parameters, and the required key is the second element of each word-tage pair. Although the second member of a pair is normally accessed with index [1], this expression on its own (i.e. key=[1]) cannot be assigned as a parameter value in the function definition since [1] looks like a list containing the integer 1. An altertive is to set the value of key to be itemgetter(1), a function which has the same effect as indexing into a tuple:

>>> pair = ('NP', 8336)
>>> pair[1]
>>> itemgetter(1)(pair)

There's a second useful programming idiom at the beginning of Figure 4.5, where we initialize a defaultdict and then use a for loop to update its values. Here's a schematic version:

my_dictionary = nltk.defaultdict(function to create default value)
for item in sequence:
`` `` update my_dictionary[item_key] with information about item

Here's another instance of this pattern, where we index words according to their last two letters:

>>> last_letters = nltk.defaultdict(list)
>>> words = nltk.corpus.words.words('en')
>>> for word in words:
...     key = word[-2:]
...     last_letters[key].append(word)
>>> last_letters['ly']
['abactinally', 'abandonedly', 'abasedly', 'abashedly', 'abashlessly', 'abbreviately',
'abdominally', 'abhorrently', 'abidingly', 'abiogenetically', 'abiologically', ...]
>>> last_letters['zy']
['blazy', 'bleezy', 'blowzy', 'boozy', 'breezy', 'bronzy', 'buzzy', 'Chazy', 'cozy', ...]

The following example uses the same pattern to create an anagram dictionary. (You might experiment with the third line to get an idea of why this program works.)

>>> anagrams = nltk.defaultdict(list)
>>> for word in words:
...     key = ''.join(sorted(word))
...     anagrams[key].append(word)
>>> anagrams['aegilnrt']
['alerting', 'altering', 'integral', 'relating', 'triangle']

Since accumulating words like this is such a common task, NLTK provides a more convenient way of creating a defaultdict(list):

>>> anagrams = nltk.Index((''.join(sorted(w)), w) for w in words)


nltk.FreqDist is essentially a defaultdict(int) with extra support for initialization, sorting and plotting that are needed in language processing. Similarly nltk.Index is a defaultdict(list) with extra support for initialization.

We can use default dictionaries with complex keys and values. Let's study the range of possible tags for a word, given the word itself, and the tag of the previous word. We will see how this information can be used by a POS tagger.

>>> pos = nltk.defaultdict(lambda: nltk.defaultdict(int))
>>> for ((w1,t1), (w2,t2)) in nltk.ibigrams(brown_news_tagged):
...     pos[(t1,w2)][t2] += 1
>>> pos[('N', 'that')]
defaultdict(<type 'int'>, {'V': 10, 'CNJ': 145, 'WH': 112})
>>> pos[('DET', 'right')]
defaultdict(<type 'int'>, {'ADV': 3, 'ADJ': 9, 'N': 3})

This example uses a dictionary whose default value for an entry is a dictionary (whose default value is int(), i.e. zero). There is some new notation here (the lambda), and we will return to this in chapter 6. For now, notice how we iterated over the bigrams of the tagged corpus, processing a pair of word-tag pairs for each iteration. Each time through the loop we updated our pos dictionary's entry for (t1,w2), a tag and its following word. The entry for ('DET', 'right') is itself a dictionary of counts. A POS tagger could use such information to decide to tag the word right as ADJ when it is preceded by a determiner.

Inverting a Dictionary

Dictionaries support efficient lookup, so long as you want to get the value for any key. If d is a dictionary and k is a key, we type d[k] and immediately obtain the value. Finding a key given a value is slower and more cumbersome:

>>> [key for (key, value) in counts.items() if value == 16]
['call', 'sleepe', 'take', 'where', 'Your', 'Father', 'looke', 'owne']

If we expect to do this kind of "reverse lookup" often, it helps to construct a dictionary that maps values to keys. In the case that no two keys have the same value, this is an easy thing to do. We just get all the key-value pairs in the dictionary, and create a new dictionary of value-key pairs. The next example also illustrates another way of initializing a dictionary pos with key-value pairs.

>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos2 = dict((value, key) for (key, value) in pos.items())
>>> pos2['N']

Let's first make our part-of-speech dictionary a bit more realistic and add some more words to pos using the dictionary update() method, to create the situation where multiple keys have the same value. Then the technique just shown for reverse lookup will no longer work (why not?). Instead, we have to incrementally add new values to the dictionary pos2, as follows:

>>> pos.update({'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})
>>> pos2 = nltk.defaultdict(list)
>>> for key, value in pos.items():
...     pos2[value].append(key)
>>> pos2['ADV']
['peacefully', 'furiously']

Now we have inverted the pos dictionary, and can look up any part-of-speech and find all words having that part-of-speech. We can do the same thing even more simply using NLTK's support for indexing as follows:

>>> pos2 = nltk.Index((value, key) for (key, value) in pos.items())


Thanks to their versatility, Python dictionaries are extremely useful in most areas of NLP. We already made heavy use of dictionaries in Chapter 1, since NLTK's FreqDist objects are just a special case of dictionaries for counting things. Table 4.5 lists the most important dictionary methods you should know.

Table 4.5:

Summary of Python's Dictionary Methods

Example Description
d = {} create an empty dictionary and assign it to d
d[key] = value assign a value to a given dictionary key
list(d), d.keys() the list of keys of the dictionary
sorted(d) the keys of the dictionary, sorted
key in d test whether a particular key is in the dictionary
for key in d iterate over the keys of the dictionary
d.values() the list of values in the dictionary
dict([(k1,v1), (k2,v2), ...]) create a dictionary from a list of key-value pairs
d1.update(d2) add all items from d2 to d1
defaultdict(int) a dictionary whose default value is zero

4.4   Automatic Tagging

In this and the following sections we will explore various ways to automatically add part-of-speech tags to some text. We'll begin by loading the data we will be using.

>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged(categories='news')
>>> brown_news_text = brown.words(categories='news')

The Default Tagger

The simplest possible tagger assigns the same tag to each token. This may seem to be a rather banal step, but it establishes an important baseline for tagger performance. In order to get the best result, we tag each word with the most likely tag. Let's find out which tag is most likely (now using the unsimplified tagset):

>>> nltk.FreqDist(tag for (word, tag) in brown_news_tagged).max()

Now we can create a tagger that tags everything as NN.

>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.wordpunct_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'),
('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'),
('I', 'NN'), ('am', 'NN'), ('!', 'NN')]

Unsurprisingly, this method performs rather poorly. On a typical corpus, it will tag only about an eighth of the tokens correctly:

>>> nltk.tag.accuracy(default_tagger, brown_news_tagged)

Default taggers assign their tag to every single word, even words that have never been encountered before. As it happens, most new words are nouns. As we will see, this means that default taggers can help to improve the robustness of a language processing system. We will return to them shortly.

The Regular Expression Tagger

The regular expression tagger assigns tags to tokens on the basis of matching patterns. For instance, we might guess that any word ending in ed is the past participle of a verb, and any word ending with 's is a possessive noun. We can express these as a list of regular expressions:

>>> patterns = [
...     (r'.*ing$', 'VBG'),               # gerunds
...     (r'.*ed$', 'VBD'),                # simple past
...     (r'.*es$', 'VBZ'),                # 3rd singular present
...     (r'.*ould$', 'MD'),               # modals
...     (r'.*\'s$', 'NN$'),               # possessive nouns
...     (r'.*s$', 'NNS'),                 # plural nouns
...     (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
...     (r'.*', 'NN')                     # nouns (default)
... ]

Note that these are processed in order, and the first one that matches is applied. Now we can set up a tagger and use it to tag a sentence.

>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(brown_news_text[3:4])
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'),
('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'),
('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'),
('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'),
('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ..., ('.', 'NN')]
>>> nltk.tag.accuracy(regexp_tagger, brown_news_tagged)

The final regular expression «.*» is a catch-all that tags everything as a noun. This is equivalent to the default tagger (only much less efficient). Instead of re-specifying this as part of the regular expression tagger, is there a way to combine this tagger with the default tagger? We will see how to do this shortly.

The Lookup Tagger

A lot of high-frequency words do not have the NN tag. Let's find some of these words and their tags. Let's find the hundred most frequent words and store their most likely tag. We can then use this information as the model for a "lookup tagger".

>>> fd = nltk.FreqDist(brown_news_text)
>>> cfd = nltk.ConditionalFreqDist(brown_news_tagged)
>>> most_freq_words = fd.keys()[:100]
>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)
>>> nltk.tag.accuracy(baseline_tagger, brown.tagged_sents(categories='news'))

It should come as no surprise by now that simply knowing the tags for the 100 most frequent words enables us to tag nearly half of all words correctly. Let's see what it does on some untagged input text:

>>> baseline_tagger.tag(brown_news_text[3])
[('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None),
('handful', None), ('of', 'IN'), ('such', None), ('reports', None),
('was', 'BEDZ'), ('received', None), ("''", "''"), (',', ','),
('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','),
('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None),
('interest', None), ('in', 'IN'), ('the', 'AT'), ('election', None),
(',', ','), ('the', 'AT'), ('number', None), ('of', 'IN'),
('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', None),
('of', 'IN'), ('this', 'DT'), ('city', None), ("''", "''"), ('.', '.')]

Many words have been assigned a tag of None, because they were not among the 100 most frequent words. In these cases we would like to assign the default tag of NN, a process known as backoff.

Getting Better Coverage with Backoff

How do we combine these taggers? We want to use the lookup table first, and if it is unable to assign a tag, then use the default tagger. We do this by specifying the default tagger as a parameter to the lookup tagger. The lookup tagger will invoke the default tagger when it can't assign a tag itself.

>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))
>>> nltk.tag.accuracy(baseline_tagger, brown_news_tagged)

We can put all this together to write a simple (but somewhat inefficient) program to create and evaluate lookup taggers having a range of sizes, as shown in Figure 4.6. We include a backoff tagger that tags everything as a noun. A consequence of using this backoff tagger is that the lookup tagger only has to store word/tag pairs for words other than nouns.

def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return nltk.tag.accuracy(baseline_tagger, brown.tagged_sents(categories='news'))

def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
>>> display()                                  

Figure 4.6 ( Figure 4.6: Lookup Tagger Performance with Varying Model Size


Figure 4.7: Lookup Tagger

Observe that performance initially increases rapidly as the model size grows, eventually reaching a plateau, when large increases in model size yield little improvement in performance. (This example used the pylab plotting package; we will return to this in Section 6.2).


In the above examples, you will have noticed an emphasis on accuracy scores. In fact, evaluating the performance of such tools is a central theme in NLP. Recall the processing pipeline in Figure 1.4; any errors in the output of one module are greatly multiplied in the downstream modules.

We evaluate the performance of a tagger relative to the tags a human expert would assign. Since we don't usually have access to an expert and impartial human judge, we make do instead with gold standard test data. This is a corpus which has been manually annotated and which is accepted as a standard against which the guesses of an automatic system are assessed. The tagger is regarded as being correct if the tag it guesses for a given word is the same as the gold standard tag.

Of course, the humans who designed and carried out the original gold standard annotation were only human. Further analysis might show mistakes in the gold standard, or may eventually lead to a revised tagset and more elaborate guidelines. Nevertheless, the gold standard is by definition "correct" as far as the evaluation of an automatic tagger is concerned.


Developing an annotated corpus is a major undertaking. Apart from the data, it generates sophisticated tools, documentation, and practices for ensuring high quality annotation. The tagsets and other coding schemes inevitably depend on some theoretical position that is not shared by all, however corpus creators often go to great lengths to make their work as theory-neutral as possible in order to maximize the usefulness of their work.

4.5   N-Gram Tagging

Separating the Training and Testing Data

Now that we are training a tagger on some data, we must be careful not to test it on the same data, as we did in the above example. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would also be useless for tagging new text. Instead, we should split the data, training on 90% and testing on the remaining 10%:

>>> size = int(len(brown_news_tagged) * 0.9)
>>> brown_news_train = brown_news_tagged[:size]
>>> brown_news_test = brown_news_tagged[size:]
>>> unigram_tagger = nltk.UnigramTagger(brown_news_train)
>>> nltk.tag.accuracy(unigram_tagger, brown_news_test)

Although the score is worse, we now have a better picture of the usefulness of this tagger, i.e. its performance on previously unseen text.

N-Gram Tagging

When we perform a language processing task based on unigrams, we are using one item of context. In the case of tagging, we only consider the current token, in isolation from any larger context. Given such a model, the best we can do is tag each word with its a priori most likely tag. This means we would tag a word such as wind with the same tag, regardless of whether it appears in the context the wind or to wind.

An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens, as shown in Figure 4.8. The tag to be chosen, tn, is circled, and the context is shaded in grey. In the example of an n-gram tagger shown in Figure 4.8, we have n=3; that is, we consider the tags of the two preceding words in addition to the current word. An n-gram tagger picks the tag that is most likely in the given context.


Figure 4.8: Tagger Context


A 1-gram tagger is another term for a unigram tagger: i.e., the context used to tag a token is just the text of the token itself. 2-gram taggers are also called bigram taggers, and 3-gram taggers are called trigram taggers.

The NgramTagger class uses a tagged training corpus to determine which part-of-speech tag is most likely for each context. Here we see a special case of an n-gram tagger, namely a bigram tagger. First we train it, then use it to tag untagged sentences:

>>> bigram_tagger = nltk.BigramTagger(brown_news_train)
>>> bigram_tagger.tag(sent)
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'),
('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'),
('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'),
('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
>>> unseen_sent = brown.sents(categories='news')[4203]
>>> bigram_tagger.tag(unseen_sent)
[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', 'NP'),
('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None),
('into', None), ('at', None), ('least', None), ('seven', None), ('major', None),
('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None),
('innumerable', None), ('tribes', None), ('speaking', None), ('400', None),
('separate', None), ('dialects', None), ('.', None)]

Notice that the bigram tagger manages to tag every word in a sentence it saw during training, but does badly on an unseen sentence. As soon as it encounters a new word (i.e., 13.5), it is unable to assign a tag. It cannot tag the following word (i.e., million) even if it was seen during training, simply because it never saw it during training with a None tag on the previous word. Consequently, the tagger fails to tag the rest of the sentence. Its overall accuracy score is very low:

>>> nltk.tag.accuracy(bigram_tagger, brown_news_test)

As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval).


n-gram taggers should not consider context that crosses a sentence boundary. Accordingly, NLTK taggers are designed to work with lists of sentences, where each sentence is a list of words. At the start of a sentence, tn-1 and preceding tags are set to None.

Combining Taggers

One way to address the trade-off between accuracy and coverage is to use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary. For example, we could combine the results of a bigram tagger, a unigram tagger, and a regexp_tagger, as follows:

  1. Try tagging the token with the bigram tagger.
  2. If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
  3. If the unigram tagger is also unable to find a tag, use a default tagger.

Most NLTK taggers permit a backoff-tagger to be specified. The backoff-tagger may itself have a backoff tagger:

>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(brown_news_train, backoff=t0)
>>> t2 = nltk.BigramTagger(brown_news_train, backoff=t1)
>>> nltk.tag.accuracy(t2, brown_news_test)


We specify the backoff tagger when the tagger is initialized, so that training can take advantage of the backoff tagger. Thus, if the bigram tagger would assign the same tag as its unigram backoff tagger in a certain context, the bigram tagger discards the training instance. This keeps the bigram tagger model as small as possible. We can further specify that a tagger needs to see more than one instance of a context in order to retain it, e.g. nltk.BigramTagger(sents, cutoff=2, backoff=t1) will discard contexts that have only been seen once or twice.

Storing Taggers

Training a tagger on a large corpus may take a significant time. Instead of training a tagger every time we need one, it is convenient to save a trained tagger in a file for later re-use. Let's save our tagger t2 to a file t2.pkl.

>>> from cPickle import dump
>>> output = open('t2.pkl', 'wb')
>>> dump(t2, output, -1)
>>> output.close()

Now, in a separate Python process, we can load our saved tagger.

>>> from cPickle import load
>>> input = open('t2.pkl', 'rb')
>>> tagger = load(input)
>>> input.close()

Now let's check that it can be used for tagging.

>>> text = """The board's action shows what free enterprise
...     is up against in our complex maze of regulatory laws ."""
>>> tokens = text.split()
>>> tagger.tag(tokens)
[('The', 'AT'), ("board's", 'NN$'), ('action', 'NN'), ('shows', 'NNS'),
('what', 'WDT'), ('free', 'JJ'), ('enterprise', 'NN'), ('is', 'BEZ'),
('up', 'RP'), ('against', 'IN'), ('in', 'IN'), ('our', 'PP$'), ('complex', 'JJ'),
('maze', 'NN'), ('of', 'IN'), ('regulatory', 'NN'), ('laws', 'NNS'), ('.', '.')]

Performance Limitations

What is the upper limit to the performance of an n-gram tagger? Consider the case of a trigram tagger. How many cases of part-of-speech ambiguity does it encounter? We can determine the answer to this question empirically:

>>> cfd = nltk.ConditionalFreqDist(
...            ((x[1], y[1], z[0]), z[1])
...            for sent in brown.tagged_sents(categories='news')
...            for x, y, z in nltk.trigrams(sent))
>>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]
>>> sum(cfd[c].N() for c in ambiguous_contexts) / float(cfd.N())

Thus, one out of twenty trigrams is ambiguous [EXAMPLES]. Given the current word and the previous two tags, in 5% of cases there is more than one tag that could be legitimately assigned to the current word according to the training data. Assuming we always pick the most likely tag in such ambiguous contexts, we can derive an empirical upper bound on the performance of a trigram tagger.

Another way to investigate the performance of a tagger is to study its mistakes. Some tags may be harder than others to assign, and it might be possible to treat them specially by pre- or post-processing the data. A convenient way to look at tagging errors is the confusion matrix. It charts expected tags (the gold standard) against actual tags generated by a tagger:

>>> def tag_list(tagged_sents):
...     return [tag for sent in tagged_sents for (word, tag) in sent]
>>> def apply_tagger(tagger, corpus):
...     return [tagger.tag(tag.untag(sent)) for sent in corpus]
>>> gold = tag_list(brown.tagged_sents(categories='editorial'))
>>> test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))
>>> print nltk.ConfusionMatrix(gold, test)                


Based on such analysis we may decide to modify the tagset. Perhaps a distinction between tags that is difficult to make can be dropped, since it is not important in the context of some larger processing task.

Another way to analyze the performance bound on a tagger comes from the less than 100% agreement between human annotators. [MORE]

In general, observe that the tagging process simultaneously collapses distinctions (i.e., lexical identity is usually lost when all personal pronouns are tagged PRP), while introducing distinctions and removing ambiguities (e.g. deal tagged as VB or NN). This move facilitates classification and prediction. When we introduce finer distinctions in a tagset, we get better information about linguistic context, but we have to do more work to classify the current token (there are more tags to choose from). Conversely, with fewer distinctions (as with the simplified tagset), we have less work to do for classifying the current token, but less information about the context to draw on.

We have seen that ambiguity in the training data leads to an upper limit in tagger performance. Sometimes more context will resolve the ambiguity. In other cases however, as noted by [Church, Young, & Bloothooft, 1996], the ambiguity can only be resolved with reference to syntax, or to world knowledge. Despite these imperfections, part-of-speech tagging has played a central role in the rise of statistical approaches to natural language processing. In the early 1990s, the surprising accuracy of statistical taggers was a striking demonstration that it was possible to solve one small part of the language understanding problem, namely part-of-speech disambiguation, without reference to deeper sources of linguistic knowledge. Can this idea be pushed further? In Chapter 7, on chunk parsing, we shall see that it can.

4.6   Transformation-Based Tagging

A potential issue with n-gram taggers is the size of their n-gram table (or language model). If tagging is to be employed in a variety of language technologies deployed on mobile computing devices, it is important to strike a balance between model size and tagger performance. An n-gram tagger with backoff may store trigram and bigram tables, large sparse arrays which may have hundreds of millions of entries.

A second issue concerns context. The only information an n-gram tagger considers from prior context is tags, even though words themselves might be a useful source of information. It is simply impractical for n-gram models to be conditioned on the identities of words in the context. In this section we examine Brill tagging, a statistical tagging method which performs very well using models that are only a tiny fraction of the size of n-gram taggers.

Brill Tagging

Brill tagging is a kind of transformation-based learning, named after its inventor [REF]. The general idea is very simple: guess the tag of each word, then go back and fix the mistakes. In this way, a Brill tagger successively transforms a bad tagging of a text into a better one. As with n-gram tagging, this is a supervised learning method, since we need annotated training data to figure out whether the tagger's guess is a mistake or not. However, unlike n-gram tagging, it does not count observations but compiles a list of transformational correction rules.

The process of Brill tagging is usually explained by analogy with painting. Suppose we were painting a tree, with all its details of boughs, branches, twigs and leaves, against a uniform sky-blue background. Instead of painting the tree first then trying to paint blue in the gaps, it is simpler to paint the whole canvas blue, then "correct" the tree section by over-painting the blue background. In the same fashion we might paint the trunk a uniform brown before going back to over-paint further details with even finer brushes. Brill tagging uses the same idea: begin with broad brush strokes then fix up the details, with successively finer changes. Let's look at an example involving the following sentence:

(9)The President said he will ask Congress to increase grants to states for vocational rehabilitation

We will examine the operation of two rules: (a) Replace NN with VB when the previous word is TO; (b) Replace TO with IN when the next tag is NNS. Table 4.6 illustrates this process, first tagging with the unigram tagger, then applying the rules to fix the errors.

Table 4.6:

Steps in Brill Tagging

Phrase to increase grants to states for vocational rehabilitation
Rule 1   VB            
Rule 2       IN        

In this table we see two rules. All such rules are generated from a template of the following form: "replace T1 with T2 in the context C". Typical contexts are the identity or the tag of the preceding or following word, or the appearance of a specific tag within 2-3 words of the current word. During its training phase, the tagger guesses values for T1, T2 and C, to create thousands of candidate rules. Each rule is scored according to its net benefit: the number of incorrect tags that it corrects, less the number of correct tags it incorrectly modifies.

Using NLTK's Brill Tagger

Figure 4.9 demonstrates NLTK's Brill tagger...

>>> nltk.tag.brill.demo
Training Brill tagger on 80 sentences...
Finding initial useful rules...
    Found 6555 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
  12  13   1   4  | NN -> VB if the tag of the preceding word is 'TO'
   8   9   1  23  | NN -> VBD if the tag of the following word is 'DT'
   8   8   0   9  | NN -> VBD if the tag of the preceding word is 'NNS'
   6   9   3  16  | NN -> NNP if the tag of words i-2...i-1 is '-NONE-'
   5   8   3   6  | NN -> NNP if the tag of the following word is 'NNP'
   5   6   1   0  | NN -> NNP if the text of words i-2...i-1 is 'like'
   5   5   0   3  | NN -> VBN if the text of the following word is '*-1'
>>> print(open("errors.out").read())
             left context |    word/test->gold     | right context
                          |      Then/NN->RB       | ,/, in/IN the/DT guests/N
, in/IN the/DT guests/NNS |       '/VBD->POS       | honor/NN ,/, the/DT speed
'/POS honor/NN ,/, the/DT |    speedway/JJ->NN     | hauled/VBD out/RP four/CD
NN ,/, the/DT speedway/NN |     hauled/NN->VBD     | out/RP four/CD drivers/NN
DT speedway/NN hauled/VBD |      out/NNP->RP       | four/CD drivers/NNS ,/, c
dway/NN hauled/VBD out/RP |      four/NNP->CD      | drivers/NNS ,/, crews/NNS
hauled/VBD out/RP four/CD |    drivers/NNP->NNS    | ,/, crews/NNS and/CC even
P four/CD drivers/NNS ,/, |     crews/NN->NNS      | and/CC even/RB the/DT off
NNS and/CC even/RB the/DT |    official/NNP->JJ    | Indianapolis/NNP 500/CD a
                          |     After/VBD->IN      | the/DT race/NN ,/, Fortun
ter/IN the/DT race/NN ,/, |    Fortune/IN->NNP     | 500/CD executives/NNS dro
s/NNS drooled/VBD like/IN |  schoolboys/NNP->NNS   | over/IN the/DT cars/NNS a
olboys/NNS over/IN the/DT |      cars/NN->NNS      | and/CC drivers/NNS ./.

Figure 4.9 ( Figure 4.9: NLTK's Brill tagger

Brill taggers have another interesting property: the rules are linguistically interpretable. Compare this with the n-gram taggers, which employ a potentially massive table of n-grams. We cannot learn much from direct inspection of such a table, in comparison to the rules learned by the Brill tagger.

4.7   The TnT Tagger

[NLTK contains a pure Python implementation of the TnT tagger nltk.tag.tnt, and also an interface to an external TnT tagger nltk_contrib.tag.tnt. These will be described in a later version of this chapter.]

4.8   How to Determine the Category of a Word

Now that we have examined word classes in detail, we turn to a more basic question: how do we decide what category a word belongs to in the first place? In general, linguists use morphological, syntactic, and semantic clues to determine the category of a word.

Morphological Clues

The internal structure of a word may give useful clues as to the word's category. For example, -ness is a suffix that combines with an adjective to produce a noun, e.g. happyhappiness, illillness. So if we encounter a word that ends in -ness, this is very likely to be a noun. Similarly, -ment is a suffix that combines with some verbs to produce a noun, e.g. governgovernment and establishestablishment.

English verbs can also be morphologically complex. For instance, the present participle of a verb ends in -ing, and expresses the idea of ongoing, incomplete action (e.g. falling, eating). The -ing suffix also appears on nouns derived from verbs, e.g. the falling of the leaves (this is known as the gerund). (Since the present participle and the gerund cannot be systematically distinguished, they are often tagged with the same tag, i.e. VBG in the Brown Corpus tagset).

Syntactic Clues

Another source of information is the typical contexts in which a word can occur. For example, assume that we have already determined the category of nouns. Then we might say that a syntactic criterion for an adjective in English is that it can occur immediately before a noun, or immediately following the words be or very. According to these tests, near should be categorized as an adjective:


a.the near window

b.The end is (very) near.

Semantic Clues

Finally, the meaning of a word is a useful clue as to its lexical category. For example, the best-known definition of a noun is semantic: "the name of a person, place or thing". Within modern linguistics, semantic criteria for word classes are treated with suspicion, mainly because they are hard to formalize. Nevertheless, semantic criteria underpin many of our intuitions about word classes, and enable us to make a good guess about the categorization of words in languages that we are unfamiliar with. For example, if all we know about the Dutch word verjaardag is that it means the same as the English word birthday, then we can guess that verjaardag is a noun in Dutch. However, some care is needed: although we might translate zij is vandaag jarig as it's her birthday today, the word jarig is in fact an adjective in Dutch, and has no exact equivalent in English.

Morphology in Part of Speech Tagsets

Common tagsets often capture some morpho-syntactic information; that is, information about the kind of morphological markings that words receive by virtue of their syntactic role. Consider, for example, the selection of distinct grammatical forms of the word go illustrated in the following sentences:


a.Go away!

b.He sometimes goes to the cafe.

c.All the cakes have gone.

d.We went on the excursion.

Each of these forms — go, goes, gone, and went — is morphologically distinct from the others. Consider the form, goes. This occurs in a restricted set of grammatical contexts, and requires a third person singular subject. Thus, the following sentences are ungrammatical.


a.*They sometimes goes to the cafe.

b.*I sometimes goes to the cafe.

By contrast, gone is the past participle form; it is required after have (and cannot be replaced in this context by goes), and cannot occur as the main verb of a clause.


a.*All the cakes have goes.

b.*He sometimes gone to the cafe.

We can easily imagine a tagset in which the four distinct grammatical forms just discussed were all tagged as VB. Although this would be adequate for some purposes, a more fine-grained tagset provides useful information about these forms that can help other processors that try to detect patterns in tag sequences. The Brown tagset captures these distinctions, as summarized in Table 4.7.

Table 4.7:

Some morphosyntactic distinctions in the Brown tagset

Form Category Tag
go base VB
goes 3rd singular present VBZ
gone past participle VBN
going gerund VBG
went simple past VBD

In addition to this set of verb tags, the various forms of the verb to be have special tags: be/BE, being/BEG, am/BEM, are/BER, is/BEZ, been/BEN, were/BED and was/BEDZ (plus extra tags for negative forms of the verb). All told, this fine-grained tagging of verbs means that an automatic tagger that uses this tagset is effectively carrying out a limited amount of "morphological analysis."

Most part-of-speech tagsets make use of the same basic categories, such as noun, verb, adjective, and preposition. However, tagsets differ both in how finely they divide words into categories, and in how they define their categories. For example, is might be tagged simply as a verb in one tagset; but as a distinct form of the lexeme BE in another tagset (as in the Brown Corpus). This variation in tagsets is unavoidable, since part-of-speech tags are used in different ways for different tasks. In other words, there is no one 'right way' to assign tags, only more or less useful ways depending on one's goals.

4.9   Summary

  • Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts of speech. Parts of speech are assigned short labels, or tags, such as NN, VB,
  • The process of automatically assigning parts of speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.
  • Some linguistic corpora, such as the Brown Corpus, have been POS tagged.
  • A variety of tagging methods are possible, e.g. default tagger, regular expression tagger, unigram tagger and n-gram taggers. These can be combined using a technique known as backoff.
  • Taggers can be trained and evaluated using tagged corpora.
  • Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.
  • A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.
  • Ngram taggers can be defined for large values of n, but once n is larger than 3 we usually encounter the sparse data problem; even with a large quantity of training data we only see a tiny fraction of possible contexts.

4.10   Further Reading

[Recommended readings on lexical categories...]

Appendix A contains details of popular tagsets.

For more examples of tagging with NLTK, please see the tagging HOWTO on the NLTK website. Chapters 4 and 5 of [Jurafsky & Martin, 2008] contain more advanced material on n-grams and part-of-speech tagging.

There are several other important approaches to tagging involving Transformation-Based Learning, Markov Modeling, and Finite State Methods. (We will discuss some of these in Chapter 5.) In Chapter 7 we will see a generalization of tagging called chunking in which a contiguous sequence of words is assigned a single tag.

Part-of-speech tagging is just one kind of tagging, one that does not depend on deep linguistic analysis. There are many other kinds of tagging. Words can be tagged with directives to a speech synthesizer, indicating which words should be emphasized. Words can be tagged with sense numbers, indicating which sense of the word was used. Words can also be tagged with morphological features. Examples of each of these kinds of tags are shown below. For space reasons, we only show the tag for a single word. Note also that the first two examples use XML-style tags, where elements in angle brackets enclose the word that is tagged.

  1. Speech Synthesis Markup Language (W3C SSML): That is a <emphasis>big</emphasis> car!
  2. SemCor: Brown Corpus tagged with WordNet senses: Space in any <wf pos="NN" lemma="form" wnsn="4">form</wf> is completely measured by the three dimensions. (Wordnet form/nn sense 4: "shape, form, configuration, contour, conformation")
  3. Morphological tagging, from the Turin University Italian Treebank: E' italiano , come progetto e realizzazione , il primo (PRIMO ADJ ORDIN M SING) porto turistico dell' Albania .

Tagging exhibits several properties that are characteristic of natural language processing. First, tagging involves classification: words have properties; many words share the same property (e.g. cat and dog are both nouns), while some words can have multiple such properties (e.g. wind is a noun and a verb). Second, in tagging, disambiguation occurs via representation: we augment the representation of tokens with part-of-speech tags. Third, training a tagger involves sequence learning from annotated corpora. Finally, tagging uses simple, general, methods such as conditional frequency distributions and transformation-based learning.

Note that tagging is also performed at higher levels. Here is an example of dialogue act tagging, from the NPS Chat Corpus [Forsyth & Martell, 2007], included with NLTK.

Statement User117 Dude..., I wanted some of that
ynQuestion User120 m I missing something?
Bye User117 I'm gonna go fix food, I'll be back later.
System User122 JOIN
System User2 slaps User122 around a bit with a large trout.
Statement User121 18/m pm me if u tryin to chat

List of available taggers:

NLTK's HMM tagger, nltk.HiddenMarkovModelTagger

[Church, Young, & Bloothooft, 1996]

4.11   Exercises

  1. ☼ Search the web for "spoof newspaper headlines", to find such gems as: British Left Waffles on Falkland Islands, and Juvenile Court to Try Shooting Defendant. Manually tag these headlines to see if knowledge of the part-of-speech tags removes the ambiguity.
  2. ☼ Working with someone else, take turns to pick a word that can be either a noun or a verb (e.g. contest); the opponent has to predict which one is likely to be the most frequent in the Brown corpus; check the opponent's prediction, and tally the score over several turns.
  3. ◑ Write programs to process the Brown Corpus and find answers to the following questions:
    1. Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the -s suffix.)
    2. Which word has the greatest number of distinct tags. What are they, and what do they represent?
    3. List tags in order of decreasing frequency. What do the 20 most frequent tags represent?
    4. Which tags are nouns most commonly found after? What do these tags represent?
  4. ◑ Explore the following issues that arise in connection with the lookup tagger:
    1. What happens to the tagger performance for the various model sizes when a backoff tagger is omitted?
    2. Consider the curve in Figure 4.7; suggest a good size for a lookup tagger that balances memory and performance. Can you come up with scenarios where it would be preferable to minimize memory usage, or to maximize performance with no regard for memory usage?
  5. ◑ What is the upper limit of performance for a lookup tagger, assuming no limit to the size of its table? (Hint: write a program to work out what percentage of tokens of a word are assigned the most likely tag for that word, on average.)
  6. ◑ Generate some statistics for tagged data to answer the following questions:
    1. What proportion of word types are always assigned the same part-of-speech tag?
    2. How many words are ambiguous, in the sense that they appear with at least two tags?
    3. What percentage of word occurrences in the Brown Corpus involve these ambiguous words?
  7. ◑ Above we gave an example of the nltk.tag.accuracy() function. It has two arguments, a tagger and some tagged text, and it works out how accurately the tagger performs on this text. For example, if the supplied tagged text was [('the', 'DT'), ('dog', 'NN')] and the tagger produced the output [('the', 'NN'), ('dog', 'NN')], then the accuracy score would be 0.5. Can you figure out how the nltk.tag.accuracy() function works?
    1. A tagger takes a list of words as input, and produces a list of tagged words as output. However, nltk.tag.accuracy() is given correctly tagged text as its input. What must the nltk.tag.accuracy() function do with this input before performing the tagging?
    2. Once the supplied tagger has created newly tagged text, how would nltk.tag.accuracy() go about comparing it with the original tagged text and computing the accuracy score?
  8. ☼ Satisfy yourself that there are restrictions on the distribution of go and went, in the sense that they cannot be freely interchanged in the kinds of contexts illustrated in (3d).
  9. ◑ Write code to search the Brown Corpus for particular words and phrases according to tags, to answer the following questions:
    1. Produce an alphabetically sorted list of the distinct words tagged as MD.
    2. Identify words that can be plural nouns or third person singular verbs (e.g. deals, flies).
    3. Identify three-word prepositional phrases of the form IN + DET + NN (eg. in the lab).
    4. What is the ratio of masculine to feminine pronouns?
  10. ◑ In the introduction we saw a table involving frequency counts for the verbs adore, love, like, prefer and preceding qualifiers such as really. Investigate the full range of qualifiers (Brown tag QL) that appear before these four verbs.
  11. ◑ We defined the regexp_tagger that can be used as a fall-back tagger for unknown words. This tagger only checks for cardinal numbers. By testing for particular prefix or suffix strings, it should be possible to guess other tags. For example, we could tag any word that ends with -s as a plural noun. Define a regular expression tagger (using nltk.RegexpTagger) that tests for at least five other patterns in the spelling of words. (Use inline documentation to explain the rules.)
  12. ◑ Consider the regular expression tagger developed in the exercises in the previous section. Evaluate the tagger using nltk.tag.accuracy(), and try to come up with ways to improve its performance. Discuss your findings. How does objective evaluation help in the development process?
  13. ★ There are 264 distinct words in the Brown Corpus having exactly three possible tags.
    1. Print a table with the integers 1..10 in one column, and the number of distinct words in the corpus having 1..10 distinct tags in the other column.
    2. For the word with the greatest number of distinct tags, print out sentences from the corpus containing the word, one for each possible tag.
  14. ★ Write a program to classify contexts involving the word must according to the tag of the following word. Can this be used to discriminate between the epistemic and deontic uses of must?
  15. ☼ Train a unigram tagger and run it on some new text. Observe that some words are not assigned a tag. Why not?
  16. ☼ Train an affix tagger AffixTagger() and run it on some new text. Experiment with different settings for the affix length and the minimum word length. Can you find a setting that seems to perform better than the one described above? Discuss your findings.
  17. ☼ Train a bigram tagger with no backoff tagger, and run it on some of the training data. Next, run it on some new data. What happens to the performance of the tagger? Why?
  18. ◑ Write a program that calls AffixTagger() repeatedly, using different settings for the affix length and the minimum word length. What parameter values give the best overall performance? Why do you think this is the case?
  19. ◑ How serious is the sparse data problem? Investigate the performance of n-gram taggers as n increases from 1 to 6. Tabulate the accuracy score. Estimate the training data required for these taggers, assuming a vocabulary size of 105 and a tagset size of 102.
  20. ◑ Obtain some tagged data for another language, and train and evaluate a variety of taggers on it. If the language is morphologically complex, or if there are any orthographic clues (e.g. capitalization) to word classes, consider developing a regular expression tagger for it (ordered after the unigram tagger, and before the default tagger). How does the accuracy of your tagger(s) compare with the same taggers run on English data? Discuss any issues you encounter in applying these methods to the language.
  21. ◑ Inspect the confusion matrix for the bigram tagger t2 defined in Section 4.5, and identify one or more sets of tags to collapse. Define a dictionary to do the mapping, and evaluate the tagger on the simplified data.
  22. ◑ Experiment with taggers using the simplified tagset (or make one of your own by discarding all but the first character of each tag name). Such a tagger has fewer distinctions to make, but much less information on which to base its work. Discuss your findings.
  23. ◑ Recall the example of a bigram tagger which encountered a word it hadn't seen during training, and tagged the rest of the sentence as None. It is possible for a bigram tagger to fail part way through a sentence even if it contains no unseen words (even if the sentence was used during training). In what circumstance can this happen? Can you write a program to find some examples of this?
  24. ◑ Modify the program in Figure 4.7 to use a logarithmic scale on the x-axis, by replacing pylab.plot() with pylab.semilogx(). What do you notice about the shape of the resulting plot? Does the gradient tell you anything?
  25. ★ Create a default tagger and various unigram and n-gram taggers, incorporating backoff, and train them on part of the Brown corpus.
    1. Create three different combinations of the taggers. Test the accuracy of each combined tagger. Which combination works best?
    2. Try varying the size of the training corpus. How does it affect your results?
  26. ★ Our approach for tagging an unknown word has been to consider the letters of the word (using RegexpTagger() and AffixTagger()), or to ignore the word altogether and tag it as a noun (using nltk.DefaultTagger()). These methods will not do well for texts having new words that are not nouns. Consider the sentence I like to blog on Kim's blog. If blog is a new word, then looking at the previous tag (TO vs NP$) would probably be helpful. I.e. we need a default tagger that is sensitive to the preceding tag.
    1. Create a new kind of unigram tagger that looks at the tag of the previous word, and ignores the current word. (The best way to do this is to modify the source code for UnigramTagger(), which presumes knowledge of Python classes discussed in Section 6.6.)
    2. Add this tagger to the sequence of backoff taggers (including ordinary trigram and bigram taggers that look at words), right before the usual default tagger.
    3. Evaluate the contribution of this new unigram tagger.
  27. ★ Write code to preprocess tagged training data, replacing all but the most frequent n words with the special word UNK. Train an n-gram backoff tagger on this data, then use it to tag some new text. Note that you will have to preprocess the text to replace unknown words with UNK, and post-process the tagged output to replace the UNK words with the words from the original input.
  28. ★ Consider the code in 4.5 which determines the upper bound for accuracy of a trigram tagger. Consult the Abney reading and review his discussion of the impossibility of exact tagging. Explain why correct tagging of these examples requires access to other kinds of information than just words and tags. How might you estimate the scale of this problem?
  29. ★ Use some of the estimation techniques in nltk.probability, such as Lidstone or Laplace estimation, to develop a statistical tagger that does a better job than ngram backoff taggers in cases where contexts encountered during testing were not seen during training.
  30. ◑ Consult the documentation for the Brill tagger demo function, using help(nltk.tag.brill.demo). Experiment with the tagger by setting different values for the parameters. Is there any trade-off between training time (corpus size) and performance?
  31. ★ Inspect the diagnostic files created by the tagger rules.out and errors.out. Obtain the demonstration code ( and create your own version of the Brill tagger. Delete some of the rule templates, based on what you learned from inspecting rules.out. Add some new rule templates which employ contexts that might help to correct the errors you saw in errors.out.

About this document...

This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [], Version 0.9.6, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [].

This document is Revision: 7166 Mon Dec 8 21:47:15 EST 2008

5   Data-Intensive Language Processing

5.1   Introduction

Language is full of patterns. In Chapter 3 we saw that frequent use of the modal verb will is characteristic of news text, and more generally, that we can use the frequency of a small number of diagnostic words in order to automatically guess the genre of a text (Table 1.1). In Chapter 4 we saw that words ending in -ed tend to be past tense verbs, and more generally, that the internal structure of words tells us something about their part of speech. Detecting and understanding such patterns is central to many NLP tasks, particularly those that try to access the meaning of a text.

In order to study and model these linguistic patterns we need to be able to write programs to process large quantities of annotated text. In this chapter we will focus on data-intensive language processing, covering manual approaches to exploring linguistic data in Section 5.2 and automatic approaches in Section 5.5.

We have already seen a simple application of classification in the case of part-of-speech tagging (Chapter 4). Although this is a humble beginning, it actually holds the key for a range of more difficult classification tasks, including those mentioned above. Recall that adjectives (tagged JJ) tend to precede nouns (tagged NN), and that we can use this information to predict that the word deal is a noun in the context good deal (and not a verb, as in to deal cards).

5.2   Exploratory Data Analysis

As language speakers, we all have intuitions about how language works, and what patterns it contains. Unfortunately, those intuitions are notoriously unreliable. We tend to notice unusual words and constructions, and to be oblivious to high-frequency cases. Many public commentators go further to make pronouncements about statistics and usage which turn out to be false. Many examples are documented on LanguageLog, e.g.

In order to get an accurate idea of how language works, and what patterns it contains, we must study langauge — in a wide variety of forms and contexts — as impartial observers. To help facilitate this endevour, researchers and organizations have created many large collections of real-world language, or corpora. These corpora are collected from a wide variety of sources, including literature, journalism, telephone conversations, instant messaging, and web pages.

Exploratory data analysis, the focus of this section, is a technique for learning about a specific linguistic pattern, or construction. It consists of four steps, illustrated in Figure 5.1.


Figure 5.1: Exploratory Corpus Analysis

First, we must find the occurrences of the construction that we're interested in, by searching the corpus. Ideally, we would like to find all occurrences of the construction, but sometimes that may not be possible, and we have to be careful not to over-generalize our findings. In particular, we should be careful not to conclude that something doesn't happen simply because we were unable to find any examples; it's also possible that our corpus is deficient.

Once we've found the constructions of interest, we can then categorize them, using two sources of information: content and context. In some cases, like identifying date and time expressions in text, we can simply write a set of rules to cover the various cases. In general, we can't just enumerate the cases but we have to manually annotate a corpus of text and then train systems to do the task automatically.

Having collected and categorized the constructions of interest, we can proceed to look for patterns. Typically, this involves describing patterns as combinations of categories, and counting how often different patterns occur. We can check for both graded distinctions and categorical distinctions...

  • center-embedding suddenly gets bad after two levels
  • examples from probabilistic syntax / gradient grammaticality

Finally, the information that we discovered about patterns in the corpus can be used to refine our understanding of how constructions work. We can then continue to perform exploratory data analysis, both by adjusting our characterizations of the constructions to better fit the data, and by building on our better understanding of simple constructions to investigate more complex constructions.

Although we have described exploratory data analysis as a cycle of four steps, it should be noted that any of these steps may be skipped or re-arranged, depending on the nature of the corpus and the constructions that we're interested in understanding. For example, we can skip the search step if we already have a corpus of the relevant constructions; and we can skip categorization if the constructions are already labeled.

5.3   Selecting a Corpus

In exploratory data analysis, we learn about a specific linguistic pattern by objectively examining how it is used. We therefore must begin by selecting a corpus (i.e., a collection of language data) containing the pattern we are interested in. Often, we can use one of the many existing corpora that have been made freely-available by the researchers who assembled them. Sometimes, we may choose instead to assemble a derived corpus by combining several existing corpora, or by selecting out specific subsets of a corpus (e.g., only news stories containing interviews). Occasionally, we may decide to build a new corpus from scratch (e.g., if we wish to learn about a previously undocumented language).

The results of our analysis will be highly dependent on the corpus that we select. This should hardly be surprising, since many linguistic phenomena pattern differently in different contexts -- for example, <<add a good example -- child vs adult? written vs spoken? some specific phenomenon?>>. But when selecting the corpus for analysis, it is important to understand how the characteristics of the corpus will affect results of the the data analysis.

Source of Language Data

Language is used for many different purposes, in many different contexts. For example, language can be used in a novel to tell a complex fictional story, or it can be used in an internet chat room to exchange rumors about celebrities. It can be used in a newspaper to report a story about a sports team, or in a workplace to form a collaberative plan for building a new product. Although some linguistic phenomena will act uniformly across these different contexts, other phenomana may vary depending.

Therefore, one of the most important characteristics defining a corpus is the source (or sources) from which its language data is drawn, which will determine the types of language data it includes. Attributes that characterize the type of language data contained in a corpus include:

  • Domain: What subject matters does the language data talk about?

  • Mode: Does the corpus contain spoken language data, written language data, or both?

  • Number of Speakers: Does the corpus contain texts that are produced by a single speaker, such as books or news stories, or does it contain dialogues?

  • Register: Is the language data in the corpus formal or informal?

  • Communicative Intent: For what purpose was the langauge generated

    -- e.g., to communicate, to entertain, or to persuade?

  • Dialect: Do the speakers use any specific dialects?

  • Language: What language or languages are used?

When making conclusions on the basis of exploratory data analysis, it is important to consider the extent to which those conclusions are dependent on the type of language data included in the corpus. For example, if we discover a pattern in a corpus of newspaper articles, we should not necessarily assume that the same pattern will hold in spoken discourse.

In order to allow more general conclusions to be drawn about linguistic patterns, several balanced corpora have been created, which include language data from a wide variety of different language sources. For example, the Brown Corpus contains documents ranging from science fiction to howto guidebooks to legislative council transcripts. But it's worth noting that since language use is so diverse, it would be almost impossible to create a single corpus that includes all of the contexts in which language gets used. Thus, even balanced corpora should be considered to cover only a subset of the possible linguistic sources (even if that subset is larger than the subset covered by many other corpora).

Information Content

Corpora can vary in the amount of information they contain about the language data they describe. At a minimum, a corpus will typically contain at least a sequence of sounds or orthographic symbols. At the other end of the spectrum, a corpus could contain a large amount of information about the syntactic structure, morphology, prosody, and semantic content of every sentence. This extra information is called annotation, and can be very helpful when performing exploratory data analysis. For example, it may be much easier to find a given linguistic pattern if we can search for specific syntactic structures; and it may be easier to categorize a linguistic pattern if every word has been tagged with its word sense.

Corpora vary widely in the amount and types of annotation that they include. Some common types of information that can be annotated include:

Unfortunately, there is not much consistency between existing corpora in how they represent their annotations. However, two general classes of annotation representation should be distinguished. Inline annotation modifies the original document by inserting special symbols or control sequences that carry the annotated information. For example, when part-of-speech tagging a document, the string "fly" might be replaced with the string "fly/NN", to indicate that the word fly is a noun in this context. In contrast, standoff annotation does not modify the original document, but instead creates a new file that adds annotation information using pointers into the original document. For example, this new document might contain the string "<word start=8 end=11 pos='NN'/>", to indicate that the word starting at character 8 and ending at character 11 is a noun.

Corpus Size

The size of corpora can vary widely, from tiny corpora containing just a few hundred sentences up to enormous corpora containing a billion words or more. In general, we perform exploratory data analysis using the largest appropriate corpus that's available. This ensures that the results of our analysis don't just reflect a quirk of the particular language data contained in the corpus.

However, in some circumstances, we may be forced to perform our analysis using small corpora. For example, if we are examining linguistic patterns in a language that is not well studied, or if our analysis requires specific annotations, then no large corpora may be available. In these cases, we should be careful when interpreting the results of an exploratory data analysis. In particular, we should avoid concluding that a linguistic pattern or phenomenon never occurs, just because we did not find it in our small sample of language data.

Table 5.1:

Example Corpora. This table summarizes some important properties of several popular corpora.

Corpus Name Contents Size Annotations etc.
Penn Treebank News stories 1m words etc.  
Web (google) etc.      

5.5   Data Modeling

Exploratory data analysis helps us to understand the linguistic patterns that occur in natural language corpora. Once we have a basic understanding of those patterns, we can attempt to create models that capture those patterns. Typically, these models will be constructed automatically, using algorithms that attempt to select a model that accurately describes an existing corpus; but it is also possible to build analytically motivated models. Either way, these explicit models serve two important purposes: they help us to understand the linguistic patterns, and they can be used to make predictions about new language data.

The extent to which explicit models can give us insight into linguistic patterns depends largely on what kind of model is used. Some models, such as decision trees, are relatively transparent, and give us direct information about which factors are important in making decisions, and about which factors are related to one another. Other models, such as multi-level neural networks, are much more opaque -- although it can be possible to gain insight by studying them, it typically takes a lot more work.

But all explicit models can make predictions about new "unseen" language data that was not included in the corpus used to build the model. These predictions can be evaluated to assess the accuracy of the model. Once a model is deemed sufficiently accurate, it can then be used to automatically predict information about new language data. These predictive models can be combined into systems that perform many useful language processing tasks, such as document classification, automatic translation, and question answering.

What do models tell us?

Before we delve into the mechanics of different models, it's important to spend some time looking at exactly what automatically constructed models can tell us about language.

One important consideration when dealing with language models is the distinction between descriptive models and explanatory models. Descriptive models capture patterns in the data but they don't provide any information about why the data contains those patterns. For example, as we saw in Table 3.1, the synonyms absolutely and definitely are not interchangeable: we say absolutely adore not definitely adore, and definitely prefer not absolutely prefer. In contrast, explanatory models attempt to capture properties and relationships that underlie the linguistic patterns. For example, we might introduce the abstract concept of "polar adjective", as one that has an extreme meaning, and categorize some adjectives like adore and detest as polar. Our explanatory model would contain the constraint that absolutely can only combine with polar adjectives, and definitely can only combine with non-polar adjectives. In summary, descriptive models provide information about correlations in the data, while explanatory models go further to postulate causal relationships.

Most models that are automatically constructed from a corpus are descriptive models; in other words, they can tell us what features are relevant to a given patterns or construction, but they can't necessarily tell us how those features and patterns relate to one another. If our goal is to understand the linguistic patterns, then we can use this information about which features are related as a starting point for further experiments designed to tease apart the relationships between features and patterns. On the other hand, if we're just interested in using the model to make predictions (e.g., as part of a language processing system), then we can use the model to make predictions about new data, without worrying about the precise nature of the underlying causal relationships.

Feature Extraction

The first step in creating a model is deciding what information about the input might be relevant to the classification task; and how to encode that information. In other words, we must decide which features of the input are relevant, and how to encode those features. Most automatic learning methods restirct features to have simple value types, such as booleans, numbers, and strings. But note that just because a feature has a simple type, does not necessarily mean that the feature's value is simple to express or compute; indeed, it is even possible to use very complex and informative values, such as the output of a second supervised classifier, as features.


Figure 5.5: Supervised Classification. (a) During training, a feature extractor is used to convert each input value to a feature set. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model. (b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These feature sets are then fed into the model, which generates predicted labels.

For NLTK's classifiers, the features for each input are stored using a dictionary that maps feature names to corresponding values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature. Feature values are simple-typed values, such as booleans, numbers, and strings. For example, if we had built an animal_classifier model for classifying animals, then we might provide it with the following feature set:

>>> animal = {'fur': True, 'legs': 4,
...           'size': 'large', 'spots': True}
>>> animal_classifier.classify(animal)

Generally, feature sets are constructed from inputs using a feature extraction function. This function takes an input value, and possibly its context, as parameters, and returns a corresponding feature set. This feature set can then be passed to the machine learning algorithm for training, or to the learned model for prediction. For example, we might use the following function to extract features for a document classification task:

def extract_features(document):
   features = {}
   for word in document:
       features['contains(%s)' % word] = True
   return features
>>> extract_features(nltk.corpus.brown.words('cj79'))
{'contains(of)': True, 'contains(components)': True,
 'contains(some)': True, 'contains(that)': True,
 'contains(passage)': True, 'contains(table)': True, ...}

Figure 5.6 ( Figure 5.6

In addition to a feature extractor, we need to select or build a training corpus, consisting of a list of examples and corresponding class labels. For many interesting tasks, appropriate corpora have already been assembled. Given a feature extractor and a training corpus, we can train a classifier. First, we run the feature extractor on each instance in the training corpus, and building a list of (featureset, label) tuples. Then, we pass this list to the classifier's constructor:

>>> train = [(extract_features(word), label)
...          for (word, label) in labeled_words]
>>> classifier = nltk.NaiveBayesClassifier.train(train)

The constructed model classifier can then be used to predict the labels for unseen inputs:

>>> test_featuresets = [extract_features(word)
...                     for word in unseen_labeled_words]
>>> predicted = classifier.batch_classify(test)


When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, we can make use of the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all values in memory:

>>> train = apply_features(extract_features, labeled_words)
>>> test = apply_features(extract_features, unseen_words)

Selecting relevant features, and deciding how to encode them for the learning method, can have an enormous impact on its ability to extract a good model. Much of the interesting work in modeling a phenomenon is deciding what features might be relevant, and how we can represent them. Although it's often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based on an understanding of the task at hand.

Typically, feature extractors are built through a process of trial-and-error, guided by intuitions about what information is relevant to the problem at hand. It's often useful to start with a "kitchen sink" approach, including all the features that you can think of, and then checking to see which features actually appear to be helpful. However, there are usually limits to the number of features that you should use with a given learning algorithm -- if you provide too many features, then the algorithm will have a higher chance of relying on idiosyncracies of your training data that don't generalize well to new examples. This problem is known as overfitting, and can especially problematic when working with small training sets.

Once a basic system is in place, a very productive method for refining the feature set is error analysis. First, the training corpus is split into two pieces: a training subcorpus, and a development subcorpus. The model is trained on the training subcorpus, and then run on the development subcorpus. We can then examine individual cases in the development subcorpus where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly, and the error analysis procedure can be repeated, ideally using a different development/training split.

Example: Predicting Name Genders

In section 5.2, we looked at some of the factors that might influence whether an English name sounds more like a male name or a female name. Now we can build a simple model for this classification task. We'll use the same names corpus that we used for exploratory data analysis, divided into a training set and an evaluation set:

>>> from nltk.corpus import names
>>> import random
>>> # Construct a list of classified names, using the names corpus.
>>> namelist = ([(name, 'male') for name in names.words('male')] +
...             [(name, 'female') for name in names.words('female')])
>>> # Randomly split the names into a test & train set.
>>> random.shuffle(namelist)
>>> train = namelist[500:]
>>> test = namelist[:500]

Next, we'll build a simple feature extractor, using some of the features that appeared to be useful in the exploratory data analysis. We'll also throw in a number of features that seem like they might be useful:

def gender_features(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[0].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
>>> gender_features('John')
{'count(j)': 1, 'has(d)': False, 'count(b)': 0, ...}

Figure 5.7 ( Figure 5.7

Now that we have a corpus and a feature extractor, we can train a classifier. We'll use a "Naive Bayes" classifier, which will be described in more detail in section 5.8.1.

Now we can use the classifier to predict the gender for unseen names:

>>> classifier.classify(gender_features('Blorgy'))
>>> classifier.classify(gender_features('Alaphina'))

And using the test corpus, we can check the overall accuracy of the classifier across a collection of unseen names with known labels:

>>> test_featuresets = [(gender_features(n),g) for (n,g) in test]
>>> print nltk.classify.accuracy(classifier, test_featuresets)

Example: Predicting Sentiment

Movie review domain; ACL 2004 paper by Lillian Lee and Bo Pang. Movie review corpus included with NLTK.

import nltk, random


def word_features(doc):
    words = nltk.corpus.movie_reviews.words(doc)
    return nltk.FreqDist(words), doc[0]

def get_data():
    featuresets = apply(word_features, nltk.corpus.movie_reviews.files())
    return featuresets[TEST_SIZE:], featuresets[:TEST_SIZE]
>>> train_featuresets, test_featuresets = get_data()
>>> c1 = nltk.NaiveBayesClassifier.train(train_featuresets)
>>> print nltk.classify.accuracy(c1, test_featuresets)
>>> c2 = nltk.DecisionTreeClassifier.train(train_featuresets)
>>> print nltk.classify.accuracy(c2, test_featuresets)

Figure 5.8 ( Figure 5.8

Initial work on a classifier to use frequency of modal verbs to classify documents by genre:

import nltk, math
modals = ['can', 'could', 'may', 'might', 'must', 'will']

def modal_counts(tokens):
    return nltk.FreqDist(word for word in tokens if word in modals)

# just the most frequent modal verb
def modal_features1(tokens):
    return dict(most_frequent_modal = model_counts(tokens).max())

# one feature per verb, set to True if the verb occurs more than once
def modal_features2(tokens):
    fd = modal_counts(tokens)
    return dict( (word,(fd[word]>1)) for word in modals)

# one feature per verb, with a small number of scalar values
def modal_features3(tokens):
    fd = modal_counts(tokens)
    features = {}
    for word in modals:
            features[word] = int(-math.log10(float(fd[word])/len(tokens)))
        except OverflowError:
            features[word] = 1000
    return features

# 4 bins per verb based on frequency
def modal_features4(tokens):
    fd = modal_counts(tokens)
    features = {}
    for word in modals:
        freq = float(fd[word])/len(tokens)
        for logfreq in range(3,7):
            features["%s(%d)" % (word, logfreq)] = (freq < 10**(-logfreq))
    return features
>>> genres = ['hobbies', 'humor', 'science_fiction', 'news', 'romance', 'religion']
>>> train = [(modal_features4(nltk.corpus.brown.words(g)[:2000]), g) for g in genres]
>>> test = [(modal_features4(nltk.corpus.brown.words(g)[2000:4000]), g) for g in genres]
>>> classifier = nltk.NaiveBayesClassifier.train(train)
>>> print 'Accuracy: %6.4f' % nltk.classify.accuracy(classifier, test)

Figure 5.9 ( Figure 5.9


Figure 5.10: Feature Extraction


Figure 5.11: Document Classification

5.6   Evaluation

In order to decide whether a classification model is accurately capturing a pattern, we must evaluate that model. The result of this evaluation is important for deciding how trustworthy the model is, and for what purposes we can use it. Evaluation can also be a useful tool for guiding us in making future improvements to the model.

Evaluation Set

Most evaluation techniques calculate a score for a model by comparing the labels that it generates for the inputs in an evaluation set with the correct labels for those inputs. This evaluation set typically has the same format as the training corpus. However, it is very important that the evaluation set be distinct from the training corpus: if we simply re-used the training corpus as the evaluation set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive very high scores. Similarly, if we use a development corpus, then it must be distinct from the evaluation set as well. Otherwise, we risk building a model that does not generlize well to new inputs; and our evaluation scores may be misleadingly high.

If we are actively developing a model, by adjusting the features that it uses or any hand-tuned parameters, then we may want to make use of two evaluation sets. We would use the first evaluation set while developing the model, to evaluate whether specific changes to the model are beneficial. However, once we've made use of this first evaluation set to help develop the model, we can no longer trust that it will give us an accurate idea of how well the model would perform on new data. We therefore save the second evaluation set until our model development is complete, at which point we can use it to check how well our model will perform on new input values.

When building an evaluation set, we must be careful to ensure that is sufficiently different from the training corpus that it will effectively evaluate the performance of the model on new inputs. For example, if our evaluation set and training corpus are both drawn from the same underlying data source, then the results of our evaluation will only tell us how well the model is likely to do on other texts that come from the same (or a similar) data source.

Precision and Recall


Figure 5.12: True and False Positives and Negatives

Consider Figure 5.12. The intersection of these sets defines four regions: the true positives (TP), true negatives (TN), false positives (FP) or Type I errors, and false negatives (FN) or Type II errors. Two standard measures are precision, the fraction of guessed chunks that were correct TP/(TP+FP), and recall, the fraction of correct chunks that were identified TP/(TP+FN). A third measure, the F measure, is the harmonic mean of precision and recall, i.e. 1/(0.5/Precision + 0.5/Recall).


To do evaluation, we need to keep some of the data back -- don't test on train. But that means we have less data available to train. Also, what if our training set has ideosyncracies?

Cross-validation: run training&testing multiple times, with different training sets.

  • Lets us get away with smaller training sets
  • Lets us get a feel for how much the performance varies based on different training sets.

Error Analysis

The metrics above give us a general feel for how well a system does, but doesn't tell us much about why it gets that performance .. are there patterns in what it gets wrong? If so, that can help us to improve the system, or if we can't improve it, then at least make us more aware of what the limitations of the system are, and what kind of data it will produce more reliable or less reliable results for.

Talk some about how to do error analysis?

5.7   Classification Methods

In this section, we'll take a closer took at three machine learning methods that can be used to automatically build classification models: Decision Trees, Naive Bayes classifiers, and Maximum Entropy classifiers. As we've seen, it's possible treat these learning methods as black boxes, simply training models and using them for prediction without understanding how they work. But there's a lot to be learned from taking a closer look at how these learning methods select models based on the data in a training corpus. An understanding of these methods can help guide our selection of appropriate features, and especially our decisions about how those features should be encoded. And an understanding of the generated models can allow us to extract useful information about which features are most informative, and how those features relate to one another.

5.8   Decision Trees

A decision tree is a tree-structured flowchart used to choose labels for input values. This flowchart consists of decision nodes, which check feature values, and leaf nodes, which assign labels. To choose the label for an input value, we begin at the flowchart's initial decision node, known as its root node. This node contains a condition that checks one of the input value's features, and selects a branch based on that feature's value. Following the branch that describes our input value, we arrive at a new decision node, with a new condition on the input value's features. We continue following the branch selected by each node's condition, until we arrive at a leaf node, which provides a label for the input value. Figure 5.13 shows an example decision tree model for the name gender task.


Figure 5.13: Decision Tree model for the name gender task. Note that tree diagrams are conventially drawn "upside down," with the root at the top, and the leaves at the bottom.

Once we have a decision tree, it is thus fairly streight forward to use it to assign labels to new input values. What's less streight forward is how we can build a decision tree that models a given training corpus. But before we look at the learning algorithm for building decision trees, we'll consider a simpler task: picking the best "decision stump" for a corpus. A decision stump is is a decision tree with a single node, that decides how to classify inputs based on a single feature. It contains one leaf for each possible feature value, specifying the class label that should be assigned to inputs whose features have that value. In order to build a decision stump, we must first decide which feature should be used. The simplest method is to just build a decision stump for each possible feature, and see which one achieves the highest accuracy on the training data; but we'll discuss some other alternatives below. Once we've picked a feature, we can build the decision stump by assigning a label to each leaf based on the most frequent label for the selected examples in the training corpus (i.e., the examples where the selected feature has that value).

Given the algorithm for choosing decision stumps, the algorithm for growing larger decision trees is straightforward. We begin by selecting the overall best decision stump for the corpus. We then check the accuracy of each of the leaves on the training corpus. Any leaves that do not achieve sufficiently good accuracy are then replaced by new decision stumps, trained on the subset of the training corpus that is selected by the path to the leaf. For example, we could grow the decision tree in Figure 5.13 by replacing the leftmost leaf with a new decision stump, trained on the subset of the training corpus names that do not start with a "k" or end with a vowel or an "l."

As we mentioned before, there are a number of methods that can be used to select the most informative feature for a decision stump. One popular alternative is to use information gain, a measure of how much more organized the input values become when we divide them up using a given feature. To measure how disorganized the original set of input values are, we calculate entropy of their labels, which is defined as:

Entropy(S) = sum_{label} freq(label) * log_2(freq(label))

(how are we doing markup for math? -- also inline math?)

If most input values have the same label, then the entropy of their labels will be low. In particular, labels that have low frequency will not contribute much to the entropy (since the first term, freq(label), will be low); and labels with high frequency will also not contribute much to the entropy (since log_2(freq(label)) will be low). On the other hand, if the input values have a wide variety of labels, then there will be many labels with a "medium" frequency, where neither freq(label) nor log_2(freq(label)) is low, so the entropy will be high.

Once we have calculated the entropy of the original set of input values' labels, we can figure out how much more organized the labels become once we apply the decision stump. To do so, we calculate the entropy for each of the decision stump's leaves, and take the average of those leaf entropy values (weighted by the number of samples in each leaf). The information gain is then equal to the original entropy minus this new, reduced entropy. The higher the information gain, the better job the decision stump does of dividing the input values into coherent groups, so we can build decision trees by selecting the decision stumps with the highest information gain.

Another consideration for decision trees is efficiency. The simple algorithm for selecting decision stumps described above must construct a candidate decision stump for every possible feature; and this process must be repeated for every node in the constructed decision tree. A number of algorithms have been developed to cut down on the training time by storing and reusing information about previously evaluated examples. <<references>>.

Decision trees have a number of useful qualities. To begin with, they're simple to understand, and easy to interpret. This is especially true near the top of the decision tree, where it is usually possible for the learning algorithm to find very useful features. Decision trees are especially well suited to cases where many hierarchical categorical distinctions can be made. For example, decision trees can be very effective at modelling phylogeny trees.

However, decision trees also have a few disadvantages. One problem is that, since each branch in the decision tree splits the training data, the amount of training data available to train nodes lower in the tree can become quite small. As a result, these lower decision nodes may overfit the training corpus, learning patterns that reflect idiosynracies of the training corpus, rather than genuine patterns in the underlying problem. One solution to this problem is to stop dividing nodes once the amount of training data becomes too small. Another solution is to grow a full decision tree, but then to prune decision nodes that do not improve performance on a development corpus.

A second problem with decision trees is that they force features to be checked in a specific order, even when features may act relatively independently of one another. For example, when classifying documents into topics (such as sports, automotive, or murder mystery), features such as hasword(football) are highly indicative of a specific label, regardless of what other the feature values are. Since there is limited space near the top of the decision tree, most of these features will need to be repeated on many different branches in the tree. And since the number of branches increases exponentially as we go down the tree, the amount of repetition can be very large.

A related problem is that decision trees are not good at making use of features that are weak predictors of the correct label. Since these features make relatively small incremental improvements, they tend to occur very low in the decision tree. But by the time the decision tree learner has descended far enough to use these features, there is not enough training data left to reliably determine what effect they should have. If we could instead look at the effect of these features across the entire training corpus, then we might be able to make some conclusions about how they should affect the choice of label.

The fact that decision trees require that features be checked in a specific order limits their ability to make use of features that are relatively independent of one another. The Naive Bayes classification method, which we'll discuss next, overcomes this limitation by allowing all features to act "in parallel."

Naive Bayes Classifiers

In Naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the Naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking frequency of each label in the training corpus. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is the highest is then assigned to the input value. Figure 5.14 illustrates this process.


Figure 5.14: An abstract illustration of the procedure used by the Naive Bayes classifier to choose the topic for a document. In the training corpus, most documents are automotive, so the classifier starts out at a pointer closer to the "automative" label. But it then considers the effect of each feature. In this example, the input document contains the word "dark," which is a weak indicator for murder mysteries; but it also contains the word "football," which is a strong indicator for sports documents. After every feature has made its contribution, the classifier checks which label it is closest to, and assigns that label to the input.

Individual features make their contribution to the overall decision by "voting against" labels that don't occur with that feature very often. In particular, the likelihood score for each label is reduced by multiplying it by the probability that an input value with that label would have the feature. For example, if the word "run" occurs in 12% of the sports documents, 10% of the murder mystery documents, and 2% of the automotive documents, then the likelihood score for the sports label will be multiplied by 0.12; the likelihood score for the murder mystery label will be multiplied by 0.1; and the likelihood score for the automotive label will be multiplied by 0.02. The overall effect will be to reduce the score of the murder mystery label slightly more than the score of the sports label; and to significantly reduce the automotive label with respect to the other two labels. This overall process is illustrated in Figure 5.15.


Figure 5.15: Calculating label likelihoods with Naive Bayes. Naive Bayes begins by calculating the prior probability of each label, based on how frequently each label occurs in the training data. Every feature then contributes to the likelihood estimate for each label, by multiplying it by the probability that input values with that label will have that feature. The resulting likelihood score can be thought of as an estimate of the probability that a randomly selected value from the training corpus would have both the given label and the set of features, assuming that the feature probabilities are all independent.

Underlying Probabilistic Model

Another way of understanding the Naive Bayes classifier is that it chooses the most likely label for an input, under the assumption that every input value is generated by first choosing a class label for that input value, and then generating each feature, entirely independent of every other feature. Of course, this assumption is unrealistic: features are often highly dependent on one another in ways that don't just reflect differences in the class label. We'll return to some of the consequences of this assumption at the end of this section. But making this simplifying assumption makes it much easier to combine the contributions of the different features, since we don't need to worry about how they should interact with one another.

Based on this assumption, we can calculate an expression for P(label|features), the probability that an input will have a particular label, given that it has a particular set of features. To choose a label for a new input, we can then simply pick the label l that maximizes P(l|features).

To begin, we note that P(label|features) is equal to the probability that an input has a particular label and the specified set of features, divided by the probability that it has the specified set of features:

P(label|features) = P(features, label)/P(features)

Next, we note that P(features) will be the same for every choice of label, so if we are simply interested in finding the most likely label, it suffices to calculate P(features, label), which we'll call the label likelihood.


If we want to generate a probability estimate for each label, rather than just choosing the most likely label, then the easiest way to compute P(features) is to simply calculate the sum over labels of P(features, label):

P(features) = sum_{l in label} P(features, label)

The label likelihood can be expanded out as the probability of the label times the probability of the features given the label:

P(features, label) = P(label) * P(features|label)

Furthermore, since the features are all independent of one another (given the label), we can seperate out the probability of each individual feature:

P(features, label) = P(label) * prod_{f in features} P(f|label)

This is exactly the equation we discussed above for calculating the label likelihood: P(label) is the prior probability for a given label, and each P(f|label) is the contribution of a single feature to the label likelihood.

Zero Counts and Smoothing

The simplest way to calculate P(f|label), the contribution of a feature f toward the label likelihood for a label label, is to take the percentage of training instances with the given label that also have the given feature:

P(f|label) = count(f, label) / count(label)

However, this simple approach can become problematic when a feature never occurs with a given label in the training corpus. In this case, our calculated value for P(f|label) will be zero, which will cause the label likelihood for the given label to be zero. Thus, the input will never be assigned this label, regardless of how well the other features fit the label.

The basic problem here is with our calculation of P(f|label), the probability that an input will have a feature, given a label. In particular, just because we haven't seen a feature/label combination occur in the training corpus, doesn't mean it's impossible for that combination to occur. For example, we may not have seen any murder mystery documents that contained the word "football," but we wouldn't want to conclude that it's completely impossible for such documents to exist.

Thus, although count(f,label)/count(label) is a good estimate for P(f|label) when count(f, label) is relatively high, this estimate becomes less reliable when count(f) becomes smaller. Therefore, when building Naive Bayes models, we usually make use of more sophisticated techniques, known as smoothing techniques, for calculating P(f|label), the probability of a feature given a label. For example, the "Expected Likelihood Estimation" for the probability of a feature given a label basically adds 0.5 to each count(f,label) value; and the "Heldout Estimation" uses a heldout corpus to calculate the relationship between feature freequencies and feature probabilities. For more information on smoothing techniques, see <<ref -- manning & schutze?>>.

The Naivite of Independence

The reason that Naive Bayes classifiers are called "naive" is that it's unreasonable to assume that all features are independent of one another (given the label). In particular, almost all real-world problems contain features with varying degrees of dependence on one another. If we had to avoid any features that were dependent on one another, it would be very difficult to construct good feature sets that provide the required information to the machine learning algorithm.

So what happens when we ignore the independence assumption, and use the Naive Bayes classifier with features that are not independent? One problem that arises is that the classifier can end up "double-counting" the effect of highly correlated features, pushing the classifier closer to a given label than is justified.

To see how this can occur, consider a name gender classifier that contains two identical features, f_1 and f_2. In other words, f_2 is an exact copy of f_1, and contains no new information. Nevertheless, when the classifier is considering an input, it will include the contribution of both f_1 and f_2 when deciding which label to choose. Thus, the information content of these two features is given more weight than it should be.

Of course, we don't usually build Naive Bayes classifiers that contain two identical features. However, we do build classifiers that contain features which are dependent on one another. For example, the features ends-with(a) and ends-with(vowel) are dependent on one another, because if an input value has the first feature, then it must also have the second feature. For features like these, the duplicated information may be given more weight than is justified by the training corpus.

5.9   Maximum Entropy Classifiers

The Maximum Entropy classifier uses a model that is very similar to the model used by the Naive Bayes classifier. But rather than using probabilities to set the model's parameters, it uses search techniques to find a set of parameters that will maximize the performance of the classifier. In particular, it looks for the set of parameters that maximizes the total likelihood of the training corpus, which is defined as:

\sum_{(x) in corpus} P(label(x)|features(x))

Where P(label|features), the probability that an input whose features are features will have class label label, is defined as:

P(label|features) = P(label, features) / sum_{label} P(label, features)

Because of the potentially complex iteractions between the effects of related features, there is no way to directly calculate the model parameters that maximize the likelihood of the training corpus. Therefore, Maximium Entropy classifiers choose the model paremeters using iterative optimization techniques, which initialze the model's parameters to random values, and then repeatedly refine those parameters to bring them closer to the optimal solution. The iterative optimization techniques guarantee that each refinement of the parameters will bring them closer to the optimal values; but do not necessarily provide a means of determining when those optimal values have been reached. Because the parameters for Maximum Entropy classifiers are seleced using iterative optimization techniques, they can take a long time to train. This is especially true when the size of the training corpus, the number of features, and the number of labels are all large.


Some iterative optimization techniques are much faster than others. When training Maximum Entropy models, avoid the use of Generalized Iterative Scaling (GIS) or Improved Iterative Scaling (IIS), which are both considerably slower than the Conjucate Gradient (CG) and the BFGS optimization methods.

  • the technique, of fixing the form of the model, and searching for model parameters that optimize some evaluation metric is called optimization.
  • a number of other machine learning algorithms can be thought of as optimization systems.

5.10   Exercises

  1. ☼ Read up on one of the language technologies mentioned in this section, such as word sense disambiguation, semantic role labeling, question answering, machine translation, named entity detection. Find out what type and quantity of annotated data is required for developing such systems. Why do you think a large amount of data is required?
  2. ☼ Exercise: compare the performance of different machine learning methods. (they're still black boxes at this point)
  3. ☼ The synonyms strong and powerful pattern differently (try combining them with chip and sales).
  4. ◑ Accessing extra features from WordNet to augment those that appear directly in the text (e.g. hyperym of any monosemous word)
  5. ★ Task involving PP Attachment data; predict choice of preposition from the nouns.
  6. ★ Suppose you wanted to automatically generate a prose description of a scene, and already had a word to uniquely describe each entity, such as the jar, and simply wanted to decide whether to use in or on in relating various items, e.g. the book is in the cupboard vs the book is on the shelf. Explore this issue by looking at corpus data; writing programs as needed.

(14) the car vs on the train town vs on campus the picture vs on the screen Macbeth vs on Letterman

About this document...

This chapter is a draft from Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2008 the authors. It is distributed with the Natural Language Toolkit [], Version 0.9.6, under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License [].

This document is Revision: 7166 Mon Dec 8 21:47:15 EST 2008

6   Structured Programming in Python

By now you will have a sense of the capabilities of the Python programming language for processing natural language. However, if you're new to Python or to programming, you may still be wrestling with Python and not feel like you are in full control yet. In this chapter we'll address the following questions:

  1. how can we write well-structured, readable programs that you and others will be able to re-use easily?
  2. how do the fundamental building blocks work, such as loops, functions and assignment?
  3. what are some of the pitfalls with Python programming and how can we avoid them?

Along the way, you will consolidate your knowledge of fundamental programming constructs, learn more about using features of the Python language in a natural and concise way, and learn some useful techniques in visualizing natural language data. As before, this chapter contains many examples and exercises (and as before, some exercises introduce new material). Readers new to programming should work through them carefully and consult other introductions to programming if necessary; experienced programmers can quickly skim this chapter.


Remember that our program samples assume you begin your interactive session or your program with: import nltk, re, pprint

6.1   Back to the Basics


If you've come to Python from another programming language, you may be inclined to make heavy use of loop variables, and no use of list comprehensions. In this section we'll see how avoiding loop variables can lead to more readable code. We'll also look at the relationship between loops with nested blocks vs list comprehensions, and discuss when to use either construct.

Let's look at a familiar technique for iterating over the members of a list by initializing an index i and then incrementing the index each time we pass through the loop:

>>> sent = ['I', 'am', 'the', 'Walrus']
>>> i = 0
>>> while i < len(sent):
...     print sent[i].lower(),
...     i += 1
i am the walrus

Although this does the job, it is not idiomatic Python. It is almost never a good idea to use loop variables in this way. Observe that Python's for statement allows us to achieve the same effect, and that the code is not just more succinct, but more readable:

>>> for w in sent:
...     print w.lower(),
i am the walrus

This is much more readable, and for loops with nested code blocks will be our preferred way of producing printed output. However, the above two programs have the same subtle problem: both print a trailing space character. If we meant to include this output with surrounding markup, we would see something like <sent>i am the walrus </sent>. Similarly, both programs need an extra print statement in order to produce a newline character at the end. A third solution using a list comprehension is more compact again

>>> print ' '.join(w.lower() for w in sent)
i am the walrus

This doesn't produce an extra space character, and does produce the required newline. However, it is less readable thanks to the use of ' '.join(), and we will usually prefer to use the second solution above for code that prints output.

Another case where loop variables seem to be necessary is for printing the value of a counter with each line of output. Instead, we can use enumerate(), which processes a sequence s and produces a tuple of the form (i, s[i]) for each item in s, starting with (0, s[0]). Here we enumerate the keys of the frequency distribution, and capture the integer-string pair in the variables rank and word. We print rank+1 so that the counting appears to start from 1, as required when producing a list of ranked items.

>>> fd = nltk.FreqDist(nltk.corpus.brown.words())
>>> cumulative = 0.0
>>> for rank, word in enumerate(fd):
...     cumulative += fd[word] * 100.0 / fd.N()
...     print "%3d %6.2f%% %s" % (rank+1, cumulative, word)
...     if cumulative > 25:
...         break
  1   5.40% the
  2  10.42% ,
  3  14.67% .
  4  17.78% of
  5  20.19% and
  6  22.40% to
  7  24.29% a
  8  25.97% in

Its sometimes tempting to use loop variables to store a maximum or minimum value seen so far. Let's use this method to find the longest word in a text.

>>> text = nltk.corpus.gutenberg.words('milton-paradise.txt')
>>> longest = ''
>>> for word in text:
...     if len(word) > len(longest):
...         longest = word
>>> longest

However, a better solution uses two list comprehensions as shown below. We sacrifice some efficiency by having two passes through the data, but the result is more transparent.

>>> maxlen = max(len(word) for word in text)
>>> [word for word in text if len(word) == maxlen]
['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']

Note that our first solution found the first word having the longest length, while the second solution found all of the longest words. In most cases we actually prefer to get all solutions, though its easy to find them all and only report one back to the user. In contrast, it is difficult to modify the first program to find all solutions. Although there's a theoretical efficiency difference between the two solutions, the main overhead is reading the data into main memory; once its there, a second pass through the data is very fast. We also need to balance our concerns about program efficiency with programmer efficiency. A fast but cryptic solution will be harder to understand and maintain.

List comprehensions have a surprising range of uses. Here's an example of how they can be used to generate all combinations of some collections of words. Here we generate all combinations of two determiners, two adjectives, and two nouns. The list comprehension is split across three lines for readability.

>>> [(det,adj,noun) for det in ('two', 'three')
...                 for adj in ('old', 'blind')
...                 for noun in ('men', 'mice')]
[('two', 'old', 'men'), ('two', 'old', 'mice'), ('two', 'blind', 'men'),
 ('two', 'blind', 'mice'), ('three', 'old', 'men'), ('three', 'old', 'mice'),
 ('three', 'blind', 'men'), ('three', 'blind', 'mice')]

Our use of list comprehensions has helped us avoid loop variables. However, there are cases where we still want to use loop variables in a list comprehension. For example, we need to use a loop variable to extract successive overlapping n-grams from a list:

>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> n = 3
>>> [sent[i:i+n] for i in range(len(sent)-n+1)]
[['The', 'dog', 'gave'],
 ['dog', 'gave', 'John'],
 ['gave', 'John', 'the'],
 ['John', 'the', 'newspaper']]

It is quite tricky to get the range of the loop variable right. Since this is a common operation in NLP, NLTK supports it with functions bigrams(text) and trigrams(text), and a general purpose ngrams(text, n).

Here's an example of how we can use loop variables in building multidimensional structures. For example, to build an array with m rows and n columns, where each cell is a set, we could use a nested list comprehension:

>>> m, n = 3, 7
>>> array = [[set() for i in range(n)] for j in range(m)]
>>> array[2][5].add('Alice')
>>> pprint.pprint(array)
[[set([]), set([]), set([]), set([]), set([]), set([]), set([])],
 [set([]), set([]), set([]), set([]), set([]), set([]), set([])],
 [set([]), set([]), set([]), set([]), set([]), set(['Alice']), set([])]]

Observe that the loop variables i and j are not used anywhere in the resulting object, they are just needed for a syntactically correct for statement. As another example of this usage, observe that the expression ['very' for i in range(3)] produces a list containing three instances of 'very', with no integers in sight.

Note that it would be incorrect to do this work using multiplication, for reasons that will be discussed in the next section.

>>> array = [[set()] * n] * m
>>> array[2][5].add(7)
>>> pprint.pprint(array)
[[set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],
 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])]]


Python's assignment statement operates on values. But what is a value? Consider the following code fragment:

>>> foo = 'Monty'
>>> bar = foo
>>> foo = 'Python'
>>> bar

This code shows that when we write bar = foo, the value of foo (the string 'Monty') is assigned to bar. That is, bar is a copy of foo, so when we overwrite foo with a new string 'Python', the value of bar is not affected.

However, assignment statements do not always involve making copies in this way. An important subtlety of Python is that the "value" of a structured object (such as a list) is actually a reference to the object. In the following example, we assign the reference of foo to the new variable bar. When we modify something inside foo, we can see that the contents of bar have also been changed.

>>> foo = ['Monty', 'Python']
>>> bar = foo
>>> foo[1] = 'Bodkin'
>>> bar
['Monty', 'Bodkin']

Figure 6.1: List Assignment and Computer Memory

Thus, the line bar = foo does not copy the contents of the variable, only its "object reference". To understand what is going on here, we need to know how lists are stored in the computer's memory. In Figure 6.1, we see that a list sent1 is a reference to an object stored at location 3133 (which is itself a series of pointers to other locations holding strings). When we assign sent2 = sent1, it is just the object reference 3133 that gets copied.

This behavior extends to other aspects of the Python language. In Section 6.2 we will see how it effects the way parameters are passed into functions. Here's an example of how it applies to copying:

>>> empty = []
>>> nested = [empty, empty, empty]
>>> nested
[[], [], []]
>>> nested[1].append('Python')
>>> nested
[['Python'], ['Python'], ['Python']]

Observe that changing one of the items inside our nested list of lists changed them all. This is because each of the three elements is actually just a reference to one and the same list in memory.


Your Turn: Use multiplication to create a list of lists: nested = [[]] * 3. Now modify one of the elements of the list, and observe that all the elements are changed.

Now, notice that when we assign a new value to one of the elements of the list, it does not propagate to the others:

>>> nested = [[]] * 3
>>> nested[1].append('Python')
>>> nested[1] = ['Monty']
>>> nested
[['Python'], ['Monty'], ['Python']]

We began with a list containing three references to a single empty list object. Then we modified that object by appending 'Python' to it, resulting in a list containing three references to a single list object ['Python']. Next, we overwrote one of those references with a reference to a new object ['Monty']. This last step modified the object references, but not the objects themselves. The ['Python'] object wasn't changed, and is still referenced from two places in our nested list of lists. It is crucial to appreciate this difference between modifying an object via an object reference, and overwriting an object reference.


To copy the items from a list foo to a new list bar, you can write bar = foo[:]. This copies the object references inside the list. To copy a structure without copying any object references, use copy.deepcopy().


We have seen three kinds of sequence object: strings, lists, and tuples. As sequences, they have some common properties: they can be indexed and they have a length:

>>> text = 'I turned off the spectroroute'
>>> words = ['I', 'turned', 'off', 'the', 'spectroroute']
>>> pair = (6, 'turned')
>>> text[2], words[3], pair[1]
('t', 'the', 'turned')
>>> len(text), len(words), len(pair)
(29, 5, 2)

We can iterate over the items in a sequence s in a variety of useful ways, as shown in Table 6.1.

Table 6.1:

Various ways to iterate over sequences

Python Expression Comment
for item in s iterate over the items of s
for item in sorted(s) iterate over the items of s in order
for item in set(s) iterate over unique elements of s
for item in reversed(s) iterate over elements of s in reverse
for item in set(s).difference(t) iterate over elements of s not in t
for item in random.shuffle(s) iterate over elements of s in random order

The sequence functions illustrated in Table 6.1 can be combined in various ways; for example, to get unique elements of s sorted in reverse, use reversed(sorted(set(s))).

We can convert between these sequence types. For example, tuple(s) converts any kind of sequence into a tuple, and list(s) converts any kind of sequence into a list. We can convert a list of strings to a single string using the join() function, e.g. ':'.join(words).

Notice in the above code sample that we computed multiple values on a single line, separated by commas. These comma-separated expressions are actually just tuples — Python allows us to omit the parentheses around tuples if there is no ambiguity. When we print a tuple, the parentheses are always displayed. By using tuples in this way, we are implicitly aggregating items together.

In the next example, we use tuples to re-arrange the contents of our list. (We can omit the parentheses because the comma has higher precedence than assignment.)

>>> words[2], words[3], words[4] = words[3], words[4], words[2]
>>> words
['I', 'turned', 'the', 'spectroroute', 'off']

This is an idiomatic and readable way to move items inside a list. It is equivalent to the following traditional way of doing such tasks that does not use tuples (notice that this method needs a temporary variable tmp).

>>> tmp = words[2]
>>> words[2] = words[3]
>>> words[3] = words[4]
>>> words[4] = tmp

As we have seen, Python has sequence functions such as sorted() and reversed() that rearrange the items of a sequence. There are also functions that modify the structure of a sequence and which can be handy for language processing. Thus, zip() takes the items of two sequences and "zips" them together into a single list of pairs. Given a sequence s, enumerate(s) returns an iterator that produces a pair of an index and the item at that index.

>>> words = ['I', 'turned', 'off', 'the', 'spectroroute']
>>> tags = ['NNP', 'VBD', 'IN', 'DT', 'NN']
>>> zip(words, tags)
[('I', 'NNP'), ('turned', 'VBD'), ('off', 'IN'),
('the', 'DT'), ('spectroroute', 'NN')]
>>> list(enumerate(words))
[(0, 'I'), (1, 'turned'), (2, 'off'), (3, 'the'), (4, 'spectroroute')]

Combining Different Sequence Types

Let's combine our knowledge of these three sequence types, together with list comprehensions, to perform the task of sorting the words in a string by their length.

>>> words = 'I turned off the spectroroute'.split()     [1]
>>> wordlens = [(len(word), word) for word in words]    [2]
>>> wordlens
[(1, 'I'), (6, 'turned'), (3, 'off'), (3, 'the'), (12, 'spectroroute')]
>>> wordlens.sort()                                     [3]
>>> ' '.join([word for (count, word) in wordlens])      [4]
'I off the turned spectroroute'

Each of the above lines of code contains a significant feature. Line [1] demonstrates that a simple string is actually an object with methods defined on it, such as split(). Line [2] shows the construction of a list of tuples, where each tuple consists of a number (the word length) and the word, e.g. (3, 'the'). Line [3] sorts the list, modifying the list in-place. Finally, line [4] discards the length information then joins the words back into a single string.

We began by talking about the commonalities in these sequence types, but the above code illustrates important differences in their roles. First, strings appear at the beginning and the end: this is typical in the context where our program is reading in some text and producing output for us to read. Lists and tuples are used in the middle, but for different purposes. A list is typically a sequence of objects all having the same type, of arbitrary length. We often use lists to hold sequences of words. In contrast, a tuple is typically a collection of objects of different types, of fixed length. We often use a tuple to hold a record, a collection of different fields relating to some entity. This distinction between the use of lists and tuples takes some getting used to, so here is another example:

>>> lexicon = [
...     ('the', 'DT', ['Di:', 'D@']),
...     ('off', 'IN', ['Qf', 'O:f'])
... ]

Here, a lexicon is represented as a list because it is a collection of objects of a single type — lexical entries — of no predetermined length. An individual entry is represented as a tuple because it is a collection of objects with different interpretations, such as the orthographic form, the part of speech, and the pronunciations represented in the SAMPA computer readable phonetic alphabet. Note that these pronunciations are stored using a list. (Why?)

The distinction between lists and tuples has been described in terms of usage. However, there is a more fundamental difference: in Python, lists are mutable, while tuples are immutable. In other words, lists can be modified, while tuples cannot. Here are some of the operations on lists that do in-place modification of the list. None of these operations is permitted on a tuple, a fact you should confirm for yourself.

>>> lexicon.sort()
>>> lexicon[1] = ('turned', 'VBD', ['t3:nd', 't3`nd'])
>>> del lexicon[0]

Stacks and Queues

Lists are a particularly versatile data type. We can use lists to implement higher-level data types such as stacks and queues. A stack is a container that has a last-in-first-out policy for adding and removing items (see Figure 6.2).


Figure 6.2: Stacks and Queues

Stacks are used to keep track of the current context in computer processing of natural languages (and programming languages too). We will seldom have to deal with stacks explicitly, as the implementation of NLTK parsers, treebank corpus readers, (and even Python functions), all use stacks behind the scenes. However, it is important to understand what stacks are and how they work.

def check_parens(tokens):
    stack = []
    for token in tokens:
        if token == '(':     # push
        elif token == ')':   # pop
    return stack
>>> phrase = "( the cat ) ( sat ( on ( the mat )"
>>> print check_parens(phrase.split())
['(', '(']

Figure 6.3 ( Figure 6.3: Check parentheses are balanced

In Python, we can treat a list as a stack by limiting ourselves to the three operations defined on stacks: append(item) (to push item onto the stack), pop() to pop the item off the top of the stack, and [-1] to access the item on the top of the stack. The program in Figure 6.3 processes a sentence with phrase markers, and checks that the parentheses are balanced. The loop pushes material onto the stack when it gets an open parenthesis, and pops the stack when it gets a close parenthesis. We see that two are left on the stack at the end; i.e. the parentheses are not balanced.

Although the program in Figure 6.3 is a useful illustration of stacks, it is overkill because we could have done a direct count: phrase.count('(') == phrase.count(')'). However, we can use stacks for more sophisticated processing of strings containing nested structure, as shown in Figure 6.4. Here we build a (potentially deeply-nested) list of lists. Whenever a token other than a parenthesis is encountered, we add it to a list at the appropriate level of nesting. The stack cleverly keeps track of this level of nesting, exploiting the fact that the item at the top of the stack is actually shared with a more deeply nested item. (Hint: add diagnostic print statements to the function to help you see what it is doing.)

def convert_parens(tokens):
    stack = [[]]
    for token in tokens:
        if token == '(':     # push
            sublist = []
        elif token == ')':   # pop
        else:                # update top of stack
    return stack[0]
>>> phrase = "( the cat ) ( sat ( on ( the mat ) ) )"
>>> print convert_parens(phrase.split())
[['the', 'cat'], ['sat', ['on', ['the', 'mat']]]]

Figure 6.4 ( Figure 6.4: Convert a nested phrase into a nested list using a stack

Lists can be used to represent another important data structure. A queue is a container that has a first-in-first-out policy for adding and removing items (see Figure 6.2). Queues are used for scheduling activities or resources. As with stacks, we will seldom have to deal with queues explicitly, as the implementation of NLTK n-gram taggers (Section 4.5) and chart parsers (Section 8.5) use queues behind the scenes. However, we will take a brief look at how queues are implemented using lists.

>>> queue = ['the', 'cat', 'sat']
>>> queue.append('on')
>>> queue.append('the')
>>> queue.append('mat')
>>> queue.pop(0)
>>> queue.pop(0)
>>> queue
['sat', 'on', 'the', 'mat']


In the condition part of an if statement, a nonempty string or list is evaluated as true, while an empty string or list evaluates as false.

>>> mixed = ['cat', '', ['dog'], []]
>>> for element in mixed:
...     if element:
...         print element

That is, we don't need to say if len(element) > 0: in the condition.

What's the difference between using if...elif as opposed to using a couple of if statements in a row? Well, consider the following situation:

>>> animals = ['cat', 'dog']
>>> if 'cat' in animals:
...     print 1
... elif 'dog' in animals:
...     print 2

Since the if clause of the statement is satisfied, Python never tries to evaluate the elif clause, so we never get to print out 2. By contrast, if we replaced the elif by an if, then we would print out both 1 and 2. So an elif clause potentially gives us more information than a bare if clause; when it evaluates to true, it tells us not only that the condition is satisfied, but also that the condition of the main if clause was not satisfied.

6.2   Functions

Once you have been programming for a while, you will find that you need to perform a task that you have done in the past. In fact, over time, the number of completely novel things you have to do in creating a program decreases significantly. Half of the work may involve simple tasks that you have done before. Thus it is important for your code to be re-usable. One effective way to do this is to abstract commonly used sequences of steps into a function.

For example, suppose we find that we often want to read text from an HTML file. This involves several steps: opening the file, reading it in, normalizing whitespace, and stripping HTML markup. We can collect these steps into a function, and give it a name such as get_text():

import re
def get_text(file):
    """Read text from a file, normalizing whitespace
    and stripping HTML markup."""
    text = open(file).read()
    text = re.sub('\s+', ' ', text)
    text = re.sub(r'<.*?>', ' ', text)
    return text

Figure 6.5 ( Figure 6.5: Read text from a file

Now, any time we want to get cleaned-up text from an HTML file, we can just call get_text() with the name of the file as its only argument. It will return a string, and we can assign this to a variable, e.g.: contents = get_text("test.html"). Each time we want to use this series of steps we only have to call the function.

Notice that a function definition consists of the keyword def (short for "define"), followed by the function name, followed by a sequence of parameters enclosed in parentheses, then a colon. The following lines contain an indented block of code, the function body.

Using functions has the benefit of saving space in our program. More importantly, our choice of name for the function helps make the program readable. In the case of the above example, whenever our program needs to read cleaned-up text from a file we don't have to clutter the program with four lines of code, we simply need to call get_text(). This naming helps to provide some "semantic interpretation" — it helps a reader of our program to see what the program "means".

Notice that the above function definition contains a string. The first string inside a function definition is called a docstring. Not only does it document the purpose of the function to someone reading the code, it is accessible to a programmer who has loaded the code from a file:

>>> help(get_text)
Help on function get_text:
Read text from a file, normalizing whitespace and stripping HTML markup.

We have seen that functions help to make our work reusable and readable. They also help make it reliable. When we re-use code that has already been developed and tested, we can be more confident that it handles a variety of cases correctly. We also remove the risk that we forget some important step, or introduce a bug. The program that calls our function also has increased reliability. The author of that program is dealing with a shorter program, and its components behave transparently.

  • [More: overview of section]

Function Arguments

  • multiple arguments
  • named arguments
  • default values

Python is a dynamically typed language. It does not force us to declare the type of a variable when we write a program. This feature is often useful, as it permits us to define functions that are flexible about the type of their arguments. For example, a tagger might expect a sequence of words, but it wouldn't care whether this sequence is expressed as a list, a tuple, or an iterator.

However, often we want to write programs for later use by others, and want to program in a defensive style, providing useful warnings when functions have not been invoked correctly. Observe that the tag() function in Figure 6.6 behaves sensibly for string arguments, but that it does not complain when it is passed a dictionary.

def tag(word):
    if word in ['a', 'the', 'all']:
        return 'DT'
        return 'NN'
>>> tag('the')
>>> tag('dog')
>>> tag({'lexeme':'turned', 'pos':'VBD', 'pron':['t3:nd', 't3`nd']})

Figure 6.6 ( Figure 6.6: A tagger that tags anything

It would be helpful if the author of this function took some extra steps to ensure that the word parameter of the tag() function is a string. A naive approach would be to check the type of the argument and return a diagnostic value, such as Python's special empty value, None, as shown in Figure 6.7.

def tag(word):
    if not type(word) is str:
        return None
    if word in ['a', 'the', 'all']:
        return 'DT'
        return 'NN'

Figure 6.7 ( Figure 6.7: A tagger that only tags strings

However, this approach is dangerous because the calling program may not detect the error, and the diagnostic return value may be propagated to later parts of the program with unpredictable consequences. A better solution is shown in Figure 6.8.

def tag(word):
    if not type(word) is str:
        raise ValueError, "argument to tag() must be a string"
    if word in ['a', 'the', 'all']:
        return 'DT'
        return 'NN'

Figure 6.8 ( Figure 6.8: A tagger that generates an error message when not passed a string

This produces an error that cannot be ignored, since it halts program execution. Additionally, the error message is easy to interpret. (We will see an even better approach, known as "duck typing" in Section [XREF].)

Another aspect of defensive programming concerns the return statement of a function. In order to be confident that all execution paths through a function lead to a return statement, it is best to have a single return statement at the end of the function definition. This approach has a further benefit: it makes it more likely that the function will only return a single type. Thus, the following version of our tag() function is safer:

>>> def tag(word):
...     result = 'NN'                       # default value, a string
...     if word in ['a', 'the', 'all']:     # in certain cases...
...         result = 'DT'                   #   overwrite the value
...     return result                       # all paths end here

A return statement can be used to pass multiple values back to the calling program, by packing them into a tuple. Here we define a function that returns a tuple consisting of the average word length of a sentence, and the inventory of letters used in the sentence. It would have been clearer to write two separate functions.

>>> def proc_words(words):
...     avg_wordlen = sum(len(word) for word in words)/len(words)
...     chars_used = ''.join(sorted(set(''.join(words))))
...     return avg_wordlen, chars_used
>>> proc_words(['Not', 'a', 'good', 'way', 'to', 'write', 'functions'])
(3, 'Nacdefginorstuwy')

Functions do not need to have a return statement at all. Some functions do their work as a side effect, printing a result, modifying a file, or updating the contents of a parameter to the function. Consider the following three sort functions; the last approach is dangerous because a programmer could use it without realizing that it had modified its input.

>>> def my_sort1(l):      # good: modifies its argument, no return value
...     l.sort()
>>> def my_sort2(l):      # good: doesn't touch its argument, returns value
...     return sorted(l)
>>> def my_sort3(l):      # bad: modifies its argument and also returns it
...     l.sort()
...     return l

An Important Subtlety

Back in Section 6.1 you saw that in Python, assignment works on values, but that the value of a structured object is a reference to that object. The same is true for functions. Python interprets function parameters as values (this is known as call-by-value). Consider Figure 6.9. Function set_up() has two parameters, both of which are modified inside the function. We begin by assigning an empty string to w and an empty dictionary to p. After calling the function, w is unchanged, while p is changed:

def set_up(word, properties):
    word = 'cat'
    properties['pos'] = 'noun'
>>> w = ''
>>> p = {}
>>> set_up(w, p)
>>> w
>>> p
{'pos': 'noun'}

Figure 6.9 ( Figure 6.9

To understand why w was not changed, it is necessary to understand call-by-value. When we called set_up(w, p), the value of w (an empty string) was assigned to a new variable word. Inside the function, the value of word was modified. However, that had no effect on the external value of w. This parameter passing is identical to the following sequence of assignments:

>>> w = ''
>>> word = w
>>> word = 'cat'
>>> w

In the case of the structured object, matters are quite different. When we called set_up(w, p), the value of p (an empty dictionary) was assigned to a new local variable properties. Since the value of p is an object reference, both variables now reference the same memory location. Modifying something inside properties will also change p, just as if we had done the following sequence of assignments:

>>> p = {}
>>> properties = p
>>> properties['pos'] = 'noun'
>>> p
{'pos': 'noun'}

Thus, to understand Python's call-by-value parameter passing, it is enough to understand Python's assignment operation. We will address some closely related issues in our discussion of variable scope later in this section.

Functional Decomposition

Well-structured programs usually make extensive use of functions. When a block of program code grows longer than 10-20 lines, it is a great help to readability if the code is broken up into one or more functions, each one having a clear purpose. This is analogous to the way a good essay is divided into paragraphs, each expressing one main idea.

Functions provide an important kind of abstraction. They allow us to group multiple actions into a single, complex action, and associate a name with it. (Compare this with the way we combine the actions of go and bring back into a single more complex action fetch.) When we use functions, the main program can be written at a higher level of abstraction, making its structure transparent, e.g.

>>> data = load_corpus()
>>> results = analyze(data)
>>> present(results)

Appropriate use of functions makes programs more readable and maintainable. Additionally, it becomes possible to reimplement a function — replacing the function's body with more efficient code — without having to be concerned with the rest of the program.

Consider the freq_words function in Figure 6.10. It updates the contents of a frequency distribution that is passed in as a parameter, and it also prints a list of the n most frequent words.

def freq_words(url, freqdist, n):
    text = nltk.clean_url(url)
    for word in nltk.wordpunct_tokenize(text):
    print freqdist.keys()[:n]
>>> constitution = ""
>>> fd = nltk.FreqDist()
>>> freq_words(constitution, fd, 20)
['the', 'of', 'charters', 'bill', 'constitution', 'rights', ',',
'declaration', 'impact', 'freedom', '-', 'making', 'independence']

Figure 6.10 ( Figure 6.10

This function has a number of problems. The function has two side-effects: it modifies the contents of its second parameter, and it prints a selection of the results it has computed. The function would be easier to understand and to reuse elsewhere if we initialize the FreqDist() object inside the function (in the same place it is populated), and if we moved the selection and display of results to the calling program. In Figure 6.11 we refactor this function, and simplify its interface by providing a single url parameter.

def freq_words(url):
    freqdist = nltk.FreqDist()
    text = nltk.clean_url(url)
    for word in nltk.wordpunct_tokenize(text):
    return freqdist
>>> fd = freq_words(constitution)
>>> print fd.keys()[:20]
['the', 'of', 'charters', 'bill', 'constitution', 'rights', ',',
'declaration', 'impact', 'freedom', '-', 'making', 'independence']

Figure 6.11 ( Figure 6.11

Note that we have now simplified the work of freq_words to the point that we can do its work with three lines of code:

>>> words = nltk.wordpunct_tokenize(nltk.clean_url(constitution))
>>> fd = nltk.FreqDist(word.lower() for word in words)
>>> fd.keys()[:20]
['the', 'of', 'charters', 'bill', 'constitution', 'rights', ',',
'declaration', 'impact', 'freedom', '-', 'making', 'independence']


Variable scope

  • local and global variables
  • scope rules
  • global variables introduce dependency on context and limits the reusability of a function
  • importance of avoiding side-effects
  • functions hide implementation details

Functions as Arguments

So far the arguments we have passed into functions have been simple objects like strings, or structured objects like lists. These arguments allow us to parameterize the behavior of a function. As a result, functions are very flexible and powerful abstractions, permitting us to repeatedly apply the same operation on different data. Python also lets us pass a function as an argument to another function. Now we can abstract out the operation, and apply a different operation on the same data. As the following examples show, we can pass the built-in function len() or a user-defined function last_letter() as parameters to another function:

>>> sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
...         'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
>>> def extract_property(prop):
...     return [prop(word) for word in sent]
>>> extract_property(len)
[4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]
>>> def last_letter(word):
...     return word[-1]
>>> extract_property(last_letter)
['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

Surprisingly, len and last_letter are objects that can be passed around like lists and dictionaries. Notice that parentheses are only used after a function name if we are invoking the function; when we are simply passing the function around as an object these are not used.

Python provides us with one more way to define functions as arguments to other functions, so-called lambda expressions. Supposing there was no need to use the above last_letter() function in multiple places, and thus no need to give it a name. We can equivalently write the following:

>>> extract_property(lambda w: w[-1])
['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']

Our next example illustrates passing a function to the sorted() function. When we call the latter with a single argument (the list to be sorted), it uses the built-in lexicographic comparison function cmp(). However, we can supply our own sort function, e.g. to sort by decreasing length.

>>> sorted(sent)
[',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds',
'take', 'the', 'the', 'themselves', 'will']
>>> sorted(sent, cmp)
[',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds',
'take', 'the', 'the', 'themselves', 'will']
>>> sorted(sent, lambda x, y: cmp(len(y), len(x)))
['themselves', 'sounds', 'sense', 'Take', 'care', 'will', 'take', 'care',
'the', 'and', 'the', 'of', 'of', ',', '.']

Higher-Order Functions

In 6.1 we saw an example of filtering out some items in a list comprehension, using an if test. Sometimes list comprehensions get cumbersome, since they can mention the same variable many times, e.g.: [word for word in sent if property(word)]. We can perform the same task more succinctly as follows:

>>> def is_lexical(word):
...     return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']
>>> filter(is_lexical, sent)
['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']

The function is_lexical(word) returns True just in case word, when normalized to lowercase, is not in the given list. This function is itself used as an argument to filter(). The filter() function applies its first argument (a function) to each item of its second (a sequence), only passing it through if the function returns true for that item. Thus filter(f, seq) is equivalent to [item for item in seq if f(item)].

Another helpful function, which like filter() applies a function to a sequence, is map(). Here is a simple way to find the average length of a sentence in a section of the Brown Corpus:

>>> lengths = map(len, nltk.corpus.brown.sents(categories='news'))
>>> sum(lengths) / float(len(lengths))

Instead of len(), we could have passed in any other function we liked:

>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
>>> def is_vowel(letter):
...     return letter.lower() in "aeiou"
>>> def vowelcount(word):
...     return len(filter(is_vowel, word))
>>> map(vowelcount, sent)
[1, 1, 2, 1, 1, 3]

Instead of using filter() to call a named function is_vowel, we can define a lambda expression as follows:

>>> map(lambda w: len(filter(lambda c: c.lower() in "aeiou", w)), sent)
[1, 1, 2, 1, 1, 3]

We can check that all or any items meet some condition:

>>> all(len(w) > 4 for w in sent)
>>> any(len(w) > 4 for w in sent)

The higher order functions like map and filter are certainly useful, but in general it is better to stick to using list comprehensions since they are often more readable.

Named Arguments

One of the difficulties in re-using functions is remembering the order of arguments. Consider the following function, that finds the n most frequent words that are at least min_len characters long:

>>> def freq_words(file, min, num):
...     text = open(file).read()
...     tokens = nltk.wordpunct_tokenize(text)
...     freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
...     return freqdist.keys()[:num]
>>> freq_words('ch01.rst', 4, 10)
['words', 'that', 'text', 'word', 'Python', 'with', 'this', 'have', 'language', 'from']

This function has three arguments. It follows the convention of listing the most basic and substantial argument first (the file). However, it might be hard to remember the order of the second and third arguments on subsequent use. We can make this function more readable by using keyword arguments. These appear in the function's argument list with an equals sign and a default value:

>>> def freq_words(file, min=1, num=10):
...     text = open(file).read()
...     tokens = nltk.wordpunct_tokenize(text)
...     freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
...     return freqdist.keys()[:num]

Now there are several equivalent ways to call this function: freq_words('ch01.rst', 4, 10), freq_words('ch01.rst', min=4, num=10), freq_words('ch01.rst', num=10, min=4).

When we use an integrated development environment such as IDLE, simply typing the name of a function at the command prompt will list the arguments. Using named arguments helps someone to re-use the code...

A side-effect of having named arguments is that they permit optionality. Thus we can leave out any arguments where we are happy with the default value: freq_words('ch01.rst', min=4), freq_words('ch01.rst', 4).

Another common use of optional arguments is to permit a flag, e.g.:

>>> def freq_words(file, min=1, num=10, trace=False):
...     freqdist = FreqDist()
...     if trace: print "Opening", file
...     text = open(file).read()
...     if trace: print "Read in %d characters" % len(file)
...     for word in nltk.wordpunct_tokenize(text):
...         if len(word) >= min:
...             if trace and freqdist.N() % 100 == 0: print "."
...     if trace: print
...     return freqdist.keys()[:num]

6.3   Iterators

[itertools, bigrams vs ibigrams, efficiency, ...]

Accumulative Functions

These functions start by initializing some storage, and iterate over input to build it up, before returning some final object (a large structure or aggregated result). The standard way to do this is to initialize an empty list, accumulate the material, then return the list, as shown in function find_nouns1() in Listing 6.12.

def find_nouns1(tagged_text):
    nouns = []
    for word, tag in tagged_text:
        if tag[:2] == 'NN':
    return nouns
>>> tagged_text = [('the', 'DT'), ('cat', 'NN'), ('sat', 'VBD'),
...                ('on', 'IN'), ('the', 'DT'), ('mat', 'NN')]
>>> find_nouns1(tagged_text)
['cat', 'mat']

Figure 6.12 ( Figure 6.12: Accumulating Output into a List

A superior way to perform this operation is define the function to be a generator, as shown in Listing 6.13. The first time this function is called, it gets as far as the yield statement and stops. The calling program gets the first word and does any necessary processing. Once the calling program is ready for another word, execution of the function is continued from where it stopped, until the next time it encounters a yield statement. This approach is typically more efficient, as the function only generates the data as it is required by the calling program, and does not need to allocate additional memory to store the output.

def find_nouns2(tagged_text):
    for word, tag in tagged_text:
        if tag[:2] == 'NN':
            yield word
>>> tagged_text = [('the', 'DT'), ('cat', 'NN'), ('sat', 'VBD'),
...                ('on', 'IN'), ('the', 'DT'), ('mat', 'NN')]
>>> find_nouns2(tagged_text)
<generator object at 0x14b2f30>
>>> for noun in find_nouns2(tagged_text):
...     print noun,
cat mat
>>> list(find_nouns2(tagged_text))
['cat', 'mat']

Figure 6.13 ( Figure 6.13: Defining a Generator Function

If we call the function directly we see that it returns a "generator object", which is not very useful to us. Instead, we can iterate over it directly, using for noun in find_nouns(tagged_text), or convert it into a list, using list(find_nouns(tagged_text)).

6.4   Algorithm Design Strategies

A major part of algorithmic problem solving is selecting or adapting an appropriate algorithm for the problem at hand. Whole books are written on this topic (e.g. [Levitin, 2004]) and we only have space to introduce some key concepts and elaborate on the approaches that are most prevalent in natural language processing.

The best known strategy is known as divide-and-conquer. We attack a problem of size n by dividing it into two problems of size n/2, solve these problems, and combine their results into a solution of the original problem. Figure 6.14 illustrates this approach for sorting a list of words.


Figure 6.14: Sorting by Divide-and-Conquer (Mergesort)

Another strategy is decrease-and-conquer. In this approach, a small amount of work on a problem of size n permits us to reduce it to a problem of size n/2. Figure 6.15 illustrates this approach for the problem of finding the index of an item in a sorted list.

A third well-known strategy is transform-and-conquer. We attack a problem by transforming it into an instance of a problem we already know how to solve. For example, in order to detect duplicates entries in a list, we can pre-sort the list, then look for adjacent identical items, as shown in Figure 6.16.

def duplicates(words):
    prev = None
    dup = [None]
    for word in sorted(words):
        if word == prev and word != dup[-1]:
            prev = word
    return dup[1:]
>>> duplicates(['cat', 'dog', 'cat', 'pig', 'dog', 'cat', 'ant', 'cat'])
['cat', 'dog']

Figure 6.16 ( Figure 6.16: Presorting a list for duplicate detection

Recursion (notes)

We first saw recursion in Chapter 3, in a function that navigated the hypernym hierarchy of WordNet...

Iterative solution:

>>> def factorial(n):
...     result = 1
...     for i in range(n):
...         result *= (i+1)
...     return result

Recursive solution (base case, induction step)

>>> def factorial(n):
...     if n == 1:
...         return n
...     else:
...         return n * factorial(n-1)

[Simple example of recursion on strings.]

Generating all permutations of words, to check which ones are grammatical:

>>> def perms(seq):
...     if len(seq) <= 1:
...         yield seq
...     else:
...         for perm in perms(seq[1:]):
...             for i in range(len(perm)+1):
...                 yield perm[:i] + seq[0:1] + perm[i:]
>>> list(perms(['police', 'fish', 'cream']))
[['police', 'fish', 'cream'], ['fish', 'police', 'cream'],
 ['fish', 'cream', 'police'], ['police', 'cream', 'fish'],
 ['cream', 'police', 'fish'], ['cream', 'fish', 'police']]

Deeply Nested Objects (notes)

We can use recursive functions to build deeply-nested objects. Building a letter trie, Figure 6.17.

def insert(trie, key, value):
    if key:
        first, rest = key[0], key[1:]
        if first not in trie:
            trie[first] = {}
        insert(trie[first], rest, value)
        trie['value'] = value
>>> trie = {}
>>> insert(trie, 'chat', 'cat')
>>> insert(trie, 'chien', 'dog')
>>> trie['c']['h']
{'a': {'t': {'value': 'cat'}}, 'i': {'e': {'n': {'value': 'dog'}}}}
>>> trie['c']['h']['a']['t']['value']
>>> pprint.pprint(trie)
{'c': {'h': {'a': {'t': {'value': 'cat'}},
             'i': {'e': {'n': {'value': 'dog'}}}}}}

Figure 6.17 ( Figure 6.17: Building a Letter Trie


A tree is a set of connected nodes, each of which is labeled with a category. It common to use a 'family' metaphor to talk about the relationships of nodes in a tree: for example, s is the parent of vp; conversely vp is a daughter (or child) of s. Also, since np and vp are both daughters of s, they are also sisters. Here is an example of a tree:


Although it is helpful to represent trees in a graphical format, for computational purposes we usually need a more text-oriented representation. We will use the same format as the Penn Treebank, a combination of brackets and labels:

   (NP Lee)
      (V saw)
         (Det the)
         (N dog))))

Here, the node value is a constituent type (e.g., np or vp), and the children encode the hierarchical contents of the tree.

Although we will focus on syntactic trees, trees can be used to encode any homogeneous hierarchical structure that spans a sequence of linguistic forms (e.g. morphological structure, discourse structure). In the general case, leaves and node values do not have to be strings.

In NLTK, trees are created with the Tree constructor, which takes a node value and a list of zero or more children. Here's a couple of simple trees:

>>> tree1 = nltk.Tree('NP', ['John'])
>>> print tree1
(NP John)
>>> tree2 = nltk.Tree('NP', ['the', 'man'])
>>> print tree2
(NP the man)

We can incorporate these into successively larger trees as follows:

>>> tree3 = nltk.Tree('VP', ['saw', tree2])
>>> tree4 = nltk.Tree('S', [tree1, tree3])
>>> print tree4
(S (NP John) (VP saw (NP the man)))

Here are some of the methods available for tree objects:

>>> print tree4[1]
(VP saw (NP the man))
>>> tree4[1].node
>>> tree4.leaves()
['John', 'saw', 'the', 'man']
>>> tree4[1,1,1]

The printed representation for complex trees can be difficult to read. In these cases, the draw method can be very useful. It opens a new window, containing a graphical representation of the tree. The tree display window allows you to zoom in and out; to collapse and expand subtrees; and to print the graphical representation to a postscript file (for inclusion in a document).

>>> tree3.draw()                           

[To do: recursion on trees]

Dynamic Programming

Dynamic programming is a general technique for designing algorithms which is widely used in natural language processing. The term 'programming' is used in a different sense to what you might expect, to mean planning or scheduling. Dynamic programming is used when a problem contains overlapping sub-problems. Instead of computing solutions to these sub-problems repeatedly, we simply store them in a lookup table. In the remainder of this section we will introduce dynamic programming, but in a rather different context to syntactic parsing.

Pingala was an Indian author who lived around the 5th century B.C., and wrote a treatise on Sanskrit prosody called the Chandas Shastra. Virahanka extended this work around the 6th century A.D., studying the number of ways of combining short and long syllables to create a meter of length n. He found, for example, that there are five ways to construct a meter of length 4: V4 = {LL, SSL, SLS, LSS, SSSS}. Observe that we can split V4 into two subsets, those starting with L and those starting with S, as shown in (2).

V4 =
    i.e. L prefixed to each item of V2 = {L, SS}
    i.e. S prefixed to each item of V3 = {SL, LS, SSS}

def virahanka1(n):
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
        s = ["S" + prosody for prosody in virahanka1(n-1)]
        l = ["L" + prosody for prosody in virahanka1(n-2)]
        return s + l

def virahanka2(n):
    lookup = [[""], ["S"]]
    for i in range(n-1):
        s = ["S" + prosody for prosody in lookup[i+1]]
        l = ["L" + prosody for prosody in lookup[i]]
        lookup.append(s + l)
    return lookup[n]

def virahanka3(n, lookup={0:[""], 1:["S"]}):
    if n not in lookup:
        s = ["S" + prosody for prosody in virahanka3(n-1)]
        l = ["L" + prosody for prosody in virahanka3(n-2)]
        lookup[n] = s + l
    return lookup[n]

from nltk import memoize
def virahanka4(n):
    if n == 0:
        return [""]
    elif n == 1:
        return ["S"]
        s = ["S" + prosody for prosody in virahanka4(n-1)]
        l = ["L" + prosody for prosody in virahanka4(n-2)]
        return s + l
>>> virahanka1(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>> virahanka2(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>> virahanka3(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
>>> virahanka4(4)
['SSSS', 'SSL', 'SLS', 'LSS', 'LL']

Figure 6.18 ( Figure 6.18: Three Ways to Compute Sanskrit Meter

With this observation, we can write a little recursive function called virahanka1() to compute these meters, shown in Figure 6.18. Notice that, in order to compute V4 we first compute V3 and V2. But to compute V3, we need to first compute V2 and V1. This call structure is depicted in (3).


As you can see, V2 is computed twice. This might not seem like a significant problem, but it turns out to be rather wasteful as n gets large: to compute V20 using this recursive technique, we would compute V2 4,181 times; and for V40 we would compute V2 63,245,986 times! A much better alternative is to store the value of V2 in a table and look it up whenever we need it. The same goes for other values, such as V3 and so on. Function virahanka2() implements a dynamic programming approach to the problem. It works by filling up a table (called lookup) with solutions to all smaller instances of the problem, stopping as soon as we reach the value we're interested in. At this point we read off the value and return it. Crucially, each sub-problem is only ever solved once.

Notice that the approach taken in virahanka2() is to solve smaller problems on the way to solving larger problems. Accordingly, this is known as the bottom-up approach to dynamic programming. Unfortunately it turns out to be quite wasteful for some applications, since it may compute solutions to sub-problems that are never required for solving the main problem. This wasted computation can be avoided using the top-down approach to dynamic programming, which is illustrated in the function virahanka3() in Figure 6.18. Unlike the bottom-up approach, this approach is recursive. It avoids the huge wastage of virahanka1() by checking whether it has previously stored the result. If not, it computes the result recursively and stores it in the table. The last step is to return the stored result. The final method is to use a Python decorator called memoize, which takes care of the housekeeping work done by virahanka3() without cluttering up the program.

This concludes our brief introduction to dynamic programming. We will encounter it again in Chapter 9.

Timing (notes)

We can easily test the efficiency gains made by the use of dynamic programming, or any other putative performance enhancement, using the timeit module:

>>> from timeit import Timer


6.5   Visualizing Language Data (DRAFT)

Python has some libraries that are useful for visualizing language data. In this section we will explore two of these, PyLab and NetworkX. The PyLab package supports sophisticated plotting functions with a MATLAB-style interface, and is available from The NetworkX package is for displaying network diagrams, and is available from


So far we have focused on textual presentation and the use of formatted print statements to get output lined up in columns. It is often very useful to display numerical data in graphical form, since this often makes it easier to detect patterns. For example, in Figure 3.4 we saw a table of numbers showing the frequency of particular modal verbs in the Brown Corpus, classified by genre. The program in Figure 6.19 presents the same information in graphical format. The output is shown in Figure 6.20 (a color figure in the online version).

colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black
def bar_chart(categories, words, counts):
    "Plot a bar chart showing counts for each word by category"
    import pylab
    ind = pylab.arange(len(words))
    width = 1.0 / (len(categories) + 1)
    bar_groups = []
    for c in range(len(categories)):
        bars =*width, counts[categories[c]], width, color=colors[c % len(colors)])
    pylab.xticks(ind+width, words)
    pylab.legend([b[0] for b in bar_groups], categories, loc='upper left')
    pylab.title('Frequency of Six Modal Verbs by Genre')
>>> genres = ['news', 'religion', 'hobbies', 'government', 'adventure']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfdist = nltk.ConditionalFreqDist((g,w)
...                                   for g in genres
...                                   for w in nltk.corpus.brown.words(categories=g)
...                                   if w in modals)
>>> counts = {}
>>> for genre in genres:
...     counts[genre] = [cfdist[genre][word] for word in modals]
>>> bar_chart(genres, modals, counts)

Figure 6.19 ( Figure 6.19: Frequency of Modals in Different Sections of the Brown Corpus

From the bar chart it is immediately obvious that may and must have almost identical relative frequencies. The same goes for could and might.

Using PyLab on the Web

We can generate data visualizations on the fly, based on user input via the web... To do this we have to specify the Agg backend for matplotlib before importing pylab, as follows:

>>> import matplotlib
>>> matplotlib.use('Agg')
>>> import pylab

Next, we use all the same PyLab methods as before, but instead of displaying the result on a graphical terminal using, we save it to a file using pylab.savefig(). We specify the filename and dpi, then print HTML markup that directs the web browser to load the file.

>>> pylab.savefig('modals.png')
>>> print 'Content-Type: text/html'
>>> print
>>> print '<html><body>'
>>> print '<img src="rainfall.png"/>'
>>> print '</body></html>'

Network Diagrams

[Section on networkx and displaying network diagrams; example with WordNet visualization]

6.6   Object-Oriented Programming in Python

Object-Oriented Programming is a programming paradigm in which complex structures and processes are decomposed into classes, each encapsulating a single data type and the legal operations on that type. In this section we show you how to create simple data classes and processing classes by example. For a systematic introduction to Object-Oriented design, please see the Further Reading section at the end of this chapter.

Data Classes: Trees in NLTK

An important data type in language processing is the syntactic tree. Here we will review the parts of the NLTK code that defines the Tree class.

The first line of a class definition is the class keyword followed by the class name, in this case Tree. This class is derived from Python's built-in list class, permitting us to use standard list operations to access the children of a tree node.

>>> class Tree(list):

Next we define the initializer __init__(); Python knows to call this function when you ask for a new tree object by writing t = Tree(node, children). The constructor's first argument is special, and is standardly called self, giving us a way to refer to the current object from within its definition. This particular constructor calls the list initializer (similar to calling self = list(children)), then defines the node property of a tree.

...     def __init__(self, node, children):
...         list.__init__(self, children)
...         self.node = node

Next we define another special function that Python knows to call when we index a Tree. The first case is the simplest, when the index is an integer, e.g. t[2], we just ask for the list item in the obvious way. The other cases are for handling slices, like t[1:2], or t[:].

...     def __getitem__(self, index):
...         if isinstance(index, int):
...             return list.__getitem__(self, index)
...         else:
...             if len(index) == 0:
...                 return self
...             elif len(index) == 1:
...                 return self[int(index[0])]
...             else:
...                 return self[int(index[0])][index[1:]]

This method was for accessing a child node. Similar methods are provided for setting and deleting a child (using __setitem__) and __delitem__).

Two other special member functions are __repr__() and __str__(). The __repr__() function produces a string representation of the object, one that can be executed to re-create the object, and is accessed from the interpreter simply by typing the name of the object and pressing 'enter'. The __str__() function produces a human-readable version of the object; here we call a pretty-printing function we have defined called pp().

...     def __repr__(self):
...         childstr = ' '.join([repr(c) for c in self])
...         return '(%s: %s)' % (self.node, childstr)
...     def __str__(self):
...         return self.pp()

Next we define some member functions that do other standard operations on trees. First, for accessing the leaves:

...     def leaves(self):
...         leaves = []
...         for child in self:
...             if isinstance(child, Tree):
...                 leaves.extend(child.leaves())
...             else:
...                 leaves.append(child)
...         return leaves

Next, for computing the height:

...     def height(self):
...         max_child_height = 0
...         for child in self:
...             if isinstance(child, Tree):
...                 max_child_height = max(max_child_height, child.height())
...             else:
...                 max_child_height = max(max_child_height, 1)
...         return 1 + max_child_height

And finally, for enumerating all the subtrees (optionally filtered):

...     def subtrees(self, filter=None):
...         if not filter or filter(self):
...             yield self
...         for child in self:
...             if isinstance(child, Tree):
...                 for subtree in child.subtrees(filter):
...                     yield subtree

Processing Classes: N-gram Taggers in NLTK

This section will discuss the tag.ngram module.

6.8   Exercises

  1. ☼ Find out more about sequence objects using Python's help facility. In the interpreter, type help(str), help(list), and help(tuple). This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscore; as the help documentation shows, each such function corresponds to something more familiar. For example x.__getitem__(y) is just a long-winded way of saying x[y].

  2. ☼ Identify three operations that can be performed on both tuples and lists. Identify three list operations that cannot be performed on tuples. Name a context where using a list instead of a tuple generates a Python error.

  3. ☼ Find out how to create a tuple consisting of a single item. There are at least two ways to do this.

  4. ☼ Create a list words = ['is', 'NLP', 'fun', '?']. Use a series of assignment statements (e.g. words[1] = words[2]) and a temporary variable tmp to transform this list into the list ['NLP', 'is', 'fun', '!']. Now do the same transformation using tuple assignment.

  5. ☼ Does the method for creating a sliding window of n-grams behave correctly for the two limiting cases: n = 1, and n = len(sent)?

  6. ☼ Create two dictionaries, d1 and d2, and add some entries to each. Now issue the command d1.update(d2). What did this do? What might it be useful for?

  7. ☼ We pointed out that when empty strings and empty lists occur in the condition part of an if clause, they evaluate to false. In this case, they are said to be occuring in a Boolean context. Experiment with different kind of non-Boolean expressions in Boolean contexts, and see whether they evaluate as true or false.

  8. ◑ Create a list of words and store it in a variable sent1. Now assign sent2 = sent1. Modify one of the items in sent1 and verify that sent2 has changed.

    1. Now try the same exercise but instead assign sent2 = sent1[:]. Modify sent1 again and see what happens to sent2. Explain.
    2. Now define text1 to be a list of lists of strings (e.g. to represent a text consisting of multiple sentences. Now assign text2 = text1[:], assign a new value to one of the words, e.g. text1[1][1] = 'Monty'. Check what this did to text2. Explain.
    3. Load Python's deepcopy() function (i.e. from copy import deepcopy), consult its documentation, and test that it makes a fresh copy of any object.
  9. ◑ Write code that starts with a string of words and results in a new string consisting of the same words, but where the first word swaps places with the second, and so on. For example, 'the cat sat on the mat' will be converted into 'cat the on sat mat the'.

  10. ◑ Initialize an n-by-m list of lists of empty strings using list multiplication, e.g. word_table = [[''] * n] * m. What happens when you set one of its values, e.g. word_table[1][2] = "hello"? Explain why this happens. Now write an expression using range() to construct a list of lists, and show that it does not have this problem.

  11. ◑ Write code to initialize a two-dimensional array of sets called word_vowels and process a list of words, adding each word to word_vowels[l][v] where l is the length of the word and v is the number of vowels it contains.

  12. ◑ Write a function novel10(text) that prints any word that appeared in the last 10% of a text that had not been encountered earlier.

  13. ◑ Write a program that takes a sentence expressed as a single string, splits it and counts up the words. Get it to print out each word and the word's frequency, one per line, in alphabetical order.

  14. ◑ Write code that builds a dictionary of dictionaries of sets.

  15. ◑ Use sorted() and set() to get a sorted list of tags used in the Brown corpus, removing duplicates.

  16. ◑ Write code to convert text into hAck3r, where characters are mapped according to the following table:

    Table 6.2

















  17. ◑ Read up on Gematria, a method for assigning numbers to words, and for mapping between words having the same number to discover the hidden meaning of texts (,

    1. Write a function gematria() that sums the numerical values of the letters of a word, according to the letter values in letter_vals:

      letter_vals = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':80, 'g':3, 'h':8,

      'i':10, 'j':10, 'k':20, 'l':30, 'm':40, 'n':50, 'o':70, 'p':80, 'q':100, 'r':200, 's':300, 't':400, 'u':6, 'v':6, 'w':800, 'x':60, 'y':10, 'z':7}

    2. Process a corpus (e.g. nltk.corpus.state_u