Package nltk :: Package classify
[hide private]
[frames] | no frames]

Package classify

source code

Classes and interfaces for labeling tokens with category labels (or class labels). Typically, labels are represented with strings (such as 'health' or 'sports'). Classifiers can be used to perform a wide range of classification tasks. For example, classifiers can be used...

Features

In order to decide which category label is appropriate for a given token, classifiers examine one or more 'features' of the token. These features are typically chosen by hand, and indicate which aspects of the token are relevant to the classification decision. For example, a document classifier might use a separate feature for each word, recording how often that word occured in the document.

Featuresets

The features describing a token are encoded using a featureset, which is a dictionary that maps from feature names to feature values. Feature names are unique strings that indicate what aspect of the token is encoded by the feature. Examples include 'prevword', for a feature whose value is the previous word; and 'contains-word(library)' for a feature that is true when a document contains the word 'library'. Feature values are typically booleans, numbers, or strings, depending on which feature they describe.

Featuresets are typically constructed using a feature detector (also known as a feature extractor). A feature detector is a function that takes a token (and sometimes information about its context) as its input, and returns a featureset describing that token. For example, the following feature detector converts a document (stored as a list of words) to a featureset describing the set of words included in the document:

>>> # Define a feature detector function.
>>> def document_features(document):
...     return dict([('contains-word(%s)' % w, True) for w in document])

Feature detectors are typically applied to each token before it is fed to the classifier:

>>> Classify each Gutenberg document.
>>> for file in gutenberg.files():
...     doc = gutenberg.tokenized(file)
...     print doc_name, classifier.classify(document_features(doc))

The parameters that a feature detector expects will vary, depending on the task and the needs of the feature detector. For example, a feature detector for word sense disambiguation (WSD) might take as its input a sentence, and the index of a word that should be classified, and return a featureset for that word. The following feature detector for WSD includes features describing the left and right contexts of the target word:

>>> def wsd_features(sentence, index):
...     featureset = {}
...     for i in range(max(0, index-3), index):
...         featureset['left-context(%s)' % sentence[i]] = True
...     for i in range(index, max(index+3, len(sentence))):
...         featureset['right-context(%s)' % sentence[i]] = True
...     return featureset

Training Classifiers

Most classifiers are built by training them on a list of hand-labeled examples, known as the training set. Training sets are represented as lists of (featuredict, label) tuples.

Submodules [hide private]

Classes [hide private]
BinaryMaxentFeatureEncoding
A feature encoding that generates vectors containing a binary joint-features of the form:
ClassifierI
A processing interface for labeling tokens with a single category label (or class).
ConditionalExponentialClassifier
A maximum entropy classifier (also known as a conditional exponential classifier).
DecisionTreeClassifier
MaxentClassifier
A maximum entropy classifier (also known as a conditional exponential classifier).
MultiClassifierI
A processing interface for labeling tokens with zero or more category labels (or labels).
NaiveBayesClassifier
A Naive Bayes classifier.
RTEFeatureExtractor
This builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.
WekaClassifier
Functions [hide private]
 
config_megam(bin=None)
Configure NLTK's interface to the megam maxent optimization package.
source code
 
config_weka(classpath=None) source code
 
rte_classifier(trainer, features=<function rte_features at 0x11cc630>)
Classify RTEPairs
source code
 
rte_features(rtepair) source code
Function Details [hide private]

config_megam(bin=None)

source code 

Configure NLTK's interface to the megam maxent optimization package.

Parameters:
  • bin (string) - The full path to the megam binary. If not specified, then nltk will search the system for a megam binary; and if one is not found, it will raise a LookupError exception.