Package nltk :: Package corpus :: Package reader :: Module wordnet :: Class WordNetCorpusReader
[hide private]
[frames] | no frames]

type WordNetCorpusReader

source code

      object --+    
               |    
api.CorpusReader --+
                   |
                  WordNetCorpusReader

A corpus reader used to access wordnet or its variants.

Instance Methods [hide private]
 
__init__(self, root)
Construct a new wordnet corpus reader, with the given root directory.
source code
 
_load_lemma_pos_offset_map(self) source code
 
_load_exception_map(self) source code
 
_compute_max_depth(self, pos, simulate_root)
Compute the max depth for the given part of speech.
source code
 
get_version(self) source code
 
lemma(self, name) source code
 
lemma_from_key(self, key) source code
 
synset(self, name) source code
 
_data_file(self, pos)
Return an open file pointer for the data file for the given part of speech.
source code
 
_synset_from_pos_and_offset(self, pos, offset) source code
 
_synset_from_pos_and_line(self, pos, data_file_line) source code
 
synsets(self, lemma, pos=None)
Load all synsets with a given lemma and part of speech tag.
source code
 
lemmas(self, lemma, pos=None)
Return all Lemma objects with a name matching the specified lemma name and part of speech tag.
source code
 
all_lemma_names(self, pos=None)
Return all lemma names for all synsets for the given part of speech tag.
source code
 
all_synsets(self, pos=None)
Iterate over all synsets with a given part of speech tag.
source code
 
lemma_count(self, lemma)
Return the frequency count for this Lemma
source code
 
path_similarity(self, synset1, synset2, verbose=False, simulate_root=True)
Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.
source code
 
lch_similarity(self, synset1, synset2, verbose=False, simulate_root=True)
Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.
source code
 
wup_similarity(self, synset1, synset2, verbose=False, simulate_root=True)
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
source code
 
res_similarity(self, synset1, synset2, ic, verbose=False)
Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
source code
 
jcn_similarity(self, synset1, synset2, ic, verbose=False)
Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.
source code
 
lin_similarity(self, synset1, synset2, ic, verbose=False)
Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.
source code
 
morphy(self, form, pos=None)
Find a possible base form for the given form, with the given part of speech, by checking WordNet's list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.
source code
 
_morphy(self, form, pos) source code
 
ic(self, corpus, weight_senses_equally=False, smoothing=1.0)
Creates an information content lookup dictionary from a corpus.
source code

Inherited from api.CorpusReader: __repr__, abspath, abspaths, encoding, fileids, open, readme

Inherited from api.CorpusReader (private): _get_root

    Deprecated since 0.9.7

Inherited from api.CorpusReader: files

    Deprecated since 0.9.1

Inherited from api.CorpusReader: items

Inherited from api.CorpusReader (private): _get_items

Class Variables [hide private]
  _ENCODING = None
  _FILES = ('cntlist.rev', 'lexnames', 'index.sense', 'index.adj...
A list of file identifiers for all the fileids used by this corpus reader.
  MORPHOLOGICAL_SUBSTITUTIONS = {'a': [('er', ''), ('est', ''), ...
  ADJ = 'a'
  ADJ_SAT = 's'
  ADV = 'r'
  NOUN = 'n'
  VERB = 'v'
    Filename constants
  _FILEMAP = {'a': 'adj', 'n': 'noun', 'r': 'adv', 'v': 'verb'}
    Part of speech constants
  _pos_numbers = {'a': 3, 'n': 1, 'r': 4, 's': 5, 'v': 2}
  _pos_names = {1: 'n', 2: 'v', 3: 'a', 4: 'r', 5: 's'}
Instance Variables [hide private]
  _lemma_pos_offset_map
A index that provides the file offset
  _synset_offset_cache
A cache so we don't have to reconstuct synsets
  _max_depth
A lookup for the maximum depth of each part of speech.

Inherited from api.CorpusReader (private): _encoding, _fileids, _root

Properties [hide private]

Inherited from api.CorpusReader: root

Method Details [hide private]

__init__(self, root)
(Constructor)

source code 

Construct a new wordnet corpus reader, with the given root directory.

Parameters:
  • root - A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a PathPointer automatically.
  • fileids - A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader's root to each file name.
  • encoding - The default unicode encoding for the files that make up the corpus. encoding's value can be any of the following:
    • A string: encoding is the encoding name for all files.
    • A dictionary: encoding[file_id] is the encoding name for the file whose identifier is file_id. If file_id is not in encoding, then the file contents will be processed using non-unicode byte strings.
    • A list: encoding should be a list of (regexp, encoding) tuples. The encoding for a file whose identifier is file_id will be the encoding value for the first tuple whose regexp matches the file_id. If no tuple's regexp matches the file_id, the file contents will be processed using non-unicode byte strings.
    • None: the file contents of all files will be processed using non-unicode byte strings.
  • tag_mapping_function - A function for normalizing or simplifying the POS tags returned by the tagged_words() or tagged_sents() methods.
Overrides: api.CorpusReader.__init__

_compute_max_depth(self, pos, simulate_root)

source code 

Compute the max depth for the given part of speech. This is used by the lch similarity metric.

synsets(self, lemma, pos=None)

source code 

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

lemmas(self, lemma, pos=None)

source code 

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

all_lemma_names(self, pos=None)

source code 

Return all lemma names for all synsets for the given part of speech tag. If pos is not specified, all synsets for all parts of speech will be used.

all_synsets(self, pos=None)

source code 

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

path_similarity(self, synset1, synset2, verbose=False, simulate_root=True)

source code 

Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.

Parameters:
  • other (Synset) - The Synset that this Synset is being compared to.
  • simulate_root (bool) - The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:
A score denoting the similarity of the two Synsets, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a Synset is compared with itself.

lch_similarity(self, synset1, synset2, verbose=False, simulate_root=True)

source code 

Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.

Parameters:
  • other (Synset) - The Synset that this Synset is being compared to.
  • simulate_root (bool) - The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:
A score denoting the similarity of the two Synsets, normally greater than 0. None is returned if no connecting path could be found. If a Synset is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth.

wup_similarity(self, synset1, synset2, verbose=False, simulate_root=True)

source code 

Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen's Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

Parameters:
  • other (Synset) - The Synset that this Synset is being compared to.
  • simulate_root (bool) - The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well.
Returns:
A float score denoting the similarity of the two Synsets, normally greater than zero. If no connecting path between the two senses can be found, None is returned.

res_similarity(self, synset1, synset2, ic, verbose=False)

source code 

Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).

Parameters:
  • other (Synset) - The Synset that this Synset is being compared to.
  • ic (dict) - an information content object (as returned by load_ic()).
Returns:
A float score denoting the similarity of the two Synsets. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N['dog'][0] and N['table'][0]).

jcn_similarity(self, synset1, synset2, ic, verbose=False)

source code 

Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).

Parameters:
  • other (Synset) - The Synset that this Synset is being compared to.
  • ic (dict) - an information content object (as returned by load_ic()).
Returns:
A float score denoting the similarity of the two Synsets.

lin_similarity(self, synset1, synset2, ic, verbose=False)

source code 

Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

Parameters:
  • other (Synset) - The Synset that this Synset is being compared to.
  • ic (dict) - an information content object (as returned by load_ic()).
Returns:
A float score denoting the similarity of the two Synsets, in the range 0 to 1.

morphy(self, form, pos=None)

source code 

Find a possible base form for the given form, with the given part of speech, by checking WordNet's list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> wn.morphy('dogs')
'dog'
>>> wn.morphy('churches')
'church'
>>> wn.morphy('aardwolves')
'aardwolf'
>>> wn.morphy('abaci')
'abacus'
>>> wn.morphy('hardrock', wn.ADV)
>>> wn.morphy('book', wn.NOUN)
'book'
>>> wn.morphy('book', wn.ADJ)

ic(self, corpus, weight_senses_equally=False, smoothing=1.0)

source code 

Creates an information content lookup dictionary from a corpus.

Parameters:
  • corpus (CorpusReader) - The corpus from which we create an information content dictionary.
  • weight_senses_equally (bool) - If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.)
  • smoothing (float) - How much do we smooth synset counts (default is 1.0)
Returns:
An information content dictionary

Class Variable Details [hide private]

_FILES

A list of file identifiers for all the fileids used by this corpus reader.

Value:
('cntlist.rev',
 'lexnames',
 'index.sense',
 'index.adj',
 'index.adv',
 'index.noun',
 'index.verb',
 'data.adj',
...

MORPHOLOGICAL_SUBSTITUTIONS

Value:
{'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
 'n': [('s', ''),
       ('ses', 's'),
       ('ves', 'f'),
       ('xes', 'x'),
       ('zes', 'z'),
       ('ches', 'ch'),
       ('shes', 'sh'),
...

Instance Variable Details [hide private]

_lemma_pos_offset_map

A index that provides the file offset

Map from lemma -> pos -> synset_index -> offset

_synset_offset_cache

A cache so we don't have to reconstuct synsets

Map from pos -> offset -> synset

_max_depth

A lookup for the maximum depth of each part of speech. Useful for the lch similarity metric.