Package nltk :: Module grammar
[hide private]
[frames] | no frames]

Module grammar

source code

Basic data classes for representing context free grammars. A grammar specifies which trees can represent the structure of a given text. Each of these trees is called a parse tree for the text (or simply a parse). In a context free grammar, the set of parse trees for any piece of a text can depend only on that piece, and not on the rest of the text (i.e., the piece's context). Context free grammars are often used to find possible syntactic structures for sentences. In this context, the leaves of a parse tree are word tokens; and the node values are phrasal categories, such as NP and VP.

The ContextFreeGrammar class is used to encode context free grammars. Each ContextFreeGrammar consists of a start symbol and a set of productions. The start symbol specifies the root node value for parse trees. For example, the start symbol for syntactic parsing is usually S. Start symbols are encoded using the Nonterminal class, which is discussed below.

A Grammar's productions specify what parent-child relationships a parse tree can contain. Each production specifies that a particular node can be the parent of a particular set of children. For example, the production <S> -> <NP> <VP> specifies that an S node can be the parent of an NP node and a VP node.

Grammar productions are implemented by the Production class. Each Production consists of a left hand side and a right hand side. The left hand side is a Nonterminal that specifies the node type for a potential parent; and the right hand side is a list that specifies allowable children for that parent. This lists consists of Nonterminals and text types: each Nonterminal indicates that the corresponding child may be a TreeToken with the specified node type; and each text type indicates that the corresponding child may be a Token with the with that type.

The Nonterminal class is used to distinguish node values from leaf values. This prevents the grammar from accidentally using a leaf value (such as the English word "A") as the node of a subtree. Within a ContextFreeGrammar, all node values are wrapped in the Nonterminal class. Note, however, that the trees that are specified by the grammar do not include these Nonterminal wrappers.

Grammars can also be given a more procedural interpretation. According to this interpretation, a Grammar specifies any tree structure tree that can be produced by the following procedure:

The operation of replacing the left hand side (lhs) of a production with the right hand side (rhs) in a tree (tree) is known as expanding lhs to rhs in tree.

Classes [hide private]
Nonterminal
A non-terminal symbol for a context free grammar.
FeatStructNonterminal
A feature structure that's also a nonterminal.
Production
A grammar production.
DependencyProduction
A dependency grammar production.
WeightedProduction
A probabilistic context free grammar production.
ContextFreeGrammar
A context-free grammar.
FeatureGrammar
A feature-based grammar.
FeatureValueType
A helper class for FeatureGrammars, designed to be different from ordinary strings.
DependencyGrammar
A dependency grammar.
StatisticalDependencyGrammar
WeightedGrammar
A probabilistic context-free grammar.
Functions [hide private]
list of Nonterminal
nonterminals(symbols)
Given a string containing a list of symbol names, return a list of Nonterminals constructed from those symbols.
source code
bool
is_nonterminal(item)
Returns: True if the item is a Nonterminal.
source code
bool
is_terminal(item)
Returns: True if the item is a terminal, which currently is if it is hashable and not a Nonterminal.
source code
 
induce_pcfg(start, productions)
Induce a PCFG grammar from a list of productions.
source code
 
parse_cfg_production(input)
Returns: a list of context-free Productions.
source code
 
parse_cfg(input)
Returns: a ContextFreeGrammar.
source code
 
parse_pcfg_production(input)
Returns: a list of PCFG WeightedProductions.
source code
 
parse_pcfg(input)
Returns: a probabilistic WeightedGrammar.
source code
 
parse_fcfg_production(input, fstruct_parser)
Returns: a list of feature-based Productions.
source code
 
parse_fcfg(input, features=None, logic_parser=None, fstruct_parser=None)
Returns: a feature structure based FeatureGrammar.
source code
 
parse_production(line, nonterm_parser, probabilistic=False)
Parse a grammar rule, given as a string, and return a list of productions.
source code
 
parse_grammar(input, nonterm_parser, probabilistic=False)
Returns: a pair of
source code
 
standard_nonterm_parser(string, pos) source code
 
parse_dependency_grammar(s) source code
 
parse_dependency_production(s) source code
 
cfg_demo()
A demonstration showing how ContextFreeGrammars can be created and used.
source code
 
pcfg_demo()
A demonstration showing how WeightedGrammars can be created and used.
source code
 
fcfg_demo() source code
 
dg_demo()
A demonstration showing the creation and inspection of a DependencyGrammar.
source code
 
sdg_demo()
A demonstration of how to read a string representation of a CoNLL format dependency tree.
source code
 
demo() source code
Variables [hide private]
  _ARROW_RE = re.compile(r'(?x)\s*->\s*')
  _PROBABILITY_RE = re.compile(r'(?x)(\[[\d\.]+\])\s*')
  _TERMINAL_RE = re.compile(r'(?x)("[^"]+"|\'[^\']+\')\s*')
  _DISJUNCTION_RE = re.compile(r'(?x)\|\s*')
  _STANDARD_NONTERM_RE = re.compile(r'(?x)([\w/][\w/\^<>-]*)\s*')
  _PARSE_DG_RE = re.compile(r'(?x)^\s*(\'[^\']+\')\s*(?:[-=]+>)\...
  _SPLIT_DG_RE = re.compile(r'(\'[^\']\'|[-=]+>|"[^"]+"|\'[^\']+...
  toy_pcfg1 = <Grammar with 17 productions>
  toy_pcfg2 = <Grammar with 23 productions>
Function Details [hide private]

nonterminals(symbols)

source code 

Given a string containing a list of symbol names, return a list of Nonterminals constructed from those symbols.

Parameters:
  • symbols (string) - The symbol name string. This string can be delimited by either spaces or commas.
Returns: list of Nonterminal
A list of Nonterminals constructed from the symbol names given in symbols. The Nonterminals are sorted in the same order as the symbols names.

is_nonterminal(item)

source code 
Returns: bool
True if the item is a Nonterminal.

is_terminal(item)

source code 
Returns: bool
True if the item is a terminal, which currently is if it is hashable and not a Nonterminal.

induce_pcfg(start, productions)

source code 

Induce a PCFG grammar from a list of productions.

The probability of a production A -> B C in a PCFG is:

| count(A -> B C) | P(B, C | A) = --------------- where * is any right hand side | count(A -> *)

Parameters:
  • start (Nonterminal) - The start symbol
  • productions (list of Production) - The list of productions that defines the grammar

parse_cfg_production(input)

source code 
Returns:
a list of context-free Productions.

parse_cfg(input)

source code 
Parameters:
  • input - a grammar, either in the form of a string or else as a list of strings.
Returns:
a ContextFreeGrammar.

parse_pcfg_production(input)

source code 
Returns:
a list of PCFG WeightedProductions.

parse_pcfg(input)

source code 
Parameters:
  • input - a grammar, either in the form of a string or else as a list of strings.
Returns:
a probabilistic WeightedGrammar.

parse_fcfg_production(input, fstruct_parser)

source code 
Returns:
a list of feature-based Productions.

parse_fcfg(input, features=None, logic_parser=None, fstruct_parser=None)

source code 
Parameters:
  • input - a grammar, either in the form of a string or else as a list of strings.
  • features - a tuple of features (default: SLASH, TYPE)
  • logic_parser - a parser for lambda-expressions (default: LogicParser())
  • fstruct_parser - a feature structure parser (only if features and logic_parser is None)
Returns:
a feature structure based FeatureGrammar.

parse_grammar(input, nonterm_parser, probabilistic=False)

source code 
Parameters:
  • input - a grammar, either in the form of a string or else as a list of strings.
  • nonterm_parser - a function for parsing nonterminals. It should take a (string,position) as argument and return a (nonterminal,position) as result.
  • probabilistic - are the grammar rules probabilistic?
Returns:
a pair of
  • a starting category
  • a list of Productions

Variables Details [hide private]

_PARSE_DG_RE

Value:
re.compile(r'(?x)^\s*(\'[^\']+\')\s*(?:[-=]+>)\s*(?:("[^"]+"|\'[^\']+\\
'|\|)\s*)*$')

_SPLIT_DG_RE

Value:
re.compile(r'(\'[^\']\'|[-=]+>|"[^"]+"|\'[^\']+\'|\|)')