1
2
3
4
5
6
7
8 """
9 The NLTK corpus and module downloader. This module defines several
10 interfaces which can be used to download corpora, models, and other
11 data packages that can be used with NLTK.
12
13 Downloading Packages
14 ====================
15 If called with no arguments, L{download() <Downloader.download>}
16 function will display an interactive interface which can be used to
17 download and install new packages. If Tkinter is available, then a
18 graphical interface will be shown; otherwise, a simple text interface
19 will be provided.
20
21 Individual packages can be downloaded by calling the C{download()}
22 function with a single argument, giving the package identifier for the
23 package that should be downloaded:
24
25 >>> download('treebank') # doctest: +SKIP
26 [nltk_data] Downloading package 'treebank'...
27 [nltk_data] Unzipping corpora/treebank.zip.
28
29 NLTK also provides a number of \"package collections\", consisting of
30 a group of related packages. To download all packages in a
31 colleciton, simply call C{download()} with the collection's
32 identifier:
33
34 >>> download('all-corpora') # doctest: +SKIP
35 [nltk_data] Downloading package 'abc'...
36 [nltk_data] Unzipping corpora/abc.zip.
37 [nltk_data] Downloading package 'alpino'...
38 [nltk_data] Unzipping corpora/alpino.zip.
39 ...
40 [nltk_data] Downloading package 'words'...
41 [nltk_data] Unzipping corpora/words.zip.
42
43 Download Directory
44 ==================
45 By default, packages are installed in either a system-wide directory
46 (if Python has sufficient access to write to it); or in the current
47 user's home directory. However, the C{download_dir} argument may be
48 used to specify a different installation target, if desired.
49
50 See L{Downloader.default_download_dir()} for more a detailed
51 description of how the default download directory is chosen.
52
53 NLTK Download Server
54 ====================
55 Before downloading any packages, the corpus and module downloader
56 contacts the NLTK download server, to retrieve an index file
57 describing the available packages. By default, this index file is
58 loaded from C{<http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml>}.
59 If necessary, it is possible to create a new L{Downloader} object,
60 specifying a different URL for the package index file.
61
62 Usage::
63
64 python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
65
66 or with py2.5+:
67
68 python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
69 """
70
71 """
72
73 0 1 2 3
74 [label][----][label][----]
75 [column ][column ]
76
77 Notes
78 =====
79 Handling data files.. Some questions:
80
81 * Should the data files be kept zipped or unzipped? I say zipped.
82
83 * Should the data files be kept in svn at all? Advantages: history;
84 automatic version numbers; 'svn up' could be used rather than the
85 downloader to update the corpora. Disadvantages: they're big,
86 which makes working from svn a bit of a pain. And we're planning
87 to potentially make them much bigger. I don't think we want
88 people to have to download 400mb corpora just to use nltk from
89 svn.
90
91 * Compromise: keep the data files in trunk/data rather than in
92 trunk/nltk. That way you can check them out in svn if you want
93 to; but you don't need to, and you can use the downloader instead.
94
95 * Also: keep models in mind. When we change the code, we'd
96 potentially like the models to get updated. This could require a
97 little thought.
98
99 * So.. let's assume we have a trunk/data directory, containing a bunch
100 of packages. The packages should be kept as zip files, because we
101 really shouldn't be editing them much (well -- we may edit models
102 more, but they tend to be binary-ish files anyway, where diffs
103 aren't that helpful). So we'll have trunk/data, with a bunch of
104 files like abc.zip and treebank.zip and propbank.zip. For each
105 package we could also have eg treebank.xml and propbank.xml,
106 describing the contents of the package (name, copyright, license,
107 etc). Collections would also have .xml files. Finally, we would
108 pull all these together to form a single index.xml file. Some
109 directory structure wouldn't hurt. So how about::
110
111 /trunk/data/ ....................... root of data svn
112 index.xml ........................ main index file
113 src/ ............................. python scripts
114 packages/ ........................ dir for packages
115 corpora/ ....................... zip & xml files for corpora
116 grammars/ ...................... zip & xml files for grammars
117 taggers/ ....................... zip & xml files for taggers
118 tokenizers/ .................... zip & xml files for tokenizers
119 etc.
120 collections/ ..................... xml files for collections
121
122 Where the root (/trunk/data) would contain a makefile; and src/
123 would contain a script to update the info.xml file. It could also
124 contain scripts to rebuild some of the various model files. The
125 script that builds index.xml should probably check that each zip
126 file expands entirely into a single subdir, whose name matches the
127 package's uid.
128
129 Changes I need to make:
130 - in index: change "size" to "filesize" or "compressed-size"
131 - in index: add "unzipped-size"
132 - when checking status: check both compressed & uncompressed size.
133 uncompressed size is important to make sure we detect a problem
134 if something got partially unzipped. define new status values
135 to differentiate stale vs corrupt vs corruptly-uncompressed??
136 (we shouldn't need to re-download the file if the zip file is ok
137 but it didn't get uncompressed fully.)
138 - add other fields to the index: author, license, copyright, contact,
139 etc.
140
141 the current grammars/ package would become a single new package (eg
142 toy-grammars or book-grammars).
143
144 xml file should have:
145 - authorship info
146 - license info
147 - copyright info
148 - contact info
149 - info about what type of data/annotation it contains?
150 - recommended corpus reader?
151
152 collections can contain other collections. they can also contain
153 multiple package types (corpora & models). Have a single 'basics'
154 package that includes everything we talk about in the book?
155
156 n.b.: there will have to be a fallback to the punkt tokenizer, in case
157 they didn't download that model.
158
159 default: unzip or not?
160
161 """
162 import time, re, os, zipfile, sys, textwrap, threading, itertools
163 from cStringIO import StringIO
164 try:
165 from hashlib import md5
166 except:
167 from md5 import md5
168
169 try:
170 TKINTER = True
171 from Tkinter import *
172 from tkMessageBox import *
173 from nltk.draw.table import Table
174 from nltk.draw import ShowText
175 except:
176 TKINTER = False
177 TclError = ValueError
178
179 from nltk.etree import ElementTree
180 import nltk
181 urllib2 = nltk.internals.import_from_stdlib('urllib2')
182
183
184
185
186
187
188 -class Package(object):
189 """
190 A directory entry for a downloadable package. These entries are
191 extracted from the XML index file that is downloaded by
192 L{Downloader}. Each package consists of a single file; but if
193 that file is a zip file, then it can be automatically decompressed
194 when the package is installed.
195 """
196 - def __init__(self, id, url, name=None, subdir='',
197 size=None, unzipped_size=None,
198 checksum=None, svn_revision=None,
199 copyright='Unknown', contact='Unknown',
200 license='Unknown', author='Unknown',
201 unzip=True,
202 **kw):
203 self.id = id
204 """A unique identifier for this package."""
205
206 self.name = name or id
207 """A string name for this package."""
208
209 self.subdir = subdir
210 """The subdirectory where this package should be installed.
211 E.g., C{'corpora'} or C{'taggers'}."""
212
213 self.url = url
214 """A URL that can be used to download this package's file."""
215
216 self.size = int(size)
217 """The filesize (in bytes) of the package file."""
218
219 self.unzipped_size = int(unzipped_size)
220 """The total filesize of the files contained in the package's
221 zipfile."""
222
223 self.checksum = checksum
224 """The MD-5 checksum of the package file."""
225
226 self.svn_revision = svn_revision
227 """A subversion revision number for this package."""
228
229 self.copyright = copyright
230 """Copyright holder for this package."""
231
232 self.contact = contact
233 """Name & email of the person who should be contacted with
234 questions about this package."""
235
236 self.license = license
237 """License information for this package."""
238
239 self.author = author
240 """Author of this package."""
241
242 ext = os.path.splitext(url.split('/')[-1])[1]
243 self.filename = os.path.join(subdir, id+ext)
244 """The filename that should be used for this package's file. It
245 is formed by joining C{self.subdir} with C{self.id}, and
246 using the same extension as C{url}."""
247
248 self.unzip = bool(int(unzip))
249 """A flag indicating whether this corpus should be unzipped by
250 default."""
251
252
253 self.__dict__.update(kw)
254
255 @staticmethod
260
262 return '<Package %s>' % self.id
263
265 """
266 A directory entry for a collection of downloadable packages.
267 These entries are extracted from the XML index file that is
268 downloaded by L{Downloader}.
269 """
270 - def __init__(self, id, children, name=None, **kw):
271 self.id = id
272 """A unique identifier for this collection."""
273
274 self.name = name or id
275 """A string name for this collection."""
276
277 self.children = children
278 """A list of the L{Collections} or L{Packages} directly
279 contained by this collection."""
280
281 self.packages = None
282 """A list of L{Packages} contained by this collection or any
283 collections it recursively contains."""
284
285
286 self.__dict__.update(kw)
287
288 @staticmethod
294
296 return '<Collection %s>' % self.id
297
303 """A status message object, used by L{incr_download} to
304 communicate its progress."""
306 """Data server has started working on a collection of packages."""
307 - def __init__(self, collection): self.collection = collection
309 """Data server has finished working on a collection of packages."""
310 - def __init__(self, collection): self.collection = collection
312 """Data server has started working on a package."""
313 - def __init__(self, package): self.package = package
315 """Data server has finished working on a package."""
316 - def __init__(self, package): self.package = package
318 """Data server has started downloading a package."""
319 - def __init__(self, package): self.package = package
321 """Data server has finished downloading a package."""
322 - def __init__(self, package): self.package = package
324 """Data server has started unzipping a package."""
325 - def __init__(self, package): self.package = package
327 """Data server has finished unzipping a package."""
328 - def __init__(self, package): self.package = package
330 """The package download file is already up-to-date"""
331 - def __init__(self, package): self.package = package
333 """The package download file is out-of-date or corrupt"""
334 - def __init__(self, package): self.package = package
336 """Data server encountered an error"""
338 self.package = package
339 if isinstance(message, Exception):
340 self.message = str(message)
341 else:
342 self.message = message
343
345 """Indicates how much progress the data server has made"""
346 - def __init__(self, progress): self.progress = progress
348 """Indicates what download directory the data server is using"""
350
356 """
357 A class used to access the NLTK data server, which can be used to
358 download corpora and other data packages.
359 """
360
361
362
363
364
365 INDEX_TIMEOUT = 60*60
366 """The amount of time after which the cached copy of the data
367 server index will be considered 'stale,' and will be
368 re-downloaded."""
369
370 DEFAULT_URL = 'http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml'
371 """The default URL for the NLTK data server's index. An
372 alternative URL can be specified when creating a new
373 C{Downloader} object."""
374
375
376
377
378
379 INSTALLED = 'installed'
380 """A status string indicating that a package or collection is
381 installed and up-to-date."""
382 NOT_INSTALLED = 'not installed'
383 """A status string indicating that a package or collection is
384 not installed."""
385 STALE = 'out of date'
386 """A status string indicating that a package or collection is
387 corrupt or out-of-date."""
388 PARTIAL = 'partial'
389 """A status string indicating that a collection is partially
390 installed (i.e., only some of its packages are installed.)"""
391
392
393
394
395
396 - def __init__(self, server_index_url=None, download_dir=None):
397 self._url = server_index_url or self.DEFAULT_URL
398 """The URL for the data server's index file."""
399
400 self._collections = {}
401 """Dictionary from collection identifier to L{Collection}"""
402
403 self._packages = {}
404 """Dictionary from package identifier to L{Package}"""
405
406 self._download_dir = download_dir
407 """The default directory to which packages will be downloaded."""
408
409 self._index = None
410 """The XML index file downloaded from the data server"""
411
412 self._index_timestamp = None
413 """Time at which L{self._index} was downloaded. If it is more
414 than L{INDEX_TIMEOUT} seconds old, it will be re-downloaded."""
415
416 self._status_cache = {}
417 """Dictionary from package/collection identifier to status
418 string (L{INSTALLED}, L{NOT_INSTALLED}, L{STALE}, or
419 L{PARTIAL}). Cache is used for packages only, not
420 collections."""
421
422 self._errors = None
423 """Flag for telling if all packages got successfully downloaded or not."""
424
425
426 if self._download_dir is None:
427 self._download_dir = self.default_download_dir()
428
429
430
431
432
433 - def list(self, download_dir=None, show_packages=True,
434 show_collections=True, header=True, more_prompt=False,
435 skip_installed=False):
436 lines = 0
437 if download_dir is None:
438 download_dir = self._download_dir
439 print 'Using default data directory (%s)' % download_dir
440 if header:
441 print '='*(26+len(self._url))
442 print ' Data server index for <%s>' % self._url
443 print '='*(26+len(self._url))
444 lines += 3
445 stale = partial = False
446
447 categories = []
448 if show_packages: categories.append('packages')
449 if show_collections: categories.append('collections')
450 for category in categories:
451 print '%s:' % category.capitalize()
452 lines += 1
453 for info in sorted(getattr(self, category)()):
454 status = self.status(info, download_dir)
455 if status == self.INSTALLED and skip_installed: continue
456 if status == self.STALE: stale = True
457 if status == self.PARTIAL: partial = True
458 prefix = {self.INSTALLED:'*', self.STALE:'-',
459 self.PARTIAL:'P', self.NOT_INSTALLED: ' '}[status]
460 name = textwrap.fill('-'*27 + (info.name or info.id),
461 75, subsequent_indent=27*' ')[27:]
462 print ' [%s] %s %s' % (prefix, info.id.ljust(20, '.'), name)
463 lines += len(name.split('\n'))
464 if more_prompt and lines > 20:
465 user_input = raw_input("Hit Enter to continue: ")
466 if (user_input.lower() in ('x', 'q')): return
467 lines = 0
468 print
469 msg = '([*] marks installed packages'
470 if stale: msg += '; [-] marks out-of-date or corrupt packages'
471 if partial: msg += '; [P] marks partially installed collections'
472 print textwrap.fill(msg+')', subsequent_indent=' ', width=76)
473
477
479 self._update_index()
480 return [pkg for (id,pkg) in self._packages.items()
481 if pkg.subdir == 'corpora']
482
484 self._update_index()
485 return [pkg for (id,pkg) in self._packages.items()
486 if pkg.subdir != 'corpora']
487
491
492
493
494
495
497 if isinstance(info_or_id, basestring):
498 return self.info(info_or_id)
499 else:
500 return info_or_id
501
502
503
504
505
506
507
508
509
510
511 - def incr_download(self, info_or_id, download_dir=None, force=False):
512
513 if download_dir is None:
514 download_dir = self._download_dir
515 yield SelectDownloadDirMessage(download_dir)
516
517
518 if isinstance(info_or_id, (list,tuple)):
519 for msg in self._download_list(info_or_id, download_dir, force):
520 yield msg
521 return
522
523
524 try: info = self._info_or_id(info_or_id)
525 except (IOError, ValueError), e:
526 yield ErrorMessage(None, 'Error loading %s: %s' %
527 (info_or_id, e))
528 return
529
530
531 if isinstance(info, Collection):
532 yield StartCollectionMessage(info)
533 for msg in self.incr_download(info.children, download_dir, force):
534 yield msg
535 yield FinishCollectionMessage(info)
536
537
538 else:
539 for msg in self._download_package(info, download_dir, force):
540 yield msg
541
543 if isinstance(item, Package): return 1
544 else: return len(item.packages)
545
567
569 yield StartPackageMessage(info)
570 yield ProgressMessage(0)
571
572
573 status = self.status(info, download_dir)
574 if not force and status == self.INSTALLED:
575 yield UpToDateMessage(info)
576 yield ProgressMessage(100)
577 yield FinishPackageMessage(info)
578 return
579
580
581 self._status_cache.pop(info.id, None)
582
583
584 filepath = os.path.join(download_dir, info.filename)
585 if os.path.exists(filepath):
586 if status == self.STALE:
587 yield StaleMessage(info)
588 os.remove(filepath)
589
590
591 if not os.path.exists(download_dir):
592 os.mkdir(download_dir)
593 if not os.path.exists(os.path.join(download_dir, info.subdir)):
594 os.mkdir(os.path.join(download_dir, info.subdir))
595
596
597
598 yield StartDownloadMessage(info)
599 yield ProgressMessage(5)
600 try:
601 infile = urllib2.urlopen(info.url)
602 outfile = open(filepath, 'wb')
603
604 num_blocks = max(1, float(info.size)/(1024*16))
605 for block in itertools.count():
606 s = infile.read(1024*16)
607 outfile.write(s)
608 if not s: break
609 if block % 2 == 0:
610 yield ProgressMessage(min(80, 5+75*(block/num_blocks)))
611 infile.close()
612 outfile.close()
613 except IOError, e:
614 yield ErrorMessage(info, 'Error downloading %r from <%s>:'
615 '\n %s' % (info.id, info.url, e))
616 return
617 yield FinishDownloadMessage(info)
618 yield ProgressMessage(80)
619
620
621 if info.filename.endswith('.zip'):
622 zipdir = os.path.join(download_dir, info.subdir)
623
624
625 if info.unzip or os.path.exists(os.path.join(zipdir, info.id)):
626 yield StartUnzipMessage(info)
627 for msg in _unzip_iter(filepath, zipdir, verbose=False):
628
629 msg.package = info
630 yield msg
631 yield FinishUnzipMessage(info)
632
633 yield FinishPackageMessage(info)
634
635 - def download(self, info_or_id=None, download_dir=None, quiet=False,
636 force=False, prefix='[nltk_data] ', halt_on_error=True,
637 raise_on_error=False):
638
639 if info_or_id is None:
640
641
642
643 if download_dir is not None: self._download_dir = download_dir
644 self._interactive_download()
645 return True
646
647 else:
648
649 def show(s, prefix2=''):
650 print textwrap.fill(s, initial_indent=prefix+prefix2,
651 subsequent_indent=prefix+prefix2+' '*4)
652
653 for msg in self.incr_download(info_or_id, download_dir, force):
654
655 if isinstance(msg, ErrorMessage):
656 show(msg.message)
657 if raise_on_error:
658 raise ValueError(msg.message)
659 if halt_on_error:
660 return False
661 self._errors = True
662 if not quiet:
663 print "Error installing package. Retry? [n/y/e]"
664 choice = raw_input().strip()
665 if choice in ['y', 'Y']:
666 if not self.download(msg.package.id, download_dir,
667 quiet, force, prefix,
668 halt_on_error, raise_on_error):
669 return False
670 elif choice in ['e', 'E']:
671 return False
672
673
674 if not quiet:
675
676 if isinstance(msg, StartCollectionMessage):
677 show('Downloading collection %r' % msg.collection.id)
678 prefix += ' | '
679 print prefix
680 elif isinstance(msg, FinishCollectionMessage):
681 print prefix
682 prefix = prefix[:-4]
683 if self._errors:
684 show('Downloaded collection %r with errors' %
685 msg.collection.id)
686 else:
687 show('Done downloading collection %r' %
688 msg.collection.id)
689
690
691 elif isinstance(msg, StartPackageMessage):
692 show('Downloading package %r to %s...' %
693 (msg.package.id, download_dir))
694 elif isinstance(msg, UpToDateMessage):
695 show('Package %s is already up-to-date!' %
696 msg.package.id, ' ')
697
698
699
700 elif isinstance(msg, StartUnzipMessage):
701 show('Unzipping %s.' % msg.package.filename, ' ')
702
703
704 elif isinstance(msg, SelectDownloadDirMessage):
705 download_dir = msg.download_dir
706 return True
707
708 - def is_stale(self, info_or_id, download_dir=None):
710
713
715 if id is None:
716 self._status_cache.clear()
717 else:
718 self._status_cache.pop(id, None)
719
720 - def status(self, info_or_id, download_dir=None):
754
756 if not os.path.exists(filepath):
757 return self.NOT_INSTALLED
758
759
760 try: filestat = os.stat(filepath)
761 except OSError: return self.NOT_INSTALLED
762 if filestat.st_size != int(info.size):
763 return self.STALE
764
765
766 if md5_hexdigest(filepath) != info.checksum:
767 return self.STALE
768
769
770
771 if filepath.endswith('.zip'):
772 unzipdir = filepath[:-4]
773 if not os.path.exists(unzipdir):
774 return self.INSTALLED
775 if not os.path.isdir(unzipdir):
776 return self.STALE
777
778 unzipped_size = sum(os.stat(os.path.join(d, f)).st_size
779 for d, _, files in os.walk(unzipdir)
780 for f in files)
781 if unzipped_size != info.unzipped_size:
782 return self.STALE
783
784
785 return self.INSTALLED
786
787 - def update(self, quiet=False, prefix='[nltk_data] '):
795
796
797
798
799
848
850 """
851 Return the XML index describing the packages available from
852 the data server. If necessary, this index will be downloaded
853 from the data server.
854 """
855 self._update_index()
856 return self._index
857
858 - def info(self, id):
859 """Return the L{Package} or L{Collection} record for the
860 given item."""
861 self._update_index()
862 if id in self._packages: return self._packages[id]
863 if id in self._collections: return self._collections[id]
864 raise ValueError('Package %r not found in index' % id)
865
867 """Return the XML info record for the given item"""
868 self._update_index()
869 for package in self._index.findall('packages/package'):
870 if package.get('id') == id:
871 return package
872 for collection in self._index.findall('collections/collection'):
873 if collection.get('id') == id:
874 return collection
875 raise ValueError('Package %r not found in index' % id)
876
877
878
879
880
882
883
884 original_url = self._url
885 try:
886 self._update_index(url)
887 except:
888 self._url = original_url
889 raise
890
891 url = property(lambda self: self._url, _set_url, doc="""
892 The URL for the data server's index file.""")
893
895 """
896 Return the directory to which packages will be downloaded by
897 default. This value can be overridden using the constructor,
898 or on a case-by-case basis using the C{download_dir} argument when
899 calling L{download()}.
900
901 On Windows, the default download directory is
902 C{I{PYTHONHOME}/lib/nltk}, where C{I{PYTHONHOME}} is the
903 directory containing Python (e.g. C{C:\\Python25}).
904
905 On all other platforms, the default directory is determined
906 as follows:
907
908 - If C{/usr/share} exists and is writable, then
909 return C{/usr/share/nltk}
910 - If C{/usr/local/share} exists and is writable, then
911 return C{/usr/local/share/nltk}
912 - If C{/usr/lib} exists and is writable, then
913 return C{/usr/lib/nltk}
914 - If C{/usr/local/lib} exists and is writable, then
915 return C{/usr/local/lib/nltk}
916 - Otherwise, return C{~/nltk_data}, where C{~} is the
917 current user's home directory.
918 """
919
920
921 for nltkdir in nltk.data.path:
922 if (os.path.exists(nltkdir) and
923 nltk.internals.is_writable(nltkdir)):
924 return nltkdir
925
926
927 if sys.platform == 'win32' and 'APPDATA' in os.environ:
928 homedir = os.environ['APPDATA']
929
930
931 else:
932 homedir = os.path.expanduser('~/')
933 if homedir == '~/':
934 raise ValueError("Could not find a default download directory")
935
936
937 return os.path.join(homedir, 'nltk_data')
938
943
944 download_dir = property(lambda self: self._download_dir,
945 _set_download_dir, doc="""
946 The default directory to which packages will be downloaded.
947 This defaults to the value returned by L{default_download_dir()}.
948 To override this default on a case-by-case basis, use the
949 C{download_dir} argument when calling L{download()}.""")
950
951
952
953
954
965
968 self._ds = dataserver
969
971 print '-'*75
972 spc = (68 - sum(len(o) for o in options))/(len(options)-1)*' '
973 print ' ' + spc.join(options)
974
975
976
977 print '-'*75
978
979
981 print 'NLTK Downloader'
982 while True:
983 self._simple_interactive_menu(
984 'd) Download', 'l) List', 'c) Config', 'h) Help', 'q) Quit')
985 user_input = raw_input('Downloader> ').strip()
986 if not user_input: print; continue
987 command = user_input.lower().split()[0]
988 args = user_input.split()[1:]
989 try:
990 if command == 'l':
991 print
992 self._ds.list(self._ds.download_dir, header=False,
993 more_prompt=True)
994 elif command == 'h':
995 self._simple_interactive_help()
996 elif command == 'c':
997 self._simple_interactive_config()
998 elif command in ('q', 'x'):
999 return
1000 elif command == 'd':
1001 self._simple_interactive_download(args)
1002 else:
1003 print 'Command %r unrecogmized' % user_input
1004 except urllib2.HTTPError, e:
1005 print 'Error reading from server: %s'%e
1006 except urllib2.URLError, e:
1007 print 'Error connecting to server: %s'%e.reason
1008
1009
1010 print
1011
1013 if args:
1014 for arg in args:
1015 try: self._ds.download(arg, prefix=' ')
1016 except (IOError, ValueError), e: print e
1017 else:
1018 while True:
1019 print
1020 print 'Download which package (l=list; x=cancel)?'
1021 user_input = raw_input(' Identifier> ')
1022 if user_input.lower()=='l':
1023 self._ds.list(self._ds.download_dir, header=False,
1024 more_prompt=True, skip_installed=True)
1025 continue
1026 elif user_input.lower() in ('x', 'q', ''):
1027 return
1028 elif user_input:
1029 for id in user_input.split():
1030 try: self._ds.download(id, prefix=' ')
1031 except (IOError, ValueError), e: print e
1032 break
1033
1035 print
1036 print 'Commands:'
1037 print ' d) Download a package or collection h) Help'
1038 print ' l) List packages & collections q) Quit'
1039 print ' c) View & Modify Configuration'
1040
1042 print
1043 print 'Data Server:'
1044 print ' - URL: <%s>' % self._ds.url
1045 print (' - %d Package Collections Available' %
1046 len(self._ds.collections()))
1047 print (' - %d Individual Packages Available' %
1048 len(self._ds.packages()))
1049 print
1050 print 'Local Machine:'
1051 print ' - Data directory: %s' % self._ds.download_dir
1052
1054 self._show_config()
1055 while True:
1056 print
1057 self._simple_interactive_menu(
1058 's) Show Config', 'u) Set Server URL',
1059 'd) Set Data Dir', 'm) Main Menu')
1060 user_input = raw_input('Config> ').strip().lower()
1061 if user_input == 's':
1062 self._show_config()
1063 elif user_input == 'd':
1064 new_dl_dir = raw_input(' New Directory> ').strip().lower()
1065 if new_dl_dir in ('', 'x', 'q'):
1066 print ' Cancelled!'
1067 elif os.path.isdir(new_dl_dir):
1068 self._ds.download_dir = new_dl_dir
1069 else:
1070 print ('Directory %r not found! Create it first.' %
1071 new_dl_dir)
1072 elif user_input == 'u':
1073 new_url = raw_input(' New URL> ').strip().lower()
1074 if new_url in ('', 'x', 'q'):
1075 print ' Cancelled!'
1076 else:
1077 if not new_url.startswith('http://'):
1078 new_url = 'http://'+new_url
1079 try: self._ds.url = new_url
1080 except Exception, e:
1081 print 'Error reading <%r>:\n %s' % (new_url, e)
1082 elif user_input == 'm':
1083 break
1084
1086 """
1087 Graphical interface for downloading packages from the NLTK data
1088 server.
1089 """
1090
1091
1092
1093
1094
1095 COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status',
1096 'Unzipped Size',
1097 'Copyright', 'Contact', 'License', 'Author',
1098 'SVN Revision', 'Subdir', 'Checksum']
1099 """A list of the names of columns. This controls the order in
1100 which the columns will appear. If this is edited, then
1101 L{_package_to_columns()} may need to be edited to match."""
1102
1103 COLUMN_WEIGHTS = {'': 0, 'Name': 5, 'Size': 0, 'Status': 0}
1104 """A dictionary specifying how columns should be resized when the
1105 table is resized. Columns with weight 0 will not be resized at
1106 all; and columns with high weight will be resized more.
1107 Default weight (for columns not explicitly listed) is 1."""
1108
1109 COLUMN_WIDTHS = {'':1, 'Identifier':20, 'Name':45,
1110 'Size': 10, 'Unzipped Size': 10,
1111 'Status': 12}
1112 """A dictionary specifying how wide each column should be, in
1113 characters. The default width (for columns not explicitly
1114 listed) is specified by L{DEFAULT_COLUMN_WIDTH}."""
1115
1116 DEFAULT_COLUMN_WIDTH = 30
1117 """The default width for columns that are not explicitly listed
1118 in C{COLUMN_WIDTHS}."""
1119
1120 INITIAL_COLUMNS = ['', 'Identifier', 'Name', 'Size', 'Status']
1121 """The set of columns that should be displayed by default."""
1122
1123
1124
1125 for c in COLUMN_WEIGHTS: assert c in COLUMNS
1126 for c in COLUMN_WIDTHS: assert c in COLUMNS
1127 for c in INITIAL_COLUMNS: assert c in COLUMNS
1128
1129
1130
1131
1132
1133 _BACKDROP_COLOR = ('#000', '#ccc')
1134
1135 _ROW_COLOR = {Downloader.INSTALLED: ('#afa', '#080'),
1136 Downloader.PARTIAL: ('#ffa', '#880'),
1137 Downloader.STALE: ('#faa', '#800'),
1138 Downloader.NOT_INSTALLED: ('#fff', '#888')}
1139
1140 _MARK_COLOR = ('#000', '#ccc')
1141
1142
1143
1144 _FRONT_TAB_COLOR = ('#fff', '#45c')
1145 _BACK_TAB_COLOR = ('#aaa', '#67a')
1146
1147 _PROGRESS_COLOR = ('#f00', '#aaa')
1148
1149 _TAB_FONT = 'helvetica -16 bold'
1150
1151
1152
1153
1154
1155 - def __init__(self, dataserver, use_threads=True):
1156 self._ds = dataserver
1157 self._use_threads = use_threads
1158
1159
1160 self._download_lock = threading.Lock()
1161 self._download_msg_queue = []
1162 self._download_abort_queue = []
1163 self._downloading = False
1164
1165
1166 self._afterid = {}
1167
1168
1169 self._log_messages = []
1170 self._log_indent = 0
1171 self._log('NLTK Downloader Started!')
1172
1173
1174 top = self.top = Tk()
1175 top.geometry('+50+50')
1176 top.title('NLTK Downloader')
1177 top.configure(background=self._BACKDROP_COLOR[1])
1178
1179
1180 top.bind('<Control-q>', self.destroy)
1181 top.bind('<Control-x>', self.destroy)
1182 self._destroyed = False
1183
1184 self._column_vars = {}
1185
1186
1187 self._init_widgets()
1188 self._init_menu()
1189 try:
1190 self._fill_table()
1191 except urllib2.HTTPError, e:
1192 showerror('Error reading from server', e)
1193 except urllib2.URLError, e:
1194 showerror('Error connecting to server', e.reason)
1195
1196 self._show_info()
1197 self._select_columns()
1198 self._table.select(0)
1199
1200
1201
1202 self._table.bind('<Destroy>', self._destroy)
1203
1204 - def _log(self, msg):
1205 self._log_messages.append('%s %s%s' % (time.ctime(),
1206 ' | '*self._log_indent, msg))
1207
1208
1209
1210
1211
1303
1305 menubar = Menu(self.top)
1306
1307 filemenu = Menu(menubar, tearoff=0)
1308 filemenu.add_command(label='Download', underline=0,
1309 command=self._download, accelerator='Return')
1310 filemenu.add_separator()
1311 filemenu.add_command(label='Change Server Index', underline=7,
1312 command=lambda: self._info_edit('url'))
1313 filemenu.add_command(label='Change Download Directory', underline=0,
1314 command=lambda: self._info_edit('download_dir'))
1315 filemenu.add_separator()
1316 filemenu.add_command(label='Show Log', underline=5,
1317 command=self._show_log)
1318 filemenu.add_separator()
1319 filemenu.add_command(label='Exit', underline=1,
1320 command=self.destroy, accelerator='Ctrl-x')
1321 menubar.add_cascade(label='File', underline=0, menu=filemenu)
1322
1323
1324
1325
1326 viewmenu = Menu(menubar, tearoff=0)
1327 for column in self._table.column_names[2:]:
1328 var = IntVar(self.top)
1329 assert column not in self._column_vars
1330 self._column_vars[column] = var
1331 if column in self.INITIAL_COLUMNS: var.set(1)
1332 viewmenu.add_checkbutton(label=column, underline=0, variable=var,
1333 command=self._select_columns)
1334 menubar.add_cascade(label='View', underline=0, menu=viewmenu)
1335
1336
1337
1338
1339 sortmenu = Menu(menubar, tearoff=0)
1340 for column in self._table.column_names[1:]:
1341 sortmenu.add_command(label='Sort by %s' % column,
1342 command=(lambda c=column:
1343 self._table.sort_by(c, 'ascending')))
1344 sortmenu.add_separator()
1345
1346 for column in self._table.column_names[1:]:
1347 sortmenu.add_command(label='Reverse sort by %s' % column,
1348 command=(lambda c=column:
1349 self._table.sort_by(c, 'descending')))
1350 menubar.add_cascade(label='Sort', underline=0, menu=sortmenu)
1351
1352 helpmenu = Menu(menubar, tearoff=0)
1353 helpmenu.add_command(label='About', underline=0,
1354 command=self.about)
1355 helpmenu.add_command(label='Instructions', underline=0,
1356 command=self.help, accelerator='F1')
1357 menubar.add_cascade(label='Help', underline=0, menu=helpmenu)
1358 self.top.bind('<F1>', self.help)
1359
1360 self.top.config(menu=menubar)
1361
1363 for (column, var) in self._column_vars.items():
1364 if var.get():
1365 self._table.show_column(column)
1366 else:
1367 self._table.hide_column(column)
1368
1370 self._ds.clear_status_cache()
1371 try:
1372 self._fill_table()
1373 except urllib2.HTTPError, e:
1374 showerror('Error reading from server', e)
1375 except urllib2.URLError, e:
1376 showerror('Error connecting to server', e.reason)
1377 self._table.select(0)
1378
1385
1397
1399 if self._table.column_names[col].endswith('Size'):
1400 if isinstance(val, basestring): return ' %s' % val
1401 elif val < 1024**2: return ' %.1f KB' % (val/1024.**1)
1402 elif val < 1024**3: return ' %.1f MB' % (val/1024.**2)
1403 else: return ' %.1f GB' % (val/1024.**3)
1404
1405 if col in (0, ''): return str(val)
1406 else: return ' %s' % val
1407
1409 if url == self._ds.url: return
1410 try:
1411 self._ds.url = url
1412 self._fill_table()
1413 except IOError, e:
1414 showerror('Error Setting Server Index', str(e))
1415 self._show_info()
1416
1417
1431
1441
1443 for i, tab in enumerate(self._tab_names):
1444 if tab.lower() == self._tab and i > 0:
1445 self._tab = self._tab_names[i-1].lower()
1446 try:
1447 return self._fill_table()
1448 except urllib2.HTTPError, e:
1449 showerror('Error reading from server', e)
1450 except urllib2.URLError, e:
1451 showerror('Error connecting to server', e.reason)
1452
1454 for i, tab in enumerate(self._tab_names):
1455 if tab.lower() == self._tab and i < (len(self._tabs)-1):
1456 self._tab = self._tab_names[i+1].lower()
1457 try:
1458 return self._fill_table()
1459 except urllib2.HTTPError, e:
1460 showerror('Error reading from server', e)
1461 except urllib2.URLError, e:
1462 showerror('Error connecting to server', e.reason)
1463
1465 self._tab = event.widget['text'].lower()
1466 try:
1467 self._fill_table()
1468 except urllib2.HTTPError, e:
1469 showerror('Error reading from server', e)
1470 except urllib2.URLError, e:
1471 showerror('Error connecting to server', e.reason)
1472
1473 _tab = 'collections'
1474
1475 _rows = None
1514
1516 for row_num in range(len(self._table)):
1517 status = self._ds.status(self._table[row_num, 'Identifier'])
1518 self._table[row_num, 'Status'] = status
1519 self._color_table()
1520
1537
1538 _DL_DELAY=10
1540 try: msg = download_iter.next()
1541 except StopIteration:
1542
1543 self._update_table_status()
1544 afterid = self.top.after(10, self._show_progress, 0)
1545 self._afterid['_download_cb'] = afterid
1546 return
1547
1548 def show(s):
1549 self._progresslabel['text'] = s
1550 self._log(s)
1551 if isinstance(msg, ProgressMessage):
1552 self._show_progress(msg.progress)
1553 elif isinstance(msg, ErrorMessage):
1554 show(msg.message)
1555 if msg.package is not None:
1556 self._select(msg.package.id)
1557 self._show_progress(None)
1558 return
1559 elif isinstance(msg, StartCollectionMessage):
1560 show('Downloading collection %r' % msg.collection.id)
1561 self._log_indent += 1
1562 elif isinstance(msg, StartPackageMessage):
1563 show('Downloading package %r' % msg.package.id)
1564 elif isinstance(msg, UpToDateMessage):
1565 show('Package %s is up-to-date!' % msg.package.id)
1566
1567
1568 elif isinstance(msg, FinishDownloadMessage):
1569 show('Finished downloading %r.' % msg.package.id)
1570 elif isinstance(msg, StartUnzipMessage):
1571 show('Unzipping %s' % msg.package.filename)
1572 elif isinstance(msg, FinishCollectionMessage):
1573 self._log_indent -= 1
1574 show('Finished downloading collection %r.' % msg.collection.id)
1575 self._clear_mark(msg.collection.id)
1576 elif isinstance(msg, FinishPackageMessage):
1577 self._clear_mark(msg.package.id)
1578 afterid = self.top.after(self._DL_DELAY, self._download_cb,
1579 download_iter, ids)
1580 self._afterid['_download_cb'] = afterid
1581
1583 for row in range(len(self._table)):
1584 if self._table[row, 'Identifier'] == id:
1585 self._table.select(row)
1586 return
1587
1589
1590 for row in range(len(self._table)):
1591 bg, sbg = self._ROW_COLOR[self._table[row, 'Status']]
1592 fg, sfg = ('black', 'white')
1593 self._table.rowconfig(row, foreground=fg, selectforeground=sfg,
1594 background=bg, selectbackground=sbg)
1595
1596 self._table.itemconfigure(row, 0,
1597 foreground=self._MARK_COLOR[0],
1598 background=self._MARK_COLOR[1])
1599
1600
1602 for row in range(len(self._table)):
1603 if self._table[row, 'Identifier'] == id:
1604 self._table[row, 0] = ''
1605
1607 for row in range(len(self._table)):
1608 self._table[row,0] = 'X'
1609
1618
1622
1624 """
1625 Given a package, return a list of values describing that
1626 package, one for each column in L{self.COLUMNS}.
1627 """
1628 row = []
1629 for column_index, column_name in enumerate(self.COLUMNS):
1630 if column_index == 0:
1631 row.append('')
1632 elif column_name == 'Identifier':
1633 row.append(pkg.id)
1634 elif column_name == 'Status':
1635 row.append(self._ds.status(pkg))
1636 else:
1637 attr = column_name.lower().replace(' ', '_')
1638 row.append(getattr(pkg, attr, 'n/a'))
1639 return row
1640
1641
1642
1643
1644
1646 if self._destroyed: return
1647 self.top.destroy()
1648 self._destroyed = True
1649
1651 if self.top is not None:
1652 for afterid in self._afterid.values():
1653 self.top.after_cancel(afterid)
1654
1655
1656 if self._downloading and self._use_threads:
1657 self._abort_download()
1658
1659
1660
1661
1662 self._column_vars.clear()
1663
1664 - def mainloop(self, *args, **kwargs):
1665 self.top.mainloop(*args, **kwargs)
1666
1667
1668
1669
1670
1671 HELP = textwrap.dedent("""\
1672 This tool can be used to download a variety of corpora and models
1673 that can be used with NLTK. Each corpus or model is distributed
1674 in a single zip file, known as a \"package file.\" You can
1675 download packages individually, or you can download pre-defined
1676 collections of packages.
1677
1678 When you download a package, it will be saved to the \"download
1679 directory.\" A default download directory is chosen when you run
1680
1681 the downloader; but you may also select a different download
1682 directory. On Windows, the default download directory is
1683
1684
1685 \"package.\"
1686
1687 The NLTK downloader can be used to download a variety of corpora,
1688 models, and other data packages.
1689
1690 Keyboard shortcuts::
1691 [return]\t Download
1692 [up]\t Select previous package
1693 [down]\t Select next package
1694 [left]\t Select previous tab
1695 [right]\t Select next tab
1696 """)
1697
1698 - def help(self, *e):
1706
1708 ABOUT = ("NLTK Downloader\n"+
1709 "Written by Edward Loper")
1710 TITLE = 'About: NLTK Downloader'
1711 try:
1712 from tkMessageBox import Message
1713 Message(message=ABOUT, title=TITLE).show()
1714 except ImportError:
1715 ShowText(self._top, TITLE, ABOUT)
1716
1717
1718
1719
1720
1721 _gradient_width = 5
1723 c = self._progressbar
1724 width, height = int(c['width']), int(c['height'])
1725 for i in range(0, (int(c['width'])*2)/self._gradient_width):
1726 c.create_line(i*self._gradient_width+20, -20,
1727 i*self._gradient_width-height-20, height+20,
1728 width=self._gradient_width,
1729 fill='#%02x0000' % (80 + abs(i%6-3)*12))
1730 c.addtag_all('gradient')
1731 c.itemconfig('gradient', state='hidden')
1732
1733
1734 c.addtag_withtag('redbox', c.create_rectangle(
1735 0, 0, 0, 0, fill=self._PROGRESS_COLOR[0]))
1736
1738 c = self._progressbar
1739 if percent is None:
1740 c.coords('redbox', 0, 0, 0, 0)
1741 c.itemconfig('gradient', state='hidden')
1742 else:
1743 width, height = int(c['width']), int(c['height'])
1744 x = percent * int(width) / 100 + 1
1745 c.coords('redbox', 0, 0, x, height+1)
1746
1748 c = self._progressbar
1749 if not self._downloading:
1750 c.itemconfig('gradient', state='hidden')
1751 else:
1752 c.itemconfig('gradient', state='normal')
1753 x1, y1, x2, y2 = c.bbox('gradient')
1754 if x1 <= -100:
1755 c.move('gradient', (self._gradient_width*6)-4, 0)
1756 else:
1757 c.move('gradient', -4, 0)
1758 afterid = self.top.after(200, self._progress_alive)
1759 self._afterid['_progress_alive'] = afterid
1760
1761
1762
1763
1764
1766
1767
1768 if self._downloading:
1769 self._abort_download()
1770 return
1771
1772
1773 self._download_button['text'] = 'Cancel'
1774
1775 marked = [self._table[row, 'Identifier']
1776 for row in range(len(self._table))
1777 if self._table[row, 0] != '']
1778 selection = self._table.selected_row()
1779 if not marked and selection is not None:
1780 marked = [self._table[selection, 'Identifier']]
1781
1782
1783
1784
1785 ds = Downloader(self._ds.url, self._ds.download_dir)
1786
1787
1788 assert self._download_msg_queue == []
1789 assert self._download_abort_queue == []
1790 self._DownloadThread(ds, marked, self._download_lock,
1791 self._download_msg_queue,
1792 self._download_abort_queue).start()
1793
1794
1795 self._log_indent = 0
1796 self._downloading = True
1797 self._monitor_message_queue()
1798
1799
1800
1801 self._progress_alive()
1802
1804 if self._downloading:
1805 self._download_lock.acquire()
1806 self._download_abort_queue.append('abort')
1807 self._download_lock.release()
1808
1810 - def __init__(self, data_server, items, lock, message_queue, abort):
1811 self.data_server = data_server
1812 self.items = items
1813 self.lock = lock
1814 self.message_queue = message_queue
1815 self.abort = abort
1816 threading.Thread.__init__(self)
1817
1819 for msg in self.data_server.incr_download(self.items):
1820 self.lock.acquire()
1821 self.message_queue.append(msg)
1822
1823 if self.abort:
1824 self.message_queue.append('aborted')
1825 self.lock.release()
1826 return
1827 self.lock.release()
1828 self.lock.acquire()
1829 self.message_queue.append('finished')
1830 self.lock.release()
1831
1832 _MONITOR_QUEUE_DELAY=100
1834 def show(s):
1835 self._progresslabel['text'] = s
1836 self._log(s)
1837
1838
1839 if not self._download_lock.acquire():
1840 return
1841 for msg in self._download_msg_queue:
1842
1843
1844 if msg == 'finished' or msg == 'aborted':
1845
1846 self._update_table_status()
1847 self._downloading = False
1848 self._download_button['text'] = 'Download'
1849 del self._download_msg_queue[:]
1850 del self._download_abort_queue[:]
1851 self._download_lock.release()
1852 if msg == 'aborted':
1853 show('Download aborted!')
1854 self._show_progress(None)
1855 else:
1856 afterid = self.top.after(100, self._show_progress, None)
1857 self._afterid['_monitor_message_queue'] = afterid
1858 return
1859
1860
1861 elif isinstance(msg, ProgressMessage):
1862 self._show_progress(msg.progress)
1863 elif isinstance(msg, ErrorMessage):
1864 show(msg.message)
1865 if msg.package is not None:
1866 self._select(msg.package.id)
1867 self._show_progress(None)
1868 self._downloading = False
1869 return
1870 elif isinstance(msg, StartCollectionMessage):
1871 show('Downloading collection %r' % msg.collection.id)
1872 self._log_indent += 1
1873 elif isinstance(msg, StartPackageMessage):
1874 self._ds.clear_status_cache(msg.package.id)
1875 show('Downloading package %r' % msg.package.id)
1876 elif isinstance(msg, UpToDateMessage):
1877 show('Package %s is up-to-date!' % msg.package.id)
1878
1879
1880
1881 elif isinstance(msg, FinishDownloadMessage):
1882 show('Finished downloading %r.' % msg.package.id)
1883 elif isinstance(msg, StartUnzipMessage):
1884 show('Unzipping %s' % msg.package.filename)
1885 elif isinstance(msg, FinishUnzipMessage):
1886 show('Finished installing %s' % msg.package.id)
1887 elif isinstance(msg, FinishCollectionMessage):
1888 self._log_indent -= 1
1889 show('Finished downloading collection %r.' % msg.collection.id)
1890 self._clear_mark(msg.collection.id)
1891 elif isinstance(msg, FinishPackageMessage):
1892 self._update_table_status()
1893 self._clear_mark(msg.package.id)
1894
1895
1896
1897
1898 if self._download_abort_queue:
1899 self._progresslabel['text'] = 'Aborting download...'
1900
1901
1902 del self._download_msg_queue[:]
1903 self._download_lock.release()
1904
1905
1906 afterid = self.top.after(self._MONITOR_QUEUE_DELAY,
1907 self._monitor_message_queue)
1908 self._afterid['_monitor_message_queue'] = afterid
1909
1916 """
1917 Calculate and return the MD5 checksum for a given file. C{file}
1918 may either be a filename or an open stream.
1919 """
1920 if isinstance(file, basestring):
1921 file = open(file, 'rb')
1922
1923 md5_digest = md5()
1924 while True:
1925 block = file.read(1024*16)
1926 if not block: break
1927 md5_digest.update(block)
1928 return md5_digest.hexdigest()
1929
1930
1931
1932
1933 -def unzip(filename, root, verbose=True):
1934 """
1935 Extract the contents of the zip file C{filename} into the
1936 directory C{root}.
1937 """
1938 for message in _unzip_iter(filename, root, verbose):
1939 if isinstance(message, ErrorMessage):
1940 raise Exception, message
1941
1943 if verbose:
1944 sys.stdout.write('Unzipping %s' % os.path.split(filename)[1])
1945 sys.stdout.flush()
1946
1947 try: zf = zipfile.ZipFile(filename)
1948 except zipfile.error, e:
1949 yield ErrorMessage(filename, 'Error with downloaded zip file')
1950 return
1951 except Exception, e:
1952 yield ErrorMessage(filename, e)
1953 return
1954
1955
1956 namelist = zf.namelist()
1957 dirlist = [x for x in namelist if x.endswith('/')]
1958 filelist = [x for x in namelist if not x.endswith('/')]
1959
1960
1961 if not os.path.exists(root):
1962 os.mkdir(root)
1963
1964
1965 for dirname in sorted(dirlist):
1966 pieces = dirname[:-1].split('/')
1967 for i in range(len(pieces)):
1968 dirpath = os.path.join(root, *pieces[:i+1])
1969 if not os.path.exists(dirpath):
1970 os.mkdir(dirpath)
1971
1972
1973 for i, filename in enumerate(filelist):
1974 filepath = os.path.join(root, *filename.split('/'))
1975 out = open(filepath, 'wb')
1976 try: contents = zf.read(filename)
1977 except Exception, e:
1978 yield ErrorMessage(filename, e)
1979 return
1980 out.write(contents)
1981 out.close()
1982 if verbose and (i*10/len(filelist) > (i-1)*10/len(filelist)):
1983 sys.stdout.write('.')
1984 sys.stdout.flush()
1985 if verbose:
1986 print
1987
1988
1989
1990
1991
1992 import subprocess, zipfile
1993
1994 -def build_index(root, base_url):
1995 """
1996 Create a new data.xml index file, by combining the xml description
1997 files for various packages and collections. C{root} should be the
1998 path to a directory containing the package xml and zip files; and
1999 the collection xml files. The C{root} directory is expected to
2000 have the following subdirectories::
2001
2002 root/
2003 packages/ .................. subdirectory for packages
2004 corpora/ ................. zip & xml files for corpora
2005 grammars/ ................ zip & xml files for grammars
2006 taggers/ ................. zip & xml files for taggers
2007 tokenizers/ .............. zip & xml files for tokenizers
2008 etc.
2009 collections/ ............... xml files for collections
2010
2011 For each package, there should be two files: C{I{package}.zip}
2012 contains the package itself, as a compressed zip file; and
2013 C{I{package}.xml} is an xml description of the package. The
2014 zipfile C{I{package}.zip} should expand to a single subdirectory
2015 named C{I{package/}}. The base filename C{I{package}} must match
2016 the identifier given in the package's xml file.
2017
2018 For each collection, there should be a single file
2019 C{I{collection}.zip}, describing the collection.
2020
2021 All identifiers (for both packages and collections) must be unique.
2022 """
2023
2024 packages = []
2025 for pkg_xml, zf, subdir in _find_packages(os.path.join(root, 'packages')):
2026 zipstat = os.stat(zf.filename)
2027 url = '%s/%s/%s' % (base_url, subdir, os.path.split(zf.filename)[1])
2028 unzipped_size = sum(zf_info.file_size for zf_info in zf.infolist())
2029
2030
2031 pkg_xml.set('unzipped_size', '%s' % unzipped_size)
2032 pkg_xml.set('size', '%s' % zipstat.st_size)
2033 pkg_xml.set('checksum', '%s' % md5_hexdigest(zf.filename))
2034 pkg_xml.set('subdir', subdir)
2035
2036 pkg_xml.set('url', url)
2037
2038
2039 packages.append(pkg_xml)
2040
2041
2042 collections = list(_find_collections(os.path.join(root, 'collections')))
2043
2044
2045 uids = set()
2046 for item in packages+collections:
2047 if item.get('id') in uids:
2048 raise ValueError('Duplicate UID: %s' % item.get('id'))
2049 uids.add(item.get('id'))
2050
2051
2052 top_elt = ElementTree.Element('nltk_data')
2053 top_elt.append(ElementTree.Element('packages'))
2054 for package in packages: top_elt[0].append(package)
2055 top_elt.append(ElementTree.Element('collections'))
2056 for collection in collections: top_elt[1].append(collection)
2057
2058 _indent_xml(top_elt)
2059 return top_elt
2060
2062 """
2063 Helper for L{build_index()}: Given an XML ElementTree, modify it
2064 (and its descendents) C{text} and C{tail} attributes to generate
2065 an indented tree, where each nested element is indented by 2
2066 spaces with respect to its parent.
2067 """
2068 if len(xml) > 0:
2069 xml.text = (xml.text or '').strip() + '\n' + prefix + ' '
2070 for child in xml:
2071 _indent_xml(child, prefix+' ')
2072 for child in xml[:-1]:
2073 child.tail = (child.tail or '').strip() + '\n' + prefix + ' '
2074 xml[-1].tail = (xml[-1].tail or '').strip() + '\n' + prefix
2075
2077 """
2078 Helper for L{build_index()}: Perform some checks to make sure that
2079 the given package is consistent.
2080 """
2081
2082 uid = os.path.splitext(os.path.split(zipfilename)[1])[0]
2083 if pkg_xml.get('id') != uid:
2084 raise ValueError('package identifier mismatch (%s vs %s)' %
2085 (pkg_xml.get('id'), uid))
2086
2087
2088 if sum( (name!=uid and not name.startswith(uid+'/'))
2089 for name in zf.namelist() ):
2090 raise ValueError('Zipfile %s.zip does not expand to a single '
2091 'subdirectory %s/' % (uid, uid))
2092
2095 """
2096 Helper for L{build_index()}: Calculate the subversion revision
2097 number for a given file (by using C{subprocess} to run C{svn}).
2098 """
2099 p = subprocess.Popen(['svn', 'status', '-v', filename],
2100 stdout=subprocess.PIPE,
2101 stderr=subprocess.PIPE)
2102 (stdout, stderr) = p.communicate()
2103 if p.returncode != 0 or stderr or not stdout:
2104 raise ValueError('Error determining svn_revision for %s: %s' %
2105 (os.path.split(filename)[1], textwrap.fill(stderr)))
2106 return stdout.split()[2]
2107
2119
2121 """
2122 Helper for L{build_index()}: Yield a list of tuples C{(pkg_xml,
2123 zf, subdir)}, where:
2124 - C{pkg_xml} is an ElementTree.Element holding the xml for a
2125 package
2126 - C{zf} is a zipfile.ZipFile for the package's contents.
2127 - C{subdir} is the subdirectory (relative to C{root}) where
2128 the package was found (e.g. 'corpora' or 'grammars').
2129 """
2130 from nltk.corpus.reader.util import _path_from
2131
2132 packages = []
2133 for dirname, subdirs, files in os.walk(root):
2134 relpath = '/'.join(_path_from(root, dirname))
2135 for filename in files:
2136 if filename.endswith('.xml'):
2137 xmlfilename = os.path.join(dirname, filename)
2138 zipfilename = xmlfilename[:-4]+'.zip'
2139 try: zf = zipfile.ZipFile(zipfilename)
2140 except Exception, e:
2141 raise ValueError('Error reading file %r!\n%s' %
2142 (zipfilename, e))
2143 try: pkg_xml = ElementTree.parse(xmlfilename).getroot()
2144 except Exception, e:
2145 raise ValueError('Error reading file %r!\n%s' %
2146 (xmlfilename, e))
2147
2148
2149 uid = os.path.split(xmlfilename[:-4])[1]
2150 if pkg_xml.get('id') != uid:
2151 raise ValueError('package identifier mismatch (%s '
2152 'vs %s)' % (pkg_xml.get('id'), uid))
2153
2154
2155
2156 if sum( (name!=uid and not name.startswith(uid+'/'))
2157 for name in zf.namelist() ):
2158 raise ValueError('Zipfile %s.zip does not expand to a '
2159 'single subdirectory %s/' % (uid, uid))
2160
2161 yield pkg_xml, zf, relpath
2162
2163 try: subdirs.remove('.svn')
2164 except ValueError: pass
2165
2166
2167
2168
2169
2170
2171
2172
2173 _downloader = Downloader()
2174 download = _downloader.download
2178
2179 if __name__ == '__main__':
2180 from optparse import OptionParser
2181 parser = OptionParser()
2182 parser.add_option("-d", "--dir", dest="dir",
2183 help="download package to directory DIR", metavar="DIR")
2184 parser.add_option("-q", "--quiet", dest="quiet", action="store_true",
2185 default=False, help="work quietly")
2186 parser.add_option("-f", "--force", dest="force", action="store_true",
2187 default=False, help="download even if already installed")
2188 parser.add_option("-e", "--exit-on-error", dest="halt_on_error", action="store_true",
2189 default=False, help="exit if an error occurs")
2190
2191 (options, args) = parser.parse_args()
2192
2193 if args:
2194 for pkg_id in args:
2195 rv = download(info_or_id=pkg_id, download_dir=options.dir,
2196 quiet=options.quiet, force=options.force,
2197 halt_on_error=options.halt_on_error)
2198 if rv==False and options.halt_on_error:
2199 break
2200 else:
2201 download(download_dir=options.dir,
2202 quiet=options.quiet, force=options.force,
2203 halt_on_error=options.halt_on_error)
2204