diff README.txt @ 0:b0942f44413f

import from git://github.com/mozilla/toolbox.git
author Jeff Hammel <k0scist@gmail.com>
date Sun, 11 May 2014 09:15:35 -0700
children 2ba55733b788
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/README.txt	Sun May 11 09:15:35 2014 -0700
@@ -0,0 +1,343 @@
+The Story of Toolbox
+Toolbox is fundamentally a document-oriented approach to resource
+indexing.  A "tool" consists three mandatory string fields -- name,
+description, and URL -- that are generic to the large class of problems
+of web resources, as well as classifiers, such as author, usage, type,
+etc. A tool may have an arbitrary number of classifier fields as
+needed.  Each classifier consists of a set of values with which a tool
+is tagged. This gives toolbox the flexibility to fit a large number of
+data models, such as PYPI, DOAP, and others.
+Running Toolbox
+You can download and run the toolbox software yourself:
+To serve in baseline mode, install the software and run::
+ paster serve paste.ini
+This will serve the handlers and static content using the paste
+(http://pythonpaste.org) webserver using ``README.txt`` as the
+``/about`` page and serving the data in ``sample``.
+The dispatcher (``toolbox.dispatcher:Dispatcher``) is the central (WSGI)
+webapp that designates per-request to a number of handlers (from
+``handlers.py``).  The dispatcher has a few options:
+* about: path to a restructured text file to serve at ``/about``
+* model_type: name of the backend to use (memory_cache, file_cache, or couch)
+* template_dir: extra directory to look for templates
+These may be configured in the ``paste.ini`` file in the
+``[app:toolbox]`` section by prepending with the namespace
+``toolbox.``. It is advisable that you copy the example ``paste.ini``
+file for your own usage needs.  Additional ``toolbox.``-namespaced
+arguments will be passed to the model.  For instance, to specify the
+directory for the ``file_cache`` model, the provided ``paste.ini`` uses
+``toolbox.directory = %(here)s/sample``.
+Toolbox uses a fairly simple architecture with a single abstract data
+model allowing an arbitrary number of implementations to be constructed::
+ Interfaces            Implementations
+ +----+              +-+-----+
+ |HTTP|              | |files|
+ +----+---\  +-----+ | +-----+
+           |-|model|-+-+-----+
+ +------+-/  +-----+ | |couch|
+ |script|            | +-----+
+ +------+            +-+------+
+                     | |memory|
+                     | +------+
+                     +-+---+
+                       |...|
+                       +---+
+Toolbox was originally intended to use a directory of files, one per project,
+as the backend. These were originally intended to be HTML files as the
+above model may be clearly mapped as HTML::
+ <div class="project"><h1><a href="{{url}}">{{name}}</a></h1>
+ <p class="description">{{description}}</p>
+ {{for field in fields}}
+  <ul class="{{field}}">
+  {{for value in values[field]}}
+   <li>{{value}}</li>
+  {{endfor}}
+ {{endfor}}
+ </div>
+This microformat approach allows not only easy editing of the HTML
+documents, but the documents may be indepently served and displayed
+without the toolbox server-side. 
+The HTML microformat was never implemented (though, since the model
+backend is pluggable, it easily could be). Instead, the original
+implementation used JSON blobs stored in one file per tool. This
+approach loses the displayable aspect, though since JSON is a defined
+format with several good tools for exploring and manipulating the data
+perhaps this disavantage is offset.
+A couch backend was also written.
+      +------------+-----------+------------+
+      |Displayable?|File-based?|Concurrency?|
+|HTML |Yes         |Yes        |No          |
+|JSON |Not really  |Yes        |No          |
+|Couch|No          |No         |Yes?        |
+The concurrency issue with file-based documennt backends may be
+overcome by using locked files.  Ideally, this is accomplished at the
+filesystem level.  If your filesystem does not promote this
+functionality, it may be introduced programmatically.  A rough cartoon
+of a good implementation is as follows:
+1. A worker thread is spawned to write the data asynchronously. The
+data is sent to the worker thread.
+2. The worker checks for the presence of a lockfile (herein further
+detailed). If the lockfile exists and is owned by an active process,
+the worker waits until said process is done with it. (For a more
+robust implementation, the worker sends a request to write the file to
+some controller.)
+3. The worker owns a lockfile based on its PID in some directory
+parallel to the directory root under consideration (for example,
+4. The worker writes to the file.
+5. The worker removes the lock
+The toolbox web service uses a dispatcher->handler framework.  The
+handlers are loosely pluggable (they are assigned in the dispatcher),
+but could (and probably should) be made completely pluggable.  That
+said, the toolbox web system features an integration of templates,
+static resources (javascript, css, images), and handlers, so true
+pluggability is further away than just supporting pluggable handlers
+in the dispatcher.
+Deployment, however, may be tailored as desired.  Any of the given
+templates may be overridden via passing a ``template_dir`` parameter
+with a path to a directory that have templates of the appropriate
+names as found in toolbox's ``templates`` directory. 
+Likewise, the static files (css, js, etc.) are served using ``paste``'s 
+``StaticURLParser`` out of toolbox's ``static`` directory. (See
+toolbox's ``factory.py``.) Notably this is *not* done using the WSGI
+app itself.  Doing it with middleware allows the deployment to be
+customizable by writing your own factory.  For example, instead of
+using the ``paste`` webserver and the included ``paste.ini``, you
+could use nginx or apache and ``mod_wsgi`` with a factory file
+invoking ``Dispatcher`` with the desired arguments and serving the
+static files with an arbitrary static file server.
+It is common sense, if rarely followed, that deployment should be
+simple.  If you want to get toolbox running on your desktop and/or for
+testing, you should be able to do this easily (see the ``INSTALL.sh``
+for a simple installation using ``bash``; you'll probably want to
+perform these steps by hand for any sort of real-world deployment).
+If you want a highly customized deployment, then this will require
+more expertise and manual setup.
+The template data and the JSON are closely tied together.  This has the
+distinct advantage of avoiding data translation steps and avoiding
+code duplication.
+Toolbox uses several light-footprint libraries:
+* webob for Request/Response handling: http://pythonpaste.org/webob/
+* tempita for (HTML) templates: http://pythonpaste.org/tempita/
+* whoosh for search.  This pure-python implementation of full-text
+  search is relatively fast (for python) and should scale decently to
+  the target scale of toolbox (1000s or 10000s of tools). While not as
+  fast as lucene, whoosh is easy to deploy and has a good API and
+  preserves toolbox as a deployable software product versus an
+  instance that requires the expert configuration, maintainence, and
+  tuning of several disparate software products that is both
+  non-automatable (cannot be installed with a script) and
+  time-consuming. http://packages.python.org/Whoosh/
+* jQuery: jQuery is the best JavaScript library and everyone
+  should use it. http://jquery.com/
+* jeditable for AJAXy editing: http://www.appelsiini.net/projects/jeditable
+* jquery-token for autocomplete: http://loopj.com/jquery-tokeninput/
+* less for dynamic stylesheets: http://lesscss.org/
+User Interaction
+A user will typically interact with Toolbox through the AJAX web
+interface.  The server side returns relatively simple (HTML) markup,
+but structured in such a way that JavaScript may be utilized to
+promote rich interaction.  The simple HTML + complex JS manifests
+several things:
+1. The document is a document. The tools HTML presented to the user (with
+the current objectionable exception of the per-project Delete button)
+is a document form of the data. It can be clearly and easily
+translated to data (for e.g. import/export) or simply marked up using
+(e.g.) JS to add functionality. By keeping concerns seperate
+(presentation layer vs. interaction layer) a self-evident clarity is
+2. Computation is shifted client-side. Often, an otherwise lightweight
+webapp loses considerable performance rendering complex templates. By
+keeping the templates light-weight and doing control presentation and
+handling in JS, high performance is preserved.
+What Toolbox Doesn't Do
+* versioning: toolbox exposes editing towards a canonical document.
+  It doesn't do versioning.  A model instance may do whatever
+  versioning it desires, and since the models are pluggable, it would
+  be relatively painless to subclass e.g. the file-based model and
+  have a post-save hook which does an e.g. ``hg commit``. Customized
+  templates could be used to display this information.
+* authentication: the information presented by toolbox is freely
+  readable and editable. This is by intention, as by going to a "wiki"
+  model and presenting a easy to use, context-switching-free interface
+  curation is encouraged (ignoring the possibly imaginary problem of
+  wiki-spam). Access-level auth could be implemented using WSGI
+  middleware (e.g. repoze.who or bitsyauth) or through a front end
+  "webserver" integration layer such as Apache or nginx. Finer grained
+  control of the presentation layer could be realized by using custom
+  templates.
+What Toolbox Would Like To Do
+Ultimately, toolbox should be as federated as possible.  The basic
+architecture of toolbox as a web service + supporting scripts makes
+this feasible and more self-contained than most proposed federated
+services.  The basic federated model has proved, in practice,
+difficult to achieve through purely the (HTTP) client-server model, as
+without complete federation and adherence to protocol offline cron
+jobs should be utilized to pull external data sources. If a webservice
+only desires to talk to others of its own type and are willing to keep
+a queue of requests for when hosts are offline, entire HTTP federation
+may be implemented with only a configuration-specified discovery
+service to find the nodes.
+Often, a piece software is presented as a state out of context (that
+is minus the evolution which led it to be and led it to look further
+out towards beyond the horizon).  While this is an interesting special
+effect for an art project, software being communication this
+is only conducive to software in the darkest of black-box approaches.
+"Beers are like web frameworks: if they're not micro, you don't know
+what you're talking about." - hipsterhacker
+For sites that fit the architecture of a given framework, it may be
+advisable to make use of them.  However, for most webapp/webservice
+categories which have a finite scope and definitive intent, it is
+often easier, more maintainable, and more legible to build a complete
+HTTP->WSGI->app architecture than to try to hammer a framework into
+fitting your problem or redefining the problem to fit the framework.
+This approach was used for toolbox.
+The GenshiView template, http://k0s.org/hg/GenshiView, was invoked to
+generate a basic dispatcher->handler system.  The cruft was removed,
+leaving only the basic structure and the TempitaHandler since tempita
+is lightweight and it was envisioned that filesystem tempita templates
+(MakeItSo!) would be used elsewhere in the project.  The basic
+handlers (projects views, field-sorted view, new, etc.) were written
+and soon a usable interface was constructed.
+A ``sample`` directory was created to hold the JSON blobs. Because
+this was done early on, two goals were achieved: 
+1. the software could be dogfooded immediately using actual applicable
+data. This helped expose a number of issues concerning the data format
+right away.
+2. There was a place to put tools before the project reached a
+deployable state (previously, a few had lived in a static state using
+a rough sketch of the HTML microformat discussed above on
+k0s.org). Since the main point of toolbox is to record Mozilla tools,
+the wealth of references mentioned in passing could be put somewhere,
+instead of passed by and forgotten.  One wishes that they do not miss
+the train while purchasing a ticket.
+The original intent, when the file-based JSON blob approach was to be
+the deployed backend, was to have two repositories: one for the code
+and one for the JSON blobs.  When this approach was scrapped, the
+file-based JSON blobs were relegated to the ``sample`` directory, with
+the intent to be to import them into e.g. a couch database on actual
+deployment (using an import script). The samples could then be used
+for testing.
+The model has a single "setter" function, ``def update``, used for
+both creating and updating projects.  Due to this and due to the fact
+the model was ABC/pluggable from the beginning, a converter ``export``
+function could be trivially written at the ABC-level::
+    def export(self, other):
+        """export the current model to another model instance"""
+        for project in self.get():
+            other.update(project)
+This with an accompanying CLI utility was used to migrate from JSON
+blob files in the ``sample`` directory to the couch instance.  This
+particular methodology as applied to an unexpected problem (the
+unanticipated switch from JSON blobs to couch) is a good example of
+the power of using a problem to drive the software forward (in this
+case, creation of a universal export function and associated command
+line utility). The alternative, a one-off manual data migration, would
+have been just as time consuming, would not be repeatable, would not
+have extended toolbox, and may have (like many one-offs do) infected
+the code base with associated semi-permanant vestiges.  In general,
+problems should be used to drive innovation.  This can only be done if
+the software is kept in a reasonably good state.  Otherwise
+considerable (though probably worthwhile) refactoring should be done
+prior to feature extension which will become cost-prohibitive in
+time-critical situations where a one-off is (more) likely to be employed.
+Use Cases
+The target use-case is software tools for Mozilla, or, more generally,
+a software index.  For this case, the default fields uses are given in
+the paste.ini file: usage, author, type, language. More fields may be
+added to the running instance in the future.
+However, the classifier classification can be used for a wide variety
+of web-locatable resources.  A few examples:
+* songs: artist, album, genre, instruments
+* de.li.cio.us: type, media, author, site
+* http://readthedocs.org/