diff README.txt @ 0:5dba84370182

initial commit; half-working prototype
author Jeff Hammel <k0scist@gmail.com>
date Sat, 24 Jun 2017 12:03:39 -0700
parents
children d1b99c695511
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/README.txt	Sat Jun 24 12:03:39 2017 -0700
@@ -0,0 +1,96 @@
+# GlobalNeighbors
+
+locate nearest cities
+
+## Making the World Local
+
+## Dataset
+
+The data is taken from http://download.geonames.org/export/dump/cities1000.zip ,
+all cities with a population > 1000 or seats of adm div (ca 150.000),
+according to http://download.geonames.org/export/dump/ .
+`wc -l` indicates that the text file has 149092 records.
+
+The schema is given by:
+
+geonameid         : integer id of record in geonames database
+name              : name of geographical point (utf8) varchar(200)
+asciiname         : name of geographical point in plain ascii characters, varchar(200)
+alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
+latitude          : latitude in decimal degrees (wgs84)
+longitude         : longitude in decimal degrees (wgs84)
+feature class     : see http://www.geonames.org/export/codes.html, char(1)
+feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
+country code      : ISO-3166 2-letter country code, 2 characters
+cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
+admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
+admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
+admin3 code       : code for third level administrative division, varchar(20)
+admin4 code       : code for fourth level administrative division, varchar(20)
+population        : bigint (8 byte int)
+elevation         : in meters, integer
+dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
+timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
+modification date : date of last modification in yyyy-MM-dd format
+
+(Ref: http://download.geonames.org/export/dump/)
+
+
+## Distance Calculation
+
+A spherical projection is assumed for the Earth.
+
+In order to calculated the distance between cities,
+the Haversine formula is used (see
+https://en.wikipedia.org/wiki/Haversine_formula ).
+
+The naive approach, which we take here is to calculate
+all the distances between all cities.  This is notably
+an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`)
+
+In this case roughly:
+```
+O = 0.5*149092*149092 approx 10 billion
+```
+
+This is no small task even with modern computers.  CPU, Memory,
+and Storage are non negligible even for this modern data set.
+
+So first we filter down based on boxing on latitude and longitude.
+This is not 100% correct as it ignores the singularity at the poles
+but this is probably tolerable since there are few cities of magnitude
+in these regions.
+
+So performance is important.
+
+Just iterating over the `0.5*N*(N-1)` without any calculations
+is expensive in python:  just iterating over the items
+is about 30 minutes on this computer (in series).
+Keeping track of neighbors within +/- 1 degree latitude and
+longitude takes closer to an hour.  Boxing this way gets
+a maximum number of neighbors of 2376 and a minimum value of 0
+with a mean of 222.  This is stored in `neighbors.json` in `data`.
+There are 72897 cases with less than 100 in this box and
+18444 with less than 10.
+
+
+## Web Service
+
+A simple web application was developed using
+`gunicorn` (http://gunicorn.org/) as the server.
+
+##  Deployment notes
+
+- parallelism:  the distance calculation is done serially.  As such
+  it is a `O(10^10)` operation on the dataset.  This should be improved
+  and parallelized
+
+
+## (Hopefully) Helpful Links
+
+- http://geojsonlint.com/
+- https://en.wikipedia.org/wiki/Haversine_formula
+
+----
+
+Jeff Hammel