Mercurial > hg > GlobalNeighbors
diff README.txt @ 0:5dba84370182
initial commit; half-working prototype
author | Jeff Hammel <k0scist@gmail.com> |
---|---|
date | Sat, 24 Jun 2017 12:03:39 -0700 |
parents | |
children | d1b99c695511 |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.txt Sat Jun 24 12:03:39 2017 -0700 @@ -0,0 +1,96 @@ +# GlobalNeighbors + +locate nearest cities + +## Making the World Local + +## Dataset + +The data is taken from http://download.geonames.org/export/dump/cities1000.zip , +all cities with a population > 1000 or seats of adm div (ca 150.000), +according to http://download.geonames.org/export/dump/ . +`wc -l` indicates that the text file has 149092 records. + +The schema is given by: + +geonameid : integer id of record in geonames database +name : name of geographical point (utf8) varchar(200) +asciiname : name of geographical point in plain ascii characters, varchar(200) +alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000) +latitude : latitude in decimal degrees (wgs84) +longitude : longitude in decimal degrees (wgs84) +feature class : see http://www.geonames.org/export/codes.html, char(1) +feature code : see http://www.geonames.org/export/codes.html, varchar(10) +country code : ISO-3166 2-letter country code, 2 characters +cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters +admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20) +admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) +admin3 code : code for third level administrative division, varchar(20) +admin4 code : code for fourth level administrative division, varchar(20) +population : bigint (8 byte int) +elevation : in meters, integer +dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat. +timezone : the iana timezone id (see file timeZone.txt) varchar(40) +modification date : date of last modification in yyyy-MM-dd format + +(Ref: http://download.geonames.org/export/dump/) + + +## Distance Calculation + +A spherical projection is assumed for the Earth. + +In order to calculated the distance between cities, +the Haversine formula is used (see +https://en.wikipedia.org/wiki/Haversine_formula ). + +The naive approach, which we take here is to calculate +all the distances between all cities. This is notably +an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`) + +In this case roughly: +``` +O = 0.5*149092*149092 approx 10 billion +``` + +This is no small task even with modern computers. CPU, Memory, +and Storage are non negligible even for this modern data set. + +So first we filter down based on boxing on latitude and longitude. +This is not 100% correct as it ignores the singularity at the poles +but this is probably tolerable since there are few cities of magnitude +in these regions. + +So performance is important. + +Just iterating over the `0.5*N*(N-1)` without any calculations +is expensive in python: just iterating over the items +is about 30 minutes on this computer (in series). +Keeping track of neighbors within +/- 1 degree latitude and +longitude takes closer to an hour. Boxing this way gets +a maximum number of neighbors of 2376 and a minimum value of 0 +with a mean of 222. This is stored in `neighbors.json` in `data`. +There are 72897 cases with less than 100 in this box and +18444 with less than 10. + + +## Web Service + +A simple web application was developed using +`gunicorn` (http://gunicorn.org/) as the server. + +## Deployment notes + +- parallelism: the distance calculation is done serially. As such + it is a `O(10^10)` operation on the dataset. This should be improved + and parallelized + + +## (Hopefully) Helpful Links + +- http://geojsonlint.com/ +- https://en.wikipedia.org/wiki/Haversine_formula + +---- + +Jeff Hammel