Mercurial > hg > GlobalNeighbors
view README.txt @ 25:991bce6b6881 default tip
[knn] placeholder for planning session
author | Jeff Hammel <k0scist@gmail.com> |
---|---|
date | Sun, 17 Sep 2017 14:35:50 -0700 |
parents | 40a9850076a4 |
children |
line wrap: on
line source
# GlobalNeighbors locate nearest cities ## Making the World Local ## Dataset The data is taken from http://download.geonames.org/export/dump/cities1000.zip , all cities with a population > 1000 or seats of adm div (ca 150.000), according to http://download.geonames.org/export/dump/ . `wc -l` indicates that the text file has 149092 records. The schema is given by: geonameid : integer id of record in geonames database name : name of geographical point (utf8) varchar(200) asciiname : name of geographical point in plain ascii characters, varchar(200) alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000) latitude : latitude in decimal degrees (wgs84) longitude : longitude in decimal degrees (wgs84) feature class : see http://www.geonames.org/export/codes.html, char(1) feature code : see http://www.geonames.org/export/codes.html, varchar(10) country code : ISO-3166 2-letter country code, 2 characters cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20) admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) admin3 code : code for third level administrative division, varchar(20) admin4 code : code for fourth level administrative division, varchar(20) population : bigint (8 byte int) elevation : in meters, integer dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat. timezone : the iana timezone id (see file timeZone.txt) varchar(40) modification date : date of last modification in yyyy-MM-dd format (Ref: http://download.geonames.org/export/dump/) ## Distance Calculation A spherical projection is assumed for the Earth. In order to calculated the distance between cities, the Haversine formula is used (see https://en.wikipedia.org/wiki/Haversine_formula ). The naive approach, which we take here is to calculate all the distances between all cities. This is notably an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`) In this case roughly: ``` O = 0.5*149092*149092 approx 10 billion ``` This is no small task even with modern computers. CPU, Memory, and Storage are non negligible even for this modern data set. So first we filter down based on boxing on latitude and longitude. This is not 100% correct as it ignores the singularity at the poles but this is probably tolerable since there are few cities of magnitude in these regions. So performance is important. Just iterating over the `0.5*N*(N-1)` without any calculations is expensive in python: just iterating over the items is about 30 minutes on this computer (in series). Keeping track of neighbors within +/- 1 degree latitude and longitude takes closer to an hour. Boxing this way gets a maximum number of neighbors of 2376 and a minimum value of 0 with a mean of 222. This is stored in `neighbors.json` in `data`. There are 72897 cases with less than 100 in this box and 18444 with less than 10. Because of this, the closest 50 neighbors were precomputed and cached as `neighbors.dat`: ``` find-neighbors /home/jhammel/tensorflow/cities1000.txt -k 50 --latlon 1. 1. neighbors.dat ``` ## Web Service A simple web application was developed using `gunicorn` (http://gunicorn.org/) as the server. `Paste` was used to create a pass-through fileserver. It's multi-threaded server also made testing easier. For fun, each city page links to a google map with its latitude and longitude. I found the secret decoder ring here: http://asnsblues.blogspot.com/2011/11/google-maps-query-string-parameters.html The landing page gives an input box using jQuery's http://easyautocomplete.com/ . ## Deployment notes `supervisor` (http://supervisord.org/) was used to ensure a continuously running web daemon. The included `globalneighbors.ini` should be copied to the appropriate supervisor directory and the service loaded. - parallelism: the distance calculation is done serially. As such it is a `O(10^10)` operation on the dataset. This should be improved and parallelized - no through-the-web (TTW) testing was done except manual. This should be corrected with Selenium and other headless testing ## (Hopefully) Helpful Links - http://geojsonlint.com/ - https://en.wikipedia.org/wiki/Haversine_formula - https://en.wikipedia.org/wiki/K-d_tree - http://scikit-learn.org/stable/modules/neighbors.html - https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/ ---- Jeff Hammel