view README.txt @ 24:40a9850076a4

[documentation] links
author Jeff Hammel <k0scist@gmail.com>
date Mon, 04 Sep 2017 14:54:38 -0700
parents 6891c5523b69
children
line wrap: on
line source

# GlobalNeighbors

locate nearest cities

## Making the World Local

## Dataset

The data is taken from http://download.geonames.org/export/dump/cities1000.zip ,
all cities with a population > 1000 or seats of adm div (ca 150.000),
according to http://download.geonames.org/export/dump/ .
`wc -l` indicates that the text file has 149092 records.

The schema is given by:

geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see http://www.geonames.org/export/codes.html, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int)
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format

(Ref: http://download.geonames.org/export/dump/)


## Distance Calculation

A spherical projection is assumed for the Earth.

In order to calculated the distance between cities,
the Haversine formula is used (see
https://en.wikipedia.org/wiki/Haversine_formula ).

The naive approach, which we take here is to calculate
all the distances between all cities.  This is notably
an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`)

In this case roughly:
```
O = 0.5*149092*149092 approx 10 billion
```

This is no small task even with modern computers.  CPU, Memory,
and Storage are non negligible even for this modern data set.

So first we filter down based on boxing on latitude and longitude.
This is not 100% correct as it ignores the singularity at the poles
but this is probably tolerable since there are few cities of magnitude
in these regions.

So performance is important.

Just iterating over the `0.5*N*(N-1)` without any calculations
is expensive in python:  just iterating over the items
is about 30 minutes on this computer (in series).
Keeping track of neighbors within +/- 1 degree latitude and
longitude takes closer to an hour.  Boxing this way gets
a maximum number of neighbors of 2376 and a minimum value of 0
with a mean of 222.  This is stored in `neighbors.json` in `data`.
There are 72897 cases with less than 100 in this box and
18444 with less than 10.

Because of this, the closest 50 neighbors were
precomputed and cached as `neighbors.dat`:

```
find-neighbors /home/jhammel/tensorflow/cities1000.txt -k 50 --latlon 1. 1. neighbors.dat
```


## Web Service

A simple web application was developed using
`gunicorn` (http://gunicorn.org/) as the server.

`Paste` was used to create a pass-through fileserver.
It's multi-threaded server also made testing easier.

For fun, each city page links to a google map with its
latitude and longitude.  I found the secret decoder ring here:
http://asnsblues.blogspot.com/2011/11/google-maps-query-string-parameters.html

The landing page gives an input box using jQuery's
http://easyautocomplete.com/ .


##  Deployment notes

`supervisor` (http://supervisord.org/) was used to ensure
a continuously running web daemon.  The included
`globalneighbors.ini` should be copied to the appropriate
supervisor directory and the service loaded.

- parallelism:  the distance calculation is done serially.  As such
  it is a `O(10^10)` operation on the dataset.  This should be improved
  and parallelized

- no through-the-web (TTW) testing was done except manual.
  This should be corrected with Selenium and other headless testing

## (Hopefully) Helpful Links

- http://geojsonlint.com/
- https://en.wikipedia.org/wiki/Haversine_formula
- https://en.wikipedia.org/wiki/K-d_tree
- http://scikit-learn.org/stable/modules/neighbors.html
- https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/

----

Jeff Hammel