annotate README.txt @ 24:40a9850076a4

[documentation] links
author Jeff Hammel <k0scist@gmail.com>
date Mon, 04 Sep 2017 14:54:38 -0700
parents 6891c5523b69
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
1 # GlobalNeighbors
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
2
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
3 locate nearest cities
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
4
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
5 ## Making the World Local
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
6
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
7 ## Dataset
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
8
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
9 The data is taken from http://download.geonames.org/export/dump/cities1000.zip ,
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
10 all cities with a population > 1000 or seats of adm div (ca 150.000),
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
11 according to http://download.geonames.org/export/dump/ .
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
12 `wc -l` indicates that the text file has 149092 records.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
13
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
14 The schema is given by:
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
15
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
16 geonameid : integer id of record in geonames database
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
17 name : name of geographical point (utf8) varchar(200)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
18 asciiname : name of geographical point in plain ascii characters, varchar(200)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
19 alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
20 latitude : latitude in decimal degrees (wgs84)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
21 longitude : longitude in decimal degrees (wgs84)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
22 feature class : see http://www.geonames.org/export/codes.html, char(1)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
23 feature code : see http://www.geonames.org/export/codes.html, varchar(10)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
24 country code : ISO-3166 2-letter country code, 2 characters
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
25 cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
26 admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
27 admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
28 admin3 code : code for third level administrative division, varchar(20)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
29 admin4 code : code for fourth level administrative division, varchar(20)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
30 population : bigint (8 byte int)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
31 elevation : in meters, integer
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
32 dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
33 timezone : the iana timezone id (see file timeZone.txt) varchar(40)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
34 modification date : date of last modification in yyyy-MM-dd format
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
35
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
36 (Ref: http://download.geonames.org/export/dump/)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
37
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
38
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
39 ## Distance Calculation
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
40
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
41 A spherical projection is assumed for the Earth.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
42
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
43 In order to calculated the distance between cities,
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
44 the Haversine formula is used (see
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
45 https://en.wikipedia.org/wiki/Haversine_formula ).
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
46
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
47 The naive approach, which we take here is to calculate
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
48 all the distances between all cities. This is notably
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
49 an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`)
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
50
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
51 In this case roughly:
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
52 ```
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
53 O = 0.5*149092*149092 approx 10 billion
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
54 ```
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
55
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
56 This is no small task even with modern computers. CPU, Memory,
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
57 and Storage are non negligible even for this modern data set.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
58
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
59 So first we filter down based on boxing on latitude and longitude.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
60 This is not 100% correct as it ignores the singularity at the poles
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
61 but this is probably tolerable since there are few cities of magnitude
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
62 in these regions.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
63
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
64 So performance is important.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
65
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
66 Just iterating over the `0.5*N*(N-1)` without any calculations
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
67 is expensive in python: just iterating over the items
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
68 is about 30 minutes on this computer (in series).
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
69 Keeping track of neighbors within +/- 1 degree latitude and
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
70 longitude takes closer to an hour. Boxing this way gets
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
71 a maximum number of neighbors of 2376 and a minimum value of 0
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
72 with a mean of 222. This is stored in `neighbors.json` in `data`.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
73 There are 72897 cases with less than 100 in this box and
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
74 18444 with less than 10.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
75
23
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
76 Because of this, the closest 50 neighbors were
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
77 precomputed and cached as `neighbors.dat`:
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
78
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
79 ```
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
80 find-neighbors /home/jhammel/tensorflow/cities1000.txt -k 50 --latlon 1. 1. neighbors.dat
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
81 ```
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
82
0
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
83
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
84 ## Web Service
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
85
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
86 A simple web application was developed using
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
87 `gunicorn` (http://gunicorn.org/) as the server.
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
88
11
d1b99c695511 remove obselete data
Jeff Hammel <k0scist@gmail.com>
parents: 0
diff changeset
89 `Paste` was used to create a pass-through fileserver.
d1b99c695511 remove obselete data
Jeff Hammel <k0scist@gmail.com>
parents: 0
diff changeset
90 It's multi-threaded server also made testing easier.
d1b99c695511 remove obselete data
Jeff Hammel <k0scist@gmail.com>
parents: 0
diff changeset
91
19
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
92 For fun, each city page links to a google map with its
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
93 latitude and longitude. I found the secret decoder ring here:
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
94 http://asnsblues.blogspot.com/2011/11/google-maps-query-string-parameters.html
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
95
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
96 The landing page gives an input box using jQuery's
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
97 http://easyautocomplete.com/ .
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
98
811adc9736eb add configuration for gunicorn
Jeff Hammel <k0scist@gmail.com>
parents: 17
diff changeset
99
0
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
100 ## Deployment notes
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
101
17
6aaf70296c5a move to supervisor for operations
Jeff Hammel <k0scist@gmail.com>
parents: 11
diff changeset
102 `supervisor` (http://supervisord.org/) was used to ensure
6aaf70296c5a move to supervisor for operations
Jeff Hammel <k0scist@gmail.com>
parents: 11
diff changeset
103 a continuously running web daemon. The included
6aaf70296c5a move to supervisor for operations
Jeff Hammel <k0scist@gmail.com>
parents: 11
diff changeset
104 `globalneighbors.ini` should be copied to the appropriate
6aaf70296c5a move to supervisor for operations
Jeff Hammel <k0scist@gmail.com>
parents: 11
diff changeset
105 supervisor directory and the service loaded.
6aaf70296c5a move to supervisor for operations
Jeff Hammel <k0scist@gmail.com>
parents: 11
diff changeset
106
0
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
107 - parallelism: the distance calculation is done serially. As such
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
108 it is a `O(10^10)` operation on the dataset. This should be improved
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
109 and parallelized
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
110
23
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
111 - no through-the-web (TTW) testing was done except manual.
6891c5523b69 load with neighbors :)
Jeff Hammel <k0scist@gmail.com>
parents: 19
diff changeset
112 This should be corrected with Selenium and other headless testing
0
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
113
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
114 ## (Hopefully) Helpful Links
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
115
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
116 - http://geojsonlint.com/
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
117 - https://en.wikipedia.org/wiki/Haversine_formula
24
40a9850076a4 [documentation] links
Jeff Hammel <k0scist@gmail.com>
parents: 23
diff changeset
118 - https://en.wikipedia.org/wiki/K-d_tree
40a9850076a4 [documentation] links
Jeff Hammel <k0scist@gmail.com>
parents: 23
diff changeset
119 - http://scikit-learn.org/stable/modules/neighbors.html
40a9850076a4 [documentation] links
Jeff Hammel <k0scist@gmail.com>
parents: 23
diff changeset
120 - https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
0
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
121
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
122 ----
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
123
5dba84370182 initial commit; half-working prototype
Jeff Hammel <k0scist@gmail.com>
parents:
diff changeset
124 Jeff Hammel