comparison README.txt @ 0:5dba84370182

initial commit; half-working prototype
author Jeff Hammel <k0scist@gmail.com>
date Sat, 24 Jun 2017 12:03:39 -0700
parents
children d1b99c695511
comparison
equal deleted inserted replaced
-1:000000000000 0:5dba84370182
1 # GlobalNeighbors
2
3 locate nearest cities
4
5 ## Making the World Local
6
7 ## Dataset
8
9 The data is taken from http://download.geonames.org/export/dump/cities1000.zip ,
10 all cities with a population > 1000 or seats of adm div (ca 150.000),
11 according to http://download.geonames.org/export/dump/ .
12 `wc -l` indicates that the text file has 149092 records.
13
14 The schema is given by:
15
16 geonameid : integer id of record in geonames database
17 name : name of geographical point (utf8) varchar(200)
18 asciiname : name of geographical point in plain ascii characters, varchar(200)
19 alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
20 latitude : latitude in decimal degrees (wgs84)
21 longitude : longitude in decimal degrees (wgs84)
22 feature class : see http://www.geonames.org/export/codes.html, char(1)
23 feature code : see http://www.geonames.org/export/codes.html, varchar(10)
24 country code : ISO-3166 2-letter country code, 2 characters
25 cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
26 admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
27 admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
28 admin3 code : code for third level administrative division, varchar(20)
29 admin4 code : code for fourth level administrative division, varchar(20)
30 population : bigint (8 byte int)
31 elevation : in meters, integer
32 dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
33 timezone : the iana timezone id (see file timeZone.txt) varchar(40)
34 modification date : date of last modification in yyyy-MM-dd format
35
36 (Ref: http://download.geonames.org/export/dump/)
37
38
39 ## Distance Calculation
40
41 A spherical projection is assumed for the Earth.
42
43 In order to calculated the distance between cities,
44 the Haversine formula is used (see
45 https://en.wikipedia.org/wiki/Haversine_formula ).
46
47 The naive approach, which we take here is to calculate
48 all the distances between all cities. This is notably
49 an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`)
50
51 In this case roughly:
52 ```
53 O = 0.5*149092*149092 approx 10 billion
54 ```
55
56 This is no small task even with modern computers. CPU, Memory,
57 and Storage are non negligible even for this modern data set.
58
59 So first we filter down based on boxing on latitude and longitude.
60 This is not 100% correct as it ignores the singularity at the poles
61 but this is probably tolerable since there are few cities of magnitude
62 in these regions.
63
64 So performance is important.
65
66 Just iterating over the `0.5*N*(N-1)` without any calculations
67 is expensive in python: just iterating over the items
68 is about 30 minutes on this computer (in series).
69 Keeping track of neighbors within +/- 1 degree latitude and
70 longitude takes closer to an hour. Boxing this way gets
71 a maximum number of neighbors of 2376 and a minimum value of 0
72 with a mean of 222. This is stored in `neighbors.json` in `data`.
73 There are 72897 cases with less than 100 in this box and
74 18444 with less than 10.
75
76
77 ## Web Service
78
79 A simple web application was developed using
80 `gunicorn` (http://gunicorn.org/) as the server.
81
82 ## Deployment notes
83
84 - parallelism: the distance calculation is done serially. As such
85 it is a `O(10^10)` operation on the dataset. This should be improved
86 and parallelized
87
88
89 ## (Hopefully) Helpful Links
90
91 - http://geojsonlint.com/
92 - https://en.wikipedia.org/wiki/Haversine_formula
93
94 ----
95
96 Jeff Hammel