Mercurial > hg > GlobalNeighbors
comparison README.txt @ 0:5dba84370182
initial commit; half-working prototype
author | Jeff Hammel <k0scist@gmail.com> |
---|---|
date | Sat, 24 Jun 2017 12:03:39 -0700 |
parents | |
children | d1b99c695511 |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:5dba84370182 |
---|---|
1 # GlobalNeighbors | |
2 | |
3 locate nearest cities | |
4 | |
5 ## Making the World Local | |
6 | |
7 ## Dataset | |
8 | |
9 The data is taken from http://download.geonames.org/export/dump/cities1000.zip , | |
10 all cities with a population > 1000 or seats of adm div (ca 150.000), | |
11 according to http://download.geonames.org/export/dump/ . | |
12 `wc -l` indicates that the text file has 149092 records. | |
13 | |
14 The schema is given by: | |
15 | |
16 geonameid : integer id of record in geonames database | |
17 name : name of geographical point (utf8) varchar(200) | |
18 asciiname : name of geographical point in plain ascii characters, varchar(200) | |
19 alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000) | |
20 latitude : latitude in decimal degrees (wgs84) | |
21 longitude : longitude in decimal degrees (wgs84) | |
22 feature class : see http://www.geonames.org/export/codes.html, char(1) | |
23 feature code : see http://www.geonames.org/export/codes.html, varchar(10) | |
24 country code : ISO-3166 2-letter country code, 2 characters | |
25 cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters | |
26 admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20) | |
27 admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) | |
28 admin3 code : code for third level administrative division, varchar(20) | |
29 admin4 code : code for fourth level administrative division, varchar(20) | |
30 population : bigint (8 byte int) | |
31 elevation : in meters, integer | |
32 dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat. | |
33 timezone : the iana timezone id (see file timeZone.txt) varchar(40) | |
34 modification date : date of last modification in yyyy-MM-dd format | |
35 | |
36 (Ref: http://download.geonames.org/export/dump/) | |
37 | |
38 | |
39 ## Distance Calculation | |
40 | |
41 A spherical projection is assumed for the Earth. | |
42 | |
43 In order to calculated the distance between cities, | |
44 the Haversine formula is used (see | |
45 https://en.wikipedia.org/wiki/Haversine_formula ). | |
46 | |
47 The naive approach, which we take here is to calculate | |
48 all the distances between all cities. This is notably | |
49 an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`) | |
50 | |
51 In this case roughly: | |
52 ``` | |
53 O = 0.5*149092*149092 approx 10 billion | |
54 ``` | |
55 | |
56 This is no small task even with modern computers. CPU, Memory, | |
57 and Storage are non negligible even for this modern data set. | |
58 | |
59 So first we filter down based on boxing on latitude and longitude. | |
60 This is not 100% correct as it ignores the singularity at the poles | |
61 but this is probably tolerable since there are few cities of magnitude | |
62 in these regions. | |
63 | |
64 So performance is important. | |
65 | |
66 Just iterating over the `0.5*N*(N-1)` without any calculations | |
67 is expensive in python: just iterating over the items | |
68 is about 30 minutes on this computer (in series). | |
69 Keeping track of neighbors within +/- 1 degree latitude and | |
70 longitude takes closer to an hour. Boxing this way gets | |
71 a maximum number of neighbors of 2376 and a minimum value of 0 | |
72 with a mean of 222. This is stored in `neighbors.json` in `data`. | |
73 There are 72897 cases with less than 100 in this box and | |
74 18444 with less than 10. | |
75 | |
76 | |
77 ## Web Service | |
78 | |
79 A simple web application was developed using | |
80 `gunicorn` (http://gunicorn.org/) as the server. | |
81 | |
82 ## Deployment notes | |
83 | |
84 - parallelism: the distance calculation is done serially. As such | |
85 it is a `O(10^10)` operation on the dataset. This should be improved | |
86 and parallelized | |
87 | |
88 | |
89 ## (Hopefully) Helpful Links | |
90 | |
91 - http://geojsonlint.com/ | |
92 - https://en.wikipedia.org/wiki/Haversine_formula | |
93 | |
94 ---- | |
95 | |
96 Jeff Hammel |