writeup of geocoding preprocessing · adam.tngl.sh/photos@c2c8c83

+5

preprocessing/README.md

··· 1 + ### Pre-processing of photo data 2 + 3 + The following collection of scripts are designed to run on a backend server 4 + with filesystem access to the photo library. 5 +

+132

preprocessing/geocoding.typ

··· 1 + #set document(title: [Pre-processing Photos]) 2 + #set par(justify: true) 3 + #show raw.where(lang: "python"): set align(center) 4 + #show raw.where(lang: "sh"): set align(center) 5 + #show title: set align(center) 6 + #show <subtitle>: set align(center) 7 + 8 + #title() 9 + 10 + The following aims to fully describe the pre-processing some photo 11 + metadata to have information about a personal collection of photos to 12 + support the display in a gallery application. 13 + 14 + This will cover functionality including: 15 + - partitioning photos to ease the frontend data volume 16 + - grouping photos by date 17 + - calculating geographical information 18 + 19 + By making this information readily available to a frontend, it 20 + enables a photo browsing experience 21 + _#link("https://medium.com/@YodgorbekKomilo/system-design-of-google-photos-or-an-android-photo-gallery-app-a-complete-guide-a34330dc93cb")[comparable 22 + to Google Photos]_. 23 + 24 + = Calculating Geographical Information 25 + 26 + Ideally we can compute the city/region a photo was taken for grouping 27 + purposes. The 'nice' name of the region would hopefully remind the 28 + viewer of the place they went long, long after the photo was taken, 29 + which is a nice touch. 30 + 31 + Services exist to do this association, but: 32 + - this is a home-lab oriented project 33 + - the region information accuracy is not critical for this use case 34 + - the freely available APIs ask that users both: 35 + - ensure requests are rate-limited 36 + - refrain from batch processing 37 + - its better if this keeps working 38 + _#link("https://web.archive.org/web/20180703221551/https://cloud.google.com/maps-platform/user-guide/pricing-changes/")[despite business model changes]_ 39 + - the minimum required hardware requirements to self-host this to determine region information on a global scale are not appropriate for such a home-lab oriented project 40 + 41 + An approximate level of accuracy can be achieved without the 42 + complexity, cost & uncertainty of continual use of the above 43 + services. GEONAMES provides a 44 + _#link("https://download.geonames.org/export/dump/")[Gazetteer dataset]_ 45 + split into many useful components which fits the bill perfectly. 46 + 47 + The GEONAMES admin2 codes export contains entries for level 48 + 2 administration regions globally. It contains unique identifiers, a 49 + potentially familiar local name of the region, which country the 50 + region is within and many other details in a neat little 2.3 MB 51 + package. 52 + 53 + We can partition this into a neat little filesystem database 54 + of countries and their regions as follows: 55 + 56 + ```sh 57 + wget https://download.geonames.org/export/dump/admin2Codes.txt 58 + unzip admin2Codes.txt 59 + 60 + while IFS= read -r line; do 61 + prefix=${line:0:2} 62 + if [[ "$prefix" =~ ^[A-Z][A-Z]$ ]];then 63 + echo "$line" >> "admin2/$prefix.txt" 64 + fi 65 + done < admin2Codes.txt 66 + ``` 67 + 68 + However, this is not enough to associate our photos with a region. The 69 + EXIF location data usually associated with photos records lat-lon, and 70 + this data alone cannot partition these into regions. 71 + 72 + Another export available is four granularity steps of cities. Each of 73 + these exports has a different threshold for whether the city is 74 + included in the export based on the population of that city. 75 + 76 + ``` 77 + cities500.zip 2025-11-08 03:34 12M 78 + cities1000.zip 2025-11-08 03:34 9.5M 79 + cities5000.zip 2025-11-08 03:34 4.9M 80 + cities15000.zip 2025-11-08 03:34 2.9M 81 + ``` 82 + 83 + Each entry contains useful information including the city name, which 84 + region the city falls into and a lat-lon for the city. With this 85 + information we can begin our budget friendly approximations. 86 + 87 + ```sh 88 + $ wget https://download.geonames.org/export/dump/cities500.zip 89 + $ unzip cities500.zip 90 + 91 + $ awk -F'\t' '{print $1, $9, $11, $12, $3, $5, $6}' cities500.txt | grep FR | head 92 + 2658090 CH FR 1003 Vuadens 46.61545 7.01732 93 + 2658124 CH FR 1002 Villaz-Saint-Pierre 46.72074 6.95638 94 + 2658128 CH FR 1004 Villars-sur-Glane 46.79054 7.11717 95 + 2658177 CH FR 1003 Vaulruz 46.62164 6.9882 96 + 2658281 CH FR 1006 Ueberstorf 46.86587 7.30998 97 + 2658317 CH FR 1004 Treyvaux 46.72796 7.13769 98 + 2658433 CH FR 1006 Tafers 46.81483 7.21852 99 + 2658548 CH FR 1003 Sorens 46.6691 7.05249 100 + 2658570 CH FR 1002 Siviriez 46.65852 6.87774 101 + 2658629 CH FR 1007 Semsales 46.57321 6.92948 102 + ``` 103 + 104 + For each lat-lon from the photo EXIF data, using one of these 105 + datasets, we can determine a which city from this dataset is 106 + closet. This provides an approximate region that the photo was 107 + potentially taken in. Obviously there will be some geographical 108 + arrangements that will make this approximate incorrect, and the city 109 + datasets containing less cities will further reduce the accuracy. 110 + 111 + Performance of computing this hasn't been discussed. Using a KDTree 112 + would help improve performance, but accuracy would suffer further, 113 + with some undesirable behaviour near the poles and at the point where 114 + longitude wraps back around to zero. BallTrees may be an interesting 115 + approach. 116 + 117 + 118 + A final touch would be to know the actual country name. This can be 119 + achieved again using GEONAMES exports. 120 + 121 + ```sh 122 + wget https://download.geonames.org/export/dump/countryInfo.txt 123 + ``` 124 + 125 + This small dataset contains country names, local currencies and even 126 + the country name as it is represented in that country. This export is 127 + processed and included in the frontend during bundling. More could be 128 + done to minify the contents to what is actually consumed by the 129 + frontend, however, given actual payloads of this gallery application 130 + runs in the hundreds of megabytes for high-res thumbnails, it's 131 + probably not worth the effort. The bundler might one day be smart 132 + enough to figure it out itself.

+132

preprocessing/geokdclust.py

··· 1 + import sys 2 + import json 3 + from kdtree import GeoIndexer 4 + 5 + def main(): 6 + if len(sys.argv) < 2: 7 + print("Usage: python geokdclust.py <directory>") 8 + sys.exit(1) 9 + 10 + directory = sys.argv[1] 11 + 12 + # Define the column names in order based on the schema 13 + column_names = [ 14 + 'geonameid', 15 + 'name', 16 + 'asciiname', 17 + 'alternatenames', 18 + 'latitude', 19 + 'longitude', 20 + 'feature_class', 21 + 'feature_code', 22 + 'country_code', 23 + 'cc2', 24 + 'admin1_code', 25 + 'admin2_code', 26 + 'admin3_code', 27 + 'admin4_code', 28 + 'population', 29 + 'elevation', 30 + 'dem', 31 + 'timezone', 32 + 'modification_date' 33 + ] 34 + 35 + # Indices of the required columns (1,9,11,12,3,5,6) in the data file 36 + required_indices = [0, 8, 10, 11, 2, 4, 5] 37 + 38 + # Extract the corresponding column names 39 + required_columns = [column_names[i] for i in required_indices] 40 + 41 + # List to store all rows as dictionaries 42 + data = [] 43 + 44 + # Read the file and process each line 45 + num_cities = '500' 46 + with open(f"cities{num_cities}.txt", 'r') as file: 47 + for line in file: 48 + line = line.strip() 49 + if not line: 50 + continue # Skip empty lines 51 + 52 + fields = line.split('\t') 53 + if len(fields) < max(required_indices) + 1: 54 + continue # Skip lines with insufficient data 55 + 56 + row = {} 57 + for idx, col in zip(required_indices, required_columns): 58 + row[col] = fields[idx] 59 + 60 + data.append(row) 61 + 62 + city_nodes = [( 63 + float(r['latitude']), 64 + float(r['longitude']), 65 + f"{r['country_code']}:{r['admin1_code']}:{r['admin2_code']}" 66 + ) for r in data] 67 + 68 + a_col_names = [ 69 + 'code', 70 + 'name', 71 + 'asciiname', 72 + 'geonameId' 73 + ] 74 + regions=dict() 75 + 76 + with open('admin2Codes.txt', 'r') as file: 77 + for line in file: 78 + line = line.strip() 79 + if not line: 80 + continue # Skip empty lines 81 + 82 + fields = line.split('\t') 83 + regions[fields[0].replace('.',':')] = fields[1] 84 + 85 + # 1. Build the index 86 + print("Building index...") 87 + indexer = GeoIndexer(city_nodes) 88 + print("Index built successfully.") 89 + 90 + with open(f"{directory}.clusters.json") as file: 91 + clusters = json.load(file)[0] 92 + 93 + rbc = dict() 94 + unk = [] 95 + 96 + towns = set() 97 + 98 + 99 + for c in clusters: 100 + for n in c: 101 + # 3. Find the nearest node 102 + nearest_node = indexer.find_nearest( 103 + float(n['lat']), 104 + float(n['lon']) 105 + ) 106 + 107 + print(f"\nQuery Point: ({n['lat']}, {n['lon']})") 108 + print(f"Nearest Node Found: {nearest_node}") 109 + 110 + if nearest_node[2] in regions: 111 + print(f"Derived Admin Region: {regions[nearest_node[2]]}") 112 + if nearest_node[2] in rbc: 113 + rbc[nearest_node[2]].append(n) 114 + else: 115 + rbc[nearest_node[2]] = [n] 116 + 117 + towns.add(nearest_node) 118 + 119 + else: 120 + print(f"Unknown Admin Region: {nearest_node[2]}") 121 + unk.append(n) 122 + 123 + with open(f"{directory}.region.clusters.{num_cities}.json", 'w') as file: 124 + l = list(rbc.values()) 125 + l.append(unk) 126 + json.dump([l], file) 127 + 128 + with open(f"{directory}.close_cities.{num_cities}.json", 'w') as file: 129 + json.dump(list(towns), file) 130 + 131 + if __name__ == '__main__': 132 + main()