···11+### Pre-processing of photo data
22+33+The following collection of scripts are designed to run on a backend server
44+with filesystem access to the photo library.
55+
+132
preprocessing/geocoding.typ
···11+#set document(title: [Pre-processing Photos])
22+#set par(justify: true)
33+#show raw.where(lang: "python"): set align(center)
44+#show raw.where(lang: "sh"): set align(center)
55+#show title: set align(center)
66+#show <subtitle>: set align(center)
77+88+#title()
99+1010+The following aims to fully describe the pre-processing some photo
1111+metadata to have information about a personal collection of photos to
1212+support the display in a gallery application.
1313+1414+This will cover functionality including:
1515+- partitioning photos to ease the frontend data volume
1616+- grouping photos by date
1717+- calculating geographical information
1818+1919+By making this information readily available to a frontend, it
2020+enables a photo browsing experience
2121+_#link("https://medium.com/@YodgorbekKomilo/system-design-of-google-photos-or-an-android-photo-gallery-app-a-complete-guide-a34330dc93cb")[comparable
2222+to Google Photos]_.
2323+2424+= Calculating Geographical Information
2525+2626+Ideally we can compute the city/region a photo was taken for grouping
2727+purposes. The 'nice' name of the region would hopefully remind the
2828+viewer of the place they went long, long after the photo was taken,
2929+which is a nice touch.
3030+3131+Services exist to do this association, but:
3232+- this is a home-lab oriented project
3333+- the region information accuracy is not critical for this use case
3434+- the freely available APIs ask that users both:
3535+ - ensure requests are rate-limited
3636+ - refrain from batch processing
3737+- its better if this keeps working
3838+ _#link("https://web.archive.org/web/20180703221551/https://cloud.google.com/maps-platform/user-guide/pricing-changes/")[despite business model changes]_
3939+- the minimum required hardware requirements to self-host this to determine region information on a global scale are not appropriate for such a home-lab oriented project
4040+4141+An approximate level of accuracy can be achieved without the
4242+complexity, cost & uncertainty of continual use of the above
4343+services. GEONAMES provides a
4444+_#link("https://download.geonames.org/export/dump/")[Gazetteer dataset]_
4545+split into many useful components which fits the bill perfectly.
4646+4747+The GEONAMES admin2 codes export contains entries for level
4848+2 administration regions globally. It contains unique identifiers, a
4949+potentially familiar local name of the region, which country the
5050+region is within and many other details in a neat little 2.3 MB
5151+package.
5252+5353+We can partition this into a neat little filesystem database
5454+of countries and their regions as follows:
5555+5656+```sh
5757+wget https://download.geonames.org/export/dump/admin2Codes.txt
5858+unzip admin2Codes.txt
5959+6060+while IFS= read -r line; do
6161+ prefix=${line:0:2}
6262+ if [[ "$prefix" =~ ^[A-Z][A-Z]$ ]];then
6363+ echo "$line" >> "admin2/$prefix.txt"
6464+ fi
6565+done < admin2Codes.txt
6666+```
6767+6868+However, this is not enough to associate our photos with a region. The
6969+EXIF location data usually associated with photos records lat-lon, and
7070+this data alone cannot partition these into regions.
7171+7272+Another export available is four granularity steps of cities. Each of
7373+these exports has a different threshold for whether the city is
7474+included in the export based on the population of that city.
7575+7676+```
7777+cities500.zip 2025-11-08 03:34 12M
7878+cities1000.zip 2025-11-08 03:34 9.5M
7979+cities5000.zip 2025-11-08 03:34 4.9M
8080+cities15000.zip 2025-11-08 03:34 2.9M
8181+```
8282+8383+Each entry contains useful information including the city name, which
8484+region the city falls into and a lat-lon for the city. With this
8585+information we can begin our budget friendly approximations.
8686+8787+```sh
8888+$ wget https://download.geonames.org/export/dump/cities500.zip
8989+$ unzip cities500.zip
9090+9191+$ awk -F'\t' '{print $1, $9, $11, $12, $3, $5, $6}' cities500.txt | grep FR | head
9292+2658090 CH FR 1003 Vuadens 46.61545 7.01732
9393+2658124 CH FR 1002 Villaz-Saint-Pierre 46.72074 6.95638
9494+2658128 CH FR 1004 Villars-sur-Glane 46.79054 7.11717
9595+2658177 CH FR 1003 Vaulruz 46.62164 6.9882
9696+2658281 CH FR 1006 Ueberstorf 46.86587 7.30998
9797+2658317 CH FR 1004 Treyvaux 46.72796 7.13769
9898+2658433 CH FR 1006 Tafers 46.81483 7.21852
9999+2658548 CH FR 1003 Sorens 46.6691 7.05249
100100+2658570 CH FR 1002 Siviriez 46.65852 6.87774
101101+2658629 CH FR 1007 Semsales 46.57321 6.92948
102102+```
103103+104104+For each lat-lon from the photo EXIF data, using one of these
105105+datasets, we can determine a which city from this dataset is
106106+closet. This provides an approximate region that the photo was
107107+potentially taken in. Obviously there will be some geographical
108108+arrangements that will make this approximate incorrect, and the city
109109+datasets containing less cities will further reduce the accuracy.
110110+111111+Performance of computing this hasn't been discussed. Using a KDTree
112112+would help improve performance, but accuracy would suffer further,
113113+with some undesirable behaviour near the poles and at the point where
114114+longitude wraps back around to zero. BallTrees may be an interesting
115115+approach.
116116+117117+118118+A final touch would be to know the actual country name. This can be
119119+achieved again using GEONAMES exports.
120120+121121+```sh
122122+wget https://download.geonames.org/export/dump/countryInfo.txt
123123+```
124124+125125+This small dataset contains country names, local currencies and even
126126+the country name as it is represented in that country. This export is
127127+processed and included in the frontend during bundling. More could be
128128+done to minify the contents to what is actually consumed by the
129129+frontend, however, given actual payloads of this gallery application
130130+runs in the hundreds of megabytes for high-res thumbnails, it's
131131+probably not worth the effort. The bundler might one day be smart
132132+enough to figure it out itself.
+132
preprocessing/geokdclust.py
···11+import sys
22+import json
33+from kdtree import GeoIndexer
44+55+def main():
66+ if len(sys.argv) < 2:
77+ print("Usage: python geokdclust.py <directory>")
88+ sys.exit(1)
99+1010+ directory = sys.argv[1]
1111+1212+ # Define the column names in order based on the schema
1313+ column_names = [
1414+ 'geonameid',
1515+ 'name',
1616+ 'asciiname',
1717+ 'alternatenames',
1818+ 'latitude',
1919+ 'longitude',
2020+ 'feature_class',
2121+ 'feature_code',
2222+ 'country_code',
2323+ 'cc2',
2424+ 'admin1_code',
2525+ 'admin2_code',
2626+ 'admin3_code',
2727+ 'admin4_code',
2828+ 'population',
2929+ 'elevation',
3030+ 'dem',
3131+ 'timezone',
3232+ 'modification_date'
3333+ ]
3434+3535+ # Indices of the required columns (1,9,11,12,3,5,6) in the data file
3636+ required_indices = [0, 8, 10, 11, 2, 4, 5]
3737+3838+ # Extract the corresponding column names
3939+ required_columns = [column_names[i] for i in required_indices]
4040+4141+ # List to store all rows as dictionaries
4242+ data = []
4343+4444+ # Read the file and process each line
4545+ num_cities = '500'
4646+ with open(f"cities{num_cities}.txt", 'r') as file:
4747+ for line in file:
4848+ line = line.strip()
4949+ if not line:
5050+ continue # Skip empty lines
5151+5252+ fields = line.split('\t')
5353+ if len(fields) < max(required_indices) + 1:
5454+ continue # Skip lines with insufficient data
5555+5656+ row = {}
5757+ for idx, col in zip(required_indices, required_columns):
5858+ row[col] = fields[idx]
5959+6060+ data.append(row)
6161+6262+ city_nodes = [(
6363+ float(r['latitude']),
6464+ float(r['longitude']),
6565+ f"{r['country_code']}:{r['admin1_code']}:{r['admin2_code']}"
6666+ ) for r in data]
6767+6868+ a_col_names = [
6969+ 'code',
7070+ 'name',
7171+ 'asciiname',
7272+ 'geonameId'
7373+ ]
7474+ regions=dict()
7575+7676+ with open('admin2Codes.txt', 'r') as file:
7777+ for line in file:
7878+ line = line.strip()
7979+ if not line:
8080+ continue # Skip empty lines
8181+8282+ fields = line.split('\t')
8383+ regions[fields[0].replace('.',':')] = fields[1]
8484+8585+ # 1. Build the index
8686+ print("Building index...")
8787+ indexer = GeoIndexer(city_nodes)
8888+ print("Index built successfully.")
8989+9090+ with open(f"{directory}.clusters.json") as file:
9191+ clusters = json.load(file)[0]
9292+9393+ rbc = dict()
9494+ unk = []
9595+9696+ towns = set()
9797+9898+9999+ for c in clusters:
100100+ for n in c:
101101+ # 3. Find the nearest node
102102+ nearest_node = indexer.find_nearest(
103103+ float(n['lat']),
104104+ float(n['lon'])
105105+ )
106106+107107+ print(f"\nQuery Point: ({n['lat']}, {n['lon']})")
108108+ print(f"Nearest Node Found: {nearest_node}")
109109+110110+ if nearest_node[2] in regions:
111111+ print(f"Derived Admin Region: {regions[nearest_node[2]]}")
112112+ if nearest_node[2] in rbc:
113113+ rbc[nearest_node[2]].append(n)
114114+ else:
115115+ rbc[nearest_node[2]] = [n]
116116+117117+ towns.add(nearest_node)
118118+119119+ else:
120120+ print(f"Unknown Admin Region: {nearest_node[2]}")
121121+ unk.append(n)
122122+123123+ with open(f"{directory}.region.clusters.{num_cities}.json", 'w') as file:
124124+ l = list(rbc.values())
125125+ l.append(unk)
126126+ json.dump([l], file)
127127+128128+ with open(f"{directory}.close_cities.{num_cities}.json", 'w') as file:
129129+ json.dump(list(towns), file)
130130+131131+if __name__ == '__main__':
132132+ main()