A "Database-less" System to Handle 220 Million Spatial Data Points

Or: How I learned to use file-based spatial indexing
The "Why Should I Care?" Part
Picture this: You've got 220 million data points representing every bump and valley across three German cities. Our product manager walks over with that look that says "I have a brilliant idea" and asks, "Can we make this, you know, interactive?"
Naturally, you think PostgreSQL + PostGIS. It's the sensible choice, right? Everyone does it. But here's the thing about 220 million points - they don't care about our sensible choices. They just want to exist, and they want to make our queries take forever.
This is the story of how I built Terrain Mapper, a system that analyzes terrain data around building footprints without giving me nightmares about database performance. And yes, there's a reason I didn't use PostgreSQL.
The "What Even Is This Thing?" Section
Terrain Mapper is what happens when you combine terrain data with building footprints and sprinkle some geospatial magic on top. Users can:
- Draw polygons to analyse terrain characteristics
- Drop pins on specific buildings
- Get insights about soil erosion risk, solar potential, and terrain ruggedness
- Watch their browser not freeze while processing millions of data points
The secret sauce? A file-based spatial partitioning system that would make Apache Hive proud.
System Design: Or Why I Chose the "Wrong" Approach
The Database That Isn't a Database
"Wait, you're not using a database?" - Every enterprise architect ever
Nope. Instead, I use a directory structure organized by Geohash. Think of it as a filing cabinet where each drawer represents a 1.2km x 0.6km area:
/db/
├── u4pruyd/ # That's Stuttgart
│ ├── buildings.parquet
│ ├── dtm.parquet
│ └── rasters/
└── u4pruyf/ # Another part of Stuttgart
├── buildings.parquet
├── dtm.parquet
└── rasters/
When someone draws a polygon, the algorithm:
- Figures out which "drawers" (Geohash partitions) overlap
- Only opens those specific drawers
- Processes them in parallel
It's like having a very organized librarian who knows exactly where every book is, except the librarian is Python multiprocessing.
Async? What Async?
"Shouldn't this be async?" - our senior developer, probably
Here's the dirty secret: Not everything needs to be async. When you're doing heavy CPU-bound work like raster interpolation, async just adds complexity without benefits. I use multiprocessing instead:
# Not shown: The 300 lines of error handling that make this actually work
def process_geohash(self, geohash: str, input_geom: Polygon) -> List[dict]:
with Pool(cpu_count()) as pool:
building_reports = pool.starmap(self.process_building, args)
return building_reports
The only async in the system is where it actually makes sense - handling HTTP requests while file operations happen in the background.
The Data Pipeline: Eight Steps to Terrain Enlightenment
Step 1: XYZ to Parquet (Because Text Files Are for Humans)
Starting with raw XYZ files is like trying to eat soup with a fork. Parquet conversion is step one:
# The most boring code you'll see today, but it works
df = pd.read_csv(file_path, sep=" ", header=None, names=["x", "y", "height"])
df['geometry'] = df.apply(lambda row: Point(row['x'], row['y']), axis=1)
gdf = gpd.GeoDataFrame(df, geometry='geometry')
gdf.to_parquet(output_file)
Step 2-8: The Pipeline That Never Ends
The full pipeline looks like this:
- XYZ → Parquet (already covered)
- Parquet → Geohash grids (spatial indexing)
- Grids → Database partitions
- Partitions → Interpolated rasters
- Rasters → Merged global raster
- Global raster → Terrain layers (slope, aspect, etc.)
- Grid cells → Derived variables (soil erosion, solar potential)
- Variables → Final interpolated layers
Each step is a separate script because:
- It's easier to debug
- You can rerun individual steps
- Your computer won't explode processing 220 million points at once
Check out the preprocessing scripts if you want the gory details.
Code Patterns: The Good, The Bad, and The "It Works"
The Service Layer Pattern
"Why so many service classes?" - Junior developer on code review
Because separation of concerns isn't just a buzzword. Each service has one job:
class RasterService:
def get_raster_stats(self, raster_key: str, zone_geom: Polygon):
# Just get stats from rasters, nothing else
class GeohashService:
def geohash_grid_covering_polygon(self, polygon: Polygon):
# Just handle geohash logic, nothing else
class BuildingService:
def generate_building_reports(self, geojson: dict):
# Orchestrate everything, but don't do the work yourself
This makes testing easier and keeps your future self from wanting to rage-quit when debugging.
The "Clean Your Mess" Pattern
class ReportCleaner:
@staticmethod
def remove_nan_values(data):
# Because NaN values are like glitter - they get everywhere
if isinstance(data, dict):
return {k: ReportCleaner.remove_nan_values(v) for k, v in data.items()}
elif isinstance(data, float) and (math.isnan(data) or math.isinf(data)):
return None # Replace the impossible with the plausible
return data
Error Handling: The "Hope for the Best, Plan for the Worst" Approach
"Do we really need all this error handling?" - Product manager, before production
Yes. Yes, we do. Geospatial data is messy. Really messy. Here's what real error handling looks like:
try:
with rasterio.open(raster_path) as src:
# The happy path that works 95% of the time
result = process_raster(src)
except FileNotFoundError:
# The 3% case where files are missing
logger.error(f"Missing raster file: {raster_path}")
return None
except Exception as e:
# The 2% case where weird stuff happens
logger.error(f"Unknown error: {e}")
return None
Performance: Why This Doesn't Suck
Geohash Indexing: The Magic Behind the Speed
"Why not just use a spatial index in PostGIS?" - The same senior guy?
Because Geohash gives you spatial indexing and a natural partitioning strategy in one package. When someone draws a polygon:
- Calculate overlapping Geohash cells (milliseconds)
- Read only those partitions (seconds, not minutes)
- Process in parallel (because CPUs have multiple cores for a reason)
COG and Vector Tiles: Because Bandwidth Costs Money
Cloud Optimized GeoTIFFs (COG) and vector tiles mean:
- Only the visible area gets downloaded
- Data is pre-tiled and cached
- Your users don't hate you for slow load times
Deployment: Docker All the Things
"Can't we just deploy it on bare metal?" - Infrastructure team
Sure, if you enjoy managing dependencies across different environments. Or we could use Docker:
# docker-compose.yml - The entire application in one file
services:
backend:
image: credium-be:latest
resources:
limits:
cpus: "1"
memory: 2G
volumes:
- ${PWD}/data:/var/task/fastapi/data
Five services, one command, zero headaches (okay maybe a few).
API Design: RESTful-ish
The API is refreshingly simple:
GET /health
- Is it alive?POST /rasterstats
- Give me min/max values for this areaPOST /stats
- Analyse buildings in this polygon
Testing: Because Future You Will Thank Present You
"Do we really need all these unit tests?" - Developer on a deadline
Yes. Especially for geospatial code. Here's why:
def test_geohash_bbox_generation(self):
# Test the easy case
bbox = self.geohash_service.get_geohash_bbox('u4pruyd')
assert bbox.is_valid
# Test the edge case that will definitely break in production
edge_case_bbox = self.geohash_service.get_geohash_bbox('zzzzzzz')
assert edge_case_bbox.is_empty # This actually matters
Run the tests with: python docker/backend/fastapi/test_unittests.py
Lessons Learned: The Therapy Session
Database Choice Matters More Than You Think
Sometimes the "wrong" choice is the right choice. File-based storage with spatial indexing outperformed database solutions for our use case.
Don't Overengineer (But Do Engineer)
Async everywhere? Nope. Microservices for every function? Also nope. Find the right balance.
Documentation is Future-Proofing
That README you're tempted to skip? Write it. Your future self (or replacement) will send you thank-you notes.
Interactive Time: Break It!?
Want to get your hands dirty? Here's what you can do right now:
- Draw a huge polygon (the system will sample 10 random buildings)
- Request stats for a non-existent raster
- Monitor the logs through Dozzle
Deploy it locally
git clone https://github.com/purijs/terrain-mapper.git
cd terrain-mapper
docker swarm init
docker stack deploy --compose-file docker-compose.yml terrain
The "What's Next?" Section
Future improvements include:
- Automated data pipeline for new regions
- ML models for risk prediction
- Support for historical terrain change analysis
- Integration with weather data
"When will these features be ready?" - Product manager, inevitably
If you made it this far, you're either: a) Really interested in geospatial system design b) Procrastinating on your own project
Either way, check out the full source code and let me know what you think. Or don't :p
Have questions? Spot a bug? Think the database-less approach is insane? Drop a comment below
Comments ()