11 May 2025 5 min read python

A "Database-less" System to Handle 220 Million Spatial Data Points

Or: How I learned to use file-based spatial indexing

The "Why Should I Care?" Part

Picture this: You've got 220 million data points representing every bump and valley across three German cities. Our product manager walks over with that look that says "I have a brilliant idea" and asks, "Can we make this, you know, interactive?"

Naturally, you think PostgreSQL + PostGIS. It's the sensible choice, right? Everyone does it. But here's the thing about 220 million points - they don't care about our sensible choices. They just want to exist, and they want to make our queries take forever.

This is the story of how I built Terrain Mapper, a system that analyzes terrain data around building footprints without giving me nightmares about database performance. And yes, there's a reason I didn't use PostgreSQL.

The "What Even Is This Thing?" Section

Terrain Mapper is what happens when you combine terrain data with building footprints and sprinkle some geospatial magic on top. Users can:

Draw polygons to analyse terrain characteristics
Drop pins on specific buildings
Get insights about soil erosion risk, solar potential, and terrain ruggedness
Watch their browser not freeze while processing millions of data points

The secret sauce? A file-based spatial partitioning system that would make Apache Hive proud.

System Design: Or Why I Chose the "Wrong" Approach

The Database That Isn't a Database

"Wait, you're not using a database?" - Every enterprise architect ever

Nope. Instead, I use a directory structure organized by Geohash. Think of it as a filing cabinet where each drawer represents a 1.2km x 0.6km area:

/db/
├── u4pruyd/     # That's Stuttgart
│   ├── buildings.parquet
│   ├── dtm.parquet
│   └── rasters/
└── u4pruyf/     # Another part of Stuttgart
    ├── buildings.parquet
    ├── dtm.parquet
    └── rasters/

When someone draws a polygon, the algorithm:

Figures out which "drawers" (Geohash partitions) overlap
Only opens those specific drawers
Processes them in parallel

It's like having a very organized librarian who knows exactly where every book is, except the librarian is Python multiprocessing.

Async? What Async?

"Shouldn't this be async?" - our senior developer, probably

Here's the dirty secret: Not everything needs to be async. When you're doing heavy CPU-bound work like raster interpolation, async just adds complexity without benefits. I use multiprocessing instead:

# Not shown: The 300 lines of error handling that make this actually work
def process_geohash(self, geohash: str, input_geom: Polygon) -> List[dict]:
    with Pool(cpu_count()) as pool:
        building_reports = pool.starmap(self.process_building, args)
    return building_reports

The only async in the system is where it actually makes sense - handling HTTP requests while file operations happen in the background.

The Data Pipeline: Eight Steps to Terrain Enlightenment

Step 1: XYZ to Parquet (Because Text Files Are for Humans)

Starting with raw XYZ files is like trying to eat soup with a fork. Parquet conversion is step one:

# The most boring code you'll see today, but it works
df = pd.read_csv(file_path, sep=" ", header=None, names=["x", "y", "height"])
df['geometry'] = df.apply(lambda row: Point(row['x'], row['y']), axis=1)
gdf = gpd.GeoDataFrame(df, geometry='geometry')
gdf.to_parquet(output_file)

Step 2-8: The Pipeline That Never Ends

The full pipeline looks like this:

XYZ → Parquet (already covered)
Parquet → Geohash grids (spatial indexing)
Grids → Database partitions
Partitions → Interpolated rasters
Rasters → Merged global raster
Global raster → Terrain layers (slope, aspect, etc.)
Grid cells → Derived variables (soil erosion, solar potential)
Variables → Final interpolated layers

Each step is a separate script because:

It's easier to debug
You can rerun individual steps
Your computer won't explode processing 220 million points at once

Check out the preprocessing scripts if you want the gory details.

Code Patterns: The Good, The Bad, and The "It Works"

The Service Layer Pattern

"Why so many service classes?" - Junior developer on code review

Because separation of concerns isn't just a buzzword. Each service has one job:

class RasterService:
    def get_raster_stats(self, raster_key: str, zone_geom: Polygon):
        # Just get stats from rasters, nothing else
        
class GeohashService:
    def geohash_grid_covering_polygon(self, polygon: Polygon):
        # Just handle geohash logic, nothing else
        
class BuildingService:
    def generate_building_reports(self, geojson: dict):
        # Orchestrate everything, but don't do the work yourself

This makes testing easier and keeps your future self from wanting to rage-quit when debugging.

The "Clean Your Mess" Pattern

class ReportCleaner:
    @staticmethod
    def remove_nan_values(data):
        # Because NaN values are like glitter - they get everywhere
        if isinstance(data, dict):
            return {k: ReportCleaner.remove_nan_values(v) for k, v in data.items()}
        elif isinstance(data, float) and (math.isnan(data) or math.isinf(data)):
            return None  # Replace the impossible with the plausible
        return data

Error Handling: The "Hope for the Best, Plan for the Worst" Approach

"Do we really need all this error handling?" - Product manager, before production

Yes. Yes, we do. Geospatial data is messy. Really messy. Here's what real error handling looks like:

try:
    with rasterio.open(raster_path) as src:
        # The happy path that works 95% of the time
        result = process_raster(src)
except FileNotFoundError:
    # The 3% case where files are missing
    logger.error(f"Missing raster file: {raster_path}")
    return None
except Exception as e:
    # The 2% case where weird stuff happens
    logger.error(f"Unknown error: {e}")
    return None

Performance: Why This Doesn't Suck

Geohash Indexing: The Magic Behind the Speed

"Why not just use a spatial index in PostGIS?" - The same senior guy?

Because Geohash gives you spatial indexing and a natural partitioning strategy in one package. When someone draws a polygon:

Calculate overlapping Geohash cells (milliseconds)
Read only those partitions (seconds, not minutes)
Process in parallel (because CPUs have multiple cores for a reason)

COG and Vector Tiles: Because Bandwidth Costs Money

Cloud Optimized GeoTIFFs (COG) and vector tiles mean:

Only the visible area gets downloaded
Data is pre-tiled and cached
Your users don't hate you for slow load times

Deployment: Docker All the Things

"Can't we just deploy it on bare metal?" - Infrastructure team

Sure, if you enjoy managing dependencies across different environments. Or we could use Docker:

# docker-compose.yml - The entire application in one file
services:
  backend:
    image: credium-be:latest
    resources:
      limits:
        cpus: "1"
        memory: 2G
    volumes:
      - ${PWD}/data:/var/task/fastapi/data

Five services, one command, zero headaches (okay maybe a few).

API Design: RESTful-ish

The API is refreshingly simple:

GET /health - Is it alive?
POST /rasterstats - Give me min/max values for this area
POST /stats - Analyse buildings in this polygon

Testing: Because Future You Will Thank Present You

"Do we really need all these unit tests?" - Developer on a deadline

Yes. Especially for geospatial code. Here's why:

def test_geohash_bbox_generation(self):
    # Test the easy case
    bbox = self.geohash_service.get_geohash_bbox('u4pruyd')
    assert bbox.is_valid
    
    # Test the edge case that will definitely break in production
    edge_case_bbox = self.geohash_service.get_geohash_bbox('zzzzzzz')
    assert edge_case_bbox.is_empty  # This actually matters

Run the tests with: python docker/backend/fastapi/test_unittests.py

Lessons Learned: The Therapy Session

Database Choice Matters More Than You Think

Sometimes the "wrong" choice is the right choice. File-based storage with spatial indexing outperformed database solutions for our use case.

Don't Overengineer (But Do Engineer)

Async everywhere? Nope. Microservices for every function? Also nope. Find the right balance.

Documentation is Future-Proofing

That README you're tempted to skip? Write it. Your future self (or replacement) will send you thank-you notes.

Interactive Time: Break It!?

Want to get your hands dirty? Here's what you can do right now:

Draw a huge polygon (the system will sample 10 random buildings)
Request stats for a non-existent raster
Monitor the logs through Dozzle

Deploy it locally

git clone https://github.com/purijs/terrain-mapper.git
cd terrain-mapper
docker swarm init
docker stack deploy --compose-file docker-compose.yml terrain

The "What's Next?" Section

Future improvements include:

Automated data pipeline for new regions
ML models for risk prediction
Support for historical terrain change analysis
Integration with weather data

"When will these features be ready?" - Product manager, inevitably

If you made it this far, you're either: a) Really interested in geospatial system design b) Procrastinating on your own project

Either way, check out the full source code and let me know what you think. Or don't :p

Have questions? Spot a bug? Think the database-less approach is insane? Drop a comment below

J.S. Puri

Germany