
Creating and Using Geospatial Data Files#
So now we know that geospatial data can be in the form of dataframes containing coordinates with associated attributes, and we know how to use different kinds of coordinates. We’re almost ready to use real geospatial data.
But how do we get our real geospatial data into whatever tool we’re using? How do we save geospatial data to a file?
Word documents use .docx files, but not all word processors do, OpenOffice uses .odx files for example. Is geospatial data like that, with different programs having different ways of saving files? Can some software only open certain files?
Well, it’s both simpler and more complicated than that, but the good news is that most geospatial tools will be able to read most forms of geospatial data files.
The less good news is that there’s a few different formats you should get familiar with. I’ll discuss 5 different vector geospatial formats here in particular.
1. CSV - Comma Separated Values#
CSV files are very simple. Each row of the data is on a different line, with the values for different columns separated by commas. Let’s demonstrate using the air pollution data from the Attribute Data notebook:
import pandas as pd
import geopandas as gpd
from shapely import Point
gdf = gpd.GeoDataFrame(
{
"Measurement_time": [
"2023-11-01 08:00:00",
"2023-11-01 08:00:30",
"2023-11-01 08:01:00",
"2023-11-01 08:01:30",
"2023-11-01 08:02:00",
],
"PM2.5": [3, 4, 6, 2, 3],
"PM10": [5, 6, 5, 3, 4],
"Pressure": [1012, 1012, 1012, 1013, 1013],
"Temperature": [12, 12, 13, 13, 13],
"Humidity": [86, 86, 87, 87, 87],
"geometry": [Point(555173, 654321), Point(555173, 654321), Point(555173, 654321), Point(555173, 654321), Point(555173, 654321)],
}
)
gdf['timestamp'] = pd.to_datetime(gdf['Measurement_time'])
gdf = gdf.set_index('timestamp')
gdf = gdf.drop(columns="Measurement_time")
gdf = gdf.set_crs('epsg:2157')
gdf
| PM2.5 | PM10 | Pressure | Temperature | Humidity | geometry | |
|---|---|---|---|---|---|---|
| timestamp | ||||||
| 2023-11-01 08:00:00 | 3 | 5 | 1012 | 12 | 86 | POINT (555173 654321) |
| 2023-11-01 08:00:30 | 4 | 6 | 1012 | 12 | 86 | POINT (555173 654321) |
| 2023-11-01 08:01:00 | 6 | 5 | 1012 | 13 | 87 | POINT (555173 654321) |
| 2023-11-01 08:01:30 | 2 | 3 | 1013 | 13 | 87 | POINT (555173 654321) |
| 2023-11-01 08:02:00 | 3 | 4 | 1013 | 13 | 87 | POINT (555173 654321) |
print(gdf.crs)
epsg:2157
We can save this as a CSV:
gdf.to_csv('../sample_data/ap.csv')
and then look at what the contents of the file look like:
with open('../sample_data/ap.csv', 'r') as f:
content = f.read()
print (content)
timestamp,PM2.5,PM10,Pressure,Temperature,Humidity,geometry
2023-11-01 08:00:00,3,5,1012,12,86,POINT (555173 654321)
2023-11-01 08:00:30,4,6,1012,12,86,POINT (555173 654321)
2023-11-01 08:01:00,6,5,1012,13,87,POINT (555173 654321)
2023-11-01 08:01:30,2,3,1013,13,87,POINT (555173 654321)
2023-11-01 08:02:00,3,4,1013,13,87,POINT (555173 654321)
It’s very simple and straightforward. Because of this, you’ll find a lot of geospatial data saved as CSV files.
However - the CSV format wasn’t designed for geospatial data, and one thing it doesn’t do is save any information about the coordinate reference system. That has to be manually managed if you’re using CSVs. Which is often fine - but there are other formats which do save the CRS (and other associated information).
2. ESRI Shapefiles#
Probably the most common geospatial data format is the ESRI shapefile. Let’s have a look:
gdf.to_file('../sample_data/ap.shp')
/tmp/ipykernel_617/508446976.py:1: UserWarning: Column names longer than 10 characters will be truncated when saved to ESRI Shapefile.
gdf.to_file('../sample_data/ap.shp')
/home/breandan/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pyogrio/raw.py:723: RuntimeWarning: Field timestamp create as date field, though DateTime requested.
ogr_write(
/home/breandan/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pyogrio/raw.py:723: RuntimeWarning: Normalized/laundered field name: 'Temperature' to 'Temperatur'
ogr_write(
Whoops. Okay, well, straight away we can see a disadvantage of the shapefile format. Let’s change the datetime data to text, and try that again:
gdf_shp = gdf.reset_index()
gdf_shp['timestamp'] = gdf_shp['timestamp'].dt.strftime('%Y-%m-%d %H:%M:%S')
gdf_shp
| timestamp | PM2.5 | PM10 | Pressure | Temperature | Humidity | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | 2023-11-01 08:00:00 | 3 | 5 | 1012 | 12 | 86 | POINT (555173 654321) |
| 1 | 2023-11-01 08:00:30 | 4 | 6 | 1012 | 12 | 86 | POINT (555173 654321) |
| 2 | 2023-11-01 08:01:00 | 6 | 5 | 1012 | 13 | 87 | POINT (555173 654321) |
| 3 | 2023-11-01 08:01:30 | 2 | 3 | 1013 | 13 | 87 | POINT (555173 654321) |
| 4 | 2023-11-01 08:02:00 | 3 | 4 | 1013 | 13 | 87 | POINT (555173 654321) |
gdf_shp.to_file('../sample_data/ap.shp')
/tmp/ipykernel_617/612289223.py:1: UserWarning: Column names longer than 10 characters will be truncated when saved to ESRI Shapefile.
gdf_shp.to_file('../sample_data/ap.shp')
/home/breandan/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/pyogrio/raw.py:723: RuntimeWarning: Normalized/laundered field name: 'Temperature' to 'Temperatur'
ogr_write(
Well, the warning there is giving us another limitation of the format.
gdf_from_shp = gpd.read_file('../sample_data/ap.shp')
gdf_from_shp
| timestamp | PM2.5 | PM10 | Pressure | Temperatur | Humidity | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | 2023-11-01 08:00:00 | 3 | 5 | 1012 | 12 | 86 | POINT (555173 654321) |
| 1 | 2023-11-01 08:00:30 | 4 | 6 | 1012 | 12 | 86 | POINT (555173 654321) |
| 2 | 2023-11-01 08:01:00 | 6 | 5 | 1012 | 13 | 87 | POINT (555173 654321) |
| 3 | 2023-11-01 08:01:30 | 2 | 3 | 1013 | 13 | 87 | POINT (555173 654321) |
| 4 | 2023-11-01 08:02:00 | 3 | 4 | 1013 | 13 | 87 | POINT (555173 654321) |
I guess we just lost the ‘e’ at the end of ‘Temperatur’, so it could be worse, but that could be very irritating.
Now, I can’t show you the inside of a shapefile in the same way that I could with a CSV, because there’s a fundamental difference here. CSVs are text files - whatever you type in, that’s how it’s saved. But shapefiles are a binary format - the data is saved as the binary representation rather than the straight text representation. I can show you the contents translated from binary:
with open('../sample_data/ap.shp', 'rb') as f:
content = f.read()
print (content)
b"\x00\x00'\n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00x\xe8\x03\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00J\xf1 A\x00\x00\x00\x00\xe2\xf7#A\x00\x00\x00\x00J\xf1 A\x00\x00\x00\x00\xe2\xf7#A\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\n\x01\x00\x00\x00\x00\x00\x00\x00J\xf1 A\x00\x00\x00\x00\xe2\xf7#A\x00\x00\x00\x02\x00\x00\x00\n\x01\x00\x00\x00\x00\x00\x00\x00J\xf1 A\x00\x00\x00\x00\xe2\xf7#A\x00\x00\x00\x03\x00\x00\x00\n\x01\x00\x00\x00\x00\x00\x00\x00J\xf1 A\x00\x00\x00\x00\xe2\xf7#A\x00\x00\x00\x04\x00\x00\x00\n\x01\x00\x00\x00\x00\x00\x00\x00J\xf1 A\x00\x00\x00\x00\xe2\xf7#A\x00\x00\x00\x05\x00\x00\x00\n\x01\x00\x00\x00\x00\x00\x00\x00J\xf1 A\x00\x00\x00\x00\xe2\xf7#A"
Not really all that informative, to be honest.
Now, there is another issue with shapefiles as well. If you look into the sample_data folder where it was saved, you’ll also see files called ap.cpg, ap.dbf, and ap.shx. We didn’t ask to create any of those - but they are created with a shapefile.
The reason for this is ultimately the same as the 10-character limit for column names, and the lack of support for datetime columns. The shapefile format was created in the 1980s when GIS started. Computers in the 1980s were far less powerful than modern computers, and so, over time as more was possible in GIS, more data was needed to be saved than the original shapefile format allowed. Rather than changing to a new format, the format was simply extended to save the additional information in these sidecar files.
All of this is very irritating. True, it’s just a minor irritation, but every year since I started teaching GIS, I’ve had an issue where a student comes to me in a panic saying “my map has stopped working, help!”. And when I go to help them, I see that their data folder has the .shp file, but none of the additional sidecar files. “Oh”, will say the student, “I noticed all those extra files in the folder and I wanted to tidy it up, so I deleted them all”.
Every year. Anyway, don’t do that. Or even better, just don’t use shapefiles.
Even all these minor irritations aren’t the main reason not to use shapfiles. The main reason not to use the format is that it’s horribly slow and memory inefficient. There are formats which take up less space, and are literally hundreds of times faster than shapefiles. The amount of compute time across the world wasted due to reading and writing shapefiles probably accounts for a measurable amount of climate change.
But somehow, despite all the limitations, it’s still the default geospatial vector file format, so you’re going to come across it a lot, unfortunately - especially if you’re using ArcGIS Pro, because it’s still ESRI’s default format.
3. GeoPackage and GeoDataBase#
The default file format in QGIS, geopackage files save geospatial data in a database format, using SQLite. Now, I’m not an expert in databases, but using a database format means the files are optimised for fast retrieval of data.
Geopackage files are essentially the standard open source alternative to shapefiles, and are a vast improvement. They don’t have a limitation on datetime fields, or the length of column names; and not only are there no sidecar files, you can actually save multiple data layers to a single geopackage file. We can do:
gdf.to_file('../sample_data/ap.gpkg')
and that works, which the shapefile version didn’t until we’d made some changes - but if we also had data from another sensor, or map data for the area around the location, we could save it to the same file, with a different layer name. So, if you have multiple layers, rather than 4-6 files per layer with shapefiles, you could have one file for all of your layers. That’s tidy.
Geopackages are a bit slow though. They’re an order of magnitude faster than shapefiles, but there are still faster new formats.
GeoDataBase files are ESRI’s database-format based files.
Both QGIS and ArcGIS Pro can read both GeoPackage and GeoDataBase files, so there’s no real concerns over data sharing with users using different formats.
4. Feather and Parquet#
Without getting even further into computer science than I already have, it would be difficult to really explain just why these formats are so good. They’re based around the Apache Arrow data format, and for now let’s just say that it’s an incredibly efficient binary format. The files are smaller, and they can be written and read literally hundreds of times faster than other formats. The Open Geospatial Consortium has defined geospatial versions, GeoFeather and GeoParquet, which is what GeoPandas is actually using when we run:
gdf.to_parquet('../sample_data/ap.parquet')
gdf.to_feather('../sample_data/ap.feather')
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[12], line 1
----> 1 gdf.to_parquet('../sample_data/ap.parquet')
2 gdf.to_feather('../sample_data/ap.feather')
File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/geopandas/geodataframe.py:1377, in GeoDataFrame.to_parquet(self, path, index, compression, geometry_encoding, write_covering_bbox, schema_version, **kwargs)
1370 raise ValueError(
1371 "GeoPandas only supports using pyarrow as the engine for "
1372 f"to_parquet: {engine!r} passed instead."
1373 )
1375 from geopandas.io.arrow import _to_parquet
-> 1377 _to_parquet(
1378 self,
1379 path,
1380 compression=compression,
1381 geometry_encoding=geometry_encoding,
1382 index=index,
1383 schema_version=schema_version,
1384 write_covering_bbox=write_covering_bbox,
1385 **kwargs,
1386 )
File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/geopandas/io/arrow.py:427, in _to_parquet(df, path, index, compression, geometry_encoding, schema_version, write_covering_bbox, **kwargs)
379 def _to_parquet(
380 df,
381 path,
(...)
387 **kwargs,
388 ):
389 """
390 Write a GeoDataFrame to the Parquet format.
391
(...)
425 Additional keyword arguments passed to pyarrow.parquet.write_table().
426 """
--> 427 parquet = import_optional_dependency(
428 "pyarrow.parquet", extra="pyarrow is required for Parquet support."
429 )
431 path = _expand_user(path)
432 table = _geopandas_to_arrow(
433 df,
434 index=index,
(...)
437 write_covering_bbox=write_covering_bbox,
438 )
File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/geopandas/_compat.py:64, in import_optional_dependency(name, extra)
61 module = importlib.import_module(name)
63 except ImportError:
---> 64 raise ImportError(msg) from None
66 return module
ImportError: Missing optional dependency 'pyarrow.parquet'. pyarrow is required for Parquet support. "
"Use pip or conda to install pyarrow.parquet.
So if we read them back in, it doesn’t lose geospatial information:
gdf_p = gpd.read_parquet('../sample_data/ap.parquet')
gdf_p.crs
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[13], line 1
----> 1 gdf_p = gpd.read_parquet('../sample_data/ap.parquet')
2 gdf_p.crs
File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/geopandas/io/arrow.py:739, in _read_parquet(path, columns, storage_options, bbox, **kwargs)
671 def _read_parquet(path, columns=None, storage_options=None, bbox=None, **kwargs):
672 """
673 Load a Parquet object from the file path, returning a GeoDataFrame.
674
(...)
736 ... ) # doctest: +SKIP
737 """
--> 739 parquet = import_optional_dependency(
740 "pyarrow.parquet", extra="pyarrow is required for Parquet support."
741 )
742 import geopandas.io._pyarrow_hotfix # noqa: F401
744 # TODO(https://github.com/pandas-dev/pandas/pull/41194): see if pandas
745 # adds filesystem as a keyword and match that.
File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/geopandas/_compat.py:64, in import_optional_dependency(name, extra)
61 module = importlib.import_module(name)
63 except ImportError:
---> 64 raise ImportError(msg) from None
66 return module
ImportError: Missing optional dependency 'pyarrow.parquet'. pyarrow is required for Parquet support. "
"Use pip or conda to install pyarrow.parquet.
gdf_f = gpd.read_feather('../sample_data/ap.feather')
gdf_f.crs
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[14], line 1
----> 1 gdf_f = gpd.read_feather('../sample_data/ap.feather')
2 gdf_f.crs
File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/geopandas/io/arrow.py:834, in _read_feather(path, columns, **kwargs)
782 def _read_feather(path, columns=None, **kwargs):
783 """
784 Load a Feather object from the file path, returning a GeoDataFrame.
785
(...)
831 ... ) # doctest: +SKIP
832 """
--> 834 feather = import_optional_dependency(
835 "pyarrow.feather", extra="pyarrow is required for Feather support."
836 )
837 # TODO move this into `import_optional_dependency`
838 import pyarrow
File ~/.pyenv/versions/3.11.6/envs/GY4006/lib/python3.11/site-packages/geopandas/_compat.py:64, in import_optional_dependency(name, extra)
61 module = importlib.import_module(name)
63 except ImportError:
---> 64 raise ImportError(msg) from None
66 return module
ImportError: Missing optional dependency 'pyarrow.feather'. pyarrow is required for Feather support. "
"Use pip or conda to install pyarrow.feather.
These formats are very, very new, so they’re not enormously widespread yet, and they are still being developed fully. But if you might do more than occasionally play with geospatial data, it’s very worth being aware of them - and expect to see them becoming a lot more common in the future.
5. GeoJSON#
GeoJSON files are more or less the standard for geospatial data being transmitted over the internet - for example, sensor data. Or, if you’ve ever used a real-time public transport information app, the app was receiving data as a GeoJSON file to tell you how long you’d be waiting.
gdf.to_file('../sample_data/ap.geojson', driver='GeoJSON')
with open('../sample_data/ap.geojson', 'r') as f:
content = f.read()
print (content)
{
"type": "FeatureCollection",
"name": "ap",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::2157" } },
"features": [
{ "type": "Feature", "properties": { "timestamp": "2023-11-01T08:00:00", "PM2.5": 3, "PM10": 5, "Pressure": 1012, "Temperature": 12, "Humidity": 86 }, "geometry": { "type": "Point", "coordinates": [ 555173.0, 654321.0 ] } },
{ "type": "Feature", "properties": { "timestamp": "2023-11-01T08:00:30", "PM2.5": 4, "PM10": 6, "Pressure": 1012, "Temperature": 12, "Humidity": 86 }, "geometry": { "type": "Point", "coordinates": [ 555173.0, 654321.0 ] } },
{ "type": "Feature", "properties": { "timestamp": "2023-11-01T08:01:00", "PM2.5": 6, "PM10": 5, "Pressure": 1012, "Temperature": 13, "Humidity": 87 }, "geometry": { "type": "Point", "coordinates": [ 555173.0, 654321.0 ] } },
{ "type": "Feature", "properties": { "timestamp": "2023-11-01T08:01:30", "PM2.5": 2, "PM10": 3, "Pressure": 1013, "Temperature": 13, "Humidity": 87 }, "geometry": { "type": "Point", "coordinates": [ 555173.0, 654321.0 ] } },
{ "type": "Feature", "properties": { "timestamp": "2023-11-01T08:02:00", "PM2.5": 3, "PM10": 4, "Pressure": 1013, "Temperature": 13, "Humidity": 87 }, "geometry": { "type": "Point", "coordinates": [ 555173.0, 654321.0 ] } }
]
}
As you can see there, the GeoJSON format is somewhat human readable. You can see each feature separately, and every piece of data for each feature is labelled.
Of course, this would be a massive disadvantage in terms of file size. This is essentially like a CSV, but adding the relevant column header before every single piece of data. Even in this short example, you can see how often those are repeated.
But this format isn’t intended for storage of huge datasets. It’s intended for sending short snippets of data in a format that a person can understand. You wouldn’t want to be storing 5 million data points in this format - but it has its place.
Next#
Now that we’ve been introduced to the main geospatial vector data formats, we’re ready to start looking at doing some geoprocessing of vector geospatial data. Next Notebook uses some real data.
GY4006 Notebooks in Colab: