One of our client has a tremendous amount of geo-coordinates from marine traffic across the globe and they wanted to geotag each coordinate for Marine Region, IHO Sea, Country and Sovereign.
Data Required for each coordinate
Marine Region, IHO Sea, Country and Sovereign
If any of above data is not available find the nearest data
Marine Geotagging has several challenges in terms of data availability with the latest and enough information, huge cost involves to purchase Geolocation APIs either from Google or any other service provider and when you have more than 500 Billion coordinates to geotag the problem become more challenging.
A summary of major problems are as follows:
More than 500 Billion geo-coordinates
Not has enough budget to spend on third-party Geolocation services like Google and etc.
Very limited resource to calculate this amount of data
Third Party APIs
There are many third party Geolocation Service APIs available to Geotag GPS coordinates and find the required information. The problem with these APIs are
They are very slow as geotagging takes time
Very costly, mostly per coordinate basis
Not available for all the Marine Regions
These API Services includes Google Geolocation Tool, NeutrinoApi, GeoMaker APIs and etc.
We have started our research keeping two things in our mind
A super fast and reliable solution
Data accuracy for at least 100 meters radius
We have divided the problem into two major steps, one is to find a reliable approach with accuracy, test the same on a small chunk of data and finally scale the same solution for all 500 billion records.
After a week of effort and energy spend in the same direction we find out that shapefile approach is the only reliable solution which will have the desired accuracy on the data.
The Solution - Part 1
The solution was very simple, find or create a multi-polygon shapefile for Marine Regions including required information like Marine Region name, IHO Sea, Country and Sovereign.
Match all 500 billion geo-coordinates against this shapefile and check if exists within any polygon into the shapefile.
If not contains within the polygon, find the nearest polygon to that coordinate.
Finding shapefile on open source was very difficult and we have spent enough time to search on various shape file repositories looking for the exact shapefile, we have found few but was not enough and we have started thinking to create our own using QGis.
Soon we realize that creating an accurate shapefile will take a lot of time and effort and best case scenario if we will be able to create the desired shapefile mapping the same with Marine Region and other required information will be another problem to solve.
Luckily, we have found one shapefile containing 525 polygons with the required information. Intersect_IHO_EEZ_v2_2012.zip available for free http://www.marineregions.org/downloads.php
This file contains information in every polygon
MarRegion - Marine Region
MARGID - Marine Region Group ID
IHO_Sea - IHO Sea information
IHO_ID - IHO Sea ID
IHO_MARGID - IHO Marine Region Group ID
Country - Country of the region
Sovereign - Sovereign of the region
Sov_ID - Sovereign ID
EEZ_MARGID - EEZ Marine Region Group ID
Area_m2 - Area of the Region in Meter square
We have imported the shapefile in QGis and it looks like this
Initially, we have tried Python to process the shapefile using PyShp package and match our test data (GPS coordinates) with the polygons and check if they fall within any of these 525 polygons.
This was a working solution, we were able to match our test records and find the desired information from the shapefile but it was very slow.
The problem was the linear match of the 525 shapes against all the records, we have tested the solution against a smaller data size (10000 geo-coordinates) and it took more than ~14 minutes to process on my machine (MacBook Pro, i7, 16GB RAM, 512GB SSD), we needed a different approach as the data was very huge (500 billion geo-coordinates)
We have been using Postgre for a long time and have used in quite a few of our projects as an alternative to MySQL but we have never used Postgre with any plugin.
While researching about the same we find out about PostGis a Postgre plugin for Geospatial data, we have installed and started looking into the same.
The result was pretty amazing, it was able to process 10000 geo-coordinates in less than ~0.02 seconds as it was using indexes on the 525 shapes we have imported into the PostGis.
We have decided to dig more into PostGis to make it even faster to process all 500 billion geo-coordinates at once.
Later, we realized that it is not that easy to process 500 billion records at once and we must divide the same into smaller chunks and it worked.
Split the main CSV file containing 500 billion geo-coordinates to 1000 CSV files containing 500 million geo-coordinates each using Python.
Written a Postgre function to read the CSV file from disk and import into a Postgre table
Generate point using longitude and latitude and store into the same table
Match each point against the 525 polygons if they fall within any of them
Store the matching ID of the shapes into the same table
Populate a new table with matching records along with the geo-coordinates
The Solution - Part 2
The actual implementation, commands and table and data format are as follows:
The actual CSV of all records contains only two columns for longitude and latitude and have more than 500 billion records,
The first step to split the CSV into 1000 CSVs containing only 500 million records each. We have used Python to do the same
# split CSV for every 1 million records # Filename will be csvfile = open('all_records.csv', 'r').readlines() filename = 1 for i in range(len(csvfile)): if i % 1000000 == 0: open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+1000000]) filename += 1
After running the above, we have 1000 CSV files containing 1 million records each.
Postgre Import - Shapefile
We have used shp2pgsql command from PostGIS to import shapefile into Postgre
shp2pgsql -I -s . | psql -U postgres -d ;
Postgre Import - CSV
To import CSV file in Postgre we have used COPY command of Postgre, see the documentation on below link
COPY lat_long (long, lat) FROM '/path/to/csv-1.csv' WITH DELIMITER ',';
Postgre (PostGIS) - Make Point
We only have longitude and latitude in our records and to match the same with shapes we need a point, so we have used ST_MakePiont command to generate point from the given longitude and latitude
UPDATE oceans.all_records SET geom = ST_MakePoint(long, lat);
See the ST_MakePoint documentation at below link
Postgre - If the point is within the shape
To check if a point is within the shape we have used the ST_Within method of PostGIS
CREATE OR REPLACE FUNCTION GET_SID(p geometry) RETURNS INTEGER AS $$ declare result INT; BEGIN SELECT id INTO result FROM oceans.shapes WHERE ST_Within(geom, p) IS NOT NULL; return result; END; $$ LANGUAGE plpgsql;
UPDATE oceans.all_records SET sid = GET_SID(geom);
We have created a function which takes POINT geometry as input and returns the Shape ID which contains the passed geometry.
Using UPDATE statement we have updated our all_records table.
Postgre - Nearest shape
There are few records which don’t have shape information as they are not within any of the 525 shapes we have imported, to add the nearest shape into those records we have used ST_Distance method from PostGIS.
CREATE OR REPLACE FUNCTION GET_SID_MIN_DISTANCE(d geometry) RETURNS INTEGER AS $$ declare distance INT; id INT; BEGIN SELECT Min(ST_Distance(s.geom, d)) AS distance, s.id INTO distance, id FROM oceans.shapes AS s GROUP BY s.gid ORDER BY distance LIMIT 1; return sid; END; $$ LANGUAGE plpgsql;
UPDATE oceans.all_records SET sid = GET_SID_MIN_DISTANCE(geom) WHERE sid IS NULL;
We have created a function GET_SID_MIN_DISTANCE which takes POINT geometry and returns the minimum distance shape from the shapes table.
Using UPDATE command we have updated the rest of the SIDs in the all_records table.
After all the above operations we have the all_records table with SID containing Shape ID from shapes table, which can be directly joined to populate the final records including Marine Region, IHO Sea, Country, Sovereign from Shapes and Longitude, Latitude from Records.
SELECT p.lat, p.long, s.marregion, s.iho_sea, s.country, s.sovereign FROM oceans.all_records AS p JOIN oceans.shapes AS s ON p.sid;
By spending time in research we were able to Geotag all the 500 billion geo-coordinates with the matching records without spending a huge amount of money on third party Geo Tagging services.
When there is a challenge we tend to spend most of the time researching, reading and testing various available options to find the best fit within the requirements and budget. Sometimes it takes a while to make a decision but knowing the boundary is very important.
We are not saying that spend most of the time in research for everything, there are things which lots of people have already spent the time it required we should leverage their effort and try not to reinvent the wheel but we should double check the environment that specific solution is designed or developed.
Let us know if have faced similar challenges, we would be really happy to know how you have overcome and what you have built.