Cyber Boxer

Marine Geotagging

Published at May 16th 18, 14:41

Requirements

One of our client has a tremendous amount of geo-coordinates from marine traffic across the globe and they wanted to geotag each coordinate for Marine Region, IHO Sea, Country and Sovereign.


Data Required for each coordinate

Marine Region, IHO Sea, Country and Sovereign


Conditions

If any of above data is not available find the nearest data


Problem

Marine Geotagging has several challenges in terms of data availability with the latest and enough information, huge cost involves to purchase Geolocation APIs either from Google or any other service provider and when you have more than 500 Billion coordinates to geotag the problem become more challenging.

A summary of major problems are as follows:

  • More than 500 Billion geo-coordinates

  • Not has enough budget to spend on third-party Geolocation services like Google and etc.

  • Very limited resource to calculate this amount of data


Third Party APIs

There are many third party Geolocation Service APIs available to Geotag GPS coordinates and find the required information. The problem with these APIs are

  • They are very slow as geotagging takes time

  • Very costly, mostly per coordinate basis

  • Not available for all the Marine Regions

These API Services includes Google Geolocation Tool, NeutrinoApi, GeoMaker APIs and etc.


The Research

We have started our research keeping two things in our mind

  • A super fast and reliable solution

  • Data accuracy for at least 100 meters radius

We have divided the problem into two major steps, one is to find a reliable approach with accuracy, test the same on a small chunk of data and finally scale the same solution for all 500 billion records.

After a week of effort and energy spend in the same direction we find out that shapefile approach is the only reliable solution which will have the desired accuracy on the data.


The Solution - Part 1

The solution was very simple, find or create a multi-polygon shapefile for Marine Regions including required information like Marine Region name, IHO Sea, Country and Sovereign.

Match all 500 billion geo-coordinates against this shapefile and check if exists within any polygon into the shapefile.

If not contains within the polygon, find the nearest polygon to that coordinate.


Shapefile

Finding shapefile on open source was very difficult and we have spent enough time to search on various shape file repositories looking for the exact shapefile, we have found few but was not enough and we have started thinking to create our own using QGis.

Soon we realize that creating an accurate shapefile will take a lot of time and effort and best case scenario if we will be able to create the desired shapefile mapping the same with Marine Region and other required information will be another problem to solve.

Luckily, we have found one shapefile containing 525 polygons with the required information. Intersect_IHO_EEZ_v2_2012.zip available for free http://www.marineregions.org/downloads.php

This file contains information in every polygon

  1. MarRegion - Marine Region

  2. MARGID - Marine Region Group ID

  3. IHO_Sea - IHO Sea information

  4. IHO_ID - IHO Sea ID

  5. IHO_MARGID - IHO Marine Region Group ID

  6. EEZ - 

  7. EEZ_ID - 

  8. Country - Country of the region

  9. Sovereign - Sovereign of the region

  10. Sov_ID - Sovereign ID

  11. EEZ_MARGID - EEZ Marine Region Group ID

  12. Latitude 

  13.  Longitude

  14. Area_m2 - Area of the Region in Meter square 

We have imported the shapefile in QGis and it looks like this



Python

Initially, we have tried Python to process the shapefile using PyShp package and match our test data (GPS coordinates) with the polygons and check if they fall within any of these 525 polygons. 

This was a working solution, we were able to match our test records and find the desired information from the shapefile but it was very slow.

The problem was the linear match of the 525 shapes against all the records, we have tested the solution against a smaller data size (10000 geo-coordinates) and it took more than ~14 minutes to process on my machine (MacBook Pro, i7, 16GB RAM, 512GB SSD), we needed a different approach as the data was very huge (500 billion geo-coordinates)


Postgre

We have been using Postgre for a long time and have used in quite a few of our projects as an alternative to MySQL but we have never used Postgre with any plugin.

While researching about the same we find out about PostGis a Postgre plugin for Geospatial data, we have installed and started looking into the same.

The result was pretty amazing, it was able to process 10000 geo-coordinates in less than ~0.02 seconds as it was using indexes on the 525 shapes we have imported into the PostGis.

We have decided to dig more into PostGis to make it even faster to process all 500 billion geo-coordinates at once.

Later, we realized that it is not that easy to process 500 billion records at once and we must divide the same into smaller chunks and it worked.

  1. Split the main CSV file containing 500 billion geo-coordinates to 1000 CSV files containing 500 million geo-coordinates each using Python.

  2. Written a Postgre function to read the CSV file from disk and import into a Postgre table

  3. Generate point using longitude and latitude and store into the same table

  4. Match each point against the 525 polygons if they fall within any of them

  5. Store the matching ID of the shapes into the same table

  6. Populate a new table with matching records along with the geo-coordinates


The Solution - Part 2

The actual implementation, commands and table and data format are as follows:


Splitting CSV 

The actual CSV of all records contains only two columns for longitude and latitude and have more than 500 billion records, 

The first step to split the CSV into 1000 CSVs containing only 500 million records each. We have used Python to do the same


# split CSV for every 1 million records

# Filename will be 

csvfile = open('all_records.csv', 'r').readlines()

  filename = 1

  for i in range(len(csvfile)):

    if i % 1000000 == 0:

      open(str(filename) + '.csv', 'w+').writelines(csvfile[i:i+1000000])

      filename += 1

After running the above, we have 1000 CSV files containing 1 million records each.


Postgre Import - Shapefile

We have used shp2pgsql command from PostGIS to import shapefile into Postgre


shp2pgsql -I -s   . | psql -U postgres -d ;


Postgre Import - CSV

To import CSV file in Postgre we have used COPY command of Postgre, see the documentation on below link 

https://www.postgresql.org/docs/9.2/static/sql-copy.html


COPY lat_long (long, lat) FROM '/path/to/csv-1.csv' WITH DELIMITER ',';


Postgre (PostGIS) - Make Point

We only have longitude and latitude in our records and to match the same with shapes we need a point, so we have used ST_MakePiont command to generate point from the given longitude and latitude


UPDATE oceans.all_records SET geom = ST_MakePoint(long, lat);

See the ST_MakePoint documentation at below link

http://www.postgis.net/docs/ST_MakePoint.html


Postgre - If the point is within the shape

To check if a point is within the shape we have used the ST_Within method of PostGIS


CREATE OR REPLACE FUNCTION GET_SID(p geometry)

RETURNS INTEGER AS $$

declare result INT; 

BEGIN 

SELECT id INTO result

FROM oceans.shapes

WHERE ST_Within(geom, p) IS NOT NULL; 

return result; 

END;

$$ LANGUAGE plpgsql;


UPDATE oceans.all_records SET sid = GET_SID(geom);

We have created a function which takes POINT geometry as input and returns the Shape ID which contains the passed geometry.

Using UPDATE statement we have updated our all_records table.


Postgre - Nearest shape

There are few records which don’t have shape information as they are not within any of the 525 shapes we have imported, to add the nearest shape into those records we have used ST_Distance method from PostGIS.


CREATE OR REPLACE FUNCTION GET_SID_MIN_DISTANCE(d geometry)

RETURNS INTEGER AS $$ 

declare   distance INT; 

id INT;

BEGIN 

SELECT Min(ST_Distance(s.geom, d)) AS distance, s.id

INTO distance, id

FROM oceans.shapes AS s 

GROUP BY s.gid

ORDER BY distance 

LIMIT 1; 

return sid; 

END;

$$ LANGUAGE plpgsql;


UPDATE oceans.all_records SET sid = GET_SID_MIN_DISTANCE(geom) WHERE sid IS NULL;

We have created a function GET_SID_MIN_DISTANCE which takes POINT geometry and returns the minimum distance shape from the shapes table.

Using UPDATE command we have updated the rest of the SIDs in the all_records table.

After all the above operations we have the all_records table with SID containing Shape ID from shapes table, which can be directly joined to populate the final records including Marine Region, IHO Sea, Country, Sovereign from Shapes and Longitude, Latitude from Records.


SELECT p.lat, p.long, s.marregion, s.iho_sea, s.country, s.sovereign

FROM oceans.all_records AS p

JOIN oceans.shapes AS s

ON p.sid;


Conclusion

By spending time in research we were able to Geotag all the 500 billion geo-coordinates with the matching records without spending a huge amount of money on third party Geo Tagging services.

When there is a challenge we tend to spend most of the time researching, reading and testing various available options to find the best fit within the requirements and budget. Sometimes it takes a while to make a decision but knowing the boundary is very important.

We are not saying that spend most of the time in research for everything, there are things which lots of people have already spent the time it required we should leverage their effort and try not to reinvent the wheel but we should double check the environment that specific solution is designed or developed.

Let us know if have faced similar challenges, we would be really happy to know how you have overcome and what you have built.

Cheers!