Placekey

חינם
על ידי Placekeyio | מְעוּדכָּן 4 days ago | Data
Health Check

N/A

חזרה לכל ההדרכות (10)

Joining Address Data Involving Sub Units ("Line 2")

Joining Address Data Involving Sub Units (“Line 2”)

This tutorial will walk through how to add a Placekey to locations including sub units.

(Note: Works with both commercial and residential)

We have a Google Colab notebook for this tutorial. Open it up to access the code and run it yourself!

Summary

Placekey is geospatially-aware universal place identifier useful for joining together disparate geospatial, address, and point-of-interest datasets
Placekey is an open and free to use
Here we demonstrate how to obtain Placekeys for point-of-interest or address-based data using the placekey-py python library and Placekey API

What do you need?

Your Placekey API key --> Register Here - it is FREE and approval is automatic after signup!
Link to Apartment Sub Unit Data [300k rows]

Have questions or ideas?

Join thousands of geospatial data scientists in the Placekey Slack Community to get help and share ideas
Follow Placekey on Github
Tweet @placekeyio
Join the community with the Placekey Slack

Setup

This section is authenticating your Google Drive and download/reading in the appropriate files from their respective Google Drive locations

# special libraries to allow file access
from google.colab import drive as mountGoogleDrive 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd

# These commands allow you to read directly from SafeGraph's public GoogleDrive containing Census Data and Sample Data
auth.authenticate_user()  # Authenticate and create the PyDrive client. 
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
print("You are fully authenticated and can edit and re-run everything in the notebook. Enjoy!")

Expected output:


def pd_read_csv_drive(id, drive, dtype=None):
downloaded = drive.CreateFile({‘id’:id})
downloaded.GetContentFile(‘Filename.csv’)
return(pd.read_csv(‘Filename.csv’,dtype=dtype))

def get_drive_id(filename):
    # Note: Project Folder - https://drive.google.com/drive/u/0/folders/1GJHGVEjsVKaiE53fug-EXdc7kaEE2IZk
    drive_ids = {
                  'sub_location_data' : '1dPRx2TjcnAixXEOrhWqtzovwye71HKEf', ## 300k version
                  'sub_location_1k_demo': '1SbObArFX3Onch3XMJ6QtxpysU5FTVnP_' ## 1k version
                 }
    return(drive_ids[filename])

Setting dtypes is super important when working with address data to ensure leading zeros don’t lead to inaccurate data

places_df = pd_read_csv_drive(get_drive_id('sub_location_data'), drive=drive, dtype={'POSTCODE':str}) # It is very important to keep columns like POSTCODE as string to make sure Pandas doesn't mistake it for integers and drop leading zeros
places_df.head()

####### Uncomment code below to run the faster 1k version #######

# places_df = pd_read_csv_drive(get_drive_id('sub_location_1k_demo'), drive=drive, dtype={'POSTCODE':str})
# places_df.head()

Install the Placekey library using Pip and then import it

!pip install placekey

import numpy as np
import json
from placekey.api import PlacekeyAPI

Demo Code Starts HERE
Add your API Key below

Note:

You will need to add your own API key in the cell below for this tutorial to work (Free and Easy!)
The commented out file path variable called data_path is for non CoLab use

# placekey_api_key = "--YOUR API KEY HERE--"
placekey_api_key = "hF6yVUCjc6lP0IK6RHw3KU5dWGFlnuBc" # Will remove this line after review and uncomment line above
pk_api = PlacekeyAPI(placekey_api_key)

Column Mapping

In the cell below you are telling the API, in a JSON format, which columns in your dataframe relate to the API query. Note: Country ‘US’ is hardcoded below. While it is not required, the more data you give Placekey, the easier the match process will be.

orig_df = places_df.copy() #Only required because of auto read in for CoLab*

orig_df['iso_country_code'] = 'US'

print(orig_df.shape)
orig_df.head()

The optional step below is only if you don’t already have a unique identifier for your data - our data does not have a unique identifier by default. You can either convert the index to a column, or run a simple for loop - I chose a for loop in the off chance your data is using a non standard index.

The unique identifier ensures that we can remerge on the correct rows later

li = []

for x in range(orig_df.shape[0]):
    li.append(f"{x}")

orig_df['id_num'] = li

We want to map the columns of our DF to what the Placekey API is expecting.

The values on the left of the column map should match the column names in our DF while the values on the right should not change.

query_id_col = "id_num"  # this column in your data should be unique for every row
column_map = {query_id_col:"query_id",
                # "location": "location_name", -- This is where you would add in location name if we had it
                "full_addr": "street_address",
                "CITY": "city",
                "REGION": "region",
                "POSTCODE": "postal_code",
                "iso_country_code": "iso_country_code",
                "LAT": "latitude",
                "LON": "longitude"}

df_for_api = orig_df.rename(columns=column_map)
cols = list(column_map.values())
df_for_api = df_for_api[cols]

df_for_api.head()

Check for Null, None, NaN, etc rows

This will alert you to any incomplete data and help you determine if you should drop those rows or not.

For instance, if you notice you have 5,000 rows missing a latitude, you could save yourself some time by dropping those columns (that was not the case for this data)

df_clean = df_for_api.copy()
possible_bad_values = ["", " ", "null", "Null", "None", "nan", "Nan"] # Any other dirty data you need to clean up?
for bad_value in possible_bad_values:
  df_clean = df_clean.replace(to_replace=bad_value, value=np.nan)

print("FYI data missing from at least 1 column in the following number of rows:")
print(df_clean.shape[0] - df_clean.dropna().shape[0])
print("Some examples of rows with missing data (if any...)")
print(df_clean[df_clean.isnull().any(axis=1)].head())


data_jsoned = json.loads(df_clean.to_json(orient="records"))
print("number of records: ", len(data_jsoned))
print("example record:")
print(data_jsoned[0])

Check Placekey Request

The cell below will query your first input to ensure everything is formatted correctly before moving on to the batch processing

See matching behavior below

single_place_example = data_jsoned[0]
print("input: \n",single_place_example)
print("\nresult: \n",pk_api.lookup_placekey(**single_place_example))

Warning: this next step takes 35-40min to run in completion

This file is 300k rows of data processed in batches of 1,000 rows at a time
See the cell at the beginning to run the smaller file

While the code runs, you can read up on how Placekey tiers the matching process:

Matching Behaviour

Our matching algorithm tries a few different queries, sequentially, and returns the best match of the first query to return with high enough score to feel assured it’s a true match. Here are the queries it does, in order:

  • If the address you’ve sent in is valid, then we search for a POI at that address placekey with a name that exactly case-insensitively matches the location_name you’ve sent in. If this does not match (or if the address you sent in wasn’t valid) but you’ve sent in a latitude and longitude with your query, then we search for that location_name and a fuzzy street address within 1km of your coordinates. If this still does not match but you’ve sent in a postal code, then we search specifically for a POI in that postal code and look for a location_name match and a fuzzy street address match If none of the above match and you have sent in a city and a region, then we require a strict match on city/region, a match on poi name, and a fuzzy match on street address. Finally, if none of the above match, we stop searching for POI and perform an address match.*

Read more HERE

%%time
responses = pk_api.lookup_placekeys(data_jsoned,
                                    strict_address_match=False,
                                    strict_name_match=False,
                                    verbose=True)

def clean_api_responses(data_jsoned, responses):

    print("number of original records: ", len(data_jsoned))
   print("total individual queries returned:", len(responses))

    # filter out invalid responses
    responses_cleaned = [resp for resp in responses if 'query_id' in resp]
    print("total successful query responses:", len(responses_cleaned))
    return(responses_cleaned)

responses_cleaned = clean_api_responses(data_jsoned, responses)

df_placekeys = pd.read_json(json.dumps(responses_cleaned), dtype={'query_id':str})
df_placekeys.head(10)

Remerge on your unique ID

You can now see your newly added Placekeys as well as an ‘error’ column that will leave some form of breadcrumbs to allow you to troubleshoot if the query comes back blank.

Error Codes

We use standard HTTP status codes to communicate success or failure.

Code Title Description 200 OK The request was successful
400 Bad Request The request is invalid. Read the message or error fields in the response for information on how to correct it.
401 Unauthorized Your API key is invalid. Check that you haven’t removed it and that you’ve used the right header: apikey
429 Too Many Requests You have exceeded the permitted rate-limit. Check your dashboard to see how many requests have been made recently.
50x Internal Server Error An error occurred within our API. If this occurs, you may need to contact us to resolve

Read more about the errors HERE

df_join_placekey = pd.merge(orig_df, df_placekeys, left_on=query_id_col, right_on="query_id", how='left')
final_cols = list(df_placekeys.columns) + list(orig_df.columns)
df_join_placekey = df_join_placekey[final_cols]
df_join_placekey.head()

print("Summary of results:")
total_recs = df_join_placekey.shape[0]
print("total records:", total_recs)
print("records with a placekey: {0} | {1:.2f}%".format(df_join_placekey[~df_join_placekey.placekey.isnull()].shape[0],  df_join_placekey[~df_join_placekey.placekey.isnull()].shape[0]*100/total_recs))
print("records missing a placekey: {0} | {1:.2f}% ".format(df_join_placekey[df_join_placekey.placekey.isnull()].shape[0], df_join_placekey[df_join_placekey.placekey.isnull()].shape[0]*100/total_recs))
print("records missing a query_id: {0} | {1:.2f}% ".format(df_join_placekey[df_join_placekey.query_id.isnull()].shape[0], df_join_placekey[df_join_placekey.query_id.isnull()].shape[0]*100/total_recs))

Conclusion

Using Placekey, we were able to differentiate between subunits within an apartment/condo (Note: This also works for commercial entities as well!). If you look at the screen shot above, you will see that while the right side of the Placekeys are the same, due to the matching street addresses, the left side (the WHAT component) change due to the unit numbers being different! You can effectively differentiate between apartments/condos within the same unit using Placekey!