Caching

Caching

Applying caching in a data analysis workflow

Backstory: Why I decided to write about caching.

So that you don’t get your hopes high, this isn’t some drama-filled story. It’s a simple one about a task I worked on and I think you could pick a thing or two about caching (I’d prefer you pick two though).

I am a data analyst, and sourcing data is one part of a DA’s tasks. Sometimes, your tasks require that you source all the data you need from external sources, other times, you only need some external data to complement data already available to you.

In this task I was working on, I needed to analyse the distribution of records by the country the record came from but there was no field containing the country the record came from. However, the dataset had a field containing the IP address of where each record originated from. From networking classes, I know that the body controlling the distribution of IP addresses maps a particular range of IP addresses to a particular country. Following this knowledge, I could get the location of an IP address if I knew the IP address.

After consulting a senior techie, I was referred to a tool that could return the location of an IP address but the platform provided no API which means I’d have to manually input the IP addresses. If you know me well, you’d know I’d never follow manual processes especially because they’re not scalable and are extremely boring! The other option was to go through web scraping which is a longer and rougher solution. After a few searches, I found another platform that offered an API; ipapi.co.

I’ve worked with APIs a few times but I never really had to make a large number of calls to them. This time using the API would require that I pull the country information for a large number of rows. You can take a look at the short Python function I used to pull the country information using the API

import requests
def get_location(ip_address):
     """
      Returns the Country mapped to an IP address

      : param  ip_address - str, IP address
    """
    response = requests.get(f"https://ipapi.co/{ip_address}/json/").json()
    country = response.get("country_name")
    return country

I applied the function to my pandas DataFrame object using the .apply() method. This is what it looked like

df["country"] = df["IP address"].apply(lambda x: get_location(x))

After running the line above multiple times because of network interruptions, my account was flagged; “Rate Limited!”.

Don't fret, getting your account flagged with 'Rate Limited' has no legal implications 😏. It often means you have exceeded the number of calls you are allowed to make to the API in a particular time frame. Some other times, it means you have exhausted the number of calls you’re allowed to make to the API using unpaid access and may have to subscribe to paid access. This is everyone's nightmare; paying for something you formerly had free access to.

After realizing I had been flagged for making too many calls to the API, it dawned on me that I could have avoided being flagged. This leads me to caching.

Caching

Every day, we experience caching but may not know. The reason websites load faster the second time you load them is because the results of the request your client (phone/pc) sent to the web server hosting the site are saved temporarily on your client.

Caching is an optimization technique for keeping recently or frequently accessed data in a memory location that has cheaper costs of retrieval. There are different approaches and strategies for implementing caching depending on the context.

A simple and basic way to implement caching is through the use of dictionaries. Using my scenario as an example, a cache system can be built using the code below

import requests

ip_location_dict = {}

def get_location(ip_address):
     """
      Returns the country mapped to an IP address
      : param  ip_address - str, IP address
    """
    if ip_address in ip_location_dict.keys():
        return ip_location_dict[ip_address]
    response = requests.get(f"https://ipapi.co/{ip_address}/json/").json()
    country = response.get("country_name")
    ip_location_dict[ip_address] = country
    return country

Firstly our code checks if the IP address is mapped to a country in the 'ip_location_dict' dictionary. If it is, we return the country mapped to the IP address. If the IP address is not mapped to any country in the ip_location_dict, we go ahead to use the API and then store the IP and location key-value pair in our dictionary. This way, we limit the number of calls we make to the API by checking if we have the location of a particular IP address stored in our dictionary. The cost of accessing our dictionary is way lower than making calls to the API. This is caching.

**Putting it all together **

After trying the caching concept and it worked, I figured the code I had written required me to finish the task in one run. On every subsequent run, the 'ip_location_dict' dictionary would be re-initialized and we would have an empty dictionary.

What I did to fix this

I invoked the power of the ancient os library😏. This is what it looked like 👇🏽

import os
import json
import requests
import pandas as pd

df = pd.read_csv('/Users/dummy-folder/dummy-file.csv')
ip_location_dict = {}
path_name = '/Users/dummy-folder/IP-Country.json'

if os.path.exists(path_name):
    with open(path_name) as ip_location_json:
        ip_location_dict = json.load(ip_location_json)

def get_location(ip_address):
     """
      Returns the country mapped to an IP address
      : param  ip_address - str, IP address
    """
    if ip_address in ip_location_dict.keys():
        return ip_location_dict[ip_address]
    response = requests.get(f"https://ipapi.co/{ip_address}/json/").json()
    country = response.get("country_name")
    ip_location_dict[ip_address] = country
    return country

df["country"] = df["IP address"].apply(lambda x: get_location(x))

with open(path_name, "w") as output:
    json.dump(ip_location_dict, output)

This way, at the start of every session, we import our locally stored ip_location dictionary and at the end of every session, we update the file containing the dictionary to include previously not captured IP addresses.

A key question to ask when building cache systems is 'how often should my cache be cleared ?' or 'how does the content of my cache become obsolete?'. The answer to this question informs the optimal caching strategy to implement.

To learn more about caching and different approaches to its implementation, check out the following writing;

I really hope you've learnt a thing or two about caching. Thank you for reading up until this point. Please share the link to the writing with someone you think could learn a thing or two also.

Gracias!