Python implementation of Normalized Google Distance (Simple web scraping example)
Introduction
Based on the count of google results we can infer the popularity of a word. Also the relationship between the frequency of two words together with respect to its individual frequency is a useful measure of how much two words are related.
Based on these ideas is defined the Normalized Google distance, in this post I show how to implement it in python using basic web scraping tools. The final code can be found here.
The Code
Importing libraries
import requests
from bs4 import BeautifulSoup
import math
import sys
Doing the search and getting the count
Here I implement this function which does a GET to google using headers that specify that we are on a desktop machine (And not on a phone), I also specify the gl parameter to make the search as if I were in USA.
def number_of_results(text):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get("https://www.google.com/search?q="+text.replace(" ","+"),params={"gl":"us"},headers=headers)
Then BeautifulSoup is used to extract the html part where the count is specified.
soup = BeautifulSoup(r.text, "lxml")
res = soup.find("div", {"id": "resultStats"})
Finally It just prints and returns the parsed number
print(res.text)
for t in res.text.split():
try:
number = float(t.replace(",",""))
print("{} results for {}".format(number,text))
return number
except:
pass
raise Exception("Couldn't find a valid number of results on Google")
Compute the formula
Here we implement the formula specified on wikipedia
# N = number_of_results("the")
N = 25270000000.0
N = math.log(N,2)
def normalized_google_distance(w1, w2):
f_w1 = math.log(number_of_results(w1),2)
f_w2 = math.log(number_of_results(w2),2)
f_w1_w2 = math.log(number_of_results(w1+" "+w2),2)
return (max(f_w1,f_w2) - f_w1_w2) / (N - min(f_w1,f_w2))
Main Function
All the code is executed from the main function
def main(argv):
w1 = argv[1]
w2 = argv[2]
score = normalized_google_distance(w1,w2)
print("Score is",round(score,2))
print("W1='"+ w1+ "' W2='"+ w2+ "'")
# Usage example
# python normalized_google_distance.py shakespeare macbeth
# python normalized_google_distance.py "shakespeare " "macbeth"
main(sys.argv)
Hey Mathias, I am getting the same error as Zaman, i.e., after a few queries the code doesn’t work giving the exactly same error.
It might be happening due to some restrictions on number of queries by google, but in that case the limit is too low, like around I made only 20 queries.
Let me know if there’s already a solution to this.
Hi Mathias. Your code was working fine for me. However, after a few runs, I am getting the following error:
File “normalizedgoogledistance.py”, line 12, in number_of_results
print(res.text)
AttributeError: ‘NoneType’ object has no attribute ‘text’
I hope you can suggest as solution for the error above.
Hi Zaman, how are you running it? A possible problem might be that one of the words (Or the combination of both) doesn’t show up on google (It has 0 results) in that case the program returns an error.
It’s working again now ! so sometimes the same code works and sometimes does not !
When I get the error, I even get the error for similar test scripts such as the one here: https://gist.github.com/yxlao/ad429b65ec1b3836da8f06fbd9fa8c54
I couldn’t reproduce your error. I would try to check the value of “r.text” when you get the error. That’s the full html response from google and it seems that for some reason it’s not returning you a valid search result.
Try adding something like:
with open(“response.html”,‘w’) as f:
f.write(r.text)
Before the:
soup = BeautifulSoup(r.text, “lxml”)
If you open the file with a web browser you should see a google search result screen, but if there is a problem it might not be the case.
Thanks Mathias. At present I am trying an alternate Java implementation
Does BeautifulSoup work for dynamically loaded DOM elements, i.e. via ajax calls?
It depends on the website but it’s usually possible to access the URL with the desired DOMs already loaded and just extract those values as usual html.
Here is an example: https://stackoverflow.com/questions/5913280/beautifulsoup-and-ajax-table-problem
Thanks, Mathias. I’ll give it a try. Appreciate the prompt response.