Florents Tselai

Python, Machine Learning, Data Science & Engineering

Analyzing 1000+ Greek Wines With Python

Published Dec 28, 2017Last updated Jun 25, 2018

This was originally posted on my website - tselai.com

One of my most enjoyable guilty pleasures has always been web scraping. Especially during the past year I have scraped countless websites, both for fun and profit. From niche and mainstream e-shops to news outlets and literary blogs, it is amazing how much interesting and clean data one can get by using simple tools like BeautifulSoup- I won't even mention what a joy Chrome's Headless mode has been.

In this post I'll play with the data I scraped from a Greek wine e-shop. In lieu of apology for sending a few more requests to their server I urge everyone browse through their catalog and maybe even buy a few bottles.

The scraper itself is fairly simple and can be found on my GitHub page. Here I'll focus on doing some quick exploratory analysis on the data I got (1125 unique labels) by using standard python packages.

The scraper itself exposes a fairly simple API. You request data for a given wine page and it returns a nice dict with the data, like this:

from houseofwine_gr import get

get('http://www.houseofwine.gr/how/megas-oinos-skoura.html')

{'ageable': True,
 'alcohol_%': 13.5,
 'avg_rating_%': 82,
 'color': 'Ερυθρός',
 'description': 'Στιβαρό ερυθρό κρασί παλαίωσης. Βασίζεται σε χαρμάνι που παντρεύει το πικάντικο σκέρτσο του Αγιωργίτικου με την αυστηρή δύναμη του Cabernet Sauvignon, στη ζεστή και βανιλάτη αγκαλιά του δρύινου βαρελιού που τα φιλοξένησε κατά την 20μηνη παλαίωσή του. Ποιοτική ετικέτα για μεγάλα φαγοπότια ή εκλεκτά κελάρια.',
 'drink_now': False,
 'keep_2_3_years': False,
 'n_votes': 21,
 'name': 'Μέγας Οίνος Σκούρα 2014',
 'price': 20.9,
 'tags': ['Ήπιος', 'Ξηρός', 'Cabernet Sauvignon', 'Αγιωργίτικο'],
 'url': 'http://www.houseofwine.gr/how/megas-oinos-skoura.html',
 'year': 2014}

We start by defining some matplotlib aesthetics.

%matplotlib inline 
import matplotlib.pyplot as plt 
import matplotlib as mpl
import seaborn as sns 
plt.style.use('fivethirtyeight') 
mpl.rcParams['figure.figsize'] = (8,6)
mpl.rcParams['font.size'] = 14
mpl.rcParams['font.family'] = 'Serif'
mpl.rcParams['figure.facecolor'] = 'white'
plt.rcParams['axes.facecolor']='white'
plt.rcParams['axes.grid'] =False
plt.rcParams['figure.facecolor']='white'

We load a dump of the data produced by the houseofwine_gr.dump module. You can find the dataset in .json, .csv and .xlsx in the GitHub page as well.

import pandas as pd df = pd.read_json('./data/houseofwine.gr-wines.json', encoding='utf-8')

We replace empty strings with np.nan to make them easier to handle for Pandas

from numpy import nan
df = df.replace('', nan, regex=True)

We rename some column names containing special characters in order to use them as native DataFrame accessors.

df = df.rename(columns={'alcohol_%': 'alcohol', 'avg_rating_%': 'avg_rating'}, inplace=False)

We also assign appropriate types to the columns.

df['alcohol'] = df.alcohol.astype(float)
df['n_votes'] = df.n_votes.astype(int, errors='ignore')
df['price'] = df.price.astype(float)
df['year'] = df.year.astype(int, errors='ignore')

Let's translate the color column values from Greek to English.

df['color'] = df.color.replace({'Λευκός': 'White', 'Ερυθρός': 'Red', 'Ροζέ': 'Rosé' } )

Here is a color histogram for our dataset.

ax = df['color'].value_counts().plot('bar')
ax.set(ylabel='Wines', title='Wine Color Frequency');

Let's check out the distribution of some straightforward metrics for each wine.

fig, ((ax1, ax2), (ax3, ax4) ) = plt.subplots(ncols=2, nrows=2, figsize=(12,8)) 
df.year.dropna().astype(int).value_counts().plot('bar', ax=ax1)
ax1.set(title='Production Year Frequency', xlabel='Year'); 

sns.distplot(df[df.alcohol < 100].alcohol.dropna(), ax=ax2)
ax2.set(xlabel='Alcohol %', title='Alcohol % Distribution'); 

sns.distplot(df[df.price < 100].price.dropna(), ax=ax3)
ax3.set(xlabel='Price (< 100)') 

sns.distplot(df.avg_rating.dropna(), ax=ax4)
ax4.set(xlabel='Average Rating');

An immediate observation is the almost-normal distribution one sees for the Average Rating column with a high mean of 85+.Kroutoner on Reddit explains why this happens (and also corrects a previous mistake of mine):

The typical wine rating scale is on 50-100, not 0-100. So what looked like was a distribution with only half support is actually a nearly fully supported distribution with slight left skew. Further, there's a huge difference culturally between wines with ratings below and above 90, with wines receiving a rating above 90 generally being considered significantly better and also selling significantly better. This cultural fact completely changes the reasonable interpretation of the data. Most wines are rated as ok and only a small portion being rated as really good.

To move things further, let's have a look at the tags column.

0 [Riesling, Ήπιος, Ημίγλυκος]
1 [Ήπιος, Sauvignon Blanc, Ξηρός]
2 [Ήπιος, Ξηρός, Ξινόμαυρο, Merlot]
3 [Ήπιος, Ξηρός, Ασύρτικο, Μαλαγουζιά]
4 [Ήπιος, Ξηρός, Pinotage]
5 [Ήπιος, Sangiovese, Ξηρός]
6 [Ήπιος, Ξηρός, Ξινόμαυρο]
7 [Cabernet Sauvignon, Ήπιος, 13, Ξηρός, Merlot,...
8 [Ήπιος, Γλυκός]
9 [Ήπιος, Αηδάνι, Ξηρός]
Name: tags, dtype: object

It looks like each list of tags can give us info about various wine attributes (varietal(s), sweetness etc.), so we would be better of if we made these attributes separate columns.

First we convert tags column elements from list to set ones, as this will make manipulations easier. That is, instead of getting messed up in an if x in -else-try-except-IndexError we're going to use set operations. Say what you want about the 79 character limit and readability; in interactive mode one-liners and immutable operations rock.

df['tags'] = df.tags.map(set)

We now do some simple operations to extract info about sweetness, mildness and wether or not is sparkling. We also translate from Greek to English.

sweetness_values = {'Γλυκός', 'Ημίγλυκος', 'Ξηρός', 'Ημίξηρος'}
df['sweetness'] = df.tags.map(sweetness_values.intersection).map(lambda x: x.pop() if x else None) 

translations = {'Γλυκός': 'Sweet', 'Ημίγλυκος': 'Semi-Sweet', 'Ξηρός': 'Dry', 'Ημίξηρος': 'Semi-Dry'}
df['sweetness'] = df['sweetness'].replace(translations)
df['sparkling'] = df.tags.map({'Αφρώδης', 'Ημιαφρώδης'}.intersection ).map(lambda x: x.pop() if x else None ).replace({'Αφρώδης': 'Sparkling', 'Ημιαφρώδης': 'Semi-Sparkling'})
df['sparkling'] = df.sparkling.fillna('Not Sparkling')

df['is_mild'] = df.tags.map(lambda x: 'Ήπιος' in x)

Here are the histogram for each one of these 4 attributes.

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, squeeze=False) for attr, ax in zip(['sweetness', 'color', 'sparkling', 'is_mild'], (ax1, ax2, ax3, ax4)): df[attr].value_counts().sort_values(ascending=True).plot(kind='barh', ax=ax) attr_str = 'Mildness' if attr is 'is_mild' else attr.title() ax.set(xlabel='Number of Wines', title='{} Frequency'.format(attr_str)); fig.set_size_inches((12,8))
fig.tight_layout()

At this point we can (almost) safely assume that all remaining tags indicate varietal info for each wine and so we define a new column to store them.

non_varietal_tags = {'Αφρώδης', 'Ημιαφρώδης', 'Ξηρός', 'Ημίξηρος', 'Γλυκός', 'Ημίγλυκος', 'Ήπιος'}
df['varieties'] = df.tags.map(lambda t: t.difference(non_varietal_tags))

Due to a parsing bug there are some integers appearing as varieties in the column, so we filter these out.

def is_not_int(x):
    try:
        int(x)
        return False
    except ValueError:
        return True
    
df['varieties'] = df.varieties.map(lambda x: set(filter(is_not_int, x)))

We can also add a boolean variable of wether or not a wine is varietal. We define as varietal wine, one that has only one variety in its mix. The rest - with at least two varieties in their mix - are blends.

df['is_varietal'] = df.varieties.map(set. __len__ ) == 1

For varietal wines in particular we set a single_variety - this value will be NaN for the rest of the non-varietal wines.

df.loc[df.is_varietal, 'single_variety'] = df.loc[df.is_varietal, 'varieties'].map(lambda v: next(iter(v)))

df['is_blend'] = df.varieties.map(set. __len__ ) >= 2

df.loc[df.is_varietal, 'variety_type'] = 'Varietal'
df.loc[df.is_blend, 'variety_type'] = 'Blend'

Let's see how the Varietal / Blend distribution looks like.

ax = df.is_varietal.replace({True: 'Varietal', False: 'Blend'} ).value_counts().plot('barh')
ax.set(title='How many wines are varietals and how many blends ?');

Let's dig into the varieties stuff - should be interesting.

Here are some indicative plots.

fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(12,14))
fig.tight_layout varieties_hist = df[df.is_varietal].varieties.map(lambda x: next(iter(x))).value_counts()

varieties_hist.head(10).sort_values(ascending=True).plot('barh', ax=ax1)
ax1.set(title='Most Frequent Varietal Wines (Top 10)'); 
varietals_mean_price = df[['single_variety', 'price']].groupby('single_variety').mean().dropna()['price']

varietals_mean_price.sort_values(ascending=False).head(10).sort_values(ascending=True).plot('barh', ax=ax2)

ax2.set(title='Most Expensive Varietals', xlabel='', ylabel=''); varietals_mean_rating = df[['single_variety', 'avg_rating']].groupby('single_variety').mean().dropna()['avg_rating']

varietals_mean_rating.sort_values(ascending=False).head(10).sort_values(ascending=True).plot('barh', ax=ax3);
ax3.set(title='Top Rated Varietals', ylabel=''); fig.tight_layout()

Looks likes Chardonnay is the most popular varietal while Vidal and Sangiovese are the most expensive ones. The top rated one is Malvasia but all varieties are pretty close.

Shifting our attention to blends, we do some Numpy and Scikit-Learn magic to produce a coocurrence matrix of the varieties appearing in blends.

def create_coocurrence_df(docs):
    # Source: https://stackoverflow.com/a/37822989
    count_model = CountVectorizer(lowercase=False, min_df=.1) # default unigram model
    X = count_model.fit_transform(docs)
    Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
    Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
    ret = pd.DataFrame(Xc.todense())#, index=count_model.get_feature_names(), columnscou=count_model.get_feature_names())
    ret.index = ret.columns = list(map(lambda f: f.replace('_', ' '), count_model.get_feature_names()))
    return ret

from sklearn.feature_extraction.text import CountVectorizer
docs = df.loc[df.is_blend, 'varieties'].map(lambda x: {s.replace(' ', '_') for s in x}).map(lambda x: ' '.join(x)) 
coocurrence = create_coocurrence_df(docs)

These are the varieties that appear most frequently in blends.

ax = coocurrence.sum().sort_values(ascending=False).head(10).sort_values(ascending=True).plot(kind='barh')
ax.set(title='Most Frequent Varieties Appearing in Blends (Top 10)');

And here is a heatmap showing which varieties are usually blended together.

ax = sns.heatmap(coocurrence, square=True, annot=True, fmt="d", cmap='Blues')
ax.set(title='Which varieties are usually blended together ?\n'.title());
plt.gcf().set_size_inches((8,6))

fig, axes = plt.subplots(nrows=3, figsize=(12, 18))
for c, ax in zip(['Red', 'White', 'Rosé'], axes):
    docs = df.loc[df.is_blend & (df.color==c), 'varieties'].map(lambda x: {s.replace(' ', '_') for s in x}).map(lambda x: ' '.join(x))
    cooc = create_coocurrence_df(docs)
    cmaps = {'Red': 'Reds', 'White': 'Blues', 'Rosé': sns.cubehelix_palette(8)}
    sns.heatmap(cooc, square=True, annot=True, fmt="d", cmap=cmaps.get(c), ax=ax)
    ax.set(title='Usually blended varieties for {}\n'.format(c).title());
plt.tight_layout()

That's all folks. A potential next step would be a classifier based on the textual description column I didn't touch here. Feel free to comment with ideas or play with the data yourself.

This was originally posted on my website - tselai.com

Python Pandas Web scraping Data Science Data analytics

Report

Enjoy this post? Give Florents Tselai a like if it's helpful.

Florents Tselai

Python, Machine Learning, Data Science & Engineering

I use mathematics and machine learning to analyze & explain data. When the machine doesn't learn, I do it the old-fashioned way: I use my brain and do my own learning instead. I consider my skillset to lie somewhere in the inters...

Discover and read more posts from Florents Tselai

get started

4Replies

Vignesh M

8 years ago

Nicely curated!

Florents Tselai

8 years ago

Thanks :)

oyedayo oyelowo

8 years ago

Interesting! thanks for this

Florents Tselai

8 years ago

Thanks :)

Show more replies