Codementor Events

Data Science with Python & R: Exploratory Data Analysis

Published Jul 27, 2015Last updated Feb 09, 2017
Data Science with Python & R: Exploratory Data Analysis

Introduction and Getting Data refresh

Here we are again, with a new episode in our series about doing data science with the two most popular open-source platforms you can use for the job nowadays. In this case we will have a look at a crucial step of the data analytics process, that of the Exploratory Data Analysis.

Exploratory data analysis takes place after gathering and cleaning data, and before any modeling and visualisation/presentation of results. However, it is part of an iterative process. After doing some EDA we can try to build some models or present some visualisations. At the same time, based on the results of the later we can perform some more EDA and so on. It is all about quickly finding clues and not so much about details or aesthetics. Among the main purposes of this type of analysis are of course getting to know our data, its tendencies and its quality, and also to check or even start formulating our hypothesis.

And with that idea in mind we will explain how to use descriptive statistics and basic plotting, together with data frames, in order to answer some questions and guide our further data analysis.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Getting data

We will continue using the same datasets we already loaded in the part introducing data frames. So you can either continue where you left in that tutorial, or re-run the section that gets and prepares the data.

Questions we want to answer

In any data analysis process, there is one or more questions we want to answer. That is the most basic and important step in the whole process, to define these questions. Since we are going to perform some Exploratory Data Analysis in our TB dataset, these are the questions we want to answer:

  • Which are the countries with the highest and infectious TB incidence?
  • What is the general world tendency in the period from 1990 to 2007?
  • What countries don't follow that tendency?
  • What other facts about the disease do we know that we can check with our data?

Descriptive Statistics

Python

The basic data descriptive statistics method for a pandas.DataFrame is describe(). It is the equivalent to R data.frame function summary().

df_summary = existing_df.describe()
df_summary
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia ... Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
count 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 ... 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000
mean 353.333333 36.944444 47.388889 12.277778 25.277778 413.444444 35.611111 10.833333 61.222222 74.944444 ... 28.055556 128.888889 186.000000 40.888889 282.666667 126.222222 43.388889 194.333333 535.277778 512.833333
std 64.708396 6.915220 4.487091 9.886447 7.274497 97.751318 1.243283 2.812786 20.232634 16.129885 ... 3.717561 15.911109 62.027508 2.422660 57.322616 86.784083 8.332353 52.158131 91.975576 113.411925
min 238.000000 22.000000 42.000000 0.000000 17.000000 281.000000 34.000000 7.000000 35.000000 49.000000 ... 23.000000 102.000000 102.000000 38.000000 220.000000 13.000000 31.000000 130.000000 387.000000 392.000000
25% 305.000000 32.000000 44.000000 6.000000 19.250000 321.250000 35.000000 9.000000 41.250000 62.000000 ... 25.000000 116.500000 128.750000 39.000000 234.250000 63.250000 36.250000 146.750000 459.000000 420.750000
50% 373.500000 40.500000 45.500000 9.000000 22.500000 399.000000 35.000000 10.000000 60.500000 77.000000 ... 27.500000 131.500000 185.000000 41.000000 257.000000 106.000000 43.000000 184.500000 521.500000 466.000000
75% 404.500000 42.000000 50.750000 16.250000 31.500000 512.000000 36.000000 12.750000 77.000000 85.750000 ... 30.750000 143.000000 240.000000 42.000000 349.000000 165.750000 51.500000 248.500000 620.000000 616.750000
max 436.000000 44.000000 56.000000 42.000000 39.000000 530.000000 38.000000 16.000000 96.000000 99.000000 ... 35.000000 152.000000 278.000000 46.000000 365.000000 352.000000 55.000000 265.000000 680.000000 714.000000
8 rows × 207 columns

There is a lot of information there. We can access individual summaries as follows.

df_summary[['Spain','United Kingdom']]
country Spain United Kingdom
count 18.000000 18.000000
mean 30.666667 9.611111
std 6.677442 0.916444
min 23.000000 9.000000
25% 25.250000 9.000000
50% 29.000000 9.000000
75% 34.750000 10.000000
max 44.000000 12.000000

There is a plethora of descriptive statistics methods in Pandas (check the documentation). Some of them are already included in our summary object, but there are many more. In following tutorials we will make good use of them in order to better understand our data.

For example, we can obtain the percentage change over the years for the number of tuberculosis cases in Spain.

tb_pct_change_spain = existing_df.Spain.pct_change()
tb_pct_change_spain
    year
    1990         NaN
    1991   -0.045455
    1992   -0.047619
    1993   -0.075000
    1994   -0.054054
    1995   -0.028571
    1996   -0.029412
    1997   -0.090909
    1998    0.000000
    1999   -0.066667
    2000   -0.035714
    2001   -0.037037
    2002    0.000000
    2003   -0.038462
    2004   -0.040000
    2005    0.000000
    2006    0.000000
    2007   -0.041667
    Name: Spain, dtype: float64

And from there get the maximum value.

tb_pct_change_spain.max()
    0.0

And do the same for the United Kingdom.

existing_df['United Kingdom'].pct_change().max()
    0.11111111111111116

If we want to know the index value (year) we use argmax (callex idmax in later versions of Pandas) as follows.

existing_df['Spain'].pct_change().argmax()
    '1998'
existing_df['United Kingdom'].pct_change().argmax()
    '1992'

That is, 1998 and 1992 were the worst years in Spain and the UK respectibely regarding the increase of infectious TB cases.

R

The basic descriptive statistics method in R is, as we said, the function summary().

existing_summary <- summary(existing_df)
str(existing_summary)
##  'table' chr [1:6, 1:207] "Min.   :238.0  " "1st Qu.:305.0  " ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:6] "" "" "" "" ...
##   ..$ : chr [1:207] " Afghanistan" "   Albania" "   Algeria" "American Samoa" ...

It returns a table object where we have summary statistics for each of the columns in a data frame. A table object is good for visualising data, but not so good for accessing and indexing it as a data frame. Basically we use positional indexing to access it as a matrix. This way, if we want the first column, that corresponding to Afghanistan, we do:

existing_summary[,1]
##                                                                         
## "Min.   :238.0  " "1st Qu.:305.0  " "Median :373.5  " "Mean   :353.3  " 
##                                     
## "3rd Qu.:404.5  " "Max.   :436.0  "

A trick we can use to access by column name is use the column names in the original data frame together with which(). We also can build a new data frame with the results.

data.frame(
    Spain=existing_summary[,which(colnames(existing_df)=='Spain')],
    UK=existing_summary[,which(colnames(existing_df)=='United Kingdom')])
##             Spain               UK
## 1 Min.   :23.00   Min.   : 9.000  
## 2 1st Qu.:25.25   1st Qu.: 9.000  
## 3 Median :29.00   Median : 9.000  
## 4 Mean   :30.67   Mean   : 9.611  
## 5 3rd Qu.:34.75   3rd Qu.:10.000  
## 6 Max.   :44.00   Max.   :12.000

Being R a functional language, we can apply functions such as sum, mean, sd, etc. to vectors. Remember that a data frame is a list of vectors (i.e. each column is a vector of values), so we can easily use these functions with columns. We can finally combine these functions with lapply or sapply and apply them to multiple columns in a data frame.

However, there is a family of functions in R that can be applied to columns or rows in order to get means and sums directly. These are more efficient than using apply functions, and also allows us to apply them not just by columns but also by row. If you type `?colSums' for example, the help page describes all of them.

Let's say we wan to obtain the average number of existing cases per year. We need a single function call.

rowMeans(existing_df)
##    X1990    X1991    X1992    X1993    X1994    X1995    X1996    X1997 
## 196.9662 196.4686 192.8116 191.1739 188.7246 187.9420 178.8986 180.9758 
##    X1998    X1999    X2000    X2001    X2002    X2003    X2004    X2005 
## 178.1208 180.4734 177.5217 177.7971 179.5169 176.4058 173.9227 171.1836 
##    X2006    X2007 
## 169.0193 167.2560

Plotting

In this section we will take a look at the basic plotting functionality in Python/Pandas and R. However, there are more powerful alternatives like ggplot2 that, although originally created for R, has its own implementation for Python from the Yhat guys.

Python

Pandas DataFrames implement up to three plotting methods out of the box (check the documentation). The first one is a basic line plot for each of the series we include in the indexing. The first line might be needed when plotting while using IPython notebook.

%matplotlib inline

 existing_df[['United Kingdom', 'Spain', 'Colombia']].plot()

enter image description here

Or we can use box plots to obtain a summarised view of a given series as follows.

 existing_df[['United Kingdom', 'Spain', 'Colombia']].boxplot()

enter image description here

There is also a histogram() method, but we can't use it with this type of data right now.

R

Base plotting in R is not very sophisticated when compared with ggplot2, but still is powerful and handy because many data types have implemented custom plot() methods that allow us to plot them with a single method call. However this is not always the case, and more often than not we will need to pass the right set of elements to our basic plotting functions.

Let's start with a basic line chart like we did with Python/Pandas.

uk_series <- existing_df[,c("United Kingdom")]
spain_series <- existing_df[,c("Spain")]
colombia_series <- existing_df[,c("Colombia")]
xrange <- 1990:2007
plot(xrange, uk_series, 
     type='l', xlab="Year", 
     ylab="Existing cases per 100K", 
     col = "blue", 
     ylim=c(0,100))
lines(xrange, spain_series, 
      col = "darkgreen")
lines(xrange, colombia_series,
      col = "red")
legend(x=2003, y=100, 
       lty=1, 
       col=c("blue","darkgreen","red"), 
       legend=c("UK","Spain","Colombia"))

enter image description here

You can compare how easy it was to plot three series in Pandas, and how doing the
same thing with basic plotting in R gets more verbose. At least we need
three function calls, those for plot and line, and then we have the legend, etc. The base plotting in R is really intended to make quick and dirty charts.

Let's use now box plots.

boxplot(uk_series, spain_series, colombia_series, 
        names=c("UK","Spain","Colombia"),
        xlab="Year", 
        ylab="Existing cases per 100K")

enter image description here

This one was way shorter, and we don't even need colours or a legend.

Answering Questions

Let's now start with the real fun. Once we know our tools (from the previous tutorial about data frames and this one), let's use them to answer some questions about the incidence and prevalence of infectious tuberculosis in the world.

Question: We want to know, per year, what country has the highest number of existing and new TB cases.

Python

If we want just the top ones we can make use of apply and argmax. Remember that, by default, apply works with columns (the countries in our case), and we want to apply it to each year. Therefore we need to transpose the data frame before using it, or we can pass the argument axis=1.

 existing_df.apply(pd.Series.argmax, axis=1)
    year
    1990            Djibouti
    1991            Djibouti
    1992            Djibouti
    1993            Djibouti
    1994            Djibouti
    1995            Djibouti
    1996            Kiribati
    1997            Kiribati
    1998            Cambodia
    1999    Korea, Dem. Rep.
    2000            Djibouti
    2001           Swaziland
    2002            Djibouti
    2003            Djibouti
    2004            Djibouti
    2005            Djibouti
    2006            Djibouti
    2007            Djibouti
    dtype: object

But this is too simplistic. Instead, we want to get those countries that are in the fourth quartile. But first we need to find out the world general tendency.

In order to explore the world's general trend, we need to sum up every countries' values for the three datasets, per year.

deaths_total_per_year_df = deaths_df.sum(axis=1)
existing_total_per_year_df = existing_df.sum(axis=1)
new_total_per_year_df = new_df.sum(axis=1)

Now we will create a new DataFrame with each sum in a series that we will plot using the data frame plot() method.

world_trends_df = pd.DataFrame({
           'Total deaths per 100K' : deaths_total_per_year_df, 
           'Total existing cases per 100K' : existing_total_per_year_df, 
           'Total new cases per 100K' : new_total_per_year_df}, 
       index=deaths_total_per_year_df.index)
world_trends_df.plot(figsize=(12,6)).legend(
    loc='center left', 
    bbox_to_anchor=(1, 0.5))

enter image description here

It seems that the general tendency is for a decrease in the total number of existing cases per 100K. However the number of new cases has been increasing, although it seems reverting from 2005. So how is possible that the total number of existing cases is decreasing if the total number of new cases has been growing? One of the reasons could be the observed increase in the number of deaths per 100K, but the main reason we have to consider is that people recover from tuberculosis thanks to treatment. The sum of the recovery rate plus the death rate is greater than the new cases rate. In any case, it seems that there are more new cases, but also that we cure them better. We need to improve prevention and epidemics control.

Countries out of tendency

So the previous was the general trend of the world as a whole. So what countries with a different trend (for worse)? In order to find this out, first we need to know the distribution of deaths by countries in an average year.

deaths_by_country_mean = deaths_df.mean()
deaths_by_country_mean_summary = deaths_by_country_mean.describe()
existing_by_country_mean = existing_df.mean()
existing_by_country_mean_summary = existing_by_country_mean.describe()
new_by_country_mean = new_df.mean()
new_by_country_mean_summary = new_by_country_mean.describe()

We can plot these distributions to have an idea of how the countries are distributed in an average year.

deaths_by_country_mean.order().plot(kind='bar', figsize=(24,6))

enter image description here

We want those countries beyond 1.5 times the inter quartile range (50%). We have these values in:

deaths_outlier = deaths_by_country_mean_summary['50%']*1.5
existing_outlier = existing_by_country_mean_summary['50%']*1.5
new_outlier = new_by_country_mean_summary['50%']*1.5

Now we can use these values to get those countries that, across the period 1990-2007, have exceeded beyond those levels.

# Now compare with the outlier threshold
outlier_countries_by_deaths_index = 
    deaths_by_country_mean > deaths_outlier
outlier_countries_by_existing_index = 
   existing_by_country_mean > existing_outlier
outlier_countries_by_new_index = 
    new_by_country_mean > new_outlier

What proportion of countries do we have out of trend? For deaths:

num_countries = len(deaths_df.T)
sum(outlier_countries_by_deaths_index)/num_countries
    0.39613526570048307

For existing cases (prevalence):

 sum(outlier_countries_by_existing_index)/num_countries
    0.39613526570048307

For new cases (incidence):

 sum(outlier_countries_by_new_index)/num_countries
    0.38647342995169082

Now we can use these indices to filter our original data frames.

outlier_deaths_df = deaths_df.T[ outlier_countries_by_deaths_index ].T
outlier_existing_df = existing_df.T[ outlier_countries_by_existing_index ].T
outlier_new_df = new_df.T[ outlier_countries_by_new_index ].T

This is serious stuff. We have more than one third of the world being outliers on the distribution of existings cases, new cases, and deaths by infectious tuberculosis. But what if we consider an outlier to be 5 times the IQR? Let's repeat the previous process.

deaths_super_outlier = deaths_by_country_mean_summary['50%']*5
existing_super_outlier = existing_by_country_mean_summary['50%']*5
new_super_outlier = new_by_country_mean_summary['50%']*5
    
super_outlier_countries_by_deaths_index = 
    deaths_by_country_mean > deaths_super_outlier
super_outlier_countries_by_existing_index = 
    existing_by_country_mean > existing_super_outlier
super_outlier_countries_by_new_index = 
    new_by_country_mean > new_super_outlier

What proportion do we have now?

sum(super_outlier_countries_by_deaths_index)/num_countries
    0.21739130434782608

Let's get the data frames.

super_outlier_deaths_df = 
    deaths_df.T[ super_outlier_countries_by_deaths_index ].T
super_outlier_existing_df = 
    existing_df.T[ super_outlier_countries_by_existing_index ].T
super_outlier_new_df = 
    new_df.T[ super_outlier_countries_by_new_index ].T

Let's concentrate on epidemics control and have a look at the new cases' data frame.

super_outlier_new_df
country Bhutan Botswana Cambodia Congo, Rep. Cote d'Ivoire Korea, Dem. Rep. Djibouti Kiribati Lesotho Malawi ... Philippines Rwanda Sierra Leone South Africa Swaziland Timor-Leste Togo Uganda Zambia Zimbabwe
year
1990 540 307 585 169 177 344 582 513 184 258 ... 393 167 207 301 267 322 308 163 297 329
1991 516 341 579 188 196 344 594 503 201 286 ... 386 185 220 301 266 322 314 250 349 364
1992 492 364 574 200 209 344 606 493 218 314 ... 380 197 233 302 260 322 320 272 411 389
1993 470 390 568 215 224 344 618 483 244 343 ... 373 212 248 305 267 322 326 296 460 417
1994 449 415 563 229 239 344 630 474 280 373 ... 366 225 263 309 293 322 333 306 501 444
1995 428 444 557 245 255 344 642 464 323 390 ... 360 241 279 317 337 322 339 319 536 474
1996 409 468 552 258 269 344 655 455 362 389 ... 353 254 297 332 398 322 346 314 554 501
1997 391 503 546 277 289 344 668 446 409 401 ... 347 273 315 360 474 322 353 320 576 538
1998 373 542 541 299 312 344 681 437 461 412 ... 341 294 334 406 558 322 360 326 583 580
1999 356 588 536 324 338 344 695 428 519 417 ... 335 319 355 479 691 322 367 324 603 628
2000 340 640 530 353 368 344 708 420 553 425 ... 329 348 377 576 801 322 374 340 602 685
2001 325 692 525 382 398 344 722 412 576 414 ... 323 376 400 683 916 322 382 360 627 740
2002 310 740 520 408 425 344 737 403 613 416 ... 317 402 425 780 994 322 389 386 632 791
2003 296 772 515 425 444 344 751 396 635 410 ... 312 419 451 852 1075 322 397 396 652 825
2004 283 780 510 430 448 344 766 388 643 405 ... 306 423 479 898 1127 322 405 385 623 834
2005 270 770 505 425 443 344 781 380 639 391 ... 301 418 509 925 1141 322 413 370 588 824
2006 258 751 500 414 432 344 797 372 638 368 ... 295 408 540 940 1169 322 421 350 547 803
2007 246 731 495 403 420 344 813 365 637 346 ... 290 397 574 948 1198 322 429 330 506 782
18 rows × 22 columns

Let's make some plots to get a better impression.

super_outlier_new_df.plot(figsize=(12,4)).legend(loc='center left', bbox_to_anchor=(1, 0.5))

enter image description here

We have 22 countries where the number of new cases during an average year is greater than 5 times the median value of the distribution. Let's create a country that represents the average of these 22 countries.

average_super_outlier_country = super_outlier_new_df.mean(axis=1)
average_super_outlier_country
    year
    1990    314.363636
    1991    330.136364
    1992    340.681818
    1993    352.909091
    1994    365.363636
    1995    379.227273
    1996    390.863636
    1997    408.000000
    1998    427.000000
    1999    451.409091
    2000    476.545455
    2001    502.409091
    2002    525.727273
    2003    543.318182
    2004    548.909091
    2005    546.409091
    2006    540.863636
    2007    535.181818
    dtype: float64

Now let's create a country that represents the rest of the world.

avearge_better_world_country = 
    new_df.T[ - super_outlier_countries_by_new_index ].T.mean(axis=1)
avearge_better_world_country
    year
    1990    80.751351
    1991    81.216216
    1992    80.681081
    1993    81.470270
    1994    81.832432
    1995    82.681081
    1996    82.589189
    1997    84.497297
    1998    85.189189
    1999    86.232432
    2000    86.378378
    2001    86.551351
    2002    89.848649
    2003    87.778378
    2004    87.978378
    2005    87.086022
    2006    86.559140
    2007    85.605405
    dtype: float64

Now let's plot this country with the average world country.

two_world_df = 
    pd.DataFrame({ 
            'Average Better World Country': avearge_better_world_country,
            'Average Outlier Country' : average_super_outlier_country},
        index = new_df.index)
two_world_df.plot(title="Estimated new TB cases per 100K",figsize=(12,8))

enter image description here

The increase in new cases' tendency is really stronger in the average super outlier country, so much stronger that is difficult to perceive that same tendency in the better world country. The 90's decade brought a terrible increase in the number of TB cases in those countries. But let's have a look at the exact numbers.

two_world_df.pct_change().plot(title="Percentage change in estimated new TB cases", figsize=(12,8))

enter image description here

Based on this plot, the decceleration and reversion of that tendency seem to happen at the same time in both average better and outlier countries, and something happened around 2002. We will try to find out what's going on in the next section.

R

We already know that we can use max with a data frame column in R and get the maximum value. Additionally, we can use which.max in order to get its position (similarly to the use og argmax in Pandas). If we use the transposed data frame, we can use lapply or sapply to perform this operation in every year column, getting then either a list or a vector of indices (we will use sapply that returns a vector). We just need a little tweak and use a countries vector that we will index to get the country name instead of the index as a result.

country_names <- rownames(existing_df_t)
sapply(existing_df_t, function(x) {country_names[which.max(x)]})
##              X1990              X1991              X1992 
##         "Djibouti"         "Djibouti"         "Djibouti" 
##              X1993              X1994              X1995 
##         "Djibouti"         "Djibouti"         "Djibouti" 
##              X1996              X1997              X1998 
##         "Kiribati"         "Kiribati"         "Cambodia" 
##              X1999              X2000              X2001 
## "Korea, Dem. Rep."         "Djibouti"        "Swaziland" 
##              X2002              X2003              X2004 
##         "Djibouti"         "Djibouti"         "Djibouti" 
##              X2005              X2006              X2007 
##         "Djibouti"         "Djibouti"         "Djibouti"

Again, in order to explore the world general tendency, we need to sum up every countries’ values for the three datasets, per year.

But first we need to load the other two datasets for number of deaths and number of new cases.

# Download files
deaths_file <- getURL("https://docs.google.com/spreadsheets/d/12uWVH_IlmzJX_75bJ3IH5E-Gqx6-zfbDKNvZqYjUuso/pub?gid=0&output=CSV")
new_cases_file <- getURL("https://docs.google.com/spreadsheets/d/1Pl51PcEGlO9Hp4Uh0x2_QM0xVb53p2UDBMPwcnSjFTk/pub?gid=0&output=csv")

# Read into data frames
deaths_df <- read.csv(
    text = deaths_file, 
    row.names=1, 
    stringsAsFactor=F)
new_df <- read.csv(
    text = new_cases_file, 
    row.names=1, 
    stringsAsFactor=F)

# Cast data to int (deaths doesn't need it)
new_df[1:18] <- lapply(
    new_df[1:18], 
    function(x) { as.integer(gsub(',', '', x) )})

# Transpose
deaths_df_t <- deaths_df
deaths_df <- as.data.frame(t(deaths_df))
new_df_t <- new_df
new_df <- as.data.frame(t(new_df))

And now the sums by row. We need to convert this to a data frame since the function returns a numeric vector.

deaths_total_per_year_df <- data.frame(total=rowSums(deaths_df))
existing_total_per_year_df <- data.frame(total=rowSums(existing_df))
# We pass na.rm = TRUE in order to ignore missing values in the new
# cases data frame when summing (no missing values in other dataframes though)
new_total_per_year_df <- data.frame(total=rowSums(new_df, na.rm = TRUE))

Now we can plot each line using what we have learnt so far. In order to get a vector with the counts to pass to each plotting function, we use R data frame indexing by column name.

xrange <- 1990:2007
plot(xrange, deaths_total_per_year_df$total, 
     type='l', xlab="Year", 
     ylab="Count per 100K", 
     col = "blue", 
     ylim=c(0,50000))
lines(xrange, existing_total_per_year_df$total,
      col = "darkgreen")
lines(xrange, new_total_per_year_df$total, 
      col = "red")
legend(x=1992, y=52000, 
       lty=1, 
       cex = .7,
       ncol = 3,
       col=c("blue","darkgreen","red"), 
       legend=c("Deaths","Existing cases","New cases"))

enter image description here

The conclusions are obviously the same as when using Python.

Countries out of tendency

So what countries are outliers of the trend (for the worse)? Again, in order to find this out, first we need to know the distribution of countries in an average year. We use colMeans for that purpose.

deaths_by_country_mean <- data.frame(mean=colMeans(deaths_df))
existing_by_country_mean <- data.frame(mean=colMeans(existing_df))
new_by_country_mean <- data.frame(mean=colMeans(new_df, na.rm=TRUE))

We can plot these distributions to have an idea of how the countries are distributed in an average year. We are not so interested about the individual countries but about the distribution itself.

barplot(sort(deaths_by_country_mean$mean))

enter image description here

Again we can see there are three trends in the plot, with a slowly decreasing part at the beginning, a second more step section, and a final peak that is clearly apart from the rest.

Let's skip this time the 1.5-outlier part and go diretcly to the 5.0-outliers. In R we will use a different approach. We will use the quantile() function in order to get the inter-quartile range and determine the outlier threshold.

Since we already know the results from our Python section, let's do it just for the new cases, so we generate also the plots we did before.

new_super_outlier <- 
    quantile(new_by_country_mean$mean, probs = c(.5)) * 5.0
super_outlier_countries_by_new_index <- 
    new_by_country_mean > new_super_outlier

And the proportion is.

sum(super_outlier_countries_by_new_index)/208
## [1] 0.1057692

Let's obtain a data frame from this, with just those countries we consider to be outliers.

super_outlier_new_df <- 
    new_df[, super_outlier_countries_by_new_index ]

Now we are ready to plot them.

xrange <- 1990:2007
plot(xrange, super_outlier_new_df[,1], 
     type='l', xlab="Year", 
     ylab="New cases per 100K", 
     col = 1, 
     ylim=c(0,1800))
for (i in seq(2:ncol(super_outlier_new_df))) {
    lines(xrange, super_outlier_new_df[,i],
    col = i)
}
legend(x=1990, y=1800, 
       lty=1, cex = 0.5,
       ncol = 7,
       col=1:22,
       legend=colnames(super_outlier_new_df))

enter image description here

Definitely we can see here an advantage of using Pandas basic plotting versus R basic plotting!

So far our results match. We have 22 countries where the number of new cases on an average year is greater than 5 times the median value of the distribution. Let’s create a country that represents on average these 22. We will use rowMeans() here.

average_countries_df <- 
    data.frame(
        averageOutlierMean=rowMeans(super_outlier_new_df, na.rm=T)
    )
average_countries_df
##       averageOutlierMean
## X1990           314.3636
## X1991           330.1364
## X1992           340.6818
## X1993           352.9091
## X1994           365.3636
## X1995           379.2273
## X1996           390.8636
## X1997           408.0000
## X1998           427.0000
## X1999           451.4091
## X2000           476.5455
## X2001           502.4091
## X2002           525.7273
## X2003           543.3182
## X2004           548.9091
## X2005           546.4091
## X2006           540.8636
## X2007           535.1818

Now let’s create a country that represents the rest of the world.

average_countries_df$averageBetterWorldMean <- 
    rowMeans(new_df[ ,- super_outlier_countries_by_new_index ], na.rm=T)
average_countries_df
##       averageOutlierMean averageBetterWorldMean
## X1990           314.3636               105.2767
## X1991           330.1364               107.3786
## X1992           340.6818               108.0243
## X1993           352.9091               110.0388
## X1994           365.3636               111.6942
## X1995           379.2273               113.9369
## X1996           390.8636               115.0971
## X1997           408.0000               118.6408
## X1998           427.0000               121.2913
## X1999           451.4091               124.8350
## X2000           476.5455               127.6505
## X2001           502.4091               130.5680
## X2002           525.7273               136.0194
## X2003           543.3182               136.0388
## X2004           548.9091               136.8155
## X2005           546.4091               135.5121
## X2006           540.8636               134.4493
## X2007           535.1818               133.2184

Now let’s plot the outlier country with the average world country.

xrange <- 1990:2007
plot(xrange, average_countries_df$averageOutlierMean, 
     type='l', xlab="Year", 
     ylab="New cases per 100K", 
     col = "darkgreen", 
     ylim=c(0,600))
lines(xrange, average_countries_df$averageBetterWorldMean, col = "blue")
legend(x=1990, y=600, 
       lty=1, cex = 0.7,
       ncol = 2,
       col=c("darkgreen","blue"),
       legend=c("Average outlier country", "Average World Country"))

enter image description here

Googling about events and dates in Tuberculosis

We will use just Python in this section. About googling, actually we just went straight to Wikipedia's entry about the disease. In the epidemics sections we found the following:

  • The total number of tuberculosis cases has been decreasing since 2005, while new cases have decreased since 2002.

  • This is confirmed by our previous analysis.

  • China has achieved particularly dramatic progress, with about an 80% reduction in its TB mortality rate between 1990 and 2010. Let's check it:

existing_df.China.plot(title="Estimated existing TB cases in China")

enter image description here

  • In 2007, the country with the highest estimated incidence rate of TB was Swaziland, with 1,200 cases per 100,000 people.
new_df.apply(pd.Series.argmax, axis=1)['2007']
    'Swaziland'

There are many more findings Wikipedia that we can confirm with these or other datasets from Gapminder world. For example, TB and HIV are frequently associated, together with poverty levels. It would be interesting to join datasets and explore tendencies in each of them. We challenge the reader to give them a try and share with us their findings.

Other web pages to explore

Some interesting resources about tuberculosis apart from the Gapminder website:

  • Gates foundation:
  • http://www.gatesfoundation.org/What-We-Do/Global-Health/Tuberculosis
  • http://www.gatesfoundation.org/Media-Center/Press-Releases/2007/09/New-Grants-to-Fight-Tuberculosis-Epidemic

Conclusions

Exploratory data analysis is a key step in data analysis. It is during this stage when we start shaping any later work. It precedes any data visualisation or machine learning work, by showing us good or bad our data and our hypothesis are.

Traditionally, R has been the weapon of choice for most EDA work, although the use of a more expressive plotting library such as gglot2 is quite convenient. In fact, the base plotting functionality incorporated in Pandas makes the process cleaner and quicker when using Python. However, the questions we have answered here were very simple and didn't include multiple variables and encodings. In such cases an advanced library like ggplot2 will shine. Apart from providing nicer charts, it will saves us quite a lot of time due to its expressiveness and reusability.

But as simple as our analysis and charts are, we have been able to make the point about how serious the humanitarian crisis is regarding a disease like tuberculosis, specially when considering that the disease is relatively well controlled in more developed countries. We have seen how some coding skills and a good amount of curiosity allows us to create awareness in these and other world issues.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Discover and read more posts from Jose A Dianes
get started
post comments7Replies
Steve
9 years ago

The python or R, both became almost equal when choice is given. The data science applications are increasing day by day. So both are equally preferred.

Evan
9 years ago

Very helpful documents.I greatly benefit from it.Thanks for your work.

Jose A Dianes
9 years ago

Hi Evanzhu2013. Many thanks for you’re comment. I’m glad to hear that our tutorials are being useful for you! Jose.

Radhouane
9 years ago

Sorry but I don’t see data science in this article (which is great btw). Is it more appropriate to rename the article : Data wrangling … instead ?
(just saying)

Jose A Dianes
9 years ago

Hi Radhouane,

Many thanks for your comment! You’re probably right about the outcomes of this particular article. However the article is part of a longer series about different parts of the data science process. The guys at codementor thought it was good to keep a common title as part of all of them. About the title you propose I disagree. Data wrangling is an important part of the data science process, but is related with transforming raw data into a format that can be used for data analysis. Here we do very mild transformations and get quickly into exploring data using charts.

Radhouane
9 years ago

Awesome sauce, it makes more sense when you mention its gonna be part of a long series. Great stuff !

Jose A Dianes
9 years ago

Glad you like it! I hope we will keep you interested :)

Show more replies