Data Science with Python & R: Exploratory Data Analysis
Introduction and Getting Data refresh
Here we are again, with a new episode in our series about doing data science with the two most popular open-source platforms you can use for the job nowadays. In this case we will have a look at a crucial step of the data analytics process, that of the Exploratory Data Analysis.
Exploratory data analysis takes place after gathering and cleaning data, and before any modeling and visualisation/presentation of results. However, it is part of an iterative process. After doing some EDA we can try to build some models or present some visualisations. At the same time, based on the results of the later we can perform some more EDA and so on. It is all about quickly finding clues and not so much about details or aesthetics. Among the main purposes of this type of analysis are of course getting to know our data, its tendencies and its quality, and also to check or even start formulating our hypothesis.
And with that idea in mind we will explain how to use descriptive statistics and basic plotting, together with data frames, in order to answer some questions and guide our further data analysis.
All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!
Getting data
We will continue using the same datasets we already loaded in the part introducing data frames. So you can either continue where you left in that tutorial, or re-run the section that gets and prepares the data.
Questions we want to answer
In any data analysis process, there is one or more questions we want to answer. That is the most basic and important step in the whole process, to define these questions. Since we are going to perform some Exploratory Data Analysis in our TB dataset, these are the questions we want to answer:
- Which are the countries with the highest and infectious TB incidence?
- What is the general world tendency in the period from 1990 to 2007?
- What countries don't follow that tendency?
- What other facts about the disease do we know that we can check with our data?
Descriptive Statistics
Python
The basic data descriptive statistics method for a pandas.DataFrame
is describe()
. It is the equivalent to R data.frame
function summary()
.
df_summary = existing_df.describe()
df_summary
country | Afghanistan | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Armenia | ... | Uruguay | Uzbekistan | Vanuatu | Venezuela | Viet Nam | Wallis et Futuna | West Bank and Gaza | Yemen | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | ... | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 | 18.000000 |
mean | 353.333333 | 36.944444 | 47.388889 | 12.277778 | 25.277778 | 413.444444 | 35.611111 | 10.833333 | 61.222222 | 74.944444 | ... | 28.055556 | 128.888889 | 186.000000 | 40.888889 | 282.666667 | 126.222222 | 43.388889 | 194.333333 | 535.277778 | 512.833333 |
std | 64.708396 | 6.915220 | 4.487091 | 9.886447 | 7.274497 | 97.751318 | 1.243283 | 2.812786 | 20.232634 | 16.129885 | ... | 3.717561 | 15.911109 | 62.027508 | 2.422660 | 57.322616 | 86.784083 | 8.332353 | 52.158131 | 91.975576 | 113.411925 |
min | 238.000000 | 22.000000 | 42.000000 | 0.000000 | 17.000000 | 281.000000 | 34.000000 | 7.000000 | 35.000000 | 49.000000 | ... | 23.000000 | 102.000000 | 102.000000 | 38.000000 | 220.000000 | 13.000000 | 31.000000 | 130.000000 | 387.000000 | 392.000000 |
25% | 305.000000 | 32.000000 | 44.000000 | 6.000000 | 19.250000 | 321.250000 | 35.000000 | 9.000000 | 41.250000 | 62.000000 | ... | 25.000000 | 116.500000 | 128.750000 | 39.000000 | 234.250000 | 63.250000 | 36.250000 | 146.750000 | 459.000000 | 420.750000 |
50% | 373.500000 | 40.500000 | 45.500000 | 9.000000 | 22.500000 | 399.000000 | 35.000000 | 10.000000 | 60.500000 | 77.000000 | ... | 27.500000 | 131.500000 | 185.000000 | 41.000000 | 257.000000 | 106.000000 | 43.000000 | 184.500000 | 521.500000 | 466.000000 |
75% | 404.500000 | 42.000000 | 50.750000 | 16.250000 | 31.500000 | 512.000000 | 36.000000 | 12.750000 | 77.000000 | 85.750000 | ... | 30.750000 | 143.000000 | 240.000000 | 42.000000 | 349.000000 | 165.750000 | 51.500000 | 248.500000 | 620.000000 | 616.750000 |
max | 436.000000 | 44.000000 | 56.000000 | 42.000000 | 39.000000 | 530.000000 | 38.000000 | 16.000000 | 96.000000 | 99.000000 | ... | 35.000000 | 152.000000 | 278.000000 | 46.000000 | 365.000000 | 352.000000 | 55.000000 | 265.000000 | 680.000000 | 714.000000 |
8 rows × 207 columns
There is a lot of information there. We can access individual summaries as follows.
df_summary[['Spain','United Kingdom']]
country | Spain | United Kingdom |
---|---|---|
count | 18.000000 | 18.000000 |
mean | 30.666667 | 9.611111 |
std | 6.677442 | 0.916444 |
min | 23.000000 | 9.000000 |
25% | 25.250000 | 9.000000 |
50% | 29.000000 | 9.000000 |
75% | 34.750000 | 10.000000 |
max | 44.000000 | 12.000000 |
There is a plethora of descriptive statistics methods in Pandas (check the documentation). Some of them are already included in our summary object, but there are many more. In following tutorials we will make good use of them in order to better understand our data.
For example, we can obtain the percentage change over the years for the number of tuberculosis cases in Spain.
tb_pct_change_spain = existing_df.Spain.pct_change()
tb_pct_change_spain
year
1990 NaN
1991 -0.045455
1992 -0.047619
1993 -0.075000
1994 -0.054054
1995 -0.028571
1996 -0.029412
1997 -0.090909
1998 0.000000
1999 -0.066667
2000 -0.035714
2001 -0.037037
2002 0.000000
2003 -0.038462
2004 -0.040000
2005 0.000000
2006 0.000000
2007 -0.041667
Name: Spain, dtype: float64
And from there get the maximum value.
tb_pct_change_spain.max()
0.0
And do the same for the United Kingdom.
existing_df['United Kingdom'].pct_change().max()
0.11111111111111116
If we want to know the index value (year) we use argmax
(callex idmax
in later versions of Pandas) as follows.
existing_df['Spain'].pct_change().argmax()
'1998'
existing_df['United Kingdom'].pct_change().argmax()
'1992'
That is, 1998 and 1992 were the worst years in Spain and the UK respectibely regarding the increase of infectious TB cases.
R
The basic descriptive statistics method in R is, as we said, the function summary()
.
existing_summary <- summary(existing_df)
str(existing_summary)
## 'table' chr [1:6, 1:207] "Min. :238.0 " "1st Qu.:305.0 " ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:6] "" "" "" "" ...
## ..$ : chr [1:207] " Afghanistan" " Albania" " Algeria" "American Samoa" ...
It returns a table object where we have summary statistics for each of the columns in a data frame. A table object is good for visualising data, but not so good for accessing and indexing it as a data frame. Basically we use positional indexing to access it as a matrix. This way, if we want the first column, that corresponding to Afghanistan, we do:
existing_summary[,1]
##
## "Min. :238.0 " "1st Qu.:305.0 " "Median :373.5 " "Mean :353.3 "
##
## "3rd Qu.:404.5 " "Max. :436.0 "
A trick we can use to access by column name is use the column names in the original data frame together with which()
. We also can build a new data frame with the results.
data.frame(
Spain=existing_summary[,which(colnames(existing_df)=='Spain')],
UK=existing_summary[,which(colnames(existing_df)=='United Kingdom')])
## Spain UK
## 1 Min. :23.00 Min. : 9.000
## 2 1st Qu.:25.25 1st Qu.: 9.000
## 3 Median :29.00 Median : 9.000
## 4 Mean :30.67 Mean : 9.611
## 5 3rd Qu.:34.75 3rd Qu.:10.000
## 6 Max. :44.00 Max. :12.000
Being R a functional language, we can apply functions such as sum
, mean
, sd
, etc. to vectors. Remember that a data frame is a list of vectors (i.e. each column is a vector of values), so we can easily use these functions with columns. We can finally combine these functions with lapply
or sapply
and apply them to multiple columns in a data frame.
However, there is a family of functions in R that can be applied to columns or rows in order to get means and sums directly. These are more efficient than using apply functions, and also allows us to apply them not just by columns but also by row. If you type `?colSums' for example, the help page describes all of them.
Let's say we wan to obtain the average number of existing cases per year. We need a single function call.
rowMeans(existing_df)
## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997
## 196.9662 196.4686 192.8116 191.1739 188.7246 187.9420 178.8986 180.9758
## X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005
## 178.1208 180.4734 177.5217 177.7971 179.5169 176.4058 173.9227 171.1836
## X2006 X2007
## 169.0193 167.2560
Plotting
In this section we will take a look at the basic plotting functionality in Python/Pandas and R. However, there are more powerful alternatives like ggplot2 that, although originally created for R, has its own implementation for Python from the Yhat guys.
Python
Pandas DataFrames implement up to three plotting methods out of the box (check the documentation). The first one is a basic line plot for each of the series we include in the indexing. The first line might be needed when plotting while using IPython notebook.
%matplotlib inline
existing_df[['United Kingdom', 'Spain', 'Colombia']].plot()
Or we can use box plots to obtain a summarised view of a given series as follows.
existing_df[['United Kingdom', 'Spain', 'Colombia']].boxplot()
There is also a histogram()
method, but we can't use it with this type of data right now.
R
Base plotting in R is not very sophisticated when compared with ggplot2, but still is powerful and handy because many data types have implemented custom plot()
methods that allow us to plot them with a single method call. However this is not always the case, and more often than not we will need to pass the right set of elements to our basic plotting functions.
Let's start with a basic line chart like we did with Python/Pandas.
uk_series <- existing_df[,c("United Kingdom")]
spain_series <- existing_df[,c("Spain")]
colombia_series <- existing_df[,c("Colombia")]
xrange <- 1990:2007
plot(xrange, uk_series,
type='l', xlab="Year",
ylab="Existing cases per 100K",
col = "blue",
ylim=c(0,100))
lines(xrange, spain_series,
col = "darkgreen")
lines(xrange, colombia_series,
col = "red")
legend(x=2003, y=100,
lty=1,
col=c("blue","darkgreen","red"),
legend=c("UK","Spain","Colombia"))
You can compare how easy it was to plot three series in Pandas, and how doing the
same thing with basic plotting in R gets more verbose. At least we need
three function calls, those for plot and line, and then we have the legend, etc. The base plotting in R is really intended to make quick and dirty charts.
Let's use now box plots.
boxplot(uk_series, spain_series, colombia_series,
names=c("UK","Spain","Colombia"),
xlab="Year",
ylab="Existing cases per 100K")
This one was way shorter, and we don't even need colours or a legend.
Answering Questions
Let's now start with the real fun. Once we know our tools (from the previous tutorial about data frames and this one), let's use them to answer some questions about the incidence and prevalence of infectious tuberculosis in the world.
Question: We want to know, per year, what country has the highest number of existing and new TB cases.
Python
If we want just the top ones we can make use of apply
and argmax
. Remember that, by default, apply
works with columns (the countries in our case), and we want to apply it to each year. Therefore we need to transpose the data frame before using it, or we can pass the argument axis=1
.
existing_df.apply(pd.Series.argmax, axis=1)
year
1990 Djibouti
1991 Djibouti
1992 Djibouti
1993 Djibouti
1994 Djibouti
1995 Djibouti
1996 Kiribati
1997 Kiribati
1998 Cambodia
1999 Korea, Dem. Rep.
2000 Djibouti
2001 Swaziland
2002 Djibouti
2003 Djibouti
2004 Djibouti
2005 Djibouti
2006 Djibouti
2007 Djibouti
dtype: object
But this is too simplistic. Instead, we want to get those countries that are in the fourth quartile. But first we need to find out the world general tendency.
World trends in TB cases
In order to explore the world's general trend, we need to sum up every countries' values for the three datasets, per year.
deaths_total_per_year_df = deaths_df.sum(axis=1)
existing_total_per_year_df = existing_df.sum(axis=1)
new_total_per_year_df = new_df.sum(axis=1)
Now we will create a new DataFrame
with each sum in a series that we will plot using the data frame plot()
method.
world_trends_df = pd.DataFrame({
'Total deaths per 100K' : deaths_total_per_year_df,
'Total existing cases per 100K' : existing_total_per_year_df,
'Total new cases per 100K' : new_total_per_year_df},
index=deaths_total_per_year_df.index)
world_trends_df.plot(figsize=(12,6)).legend(
loc='center left',
bbox_to_anchor=(1, 0.5))
It seems that the general tendency is for a decrease in the total number of existing cases per 100K. However the number of new cases has been increasing, although it seems reverting from 2005. So how is possible that the total number of existing cases is decreasing if the total number of new cases has been growing? One of the reasons could be the observed increase in the number of deaths per 100K, but the main reason we have to consider is that people recover from tuberculosis thanks to treatment. The sum of the recovery rate plus the death rate is greater than the new cases rate. In any case, it seems that there are more new cases, but also that we cure them better. We need to improve prevention and epidemics control.
Countries out of tendency
So the previous was the general trend of the world as a whole. So what countries with a different trend (for worse)? In order to find this out, first we need to know the distribution of deaths by countries in an average year.
deaths_by_country_mean = deaths_df.mean()
deaths_by_country_mean_summary = deaths_by_country_mean.describe()
existing_by_country_mean = existing_df.mean()
existing_by_country_mean_summary = existing_by_country_mean.describe()
new_by_country_mean = new_df.mean()
new_by_country_mean_summary = new_by_country_mean.describe()
We can plot these distributions to have an idea of how the countries are distributed in an average year.
deaths_by_country_mean.order().plot(kind='bar', figsize=(24,6))
We want those countries beyond 1.5 times the inter quartile range (50%). We have these values in:
deaths_outlier = deaths_by_country_mean_summary['50%']*1.5
existing_outlier = existing_by_country_mean_summary['50%']*1.5
new_outlier = new_by_country_mean_summary['50%']*1.5
Now we can use these values to get those countries that, across the period 1990-2007, have exceeded beyond those levels.
# Now compare with the outlier threshold
outlier_countries_by_deaths_index =
deaths_by_country_mean > deaths_outlier
outlier_countries_by_existing_index =
existing_by_country_mean > existing_outlier
outlier_countries_by_new_index =
new_by_country_mean > new_outlier
What proportion of countries do we have out of trend? For deaths:
num_countries = len(deaths_df.T)
sum(outlier_countries_by_deaths_index)/num_countries
0.39613526570048307
For existing cases (prevalence):
sum(outlier_countries_by_existing_index)/num_countries
0.39613526570048307
For new cases (incidence):
sum(outlier_countries_by_new_index)/num_countries
0.38647342995169082
Now we can use these indices to filter our original data frames.
outlier_deaths_df = deaths_df.T[ outlier_countries_by_deaths_index ].T
outlier_existing_df = existing_df.T[ outlier_countries_by_existing_index ].T
outlier_new_df = new_df.T[ outlier_countries_by_new_index ].T
This is serious stuff. We have more than one third of the world being outliers on the distribution of existings cases, new cases, and deaths by infectious tuberculosis. But what if we consider an outlier to be 5 times the IQR? Let's repeat the previous process.
deaths_super_outlier = deaths_by_country_mean_summary['50%']*5
existing_super_outlier = existing_by_country_mean_summary['50%']*5
new_super_outlier = new_by_country_mean_summary['50%']*5
super_outlier_countries_by_deaths_index =
deaths_by_country_mean > deaths_super_outlier
super_outlier_countries_by_existing_index =
existing_by_country_mean > existing_super_outlier
super_outlier_countries_by_new_index =
new_by_country_mean > new_super_outlier
What proportion do we have now?
sum(super_outlier_countries_by_deaths_index)/num_countries
0.21739130434782608
Let's get the data frames.
super_outlier_deaths_df =
deaths_df.T[ super_outlier_countries_by_deaths_index ].T
super_outlier_existing_df =
existing_df.T[ super_outlier_countries_by_existing_index ].T
super_outlier_new_df =
new_df.T[ super_outlier_countries_by_new_index ].T
Let's concentrate on epidemics control and have a look at the new cases' data frame.
super_outlier_new_df
country | Bhutan | Botswana | Cambodia | Congo, Rep. | Cote d'Ivoire | Korea, Dem. Rep. | Djibouti | Kiribati | Lesotho | Malawi | ... | Philippines | Rwanda | Sierra Leone | South Africa | Swaziland | Timor-Leste | Togo | Uganda | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
1990 | 540 | 307 | 585 | 169 | 177 | 344 | 582 | 513 | 184 | 258 | ... | 393 | 167 | 207 | 301 | 267 | 322 | 308 | 163 | 297 | 329 |
1991 | 516 | 341 | 579 | 188 | 196 | 344 | 594 | 503 | 201 | 286 | ... | 386 | 185 | 220 | 301 | 266 | 322 | 314 | 250 | 349 | 364 |
1992 | 492 | 364 | 574 | 200 | 209 | 344 | 606 | 493 | 218 | 314 | ... | 380 | 197 | 233 | 302 | 260 | 322 | 320 | 272 | 411 | 389 |
1993 | 470 | 390 | 568 | 215 | 224 | 344 | 618 | 483 | 244 | 343 | ... | 373 | 212 | 248 | 305 | 267 | 322 | 326 | 296 | 460 | 417 |
1994 | 449 | 415 | 563 | 229 | 239 | 344 | 630 | 474 | 280 | 373 | ... | 366 | 225 | 263 | 309 | 293 | 322 | 333 | 306 | 501 | 444 |
1995 | 428 | 444 | 557 | 245 | 255 | 344 | 642 | 464 | 323 | 390 | ... | 360 | 241 | 279 | 317 | 337 | 322 | 339 | 319 | 536 | 474 |
1996 | 409 | 468 | 552 | 258 | 269 | 344 | 655 | 455 | 362 | 389 | ... | 353 | 254 | 297 | 332 | 398 | 322 | 346 | 314 | 554 | 501 |
1997 | 391 | 503 | 546 | 277 | 289 | 344 | 668 | 446 | 409 | 401 | ... | 347 | 273 | 315 | 360 | 474 | 322 | 353 | 320 | 576 | 538 |
1998 | 373 | 542 | 541 | 299 | 312 | 344 | 681 | 437 | 461 | 412 | ... | 341 | 294 | 334 | 406 | 558 | 322 | 360 | 326 | 583 | 580 |
1999 | 356 | 588 | 536 | 324 | 338 | 344 | 695 | 428 | 519 | 417 | ... | 335 | 319 | 355 | 479 | 691 | 322 | 367 | 324 | 603 | 628 |
2000 | 340 | 640 | 530 | 353 | 368 | 344 | 708 | 420 | 553 | 425 | ... | 329 | 348 | 377 | 576 | 801 | 322 | 374 | 340 | 602 | 685 |
2001 | 325 | 692 | 525 | 382 | 398 | 344 | 722 | 412 | 576 | 414 | ... | 323 | 376 | 400 | 683 | 916 | 322 | 382 | 360 | 627 | 740 |
2002 | 310 | 740 | 520 | 408 | 425 | 344 | 737 | 403 | 613 | 416 | ... | 317 | 402 | 425 | 780 | 994 | 322 | 389 | 386 | 632 | 791 |
2003 | 296 | 772 | 515 | 425 | 444 | 344 | 751 | 396 | 635 | 410 | ... | 312 | 419 | 451 | 852 | 1075 | 322 | 397 | 396 | 652 | 825 |
2004 | 283 | 780 | 510 | 430 | 448 | 344 | 766 | 388 | 643 | 405 | ... | 306 | 423 | 479 | 898 | 1127 | 322 | 405 | 385 | 623 | 834 |
2005 | 270 | 770 | 505 | 425 | 443 | 344 | 781 | 380 | 639 | 391 | ... | 301 | 418 | 509 | 925 | 1141 | 322 | 413 | 370 | 588 | 824 |
2006 | 258 | 751 | 500 | 414 | 432 | 344 | 797 | 372 | 638 | 368 | ... | 295 | 408 | 540 | 940 | 1169 | 322 | 421 | 350 | 547 | 803 |
2007 | 246 | 731 | 495 | 403 | 420 | 344 | 813 | 365 | 637 | 346 | ... | 290 | 397 | 574 | 948 | 1198 | 322 | 429 | 330 | 506 | 782 |
18 rows × 22 columns
Let's make some plots to get a better impression.
super_outlier_new_df.plot(figsize=(12,4)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
We have 22 countries where the number of new cases during an average year is greater than 5 times the median value of the distribution. Let's create a country that represents the average of these 22 countries.
average_super_outlier_country = super_outlier_new_df.mean(axis=1)
average_super_outlier_country
year
1990 314.363636
1991 330.136364
1992 340.681818
1993 352.909091
1994 365.363636
1995 379.227273
1996 390.863636
1997 408.000000
1998 427.000000
1999 451.409091
2000 476.545455
2001 502.409091
2002 525.727273
2003 543.318182
2004 548.909091
2005 546.409091
2006 540.863636
2007 535.181818
dtype: float64
Now let's create a country that represents the rest of the world.
avearge_better_world_country =
new_df.T[ - super_outlier_countries_by_new_index ].T.mean(axis=1)
avearge_better_world_country
year
1990 80.751351
1991 81.216216
1992 80.681081
1993 81.470270
1994 81.832432
1995 82.681081
1996 82.589189
1997 84.497297
1998 85.189189
1999 86.232432
2000 86.378378
2001 86.551351
2002 89.848649
2003 87.778378
2004 87.978378
2005 87.086022
2006 86.559140
2007 85.605405
dtype: float64
Now let's plot this country with the average world country.
two_world_df =
pd.DataFrame({
'Average Better World Country': avearge_better_world_country,
'Average Outlier Country' : average_super_outlier_country},
index = new_df.index)
two_world_df.plot(title="Estimated new TB cases per 100K",figsize=(12,8))
The increase in new cases' tendency is really stronger in the average super outlier country, so much stronger that is difficult to perceive that same tendency in the better world country. The 90's decade brought a terrible increase in the number of TB cases in those countries. But let's have a look at the exact numbers.
two_world_df.pct_change().plot(title="Percentage change in estimated new TB cases", figsize=(12,8))
Based on this plot, the decceleration and reversion of that tendency seem to happen at the same time in both average better and outlier countries, and something happened around 2002. We will try to find out what's going on in the next section.
R
We already know that we can use max
with a data frame column in R and get the maximum value. Additionally, we can use which.max
in order to get its position (similarly to the use og argmax
in Pandas). If we use the transposed data frame, we can use lapply
or sapply
to perform this operation in every year column, getting then either a list or a vector of indices (we will use sapply
that returns a vector). We just need a little tweak and use a countries vector that we will index to get the country name instead of the index as a result.
country_names <- rownames(existing_df_t)
sapply(existing_df_t, function(x) {country_names[which.max(x)]})
## X1990 X1991 X1992
## "Djibouti" "Djibouti" "Djibouti"
## X1993 X1994 X1995
## "Djibouti" "Djibouti" "Djibouti"
## X1996 X1997 X1998
## "Kiribati" "Kiribati" "Cambodia"
## X1999 X2000 X2001
## "Korea, Dem. Rep." "Djibouti" "Swaziland"
## X2002 X2003 X2004
## "Djibouti" "Djibouti" "Djibouti"
## X2005 X2006 X2007
## "Djibouti" "Djibouti" "Djibouti"
World trends in TB cases
Again, in order to explore the world general tendency, we need to sum up every countries’ values for the three datasets, per year.
But first we need to load the other two datasets for number of deaths and number of new cases.
# Download files
deaths_file <- getURL("https://docs.google.com/spreadsheets/d/12uWVH_IlmzJX_75bJ3IH5E-Gqx6-zfbDKNvZqYjUuso/pub?gid=0&output=CSV")
new_cases_file <- getURL("https://docs.google.com/spreadsheets/d/1Pl51PcEGlO9Hp4Uh0x2_QM0xVb53p2UDBMPwcnSjFTk/pub?gid=0&output=csv")
# Read into data frames
deaths_df <- read.csv(
text = deaths_file,
row.names=1,
stringsAsFactor=F)
new_df <- read.csv(
text = new_cases_file,
row.names=1,
stringsAsFactor=F)
# Cast data to int (deaths doesn't need it)
new_df[1:18] <- lapply(
new_df[1:18],
function(x) { as.integer(gsub(',', '', x) )})
# Transpose
deaths_df_t <- deaths_df
deaths_df <- as.data.frame(t(deaths_df))
new_df_t <- new_df
new_df <- as.data.frame(t(new_df))
And now the sums by row. We need to convert this to a data frame since the function returns a numeric vector.
deaths_total_per_year_df <- data.frame(total=rowSums(deaths_df))
existing_total_per_year_df <- data.frame(total=rowSums(existing_df))
# We pass na.rm = TRUE in order to ignore missing values in the new
# cases data frame when summing (no missing values in other dataframes though)
new_total_per_year_df <- data.frame(total=rowSums(new_df, na.rm = TRUE))
Now we can plot each line using what we have learnt so far. In order to get a vector with the counts to pass to each plotting function, we use R data frame indexing by column name.
xrange <- 1990:2007
plot(xrange, deaths_total_per_year_df$total,
type='l', xlab="Year",
ylab="Count per 100K",
col = "blue",
ylim=c(0,50000))
lines(xrange, existing_total_per_year_df$total,
col = "darkgreen")
lines(xrange, new_total_per_year_df$total,
col = "red")
legend(x=1992, y=52000,
lty=1,
cex = .7,
ncol = 3,
col=c("blue","darkgreen","red"),
legend=c("Deaths","Existing cases","New cases"))
The conclusions are obviously the same as when using Python.
Countries out of tendency
So what countries are outliers of the trend (for the worse)? Again, in order to find this out, first we need to know the distribution of countries in an average year. We use colMeans
for that purpose.
deaths_by_country_mean <- data.frame(mean=colMeans(deaths_df))
existing_by_country_mean <- data.frame(mean=colMeans(existing_df))
new_by_country_mean <- data.frame(mean=colMeans(new_df, na.rm=TRUE))
We can plot these distributions to have an idea of how the countries are distributed in an average year. We are not so interested about the individual countries but about the distribution itself.
barplot(sort(deaths_by_country_mean$mean))
Again we can see there are three trends in the plot, with a slowly decreasing part at the beginning, a second more step section, and a final peak that is clearly apart from the rest.
Let's skip this time the 1.5-outlier part and go diretcly to the 5.0-outliers. In R we will use a different approach. We will use the quantile()
function in order to get the inter-quartile range and determine the outlier threshold.
Since we already know the results from our Python section, let's do it just for the new cases, so we generate also the plots we did before.
new_super_outlier <-
quantile(new_by_country_mean$mean, probs = c(.5)) * 5.0
super_outlier_countries_by_new_index <-
new_by_country_mean > new_super_outlier
And the proportion is.
sum(super_outlier_countries_by_new_index)/208
## [1] 0.1057692
Let's obtain a data frame from this, with just those countries we consider to be outliers.
super_outlier_new_df <-
new_df[, super_outlier_countries_by_new_index ]
Now we are ready to plot them.
xrange <- 1990:2007
plot(xrange, super_outlier_new_df[,1],
type='l', xlab="Year",
ylab="New cases per 100K",
col = 1,
ylim=c(0,1800))
for (i in seq(2:ncol(super_outlier_new_df))) {
lines(xrange, super_outlier_new_df[,i],
col = i)
}
legend(x=1990, y=1800,
lty=1, cex = 0.5,
ncol = 7,
col=1:22,
legend=colnames(super_outlier_new_df))
Definitely we can see here an advantage of using Pandas basic plotting versus R basic plotting!
So far our results match. We have 22 countries where the number of new cases on an average year is greater than 5 times the median value of the distribution. Let’s create a country that represents on average these 22. We will use rowMeans()
here.
average_countries_df <-
data.frame(
averageOutlierMean=rowMeans(super_outlier_new_df, na.rm=T)
)
average_countries_df
## averageOutlierMean
## X1990 314.3636
## X1991 330.1364
## X1992 340.6818
## X1993 352.9091
## X1994 365.3636
## X1995 379.2273
## X1996 390.8636
## X1997 408.0000
## X1998 427.0000
## X1999 451.4091
## X2000 476.5455
## X2001 502.4091
## X2002 525.7273
## X2003 543.3182
## X2004 548.9091
## X2005 546.4091
## X2006 540.8636
## X2007 535.1818
Now let’s create a country that represents the rest of the world.
average_countries_df$averageBetterWorldMean <-
rowMeans(new_df[ ,- super_outlier_countries_by_new_index ], na.rm=T)
average_countries_df
## averageOutlierMean averageBetterWorldMean
## X1990 314.3636 105.2767
## X1991 330.1364 107.3786
## X1992 340.6818 108.0243
## X1993 352.9091 110.0388
## X1994 365.3636 111.6942
## X1995 379.2273 113.9369
## X1996 390.8636 115.0971
## X1997 408.0000 118.6408
## X1998 427.0000 121.2913
## X1999 451.4091 124.8350
## X2000 476.5455 127.6505
## X2001 502.4091 130.5680
## X2002 525.7273 136.0194
## X2003 543.3182 136.0388
## X2004 548.9091 136.8155
## X2005 546.4091 135.5121
## X2006 540.8636 134.4493
## X2007 535.1818 133.2184
Now let’s plot the outlier country with the average world country.
xrange <- 1990:2007
plot(xrange, average_countries_df$averageOutlierMean,
type='l', xlab="Year",
ylab="New cases per 100K",
col = "darkgreen",
ylim=c(0,600))
lines(xrange, average_countries_df$averageBetterWorldMean, col = "blue")
legend(x=1990, y=600,
lty=1, cex = 0.7,
ncol = 2,
col=c("darkgreen","blue"),
legend=c("Average outlier country", "Average World Country"))
Googling about events and dates in Tuberculosis
We will use just Python in this section. About googling, actually we just went straight to Wikipedia's entry about the disease. In the epidemics sections we found the following:
-
The total number of tuberculosis cases has been decreasing since 2005, while new cases have decreased since 2002.
-
This is confirmed by our previous analysis.
-
China has achieved particularly dramatic progress, with about an 80% reduction in its TB mortality rate between 1990 and 2010. Let's check it:
existing_df.China.plot(title="Estimated existing TB cases in China")
- In 2007, the country with the highest estimated incidence rate of TB was Swaziland, with 1,200 cases per 100,000 people.
new_df.apply(pd.Series.argmax, axis=1)['2007']
'Swaziland'
There are many more findings Wikipedia that we can confirm with these or other datasets from Gapminder world. For example, TB and HIV are frequently associated, together with poverty levels. It would be interesting to join datasets and explore tendencies in each of them. We challenge the reader to give them a try and share with us their findings.
Other web pages to explore
Some interesting resources about tuberculosis apart from the Gapminder website:
- Gates foundation:
- http://www.gatesfoundation.org/What-We-Do/Global-Health/Tuberculosis
- http://www.gatesfoundation.org/Media-Center/Press-Releases/2007/09/New-Grants-to-Fight-Tuberculosis-Epidemic
Conclusions
Exploratory data analysis is a key step in data analysis. It is during this stage when we start shaping any later work. It precedes any data visualisation or machine learning work, by showing us good or bad our data and our hypothesis are.
Traditionally, R has been the weapon of choice for most EDA work, although the use of a more expressive plotting library such as gglot2 is quite convenient. In fact, the base plotting functionality incorporated in Pandas makes the process cleaner and quicker when using Python. However, the questions we have answered here were very simple and didn't include multiple variables and encodings. In such cases an advanced library like ggplot2 will shine. Apart from providing nicer charts, it will saves us quite a lot of time due to its expressiveness and reusability.
But as simple as our analysis and charts are, we have been able to make the point about how serious the humanitarian crisis is regarding a disease like tuberculosis, specially when considering that the disease is relatively well controlled in more developed countries. We have seen how some coding skills and a good amount of curiosity allows us to create awareness in these and other world issues.
Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!
The python or R, both became almost equal when choice is given. The data science applications are increasing day by day. So both are equally preferred.
Very helpful documents.I greatly benefit from it.Thanks for your work.
Hi Evanzhu2013. Many thanks for you’re comment. I’m glad to hear that our tutorials are being useful for you! Jose.
Sorry but I don’t see data science in this article (which is great btw). Is it more appropriate to rename the article : Data wrangling … instead ?
(just saying)
Hi Radhouane,
Many thanks for your comment! You’re probably right about the outcomes of this particular article. However the article is part of a longer series about different parts of the data science process. The guys at codementor thought it was good to keep a common title as part of all of them. About the title you propose I disagree. Data wrangling is an important part of the data science process, but is related with transforming raw data into a format that can be used for data analysis. Here we do very mild transformations and get quickly into exploring data using charts.
Awesome sauce, it makes more sense when you mention its gonna be part of a long series. Great stuff !
Glad you like it! I hope we will keep you interested :)