Codementor Events

Data Science with Python & R: Data Frames II

Published Jul 15, 2015Last updated Feb 09, 2017
Data Science with Python & R: Data Frames II

We continue here our tutorial on data frames with python and R. The first part introduced the concepts of Data Frame and explained how to create them and index them in Python and R. This part will concentrate on data selection and function mapping.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Data Selection

In this section we will show how to select data from data frames based on their values, by using logical expressions.

Python

With Pandas, we can use logical expression to select just data that satisfy certain conditions. So first, let's see what happens when we use logical operators with data frames or series objects.

existing_df>10
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia ... Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1990 True True True True True True True True True True ... True True True True True True True True True True
1991 True True True True True True True True True True ... True True True True True True True True True True
1992 True True True False True True True True True True ... True True True True True True True True True True
1993 True True True True True True True True True True ... True True True True True True True True True True
1994 True True True True True True True True True True ... True True True True True True True True True True
1995 True True True True True True True True True True ... True True True True True True True True True True
1996 True True True False True True True True True True ... True True True True True True True True True True
1997 True True True True True True True True True True ... True True True True True True True True True True
1998 True True True True True True True True True True ... True True True True True True True True True True
1999 True True True False True True True False True True ... True True True True True True True True True True
2000 True True True False True True True False True True ... True True True True True True True True True True
2001 True True True False True True True False True True ... True True True True True True True True True True
2002 True True True False True True True False True True ... True True True True True True True True True True
2003 True True True False True True True False True True ... True True True True True True True True True True
2004 True True True False True True True False True True ... True True True True True True True True True True
2005 True True True True True True True False True True ... True True True True True True True True True True
2006 True True True False True True True False True True ... True True True True True True True True True True
2007 True True True False True True True False True True ... True True True True True True True True True True
18 rows × 207 columns

And if applied to individual series.

existing_df['United Kingdom'] > 10
    year
    1990    False
    1991    False
    1992    False
    1993    False
    1994    False
    1995    False
    1996    False
    1997    False
    1998    False
    1999    False
    2000    False
    2001    False
    2002    False
    2003    False
    2004    False
    2005     True
    2006     True
    2007     True
    Name: United Kingdom, dtype: bool

The result of these expressions can be used as a indexing vector (with [] or `.iloc') as follows.

existing_df.Spain[existing_df['United Kingdom'] > 10]
    year
    2005    24
    2006    24
    2007    23
    Name: Spain, dtype: int64

An interesting case happens when indexing several series and some of them happen to have False as index and other True at the same position. For example:

existing_df[ existing_df > 10 ]
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia ... Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1990 436 42 45 42 39 514 38 16 96 52 ... 35 114 278 46 365 126 55 265 436 409
1991 429 40 44 14 37 514 38 15 91 49 ... 34 105 268 45 361 352 54 261 456 417
1992 422 41 44 NaN 35 513 37 15 86 51 ... 33 102 259 44 358 64 54 263 494 415
1993 415 42 43 18 33 512 37 14 82 55 ... 32 118 250 43 354 174 52 253 526 419
1994 407 42 43 17 32 510 36 13 78 60 ... 31 116 242 42 350 172 52 250 556 426
1995 397 43 42 22 30 508 35 12 74 68 ... 30 119 234 42 346 93 50 244 585 439
1996 397 42 43 NaN 28 512 35 12 71 74 ... 28 111 226 41 312 123 49 233 602 453
1997 387 44 44 25 23 363 36 11 67 75 ... 27 122 218 41 273 213 46 207 626 481
1998 374 43 45 12 24 414 36 11 63 74 ... 28 129 211 40 261 107 44 194 634 392
1999 373 42 46 NaN 22 384 36 NaN 58 86 ... 28 134 159 39 253 105 42 175 657 430
2000 346 40 48 NaN 20 530 35 NaN 52 94 ... 27 139 143 39 248 103 40 164 658 479
2001 326 34 49 NaN 20 335 35 NaN 51 99 ... 25 148 128 41 243 13 39 154 680 523
2002 304 32 50 NaN 21 307 35 NaN 42 97 ... 27 144 149 41 235 275 37 149 517 571
2003 308 32 51 NaN 18 281 35 NaN 41 91 ... 25 152 128 39 234 147 36 146 478 632
2004 283 29 52 NaN 19 318 35 NaN 39 85 ... 23 149 118 38 226 63 35 138 468 652
2005 267 29 53 11 18 331 34 NaN 39 79 ... 24 144 131 38 227 57 33 137 453 680
2006 251 26 55 NaN 17 302 34 NaN 37 79 ... 25 134 104 38 222 60 32 135 422 699
2007 238 22 56 NaN 19 294 34 NaN 35 81 ... 23 140 102 39 220 25 31 130 387 714
18 rows × 207 columns

Those cells where existing_df doesn't happen to have more than 10 cases per 100K give False for indexing. The resulting data frame have a NaN value for those cells. A way of solving that (if we need to) is by using the where() method that, apart from providing a more expressive way of reading data selection, acceps a second argument that we can use to impute the NaN values. For example, if we want to have 0 as a value.

existing_df.where(existing_df > 10, 0)
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia ... Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1990 436 42 45 42 39 514 38 16 96 52 ... 35 114 278 46 365 126 55 265 436 409
1991 429 40 44 14 37 514 38 15 91 49 ... 34 105 268 45 361 352 54 261 456 417
1992 422 41 44 0 35 513 37 15 86 51 ... 33 102 259 44 358 64 54 263 494 415
1993 415 42 43 18 33 512 37 14 82 55 ... 32 118 250 43 354 174 52 253 526 419
1994 407 42 43 17 32 510 36 13 78 60 ... 31 116 242 42 350 172 52 250 556 426
1995 397 43 42 22 30 508 35 12 74 68 ... 30 119 234 42 346 93 50 244 585 439
1996 397 42 43 0 28 512 35 12 71 74 ... 28 111 226 41 312 123 49 233 602 453
1997 387 44 44 25 23 363 36 11 67 75 ... 27 122 218 41 273 213 46 207 626 481
1998 374 43 45 12 24 414 36 11 63 74 ... 28 129 211 40 261 107 44 194 634 392
1999 373 42 46 0 22 384 36 0 58 86 ... 28 134 159 39 253 105 42 175 657 430
2000 346 40 48 0 20 530 35 0 52 94 ... 27 139 143 39 248 103 40 164 658 479
2001 326 34 49 0 20 335 35 0 51 99 ... 25 148 128 41 243 13 39 154 680 523
2002 304 32 50 0 21 307 35 0 42 97 ... 27 144 149 41 235 275 37 149 517 571
2003 308 32 51 0 18 281 35 0 41 91 ... 25 152 128 39 234 147 36 146 478 632
2004 283 29 52 0 19 318 35 0 39 85 ... 23 149 118 38 226 63 35 138 468 652
2005 267 29 53 11 18 331 34 0 39 79 ... 24 144 131 38 227 57 33 137 453 680
2006 251 26 55 0 17 302 34 0 37 79 ... 25 134 104 38 222 60 32 135 422 699
2007 238 22 56 0 19 294 34 0 35 81 ... 23 140 102 39 220 25 31 130 387 714
18 rows × 207 columns

R

As we did with Pandas, let's check the result of using a data.frame in a logical
or boolean expression.

existing_df_gt10 <- existing_df>10
head(existing_df_gt10,2) # check just a couple of rows
##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla
## X1990        TRUE    TRUE    TRUE           TRUE    TRUE   TRUE     TRUE
## X1991        TRUE    TRUE    TRUE           TRUE    TRUE   TRUE     TRUE
##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan
## X1990                TRUE      TRUE    TRUE     FALSE    TRUE       TRUE
## X1991                TRUE      TRUE    TRUE     FALSE    TRUE       TRUE
##       Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin
## X1990    TRUE    TRUE       TRUE    FALSE    TRUE    TRUE   TRUE  TRUE
## X1991    TRUE    TRUE       TRUE    FALSE    TRUE    TRUE   TRUE  TRUE
##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil
## X1990   FALSE   TRUE    TRUE                   TRUE     TRUE   TRUE
## X1991   FALSE   TRUE    TRUE                   TRUE     TRUE   TRUE
##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso
## X1990                   TRUE              TRUE     TRUE         TRUE
## X1991                   TRUE              TRUE     TRUE         TRUE
##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands
## X1990    TRUE     TRUE     TRUE  FALSE       TRUE          FALSE
## X1991    TRUE     TRUE     TRUE  FALSE       TRUE          FALSE
##       Central African Republic Chad Chile China Colombia Comoros
## X1990                     TRUE TRUE  TRUE  TRUE     TRUE    TRUE
## X1991                     TRUE TRUE  TRUE  TRUE     TRUE    TRUE
##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus
## X1990        TRUE        FALSE       TRUE    TRUE TRUE   TRUE
## X1991        TRUE        FALSE       TRUE    TRUE TRUE   TRUE
##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.
## X1990           TRUE          TRUE             TRUE             TRUE
## X1991           TRUE          TRUE             TRUE             TRUE
##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt
## X1990    TRUE     TRUE     TRUE               TRUE    TRUE  TRUE
## X1991    TRUE     TRUE     TRUE               TRUE    TRUE  TRUE
##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland
## X1990        TRUE              TRUE    TRUE    TRUE     TRUE TRUE    TRUE
## X1991        TRUE              TRUE    TRUE    TRUE     TRUE TRUE    TRUE
##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece
## X1990   TRUE             TRUE  TRUE   TRUE    TRUE    TRUE  TRUE   TRUE
## X1991   TRUE             TRUE  TRUE   TRUE    TRUE    TRUE  TRUE   TRUE
##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras
## X1990   FALSE TRUE      TRUE   TRUE          TRUE   TRUE  TRUE     TRUE
## X1991   FALSE TRUE      TRUE   TRUE          TRUE   TRUE  TRUE     TRUE
##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy
## X1990    TRUE   FALSE  TRUE      TRUE TRUE TRUE    TRUE   TRUE  TRUE
## X1991    TRUE   FALSE  TRUE      TRUE TRUE TRUE    TRUE  FALSE FALSE
##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan
## X1990   FALSE  TRUE   TRUE       TRUE  TRUE     TRUE   TRUE       TRUE
## X1991   FALSE  TRUE   TRUE       TRUE  TRUE     TRUE   TRUE       TRUE
##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania
## X1990 TRUE   TRUE    TRUE    TRUE    TRUE                   TRUE      TRUE
## X1991 TRUE   TRUE    TRUE    TRUE    TRUE                   TRUE      TRUE
##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania
## X1990       TRUE       TRUE   TRUE     TRUE     TRUE TRUE FALSE       TRUE
## X1991       TRUE       TRUE   TRUE     TRUE     TRUE TRUE FALSE       TRUE
##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat
## X1990      TRUE   TRUE                  TRUE  FALSE     TRUE       TRUE
## X1991      TRUE   TRUE                  TRUE  FALSE     TRUE       TRUE
##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## X1990    TRUE       TRUE    TRUE    TRUE  TRUE  TRUE        TRUE
## X1991    TRUE       TRUE    TRUE    TRUE  TRUE  TRUE       FALSE
##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger
## X1990                 TRUE          TRUE       FALSE      TRUE  TRUE
## X1991                 TRUE          TRUE       FALSE      TRUE  TRUE
##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau
## X1990    TRUE TRUE                     TRUE  FALSE TRUE     TRUE  TRUE
## X1991    TRUE TRUE                     TRUE  FALSE TRUE     TRUE  TRUE
##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal
## X1990   TRUE             TRUE     TRUE TRUE        TRUE   TRUE     TRUE
## X1991   TRUE             TRUE     TRUE TRUE        TRUE   TRUE     TRUE
##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation
## X1990        TRUE  TRUE        TRUE    TRUE    TRUE               TRUE
## X1991        TRUE  TRUE        TRUE    TRUE    TRUE               TRUE
##       Rwanda Saint Kitts and Nevis Saint Lucia
## X1990   TRUE                  TRUE        TRUE
## X1991   TRUE                  TRUE        TRUE
##       Saint Vincent and the Grenadines Samoa San Marino
## X1990                             TRUE  TRUE      FALSE
## X1991                             TRUE  TRUE      FALSE
##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone
## X1990                  TRUE         TRUE    TRUE       TRUE         TRUE
## X1991                  TRUE         TRUE    TRUE       TRUE         TRUE
##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa
## X1990      TRUE     TRUE     TRUE            TRUE    TRUE         TRUE
## X1991      TRUE     TRUE     TRUE            TRUE    TRUE         TRUE
##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland
## X1990  TRUE      TRUE  TRUE     TRUE      TRUE  FALSE        TRUE
## X1991  TRUE      TRUE  TRUE     TRUE      TRUE  FALSE        TRUE
##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste
## X1990                 TRUE       TRUE     TRUE           TRUE        TRUE
## X1991                 TRUE       TRUE     TRUE           TRUE        TRUE
##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan
## X1990 TRUE    TRUE  TRUE                TRUE    TRUE   TRUE         TRUE
## X1991 TRUE    TRUE  TRUE                TRUE    TRUE   TRUE         TRUE
##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates
## X1990                     TRUE   TRUE   TRUE    TRUE                 TRUE
## X1991                     TRUE   TRUE   TRUE    TRUE                 TRUE
##       United Kingdom Tanzania Virgin Islands (U.S.)
## X1990          FALSE     TRUE                  TRUE
## X1991          FALSE     TRUE                  TRUE
##       United States of America Uruguay Uzbekistan Vanuatu Venezuela
## X1990                    FALSE    TRUE       TRUE    TRUE      TRUE
## X1991                    FALSE    TRUE       TRUE    TRUE      TRUE
##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
## X1990     TRUE             TRUE               TRUE  TRUE   TRUE     TRUE
## X1991     TRUE             TRUE               TRUE  TRUE   TRUE     TRUE

In this case we get a matrix variable, with boolean values. When applied to
individual columns.

existing_df['United Kingdom'] > 10
##       United Kingdom
## X1990          FALSE
## X1991          FALSE
## X1992          FALSE
## X1993          FALSE
## X1994          FALSE
## X1995          FALSE
## X1996          FALSE
## X1997          FALSE
## X1998          FALSE
## X1999          FALSE
## X2000          FALSE
## X2001          FALSE
## X2002          FALSE
## X2003          FALSE
## X2004          FALSE
## X2005           TRUE
## X2006           TRUE
## X2007           TRUE

The result (and the syntax) is equivalent to that of Pandas, and can be used for
indexing as follows.

existing_df$Spain[existing_df['United Kingdom'] > 10]
## [1] 24 24 23

As we did in Python/Pandas, let's use the whole boolean matrix we got before.

head(existing_df[ existing_df_gt10 ]) # check first few elements
## [1] 436 429 422 415 407 397

But hey, the results are quite different from what we would expect coming from
using Pandas. We got a long vector of values, not a data frame. The problem is
that the [ ] operator, when passed a matrix, first coerces the data frame to a
matrix. Basically we cannot seamlessly work with R data.frames and boolean matrices
as we did with Pandas. We should instead index in both dimensions, columns and rows,
separately.

But still, we can use matrix indexing with a data frame to replace elements.

existing_df_2 <- existing_df
existing_df_2[ existing_df_gt10 ] <- -1
head(existing_df_2,2) # check just a couple of rows
##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla
## X1990          -1      -1      -1             -1      -1     -1       -1
## X1991          -1      -1      -1             -1      -1     -1       -1
##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan
## X1990                  -1        -1      -1         7      -1         -1
## X1991                  -1        -1      -1         7      -1         -1
##       Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin
## X1990      -1      -1         -1        8      -1      -1     -1    -1
## X1991      -1      -1         -1        8      -1      -1     -1    -1
##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil
## X1990      10     -1      -1                     -1       -1     -1
## X1991      10     -1      -1                     -1       -1     -1
##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso
## X1990                     -1                -1       -1           -1
## X1991                     -1                -1       -1           -1
##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands
## X1990      -1       -1       -1      7         -1             10
## X1991      -1       -1       -1      7         -1             10
##       Central African Republic Chad Chile China Colombia Comoros
## X1990                       -1   -1    -1    -1       -1      -1
## X1991                       -1   -1    -1    -1       -1      -1
##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus
## X1990          -1            0         -1      -1   -1     -1
## X1991          -1           10         -1      -1   -1     -1
##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.
## X1990             -1            -1               -1               -1
## X1991             -1            -1               -1               -1
##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt
## X1990      -1       -1       -1                 -1      -1    -1
## X1991      -1       -1       -1                 -1      -1    -1
##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland
## X1990          -1                -1      -1      -1       -1   -1      -1
## X1991          -1                -1      -1      -1       -1   -1      -1
##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece
## X1990     -1               -1    -1     -1      -1      -1    -1     -1
## X1991     -1               -1    -1     -1      -1      -1    -1     -1
##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras
## X1990       7   -1        -1     -1            -1     -1    -1       -1
## X1991       7   -1        -1     -1            -1     -1    -1       -1
##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy
## X1990      -1       5    -1        -1   -1   -1      -1     -1    -1
## X1991      -1       4    -1        -1   -1   -1      -1     10    10
##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan
## X1990      10    -1     -1         -1    -1       -1     -1         -1
## X1991      10    -1     -1         -1    -1       -1     -1         -1
##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania
## X1990   -1     -1      -1      -1      -1                     -1        -1
## X1991   -1     -1      -1      -1      -1                     -1        -1
##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania
## X1990         -1         -1     -1       -1       -1   -1    10         -1
## X1991         -1         -1     -1       -1       -1   -1     9         -1
##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat
## X1990        -1     -1                    -1      3       -1         -1
## X1991        -1     -1                    -1      3       -1         -1
##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## X1990      -1         -1      -1      -1    -1    -1          -1
## X1991      -1         -1      -1      -1    -1    -1          10
##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger
## X1990                   -1            -1          10        -1    -1
## X1991                   -1            -1          10        -1    -1
##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau
## X1990      -1   -1                       -1      8   -1       -1    -1
## X1991      -1   -1                       -1      8   -1       -1    -1
##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal
## X1990     -1               -1       -1   -1          -1     -1       -1
## X1991     -1               -1       -1   -1          -1     -1       -1
##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation
## X1990          -1    -1          -1      -1      -1                 -1
## X1991          -1    -1          -1      -1      -1                 -1
##       Rwanda Saint Kitts and Nevis Saint Lucia
## X1990     -1                    -1          -1
## X1991     -1                    -1          -1
##       Saint Vincent and the Grenadines Samoa San Marino
## X1990                               -1    -1          9
## X1991                               -1    -1          9
##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone
## X1990                    -1           -1      -1         -1           -1
## X1991                    -1           -1      -1         -1           -1
##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa
## X1990        -1       -1       -1              -1      -1           -1
## X1991        -1       -1       -1              -1      -1           -1
##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland
## X1990    -1        -1    -1       -1        -1      5          -1
## X1991    -1        -1    -1       -1        -1      5          -1
##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste
## X1990                   -1         -1       -1             -1          -1
## X1991                   -1         -1       -1             -1          -1
##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan
## X1990   -1      -1    -1                  -1      -1     -1           -1
## X1991   -1      -1    -1                  -1      -1     -1           -1
##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates
## X1990                       -1     -1     -1      -1                   -1
## X1991                       -1     -1     -1      -1                   -1
##       United Kingdom Tanzania Virgin Islands (U.S.)
## X1990              9       -1                    -1
## X1991              9       -1                    -1
##       United States of America Uruguay Uzbekistan Vanuatu Venezuela
## X1990                        7      -1         -1      -1        -1
## X1991                        7      -1         -1      -1        -1
##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
## X1990       -1               -1                 -1    -1     -1       -1
## X1991       -1               -1                 -1    -1     -1       -1

We can see how many of the elements, those where we had more than 10 cases, where
assigned a -1 value.

The most expressive way of selecting form a data.frame in R is by using the
subset function (type ?subset in your R console to
read about this function). The function is applied by row in the data frame.
The second argument can include any condition using column names. The third argument
can include a list of columns. The resulting data frame will contain those rows
that satisfy the second argument conditions, including just those columns listed
in the third argument (all of them bt default). For example, if we want to select
those years when the United Kingdom had more than 10 cases, and list the resulting
rows for three countries (UK, Spain, and Colombia) we will use:

# If a column name contains blanks, we can have to use ` `
subset(existing_df,  `United Kingdom`>10, c('United Kingdom', 'Spain','Colombia'))
##       United Kingdom Spain Colombia
## X2005             11    24       53
## X2006             11    24       44
## X2007             12    23       43

We can do the same thing using [ ] as follows.

existing_df[existing_df["United Kingdom"]>10, c('United Kingdom', 'Spain','Colombia')]
##       United Kingdom Spain Colombia
## X2005             11    24       53
## X2006             11    24       44
## X2007             12    23       43

Function mapping and data grouping

Python

The pandas.DataFrame class defines several ways of applying functions both, index-wise and element-wise. Some of them are already predefined, and are part of the descriptive statistics methods we will talk about when performing exploratory data analysis.

existing_df.sum()
    country
    Afghanistan            6360
    Albania                 665
    Algeria                 853
    American Samoa          221
    Andorra                 455
    Angola                 7442
    Anguilla                641
    Antigua and Barbuda     195
    Argentina              1102
    Armenia                1349
    Australia               116
    Austria                 228
    Azerbaijan             1541
    Bahamas                 920
    Bahrain                1375
    ...
    United Arab Emirates         577
    United Kingdom               173
    Tanzania                    5713
    Virgin Islands (U.S.)        367
    United States of America      88
    Uruguay                      505
    Uzbekistan                  2320
    Vanuatu                     3348
    Venezuela                    736
    Viet Nam                    5088
    Wallis et Futuna            2272
    West Bank and Gaza           781
    Yemen                       3498
    Zambia                      9635
    Zimbabwe                    9231
    Length: 207, dtype: int64

We have just calculated the total number of TB cases from 1990 to 2007 for each country. We can do the same by year if we pass axis=1 to use columns instead of index as axis.

existing_df.sum(axis=1)
    year
    1990    40772
    1991    40669
    1992    39912
    1993    39573
    1994    39066
    1995    38904
    1996    37032
    1997    37462
    1998    36871
    1999    37358
    2000    36747
    2001    36804
    2002    37160
    2003    36516
    2004    36002
    2005    35435
    2006    34987
    2007    34622
    dtype: int64

It looks like there is a descent in the existing number of TB cases per 100K across the world.

Pandas also provides methods to apply other functions to data frames. They are three: apply, applymap, and groupby.

apply and applymap

By using apply() we can apply a function along an input axis of a DataFrame. Objects passed to the functions we apply are Series objects having as index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty. For example, if we want to obtain the number of existing cases per million (instead of 100K) we can use the following.

from __future__ import division # we need this to have float division without using a cast
existing_df.apply(lambda x: x/10)
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia ... Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1990 43.6 4.2 4.5 4.2 3.9 51.4 3.8 1.6 9.6 5.2 ... 3.5 11.4 27.8 4.6 36.5 12.6 5.5 26.5 43.6 40.9
1991 42.9 4.0 4.4 1.4 3.7 51.4 3.8 1.5 9.1 4.9 ... 3.4 10.5 26.8 4.5 36.1 35.2 5.4 26.1 45.6 41.7
1992 42.2 4.1 4.4 0.4 3.5 51.3 3.7 1.5 8.6 5.1 ... 3.3 10.2 25.9 4.4 35.8 6.4 5.4 26.3 49.4 41.5
1993 41.5 4.2 4.3 1.8 3.3 51.2 3.7 1.4 8.2 5.5 ... 3.2 11.8 25.0 4.3 35.4 17.4 5.2 25.3 52.6 41.9
1994 40.7 4.2 4.3 1.7 3.2 51.0 3.6 1.3 7.8 6.0 ... 3.1 11.6 24.2 4.2 35.0 17.2 5.2 25.0 55.6 42.6
1995 39.7 4.3 4.2 2.2 3.0 50.8 3.5 1.2 7.4 6.8 ... 3.0 11.9 23.4 4.2 34.6 9.3 5.0 24.4 58.5 43.9
1996 39.7 4.2 4.3 0.0 2.8 51.2 3.5 1.2 7.1 7.4 ... 2.8 11.1 22.6 4.1 31.2 12.3 4.9 23.3 60.2 45.3
1997 38.7 4.4 4.4 2.5 2.3 36.3 3.6 1.1 6.7 7.5 ... 2.7 12.2 21.8 4.1 27.3 21.3 4.6 20.7 62.6 48.1
1998 37.4 4.3 4.5 1.2 2.4 41.4 3.6 1.1 6.3 7.4 ... 2.8 12.9 21.1 4.0 26.1 10.7 4.4 19.4 63.4 39.2
1999 37.3 4.2 4.6 0.8 2.2 38.4 3.6 0.9 5.8 8.6 ... 2.8 13.4 15.9 3.9 25.3 10.5 4.2 17.5 65.7 43.0
2000 34.6 4.0 4.8 0.8 2.0 53.0 3.5 0.8 5.2 9.4 ... 2.7 13.9 14.3 3.9 24.8 10.3 4.0 16.4 65.8 47.9
2001 32.6 3.4 4.9 0.6 2.0 33.5 3.5 0.9 5.1 9.9 ... 2.5 14.8 12.8 4.1 24.3 1.3 3.9 15.4 68.0 52.3
2002 30.4 3.2 5.0 0.5 2.1 30.7 3.5 0.7 4.2 9.7 ... 2.7 14.4 14.9 4.1 23.5 27.5 3.7 14.9 51.7 57.1
2003 30.8 3.2 5.1 0.6 1.8 28.1 3.5 0.9 4.1 9.1 ... 2.5 15.2 12.8 3.9 23.4 14.7 3.6 14.6 47.8 63.2
2004 28.3 2.9 5.2 0.9 1.9 31.8 3.5 0.8 3.9 8.5 ... 2.3 14.9 11.8 3.8 22.6 6.3 3.5 13.8 46.8 65.2
2005 26.7 2.9 5.3 1.1 1.8 33.1 3.4 0.8 3.9 7.9 ... 2.4 14.4 13.1 3.8 22.7 5.7 3.3 13.7 45.3 68.0
2006 25.1 2.6 5.5 0.9 1.7 30.2 3.4 0.9 3.7 7.9 ... 2.5 13.4 10.4 3.8 22.2 6.0 3.2 13.5 42.2 69.9
2007 23.8 2.2 5.6 0.5 1.9 29.4 3.4 0.9 3.5 8.1 ... 2.3 14.0 10.2 3.9 22.0 2.5 3.1 13.0 38.7 71.4
18 rows × 207 columns

We have seen how apply works element-wise. If the function we pass is applicable to single elements (e.g. division) pandas will broadcast that to every single element and we will get again a Series with the function applied to each element and hence, a data frame as a result in our case. However, the function intended to be used for element-wise maps is applymap.

groupby

Grouping is a powerful an important data frame operation in Exploratory Data Analysis. In Pandas we can do this easily. For example, imagine we want the mean number of existing cases per year in two different periods, before and after the year 2000. We can do the following.

mean_cases_by_period = existing_df.groupby(lambda x: int(x)>1999).mean()
mean_cases_by_period.index = ['1990-1999', '2000-2007']
mean_cases_by_period
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia ... Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
1990-1999 403.700 42.1 43.90 16.200 30.3 474.40 36.400 12.800 76.6 64.400 ... 30.600 117.00 234.500 42.300 323.300 152.900 49.800 234.500 557.200 428.10
2000-2007 290.375 30.5 51.75 7.375 19.0 337.25 34.625 8.375 42.0 88.125 ... 24.875 143.75 125.375 39.125 231.875 92.875 35.375 144.125 507.875 618.75
2 rows × 207 columns

The groupby method accepts different types of grouping, including a mapping function as we passed, a dictionary, a Series, or a tuple / list of column names. The mapping function for example will be called on each element of the object .index (the year string in our case) to determine the groups. If a dict or Series is passed, the Series or dict values are used to determine the groups (e.g. we can pass a column that contains categorical values).

We can index the resulting data frame as usual.

 mean_cases_by_period[['United Kingdom', 'Spain', 'Colombia']]
country United Kingdom Spain Colombia
1990-1999 9.200 35.300 75.10
2000-2007 10.125 24.875 53.25

R

lapply

R has a long collection of apply functions that can be used to apply functions to
elements within vectors, matrices, lists, and data frames. The one we will introduce here
is lapply (type ?lapply in your R console). It is the one we use with lists and,
since a data frame is a list of column vectors, will work with them as well.

For example, we can repeat the by year sum we did with Pandas as follows.

existing_df_sum_years <- lapply(existing_df, function(x) { sum(x) })
existing_df_sum_years <- as.data.frame(existing_df_sum_years)
existing_df_sum_years
##   Afghanistan Albania Algeria American.Samoa Andorra Angola Anguilla
## 1        6360     665     853            221     455   7442      641
##   Antigua.and.Barbuda Argentina Armenia Australia Austria Azerbaijan
## 1                 195      1102    1349       116     228       1541
##   Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda
## 1     920    1375       9278       95    1446     229    864  2384     133
##   Bhutan Bolivia Bosnia.and.Herzegovina Botswana Brazil
## 1  10579    4806                   1817     8067   1585
##   British.Virgin.Islands Brunei.Darussalam Bulgaria Burkina.Faso Burundi
## 1                    383              1492      960         5583    8097
##   Cambodia Cameroon Canada Cape.Verde Cayman.Islands
## 1    14015     3787     92       6712            129
##   Central.African.Republic Chad Chile China Colombia Comoros Congo..Rep.
## 1                     7557 7316   452  4854     1177    2310        6755
##   Cook.Islands Costa.Rica Croatia Cuba Cyprus Czech.Republic Cote.d.Ivoire
## 1          357        349    1637  295    163            304          7900
##   Korea..Dem..Rep. Congo..Dem..Rep. Denmark Djibouti Dominica
## 1            12359             9343     151    19155      375
##   Dominican.Republic Ecuador Egypt El.Salvador Equatorial.Guinea Eritrea
## 1               2252    3676   700        1483              5303    3181
##   Estonia Ethiopia Fiji Finland France French.Polynesia Gabon Gambia
## 1    1214     8432  811     153    263              974  5949   6700
##   Georgia Germany Ghana Greece Grenada Guam Guatemala Guinea Guinea.Bissau
## 1    1406     180  7368    380     125 1340      1716   5853          6207
##   Guyana Haiti Honduras Hungary Iceland India Indonesia Iran Iraq Ireland
## 1   1621  7428     1756     930      58  8107      6131  789 1433     233
##   Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait
## 1    138   139     142   822    236       2249  5117    12652    928
##   Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia Libyan.Arab.Jamahiriya
## 1       2354 6460   1351     783    6059    7707                    559
##   Lithuania Luxembourg Madagascar Malawi Malaysia Maldives  Mali Malta
## 1      1579        233       6691   6290     2615     1638 10611   120
##   Mauritania Mauritius Mexico Micronesia..Fed..Sts. Monaco Mongolia
## 1      10698       817    978                  3570     44     6127
##   Montserrat Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## 1        227    1873       7992    5061    9990  2860  7398         138
##   Netherlands.Antilles New.Caledonia New.Zealand Nicaragua Niger Nigeria
## 1                  355          1095         176      1708  5360    7968
##   Niue Northern.Mariana.Islands Norway Oman Pakistan Palau Panama
## 1 1494                     3033    103  337     6889  2258   1073
##   Papua.New.Guinea Paraguay Peru Philippines Poland Portugal Puerto.Rico
## 1             8652     1559 4352       11604   1064      677         206
##   Qatar Korea..Rep. Moldova Romania Russian.Federation Rwanda
## 1  1380        2353    2781    2891               2170   7216
##   Saint.Kitts.and.Nevis Saint.Lucia Saint.Vincent.and.the.Grenadines Samoa
## 1                   259         371                              709   568
##   San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Seychelles
## 1        118                  5129         1171    7423       1347
##   Sierra.Leone Singapore Slovakia Slovenia Solomon.Islands Somalia
## 1        11756       751      700      639            6623    8128
##   South.Africa Spain Sri.Lanka Sudan Suriname Swaziland Sweden Switzerland
## 1        10788   552      1695  7062     1975     11460     82         149
##   Syrian.Arab.Republic Tajikistan Thailand Macedonia..FYR Timor.Leste
## 1                  986       3438     4442           1108       10118
##    Togo Tokelau Tonga Trinidad.and.Tobago Tunisia Turkey Turkmenistan
## 1 12111    1283   679                 282     685   1023         1866
##   Turks.and.Caicos.Islands Tuvalu Uganda Ukraine United.Arab.Emirates
## 1                      485   7795   7069    1778                  577
##   United.Kingdom Tanzania Virgin.Islands..U.S.. United.States.of.America
## 1            173     5713                   367                       88
##   Uruguay Uzbekistan Vanuatu Venezuela Viet.Nam Wallis.et.Futuna
## 1     505       2320    3348       736     5088             2272
##   West.Bank.and.Gaza Yemen Zambia Zimbabwe
## 1                781  3498   9635     9231

What did we do there? Very simple. the lapply function gets a list and a function
that will be applied to each element. It returns the result as a list. The function
is defined in-line (i.e. as a lambda in Python). For a given x if sums its elements.

If we want to sum by year, for every country, we can use the transposed data frame
we stored before.

existing_df_sum_countries <- lapply(existing_df_t, function(x) { sum(x) })
existing_df_sum_countries <- as.data.frame(existing_df_sum_countries)
existing_df_sum_countries
##   X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001
## 1 40772 40669 39912 39573 39066 38904 37032 37462 36871 37358 36747 36804
##   X2002 X2003 X2004 X2005 X2006 X2007
## 1 37160 36516 36002 35435 34987 34622

aggregate

R provided basic grouping functionality by using aggregate. Another option is
to have a look at the powerful dplyr library that I highly recommend.

But aggregate is quite powerful as well. It accepts a data frame, a list of
grouping elements, and a function to apply to each group. First we need to define
a grouping vector.

before_2000 <- c('1990-99','1990-99','1990-99','1990-99','1990-99',
                 '1990-99','1990-99','1990-99','1990-99','1990-99',
                 '2000-07','2000-07','2000-07','2000-07','2000-07',
                 '2000-07','2000-07','2000-07')
before_2000
##  [1] "1990-99" "1990-99" "1990-99" "1990-99" "1990-99" "1990-99" "1990-99"
##  [8] "1990-99" "1990-99" "1990-99" "2000-07" "2000-07" "2000-07" "2000-07"
## [15] "2000-07" "2000-07" "2000-07" "2000-07"

Then we can use that column as grouping element and use the function mean.

mean_cases_by_period <- aggregate(existing_df, list(Period = before_2000), mean)
mean_cases_by_period
##    Period Afghanistan Albania Algeria American Samoa Andorra Angola
## 1 1990-99     403.700    42.1   43.90         16.200    30.3 474.40
## 2 2000-07     290.375    30.5   51.75          7.375    19.0 337.25
##   Anguilla Antigua and Barbuda Argentina Armenia Australia Austria
## 1   36.400              12.800      76.6  64.400       6.8  14.500
## 2   34.625               8.375      42.0  88.125       6.0  10.375
##   Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize
## 1     75.600  52.700  95.600     571.20    6.400  80.500  14.000  54.60
## 2     98.125  49.125  52.375     445.75    3.875  80.125  11.125  39.75
##     Benin Bermuda  Bhutan Bolivia Bosnia and Herzegovina Botswana  Brazil
## 1 131.300   8.400 699.600   308.2                  132.9  356.400 103.400
## 2 133.875   6.125 447.875   215.5                   61.0  562.875  68.875
##   British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso Burundi
## 1                 24.600             90.60   57.700        239.9  332.30
## 2                 17.125             73.25   47.875        398.0  596.75
##   Cambodia Cameroon Canada Cape Verde Cayman Islands
## 1    835.9  201.400  5.900    409.500          8.400
## 2    707.0  221.625  4.125    327.125          5.625
##   Central African Republic    Chad Chile  China Colombia Comoros
## 1                  360.000 330.300  32.0 300.00    75.10 152.500
## 2                  494.625 501.625  16.5 231.75    53.25  98.125
##   Congo, Rep. Cook Islands Costa Rica Croatia  Cuba Cyprus Czech Republic
## 1     322.200       23.400       24.5 110.000 21.70  10.90           20.8
## 2     441.625       15.375       13.0  67.125  9.75   6.75           12.0
##   Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep. Denmark Djibouti
## 1        331.00          794.400           393.30    9.70 1145.000
## 2        573.75          551.875           676.25    6.75  963.125
##   Dominica Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea
## 1   22.000             148.20 236.700  45.6       101.9            206.50
## 2   19.375              96.25 163.625  30.5        58.0            404.75
##   Eritrea Estonia Ethiopia  Fiji Finland France French Polynesia   Gabon
## 1 221.200  77.700  382.900 54.50  10.400  16.90           70.900 330.800
## 2 121.125  54.625  575.375 33.25   6.125  11.75           33.125 330.125
##   Gambia Georgia Germany   Ghana Greece Grenada   Guam Guatemala  Guinea
## 1 352.20    68.2    12.8 450.100 24.300   7.000 100.20   101.500 274.200
## 2 397.25    90.5     6.5 358.375 17.125   6.875  42.25    87.625 388.875
##   Guinea-Bissau  Guyana   Haiti Honduras Hungary Iceland   India Indonesia
## 1        394.10  61.800 438.100  118.900  68.300   3.700 533.200    387.70
## 2        283.25 125.375 380.875   70.875  30.875   2.625 346.875    281.75
##     Iran   Iraq Ireland Israel Italy Jamaica  Japan Jordan Kazakhstan
## 1 52.000 85.800    14.9   8.80 8.800     8.6 53.700 16.300      107.3
## 2 33.625 71.875    10.5   6.25 6.375     7.0 35.625  9.125      147.0
##   Kenya Kiribati Kuwait Kyrgyzstan   Laos Latvia Lebanon Lesotho Liberia
## 1 208.9  874.900  69.40    118.700 393.40 75.400    57.9   271.5   444.7
## 2 378.5  487.875  29.25    145.875 315.75 74.625    25.5   418.0   407.5
##   Libyan Arab Jamahiriya Lithuania Luxembourg Madagascar Malawi Malaysia
## 1                 40.200     94.10      15.10      359.5  355.0   158.90
## 2                 19.625     79.75      10.25      387.0  342.5   128.25
##   Maldives    Mali Malta Mauritania Mauritius Mexico Micronesia, Fed. Sts.
## 1  105.500 595.200  7.80    600.700    50.200  72.40                246.80
## 2   72.875 582.375  5.25    586.375    39.375  31.75                137.75
##   Monaco Mongolia Montserrat Morocco Mozambique Myanmar Namibia   Nauru
## 1    2.8   412.50       13.5 116.600    368.300  352.70 566.900 216.500
## 2    2.0   250.25       11.5  88.375    538.625  191.75 540.125  86.875
##     Nepal Netherlands Netherlands Antilles New Caledonia New Zealand
## 1 523.300        8.80                 22.7          83.1      10.100
## 2 270.625        6.25                 16.0          33.0       9.375
##   Nicaragua  Niger Nigeria  Niue Northern Mariana Islands Norway   Oman
## 1    113.40 308.60 361.500 98.80                  228.200    6.7 23.200
## 2     71.75 284.25 544.125 63.25                   93.875    4.5 13.125
##   Pakistan   Palau Panama Papua New Guinea Paraguay   Peru Philippines
## 1  423.400 164.100 68.800          494.900   89.400 297.40       726.4
## 2  331.875  77.125 48.125          462.875   83.125 172.25       542.5
##   Poland Portugal Puerto Rico Qatar Korea, Rep. Moldova Romania
## 1 77.100    43.90      15.300    78     141.600 140.000   153.1
## 2 36.625    29.75       6.625    75     117.125 172.625   170.0
##   Russian Federation Rwanda Saint Kitts and Nevis Saint Lucia
## 1             107.20 274.20                  15.1       22.50
## 2             137.25 559.25                  13.5       18.25
##   Saint Vincent and the Grenadines Samoa San Marino Sao Tome and Principe
## 1                            42.30 35.00      7.500                 306.1
## 2                            35.75 27.25      5.375                 258.5
##   Saudi Arabia Senegal Seychelles Sierra Leone Singapore Slovakia Slovenia
## 1       67.000 385.000     91.400      531.900     49.70   49.700   47.800
## 2       62.625 446.625     54.125      804.625     31.75   25.375   20.125
##   Solomon Islands Somalia South Africa  Spain Sri Lanka   Sudan Suriname
## 1         469.600 521.100        569.2 35.300      99.1 401.100     95.1
## 2         240.875 364.625        637.0 24.875      88.0 381.375    128.0
##   Swaziland Sweden Switzerland Syrian Arab Republic Tajikistan Thailand
## 1   527.900  4.900       10.30               72.300     134.00    288.6
## 2   772.625  4.125        5.75               32.875     262.25    194.5
##   Macedonia, FYR Timor-Leste   Togo Tokelau Tonga Trinidad and Tobago
## 1         80.100       662.6 650.10   105.9  39.9              16.100
## 2         38.375       436.5 701.25    28.0  35.0              15.125
##   Tunisia Turkey Turkmenistan Turks and Caicos Islands Tuvalu Uganda
## 1  46.400 68.800      105.900                   32.200 511.30 352.70
## 2  27.625 41.875      100.875                   20.375 335.25 442.75
##   Ukraine United Arab Emirates United Kingdom Tanzania
## 1   81.60               37.400          9.200  279.200
## 2  120.25               25.375         10.125  365.125
##   Virgin Islands (U.S.) United States of America Uruguay Uzbekistan
## 1                23.000                      6.0  30.600     117.00
## 2                17.125                      3.5  24.875     143.75
##   Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza   Yemen
## 1 234.500    42.300  323.300          152.900             49.800 234.500
## 2 125.375    39.125  231.875           92.875             35.375 144.125
##    Zambia Zimbabwe
## 1 557.200   428.10
## 2 507.875   618.75

The aggregate function allows subsetting the data frame we pass as first parameter
of course, and also to pass multiple grouping elements and define our own functions
(either as lambda or predefined functions). And again, the result is a data frame
that we can index as usual.

mean_cases_by_period[,c('United Kingdom','Spain','Colombia')]
##   United Kingdom  Spain Colombia
## 1          9.200 35.300    75.10
## 2         10.125 24.875    53.25

Conclusions

This two-part tutorial has introduced the concept of data frame, together with how to use them in the two most popular Data Science ecosystems nowadays, R and Python. We have seen how Pandas is inspired by R. We can see how in Python/Pandas we can use very similar constructs to those present in the R language. Python is also a language widely used by software developers of all kinds. All this means that Pandas offers a more consistent programming interface, more efficient in many situations. It is also agreed in the community that, if you come from a software development background, you will feel more comfortable with a language like Python and how DataFrame as an object oriented concepts is defined. If you come instead from a maths and statistics background, you will appreciate a language like R, very interactive and totally function-based, with libraries made by statisticians for statisticians. It is not a language meant to be used in complex software architectures on its own, but to be used in a powerful dialog with data.

Additionally, we have introduced a few datasets from Gapminder World related with Infectious Tuberculosis, a very serious epidemic disease sometimes forgotten in developed countries but that nowadays is the second cause of death of its kind just after HIV (and many times associated to HIV). In the next tutorial in the series, we will use these datasets in order to perform some Exploratory Analysis in both, Python and R, to better understand the world situation regarding the disease.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Discover and read more posts from Jose A Dianes
get started