Data Science with Python & R: Data Frames II
We continue here our tutorial on data frames with python and R. The first part introduced the concepts of Data Frame and explained how to create them and index them in Python and R. This part will concentrate on data selection and function mapping.
All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!
Data Selection
In this section we will show how to select data from data frames based on their values, by using logical expressions.
Python
With Pandas, we can use logical expression to select just data that satisfy certain conditions. So first, let's see what happens when we use logical operators with data frames or series objects.
existing_df>10
country | Afghanistan | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Armenia | ... | Uruguay | Uzbekistan | Vanuatu | Venezuela | Viet Nam | Wallis et Futuna | West Bank and Gaza | Yemen | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
1990 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1991 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1992 | True | True | True | False | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1993 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1994 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1995 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1996 | True | True | True | False | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1997 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1998 | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | True | True | True |
1999 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2000 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2001 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2002 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2003 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2004 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2005 | True | True | True | True | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2006 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
2007 | True | True | True | False | True | True | True | False | True | True | ... | True | True | True | True | True | True | True | True | True | True |
18 rows × 207 columns
And if applied to individual series.
existing_df['United Kingdom'] > 10
year
1990 False
1991 False
1992 False
1993 False
1994 False
1995 False
1996 False
1997 False
1998 False
1999 False
2000 False
2001 False
2002 False
2003 False
2004 False
2005 True
2006 True
2007 True
Name: United Kingdom, dtype: bool
The result of these expressions can be used as a indexing vector (with []
or `.iloc') as follows.
existing_df.Spain[existing_df['United Kingdom'] > 10]
year
2005 24
2006 24
2007 23
Name: Spain, dtype: int64
An interesting case happens when indexing several series and some of them happen to have False
as index and other True
at the same position. For example:
existing_df[ existing_df > 10 ]
country | Afghanistan | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Armenia | ... | Uruguay | Uzbekistan | Vanuatu | Venezuela | Viet Nam | Wallis et Futuna | West Bank and Gaza | Yemen | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
1990 | 436 | 42 | 45 | 42 | 39 | 514 | 38 | 16 | 96 | 52 | ... | 35 | 114 | 278 | 46 | 365 | 126 | 55 | 265 | 436 | 409 |
1991 | 429 | 40 | 44 | 14 | 37 | 514 | 38 | 15 | 91 | 49 | ... | 34 | 105 | 268 | 45 | 361 | 352 | 54 | 261 | 456 | 417 |
1992 | 422 | 41 | 44 | NaN | 35 | 513 | 37 | 15 | 86 | 51 | ... | 33 | 102 | 259 | 44 | 358 | 64 | 54 | 263 | 494 | 415 |
1993 | 415 | 42 | 43 | 18 | 33 | 512 | 37 | 14 | 82 | 55 | ... | 32 | 118 | 250 | 43 | 354 | 174 | 52 | 253 | 526 | 419 |
1994 | 407 | 42 | 43 | 17 | 32 | 510 | 36 | 13 | 78 | 60 | ... | 31 | 116 | 242 | 42 | 350 | 172 | 52 | 250 | 556 | 426 |
1995 | 397 | 43 | 42 | 22 | 30 | 508 | 35 | 12 | 74 | 68 | ... | 30 | 119 | 234 | 42 | 346 | 93 | 50 | 244 | 585 | 439 |
1996 | 397 | 42 | 43 | NaN | 28 | 512 | 35 | 12 | 71 | 74 | ... | 28 | 111 | 226 | 41 | 312 | 123 | 49 | 233 | 602 | 453 |
1997 | 387 | 44 | 44 | 25 | 23 | 363 | 36 | 11 | 67 | 75 | ... | 27 | 122 | 218 | 41 | 273 | 213 | 46 | 207 | 626 | 481 |
1998 | 374 | 43 | 45 | 12 | 24 | 414 | 36 | 11 | 63 | 74 | ... | 28 | 129 | 211 | 40 | 261 | 107 | 44 | 194 | 634 | 392 |
1999 | 373 | 42 | 46 | NaN | 22 | 384 | 36 | NaN | 58 | 86 | ... | 28 | 134 | 159 | 39 | 253 | 105 | 42 | 175 | 657 | 430 |
2000 | 346 | 40 | 48 | NaN | 20 | 530 | 35 | NaN | 52 | 94 | ... | 27 | 139 | 143 | 39 | 248 | 103 | 40 | 164 | 658 | 479 |
2001 | 326 | 34 | 49 | NaN | 20 | 335 | 35 | NaN | 51 | 99 | ... | 25 | 148 | 128 | 41 | 243 | 13 | 39 | 154 | 680 | 523 |
2002 | 304 | 32 | 50 | NaN | 21 | 307 | 35 | NaN | 42 | 97 | ... | 27 | 144 | 149 | 41 | 235 | 275 | 37 | 149 | 517 | 571 |
2003 | 308 | 32 | 51 | NaN | 18 | 281 | 35 | NaN | 41 | 91 | ... | 25 | 152 | 128 | 39 | 234 | 147 | 36 | 146 | 478 | 632 |
2004 | 283 | 29 | 52 | NaN | 19 | 318 | 35 | NaN | 39 | 85 | ... | 23 | 149 | 118 | 38 | 226 | 63 | 35 | 138 | 468 | 652 |
2005 | 267 | 29 | 53 | 11 | 18 | 331 | 34 | NaN | 39 | 79 | ... | 24 | 144 | 131 | 38 | 227 | 57 | 33 | 137 | 453 | 680 |
2006 | 251 | 26 | 55 | NaN | 17 | 302 | 34 | NaN | 37 | 79 | ... | 25 | 134 | 104 | 38 | 222 | 60 | 32 | 135 | 422 | 699 |
2007 | 238 | 22 | 56 | NaN | 19 | 294 | 34 | NaN | 35 | 81 | ... | 23 | 140 | 102 | 39 | 220 | 25 | 31 | 130 | 387 | 714 |
18 rows × 207 columns
Those cells where existing_df
doesn't happen to have more than 10 cases per 100K give False
for indexing. The resulting data frame have a NaN
value for those cells. A way of solving that (if we need to) is by using the where()
method that, apart from providing a more expressive way of reading data selection, acceps a second argument that we can use to impute the NaN
values. For example, if we want to have 0 as a value.
existing_df.where(existing_df > 10, 0)
country | Afghanistan | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Armenia | ... | Uruguay | Uzbekistan | Vanuatu | Venezuela | Viet Nam | Wallis et Futuna | West Bank and Gaza | Yemen | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
1990 | 436 | 42 | 45 | 42 | 39 | 514 | 38 | 16 | 96 | 52 | ... | 35 | 114 | 278 | 46 | 365 | 126 | 55 | 265 | 436 | 409 |
1991 | 429 | 40 | 44 | 14 | 37 | 514 | 38 | 15 | 91 | 49 | ... | 34 | 105 | 268 | 45 | 361 | 352 | 54 | 261 | 456 | 417 |
1992 | 422 | 41 | 44 | 0 | 35 | 513 | 37 | 15 | 86 | 51 | ... | 33 | 102 | 259 | 44 | 358 | 64 | 54 | 263 | 494 | 415 |
1993 | 415 | 42 | 43 | 18 | 33 | 512 | 37 | 14 | 82 | 55 | ... | 32 | 118 | 250 | 43 | 354 | 174 | 52 | 253 | 526 | 419 |
1994 | 407 | 42 | 43 | 17 | 32 | 510 | 36 | 13 | 78 | 60 | ... | 31 | 116 | 242 | 42 | 350 | 172 | 52 | 250 | 556 | 426 |
1995 | 397 | 43 | 42 | 22 | 30 | 508 | 35 | 12 | 74 | 68 | ... | 30 | 119 | 234 | 42 | 346 | 93 | 50 | 244 | 585 | 439 |
1996 | 397 | 42 | 43 | 0 | 28 | 512 | 35 | 12 | 71 | 74 | ... | 28 | 111 | 226 | 41 | 312 | 123 | 49 | 233 | 602 | 453 |
1997 | 387 | 44 | 44 | 25 | 23 | 363 | 36 | 11 | 67 | 75 | ... | 27 | 122 | 218 | 41 | 273 | 213 | 46 | 207 | 626 | 481 |
1998 | 374 | 43 | 45 | 12 | 24 | 414 | 36 | 11 | 63 | 74 | ... | 28 | 129 | 211 | 40 | 261 | 107 | 44 | 194 | 634 | 392 |
1999 | 373 | 42 | 46 | 0 | 22 | 384 | 36 | 0 | 58 | 86 | ... | 28 | 134 | 159 | 39 | 253 | 105 | 42 | 175 | 657 | 430 |
2000 | 346 | 40 | 48 | 0 | 20 | 530 | 35 | 0 | 52 | 94 | ... | 27 | 139 | 143 | 39 | 248 | 103 | 40 | 164 | 658 | 479 |
2001 | 326 | 34 | 49 | 0 | 20 | 335 | 35 | 0 | 51 | 99 | ... | 25 | 148 | 128 | 41 | 243 | 13 | 39 | 154 | 680 | 523 |
2002 | 304 | 32 | 50 | 0 | 21 | 307 | 35 | 0 | 42 | 97 | ... | 27 | 144 | 149 | 41 | 235 | 275 | 37 | 149 | 517 | 571 |
2003 | 308 | 32 | 51 | 0 | 18 | 281 | 35 | 0 | 41 | 91 | ... | 25 | 152 | 128 | 39 | 234 | 147 | 36 | 146 | 478 | 632 |
2004 | 283 | 29 | 52 | 0 | 19 | 318 | 35 | 0 | 39 | 85 | ... | 23 | 149 | 118 | 38 | 226 | 63 | 35 | 138 | 468 | 652 |
2005 | 267 | 29 | 53 | 11 | 18 | 331 | 34 | 0 | 39 | 79 | ... | 24 | 144 | 131 | 38 | 227 | 57 | 33 | 137 | 453 | 680 |
2006 | 251 | 26 | 55 | 0 | 17 | 302 | 34 | 0 | 37 | 79 | ... | 25 | 134 | 104 | 38 | 222 | 60 | 32 | 135 | 422 | 699 |
2007 | 238 | 22 | 56 | 0 | 19 | 294 | 34 | 0 | 35 | 81 | ... | 23 | 140 | 102 | 39 | 220 | 25 | 31 | 130 | 387 | 714 |
18 rows × 207 columns
R
As we did with Pandas, let's check the result of using a data.frame
in a logical
or boolean expression.
existing_df_gt10 <- existing_df>10
head(existing_df_gt10,2) # check just a couple of rows
## Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan
## X1990 TRUE TRUE TRUE FALSE TRUE TRUE
## X1991 TRUE TRUE TRUE FALSE TRUE TRUE
## Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin
## X1990 TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil
## X1990 FALSE TRUE TRUE TRUE TRUE TRUE
## X1991 FALSE TRUE TRUE TRUE TRUE TRUE
## British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso
## X1990 TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE
## Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands
## X1990 TRUE TRUE TRUE FALSE TRUE FALSE
## X1991 TRUE TRUE TRUE FALSE TRUE FALSE
## Central African Republic Chad Chile China Colombia Comoros
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE
## Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus
## X1990 TRUE FALSE TRUE TRUE TRUE TRUE
## X1991 TRUE FALSE TRUE TRUE TRUE TRUE
## Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.
## X1990 TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE
## Denmark Djibouti Dominica Dominican Republic Ecuador Egypt
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE
## El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## France French Polynesia Gabon Gambia Georgia Germany Ghana Greece
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras
## X1990 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy
## X1990 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan
## X1990 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat
## X1990 TRUE TRUE TRUE FALSE TRUE TRUE
## X1991 TRUE TRUE TRUE FALSE TRUE TRUE
## Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## Netherlands Antilles New Caledonia New Zealand Nicaragua Niger
## X1990 TRUE TRUE FALSE TRUE TRUE
## X1991 TRUE TRUE FALSE TRUE TRUE
## Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau
## X1990 TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE
## Rwanda Saint Kitts and Nevis Saint Lucia
## X1990 TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE
## Saint Vincent and the Grenadines Samoa San Marino
## X1990 TRUE TRUE FALSE
## X1991 TRUE TRUE FALSE
## Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone
## X1990 TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE
## Singapore Slovakia Slovenia Solomon Islands Somalia South Africa
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE
## Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland
## X1990 TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste
## X1990 TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE
## Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates
## X1990 TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE
## United Kingdom Tanzania Virgin Islands (U.S.)
## X1990 FALSE TRUE TRUE
## X1991 FALSE TRUE TRUE
## United States of America Uruguay Uzbekistan Vanuatu Venezuela
## X1990 FALSE TRUE TRUE TRUE TRUE
## X1991 FALSE TRUE TRUE TRUE TRUE
## Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
## X1990 TRUE TRUE TRUE TRUE TRUE TRUE
## X1991 TRUE TRUE TRUE TRUE TRUE TRUE
In this case we get a matrix
variable, with boolean values. When applied to
individual columns.
existing_df['United Kingdom'] > 10
## United Kingdom
## X1990 FALSE
## X1991 FALSE
## X1992 FALSE
## X1993 FALSE
## X1994 FALSE
## X1995 FALSE
## X1996 FALSE
## X1997 FALSE
## X1998 FALSE
## X1999 FALSE
## X2000 FALSE
## X2001 FALSE
## X2002 FALSE
## X2003 FALSE
## X2004 FALSE
## X2005 TRUE
## X2006 TRUE
## X2007 TRUE
The result (and the syntax) is equivalent to that of Pandas, and can be used for
indexing as follows.
existing_df$Spain[existing_df['United Kingdom'] > 10]
## [1] 24 24 23
As we did in Python/Pandas, let's use the whole boolean matrix we got before.
head(existing_df[ existing_df_gt10 ]) # check first few elements
## [1] 436 429 422 415 407 397
But hey, the results are quite different from what we would expect coming from
using Pandas. We got a long vector of values, not a data frame. The problem is
that the [ ]
operator, when passed a matrix, first coerces the data frame to a
matrix. Basically we cannot seamlessly work with R data.frames and boolean matrices
as we did with Pandas. We should instead index in both dimensions, columns and rows,
separately.
But still, we can use matrix indexing with a data frame to replace elements.
existing_df_2 <- existing_df
existing_df_2[ existing_df_gt10 ] <- -1
head(existing_df_2,2) # check just a couple of rows
## Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla
## X1990 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1 -1
## Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan
## X1990 -1 -1 -1 7 -1 -1
## X1991 -1 -1 -1 7 -1 -1
## Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin
## X1990 -1 -1 -1 8 -1 -1 -1 -1
## X1991 -1 -1 -1 8 -1 -1 -1 -1
## Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil
## X1990 10 -1 -1 -1 -1 -1
## X1991 10 -1 -1 -1 -1 -1
## British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso
## X1990 -1 -1 -1 -1
## X1991 -1 -1 -1 -1
## Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands
## X1990 -1 -1 -1 7 -1 10
## X1991 -1 -1 -1 7 -1 10
## Central African Republic Chad Chile China Colombia Comoros
## X1990 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1
## Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus
## X1990 -1 0 -1 -1 -1 -1
## X1991 -1 10 -1 -1 -1 -1
## Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.
## X1990 -1 -1 -1 -1
## X1991 -1 -1 -1 -1
## Denmark Djibouti Dominica Dominican Republic Ecuador Egypt
## X1990 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1
## El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland
## X1990 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1 -1
## France French Polynesia Gabon Gambia Georgia Germany Ghana Greece
## X1990 -1 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1 -1 -1
## Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras
## X1990 7 -1 -1 -1 -1 -1 -1 -1
## X1991 7 -1 -1 -1 -1 -1 -1 -1
## Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy
## X1990 -1 5 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 4 -1 -1 -1 -1 -1 10 10
## Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan
## X1990 10 -1 -1 -1 -1 -1 -1 -1
## X1991 10 -1 -1 -1 -1 -1 -1 -1
## Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania
## X1990 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1 -1
## Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania
## X1990 -1 -1 -1 -1 -1 -1 10 -1
## X1991 -1 -1 -1 -1 -1 -1 9 -1
## Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat
## X1990 -1 -1 -1 3 -1 -1
## X1991 -1 -1 -1 3 -1 -1
## Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## X1990 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1 10
## Netherlands Antilles New Caledonia New Zealand Nicaragua Niger
## X1990 -1 -1 10 -1 -1
## X1991 -1 -1 10 -1 -1
## Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau
## X1990 -1 -1 -1 8 -1 -1 -1
## X1991 -1 -1 -1 8 -1 -1 -1
## Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal
## X1990 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1 -1
## Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation
## X1990 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1
## Rwanda Saint Kitts and Nevis Saint Lucia
## X1990 -1 -1 -1
## X1991 -1 -1 -1
## Saint Vincent and the Grenadines Samoa San Marino
## X1990 -1 -1 9
## X1991 -1 -1 9
## Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone
## X1990 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1
## Singapore Slovakia Slovenia Solomon Islands Somalia South Africa
## X1990 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1
## Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland
## X1990 -1 -1 -1 -1 -1 5 -1
## X1991 -1 -1 -1 -1 -1 5 -1
## Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste
## X1990 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1
## Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan
## X1990 -1 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1 -1
## Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates
## X1990 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1
## United Kingdom Tanzania Virgin Islands (U.S.)
## X1990 9 -1 -1
## X1991 9 -1 -1
## United States of America Uruguay Uzbekistan Vanuatu Venezuela
## X1990 7 -1 -1 -1 -1
## X1991 7 -1 -1 -1 -1
## Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
## X1990 -1 -1 -1 -1 -1 -1
## X1991 -1 -1 -1 -1 -1 -1
We can see how many of the elements, those where we had more than 10 cases, where
assigned a -1 value.
The most expressive way of selecting form a data.frame
in R is by using the
subset
function (type ?subset
in your R console to
read about this function). The function is applied by row in the data frame.
The second argument can include any condition using column names. The third argument
can include a list of columns. The resulting data frame will contain those rows
that satisfy the second argument conditions, including just those columns listed
in the third argument (all of them bt default). For example, if we want to select
those years when the United Kingdom had more than 10 cases, and list the resulting
rows for three countries (UK, Spain, and Colombia) we will use:
# If a column name contains blanks, we can have to use ` `
subset(existing_df, `United Kingdom`>10, c('United Kingdom', 'Spain','Colombia'))
## United Kingdom Spain Colombia
## X2005 11 24 53
## X2006 11 24 44
## X2007 12 23 43
We can do the same thing using [ ]
as follows.
existing_df[existing_df["United Kingdom"]>10, c('United Kingdom', 'Spain','Colombia')]
## United Kingdom Spain Colombia
## X2005 11 24 53
## X2006 11 24 44
## X2007 12 23 43
Function mapping and data grouping
Python
The pandas.DataFrame
class defines several ways of applying functions both, index-wise and element-wise. Some of them are already predefined, and are part of the descriptive statistics methods we will talk about when performing exploratory data analysis.
existing_df.sum()
country
Afghanistan 6360
Albania 665
Algeria 853
American Samoa 221
Andorra 455
Angola 7442
Anguilla 641
Antigua and Barbuda 195
Argentina 1102
Armenia 1349
Australia 116
Austria 228
Azerbaijan 1541
Bahamas 920
Bahrain 1375
...
United Arab Emirates 577
United Kingdom 173
Tanzania 5713
Virgin Islands (U.S.) 367
United States of America 88
Uruguay 505
Uzbekistan 2320
Vanuatu 3348
Venezuela 736
Viet Nam 5088
Wallis et Futuna 2272
West Bank and Gaza 781
Yemen 3498
Zambia 9635
Zimbabwe 9231
Length: 207, dtype: int64
We have just calculated the total number of TB cases from 1990 to 2007 for each country. We can do the same by year if we pass axis=1
to use columns
instead of index
as axis.
existing_df.sum(axis=1)
year
1990 40772
1991 40669
1992 39912
1993 39573
1994 39066
1995 38904
1996 37032
1997 37462
1998 36871
1999 37358
2000 36747
2001 36804
2002 37160
2003 36516
2004 36002
2005 35435
2006 34987
2007 34622
dtype: int64
It looks like there is a descent in the existing number of TB cases per 100K across the world.
Pandas also provides methods to apply other functions to data frames. They are three: apply
, applymap
, and groupby
.
apply and applymap
By using apply()
we can apply a function along an input axis of a DataFrame
. Objects passed to the functions we apply are Series
objects having as index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty. For example, if we want to obtain the number of existing cases per million (instead of 100K) we can use the following.
from __future__ import division # we need this to have float division without using a cast
existing_df.apply(lambda x: x/10)
country | Afghanistan | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Armenia | ... | Uruguay | Uzbekistan | Vanuatu | Venezuela | Viet Nam | Wallis et Futuna | West Bank and Gaza | Yemen | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
1990 | 43.6 | 4.2 | 4.5 | 4.2 | 3.9 | 51.4 | 3.8 | 1.6 | 9.6 | 5.2 | ... | 3.5 | 11.4 | 27.8 | 4.6 | 36.5 | 12.6 | 5.5 | 26.5 | 43.6 | 40.9 |
1991 | 42.9 | 4.0 | 4.4 | 1.4 | 3.7 | 51.4 | 3.8 | 1.5 | 9.1 | 4.9 | ... | 3.4 | 10.5 | 26.8 | 4.5 | 36.1 | 35.2 | 5.4 | 26.1 | 45.6 | 41.7 |
1992 | 42.2 | 4.1 | 4.4 | 0.4 | 3.5 | 51.3 | 3.7 | 1.5 | 8.6 | 5.1 | ... | 3.3 | 10.2 | 25.9 | 4.4 | 35.8 | 6.4 | 5.4 | 26.3 | 49.4 | 41.5 |
1993 | 41.5 | 4.2 | 4.3 | 1.8 | 3.3 | 51.2 | 3.7 | 1.4 | 8.2 | 5.5 | ... | 3.2 | 11.8 | 25.0 | 4.3 | 35.4 | 17.4 | 5.2 | 25.3 | 52.6 | 41.9 |
1994 | 40.7 | 4.2 | 4.3 | 1.7 | 3.2 | 51.0 | 3.6 | 1.3 | 7.8 | 6.0 | ... | 3.1 | 11.6 | 24.2 | 4.2 | 35.0 | 17.2 | 5.2 | 25.0 | 55.6 | 42.6 |
1995 | 39.7 | 4.3 | 4.2 | 2.2 | 3.0 | 50.8 | 3.5 | 1.2 | 7.4 | 6.8 | ... | 3.0 | 11.9 | 23.4 | 4.2 | 34.6 | 9.3 | 5.0 | 24.4 | 58.5 | 43.9 |
1996 | 39.7 | 4.2 | 4.3 | 0.0 | 2.8 | 51.2 | 3.5 | 1.2 | 7.1 | 7.4 | ... | 2.8 | 11.1 | 22.6 | 4.1 | 31.2 | 12.3 | 4.9 | 23.3 | 60.2 | 45.3 |
1997 | 38.7 | 4.4 | 4.4 | 2.5 | 2.3 | 36.3 | 3.6 | 1.1 | 6.7 | 7.5 | ... | 2.7 | 12.2 | 21.8 | 4.1 | 27.3 | 21.3 | 4.6 | 20.7 | 62.6 | 48.1 |
1998 | 37.4 | 4.3 | 4.5 | 1.2 | 2.4 | 41.4 | 3.6 | 1.1 | 6.3 | 7.4 | ... | 2.8 | 12.9 | 21.1 | 4.0 | 26.1 | 10.7 | 4.4 | 19.4 | 63.4 | 39.2 |
1999 | 37.3 | 4.2 | 4.6 | 0.8 | 2.2 | 38.4 | 3.6 | 0.9 | 5.8 | 8.6 | ... | 2.8 | 13.4 | 15.9 | 3.9 | 25.3 | 10.5 | 4.2 | 17.5 | 65.7 | 43.0 |
2000 | 34.6 | 4.0 | 4.8 | 0.8 | 2.0 | 53.0 | 3.5 | 0.8 | 5.2 | 9.4 | ... | 2.7 | 13.9 | 14.3 | 3.9 | 24.8 | 10.3 | 4.0 | 16.4 | 65.8 | 47.9 |
2001 | 32.6 | 3.4 | 4.9 | 0.6 | 2.0 | 33.5 | 3.5 | 0.9 | 5.1 | 9.9 | ... | 2.5 | 14.8 | 12.8 | 4.1 | 24.3 | 1.3 | 3.9 | 15.4 | 68.0 | 52.3 |
2002 | 30.4 | 3.2 | 5.0 | 0.5 | 2.1 | 30.7 | 3.5 | 0.7 | 4.2 | 9.7 | ... | 2.7 | 14.4 | 14.9 | 4.1 | 23.5 | 27.5 | 3.7 | 14.9 | 51.7 | 57.1 |
2003 | 30.8 | 3.2 | 5.1 | 0.6 | 1.8 | 28.1 | 3.5 | 0.9 | 4.1 | 9.1 | ... | 2.5 | 15.2 | 12.8 | 3.9 | 23.4 | 14.7 | 3.6 | 14.6 | 47.8 | 63.2 |
2004 | 28.3 | 2.9 | 5.2 | 0.9 | 1.9 | 31.8 | 3.5 | 0.8 | 3.9 | 8.5 | ... | 2.3 | 14.9 | 11.8 | 3.8 | 22.6 | 6.3 | 3.5 | 13.8 | 46.8 | 65.2 |
2005 | 26.7 | 2.9 | 5.3 | 1.1 | 1.8 | 33.1 | 3.4 | 0.8 | 3.9 | 7.9 | ... | 2.4 | 14.4 | 13.1 | 3.8 | 22.7 | 5.7 | 3.3 | 13.7 | 45.3 | 68.0 |
2006 | 25.1 | 2.6 | 5.5 | 0.9 | 1.7 | 30.2 | 3.4 | 0.9 | 3.7 | 7.9 | ... | 2.5 | 13.4 | 10.4 | 3.8 | 22.2 | 6.0 | 3.2 | 13.5 | 42.2 | 69.9 |
2007 | 23.8 | 2.2 | 5.6 | 0.5 | 1.9 | 29.4 | 3.4 | 0.9 | 3.5 | 8.1 | ... | 2.3 | 14.0 | 10.2 | 3.9 | 22.0 | 2.5 | 3.1 | 13.0 | 38.7 | 71.4 |
18 rows × 207 columns
We have seen how apply
works element-wise. If the function we pass is applicable to single elements (e.g. division) pandas will broadcast that to every single element and we will get again a Series with the function applied to each element and hence, a data frame as a result in our case. However, the function intended to be used for element-wise maps is applymap
.
groupby
Grouping is a powerful an important data frame operation in Exploratory Data Analysis. In Pandas we can do this easily. For example, imagine we want the mean number of existing cases per year in two different periods, before and after the year 2000. We can do the following.
mean_cases_by_period = existing_df.groupby(lambda x: int(x)>1999).mean()
mean_cases_by_period.index = ['1990-1999', '2000-2007']
mean_cases_by_period
country | Afghanistan | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antigua and Barbuda | Argentina | Armenia | ... | Uruguay | Uzbekistan | Vanuatu | Venezuela | Viet Nam | Wallis et Futuna | West Bank and Gaza | Yemen | Zambia | Zimbabwe |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1990-1999 | 403.700 | 42.1 | 43.90 | 16.200 | 30.3 | 474.40 | 36.400 | 12.800 | 76.6 | 64.400 | ... | 30.600 | 117.00 | 234.500 | 42.300 | 323.300 | 152.900 | 49.800 | 234.500 | 557.200 | 428.10 |
2000-2007 | 290.375 | 30.5 | 51.75 | 7.375 | 19.0 | 337.25 | 34.625 | 8.375 | 42.0 | 88.125 | ... | 24.875 | 143.75 | 125.375 | 39.125 | 231.875 | 92.875 | 35.375 | 144.125 | 507.875 | 618.75 |
2 rows × 207 columns
The groupby
method accepts different types of grouping, including a mapping function as we passed, a dictionary, a Series, or a tuple / list of column names. The mapping function for example will be called on each element of the object .index
(the year string in our case) to determine the groups. If a dict
or Series
is passed, the Series
or dict
values are used to determine the groups (e.g. we can pass a column that contains categorical values).
We can index the resulting data frame as usual.
mean_cases_by_period[['United Kingdom', 'Spain', 'Colombia']]
country | United Kingdom | Spain | Colombia |
---|---|---|---|
1990-1999 | 9.200 | 35.300 | 75.10 |
2000-2007 | 10.125 | 24.875 | 53.25 |
R
lapply
R has a long collection of apply functions that can be used to apply functions to
elements within vectors, matrices, lists, and data frames. The one we will introduce here
is lapply (type ?lapply
in your R console). It is the one we use with lists and,
since a data frame is a list of column vectors, will work with them as well.
For example, we can repeat the by year sum we did with Pandas as follows.
existing_df_sum_years <- lapply(existing_df, function(x) { sum(x) })
existing_df_sum_years <- as.data.frame(existing_df_sum_years)
existing_df_sum_years
## Afghanistan Albania Algeria American.Samoa Andorra Angola Anguilla
## 1 6360 665 853 221 455 7442 641
## Antigua.and.Barbuda Argentina Armenia Australia Austria Azerbaijan
## 1 195 1102 1349 116 228 1541
## Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda
## 1 920 1375 9278 95 1446 229 864 2384 133
## Bhutan Bolivia Bosnia.and.Herzegovina Botswana Brazil
## 1 10579 4806 1817 8067 1585
## British.Virgin.Islands Brunei.Darussalam Bulgaria Burkina.Faso Burundi
## 1 383 1492 960 5583 8097
## Cambodia Cameroon Canada Cape.Verde Cayman.Islands
## 1 14015 3787 92 6712 129
## Central.African.Republic Chad Chile China Colombia Comoros Congo..Rep.
## 1 7557 7316 452 4854 1177 2310 6755
## Cook.Islands Costa.Rica Croatia Cuba Cyprus Czech.Republic Cote.d.Ivoire
## 1 357 349 1637 295 163 304 7900
## Korea..Dem..Rep. Congo..Dem..Rep. Denmark Djibouti Dominica
## 1 12359 9343 151 19155 375
## Dominican.Republic Ecuador Egypt El.Salvador Equatorial.Guinea Eritrea
## 1 2252 3676 700 1483 5303 3181
## Estonia Ethiopia Fiji Finland France French.Polynesia Gabon Gambia
## 1 1214 8432 811 153 263 974 5949 6700
## Georgia Germany Ghana Greece Grenada Guam Guatemala Guinea Guinea.Bissau
## 1 1406 180 7368 380 125 1340 1716 5853 6207
## Guyana Haiti Honduras Hungary Iceland India Indonesia Iran Iraq Ireland
## 1 1621 7428 1756 930 58 8107 6131 789 1433 233
## Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait
## 1 138 139 142 822 236 2249 5117 12652 928
## Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia Libyan.Arab.Jamahiriya
## 1 2354 6460 1351 783 6059 7707 559
## Lithuania Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta
## 1 1579 233 6691 6290 2615 1638 10611 120
## Mauritania Mauritius Mexico Micronesia..Fed..Sts. Monaco Mongolia
## 1 10698 817 978 3570 44 6127
## Montserrat Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## 1 227 1873 7992 5061 9990 2860 7398 138
## Netherlands.Antilles New.Caledonia New.Zealand Nicaragua Niger Nigeria
## 1 355 1095 176 1708 5360 7968
## Niue Northern.Mariana.Islands Norway Oman Pakistan Palau Panama
## 1 1494 3033 103 337 6889 2258 1073
## Papua.New.Guinea Paraguay Peru Philippines Poland Portugal Puerto.Rico
## 1 8652 1559 4352 11604 1064 677 206
## Qatar Korea..Rep. Moldova Romania Russian.Federation Rwanda
## 1 1380 2353 2781 2891 2170 7216
## Saint.Kitts.and.Nevis Saint.Lucia Saint.Vincent.and.the.Grenadines Samoa
## 1 259 371 709 568
## San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Seychelles
## 1 118 5129 1171 7423 1347
## Sierra.Leone Singapore Slovakia Slovenia Solomon.Islands Somalia
## 1 11756 751 700 639 6623 8128
## South.Africa Spain Sri.Lanka Sudan Suriname Swaziland Sweden Switzerland
## 1 10788 552 1695 7062 1975 11460 82 149
## Syrian.Arab.Republic Tajikistan Thailand Macedonia..FYR Timor.Leste
## 1 986 3438 4442 1108 10118
## Togo Tokelau Tonga Trinidad.and.Tobago Tunisia Turkey Turkmenistan
## 1 12111 1283 679 282 685 1023 1866
## Turks.and.Caicos.Islands Tuvalu Uganda Ukraine United.Arab.Emirates
## 1 485 7795 7069 1778 577
## United.Kingdom Tanzania Virgin.Islands..U.S.. United.States.of.America
## 1 173 5713 367 88
## Uruguay Uzbekistan Vanuatu Venezuela Viet.Nam Wallis.et.Futuna
## 1 505 2320 3348 736 5088 2272
## West.Bank.and.Gaza Yemen Zambia Zimbabwe
## 1 781 3498 9635 9231
What did we do there? Very simple. the lapply
function gets a list and a function
that will be applied to each element. It returns the result as a list. The function
is defined in-line (i.e. as a lambda in Python). For a given x
if sums its elements.
If we want to sum by year, for every country, we can use the transposed data frame
we stored before.
existing_df_sum_countries <- lapply(existing_df_t, function(x) { sum(x) })
existing_df_sum_countries <- as.data.frame(existing_df_sum_countries)
existing_df_sum_countries
## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001
## 1 40772 40669 39912 39573 39066 38904 37032 37462 36871 37358 36747 36804
## X2002 X2003 X2004 X2005 X2006 X2007
## 1 37160 36516 36002 35435 34987 34622
aggregate
R provided basic grouping functionality by using aggregate
. Another option is
to have a look at the powerful dplyr library that I highly recommend.
But aggregate
is quite powerful as well. It accepts a data frame, a list of
grouping elements, and a function to apply to each group. First we need to define
a grouping vector.
before_2000 <- c('1990-99','1990-99','1990-99','1990-99','1990-99',
'1990-99','1990-99','1990-99','1990-99','1990-99',
'2000-07','2000-07','2000-07','2000-07','2000-07',
'2000-07','2000-07','2000-07')
before_2000
## [1] "1990-99" "1990-99" "1990-99" "1990-99" "1990-99" "1990-99" "1990-99"
## [8] "1990-99" "1990-99" "1990-99" "2000-07" "2000-07" "2000-07" "2000-07"
## [15] "2000-07" "2000-07" "2000-07" "2000-07"
Then we can use that column as grouping element and use the function mean
.
mean_cases_by_period <- aggregate(existing_df, list(Period = before_2000), mean)
mean_cases_by_period
## Period Afghanistan Albania Algeria American Samoa Andorra Angola
## 1 1990-99 403.700 42.1 43.90 16.200 30.3 474.40
## 2 2000-07 290.375 30.5 51.75 7.375 19.0 337.25
## Anguilla Antigua and Barbuda Argentina Armenia Australia Austria
## 1 36.400 12.800 76.6 64.400 6.8 14.500
## 2 34.625 8.375 42.0 88.125 6.0 10.375
## Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize
## 1 75.600 52.700 95.600 571.20 6.400 80.500 14.000 54.60
## 2 98.125 49.125 52.375 445.75 3.875 80.125 11.125 39.75
## Benin Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil
## 1 131.300 8.400 699.600 308.2 132.9 356.400 103.400
## 2 133.875 6.125 447.875 215.5 61.0 562.875 68.875
## British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso Burundi
## 1 24.600 90.60 57.700 239.9 332.30
## 2 17.125 73.25 47.875 398.0 596.75
## Cambodia Cameroon Canada Cape Verde Cayman Islands
## 1 835.9 201.400 5.900 409.500 8.400
## 2 707.0 221.625 4.125 327.125 5.625
## Central African Republic Chad Chile China Colombia Comoros
## 1 360.000 330.300 32.0 300.00 75.10 152.500
## 2 494.625 501.625 16.5 231.75 53.25 98.125
## Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus Czech Republic
## 1 322.200 23.400 24.5 110.000 21.70 10.90 20.8
## 2 441.625 15.375 13.0 67.125 9.75 6.75 12.0
## Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep. Denmark Djibouti
## 1 331.00 794.400 393.30 9.70 1145.000
## 2 573.75 551.875 676.25 6.75 963.125
## Dominica Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea
## 1 22.000 148.20 236.700 45.6 101.9 206.50
## 2 19.375 96.25 163.625 30.5 58.0 404.75
## Eritrea Estonia Ethiopia Fiji Finland France French Polynesia Gabon
## 1 221.200 77.700 382.900 54.50 10.400 16.90 70.900 330.800
## 2 121.125 54.625 575.375 33.25 6.125 11.75 33.125 330.125
## Gambia Georgia Germany Ghana Greece Grenada Guam Guatemala Guinea
## 1 352.20 68.2 12.8 450.100 24.300 7.000 100.20 101.500 274.200
## 2 397.25 90.5 6.5 358.375 17.125 6.875 42.25 87.625 388.875
## Guinea-Bissau Guyana Haiti Honduras Hungary Iceland India Indonesia
## 1 394.10 61.800 438.100 118.900 68.300 3.700 533.200 387.70
## 2 283.25 125.375 380.875 70.875 30.875 2.625 346.875 281.75
## Iran Iraq Ireland Israel Italy Jamaica Japan Jordan Kazakhstan
## 1 52.000 85.800 14.9 8.80 8.800 8.6 53.700 16.300 107.3
## 2 33.625 71.875 10.5 6.25 6.375 7.0 35.625 9.125 147.0
## Kenya Kiribati Kuwait Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia
## 1 208.9 874.900 69.40 118.700 393.40 75.400 57.9 271.5 444.7
## 2 378.5 487.875 29.25 145.875 315.75 74.625 25.5 418.0 407.5
## Libyan Arab Jamahiriya Lithuania Luxembourg Madagascar Malawi Malaysia
## 1 40.200 94.10 15.10 359.5 355.0 158.90
## 2 19.625 79.75 10.25 387.0 342.5 128.25
## Maldives Mali Malta Mauritania Mauritius Mexico Micronesia, Fed. Sts.
## 1 105.500 595.200 7.80 600.700 50.200 72.40 246.80
## 2 72.875 582.375 5.25 586.375 39.375 31.75 137.75
## Monaco Mongolia Montserrat Morocco Mozambique Myanmar Namibia Nauru
## 1 2.8 412.50 13.5 116.600 368.300 352.70 566.900 216.500
## 2 2.0 250.25 11.5 88.375 538.625 191.75 540.125 86.875
## Nepal Netherlands Netherlands Antilles New Caledonia New Zealand
## 1 523.300 8.80 22.7 83.1 10.100
## 2 270.625 6.25 16.0 33.0 9.375
## Nicaragua Niger Nigeria Niue Northern Mariana Islands Norway Oman
## 1 113.40 308.60 361.500 98.80 228.200 6.7 23.200
## 2 71.75 284.25 544.125 63.25 93.875 4.5 13.125
## Pakistan Palau Panama Papua New Guinea Paraguay Peru Philippines
## 1 423.400 164.100 68.800 494.900 89.400 297.40 726.4
## 2 331.875 77.125 48.125 462.875 83.125 172.25 542.5
## Poland Portugal Puerto Rico Qatar Korea, Rep. Moldova Romania
## 1 77.100 43.90 15.300 78 141.600 140.000 153.1
## 2 36.625 29.75 6.625 75 117.125 172.625 170.0
## Russian Federation Rwanda Saint Kitts and Nevis Saint Lucia
## 1 107.20 274.20 15.1 22.50
## 2 137.25 559.25 13.5 18.25
## Saint Vincent and the Grenadines Samoa San Marino Sao Tome and Principe
## 1 42.30 35.00 7.500 306.1
## 2 35.75 27.25 5.375 258.5
## Saudi Arabia Senegal Seychelles Sierra Leone Singapore Slovakia Slovenia
## 1 67.000 385.000 91.400 531.900 49.70 49.700 47.800
## 2 62.625 446.625 54.125 804.625 31.75 25.375 20.125
## Solomon Islands Somalia South Africa Spain Sri Lanka Sudan Suriname
## 1 469.600 521.100 569.2 35.300 99.1 401.100 95.1
## 2 240.875 364.625 637.0 24.875 88.0 381.375 128.0
## Swaziland Sweden Switzerland Syrian Arab Republic Tajikistan Thailand
## 1 527.900 4.900 10.30 72.300 134.00 288.6
## 2 772.625 4.125 5.75 32.875 262.25 194.5
## Macedonia, FYR Timor-Leste Togo Tokelau Tonga Trinidad and Tobago
## 1 80.100 662.6 650.10 105.9 39.9 16.100
## 2 38.375 436.5 701.25 28.0 35.0 15.125
## Tunisia Turkey Turkmenistan Turks and Caicos Islands Tuvalu Uganda
## 1 46.400 68.800 105.900 32.200 511.30 352.70
## 2 27.625 41.875 100.875 20.375 335.25 442.75
## Ukraine United Arab Emirates United Kingdom Tanzania
## 1 81.60 37.400 9.200 279.200
## 2 120.25 25.375 10.125 365.125
## Virgin Islands (U.S.) United States of America Uruguay Uzbekistan
## 1 23.000 6.0 30.600 117.00
## 2 17.125 3.5 24.875 143.75
## Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen
## 1 234.500 42.300 323.300 152.900 49.800 234.500
## 2 125.375 39.125 231.875 92.875 35.375 144.125
## Zambia Zimbabwe
## 1 557.200 428.10
## 2 507.875 618.75
The aggregate
function allows subsetting the data frame we pass as first parameter
of course, and also to pass multiple grouping elements and define our own functions
(either as lambda or predefined functions). And again, the result is a data frame
that we can index as usual.
mean_cases_by_period[,c('United Kingdom','Spain','Colombia')]
## United Kingdom Spain Colombia
## 1 9.200 35.300 75.10
## 2 10.125 24.875 53.25
Conclusions
This two-part tutorial has introduced the concept of data frame, together with how to use them in the two most popular Data Science ecosystems nowadays, R and Python. We have seen how Pandas is inspired by R. We can see how in Python/Pandas we can use very similar constructs to those present in the R language. Python is also a language widely used by software developers of all kinds. All this means that Pandas offers a more consistent programming interface, more efficient in many situations. It is also agreed in the community that, if you come from a software development background, you will feel more comfortable with a language like Python and how DataFrame
as an object oriented concepts is defined. If you come instead from a maths and statistics background, you will appreciate a language like R, very interactive and totally function-based, with libraries made by statisticians for statisticians. It is not a language meant to be used in complex software architectures on its own, but to be used in a powerful dialog with data.
Additionally, we have introduced a few datasets from Gapminder World related with Infectious Tuberculosis, a very serious epidemic disease sometimes forgotten in developed countries but that nowadays is the second cause of death of its kind just after HIV (and many times associated to HIV). In the next tutorial in the series, we will use these datasets in order to perform some Exploratory Analysis in both, Python and R, to better understand the world situation regarding the disease.
Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!