Our work today
This code was prepared for a intro to R for the undegraduate seminar Tweeting Politica Crisis offered by Ernesto Calvo at the University of Maryland, College Park.
I cover some basic steps on learning R in this tutorial. The idea here is more to teach the students how to understand basic programming in R, how to navigate in the environment, and how and where to ask for help and learn more in the future.
This presentation was built using Rmarkdown, but I strongly suggest students to work directly in the R code attached to this presentation. All the materials can be download here.
This code has been adapted from previous materials by Eric Dunford, Natalia Bueno, and Rochelle Terman
Who am I and how did I learn R?
My name is Tiago Ventura. I am a Ph.D. student in Government and Politics at the University of Maryland, College Park. I have been working in R for the last 5 years since I started Master degree in Brazil. You can find my website here and fell free to send me emails if you need help with R. You can also find me at Chiconteague 4118.
I learned R through a fell different sources. Those are some of them:
What’s R?
R is a versatile, open source programming/scripting language that’s useful both for statistics but also data science. Inspired by the programming language [S
][S].
- Open source software under GPL.
- Superior (if not just comparable) to commercial alternatives. As of January 2019, R ranks 12th in the TIOBE index, which measures the popularity of programming languages. It’s widely used both in academia and industry especially in the circle of data scientists.
- Available on all platforms (Unix, Windows, Linux).
- As a result, if you do your analysis in R, anyone can easily replicate it.
- Not just for statistics, but also general purpose programming.
- Is object oriented (= R has objects) and functional (= You can write functions).
- Large and growing community of peers.
Golden Rules of R
- Everything that exists is an object. (*objected oriented**)
- Everything that happens is a function call." (functional)
Rstudio
RStudio is the premier R graphical user interface (GUI) and integrated development environment (IDE) that makes R easier to use.
Tools –> Global Options
Before we begin, let’s set a few RStudio settings to improve your experience.
Click “Tools –> Global Options –> Appearance” to change your color and font settings.
Click “Tools –> Global Options –> Code” and check the box that says “Soft-wrap R source files” to wrap the text in your script to the width of the script pane.
Click “Tools –> Global Options –> Code –> Display” and check the boxes that say “Highlight selected line” and “Highlight R function calls”.
Installing a package in R
There are a number of packages
that are supplied with the R distribution. These are known as ``base packages" and they are in the background the second one starts a session in R.
Packages are collections of R functions, data, and compiled code in a well-defined format.
Asking for help
?
+ object opens a help page for that specific object??
+ object searches help pages containing the name of the object
R Basics
Assigning an Object
In simple terms, an object
is a bit of text that represents a specific value. Variable names can only contain letters, numbers, the underscore character, and (unlike Python) the period character. Whereas an object name like myobject.thing
would point to the subclass or method thing
of myobject
in Python, R treats myobject.thing
as its own entity.
[1] "my_name" "x"
[1] "my_name"
[1] "Tiago Da Silva Ventura"
Remove objects
Object Coersion
When need be, an object can be coerced to be a different class.
[1] 3
[1] "3"
Here we transformed x
– which was an object containing the value 3
– into a character. x
is now a string with the text “3”.
Data Structures
There are also many ways data can be organized in R
.
The same object can be organized in different ways depending on the needs to the user. Some commonly used data structures include:
vector
matrix
data.frame
list
array
Data Structures: Vector
[1] 1.00 2.30 4.00 5.00 6.78 6.00 7.00 8.00 9.00 10.00
[1] "numeric"
[1] 10
Data Structures: Data Frame
The most useful type of data for data analysis. It is like a spreadsheet in your R environment.
X
1 1.00
2 2.30
3 4.00
4 5.00
5 6.78
6 6.00
7 7.00
8 8.00
9 9.00
10 10.00
# Create a data frame
data <- data.frame(name="Tiago", last_name="ventura", school="UMD", age=30)
data
name last_name school age
1 Tiago ventura UMD 30
Data Structures: Matrix
Same as a data frame, but with the same data type in the collumns
[,1]
[1,] 1.00
[2,] 2.30
[3,] 4.00
[4,] 5.00
[5,] 6.78
[6,] 6.00
[7,] 7.00
[8,] 8.00
[9,] 9.00
[10,] 10.00
Data Structures: List
List are extremely usefulf for more advanced applications. It works as a repository of multiple objects. It is like a big drawer where you can save your mess.
[[1]]
[1] 1
[[2]]
[1] 2.3
[[3]]
[1] 4
[[4]]
[1] 5
[[5]]
[1] 6.78
[[6]]
[1] 6
[[7]]
[1] 7
[[8]]
[1] 8
[[9]]
[1] 9
[[10]]
[1] 10
List of 2
$ : num [1:10] 1 2.3 4 5 6.78 6 7 8 9 10
$ :'data.frame': 1 obs. of 4 variables:
..$ name : Factor w/ 1 level "Tiago": 1
..$ last_name: Factor w/ 1 level "ventura": 1
..$ school : Factor w/ 1 level "UMD": 1
..$ age : num 30
Data Structures: Accessing Data
One must understand the structure of an object in order to systematically access the material contained within it.
[1] "data.frame"
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
[15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
[29] 15.8 19.7 15.0 21.4
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
More ways to get information about your data frame
[1] 32
[1] 11
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
[1] 32 11
Data Management
But, first, where is my data exactly?
- In your working directory.
R
doesn’t intuitively know where your data is. If the data is in a special folder entitled “super secret research”, we have to tell R
how to get there.
We can do this two ways:
- Set the working directory to that folder
- Establish the path to that folder
Every time R
boots up, it does so in the same place, unless we tell it to go somewhere else.
[1] "C:/Users/venturat/Dropbox/Workshops/UFPA_Intro_to_data_science_in_R/crash_course_GVPT"
Setting a new working director
Importing data
For basic programming tasks, you will mostly work importing data from a .csv, excel-ish, type of file. If you want to download data directly from twitter, using ther API, the data comes in a Json format, and processing is little more trick. Professor Calvo will do that for you, so it is more likely you will deal with .csv files.
Download this data here, and add in any folder on your computer.
library(tidyverse)
data = read.csv(file = "results.csv",
stringsAsFactors = F)
data = read_csv("results.csv")
These functions have specific arguments that we are referencing: stringsAsFactors
means that we don’t want all character vectors in the data.frame to be converted to Factors. header
means the first row of the data are column names. sep
means that entries are seperated by commas.
Exporting data
Exporing data is the same process in reverse. Assuming the we have the foreign
, readstata13
, and XLConnect
packages loaded:
Descriptive Statistics
Now that we can get data into R, we want to explore and summarize what’s going on.
summary()
allows for one to quickly summarize the distributions across a set of variables
date home_team away_team
Min. :1872-11-30 Length:39669 Length:39669
1st Qu.:1977-02-02 Class :character Class :character
Median :1996-10-06 Mode :character Mode :character
Mean :1989-10-17
3rd Qu.:2008-01-22
Max. :2018-07-10
home_score away_score tournament city
Min. : 0.000 Min. : 0.000 Length:39669 Length:39669
1st Qu.: 1.000 1st Qu.: 0.000 Class :character Class :character
Median : 1.000 Median : 1.000 Mode :character Mode :character
Mean : 1.748 Mean : 1.188
3rd Qu.: 2.000 3rd Qu.: 2.000
Max. :31.000 Max. :21.000
country neutral
Length:39669 Mode :logical
Class :character FALSE:29848
Mode :character TRUE :9821
There are a wealth of useful summary operators that are built into R
.
…to name a few!
Base Graphics
A rather flexible graphing language comes built into R
. Though there are more powerful and easy to use graphical packages out there (e.g. ggplot2
and lattice
), the base plotting functions offer a lot of functionality. The benefit of these functions is that they are easy to manipulate and use. - histograms: hist()
- scatter plots: plot()
- barplot: barplot()
- pie chart: pie()
- density plot: plot(density())
Histogram
Base Graphics: Scatter Plots
Base Graphics: Scatter Plots
Base Graphics: Density Plots
Basic Data Manipulations
Here we are going to use the same dataset we opened before with the data about soccer matches overt time.
'data.frame': 39669 obs. of 9 variables:
$ date : chr "1872-11-30" "1873-03-08" "1874-03-07" "1875-03-06" ...
$ home_team : chr "Scotland" "England" "Scotland" "England" ...
$ away_team : chr "England" "Scotland" "England" "Scotland" ...
$ home_score: int 0 4 2 2 3 4 1 0 7 9 ...
$ away_score: int 0 2 1 2 0 0 3 2 2 0 ...
$ tournament: chr "Friendly" "Friendly" "Friendly" "Friendly" ...
$ city : chr "Glasgow" "London" "Glasgow" "London" ...
$ country : chr "Scotland" "England" "Scotland" "England" ...
$ neutral : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
Creating Variables
Recall that we can access a variables contents using the call sign $
. We can also use this same call logic to create a new variable.
date home_team away_team home_score away_score tournament city
1 1872-11-30 Scotland England 0 0 Friendly Glasgow
2 1873-03-08 England Scotland 4 2 Friendly London
3 1874-03-07 Scotland England 2 1 Friendly Glasgow
4 1875-03-06 England Scotland 2 2 Friendly London
5 1876-03-04 Scotland England 3 0 Friendly Glasgow
6 1876-03-25 Scotland Wales 4 0 Friendly Glasgow
country neutral sum_of_gols
1 Scotland FALSE 0
2 England FALSE 6
3 Scotland FALSE 3
4 England FALSE 4
5 Scotland FALSE 3
6 Scotland FALSE 4
We can also use other aspects of a data frame’s structure to the same end.
data[,'local'] <-paste(data$city, data$country) # As Column 4, load the value 1 for all obs.
head(data) # Assign arbitrary name
date home_team away_team home_score away_score tournament city
1 1872-11-30 Scotland England 0 0 Friendly Glasgow
2 1873-03-08 England Scotland 4 2 Friendly London
3 1874-03-07 Scotland England 2 1 Friendly Glasgow
4 1875-03-06 England Scotland 2 2 Friendly London
5 1876-03-04 Scotland England 3 0 Friendly Glasgow
6 1876-03-25 Scotland Wales 4 0 Friendly Glasgow
country neutral sum_of_gols local
1 Scotland FALSE 0 Glasgow Scotland
2 England FALSE 6 London England
3 Scotland FALSE 3 Glasgow Scotland
4 England FALSE 4 London England
5 Scotland FALSE 3 Glasgow Scotland
6 Scotland FALSE 4 Glasgow Scotland
The creation of any variable follows this same logic as long as the vector being inserted is of the correct length.
[1] 39669
Ordinal Variables (ifelse()
conditionals)
Often we need to chop up a distribution into an ordered variable. This is straightforward when using the ifelse()
conditional statement. Essentially, we are saying: if the variable meets this criteria, code it as this; else do this.
For an example, let’s break the extra
variable up into a dichotomous indicator.
data$home_vic <- ifelse(data$home_score>=data$away_score,"home victory","away victory")
data[,c("home_score","away_score", "home_vic")]
home_score away_score home_vic
1 0 0 home victory
2 4 2 home victory
3 2 1 home victory
4 2 2 home victory
5 3 0 home victory
6 4 0 home victory
7 1 3 away victory
8 0 2 away victory
9 7 2 home victory
10 9 0 home victory
11 2 1 home victory
12 5 4 home victory
13 0 3 away victory
14 5 4 home victory
15 2 3 away victory
16 5 1 home victory
17 0 1 away victory
18 1 6 away victory
19 1 5 away victory
20 0 13 away victory
21 7 1 home victory
22 5 1 home victory
23 5 3 home victory
24 5 0 home victory
25 5 0 home victory
[ reached 'max' / getOption("max.print") -- omitted 39644 rows ]
Dropping Variables
Use negative values in the brackets to specify variables you’d like to drop.
date home_team country neutral sum_of_gols local id
1 1872-11-30 Scotland Scotland FALSE 0 Glasgow Scotland 1
2 1873-03-08 England England FALSE 6 London England 2
3 1874-03-07 Scotland Scotland FALSE 3 Glasgow Scotland 3
4 1875-03-06 England England FALSE 4 London England 4
5 1876-03-04 Scotland Scotland FALSE 3 Glasgow Scotland 5
6 1876-03-25 Scotland Scotland FALSE 4 Glasgow Scotland 6
home_vic
1 home victory
2 home victory
3 home victory
4 home victory
5 home victory
6 home victory
We can also subset out a variable.
date home_team
1 1872-11-30 Scotland
2 1873-03-08 England
3 1874-03-07 Scotland
4 1875-03-06 England
5 1876-03-04 Scotland
6 1876-03-25 Scotland
Renaming Variables
Inevitably, you we’ll need to rename variables. Doing so is straightforward with the colnames()
function.
[1] "date" "home_team" "away_team" "home_score" "away_score"
[6] "tournament" "city" "country" "neutral" "sum_of_gols"
[11] "local" "id" "home_vic"
# colnames behaves like any vector, and as such, we can access the information
# as we would any vector
colnames(data)[4]
[1] "home_score"
[1] "home_score" "away_score"
# Renaming a variable is as easy as inserting a new value in the data structure.
colnames(data)[4] <- "home-score"
colnames(data)
[1] "date" "home_team" "away_team" "home-score" "away_score"
[6] "tournament" "city" "country" "neutral" "sum_of_gols"
[11] "local" "id" "home_vic"
[1] "var1" "var2" "var3" "var4" "var5"
[6] "tournament" "city" "country" "neutral" "sum_of_gols"
[11] "local" "id" "home_vic"
var1 var2 var3 var4 var5 tournament city country
1 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland
2 1873-03-08 England Scotland 4 2 Friendly London England
3 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland
4 1875-03-06 England Scotland 2 2 Friendly London England
5 1876-03-04 Scotland England 3 0 Friendly Glasgow Scotland
neutral sum_of_gols local id home_vic
1 FALSE 0 Glasgow Scotland 1 home victory
2 FALSE 6 London England 2 home victory
3 FALSE 3 Glasgow Scotland 3 home victory
4 FALSE 4 London England 4 home victory
5 FALSE 3 Glasgow Scotland 5 home victory
[ reached 'max' / getOption("max.print") -- omitted 1 rows ]
Subsetting Data
As noted above, it’s straightforward to subset data given what we know about an object’s structure. But there are also a few functions that make our life easier.
# Let's subset the data just to games Brazil was playing. There are many ways to do
# this, let's explore a few.
data = read.csv(file = "results.csv",
stringsAsFactors = F)
# (1) Use the what we know about boolean operators from last week.
data[data$home_team=="Brazil",]
date home_team away_team home_score away_score tournament
424 1916-07-08 Brazil Chile 1 1 Copa América
427 1916-07-12 Brazil Uruguay 1 2 Copa América
460 1917-10-12 Brazil Chile 5 0 Copa América
486 1919-05-11 Brazil Chile 6 0 Copa América
491 1919-05-18 Brazil Argentina 3 1 Copa América
495 1919-05-26 Brazil Uruguay 2 2 Copa América
496 1919-05-29 Brazil Uruguay 1 0 Copa América
498 1919-06-01 Brazil Argentina 3 3 Friendly
city country neutral
424 Buenos Aires Argentina TRUE
427 Buenos Aires Argentina TRUE
460 Montevideo Uruguay TRUE
486 Rio de Janeiro Brazil FALSE
491 Rio de Janeiro Brazil FALSE
495 Rio de Janeiro Brazil FALSE
496 Rio de Janeiro Brazil FALSE
498 Rio de Janeiro Brazil FALSE
[ reached 'max' / getOption("max.print") -- omitted 544 rows ]
date home_team away_team home_score away_score tournament
491 1919-05-18 Brazil Argentina 3 1 Copa América
498 1919-06-01 Brazil Argentina 3 3 Friendly
655 1922-10-15 Brazil Argentina 2 0 Copa América
660 1922-10-22 Brazil Argentina 2 1 Copa Roca
2136 1939-01-15 Brazil Argentina 1 5 Copa Roca
2139 1939-01-22 Brazil Argentina 3 2 Copa Roca
2222 1940-02-18 Brazil Argentina 2 2 Copa Roca
2225 1940-02-25 Brazil Argentina 0 3 Copa Roca
city country neutral
491 Rio de Janeiro Brazil FALSE
498 Rio de Janeiro Brazil FALSE
655 Rio de Janeiro Brazil FALSE
660 São Paulo Brazil FALSE
2136 Rio de Janeiro Brazil FALSE
2139 Rio de Janeiro Brazil FALSE
2222 São Paulo Brazil FALSE
2225 São Paulo Brazil FALSE
[ reached 'max' / getOption("max.print") -- omitted 37 rows ]
# Subset and only give me the first column
data[data$home_team=="Brazil" & data$away_team=="Argentina", c("home_score", "away_score")]
home_score away_score
491 3 1
498 3 3
655 2 0
660 2 1
2136 1 5
2139 3 2
2222 2 2
2225 0 3
2508 3 4
2509 6 2
2510 3 1
4129 1 2
4132 2 0
4684 5 1
5318 2 3
5319 5 2
5560 0 3
5814 0 0
6765 4 1
6768 3 2
7317 0 2
7319 2 1
9348 2 1
9634 2 0
10835 2 1
12660 0 0
12986 0 0
13611 2 1
15575 2 0
16035 0 1
16455 1 1
17736 1 1
18111 2 0
18926 2 2
21122 0 1
22039 2 1
22155 4 2
[ reached 'max' / getOption("max.print") -- omitted 8 rows ]
Merging Data
Merging data is a must in quantitative political analysis by bringing various datasets together we can enrich our analysis. But this isn’t always straightforward. Sometimes observations can be dropped if one is not vigilant of the dimensions of each data frame being input.
The Basics
# Let's create two example data frames. Note that rep() is a function to repeat
# a sequence a specific number of times.
countries <- rep(c("China","Russia","US","Benin"),2)
years <- c(rep(1999,4),rep(2000,4))
data1 <- data.frame(country=countries,
year=years,
repress = c(1,2,4,3,2,3,4,1),stringsAsFactors = F)
data2 <- data.frame(country=countries,
year=years,
GDPpc= round(runif(8,2e3,20e3),3),stringsAsFactors = F)
head(data1);head(data2)
country year repress
1 China 1999 1
2 Russia 1999 2
3 US 1999 4
4 Benin 1999 3
5 China 2000 2
6 Russia 2000 3
country year GDPpc
1 China 1999 8942.119
2 Russia 1999 17715.519
3 US 1999 4948.356
4 Benin 1999 18299.354
5 China 2000 15222.150
6 Russia 2000 10466.718
# Merging the datasets: here we'll merge the data utilizing a unqiue identifier
# that is common across the two datasets
merge(data1,data2,by="country") # Just countries
country year.x repress year.y GDPpc
1 Benin 1999 3 1999 18299.354
2 Benin 1999 3 2000 3674.782
3 Benin 2000 1 1999 18299.354
4 Benin 2000 1 2000 3674.782
5 China 1999 1 1999 8942.119
6 China 1999 1 2000 15222.150
7 China 2000 2 1999 8942.119
8 China 2000 2 2000 15222.150
9 Russia 1999 2 1999 17715.519
10 Russia 1999 2 2000 10466.718
11 Russia 2000 3 1999 17715.519
12 Russia 2000 3 2000 10466.718
13 US 1999 4 1999 4948.356
14 US 1999 4 2000 8071.605
15 US 2000 4 1999 4948.356
[ reached 'max' / getOption("max.print") -- omitted 1 rows ]
country year repress GDPpc
1 Benin 1999 3 18299.354
2 Benin 2000 1 3674.782
3 China 1999 1 8942.119
4 China 2000 2 15222.150
5 Russia 1999 2 17715.519
6 Russia 2000 3 10466.718
7 US 1999 4 4948.356
8 US 2000 4 8071.605
Loops
As one quickly notes, doing any task in R can become redundant. Loops and functions can dramatically increase our workflow when a task is systematic and repeatable.
Let’ say, we want to calculate how many games each country won when playing at their home.
# First difference of the goals
data$victory <- ifelse(data$home_score > data$away_score, 1, 0)
# To do this, we'd need to subset by each group and then calculate the mean.
sub <- data[data$home_team=="Brazil",]
victory_brazil <- sum(sub$victory)
sub <- data[data$home_team=="Argentina",]
victory_argentina <- sum(sub$victory)
group_means <- c(victory_brazil,victory_argentina) # combine
group_means
[1] 395 354
This works for few cases. But it would become quite the undertaking as the number of groups increased. loops just allows you to repeate the operation using some type of index on your data frame.
Here is where loops can make one’s life easier! By “looping through” all the respective groups, we can automate this process so that it goes a lot quicker.
A loop essentially works like this:
Specify a length of some thing you want to loop through. In our case, it’s the number of groups.
Set the code up so that every iteration only performs a manipulation on a single subset at a time.
Save the contents of each iteration in a new object that won’t be overwritten. Here we want to think in terms of “stacking” results or concatenating them.
In practice…
[1] "Scotland" "England" "Wales"
[4] "Northern Ireland" "USA" "Uruguay"
[7] "Austria" "Hungary" "Argentina"
[10] "Belgium" "France" "Netherlands"
[13] "Czechoslovakia" "Switzerland" "Sweden"
[16] "Germany" "Italy" "Chile"
[19] "Norway" "Finland" "Luxembourg"
[22] "Russia" "Denmark" "Brazil"
[25] "Japan" "Paraguay" "Canada"
[28] "Estonia" "Costa Rica" "Guatemala"
[31] "Spain" "Poland" "Yugoslavia"
[34] "New Zealand" "Romania" "Latvia"
[37] "Portugal" "China" "Australia"
[40] "Lithuania" "Turkey" "Mexico"
[43] "Aruba" "Egypt" "Haiti"
[46] "Philippines" "Bulgaria" "Jamaica"
[49] "Kenya" "Bolivia" "Peru"
[52] "Honduras" "Guyana" "Uganda"
[55] "Belarus" "El Salvador" "Barbados"
[58] "Ireland" "Trinidad and Tobago" "Greece"
[61] "Curaçao" "Dominica" "Guadeloupe"
[64] "Israel" "Suriname" "French Guyana"
[67] "Cuba" "Colombia" "Ecuador"
[70] "St. Kitts and Nevis" "Panama" "Slovakia"
[73] "Manchukuo" "Croatia" "Nicaragua"
[ reached getOption("max.print") -- omitted 216 entries ]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[70] 70 71 72 73 74 75
[ reached getOption("max.print") -- omitted 216 entries ]
# (2) Make the code iterable: Check what you are repeating
sub = data[data$home_team==no.of.groups[1],]
# Here, just by changing where we are in the vector "no.of.groups", we can draw
# out a unique subset
Now combine all these elements if a special base function called for(){}
– note that all the code goes in-between the brackets. Here we need to establish an arbitrary iterator, which I’ll call i
in the example below. i
will take the value of each entry in the vector 1:length(no.of.groups)
, e.g. i=1
then i=2
, and so on given how many groups we have.
container = c() # Empty Container
for ( i in 1:length(no.of.groups) ){
sub = data[data$home_team==no.of.groups[i],]
mu <- sum(sub$victory)
container <- c(container,mu)
}
container
[1] 208 298 117 102 208 188 206 250 354 208 270 231 136 171 278 312 271
[18] 204 158 100 22 177 211 395 182 128 70 73 176 102 238 194 108 79
[35] 186 70 180 188 161 52 118 291 16 229 101 43 146 135 184 81 124
[52] 136 49 159 39 118 56 141 170 127 72 28 58 95 87 24 57 124
[69] 97 47 85 59 0 88 14
[ reached getOption("max.print") -- omitted 216 entries ]
country container
1 Scotland 208
2 England 298
3 Wales 117
4 Northern Ireland 102
5 USA 208
6 Uruguay 188
7 Austria 206
8 Hungary 250
9 Argentina 354
10 Belgium 208
11 France 270
12 Netherlands 231
13 Czechoslovakia 136
14 Switzerland 171
15 Sweden 278
16 Germany 312
17 Italy 271
18 Chile 204
19 Norway 158
20 Finland 100
21 Luxembourg 22
22 Russia 177
23 Denmark 211
24 Brazil 395
25 Japan 182
26 Paraguay 128
27 Canada 70
28 Estonia 73
29 Costa Rica 176
30 Guatemala 102
31 Spain 238
32 Poland 194
33 Yugoslavia 108
34 New Zealand 79
35 Romania 186
36 Latvia 70
37 Portugal 180
[ reached 'max' / getOption("max.print") -- omitted 254 rows ]
Functions
Really often, we have specific tasks that we have to implement all the time.
Building a function for these tasks can really make life easier, and often it makes one’s work more reproducible and transparent.
For example, consider the example above, it is likely we will perform this by group sum calculation a lot of times, therefore, it is interesting to convert this to a function.
Let’s go through the process of building our own functions in R
. In basic terms, a function is a specific set of arguments that perform a specific task.
Let’s build a simple function that adds two values. Here the function will have two arguments, or put differently, two values that need to be entered for the function to perform. As you’ll note, this looks a lot like the set up for a loop!
add_me <- function( argument1, argument2 ){
value <- argument1 + argument2
return(value) # "return" means "send this back once the function is done"
}
add_me(2,3)
[1] 5
[1] 223
[1] 141
# We can set "default" values for an argument, so if there is no inputs, the
# function will still run.
add_me <- function( argument1=1, argument2=2 ){
value <- argument1 + argument2
return(value)
}
add_me()
[1] 3
[1] 9
The basic structure is the following:
## name.of.the.function <- function(x,y,z){
## ## tells R that this is a function and define the
## ## arguments it will have, here (x,y,z)
##
## out <- what the function does.
##
## return(out) ## defines the output of the function
## }
## closes the function
Now, let’s build a function for our sum loop that we constructed in the last section. The arguments we would need are straight forward. We need the data
, the name of the group
column, and the name of the value
column.
group_sum <- function(data, group.var, value.var) {
no.of.groups = unique(data[, group.var])
# Does anyone know why I am accessing the data this way?
container = c() # Empty Container
for (super_arbitrary_iterator in 1:length(no.of.groups)) {
sub = data[data[, group.var] == no.of.groups[super_arbitrary_iterator],
]
mu <- sum(sub[, value.var])
container <- rbind(container, mu) # return as matrix
}
# Lastly, create a data frame
data_frame = data.frame(no.of.groups, container)
return(data_frame)
}
# Recall the fake country data?
group_sum(data, group.var = "home_team", value.var = "victory") # beautiful!
no.of.groups container
mu Scotland 208
mu.1 England 298
mu.2 Wales 117
mu.3 Northern Ireland 102
mu.4 USA 208
mu.5 Uruguay 188
mu.6 Austria 206
mu.7 Hungary 250
mu.8 Argentina 354
mu.9 Belgium 208
mu.10 France 270
mu.11 Netherlands 231
mu.12 Czechoslovakia 136
mu.13 Switzerland 171
mu.14 Sweden 278
mu.15 Germany 312
mu.16 Italy 271
mu.17 Chile 204
mu.18 Norway 158
mu.19 Finland 100
mu.20 Luxembourg 22
mu.21 Russia 177
mu.22 Denmark 211
mu.23 Brazil 395
mu.24 Japan 182
mu.25 Paraguay 128
mu.26 Canada 70
mu.27 Estonia 73
mu.28 Costa Rica 176
mu.29 Guatemala 102
mu.30 Spain 238
mu.31 Poland 194
mu.32 Yugoslavia 108
mu.33 New Zealand 79
mu.34 Romania 186
mu.35 Latvia 70
mu.36 Portugal 180
[ reached 'max' / getOption("max.print") -- omitted 254 rows ]
# change whatever you want here. The function is super general
group_sum(data, group.var = "away_team", value.var = "victory") # beautiful!
no.of.groups container
mu England 109
mu.1 Scotland 154
mu.2 Wales 184
mu.3 Northern Ireland 189
mu.4 Canada 97
mu.5 Argentina 141
mu.6 Hungary 188
mu.7 Czechoslovakia 121
mu.8 Uruguay 219
mu.9 France 136
mu.10 Austria 154
mu.11 Switzerland 202
mu.12 Netherlands 126
mu.13 Belgium 155
mu.14 Germany 115
mu.15 Norway 190
mu.16 Sweden 191
mu.17 Italy 103
mu.18 Chile 195
mu.19 Finland 238
mu.20 Russia 104
mu.21 Luxembourg 146
mu.22 Denmark 164
mu.23 Brazil 100
mu.24 USA 140
mu.25 Philippines 84
mu.26 Estonia 134
mu.27 El Salvador 123
mu.28 Costa Rica 137
mu.29 Paraguay 207
mu.30 Yugoslavia 113
mu.31 Poland 167
mu.32 Portugal 116
mu.33 Spain 79
mu.34 Romania 141
mu.35 Australia 74
mu.36 Mexico 128
[ reached 'max' / getOption("max.print") -- omitted 252 rows ]
Final Suggestions
This was a super rushed crash course in R. If you need help, you can always find me here.
The next step in R would be to introduce to you the tidyverse packages. The tidyverse is a family of R packages developed by Hadley Wickham and his colleagues that apply the same language and structure to different tasks in R. In summary, the tidyverse makes duties as data management, cleaning and visualization super easy.
We don’t have time today, but here you can find a workshop I prepare for graduate students about using the tidyverse packages in R.