A Beginner’s Guide to R

Hello everyone! Welcome to your R journey. Today we are exploring some very simple operations using the R software.

Assigning values

There are two ways for assigning values, i.e., using “=” or “<-”.

x = 100
y <- x + 10
z <- 'Marketing'
print(c(x, y, z))

## [1] "100"       "110"       "Marketing"

Printing values

In the above code, we use the print function to print an output. You can either print a single value or multiple values.

print('HKU')

## [1] "HKU"

print(c('a','b','c'))

## [1] "a" "b" "c"

Simple Algebra

You can also try some simple algebraic operations on R.

x <- 100
y <- 50.5
z1 <- x + y
z2 <- x - y
z3 <- x*y
z4 <- x/y
z5 <- y^2
z6 <- sqrt(x)
print(c(z1,z2,z3,z4,z5,z6))

## [1]  150.500000   49.500000 5050.000000    1.980198 2550.250000   10.000000

Logic operators

A logic variable is either “TRUE” or “FALSE”.

x <- TRUE
y <- FALSE
a <- 100
b <- (a == 99)
c <- (a >= 99)
d <- (a != 99)
print(c(b, c, d))

## [1] FALSE  TRUE  TRUE

String operations

String concatenation allows you to merge two concatenate two strings. R provides you with two functions, namely “paste” and “paste0”.

x <- "Welcome"
y <- "to"
z <- "Marketing"
result <- paste(x, y, z)
result0 <- paste0(x, y, z)
print(c(result, result0))

## [1] "Welcome to Marketing" "WelcometoMarketing"

Sometimes we want to convert a number to a string. You can do this:

x <- 100
y <- 200
sx <- toString(x)
sy <- toString(y)
print(c(x, y, paste0(sx,sy)))

## [1] "100"    "200"    "100200"

Counting the length of a string:

str = "Marketing and Big Data"
print(nchar(str))

## [1] 22

Splitting the string based on SPACE:

str = "Marketing and Big Data"
y = strsplit(str,split=' ') 
print(y)

## [[1]]
## [1] "Marketing" "and"       "Big"       "Data"

Obtaining a substring.

str = "Marketing and Big Data"
z = substr(str, 2,5)
#substring from the 2nd to the 5th character
print(z)

## [1] "arke"

List

A list is an ordered sequence of objects:

vec <- c(1,3,5)
char_vec <- c("Spring", "Summer", "Autumn", "Winter")
logic_vec <- c(TRUE, FALSE, TRUE, FALSE)
print(vec)

## [1] 1 3 5

print(char_vec)

## [1] "Spring" "Summer" "Autumn" "Winter"

print(logic_vec)

## [1]  TRUE FALSE  TRUE FALSE

You may also put different types of variables in a list:

list <- c(100, 'HKU', TRUE, 10 + 15)
print(list)

## [1] "100"  "HKU"  "TRUE" "25"

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Simple Plots

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
plot(x, y)

plot(x, y, type = "b")

plot(x, y, pch = 17)

The “pch” instruction defines the appearance of your points, while the “lty” instruction defines the appearance of your line. For details, please refer to pch notes and lty notes

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
plot(x, y, pch = 2, lty = 2, type = "b")

You can find more details of plotting here.

Data Plots

R also provides us with some useful packages for data plotting. The most useful one is called “ggplot”. You need to install it first.

# install the package
 install.packages("ggplot2", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Li Xi/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Li Xi\AppData\Local\Temp\RtmpesN4Kj\downloaded_packages

# use the package
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.1

# create data
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
data <- data.frame(x, y)
# Plot
ggplot(data, aes(x, y)) + geom_point()

ggplot(data, aes(x, y)) + geom_area()

Next let us add colors to the plots.

# Here, for each individual, we have three values, its X, its Y, and its gender (e.g., 1 for male and 0 for female).
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
gender <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
data <- data.frame(x, y, gender)
# We plot the points in red
ggplot(data, aes(x, y, color = 'red')) + geom_point()

# We plot the points based on gender (one color for one gender)
ggplot(data, aes(x, y, color = gender)) + geom_point()

Data Frame

A data frame is a list of variables of the same number of rows with unique row names, given class “data.frame”.

  employees <- data.frame(
    name = c('Alice', 'Bob', 'Carol', 'Denis'),
    salary = c(20000,19000,23000,22000), 
    job = c('IT', 'Sales', 'Finance', 'IT' ))

You may have missing values in your data frame. In this case you can enter “NA” to represent the missing value.

employees <- data.frame(
  name = c('Alice', 'Bob', 'Carol','Denis'),
  salary = c(20000,NA,23000,22000),
  job = c('IT', 'Sales', NA, 'IT'))

You can use the dollar sign“$” to select a specific variable:

print(employees)

##    name salary   job
## 1 Alice  20000    IT
## 2   Bob     NA Sales
## 3 Carol  23000  <NA>
## 4 Denis  22000    IT

print(summary(employees))

##      name               salary          job           
##  Length:4           Min.   :20000   Length:4          
##  Class :character   1st Qu.:21000   Class :character  
##  Mode  :character   Median :22000   Mode  :character  
##                     Mean   :21667                     
##                     3rd Qu.:22500                     
##                     Max.   :23000                     
##                     NA's   :1

print(employees$name)

## [1] "Alice" "Bob"   "Carol" "Denis"

Simple Satistics

vector <- c(0, 8, 4, 6, 7, 9, 5)
print(mean(vector))

## [1] 5.571429

print(median(vector))

## [1] 6

print(var(vector))     #variance

## [1] 8.952381

print(sd(vector))      #standard deviation

## [1] 2.992053

print(max(vector))     #maximum

## [1] 9

print(min(vector))     #mimimum

## [1] 0

print(sort(vector))    #sort the data in increasing order

## [1] 0 4 5 6 7 8 9

IF ELSE

x <- 0
if (x < 0) {
  print("Negative number")
} else if (x > 0) {
  print("Positive number")
} else
  print("Zero")

## [1] "Zero"

WHILE LOOP

count = 0
while (count <= 5)
{
  count = count + 1
  print(count)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6

FOR LOOP

vector = c(1, 3, 5, 7)
print(vector)

## [1] 1 3 5 7

for (item in vector)
  print(item)

## [1] 1
## [1] 3
## [1] 5
## [1] 7

FOR LOOP with skip logic

for (year in 2000: 2020)
  print (year)

## [1] 2000
## [1] 2001
## [1] 2002
## [1] 2003
## [1] 2004
## [1] 2005
## [1] 2006
## [1] 2007
## [1] 2008
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014
## [1] 2015
## [1] 2016
## [1] 2017
## [1] 2018
## [1] 2019
## [1] 2020

for (year in 2000: 2020)
{
  if(year == 2008)
    next        #skip to the next iteration
  print (year)
}

## [1] 2000
## [1] 2001
## [1] 2002
## [1] 2003
## [1] 2004
## [1] 2005
## [1] 2006
## [1] 2007
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014
## [1] 2015
## [1] 2016
## [1] 2017
## [1] 2018
## [1] 2019
## [1] 2020

Functions

f1 <- function(a) {
  print(a + 1)
}

f1(0.5)

## [1] 1.5

f1(2)

## [1] 3

Another example:

f2 <- function(a) {
  return(a+2)
}

print(f2(0.5))

## [1] 2.5

print(f2(2))

## [1] 4

Yet another example:

f3 <- function(a) {
  a <- toString(a)
  a <- paste(a, "data")
  return (a)
}

print(f3(100))

## [1] "100 data"

print(f3("big"))

## [1] "big data"

More examples…

f4 <- function(a) {
  r1 <- a + 1
  r2 <- a + 2
  mylist <- list("r1" = r1, "r2" = r2)
  return(mylist)
}

mylist <- f4(15)
print(mylist$r1)

## [1] 16

print(mylist$r2)

## [1] 17

Generating random variables from uniform distribution:

a = runif(1)   #generate a random number between 0 and 1
print(a)

## [1] 0.8493704

vec = runif(5) #generate a list of 5 random numbers
print(vec)

## [1] 0.5981136 0.4764372 0.1625948 0.1456570 0.8633295

vec = runif(3, min=0, max=100) 
#generate 3 random numbers between 0 and 100
print(vec)

## [1] 65.45418 25.33918 34.43393

Generating random variables from normal distribution:

x = rnorm(1) 
#generate a random number using the standard normal distribution
print (x)

## [1] 0.7875174

y = rnorm(4, mean=50, sd=10)
#generate 4 random numbers following the specified normal distribition
print (y)

## [1] 45.00705 36.68992 73.94876 57.05236

z <- rnorm(1000, mean=50, sd=10)
hist(z)

#generate the histogram of z

Plotting histograms using ggplot2:

library(ggplot2)
z <- rnorm(1000, mean=50, sd=10)
data <- data.frame(z)

figure <- ggplot(data, aes(z))+
  geom_histogram(color="darkblue", fill="lightblue")
figure

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

figure + geom_vline(aes(xintercept=mean(z)),
                       color="blue", linetype="dashed", size=1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Getting and setting your working directories:

getwd()

## [1] "C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm"

#get working directory

setwd('C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm')
#set working directory
getwd()

## [1] "C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm"

Writing to text files:

file1<-file("output.txt")
writeLines(c("Big","Data"), file1)
close(file1)

file2<-file("C:/Users/Li Xi/Dropbox/Marketing Classes/output.txt")
writeLines(c("Big","Data"), file2)
close(file2)

In the above output, Here, “C:/Users/Li Xi/Dropbox/Marketing Classes/output.txt” is the path to your txt file. You can think of it as the address of your txt file.

You can also write: “C:\\Users\\Li Xi\\Dropbox\\Marketing Classes\\output.txt” to refer to your txt file.

However, you cannot write “C:\Users\Li Xi\Dropbox\Marketing Classes\output.txt”.

Another way to write the file:

sink("output.txt")
cat("Big")

## Big

cat("\n")      #set up a new line

cat("Data")

## Data

sink()

Write a dataframe:

employees <- data.frame(
  name = c('Alice', 'Bob', 'Carol','Denis'),
  salary = c(20000,NA,23000,22000),
  job = c('IT', 'Sales', NA, 'IT'))

setwd('C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm')
write.table(employees, file = "output.txt", sep = "\t", row.names = FALSE)

Reading a CSV (Excel) file

R allows you to read data from various files. If you want to read a spreadsheet, you are recommended to save the file as a csv file (Comma-Separated Values), and open it with the following codes:

mydata <- read.csv("C:/Users/Li Xi/Dropbox/r-exercise.csv",
                   fileEncoding = "UTF-8-BOM")

And you can also get the file from a URL address:

mydata <- read.csv("https://ximarketing.github.io/class/teachingfiles/r-exercise.csv", fileEncoding = "UTF-8-BOM")

You can print the first fives rows of the data to see if it works well:

head(mydata)

Summary Statistics

Next we show the summary statistics of the data:

summary(mydata)

##      Rating        Expertise         Votes           Purpose         
##  Min.   :1.000   Min.   :0.000   Min.   : 0.0000   Length:180635     
##  1st Qu.:4.000   1st Qu.:1.000   1st Qu.: 0.0000   Class :character  
##  Median :5.000   Median :3.000   Median : 0.0000   Mode  :character  
##  Mean   :4.286   Mean   :2.892   Mean   : 0.8217                     
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.: 1.0000                     
##  Max.   :5.000   Max.   :6.000   Max.   :75.0000

Number of Rows and Columns

nrow(mydata)

## [1] 180635

ncol(mydata)

## [1] 4

Plotting histograms, again

hist(mydata$Rating)

Choosing a subset of your data

Suppose that we only want to use reviews with rating <= 4.

subdata <- subset(mydata, Rating <= 4)
head(subdata)

Linear Regression (Important!!!)

Suppose that you want to do the following regression analysis on the dataset mydata:

Rating= a + b1 Experience

result <- lm(Rating ~ Expertise, data = mydata)
summary(result)

## 
## Call:
## lm(formula = Rating ~ Expertise, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3326 -0.3003  0.6674  0.7158  0.7642 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.332610   0.003872 1118.98   <2e-16 ***
## Expertise   -0.016138   0.001091  -14.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9531 on 180633 degrees of freedom
## Multiple R-squared:  0.001209,   Adjusted R-squared:  0.001203 
## F-statistic: 218.6 on 1 and 180633 DF,  p-value: < 2.2e-16

In the above regression formulation, “lm” stands for “linear models”.

This means you get the following result:

Rating=4.332−0.016 Experience

In addition, we get the significance value of experience (p-value) is smaller than 2×10^(−16)≪1%, meaning that the coefficient at significantly different from 0. This implies that experienced reviewers assign significant high ratings (to hotels).

We can make predictions based on the regression out. For example, suppose we have another review with expertise 4, then you can do the followings:

prediction <- predict(result, data.frame(Expertise = 4))
print(prediction)

##       1 
## 4.26806

Likewise, we can also run multiple regression:

result <- lm(Votes ~ Expertise + Rating, data = mydata)
summary(result)

## 
## Call:
## lm(formula = Votes ~ Expertise + Rating, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.421 -0.860 -0.686  0.301 74.301 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.573719   0.019761  79.637   <2e-16 ***
## Expertise    0.004350   0.001979   2.198    0.028 *  
## Rating      -0.178399   0.004264 -41.840   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.727 on 180632 degrees of freedom
## Multiple R-squared:  0.009671,   Adjusted R-squared:  0.00966 
## F-statistic:   882 on 2 and 180632 DF,  p-value: < 2.2e-16

Similarly, we can also make predictions based on the regression result:

prediction <- predict(result, data.frame(Expertise = 4, Rating = 2))
print(prediction)

##       1 
## 1.23432

Linear Regression with Fixed Effects

Moreover, we can also run linear regression with fixed effects: Here, we take purpose as a fixed effect which takes the following values: business, couple, family, friend, solo, and unknown.

result <- lm(Votes ~ Expertise + Rating + factor(Purpose), data = mydata)
summary(result)

## 
## Call:
## lm(formula = Votes ~ Expertise + Rating + factor(Purpose), data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.382 -0.761 -0.593  0.312 73.350 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.405531   0.020426  68.812  < 2e-16 ***
## Expertise               0.004507   0.001960   2.299   0.0215 *  
## Rating                 -0.180778   0.004238 -42.660  < 2e-16 ***
## factor(Purpose)couple   0.176869   0.011199  15.794  < 2e-16 ***
## factor(Purpose)family   0.091479   0.012095   7.564 3.94e-14 ***
## factor(Purpose)friend   0.065447   0.016380   3.996 6.46e-05 ***
## factor(Purpose)solo     0.081461   0.018771   4.340 1.43e-05 ***
## factor(Purpose)Unknown  1.130019   0.016878  66.954  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.704 on 180627 degrees of freedom
## Multiple R-squared:  0.03619,    Adjusted R-squared:  0.03615 
## F-statistic: 968.9 on 7 and 180627 DF,  p-value: < 2.2e-16

And making predictions accordingly…

prediction <- predict(result, data.frame(Expertise = 4, Rating = 2, Purpose = "family"))
print(prediction)

##        1 
## 1.153482

The Stargazer Package

The stargazer package is very useful for organizing the regression output in a table. You need to install it first.

# install the package
install.packages("stargazer", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Li Xi/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'stargazer' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Li Xi\AppData\Local\Temp\RtmpesN4Kj\downloaded_packages

# use the package
library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

stargazer(result, title = "regression output", align = TRUE, out = "regression.html", type = "html")

## 
## <table style="text-align:center"><caption><strong>regression output</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>Votes</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Expertise</td><td>0.005<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.002)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Rating</td><td>-0.181<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.004)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)couple</td><td>0.177<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.011)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)family</td><td>0.091<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.012)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)friend</td><td>0.065<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.016)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)solo</td><td>0.081<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.019)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)Unknown</td><td>1.130<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.017)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>1.406<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.020)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>180,635</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.036</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.036</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>1.704 (df = 180627)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>968.891<sup>***</sup> (df = 7; 180627)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

You can also generate an output based on several regressions:

result0 <- lm(Votes ~ Expertise, data = mydata)
result1 <- lm(Votes ~ Expertise + Rating, data = mydata)
result2 <- lm(Votes ~ Expertise + Rating + factor(Purpose), data = mydata)
stargazer(result0, result1, result2, title = "regression output", align = TRUE, out = "regression.html", type = "html")

## 
## <table style="text-align:center"><caption><strong>regression output</strong></caption>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="3">Votes</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Expertise</td><td>0.007<sup>***</sup></td><td>0.004<sup>**</sup></td><td>0.005<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.002)</td><td>(0.002)</td><td>(0.002)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Rating</td><td></td><td>-0.178<sup>***</sup></td><td>-0.181<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.004)</td><td>(0.004)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)couple</td><td></td><td></td><td>0.177<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.011)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)family</td><td></td><td></td><td>0.091<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.012)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)friend</td><td></td><td></td><td>0.065<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.016)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)solo</td><td></td><td></td><td>0.081<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.019)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)Unknown</td><td></td><td></td><td>1.130<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.017)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.801<sup>***</sup></td><td>1.574<sup>***</sup></td><td>1.406<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.007)</td><td>(0.020)</td><td>(0.020)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>180,635</td><td>180,635</td><td>180,635</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.0001</td><td>0.010</td><td>0.036</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.0001</td><td>0.010</td><td>0.036</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>1.736 (df = 180633)</td><td>1.727 (df = 180632)</td><td>1.704 (df = 180627)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>13.230<sup>***</sup> (df = 1; 180633)</td><td>881.976<sup>***</sup> (df = 2; 180632)</td><td>968.891<sup>***</sup> (df = 7; 180627)</td></tr>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

We can easily generate the summary statistics of our dataset:

stargazer(mydata, title = "Summary Statistics", align = TRUE, out = "summary.html", type = "html")

## 
## <table style="text-align:center"><caption><strong>Summary Statistics</strong></caption>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Pctl(25)</td><td>Pctl(75)</td><td>Max</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Rating</td><td>180,635</td><td>4.286</td><td>0.954</td><td>1</td><td>4</td><td>5</td><td>5</td></tr>
## <tr><td style="text-align:left">Expertise</td><td>180,635</td><td>2.892</td><td>2.055</td><td>0</td><td>1</td><td>5</td><td>6</td></tr>
## <tr><td style="text-align:left">Votes</td><td>180,635</td><td>0.822</td><td>1.736</td><td>0</td><td>0</td><td>1</td><td>75</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr></table>

Another Example of Linear Regression

In this regression, our dataset comes from Los Angeles Neighborhoods Data. The data source is here.

It covers some basic information of several neighborhoods in Los Angeles (e.g., income, age, ethnic group, …)

library(ggplot2)
file = "https://ximarketing.github.io/class/teachingfiles/r-exercise.txt"
mydata <- read.table(file, header = TRUE)
ggplot(mydata,aes(y=Income,x=Age))+geom_point()

Plotting the regression line:

result <- lm(Income ~ Age, data = mydata)
summary(result)

## 
## Call:
## lm(formula = Income ~ Age, data = mydata)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70208  -9605  -1762   9509  94968 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -74588      10950  -6.812 5.68e-10 ***
## Age             4097        332  12.341  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19880 on 108 degrees of freedom
## Multiple R-squared:  0.5851, Adjusted R-squared:  0.5813 
## F-statistic: 152.3 on 1 and 108 DF,  p-value: < 2.2e-16

ggplot(mydata,aes(y=Income,x=Age))+geom_point()+geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

T-Test

It is very convenient to run a t-test in R:

x = c(1, 3, 3, 5, 3, 2, 4, 3, 5, 7)
y = c(2, 6, 3, 4, 5, 2, 5, 8, 1, 6)
t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -0.68034, df = 16.975, p-value = 0.5055
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.46089  1.26089
## sample estimates:
## mean of x mean of y 
##       3.6       4.2

Visualizing t-tests:

male <- c(18,22,21,17,20,17,23,20,22,21)
female <- c(16,20,14,21,20,18,13,15,17,21)
data        = c(mean(male), mean(female))
names(data) = c("male", "female")
se       = c(sd(male)/sqrt(length(male)), 
          sd(female)/sqrt(length(female)))
windows()
bp = barplot(data, ylim=c(16, 21), xpd=FALSE)
box()
arrows(x0=bp, y0=data-se, y1=data+se, code=3, angle=90)