Hello everyone! Welcome to your R journey. Today we are exploring some very simple operations using the R software.
There are two ways for assigning values, i.e., using “=” or “<-”.
x = 100
y <- x + 10
z <- 'Marketing'
print(c(x, y, z))
## [1] "100" "110" "Marketing"
In the above code, we use the print function to print an output. You can either print a single value or multiple values.
print('HKU')
## [1] "HKU"
print(c('a','b','c'))
## [1] "a" "b" "c"
You can also try some simple algebraic operations on R.
x <- 100
y <- 50.5
z1 <- x + y
z2 <- x - y
z3 <- x*y
z4 <- x/y
z5 <- y^2
z6 <- sqrt(x)
print(c(z1,z2,z3,z4,z5,z6))
## [1] 150.500000 49.500000 5050.000000 1.980198 2550.250000 10.000000
A logic variable is either “TRUE” or “FALSE”.
x <- TRUE
y <- FALSE
a <- 100
b <- (a == 99)
c <- (a >= 99)
d <- (a != 99)
print(c(b, c, d))
## [1] FALSE TRUE TRUE
String concatenation allows you to merge two concatenate two strings. R provides you with two functions, namely “paste” and “paste0”.
x <- "Welcome"
y <- "to"
z <- "Marketing"
result <- paste(x, y, z)
result0 <- paste0(x, y, z)
print(c(result, result0))
## [1] "Welcome to Marketing" "WelcometoMarketing"
Sometimes we want to convert a number to a string. You can do this:
x <- 100
y <- 200
sx <- toString(x)
sy <- toString(y)
print(c(x, y, paste0(sx,sy)))
## [1] "100" "200" "100200"
Counting the length of a string:
str = "Marketing and Big Data"
print(nchar(str))
## [1] 22
Splitting the string based on SPACE:
str = "Marketing and Big Data"
y = strsplit(str,split=' ')
print(y)
## [[1]]
## [1] "Marketing" "and" "Big" "Data"
Obtaining a substring.
str = "Marketing and Big Data"
z = substr(str, 2,5)
#substring from the 2nd to the 5th character
print(z)
## [1] "arke"
A list is an ordered sequence of objects:
vec <- c(1,3,5)
char_vec <- c("Spring", "Summer", "Autumn", "Winter")
logic_vec <- c(TRUE, FALSE, TRUE, FALSE)
print(vec)
## [1] 1 3 5
print(char_vec)
## [1] "Spring" "Summer" "Autumn" "Winter"
print(logic_vec)
## [1] TRUE FALSE TRUE FALSE
You may also put different types of variables in a list:
list <- c(100, 'HKU', TRUE, 10 + 15)
print(list)
## [1] "100" "HKU" "TRUE" "25"
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
plot(x, y)
plot(x, y, type = "b")
plot(x, y, pch = 17)
The “pch” instruction defines the appearance of your points, while the “lty” instruction defines the appearance of your line. For details, please refer to pch notes and lty notes
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
plot(x, y, pch = 2, lty = 2, type = "b")
You can find more details of plotting here.
R also provides us with some useful packages for data plotting. The most useful one is called “ggplot”. You need to install it first.
# install the package
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Li Xi/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Li Xi\AppData\Local\Temp\RtmpesN4Kj\downloaded_packages
# use the package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.1
# create data
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
data <- data.frame(x, y)
# Plot
ggplot(data, aes(x, y)) + geom_point()
ggplot(data, aes(x, y)) + geom_area()
Next let us add colors to the plots.
# Here, for each individual, we have three values, its X, its Y, and its gender (e.g., 1 for male and 0 for female).
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
y <- c(9, 7, 7, 4.5, 6, 7.8, 7, 3, 6, 2, 4)
gender <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
data <- data.frame(x, y, gender)
# We plot the points in red
ggplot(data, aes(x, y, color = 'red')) + geom_point()
# We plot the points based on gender (one color for one gender)
ggplot(data, aes(x, y, color = gender)) + geom_point()
A data frame is a list of variables of the same number of rows with unique row names, given class “data.frame”.
employees <- data.frame(
name = c('Alice', 'Bob', 'Carol', 'Denis'),
salary = c(20000,19000,23000,22000),
job = c('IT', 'Sales', 'Finance', 'IT' ))
You may have missing values in your data frame. In this case you can enter “NA” to represent the missing value.
employees <- data.frame(
name = c('Alice', 'Bob', 'Carol','Denis'),
salary = c(20000,NA,23000,22000),
job = c('IT', 'Sales', NA, 'IT'))
You can use the dollar sign“$” to select a specific variable:
print(employees)
## name salary job
## 1 Alice 20000 IT
## 2 Bob NA Sales
## 3 Carol 23000 <NA>
## 4 Denis 22000 IT
print(summary(employees))
## name salary job
## Length:4 Min. :20000 Length:4
## Class :character 1st Qu.:21000 Class :character
## Mode :character Median :22000 Mode :character
## Mean :21667
## 3rd Qu.:22500
## Max. :23000
## NA's :1
print(employees$name)
## [1] "Alice" "Bob" "Carol" "Denis"
vector <- c(0, 8, 4, 6, 7, 9, 5)
print(mean(vector))
## [1] 5.571429
print(median(vector))
## [1] 6
print(var(vector)) #variance
## [1] 8.952381
print(sd(vector)) #standard deviation
## [1] 2.992053
print(max(vector)) #maximum
## [1] 9
print(min(vector)) #mimimum
## [1] 0
print(sort(vector)) #sort the data in increasing order
## [1] 0 4 5 6 7 8 9
x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
print("Zero")
## [1] "Zero"
count = 0
while (count <= 5)
{
count = count + 1
print(count)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
vector = c(1, 3, 5, 7)
print(vector)
## [1] 1 3 5 7
for (item in vector)
print(item)
## [1] 1
## [1] 3
## [1] 5
## [1] 7
for (year in 2000: 2020)
print (year)
## [1] 2000
## [1] 2001
## [1] 2002
## [1] 2003
## [1] 2004
## [1] 2005
## [1] 2006
## [1] 2007
## [1] 2008
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014
## [1] 2015
## [1] 2016
## [1] 2017
## [1] 2018
## [1] 2019
## [1] 2020
for (year in 2000: 2020)
{
if(year == 2008)
next #skip to the next iteration
print (year)
}
## [1] 2000
## [1] 2001
## [1] 2002
## [1] 2003
## [1] 2004
## [1] 2005
## [1] 2006
## [1] 2007
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014
## [1] 2015
## [1] 2016
## [1] 2017
## [1] 2018
## [1] 2019
## [1] 2020
f1 <- function(a) {
print(a + 1)
}
f1(0.5)
## [1] 1.5
f1(2)
## [1] 3
Another example:
f2 <- function(a) {
return(a+2)
}
print(f2(0.5))
## [1] 2.5
print(f2(2))
## [1] 4
Yet another example:
f3 <- function(a) {
a <- toString(a)
a <- paste(a, "data")
return (a)
}
print(f3(100))
## [1] "100 data"
print(f3("big"))
## [1] "big data"
More examples…
f4 <- function(a) {
r1 <- a + 1
r2 <- a + 2
mylist <- list("r1" = r1, "r2" = r2)
return(mylist)
}
mylist <- f4(15)
print(mylist$r1)
## [1] 16
print(mylist$r2)
## [1] 17
Generating random variables from uniform distribution:
a = runif(1) #generate a random number between 0 and 1
print(a)
## [1] 0.8493704
vec = runif(5) #generate a list of 5 random numbers
print(vec)
## [1] 0.5981136 0.4764372 0.1625948 0.1456570 0.8633295
vec = runif(3, min=0, max=100)
#generate 3 random numbers between 0 and 100
print(vec)
## [1] 65.45418 25.33918 34.43393
Generating random variables from normal distribution:
x = rnorm(1)
#generate a random number using the standard normal distribution
print (x)
## [1] 0.7875174
y = rnorm(4, mean=50, sd=10)
#generate 4 random numbers following the specified normal distribition
print (y)
## [1] 45.00705 36.68992 73.94876 57.05236
z <- rnorm(1000, mean=50, sd=10)
hist(z)
#generate the histogram of z
Plotting histograms using ggplot2:
library(ggplot2)
z <- rnorm(1000, mean=50, sd=10)
data <- data.frame(z)
figure <- ggplot(data, aes(z))+
geom_histogram(color="darkblue", fill="lightblue")
figure
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
figure + geom_vline(aes(xintercept=mean(z)),
color="blue", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
getwd()
## [1] "C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm"
#get working directory
setwd('C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm')
#set working directory
getwd()
## [1] "C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm"
file1<-file("output.txt")
writeLines(c("Big","Data"), file1)
close(file1)
file2<-file("C:/Users/Li Xi/Dropbox/Marketing Classes/output.txt")
writeLines(c("Big","Data"), file2)
close(file2)
In the above output, Here, “C:/Users/Li Xi/Dropbox/Marketing Classes/output.txt” is the path to your txt file. You can think of it as the address of your txt file.
You can also write: “C:\\Users\\Li Xi\\Dropbox\\Marketing Classes\\output.txt” to refer to your txt file.
However, you cannot write “C:\Users\Li Xi\Dropbox\Marketing Classes\output.txt”.
Another way to write the file:
sink("output.txt")
cat("Big")
## Big
cat("\n") #set up a new line
cat("Data")
## Data
sink()
Write a dataframe:
employees <- data.frame(
name = c('Alice', 'Bob', 'Carol','Denis'),
salary = c(20000,NA,23000,22000),
job = c('IT', 'Sales', NA, 'IT'))
setwd('C:/Users/Li Xi/Dropbox/Marketing Classes/Algorithm')
write.table(employees, file = "output.txt", sep = "\t", row.names = FALSE)
R allows you to read data from various files. If you want to read a spreadsheet, you are recommended to save the file as a csv file (Comma-Separated Values), and open it with the following codes:
mydata <- read.csv("C:/Users/Li Xi/Dropbox/r-exercise.csv",
fileEncoding = "UTF-8-BOM")
And you can also get the file from a URL address:
mydata <- read.csv("https://ximarketing.github.io/class/teachingfiles/r-exercise.csv", fileEncoding = "UTF-8-BOM")
You can print the first fives rows of the data to see if it works well:
head(mydata)
Next we show the summary statistics of the data:
summary(mydata)
## Rating Expertise Votes Purpose
## Min. :1.000 Min. :0.000 Min. : 0.0000 Length:180635
## 1st Qu.:4.000 1st Qu.:1.000 1st Qu.: 0.0000 Class :character
## Median :5.000 Median :3.000 Median : 0.0000 Mode :character
## Mean :4.286 Mean :2.892 Mean : 0.8217
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.: 1.0000
## Max. :5.000 Max. :6.000 Max. :75.0000
nrow(mydata)
## [1] 180635
ncol(mydata)
## [1] 4
hist(mydata$Rating)
Suppose that we only want to use reviews with rating <= 4.
subdata <- subset(mydata, Rating <= 4)
head(subdata)
Suppose that you want to do the following regression analysis on the dataset mydata:
Rating= a + b1 Experience
result <- lm(Rating ~ Expertise, data = mydata)
summary(result)
##
## Call:
## lm(formula = Rating ~ Expertise, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3326 -0.3003 0.6674 0.7158 0.7642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.332610 0.003872 1118.98 <2e-16 ***
## Expertise -0.016138 0.001091 -14.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9531 on 180633 degrees of freedom
## Multiple R-squared: 0.001209, Adjusted R-squared: 0.001203
## F-statistic: 218.6 on 1 and 180633 DF, p-value: < 2.2e-16
In the above regression formulation, “lm” stands for “linear models”.
This means you get the following result:
Rating=4.332−0.016 Experience
In addition, we get the significance value of experience (p-value) is smaller than 2×10^(−16)≪1%, meaning that the coefficient at significantly different from 0. This implies that experienced reviewers assign significant high ratings (to hotels).
We can make predictions based on the regression out. For example, suppose we have another review with expertise 4, then you can do the followings:
prediction <- predict(result, data.frame(Expertise = 4))
print(prediction)
## 1
## 4.26806
Likewise, we can also run multiple regression:
result <- lm(Votes ~ Expertise + Rating, data = mydata)
summary(result)
##
## Call:
## lm(formula = Votes ~ Expertise + Rating, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.421 -0.860 -0.686 0.301 74.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.573719 0.019761 79.637 <2e-16 ***
## Expertise 0.004350 0.001979 2.198 0.028 *
## Rating -0.178399 0.004264 -41.840 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.727 on 180632 degrees of freedom
## Multiple R-squared: 0.009671, Adjusted R-squared: 0.00966
## F-statistic: 882 on 2 and 180632 DF, p-value: < 2.2e-16
Similarly, we can also make predictions based on the regression result:
prediction <- predict(result, data.frame(Expertise = 4, Rating = 2))
print(prediction)
## 1
## 1.23432
Moreover, we can also run linear regression with fixed effects: Here, we take purpose as a fixed effect which takes the following values: business, couple, family, friend, solo, and unknown.
result <- lm(Votes ~ Expertise + Rating + factor(Purpose), data = mydata)
summary(result)
##
## Call:
## lm(formula = Votes ~ Expertise + Rating + factor(Purpose), data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.382 -0.761 -0.593 0.312 73.350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.405531 0.020426 68.812 < 2e-16 ***
## Expertise 0.004507 0.001960 2.299 0.0215 *
## Rating -0.180778 0.004238 -42.660 < 2e-16 ***
## factor(Purpose)couple 0.176869 0.011199 15.794 < 2e-16 ***
## factor(Purpose)family 0.091479 0.012095 7.564 3.94e-14 ***
## factor(Purpose)friend 0.065447 0.016380 3.996 6.46e-05 ***
## factor(Purpose)solo 0.081461 0.018771 4.340 1.43e-05 ***
## factor(Purpose)Unknown 1.130019 0.016878 66.954 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.704 on 180627 degrees of freedom
## Multiple R-squared: 0.03619, Adjusted R-squared: 0.03615
## F-statistic: 968.9 on 7 and 180627 DF, p-value: < 2.2e-16
And making predictions accordingly…
prediction <- predict(result, data.frame(Expertise = 4, Rating = 2, Purpose = "family"))
print(prediction)
## 1
## 1.153482
The stargazer package is very useful for organizing the regression output in a table. You need to install it first.
# install the package
install.packages("stargazer", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Li Xi/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'stargazer' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Li Xi\AppData\Local\Temp\RtmpesN4Kj\downloaded_packages
# use the package
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(result, title = "regression output", align = TRUE, out = "regression.html", type = "html")
##
## <table style="text-align:center"><caption><strong>regression output</strong></caption>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="1" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>Votes</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Expertise</td><td>0.005<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.002)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Rating</td><td>-0.181<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.004)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)couple</td><td>0.177<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.011)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)family</td><td>0.091<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.012)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)friend</td><td>0.065<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.016)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)solo</td><td>0.081<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.019)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)Unknown</td><td>1.130<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.017)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>1.406<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.020)</td></tr>
## <tr><td style="text-align:left"></td><td></td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>180,635</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.036</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.036</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>1.704 (df = 180627)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>968.891<sup>***</sup> (df = 7; 180627)</td></tr>
## <tr><td colspan="2" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
You can also generate an output based on several regressions:
result0 <- lm(Votes ~ Expertise, data = mydata)
result1 <- lm(Votes ~ Expertise + Rating, data = mydata)
result2 <- lm(Votes ~ Expertise + Rating + factor(Purpose), data = mydata)
stargazer(result0, result1, result2, title = "regression output", align = TRUE, out = "regression.html", type = "html")
##
## <table style="text-align:center"><caption><strong>regression output</strong></caption>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="3"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="3" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="3">Votes</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td><td>(3)</td></tr>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Expertise</td><td>0.007<sup>***</sup></td><td>0.004<sup>**</sup></td><td>0.005<sup>**</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.002)</td><td>(0.002)</td><td>(0.002)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Rating</td><td></td><td>-0.178<sup>***</sup></td><td>-0.181<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.004)</td><td>(0.004)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)couple</td><td></td><td></td><td>0.177<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.011)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)family</td><td></td><td></td><td>0.091<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.012)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)friend</td><td></td><td></td><td>0.065<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.016)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)solo</td><td></td><td></td><td>0.081<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.019)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">factor(Purpose)Unknown</td><td></td><td></td><td>1.130<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td>(0.017)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.801<sup>***</sup></td><td>1.574<sup>***</sup></td><td>1.406<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.007)</td><td>(0.020)</td><td>(0.020)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td><td></td></tr>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>180,635</td><td>180,635</td><td>180,635</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.0001</td><td>0.010</td><td>0.036</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.0001</td><td>0.010</td><td>0.036</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>1.736 (df = 180633)</td><td>1.727 (df = 180632)</td><td>1.704 (df = 180627)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>13.230<sup>***</sup> (df = 1; 180633)</td><td>881.976<sup>***</sup> (df = 2; 180632)</td><td>968.891<sup>***</sup> (df = 7; 180627)</td></tr>
## <tr><td colspan="4" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="3" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>
We can easily generate the summary statistics of our dataset:
stargazer(mydata, title = "Summary Statistics", align = TRUE, out = "summary.html", type = "html")
##
## <table style="text-align:center"><caption><strong>Summary Statistics</strong></caption>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Statistic</td><td>N</td><td>Mean</td><td>St. Dev.</td><td>Min</td><td>Pctl(25)</td><td>Pctl(75)</td><td>Max</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Rating</td><td>180,635</td><td>4.286</td><td>0.954</td><td>1</td><td>4</td><td>5</td><td>5</td></tr>
## <tr><td style="text-align:left">Expertise</td><td>180,635</td><td>2.892</td><td>2.055</td><td>0</td><td>1</td><td>5</td><td>6</td></tr>
## <tr><td style="text-align:left">Votes</td><td>180,635</td><td>0.822</td><td>1.736</td><td>0</td><td>0</td><td>1</td><td>75</td></tr>
## <tr><td colspan="8" style="border-bottom: 1px solid black"></td></tr></table>
In this regression, our dataset comes from Los Angeles Neighborhoods Data. The data source is here.
It covers some basic information of several neighborhoods in Los Angeles (e.g., income, age, ethnic group, …)
library(ggplot2)
file = "https://ximarketing.github.io/class/teachingfiles/r-exercise.txt"
mydata <- read.table(file, header = TRUE)
ggplot(mydata,aes(y=Income,x=Age))+geom_point()
Plotting the regression line:
result <- lm(Income ~ Age, data = mydata)
summary(result)
##
## Call:
## lm(formula = Income ~ Age, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70208 -9605 -1762 9509 94968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -74588 10950 -6.812 5.68e-10 ***
## Age 4097 332 12.341 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19880 on 108 degrees of freedom
## Multiple R-squared: 0.5851, Adjusted R-squared: 0.5813
## F-statistic: 152.3 on 1 and 108 DF, p-value: < 2.2e-16
ggplot(mydata,aes(y=Income,x=Age))+geom_point()+geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
It is very convenient to run a t-test in R:
x = c(1, 3, 3, 5, 3, 2, 4, 3, 5, 7)
y = c(2, 6, 3, 4, 5, 2, 5, 8, 1, 6)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -0.68034, df = 16.975, p-value = 0.5055
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.46089 1.26089
## sample estimates:
## mean of x mean of y
## 3.6 4.2
Visualizing t-tests:
male <- c(18,22,21,17,20,17,23,20,22,21)
female <- c(16,20,14,21,20,18,13,15,17,21)
data = c(mean(male), mean(female))
names(data) = c("male", "female")
se = c(sd(male)/sqrt(length(male)),
sd(female)/sqrt(length(female)))
windows()
bp = barplot(data, ylim=c(16, 21), xpd=FALSE)
box()
arrows(x0=bp, y0=data-se, y1=data+se, code=3, angle=90)