Sample Analysis

First, you can directly load the dataset from the following URL:

mydata <- read.csv("https://ximarketing.github.io/class/TripAdvisor.csv", 
      fileEncoding = "UTF-8-BOM")
summary(mydata)

##      Name               Local        CountRestaurant   CountReview       
##  Length:149912      Min.   :0.0000   Min.   :   0.00   Length:149912     
##  Class :character   1st Qu.:0.0000   1st Qu.:   5.00   Class :character  
##  Mode  :character   Median :0.0000   Median :  16.00   Mode  :character  
##                     Mean   :0.2012   Mean   :  35.67                     
##                     3rd Qu.:0.0000   3rd Qu.:  42.00                     
##                     Max.   :1.0000   Max.   :2149.00                     
##   CountVotes            Rating         Helpful             Mobile      
##  Length:149912      Min.   :1.000   Min.   :  0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:4.000   1st Qu.:  0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :5.000   Median :  0.0000   Median :0.0000  
##                     Mean   :4.505   Mean   :  0.5684   Mean   :0.2052  
##                     3rd Qu.:5.000   3rd Qu.:  1.0000   3rd Qu.:0.0000  
##                     Max.   :5.000   Max.   :206.0000   Max.   :1.0000  
##   TitleLength        Length           Photo            Date          
##  Min.   :  0.0   Min.   :   0.0   Min.   :0.0000   Length:149912     
##  1st Qu.: 15.0   1st Qu.: 166.0   1st Qu.:0.0000   Class :character  
##  Median : 23.0   Median : 278.0   Median :0.0000   Mode  :character  
##  Mean   : 25.6   Mean   : 357.6   Mean   :0.1765                     
##  3rd Qu.: 33.0   3rd Qu.: 461.0   3rd Qu.:0.0000                     
##  Max.   :128.0   Max.   :5884.0   Max.   :8.0000                     
##     Positive          Negative         Subjectivity          Menu         
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.04412   1st Qu.:0.000000   1st Qu.:0.07014   1st Qu.:0.000000  
##  Median :0.06818   Median :0.003497   Median :0.17417   Median :0.000000  
##  Mean   :0.07593   Mean   :0.012575   Mean   :0.25061   Mean   :0.001047  
##  3rd Qu.:0.09804   3rd Qu.:0.020270   3rd Qu.:0.37500   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :0.500000   Max.   :1.00000   Max.   :1.000000  
##     Building            Meat          Vegetable           Person        
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.000000  
##  Mean   :0.01006   Mean   :0.0313   Mean   :0.03086   Mean   :0.006237  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.000000

Let us convert the date into weekdays (i.e., Monday, Tuesday, …) using the strftime function of R:

mydata$Weekday = strftime(mydata$Date, "%A")

Suppose that we want to know whether a local reviewer is tougher or nicer, and whether individuals behave differently on Weekends, we can run the following regression:

result <- lm(Rating ~ Local + factor(Weekday), data = mydata)
summary(result)

## 
## Call:
## lm(formula = Rating ~ Local + factor(Weekday), data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5396 -0.5124  0.4761  0.4876  0.5765 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.532581   0.006238 726.608  < 2e-16 ***
## Local                    -0.084126   0.005365 -15.679  < 2e-16 ***
## factor(Weekday)Monday    -0.015763   0.008183  -1.926  0.05406 .  
## factor(Weekday)Saturday  -0.024908   0.008638  -2.883  0.00393 ** 
## factor(Weekday)Sunday    -0.020167   0.008419  -2.396  0.01660 *  
## factor(Weekday)Thursday  -0.012348   0.008503  -1.452  0.14646    
## factor(Weekday)Tuesday    0.007067   0.008048   0.878  0.37986    
## factor(Weekday)Wednesday -0.008706   0.008306  -1.048  0.29453    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.832 on 149904 degrees of freedom
## Multiple R-squared:  0.001761,   Adjusted R-squared:  0.001714 
## F-statistic: 37.78 on 7 and 149904 DF,  p-value: < 2.2e-16

As you can see, a local reviewer is tougher, and people write tougher reviews on weekends.

Sample Analysis

Xi Li

9/2/2021