In this exercise, we scrape data from the Marketing Science Journal.

As before, we import the rvest package for data scraping.

library(rvest)

Next, we load the webpage and get the content from its HTML.

url = "https://pubsonline.informs.org/toc/mksc/40/1"
webpage = read_html(url, encoding = "UTF-8")

Next, we intepret the HTML file. We look for all the “a” node whose parent node is “h5” and the parent’s class is "issue-item__title". The code is as follows.

nodes <- html_nodes(webpage,xpath = '//h5[@class="issue-item__title"]/a')

We can then print the content of these nodes.

for (node in nodes)
  print(html_text(node))
## [1] "Frontiers: Algorithmic Collusion: Supra-competitive Prices via Independent Algorithms"
## [1] "Frontiers: Moment Marketing: Measuring Dynamics in Cross-Channel Ad Effectiveness"
## [1] "The Effect of Home-Sharing on House Prices and Rents: Evidence from Airbnb"
## [1] "The Impact of Coupons on the Visit-to-Purchase Funnel"
## [1] "Preference Learning and Demand Forecast"
## [1] "The End of the Express Road for Hybrid Vehicles: Can Governments’ Green Product Incentives Backfire?"
## [1] "When Franchisee Service Affects Demand: An Application to the Car Radiator Market and Resale Price Maintenance"
## [1] "Price Fairness and Strategic Obfuscation"
## [1] "A Model of Brand Architecture Choice: A House of Brands vs. A Branded House"
## [1] "When Consumers Learn, Money Burns: Signaling Quality via Advertising with Observational Learning and Word of Mouth"
## [1] "Focus On Authors"
## [1] "Editorial Board"

Next, we print the name of the authors:

nodes <- html_nodes(webpage,xpath = '//a[@class="entryAuthor linkable hlFld-ContribAuthor"]')
for (node in nodes)
  print(html_text(node))
## [1] "Karsten T. Hansen"
## [1] "Kanishka Misra "
## [1] "Mallesh M. Pai"
## [1] "Jia Liu "
## [1] "Shawndra Hill "
## [1] "Kyle Barron"
## [1] "Edward Kung "
## [1] "Davide Proserpio "
## [1] "Arun Gopalakrishnan "
## [1] "Young-Hoon Park "
## [1] "Xinyu Cao "
## [1] "Juanjuan Zhang "
## [1] "Cheng He "
## [1] "O. Cem Ozturk "
## [1] "Chris Gu "
## [1] "Jorge Mario Silva-Risso"
## [1] "Tongil “TI” Kim "
## [1] "William J. Allender "
## [1] "Jura Liaukonyte "
## [1] "Sherif Nasser "
## [1] "Timothy J. Richards "
## [1] "Jungju Yu "
## [1] "Yogesh V. Joshi "
## [1] "Andres Musalem "

However, these names are not organized: They are the author names of ALL authors, not categorized by the article. Here, we want to print the author names by article.

First, we select the nodes for each article:

article_nodes <- html_nodes(webpage,xpath = '//div[@class="issue-item"]')
print(length(article_nodes))
## [1] 12

For each article, we print its title and authors:

for (article in article_nodes)
{  titles <- html_nodes(article, xpath = './/h5[@class="issue-item__title"]/a')
  print(html_text(titles[1]))
  authors <- html_nodes(article, xpath = './/a[@class="entryAuthor linkable hlFld-ContribAuthor"]')
  for (author in authors)
    print(html_text(author))
  }
## [1] "Frontiers: Algorithmic Collusion: Supra-competitive Prices via Independent Algorithms"
## [1] "Karsten T. Hansen"
## [1] "Kanishka Misra "
## [1] "Mallesh M. Pai"
## [1] "Frontiers: Moment Marketing: Measuring Dynamics in Cross-Channel Ad Effectiveness"
## [1] "Jia Liu "
## [1] "Shawndra Hill "
## [1] "The Effect of Home-Sharing on House Prices and Rents: Evidence from Airbnb"
## [1] "Kyle Barron"
## [1] "Edward Kung "
## [1] "Davide Proserpio "
## [1] "The Impact of Coupons on the Visit-to-Purchase Funnel"
## [1] "Arun Gopalakrishnan "
## [1] "Young-Hoon Park "
## [1] "Preference Learning and Demand Forecast"
## [1] "Xinyu Cao "
## [1] "Juanjuan Zhang "
## [1] "The End of the Express Road for Hybrid Vehicles: Can Governments’ Green Product Incentives Backfire?"
## [1] "Cheng He "
## [1] "O. Cem Ozturk "
## [1] "Chris Gu "
## [1] "Jorge Mario Silva-Risso"
## [1] "When Franchisee Service Affects Demand: An Application to the Car Radiator Market and Resale Price Maintenance"
## [1] "Tongil “TI” Kim "
## [1] "Price Fairness and Strategic Obfuscation"
## [1] "William J. Allender "
## [1] "Jura Liaukonyte "
## [1] "Sherif Nasser "
## [1] "Timothy J. Richards "
## [1] "A Model of Brand Architecture Choice: A House of Brands vs. A Branded House"
## [1] "Jungju Yu "
## [1] "When Consumers Learn, Money Burns: Signaling Quality via Advertising with Observational Learning and Word of Mouth"
## [1] "Yogesh V. Joshi "
## [1] "Andres Musalem "
## [1] "Focus On Authors"
## [1] "Editorial Board"

Congratulations! We are done now.