Hello everyone! Today we are investing how to use R to scrape online data.
First of all, we need to install the “rvest” package which is useful for data scraping.
install.packages("rvest", repos = "http://cran.us.r-project.org")
## 将程序包安装入'C:/Users/Li Xi/Documents/R/win-library/4.1'
## (因为'lib'没有被指定)
## package 'rvest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Li Xi\AppData\Local\Temp\RtmpCmC5gh\downloaded_packages
Next, we enter the URL of the webpage that we want to analyze. Here, we take the HKU marketing faculty webpage as an example.
library(rvest)
url = "https://www.fbe.hku.hk/people/faculty?pg=1&staff_type=faculty&subject_area=marketing&track=all"
webpage = read_html(url, encoding = "UTF-8")
print(webpage)
## {html_document}
## <html lang="en-US" prefix="og: https://ogp.me/ns#">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="page-template page-template-people-listing page-template-peo ...
Getting all the “h5” nodes in the webpage:
nodes <- html_nodes(webpage,xpath = '//h5')
How many h5 nodes do we have?
print(length(nodes))
## [1] 16
Now, let us focus on h5 nodes that satisfies the following criterion, and print the information accordingly.
nodes <- html_nodes(webpage,xpath = '//div[@class="people-info"]/h5')
print(length(nodes))
## [1] 15
for (node in nodes)
print(html_text(node))
## [1] "Dr. Jingcun CAO"
## [1] "Mr. Baniel CHEUNG"
## [1] "Dr. Buston Yat Chiu CHU"
## [1] "Dr. Chu (Ivy) DANG"
## [1] "Dr. Jinzhao DU"
## [1] "Dr. Tak Zhongqiang HUANG"
## [1] "Dr. Jayson Shi JIA"
## [1] "Dr. Michael He JIA"
## [1] "Dr. Sara KIM"
## [1] "Dr. Yin Mei NG"
## [1] "Dr. Tuan Quang PHAN"
## [1] "Mr. Sean RACH"
## [1] "Prof. David K.C. TSE"
## [1] "Prof. Echo Wen WAN"
## [1] "Dr. Guiyang XIONG"