爬取豆瓣电影top 250

晚上写的论文没有保存，要被自己蠢哭~然后就开始爬豆瓣电影250，总共有250部电影，10页。但在第5页和第10页时爬取的电影的简要介绍有缺失，这会造成该变量与其他变量长度不一，不能弄在同一个data.frame里面。然后弄成list，又不能实现rbind函数。所以我就用了比较笨的办法，把第5页和第10页的函数单独编辑~希望可以有大神写出更棒的代码和我交流~

library(magrittr)

library(rvest)

library(xml2)

library(stringr)

site1<-'https: movie.douban.com/top250?start='

site2<-' &filter='

movie<-data.frame()

for(i in 1:4){

fun<-function(i){

site<-paste(site1,25*(9-1),site2,sep='>

web<>

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction<-web%>%html_nodes('.inq')%>%html_text()

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<>

number<>

movie<>

}

movie<>

}

######在第5页时introduction变量只有24个，核对之后发现是第5页第四部《摔跤吧爸爸》没有介绍。因为能力有限，只能对第五页进行单独爬。

introduction<>

fun<>

site<>

web<>

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction1<-web%>%html_nodes('.inq')%>%html_text()

introduction<>

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<>

number<>

movie<>

}

movie<>

########继续第6-9页

for(i in 6:9){

fun<>

site<>

web<>

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction<-web%>%html_nodes('.inq')%>%html_text()

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<>

number<>

movie<>

}

movie<>

}

######在第10页时introduction变量只有24个，核对之后发现是第10页第20部《你的名字》没有介绍。还是按照第5页的写法。

introduction<>

fun<>

site<>

web<>

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction1<-web%>%html_nodes('.inq')%>%html_text()

introduction<>

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<>

number<>

movie<>

}

movie<>

write.csv(movie,'C:\\Users\\Administrator\\Desktop\\movie.csv')

一入爬虫深似海，从此复制是路人~

有兴趣的可以试一下，不用粘贴复制就可以得到豆瓣top250的电影了

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。