Resquests库应该是大多数爬虫初学者接触到的第一个库,因为他够简洁,而且功能对于常规的网页内容也够用。更好的是,它有中文文档,对我们这种四六级飘过的选手很友好,找机会整理了一下。
import requests
r = requests.get('https://www.jianshu.com/')
用来给网页传递一个表单,主要用来进行登陆操作
r = requests.post('http://httpbin.org/post', data = {'key':'value'})
response = requests.get('http://httpbin.org/get?name=zhangsan&age=22')print(response.text)
payload = {'key1': 'value1', 'key2': 'value2'}response = requests.get('http://httpbin.org/get', params=payload)print(response.text)
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}response = requests.get('http://httpbin.org/get', headers=headers)print(response.text)
proxies = { 'http': 'https://175.44.148.176:9000', 'https': 'https://183.129.207.86:14002'}response = requests.get('https://www.baidu.com/', proxies=proxies)
属性 描述 response.text 获取str类型(Unicode编码)的响应 response.content 获取bytes类型的响应 response.status_code 获取响应状态码 response.headers 获取响应头 response.request 获取响应对应的请求
response.cookies
session用来记录登陆状态,第一次用session发送post登陆请求后,请求头信息、登陆成功的cookies信息等都保存在session对象中,第二次只需要直接get即可。同时也可验证是否登录成功。
def login_renren(): login_url = 'http://www.renren.com/SysHome.do' login_data = {'email': '账号','password': '密码'} headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',} session = requests.session() response = session.post(login_url, data=login_data, headers=headers) response = session.get('http://www.renren.com/971909762/newsfeed/photo') print(response.text)login_renren()
import requestsimport sysclass BaiduTieBa: def __init__(self, name, pn, ): self.name = name self.url = 'http://tieba.baidu.com/f?kw={}&ie=utf-8&pn='.format(name) self.headers = { # 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36' # 使用较老版本的请求头,该浏览器不支持js 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)' } self.url_list = [self.url + str(pn * 50) for pn in range(pn)] print(self.url_list) def run(self): for url in self.url_list: data = self.get_data(url) num = self.url_list.index(url) self.save_data(data, num) def get_data(self, url): response = requests.get(url, headers=self.headers) return response.content def save_data(self, data, num): file_name = self.name + '_' + str(num) + '.html' with open(file_name, 'wb') as f: f.write(data)if __name__ == '__main__': name = sys.argv[1] pn = int(sys.argv[2]) baidu = BaiduTieBa(name, pn) baidu.run()
注:参考文章的这个例子感觉有点问题,'http://tieba.baidu.com/f?kw={}&ie=utf-8&pn={}'.format(name,pn),感觉不应该此时传输pn,修改后可以跑通了。
baidu.py文件写好后,在项目目录打开CMD,输入:python baidu.py 两会 6,得到结果。
session记录登录信息和cookie,接着访问个人主页,将结果保存到renren.html文件
import jsonimport requestsdef login_renren(): login_url = 'http://www.renren.com/PLogin.do' login_data = {'email': '账号','password': '密码'} headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',} session = requests.session() session.post(login_url, data=login_data, headers=headers) response = session.get('http://www.renren.com/581249666/newsfeed/photo') with open('renren.html', 'wb') as fp: fp.write(response.content) cookies_dict = requests.utils.dict_from_cookiejar(response.cookies) with open('cookies.txt', 'w') as f: json.dump(cookies_dict,f) with open('cookies.txt', 'r') as f: cookies = json.load(f) print(cookies)login_renren()
联系客服