应朋友之约,帮他做个爬虫,并且每个网页的数据都分别导入到excel中。
目标网站:http://www.hs-bianma.com/hs_chapter_01.htm
根据我的观察,网页采取的是<td><th>制成表格来存放数据,属于非常简单的类型。因为Python自带有非常好的网页处理模块,因此前后代码花费时间在30分钟。
网站:
网页源代码:
需要模块:BeautifulSoup、Request、xlwt
废话不多说,直接上代码:
- from bs4 import BeautifulSoup
- from urllib import request
- import xlwt
- #获取数据
- value=1
- while value<=98:
- value0=str(value)
- url = 'http://www.hs-bianma.com/hs_chapter_'+value0+'.htm'
- #url='http://www.hs-bianma.com/hs_chapter_01.htm'
- '''此行可以自行更换代码用来汇集数据'''
- response = request.urlopen(url)
- html = response.read()
- html = html.decode('utf-8')
- bs = BeautifulSoup(html,'lxml')
- #标题处理
- title = bs.find_all('th')
- data_list_title=[]
- for data in title:
- data_list_title.append(data.text.strip())
- #内容处理
- content = bs.find_all('td')
- data_list_content=[]
- for data in content:
- data_list_content.append(data.text.strip())
- new_list=[data_list_content[i:i+16] for i in range(0,len(data_list_content),16)]
- #存入excel表格
- book=xlwt.Workbook()
- sheet1=book.add_sheet('sheet1',cell_overwrite_ok=True)
- #标题存入
- heads=data_list_title[:]
- ii=0
- for head in heads:
- sheet1.write(0,ii,head)
- ii+=1
- #print(head)
- #内容录入
- i=1
- for list in new_list:
- j=0
- for data in list:
- sheet1.write(i,j,data)
- j+=1
- i+=1
- #文件保存
- book.save('sum'+value0+'.xls')
- value += 1
- print(value0+'写入完成!')
- print('全部完成')
联系客服