打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
[开源 .NET 跨平台 Crawler 数据采集 爬虫框架: DotnetSpider] [三] 配置式爬虫

[DotnetSpider 系列目录]



上一篇介绍的基本的使用方式,自由度很高,但是编写的代码相对就多了。而我所在的行业其实大部分都是定题爬虫, 只需要采集指定的页面并结构化数据。为了提高开发效率, 我实现了利用实体配置的方式来实现爬虫


创建 Console 项目


利用NUGET添加包


DotnetSpider2.Extension


定义配置式数据对象



  • 数据对象必须继承 ISpiderEntity

  • Schema 定义数据名称、表名及表名后缀

  • Indexes 定义数据表的主键、唯一索引、索引

  • EntitySelector 定义从页面数据中抽取数据对象的规则


定义一个原始的数据对象类



public class Product : ISpiderEntity{}


使用Chrome打开京东商品页 http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main


  1. 使用快捷键F12打开开发者工具

  2. 选中一个商品,并观察Html结构


          


我们发现每个商品都在class为gl-i-wrap j-sku-item的DIV下面,因此添加EntitySelector到数据对象Product的类名上面。( XPath的写法不是唯一的,不熟悉的可以去W3CSCHOLL学习一下, 框架也支持使用Css甚至正则来选择出正确的Html片段)。 


    [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : ISpiderEntity 




  1. 添加数据库及索引信息



    [Schema("test", "sku", TableSuffix.Today)][EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")][Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]public class Product : ISpiderEntity




  2. 假设你需要采集SKU信息,观察HTML结构,计算出相对的XPath, 为什么是相对XPath?因为EntitySelector已经把HTML截成片段了,内部的Html元素查询都是相对于EntitySelector查询出来的元素。最后再加上数据库中列的信息




    [Schema("test", "sku", TableSuffix.Today)][EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")][Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]public class Product : ISpiderEntity{     [StoredAs("sku", DataType.String, 25)]     [PropertySelector(Expression = "./@data-sku")]     public string Sku { get; set; } }





  3. 爬虫内部,链接是通过Request对象来存储信息的,构造Request对象时可以添加额外的属性值,这时候允许数据对象从Request的额外属性值中查询数据



    [StoredAs("category", DataType.String, 20)][PropertySelector(Expression = "name", Type = SelectorType.Enviroment)]public string CategoryName { get; set; }




配置爬虫(继承EntitySpiderBuilder)



    protected override EntitySpider GetEntitySpider()    {        EntitySpider context = new EntitySpider(new Site        {            //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API"))        })        {            UserId = "DotnetSpider",            TaskGroup = "JdSkuSampleSpider"        };        context.SetThreadNum(1);        context.SetIdentity("JD_sku_store_test_" + DateTime.Now.ToString("yyyy_MM_dd_hhmmss"));        context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));        context.AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手机" }, { "cat3", "655" } });        context.AddEntityType(typeof(Product), new TargetUrlExtractor        {            Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class=\"p-num\"]" },            Patterns = new List<string> { @"&page=[0-9]+&" }        });        return context;    }





  1. 其中AddStartUrl第二个参数Dictionary<string, object>就是用于Enviroment查询的数据




  2. 配置Scheduler: 默认是使用内存Queue做Url调度,如果想使用多台机器分布式采集则需要配置为RedisScheduler



    context.SetScheduler(new RedisScheduler {     Host = "",     Password = "",     Port = 6379 });




  3. 在添加数据对象时,可以配置数据链接的合法性验证。用在一个网站采集多种链接时映射到不同的数据对象。同时此验证会抽取当前页面中符合规则的Url加入到Scheduler中继续采集。



    context.AddEntityType(typeof(Product), new TargetUrlExtractor{     Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class=\"p-num\"]" },    Patterns = new List<string> { @"&page=[0-9]+&" }});




          




  1. 添加一个MySql的数据管道,只需要配置好连接字符串即可



    context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));




完整代码




public class JdSkuSampleSpider : EntitySpiderBuilder    {        protected override EntitySpider GetEntitySpider()        {            EntitySpider context = new EntitySpider(new Site            {                //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API"))            })            {                UserId = "DotnetSpider",                TaskGroup = "JdSkuSampleSpider"            };            context.SetThreadNum(1);            context.SetIdentity("JD_sku_store_test_" + DateTime.Now.ToString("yyyy_MM_dd_hhmmss"));            context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));            context.AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手机" }, { "cat3", "655" } });            context.AddEntityType(typeof(Product), new TargetUrlExtractor            {                Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class=\"p-num\"]" },                Patterns = new List<string> { @"&page=[0-9]+&" }            });            return context;        }        [Schema("test", "sku", TableSuffix.Today)]        [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]        [Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]        public class Product : ISpiderEntity        {            [StoredAs("sku", DataType.String, 25)]            [PropertySelector(Expression = "./@data-sku")]            public string Sku { get; set; }            [StoredAs("category", DataType.String, 20)]            [PropertySelector(Expression = "name", Type = SelectorType.Enviroment)]            public string CategoryName { get; set; }            [StoredAs("cat3", DataType.String, 20)]            [PropertySelector(Expression = "cat3", Type = SelectorType.Enviroment)]            public int CategoryId { get; set; }            [StoredAs("url", DataType.Text)]            [PropertySelector(Expression = "./div[1]/a/@href")]            public string Url { get; set; }            [StoredAs("commentscount", DataType.String, 32)]            [PropertySelector(Expression = "./div[5]/strong/a")]            public long CommentsCount { get; set; }            [StoredAs("shopname", DataType.String, 100)]            [PropertySelector(Expression = ".//div[@class='p-shop']/@data-shop_name")]            public string ShopName { get; set; }            [StoredAs("name", DataType.String, 50)]            [PropertySelector(Expression = ".//div[@class='p-name']/a/em")]            public string Name { get; set; }            [StoredAs("venderid", DataType.String, 25)]            [PropertySelector(Expression = "./@venderid")]            public string VenderId { get; set; }            [StoredAs("jdzy_shop_id", DataType.String, 25)]            [PropertySelector(Expression = "./@jdzy_shop_id")]            public string JdzyShopId { get; set; }            [StoredAs("run_id", DataType.Date)]            [PropertySelector(Expression = "Monday", Type = SelectorType.Enviroment)]            public DateTime RunId { get; set; }            [PropertySelector(Expression = "Now", Type = SelectorType.Enviroment)]            [StoredAs("cdate", DataType.Time)]            public DateTime CDate { get; set; }        }    }




运行爬虫



public class Program{    public static void Main(string[] args)    {        JdSkuSampleSpider spiderBuilder = new JdSkuSampleSpider();        spiderBuilder.Run("rerun");    }}




不到100行代码完成一个爬虫,是不是异常的简单?

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
听说房子降价了,用python爬虫看一下数据!
EasyExcel,让 excel 导入导出更加简单
【小白视角】大数据基础实践(五) MapReduce编程基础操作
Hadoop集群(第9期)_MapReduce初级案例
大神从开头教你如何写电商爬虫-实战
Android 双进程守护尝试与分析
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服