打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
WebHarvest: Easy Web Scraping from Java?|?masochismtango

WebHarvest: Easy Web Scraping from Java

// February 15th, 2010 // Dev, Web

I’ve been experimenting with data visualisation for awhile now, most of which is for Masabi’sbusiness plan though I hope to share some offshoots soon.

I often have a need to quickly scrape some data out of a web page (orlist of web pages), which can then be fed into Excel and on tospecialist data visualisation tools like Tableau (available in a free public editionhere – my initial impressions are positive but it’s early days yet).

To this end I have turned to WebHarvest, an excellentscriptable open source API for web scraping in Java. I really reallylike it, but there are some quirks and setup issues that have cost mehours so I thought I’d roll together a tutorial with the fixes.

WebHarvest Config for Maven

When it works Maven is alovely tool to hide dependency management for Java projects, butWebHarvest is not configured qiute right out of the box to worktransparently with it. (Describing Maven is beyond the scope of thispost, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)

This is the Maven POM I ended up with to use WebHarvest in a newJavaSE project:

  1. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
  2.  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
  3.  <modelVersion>4.0.0</modelVersion>  
  4.  <groupId>WebScraping</groupId>  
  5.  <artifactId>WebScraping</artifactId>  
  6.  <packaging>jar</packaging>  
  7.  <version>0.00.01</version>  
  8.  <properties>  
  9.  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>  
  10.  </properties>  
  11.   
  12.  <build>  
  13.  <plugins>  
  14.  <plugin>  
  15.  <artifactId>maven-compiler-plugin</artifactId>  
  16.  <configuration>  
  17.  <source>1.6</source>  
  18.  <target>1.6</target>  
  19.  </configuration>  
  20.  </plugin>  
  21.  </plugins>  
  22.  </build>  
  23.   
  24.  <repositories>  
  25.  <repository>  
  26.  <id>wso2</id>  
  27.  <url>http://dist.wso2.org/maven2/</url>  
  28.  </repository>  
  29.  <repository>  
  30.  <id>maven-repository-1</id>  
  31.  <url>http://repo1.maven.org/maven2/</url>  
  32.  </repository>  
  33.  </repositories>  
  34.  <dependencies>  
  35.  <dependency>  
  36.  <groupId>commons-logging</groupId>  
  37.  <artifactId>commons-logging</artifactId>  
  38.  <version>1.1</version>  
  39.  <type>jar</type>  
  40.  <scope>compile</scope>  
  41.  </dependency>  
  42.  <dependency>  
  43.  <groupId>log4j</groupId>  
  44.  <artifactId>log4j</artifactId>  
  45.  <version>1.2.12</version>  
  46.  <type>jar</type>  
  47.  <scope>compile</scope>  
  48.  </dependency>  
  49.  <dependency>  
  50.  <groupId>org.webharvest.wso2</groupId>  
  51.  <artifactId>webharvest-core</artifactId>  
  52.  <version>1.0.0.wso2v1</version>  
  53.  <type>jar</type>  
  54.  <scope>compile</scope>  
  55.  </dependency>  
  56.  <!-- web harvest pom doesn't track dependencies well -->  
  57.  <dependency>  
  58.  <groupId>net.sf.saxon</groupId>  
  59.  <artifactId>saxon-xom</artifactId>  
  60.  <version>8.7</version>  
  61.  </dependency>  
  62.  <dependency>  
  63.  <groupId>org.htmlcleaner</groupId>  
  64.  <artifactId>htmlcleaner</artifactId>  
  65.  <version>1.55</version>  
  66.  </dependency>  
  67.  <dependency>  
  68.  <groupId>bsh</groupId>  
  69.  <artifactId>bsh</artifactId>  
  70.  <version>1.3.0</version>  
  71.  </dependency>  
  72.  <dependency>  
  73.  <groupId>commons-httpclient</groupId>  
  74.  <artifactId>commons-httpclient</artifactId>  
  75.  <version>3.1</version>  
  76.  </dependency>  
  77.  </dependencies>  
  78. </project>  

You’ll note that the WebHarvest dependencies had to be addedexplicitly, because the jar does not come with a working pom listingthem.

Writing A Scraping Script

WebHarvest uses XML configuration files to describe how to scrape asite – and with a few lines of Java code you can run any XMLconfiguration and have access to any properties that the scriptidentified from the page. This is definitely the safest way to scrapedata, as it decouples the code from the web page markup – so if the siteyou are scraping goes through a redesign, you can quickly adjust theconfig files without recompiling the code they pass data to.

The site some good example scriptsto show you how to get started, so I won’t repeat them here. Theeasiest way to create your own is to run the WebHarvest GUI from thecommand line, start with a sample script, and then hack it around to getwhat you want – it’s an easy iterative process with good feedback inthe UI.

As a simple example, this is a script to go to the Sony-Ericssondeveloper site’s handset gallery at http://developer.sonyericsson.com/device/searchDevice.do?restart=true,and rip each handset’s individual spec page URI:

  1. <?xml version="1.0" encoding="UTF-8"?>  
  2. <config>  
  3.     <!-- indicates we want a loop, through the list defined in <list>, doing <body> for each item where the variables uri and i are defined as the index and value of the relevant item -->  
  4.     <loop item="uid" index="i">  
  5.         <!-- the list section defines what we will loop over - here, it pulls out the value attribute of all option tags -->  
  6.         <list>  
  7.             <xpath expression="//option/@value">  
  8.                 <html-to-xml>  
  9.                     <http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/>  
  10.                 </html-to-xml>  
  11.             </xpath>  
  12.         </list>  
  13.         <!-- the body section lists instructions which are run for every iteration of the loop -->  
  14.         <body>  
  15.             <!-- we define a new variable for every iteration, using the iteration count as a suffix  -->  
  16.             <var-def name="uri.${i}">  
  17.                 <!-- template tag is important, else the $ var syntax will be ignored and won't do any value substitutions -->  
  18.                 <template>device/loadDevice.do?id=${uid}</template>  
  19.             </var-def>  
  20.         </body>  
  21.     </loop>  
  22. </config>  

The handset URIs will end up in a list of variables, from uri.1to uri.N.

The XML configuration’s syntax can take a little getting used to – itappeared quite backwards to me at first, but by messing around in theGUI you can experiment and learn pretty fast. With a basicunderstanding of XPathto identify parts of the web page, and perhaps a little regularexpression knowledge to get at information surrounded by plain text,you can perform some very powerful scraping.

We can then define another script which will take this URI, and pullout a piece of information from the page – in this example, it will showthe region(s) that the handset was released in:

  1. <?xml version="1.0" encoding="UTF-8"?>  
  2. <config>  
  3.     <!-- get the entire page -->  
  4.     <var-def name="wholepage">  
  5.         <html-to-xml>  
  6.             <!-- NEVER try and pass in the entire URL as a single variable here! -->  
  7.             <http url="http://developer.sonyericsson.com/${uri}"/>  
  8.         </html-to-xml>  
  9.     </var-def>  
  10.     <!-- rip out the block with the specifications -->  
  11.     <var-def name="specsheet">  
  12.         <xpath expression="//div[@class='phone-specs']">  
  13.             <var name="wholepage"/>  
  14.             </xpath>  
  15.         </var-def>  
  16.         <!-- find the handset's name -->  
  17.     <var-def name="name">  
  18.         <xpath expression="//h5[contains(text(),'Phone Model')]/following-sibling::p[1]/text()">  
  19.             <var name="specsheet"/>  
  20.             </xpath>  
  21.     </var-def>  
  22.     <!-- identify the screen resolution -->  
  23.     <regexp>  
  24.         <regexp-pattern>([\d]*)x([\d]*)</regexp-pattern>  
  25.             <regexp-source>  
  26.                 <xpath expression="//h5[contains(text(),'Screen Sizes')]/following-sibling::p[1]/text()">  
  27.                     <var name="specsheet"/>  
  28.                 </xpath>  
  29.             </regexp-source>  
  30.         <regexp-result>  
  31.             <var-def name="screen.width"><template>${_1}</template></var-def>  
  32.             <var-def name="screen.height"><template>${_2}</template></var-def>  
  33.         </regexp-result>  
  34.     </regexp>  
  35. </config>  

At this point I should note the biggest gotcha with WebHarvest, thatjust caused me 3 hours of hear tearing. In the script, this linedefines the page to scrape: <httpurl="http://developer.sonyericsson.com/${uri}"/>,where ${uri} is a variable specified at runtime to define aURI. This works.

If you were to substitute in this perfectly sensible alternative: <httpurl="${url}"/>, you would end up with acompletely obscure runtime exception a little like this:

Exception in thread "main" org.webharvest.exception.ScriptException: Cannot set variable in scripter: Field access: bsh.ReflectError: No such field: 1at org.webharvest.runtime.scripting.BeanShellScriptEngine.setVariable(Unknown Source)at org.webharvest.runtime.scripting.ScriptEngine.pushAllVariablesFromContextToScriptEngine(Unknown Source)at org.webharvest.runtime.scripting.BeanShellScriptEngine.eval(Unknown Source)at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)at org.webharvest.runtime.processors.TemplateProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:82)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:49)at scrape.ActualScraper.main(DhfScraper.java:37)Caused by: Field access: bsh.ReflectError: No such field: 1 : at Line: -1 : in file:  : at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.Interpreter.set(Unknown Source)... 18 more

You have been warned!

Running The Scripts From Java

WebHarvest requires very little code to run. I created this littlereusable harness class to quickly run the two types of script – one topull information from a page, and one to farm URLs from which to scrapedata. You can use the first without the second, of course.

  1. package scrape;  
  2.   
  3. import java.io.*;  
  4. import java.util.*;  
  5.   
  6. import org.apache.commons.logging.*;  
  7. import org.webharvest.definition.ScraperConfiguration;  
  8. import org.webharvest.runtime.*;  
  9. import org.webharvest.runtime.variables.Variable;  
  10.   
  11. /** 
  12.  * Quick hackable web scraping class. 
  13.  * @author Tom Godber 
  14.  */  
  15. public abstract class QuickScraper  
  16. {  
  17.     /** Logging object. */  
  18.     protected final Log LOG = LogFactory.getLog(getClass());  
  19.     /** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */  
  20.     public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";  
  21.     /** A variable name which holds the initial URL to scrape. */  
  22.     public static final String START_URL_VARIABLE = "url";  
  23.   
  24.     /** A temporary working folder. */  
  25.     private File working = new File("temp");  
  26.   
  27.     /** Ensures temp folder exists.` */  
  28.     public QuickScraper()  
  29.     {  
  30.         working.mkdirs();  
  31.     }  
  32.   
  33.     /** 
  34.      * Scrapes a list of URLs which are automatically derived from a page. 
  35.      * The initial URL must be set in the actual URL list config XML. 
  36.      * @param urlConfigXml Path of an XML describing how to scrape the URL list. 
  37.      * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.# 
  38.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  39.      */  
  40.     protected int scrapeUrlList(String urlConfigXml, String pageConfigXml)  
  41.     {  
  42.         return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);  
  43.     }  
  44.   
  45.     /** 
  46.      * Scrapes a list of URLs which are automatically derived from a page. 
  47.      * @param setup Optional configuration for the script 
  48.      * @param urlConfigXml Path of an XML describing how to scrape the URL list. 
  49.      * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.# 
  50.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  51.      */  
  52.     protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml)  
  53.     {  
  54.         return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));  
  55.     }  
  56.   
  57.     /** 
  58.      * Scrapes a list of URLs which are automatically derived from a page. 
  59.      * The initial URL must be set in the actual URL list config XML. 
  60.      * @param urlConfigXml XML describing how to scrape the URL list. 
  61.      * @param pageConfigXml XML describing how to scrape the individual pages found.# 
  62.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  63.      */  
  64.     protected int scrapeUrlList(File urlConfigXml, File pageConfigXml)  
  65.     {  
  66.         return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);  
  67.     }  
  68.   
  69.     /** 
  70.      * Scrapes a list of URLs which are automatically derived from a page. 
  71.      * @param setup Optional configuration for the script 
  72.      * @param urlConfigXml XML describing how to scrape the URL list. 
  73.      * @param pageConfigXml XML describing how to scrape the individual pages found. 
  74.      * @return The number of URLs processed, or -1 if the config could not be loaded. 
  75.      * @throws NullPointerException If the setup map is null. 
  76.      */  
  77.     protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml)  
  78.     {  
  79.         try  
  80.         {  
  81.             if (LOG.isDebugEnabled())   LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");  
  82.             // generate a one-off scraper based on preloaded configuration  
  83.             ScraperConfiguration config = new ScraperConfiguration(urlConfigXml);  
  84.             Scraper scraper = new Scraper(config, working.getAbsolutePath());  
  85.             // initialise any config  
  86.             setupScraperContext(setup, scraper);  
  87.             // run the script  
  88.             scraper.execute();  
  89.   
  90.             // rip the URL list out of the scraped content  
  91.             ScraperContext context = scraper.getContext();  
  92.             int i=1;  
  93.             Variable scrapedUrl;  
  94.             if (LOG.isDebugEnabled())   LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");  
  95.             while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i))  != null)  
  96.             {  
  97.                 if (LOG.isTraceEnabled())   LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());  
  98.                 // parse this URL  
  99.                 setup.put(START_URL_VARIABLE, scrapedUrl.toString());  
  100.                 scrapeUrl(setup, pageConfigXml);  
  101.                 // move on  
  102.                 i++;  
  103.             }  
  104.             if (LOG.isDebugEnabled())   LOG.debug("No more URLs found.");  
  105.             return i;  
  106.         }  
  107.         catch (FileNotFoundException e)  
  108.         {  
  109.             if (LOG.isErrorEnabled())   LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);  
  110.             return -1;  
  111.         }  
  112.         finally  
  113.         {  
  114.             working.delete();  
  115.         }  
  116.     }  
  117.   
  118.     /** 
  119.      * Scrapes an individual page, and passed the results on for processing. 
  120.      * The script must contain a hardcoded URL. 
  121.      * @param configXml XML describing how to scrape an individual page. 
  122.      */  
  123.     protected void scrapeUrl(File configXml)  
  124.     {  
  125.         scrapeUrl((String)null, configXml);  
  126.     }  
  127.   
  128.     /** 
  129.      * Scrapes an individual page, and passed the results on for processing. 
  130.      * @param url The URL to scrape. If null, the URL must be set in the config itself. 
  131.      * @param configXml XML describing how to scrape an individual page. 
  132.      */  
  133.     protected void scrapeUrl(String url, File configXml)  
  134.     {  
  135.         Map setup = new HashMap();  
  136.         if (url!=null)  setup.put(START_URL_VARIABLE, url);  
  137.         scrapeUrl(setup, configXml);  
  138.     }  
  139.   
  140.     /** 
  141.      * Scrapes an individual page, and passed the results on for processing. 
  142.      * @param setup Optional configuration for the script 
  143.      * @param configXml XML describing how to scrape an individual page. 
  144.      */  
  145.     protected void scrapeUrl(Map setup, File configXml)  
  146.     {  
  147.         try  
  148.         {  
  149.             if (LOG.isDebugEnabled())   LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");  
  150.             // generate a one-off scraper based on preloaded configuration  
  151.             ScraperConfiguration config = new ScraperConfiguration(configXml);  
  152.             Scraper scraper = new Scraper(config, working.getAbsolutePath());  
  153.             setupScraperContext(setup, scraper);  
  154.             scraper.execute();  
  155.   
  156.             // handle contents in some way  
  157.             pageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());  
  158.   
  159.             if (LOG.isDebugEnabled())   LOG.debug("Page scraping complete.");  
  160.         }  
  161.         catch (FileNotFoundException e)  
  162.         {  
  163.             if (LOG.isErrorEnabled())   LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);  
  164.   
  165.         }  
  166.         finally  
  167.         {  
  168.             working.delete();  
  169.         }  
  170.     }  
  171.   
  172.     /** 
  173.      * @param setup Any variables to be set before the script runs. 
  174.      * @param scraper The object which does the scraping. 
  175.      */  
  176.     private void setupScraperContext(Map<string,object> setup, Scraper scraper)  
  177.     {  
  178.         if (setup!=null)  
  179.             for (String key : setup.keySet())  
  180.                 scraper.getContext().setVar(key, setup.get(key));  
  181.     }  
  182.   
  183.     /** 
  184.      * Process a page that was scraped. 
  185.      * @param url The URL that was scraped. 
  186.      * @param context The contents of the scraped page. 
  187.      */  
  188.     public abstract void pageScraped(String url, ScraperContext context);  
  189. }  
  190. </string,object>  

Scraping a new set of data then becomes as simple as extending theclass, passing in appropriate config, and pulling out whatever variablesyou want every time a page is scraped:

  1. package scrape;  
  2.   
  3. import org.webharvest.runtime.ScraperContext;  
  4. import org.webharvest.runtime.variables.Variable;  
  5.   
  6. public class ActualScraper extends QuickScraper  
  7. {  
  8.     public static void main(String[] args)  
  9.     {  
  10.         try  
  11.         {  
  12.             ActualScraper scraper = new ActualScraper();  
  13.             // do the scraping  
  14.             scraper.scrapeUrlList(config, "config/se.urls.xml""config/se.page.xml");  
  15.         }  
  16.         catch (Exception e)  
  17.         {  
  18.             e.printStackTrace();  
  19.         }  
  20.     }  
  21.   
  22.     /** 
  23.      * @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext) 
  24.      */  
  25.     public void pageScraped(String url, ScraperContext context)  
  26.     {  
  27.         Variable nameVar = context.getVar("name");  
  28.         if (nameVar==null)  
  29.         {  
  30.             if (LOG.isWarnEnabled())    LOG.warn("Scrape for "+url+" produced no data! Ignoring");  
  31.             return;  
  32.         }  
  33.   
  34.         // store this station's details  
  35.         if (LOG.isInfoEnabled())    LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");  
  36.     }  
  37. }  

Soi there you have it – a powerful, configurable and highly effectiveweb scraping system with almost no code written!

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
Maven学习笔记1
MyBatis+Spring轻量级整合(Maven)
cas 配置与自定义开发
js websocket技术总结
jsp中的config对象
SAP 电商云 Spartacus UI SiteContextParamsService 的实现原理解析
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服