WebHarvest: Easy Web Scraping from Java?|?masochismtango

WebHarvest: Easy Web Scraping from Java

// February 15th, 2010 // Dev, Web

I’ve been experimenting with data visualisation for awhile now, most of which is for Masabi’sbusiness plan though I hope to share some offshoots soon.

I often have a need to quickly scrape some data out of a web page (orlist of web pages), which can then be fed into Excel and on tospecialist data visualisation tools like Tableau (available in a free public editionhere – my initial impressions are positive but it’s early days yet).

To this end I have turned to WebHarvest, an excellentscriptable open source API for web scraping in Java. I really reallylike it, but there are some quirks and setup issues that have cost mehours so I thought I’d roll together a tutorial with the fixes.

WebHarvest Config for Maven

When it works Maven is alovely tool to hide dependency management for Java projects, butWebHarvest is not configured qiute right out of the box to worktransparently with it. (Describing Maven is beyond the scope of thispost, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)

This is the Maven POM I ended up with to use WebHarvest in a newJavaSE project:

viewplain copyto clipboard print ?

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>WebScraping</groupId>
<artifactId>WebScraping</artifactId>
<packaging>jar</packaging>
<version>0.00.01</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>wso2</id>
<url>http://dist.wso2.org/maven2/</url>
</repository>
<repository>
<id>maven-repository-1</id>
<url>http://repo1.maven.org/maven2/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.12</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.webharvest.wso2</groupId>
<artifactId>webharvest-core</artifactId>
<version>1.0.0.wso2v1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>net.sf.saxon</groupId>
<artifactId>saxon-xom</artifactId>
<version>8.7</version>
</dependency>
<dependency>
<groupId>org.htmlcleaner</groupId>
<artifactId>htmlcleaner</artifactId>
<version>1.55</version>
</dependency>
<dependency>
<groupId>bsh</groupId>
<artifactId>bsh</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>
</dependencies>
</project>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"><modelVersion>4.0.0</modelVersion><groupId>WebScraping</groupId><artifactId>WebScraping</artifactId><packaging>jar</packaging><version>0.00.01</version><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding></properties><build><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><configuration><source>1.6</source><target>1.6</target></configuration></plugin></plugins></build><repositories><repository><id>wso2</id><url>http://dist.wso2.org/maven2/</url></repository><repository><id>maven-repository-1</id><url>http://repo1.maven.org/maven2/</url></repository></repositories><dependencies><dependency><groupId>commons-logging</groupId><artifactId>commons-logging</artifactId><version>1.1</version><type>jar</type><scope>compile</scope></dependency><dependency><groupId>log4j</groupId><artifactId>log4j</artifactId><version>1.2.12</version><type>jar</type><scope>compile</scope></dependency><dependency><groupId>org.webharvest.wso2</groupId><artifactId>webharvest-core</artifactId><version>1.0.0.wso2v1</version><type>jar</type><scope>compile</scope></dependency><!-- web harvest pom doesn't track dependencies well --><dependency><groupId>net.sf.saxon</groupId><artifactId>saxon-xom</artifactId><version>8.7</version></dependency><dependency><groupId>org.htmlcleaner</groupId><artifactId>htmlcleaner</artifactId><version>1.55</version></dependency><dependency><groupId>bsh</groupId><artifactId>bsh</artifactId><version>1.3.0</version></dependency><dependency><groupId>commons-httpclient</groupId><artifactId>commons-httpclient</artifactId><version>3.1</version></dependency></dependencies></project>

You’ll note that the WebHarvest dependencies had to be addedexplicitly, because the jar does not come with a working pom listingthem.

Writing A Scraping Script

WebHarvest uses XML configuration files to describe how to scrape asite – and with a few lines of Java code you can run any XMLconfiguration and have access to any properties that the scriptidentified from the page. This is definitely the safest way to scrapedata, as it decouples the code from the web page markup – so if the siteyou are scraping goes through a redesign, you can quickly adjust theconfig files without recompiling the code they pass data to.

The site some good example scriptsto show you how to get started, so I won’t repeat them here. Theeasiest way to create your own is to run the WebHarvest GUI from thecommand line, start with a sample script, and then hack it around to getwhat you want – it’s an easy iterative process with good feedback inthe UI.

As a simple example, this is a script to go to the Sony-Ericssondeveloper site’s handset gallery at http://developer.sonyericsson.com/device/searchDevice.do?restart=true,and rip each handset’s individual spec page URI:

viewplain copyto clipboard print ?

<?xml version="1.0" encoding="UTF-8"?>
<config>
<loop item="uid" index="i">
<list>
<xpath expression="//option/@value">
<html-to-xml>
<http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/>
</html-to-xml>
</xpath>
</list>
<body>
<var-def name="uri.${i}">
<template>device/loadDevice.do?id=${uid}</template>
</var-def>
</body>
</loop>
</config>

<?xml version="1.0" encoding="UTF-8"?><config><!-- indicates we want a loop, through the list defined in <list>, doing <body> for each item where the variables uri and i are defined as the index and value of the relevant item --><loop item="uid" index="i"><!-- the list section defines what we will loop over - here, it pulls out the value attribute of all option tags --><list><xpath expression="//option/@value"><html-to-xml><http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/></html-to-xml></xpath></list><!-- the body section lists instructions which are run for every iteration of the loop --><body><!-- we define a new variable for every iteration, using the iteration count as a suffix  --><var-def name="uri.${i}"><!-- template tag is important, else the $ var syntax will be ignored and won't do any value substitutions --><template>device/loadDevice.do?id=${uid}</template></var-def></body></loop></config>

The handset URIs will end up in a list of variables, from uri.1to uri.N.

The XML configuration’s syntax can take a little getting used to – itappeared quite backwards to me at first, but by messing around in theGUI you can experiment and learn pretty fast. With a basicunderstanding of XPathto identify parts of the web page, and perhaps a little regularexpression knowledge to get at information surrounded by plain text,you can perform some very powerful scraping.

We can then define another script which will take this URI, and pullout a piece of information from the page – in this example, it will showthe region(s) that the handset was released in:

viewplain copyto clipboard print ?

<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="wholepage">
<html-to-xml>
<http url="http://developer.sonyericsson.com/${uri}"/>
</html-to-xml>
</var-def>
<var-def name="specsheet">
<xpath expression="//div[@class='phone-specs']">
<var name="wholepage"/>
</xpath>
</var-def>
<var-def name="name">
<xpath expression="//h5[contains(text(),'Phone Model')]/following-sibling::p[1]/text()">
<var name="specsheet"/>
</xpath>
</var-def>
<regexp>
<regexp-pattern>([\d]*)x([\d]*)</regexp-pattern>
<regexp-source>
<xpath expression="//h5[contains(text(),'Screen Sizes')]/following-sibling::p[1]/text()">
<var name="specsheet"/>
</xpath>
</regexp-source>
<regexp-result>
<var-def name="screen.width"><template>${_1}</template></var-def>
<var-def name="screen.height"><template>${_2}</template></var-def>
</regexp-result>
</regexp>
</config>

<?xml version="1.0" encoding="UTF-8"?><config><!-- get the entire page --><var-def name="wholepage"><html-to-xml><!-- NEVER try and pass in the entire URL as a single variable here! --><http url="http://developer.sonyericsson.com/${uri}"/></html-to-xml></var-def><!-- rip out the block with the specifications --><var-def name="specsheet"><xpath expression="//div[@class='phone-specs']"><var name="wholepage"/></xpath></var-def><!-- find the handset's name --><var-def name="name"><xpath expression="//h5[contains(text(),'Phone Model')]/following-sibling::p[1]/text()"><var name="specsheet"/></xpath></var-def><!-- identify the screen resolution --><regexp><regexp-pattern>([\d]*)x([\d]*)</regexp-pattern><regexp-source><xpath expression="//h5[contains(text(),'Screen Sizes')]/following-sibling::p[1]/text()"><var name="specsheet"/></xpath></regexp-source><regexp-result><var-def name="screen.width"><template>${_1}</template></var-def><var-def name="screen.height"><template>${_2}</template></var-def></regexp-result></regexp></config>

At this point I should note the biggest gotcha with WebHarvest, thatjust caused me 3 hours of hear tearing. In the script, this linedefines the page to scrape: <httpurl="http://developer.sonyericsson.com/${uri}"/>,where ${uri} is a variable specified at runtime to define aURI. This works.

If you were to substitute in this perfectly sensible alternative: <httpurl="${url}"/>, you would end up with acompletely obscure runtime exception a little like this:

Exception in thread "main" org.webharvest.exception.ScriptException: Cannot set variable in scripter: Field access: bsh.ReflectError: No such field: 1at org.webharvest.runtime.scripting.BeanShellScriptEngine.setVariable(Unknown Source)at org.webharvest.runtime.scripting.ScriptEngine.pushAllVariablesFromContextToScriptEngine(Unknown Source)at org.webharvest.runtime.scripting.BeanShellScriptEngine.eval(Unknown Source)at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)at org.webharvest.runtime.processors.TemplateProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at org.webharvest.runtime.Scraper.execute(Unknown Source)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:82)at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:49)at scrape.ActualScraper.main(DhfScraper.java:37)Caused by: Field access: bsh.ReflectError: No such field: 1 : at Line: -1 : in file:  : at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.UtilEvalError.toEvalError(Unknown Source)at bsh.Interpreter.set(Unknown Source)... 18 more

You have been warned!

Running The Scripts From Java

WebHarvest requires very little code to run. I created this littlereusable harness class to quickly run the two types of script – one topull information from a page, and one to farm URLs from which to scrapedata. You can use the first without the second, of course.

viewplain copyto clipboard print ?

package scrape;
import java.io.*;
import java.util.*;
import org.apache.commons.logging.*;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.*;
import org.webharvest.runtime.variables.Variable;
/**
* Quick hackable web scraping class.
* @author Tom Godber
*/
public abstract class QuickScraper
{
/** Logging object. */
protected final Log LOG = LogFactory.getLog(getClass());
/** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */
public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";
/** A variable name which holds the initial URL to scrape. */
public static final String START_URL_VARIABLE = "url";
/** A temporary working folder. */
private File working = new File("temp");
/** Ensures temp folder exists.` */
public QuickScraper()
{
working.mkdirs();
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* The initial URL must be set in the actual URL list config XML.
* @param urlConfigXml Path of an XML describing how to scrape the URL list.
* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
* @return The number of URLs processed, or -1 if the config could not be loaded.
*/
protected int scrapeUrlList(String urlConfigXml, String pageConfigXml)
{
return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* @param setup Optional configuration for the script
* @param urlConfigXml Path of an XML describing how to scrape the URL list.
* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
* @return The number of URLs processed, or -1 if the config could not be loaded.
*/
protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml)
{
return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* The initial URL must be set in the actual URL list config XML.
* @param urlConfigXml XML describing how to scrape the URL list.
* @param pageConfigXml XML describing how to scrape the individual pages found.#
* @return The number of URLs processed, or -1 if the config could not be loaded.
*/
protected int scrapeUrlList(File urlConfigXml, File pageConfigXml)
{
return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
}
/**
* Scrapes a list of URLs which are automatically derived from a page.
* @param setup Optional configuration for the script
* @param urlConfigXml XML describing how to scrape the URL list.
* @param pageConfigXml XML describing how to scrape the individual pages found.
* @return The number of URLs processed, or -1 if the config could not be loaded.
* @throws NullPointerException If the setup map is null.
*/
protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml)
{
try
{
if (LOG.isDebugEnabled()) LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
// generate a one-off scraper based on preloaded configuration
ScraperConfiguration config = new ScraperConfiguration(urlConfigXml);
Scraper scraper = new Scraper(config, working.getAbsolutePath());
// initialise any config
setupScraperContext(setup, scraper);
// run the script
scraper.execute();
// rip the URL list out of the scraped content
ScraperContext context = scraper.getContext();
int i=1;
Variable scrapedUrl;
if (LOG.isDebugEnabled()) LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");
while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i)) != null)
{
if (LOG.isTraceEnabled()) LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());
// parse this URL
setup.put(START_URL_VARIABLE, scrapedUrl.toString());
scrapeUrl(setup, pageConfigXml);
// move on
i++;
}
if (LOG.isDebugEnabled()) LOG.debug("No more URLs found.");
return i;
}
catch (FileNotFoundException e)
{
if (LOG.isErrorEnabled()) LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);
return -1;
}
finally
{
working.delete();
}
}
/**
* Scrapes an individual page, and passed the results on for processing.
* The script must contain a hardcoded URL.
* @param configXml XML describing how to scrape an individual page.
*/
protected void scrapeUrl(File configXml)
{
scrapeUrl((String)null, configXml);
}
/**
* Scrapes an individual page, and passed the results on for processing.
* @param url The URL to scrape. If null, the URL must be set in the config itself.
* @param configXml XML describing how to scrape an individual page.
*/
protected void scrapeUrl(String url, File configXml)
{
Map setup = new HashMap();
if (url!=null) setup.put(START_URL_VARIABLE, url);
scrapeUrl(setup, configXml);
}
/**
* Scrapes an individual page, and passed the results on for processing.
* @param setup Optional configuration for the script
* @param configXml XML describing how to scrape an individual page.
*/
protected void scrapeUrl(Map setup, File configXml)
{
try
{
if (LOG.isDebugEnabled()) LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
// generate a one-off scraper based on preloaded configuration
ScraperConfiguration config = new ScraperConfiguration(configXml);
Scraper scraper = new Scraper(config, working.getAbsolutePath());
setupScraperContext(setup, scraper);
scraper.execute();
// handle contents in some way
pageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());
if (LOG.isDebugEnabled()) LOG.debug("Page scraping complete.");
}
catch (FileNotFoundException e)
{
if (LOG.isErrorEnabled()) LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);
}
finally
{
working.delete();
}
}
/**
* @param setup Any variables to be set before the script runs.
* @param scraper The object which does the scraping.
*/
private void setupScraperContext(Map<string,object> setup, Scraper scraper)
{
if (setup!=null)
for (String key : setup.keySet())
scraper.getContext().setVar(key, setup.get(key));
}
/**
* Process a page that was scraped.
* @param url The URL that was scraped.
* @param context The contents of the scraped page.
*/
public abstract void pageScraped(String url, ScraperContext context);
}
</string,object>

package scrape;import java.io.*;import java.util.*;import org.apache.commons.logging.*;import org.webharvest.definition.ScraperConfiguration;import org.webharvest.runtime.*;import org.webharvest.runtime.variables.Variable;/*** Quick hackable web scraping class.* @author Tom Godber*/public abstract class QuickScraper{/** Logging object. */protected final Log LOG = LogFactory.getLog(getClass());/** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";/** A variable name which holds the initial URL to scrape. */public static final String START_URL_VARIABLE = "url";/** A temporary working folder. */private File working = new File("temp");/** Ensures temp folder exists.` */public QuickScraper(){working.mkdirs();}/*** Scrapes a list of URLs which are automatically derived from a page.* The initial URL must be set in the actual URL list config XML.* @param urlConfigXml Path of an XML describing how to scrape the URL list.* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#* @return The number of URLs processed, or -1 if the config could not be loaded.*/protected int scrapeUrlList(String urlConfigXml, String pageConfigXml){return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);}/*** Scrapes a list of URLs which are automatically derived from a page.* @param setup Optional configuration for the script* @param urlConfigXml Path of an XML describing how to scrape the URL list.* @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#* @return The number of URLs processed, or -1 if the config could not be loaded.*/protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml){return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));}/*** Scrapes a list of URLs which are automatically derived from a page.* The initial URL must be set in the actual URL list config XML.* @param urlConfigXml XML describing how to scrape the URL list.* @param pageConfigXml XML describing how to scrape the individual pages found.#* @return The number of URLs processed, or -1 if the config could not be loaded.*/protected int scrapeUrlList(File urlConfigXml, File pageConfigXml){return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);}/*** Scrapes a list of URLs which are automatically derived from a page.* @param setup Optional configuration for the script* @param urlConfigXml XML describing how to scrape the URL list.* @param pageConfigXml XML describing how to scrape the individual pages found.* @return The number of URLs processed, or -1 if the config could not be loaded.* @throws NullPointerException If the setup map is null.*/protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml){try{if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");// generate a one-off scraper based on preloaded configurationScraperConfiguration config = new ScraperConfiguration(urlConfigXml);Scraper scraper = new Scraper(config, working.getAbsolutePath());// initialise any configsetupScraperContext(setup, scraper);// run the scriptscraper.execute();// rip the URL list out of the scraped contentScraperContext context = scraper.getContext();int i=1;Variable scrapedUrl;if (LOG.isDebugEnabled())	LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i))  != null){if (LOG.isTraceEnabled())	LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());// parse this URLsetup.put(START_URL_VARIABLE, scrapedUrl.toString());scrapeUrl(setup, pageConfigXml);// move oni++;}if (LOG.isDebugEnabled())	LOG.debug("No more URLs found.");return i;}catch (FileNotFoundException e){if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);return -1;}finally{working.delete();}}/*** Scrapes an individual page, and passed the results on for processing.* The script must contain a hardcoded URL.* @param configXml XML describing how to scrape an individual page.*/protected void scrapeUrl(File configXml){scrapeUrl((String)null, configXml);}/*** Scrapes an individual page, and passed the results on for processing.* @param url The URL to scrape. If null, the URL must be set in the config itself.* @param configXml XML describing how to scrape an individual page.*/protected void scrapeUrl(String url, File configXml){Map setup = new HashMap();if (url!=null)	setup.put(START_URL_VARIABLE, url);scrapeUrl(setup, configXml);}/*** Scrapes an individual page, and passed the results on for processing.* @param setup Optional configuration for the script* @param configXml XML describing how to scrape an individual page.*/protected void scrapeUrl(Map setup, File configXml){try{if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");// generate a one-off scraper based on preloaded configurationScraperConfiguration config = new ScraperConfiguration(configXml);Scraper scraper = new Scraper(config, working.getAbsolutePath());setupScraperContext(setup, scraper);scraper.execute();// handle contents in some waypageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());if (LOG.isDebugEnabled())	LOG.debug("Page scraping complete.");}catch (FileNotFoundException e){if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);}finally{working.delete();}}/*** @param setup Any variables to be set before the script runs.* @param scraper The object which does the scraping.*/private void setupScraperContext(Map setup, Scraper scraper){if (setup!=null)for (String key : setup.keySet())scraper.getContext().setVar(key, setup.get(key));}/*** Process a page that was scraped.* @param url The URL that was scraped.* @param context The contents of the scraped page.*/public abstract void pageScraped(String url, ScraperContext context);}

Scraping a new set of data then becomes as simple as extending theclass, passing in appropriate config, and pulling out whatever variablesyou want every time a page is scraped:

viewplain copyto clipboard print ?

package scrape;
import org.webharvest.runtime.ScraperContext;
import org.webharvest.runtime.variables.Variable;
public class ActualScraper extends QuickScraper
{
public static void main(String[] args)
{
try
{
ActualScraper scraper = new ActualScraper();
// do the scraping
scraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");
}
catch (Exception e)
{
e.printStackTrace();
}
}
/**
* @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext)
*/
public void pageScraped(String url, ScraperContext context)
{
Variable nameVar = context.getVar("name");
if (nameVar==null)
{
if (LOG.isWarnEnabled()) LOG.warn("Scrape for "+url+" produced no data! Ignoring");
return;
}
// store this station's details
if (LOG.isInfoEnabled()) LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");
}
}

package scrape;import org.webharvest.runtime.ScraperContext;import org.webharvest.runtime.variables.Variable;public class ActualScraper extends QuickScraper{public static void main(String[] args){try{ActualScraper scraper = new ActualScraper();// do the scrapingscraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");}catch (Exception e){e.printStackTrace();}}/*** @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext)*/public void pageScraped(String url, ScraperContext context){Variable nameVar = context.getVar("name");if (nameVar==null){if (LOG.isWarnEnabled())	LOG.warn("Scrape for "+url+" produced no data! Ignoring");return;}// store this station's detailsif (LOG.isInfoEnabled())	LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");}}

Soi there you have it – a powerful, configurable and highly effectiveweb scraping system with almost no code written!

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。