HOWTO: Collect WebDriver HTTP Request and Response Headers
Posted by Kelvin on 22 Jun 2011 at 11:50 am | Tagged as: programming, Lucene / Solr / Elasticsearch / Nutch, crawling
WebDriver, is a fantastic Java API for web application testing. It has recently been merged into the Selenium project to provide a friendlier API for programmatic simulation of web browser actions. Its unique property is that of executing web pages on web browsers such as Firefox, Chrome, IE etc, and the subsequent programmatic access of the DOM model.
The problem with WebDriver, though, as reported here, is that because the underlying browser implementation does the actual fetching, as opposed to, Commons HttpClient, for example, its currently not possible to obtain the HTTP request and response headers, which is kind of a PITA.
I present here a method of obtaining HTTP request and response headers via an embedded proxy, derived from the Proxoid project.
ProxyLight from Proxoid
ProxyLight is the lightweight standalone proxy from the Proxoid project. It's released under the Apache Public License.
The original code only provided request filtering, and performed no response filtering, forwarding data directly from the web server to the requesting client.
I made some modifications to intercept and parse HTTP response headers.
Get my version here (released under APL): http://downloads.supermind.org/proxylight-20110622.zip
Using ProxyLight from WebDriver
The modified ProxyLight allows you to process both request and response.
This has the added benefit allowing you to write a RequestFilter which ignores images, or URLs from certain domains. Sweet!
What your WebDriver code has to do then, is:
- Ensure the ProxyLight server is started
- Add Request and Response Filters to the ProxyLight server
- Maintain a cache of request and response filters which you can then retrieve
- Ensure the native browser uses our ProxyLight server
Here's a sample class to get you started
package org.supermind.webdriver; import com.mba.proxylight.ProxyLight; import com.mba.proxylight.Response; import com.mba.proxylight.ResponseFilter; import org.openqa.selenium.firefox.FirefoxDriver; import org.openqa.selenium.firefox.FirefoxProfile; import java.util.LinkedHashMap; import java.util.Map; public class SampleWebDriver { protected int localProxyPort = 5368; protected ProxyLight proxy; // LRU response table. Note: this is not thread-safe. // Use ConcurrentLinkedHashMap instead: http://code.google.com/p/concurrentlinkedhashmap/ private LinkedHashMap<String, Response> responseTable = new LinkedHashMap<String, Response>() { protected boolean removeEldestEntry(Map.Entry eldest) { return size() > 100; } }; public Response fetch(String url) { if (proxy == null) { initProxy(); } FirefoxProfile profile = new FirefoxProfile(); /** * Get the native browser to use our proxy */ profile.setPreference("network.proxy.type", 1); profile.setPreference("network.proxy.http", "localhost"); profile.setPreference("network.proxy.http_port", localProxyPort); FirefoxDriver driver = new FirefoxDriver(profile); // Now fetch the URL driver.get(url); Response proxyResponse = responseTable.remove(driver.getCurrentUrl()); return proxyResponse; } private void initProxy() { proxy = new ProxyLight(); this.proxy.setPort(localProxyPort); // this response filter adds the intercepted response to the cache this.proxy.getResponseFilters().add(new ResponseFilter() { public void filter(Response response) { responseTable.put(response.getRequest().getUrl(), response); } }); // add request filters here if needed // now start the proxy try { this.proxy.start(); } catch (Exception e) { e.printStackTrace(); } } public static void main(String[] args) { SampleWebDriver driver = new SampleWebDriver(); Response res = driver.fetch("http://www.lucenetutorial.com"); System.out.println(res.getHeaders()); } }