Session crawler valve in Tomcat on waitingforcode.com

When a website is liked by search engine crawlers, it has more chances to be liked (or only known) by real users. However, some server configurations doesn't work well with crawlers activity. And it's the case of Tomcat 7 and its sessions. But one solution exists - crawler session manager valve.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

In this article we'll talk about avoiding session overhead caused by too big activity of crawlers. In the first part we'll introduce the concept of valve, used to remedy for this hyper-activity. After that we'll explain this particular crawler valve. At the end we'll show how to implement special crawler valve in Tomcat.

Valves in Tomcat

Tomcat's valves are a kind of hooks executed in every request process by this servlet container. The valves can be executed at different levels: , or . These hooks are executed always before the request processing. So in this way we can, for example, filter which clients can access given engine, host or context. To do so, we use org.apache.catalina.valves.RemoteHostValve.

All valves are placed in org.apache.catalina.valves package and the majority of them extend ValveBase class. This abstract class implements another valve-oriented object, an interface org.apache.catalina.Valve. If we're looking on it, we can observe that it stipulates the use of chain-of-responsability design pattern through getNext() and setNext(Valve vale) methods. The valves are executed thanks to invoke(Request request, Response response) method.

Crawler Session Manager Valve

After this short introduction, let's come back to the goal of this article, session crawlers valve. Represented by class CrawlerSessionManagerValve, its main purpose is to avoid creating session for every crawler visit or, for example, for every ping from different proxies. Thanks to ensuring that every crawler is associated with single session, this valve contributes to reducing memory consumption caused by creation of session for every crawler's request. This can be catastrophic and produce out-of-memory errors when, for example, you have a big timeout for sessions (kind of 10 hours or more).

The configuration of this valve is based on 3 attributes:

className

It represents the class used to serve this valve. The only one correct value for this attribute is org.apache.catalina.valves.CrawlerSessionManagerValve.

crawlerUserAgents

It's a regular expression which list of all possible crawlers managed by the valve. CrawlerSessionManagerValve will try to match this RegEx with HTTP request's User-Agent header. If both are matched, it means that user accessing the page is search engine crawler. This detection is ensured thanks to boolean isBol = false from invoke method:

// From Tomcat 7 version
@Override
public void invoke(Request request, Response response) throws IOException, ServletException {

  boolean isBot = false;

// ...
 // Is this a crawler - check the UA headers
  Enumeration uaHeaders = request.getHeaders("user-agent");
  String uaHeader = null;
  if (uaHeaders.hasMoreElements()) {
    uaHeader = uaHeaders.nextElement();
  }

  // If more than one UA header - assume not a bot
  if (uaHeader != null && !uaHeaders.hasMoreElements()) {

  if (log.isDebugEnabled()) {
    log.debug(request.hashCode() + ": UserAgent=" + uaHeader);
  }

  if (uaPattern.matcher(uaHeader).matches()) {
    isBot = true;

    if (log.isDebugEnabled()) {
      log.debug(request.hashCode() +
        ": Bot found. UserAgent=" + uaHeader);
    }
  }
// ...

As you can see further in the code, we check if crawler has already an associated session id. If it's not the case, new session's id is created. Otherwise, the old one is retrieved from clientIpSessionId private ConcurrentHashMap. You can observe it here:

if (isBot) {
  clientIp = request.getRemoteAddr();
  sessionId = clientIpSessionId.get(clientIp);
  if (sessionId != null) {
    request.setRequestedSessionId(sessionId);
    if (log.isDebugEnabled()) {
      log.debug(request.hashCode() + ": SessionID=" +
      sessionId);
    }
  }
}
// ...
if (isBot) {
  if (sessionId == null) {
    // Has bot just created a session, if so make a note of it
    HttpSession s = request.getSession(false);
    if (s != null) {
      clientIpSessionId.put(clientIp, s.getId());
      sessionIdClientIp.put(s.getId(), clientIp);
      // #valueUnbound() will be called on session expiration
      s.setAttribute(this.getClass().getName(), this);
      s.setMaxInactiveInterval(sessionInactiveInterval);

      if (log.isDebugEnabled()) {
        log.debug(request.hashCode() +
        ": New bot session. SessionID=" + s.getId());
      }
    }
  } else {
    if (log.isDebugEnabled()) {
      log.debug(request.hashCode() +
      ": Bot session accessed. SessionID=" + sessionId);
    }
  }
}

The default value for this attribute is:

private String crawlerUserAgents = ".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*";

sessionInactiveInterval

This third attribute means the timeout of crawlers session. This timeout should be usually lower than for the sessions of normal users. This value should be expressed in seconds. The default value is:
```
private int sessionInactiveInterval = 60;
```

Implement Crawler Session Manager Valve in Tomcat 7

We suppose that you have one Tomcat's instance installed and, at least, default webapp configured (localhost:8080). If it's not the case, you must setup it before continue. If you're ready, they're some others steps to configure crawler session valve. To be able to test it, we'll configure us as search engine crawlers. First, we retrieve the User-Agent value of HTTP request, as for example:

User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36

Now, we open the configuration file for our webapp and add following line under entry:




To test it, we'll set FINE logging level in /etc/tomcat7/logging.properties to all valve classes:
org.apache.catalina.valves.level=FINE


A simple Tomcat's restart and we can track outputs in catalina.out file on accessing our webapp in two different browsers: Chromium (user-agent already quoted) and Firefox (with following user-agent: "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0"). Logs should show you:
juil. 14, 2014 12:53:47 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: ClientIp=127.0.0.1, RequestedSessionId=k3zbqnc9dgan1qx5vlzm6qbu6

juil. 14, 2014 12:53:47 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: Bot found. UserAgent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36

juil. 14, 2014 12:54:42 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: ClientIp=127.0.0.1, RequestedSessionId=null

juil. 14, 2014 12:54:42 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: UserAgent=Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0


Note that every time when a bot is found, specific message is printed in log files: "Bot found". This message appears only for user accessing page with Chromium.

As you could see in this article, Tomcat provides already the protection against massive session creation. Thanks to it, we can economize a lot memory space occupied by sessions initially created for each search engine crawler visits.




Consulting



With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. 
As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and 
drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!


👉 contact@waitingforcode.com

🔗 past projects