Session crawler valve in Tomcat

When a website is liked by search engine crawlers, it has more chances to be liked (or only known) by real users. However, some server configurations doesn't work well with crawlers activity. And it's the case of Tomcat 7 and its sessions. But one solution exists - crawler session manager valve.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

In this article we'll talk about avoiding session overhead caused by too big activity of crawlers. In the first part we'll introduce the concept of valve, used to remedy for this hyper-activity. After that we'll explain this particular crawler valve. At the end we'll show how to implement special crawler valve in Tomcat.

Valves in Tomcat

Tomcat's valves are a kind of hooks executed in every request process by this servlet container. The valves can be executed at different levels: , or . These hooks are executed always before the request processing. So in this way we can, for example, filter which clients can access given engine, host or context. To do so, we use org.apache.catalina.valves.RemoteHostValve.

All valves are placed in org.apache.catalina.valves package and the majority of them extend ValveBase class. This abstract class implements another valve-oriented object, an interface org.apache.catalina.Valve. If we're looking on it, we can observe that it stipulates the use of chain-of-responsability design pattern through getNext() and setNext(Valve vale) methods. The valves are executed thanks to invoke(Request request, Response response) method.

Crawler Session Manager Valve

After this short introduction, let's come back to the goal of this article, session crawlers valve. Represented by class CrawlerSessionManagerValve, its main purpose is to avoid creating session for every crawler visit or, for example, for every ping from different proxies. Thanks to ensuring that every crawler is associated with single session, this valve contributes to reducing memory consumption caused by creation of session for every crawler's request. This can be catastrophic and produce out-of-memory errors when, for example, you have a big timeout for sessions (kind of 10 hours or more).

The configuration of this valve is based on 3 attributes:

Implement Crawler Session Manager Valve in Tomcat 7

We suppose that you have one Tomcat's instance installed and, at least, default webapp configured (localhost:8080). If it's not the case, you must setup it before continue. If you're ready, they're some others steps to configure crawler session valve. To be able to test it, we'll configure us as search engine crawlers. First, we retrieve the User-Agent value of HTTP request, as for example:

User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36

Now, we open the configuration file for our webapp and add following line under entry:




To test it, we'll set FINE logging level in /etc/tomcat7/logging.properties to all valve classes:

org.apache.catalina.valves.level=FINE

A simple Tomcat's restart and we can track outputs in catalina.out file on accessing our webapp in two different browsers: Chromium (user-agent already quoted) and Firefox (with following user-agent: "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0"). Logs should show you:

juil. 14, 2014 12:53:47 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: ClientIp=127.0.0.1, RequestedSessionId=k3zbqnc9dgan1qx5vlzm6qbu6

juil. 14, 2014 12:53:47 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: Bot found. UserAgent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36

juil. 14, 2014 12:54:42 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: ClientIp=127.0.0.1, RequestedSessionId=null

juil. 14, 2014 12:54:42 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke
Précis: 1440520568: UserAgent=Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0

Note that every time when a bot is found, specific message is printed in log files: "Bot found". This message appears only for user accessing page with Chromium.

As you could see in this article, Tomcat provides already the protection against massive session creation. Thanks to it, we can economize a lot memory space occupied by sessions initially created for each search engine crawler visits.


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!