When a website is liked by search engine crawlers, it has more chances to be liked (or only known) by real users. However, some server configurations doesn't work well with crawlers activity. And it's the case of Tomcat 7 and its sessions. But one solution exists - crawler session manager valve.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in 👉
Early Release on the O'Reilly platform
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
In this article we'll talk about avoiding session overhead caused by too big activity of crawlers. In the first part we'll introduce the concept of valve, used to remedy for this hyper-activity. After that we'll explain this particular crawler valve. At the end we'll show how to implement special crawler valve in Tomcat.
Valves in Tomcat
Tomcat's valves are a kind of hooks executed in every request process by this servlet container. The valves can be executed at different levels:
All valves are placed in org.apache.catalina.valves package and the majority of them extend ValveBase class. This abstract class implements another valve-oriented object, an interface org.apache.catalina.Valve. If we're looking on it, we can observe that it stipulates the use of chain-of-responsability design pattern through getNext() and setNext(Valve vale) methods. The valves are executed thanks to invoke(Request request, Response response) method.
Crawler Session Manager Valve
After this short introduction, let's come back to the goal of this article, session crawlers valve. Represented by class CrawlerSessionManagerValve, its main purpose is to avoid creating session for every crawler visit or, for example, for every ping from different proxies. Thanks to ensuring that every crawler is associated with single session, this valve contributes to reducing memory consumption caused by creation of session for every crawler's request. This can be catastrophic and produce out-of-memory errors when, for example, you have a big timeout for sessions (kind of 10 hours or more).
The configuration of this valve is based on 3 attributes:
className
It represents the class used to serve this valve. The only one correct value for this attribute is org.apache.catalina.valves.CrawlerSessionManagerValve.
crawlerUserAgents
It's a regular expression which list of all possible crawlers managed by the valve. CrawlerSessionManagerValve will try to match this RegEx with HTTP request's User-Agent header. If both are matched, it means that user accessing the page is search engine crawler. This detection is ensured thanks to boolean isBol = false from invoke method:
// From Tomcat 7 version @Override public void invoke(Request request, Response response) throws IOException, ServletException { boolean isBot = false; // ... // Is this a crawler - check the UA headers Enumeration
uaHeaders = request.getHeaders("user-agent"); String uaHeader = null; if (uaHeaders.hasMoreElements()) { uaHeader = uaHeaders.nextElement(); } // If more than one UA header - assume not a bot if (uaHeader != null && !uaHeaders.hasMoreElements()) { if (log.isDebugEnabled()) { log.debug(request.hashCode() + ": UserAgent=" + uaHeader); } if (uaPattern.matcher(uaHeader).matches()) { isBot = true; if (log.isDebugEnabled()) { log.debug(request.hashCode() + ": Bot found. UserAgent=" + uaHeader); } } // ... As you can see further in the code, we check if crawler has already an associated session id. If it's not the case, new session's id is created. Otherwise, the old one is retrieved from clientIpSessionId private ConcurrentHashMap
. You can observe it here: if (isBot) { clientIp = request.getRemoteAddr(); sessionId = clientIpSessionId.get(clientIp); if (sessionId != null) { request.setRequestedSessionId(sessionId); if (log.isDebugEnabled()) { log.debug(request.hashCode() + ": SessionID=" + sessionId); } } } // ... if (isBot) { if (sessionId == null) { // Has bot just created a session, if so make a note of it HttpSession s = request.getSession(false); if (s != null) { clientIpSessionId.put(clientIp, s.getId()); sessionIdClientIp.put(s.getId(), clientIp); // #valueUnbound() will be called on session expiration s.setAttribute(this.getClass().getName(), this); s.setMaxInactiveInterval(sessionInactiveInterval); if (log.isDebugEnabled()) { log.debug(request.hashCode() + ": New bot session. SessionID=" + s.getId()); } } } else { if (log.isDebugEnabled()) { log.debug(request.hashCode() + ": Bot session accessed. SessionID=" + sessionId); } } }
The default value for this attribute is:
private String crawlerUserAgents = ".*[bB]ot.*|.*Yahoo! Slurp.*|.*Feedfetcher-Google.*";
sessionInactiveInterval
This third attribute means the timeout of crawlers session. This timeout should be usually lower than for the sessions of normal users. This value should be expressed in seconds. The default value is:
private int sessionInactiveInterval = 60;
Implement Crawler Session Manager Valve in Tomcat 7
We suppose that you have one Tomcat's instance installed and, at least, default webapp configured (localhost:8080). If it's not the case, you must setup it before continue. If you're ready, they're some others steps to configure crawler session valve. To be able to test it, we'll configure us as search engine crawlers. First, we retrieve the User-Agent value of HTTP request, as for example:
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36
Now, we open the configuration file for our webapp and add following line under
To test it, we'll set FINE logging level in /etc/tomcat7/logging.properties to all valve classes:
org.apache.catalina.valves.level=FINEA simple Tomcat's restart and we can track outputs in catalina.out file on accessing our webapp in two different browsers: Chromium (user-agent already quoted) and Firefox (with following user-agent: "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0"). Logs should show you:
juil. 14, 2014 12:53:47 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke Précis: 1440520568: ClientIp=127.0.0.1, RequestedSessionId=k3zbqnc9dgan1qx5vlzm6qbu6 juil. 14, 2014 12:53:47 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke Précis: 1440520568: Bot found. UserAgent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36 juil. 14, 2014 12:54:42 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke Précis: 1440520568: ClientIp=127.0.0.1, RequestedSessionId=null juil. 14, 2014 12:54:42 PM org.apache.catalina.valves.CrawlerSessionManagerValve invoke Précis: 1440520568: UserAgent=Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0Note that every time when a bot is found, specific message is printed in log files: "Bot found". This message appears only for user accessing page with Chromium.
As you could see in this article, Tomcat provides already the protection against massive session creation. Thanks to it, we can economize a lot memory space occupied by sessions initially created for each search engine crawler visits.