Sunday, December 10, 2006

Short note: Running Java CAPS on Java SE 6

Today Java SE 6 was released. It comes with many new features and cool tools. One of them being jmap as described in a previous log on permgen exceptions. The Integration Server is not officially supported on SE 6 yet. However, if you want to run the Java CAPS integration server on SE 6, this is what you can do:

  1.     Install JDK 6 somewhere, e.g. c:\\java
  2.     Install the IS somewhere, e.g. c:\\logicalhost
  3.     Rename c:\\logicalhost\\jre to c:\\logicalhost\\jre.old
  4.     Copy c:\\java\\jre1.6.0 to c:\\logicalhost\\jre
  5.     Copy c:\\java\\jdk1.6.0\\lib\\tools.jar to c:\\logicalhost\\jre\\lib
  6.     Copy c:\\logicalhost\\jre\\bin\\javaw.exe to c:\\logicalhost\\jre\\bin\\is_domain1.exe
  7.     Copy c:\\logicalhost\\jre\\bin\\javaw.exe to c:\\logicalhost\\jre\\bin\\ isprocmgr_domain1.exe
  8.     Edit c:\\logicalhost\\is\\domains\\domain1\\config\\domain.xml and comment out these lines:
<!--
<jvm-options>-Dcom.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager=com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager</jvm-options>
<jvm-options>-Dorg.xml.sax.driver=com.sun.org.apache.xerces.internal.parsers.SAXParser</jvm-options>
<jvm-options>-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl</jvm-options>
<jvm-options>-Dcom.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration=com.sun.org.apache.xerces.internal.parsers.XIncludeParserConfiguration</jvm-options>
<jvm-options>-Djavax.xml.transform.TransformerFactory=com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl</jvm-options>
<jvm-options>-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl</jvm-options>
<jvm-options>-Djavax.xml.soap.MessageFactory=com.sun.xml.messaging.saaj.soap.ver1_1.SOAPMessageFactory1_1Impl</jvm-options>
<jvm-options>-Djavax.xml.soap.SOAPFactory=com.sun.xml.messaging.saaj.soap.ver1_1.SOAPFactory1_1Impl</jvm-options>
<jvm-options>-Djavax.xml.soap.SOAPConnectionFactory=com.sun.xml.messaging.saaj.client.p2p.HttpSOAPConnectionFactory</jvm-options>
-->

i.e. add <!-- before and add --> after these lines.

Also comment out this line:

<!--        <jvm-options>-server</jvm-options>-->

After these changes you can run the Integration Server with Jave SE 6. These are not “official” recommendations (as mentioned, there’s no support for SE 6 just yet); also the lines commented out are optimizations, that need to be re-established for SE 6 tet, so don’t do any performance comparisons just yet.


 

Saturday, December 2, 2006

Moving out of people management

Last week it was four years ago that I started at SeeBeyond. At SeeBeyond, I managed products (the JMS server and J2EE application server in Java CAPS), technology and people. The latter was part of the culture at SeeBeyond: the only way to have influence on any product was to have a team of people reporting to you.

Having responsibility for a team has been interesting. Over the past four years, I've seen people grow. I've seen people "turn around" on whom I was ready to give up. This was very satisfying. With them and through them, I've grown as well. However, in the past year I felt it was time for me to move to the next level. Also, with the increasing number of people in my team (two originally, eight at one point, and six lately), there was less and less time to stay involved with technology at a deep enough level.

When Sun acquired SeeBeyond last year, I was classified as people manager because of the fact that I had people reporting to me. As it turned out, Sun's culture is quite different from SeeBeyond's: there is a dual career ladder with appreciation and growth opportunities for both people managers and individual contributors. But unlike SeeBeyond, people managers primarily manage people and are less involved with technology. 11 people per manager is seen as the norm. It are the technical individual contributors that manage and shape products.

Moving up to the next step on Sun's career ladder, I requested a "diagonal promotion": up one level and from the people management track to the technology track. This week I got my promotion: I'm now a Senior Staff Engineer. Does this mean I'm now a heads-down techie? No, of course not. Sure there will be no more managing reports, but I'll now devote more time influencing people in other teams even outside of the organization. And yes, hopefully there'll be more time to dive a little deeper into a piece of technology.

Wednesday, November 15, 2006

Using java.util.logging in BEA Weblogic

Some weeks ago I was making sure that the JMSJCA connector runs properly on BEA Weblogic 9.1. Of course I ran into some issues: never assume that your code is really portable until you try it out. I also found some issues in Weblogic. Now I don't want to make this into a bashing session of Weblogic: since I work for Sun Microsystems, and Weblogic can be considered a competitor or Glassfish, that would be in bad taste. However, there's one thing that bugged me a lot and for which I want to share a solution: logging.

Dude, where's my log output?

At SeeBeyond (and of course at Sun) we standardized on the use of java.util.logging. You could argue whether java.util.logging is technically the best solution, but since it's a standard it's a lot better if you use that package than it is to write your own or use other third party logging tools. Standardization gives portability! So, JMSJCA, STCMS, etc all use java.util.logging.

When I started to run my test suite on Weblogic, I was surprised that the log output from the tests did not end up in one of the Weblogic log files. Instead, output appeared on the console. Surely there was a configuration option somewhere in Weblogic to fix this, right? I couldn't find one. The problem became even more annoying when I tried to enable debug logging. In Glassfish you go to the logging configuration screen, and you simply type the name of the logging category that you want to change, select the level and that's it. Not so in Weblogic. What was going on? It was time to look a little deeper into this.

How Weblogic uses java.util.logging.

I must admit that the Weblogic documentation is good, and very easily accessible. I wish we had the same situation at SeeBeyond. Anyway, from the documentation it appeared that Weblogic fully supports java.util.logging: I quote from the documentation: To distribute messages, WebLogic Server supports Java based logging by default. The LoggingHelper class provides access to the java.util.logging.Logger object used for server logging. [snip] If your application is configured for Java Logging or Log4j, in order to publish application events using WebLogic logging services, create a custom handler or appender that relays the application events to WebLogic logging services using the message catalogs or Commons API.

Eh, what do you mean... LoggingHelper class? Can't I just use java.util.logging without depending on any Weblogic classes? Do I have to write custom code?

And how do I change log levels dynamically? Isn't there an option in the admin console to do that? An MBean perhaps? I was really surprised to find this recommendation: If you use the DEBUG severity level, BEA recommends that you create a "debug mode" for your application. For example, your application can create an object that contains a Boolean value. To enable or disable the debug mode, you toggle the value of the Boolean. Then, for each DEBUG message, you can create a wrapper that outputs the message only if your application's debug mode is enabled.

For example, the following code can produce a debug message:


private static boolean debug = Boolean.getBoolean("my.debug.enabled");
if (debug) {
   mylogger.debug("Something debuggy happened");
}
[snip]
To enable your application to print this message, you include the following Java option when you start the application's JVM:

-Dmy.debug.enabled=true

That was a good recommendation... TEN YEARS AGO!

A fix

Fortunately, things are not as grim as they look. It's quite easy to write a little bit of code that is deployed globally independently of your application, so that you can continue to use java.util.logging without sprinkling Weblogic dependencies all over your application. First of all, what's all this with this BEA logger? As it turns out, during startup, Weblogic instantiates a java.util.logging.Logger object and hooks a number of Handler objects to it. So if you log to that particular logger, your log output will appear in one of the Weblogic log files. Can't you get access to this Logger using Logger.getLogger(name)? No you cannot: Weblogic is not using the LogManager at all. That particular logger is not registered in the LogManager. That is the reason why you need to use the weblogic.logging.LoggingHelper.getServerLogger() method.

Once we have a reference to this special Logger object,  we can get a list of the Handler objects. Next, we can assign this list to the root logger object. As a result, whenever you use a Logger object obtained through Logger.getLogger(), the output will go to the Weblogic handlers. Problem solved!

Of course we want to centralize this code: afterall we don't want to introduce this dependency in our application code. There are two ways of doing this: we can write our own LogManager and set that as the JVM's LogManager singleton. We can do this because Weblogic doesn't set a LogManager. The other approach is to write a Weblogic startup class. An instance of this class is created at server startup and its startup() method is called. This is the approach I've taken:

package com.sun.bealog;

import java.util.Hashtable;
import java.util.logging.Handler;
import java.util.logging.Level;
import java.util.logging.LogManager;
import java.util.logging.Logger;
import javax.management.MBeanServer;
import javax.management.ObjectName;
import javax.naming.Context;
import javax.naming.InitialContext;
import javax.naming.NamingException;
import weblogic.common.T3ServicesDef;
import weblogic.common.T3StartupDef;
import weblogic.logging.LoggingHelper;

/\*\*
\* Allows logging by applications and components using the java.util.logging
\* package by sending output from java.util.logging.Logger-s to the
\* WebLogic trace file.
\*
\* @author fkieviet
\*/
public class BeaLog implements T3StartupDef {

public String startup(String string, Hashtable hashtable) throws Exception {
String result = "";
Logger wllogger = LoggingHelper.getServerLogger();
Handler[] wlhandlers = wllogger.getHandlers();

// Change log level on log file handler
for (int i = 0; i < wlhandlers.length; i++) {
if (wlhandlers[i].getLevel().intValue() >= 530) {
result += "log level changed of " + wlhandlers[i] + "; ";
wlhandlers[i].setLevel(Level.FINEST);
}
}

// Copy handlers from wllogger to rootlogger
Logger rootlogger = Logger. getLogger("");
Handler[] toremove = rootlogger.getHandlers();
for (int j = 0; j < toremove.length; j++) {
rootlogger.removeHandler(toremove[j]);
}
for (int i = 0; i < wlhandlers.length; i++) {
rootlogger.addHandler(wlhandlers[i]);
}
result += "root handler now has " + rootlogger.getHandlers().length + " handlers; ";

// Register wllogger so that it can be manipulated through the mbean
boolean alreadythere = !LogManager.getLogManager().addLogger(wllogger);

// Register mbean
InitialContext ctx = null;
String[] names = new String[] {"java:comp/jmx/runtime", "java:comp/env/jmx/runtime"};
for (int i = 0; i < names.length; i++) {
try {
ctx = new InitialContext();
MBeanServer mbeanServer = (MBeanServer) ctx.lookup(names[i]);
mbeanServer.registerMBean(LogManager.getLoggingMXBean(),
new ObjectName("java.util.logging:type=Logging"));
result += "mbean registered; ";
} catch (Exception e) {
// ignore
} finally {
safeClose(ctx);
}
}

// For Java CAPS: disable two commonly used loggers that are used to
// relay call context information
Logger.getLogger("com.stc.EnterContext").setLevel(Level.OFF);
Logger.getLogger("com.stc.ExitContext").setLevel(Level.OFF);

return this.getClass().getName() + result;
}

private void safeClose(Context ctx) {
if (ctx != null) {
try {
ctx.close();
} catch (NamingException ignore) {
// ignore
}
}
}

public void setServices(T3ServicesDef t3ServicesDef) {
}
}

As you may have noticed, there's another problem that's solved in the code snippet above: dynamice log level configuration. The JVM's logging package ships with an MBean that allows you to set the log levels dynamically. The only thing that needs to be done is to register this MBean in Weblogic's MBeanServer (Weblogic has an unusual way to access its MBeanServer as you can see in the code). From then on, you can use a tool like jmx-console to access the logging mbean and set the log levels.

Download a NetBeans project that contains the BeaLog code.

Tuesday, November 14, 2006

More on... How to fix the dreaded "java.lang.OutOfMemoryError: PermGen space" exception (classloader leaks)

I got quite a few comments on my last blog (How to fix the dreaded "java.lang.OutOfMemoryError: PermGen space" exception (classloader leaks)). Apparently more people have been struggling with this problem.


Why bring this up? What's the news? Edward Chou continued to explore options to diagnose classloader leaks. First of all, he explored how to generate a list of orphaned classloaders with jhat. An orphaned classloader is a classloader that is not referenced by any object directly but cannot be garbage collected. The thinking behind this is that programs that create classloaders (e.g. application servers) do maintain references to them. So if there's a classloader that is no longer directly referenced, this classloader is probably a leak. Read about it on his blog (Find Orphaned Classloaders).

Still we were not satisfied: when examining some memory dumps from code that we were not familiar with, we explored yet some other options to diagnose classloader leaks: duplicate classes and duplicate classloaders. Let me explain.

Let's say that your application has a com.xyz.Controller class. If you find many instances of this class object, you likely have a classloader leak. Note the phrase "instances of this class object". What I mean by this: the class com.xyz.Controller is loaded multiple times, i.e. multiple instances of the com.xyz.Controller.class are present.  You can use jhat to run this query: simply list all instances of java.lang.Class.

Edward modified jhat to generate a list of all classloader instances that have an identical set of classes that it loaded. Typically there's no reason why someone would create two classloader instances and load exactly the same set of classes into them. If you find any in your memory dump, you should get suspicious and take a closer look. Monitor Edward's blog for more details on this.

One more thing: Edward found out that the method java.lang.String.intern() allocates memory in PermGen space. So if your application frequently uses this method with different strings, watch out. Fortunately these strings are subject to garbage collection. But if your application holds references to these strings, thereby making garbage collection impossible, your application may cause the dreaded "java.lang.OutOfMemoryError: PermGen space" exception. No classloaders involved this time.

Wednesday, October 18, 2006

How to fix the dreaded "java.lang.OutOfMemoryError: PermGen space" exception (classloader leaks)

In the previous blog entry Classloader leaks: the dreaded "java.lang.OutOfMemoryError: PermGen space" exception  I explained how this type of problem can originate in the application code that you deploy to an application server.  In this post I'll explain how to track down the leak so that you can fix it.

Profilers

Memory leak? Use a profiler. Right? Well... generally speaking the answer is yes, but classloader leaks are a bit special...

To refresh your memory (pardon the pun), a memory leak is an object that the system unintentionally hangs on to, thereby making it impossible for the garbage collector to remove this object. The way that profilers find memory leaks is to trace references to a leaked object.

What do I mean by "tracing"? A leaked object can be referenced by another object which itself is a leak. In turn, this object may also be a leak, and so on. This repeats until an object is found that references a leaked object by mistake. This reference is where the problem is, and what you need to fix. Let me try to clarify this by illustrating this with a picture from my previous blog:

In this picture the AppClassloader, LeakServlet.class, STATICNAME, CUSTOMLEVEL, LeakServlet$1.class are all leaked objects. Due to static objects (e.g. STATICNAME) in the picture, that may in turn reference other objects, the number of leaked objects may be in the thousands. Going over each leaked object manually to check if there are any incidental references to it (the red reference in the picture) until you find the troublesome object (CUSTOMLEVEL) is laborious. You would rather have a program find the violating reference for you.

A profiler doesn't tell you which leaked object is interesting to look at (CUSTOMLEVEL). Instead it gives you all leaked objects. Let's say that you would look at STATICNAME. The profiler now should find the route STATICNAME to LEAKSERVLET.class to AppClassloader to LeakServlet1$1.class, to CUSTOMLEVEL to Level.class. In this route, the red line in the picture is the reference that actually causes the leak. I said the profiler should find this route. However, all the profilers that we tried, stop tracing as soon as they reach a class object or classloader. There's a good reason for that: the number of traces grows enormous if it follows through the references through classes. And in most cases, these traces are not very useful.

So no luck with profilers! We need to try something else.

JDK 6.0 to the rescue

When Edward Chou and I worked on tracking down classloader leaks last year, we tried to run the JVM with HPROF and tried to trigger a memory dump; we looked at using Hat to interpret the dump. Hat stands for Heap Analysis Tool, and was developed to read dump files generated with HPROF. Unfortunately, the hat tool blew up reading our dump files. Because we didn't think it was difficult to parse the dump file, we wrote a utility to read the file and track the memory leak.

That was last year. This year we have JDK 6.0; this new JDK comes with a few tools that make looking at the VM's memory a lot simpler. First of all, there's a tool called jmap. This command line tool allows you to trigger a dump file without HPROF. It is as simple as typing something like:
    jmap -dump:format=b,file=leak 3144
Here leak is the filename of the dump, and 3144 is the PID of the process. To find the PID, you can use jps.

Secondly, Hat is now part of the JDK. It is now called jhat. You can run it using a command line like:
    jhat -J-Xmx512m leak
Here leak is the name of the dump file, and as you may have guessed, -J-Xmx512m is a parameter to specify how much memory jhat is allowed to allocate.

When you start jhat it reads the dump file and then listens on an HTTP port. You point your browser to that port (7000 by default) and through that you can browse the memory heap dump. It's a very convenient way of looking at what objects are in memory and how they are connected.

So, it seemed like a good idea to check out what can be done with these new tools to find classloader leaks.

... or not?

Unfortunately, jhat, just like the profilers we tried, also stops tracing when it encounters a class. Now what? I decided to download the JDK source code and find out what the problem is.  Building the whole JDK is a difficult task from what I gather from the documentation. Fortunately, jhat is a nicely modularized program; I could just take the com.sun.tools.hat-packages out of the source tree, load them in my favorite editor and compile the code. The patched code was easily packaged and run: I just jar-ed it and added it to the lib/ext directory of the JDK:
    jar -cf C:\\apps\\Java\\jdk1.6.0\\jre\\lib\\ext\\ahat.jar -C hat\\bin .
    jhat leak

This was really as easy as pie. So after running the program in the debugger for some time, I figured out how it works and what changes I wanted to make. The change is that when you follow the references from a classloader, the modified jhat will follow through all traces from all the instances of the classes that it loaded. With that change, finding the cause of a classloader leak is simple.

An example

Let's look at the example from my previous blog as depicted in the picture above. Using NetBeans I created the following servlet and deployed it to Glassfish:

 1 package com.stc.test;
 2
 3 import java.io.\*;
 4 import java.net.\*;
 5 import java.util.logging.Level;
 6 import java.util.logging.Logger;
 7 import javax.servlet.\*;
 8 import javax.servlet.http.\*;
 9
10 public class Leak extends HttpServlet {
11
12     protected void processRequest(HttpServletRequest request, HttpServletResponse response)
13     throws ServletException, IOException {
14         response.setContentType("text/html;charset=UTF-8");
15         PrintWriter out = response.getWriter();
16         out.println("<html><body><pre>");
17         Level custom = new Level("LEAK", 950) {};
18         Logger.getLogger(this.getClass().getName()).log(custom, "New level created");
19         out.println("</pre></body></html>");
20         out.close();
21     }
22+    HTTPServlet methods. Click on the + sign on the left to edit the code
48 }
49

I invoked the servlet to cause the leak. Next I undeployed the servlet. Then I triggered a heap dump:
    jmap -dump:format=b,file=leak 3144
and fired up the modified jhat:
    jhat -J-Xmx512m leak
and brought up the browser. The opening screen shows amongst other things, all classes that are found in the dump:

Finding objects that were leaked is easy since I know that I shouldn't see any objects of the classes that I deployed. Recall that I deployed a class com.stc.test.Leak; so I searched in the browser for the com.stc.test package, and found these classes (never mind the NoLeak class: I used it for testing).

Clicking on the link class com.stc.test.Leak brings up the following screen:


Clicking on the classloader link brings up the following screen:


Scrolling down, I see Reference Chains from Rootset / Exclude weak refs . Clicking on this link invokes the code that I modified; the following screen comes up:


And there's the link to java.util.Logging.Level that we've been looking for!

Easy as pie!

Summarizing, the steps are:

  1. undeploy the application that is leaking
  2. trigger a memory dump
  3. run jhat (with modification)
  4. find a leaked class
  5. locate the classloader
  6. find the "Reference chains from root set"
  7. inspect the chains, locate the accidental reference, and fix the code

I'll contact the JDK team to see if they are willing to accept the changes I made to jhat. If you cannot wait, send me an email or leave a comment.

Update (April 2007): Java SE SDK 6.0 update 1 has the updated code.

Other Permgen space tidbits

After fixing the classloader leak, you of course want to test to see if the memory leak has disappeared. You could again trigger a memory dump and run jhat. What you also could try is to see if the amount of used permgen space memory goes up continuously after each deployment/undeployment of your application.

You can monitor permgen space usage using jconsole. You can see the memory usage go up when you repeatedly deploy and undeploy an application. However, this may not be a classloader / memory leak. As it turns out, it's difficult to predict when the garbage collector cleans up permgen space. Pressing the button in Run GC in jconsole does not do the trick. Only when you encounter a java.lang.OutOfMemoryError: PermGen space exception can you be sure that there really was no memory. This is a bit more involved than it should be!

How can we force the garbage collector to kick in? We can force a java.lang.OutOfMemoryError: PermGen space and then releasing the memory after which we force the garbage collector to kick in. I wrote the following servlet to do that:

package com.stc.test;

import java.io.\*; import java.util.ArrayList; import javax.servlet.\*; import javax.servlet.http.\*;
public class RunGC extends HttpServlet {
private static class XClassloader extends ClassLoader { private byte[] data; private int len;
public XClassloader(byte[] data, int len) { super(RunGC.class.getClassLoader()); this.data = data; this.len = len; }
public Class findClass(String name) { return defineClass(name, data, 0, len); } }
protected void processRequest(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType("text/html;charset=UTF-8"); PrintWriter out = response.getWriter(); out.println("<html><body><pre>");
try { // Load class data byte[] buf = new byte[1000000]; InputStream inp = this.getClass().getClassLoader() .getResourceAsStream("com/stc/test/BigFatClass.class"); int n = inp.read(buf); inp.close(); out.println(n + " bytes read of class data");
// Exhaust permgen ArrayList keep = new ArrayList(); int nLoadedAtError = 0; try { for (int i = 0; i < Integer.MAX_VALUE; i++) { XClassloader loader = new XClassloader(buf, n); Class c = loader.findClass("com.stc.test.BigFatClass"); keep.add(c); } } catch (Error e) { nLoadedAtError = keep.size(); }
// Release memory keep = null; out.println("Error at " + nLoadedAtError);
// Load one more; this should trigger GC XClassloader loader = new XClassloader(buf, n); Class c = loader.findClass("com.stc.test.BigFatClass"); out.println("Loaded one more"); } catch (Exception e) { e.printStackTrace(out); }
out.println("</pre></body></html>"); out.close(); }
In this servlet a custom classloader is instantiated which loads a class in that classloader. That class is really present in the web classloader, but the custom classloader is tricked by not delegating to the parent classloader; instead the classloader is instantiating the class using the bytes of the class obtained through getResourceAsStream().

In the servlet it tries to allocate as many of these custom classes as possible, i.e. until the memory exception occurs. Next, the memory is made eligible for garbage collection, and one more classloader is allocated thereby forcing garbage collection.

The number of custom classes that can be loaded until a memory exception occurs, is a good measure of how much permgen space memory is available. As it turns out, this metric is a much more reliable than the one that you get from jconsole.

And more

Edward Chou is thinking of some other ideas to further automate the process of determining exactly where the cause of a classloader leak is. E.g. it should be possible to identifiy the erroneous reference (the red line in the picture) automatically, since this reference is from one classloader to another. Check his blog in the coming days.

Update (April 2007): You can find an interesting usage of jhat's Object Query Language on Sundarajan's blog to compute histograms of reference chains.

Sunday, October 15, 2006

Classloader leaks: the dreaded "java.lang.OutOfMemoryError: PermGen space" exception

Did you ever encounter a java.lang.OutOfMemoryError: PermGen space error when you redeployed your application to an application server? Did you curse the application server, while restarting the application server, to continue with your work thinking that this is clearly a bug in the application server. Those application server developers should get their act together, shouldn't they? Well, perhaps. But perhaps it's really  your fault!

Take a look at the following example of an innocent looking servlet.

package com.stc.test;

import java.io.\*; import java.util.logging.\*; import javax.servlet.\*; import javax.servlet.http.\*;
public class MyServlet extends HttpServlet { protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { // Log at a custom level Level customLevel = new Level("OOPS", 555) {}; Logger.getLogger("test").log(customLevel, "doGet() called"); } }
Try to redeploy this little sample a number of times.  I bet this will eventually fail with the dreaded java.lang.OutOfMemoryError: PermGen space error. If you like to understand what's happening, read on.

The problem in a nutshell

Application servers such as Glassfish allow you to write an application (.ear, .war, etc) and deploy this application with other applications on this application server. Should you feel the need to make a change to your application, you can simply make the change in your source code, compile the source, and redeploy the application without affecting the other still running applications in the application server: you don't need to restart the application server. This mechanism works fine on Glassfish and other application servers (e.g. Java CAPS Integration Server).

The way that this works is that each application is loaded using its own classloader. Simply put, a classloader is a special class that loads .class files from jar files. When you undeploy the application, the classloader is discarded and it and all the classes that it loaded, should be garbage collected sooner or later.

Somehow, something may hold on to the classloader however, and prevent it from being garbage collected. And that's what's causing the java.lang.OutOfMemoryError: PermGen space exception.

PermGen space

What is PermGen space anyways? The memory in the Virtual Machine is divided into a number of regions. One of these regions is PermGen. It's an area of memory that is used to (among other things) load class files. The size of this memory region is fixed, i.e. it does not change when the VM is running. You can specify the size of this region with a commandline switch: -XX:MaxPermSize . The default is 64 Mb on the Sun VMs.

If there's a problem with garbage collecting classes and if you keep loading new classes, the VM will run out of space in that memory region, even if there's plenty of memory available on the heap. Setting the -Xmx parameter will not help: this parameter only specifies the size of the total heap and does not affect the size of the PermGen region.

Garbage collecting and classloaders

When you write something silly like

 private void x1() {
        for (;;) {
            List c = new ArrayList();
        }
    }

you're continuously allocating objects; yet the program doesn't run out of memory: the objects that you create are garbage collected thereby freeing up space so that you can allocate another object. An object can only be garbage collected if the object is "unreachable". What this means is that there is no way to access the object from anywhere in the program. If nobody can access the object, there's no point in keeping the object, so it gets garbage collected. Let's take a look at the memory picture of the servlet example. First, let's even further simplify this example:

package com.stc.test;

import java.io.\*; import java.net.\*; import javax.servlet.\*; import javax.servlet.http.\*;
public class Servlet1 extends HttpServlet { private static final String STATICNAME = "Simple"; protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { } }

After loading the above servlet, the following objects are in memory (ofcourse limited to the relevant ones):

In this picture you see the objects loaded by the application classloader in yellow, and the rest in green. You see a simplified container object that holds references to the application classloader that was created just for this application, and to the servlet instance so that the container can invoke the doGet() method on it when a web request comes in. Note that the STATICNAME object is owned by the class object. Other important things to notice:

  1. Like each object, the Servlet1 instance holds a reference to its class (Servlet1.class).
  2. Each class object (e.g. Servlet1.class) holds a reference to the classloader that loaded it.
  3. Each classloader holds references to all the classes that it loaded.
The important consequence of this is that whenever an object outside of AppClassloader holds a reference to an object loaded by AppClassloader, none of the classes can be garbage collected.

To illustrate this, let's see what happens when the application gets undeployed: the Container object nullifies its references to the Servlet1 instance and to the AppClassloader object.

As you can see, none of the objects are reachable, so they all can be garbage collected. Now let's see what happens when we use the original example where we use the Level class:
package com.stc.test;

import java.io.\*; import java.net.\*; import java.util.logging.\*; import javax.servlet.\*; import javax.servlet.http.\*;
public class LeakServlet extends HttpServlet { private static final String STATICNAME = "This leaks!"; private static final Level CUSTOMLEVEL = new Level("test", 550) {}; // anon class!
protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { Logger.getLogger("test").log(CUSTOMLEVEL, "doGet called"); } }

Note that the CUSTOMLEVEL's class is an anonymous class. That is necessary because the constructor of Level is protected. Let's take a look at the memory picture of this scenario:


In this picture you see something you may not have expected: the Level class holds a static member to all Level objects that were created. Here's the constructor of the Level class in the JDK:

 protected Level(String name, int value) {
this.name = name;
this.value = value;
synchronized (Level.class) {
known.add(this);
}
}

Here known is a static ArrayList in the Level class. Now what happens if the application is undeployed?


Only the LeakServlet object can be garbage collected. Because of the reference to the CUSTOMLEVEL object from outside of AppClassloader, the  CUSTOMLEVEL anyonymous class objects (LeakServlet$1.class) cannot be garbage collected, and through that neither can the AppClassloader, and hence none of the classes that the AppClassloader loaded can be garbage collected.

Conclusion: any reference from outside the application to an object in the application of which the class is loaded by the application's classloader will cause a classloader leak.

More sneaky problems

I don't blame you if you didn't see the problem with the Level class: it's sneaky. Last year we had some undeployment problems in our application server. My team, in particular Edward Chou, spent some time to track them all down. Next to the problem with Level, here are some other problems Edward and I encountered. For instance, if you happen to use some of the Apache Commons BeanHelper's code: there's a static cache in that code that refers to Method objects. The Method object holds a reference to the class the Method points to. Not a problem if the Apache Commons code is loaded in your application's classloader. However, you do have a problem if this code is also present in the classpath of the application server because those classes take precedence. As a result now you have references to classes in your application from the application server's classloader... a classloader leak!

I did not mentiond yet the simplest recipe for disaster: a thread started by the application while the thread does not exit after the application is undeployed.

Detection and solution

Classloader leaks are difficult. Detecting if there's such a leak without having to deploy/undeploy a large number of times is difficult. Finding the source of a classloader leak is even trickier. This is because all the profilers that we tried at the time, did not follow links through classloaders. Therefore we resorted to writing some custom code to find the leaks from memory dump files. Since that exercise, new tools came to market in JDK 6. The next blog will outline what the easiest approach today is for tracking down a glassloader leak.

Tuesday, October 10, 2006

Transactions, disks, and performance

So far I've written only a handful of blogs. My blogs are technical and dry and consequently the hit count of my blog is low. However, it has generated some interest from people in far away places and generated some interesting discussions through email. One of the things that came up is data integrity and disk performance. And that's the topic of this blog...

Maintaining data integrity

Consider a simple scenario: an MDB reads from a JMS queue, and for each message that it receives, it writes a record to a database. Let's say that you want to minimize the chance that you would lose or duplicate data in the case of failure. What kind of failure? Let's say that power failures or system crashes are your biggest worries.

How do we minimize the chance of losing or duplicating data? First of all, you want to make sure that receiving the message from the queue and the writing to the database happens in one single transaction. One of the easiest ways of doing that is to use an application server like Glassfish. How does Glassfish help? It starts a distributed (two phase) transaction. The transactional state is persisted by the JMS server and database, and by Glassfish through its transaction log. In the event of a system crash, the system will come back up and by using the persisted transactional state, Glassfish can make sure that either the transaction is fully committed or fully rolled back. Ergo, no data is lost or duplicated. The details of this are fascinating, but let's look at that some other time.

Thus, reliable persistence that is immune to system crashes (e.g. the Blue Screen of Death (BSOD)) or power failures is important. That means that when a program thinks that the data has been persisted on the disk, it in fact should have been written in the magnetic media of the disk. Why do I stress this? Because it's less trivial than you might think: the operating system and drive may cache the data in the write cache. Therefore you will have to disable the write cache in the OS on the one hand; in your application you will have to make sure that the data is indeed sync-ed to disk on the other hand.

Turning off the write cache, syncing to disk... no big deal, right? Unfortunately there are many systems in which the performance drops like a stone when you do that. Let's look at why that is, and how good systems solve this performance problem.

The expense of writing to disk

As you know, data on a disk is organized in concentrical circles. Each circle (track) is divided up in sectors. When the disk controller writes data to the disk, it has to position the write head to the right track and wait until the disk has rotated such that the beginning of the sector is under the write head. These are mechanical operations; they are not measured in nano seconds or micro seconds, but rather in milliseconds. As it turns out, a drive can do about one hundred of these operations per second. So, as a rule of thumb, a drive can perform one hundred writes per second. And this hasn't changed much in the past few decades, and it won't change in the next decade.

Writes per second is one aspect. The number of bytes that can be written per second is another.  This is measured in megabytes per second. Fortunately over the past decades this rate has steadily increased and it likely will continue to do so. That's why formatting a 300 Gb drive today doesn't take 10000 times longer than it took to format a 30 Mb drive 15 years ago.

Key is that the more you can send to the disk in one write operation, the more data you can write.  To demonstrate this, I've done some measurements on my desktop machine.
writes per second

This chart shows that the number of write operations per second remains constant up to approximately 32 kb. (Note that the x-axis is logarithmic) That means that whether you try to write 32768 bytes or just one single byte per write operation, the maximum number of writes per second remains the same. Of course the amount of data you can write to the disk goes up if you put more data in a single write operation. Put another way, if you need to write 1 kb chunks of data, you can process 32 times faster if you combine your chunks into one 32 kb chunk.
bytes per second

What you can see in this chart is that the amount of data you can write to the disk per second (data rate) increases with the number of bytes per write. Of course that can not increase infinitely: the data rate eventually becomes constant. (Note that the x-axis and y-axis are both logarithmic). The data rate at a write size of 32 kb is approx 3.4 Mb/sec and levels off at a write size of approx 1 Mb to 6 Mb/sec.

These measurements were done on a $300 PC with a cheap hard drive running Windows. I've done the same measurements on more expensive hardware with more expensive drives and found similar results: most drives perform slightly better, but the overall picture is the same.

The lesson: if you have to write small amounts of data, try to combine these in one write operation.

Multi threading

Back to our example of JMS and a database. As I mentioned, for each message picked up from the queue and written to the database, there are several write operations. For now, let's assume that the data written for the processing of single message cannot be combined into one write operation. However, if we process the messages concurrently, we can combine the data being written from multiple threads in one write operation!

Good transaction loggers, good databases that are optimize for transaction processing, and good JMS servers, all try to combine the data from multiple concurrent transactions into as few write operations as possible. As an example, here are some measurements I did against STCMS on the same hardware as above. STCMS is the JMS server that ships with Java CAPS.
messages per second

As you can see, for small messages the throughput of the system increases almost linearly with the number of threads. This happens because one of the design principles of STCMS is to consolidate as much data as possible in a single write operation. Of course when you increase the size of the messages, or if you increase the number of threads indefinitely, you will eventually hit the limit on the data rate. This is why the performance for large messages does not scale like it does for small messages. Of course when increasing the number of threads to large numbers, you will also hit other limits due to the overhead in thread switching.

Note that the measurements in this chart were done on a queue-to-queue and topic-to-topic scenario about three years ago. The STCMS shipping today performs a lot better, e.g. there's no difference in performance anymore between queues and topics, so don't use these numbers as benchmark numbers for STCMS; I'm just using them to prove the point of the power of write-consolidation.

Other optimizations in the transaction manager

Combining data that needs to be written to the disk from many threads into one write operation is definitely a good thing for both resource managers (the JMS server and database server in our example) and the transaction manager.  Are there other things that can be done specifically in the area of the transaction manager? Yes, there are: let's see if we can minimize the number of times data needs to be written for each message.

Last agent commit The first thing that comes to mind is last-agent commit optimization. That means that in our example, one of the resource managers will not do a two phase commit, but instead only do a single phase commit, thereby saving another write. Most transaction managers can do this, e.g. the transaction manager in Glassfish does this.

Piggy backing on another persistence store Instead of writing to its own persistence store, the transaction log could write its data to the persistence store of one of the resource managers participating in the transaction. By doing so, it can push its data into the same write operation that the resource manager needs to do anyway. For instance, the transaction manager can write its transaction state into the database. An extra table in the database is all it takes. Alternatively, the JMS server could provide some extensions so that the transaction log can write its data to the persistence store of the JMS server.

Logless transactions If the resource manager (e.g. JMS) doesn't have an API that allows the transaction manager to use its persistence store, the transaction manager could sneak in some data of its own in the XID. The XID is just a string of bytes that the resource manager treats as an opaque entity. The transaction manager could put some data in the XID that identifies the last Resource Manager; in the case of a system crash, the transaction manager will query all known resource managers and obtain a list of in-doubt XIDs. The presence of the in-doubt XID of the last resource in the transaction signifies that all resources were successfully prepared and that commit should be called; the absence signifies that the prepare phase was incomplete and that rollback should be called. This mechanism was invented by SeeBeyond a few years ago (patent pending). There are some intricacies in this system; perhaps a topic for a future blog? Leave a comment to this blog if you're interested.

Beyond the disk

The scenario we've been looking at is one in which data needs to be persisted so that no data gets lost in case of a system crash or power failure.  What about a disk failure? We could use a RAID system. Are there other ways? Sure! We could involve multiple computers: clustering! If each machine that needs to safeguard transaction data would also write its data to another node in a cluster, that data would survive a crash of the primary node. Assuming that a crash of two nodes at the same time is very unlikely, this forms a reliable solution. On both nodes, data can be written to the disk using a write caching scheme so that the number of writes per second is no longer the limiting factor.

We can go even further for short-lived objects. Consider this scenario:
    insert into A values ('a', 'b')
    update A set value = 'c' where pk = 'a'
    delete A where pk = 'a'
why write the value ('a', 'b') to the disk at all? Hence, a smart write cache can avoid any disk writes for short lived objects. Of course typical data that the transaction manager generates are short lived objects. JMS messages are another example of objects that may have a short life span.

A pluggable transaction manager... and what about Glassfish?

It would be nice if transaction managers had a pluggable architecture so that depending on your specific scenario, you could choose the appropriate transaction manager log persistence strategy.

Isn't clustering something difficult and expensive? Difficult to develop yes, but not difficult to use. Sun Java System Application Server 8.x EE already has strong support for clustering. Glassfish will soon have support for clustering too. And with that clustering will become available to everybody!

Sunday, September 17, 2006

A free course in Java EE (link)

I found this note in my mail:


The 11th session of "Java EE Programming (with Passion!)" free online course will start from October 23rd, 2006. The course contents are revised with new "look and feel" and new hands-on labs. Please see the following course website for more information.
Course website: http://www.javapassion.com/j2ee Course FAQ: http://www.javapassion.com/j2ee/coursefaq.html


If you have anybody who wants to learn Java EE programing through hands-on style using NetBeans, please let them about this course.

Thursday, September 14, 2006

Funny videos from Sun

Who knew... Sun makes pretty funny commercials. See here: Sun videos. While you're there, also take a look at Project LookingGlas... impressive stuff!

Wednesday, September 13, 2006

Testing connection failures in resource adapters

In a previous blog (see J2EE JCA Resource Adapters: Poisonous pools)  I talked about the difficulty to detect faulty connections to the external system (EIS) that the Resource Adapter is interfacing with. A common cause of connection failures is simply connection loss. In my previous blog I detailed the problems of detecting connection loss in the RA and presented a number of ways to address this problem.

What I didn't discuss is how you can build unit tests so that you can easily test connection failures and connection failure recovery. That is the topic of this blog.

What is it that I will be looking at? Say that you have a resource adapter that connects to an external system (EIS) using one or more TCP/IP connections. The failure I want to test is that of a simple connection drop. This can happen if the EIS is rebooted, crashes, or simply because the network temporarily fails (e.g. someone stumbles over a network cable).

An automated way to induce failures

How would you manually test this type of failure? You could simply kill the EIS, or unplug the network connection. Simple and effective, but very manual. What I prefer is an automated way so that I can include test failures in an automated unit test suite.  Here's a way to do that: use a port-forwarder proxy. A port forwarder proxy is a process that listens on a particular port. Any time a connection comes in on that port, it will create a new connection to a specified server and port. Bytes coming in from the client are sent to the server unmodified. Likewise, bytes coming in from the server are sent to the client unmodified. The client is the RA in this case; the server is the EIS. Hence, the port -forwarder proxy sits between the Resource Adapter and the EIS.

A failure can be induced by instructing the port-forwarder proxy to drop all the connections. To simulate a transient failure, the port-forwarder proxy should then again accept connections and forward them to the EIS.

Here is an example of how this port-forwarder proxy can be used in a JUnit test. Let's say that I'm testing a JMS resource adapter. Let's say that I'm testing the RA in an application server: I have an MDB deployed in the application server that reads JMS messages from one queue (e.g. Queue1) and forwards them to a different queue (e.g. Queue2). The test would look something like this:

  1. start the port-forwarder proxy; specify the server name and port number that the EIS is listening on
  2. get the port number that proxy is listening on
  3. generate a connection URL that the RA will use to connect to the EIS; the new connection URL will have the proxy's server name and port number rather than the server name and port number of the EIS
  4. update the EAR file with the new connection URL
  5. deploy the EAR file
  6. send 1000 messages to Queue1
  7. read these 1000 message from Queue2
  8. verify that the port-forwarder has received connections; then tell the port-forwarder to kill all active connections
  9. send another batch of 1000 messages to Queue1
  10. read this batch of 1000 messages from Queue2
  11. undeploy the EAR file

As you can see I'm assuming in this example that you are using an embedded resource adapter, i.e. one that is embedded in the EAR file. Ofcourse you can also make this work using global resource adapters; you just need a way to automatically configure the URL in the global resource adapter.

A port forwarder in Java

My first Java program I ever wrote was a port-forwarder proxy (Interactive Spy). I'm not sure when it was, but I do remember that people just started to use JDK 1.3. In that version of the JDK there was only blocking IO available. Based on that restriction a way to write a port-forwarder proxy is to listen on a port and for each incoming connection create two threads: one thread exclusively reads from the client and sends the bytes to the server; the other thread reads from the server and sends the bytes to the client. This was not a very scalable or elegant solution. However, I did use this solution successfully for a number of tests in the JMS test suite at SeeBeyond for the JMS Intelligent Queue Manager in Java CAPS (better known to engineers as STCMS). However, when I was developing the connection failure tests for the JMSJCA Resource Adapter for JMS, I ran into these scalability issues, and I saw test failures due to problems in the test setup rather than to bugs in the application server, JMSJCA or STCMS.

A better approach is to make use of the non-blocking NIO capabilities in JDK 1.4. For me this was the opportunity to explore the capabilities of NIO. The result is a relatively small class that fully stands on its own; I pasted the source code of the resulting class below. The usual restrictions apply: this code is provided "as-is" without warranty of any kind; use at your own risk, etc. I've omitted the unit test for this class.

This is how you use it: instantiate a TCPProxyNIO passing it the server name and port number of the EIS. The proxy will find a port to listen on in the range of 50000. Use getPort() to find out what the port number is. The proxy now listens on that port and is ready to accept connections. Use killAllConnections() to kill all connections. Make sure to destroy the proxy after use: call close().


/\*
\* The contents of this file are subject to the terms
\* of the Common Development and Distribution License
\* (the "License"). You may not use this file except
\* in compliance with the License.
\*
\* You can obtain a copy of the license at
\* glassfish/bootstrap/legal/CDDLv1.0.txt or
\* https://glassfish.dev.java.net/public/CDDLv1.0.html.
\* See the License for the specific language governing
\* permissions and limitations under the License.
\*
\* When distributing Covered Code, include this CDDL
\* HEADER in each file and include the License file at
\* glassfish/bootstrap/legal/CDDLv1.0.txt. If applicable,
\* add the following below this CDDL HEADER, with the
\* fields enclosed by brackets "[]" replaced with your
\* own identifying information: Portions Copyright [yyyy]
\* [name of copyright owner]
\*/

package com.stc.jmsjca.test.core;

import java.io.IOException;
import java.net.InetSocketAddress;
import java.net.ServerSocket;
import java.nio.ByteBuffer;
import java.nio.channels.SelectionKey;
import java.nio.channels.Selector;
import java.nio.channels.ServerSocketChannel;
import java.nio.channels.SocketChannel;
import java.util.IdentityHashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.Semaphore;
import java.util.logging.Level;
import java.util.logging.Logger;

/\*\*
\* A proxy server that can be used in JUnit tests to induce connection
\* failures, to assure that connections are made, etc. The proxy server is
\* setup with a target server and port; it will listen on a port that it
\* chooses itself and delegates all data coming in to the server, and vice
\* versa.
\*
\* Implementation: each incoming connection (client connection) maps into
\* a Conduit; this holds both ends of the line, i.e. the client end
\* and the server end.
\*
\* Everything is based on non-blocking IO (NIO). The proxy creates one
\* extra thread to handle the NIO events.
\*
\* @author fkieviet
\*/
public class TcpProxyNIO implements Runnable {
private static Logger sLog = Logger.getLogger(TcpProxyNIO.class.getName());
private String mRelayServer;
private int mRelayPort;
private int mNPassThroughsCreated;
private Receptor mReceptor;
private Map mChannelToPipes = new IdentityHashMap();
private Selector selector;
private int mCmd;
private Semaphore mAck = new Semaphore(0);
private Object mCmdSync = new Object();
private Exception mStartupFailure;
private Exception mUnexpectedThreadFailure;
private boolean mStopped;

private static final int NONE = 0;
private static final int STOP = 1;
private static final int KILLALL = 2;
private static final int KILLLAST = 3;

private static int BUFFER_SIZE = 16384;

/\*\*
\* Constructor
\*
\* @param relayServer
\* @param port
\* @throws Exception
\*/
public TcpProxyNIO(String relayServer, int port) throws Exception {
mRelayServer = relayServer;
mRelayPort = port;

Receptor r = selectPort();
mReceptor = r;

new Thread(this, "TCPProxy on " + mReceptor.port).start();

mAck.acquire();
if (mStartupFailure != null) {
throw mStartupFailure;
}
}

/\*\*
\* Utility class to hold data that describes the proxy server
\* listening socket
\*/
private class Receptor {
public int port;
public ServerSocket serverSocket;
public ServerSocketChannel serverSocketChannel;

public Receptor(int port) {
this.port = port;
}

public void bind() throws IOException {
serverSocketChannel = ServerSocketChannel.open();
serverSocketChannel.configureBlocking(false);
serverSocket = serverSocketChannel.socket();
InetSocketAddress inetSocketAddress = new InetSocketAddress(port);
serverSocket.bind(inetSocketAddress);
}

public void close() {
if (serverSocketChannel != null) {
try {
serverSocketChannel.close();
} catch (Exception ignore) {

}
serverSocket = null;
}
}
}

/\*\*
\* The client or server connection
\*/
private class PipeEnd {
public SocketChannel channel;
public ByteBuffer buf;
public Conduit conduit;
public PipeEnd other;
public SelectionKey key;
public String name;

public PipeEnd(String name) {
buf = ByteBuffer.allocateDirect(BUFFER_SIZE);
buf.clear();
buf.flip();
this.name = "{" + name + "}";
}

public String toString() {
StringBuffer ret = new StringBuffer();
ret.append(name);
if (key != null) {
ret.append("; key: ");
if ((key.interestOps() & SelectionKey.OP_READ) != 0) {
ret.append("-READ-");
}
if ((key.interestOps() & SelectionKey.OP_WRITE) != 0) {
ret.append("-WRITE-");
}
if ((key.interestOps() & SelectionKey.OP_CONNECT) != 0) {
ret.append("-CONNECT-");
}
}
return ret.toString();
}

public void setChannel(SocketChannel channel2) throws IOException {
this.channel = channel2;
mChannelToPipes.put(channel, this);
channel.configureBlocking(false);
}

public void close() throws IOException {
mChannelToPipes.remove(channel);
try {
channel.close();
} catch (IOException e) {
// ignore

}
channel = null;
if (key != null) {
key.cancel();
key = null;
}
}

public void listenForRead(boolean on) {
if (on) {
key.interestOps(key.interestOps() | SelectionKey.OP_READ);
} else {
key.interestOps(key.interestOps() &~ SelectionKey.OP_READ);
}
}

public void listenForWrite(boolean on) {
if (on) {
key.interestOps(key.interestOps() | SelectionKey.OP_WRITE);
} else {
key.interestOps(key.interestOps() &~ SelectionKey.OP_WRITE);
}
}
}

/\*\*
\* Represents one link from the client to the server. It is an association
\* of the two ends of the link.
\*/
private class Conduit {
public PipeEnd client;
public PipeEnd server;
public int id;

public Conduit() {
client = new PipeEnd("CLIENT");
client.conduit = this;

server = new PipeEnd("SERVER");
server.conduit = this;

client.other = server;
server.other = client;

id = mNPassThroughsCreated++;
}
}

/\*\*
\* Finds a port to listen on
\*
\* @return a newly initialized receptor
\* @throws Exception on any failure
\*/
private Receptor selectPort() throws Exception {
Receptor ret;

// Find a port to listen on; try up to 100 port numbers

Random random = new Random();
for (int i = 0; i < 100; i++) {
int port = 50000 + random.nextInt(1000);
try {
ret = new Receptor(port);
ret.bind();
return ret;
} catch (IOException ignore) {
// Ignore

}
}
throw new Exception("Could not bind port");
}

/\*\*
\* The main event loop
\*/
public void run() {
// ===== STARTUP ==========
// The main thread will wait until the server is actually listening and ready
// to process incoming connections. Failures during startup should be
// propagated back to the calling thread.

try {
selector = Selector.open();

// Acceptor
mReceptor.serverSocketChannel.configureBlocking(false);
mReceptor.serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
} catch (Exception e) {
synchronized (mCmdSync) {
mStartupFailure = e;
}
}

// ===== STARTUP COMPLETE ==========

// Tha main thread is waiting on the ack lock; notify the main thread.
// Startup errors are communicated through the mStartupFailure variable.
mAck.release();
if (mStartupFailure != null) {
return;
}

// ===== RUN: event loop ==========

// The proxy thread spends its life in this event handling loop in which
// it deals with requests from the main thread and from notifications from
// NIO.
try {
loop: for (;;) {
int nEvents = selector.select();

// ===== COMMANDS ==========

// Process requests from the main thread. The communication mechanism
// is simple: the command is communicated through a variable; the main
// thread waits until the mAck lock is set.
switch (getCmd()) {
case STOP: {
ack();
break loop;
}
case KILLALL: {
PipeEnd[] pipes = toPipeArray();
for (int i = 0; i < pipes.length; i++) {
pipes[i].close();
}
ack();
continue;
}
case KILLLAST: {
PipeEnd[] pipes = toPipeArray();
Conduit last = pipes.length > 0 ? pipes[0].conduit : null;
if (last != null) {
for (int i = 0; i < pipes.length; i++) {
if (pipes[i].conduit.id > last.id) {
last = pipes[i].conduit;
}
}
last.client.close();
last.server.close();
}
ack();
continue;
}
}

//===== NIO Event handling ==========

if (nEvents == 0) {
continue;
}
Set keySet = selector.selectedKeys();
for (Iterator iter = keySet.iterator(); iter.hasNext();) {
SelectionKey key = (SelectionKey) iter.next();
iter.remove();

//===== ACCEPT ==========

// A client connection has come in. Perform an async connect to
// the server. The remainder of the connect is going to be done in
// the CONNECT event handling.
if (key.isValid() && key.isAcceptable()) {
sLog.fine(">Incoming connection");
try {
Conduit pt = new Conduit();
ServerSocketChannel ss = (ServerSocketChannel) key.channel();

// Accept

pt.client.setChannel(ss.accept());

// Do asynchronous connect to relay server
pt.server.setChannel(SocketChannel.open());
pt.server.key = pt.server.channel.register(
selector, SelectionKey.OP_CONNECT);
pt.server.channel.connect(new InetSocketAddress(
mRelayServer, mRelayPort));
} catch (IOException e) {
System.err.println(">Unable to accept channel");
e.printStackTrace();
// selectionKey.cancel();

}
}

//===== CONNECT ==========
// Event that is generated when the connection to the server has
// completed. Here we need to initialize both pipe-ends. Both ends
// need to start reading. If the connection had not succeeded, the
// client needs to be closed immediately.
if (key != null && key.isValid() && key.isConnectable()) {
SocketChannel c = (SocketChannel) key.channel();
PipeEnd p = (PipeEnd) mChannelToPipes.get(c); // SERVER-SIDE

if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">CONNECT event on " + p + " -- other: " + p.other);
}

boolean success;
try {
success = c.finishConnect();
} catch (RuntimeException e) {
success = false;
if (sLog.isLoggable(Level.FINE)) {
sLog.log(Level.FINE, "Connect failed: " + e, e);
}
}
if (!success) {
// Connection failure

p.close();
p.other.close();

// Unregister the channel with this selector
key.cancel();
key = null;
} else {
// Connection was established successfully

// Both need to be in readmode; note that the key for
// "other" has not been created yet
p.other.key = p.other.channel.register(selector, SelectionKey.OP_READ);
p.key.interestOps(SelectionKey.OP_READ);
}

if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">END CONNECT event on " + p + " -- other: " + p.other);
}
}

//===== READ ==========

// Data was received. The data needs to be written to the other
// end. Note that data from client to server is processed one chunk
// at a time, i.e. a chunk of data is read from the client; then
// no new data is read from the client until the complete chunk
// is written to to the server. This is why the interest-fields
// in the key are toggled back and forth. Ofcourse the same holds

// true for data from the server to the client.
if (key != null && key.isValid() && key.isReadable()) {
PipeEnd p = (PipeEnd) mChannelToPipes.get(key.channel());
if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">READ event on " + p + " -- other: " + p.other);
}

// Read data

p.buf.clear();
int n;
try {
n = p.channel.read(p.buf);
} catch (IOException e) {
n = -1;
}

if (n >= 0) {
// Write to other end

p.buf.flip();
int nw = p.other.channel.write(p.buf);

if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">Read " + n + " from " + p.name + "; wrote " + nw);
}

p.other.listenForWrite(true);
p.listenForRead(false);
} else {
// Disconnected

if (sLog.isLoggable(Level.FINE)) {
sLog.fine("Disconnected");
}

p.close();
key = null;
if (sLog.isLoggable(Level.FINE)) {
sLog.fine("Now present: " + mChannelToPipes.size());
}

p.other.close();

// Stop reading from other side

if (p.other.channel != null) {
p.other.listenForRead(false);
p.other.listenForWrite(true);
}
}

if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">END READ event on " + p + " -- other: " + p.other);
}
}

//===== WRITE ==========

// Data was sent. As for READ, data is processed in chunks which
// is why the interest READ and WRITE bits are flipped.
// In the case a connection failure is detected, there still may be
// data in the READ buffer that was not read yet. Example, the
// client sends a LOGOFF message to the server, the server then sends
// back a BYE message back to the client; depending on when the

// write failure event comes in, the BYE message may still be in
// the buffer and must be read and sent to the client before the
// client connection is closed.
if (key != null && key.isValid() && key.isWritable()) {
PipeEnd p = (PipeEnd) mChannelToPipes.get(key.channel());

if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">WRITE event on " + p + " -- other: " + p.other);
}

// More to write?

if (p.other.buf.hasRemaining()) {
int n = p.channel.write(p.other.buf);
if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">Write some more to " + p.name + ": " + n);
}
} else {
if (p.other.channel != null) {
// Read from input again

p.other.buf.clear();
p.other.buf.flip();

p.other.listenForRead(true);
p.listenForWrite(false);

} else {
// Close

p.close();
key = null;
if (sLog.isLoggable(Level.FINE)) {
sLog.fine("Now present: " + mChannelToPipes.size());
}
}
}
if (sLog.isLoggable(Level.FINE)) {
sLog.fine(">END WRITE event on " + p + " -- other: " + p.other);
}
}
}
}
} catch (Exception e) {
sLog.log(Level.SEVERE, "Proxy main loop error: " + e, e);
e.printStackTrace();
synchronized (mCmdSync) {
mUnexpectedThreadFailure = e;
}
}

// ===== CLEANUP =====

// The main event loop has exited; close all connections
try {
selector.close();
PipeEnd[] pipes = toPipeArray();
for (int i = 0; i < pipes.length; i++) {
pipes[i].close();
}
mReceptor.close();
} catch (IOException e) {
sLog.log(Level.SEVERE, "Cleanup error: " + e, e);
e.printStackTrace();
}
}

private PipeEnd[] toPipeArray() {
return (PipeEnd[]) mChannelToPipes.values().toArray(
new PipeEnd[mChannelToPipes.size()]);
}

private int getCmd() {
synchronized (mCmdSync) {
return mCmd;
}
}

private void ack() {
setCmd(NONE);
mAck.release();
}

private void setCmd(int cmd) {
synchronized (mCmdSync) {
mCmd = cmd;
}
}

private void request(int cmd) {
setCmd(cmd);
selector.wakeup();

try {
mAck.acquire();
} catch (InterruptedException e) {
// ignore
}
}

/\*\*
\* Closes the proxy
\*/
public void close() {
if (mStopped) {
return;
}
mStopped = true;

synchronized (mCmdSync) {
if (mUnexpectedThreadFailure != null) {
throw new RuntimeException("Unexpected thread exit: " + mUnexpectedThreadFailure, mUnexpectedThreadFailure);
}
}

request(STOP);
}

/\*\*
\* Restarts after close
\*
\* @throws Exception
\*/
public void restart() throws Exception {
close();

mChannelToPipes = new IdentityHashMap();
mAck = new Semaphore(0);
mStartupFailure = null;
mUnexpectedThreadFailure = null;

mReceptor.bind();
new Thread(this, "TCPProxy on " + mReceptor.port).start();
mStopped = false;

mAck.acquire();
if (mStartupFailure != null) {
throw mStartupFailure;
}
}

/\*\*
\* Returns the port number this proxy listens on
\*
\* @return port number
\*/
public int getPort() {
return mReceptor.port;
}

/\*\*
\* Kills all connections; data may be lost
\*/
public void killAllConnections() {
request(KILLALL);
}

/\*\*

\* Kills the last created connection; data may be lost
\*/
public void killLastConnection() {
request(KILLLAST);
}

/\*\*
\* Closes the proxy
\*
\* @param proxy
\*/
public void safeClose(TcpProxyNIO proxy) {
if (proxy != null) {
proxy.close();
}
}
}


Other uses of the port-forwarding proxy

What else can you do with this port-forwarding proxy? First of all, its use is not limited to outbound connections: ofcourse you can also use it to test connection failures on inbound connections.  Next, you can also use it to make sure that connections are infact being made the way that you expected. For example, the CAPS JMS server can use both SSL and non SSL connections. I added a few tests to our internal test suite to test this capability from JMSJCA. Here, just to make sure that the test itself works as expected, I'm using the proxy to find out if the URL was indeed modified and that the connections are indeed going to the JMS server's SSL port. Similary, you can also use the proxy to count the number connections being made. E.g. if you want to test that connection pooling indeed works, you can use this proxy and assert that the number of connections created does not exceed the number of connections in the pool.

Missing components

In the example I mentioned updating the EAR file and automatically deploying the EAR file to the application server. I've created tools for those too so that you can do those things programmatically as well. Drop me a line if you're interested and I'll blog about those too.