I was called upon yesterday to review some WebSphere product code I wrote about six years ago. It was a bit of a shock to see from the change history the number of fixes that had been made in the intervening time! Many of these related to deadlock situations, often when I was merely taking a lock whilst I iterated over a set in toString or some such seemingly innocuous method. It is very hard to get the lock ordering correct when you are just a component in the middle of the stack. The lesson there is, I suspect, to only trace when entries are added or removed from the set and perhaps just give the size at other times.
Anyway, the particular issue I was being asked to look at related to a memory leak. It’s a problem I’ve seen before where a HashSet grows in size because when remove is called the hash code of the entry has changed since it was added to the set and therefore the entry cannot be found in the bucket for the new hash code. This is hard to spot unless you are explicitly checking the boolean return value. The behaviour was made even more obscure by the fact that, to avoid the aforementioned deadlocks, someone had changed the toString to take a clone of the unsynchronized set and then iterate over that. The issue here was that, not only had the hashCode changed on most of the entries since they were added to the set, but nearly all of them had become equivalent due to some logic that ran after the failed remove. This meant that, when the set was cloned, most of the entries were discarded in the copy as being equivalent, thereby giving the impression that the set was not as big as it actually was.
The service team had knocked up some code that, rather than just using remove, iterated over the set until it found the appropriate entry and then called iterator.remove(). Unfortunately this method does little more than call remove on the set with the entry the iterator is currently pointing to and consequently was still failing. In this case you don’t even get any feedback to tell you that the remove failed.
The solution? Well, fixing the hashCode is the obvious answer but it did cause me to stop and think. At the time I was perhaps a little overzealous and overrode the default equals and hashCode methods for almost every class I wrote. In this case, there was probably no good reason to do so and removing those methods altogether would fix the problem.
Looking at this problem also reminded me how a good a tool I’ve blogged about before is: the Trace Analyzer for WebSphere Application Server. Whilst the 300MB trace file I was sent killed every other editor on my machine, a simple increase in the Java heap size and this tool displayed the file without issue and then allowed me to filter out a single thread for further attention.