The Dangers of String.substring

A few days ago at work, I had to track down why my Java application was running out of memory. It processed a few CSV files, storing some of the data in them. The files were large, but the data my application was storing should have easily been able to fit into memory. After an hour or so of investigation, I found it was an issue I had heard about but never run into myself: the String.substring memory leak.

The bug in question is that String.substring, instead of returning a new string, returns a string based on the original string. This means that the larger string can’t be garbage-collected, so if you store a substring thben the entire string will be kept in memory. In my specific case, I was using String.split and storing a few of the fields - however, String.split uses String.substring behind the scenes, so I ended up keeping the entire files in memory, explaining my OutOfMemoryErrors.

This bug is due to the Java implementors wanting String.substring to be as fast as possible. If it did not have this behaviour, a new char[] would have to be allocated on the heap each time the function was called and the data would have had to be copied into it. Moreover, if you did need the larger string to stay around, part of the string would be duplicated, which is inefficient memory-wise. The workaround for this bug is to just wrap calls to String.substring or calls that use it in a new String(). For example:

     String sub = new String( oldString.substring(0, 4) );

and the old string will be garbage collected.

For the above-mentioned reasons of the bug’s existence and the length of time it’s been opened, it probably won’t be fixed. The workaround suggested currently is not *too* painful, although you do have to find which calls themselves use String.substring, in many cases even if you forget the memory inefficiency isn’t too terrible (if you strip off the last character of a string, for example). It would be nice if there was a mention in the javadoc for these methods that this behaviour occurred, so that looking at the description of a method would warn you about issues such as this.

Tags:

14 Responses to “The Dangers of String.substring”

  1. Kartik says:

    psssst …. You are missing s bracket and semi colon. String sub = new String( oldString.substring(0, 4) );

  2. Andrew says:

    Given String is a primitive you could have the VM perform the following

    (1) String should have an internal variable which records how inefficient it is (e.g. ratio of substring size to parent string size)
    (1) When the managed heap is full, and it might be time to throw an OutOfMemoryError, examine ‘inefficient’ String’s to see if any large chunks of memory can be saved by changing the inefficient substring’s representation to have its own character array rather than a pointer into a larger array

    All sorts of metrics and finesses could be applied - and in honesty I do think this memory leak is kind of a cornercase to begin with, but that is no reason not to be a little bit smarter in the run-time.

  3. Behrang says:

    This is interesting because Java (unlike C++) is geared towards allocating everything on the heap. Yet, apparently, the developers want to gain performance by bypassing that fundamental rule and it is now causing headaches.

  4. The workaround sounds good. Think like a functional programmer. Don’t connect the two strings. When programs get large and complicated, this probably hurts efficiency. I imagine there was some debate on the Sun team on whether it would be a good idea in the first place.

  5. mark says:

    Ruby:
    sub = oldString[0..4]

    I couldn’t resist ;)

    I think ultimately one huge problem with languages like C and similar is that they make simple stuff more complicated than necessary.

    Take aside memory manipulation or pointer handling - these can stay complicated, but why not try to find the easiest approach altogether - in general? The requirement for a semicolon is one example - it shouldnt be needed. It adds no meaningful information
    to the human coder (unless he wants to use multiple instructions on the same line, but this is not a common cause, because people like readable code)

    Java should be a lot shorter. And Variables should default to string type too, without a need to specify “String”.

  6. SJS says:

    The problem is indeed subtle, but the workaround seems perfectly sensible.

    Modifying the GC to look for the (rare) instances of large strings where only a small substring is kept seems likely to be a performance killer. People already whine about GC languages being slow, making one slower for a rare case doesn’t seem to be a very good tradeoff. Let a few CS grad students work on a fast and efficient mechanism for handling this case in the JVM, and *then* look into adding it in.

    @mark - Java isn’t Ruby, nor should it be. The ultimate problem with Ruby and languages like it is that too often the advocates want to turn every other language into a poor copy of their favorite language, so they can smugly point out that everyone else should be using their favorite language.

    Take, for example, statement separators. They’re useful, in that they *conveniently* let you wrap long statements across several lines, aiding readability. Newline-terminated statements require compiler magic (confusing to the reader) or continuation characters (ugly), which is fine in small, uncomplicated, or low-use programs.

    And as for the elimination of “String”, well, consistency is preferential to conciseness. One of Java’s biggest warts is the package-private scope being “default”, instead of making it a required keyword. Verboseness may not be a virtue, but it’s not always a sin, either.

    * * *

    I think the idea of making the javadoc more clear is the best approach. I don’t think I would have looked at “String x = reader.readLine().split(pattern)[1];” as a potential memory “leak” — it’s a non-obvious issue. Good catch.

  7. Chris says:

    @Mark

    > Variables should default to string type too
    WTF are you smoking? How about real type inference, rather than picking one type and raising it up above the others?

  8. Nick says:

    If you’re parsing CSV by calling String.split you’re doing it wrong anyway.

    Foo,”Baz,Baz”,Qux

  9. Brian says:

    Actually, you split with a regex that ignores commas inside of quotes. You don’t just split with “,”.

  10. ghettoimp says:

    @SJS: It seems unlikely that Andrew’s solution would have any impact on GC performance.

    Instead, his proposal is merely for the JVM to attempt a more aggressive recovery strategy before dying with an OutOfMemoryError. It’s true that in such a case, one might reasonably expect the JVM to run its garbage collector to try to find some memory that can be freed. But during any “ordinary” run of the garbage collector, there is still plenty of memory available, and so the “last ditch effort” that Andrew describes (looking for some strings to free) would simply not be invoked.

    Andrew’s solution seems fairly reasonable. Ask yourself: would you rather that your program immediately die with an OutOfMemoryError, or that the JVM spends some time in a search for additional memory before it gives up? To me, it seems better to win some of the time than to lose all of the time.

  11. Salman Ahmed says:

    So, String.split() also leaks memory? What workaround would you recommend for the case when String.split() is used?

  12. admin says:

    Salman:
    You have to wrap the elements of the array that you want to keep, so something like:

    String[] items = oldString.split(" ");
    String stringToKeep = new String(items[0]);

    Will allow the original string to be garbage collected.

  13. Terry says:

    I don’t understand why this bug is still lurking in Java. Java has been open sourced correct? So someone should be able to go in and change the behavior of the C++ code in the JVM that’s causing this issue. Right?

  14. [...] post is inspired by an entry on nflath.com about the dangers of String.substring() [...]

Leave a Reply