Archive for October, 2009

Guava: Google Libraries for Java

Wednesday, October 7th, 2009

Guava is a set of Java Libraries recently provided by Google. The API reference for it is located here, but I’ll provide a very, very brief overview of some of the classes provided. These libraries aren’t at a release version yet, but they still seem quite usable.

CharMatcher - An instance of CharMatcher represents a set of matching characters. You can use predefined constanc CharMatchers, such as CharMatcher.WHITESPACE or CharMatcher.WHITESPACE, or you can also use a factory method, such as CharMatcher.oneOf(”ABC”). Alternatively, you could subclass CharMatcher and implement matches(char c). Operations on a CharMatcher represent what you want done for characters that are matched by the CharMatcher for a string. Some of these are matchesAllOf, removeFrom, and retainFrom. For more examples, see the Guava API documentation.

Charsets - predefined constants for charsets supported on all JVMs (for example, UTF-8) which do not need UnsupportedEncodingExceptions.

Joiner - this class joins pieces of text with a separator. Works on arrays, Lists, and Maps. This

Splitter - Much better than String.split, constructing on and using the split( String s ) method will return an iterator over the pieces of the string. It can be configured for how you want to handle empty pieces, unlike the String.split method. You can also split on strings that are not regular expressions, and are not required to have an array.

Guava also has a set of classes for utilities on primitive objects. They give primitives hashCode, compare, and many other methods for primitives. They also provide an asList method that allows viewing an array as a List. Another feature is utilities for dealing with both signed and unsigned bytes. For more of the details, check out either the API documentation or the introductory PDF.

Guava also provides a set of classes for dealing with IO. There are utility methods for Streams, Files, and other IO classes. One of the notable standouts here is the Files.readLines method(). There are two methods with this name: one takes a File and a CharSet and returns a List containing the file. The other method also takes a LineProcessor class, which is an interface containing processLine(String) and getResult(). Each line will be called with processLine(String), and the once all the lines have been processed the result of getResult() will be returned.

Guava also provides a set of utilities for concurrency. I haven’t done very much with concurrency, so I can’t really talk about these much. I’d suggest looking them over yourself and seeing if there’s anything you can use. I’ll definitely come back and check out what they have when I have some program that needs to have concurrent threads.

Practical Virtual Method Call Resolution for Java

Monday, October 5th, 2009

This paper was on new methods of statically determining which methods can be called by virtual method calls in Java. This is used for method inlining and producing more accurate call graphs, which can improve other analysis. This paper suggest two algorithms that builds type propagation graphs, where vertexes are variables and edges represent the flow of types due to assignments,. anc compares them against CHA and RTA. These new algorithms were designed so that they would scale linearly with the size of the Java program being analyzed. The analysis was implemented with the Soot framework, which provides Jimple, which is a representation of Java bytecode. Thus, this analysis can be used for all JVM languages and not just Java.

The Soot framework is an APO for manipulating Java bytecode. The implementation of the algorithms worked by reading all class files required by an application, converting them into Soot representations of the class, with Jimple classes representing the bytecode. Once the analysis and transformation is complete, the optimized code is output again as bytecode.

Class Hierarchy Analysis is a method for determining the set of types for objects used in Java applications. Basically, for any method call c.m() where the declared type of c is a class, the possible types of c are the declared type of c and it’s subclasses. If the declared type of c is an interface I, the possible types of c are all classes that implement I or a subinterface of I, plus all classes that are subclasses of those classes.

A call graph consists of one node for each method that is reachable from the main method. Each node in the call graph contains a collection of call sites, which are method calls. Edges go from call sites to all nodes that could be called from that call site. This call graph may contain edges and nodes that are not necessary; we want to remove these as much as possible.

Rapid type analysis is a way to improve the call graph. It uses the fact that an object can only be called if it is created by a ‘new’. There are two types of RTA: optimistic and conservative. In conservative RTA, CHA is performed and then the reachable methods are inspected to determine which types can be instantiated. Optimistic RTA constructs the call graph iteratively,

The two new algorithms refine RTA using the property that for any call c.m(), c can only be a type T if there is an expression of the form v = new T() followed by a set of assignments x_1 = v, x_2 = x_1, …, x_n = x_{n - 1}, c = x_{n - 1}. The two analysis work by building a type propagation graph, where nodes represent variables and a directed edge from a to b represents an assignment of the form b = a. This graph is used to determine which variables that methods are called can correspond to which classes in a more precise way than RTA.

One of the new algorithms, Variable-Type Analysis, uses the name of a variable as its representative. There are three types of variables, with the following representative names: local variables/parameters have a name C.m.a, fields are named C.f, and array references are named a[x].

The nodes of the type propagation graph is constructed as follows:

For every class C in the program:
    For every field F in C
        create a node C.F

For every method C.m in the call graph of P:
    For every parameter p_i of C.m, create C.m.p_i
    For every local variable l of C.m, create C.m.l
    Create C.m.this
    Create C.m.return

Afterwards, assignments between variables in the form of assignment statements and method calls are used to construct the edges of the type propagation graph. Native methods have their information input by hand. Aliases do not affect this alias’, because locals and fields cannot be aliased - if there is are two locals a and b, they refer to different locations. This is not true for arrays, and references to arrays can be stored in Objects. To solve this issue, when adding edges to the type propagation graph, if the two types are of array types or java.lang.Object, a bidirectional edge is added to the graph. Declared-type analysis works the same way, except that a node is used for each declared type and not each variable. The size of the graph is smaller, since there are fewer nodes, but the final answer is less precise.

To keep the algorithm simple, some design tradeoffs were made:

  • Aliases - To avoid solving the aliasing problem except in the case of arrays, one node was used for all instances of a field instead of one per arrays.
  • No killing based on casts - For assignments of the form lhs = (C)rhs, we do not take into account the extra type information given by the task of C, as this would require either a iterative solver or a more complex constraint solver.
  • Pessimistic algorithm - The algorithm adds edges for all method calls in the conservative call graph, instead of using which types could currently reach the call at each step of the analysis. However, this would require iteration. We improve the pessimistic algorithm by generating a better conservative call graph.
  • A whole-program approach - Instead of only propagating type information within a procedure and using the call graph to determine the effect of method calls.
  • A 3-address representation of bytecode - This greatly simplified the analysis

Seven benchmarks were used to compare algorithms, including javac. One use of these algorithms is to reduce the size of the call graph. For this, both VTA and DTA perform better than even optimistic RTA, with VTA performing better. VTA removes 10% to 65% of the methods and 17 to 65% of the edges, whereas optimistic RTA varies from 7% to 56%. VTA has more notable improvements on more object-oriented benchmarks. VTA in general seems to be an improvement over RTA in all cases.

The other measurement is how many polymorphic calls. Given a call graph constructed by CHA, the percentage of polymorphic calls that were determined to be monomorphic by RTA, DTA, and VTA were measured. VTA showed a significant improvement, whereas RTA and DTA did not prune the call graph very much - VTA can resolve between 12% and 96% of the calls. The benchmarks were then instrumented to output profiling information to construct a complete call graph. On the large OO benchmarks, while VTA did significantly better than RTA, it still did not match the profiling information. In javac, VTA only determined 82% of the calls that the profile indicated were monomorphic. There are two ways to attempt to improve this percentage; perform more expensive static analysis or do dynamic optimization, such as branch target prediction.

Using this information to implement inlining showed small increases in speed on both the soot and javac benchmarks. For soot, using CHA to determine monomorhpic calls gives a 1% speedup in execution time, whereas VTA give 3% speedup. In javac, these percentages are 0% and 2%.

Both DTA and VTA behave linearly in practise, and thus scale well. VTA is approximately a constant factor of 10 slower than DTA, and so while is more precise is also more expensive. In conclusion, VTA gives significant improvements to DTA, which performs better than both CHA and RTA.

Bitlbee

Friday, October 2nd, 2009

I usually use pidgin for my instant messaging, but recently it has stopped alerting me when I get messages. I wasn’t entirely sure why, but it meant I kept missing messages from people and was incredibly annoying. A combination of this and the fact that I wished to be able to stick in Emacs for chatting sent me out on a search for a new messaging program.

Emacs itself has a few packages that claim to do instant messaging support, bust most are either in early stages, flat-out don’t work, or don’t support the protocols I need. TNT, for example, works quite well - but only for AIM. Since in most cases I use either MSN or Gmail, that strikes TNT out. Unfortunately, it seemed I was going to have to either switch to another external IM client or just keep using Pidgin. Then a friend of mine told me about Bitlbee.

Bitlbee is a program that will create an IRC server that you can use to communicate via MSN, Jabber, and a few other protocols. You can use it to MSN via a terminal running IRSSI; I decided to use ERC, Emacs’ built-in IRC client. The bitlbee we page is located here; go there if you need to report a bug, or want more information than is provided in this post.

Installing Bitlbee is easy, if you are on Ubuntu. A simple ’sudo apt-get install bitlbee’ will install it. I’m not entirely sure how to do it on other operating systems, but the source is provided so in the worst case scenario you can just build it yourself. Once this is done, the command ’sudo /etc/init.d/bitlbee start’ will start the IRC server. Connect to this on localhost:6667 and you are ready to go.

Once you join the IRC server, you will need to register your account. Do this by sending the message ‘register pass’, where pass is your desired password. After this, you can start adding accounts. Adding a MSN account is done by ‘account add msn username@userdomain pass’. Gmail accounts require two commands to add: The first is ‘account add jabber user@gmail.com pass’ and the second is ‘account set 1/server talk.google.com’ The 1 in the previous command may differ in your case: find which number you should be using by sending ‘account list’. Connecting to your accounts is done with ‘account on #’. Once you are logged in, you probably want to save your configuration with ’save’.

Once you have connected to your accounts, you can see all your buddies and their statuses with ‘blist’. Messaging them can be done in two ways, the first of which is to prefix the message to them with ‘username: ‘. In this case, replies they send to you will appear in this main channel. If you want a private message to be opened, you can message them using the standard ‘/msg user’ command. This is probably what most users will want; I haven’t used bitlbee with ERC enough to know which I prefer.

Of course, you will probably want to add new contacts eventually. This is done with the ‘add’ command. To use it, just ‘add # user’, where # is the number of the connection you want to add the user to and user is their username. The other command you’ll want to know aobut at the start is ‘rename’, which will allow you to alias users with ‘rename user alias’. Since many people have usernames that don’t make it obvious who they are, this command is pretty much required for a chat application.

That is everything I needed to do to get set up with Bitlbee and start chatting. To access its help system, use ‘help’ in the bitlbee server. There are a few other things I want to investigate that aren’t a huge priority for me: the ability to merge users and to export my aliases are two of them. It doesn’t look like there is a way to export data from bitlbee, although there is a bug report filed. The ability to merge users is probably not possible in bitlbee itself, but it’s not a big deal and I can probably just code something in elisp to handle that case.