Posts Tagged ‘computer science’

Scalable Propagation-Based Call Graph Construction Algorithms

Wednesday, September 30th, 2009

This paper goes over several algorithms for constructing call graphs for Java programs. They compare an old algorithm, RTA, and three new algorithms (MTA, FTA, and XTA). The problem is that existing algorithms either are scalable to large programs but compute an excessively large call graph or compute a much more precise call graph but do not scale to larger problems. This paper introduced algorithms that appear to scale to large problems while computing a much more precise call graph.

A call graph is just a directed graph of what methods each method can call. Each method is a vertex in a call graph, and there is an edge from V_1 to V_2 if V_1 can call method V_2 in it’s execution. These graphs are determined by analysis of call sites, which are just Java lines of the form e.M(), or just a method call. Since these functions are dynamically resolved, these call graphs may be more imprecise than strictly necessary, depending on the algorithm used to figure out which call site can call each method. Call graphs are used for whole-program optimization. Methods that are not called from the main method can be removed, if only one method can be called from a call site then it can be inlined instead of called dynamically, etc.

Old algorithms

  • RA - RA, Reachability Analysis, only takes into accound the name of a method. Essentially, the main method is added to the list of reachable methods. For each method that is added to the list of reachable methods, go through all the call sites e.M() in the method and add all methods with name M to the list of reachable methods. This will give a set of reachable methods, and you can use this list to determine which methods are unreachable and can be deleted.
  • CHA - Class Hierarchy Analysis is an extension of RA that takes information about the class
    hierarchy to perform its reachability analysis. It works the same as RA, except instead of adding
    all methods with name m from a call site e.m(), it only adds methods with name m in classes that are subclasses of e.

  • RTA - Rapid Type Analysis is an extension of CHA that takes into account which classes were actually instantiated. For a call e.m(), it will only add those methods with name m that occur in subclasses of e that are instantiated in a reachable method. RTA is still simple to implement, scales, and computes call graphs significantly more precise than CHA.

New algorithms

  • XTA - XTA is an algorithm that uses a distinct set for each method M(S_M) and field x(S_x), in order to give a more local view of what types are available. XTA is defined by five constraints:
    1. The main method is reachable.
    2. For any method C.M, add C to M_X. Add the
    3. If a method is reachable, for all statements of the form “new C()” occurring inside M, add C to S_M.
    4. If a method is reachable, for all fields x that are read, add S_X to S_M.
    5. If a method is reachable, for All Fields x that are written, add the intersection of S_M and all subtypes of the declared types of x to S_x.
  • CTA - CTA uses a distinct set variable S_C for each class C, which unifies the flow of information about both methods and fields. It is defined by adding two constraints to XTA: If a class C defines a method M, S_C = S_M and if C defines a field x, S_C = S_x. This is less precise than XTA, but has the potential to be more scalable.
  • MTA - MTA uses a set variable S_C for each class C, and S_x for every field x. MTA adds this constraint to XTA: If a class C defines a method M, S_C = S_M.
  • FTA - FTA uses a set variable S_C for each class aC, and S_M for each method M. MTA adds this constraint to XTA: If a class C defines a field x, S_C = S_x.

The paper next covered some implementation issues that occurred while implementing these algorithms. The implementations use JikesBT for parsing class files. The new algorithms were added to Jax, which is an application extractor for Java. Jax uses RTA for constructing call graphs, so many of the data structures from Jax could be reused.

The algorithms were constructed in an iterative style, where types are propagated outwards from the main method. XTA had three lists associated with each method and field; one with tyeps that have been propagated to other components, current types that will be propagated in the next iteration, and new types that are propogated to the component in the current iteration. FTA and MTA use a shared set for all the methods and fields in a class.

Arrays are abstracted as classes with a field representing all of it’s elements. Methods are assumed to read an element from array A if an object of type A is propagated to the method, and the mehtod contains an aaload bytecode instruction. Array writes are treated similarly. Exceptions can also cause complex type flow between methods, since several stack frames may be skipped. Instead of implementing some way to handle type propagation of exceptions, there is one set of types that represent the type of all expressions whose static type is of type Throwable, and use that set to resolve method calls on Exception objects. This is not very precise, but the number of types involved will likely be very small. Reflective calls are dealt with by manually supplying the necessary information.

Another issue that had to be solved was the analysis of incomplete applications. Most applications use external libraries which you may not have the source for, such as the java standard library. The solution to this is to have one set of objects for all external classes. This works by assuming that any library call can call any method on an object it is passed. However, this is oftentimes quite conservative, and for some heavily-called methods the analysis is coded to not add types to the global set.

12 benchmark programs were used to test the different algorithms. XTA greatly reduced the number of types available per average in methods from RTA - XTA only has 12.3% of the types being available on average. XTA averages 1.6% fewer possible method definitions than RTA, leading to more methods that can be removed. XTA also computes call graphs with an average of 7.2% fewer edges than RTA. The last comparison, the number of virtual method calls determined to be monomorphic, shows that RTA classifies 7.8% of all call sites as polymorphic and XTA classifies 7% of call sites as polymorphic. MTA and FTA both had better results than RTA, but worse than XTA.

An analysis of the running times of the various algorithms showed that RTA was the fastest, and XTA was slower than RTA by a factor of about 8.3 on average. However, there does not seem to be correlation between program size and how much slower XTA is than RTA, implying they have the same asymptotic running times. MTA and FTA were both surprisingly slightly slower than XTA, although this is probably due to a poor implementation.

XTA, FTA, and MTA all produce more accurate call graphs. FTA is usually not much worse than XTA, and so if space is at a premium it may be used over XTA. XTA is considered the best algorithm for most cases. There does not seem to be a compelling reason to use MTA over RTA.