Posts Tagged ‘paper’

Useless-Variable Elimination

Friday, December 18th, 2009

Useless-Variable Elimination
Olin Shivers

Useless-Variable Elimination(UVE) is a optimization to remove unnecessary computations from Scheme programs. A useless variable is one that contributes nothing to the final outcome of the computation. We can remove these, and the computations creating them, with no loss. This is easy to detect in simple cases; however, in more complex cases(possibly including recursion), this is harder to detect. An example is a simple factorial function which takes an extra ‘bogus’ argument that is continually square rooted; it is not used in computing the factorial, but it is harder to tell it is unused.

UVE is often used to clean up after other optimizations. For example, in continuation-passing style intermediate forms, the continuation is often passed around a loop needlessly.

Useless variables are found by backwards flow analysis. First, variables who are trivially useful - such as variables returned as the final result - are added to the set of useful variables. The control-flow structure of the code is then traced backwards to mark all variables that contributed to this variable. Once we have marked all useful variables, the unmarked variables are useless.

Once we have found useless variables, we do two things; remove all references to useless variables and remove useless variables from lambda lists. In the first step, we can simply replace all useless variables to #f instead of whatever computation was used to define them. The second step is slightly harder, as we cannot delete variables from external calls; a value is still passed in, even if it is unused. This leads to two recursive dependencies to determine when it is safe to remove variables from parameter lists; you can read these in Shiver’s paper. Using these techniques, we can ‘fix’ the factorial example, removing the extra parameter and useless square root calculation.

On building fast kd-Trees for Ray Tracing, and on doing that in O(N log N)

Monday, November 23rd, 2009

On building fast kd-Trees for Ray Tracing, and on doing that in O(N log N)
Ingo Wald, Valstimil Havran

Kd-trees seem to be the most widely used technique for spatial subdivisioning. They are believed to be the best known method for fast ray tracing. The Surface Area Heuristic is the construction technique usually used, and how to building efficient ones and traversing them quickly is well understood. However, constructing them is not as well understood, which is what this paper hopes to analyze.

All kd-tree construction schemes follow the same recursive scene:

function RecBuild( triangles T, voxel V ) returns node
    if Terminate(T,V) then:
        return new leaf node( T )
    p = FindPlane( T, V )
    (V_L, V_R ) = Split V with p
    T_L = all members of T on one side of  V_L
    T_R = all members of T on the other side of V_L
    return new node( p, RecBuild(T_L, V_L), RecBuild(T_R, V_R) )

The difference between a good and a naively build kd-tree is often a factor of two or more. The naive method often used, the “spatial median” method, is to just alternate axis and the plane being split at the spatial median of the voxel.

The Surface Area Heuristic(SAH) tends to perform quite well. It makes several assumptions:

  • Rays are uniformly distributed, infinite lines.
  • The cost for both a traversal and a triangle intersection are known.
  • The cost of intersecting N triangles is roughly NK.

The expected cost for a given plane p then is one traversal step, plus the cost of intersecting the two children. Usually, instead of minimizing the cost of traversing the entire tree, a greedy version of the heuristic is used which minimizes it at each level, due to efficiency reasons. We also need to define a termination criterion; when to stop the recursive build: SAH gives us an elegant one, which is if the cost of this node as a leaf is better than even the best split.

The SAH tends to work best, and only a few modifications are known to consistently yield better improvements. One of these is to favor splits that cut off empty space. Requiring the tree to be split a certain number of times can prevent the tree from being stuck in a local minnima, but this number is hard to determine for general scenes.

There are several ways to construct Kd-trees once you have decided upon the cost function. The mmost trivial way is to iterate all triangles, determine their possible split candidates, and compare the cost for each one. This will end up taking O(N^2) operations, where N is the number of triangles in the scene. This is often too slow.

There is also a widely-known algorithm for constructing SAH trees in O(N log^2 N). I’m not going to summarize the algorithm here: if you are interested, read the paper. It essentially ’sweeps’ a plane, calculating costs incrementally. This is a huge improvement, but as the theoretical lower bound is O(N log N) we wish to be better.

This paper introduces a O(N log N) method of building a kd-tree. What caused the previous algorith to be O(N log^2 N ) instead of O(N log N ) is having to sort once per partition. This algorithm improves on that by sorting the evet list only once. This involves having to events for all dimensions. Again, I’m not going to recap specifics of the algorithm here.

The two algorithms were then compared in actual time, and not just asymptotic behavior. Realistic models of varying size were used. The O(N log N) variant is concisely faster than the O(N log^2 N) variant by a factor of 2-3.5. Analysis shows that the O(N log^2 N) variant spends more time re-sorting than the other variant uses to build the entire tree.

Supporting Dynamic Languages on the JVM

Thursday, November 19th, 2009

Supporting Dynamic Languages on the Java Virtual Machine
Author: Olin Shivers

This paper is about enhancements to the JVM that would make it easier to port dynamic languages, such as Scheme, to the JVM. This is desirable due to the fact that there is now a JVM for practically everything; running on the JVM means that the language will run on every architecture it will run on. However, the JVM, while well-designed for speed of Java programs, does not support dynamic languages very well. One example of a language that suffers performance problems from this porting process is Scheme. In Scheme, there must be a uniform representation of data - all types can be in cons cells. Working this into the Java class models means that every type must extend Object, and boxing and unboxing is expensive for primitive types.

The proposal to support immediates is to give pointers to Java objects a low bit of one. Since these are allocated on word boundaries, this does not hurt. This allows a final ImmediateDescriptor class that has 31 bits of state that can quickly be converted to an even integer(or vice versa). This causes no penalty to programs not using them - if a method iis called on them, this would generate a memory alignment exception that the VM can catch, which would still be fast.

There are still problems with issues such as method lookup, for example. The JVM bytecode is well-optimized for Java code, but not necessarily other paradigms. There is a tension in the bytecode between verification and efficiency - we don’t want an unsafe RISC bytecode system, we do want a safe system, but this will make it less efficient in few cases. This tradeoff is made well for Java, but not other languages.

A proposal to fix this is to have some of the bytecode instructions to be linked to C routines at runtime. This allows language implementers to efficiently represent whatever they need. However, this brings up the problem of verification - these C routines may be unsafe. The solution to this is to have some central body, where the JVM will ‘checkout’ the required instructions as told by the language implementer, and warn the user if the requested code is not in this standard and must be checked out from an unverified location.

Your Botnet is My Botnet: Analysis of a Botnet Takeover

Monday, November 9th, 2009

This is a summary of a paper, titled Your Botnet is My Botnet: Analysis of a Botnet Takeover.

Botnets are becoming a large problem for the internet. They are formed by networks of compromised computers that are under the control of some other person. Botnets are becoming the primary means for criminals to launch DOS attacks, steal personal data, or other cyber crimes.

Most previous analysis of botnets have been analyzing them from the inside; intentionally infecting a computer to join the botnet, and analyzing the activity that then occurs. Since many botnets use P2P protocols, other infected computers can be discovered using this technique. However, this gives a very limited view of the activities of the botnet. A better way is to take control of the entire botnet, which can be done either with cooperation from domain registrars or law enforcement.

For this paper, researchers took control of the Torpig botnet. Torpig is primarily associated with bank account and credit card theft. This was done by exploiting how the bots try to locate their command server. Each bot generates a list of domains to contact, and the first host that sends a reply identifying itself is considered genuine until the next domain generating phase. This allowed researchers to register domains the infested host would contact.

Torpig is distributed to it’s victims using Mebroot, a rootkit that replaces a system’s Master Boot Record. Victims are infected through vulnerable web sites being modified so that the victim’s browser requests Javascript, which then attempts several exploits. If any are successful, an installer for Mebroot is downloaded and executed. Mebroot does not perform other malicious attacks itself; It acts as a platform to install malicious modules. Mebroot contacts the C&C server every two hours to receive updates.

The C&C server distributed three modules, which comprise Torpig. These inject these DLLs into the file manager, Internet Explorer, Firefox, and other popular utilities, allowing it to inspect all data handled by these programs. Every twenty minutes, Torpig uploads new data to the command server. In reply to this, the C&C server can either respond with ‘ok’ or a configuration file used for configuration and parameters to perform phishing attacks. These attacks can gain data that would not otherwise be possible by passive monitoring. When the user goes to a site in the configuration file, they will instead be redirected to a site given by an injection server.

Taking over the botnet was fairly simple; domains were registered for a three-week period. Logs were collected from all network data, until a new torpig binary that changed the domain generation algorithm was installed through Mebroot. 70GB of data were collected during the 10-day period that the Torpig botnet was under control.

All bots communicate with the Torpig command server through HTTP Post requests. This requests contains all the collected dat, as well as information about the bot. There are 8 different types of data that Torpig sends out: Mailbox account, email, form data, HTTP account, FTP account, POP account, and Windows passwords.

Attempting to analyze the size of the botnet is somewhat difficult. It can’t be done by merely checking how many IPs connect to the C&C server vecause of NAT and DHCP. However, Torpig contains information for hardware configurations and a mostly-unique ID for each bot. This led to an estimated 182,914 bots in the Torpig botnet. Further analysis was done to find the number of security researchers and search engine bots to get a more accurate number. Security researchers could be found by checking the default hardware configurations of VMWare and other virtualization tools. This gave a final estimate of 182,800 bots. In contrast, the number of IPs connecting to the C&C server was an order of magnitude larger. In the ten days the botnet was taken over, 49,924 new hosts were infected, though there were large spikes on two days.

Torpig is crafted to retrieve information that can easily be monetized. In the ten days, Torpig obtained 8310 accounts at financial institutions. 1660 credit and debit card numbers were also obtained. By pricing these accounts, the estimated value from these ten days is between $83K and $8.3M. In addition to information retrieval, Torpig opens proxies that can be used for spam or other activities, and represents a great deal of bandwidth that can be used in a DOS attack. It logs all other datas, which represents a huge breach of privacy and can be used to look at all chat, email, and other messages sent.

Analysis of the passwords retrieved showed that most were not very high strength, and roughly 28% of users reuse their passwords. This is evidence that the reason these botnets so large is a cultural problem, of people not understanding the consequences of irresponsible computer use.

Practical Virtual Method Call Resolution for Java

Monday, October 5th, 2009

This paper was on new methods of statically determining which methods can be called by virtual method calls in Java. This is used for method inlining and producing more accurate call graphs, which can improve other analysis. This paper suggest two algorithms that builds type propagation graphs, where vertexes are variables and edges represent the flow of types due to assignments,. anc compares them against CHA and RTA. These new algorithms were designed so that they would scale linearly with the size of the Java program being analyzed. The analysis was implemented with the Soot framework, which provides Jimple, which is a representation of Java bytecode. Thus, this analysis can be used for all JVM languages and not just Java.

The Soot framework is an APO for manipulating Java bytecode. The implementation of the algorithms worked by reading all class files required by an application, converting them into Soot representations of the class, with Jimple classes representing the bytecode. Once the analysis and transformation is complete, the optimized code is output again as bytecode.

Class Hierarchy Analysis is a method for determining the set of types for objects used in Java applications. Basically, for any method call c.m() where the declared type of c is a class, the possible types of c are the declared type of c and it’s subclasses. If the declared type of c is an interface I, the possible types of c are all classes that implement I or a subinterface of I, plus all classes that are subclasses of those classes.

A call graph consists of one node for each method that is reachable from the main method. Each node in the call graph contains a collection of call sites, which are method calls. Edges go from call sites to all nodes that could be called from that call site. This call graph may contain edges and nodes that are not necessary; we want to remove these as much as possible.

Rapid type analysis is a way to improve the call graph. It uses the fact that an object can only be called if it is created by a ‘new’. There are two types of RTA: optimistic and conservative. In conservative RTA, CHA is performed and then the reachable methods are inspected to determine which types can be instantiated. Optimistic RTA constructs the call graph iteratively,

The two new algorithms refine RTA using the property that for any call c.m(), c can only be a type T if there is an expression of the form v = new T() followed by a set of assignments x_1 = v, x_2 = x_1, …, x_n = x_{n - 1}, c = x_{n - 1}. The two analysis work by building a type propagation graph, where nodes represent variables and a directed edge from a to b represents an assignment of the form b = a. This graph is used to determine which variables that methods are called can correspond to which classes in a more precise way than RTA.

One of the new algorithms, Variable-Type Analysis, uses the name of a variable as its representative. There are three types of variables, with the following representative names: local variables/parameters have a name C.m.a, fields are named C.f, and array references are named a[x].

The nodes of the type propagation graph is constructed as follows:

For every class C in the program:
    For every field F in C
        create a node C.F

For every method C.m in the call graph of P:
    For every parameter p_i of C.m, create C.m.p_i
    For every local variable l of C.m, create C.m.l
    Create C.m.this
    Create C.m.return

Afterwards, assignments between variables in the form of assignment statements and method calls are used to construct the edges of the type propagation graph. Native methods have their information input by hand. Aliases do not affect this alias’, because locals and fields cannot be aliased - if there is are two locals a and b, they refer to different locations. This is not true for arrays, and references to arrays can be stored in Objects. To solve this issue, when adding edges to the type propagation graph, if the two types are of array types or java.lang.Object, a bidirectional edge is added to the graph. Declared-type analysis works the same way, except that a node is used for each declared type and not each variable. The size of the graph is smaller, since there are fewer nodes, but the final answer is less precise.

To keep the algorithm simple, some design tradeoffs were made:

  • Aliases - To avoid solving the aliasing problem except in the case of arrays, one node was used for all instances of a field instead of one per arrays.
  • No killing based on casts - For assignments of the form lhs = (C)rhs, we do not take into account the extra type information given by the task of C, as this would require either a iterative solver or a more complex constraint solver.
  • Pessimistic algorithm - The algorithm adds edges for all method calls in the conservative call graph, instead of using which types could currently reach the call at each step of the analysis. However, this would require iteration. We improve the pessimistic algorithm by generating a better conservative call graph.
  • A whole-program approach - Instead of only propagating type information within a procedure and using the call graph to determine the effect of method calls.
  • A 3-address representation of bytecode - This greatly simplified the analysis

Seven benchmarks were used to compare algorithms, including javac. One use of these algorithms is to reduce the size of the call graph. For this, both VTA and DTA perform better than even optimistic RTA, with VTA performing better. VTA removes 10% to 65% of the methods and 17 to 65% of the edges, whereas optimistic RTA varies from 7% to 56%. VTA has more notable improvements on more object-oriented benchmarks. VTA in general seems to be an improvement over RTA in all cases.

The other measurement is how many polymorphic calls. Given a call graph constructed by CHA, the percentage of polymorphic calls that were determined to be monomorphic by RTA, DTA, and VTA were measured. VTA showed a significant improvement, whereas RTA and DTA did not prune the call graph very much - VTA can resolve between 12% and 96% of the calls. The benchmarks were then instrumented to output profiling information to construct a complete call graph. On the large OO benchmarks, while VTA did significantly better than RTA, it still did not match the profiling information. In javac, VTA only determined 82% of the calls that the profile indicated were monomorphic. There are two ways to attempt to improve this percentage; perform more expensive static analysis or do dynamic optimization, such as branch target prediction.

Using this information to implement inlining showed small increases in speed on both the soot and javac benchmarks. For soot, using CHA to determine monomorhpic calls gives a 1% speedup in execution time, whereas VTA give 3% speedup. In javac, these percentages are 0% and 2%.

Both DTA and VTA behave linearly in practise, and thus scale well. VTA is approximately a constant factor of 10 slower than DTA, and so while is more precise is also more expensive. In conclusion, VTA gives significant improvements to DTA, which performs better than both CHA and RTA.

Scalable Propagation-Based Call Graph Construction Algorithms

Wednesday, September 30th, 2009

This paper goes over several algorithms for constructing call graphs for Java programs. They compare an old algorithm, RTA, and three new algorithms (MTA, FTA, and XTA). The problem is that existing algorithms either are scalable to large programs but compute an excessively large call graph or compute a much more precise call graph but do not scale to larger problems. This paper introduced algorithms that appear to scale to large problems while computing a much more precise call graph.

A call graph is just a directed graph of what methods each method can call. Each method is a vertex in a call graph, and there is an edge from V_1 to V_2 if V_1 can call method V_2 in it’s execution. These graphs are determined by analysis of call sites, which are just Java lines of the form e.M(), or just a method call. Since these functions are dynamically resolved, these call graphs may be more imprecise than strictly necessary, depending on the algorithm used to figure out which call site can call each method. Call graphs are used for whole-program optimization. Methods that are not called from the main method can be removed, if only one method can be called from a call site then it can be inlined instead of called dynamically, etc.

Old algorithms

  • RA - RA, Reachability Analysis, only takes into accound the name of a method. Essentially, the main method is added to the list of reachable methods. For each method that is added to the list of reachable methods, go through all the call sites e.M() in the method and add all methods with name M to the list of reachable methods. This will give a set of reachable methods, and you can use this list to determine which methods are unreachable and can be deleted.
  • CHA - Class Hierarchy Analysis is an extension of RA that takes information about the class
    hierarchy to perform its reachability analysis. It works the same as RA, except instead of adding
    all methods with name m from a call site e.m(), it only adds methods with name m in classes that are subclasses of e.

  • RTA - Rapid Type Analysis is an extension of CHA that takes into account which classes were actually instantiated. For a call e.m(), it will only add those methods with name m that occur in subclasses of e that are instantiated in a reachable method. RTA is still simple to implement, scales, and computes call graphs significantly more precise than CHA.

New algorithms

  • XTA - XTA is an algorithm that uses a distinct set for each method M(S_M) and field x(S_x), in order to give a more local view of what types are available. XTA is defined by five constraints:
    1. The main method is reachable.
    2. For any method C.M, add C to M_X. Add the
    3. If a method is reachable, for all statements of the form “new C()” occurring inside M, add C to S_M.
    4. If a method is reachable, for all fields x that are read, add S_X to S_M.
    5. If a method is reachable, for All Fields x that are written, add the intersection of S_M and all subtypes of the declared types of x to S_x.
  • CTA - CTA uses a distinct set variable S_C for each class C, which unifies the flow of information about both methods and fields. It is defined by adding two constraints to XTA: If a class C defines a method M, S_C = S_M and if C defines a field x, S_C = S_x. This is less precise than XTA, but has the potential to be more scalable.
  • MTA - MTA uses a set variable S_C for each class C, and S_x for every field x. MTA adds this constraint to XTA: If a class C defines a method M, S_C = S_M.
  • FTA - FTA uses a set variable S_C for each class aC, and S_M for each method M. MTA adds this constraint to XTA: If a class C defines a field x, S_C = S_x.

The paper next covered some implementation issues that occurred while implementing these algorithms. The implementations use JikesBT for parsing class files. The new algorithms were added to Jax, which is an application extractor for Java. Jax uses RTA for constructing call graphs, so many of the data structures from Jax could be reused.

The algorithms were constructed in an iterative style, where types are propagated outwards from the main method. XTA had three lists associated with each method and field; one with tyeps that have been propagated to other components, current types that will be propagated in the next iteration, and new types that are propogated to the component in the current iteration. FTA and MTA use a shared set for all the methods and fields in a class.

Arrays are abstracted as classes with a field representing all of it’s elements. Methods are assumed to read an element from array A if an object of type A is propagated to the method, and the mehtod contains an aaload bytecode instruction. Array writes are treated similarly. Exceptions can also cause complex type flow between methods, since several stack frames may be skipped. Instead of implementing some way to handle type propagation of exceptions, there is one set of types that represent the type of all expressions whose static type is of type Throwable, and use that set to resolve method calls on Exception objects. This is not very precise, but the number of types involved will likely be very small. Reflective calls are dealt with by manually supplying the necessary information.

Another issue that had to be solved was the analysis of incomplete applications. Most applications use external libraries which you may not have the source for, such as the java standard library. The solution to this is to have one set of objects for all external classes. This works by assuming that any library call can call any method on an object it is passed. However, this is oftentimes quite conservative, and for some heavily-called methods the analysis is coded to not add types to the global set.

12 benchmark programs were used to test the different algorithms. XTA greatly reduced the number of types available per average in methods from RTA - XTA only has 12.3% of the types being available on average. XTA averages 1.6% fewer possible method definitions than RTA, leading to more methods that can be removed. XTA also computes call graphs with an average of 7.2% fewer edges than RTA. The last comparison, the number of virtual method calls determined to be monomorphic, shows that RTA classifies 7.8% of all call sites as polymorphic and XTA classifies 7% of call sites as polymorphic. MTA and FTA both had better results than RTA, but worse than XTA.

An analysis of the running times of the various algorithms showed that RTA was the fastest, and XTA was slower than RTA by a factor of about 8.3 on average. However, there does not seem to be correlation between program size and how much slower XTA is than RTA, implying they have the same asymptotic running times. MTA and FTA were both surprisingly slightly slower than XTA, although this is probably due to a poor implementation.

XTA, FTA, and MTA all produce more accurate call graphs. FTA is usually not much worse than XTA, and so if space is at a premium it may be used over XTA. XTA is considered the best algorithm for most cases. There does not seem to be a compelling reason to use MTA over RTA.