May 24, 2022

    Beyond Taint Analysis: Detecting Data Flows in Java Applications with ProGuardCORE

    Analyzing software is a common routine in various development stages including, but not limited to, debugging, optimization, safety verification, and security assessment. But it also has many uses outside of development, such as reverse engineering or CTF contests.

    Very often, the developer has to answer the question, “Is data from a specific origin accessed from a sensitive location? If yes, how did it reach there?”

    Answering this question helps detect a bug, a data leak, or a security vulnerability. And using data flow analysis is a formal method of answering this question in an automated and reliable way.

    Data flow analysis is an essential part of static program analysis. One of its special cases is taint analysis, a problem of determining whether there is any data flow between two points, called the source and the sink, of the program. Taint analysis provides sufficient information for detecting information and control leaks.

    In this blog, we introduce the newly added taint analysis feature in ProGuardCORE and provide a brief overview of the data flow analysis framework behind it.

    Not getting lost among data flow analysis tools

    Before diving into ProGuardCORE, let’s take a step back and consider some important data-flow analysis (DFA) characteristics to compare a few DFA tools.

    When evaluating different data flow analyzers, it is important to distinguish features important for your specific area of application. Without going deeply into domain-specific details, we consider three defining DFA characteristics:

    • Whether the analysis is performed during runtime (static vs dynamic)
    • Whether the analysis distinguishes between different execution paths (path-dependence)
    • How exhaustive the analysis is (conservative, or considering any possible —even unlikely — course of events vs forgiving, or moderate about its approximations)

    Static and dynamic analyses

    The difference between static and dynamic analyses lies in when it is conducted. A static analysis does not require the application to run but it rather abstractly interprets the code to achieve the result. By analyzing the application statically, we can guarantee the result for all possible execution paths and run the analysis just once before the program is shipped.

    On the contrary, a dynamic analysis runs with the program and requires either patching or execution in a special environment. Dynamic analysis typically produces less false positives because it never encounters infeasible paths. However, it comes at the cost of having to sacrifice the performance by always running it with the application, or having an unsound result obtained from only analyzing a limited number of test cases.

    The table below summarizes the differences between static and dynamic analyses.

    Analysis Type Static Dynamic
    When it Takes Place Before distribution During runtime or testing
    Paths Considered All (including infeasible paths) Only occurred during execution
    False Negatives Potentially none Only if run during runtime
    False Positives Inevitable Potentially none

    Path-dependence

    Path-dependence is the ability of the analysis to discard infeasible paths and make the result more accurate. Consider the following example:

    if (x > 0)
    	y = 1;
    if (x <= 0)
    	y = 1;
    

    After executing the code above, the value of y is equal to 1 because exactly one if-branch will be executed, assuming that x is a regular numeric type (i.e., it is either positive or non-positive). The paths when either none or both branches will be triggered are infeasible (i.e., they will never be executed).

    However, if the analysis is path-insensitive, it will conclude that y is either equal to 1 or undefined, which would correspond to any value in the conservative interpretation. Thus, path sensitivity narrows down the set of possible values of y from all integers to 1.

    Path sensitivity is naturally achieved in dynamic analyses which are performed during runtime and thus cannot go through infeasible paths. For static analysis, exact path-sensitivity is undecidable (as equivalent to the Halting problem), although a partial result is achievable with SMT solvers.

    Conservatism

    Because an exact analysis cannot be performed in most nontrivial cases, we follow the hypothesis testing terminology and distinguish between two sorts of errors: false positive and false negative.

    In taint analysis, a false positive would mean that the analysis reports a data flow which does not correspond to any feasible path, and a false negative would be the absence of a feasible data flow in the analysis result.

    This definition extends to other types of analysis, as well, if it is read as “a faulty result is false positive if it is caused by assuming an infeasible path, and false positive if an actual path is missed.”

    Conservative analysis tolerates false positives while completely eliminating false negatives. Thus, it produces a result which is reliable for all execution paths, making it suitable for detecting possible automatic optimizations (i.e., it guarantees that the optimized code is semantically equivalent to the original). Worth noting: In the literature, conservatism and soundness are often used interchangeably.

    Forgiving analysis finds a trade-off between false negatives and false positives and tries to minimize the latter. For example, if the analysis result informs the user about necessary manual actions upon detecting a data flow, it is important to minimize false positives and thus reduce the user’s annoyance.

    Analysis Type Conservative Forgiving
    Application Area Compiler optimizations Reporting to the user
    False negatives None Possible
    False Positives Inevitable Minimized but still possible

    ProGuardCORE in context

    With a refresher in mind regarding key DFA characteristics and tools, and knowing that taint analysis is widely used for detecting vulnerabilities in applications, let’s transition to exploring the latest feature update in ProGuardCORE.

    In this section, we consider some well-known analysis frameworks and compare their approaches to ProGuardCORE.

    Analysis Features FlowDroid TaintDroid Amandroid Hopper ProGuardCORE
    Static/Dynamic static dynamic static static static
    Path-dependence no yes no yes no
    Conservative no no yes yes no

    Static analysis for stronger guarantees and better runtime

    ProGuardCORE analyzes the applications statically without affecting the runtime, unlike TaintDroid, which requires instrumentation of the Dalvik VM to track taints. Thus, the result covers all execution paths and does not have to be run every time the application is launched.

    Approximate path-dependence planned

    Path-dependence is naturally expected from dynamic analysers since they consider only feasible paths, whereas static analyzers have to apply an approximate technique. For instance, the tool Hopper derives the feasibility information in its goal-directed analysis. This is currently unsupported in ProGuardCORE, but adding a predicate analysis for excluding some obvious infeasible paths can be achieved naturally due to its extensible architecture.

    Forgiving analysis for better user experience

    ProGuardCORE exhibits a forgiving behavior by simply ignoring the calls it cannot resolve and using a default unsound approximation of library calls. This guarantees construction of an informative analysis result that illustrates the execution path in case of a data flow detection. The opposite conservative example is Amandroid, which does not analyze the native code but rather overapproximates it by assigning all data structures reachable from a native method and unknown value.

    Extensible approach instead of ad hoc workaround

    FlowDroid is the most similar tool to ProGuardCORE judging by the aforementioned analysis features. However, there is a significant difference in architecture: FlowDroid is a taint analysis tool and is not designed to be used for other purposes. On the contrary, ProGuardCORE benefits from the more general configurable program analysis (CPA) making it easy to extend with other types of analyses.

    This difference is visible in the interprocedural analysis performed by both tools. FlowDroid uses a graph representation for its method summaries, which fits well the binary content of the taint analysis. ProGuardCORE, however, follows the block-abstraction memoization (BAM) approach, storing the analysis results for each context-call pair. This technique is domain-independent and thus allows analyses with more informative abstractions, like predicate analysis.

    Detecting data flows with ProGuardCORE

    The preparation for data flow analysis in ProGuardCORE consists of three steps:

    1. Transforming the target Java bytecode into a control flow automaton
    2. Preparing domain-specific components of the analysis
    3. Composing a suitable CPA run from the result of the previous steps and suitable analysis parameters

    Since taint analysis is the only currently available data flow analysis method, it will be the focus of this tutorial.

    Modeling the control flow with a control flow automaton

    A control flow automaton arranges the bytecode into a graph structure easing traversal along execution paths. To create this model, we first need to come up with the program we will analyze. For simplicity, we use a string representation rather than reading from a jar file:

    val program =
                """
                    class A
                    {
    
                        public static void main()
                        {
                            sink(callee());
                        }
                        
                        public static String callee()
                        {
                            return source();
                        }
                        
                        public static void sink(String s)
                        {
                        }
                        
                        public static String source()
                        {
                            return null;
                        }
                    }
                """.trimIndent()

    In our example, the taint goes to the sinksink(String s)sensitive to its only argument through a call tocallee()from the taint sourcesource()returning a tainted value. The actual code of the source and the sink is not relevant for us because we define their semantics separately in the next section.

    Now we need to get the ProGuardCORE program class pool:

    val programClassPool = ClassPoolBuilder.fromSource(JavaSource(
                "A.java",
                program
        )).programClassPool

    Next, we create the intraprocedural CFA with CfaUtil:

    val cfa = CfaUtil.createInterproceduralCfaFromClassPool(programClassPool)

    Configuring taint sources and sinks

    Now we can see ProGuardCORE in action. As we mentioned before, ProGuardCORE supports taint analysis by tracking the data flows from taint sources to taint sinks. A taint source is a method which returns the taint or applies it to its arguments, calling instance, or static fields. For simplicity, we create a taint source returning the taint:

    val taintSource = TaintSource(
                "LA;source()Ljava/lang/String;", // the fully qualified name of a source method
                false,                           // whether the source taints the calling 
                                                 // instance 
                true,                            // whether the source taints its return
                setOf(),                         // a set of tainted arguments
                setOf()                          // a set of tainted global variables
        )

    Our source is the methodsource()from classAreturning a tainted string. The third booleanTaintSourceargument defines whether the return value should be tainted.

    A taint sink is a method call which is sensible to the taint. It can access the tainted value through its argument, calling instance, or static fields. Below, we create a taint sink which accepts the taint as its first argument:

    val taintSink = JvmTaintSink(
                "LA;sink(Ljava/lang/String;)V", // the fully qualified name of a sink method
                false,                          // whether the sink is sensitive to the calling
                                                // instance
                setOf(1),                       // a set of sensitive arguments
                setOf()                         // a set of sensitive global variables
        )

    The taint sink is the void methodsink(String)from classAwith its only argument sensible to the sink, as one can see from the thirdJvmTaintSinkargument. Note that the argument enumeration starts with one regardless of whether the method is static. This is different from Java byte code convention, which has the calling instance as the first argument for non-static methods.

    Setting up the analysis run

    The CPA algorithm is encapsulated in theJvmTaintMemoryLocationBamCpaRunclass. It requires a CFA, the taint sources, and sinks for its work. Now we have everything we need:

    val taintMemoryLocationCpaRun = JvmTaintMemoryLocationBamCpaRun(
            cfa,                                 // the CFA
            setOf(taintSource),                  // a set of taint sources
            MethodSignature("A", "main", "()V"), // the signature of the main method
            -1,                                  // the maximum depth of the call stack
                                                 // 0 means intraprocedural analysis.
                                                 // < 0 means unlimited call stack.
            TaintAbstractState.bottom,           // a cut-off threshold
            setOf(taintSink)                     // a collection of taint sinks
    )

    We provided the constructor with the components we already had, the entry point method signature, the maximum call stack depth (-1 stands for unlimited), and the cut-off threshold for the trace extraction telling it that we are not interested in empty taints.

    We can now run the analysis and obtain the traces:

    val traces = taintMemoryLocationCpaRun.extractLinearTraces()

    Interpreting the traces

    The resulting set contains only one trace, as expected. We can now take a look at its string representation:

    [JvmStackLocation(0)@LA;main()V:3, JvmStackLocation(0)@LA;callee()Ljava/lang/String;:3]

    The format of the trace is a sequence of memory locations going from the sink to the source. Each memory location consists of a description where it takes place (at the operand stack, the local variable array, the static memory, or the heap) and a program point (method and offset) it refers to.

    Our trace consists of two memory locations, both at the top of the stack but in different methods. The first memory locationJvmStackLocation(0)@LA;main()V:3corresponds to the first argument of the methodsink(String)which is also the return value ofcallee().

    The second memory locationJvmStackLocation(0)@LA;callee()Ljava/lang/String;:3is the return of the taint sourcesource(). Thus, the trace covers the whole path between the taint creation and consumption.

    This tutorial is also available in Java in the ProGuardCORE manual.

    Writing your own analysis in ProGuardCORE

    The taint analysis is not the only possible analysis achievable with ProGuardCORE. One can relatively easily extend the existing framework with their own analysis. The first step is to create your ownAbstractState. This requires finding a proper representation of the information you are going to track, defining a semi-lattice on it, and specifying the transfer relation for your analysis. The latter can be forked from the existing defaultJvmTransferRelationby overriding its methods encapsulating the instruction and call semantics.

    Applications of DFA

    The taint analysis in ProGuardCORE can be applied to detect various vulnerabilities related to data in code. They include, but are not limited to:

    • Data leaks where some sensitive data is read in one place and passed to an undesired channel
    • Data insertion, where data is obtained from an untrusted source and written to a sensitive storage
    • Control leaks, where potentially dangerous data is used for execution

    Thus, using taint analysis with well chosen sources and sinks can help find potential threats in already compiled code and reduce the security assessment effort significantly. Note that since the analysis is forgiving, an empty result does not necessarily mean the absence of a data flow; it only excludes some straightforward possibilities. And this also applies the other way: If there is a data flow detected, the trace may still be infeasible due to path insensitivity and may never actually occur during the program execution.

    Conclusion

    There is a broad range of data flow analysis tools, each with their own feature set and specific applications. In this post, we considered three fundamental characteristics to help distinguish them. Thus, we divided all analyses into:

    1. static/dynamic
    2. path-dependent/independent
    3. conservative/forgiving

    The ProGuardCORE data flow analysis feature includes a working interprocedural taint analysis and an extensible framework for other analyses. It is:

    • static, hence can be run once before the application deployment
    • path-independent, but can be complemented with analyses contributing to path-sensitivity due to its extensible architecture
    • forgiving, making it a good base for generating warnings for user programs

    The tutorial shows how to analyze an application starting from the definition of taints, modeling the control flow with a control flow automaton, and finally interpreting the witness traces.

    Check out the ProGuardCORE manual to review this tutorial in Java, as well.

    Dmitry Ivanov - Software Engineer

    Discover how Guardsquare provides industry-leading protection for mobile apps.

    Request Pricing

    Other posts you might be interested in