May 19, 2026

Debugging App Performance: Lessons from Bytecode Experts

Compiler-based protection is the most effective way to secure a mobile application. Because it weaves defenses directly into the app’s logic, it is more difficult for attackers to bypass or strip the protection away. But keeping an app secure is only half the battle. A crucial requirement for any security mechanism is maintaining optimal performance and ensuring a seamless user experience.

Achieving this requires a deep understanding of runtime behavior. While spotting a performance drop is usually straightforward, investigating the actual root cause can be quite challenging. Building a polymorphic Android app protection engine like DexGuard has provided us a lot of valuable insights into analyzing runtime behavior in such environments. Profiling and debugging a prebuilt APK is a hurdle on its own, as standard debugging tools rely heavily on having the source code available. Additionally, factors such as invisible overhead across the JNI boundary and optimizations performed by the Android Runtime (ART) can make it even harder to pinpoint where time is actually being spent.

In this post, we will start with surface-level tools and progressively dive into the more difficult, unconventional techniques we’ve used to investigate performance bottlenecks deep within the compiled application. These lessons can be useful for anyone working in similar conditions, whether you are profiling third-party dependencies, reverse engineering, or manipulating bytecode.

Profiling Android release builds using Perfetto

Tracking app startup time is a common practice for monitoring app performance. We rely on this exact metric, among others, to ensure our protections do not degrade the user experience. When a slowdown does occur, discovering the root cause usually starts with a standard Java/Kotlin profiler (like the one in Android Studio).

However, analyzing a prebuilt app presents two unique hurdles:

  1. Standard tools require the original source code (which doesn't help when the bottleneck was introduced into the final compiled bytecode)
  2. You need to manually make the release build profileable or debuggable.

Tip: Making a Build Profileable

To overcome the second hurdle, you can unpack the APK with apktool, add <profileable android:shell="true"/> inside the <application> tag in the AndroidManifest.xml, and repack the APK.

This is preferred over making the app debuggable, which introduces significant performance penalties that can taint your test results. Note, however, that profileable is not available on older API levels, in which case you can instead add android:debuggable="true" to the <application> tag in the manifest.

Once the app is profileable, you can then capture the trace. While Android Studio does enable profiling a prebuilt APK, the process is quite cumbersome. It requires creating and configuring a new project for every APK, waiting for Smali indexing, and working with trace files that can become quite large (often several gigabytes).

Instead, we highly prefer profiling by running Perfetto, Android's default profiler, directly via ADB. This bypasses the IDE entirely and generates much smaller trace files (typically in the 10MB range) while retaining all necessary information for our use cases.

Method sampling vs. Method tracing

When starting the profiler via ADB, we recommend performing method sampling.

  • Method sampling acts like a fast trace. It captures the call stack periodically (e.g., every millisecond). It adds almost no overhead, retains the app's native performance, and more clearly points out the functions where the app is spending most of its time.
  • Method tracing, on the other hand, records the exact start and end of every single method call. This is useful for analyzing individual functions in isolation, however, it introduces massive overhead that severely impacts the app's timing and distorts real-world performance. Furthermore, because it logs everything, the volume of data can quickly cause the trace buffer to overflow, resulting in dropped entries.

Always start with method sampling. You can increase the sampling frequency to get a more granular view of the app's execution. Full method tracing is rarely useful for profiling the app as a whole due to buffer overflow, so reserve it strictly for profiling specific, isolated functions (for example, by wrapping the code block in Trace.beginSection()).

Capturing the trace

To capture a profiling trace via ADB:

1. Start capturing a trace. For method sampling with an interval of 1000 µs, use:

adb shell am start -n /.MainActivity --start-profiler /data/local/tmp/trace.trace --sampling 1000

(Note: Remove --sampling 1000 for method tracing).

2. Stop the capture once the app has started up or you are done performing the slow action:

adb shell am profile stop

3. Pull the trace file from the device to your machine:

adb pull /data/local/tmp/trace.trace

Note: If the generated trace file is empty (0 bytes), double-check that the app is actually profileable or debuggable. Perfetto will not warn you in that case and it will fail silently.

To view the exported trace, we use the modern web UI at https://ui.perfetto.dev. If you are working with an optimized or obfuscated app, you can also use the Python script from Appendix A along with an R8/ProGuard mapping file to deobfuscate the trace file and more easily make sense of the information.

By applying these steps, you can perform an initial triage to locate the slowdown, and often identify the exact problem immediately.

Example: Spotting slow API calls in a profiling trace

(Note: Some specific details in the following example have been omitted or simplified for illustrative purposes.)

We captured a trace (method sampling with a 500 µs interval) for an obfuscated APK that was unexpectedly slow at launch. When we opened it in Perfetto, the root cause stood out immediately:

1-Debugging-App-Performance--Lessons-from-Bytecode-Experts

The MainActivity.<clinit> was taking a long time, and it was filled with a repeating pattern. The original app logic here was a simple byte-by-byte checksum calculation. However, to prevent attackers from lifting the code, our protection engine had automatically added an environmental check tied to the checksum calculation, such as NetworkInterface.getNetworkInterfaces() inside this loop.

By using Perfetto's selection tool to create a pivot table, we could also easily quantify the impact: (Note: Due to method sampling, not all numbers are exact or will perfectly add up.)

2-Debugging-App-Performance--Lessons-from-Bytecode-Experts

While this is usually a seamless defense, the trace revealed that on this specific device, the injected OS call took disproportionately longer to execute, leading us to adjust our engine's heuristics.

Tracing Android native code using Simpleperf

Often, a performance bottleneck doesn't live in Java or Kotlin code, but hides behind the Java Native Interface (JNI). While standard Java profilers easily reveal if you are crossing the JNI boundary too frequently (which is inherently expensive), they treat the native execution itself as a complete blind spot.

Example: Slow JNI call in a profiling trace

This is exactly what we encountered during a recent investigation. The Java profiling trace we captured simply showed a massive black box: a native method taking an unusually long time to execute.

3-Debugging-App-Performance--Lessons-from-Bytecode-Experts

The trace above shows two obfuscated native methods, o.Ch._init_lambda5 and o.Bz.write, consuming significant time, but it provides no visibility into what the C code is actually doing.

If your slowdown is hidden within one of these native black boxes, you need to trace raw CPU cycles. For that, we use simpleperf, a powerful profiling tool included in the Android NDK. simpleperf provides flamegraphs and execution traces just like a Java profiler, but it specifically details the native C/C++ methods executed by the app, alongside lower-level system calls (syscalls).

Capturing the trace

The steps to capture a native trace and generate a report are:

1. Record the trace. Run the app using the command below and interact with it for a set duration (e.g., 10 seconds) to generate a perf.data file:

python3 $ANDROID_HOME/ndk/<VERSION>/simpleperf/app_profiler.py -p <pkg-name> -a .MainActivity -r "-g --duration 10"

2. Link the debug symbols. If the native library inside the APK had its symbols stripped (which is standard for release builds), update the perf.data file to pull symbol names from your local, non-stripped build:

python3 $ANDROID_HOME/ndk/<VERSION>/simpleperf/binary_cache_builder.py -i perf.data -lib debug/lib/arm64-v8a

3. Generate the report. This creates an interactive HTML report containing the flamegraphs and charts, which you can open directly in your browser. Pass in your ProGuard/R8 mapping file so the Java-to-native calls are more readable:

python3 $ANDROID_HOME/ndk/<VERSION>/simpleperf/report_html.py --show-art-frames --proguard-mapping-file mapping.txt

Once generated, the interactive HTML report provides a high-level overview of where CPU time is being spent across the entire process. It includes interactive pie charts that break down execution time by thread, library, and individual function. Additionally, the report generates native flamegraphs, which offer another intuitive view of the call stack (we will look closer at these in the next section).

Example: Finding native hotspots with Simpleperf

To see this in action, we can look at the generated HTML report from our investigation. By interacting with the chart, we drilled down into specific threads to see the cumulative execution duration of all native libraries. In our case, we were specifically interested in the obfuscated librarylibbede.sosince that is where the slow method was.

Going further into the specific functions inside libbede.so and looking at the function breakdown, we discovered the root cause: a highly obfuscated function that was called frequently across the native code (the pink slice in the image below) was causing the slowdown.

4-Debugging-App-Performance--Lessons-from-Bytecode-Experts

Identifying this allowed us to tune the protection configuration, and we could confirm that the execution time after was much more naturally distributed across the library:

5-Debugging-App-Performance--Lessons-from-Bytecode-Experts

Debugging AOT compilation

Consider this scenario: you confirm via a Java profiling trace that a specific method is the root cause of a slowdown. However, the method is only slow on the very first app launch after installation, and mysteriously becomes fast on subsequent runs.

Sometimes, a performance drop isn't due to the code itself, but rather how the Android Runtime (ART) interprets and optimizes it. To understand why this happens, we first need to understand the different ways ART executes code throughout an app's lifecycle.

The ART compilation lifecycle

When a user launches an app, the runtime relies on three primary mechanisms to execute your code:

  1. The Interpreter: This is the slowest execution path. It runs “cold code” instruction-by-instruction, translating the bytecode one-by-one via the virtual machine.
  2. The JIT Compiler: As the interpreter runs, it detects frequently executed paths (or “hot code”) and feeds real-time data to the Just-In-Time (JIT) compiler. The JIT compiler converts segments of this bytecode into native machine code on the fly, allowing them to run much faster directly on the device's architecture rather than through the VM.
  3. The AOT Compiler (dex2oat): Over time, the execution data gathered by the JIT compiler is saved as Profiles. When the device is idle (or during app installation), the Ahead-Of-Time (AOT) compiler uses these profiles to pre-compile methods or entire classes into native machine code. This optimized native code is stored as .odex files in the private app directory under /data/app/<package.name>*/oat/. (You can read more about this in the official Android JIT architecture documentation).

This lifecycle is illustrated in the diagram below:

6-Debugging App Performance- Lessons from Bytecode Experts

Essentially, when a user launches an app, the runtime first checks if an AOT binary is available. If it is, it executes the optimized native code directly. If not, it falls back to the raw .dex files and the slow interpreter.

Now that you know how AOT compilation works, we can return to our initial scenario. If a specific method is unexpectedly slow on the first launch but fast on subsequent runs, it means the AOT compiler likely skipped pre-compiling that method during installation. This forces the app to rely on the slow interpreter on that initial launch, until the JIT compiler eventually catches up and optimizes it for future runs.

To prove this is exactly what is happening under the hood, and to see how the code performs when fully optimized, we can manually take control of the compiler.

Tip: Inspecting compiled .odex files with oatdump

You can see the optimized (native) code that dex2oat outputs by running oatdump: (Note: requires a rooted device)

adb shell "oatdump --oat-file=/data/app/com.example.../oat/arm64/base.odex --output=/data/local/tmp/oatdump.txt"

To illustrate what the output looks like: for example, this method was not compiled and falls back to the interpreter:

0: void android.arch.lifecycle.LiveData$4.(android.arch.lifecycle.LiveData) (dex_method_idx=263) DEX CODE: 0x0000: e801 0800 | iput-object-quick v1, v0, // offset@8 0x0002: 7010 f90c 0000 | invoke-direct {v0}, void java.lang.Object.() // method@3321 0x0005: 7300 | return-void-no-barrier OatMethodOffsets (offset=0x00000000) code_offset: 0x00000000 OatQuickMethodHeader (offset=0x00000000) vmap_table: (offset=0x00000000) QuickMethodFrameInfo frame_size_in_bytes: 0 ... CODE: (code_offset=0x00000000 size_offset=0x00000000 size=0) NO CODE!

While this method was compiled; notice how code_offset points to a real memory address, and the CODE: block contains a hex dump of native machine instructions:

0: void android.arch.lifecycle.LiveData$4.(android.arch.lifecycle.LiveData) (dex_method_idx=263) DEX CODE: 0x0000: 5b01 1600 | iput-object v1, v0, Landroid/arch/lifecycle/LiveData; android.arch.lifecycle.LiveData$4.FullLifecycleObserverAdapter // field@22 0x0002: 7010 f90c 0000 | invoke-direct {v0}, void java.lang.Object.() // method@3321 0x0005: 7300 | return-void-no-barrier OatMethodOffsets (offset=0x000061c0) code_offset: 0x003933c0 ... CODE: (code_offset=0x003933c0 size_offset=0x003933bc size=24)... 0x003933c0: b9000822 str w2, [x1, #8] 0x003933c4: 34000082 cbz w2, #+0x10 (addr 0x3933d4) 0x003933c8: f9404e70 ldr x16, [tr, #152] ; card_table 0x003933cc: 530a7c31 lsr w17, w1, #10 0x003933d0: 38316a10 strb w16, [x16, x17] 0x003933d4: d65f03c0 ret

Using dex2oat compiler filters

A great first triage step to confirm an AOT fallback is to force ART to pre-compile the app using different dex2oat compilation filters. We trigger this right after installation, but before running the app for the first time, using the following ADB command:

adb shell cmd package compile -m <compiler_filter> -f <your.package.name>

The most important compilation filters are:

  • verify Performs zero Ahead-Of-Time (AOT) compilation. This acts as your baseline by forcing the app to rely entirely on the slow interpreter and JIT.
  • speed-profile Compiles only the methods listed in the app's Baseline Profile (this is used by default when the app is installed).
  • speed Compiles most of the app into native machine code while considering storage space.
  • everything Compiles the entire app into native machine code, regardless of storage concerns.

You can verify that dex2oat successfully ran by checking logcat for the compilation filter:

I dex2oat : /apex/com.android.runtime/bin/dex2oat ... --compiler-filter=everything I dex2oat : dex2oat took 784.372ms (2.780s cpu)

A useful experiment is to compile the app with the everything profile. If the slowdown completely disappears, you have found a massive clue: the code runs fine when natively compiled, but standard installations (which use speed-profile) are skipping it. Since compiling everything is not viable for real-world end users due to storage constraints, we need to ensure the specific classes we need are compiled. This is where Baseline Profiles come in.

Tip: Measuring App Startup Time via ADB

To measure the actual impact of these different filters, you can launch your app via ADB with the -W flag. This forces the console to wait for the app to finish launching and prints the exact startup time:

adb shell am start -W -n <your.package.name>/.MainActivity

Patching Baseline Profiles

Baseline Profiles explicitly tell dex2oat which classes and methods are crucial to optimize during installation. If you unzip an APK, you will find an assets/dexopt/baseline.prof binary file which contains this information.

Tip: Decompiling Baseline Profiles

You can decompile and read the baseline.prof binary file using the Android SDK's profgen tool:

$ANDROID_HOME/cmdline-tools/latest/bin/profgen dumpProfile --profile baseline.prof --apk app.apk --output ./baseline-prof.txt

The output will be a list of class descriptors and methods:

HPLcom/example/MainActivity;->onCreate(Landroid/os/Bundle;)V

(Note: the flags at the beginning indicate execution states: H = Hot, S = Startup, P = Post-startup)

If oatdump confirms your slow method is not being optimized, you can manually force AOT compilation by patching the Baseline Profile to test if it resolves the issue:

1. Decompile the app with Apktool: apktool d app.apk -o decompiled_app

2. Decompile the baseline.prof file using profgen (as shown in the tip above).

3. Modify the resulting text file by adding your slow methods.

4. Recompile it back into a binary:

$ANDROID_HOME/cmdline-tools/latest/bin/profgen bin ./baseline.txt --apk app.apk --output ./baseline.prof --output-meta ./baseline.profm

5. Replace the original .prof and .profm files inside the decompiled_app/assets/dexopt/ directory.

6. Recompile the APK: apktool b decompiled_app -o app-patched.apk

You can now reinstall and measure the app startup time again.

Tracing CPU cycles in compiled code

What if you patch the baseline profile, confirm the method is compiled via oatdump, but the problem still isn't solved?

At this point, we have to dig deeper. Because the code is now running as native machine code, we can use simpleperf to analyze the raw CPU cycles and see exactly where the compiled code is spending its time.

This time, we use a command to record CPU cycles system-wide for 10 seconds:

simpleperf record -a -g -f 1000 --duration 10 -o perf.data

We can then filter the output specifically for our package. We can generate an ordered list of executed functions and the percentage of time they consumed:

simpleperf report -i perf.data --comms <package-name> --sort symbol > trace_report.txt

Also, we can build a call graph to see the execution paths leading to those functions:

simpleperf report -i perf.data --comms <package-name> -g > graph_report.txt

This may show a lot of information, and although you might not be able to make sense of all of it, it can give clues as to what is happening. We'll show how this is used in the following example.

Example: The cost of JNI boundary crossings

We once reached this exact point in an investigation. We had patched the Baseline Profiles, but the app was still slow. Only after looking at the simpleperf symbol trace and call graph did we solve the puzzle.

The sorted trace already revealed a suspicious trend:

Overhead Symbol 20.17% __kernel_clock_gettime 15.17% el0_sys 6.46% art::ClassLinker::InitializeClass(art::Thread*, art::Handle, bool, bool) 4.89% art::ClassLinker::ResolveField(unsigned int, art::ArtMethod*, bool) 4.07% cntvct_read_handler 3.63% art::ClassLinker::EnsureInitialized(art::Thread*, art::Handle, bool, bool) 2.78% art::Monitor::MonitorEnter(art::Thread*, art::ObjPtr, bool) 2.68% bool art::interpreter::MterpFieldAccessSlow(art::Instruction*, unsigned short, art::ShadowFrame*, art::Thread*) 2.55% bool art::interpreter::MterpFieldAccessSlow(art::Instruction*, unsigned short, art::ShadowFrame*, art::Thread*) 2.15% art::Monitor::MonitorExit(art::Thread*, art::ObjPtr) 1.70% art::ObjectLock::ObjectLock(art::Thread*, art::Handle) 1.65% mterp_op_sub_long ...

Namely, OS-level clock and ART lock functions were dominating the execution time, alongside MterpFieldAccessSlow. Our code didn't seem to heavily rely on these, making this very suspicious. We then checked the call graph to trace the path to these functions, which looked as follows:

|--98.71%-- o.setScrollbarFadingEnabled. |--42.12%-- mterp_op_sget | |--96.01%-- art::interpreter::MterpFieldAccessSlow(...) | | |--84.80%-- art::ClassLinker::EnsureInitialized(...) | | | |--91.54%-- art::ClassLinker::InitializeClass(...) | | | | |--84.69%-- art::ObjectLock(...) | | | | | |--95.50%-- art::Monitor::MonitorEnter(...) | | | | | | |--91.32%-- art::NanoTime() | | | | | | | |--96.96%-- clock_gettime | | | | | | | | |--98.15%-- __kernel_clock_gettime

The graph showed that our compiled code was constantly forcing slow field accesses (mterp_op_sget), which require expensive locks (MonitorEnter), and this solved our puzzle:

    • The applied obfuscation had structurally moved certain field accesses into a secondary, generated class.
    • While our primary class was compiled into native code (thanks to our patched profile), the secondary class was not.
    • Consequently, every time the native (compiled) method accessed those Java fields, it had to cross the boundary back to the interpreter, requiring setting up expensive locks.
    • This explained why the everything profile resolved the issue during initial triage: it forced both classes to compile to native code, removing the interpreter boundary entirely!

Conclusion

Debugging performance issues when manipulating bytecode is a unique challenge, as even small changes can often lead to surprising runtime behaviors. As shown in the examples above, finding the root cause of a slowdown requires digging deep into the app, and relying on multiple, lesser-known tools. This requires a deliberate investment of time and engineering effort.

Because we know that applying advanced obfuscation and optimization requires careful consideration of the performance impact, we keep track of performance during automated testing and diligently investigate issues before they reach our customers. This allows us to build solutions that work for any app, and all devices.

Appendix A: Deobfuscating Trace Files

When capturing a profiling trace for an obfuscated APK, you can use your ProGuard/R8 mapping.txt file to translate the obfuscated names back into their original using the following Python script:

  1. Save the code below as deobfuscate_trace.py.
  2. Run it via your terminal, passing the mapping file, the obfuscated trace, and the desired output path:
python3 deobfuscate_trace.py mapping.txt obfuscated.trace deobfuscated.trace

The script:

import sys import os import argparse def parse_mapping(mapping_path): """Parses a ProGuard/R8 mapping.txt file into class and method dictionaries.""" class_map = {} method_map = {} with open(mapping_path, 'r', encoding='utf-8') as f: current_obf_class = None for line in f: line = line.strip() if not line or line.startswith('#'): continue # Class mapping: com.example.RealClass -> a.b.c: if line.endswith(':'): parts = line[:-1].split(' -> ') if len(parts) == 2: real_class, obf_class = parts class_map[obf_class] = real_class current_obf_class = obf_class # Method mapping: 12:23:void myMethod(int) -> a elif current_obf_class and ' -> ' in line: if '(' in line and ')' in line: left_part, obf_method = line.split(' -> ') obf_method = obf_method.strip() # Strip line numbers if present if ':' in left_part: left_part = left_part.split(':')[-1] method_signature = left_part.split('(')[0].strip() real_method = method_signature.split()[-1] method_key = f"{current_obf_class}::{obf_method}" method_map[method_key] = real_method return class_map, method_map def deobfuscate_trace(trace_in_path, trace_out_path, class_map, method_map, force_version=False): """Translates obfuscated names in the trace file using the parsed mapping.""" with open(trace_in_path, 'rb') as fin, open(trace_out_path, 'wb') as fout: line1 = fin.readline() line2 = fin.readline() # Check if the first two lines match exactly what we expect for a V3 trace is_version_3 = (line1.strip() == b'*version' and line2.strip() == b'3') if not is_version_3 and not force_version: print("Error: The trace file does not appear to be a raw ART trace (version 3).") print("Expected '*version' and '3' at the beginning of the file.") print("To ignore this error and attempt translation anyway, use the --force flag.") fout.close() os.remove(trace_out_path) sys.exit(1) # Write the header lines we just consumed to the output file fout.write(line1) if line2: fout.write(line2) # We're interested in translating the string table between "*methods" and "*end" in_methods = False for line in fin: if line.strip() == b'*methods': in_methods = True fout.write(line) continue elif line.strip() == b'*end': in_methods = False fout.write(line) # Dump the remaining binary data fout.write(fin.read()) break if in_methods: try: decoded_line = line.decode('utf-8') parts = decoded_line.rstrip('\r\n').split('\t') if len(parts) >= 5: obf_class = parts[1] obf_method = parts[2] real_class = class_map.get(obf_class, obf_class) method_key = f"{obf_class}::{obf_method}" real_method = method_map.get(method_key, obf_method) parts[1] = real_class parts[2] = real_method new_line = '\t'.join(parts) + '\n' fout.write(new_line.encode('utf-8')) else: fout.write(line) except UnicodeDecodeError: fout.write(line) else: fout.write(line) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Deobfuscate method names in Android ART trace files (v3).") parser.add_argument("mapping", help="Path to the ProGuard/R8 mapping.txt file") parser.add_argument("input", help="Path to the input obfuscated .trace file") parser.add_argument("output", help="Path to save the deobfuscated .trace file") parser.add_argument("--force", action="store_true", help="Override the '*version 3' header check") args = parser.parse_args() if not os.path.exists(args.mapping): print(f"Error: Mapping file '{args.mapping}' not found.") sys.exit(1) if not os.path.exists(args.input): print(f"Error: Input trace '{args.input}' not found.") sys.exit(1) print("Parsing mapping file...") cmap, mmap = parse_mapping(args.mapping) print(f"Loaded {len(cmap)} classes and {len(mmap)} methods.") print("Processing trace file...") deobfuscate_trace(args.input, args.output, cmap, mmap, args.force) print(f"Done! Deobfuscated trace saved to '{args.output}'.")

 

Discover how Guardsquare provides industry-leading protection for mobile apps.

Request Pricing

Other posts you might be interested in