Adventures In Compiling Knulli using AI

The adventure begins with a $50 present: an Anbernic RG35XX Pro, one of those pocket-sized retro gaming handhelds that have exploded in popularity. The device is able to run a custom OS Knulli, a Batocera-based Linux distribution designed for emulation. Out of the box, it works. Games load, controls respond, nostalgia flows.

But something nagged at me. The OS felt… heavy. And as someone who’s spent years working with Linux systems, I knew that “works” and “optimized” are very different things.

What followed was a weekend of kernel compilation, AI-assisted learning, debugging rabbit holes, and ultimately a deeper understanding of how operating systems translate code into the electrical signals that make a CPU do useful work.

The Hypothesis: Performance was lost due to generic builds Link to heading

The RG35XX Pro runs on an Allwinner H700 SoC—a quad-core ARM Cortex-A53 with a Mali-G31 GPU. When I started digging into Knulli’s source code, I discovered that my device shared a kernel configuration with several other handhelds using the same chipset family. The configuration was built to work across multiple devices, not to squeeze every cycle out of any particular one.

This is sensible from a maintainer’s perspective. Supporting dozens of device-specific configurations is a nightmare. But it also means trade-offs: compiler optimizations favoring binary size over speed, conservative timer frequencies, debug symbols left enabled, and CPU governors tuned for safety rather than performance.

I wanted to know… What would happen if I compiled a kernel specifically for my hardware?

Step One: Know Your Hardware Link to heading

Before optimizing anything, I needed to understand exactly what I was working with. The device’s specs were marketed in vague terms, so I SSH’d into the device and started interrogating the system directly.

Here’s the reconnaissance process:

# CPU information - cores, features, architecture
cat /proc/cpuinfo

# Device tree - the authoritative source for hardware configuration
cat /proc/device-tree/compatible
ls /proc/device-tree/

# Memory configuration
cat /proc/meminfo
cat /proc/iomem

# Current kernel configuration (if available)
zcat /proc/config.gz > current_config.txt

# GPU information
cat /sys/class/misc/mali0/device/gpuinfo 2>/dev/null
ls /sys/class/devfreq/

# CPU frequency scaling capabilities
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

# Storage and partition layout
lsblk
cat /proc/partitions

# Loaded kernel modules
lsmod

# Full hardware tree
dmesg | head -200

What emerged was a detailed picture: four Cortex-A53 cores capable of scaling from 480MHz to 1.512GHz, a Mali-G31 MP2 GPU with frequencies up to 648MHz, 1GB of RAM, hardware AES and CRC32 acceleration, and 64MB of contiguous memory allocated for GPU operations.

I compiled this into a comprehensive hardware spec which became the foundation for everything that followed.

AI: From Specs to Optimization Strategy Link to heading

Here’s where the project took an interesting turn. I had the hardware specs. I had access to the Knulli source code. What I didn’t have was deep experience with kernel compilation or the intuition for which configuration options would actually matter for emulation workloads.

I created a docs folder containing my hardware specification and the existing kernel configuration, then turned to Claude Code for help. My approach was straightforward: Create a new branch off Knulli’s main repository and ask Claude to read the optimization guides and implement a new make configuration for the RG35XX Pro based on the existing H700 hardware configurations.

What followed was genuinely educational. After implementing the config file, Claude was able to walk me through the reasoning behind each change.

The Timer Frequency Revelation Link to heading

The default kernel was configured with CONFIG_HZ=100. This means the kernel’s internal timer ticked 100 times per second, giving 10ms granularity for scheduling decisions.

For a server, this is fine. For emulation targeting 60fps games? It’s problematic. Each frame of a 60fps game takes 16.67ms. With 10ms timer resolution, you can’t cleanly divide frames, leading to inconsistent frame pacing.

The recommendation was CONFIG_HZ=250, providing 4ms granularity. This divides evenly into both 50Hz (PAL) and 60Hz (NTSC) refresh rates—the two standards that virtually all retro games target.

Compiler Optimization: Size vs. Speed Link to heading

The kernel was compiled with -Os (optimize for size). This makes sense when you’re trying to fit a kernel into limited flash storage or memory. But the RG35XX Pro has 1GB of RAM. I had a 64 GB microSD card with plenty of space for shinanegans.

Switching to -O2 (optimize for speed) would increase the kernel size by 10-15%, but generate faster code paths. For CPU-bound emulation, this matters.

The Kernel Stack Protector Trade-off Link to heading

This is where I had to make conscious security decisions. The kernel stack protector (CONFIG_CC_STACKPROTECTOR_STRONG) adds runtime checks to detect stack buffer overflows. This is a critical defense against certain classes of exploits.

But it also adds overhead. Every function call includes additional instructions to set up and verify stack canaries.

For a device that runs offline, plays decades-old games, and never processes untrusted network input in kernel space, I made the judgment call to disable it. This isn’t a decision I’d make for a server or a phone, but for a dedicated gaming handheld, the performance benefit outweighed the theoretical security risk.

Claude helped me understand what I was actually trading away, which made the decision informed rather than blind.

Understanding the Stack: From Code to Silicon Link to heading

One of the most valuable outcomes wasn’t a configuration change. Through our back-and-forth, I developed a much clearer mental model of how high-level code becomes machine instructions becomes electrical signals.

The -O2 flag doesn’t just “make things faster.” It enables specific optimization passes in GCC: loop unrolling, function inlining, instruction scheduling tuned for the Cortex A53’s pipeline. Each optimization is a trade-off between code size, compilation time, and runtime performance.

Understanding that the Cortex A53 supports specific ARM extensions (+crypto+crc) meant we could enable hardware-accelerated AES and CRC32 operations—instructions that would otherwise be emulated in software.

This is the kind of knowledge that textbooks can teach but that somehow clicks differently when you’re applying it to a little device sitting on your desk.

The Build Environment Saga: Mac to Linux Link to heading

I should mention that I didn’t start this project on Linux.

My initial plan was to compile on my MacBook. I began using docker but ran into limitations with the Mac filesystem being case-insensitive. I then tried using a Debian VM. This lasted about two hours before I ran out of disk space. Buildroot, the system Knulli uses to construct the entire OS from source, is not small. Between the toolchain, source trees, and intermediate build artifacts, you’re looking at 50+ GB easily.

I could have allocated more space. I could have attached external storage. Instead, I took this as a sign.

I’d been meaning to set up a Linux desktop for gaming anyway, another project I’d been putting off. I installed Pop!_OS and finally, I had a proper development environment with ample storage, native performance, and no virtualization overhead.

Sometimes the obstacle is the path. Now I have a Linux gaming rig and a kernel compilation environment, all because a VM ran out of space.

The Debugging Rabbit Hole: When the Bug Was Already There Link to heading

With the build environment ready and optimizations configured, I compiled my custom kernel. The build took several hours and ran overnight. I flashed the resulting image to a microSD card, inserted it into the device, and powered on.

The kernel booted. The UI appeared. And then I noticed problems.

The WiFi menu showed the feature as unavailable. Bluetooth was missing. Several emulator options weren’t appearing in the menus. My first instinct was that I’d broken something, perhaps misconfigured a driver, disabled a necessary module, or made some fatal optimization error.

I spent an embarrassing amount of time reviewing my changes before I thought to check whether these features worked if I compiled the stock Knulli image. They didn’t.

The RG35XX Pro had been added to Knulli relatively recently, and the integration was incomplete. The symptoms I was seeing weren’t from my optimization work, they were pre-existing bugs in the upstream repository.

Finding the Real Bug Link to heading

The breakthrough came accidentally when I checked my router’s admin panel. There, in the connected devices list, was my RG35XX Pro. The device had WiFi connectivity at the system level. The kernel was fine. The problem was higher up the stack.

SSH still worked, so I connected and started investigating.

The culprit was in S02generate-capability, a script that runs during boot to detect device capabilities and write them to a configuration file that EmulationStation reads. The RG35XX Pro wasn’t in the list of recognized devices.

Because the device wasn’t recognized, the capability file wasn’t generated correctly. EmulationStation saw no WiFi capability flag, so it hid the WiFi menu. Same for Bluetooth. Same for various other features.

The fix was straightforward: add the RG35XX Pro to the recognition logic in the capability generation script. After that change, a rebuild, and a reflash, everything appeared correctly.

The Results Link to heading

So after the hardware reconnaissance, the AI-assisted optimization, the platform migration, the debugging detour… What did I actually gain?

Maybe 5-10% better performance 😄

Some N64 games that previously stuttered now held steady at 60 fps. Frame pacing improved noticeably in titles that had exhibited occasional stuttering. Battery life increased thanks to the CPU completing work in fewer cycles, spending more time in low-power states.

Boot time was roughly the same. But interactions felt snappier. Menus responded faster. The whole experience felt more polished.

Was a 5-10% improvement worth a weekend of work?

For pure performance ROI, probably not. I could have spent that time actually playing games.

But performance wasn’t really the point.

What I Actually Learned Link to heading

This project was a crash course in several things I’d understood abstractly but never applied hands-on.

Kernel configuration is a series of trade-offs. Every option exists because someone needed it. Understanding why an option exists [What problem it solves? What cost it incurs?] is more valuable than memorizing what it does.

AI is remarkably effective at accelerating learning. I could have learned everything Claude taught me from documentation and source code. It would have taken weeks instead of hours. Having an assistant that could explain the reasoning behind recommendations, and allowing me to build off my understanding in real-time was a remarkable learning tool.

Debugging is often about questioning assumptions. I assumed my changes broke WiFi. That assumption sent me down the wrong path for hours. The fix came when I questioned a more fundamental assumption: that the stock image worked correctly.

The build environment matters. Trying to compile a Linux distribution in a resource-constrained VM was fighting the problem instead of solving it. Sometimes the right move is to step back and set up proper infrastructure.

“Good enough” configurations run on the many, but are a master of none. Generic configurations are designed not to fail. Device-specific configurations can be designed to excel. The gap between those two isn’t always large, but when it exists, it’s real.

Would I Do It Again? Link to heading

Absolutely.

Not for the performance gains, though they’re nice. For the understanding.

There’s a difference between knowing that a kernel manages hardware resources and understanding how a specific configuration option changes the way your specific CPU schedules work. Between knowing that compilers optimize code and seeing how -O2 generates different assembly for your specific architecture.

If you have a device running Linux, especially one designed for a specific task like gaming or media playback, I’d encourage experimenting. Clone the source, interrogate the hardware, question the defaults.

You might find 5-10% more performance hiding in the generic configurations. You’ll definitely find a deeper appreciation for the layers of abstraction between your code and the silicon executing it.

And if your VM runs out of space, maybe it’s time to finally set up that Linux desktop you’ve been putting off.