Mainline Hero Part 1 - First Attempts At Porting

In the first post of the series, I showed what information I gathered and what tricks can be used to debug our mainline port of the herolte kernel. While I learned a lot just by preparing for the actual porting, I was not able to actually get as close as to booting the kernel. I would have liked to write about what I did to actually boot a 5.X.X kernel on the device, but instead I will tell you about the journey I completed thus far.

If you are curious about the progress I made, you can find the patches [here]({{ site.social.git_url}}/herolte-mainline). The first patches I produced are in the patches/ directory, while the ones I created with lower expectations are in the patches_v2/ directory. Both "patchsets" are based on the linux-next source.

Starting Out

My initial expectations about mainlining were simple: The kernel should at least boot and then perhaps crash in some way I can debug.

This, however, was my first mistake: Nothing is that easy! Ignoring this, I immeditately began writing up a Device Tree based on the original downstream source. This was the first big challenge as the amount of downstream Device Tree files is overwhelming:

$ wc -l exynos* | awk -F\  '{print $1}' | awk '{sum += $1} END {print sum}'
54952

But I chewed through most of them by just looking for interesting nodes like cpu or memory, after which I transfered them into a new simple Device Tree. At this point I learned that the Github search does not work as well as I thought it does. It does find what I searched for. But only sometimes. So how to we find what we are looking for? By grep-ping through the files. Using grep -i -r cpu . we are able to search a directory tree for the keyword cpu. But while grep does a wonderful job, it is kind of slow. So at that point I switched over to a tool called ripgrep which does these searches a lot faster than plain-old grep.

At some point, I found it very tiring to search for nodes; The reason being that I had to search for specific nodes without knowing their names or locations. This led to the creation of a script which parses a Device Tree while following includes of other Device Tree files, allowing me to search for nodes which have, for example, a certain attribute set. This script is also included in the "patch repository", however, it does not work perfectly. It finds most of the nodes but not all of them but was sufficient for my searches.

After finally having the basic nodes in my Device Tree, I started to port over all of the required nodes to enable the serial interface on the SoC. This was the next big mistake I made: I tried to do too much without verifiying that the kernel even boots. This was also the point where I learned that the Device Tree by itself doesn't really do anything. It just tells the kernel how the SoC looks like so that the correct drivers can be loaded and initialized. So I knew that I had to port drivers from the downstream kernel into the mainline kernel. The kernel identifies the corresponding driver by looking at the data that the drivers expose.

[...]
static struct of_device_id ext_clk_match[] __initdata = {
        { .compatible = "samsung,exynos8890-oscclk", .data = (void *)0, },
};
[...]

This is an example from the clock driver of the downstream kernel. When the kernel is processing a node of the Device Tree it looks for a driver that exposes the same compatible attribute. In this case, it would be the Samsung clock driver.

So at this point I was wildly copying over driver code into the mainline kernel. As I forgot this during the porting attempt, I am mentioning my mistake again: I never thought about the possibility that the kernel would not boot at all.

After having "ported" the driver code for the clock and some other devices I decided to try and boot the kernel. Having my phone plugged into the serial adapter made my terminal show nothing. So I went into the S-Boot console to poke around. There I tried some commands in the hope that the bootloader would initialize the hardware for me so that it magically makes the kernel boot and give me serial output. One was especially interesting at that time: The name made it look like it would test whether the processor can do SMP - Symmetric Multiprocessing; ARM's version of Intel's Hyper Threading or AMD's SMT. By continuing to boot, I got some output via the serial interface! It was garbage data, but it was data. This gave me some hope. However, it was just some data that was pushed by something other than the kernel. I checked this hypothesis by installing the downstream kernel, issuing the same commands and booting the kernel.

Back To The Drawing Board

At this point I was kind of frustrated. I knew that this endeavour was going to be difficult, but I immensely underestimated it.

After taking a break, I went back to my computer with a new tactic: Port as few things as possible, confirm that it boots and then port the rest. This was inspired by the way the Galaxy Nexus was mainlined in this blog post.

What did I do this time? The first step was a minimal Device Tree. No clock nodes. No serial nodes. No GPIO nodes. Just the CPU, the memory and a chosen node. Setting the CONFIG_PANIC_TIMEOUT option to 5, waiting at least 15 seconds and seeing no reboot, I was thinking that the phone did boot the mainline kernel. But before getting too excited, as I kept in mind that it was a hugely difficult endeavour, I asked in postmarketOS' mainline Matrix channel whether it could happen that the phone panics and still does not reboot. The answer I got was that it could, indeed, happen. It seems like the CPU does not know how to shut itself off. On the x86 platform, this is the task of ACPI, while on ARM PSCI, the Power State Coordination Interface, is responsible for it. Since the mainline kernel knows about PSCI, I wondered why my phone did not reboot. As the result of some thinking I thought up 3 possibilities:

The kernel boots just fine and does not panic. Hence no reboot.
The kernel panics and wants to reboot but the PSCI implementation in the downstream kernel differs from the mainline code.
The kernel just does not boot.

The first possibility I threw out of the window immeditately. It was just too easy. As such, I began investigating the PSCI code. Out of curiosity, I looked at the implementation of the emergency_restart function of the kernel and discovered that the function arm_pm_restart is used on arm64. Looking deeper, I found out that this function is only set when the Device Tree contains a PSCI node of a supported version. The downstream node is compatible with version 0.1, which does not support the SYSTEM_RESET functionality of PSCI. Since I could just turn off or restart the phone when using Android or postmarketOS, I knew that there is something that just works around old firmware.

The downstream PSCI node just specifies that it is compatible with arm,psci, so how do I know that it is only firmware version 0.1 and how do I know of this SYSTEM_RESET?

If we grep for the compatible attribute arm,psci we find it as the value of the compatible field in the source file arch/arm64/kernel/psci.c. It specifies that the exact attribute of arm,psci results in a call to the function psci_0_1_init. This indicates a version of PSCI. If we take a look at ARM's PSCI documentation we find a section called "Changes in PSCIv0.2 from first proposal" which contains the information that, compared to version 0.2, the call SYSTEM_RESET was added. Hence we can guess that the Exynos8890 SoC comes with firmware which only supports this version 0.1 of PSCI.

After a lot of searching, I found a node called reboot in the downstream source. The compatible driver for it is within the Samsung SoC driver code.

Effectively, the way this code reboots the SoC, is by mapping the address of the PMU, which I guess stands for Power Management Unit, into memory and writing some value to it. This value is probably the command which tells the PMU to reset the SoC. In my "patchset" patches_v2 I have ported this code. Testing it with the downstream kernel, it made the device do something. Although it crashed the kernel, it was enough to debug.

To test the mainline kernel, I added an emergency_restart at the beginning of the start_kernel function. The result was that the device did not do anything. The only option I had left was 3; the kernel does not even boot.

At this point I began investigating the arch/arm64/ code of the downstream kernel more closely. However, I noticed something unrelated during a kernel build: The downstream kernel logs something with FIPS at the end of the build. Grepping for it resulted in some code at the end of the link-vmlinuz.sh script. I thought that it was signing the kernel with a key in the repo, but it probably is doing something else. I tested whether the downstream kernel boots without these crypto scripts and it did.

The only thing I did not test was whether the kernel boots without "double-checking [the] jopp magic". But by looking at this script, I noticed another interesting thing: CONFIG_RELOCATABLE_KERNEL. By having just a rough idea of what this config option enables, I removed it from the downstream kernel and tried to boot. But the kernel did not boot. This meant that this option was required for booting the kernel. This was the only success I can report.

By grepping for this config option I found the file arch/arm64/kernel/head.S. I did not know what it was for so I searched the internet and found a thread on StackOverflow that explained that the file is prepended onto the kernel and executed before start_kernel. I mainly investigated this file, but in hindsight I should have also looked more at the other occurences of the CONFIG_RELOCATABLE_KERNEL option.

So what I did was try and port over code from the downstream head.S into the mainline head.S. This is the point where I am at now. I did not progress any further as I am not used to assembly code or ARM assembly, but I still got some more hypotheses as to why the kernel does not boot.

For some reason the CPU never reaches the instruction to jump to start_kernel.
The CPU fails to initialize the MMU or some other low-level component and thus cannot jump into start_kernel.

At the moment, option 2 seems the most likely as the code from the downstream kernel and the mainline kernel do differ some and I expect that Samsung added some code as their MMU might have some quirks that the mainline kernel does not address. However, I did not have the chance to either confirm or deny any of these assumptions.

As a bottom line, I can say that the most useful, but in my case most ignored, thing I learned is patience. During the entire porting process I tried to do as much as I can in the shortest amount of time possible. However, I quickly realized that I got the best ideas when I was doing something completely different. As such, I also learned that it is incredibly useful to always have a piece of paper or a text editor handy to write down any ideas you might have. You never know what might be useful and what not.

I also want to mention that I used the Bootlin Elixir Cross Referencer a lot. It is a very useful tool to use when exploring the kernel source tree. However, I would still recommend to have a local copy so that you can very easily grep through the code and find things that neither Github nor Elixir can find.