Yak Shaving Day: Rails Asset Precompilation to Kernel Debugging
Yak shaving is usually a huge pain in the ass. Sometimes, though, you can learn from the experience. Based on an actual bug I was recently asked to take a look at, I'm going to take you from an issue that exhibited as JavaScript not loading in the browser, down to Linux kernel debugging over virtualized serial ports.
This is of course, insane. If you ever have to really debug a broken Rails app
by running gdb
against the kernel, I'm highly interested in whatever awful
thing you did to deserve such a fate. Think about your life choices really hard.
Did you kick a kid back in kindergarten? He's probably homeless now, all due to
that one transformative event that set his life on the wrong path. Dick.
The Bug
A staging setup consisting of two machines with the exact same revision of an
app checked out were generating precompiled assets
that ended up with different hashes. Cries of “host your assets on a CDN”
notwithstanding, this was rather unexpected. The issue manifested itself as a
page occasionally not working, as the <script>
tag served up by one instance
of the application referenced a URL that 404d on another instance.
If you'd like to follow along, I've created a Rails template that sets up an application that might break in the same way. Due to the nature of the issue, I can't guarantee you'll actually be able to reproduce it. I was able to do it using two different Linodes I happened to have access to running different versions of Ubuntu with different kernels, but in the environment this actually happened in, both instances had the same kernel and some OS version. Your mileage, as usual, may vary. You can generate a new Rails app from the template like so:
$ rails new testapp -m https://gist.github.com/gabrielg/b68ea622f6ade2b6a99f/raw/4826ca1ab4725da89e079df0b137846b476d4aa6/recipe.rb
Here are two instances I prepared earlier. Each has the exact same copy of the
application, running on the same version of Ruby. It's immediately obvious that
application.js
is being precompiled and ending up with different checksums on
each machine:
# Instance A
ggironda@athena:~/testapp$ bundle exec rake assets:precompile
/home/ggironda/.rbenv/versions/1.9.3-p392/bin/ruby /home/ggironda/.rbenv/versions/1.9.3-p392/bin/rake assets:precompile:all RAILS_ENV=production RAILS_GROUPS=assets
ggironda@athena:~/testapp$ ls public/assets/application*.js
public/assets/application-d948590a01ac46e5fddbbc0e842b9d3e.js public/assets/application.js
# Instance B
ggironda@annealer:~/testapp$ bundle exec rake assets:precompile
/home/ggironda/.rbenv/versions/1.9.3-p392/bin/ruby /home/ggironda/.rbenv/versions/1.9.3-p392/bin/rake assets:precompile:all RAILS_ENV=production RAILS_GROUPS=assets
ggironda@annealer:~/testapp$ ls public/assets/application*.js
public/assets/application-df1f53d685f287055687bc16383bb856.js public/assets/application.js
The JavaScript has been minified, so running diff
against the two files won't
really help with figuring out what's different. Beautifying both files with a
tool such as jsbeautify
first allows for a useful diff to be produced, giving
the first bit of insight into what the problem is:
ggironda@moya /tmp/assets » diff application-d948590a01ac46e5fddbbc0e842b9d3e.js application-df1f53d685f287055687bc16383bb856.js
3642a3643,3645
> hr: {
> hello: "hello world from hr"
> },
3649,3654d3651
< "es-AR": {
< hello: "hello world from es-AR"
< },
< hr: {
< hello: "hello world from hr"
< },
3662a3660,3662
> },
> "es-AR": {
> hello: "hello world from es-AR"
Looks like localization data, right? The app is using a gem called i18n-js to expose translations in the Rails app to the client side. The portions of the two files that are different are the JS representations of these localizations—it's the exact same data though, just in different order. While this works just fine when the JavaScript is loaded in the browser, it's the cause of the different hashes.
Flying Backwards
I'm going to unwind different parts of this mess really fast. I'll leave the choice of the Benny Hill Theme or Sabre Dance as audio accompaniment up to you, dear reader.
The point at which the JavaScript related to all of this i18n stuff is first
pulled into the asset pipeline is via two Sprockets
require lines in app/assets/javascripts/application.js
. One pulls in the
i18n
JS library itself, the other pulls in the translation data:
//= require i18n
//= require i18n/translations
i18n/translations
references an ERB template
in the i18n-js
gem itself.
I18n.translations = <%=
SimplesIdeias::I18n.translation_segments.each_with_object({}) do |(name, segment),translations|
translations.merge!(segment)
end.to_json
%>;
SimplesIdeias::I18n.translation_segments
is a Hash
, and in this dummy
application I'm working against, it only has one key—“app/assets/javascripts/i18n/translations.js”.
The value corresponding to this key is another Hash
. Checking what the keys
are on my two instances of the app yields more clues:
# Instance A
irb(main):019:0> SimplesIdeias::I18n.translation_segments["app/assets/javascripts/i18n/translations.js"].keys
=> [:en, :"en-AU", :"en-US", :"es-AR", :hr, :ko, :pl, :ru]
# Instance B
irb(main):018:0> SimplesIdeias::I18n.translation_segments["app/assets/javascripts/i18n/translations.js"].keys
=> [:en, :hr, :"en-AU", :"en-US", :ko, :pl, :ru, :"es-AR"]
In Ruby 1.9, insertion order for a Hash
is preserved,
unlike Ruby 1.8. Anything that iterates over a Hash
will have the keys and
values yielded in the order they were inserted. It stands to reason that this
might have an effect on the order of keys in the JSON output generated from a
Hash
, too. This can be quickly confirmed as the behaviour of the JSON library
that ships with Ruby 1.9:
irb(main):020:0* require "json"
=> false
irb(main):021:0> hash_a = {}; hash_a[:a] = true; hash_a[:b] = true
=> true
irb(main):022:0> hash_b = {}; hash_b[:b] = true; hash_b[:a] = true
=> true
irb(main):023:0> puts JSON.dump(hash_a)
{"a":true,"b":true}
=> nil
irb(main):024:0> puts JSON.dump(hash_b)
{"b":true,"a":true}
=> nil
irb(main):025:0> hash_a == hash_b
=> true
My word. Two ostensibly equivalent hashes, but JSON.dump
is generating
different output. What a topsy-turvy world.
Time to figure out why the keys are in a different order in the first place. The
translations are loaded in i18n-js
in a kinda gross bit of instance_eval
hackery:
::I18n.backend.instance_eval do
init_translations unless initialized?
translations
end
The Simple
I18n
backend that's in use mixes in the Base
backend module,
which loads the translation files in order
from I18n.load_path
:
def load_translations(*filenames)
filenames = I18n.load_path if filenames.empty?
filenames.flatten.each { |filename| load_file(filename) }
end
In Rails 3.2.13, I18n.load_path
is set during the application initialization
process through i18n_railtie.rb
, a part of ActiveSupport:
when :load_path
I18n.load_path += value
else
According to the documentation in config/application.rb
of a fresh Rails app,
everything from config/locales/*.rb,yml
is auto loaded. This application,
however, has changed the default load path so that localizations in
subdirectories are loaded as well. As of writing, it's the recommended change to make according to the Rails Guides:
config.i18n.load_path += Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s]
Checking the return value of that Dir.[]
call on the two instances of the app
shows that the files are returned in different order—an order that looks
suspiciously like the order of the keys in the translation Hash
:
# Instance A. Full array returned by Dir.[] omitted for brevity
irb(main):024:0> pp Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s][0..1]
["/home/ggironda/testapp/config/locales/en.yml",
"/home/ggironda/testapp/config/locales/models/en-AU.yml"]
# Instance B. Full array returned by Dir.[] omitted for brevity
irb(main):033:0> pp Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s][0..1]
["/home/ggironda/testapp/config/locales/en.yml",
"/home/ggironda/testapp/config/locales/models/hr.yml"]
When sorted, the arrays are the same. Great, now what?
The Fix
Well, the quick fix is straightforward: make sure that the collection of
translations is sorted somehow. The easiest way is to just sort the paths
returned by the Dir.[]
call:
config.i18n.load_path += Dir[Rails.root.join('config', 'locales', '**', '*.{rb,yml}').to_s].sort
This results in the Hash
of translations kept internally by the i18n
backend
being sorted the same everywhere.
A longer term fix depends on your perspective:
JSON
could ignore order in aHash
, and output in a consistent order. There are a lot of JSON libraries for Ruby though, like YAJL, and they'd all need to change as well.- The
i18n
gem itself can keep translations sorted a consistent way internally. - The
i18n-js
gem can handle the sorting problem itself as well. - Sorting the paths returned from the
Dir.[]
call might be a perfectly acceptable long term fix. - Just serve static assets from some kind of CDN with a consistent filename.
If we were just interested in a fix, this post could end here. That's boring. Let's go spelunking.
Into The Gaping Maw
Dir.[]
isn't written in Ruby. It's implemented in Ruby's C source
as a function called dir_s_aref
.
This is a common quandary folks have with working on Ruby itself—a lot of
the time, hacking on parts of the language that aren't the VM still involves
knowing more C than it does Ruby. If you're truly, madly, deeply curious about
how it's done in other Ruby implementations:
- More details on JRuby's implementation can be found in RubyDir.java and Dir.java.
- Poking around Rubinius leads to dir.rb and dir_glob.rb. This is one of the goals of Rubinius—to implement Ruby in Ruby as much as possible.
We're dealing with MRI here though, so
onwards with that. dir_s_aref
eventually ends up calling a function called
glob_helper
. Doxygen made me a call
graph to help illustrate this. It's nice like that.
glob_helper
ends up calling whatever the READDIR
macro expands to in a loop,
as you may expect for code that is recursively searching directories. What the
READDIR
macro expands to is dependent on your system.
Regardless of that, we can use strace
on Linux (or dtruss
on systems
that support DTrace, such as OS X and FreeBSD) to figure out what syscalls end
up getting invoked.
Running code under strace
that performs the same glob without the rest of
Rails in the way reveals more of what's going on underneath the covers—calls
to getdents
followed by calls to lstat
for every file.
ggironda@annealer:~/testapp$ ruby -rpathname -e 'puts Process.pid; sleep(10); Dir[Pathname(".").join("config", "locales", "**", "*.{rb,yml}").to_s][0..1]'
5987
^Z
[1]+ Stopped ruby -rpathname -e 'puts Process.pid; sleep(10); Dir[Pathname(".").join("config", "locales", "**", "*.{rb,yml}").to_s][0..1]'
ggironda@annealer:~/testapp$ strace -q -f -e trace=desc,file -p 5987
[pid 6007] --- SIGTSTP (Stopped) @ 0 (0) ---
[pid 5987] --- SIGTSTP (Stopped) @ 0 (0) ---
[pid 6007] select(4, [3], NULL, NULL, NULL <unfinished ...>
[pid 5987] openat(AT_FDCWD, "config/locales", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
[pid 5987] getdents(5, /* 5 entries */, 32768) = 144
[pid 5987] lstat("config/locales/en.yml", {st_mode=S_IFREG|0644, st_size=214, ...}) = 0
[pid 5987] lstat("config/locales/models", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 5987] openat(AT_FDCWD, "config/locales/models", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 5987] getdents(6, /* 9 entries */, 32768) = 272
[pid 5987] lstat("config/locales/models/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/models/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/models/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] getdents(6, /* 0 entries */, 32768) = 0
[pid 5987] close(6) = 0
[pid 5987] lstat("config/locales/views", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 5987] openat(AT_FDCWD, "config/locales/views", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 5987] getdents(6, /* 9 entries */, 32768) = 272
[pid 5987] lstat("config/locales/views/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/views/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/views/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] getdents(6, /* 0 entries */, 32768) = 0
[pid 5987] close(6) = 0
[pid 5987] getdents(5, /* 0 entries */, 32768) = 0
[pid 5987] close(5) = 0
[pid 5987] openat(AT_FDCWD, "config/locales", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
[pid 5987] getdents(5, /* 5 entries */, 32768) = 144
[pid 5987] lstat("config/locales/en.yml", {st_mode=S_IFREG|0644, st_size=214, ...}) = 0
[pid 5987] lstat("config/locales/models", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 5987] openat(AT_FDCWD, "config/locales/models", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 5987] getdents(6, /* 9 entries */, 32768) = 272
[pid 5987] lstat("config/locales/models/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/models/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/models/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/models/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] getdents(6, /* 0 entries */, 32768) = 0
[pid 5987] close(6) = 0
[pid 5987] lstat("config/locales/views", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 5987] openat(AT_FDCWD, "config/locales/views", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 5987] getdents(6, /* 9 entries */, 32768) = 272
[pid 5987] lstat("config/locales/views/hr.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/en-AU.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/views/en-US.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] lstat("config/locales/views/ko.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/pl.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/ru.yml", {st_mode=S_IFREG|0644, st_size=33, ...}) = 0
[pid 5987] lstat("config/locales/views/es-AR.yml", {st_mode=S_IFREG|0644, st_size=39, ...}) = 0
[pid 5987] getdents(6, /* 0 entries */, 32768) = 0
[pid 5987] close(6) = 0
[pid 5987] getdents(5, /* 0 entries */, 32768) = 0
[pid 5987] close(5) = 0
[pid 5987] open("/proc/self/maps", O_RDONLY|O_CLOEXEC) = 5
[pid 5987] fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
[pid 5987] read(5, "00400000-0062c000 r-xp 00000000 "..., 1024) = 1024
[pid 5987] read(5, " /home/ggironda/.rbenv/ver"..., 1024) = 1024
[pid 5987] read(5, "/1.9.1/x86_64-linux/enc/encdb.so"..., 1024) = 1024
[pid 5987] read(5, "/libm-2.15.so\n7fb3f7d2f000-7fb3f"..., 1024) = 1024
[pid 5987] read(5, "6_64-linux-gnu/libdl-2.15.so\n7fb"..., 1024) = 1024
[pid 5987] read(5, "\n7fb3f878d000-7fb3f878e000 rw-p "..., 1024) = 960
[pid 5987] close(5) = 0
[pid 5987] write(4, "!", 1) = 1
[pid 6007] <... select resumed> ) = 1 (in [3])
[pid 6007] read(3, "!", 1024) = 1
ggironda@annealer:~/testapp$
Giving a brief thought to how one could implement a simple recursive search of directories, this makes sense:
- Start at a directory.
- Get all of its entries.
- Keep track of any paths that match a pattern, such as
*.rb
. - Descend into any directories in the list.
- Repeat.
Running this on the two different instances I have set up shows the calls
happening in different order. It turns out that the getdents
system call
(and by extension, readdir(3)
) makes
no guarantees as to the order directory entries are returned in. Assuming that
it will return them in any specific order at all will probably land you in hot
water. If you really want 'em sorted, you're just going to have to do it yourself.
We can go further. Pull up your socks and follow me.
Now This Is Just Getting Ridiculous
The getdents
system call in Linux 3.5.0 is defined in readdir.c
.
You may notice it calls out to a vfs_readdir
function defined in the same file,
which eventually calls a function pointed to by some readdir
pointer in a struct.
# Excerpt of the getdents syscall
buf.current_dir = dirent;
buf.previous = NULL;
buf.count = count;
buf.error = 0;
error = vfs_readdir(file, filldir, &buf);
if (error >= 0)
error = buf.error;
lastdirent = buf.previous;
# Excerpt of vfs_readdir()
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
res = file->f_op->readdir(file, buf, filler);
file_accessed(file);
}
mutex_unlock(&inode->i_mutex);
What function is it calling? Well, let's use gdb
on the kernel itself to find
out. Ubuntu ships with support for the kernel debugger built in,
but it needs to be enabled at boot time. I did this by adding a new GRUB
menu
entry in /etc/grub.d/40_custom
. Mind you, it's been a while since I had to
configure a bootloader, so this is mostly copied and pasted then tweaked as
necessary:
menuentry 'Ubuntu With KGDB' {
recordfail
gfxmode $linux_gfx_mode
insmod gzio
insmod part_msdos
insmod ext2
set root='hd0,msdos1'
if [ x$feature_platform_search_hint = xy ]; then
search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 ea92d8ea-4d0a-4c67-8b69-ac8f0f9f7ee9
else
search --no-floppy --fs-uuid --set=root ea92d8ea-4d0a-4c67-8b69-ac8f0f9f7ee9
fi
linux /vmlinuz root=UUID=ea92d8ea-4d0a-4c67-8b69-ac8f0f9f7ee9 ro kgdboc=ttyS1,115200 kgdbwait
initrd /initrd.img
}
The kernel debugger is configured to use the serial port ttyS1
, at a baud rate
of 115200, and to drop to the debugger and wait on boot. “Wait, what serial
port? I'm running this in a virtual machine.”, you ask? Right.
VMWare Fusion
allows for emulation of serial ports. On my Mac, the VMs are kept in ~/Documents/Virtual Machines
.
Each VM in there is actually a directory, containing a .vmx
file that
specifies virtual machine configuration. My serial port configuration
for the machine that's going to be debugged looks like so:
ggironda@moya ~ » grep serial1 ~/Documents/Virtual\ Machines.localized/Ubuntu\ 64-bit.vmwarevm/Ubuntu\ 64-bit.vmx
serial1.present = "TRUE"
serial1.fileType = "pipe"
serial1.fileName = "/private/tmp/ubuntu_vmware_serial_port"
You'll also need another VM to run gdb
itself on. The configuration for that
one is ever so slightly different:
ggironda@moya ~ » grep serial1 ~/Documents/Virtual\ Machines.localized/Ubuntu\ 64-bit\ KGDB\ Client.vmwarevm/Ubuntu\ 64-bit.vmx
serial1.present = "TRUE"
serial1.fileType = "pipe"
serial1.pipe.endPoint = "client"
serial1.fileName = "/private/tmp/ubuntu_vmware_serial_port"
To be able to do anything useful, you'll need debug symbols for the version of
the kernel you're debugging. I built my own,
but you can also install a package via apt-get.
Once that's done, you're ready to fire up both VMs. Going into exact operation
of kgdb
here is a bit much, but you can find some good writing
on the subject.
Once the machines are booted and a remote debugging session has started, getting
more answers involves using gdb
much like normal.
Note: Executing ls itself may not have even triggered the breakpoint. Any number of other things, like the shell looking in $PATH may have resulted in getdents getting called. Either way, breakpoint triggered.
Setting a breakpoint in vfs_readdir
lets us poke around in the kernel's guts.
It turns out the brunt of the work of getdents
in our case is handed off to ext4_readdir
.
The filesystem code clearly needs to be able to get directory entries. Going
any further than this would really stretch your attention. At this point, I'm
satisfied knowing that the filesystem code is responsible for the order
directory entries are returned in, and that you shouldn't safely assume that
order will be consistent in any way.
Get Me Out Of Here
I went from “this web app doesn't work in a user's browser because of a 404 on
the JS” back down to “let's get crazy and debug the kernel over an emulated
serial port”, all as part of investigating the one issue. Clearly overkill for
fixing the problem—when working on it, I added the .sort
and went on
with my life. It just goes to show when a little more curiosity is involved, you
can learn a whole lot even from some of the most easy to fix bugs.
If you enjoyed this, you'll probably also enjoy my posts on reverse engineering Vine.