RubySource: Code Safari: Underscore Madness

Last week I came across an excellent presentation from the Scotland Ruby Conference: “Literary Criticism for the Idle Programmer” by Roland Swingler. It introduced me to a crazy little ruby script that allows you to write programs entirely in underscores!

# hello.rb
require "_"
____ _ _____ ____ __ ____ ____ __ ___ ____ __ __ _ ______
_____ ___ _ _ ___ _____ ______ ____ _ _ ____ _ _ ____ _
____ __ __ ___ _ ______ ___ ____ __ ______ ____ _ ____ ____
__ _ ____ _ _ ___ _____ _____ _ ______ ____ _ ______ _____

That’s a hello world app, right there. I scarcely believed it either. Roland steps through briefly how this code works in his talk, but given our previous obfuscation adventures, I wanted to dive in a little deeper.

A bird’s eye view

We are going to work on a copy of the library from git, so that we can easily play around with changing it. Clone it, and try running the above script:

 cd /tmp
 git clone https://github.com/mame/_
 ruby -I _/lib hello.rb
Hello, world!

Hello world indeed! The -I option to Ruby adds a new directory to the load path, allowing the require "_" line to successfully find _.rb inside the git repository. With our test harness operational, let’s open up the code to see what madness lies within.

# _/lib/_.rb
def __script__(src)
  code = []
  src = src.unpack("C*").map {|c| c.ord.to_s(6).rjust(3, "0").chars.to_a }
  src.flatten(1).map {|n| n.to_i(6) + 1 }.each do |n|
    code.empty? || code.last.size + n + 1 = 60 ? code  "" : code.last  " "
    code.last  "_" * n
  end
  ([%q(require "_")] + code).join("

“)
end

$code, $fragment = [], []
def method_missing(mhd, *x)
  if x.empty?
    $code.concat($fragment.reverse)
    $fragment.clear
  end
  $fragment  (mhd.to_s.size - 1).to_s
end
at_exit do
  $code.concat($fragment.reverse)
  eval($code.join.scan(/.../).map {|c| c.to_i(6) }.pack("C*"))
end

Ignoring the __script__ function, which is a helper for creating source files of underscores, there are only 12 lines that do the real work. To understand them, we need to be familiar with a few interesting Ruby concepts.

The first is method_missing, a built in method that is called on an object when a method otherwise does not match anything in the class.
In the case of underscore, it is defining method_missing at the top level, which in Ruby is still executed within the context of an object:

 ruby -e 'puts self.class'
Object

To illustrate, we can write a script that echoes back any method we call:

# echo.rb
def method_missing(method_name)
  puts method_name.to_s
end
hello
world

The second concept to understand is method naming in Ruby. Traditionally you define methods with characters and underscores, but this is not a built-in limitation of Ruby. As underscore demonstrates, a method name of solely underscores is valid, and you can go even further by defining methods with spaces in them! Of course, the parser won’t be able to call them, but we can work around that:

class Greeter
  define_method("say hello") do
    puts "hello"
  end
end
Greeter.new.send("say hello")

The last important concept is at_exit. From the Ruby documentation

Converts block to a Proc object (and therefore binds it at the point of call) and registers it for execution when the program exits. If multiple handlers are registered, they are executed in reverse order of registration.

This is what testing frameworks such as Test::Unit and RSpec typically use to automatically run tests after executing a file.

Piecing it together

We can now understand the general structure of underscore: Use method missing to handle method calls with any number of underscores in them, accumulate codes in the $code and $fragment variables, then execute those codes when the program exits. Let’s get an idea of the codes it is accumulating by adding a puts statement just before the eval statement. We can do this easily since we cloned the code from git and added it to our load path; this kind of hacking at libraries is harder to do (though still possible!) when working with gems.

at_exit do
  $code.concat($fragment.reverse)
  puts $code.inspect
  eval($code.join.scan(/.../).map {|c| c.to_i(6) }.pack("C*"))
end

This shows a list of numbers, generated by the mhd.to_s.size - 1 call in method_missing

 ruby -I _/lib hello.rb
["3", "0", "4", "3", "1", "3" # and so on...

How do we get from here to code? There are a few steps in the conversion process, which we will address in turn. irb allows us a quick and easy way to step through the process and visualize the transformations:

irb input = ["3", "0", "4", "3", "1", "3"]
 = ["3", "0", "4", "3", "1", "3"]
irb input.join
 = "304313"
irb input = ["3", "0", "4", "3", "1", "3"]
 = ["3", "0", "4", "3", "1", "3"]
irb input = input.join
 = "304313"
irb input = input.scan(/.../)
 = ["304", "313"]
irb input = input.map {|c| c.to_i(6) }
 = [112, 117]
irb input.pack("C*")
 = "pu" # first two characters of our source code

There are a few tricky calls in there. scan can be a bit confusing if you haven’t seen it before, especially since that regex doesn’t look familiar. Once again, refering to the documentation:

iterate through str, matching the pattern (which may be a Regexp or a String). For each match, a result is generated and either added to the result array or passed to the block.

We don’t have a block, so it just adds to the result. The regex /.../ matches three characters, so this is how it segregates the input into groups of three.

The to_i call is interesting: what is that 6 parameter?

Returns the result of interpreting leading characters in str as an integer base base (between 2 and 36).

So it’s interpreting the number as base 6, but why is this necessary? Let’s see what happens if we change it back to base 10 (note the added parameter base to the method signature).

# _/lib/_.rb
def __script__(src, base = 6)
  code = []
  src = src.unpack("C*").map {|c| c.ord.to_s(base).rjust(3, "0").chars.to_a }
  puts src.inspect
  src.flatten(1).map {|n| n.to_i(base) + 1 }.each do |n|
    code.empty? || code.last.size + n + 1 = 60 ? code  "" : code.last  " "
    code.last  "_" * n
  end
  ([%q(require "_")] + code).join("

“)
end

 ruby -r _/lib/_ -e 'puts __script__("puts "Hello, world!"", 10)'
require "_"
__ __ ___ __ __ ________ __ __ _______ __ __ ______ _ ____
___ _ ____ _____ _ ________ ___ __ _ __ __ _ _________ __ _
_________ __ __ __ _ _____ _____ _ ____ ___ __ __
__________ __ __ __ __ __ _____ __ _ _________ __ _ _ _
____ ____ _ ____ _____

Decoding this works fine, so it would appear to be just for aesthetic reasons that base 6 was chosen.

The last stage in the process is the mysterious pack method, with its inscrutable argument 'C*'. This is a method that translates an array into a binary sequence based on single letter codes that map to the type of the data, such as different types of integers and strings. C maps to “8-bit unsigned integer (unsigned char)”, and * means convert all remaining elements using this mapping. Since all our numbers are ASCII codes for the code, this has the effect of converting our array into valid source code that can be passed into eval. pack is hard to understand in words — your best bet is to play around with it in irb.

We made it to the end, but we jumped over one other important branch in method_missing. Here it is again for reference:

$code, $fragment = [], []
def method_missing(mhd, *x)
  if x.empty?
    $code.concat($fragment.reverse)
    $fragment.clear
  end
  $fragment  (mhd.to_s.size - 1).to_s
end

What is the x.empty? check doing? This has to do with the order in which methods are called in ruby.Chaining multiple methods together without explicit parenthesis is generally bad form, but is supported by Ruby. It evaluates calls right to left, so that the following two lines are equivalent:

method1 method2 method3
method1(method2(method3))

What this means though is that method3 is executed first, which gives us a back-to-front ordering. This is where the $fragment.reverse call comes in handy, so that the underscored version can be in the same order as the code it represents. The x.empty? is simply a way of detecting the end of a line — underscore doesn’t use arguments for anything else, and method3 is the only one of the above methods that doesn’t have any.

One other thing you may have noticed is that there is no actual check for underscores here. It looks like we could use any characters we choose to write our obsfucated program! This excites the kid in me who wrote hidden letters in lemon ink.

require '_'
look a poem!
a monkey ark
a baboon yacht too _
a big aquian din!
a animal party

But what is the hidden message? That is for you to discover! Here are some other things you could try:

Base 6 encoding was shorter than base 10, but is it the shortest? Write a program to calculate the base that gives the shortest output for a given program.
Underscore was written on 1.8.7, there’s a bug on 1.9.2 with the input “puts 1″. Can you find and fix it?
More poems!

Share how you go in the comments. Join me next week for more adventures in the code jungle.