Created with Sleep Cycle iPhone app.
Hey I'm Red Davis, a programmer from Bath, England. I'm Currently working for Green Thing and have an interest in machine learning.
Created with Sleep Cycle iPhone app.
Don’t worry about diplomas or degrees; just get so good that no one can ignore you.
— James Bach
I believe that everything happens for a reason. People change so that you can learn to let go, things go wrong so that you appreciate them when they’re right, you believe lies so you eventually learn to trust no one but yourself, and sometimes good things fall apart so better things can fall together.

If you use a Mac for development, I highly recommend Visor. Instead of tabbing back to Terminal or flicking between spaces, just press a shortcut and Terminal will magically appear.
This is a little script I built the other day on the train to detect plagiarism.
You can use the script as so:
It works by simple searching Google with the terms provided, “shakespeare”, scrapes the pages and removes any html etc.
It then takes [4,5,7,9,15]-Grams of each page. Then it takes the same N-Grams of the document in question and compares everything.
In the end you get a list of the sites with a score next to each one specifying how many N-Grams were matched. You are also given a list of N-Grams that were matched and how many times (though there is no real use for that).
Here’s the code.
It’s always good to know what you don’t know

F1 Score is a way to measure a tests accuracy.
P and R represent Precision and Recall.
As an example, let us imagine that we have created a search engine. We are in the early stages of development so have only crawled 100 pages (we will call these documents).
We tested our search engine by querying it with “Machine Learning”. We received 50 results.
Precision is the exactness of our results. We calculate it by:
A score of 1 would mean that all documents retrieved had something to do with machine learning.
However, what if 90 of our 100 documents are to do with machine learning? Then surely our search engine isn’t as accurate as we first thought.
This is where recall comes in. Recall is calculated by:
So going forward with our example, our recall score would be:
A recall score of one would indicate that all machine learning documents were retrieved.
So our F1 score is…(if we presume our precision is 1)
After my recent experiences with CouchDB (which is a great product) I was forced to look for something that could handle large amounts of data more efficiently. After doing some research, I settled on Hadoop.
If you are dealing with truly large amounts of data, in the multiple terabyte range or larger, there really are only a few options available to efficiently store and process that data. If you are a company with money to burn, you can talk to Oracle. If that doesn’t appeal to you, you can do what many companies are doing - using Hadoop to store and process their data.
I built a small prototype using Hadoop over the course of a few weeks and really liked what I saw. Hadoop is based on the Google File System, and is an Apache Foundation project. The Hadoop project has seen steady growth over the past few years, with contributions from engineers at Yahoo, Facebook, and others. Many companies now run Hadoop on large clusters of machines and use it to store and process many Terabytes or even Petabytes of data.
I quickly realized that Hadoop would be able to handle the requirements of the project I was working on, but I also realized it was complex and has a steep learning curve. Hadoop is written in Java, so you can download the source, compile, and run yourself. Doing this makes setting up even a small cluster challenging. Fortunately, there are companies like Cloudera that provide pre-configured images for EC2, VMWare, etc. Using Cloudera, I was able to get a small Hadoop cluster running on EC2 pretty quickly.
Even with Cloudera’s help, I realized I would be spending too much of my time configuring and maintaining servers. Hadoop has a daunting list of configuration options, and at this point I would rather spend time learning MapReduce concepts and other data processing tools like Pig and Hive. I also wanted to take advantage of Hadoop Streaming, which allows you to write MapReduce programs in your language of choice (Ruby) and process data. I love Ruby, and have no desire to use Java.
Luckily, there is a solution out there that met my goals. Using Amazon’s Elastic MapReduce service, it’s possible to spin up a Hadoop cluster with minimal effort. In fact, all that is needed is an account and a browser. Once the cluster is running, you can access the master node via ssh and get to work. By doing this, I was able to use Ruby scripts for Streaming and also the Pig interface. You need to have your data stored on S3, but since I was already using S3, processing the data was easy.
Assuming you have an Amazon Web Services account setup for Elastic MapReduce, here are the steps to get a Hadoop cluster up and running in ‘interactive mode’:
- Login to the AWS Management Console
Uploaded with plasq’s Skitch!- Click on the ‘Amazon Elastic MapReduce’ tab and choose ‘Create New Job Flow’
Uploaded with plasq’s Skitch!- Give the Job Flow a name, and be sure to select the ‘Pig Program’ option. Click Continue.
Uploaded with plasq’s Skitch!- Rather than executing a Pig script, we want to start an interactive session. Click Continue.
Uploaded with plasq’s Skitch!- Choose how many instances you want in the Hadoop cluster and the type. ‘m1.small’ is fine for testing purposes. You’ll want larger instances for real work. Be sure to select a key pair to use. You will use this key pair to ssh into the master node. Click Continue.
Uploaded with plasq’s Skitch!- Review the settings you chose, then click ‘Create Job Flow’
At this point, Amazon will create the Hadoop cluster. It usually takes a few minutes, so this would be a good time to check your Twitter client. Remember that you must manually shut down this cluster. In non-interactive mode, Elastic MapReduce will start, run the scripts you ask it to, then shut down the cluster. In interactive mode, you are responsible for terminating the cluster when you are done.
When the state of your job flow is ‘waiting’ you will be able to ssh into the master node. Copy the ‘Master Public DNS Name’ and ssh to the cluster using the following command:
Uploaded with plasq’s Skitch!
Uploaded with plasq’s Skitch!Note the ‘hadoop’ username. If all goes well, you will see a waiting prompt after successfully connecting to the master node. Do a quick ‘ps ax’ and you will see that several Hadoop processes are running.
Next, we need to get some data into the Hadoop cluster to work with. Assuming you have some type of data stored in S3, you can create a ‘data’ directory in the Hadoop file system with this command:
‘hadoop fs -mkdir /data’
We also need a directory for output:
‘hadoop fs -mkdir /output’
Run ‘hadoop fs -ls /’ and you should see the two directories. Next, assuming you have data residing in an S3 bucket, you can copy the contents of that bucket to your newly created ‘data’ directory with this command:
‘hadoop fs -cp s3://your_bucket_name/* /data’
Remember that all data transfer to / from S3 and EC2 instances is free, so do not be afraid to copy a big chunk of data to the Hadoop cluster. The only constraint you have is that the EC2 instances have a limited amount of local disk space. There are ways around this, but for this exercise, you will probably want to work with a relatively small amount of data. Several Gigabytes max.
Once you have the data copied to the Hadoop cluster, you can work with it using a variety of methods. You could type ‘pig’ and be dropped into the grunt shell. Or you could submit MapReduce jobs written in Java via the Hadoop command line interface. But you’re reading this because you want to use Ruby with Hadoop, so we’ll do that.
Hadoop has a method of processing data called Streaming, where data is literally streamed line-by-line to a script via STDIN / STDOUT. This is slower than compiled Java, but it’s also much more convenient. You can start working with Hadoop and learning MapReduce while using a language you are comfortable with. The basic process behind streaming is to write your map and reduce scripts, then submit a Hadoop job via the command line interface, telling it where your scripts are and where the data is.
We could do that, but we are going to take one step back and use a great Ruby interface to Hadoop Streaming called Wukong.
In order to use wukong, we need to somehow install it on the hadoop master node. Because we don’t have root or even sudo access, we can’t just install a gem. What I ended up doing is downloading wukong to my local machine, then using scp to copy it to the master node.
‘scp -i /path_to_your_key/key_name ~/Downloads/wukong.zip hadoop@ec2-67-202-43-146.compute-1.amazonaws.com:wukong.zip’
Put the unzipped files in a directory called ‘wukong’ - the following Ruby scripts will look there.
We are now ready to write our Ruby MapReduce program. For demo purposes, I am going to show you a simple example that I adapted from the Wukong examples. The data I am working with happens to be log data from S3, and I am interested in counting unique IPs over the course of a few months. The following script does just that:
Save your script in the same directory as the ‘wukong’ directory. Before we can run the script, we must first tell wukong where Hadoop is, as well as make some of the wukong utilities available:
‘export HADOOP_HOME=/home/hadoop’
‘export PATH=~/scripts/wukong/bin:$PATH’
Finally, it’s time to run a MapReduce job! Be sure your script is executable, then run it using these options:
’./wukong_demo.rb —run=hadoop /data /output’
Note the /data and /output options. The first tells wukong where the input data is located, the second tells it where you want MapReduce to place the results. You should see output similar to the following while your job runs:
Uploaded with plasq’s Skitch!Note that it will probably take several minutes for your MapReduce job to run. It all depends on how much data you have. Even a small dataset will take three or four minutes.
Once your job is complete, you can view the results in the /output directory on the Hadoop cluster.
‘hadoop fs -ls /output/’
‘hadoop fs -cat /output/output_file_name’
At this point, we’ve pretty much covered the basics of using Ruby with Hadoop. There are many issues and options that I have not covered, but I’ll leave those to you to explore and figure out.
The bottom line is that you can use Ruby with Hadoop, and Amazon makes it even easier with their Elastic MapReduce service. When you need a full-time Hadoop cluster, spend the time and money to learn and build one. For now, pay for what you use on Amazon, and focus on learning MapReduce concepts and Hadoop fundamentals.
It’s amazing to me that this type of processing power is available on a pay-for-use basis. Running a 100 node Hadoop cluster for a few hours would be cheap and very efficient. That type of compute power was only available to a select few companies and governments even three or four years ago.
Have fun and let me know if you have questions.
Everyone likes a bit of feedback when a using a program, log I have given Feature Selection the ability to log its progress.
Feature selection is the process of selecting a group of terms from a training set and using them as features.
The two main benefits of doing this is that firstly you are decreasing the number of features. This automatically makes training/classifying quicker. Secondly it increases classification accuracy by removing noise and can therefore help prevent over-fitting.
I have recently released a library containing 3 different feature selection algorithms, so I am going to focus on them.
Chi Squared
Chi Squared is used in statistics to measure the independence of two independent events. In our case, our two events are the occurrence of a term and a class. What the equation spits back at us is a measure of how much the expected and observed count differs from each other.
The higher the result the more dependent the two events are of each other. (i.e the occurrence of a term makes the occurrence of the class more or less likely.)
The equation looks like…

N is the observed frequency and E is the expected frequency.
et is the occurrence of a term. Which can be true or false (1, 0).
ec is the occurrence of a class. Which can be true or false (1, 0).
Therefore N11 would be the occurrence of a term AND a class
Mutual Information
Mutual information measures how much the presence or absence of a term contributes to making the correct classification.

Frequency Based
Frequency Based is simply selecting the terms that occur most in a class
Here is the code (1)
(via surrealtime)
I’ve been meaning to setup a continuous integration server for my maintained gems for quite a while. Luckily I came across Run Code Run.
Run Code Run is basically a hosted CI server.
It’s very easy to setup if your on Github. Just go to service hooks in your project settings and turn Run Code Run on.

One cool feature is that they will also test your code on Ruby 1.9. I’m happy to say that all of my gems are 1.9 compatible
Taken from the README…
Feature Selection is a library of feature selection algorithms.
A quick how-to…
“How many times have you looked at some ruby code and found strange variable names (eg. $0, $:, etc) and wondered what they meant? Below is a list of cryptic global variable names in ruby and their meanings.”