Thursday, December 15, 2011

First try at a linear separator.

I've written the first draft of a linear separator implementation in R, and used it on the famous Old Faithful geyser data set. In the plots below, the x-axis corresponds to the length of an eruption event in seconds, and the y-axis represents the waiting time until the next eruption. Waiting times fall into two general groups: short waiting times after short eruptions, and long waiting times after long eruptions.


My implementation (in its current form) plots the data itself in the scatter plot "Initial State" and then adds a second plot with the data separated into two groups, blue and purple, and the separator shown in orange. (Click the image to enlarge.)

All I know about linear separators I learned in AI class. My code does not use the alpha correction in the update, nor does it find the optimum separator among all possible separators, as Dr. Thrun begins to describe at the 3:40.


My code takes any data set of x,y ordered pairs and subsets it into two by finding centered means of two subsets. It does not matter whether the data set is linearly separable in a formal sense, the algorithm will separate the data into two regardless. 


Without going in to too much detail, the separation process itself happens iteratively in a while loop where y-values of the data points are compared to y-values of the separator line. The initial separator line is the perpendicular bisector of the two points farthest apart from each other. Every point above the line is put in one group, and every point below is put into another group. The algorithm proceeds to draw a new separator as the bisector of the mean x,y position of the two points above and below the previous separator. When the means do not change within a tolerance, the program quits.


In order for this to work, the separator line must not become perfectly vertical or the y-distance between any point and the line would be infinite. 


R has a jitter function that adds a small random number to each and every element of a vector. Here is an example of jitter on a vector x containing 4,5,6:


> x=c(4,5,6)
> x=jitter(x)
> x
[1] 3.996404 5.159189 6.149125

The 4 became 3.996404, 5 became 5.159189, and 6 became 6.149125. Those may be small changes, but I wanted my adjustments to be even smaller.


c=A.1-B.1

if (0 %in% c) {A.1 = A.1 + runif( length(A.1), -.0001, .0001)
               B.1 = B.1 + runif( length(B.1), -.0001, .0001)
                 y = y   + runif(length(y), -.0001, .0001)}

I alter the y-values randomly by an amount greater than -0.0001 and less than 0.0001. When the data is plotted, you can not tell that it was jittered. (Note: This is fine if .0001 is an insignificant amount relative to the values of the data set. It seems obvious now that a second draft of the code should determine what limits of runif would yield an insignificant amount of jitter.)