Being a data analyst, it’s a bit embarrassing how little experience I have with the new hotness of machine learning. I recently had a conversation with an individual who mentioned that they often employ decision and regression trees as a data exploration method and this prompted me to start looking into them.
Decision and regression trees are an awesome tool because of how transparent the end result is. It’s easy to understand and to explain to others who might be weary of implementing something opaque. In the simplest trees, they ask a series of yes-no questions such as: is a certain variable greater than some number. With each question you progress through different paths until you reach a terminal node. This terminal node will give you a prediction, either a classification or a value, depending on the type of tree. This process is extremely easy to follow and is the biggest selling point of decision trees.
Another advantage of decision trees is the simplicity of the algorithm used to create the tree. There are 3 basic steps that go into creating a tree. The first is a calculation on some cost function that we want to minimize. In the case of regression trees the cost function is usually just the mean squared error of all observations at that particular node. Secondly, each variable is iterated over to find the optimal way to divide the observations into two groups. Optimal, in this case, refers to the smallest mean squared error. And finally, once the optimal division is found the process is repeated on the two subgroups. This continues until certain predefined conditions are reached like minimum number of observations at a node.
In fact, the algorithm is so simple I decided to implement a basic regression tree in Julia as a learning exercise. Julia is an awesome statistical computing language thats main advantage is speed. Code written in Julia is often several times faster than the equivalent R or Python code for non-trivial calculations. My implementation is rather limited compared to the ‘rpart’ package in R or even the ‘DecisionTrees.jl’ package available in Julia. The idea was to gain a better understanding of how decision trees actually work and not to replace any of the already great implementations available.
I tested my implementation on the 'cu.summary' dataset from 'rpart'. This dataset contains information on a small number of cars and regressing on mileage gives the following tree:
The labels show the decision that is made at each node. The lines that begin with a number show the number of observations that were placed in that bin along with the average mileage of those observations. The output isn’t pretty but it isn’t that difficult to follow since the tree is pretty shallow.
Building a clock with nixie tubes is a project that has been done to death already. A quick google search will yield thousands of variants put together by just as many different people. So even though it’s not the most original project, nixie tubes have such a cool look to them I had to make my own anyway. I think the warm glow always catches people’s eye with a look that is somehow both futuristic and retro at the same time.
There were a couple goals that I set for the project. Firstly, I wanted to design as much of the clock as I could. This was made pretty easy by the sheer amount of information available online. As I said, thousands of people have attempted this already and there is no shortage of documentation. The tubes themselves are salvaged IN-14 style tubes produced in the Soviet Union. They even have the CCCP logo printed on the backs of a couple that haven’t yet worn away.
I used the site EasyEDA to design the schematic and PCB board layout. I got the PCB fabricated from there as well and they turned out pretty nice. Although the minimum I could order was 6 so I got a few extras that I’m not sure what to do with. It was cool to design my own board and layout everything the way I wanted it. I had never done any CAD stuff before so it was a fun experience.
Another thing I really wanted for the clock was accurate timekeeping. I found that using the internal clock from the Arduino was pretty inaccurate as over time it would fluctuate for many different reasons. Another option was to use a real-time clock chip, like those in digital watches. But I decided against it because even watches can drift by a minute or two over several months. The final option, and the one I went with, was to use a GPS chip. Going with GPS means that the clock will always have exactly GMT time, as provided by the satellites. Which means I only have to adjust the hour instead of setting the whole time when resetting it. Of course, if you live somewhere strange, like Newfoundland with a half timezone, the minutes will still need to be adjusted too.
My board design is uploaded here if you'd like to take a look or get your own fabricated. And the Arduino code is, as always, on my Github.
If you've been on the internet over the past couple of weeks then you've likely heard of the so-called Dirty COW bug recently discovered in Linux. The story behind it is actually kind of funny. It was discovered and patched by Linus himself over 11 years ago! The fix was then undone in subsequent code commits due to issues with new code. So the vulnerability has been exploitable since for about 9 years, since 2007. The silver lining is that no examples of this exploit being used have been found in the wild and most popular Linux distros have pushed fixes. Patch your system if you haven't already!
If you're not familiar with what dirty COW does it allows a non-privileged user to write to files that are usually read-only except by root. This is pretty significant and since "everything in Unix is a file" this opens up lots of doors for exploitation. There are plenty of resources online to learn more about how exactly the exploit works, like here, so I won't go into much detail about the exploit itself.
Along with knowledge of the exploit, proof-of-concept code was also released. Basically a skeleton example with just enough code to show how writing to a root-protected file is possible. This is what I used to create a simple command line program that I call Madcow. It takes advantage of the dirty COW exploit to change the root password on an Ubuntu system to whatever the user provides.
Ubuntu stores passwords (sort of) in a file located at '/etc/passwd'. Here is the first couple lines of the one found on my machine:
This file is read-only to all users except root. There's lots of stuff here but the most important information is after the first colon of each line. Right at the top we see 'root' then a colon, a lowercase 'x', and another colon. There can be two types of information here: an encrypted password (storing plaintext passwords in files readable by all users would be silly), or a lowercase 'x'. The lowercase 'x' indicates that the encrypted version of the password is stored in a separate file '/etc/shadow'. That file is readable only by root so changing it is impossible.
So if we want to change the root password all we need to do is replace the single lowercase 'x' with an encrypted password of our choosing. And this is exactly what madcow does. Here is a quick example:
$ head -1 /etc/passwd
$ ./madcow new_password
$ head -1 /etc/passwd
Not very interesting on the surface, but this is a write-protected file that we, as a normal user, have just written to! Now it's simply a matter of starting a root shell using the new password and the entire system is ours.
That's basically it. I'm sure there are a lot more interesting things that could be done with this exploit but this is the most obvious and powerful. The code is available on my Github if you want to take a closer look. Obviously I take no responsibility for any damage caused to your or others' systems from running this code. And please don't run it outside of a VM or an installation you don't mind losing.
The more I play around with Julia the better things seem to get. With web-stuff on my mind lately from working on MiniHTTP I thought it would be cool to create a simple web API for adding charts to any webpage. Thanks to a couple packages (HttpServer.jl and Plots.jl) this was dead easy in Julia.
The basis for handling requests for charts is by creating a simple HTTP Handler. This checks the URL that was supplied and, if it matches, dispatches the correct chart.
http = HttpHandler() do req::Request, res::Response
Above there are two endpoints that we can match. The first one is if the URL starts with ‘/randomchart’, in this case the function ‘random_chart()’ is called which generates some random numbers and plots them to a chart. The chart at the top of this post in made this way and, since it is random, every time this page is refreshed the chart will be slightly different.
The second endpoint is for generating a custom line chart. If the URL starts with ‘/linechart’ then the function ‘line_chart(::AbstractString)’ is called. This function expects that a query string is attached to the end of the supplied URL that contains a list of x and y coordinates to plot on a line graph. For instance, the URL ‘/linechart?x=1,2,3,4,5,6,7,8,9,10&y=1,2,2.5,2,3,1,1.5,3,0.5,2’ was used to generate the plot below.
And that’s really all there is to it. The full code (which is less than 50 lines!) has been uploaded to my Github if you would like to take a further look. With a little extra work, this could be fleshed out to a very nice and easy-to-use web graphing API for people who just want to be able to quickly add some charts to a website.
Update(20-02-2017): I've shut down one of my Amazon EC2 instances so the charts shown above are now static images instead of dynamically generated from Julia.
Finally! After over a year of wondering what I should do with my Raspberry Pi, and a forgotten attempt at a weather station, I have found a permanent use for the Raspberry Pi.
Recently I took my website off blogger and wrote my own from scratch. This started a chain reaction and got me thinking: “If I’m already writing a website, why not write a webserver to go along with it?” And as a result, I’ve been spending the majority of my spare time learning about and writing a C webserver from scratch. A huge thanks to Beej's guide that got me rolling on this project.
It’s called MiniHTTP (not very original I know) and it’s finally in a semi-working state. Good enough for my purposes anyways. It’s more of a prototype and a learning project than a piece of code that should ever be used by anybody else for anything other than education. It’s rather inefficient due to forking processes to handle each request and it supports only a small subset of HTTP. But I wanted to keep it small a purpose-built for my application since it will be running on a Raspberry Pi I didn’t want the project to get too big.
Now the Raspberry Pi comes into play. My version 2 has been sitting in the closet for quite some time now. Soon after starting to work on MiniHTTP I made the decision to host everything on my Pi for a couple of reasons. First, while Amazon AWS is pretty awesome, it’s quite expensive for hosting my small site that barely gets any traffic. And second, because it’s awesome! Hosting my own site on my own hardware, that only cost 35 bucks! Actually twice that with the cost of the PiDrive. But still pretty dang cheap. And if it holds up for more than 7 months it will have been cheaper than continuing to pay for an EC2 instance on Amazon.
And of course it isn’t a true Raspberry Pi project without an unnecessary glowing blue LCD display. So along with the webserver a small python script is also running in the background that reads from the server log file to show hourly, daily, and weekly website hits.
This blog is intended as written documentation of my ongoing side-projects. Most everything here is likely to be programming or data related, although something else may creep in every now and then.
The blog itself is an example of a personal project of mine. It was written in C, using SQLite3, and is hosted on a Raspberry Pi