I’m learning something new about Linux every day, so this is something that will probably be obvious to many of you out there. But I needed today to read the contents of a file into a variable in a bash script. The simple way to go about doing this is to use the following command:
September 11th, 2009 | Posted in Linux | No Comments
Imagine you wanted to come up with a “classifier” to predict whether Georgia would win a given American football game. So you might get (training) data from all their games from the previous two years. Then you might come up with rules based on that data. For example, if the quarterback throws for 300+ yards, the running back runs for 100+ yards, and the defense gets 2+ interceptions, then you think they will win. Then you test those rules on the same (training) data, and it turns out that voila! it is correct 100% of the time. But you don’t really know whether that classifier (set of rules) is generalizable because you only tested it on data you had already looked at. So you could apply the rules to 2009 data (test set) and see how well it performs there. But it could be that the team is so different in 2009 from last year that these rules don’t hold. So while the classifier “fit” the training data well, in fact it “overfit” the training data because it didn’t also “fit” the test data. That’s why it’s important to hold out a test set so you can evaluate the classifier.
Your training data wouldn’t necessarily have to be from 2007-2008. You could pick five games from each of the last 10 seasons and have that be your training data. And then you could pick five other games (non overlapping) from each those seasons and make that your test set. The important thing is that they don’t overlap. Or you could use cross validation in which 1 game from each of the last 10 seasons is used as the test set, and the remaining 9 games from those seasons are used as the training set. This is repeated 10 times, each time using a different game from each seasons for the test set and the remaining games for the training set.
September 11th, 2009 | Posted in Statistics | No Comments
I had an occasion to learn a little about programming in C for the first time. One thing I needed to be able to do was save text to a file, but it was hard to find help for this on the Internet. This may be because it is called something slightly different in C than what I’m used to.
Anyway, below is some sample code.
#include <stdio.h>
FILE *fp;
fp = fopen("/home/Bill/path/test.txt", "w");
fprintf(fp, "%-15.15s\n", "Hello, world!");
fclose(fp);
That should get you started at least. There are other options, such as appending to a file (example below), writing numbers, etc.
fp = fopen("/home/Bill/path/test.txt", "a");
September 9th, 2009 | Posted in Tip | No Comments
I am running an application in one location on the file system, and I need to be able to invoke an application that is in a different directory on the file system. Due to personal preference, this other directory is not specified in the PATH variable. So I’m wondering how I can get Linux to know that it has to look for the second application in that other directory.
A simple way to approach this is to use the cd command in combination with the call to the second program. This will change the active directory to the second one and execute the command. This works for me because that call is being made in a different session than the one the first application is running in. Not sure if the terminology is right there. Please correct me if I’m wrong. Anyway, here’s the syntax:
cd /home/Bill/secondapp;./scriptname
September 9th, 2009 | Posted in Linux, Tip | No Comments
In Linux, you sometimes want to run a command when the server first starts up. I was doing this recently where I would have to manually go in and run a command each time the server got rebooted. This was a pain, so with a little help from my friend Martin, I learned how to have that script run automatically each time. There is a more advanced way to handle startup scripts, which is to use the /etc/init.d/ directory. I won’t go into that. You can Google it if you want more detail. To have a simple command run at startup, go to the /etc/rc.local file. And past the command at the end.
Below is what my file looks like (or close to it). This runs a script called MyScript. Make sure the script has execute permissions and that you have the full path to the script.
#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
touch /var/lock/subsys/local
/path_to_script/MyScript &
September 4th, 2009 | Posted in Linux, Tip | No Comments
In this post, I explain the problem of sorting strings that contain numbers. If you just sort values using the default approach, it will not work properly. For example, it will sort “a10″ before “a2″ even though in your application you may want it to consider the alphabetic characters separately from the numeric ones in sorting. Or in other words, because the first character is the same, you may want it to treat the remaining portions as numbers. So “a2″ would be sorted before “a10″ because 2 is less than 10.
I converted the Python code from the other post to Java code. I won’t explain how it works, but let me know if you want an explanation.
import java.util.*;
public class NumericString implements Comparable
{
private String _rawValue;
public NumericString(String rawValue)
{
_rawValue = rawValue;
}
private static int CompareTwoVals(String xVal, String yVal)
{
if (IsDouble(xVal) && IsDouble(yVal))
return new Double(Double.parseDouble(xVal)).compareTo(Double.parseDouble(yVal));
if (!IsDouble(xVal) && !IsDouble(yVal))
return xVal.compareTo(yVal);
if (IsDouble(xVal))
return -1;
return 1;
}
private ArrayList GetValueList()
{
ArrayList valueList = new ArrayList();
String tempVal = "";
for (int i=0; i<_rawValue.length(); i++)
{
String val = _rawValue.substring(i, i+1);
if (IsDouble(val))
tempVal += val;
else
{
if (!tempVal.equals(""))
{
valueList.add(tempVal);
tempVal = "";
}
valueList.add(val);
}
}
if (!tempVal.equals(""))
valueList.add(tempVal);
return valueList;
}
public static boolean IsDouble(String value)
{
try
{
Double.parseDouble(value);
return true;
}
catch (Exception ex)
{
return false;
}
}
@Override
public int compareTo(Object obj)
{
if (obj == null)
return -1;
NumericString compareObj = (NumericString)obj;
if (this._rawValue == null) return -1;
if (compareObj._rawValue == null) return 1;
if (this._rawValue.equals(compareObj._rawValue)) return 0;
ArrayList xList = this.GetValueList();
ArrayList yList = compareObj.GetValueList();
for (int i=0; i<xList.size() && i<yList.size(); i++)
{
int compareResult = CompareTwoVals(xList.get(i), yList.get(i));
if (compareResult != 0)
return compareResult;
}
return new Integer(xList.size()).compareTo(yList.size());
}
@Override
public String toString()
{
return _rawValue;
}
}
Here’s how you might call it:
NumericString a = new NumericString("a10");
NumericString b = new NumericString("a2");
ArrayList list = new ArrayList();
list.add(a);
list.add(b);
Collections.sort(list);
for (NumericString x : list)
System.out.println(x); // a2, a10
August 19th, 2009 | Posted in Java, Tip | No Comments
I’m writing this post because I just spent a couple of hours banging my head against the wall, trying to figure out how to run an R script from the command line. It was working if I simply ran it at the command line. But when I try to run the same command from Java (I know, this sounds convoluted) using the following code, it was behaving strangely and wasn’t executing.
Runtime r = Runtime.getRuntime();
r.exec("R CMD BATCH RScriptFile.R");
So I finally found out that there is a utility in R that is designed to help you execute scripts at the command line more easily. It’s called RScript. Now I’m doing it the following way, and it’s working beautifully.
Runtime r = Runtime.getRuntime();
r.exec("Rscript RScriptFile.R");
Seems obvious, but it wasn’t to me, and it took awhile to find an answer.
August 12th, 2009 | Posted in Java, R | 6 Comments
In my last post, I explained a nifty way to invoke a function dynamically when you have the name of the function in a character object (in R). However, this didn’t explain how you could pass parameters to that function. I found a way to do this, which I will explain below, though it is possible that there is a simpler way to do it, of which I am not aware. But the way I describe should be very flexible for all types of dynamic invocation.
Before I get into the details, I will explain a possible use case for this type of functionality. It might seem far fetched, but it’s the way I’m approaching one of my research tasks right now, at least as a workaround until I can find a more elegant way to do it. So…most of my code for this project is written in Java. But I need to be able to do some statistical processing in R (which has far richer statistical capabilities than Java). The way I’m approaching this is to communicate between Java and R using command-line invocation (I can provide more details if anyone is interested). I need to tell R via the command line that it should load source code from a given file and invoke a specific function using specific parameters. Then in the R code, it needs to parse those values and then call the functions specified. Hopefully that made sense…let me know if not.
Anyway, here’s how you do it. Suppose you had a function like this:
x = function(param1, param2)
{
print(paste("value of param1:", param1))
print(paste("value of param2:", param2))
}
The normal way you would invoke this would be:
To invoke it dynamically, you could do this:
eval(call("x", "abc", "def"))
August 12th, 2009 | Posted in Tip | No Comments
One way the R programming language has been described is that it is a functional programming language. Whether it would be called this by purists, I don’t know. But part of what this means is that all functions are treated as objects. So you can pass functions around very easily. This might sound strange, but this comes in very handy. One example is when you want to apply a certain function to each element in a given vector. You can use the sapply function, passing as parameters the vector and the function you want to apply to it.
Another nice thing is that you can dynamically invoke functions (reflection) very easily. Let’s say you created a function called “x” and that the purpose of this function was to print “hello, world!”
x = function()
{
print("hello, world!")
}
The standard way you would invoke this function would be the following:
But let’s say (for reasons I won’t elaborate on here) that you only had the function name as a character object and wanted to invoke it. You could simply do the following:
Is that cool or what!!??!! (though I’m not sure how often it would be useful)
August 11th, 2009 | Posted in R, Reflection, Tip | No Comments
LaTeX is a typesetting system that you can use to produce high quality documents. I’m starting to use it now to write my dissertation. It has a bit of a learning curve, but I’m thinking it will probably save me a lot of pain and anguish in the long run in writing my dissertation. I’ll probably be writing more posts about it as I gain familiarity and understand best practices, but for now I want to write a simple tip that will help you.
If you want to write double quotes in LaTeX, you need to use two back quote characters for the left quotation mark and then two hyphens for the right quotation mark. I’m guessing this is because the regular double-quotation mark is used as a reserved character. Or maybe because LaTeX wants you to be very explicit about what you want to be displayed, rather than trying to guess for you (as MS Word does). This can be a pro or a con, depending on how you look at it.
August 7th, 2009 | Posted in LaTeX, Tip | No Comments
Let’s say you have a directory with files matching a certain pattern (and others that don’t), and you want to find all the files that match that pattern in R. It’s pretty easy to do this, but it’s not super straightforward to figure it out from the help. Below is the simple answer. But you may want to look at the help files for more options (such as searching sub-directories recursively, ignoring case, etc.).
files = list.files(".", glob2rx("filePattern*"))
July 27th, 2009 | Posted in R | No Comments
In Python 2.5, there is a built-in function that tells you the maximum allowable int value:
import sys
print sys.maxint
But there is no corresponding function to do this for float values. Part of the reason may be that it can vary from system to system. But actually, I believe they added this in version 2.6. So this is just a workaround for previous versions.
Using the following code, I tested this on both Windows XP and Red Hat Linux (64-bit).
x = 1.0
while x > 0.0 and str(x) != "inf":
x *= 10.0
print x
Not that the value for infinity (“inf” in this case) varies between Linux and Windows. This is the value for my Linux system.
Anyway, this comes up with 1e+308. That’s probably not the absolutely precise number. But it’s close to that. The minimum number is the negative value of this.
If you encounter a situation where Python is giving you an error such as “Numerical result out of range,” you can guess that the system is trying to handle float values greater or less than these limits. A quick and dirty solution to getting around this is to find the max/min values for your environment and hard code them. But you can also find external libraries that help if you need it to be absolutely precise.
Please let me know if you have any additional insights on this.
July 8th, 2009 | Posted in Math, Python | No Comments
I’m sure this one will be obvious to many readers, but it wasn’t to me, so I’m going to share it. A bash script is a file that you can run at the command line in Linux/Unix environments to automate something. In my case, I have a Java program that I need to run over and over again, so I created a bash script to make this easier rather than having to type the same long command repeatedly. These scripts are quite flexible and allow you do even programming constructs like functions, for loops, and if statements, etc.
Let’s say you have a simple bash script. The file name is test. And the contents of the file are the following:
#!/bin/bash
echo 'Hello World!'
You would invoke this at the command line (from the directory where it resides) by entering:
Now let’s say you want to be able to pass an argument to it. You could modify the script to this.
#!/bin/bash
echo 'Hello World!'
echo 'The argument is:'
echo $1
You would invoke this at the command line by entering:
This would print out:
Hello World!
The argument is:
whatup
As far as I know, the command-line arguments can only be referenced by their index (in this case, it was the number 1 because it was the first argument). But you could get more creative if you wanted to.
This article has more detail on how to create bash scripts:
http://www.ibm.com/developerworks/library/l-bash.html
June 30th, 2009 | Posted in Linux | No Comments
Let’s say you have a list of values in a vector. And you want to be able to convert that into a delimited string. In some languages (such as Python), you can do this easily with the join method. But how would you do this in R?
This can be done with the paste function. The following code example shows how you would convert a simple vector to a comma-separated string.
x = c(1,2,3)
paste(x, collapse=",")
June 29th, 2009 | Posted in R | 4 Comments
Suppose you have an object in Python that you retrieved from a third-party library, but you don’t have access to the source code or to very good documentation. Believe me, it happens (and did to me today). You can use a simple built-in method in Python to find out which methods are exposed by a given object: dir().
x = ... #get from somewhere
print dir(x)
June 17th, 2009 | Posted in Python, Reflection | 1 Comment
I recently had a scenario where part of what I wanted to do was in Java, and the other part was in an application that was written in Python. Rather than than rewrite my entire code base in one language or the other, I wanted to find a (quick and dirty) way to invoke the Python application from Java. One way to do this would be to use some type of service-oriented architecture (e.g. via an HTTP/XML Web service), but I didn’t want to get into that complexity for this simple task. Another way would be to use some type of interface like Jython.
Instead I just invoked the other program as if from the command line and retrieved the output that was stored in text files. This type of approach will probably only work in limited scenarios, but it can come in handy for integrating different programming languages.
This article gives a nice code sample (that has worked for me) that illustrates how to do this. If you’re like me, you’ll probably simplify the code, but at least it shows how to do it.
June 4th, 2009 | Posted in Java, Linux, Python, Tip | No Comments
In Java and C#, there is a method that enables you to find the last occurrence of a given sequence of string characters within a string. This method is sometimes called “lastIndexOf.” A method by this name does not exist in Python. But there is an easy way to do it if you know what you are looking for.
The method is rfind, which could be translated into “find from the right side.” See below.
x = "abc,def,ghi"
print x.rfind(",") // 7
April 16th, 2009 | Posted in Python | No Comments
There is no way that I know of to find the median of a list of numbers in the Java framework. The median is the middle value. If there is an even number of values, the median is the middle of these two numbers. Below is a method, along with a supporting method and some tests for computing the median in Java.
public static double Median(ArrayList values)
{
Collections.sort(values);
if (values.size() % 2 == 1)
return values.get((values.size()+1)/2-1);
else
{
double lower = values.get(values.size()/2-1);
double upper = values.get(values.size()/2);
return (lower + upper) / 2.0;
}
}
public static ArrayList CreateDoubleList(double ... values)
{
ArrayList results = new ArrayList();
for (double d : values)
results.add(d);
return results;
}
System.out.println(2.5==MathUtility.Median(Lists.CreateDoubleList(0,1,2,3,4,5)));
System.out.println(2.0==MathUtility.Median(Lists.CreateDoubleList(0,1,2,3,4)));
System.out.println(2.0==MathUtility.Median(Lists.CreateDoubleList(3,1,2)));
System.out.println(3.0==MathUtility.Median(Lists.CreateDoubleList(3,2,3)));
System.out.println(1.234==MathUtility.Median(Lists.CreateDoubleList(1.234, 3.678, -2.467)));
System.out.println(1.345==MathUtility.Median(Lists.CreateDoubleList(1.234, 3.678, 1.456, -2.467)));
March 11th, 2009 | Posted in Java, Math | 2 Comments
Java has functionality built into it to transform a number using the natural logarithm. This can be done using the java.util.Math.log() method. However, to my knowledge there is no way to do this for base-2 logarithms.
Please don’t let me get started on how silly this is!!
To do a base-2 log transformation:
public static double Log2(double number)
{
return Math.log(number)/Math.log(2);
}
March 5th, 2009 | Posted in Java, Statistics | No Comments
When you create a plot in R, you can easily modify the text size of the labels on the axes using the cex.lab property. This stands for “character expansion of labels.” For this value, you specify a relative size (compared to the default) that you want the text to be. The following code shows how you might do this:
barplot(c(48.3, 63.3, 66.7), ylim=c(0, 70), col=1:3, ylab="Classification Accuracy %", names=c("Majority Vote", "Clinical Only", "Clinical + Mutations"), cex.names=1.3, cex.lab=1.3)
February 26th, 2009 | Posted in R, Statistics | No Comments