How to Delete Files Using Python Code

It’s really easy to delete a single file in Python. Say the file you want to delete is called “MyFile.txt.” You could delete it using this syntax:

import os
 
os.remove("MyFile.txt")

Let’s say you had many files in a directory, and the ones you wanted to delete were called MyFile1.txt, MyFile2.txt, and MyFile3.txt. One trick that you can use to delete such files is to use a wildcard pattern, such as MyFile*.txt. The * indicates that any sequence of characters can fill that space. Using the following code, you could find all files in a directory that match this pattern and then delete them:

import os, glob
 
for filePath in glob.glob("MyDirectory/MyFile*.txt"):
    if os.path.isfile(filePath):
        os.remove(filePath)

If you wanted to do this for all subdirectories, you could do the following:

import os, glob
 
for root, dirs, files in os.walk('MyDirectory'):
    for filePath in glob.glob(os.path.join(root, "MyFile*.txt")):
        if os.path.isfile(filePath):
            os.remove(filePath)

Just be careful! Any time you are using wildcards to delete files, it is possible to delete files you didn’t intend to. So make sure you test thoroughly before implementing code like this.

How to Peek at Top, Middle, or Bottom of Text File Without Opening It in Linux

Let’s say you have a really large text file. By large, I mean it has a lot of lines of text in the file. I frequently work with files that have millions or billions of lines of text in them. If I try to open the file in a text editor, it usually takes a really long time and can sometimes cause the text editor to crash. I’m sure there are text editors out there that handle this better, but sometimes you don’t have the luxury of using your preferred tools. In Unix/Linux, one way you can “peek” at the top of a file is to use the head command. For example, the following command will let you look at the top 100 lines in a file.

head -n 100 myfile.txt

Or if you want to see the bottom 100 lines, you can use tail:

tail -n 100 myfile.txt

But let’s say you have 1,000,000 lines in a text file, and you want to see what’s on lines 500,001-500,010 without opening the file. You can use a little trickery by combining head and tail.

tail -n +500001 myfile.txt | head -n 10

The way this works is that the tail command outputs the file starting from the value specified after “+” and then that output is redirected (via a pipe) to the head command, which then displays the top lines of that output. So the tail command above would output lines 500,001-1,000,000. And the head command would spit out the first ten of those lines. Fortunately, you won’t see the output of tail on the screen. It will only display the final result. (See also http://www.fastechws.com/tricks/unix/head_tail_mid_files.php.)

Get Smallest Matches Possible with Regular Expressions

I occasionally use regular expressions, and when I don’t regret it. But it can be difficult to get the syntax and concepts straight in my head because I only use them occasionally.

The general purpose behind regular expressions is to take a piece of text and find smaller pieces of text within it that match a certain pattern. There are various syntaxes for defining the patterns, and they overlap with each other quite a bit. You can match specific characters, numerical digits, whitespace characters, etc. Obviously I can’t provide a full tutorial here…other places on the web have tutorials on this.

One gotcha I often run into is when I am trying to match a pattern that starts and ends with specific characters. But regular expression engines often give you the largest possible match for your patterns rather than the smallest, and typically I want the smallest possible match. For example, let’s say the text you are working with is:

abcd abcd abed

For demo purposes, let’s say the first pattern I want to match is “bcd”.

Below is some Python code that illustrates how to use regular expressions to find matches for this pattern. (To the Python purists, I will admit that this code is not as short/terse as it could be. I decided to make it a little more verbose so readers not super familiar with Python would not get lost, for example, if I used list comprehensions.)

import re
 
text = "abcd abcd abed"
pattern = "bcd"
 
matches = []
 
for matchGroup in re.finditer(pattern, text):
    matchText = matchGroup.group(0)
    matches.append(matchText)
 
print matches

The output of this will be:

['bcd', 'bcd']

OK, that’s good. Now let’s say I wanted to match anything that started with “b” and ended with “d”. This is exactly the kind of task for which regular expressions are useful. I can use the same code but specify a “b” followed by a dot character followed by a plus followed by a “d”: “b.+d”. This means that I want to find anything that starts with a “b”, ends with a “d”, and has at least one character in between. When I run the Python code I get:

import re
 
text = "abcd abcd abed"
pattern = "b.+d"
 
matches = []
 
for matchGroup in re.finditer(pattern, text):
    matchText = matchGroup.group(0)
    matches.append(matchText)
 
print matches
 
Output:
['bcd abcd abed']

It found the largest single match that started with a “b” and ended with a “d”: almost the entire text. But that’s not exactly what I was looking for. I wanted it to identify the smaller, individual matches: “bcd”, “bcd”, and “bed”.

Well, there’s an easy way around this. Simply prefix the last character in your pattern with a question mark. (Note that this syntax may vary depending on the programming language or regular expression implementation you are using.)

import re
 
text = "abcd abcd abed"
pattern = "b.+?d"
 
matches = []
 
for matchGroup in re.finditer(pattern, text):
    matchText = matchGroup.group(0)
    matches.append(matchText)
 
print matches
 
Outputs:
['bcd', 'bcd', 'bed']

Vertically Align x-axis Labels in an R plot

When you are creating a plot in R, sometimes you have axis labels on the x axis. You can create custom ones with the axis function. If your labels are long, you can bring them apart using a newline character so they span multiple lines. But when you do this, it centers the text vertically at the bottom. But sometimes this doesn’t look right. In my case, I wanted these labels to be centered vertically at the top. I searched and couldn’t find any ideas on how to certain these vertically. If you know of something, please let me know.

Below is some code that is a workaround to certain labels vertically.

formatAxisNames = function(x, splitChar)
{
  maxSize = 0
 
  for (y in x)
  {
    z = strsplit(y, splitChar)[[1]]
    if (length(z) > maxSize)
      maxSize = length(z)
  }
 
  formatted = NULL
 
  for (y in x)
  {
    z = strsplit(y, splitChar)[[1]]
    while (length(z) < maxSize)
      z = c(z, "")
 
    formatted = c(formatted, paste(z, collapse="\n"))
  }
 
  formatted
}

The image below illustrates how this can be used:

How to echo a Tab or Other Special Characters in Linux

I’m working on a bash script that I want to insert some text into a file. So I’m using the echo command to do this, like so:


echo “hello, world” > myfile.txt

This works just great. Now I want to insert a special character into the file. So I need to do two things: 1) escape special characters and 2) append rather than write to the file. The special character I want to insert is a tab, which is represented as \t. In order to do this, you have specify the -e parameter when you call echo. Like so:


echo -e “before\tafter” >> myfile.txt

The myfile.txt file should contain something like this now:

hello, world
before    after

Generate a Random String in Linux

It’s not too hard to generate a random number at the command line in Linux. You can use $RANDOM. But if you want to generate a random string, it is a bit more obscure. The best solution I found (after searching for longer than expected) was the following:

openssl rand -base64 30

This technique uses the openssl utility, which I believe is commonly installed on Linux systems. It generates a base-64 string value that is 30 characters long. If you know of a better way to do this, please let me know.

How to Extract Substring in Bash

Let’s say you have a bash variable:

x=abc123def

And you want to get everything in the substring that comes before “def.” You can use awk to do this, but it’s a bit “awk”ward (though very flexible and powerful in general). An easy solution is to do the following:

y=${x%def*}
echo $y # Should print "abc123"

This approach should work on most/all UNIX-like systems, including Mac OS X.

Note: I derived this solution from http://hintsforums.macworld.com/showthread.php?t=101414.

Splitting a String into Array and Accessing Contents by Index in bash

Say you have a variable in a bash script that looks like this:

x="abc_123_def"

Now say you want to split this variable into an array. The character you will use to split it is the underscore (_). So you want to end up with an array of three objects: abc, 123, and def. Then you want to be able to access those objects within the array using index values (for example, index 0 would give you abc, index 1 would give you 123, etc.). Here’s how to do it.

delimiter="_" # this is the split character
declare -a arr # declare the array
arr=(`echo ${x//$delimiter/ }`) # do the split
echo ${arr[0]} # print the first element (abc)
echo ${arr[1]} # print the first element (123)
echo ${arr[2]} # print the first element (def)

The syntax feels awkward to me, but it’s the simplest approach I could find to accomplish this. (I derived my solution from what I read here: http://www.linuxforums.org/forum/red-hat-fedora-linux/157606-bash-split-string-into-array.html.)

Sorting String Values That Contain Numbers in Python

If you have a list of String objects in Python and want to sort them but treat any numeric values accordingly, it will not work with the regular sort function in Python. So if you have the following list:

x = ["3", "1", "10", "2"]

You will get the following if you sort and print it:

print sorted(x) // ['1', '10', '2', '3']

I found some code that provided a workaround for this, and I tweaked it slightly. Here it is:

## From http://stackoverflow.com/questions/34518/natural-sorting-algorithm
def naturalSort(x):
    def natural_key(s):
        return tuple(
            int(''.join(chars)) if isdigit else ''.join(chars)
            for isdigit, chars in itertools.groupby(s, str.isdigit)
        )
 
    return sorted(x, key=natural_key)
 
print naturalSort(x) // ['1', '2', '3', '10']

Some people call this a “natural sort.”

How to Scale a Vector of Numbers to One

Let’s say you have a vector of numbers: 1, 2, and 5. And let’s say you have another vector of numbers: 3, 4, 9. These vectors have different ranges, but you believe they actually should have the same range. This scenario probably sounds strange, but it can happen, for example, in cases where genetic data are acquired using two different technologies. Anyway, a simple way to do this is with the following equation (coded in the R statistical package language, but you should get the idea).

y = (x - min(x)) / max((x-min(x)))

This will put the numbers of a given vector, x, on a scale of zero to one. So if you apply this code to both, then they will have the same range. Although you want to use caution because it doesn’t necessarily mean the two vectors are comparable from a practical standpoint.

When Warnings Don’t Print in R

When you’re running R from the command line (using Rscript), sometimes you might get a message saying that warnings have occurred but weren’t printed to the screen. Sometimes you need to worry about the warnings, and sometimes you don’t. That’s why they’re warnings and not errors. Anyway, when you get that message, you can try printing the errors by invoking the following:

print(warnings())

However, sometimes this doesn’t work, and it just prints NULL. One way to get around this is to set the following option at the top of your R script.

options(warn=1)

Additionally, if you set warn=2, then all warnings will be converted to errors, and the script will stop executing when an error has occurred. Probably not the behavior you want all the time, but sometimes you might.

Fast Way to Filter Large Python Lists

Let’s say you have a list of Python objects.

x = [["1","a"], ["2", "a"], ["3","b"], ["4","c"]]

And you have a second list of Python objects.

y = ["a", "b"]

Now let’s say you want to filter x so that it only contains objects for which the second value corresponds to the values in y. So in this case you would want the resulting list to look like this.

z = [["1","a"], ["2", "a"], ["3","b"]]

Two well-known ways exist for performing this filtering.

z = [row for row in x if row[1] in y]
z = filter(lambda row: row[1] in y, x)

Both of those methods work just great when your x is relatively small. However, when it is large, these are very inefficient. One way to work around this problem is to use dictionary objects. Below is an example of how to do this.

valueIndexDict = {}
for data in enumerate([row[1] for row in x]):
    valueIndexDict[data[1]] = data[0]
 
indicesToKeep = [valueIndexDict[value] for value in y]
z = [x[i] for i in indicesToKeep]

I haven’t timed this to get an official comparison on which is faster, but in my informal testing I found the latter solution to be MUCH MUCH faster. Give it a try and let me know what you think.

How to Suppress Error Message When Attempting to Recreate Directory in Linux

Let’s say you have a script that creates a directory as a preliminary step to some type of task that you are performing. If the directory doesn’t exist, you want the script to create the directory. You don’t want to have to modify the script after the directory has been created. If the directory already exists, you want the script to move on its merry way, without displaying any error messages.

One way to do this is to use the -p parameter. The main purpose of this parameter is to create parent directories if they don’t exist. But you can also use it to suppress error messages that occur when you try to recreate a directory.

Let’s say you are trying to create a directory called “MyDirectory.” Your script might look something like this:

mkdir -p ~/MyDirectory

The first time around the command will create the directory. The second time you run the script (assuming the directory still exists), this command will do nothing and will display no error message.

For loops in R: how to do a “continue”

In R, you can use “for” loops just like in most other programming languages. Sometimes when you are looping through items and you encounter a certain condition, you want to skip over an item and move to the next item in the loop. In some languages, the keyword for doing this is “continue.” In R, you use the keyword “next.” Below is an example.

for (i in 1:10)
{
  if (i == 5)
    next
  print(i)
}

This would give you the following output:

[1] 1
[1] 2
[1] 3
[1] 4
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

Troubleshooting Password-less SSH Login on UNIX Systems

There’s a nice way of setting up a UNIX-based server so you can log in without a password. This technique uses public/private key encryption. You set it up one time on your client machine and one time on each server you want to connect to. Then you can SSH into the server without using a password. (Disclaimer: It’s less secure because if someone got ahold of your machine, they could log in to the server.) Awhile ago I was having trouble getting it to work, and I found a really helpful page on the Web that helped me solve the problem. Recently, I had a similar problem but couldn’t find that page. Fortunately, I remembered enough to figure out the problem. So I’m now using this page to try to be a location where people can come to solve such problems.

I won’t go into the basic steps on how to figure this out. You can find that in several places on the Web, such as here: http://linuxproblem.org/art_9.html.

If you are still prompted for your password, you can try this:

  1. Make sure the permissions on the server’s ~/.ssh folder are 700. You can set this with the command: chmod 700 ~/.ssh
  2. Make sure the permissions on the server’s ~/.ssh/authorized_keys file are 600. You can set this with the command: chmod 600 ~/ssh/authorized_keys

Then log out and try to log in again. If you still are asked for your password, add the -vvv option to your ssh command, and that should give some hints on what is going wrong. Please post a comment on this if you find something useful or still need some help.

Calculate the Exponent of a Number in Python

Let’s say you want to square a numeric value in Python. It’s a really easy solution. You use the ** operator. It works for whatever exponent you want.

x = 2
y = x**2 // 4
 
x = 10
y = x**2 // 100
 
x = 10
y = x**3 // 1000

How to Combine Multiple PDF Files For Free

I needed to create a PDF file that is a compilation of multiple separate PDF files that I have. I tried a few ways of going about this. And the best approach that I’ve found is to use LaTeX. It has a package called pdfpages that you can use to build a PDF file (something LaTeX is excellent at doing in general) using other PDF files as input.

Before getting started, you’ll need to install LaTeX. It’s not hard to do, but it can be a pretty big installation. That doesn’t bother me because I use LaTeX anyway, but just wanted to let you know it’s big. Instructions for doing this on Windows are here. Instructions for doing this on a Mac are here.

The next step is to create a simple text file with contents such as the following example. Make sure the file extension of the text file is .tex

I had to make a few tweaks to get it working right, and they are shown in the example.

\documentclass[11pt]{article}
\pdfminorversion=6
\usepackage{pdfpages}
\begin{document}
  \includepdf{Document1.pdf}
  \includepdf[pages=1-last]{MultiPageDocument2.pdf}
  \includepdf{Document3.pdf}
\end{document}

Let’s say you named this text file MyDocument.tex. The next step would be to run the pdflatex command at the command line.

pdflatex MyDocument

If it worked, you should see a file in the same directory entitled MyDocument.pdf containing all the pages.

Please let me know if these instructions are not clear enough.

Stop Mac Mail from Truncating Messages w/ Attachments

I was seeing some weird behavior in Mac Mail. Sometimes when I would attach a file to an email message, it would truncate the text of my email message and attach the truncated text as a text file. This is problematic because the recipient may not realize that the text file contained the rest of the text; and if anything it would be annoying for them to have to open another attachment rather than being able to read the full message in their mail reader. I was about ready to ditch Mac Mail but finally found a workaround solution. In the Edit menu, you select Attachments, and then select “Always Insert Attachments at End of Messages.” I guess the problem is that when the attachments are placed in the middle of the text that Mac Mail gets confused. So if you always have the attachments at the end, it should work fine. Why this setting isn’t a default is beyond me. And why Apple doesn’t consider this a bug is even further beyond me.

Simple Multi-Threading Example in Java

This week I spent some time figuring out how to use multiple threads in Java. Well, really, I just learned the basics. But it was enough to get a functional example working, and it took me less time than I anticipated. I won’t go into a lot of detail on how to do this because others have covered the details. But I did have a bit of a hard time finding a simple, functional example. So I’m providing that here. In the code below, an ArrayList containing String objects is instantiated. Then Callable objects are used to package the logic that will be executed on each string—in this case, it is simply to print the object. Then the Callable objects are executed as the threads become available. Note that I am configuring it to use as many threads as there are processors on the system, but you can specify the number of threads by hand if you prefer.

import java.util.*;
import java.util.concurrent.*;
 
public class Main
{
    public static void main(String[] args)
    {
        ArrayList x = new ArrayList();
        x.add("A");
        x.add("B");
        x.add("C");
 
        Collection<Callable> callables = new LinkedList<Callable>();
 
        for (String y : x)
        {
            callables.add(new Callable()
            {
                public Object call() throws Exception
                {
                    System.out.println(y);
                }
            });
        }
 
        ExecutorService service = newFixedThreadPool(Runtime.getRuntime().availableProcessors());
        service.invokeAll(callables);
        service.shutdown();
    }
}

Another approach would be to use a handy tool called Conja by David Soergel. In this case, your code would be something more simple, like the code snippet below:

Parallel.forEach(x, new Function()
{
    public Void apply(String obj)
    {
        System.out.println(obj);
        return null;
    }
});

I haven’t used either method extensively, but I recommend starting with Conja, and if it works for you, go with that instead of the more raw approach.

How to Add a Footnote With No Number in LaTeX

In LaTeX, it is very easy to create footnotes. The simplest way is to use this syntax:

\footnote{put text here}

In some cases, you might want to suppress the number on a footnote. I looked around the web, and a few suggestions were out there, but the simplest one I could find was the following:

\let\thefootnote\relax\footnotetext{put text here}

Depending on what you’re trying to do, you might have to tweak numbers of later footnotes, so keep that in mind. This page has a few hints on how to do this.