How to Properly Sort Strings Containing Numbers in Python

Let’s say you have a list of strings and want to be able to sort them. Python makes this extremely easy to do, which is something I really love about it.

x = ["abc", "def", "aaa"]
x.sort() // would result in ['aaa', 'abc', 'def']

But what if you have numbers embedded in those strings? Most of us have seen on computers that this can cause problems. The following example illustrates what I mean.

x = ["x1", "x2", "x10"]
x.sort() // would result in ['x1', 'x10', 'x2']

In this case, x10 is sorted ahead of x2, even though you might think it would be the opposite because as a human you know that 2 comes before 10. But Python just looks at the first character rather than the full number (which in some cases is desirable).

So to do a “numerically aware string comparison,” you would want to group those numbers into a single number and do the comparison that way. I was able to find a library that was published back in 2002 for Python, but I couldn’t get it to run on my Windows machine. So I decided to implement something on my own and see how simple I could make it. I figured I would also run their tests on my code (plus some other tests) to make sure it was working properly.

Here is a Python class that asks for a string value as input:

class NumericString:
    def __init__(self, rawValue):
        self.rawValue = rawValue
 
    def __cmp__(self, other):
        if self.rawValue == other.rawValue: return 0
        if self.rawValue == None: return -1
        if other.rawValue == None: return 1
        if self.rawValue == "": return -1
        if other.rawValue == "": return 1
 
        xList = self.getValueList(self.rawValue)
        yList = self.getValueList(other.rawValue)
 
        for i in range(min(len(xList), len(yList))):
            compareResult = self.compareTwoVals(xList[i], yList[i])
 
            if compareResult != 0:
                return compareResult
 
        return cmp(len(xList), len(yList))
 
    def compareTwoVals(self, xVal, yVal):
        if xVal.isdigit() and yVal.isdigit():
            return cmp(int(xVal), int(yVal))
        if not xVal.isdigit() and not yVal.isdigit():
            return cmp(xVal, yVal)
        if xVal.isdigit():
            return -1
        return 1
 
    def getValueList(self, rawValue):
        values = []
        tempVal = ""
 
        for i in range(len(rawValue)):
            val = rawValue[i]
 
            if val.isdigit():
                tempVal += str(val)
            else:
                if tempVal != "":
                    values.append(tempVal)
                    tempVal = ""
 
                values.append(val)
 
        if tempVal != "":
            values.append(tempVal)
 
        return values

This only requires 51 lines of code.

Also, here’s a function that accepts a list of string objects, performs the numerically aware sort, and returns the sorted list.

def SortNumericStringList(original):
    newList = [NumericString(x) for x in original]
    newList.sort()
    return [x.rawValue for x in newList]

The best way to illustrate how you would invoke these is to use unit tests. These invoke the code and make sure the results are what you would expect. The ability to do unit tests is built into Python, so it’s really easy.

import unittest
from SortHelper import *
 
class SortHelperTests(unittest.TestCase):
    def testNumericString(self):
        self.assert_(NumericString(None) == NumericString(None))
        self.assert_(NumericString(None)  NumericString(None))
        self.assert_(NumericString("")  NumericString(""))
        self.assert_(NumericString("1") == NumericString("1"))
        self.assert_(NumericString("1") < NumericString("2"))
        self.assert_(NumericString("1") < NumericString("10"))
        self.assert_(NumericString("10") < NumericString("100"))
        self.assert_(NumericString("2") < NumericString("10"))
        self.assert_(NumericString("abc1") == NumericString("abc1"))
        self.assert_(NumericString("abc1") < NumericString("abc2"))
        self.assert_(NumericString("abc1") < NumericString("abc10"))
        self.assert_(NumericString("abc10") < NumericString("abc100"))
        self.assert_(NumericString("abc2") < NumericString("abc10"))
        self.assert_(NumericString("abc1abc1") < NumericString("abc1abc2"))
        self.assert_(NumericString("abc1abc1") < NumericString("abc1abc10"))
        self.assert_(NumericString("abc1abc10") < NumericString("abc1abc100"))
        self.assert_(NumericString("abc1abc2") < NumericString("abc1abc10"))
        self.assert_(NumericString("abc1abc2")  NumericString("abc1abc2"))
        self.assert_(NumericString("abc1")  NumericString("abc1a"))
        self.assert_(NumericString("abc456") < NumericString("abc456a"))
        self.assert_(NumericString("abc456a") < NumericString("abc456b"))
        self.assert_(NumericString("abc45b") < NumericString("abc456a"))
 
    def testNumericStringSort(self):
        self.assertEqual(SortNumericStringList(['abc','def']), ['abc','def'])
        self.assertEqual(SortNumericStringList(['abc','']), ['','abc'])
        self.assertEqual(SortNumericStringList(['1','1']), ['1','1'])
        self.assertEqual(SortNumericStringList(['1','2','3']), ['1','2','3'])
        self.assertEqual(SortNumericStringList(['2','3','1']), ['1','2','3'])
        self.assertEqual(SortNumericStringList(['abc2','abc3','abc1']), ['abc1','abc2','abc3'])
        self.assertEqual(SortNumericStringList(['abc2','abc1','abc10']), ['abc1','abc2','abc10'])
        self.assertEqual(SortNumericStringList(['abc2a','abc1zabcq','abc10qqq']), ['abc1zabcq','abc2a','abc10qqq'])
        self.assertEqual(SortNumericStringList(['abc2','abc1zabcq','abc10qqq']), ['abc1zabcq','abc2','abc10qqq'])
        self.assertEqual(SortNumericStringList(['abc1','abc','abc10']), ['abc','abc1','abc10'])
 
        numacompList = ['', '0', '00',
                        '1', '01',
                        '2', '12', '012',
                        'A', 'A0', 'A00',
                        'A1','A2','A3','A019','A20',
                        'A020','A20B', #changed these around for personal preference
                        'A020B','A020Bxfred', 
                        'A020Bxfred0','A020Bxfred00','A020Bxfred1',
                        'A020Bxfred01','A020Bxfred1N', #changed these around for personal preference
                        'A020Bxfred10','A0021','A22'
                        'Num544555666777888',
                        'Num5444555666777888',
                        'Num05444555666777888',
                        'Num5444555666777889',
                        'Num14445556667778890']
 
        self.assertEqual(SortNumericStringList(numacompList), numacompList)
 
unittest.main()

Hopefully this is helpful. Please let me know if I can clarify anything. One important point is that there is no “correct” way to do this. You may want to slightly tweak how you do it.

This site explains a bit more about the problem and references some implementations in a few programming languages.

This is the site for the original implementation in Python. Look at the documentation if you’re interested, because it contains a helpful explanation of the algorithm they used.

Note: For an example of how to do this in Java, see this post.

One Response to “How to Properly Sort Strings Containing Numbers in Python”

  1. Thanks very much for providing this code. I find it useful. I have discovered two issues with it:

    -There are four lines of unit tests that do not contain a comparison operator.
    -Uppercase strings seem to be sorted such that they come before lowercase strings. For example, compare ‘M List’ and ‘h80′, with the former coming before. Why is this? It might have been better to do case independent sorting.

    Please email me if the script is updated. My email is:
    ‘axmyiyty.ybxexlxaynxix@xhzoxtymxayixly.xcyozmx’[::2]

    Thanks

Leave a Reply