Archived Blog

What do a philologist and a lollipop have in common?

02.24.2010 - 12:00 PM


Question: What do a philologist and a lollipop have in common?
Answer: LOL (if you don't get it, you will LOL when you see it below)

The problem

Given a few strings:
  ewf3hardyharharoiew
  p90weuhardyharhar
  hardyharharoie78wjf
  ahardyharhar787834
Determine the longest substring that they all have in common:
 p90weuhardyharhar
       |||||||||||
      ahardyharhar787834
       |||||||||||
   ewf3hardyharharoiew
       |||||||||||
     d1hardyharharoie78wjf
Which, in the example above, is 'hardyharhar'. The question is, how do you code this?

The solution

Today, on Hug-An-Algorithm day, we're giving the hat-tip to an algorithm called longest common substring (LCS).

The solution can be implemented using a generalized suffix tree, or by dynamic programming (Wikipedia). Specifically, I'll walk you through a Python implementation, but we'll focus on the approach behind it so that you can implement this in any language.

I found a few LCS implementations on Wikibooks.org, which is where I got this simple Python implementation (adapted slightly for this demo).

Technical nitty-gritty details warning! If you're only interested in how this may be useful to you, skip to the Real-world applications section.

For best viewing, select full screen and high quality, please.

Code from the video for plugging into your Python interpreter:
 
def LCS(S, T):
    m = len(S); n = len(T)
    L = [[0] * (n+1) for i in xrange(m+1)]
    LCS_set = set()
    longest = 0
    for i in xrange(m):
        for j in xrange(n):
            if S[i] == T[j]:
                v = L[i][j] + 1
                L[i+1][j+1] = v
                if v > longest:
                    longest = v
                    LCS_set = set()
                if v == longest:
                    LCS_set.add(S[i-v+1:i+1])
    print LCS_set


Real world application




As researchers, we often find ourselves swimming in a sea of seemingly random data, and we're always looking for ways to make sense of this wealth of data, how we can obtain insights and the act accordingly.

This is how I use the LCS algorithm to find patterns in malicious links. Obviously, it can be applied to any kind of data outside the boundaries of security research. I hope you'll find this tool handy in your problem-solving toolkit!

Security Researcher: Jay Liew

PS: This blog post is dedicated to the folks in #python on Freenode.

Bookmark This Post: