An anagram measure

Those like me with a love of the byways of the English language, or with a love of cryptic crosswords, will no doubt have collected over the years a private trove of single word anagrams. One of my favourites (because both words are common) is

ORCHESTRA, CARTHORSE

Clearly smaller words are likely to have more anagrams than longer ones:

OPST, POTS, POST, STOP, SPOT, TOPS

From the Anagrams FAQ comes this lovely example, of almost familiar words (at least, they are not scientific or technical terms):

INTERROGATIVES, REINVESTIGATOR, TERGIVERSATION

From an anagrams programming page:

ANGOR, ARGON, GORAN, GRANO, GROAN, NAGOR, ORANG, ORGAN, ROGAN

several of which are unfamiliar to me as English words. I’m not allowing proper names.

How can we measure the “anagrammability” of a set of letters? Some account should be given to the number of anagrams, and to the length of the word. A long word with only two anagrams should get a higher score than a small word with many. Let n be the number of letters, and k the number of (single word) English anagrams. My anagrammability measure (discussed with my eldest son) is:

(k-1)n^2.

This gives a score of zero if there are no anagrams other than the word itself, and is weighted for large words.

With the examples above:

ORCHESTRA: 81
OPST: 80
INTERROGATIVES: 392
ANGOR: 200

According to the Anagrams FAQ again, the set of letters with the most one word anagrams is AEINRST with:

ANESTRI, ASTERIN, ERANIST, NASTIER, RATINES, RESIANT, RESTAIN, RETAINS, RETINAS, RETSINA, SAINTER, STAINER, STARNIE, STEARIN

for a score of 637. (I’ve never seen some of these words before, but apparently they can all be found in Merriam-Webster’s unabridged dictionary).

It’s not hard to experiment with anagrams in Sage; all you need is a wordlist, which you can read in, turn into a list and strip the trailing non-printing characters. You can download a list of 109582 words here, then save it as, say,

words.txt

.
Then:

f = open("words.txt","r")
nwords = f.readlines()
words = []
for i in nwords:
   words.append(i.strip('\r\n'))

Then we can write a simple program which given any string, checks if each permutation is in the list we’ve just created:

def find_anagrams(myword):
   anagrams=[]
   for i in permutations(myword):
       anagrams.append(join(i,""))
   for i in anagrams:
       if i in words:
           print i

For example:

sage: find_anagrams('opst')
opts
post
pots
spot
stop
tops
sage: find_anagrams('aeirnst')
retains
retinas
retsina
nastier
stainer
stearin
sage: find_anagrams('organ')
orang
organ
groan
argon

Clearly this wordlist doesn’t include all the words listed earlier. Clearly also this method is hideously inefficient – it creates all the possible anagrams first, and then checks if each one is in the wordlist. A much better method, for words of say, 8 letters or longer, would be to first set up the wordlist into sublists of words with the same number of letters. Then for your entered word, you simply check if each word in your list is an anagram of your entered word.

And here’s how to do that. First we discover that the longest word in our list has 28 letters:

words = [[] for i in range(28)]
for i in nwords:
   li=len(i)
   words[li-3].append(i.strip('\r\n'))

and a little procedure to test anagrams:

def isanagram(w1,w2):
    l1=list(w1)
    l2=list(w2)
    l1.sort()
    l2.sort()
    return l1==l2

and finally the program to find anagrams in our New Improved List:

def find2_anagrams(myword):
    lm=len(myword);
    for i in words[lm-1]:
        if isanagram(i,myword):
            print i

A quick test of it:

sage: find2_anagrams('retains')
nastier
retains
retinas
retsina
stainer
stearin

This is much faster than the first method.

Addendum

Running this over the entire word list, the highest scoring group of letters I found was AEIGLNRT, with

ALERTING, ALTERING, INTEGRAL, RELATING, TANGLIER, TRIANGLE

for a score of 320.

A much longer list of words is that created as part of the Moby Project; “Moby Words” can be downloaded from this page, and the file single.txt contains a splendid 354,984 words.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 44 other followers

%d bloggers like this: