Sunday, December 26, 2010

Parsing the contest results with Python

Reading and generating data files is one of the most common tasks I do with Python. Recently, I used python to see what my record was against individual players in the recent AI contest.

First, I grabbed the my results starting here. Then, I grabbed the top 300 players. Using some tricks learned from this book, I then parsed out the parts of the data that I cared about.

Before going any further, what I am about to show is a bunch of hacks that wouldn't work in a "serious" application. There is no error checking of any kind because I intend to only run this code myself from the command line a few times, and then never see it again.

This isn't parsing like what is needed in a compiler. This is messy, quick parsing. The data is messy, so the method seems appropriate. Coming up with formal grammar is probably overkill for this use.

Here is what the game result data looks like, after removing some junk at the top of the file, and then formatting it to fit the page.

< -- stuff from top of file.... >
<thead><tr><th>Time</th>
<th>Opponent</th>
<th>Outcome</th>
<th>&nbsp;</th>
</tr></thead><tbody>  
<tr class="odd">
<td>Dec 1st 17:00:51</td>    
<td><a href="profile.php?user_id=3938">dabino</a></td>    
<td class="game_draw">Draw</td>    
<td><a href="visualizer.php?game_id=9559407">View Game &gt;&gt;</a>
</td>  </tr>  <tr class="even">    
<td>Dec 1st 16:52:57</td>    
<td><a href="profile.php?user_id=8737">Hazard</a></td>    
<td class="game_win">Win</td>    
<td><a href="visualizer.php?game_id=9558701">View Game &gt;&gt;</a>
</td>  </tr>  <tr class="odd">    
<td>Dec 1st 16:48:30</td>    
<td><a href="profile.php?user_id=7196">FlameN</a></td>    
<td class="game_win">Win</td>    
<td><a href="visualizer.php?game_id=9558244">View Game &gt;&gt;</a>
  </td>  </tr>  <tr class="even">    
<td>Dec 1st 16:42:13</td>    
<td><a href="profile.php?user_id=10663">smloh1</a></td>    
<td class="game_win">Win</td>    
<td><a href="visualizer.php?game_id=9557675">View Game &gt;&gt;</a></td>  </tr>
< -- stuff from bottom of file.... >

All of that, and more, was in one big line in the file. If you look, you can see that all of the interesting parts on are separated by tr tags. So, we can use the string.split method to break the file up into chunks that contain individual results.

def read_result_file(filename):
    text = open(filename).read()
    return text.split("<tr")

I'm going to mention this one more time… there is no error checking in here. If the file doesn't exist, I'll just get the error message at the prompt.

From each line, I want to collect the user id of my opponent, and the result. The user id is contained in a string like "user_id=12345", and the result is in a string like 'class="game_win"'. Both pieces of data occur in one line of text, so I will use a regular expression to try to pull it out, and test that regular expression against each "line" of text from the read_result_file function.

Here is a regular expression to get what I want out of the string. The regex is compiled with the VERBOSE flag, so that I can insert white space and comments into the string to make it more readable.

import re

data_re_string = r'''#
# find the user_id
href="profile\.php\?user_id=
(\d+)"   # the id is here
.*   # ignore all of this
<td\ class=
"game_([a-z]+)"
'''

data_re = re.compile(data_re_string, re.VERBOSE)

Now, we can search all the lines to find matches. Not all of the lines will have a result. For example, the first line has everything before the first <tr> flag. So, the loop will have to handle that.

def pull_games(rows):
    # run the regular expression on every line
    tmp = (data_re.search(line) for line in rows)

    # only keep the results where there is a match.
    # The search function returns None if there isn't a match.
    result = [m.groups() for m in tmp if m]
    return result

Now, we can see if that works.

>>> games = pull_games(read_result_file("games_1.html"))
>>> import pprint
>>> pprint.pprint(games[:10])
[('3938', 'draw'),
 ('8737', 'win'),
 ('7196', 'win'),
 ('10663', 'win'),
 ('7526', 'draw'),
 ('11610', 'loss'),
 ('11985', 'win'),
 ('10464', 'win'),
 ('9325', 'loss'),
 ('6441', 'win')]
>>> 

If we give it a bad file, the exception is printed at the prompt.

>>> lines = read_result_file("bad_file")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/python-2290mSr.py", line 4, in read_result_file
IOError: [Errno 2] No such file or directory: 'bad_file'

My game results were spread over three files, so it would be nice to have a helper function to parse all three at once. Here we go…

def get_all_results(files):
    result = []
    for f in files:
        result.extend(pull_games(read_result_file(f)))
    return result

And I can call it easily like this:

>>> # easily create the file names ...
>>> print ["games_%d.html" % x for x in [1,2,3]]
['games_1.html', 'games_2.html', 'games_3.html']

>>> games = get_all_results(["games_%d.html" % x for x in [1,2,3]])
>>> len(games)
522

>>> pprint.pprint(games[-1])
('14085', 'win')

Let's look at the rankings files. After a bunch of junk, the html looks this (again edited for the page…)

<tbody>  <tr class="odd">
    <td>1</td>
    <td><a href="profile.php?user_id=8565">bocsimacko</a></td>
    <td><a href="country_profile.php?country_id=106">
   <img src="flags/hu.png" />
</a></td>    <td>
<a href="organization_profile.php?org_id=0">Other</a>
</td>    <td><a href="language_profile.php?lang=Lisp">Lisp</a>
</td>    <td>3765</td>  </tr>
  <tr class="even">
    <td>2</td>
    <td><a href="profile.php?user_id=7026">_iouri_</a></td>
    <td><a href="country_profile.php?country_id=2">
<img src="flags/ca.png" />
</a></td>    <td><a href="organization_profile.php?org_id=0">Other</a>
</td>    <td><a href="language_profile.php?lang=C%2B%2B">C++</a>
</td>    <td>3565</td>  </tr>

In this case, the data chunks span multiple lines, so it is not as easy to write a regular expression to just grab everything at once. This is where I turn to what I learned in the text processing book, and write a mini state machine to handle the processing.

(ok, a multiline regex would probably work here too…, but I took this approach instead)

First, I need two regular expressions. One to match against a "ranking" line, and another to match a user_id line. Then, the main function is basically a loop. At each line, it either looks for a ranking, or looks for a user_id, based on what the current state of the machine is. There is a third major state where the code looks for the start of a user record.

Here is the code…

num_match = re.compile('<td>\\s*(\\d+)\\s*</td>')
player_match = re.compile('profile\\.php\\?user_id=(\\d+)">([^<]+)<')

def pull_rankings(text):
    state = "WAIT"
    results = []

    # This is the ranking of the player we are looking at.
    # The default value is -1 to make it easier to spot an error.
    # (if any player has a ranking of -1, the code is wrong..)
    current_rank = -1

    # The states loop from RANK -> PLAYER -> START -> RANK ...
    # except at the start.  At the start the state is WAIT
    # until the code sees the start of the table.
    for line in text:
        if line.startswith("<tbody>"):
            # This only happens once, but it moves the
            # state machine from WAIT to running
            state = "RANK"
        elif state == "START":
            if line.strip().startswith("<tr"):
                state = "RANK"
        elif state == "RANK":
            num = num_match.search(line)
            if num:
                current_rank = num.groups()[0]
                state = "PLAYER"
        elif state == "PLAYER":
            m = player_match.search(line)
            if m:
                pid, name = m.groups()
                # Store the current player, then reset the
                # current rank value to the default.
                results.append((name, pid, current_rank))
                current_rank = -1
                state = "START"
    return results

def pull_all_rankings(files):
    """Another helper to pull multiple files of data"""
    all_values = []
    for f in files:
        all_values.extend(pull_rankings(open(f).readlines()))
    return all_values

The result is a list of tuples of the name, user_id, and ranking of the players.

Next, I want a dictionary to lookup a player's information keyed on the user_id.

def make_lookup(all_rankings):
    lookup = {}
    for value in all_rankings:
        name, id, ranking = value
        lookup[id] = value
    return lookup

Then, I need to find all of the games against the same player.

def group_by_player(results):
    output = {}
    for result in results:
        pid, r = result
        if pid not in output:
            output[pid] = []
        output[pid].append(r)
    return output

Finally, I want to create a table of the results against each player, with the game results converted to win/loss/draw records.

def tabulate_games(games):
    result = [0, 0, 0]
    v = dict([("win", 0),
              ("loss", 1),
              ("draw", 2)])

    for g in games:
        result[v[g]] += 1
    return tuple(result)

def create_results(grouped, lookup):
    result = []
    for pid, games in grouped.iteritems():
        if pid in lookup:
            name, ignore, ranking = lookup[pid]
            result.append((int(ranking), name, tabulate_games(games), pid))
    return result

And then I realize that I was lucky to place where I did..

>>> games = get_all_results(["games_%d.html" % x for x in [1,2,3]])
>>> grouped = group_by_player(games)
>>> lookup = make_lookup(pull_all_rankings(
           ["rankings_%d.html" % x for x in [1,2,3]]))
>>> table = create_results(grouped, lookup)
>>> table.sort()  # the first value in each row is the rating

>>> pprint.pprint(table[:10])
[(1, 'bocsimacko', (0, 9, 0), '8565'),
 (2, '_iouri_', (4, 4, 1), '7026'),
 (3, 'Slin-.-', (4, 5, 0), '11248'),
 (4, '_Astek_', (4, 4, 1), '12009'),
 (5, 'jimrogerz', (3, 6, 1), '9325'),
 (6, 'Accoun', (3, 6, 0), '7423'),
 (7, 'george', (5, 4, 0), '11173'),
 (8, 'GreenTea', (2, 6, 1), '5955'),
 (9, 'asavis', (7, 3, 0), '8348'),
 (10, 'bix0r4ever', (3, 6, 0), '7950')]

No comments: