Sunday, December 26, 2010

Parsing the contest results with Python

Reading and generating data files is one of the most common tasks I do with Python. Recently, I used python to see what my record was against individual players in the recent AI contest.

First, I grabbed the my results starting here. Then, I grabbed the top 300 players. Using some tricks learned from this book, I then parsed out the parts of the data that I cared about.

Before going any further, what I am about to show is a bunch of hacks that wouldn't work in a "serious" application. There is no error checking of any kind because I intend to only run this code myself from the command line a few times, and then never see it again.

This isn't parsing like what is needed in a compiler. This is messy, quick parsing. The data is messy, so the method seems appropriate. Coming up with formal grammar is probably overkill for this use.

Here is what the game result data looks like, after removing some junk at the top of the file, and then formatting it to fit the page.

< -- stuff from top of file.... >
<thead><tr><th>Time</th>
<th>Opponent</th>
<th>Outcome</th>
<th>&nbsp;</th>
</tr></thead><tbody>  
<tr class="odd">
<td>Dec 1st 17:00:51</td>    
<td><a href="profile.php?user_id=3938">dabino</a></td>    
<td class="game_draw">Draw</td>    
<td><a href="visualizer.php?game_id=9559407">View Game &gt;&gt;</a>
</td>  </tr>  <tr class="even">    
<td>Dec 1st 16:52:57</td>    
<td><a href="profile.php?user_id=8737">Hazard</a></td>    
<td class="game_win">Win</td>    
<td><a href="visualizer.php?game_id=9558701">View Game &gt;&gt;</a>
</td>  </tr>  <tr class="odd">    
<td>Dec 1st 16:48:30</td>    
<td><a href="profile.php?user_id=7196">FlameN</a></td>    
<td class="game_win">Win</td>    
<td><a href="visualizer.php?game_id=9558244">View Game &gt;&gt;</a>
  </td>  </tr>  <tr class="even">    
<td>Dec 1st 16:42:13</td>    
<td><a href="profile.php?user_id=10663">smloh1</a></td>    
<td class="game_win">Win</td>    
<td><a href="visualizer.php?game_id=9557675">View Game &gt;&gt;</a></td>  </tr>
< -- stuff from bottom of file.... >

All of that, and more, was in one big line in the file. If you look, you can see that all of the interesting parts on are separated by tr tags. So, we can use the string.split method to break the file up into chunks that contain individual results.

def read_result_file(filename):
    text = open(filename).read()
    return text.split("<tr")

I'm going to mention this one more time… there is no error checking in here. If the file doesn't exist, I'll just get the error message at the prompt.

From each line, I want to collect the user id of my opponent, and the result. The user id is contained in a string like "user_id=12345", and the result is in a string like 'class="game_win"'. Both pieces of data occur in one line of text, so I will use a regular expression to try to pull it out, and test that regular expression against each "line" of text from the read_result_file function.

Here is a regular expression to get what I want out of the string. The regex is compiled with the VERBOSE flag, so that I can insert white space and comments into the string to make it more readable.

import re

data_re_string = r'''#
# find the user_id
href="profile\.php\?user_id=
(\d+)"   # the id is here
.*   # ignore all of this
<td\ class=
"game_([a-z]+)"
'''

data_re = re.compile(data_re_string, re.VERBOSE)

Now, we can search all the lines to find matches. Not all of the lines will have a result. For example, the first line has everything before the first <tr> flag. So, the loop will have to handle that.

def pull_games(rows):
    # run the regular expression on every line
    tmp = (data_re.search(line) for line in rows)

    # only keep the results where there is a match.
    # The search function returns None if there isn't a match.
    result = [m.groups() for m in tmp if m]
    return result

Now, we can see if that works.

>>> games = pull_games(read_result_file("games_1.html"))
>>> import pprint
>>> pprint.pprint(games[:10])
[('3938', 'draw'),
 ('8737', 'win'),
 ('7196', 'win'),
 ('10663', 'win'),
 ('7526', 'draw'),
 ('11610', 'loss'),
 ('11985', 'win'),
 ('10464', 'win'),
 ('9325', 'loss'),
 ('6441', 'win')]
>>> 

If we give it a bad file, the exception is printed at the prompt.

>>> lines = read_result_file("bad_file")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/python-2290mSr.py", line 4, in read_result_file
IOError: [Errno 2] No such file or directory: 'bad_file'

My game results were spread over three files, so it would be nice to have a helper function to parse all three at once. Here we go…

def get_all_results(files):
    result = []
    for f in files:
        result.extend(pull_games(read_result_file(f)))
    return result

And I can call it easily like this:

>>> # easily create the file names ...
>>> print ["games_%d.html" % x for x in [1,2,3]]
['games_1.html', 'games_2.html', 'games_3.html']

>>> games = get_all_results(["games_%d.html" % x for x in [1,2,3]])
>>> len(games)
522

>>> pprint.pprint(games[-1])
('14085', 'win')

Let's look at the rankings files. After a bunch of junk, the html looks this (again edited for the page…)

<tbody>  <tr class="odd">
    <td>1</td>
    <td><a href="profile.php?user_id=8565">bocsimacko</a></td>
    <td><a href="country_profile.php?country_id=106">
   <img src="flags/hu.png" />
</a></td>    <td>
<a href="organization_profile.php?org_id=0">Other</a>
</td>    <td><a href="language_profile.php?lang=Lisp">Lisp</a>
</td>    <td>3765</td>  </tr>
  <tr class="even">
    <td>2</td>
    <td><a href="profile.php?user_id=7026">_iouri_</a></td>
    <td><a href="country_profile.php?country_id=2">
<img src="flags/ca.png" />
</a></td>    <td><a href="organization_profile.php?org_id=0">Other</a>
</td>    <td><a href="language_profile.php?lang=C%2B%2B">C++</a>
</td>    <td>3565</td>  </tr>

In this case, the data chunks span multiple lines, so it is not as easy to write a regular expression to just grab everything at once. This is where I turn to what I learned in the text processing book, and write a mini state machine to handle the processing.

(ok, a multiline regex would probably work here too…, but I took this approach instead)

First, I need two regular expressions. One to match against a "ranking" line, and another to match a user_id line. Then, the main function is basically a loop. At each line, it either looks for a ranking, or looks for a user_id, based on what the current state of the machine is. There is a third major state where the code looks for the start of a user record.

Here is the code…

num_match = re.compile('<td>\\s*(\\d+)\\s*</td>')
player_match = re.compile('profile\\.php\\?user_id=(\\d+)">([^<]+)<')

def pull_rankings(text):
    state = "WAIT"
    results = []

    # This is the ranking of the player we are looking at.
    # The default value is -1 to make it easier to spot an error.
    # (if any player has a ranking of -1, the code is wrong..)
    current_rank = -1

    # The states loop from RANK -> PLAYER -> START -> RANK ...
    # except at the start.  At the start the state is WAIT
    # until the code sees the start of the table.
    for line in text:
        if line.startswith("<tbody>"):
            # This only happens once, but it moves the
            # state machine from WAIT to running
            state = "RANK"
        elif state == "START":
            if line.strip().startswith("<tr"):
                state = "RANK"
        elif state == "RANK":
            num = num_match.search(line)
            if num:
                current_rank = num.groups()[0]
                state = "PLAYER"
        elif state == "PLAYER":
            m = player_match.search(line)
            if m:
                pid, name = m.groups()
                # Store the current player, then reset the
                # current rank value to the default.
                results.append((name, pid, current_rank))
                current_rank = -1
                state = "START"
    return results

def pull_all_rankings(files):
    """Another helper to pull multiple files of data"""
    all_values = []
    for f in files:
        all_values.extend(pull_rankings(open(f).readlines()))
    return all_values

The result is a list of tuples of the name, user_id, and ranking of the players.

Next, I want a dictionary to lookup a player's information keyed on the user_id.

def make_lookup(all_rankings):
    lookup = {}
    for value in all_rankings:
        name, id, ranking = value
        lookup[id] = value
    return lookup

Then, I need to find all of the games against the same player.

def group_by_player(results):
    output = {}
    for result in results:
        pid, r = result
        if pid not in output:
            output[pid] = []
        output[pid].append(r)
    return output

Finally, I want to create a table of the results against each player, with the game results converted to win/loss/draw records.

def tabulate_games(games):
    result = [0, 0, 0]
    v = dict([("win", 0),
              ("loss", 1),
              ("draw", 2)])

    for g in games:
        result[v[g]] += 1
    return tuple(result)

def create_results(grouped, lookup):
    result = []
    for pid, games in grouped.iteritems():
        if pid in lookup:
            name, ignore, ranking = lookup[pid]
            result.append((int(ranking), name, tabulate_games(games), pid))
    return result

And then I realize that I was lucky to place where I did..

>>> games = get_all_results(["games_%d.html" % x for x in [1,2,3]])
>>> grouped = group_by_player(games)
>>> lookup = make_lookup(pull_all_rankings(
           ["rankings_%d.html" % x for x in [1,2,3]]))
>>> table = create_results(grouped, lookup)
>>> table.sort()  # the first value in each row is the rating

>>> pprint.pprint(table[:10])
[(1, 'bocsimacko', (0, 9, 0), '8565'),
 (2, '_iouri_', (4, 4, 1), '7026'),
 (3, 'Slin-.-', (4, 5, 0), '11248'),
 (4, '_Astek_', (4, 4, 1), '12009'),
 (5, 'jimrogerz', (3, 6, 1), '9325'),
 (6, 'Accoun', (3, 6, 0), '7423'),
 (7, 'george', (5, 4, 0), '11173'),
 (8, 'GreenTea', (2, 6, 1), '5955'),
 (9, 'asavis', (7, 3, 0), '8348'),
 (10, 'bix0r4ever', (3, 6, 0), '7950')]

Thursday, December 9, 2010

Rock, Paper, Scissors over email

This is just to get this off my chest. During the AI contest, I wanted to create a P2P way for two players to play a match, and keep it fair. The Planetwars game had simultaneous moves, so I wanted a way that the players could each make a move at the same time without requiring one player to send a move in the clear to the other first.

This was mostly an exercise for fun, but I thought it may be useful for people to play each other online for testing. Since then, several players, McLeopold in particular, have worked on making it easier for people to host a game server of their own. Hosting a server to play against a friend would be far easier than fully implementing my idea.

The method I wanted to use was based on commitment schemes, where each player could commit to their choice before sending it in the clear. I learned about this idea in Bruce Schneier's Applied Cryptography.

The idea would be that the players could negotiate a map choice, then play out the game. On each turn, the players would

  • pick their move
  • create and send a commitment string based on the move
  • wait to receive the opponents string
  • send the cleartext move
  • verify the commitment string
  • repeat

Implementing this might be fun, but I have other projects I want to do first. But, I have talked about this too much to just drop it. So, here is some code to demonstrate.

There are plenty of security problems with this code. Someone could cheat this with enough effort. There are some simple things that would make this more cheat resistant... but lets get to the details.

Rock, Paper, Scissors

I won't explain the rules here. For a given move, the player will send "rock", "paper", or "scissors".

Creating the commitment string

First, the player picks their move. Then, they pick some sort of random key to use to create a hash of their move using HMAC and SHA-1. (Again, this was for fun. Don't assume anything about the suitability of this idea for any purpose.) The player must also pick a new key every turn, and the key must be something that is unique and unpredictable.

Here is some Python code to create the commitment string.

import hashlib
import hmac

def run_hash(key, message):
    return hmac.new(key, message, hashlib.sha1).hexdigest()

Then, to create the string, we could do something like this:

>>> move = "rock"
>>> key = "different random key every move"
>>> hash = run_hash(key, move)
>>> print hash
84c90d7bec92c41502e1b47b981cb62f74b043ac

Transmit strings

Nex the players trade commitment strings. After each is received, they then trade the cleartext moves and secret keys. Then, each player can verify that the commitment string matches the secret key and move.

Even more

The same idea could be used to pick the maps to use. Given a set of possible maps that both players agree on, one of them could create a commitment string for each map using a secret key. Then, the player could shuffle this list of strings, and send it to the other player. The other player would pick one of the strings, and by doing so randomly chooses a map. Given the secret key, the second player can verify that nothing funny was going on.

Scheier's book has a short section on "telephone poker" that goes even further. The players can deal out cards randomly without knowing what the other got.

So, I am not pursuing this any more at the moment. I wanted to write a client to play one on one matches, but I think the easy to install tcp server is a more practical idea.