# Deep reinforcement learning for checkers -- pretraining a policy

This post discusses an approach to approximate a checkers playing policy using a neural network trained on a database of human play. It is the first post in a series covering the engine behind checkers.ai, which uses a combination of supervised learning, reinforcement learning, and game tree search to play checkers. This set of approaches is based on AlphaGo (see this paper). Unlike Go, checkers has a known optimal policy that was found using exhaustive tree search and end-game databases (see this paper). Although Chinook has already solved checkers, the game is still very complex with over 5 x 10^20 states, making it interesting application of deep reinforcement learning. To give you an idea, it took Jonathan Schaeffer and his team at UC Alberta 18 years (1989-2007) to complete to search the game tree end-to-end. Here I will discuss how we can use a database of expert human moves to pretrain a policy, which will be the building block of the engine. I’ll show that a pretrained policy can beat intermediate/advanced human players as well as some of the popular online checkers engines. In later posts we’ll take this policy and improve it through self play.

Problem statement

In reinforcement learning, the goal is to map states, or best estimates of states, to actions that achieve some set of objectives, optimally or near optimally. States represent an environment, and actions are taken in that environment so as to maximize some reward function. States can be continous or discrete. Continous state spaces are encountered in robotics settings, whereas for computer board games we use discrete state spaces to represent the environment. In both cases the policy is defined as the mapping from states to actions, and is often represented by the conditional probability distribution $$p(action | state)$$. Unlike real-world reinforcement learning problems, checkers is a game of perfect information, and so we don’t need to infer or estimate the state. This greatly reduces the complexity of the problem, but unfortunately the state space is still much too large to store on any reasonably sized/priced memory or disk hardware. The goal is therefore to use a parametric model $$p(action | state, \theta)$$, where $$\theta$$ represents the model’s trainable parameters, to learn good strategy. In order to initialize such a policy with good checkers playing behavior, we can train it on human expert moves. This is a supervised learning problem, where we use examples in a database drawn from the joint probability distribution $$p(expert-human-action, board)$$ to train our neural network policy $$p(action | state, \theta)$$.

Training database

The first task is to identify a database of moves on which to train the policy. One option would be to mine data from an online checkers websites. While this could be done, it would require a lot of data preprocessing. Although online game websites collect a lot of data, one issue is that the data is high variance. We instead prefer to train our engine only on expert play. Luckily, Martin Fierz has an extensive database of ~22k expert-level games which can be found here. This databse consists of professional checkers tournament play that took place in 19th century Great Britain. The database was meticuously transcribed (by hand) with very few errors. We’re going to use this dataset. The database is a text file consisting of a cascade of games encoded as follows:

[Event "Manchester 1841"]
[Date "1841-??-??"]
[White "Wyllie, J."]
[Site "Manchester"]
[Result "0-1"]
1. 11-15 24-20 2. 8-11 28-24 3. 9-13 22-18 4. 15x22 25x18 5. 4-8 26-22 6. 10-14
18x9 7. 5x14 22-18 8. 1-5 18x9 9. 5x14 29-25 10. 11-15 24-19 11. 15x24 25-22 12.
24-28 22-18 13. 6-9 27-24 14. 8-11 24-19 15. 7-10 20-16 16. 11x20 18-15 17. 2-6
15-11 18. 12-16 19x12 19. 10-15 11-8 20. 15-18 21-17 21. 13x22 30-26 22. 18x27
26x17x10x1 0-1


The parser.py script in my github repository walks through this file and saves each of its 440k+ board-move pairs into a dict. The dict is then serialized for later use.

Model architecture

To encode the board state, I use the map (-3: opp king, -1: opp chkr, 0: empty, 1: own chkr, 3: own king) to represent pieces on the board. Other entries might work just as well; as a general rule you should make the entries quasi-equidistant, and stay away from machine precision. The input layer of the neural network, $$x$$, can consist of either an 8x4 representation (for a CNN) or 32x1 (for a feedforward NN) depending on our choice of network. In order to capture both highly localized features (i.e., an available jump) and high level patterns, like multiple jumps, proximity to kings row, and piece differential, I feed the input into three separate layers: a convolutional layer with a 2x2 kernel, and convolutional layer with a 3x3 kernel, and a fully dense feed forward layer. These networks are first trained independently on the dataset, and then their last hidden layers are concatenated and fed into two dense layers that in turn map to a softmax output. The full network is trained end to end on the same database following concatenation of the individual nets. The figure below illustrates the network architecture.

The output layer ($$h(x)$$) consists of a location and direction represented by a 32x4 matrix with binary entries, which we then unravel into a 128x1 one-hot encoded output. You’ll note that this will produces a total of 30 invalid moves at edge and corner positions; our model will need to learn this. The Board class definition is given in board.py. The board object stores board states for both players during the game by switching the input positions for the ‘black’ pieces (by rotating the board 180 deg). If you run parser.py (after downloading the raw text file, OCA_2.0.pdn) it will save a .pickle file into the local directory which you will need for the training step.

The convolutional neural network that I chose for this task is below. It accepts an 8x4 input ($$x$$), and returns a 128 dimensional output ($$h(x)$$).

The algebraic form of the convolution operation, which maps layers $$l=0$$ and $$l=1$$ to layers $$l=1$$ to $$l=2$$, respectively, is given by:

$$a_{mij}^{l} = \left( \textbf{w}_{m}^{l} * \textbf{z}_{m}^{* \ l-1} + \textbf{b}_{m}^{l} \right)_{ij} \leftarrow b_{mij}^{l} + \sum_{n}\sum_{q} z_{m, \ i-n, \ j-q}^{* \ l-1}w_{mnq}^{l}$$

where $$a_{mij}^{l}$$ represents the element $$\left(i,j\right)$$ in the $$m^{th}$$ feature map in layer $$l$$, $$\textbf{w}_{m}^{l}$$ is the kernel (of size 2x2) corresponding to the $$m^{th}$$ feature map in layer $$l$$, $$\textbf{b}_{m}^{l}$$ is the bias terms corresponding in the $$m^{th}$$ feature map in layer $$l$$, and $$\textbf{z}_{m}^{l-1}$$ is equal to $$\textbf{x}$$ for $$l=0$$, and to $$max(0, a_{mij}^{l-1})$$ (i.e., the rectified linear unit) for $$l=1$$. The operator * denotes the convolution between the kernel and its input, while $$\textbf{z}^{*}$$ denotes a matrix representation of $$\textbf{z}$$ that has been padded with a box of zeros around its perimeter. Padding is a technique that is often used to preserve information at the corners and edges of the input space. Here we implement something known as ‘same’ padding along with a kernel-stride of (1,1).

The next operation in the network vectorizes $$\textbf{z}_{m}^{\ l=2}$$, unravelling it by row, column, and then feature map (i.e., C-order). Two fully dense mappings are then applied in successsion to yield the output at layer $$l=4$$. The algebraic form of this mapping function is given by:

$$\textbf{z}^{l} = \sigma^{l} \left( \textbf{w}^{l} \textbf{z}^{\ l-1} + \textbf{b}^{\ l} \right)$$

where $$\sigma^{l=3}$$ is a rectified linear unit, and $$\sigma^{l=4}$$ is the softmax function given by:

$$\sigma^{l=4}(z^{l=4})_{i} = \frac{e^{z_{i}^{l=4}}}{\sum_{j=0}^{n} e^{z_{j}^{l=4}}}$$

The softmax function is a normalized exponential of the outputs from layer 4, and thus can be interpreted as a probability distribution over the output space. The model prediction is given by:

$$\mathsf{predicted \ move} = argmax\left( \sigma^{l=4}(\textbf{z}^{l=4}) \right)$$

where $$n = 128$$, the number of output labels.

Training the model

In order to train efficiently, it is important choose an appropriate loss function ($$L$$). Here I use categorical cross-entropy with l2 regularization, which is computed as follows:

$$L(h(x)) = -\frac{1}{n}\sum_{x} \left( yln(h(x)) + (1-y)ln(1-h(x)) \right) + \lambda\left( \sum_{l=1}^{2} \sum_{m=0}^{M} \lVert \textbf{w}_{m}^{l} \rVert + \sum_{l=3}^{4} \lVert \textbf{w}^{l} \rVert \right)$$

where $$y$$ denotes the ground truth label, and $$\lambda$$ is the regularization constant. While classification error is a binary measure, this function is a measure of ‘closeness’, which produces better gradients and thus allows the model to converge more efficiently. To take advantage of this, we skip Equation 4 and plug the softmax activations directly into this loss function (only when computing the loss during training).

I used Tensorflow to implement the forward version of this network, along with the backpropogation and stochastic gradient descent steps that optimize its parameters. I plan to derive the backprop and update steps in the near future–for now I’ll just show how they are implemented in Tensorflow.

import os
import numpy as np
import pandas as pd
from six.moves import cPickle as pickle
import tensorflow as tf
import time

def accuracy(preds, labs):
n = preds.shape[0]
acc_vec = np.ndarray([n, 1])
for i in range(n):
acc_vec[i, 0] = sum(preds[i, :].astype(int) * labs[i, :].astype(int))
# print(acc_vec[i, 0])
assert acc_vec[i, 0] in range(0, 6)
acc_score = list()
for j in range(1, 6):
percentage = len(np.argwhere(acc_vec == j)) / float(n)
acc_score.append(round(percentage, 4))
return acc_score

def deepnet(num_steps, lambda_loss, dropout_L1, dropout_L2, ckpt_dir):

# Computational graph
graph = tf.Graph()
with graph.as_default():

# Inputs
tf_xTr = tf.placeholder(tf.float32, shape=[batch_size, board_height, board_width, num_channels])
tf_yTr = tf.placeholder(tf.float32, shape=[batch_size, label_height * label_width])
tf_xTe = tf.constant(xTe)
tf_xTr_full = tf.constant(xTr)

# Variables
w1 = tf.Variable(tf.truncated_normal([patch_size, patch_size, num_channels, depth], stddev=0.1), name='w1')
b1 = tf.Variable(tf.zeros([depth]), name='b1')
w2 = tf.Variable(tf.truncated_normal([patch_size, patch_size, depth, depth], stddev=0.1), name='w2')
b2 = tf.Variable(tf.zeros([depth]), name='b2')
w3 = tf.Variable(tf.truncated_normal([board_height * board_width * depth, num_nodes_layer3], stddev=0.1), name='w3')
b3 = tf.Variable(tf.zeros([num_nodes_layer3]), name='b3')
w4 = tf.Variable(tf.truncated_normal([num_nodes_layer3, num_nodes_output], stddev=0.1), name='w4')
b4 = tf.Variable(tf.zeros([num_nodes_output]), name='b4')

# Train
def model(xtrain, dropout_switch):

# First convolutional layer
c1 = tf.nn.conv2d(xtrain, w1, strides=[1, 1, 1, 1], padding='SAME')
h1 = tf.nn.relu(c1 + b1)
h1_out = tf.nn.dropout(h1, 1 - dropout_L1 * dropout_switch)
# maxpool1 = tf.nn.max_pool(h1_out, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

# Second convolutional layer
c2 = tf.nn.conv2d(h1_out, w2, strides=[1, 1, 1, 1], padding='SAME')
h2 = tf.nn.relu(c2 + b2)
h2_out = tf.nn.dropout(h2, 1 - dropout_L1 * dropout_switch)
# maxpool2 = tf.nn.max_pool(h2_out, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

# Reshape for fully connected layer
h2_shape = xtrain.get_shape().as_list()
h2_out_vec = tf.reshape(h2_out, shape=[h2_shape[0], board_height * board_width * depth])

# First fully connected layer
y3 = tf.matmul(h2_out_vec, w3) + b3
h3 = tf.nn.relu(y3)
h3_out = tf.nn.dropout(h3, 1 - dropout_L2 * dropout_switch)

# Model output
return tf.matmul(h3_out, w4) + b4

logits = model(tf_xTr, 1)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_yTr))
loss += lambda_loss * (tf.nn.l2_loss(w1) + tf.nn.l2_loss(w2) + tf.nn.l2_loss(w3))

# Optimizer (Built into tensor flow, based on gradient descent)
batch = tf.Variable(0)
learning_rate = tf.train.exponential_decay(0.01, batch * batch_size, nTr, 0.95, staircase=True)
optimizer = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(loss, global_step=batch)

# Predictions for the training, validation, and test data
preds_Tr = tf.nn.softmax(model(tf_xTr_full, 0))
preds_Te = tf.nn.softmax(model(tf_xTe, 0))

# Feed data into the graph, run the model
with tf.Session(graph=graph) as session:

var_dict = {'w1': w1,
'b1': b1,
'w2': w2,
'b2': b2,
'w3': w3,
'b3': b3,
'w4': w4,
'b4': b4,
}
saver = tf.train.Saver(var_dict)

# Run model
tf.initialize_all_variables().run()
print('Graph initialized ...')
t = time.time()
for step in range(num_steps):
offset = (step * batch_size) % (nTr - batch_size)
batch_data = xTr[offset:(offset + batch_size), :]
batch_labels = yTr[offset:(offset + batch_size), :]
feed_dict = {tf_xTr: batch_data, tf_yTr: batch_labels}
_ = session.run([optimizer], feed_dict=feed_dict)

if step % 5000 == 0:
l, preds_Train, preds_Test = session.run([loss, preds_Tr, preds_Te], feed_dict=feed_dict)
# Find max and set to 1, else 0
for i in range(nTr):
ind_Tr = np.argsort(preds_Train[i, :])[::-1][:5]
preds_Train[i, :] = 0
for j in range(1, 6):
preds_Train[i, ind_Tr[j - 1]] = j
for i in range(nTe):
ind_Te = np.argsort(preds_Test[i, :])[::-1][:5]
preds_Test[i, :] = 0
for j in range(1, 6):
preds_Test[i, ind_Te[j - 1]] = j
acc_Tr = accuracy(preds_Train, yTr)
acc_Te = accuracy(preds_Test, yTe)

print('Minibatch loss at step %d: %f' % (step, l))
print('Training accuracy of top 5 probabilities: %s' % acc_Tr)
print('Testing accuracy of top 5 probabilities: %s' % acc_Te)
print('Time consumed: %d minutes' % ((time.time() - t) / 60.))
saver.save(session, ckpt_dir + 'model.ckpt', global_step=step + 1)

elif step % 500 == 0:
print('Step %d complete ...' % step)

# if step == 10000:
#     break

print('Training complete.')

if __name__ == '__main__':

# Define batch size for SGD, and network architecture
batch_size = 128
num_channels = 1
patch_size = 2
depth = 32
num_nodes_layer3 = 1024
num_nodes_output = 128
board_height = 8
board_width = 4
label_height = 32
label_width = 4

# Extract training data into win_dict, loss_dict, and draw_dict
trainset_file = 'checkers_library_full_v2.pickle'

# Create numpy arrays xTr(nx8x4) and yTr(nx32x4), where n = number of training examples
data_list = list()
labels_list = list()
for dictionary in [win_dict, loss_dict, draw_dict]:
for key in dictionary:
data_list.append(dictionary[key][0].as_matrix())
labels_list.append(dictionary[key][1].as_matrix())
data = np.reshape(np.array(data_list, dtype=int), (-1, board_height, board_width, num_channels))
labels = np.array(labels_list, dtype=int)

# Randomize order since incoming data is structured into win, loss, draw
n = len(data_list)
assert n == len(labels_list)
ind = np.arange(n)
np.random.shuffle(ind)
data, labels = data[ind, :, :], labels[ind, :, :]

# Vectorize the inputs and labels
data = data.reshape((-1, board_height, board_width)).astype(np.float32)
labels = labels.reshape((-1, label_height * label_width)).astype(np.float32)

# Split x, y into training, cross validation, and test sets
test_split = 0.35
nTe = int(test_split * n)
nTr = n - nTe
xTe, yTe = data[:nTe, :, :], labels[:nTe, :]
xTr, yTr = data[nTe:, :, :], labels[nTe:, :]
assert n == nTr + nTe
del data, labels

# Reshape data
xTr = np.reshape(xTr, (-1, board_height, board_width, num_channels))
xTe = np.reshape(xTe, (-1, board_height, board_width, num_channels))

param_dir = 'parameters_v2/convnet_100k_full_no_reg/'

deepnet(num_steps=150001,
lambda_loss=0,
dropout_L1=0,
dropout_L2=0,
ckpt_dir=param_dir)


This code passes 150,000 mini-batches of 128 samples each to the CNN for training. The figure below shows the training accuracy plotted versus training step. We see that the model captures about 66% of moves with it’s top prediction after ~40 epochs through the training set (note: I didn’t use cross-validation).

There are several interesting questions that we could ask from here, for example, what is the variance in the dataset? How does the variance change as a function of board density/sparsity (i.e. is it heteroscedastic)? Intuitively we know that there is variance–players play with different strategies. I would also posit that the variance among early boards is lower than mid-late boards. Knowing the mean variance would be very useful as it would give us a lower bound on test error. However, obtaining the distribution of variance over the entire input space is complicated by a lack of duplicate board examples in the mid-late game scenarios. As we’ll see, the CNN’s performance in the early game is not under question; its late game performance on the other hand is sporadic, and this is a limitation imposed by the size of our dataset. I’m jumping ahead here, but just want to point out that proper evaluation of this model is complicated by several issues.

The good news is that the CNN does capturing collective strategy from the data; it has an error of ~3.45% when we consider its top-5 outputs (i.e., softmax activations). In the future I plan to add a search algorithm (e.g., minimax) and determine the required search breadth/depth for different levels of play. For now though, it is interesting to examine how it performs against humans and other checkers engines without any notion of ‘how’ to play checkers, or of its rules.

The Engine

The program shown below facilitates gameplay between a human and the predictive model. The engine makes calls to the model and to the human player (via a shell-based GUI) and then uses the aforementioned Board class to keep track of board states. When the computer makes a move, the engine displays the top-5 predictions from the model. Because the model has no explicit instruction of ‘how’ to play Checkers (the rules can be found here), the engine also ensures that each move is valid (by querying the ranked predictions until it finds a valid one). I have observed invalid predictions on a few occasions. At every computer move, the engine also checks for hops (and subsequent hops). I’ve found that the model outputs hops corrently in the vast majority of cases. The code for this engine is given below.

import numpy as np
import pandas as pd
import predict_move

class Board(object):

global jumps, empty, odd_list

def __init__(self):
self.invalid_attempts = 0

def board_state(self, player_type):
if player_type == 'white':
return -self.state.iloc[::-1, ::-1]
elif player_type == 'black':
return self.state

def print_board(self):

print '  32  31  30  29'
print '28  27  26  25'
print '  24  23  22  21'
print '20  19  18  17'
print '  16  15  14  13'
print '12  11  10  09'
print '  08  07  06  05'
print '04  03  02  01'
print '\n'
for j in range(8):
for i in range(4):
if j % 2 == 0:
print ' ',
if self.state[3 - i][7 - j] == 1:
print 'x',
elif self.state[3 - i][7 - j] == 3:
print 'X',
elif self.state[3 - i][7 - j] == 0:
print '-',
elif self.state[3 - i][7 - j] == -1:
print 'o',
else:
print 'O',
if j % 2 != 0:
print ' ',
print ''

def find_jumps(self, player_type):

valid_jumps = list()

if player_type == 'black':
king_value = black_king
chkr_value = black_chkr
chkr_directions = [1, 2]
else:
king_value = white_king
chkr_value = white_chkr
chkr_directions = [0, 3]

board_state = self.state.as_matrix()
board_state = np.reshape(board_state, (32,))

for position in range(32):
piece = board_state[position]
neighbors_list = neighbors[position]
next_neighbors_list = next_neighbors[position]

if piece == chkr_value:
for direction in chkr_directions:
neighbor = neighbors_list[direction]
next_neighbor = next_neighbors_list[direction]
if neighbor == iv or next_neighbor == iv:
pass
elif board_state[next_neighbor] == empty and (board_state[neighbor] == -chkr_value or board_state[neighbor] == -king_value):
valid_jumps.append([position, next_neighbor])

elif piece == king_value:
for direction in range(4):
neighbor = neighbors_list[direction]
next_neighbor = next_neighbors_list[direction]
if neighbor == iv or next_neighbor == iv:
pass
elif board_state[next_neighbor] == empty and (board_state[neighbor] == -chkr_value or board_state[neighbor] == -king_value):
valid_jumps.append([position, next_neighbor])

return valid_jumps

def get_positions(self, move, player_type):

# Extract starting position, and direction to move
ind = np.argwhere(move == 1)[0]
position = ind[0]
direction = ind[1]

jumps_available = self.find_jumps(player_type=player_type)

neighbor = neighbors[position][direction]
next_neighbor = next_neighbors[position][direction]

if [position, next_neighbor] in jumps_available:
return position, next_neighbor, 'jump'
else:
return position, neighbor, 'standard'

def generate_move(self, player_type, output_type):

board_state = self.state
moves, probs = predict_move.predict_cnn(board_state, output=output_type, params_dir=params_dir)
moves_list = list()

for i in range(1, 6):
ind = np.argwhere(moves == i)[0]
move = np.zeros([32, 4])
move[ind[0], ind[1]] = 1
pos_init, pos_final, move_type = self.get_positions(move, player_type=player_type)
moves_list.append([pos_init, pos_final])

return moves_list, probs

def update(self, positions, player_type, move_type):

# Extract the initial and final positions into ints
[pos_init, pos_final] = positions[0], positions[1]

if player_type == 'black':
king_pos = black_king_pos
king_value = black_king
chkr_value = black_chkr
pos_init, pos_final = int(pos_init), int(pos_final)
else:
king_pos = white_king_pos
king_value = white_king
chkr_value = white_chkr
pos_init, pos_final = int(pos_init) - 1, int(pos_final) - 1

# print(pos_init, pos_final)
board_vec = self.state.copy()
board_vec = np.reshape(board_vec.as_matrix(), (32,))

if (board_vec[pos_init] == chkr_value or board_vec[pos_init] == king_value) and board_vec[pos_final] == empty:
board_vec[pos_final] = board_vec[pos_init]
board_vec[pos_init] = empty

# Assign kings
if pos_final in king_pos:
board_vec[pos_final] = king_value

# Remove eliminated pieces
if move_type == 'jump':
eliminated = int(jumps.iloc[pos_init, pos_final])
print('Position eliminated: %d' % (eliminated + 1))
assert board_vec[eliminated] == -chkr_value or -king_value
board_vec[eliminated] = empty

# Update the board
board_vec = pd.DataFrame(np.reshape(board_vec, (8, 4)))
self.state = board_vec
return False

else:
return True

def play():

# Alpha-numeric encoding of player turn: white = 1, black = -1
turn = -1

# Count number of invalid move attempts
invalid_move_attempts = 0
jumps_not_predicted = 0
move_count = 0
game_aborted = False

# Initialize board object
board = Board()

# Start game
raw_input("To begin, press Enter:")
end_game = False
winner = ''
while True:

# White turn
if turn == 1:

print('\n' * 2)
print('=======================================================')
print("White's turn")
board.print_board()
move_illegal = True
while move_illegal:

# Prompt player for input
move = raw_input("Enter move as 'pos_init, pos_final':")

if move[::-1][:4][::-1] == 'wins':
winner = move[0:len(move) - 5]
end_game = True
break

elif move == 'draw':
winner = 'draw'
end_game = True
break

else:
move = move.split(',')

for i in range(len(move) - 1):
pos_init = int(move[i])
pos_final = int(move[i + 1])
if abs(pos_final - pos_init) > 5:
move_type = 'jump'
else:
move_type = 'standard'

move_illegal = board.update(positions=[pos_init, pos_final], player_type='white', move_type=move_type)
if move_illegal:
print('That move is invalid, please try again.')
else:
print("White move: %s" % [pos_init, pos_final])

# Black turn
elif turn == -1:
print('\n' * 2)
print('=======================================================')
print("Black's turn")
board.print_board()
player_type = 'black'

# Call model to generate move
moves_list, probs = board.generate_move(player_type=player_type, output_type='top-5')
print(np.array(moves_list) + 1)
print(probs)

# Check for available jumps, cross check with moves
available_jumps = board.find_jumps(player_type=player_type)

first_move = True

# Handles situation where there is a jump available to black
if len(available_jumps) > 0:

move_type = 'jump'
jump_available = True

while jump_available:

# For one jump available
if len(available_jumps) == 1:
count = 1
move_predicted = False

for move in moves_list:
if move == available_jumps[0]:
# print("There is one jump available. This move was choice %d." % count)
move_predicted = True
break
else:
count += 1

if not move_predicted:
# print('Model did not output the available jumps. Forced move.')
jumps_not_predicted += 1

initial_position = available_jumps[0][0]
if not (first_move or final_position == initial_position):
break
final_position = available_jumps[0][1]
initial_piece = np.reshape(board.state.as_matrix(), (32,))[initial_position]
move_illegal = board.update(available_jumps[0], player_type=player_type, move_type=move_type)

if move_illegal:
# print('Find Jumps function returned invalid move: %s' % (np.array(available_jumps[0]) + 1))
game_aborted = True
else:
print("Black move: %s" % (np.array(available_jumps[0]) + 1))
available_jumps = board.find_jumps(player_type=player_type)
final_piece = np.reshape(board.state.as_matrix(), (32,))[final_position]
if len(available_jumps) == 0 or final_piece != initial_piece:
jump_available = False

# When diffent multiple jumps are available
else:
move_predicted = False
for move in moves_list:
if move in available_jumps:

initial_position = move[0]
if not (first_move or final_position == initial_position):
break
final_position = move[1]
initial_piece = np.reshape(board.state.as_matrix(), (32,))[initial_position]
move_illegal = board.update(move, player_type=player_type, move_type=move_type)

if move_illegal:
pass
# print('Model and Find jumps function predicted an invalid move: %s' % (np.array(move) + 1))
else:
print("Black move: %s" % (np.array(move) + 1))
move_predicted = True
available_jumps = board.find_jumps(player_type=player_type)
final_piece = np.reshape(board.state.as_matrix(), (32,))[final_position]
if len(available_jumps) == 0 or final_piece != initial_piece:
jump_available = False
break

if not move_predicted:
# print('Model did not output any of the available jumps. Move picked randomly among valid options.')
jumps_not_predicted += 1
ind = np.random.randint(0, len(available_jumps))

initial_position = available_jumps[ind][0]
if not (first_move or final_position == initial_position):
break
final_position = available_jumps[ind][1]
initial_piece = np.reshape(board.state.as_matrix(), (32,))[initial_position]
move_illegal = board.update(available_jumps[ind], player_type=player_type, move_type=move_type)

if move_illegal:
# print('Find Jumps function returned invalid move: %s' % (np.array(available_jumps[ind]) + 1))
game_aborted = True
else:
available_jumps = board.find_jumps(player_type=player_type)
final_piece = np.reshape(board.state.as_matrix(), (32,))[final_position]
if len(available_jumps) == 0 or final_piece != initial_piece:
jump_available = False

first_move = False

# For standard moves
else:
move_type = 'standard'
move_illegal = True
while move_illegal:

count = 1
for move in moves_list:

move_illegal = board.update(move, player_type=player_type, move_type=move_type)

if move_illegal:
# print('model predicted invalid move (%s)' % (np.array(move) + 1))
print(probs[count - 1])
invalid_move_attempts += 1
count += 1
else:
print('Black move: %s' % (np.array(move) + 1))
break

if move_illegal:
game_aborted = True
print("The model failed to provide a valid move. Game aborted.")
print(np.array(moves_list) + 1)
print(probs)
break

if game_aborted:
print('Game aborted.')
break

if end_game:
print('The game has ended')
break
move_count += 1
turn *= -1

# Print out game stats
end_board = board.state.as_matrix()
num_black_chkr = len(np.argwhere(end_board == black_chkr))
num_black_king = len(np.argwhere(end_board == black_king))
num_white_chkr = len(np.argwhere(end_board == white_chkr))
num_white_king = len(np.argwhere(end_board == white_king))
if winner == 'draw':
print('The game ended in a draw.')
else:
print('%s wins' % winner)
print('Total number of moves: %d' % move_count)
print('Remaining white pieces: (checkers: %d, kings: %d)' % (num_white_chkr, num_white_king))
print('Remaining black pieces: (checkers: %d, kings: %d)' % (num_black_chkr, num_black_king))

if __name__ == '__main__':
# Define board entries and valid positions
empty = 0
black_chkr = 1
black_king = 3
black_king_pos = [28, 29, 30, 31]
white_chkr = -black_chkr
white_king = -black_king
white_king_pos = [0, 1, 2, 3]
valid_positions = range(32)
odd_list = [0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27]
even_list = [4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31]
params_dir = 'parameters/convnet_150k_full/model.ckpt-150001'

# Entries for neighbors are lists, with indices corresponding to direction as defined in parser_v7.py ...
iv = ''
neighbors = {0: [iv, 5, 4, iv],
1: [iv, 6, 5, iv],
2: [iv, 7, 6, iv],
3: [iv, iv, 7, iv],
4: [0, 8, iv, iv],
5: [1, 9, 8, 0],
6: [2, 10, 9, 1],
7: [3, 11, 10, 2],
8: [5, 13, 12, 4],
9: [6, 14, 13, 5],
10: [7, 15, 14, 6],
11: [iv, iv, 15, 7],
12: [8, 16, iv, iv],
13: [9, 17, 16, 8],
14: [10, 18, 17, 9],
15: [11, 19, 18, 10],
16: [13, 21, 20, 12],
17: [14, 22, 21, 13],
18: [15, 23, 22, 14],
19: [iv, iv, 23, 15],
20: [16, 24, iv, iv],
21: [17, 25, 24, 16],
22: [18, 26, 25, 17],
23: [19, 27, 26, 18],
24: [21, 29, 28, 20],
25: [22, 30, 29, 21],
26: [23, 31, 30, 22],
27: [iv, iv, 31, 23],
28: [24, iv, iv, iv],
29: [25, iv, iv, 24],
30: [26, iv, iv, 25],
31: [27, iv, iv, 26]
}

next_neighbors = {0: [iv, 9, iv, iv],
1: [iv, 10, 8, iv],
2: [iv, 11, 9, iv],
3: [iv, iv, 10, iv],
4: [iv, 13, iv, iv],
5: [iv, 14, 12, iv],
6: [iv, 15, 13, iv],
7: [iv, iv, 14, iv],
8: [1, 17, iv, iv],
9: [2, 18, 16, 0],
10: [3, 19, 17, 1],
11: [iv, iv, 18, 2],
12: [5, 21, iv, iv],
13: [6, 22, 20, 4],
14: [7, 23, 21, 5],
15: [iv, iv, 22, 6],
16: [9, 25, iv, iv],
17: [10, 26, 24, 8],
18: [11, 27, 25, 9],
19: [iv, iv, 26, 10],
20: [13, 29, iv, iv],
21: [14, 30, 28, 12],
22: [15, 31, 29, 13],
23: [iv, iv, 30, 14],
24: [17, iv, iv, iv],
25: [18, iv, iv, 16],
26: [19, iv, iv, 17],
27: [iv, iv, iv, 18],
28: [21, iv, iv, iv],
29: [22, iv, iv, 20],
30: [23, iv, iv, 21],
31: [iv, iv, iv, 22]
}

play()


Engine performance versus humans

I showed above that this model can learn individual moves, but the more interesting question is whether or not it exhibits the high-level strategy that is required to win games. Below are tabulated results from four games against human opponents. One of the players was relatively inexperienced at checkers and the CNN engine won the game easily, maintaining a clear advantage throughout the game. The second individual plays at an intermediate level, and was also defeated by the CNN, but after a more lengthy game of 100 moves. The other two human players play somewhere between intermediate and advanced, and were able to overcome the CNN, but not easily.

One of the interesting findings was that the CNN engine does have the ability to generalize and play sound checkers on board states it has never seen before (sometimes, that is). An example of this was the endgame between the CNN and the the intermediate level human. The game­-ending sequence is shown below. On move 95, the CNN had a clear opportunity to defend it’s own checker by moving from position 23 to 27, but instead moved in the opposite direction, sacrificing its own checker to allow the opposing king to move into a very unfavorable position (from the human’s perspective). The CNN engine then performed the perfect line of play given the human’s subsequent move into kings row, and was able secure the win one move later.

The two games against the more advanced human players were recorded and analyzed by the Cake checkers engine’s analysis tool in order to track the CNN’s performance throughout the game. The CNN lost both games but displayed a similar pattern of performance. For the first portion of these games, the engine had clear winning positions through the first 25 moves (e.g., either by clear advantage based on number of kings and pieces, or as determined by the Cake engine analysis in situations where piece counts were closer). However, as boards became more sparse, the engine became prone to game­-changing blunders, which ultimately led to its defeat.

Engine performance versus Checkers Deluxe and Cake

As a more concrete measure of performance, I show the tabulated results of our model versus Checkers Deluxe, which is a popular Checkers­ playing app. The app has four skill levels–as can be seen, the CNN defeated all levels except for Expert. Even though it couldn’t defeat the Expert level, it performed remakably well for 90+ moves before losing on the 97th move.

The CNN also lost twice to the the world championship,­ search­-based checkers engine Cake. This engine uses advanced search algorithms, along with endgame databases, to generate moves. Desipite the clear disadvantage of having no search, it maintained reasonable positions for 20+ moves in each game. According to the Cake analysis tool, the CNN’s first 5­7 outputs were book moves, which makes sense given that those would be the most common moves in the training database. The figure below shows a snapshot of the first game after 20 moves, shortly before Cake (white) gained a winning position.

Game play The source code for training and gameplay can be found here. Some day I hope to to wrap the game engine in a GUI–for now it just runs on the shell or your favorite Python ide. If you are interested in seeing how this engine is implemented please read on, otherwise feel free to download the source code and see if you can beat it. Tip: run a browser-based checkers game in 2-player mode along side this game for a better experience. The shell interface looks like this …

Final thoughts

The above results indicate that intermediate-level checkers play can be achieved without the use of search, which is pretty cool in its own right. While it typically maintains a clear advantage through 20+ moves, it tends to suffer from the occasional blunder in the mid-late game against stronger opponents as the board becomes sparse and less familar. This is expected, and motives the further learning through self play. In the next post I’ll describe how we can factor checkers games into a set of conditional probability distributions that provides an elegant means to improve our pretrained policy by playing games, observing outcomes, assigning rewards, and propogating reward signals back through the network to make moves that positively correlate with winning more probable in the policy.