Q-Learning for Tic-Tac-Toe
Contents
Q-Learning for Tic-Tac-Toe#
Read Temporal Difference Learning (TD Learning) before you start.
The aim of this project is to create an AI player for Tic-tac-toe using Q-learning. The AI player will not only learn how to win the game but he’ll also have to learn the rules of the game.
We’ll have a board object holding the state information of the game and two player objects interacting with the board. Interaction is not direct. Instead all information flow is controlled by code living outside board and player objects. Thus, board and players do not have to know how to control each other.
import numpy as np
rng = np.random.default_rng(0)
The Board#
The board is 3-by-3 indexed rowwise:
0 1 2
3 4 5
6 7 8
The state of each field is represented by a one-character string. Symbols used are flexible, but ' '
indicates an empty field.
class Board:
# some constants to increase readability of code
OKAY = 0
INVALID = 1
WIN = 2
DRAW = 3
def __init__(self, symbol1='X', symbol2='O'):
''' Create empty board. Player 1 has first move. '''
self.symbol1 = symbol1
self.symbol2 = symbol2
self.board = [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
self.game_over = False
self.next_player = 1 # player 1 always has first move
def _win(self, symbol):
''' Check whether player with symbol won the game. '''
b = self.board # shorthand
s3 = 3 * (symbol, )
if (b[0], b[1], b[2]) == s3 or \
(b[3], b[4], b[5]) == s3 or \
(b[6], b[7], b[8]) == s3 or \
(b[0], b[3], b[6]) == s3 or \
(b[1], b[4], b[7]) == s3 or \
(b[2], b[5], b[8]) == s3 or \
(b[0], b[4], b[8]) == s3 or \
(b[2], b[4], b[6]) == s3:
return True
else:
return False
def take(self, field):
''' Take a field (0...8) for next player and return status. '''
# no moves allowed if game is over
if self.game_over:
return Board.INVALID
# valid move?
try:
field = int(field)
except:
return Board.INVALID
if field < 0 or field > 8:
return Board.INVALID
if self.board[field] != ' ':
return Board.INVALID
self.game_over = True # set to False below if appropriate
# take field and check for win
symbol = self.symbol1 if self.next_player == 1 else self.symbol2
self.board[field] = symbol
if self._win(symbol):
return Board.WIN
else:
self.next_player = 2 if self.next_player == 1 else 1
if ' ' not in self.board:
return Board.DRAW
# if we arrive here, game is not over
self.game_over = False
return Board.OKAY
def render(self):
''' Print current state of the board. '''
b = self.board # shorthand
print('+---+---+---+')
print('| ' + b[0] + ' | ' + b[1] + ' | ' + b[2] + ' |')
print('+---+---+---+')
print('| ' + b[3] + ' | ' + b[4] + ' | ' + b[5] + ' |')
print('+---+---+---+')
print('| ' + b[6] + ' | ' + b[7] + ' | ' + b[8] + ' |')
print('+---+---+---+')
Task: For testing create a board, take some fields and render the board.
Solution:
# your solution
Players#
We create an abtract player class from which we may derive different types of players (human player, AI player with random behavior, AI player controlled by Q-learning,…).
Abstract Player#
class Player:
def __init__(self, symbol):
self.symbol = symbol
def field(self, board):
''' Choose a field (0...8) to take. '''
raise NotImplementedError
def reward(self, r, board):
''' Numerical feedback for player and new board state.
Called by controller after each call of field method. '''
pass
Random Player#
For testing purposes we implement an AI player who chooses one of the empty fields uniformly at random.
class RandomPlayer(Player):
def field(self, board):
''' Choose a field (0...8) to take. '''
free = np.array([board[i] == ' ' for i in range(9)])
return rng.choice(np.arange(9)[free])
Human Player#
In the end we want to play a game against our Q-learning AI player. Thus, we need a player object asking us for a field to choose whenever the player object’s field
method is called.
class HumanPlayer(Player):
def field(self, board):
print('| ' + board[0] + ' ' + board[1] + ' ' + board[2] + ' | 0 1 2')
print('| ' + board[3] + ' ' + board[4] + ' ' + board[5] + ' | 3 4 5')
print('| ' + board[6] + ' ' + board[7] + ' ' + board[8] + ' | 6 7 8')
return input('Which field? ')
Episodes#
An episode of playing tic-tac-toe is realized by the following function, which returns the final game status and the last symbol added to the board. Note that an episode ends as soon as a player requests an invalid move.
rewards = {Board.WIN: 2, Board.DRAW: 1, Board.OKAY: 0, Board.INVALID: -10}
def episode(p1, p2):
''' Play an episode between two Player objects and return final status and symbol. '''
game = Board(p1.symbol, p2.symbol)
status = Board.OKAY
p = p2 # implies that p1 starts the game
while status == Board.OKAY:
p = p2 if p == p1 else p1
status = game.take(p.field(game.board))
p.reward(rewards[status], game.board)
return status, p.symbol
Task: Create a board, a random AI player and a human player. Test play some episodes.
Solution:
# your solution
Task: Let two random players play 1000 games against each other. How many wins does each player have? How many draws?
Solution:
# your solution
Q-Learning AI Player#
Task: Think about memory requirements and suitable data structures for implementing a Q-learning based AI player.
Solution:
# your notes
Task: Create an AI player class QLearningPlayer
, which allows to train an AI player via Q-learning.
Solution:
# your solution
Task: Let the Q-learning AI player train by playing against a random AI player. Count and compare number of wins, draws and invalid moves. Print and reset counters every \(n\) episodes for suitable \(n\). Also switch the roles of player 1 and 2 at these points, because the AI player should learn to play in both roles (remember that player 1 always has the first move).
Solution:
# your solution
Task: Train two AI players by letting them play against each other. Count and compare wins and draws.
Solution:
# your solution
Task: Play a game against the trained AI player.
Solution:
# your solution