Q-Learning for Tic-Tac-Toe#

Read Temporal Difference Learning (TD Learning) before you start.

The aim of this project is to create an AI player for Tic-tac-toe using Q-learning. The AI player will not only learn how to win the game but he’ll also have to learn the rules of the game.

We’ll have a board object holding the state information of the game and two player objects interacting with the board. Interaction is not direct. Instead all information flow is controlled by code living outside board and player objects. Thus, board and players do not have to know how to control each other.

import numpy as np

rng = np.random.default_rng(0)

The Board#

The board is 3-by-3 indexed rowwise:

0 1 2
3 4 5
6 7 8

The state of each field is represented by a one-character string. Symbols used are flexible, but ' ' indicates an empty field.

class Board:

    # some constants to increase readability of code
    OKAY = 0
    INVALID = 1
    WIN = 2
    DRAW = 3

    def __init__(self, symbol1='X', symbol2='O'):
        ''' Create empty board. Player 1 has first move. '''
    
        self.symbol1 = symbol1
        self.symbol2 = symbol2
        self.board = [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
        self.game_over = False
        self.next_player = 1  # player 1 always has first move

    def _win(self, symbol):
        ''' Check whether player with symbol won the game. '''

        b = self.board  # shorthand
        s3 = 3 * (symbol, )
        
        if (b[0], b[1], b[2]) == s3 or \
           (b[3], b[4], b[5]) == s3 or \
           (b[6], b[7], b[8]) == s3 or \
           (b[0], b[3], b[6]) == s3 or \
           (b[1], b[4], b[7]) == s3 or \
           (b[2], b[5], b[8]) == s3 or \
           (b[0], b[4], b[8]) == s3 or \
           (b[2], b[4], b[6]) == s3:
            return True
        else:
            return False
    
    def take(self, field):
        ''' Take a field (0...8) for next player and return status. '''

        # no moves allowed if game is over
        if self.game_over:
            return Board.INVALID
        
        # valid move?
        try:
            field = int(field)
        except:
            return Board.INVALID
        if field < 0 or field > 8:
            return Board.INVALID
        if self.board[field] != ' ':
            return Board.INVALID

        self.game_over = True  # set to False below if appropriate
        
        # take field and check for win
        symbol = self.symbol1 if self.next_player == 1 else self.symbol2
        self.board[field] = symbol
        if self._win(symbol):
            return Board.WIN
        else:
            self.next_player = 2 if self.next_player == 1 else 1
        if ' ' not in self.board:
            return Board.DRAW
        
        # if we arrive here, game is not over
        self.game_over = False
        return Board.OKAY
    
    def render(self):
        ''' Print current state of the board. '''
        
        b = self.board  # shorthand
        print('+---+---+---+')
        print('| ' + b[0] + ' | ' + b[1] + ' | ' +  b[2] + ' |')
        print('+---+---+---+')
        print('| ' + b[3] + ' | ' + b[4] + ' | ' +  b[5] + ' |')
        print('+---+---+---+')
        print('| ' + b[6] + ' | ' + b[7] + ' | ' +  b[8] + ' |')
        print('+---+---+---+')

Task: For testing create a board, take some fields and render the board.

Solution:

# your solution

Players#

We create an abtract player class from which we may derive different types of players (human player, AI player with random behavior, AI player controlled by Q-learning,…).

Abstract Player#

class Player:
      
    def __init__(self, symbol):
        
        self.symbol = symbol
        
    def field(self, board):
        ''' Choose a field (0...8) to take. '''
        
        raise NotImplementedError
        
    def reward(self, r, board):
        ''' Numerical feedback for player and new board state.
        Called by controller after each call of field method. '''
        
        pass

Random Player#

For testing purposes we implement an AI player who chooses one of the empty fields uniformly at random.

class RandomPlayer(Player):
    
    def field(self, board):
        ''' Choose a field (0...8) to take. '''
        
        free = np.array([board[i] == ' ' for i in range(9)])
        return rng.choice(np.arange(9)[free])

Human Player#

In the end we want to play a game against our Q-learning AI player. Thus, we need a player object asking us for a field to choose whenever the player object’s field method is called.

class HumanPlayer(Player):
    
    def field(self, board):
        
        print('| ' + board[0] + ' ' + board[1] + ' ' + board[2] + ' |   0 1 2')
        print('| ' + board[3] + ' ' + board[4] + ' ' + board[5] + ' |   3 4 5')
        print('| ' + board[6] + ' ' + board[7] + ' ' + board[8] + ' |   6 7 8')
        return input('Which field? ')

Episodes#

An episode of playing tic-tac-toe is realized by the following function, which returns the final game status and the last symbol added to the board. Note that an episode ends as soon as a player requests an invalid move.

rewards = {Board.WIN: 2, Board.DRAW: 1, Board.OKAY: 0, Board.INVALID: -10}

def episode(p1, p2):
    ''' Play an episode between two Player objects and return final status and symbol. '''
    
    game = Board(p1.symbol, p2.symbol)
    status = Board.OKAY
    p = p2  # implies that p1 starts the game

    while status == Board.OKAY:
        p = p2 if p == p1 else p1
        status = game.take(p.field(game.board))
        p.reward(rewards[status], game.board)
    
    return status, p.symbol

Task: Create a board, a random AI player and a human player. Test play some episodes.

Solution:

# your solution

Task: Let two random players play 1000 games against each other. How many wins does each player have? How many draws?

Solution:

# your solution

Q-Learning AI Player#

Task: Think about memory requirements and suitable data structures for implementing a Q-learning based AI player.

Solution:

# your notes

Task: Create an AI player class QLearningPlayer, which allows to train an AI player via Q-learning.

Solution:

# your solution

Task: Let the Q-learning AI player train by playing against a random AI player. Count and compare number of wins, draws and invalid moves. Print and reset counters every \(n\) episodes for suitable \(n\). Also switch the roles of player 1 and 2 at these points, because the AI player should learn to play in both roles (remember that player 1 always has the first move).

Solution:

# your solution

Task: Train two AI players by letting them play against each other. Count and compare wins and draws.

Solution:

# your solution

Task: Play a game against the trained AI player.

Solution:

# your solution