# Train a Neural Network to Play Black Jack with Q Learning

Q Learning is a standard algorithm that’s used in Reinforcement Learning. The idea of the function is a utility function, which when given a situation and an action, outputs the expected reward that will happen when you perform the action. Today, we will develop and extend this idea in order to train a Machine Learning model to approximate the function. The full code can be found on GitHub.

Say we would like to learn the optimal policy for a game. This means, at any state ( the collection of all possible states of the game), we want to know the optimal action out of a set of actions . The game is typically called the “environment”, whereas the “agent” is our learner. The agent must be able to accept states from the environment and select an action which is then input back into the environment. If we would like our agent to learn, the environment must also be able to provide a reward for each action. Any Reinforcement Learning algorithm follows the same basic steps:

- Agent is given a state
- Agent chooses such that .
- Input into environment and receive reward and move to a new state .
- Update agent via .
- Repeat

One of the standard Reinforcement Learning algorithms is the Q Learner which aimed to find the function iteratively. This algorithm is not tractible for large state spaces, and due to this limitation, the idea to approximate with a Neural Network arose. This technique has been successful to learn many types of games, including how to play Atari games after defining a reward function (see this paper).

In the sequel, we will discuss the differences between the standard Q Learning algorithm and how to use a Neural Network for reinforcement learning. We will then build a basic Black Jack implementation on which to apply a Neural Network and the standard Q-learning algorithm to compare the outcomes.

# The Basic Algorithm

The learning algorithm is basically identical whether you use the standard Q Learner or a neural network. We are going to outline an -greedy strategy for learning. Let represent our (approximate) function, which is a function from .

- Given a state , find the optimal action .
- With probability , perform , otherwise perform a chosen at random. Call the performed action .
- Receive a reward . Update with the triplet of information .

Step 3 is where we update our learner either by an online update to a Neural Network or an iterative step in the Q Learning algorithm. The choice of allows us to explore the environment rather than only choosing the optimal policy that we have learned up to that moment. When one wants to see how well our learner plays, one should set to 0.

## A Basic Q Learner

Our basic Q-Learner implements the typical Q-Learning algorithm as defined here. At the start, we initialize an empty dictionary where we will store the states and the current iterations’ reward value for that state.

1

2

3

4

5

6

# Q Learner will inherit a Player class from our Black Jack game

# We will get back to that later

class Learner(Player):

def __init__(self):

super().__init__()

self._Q = {}

Assume we are at state , our agent must decide what action to perform. We can do this via

1

2

3

4

5

6

7

8

9

10

11

12

13

14

def get_action(self, state):

# Notice here epsilon is the same as 1-epsilon from our previous discussion

if state in self._Q and np.random.uniform(0,1) < self._epsilon:

action = max(self._Q[state], key = self._Q[state].get)

else:

action = np.random.choice([Constants.hit, Constants.stay])

if state not in self._Q:

self._Q[state] = {}

self._Q[state][action] = 0

self._last_state = state

self._last_action = action

return action

Now, we perform action receiving reward . Then, we may update the approximate function at time , denoted with a value update at via

Here, is called the learning rate and is the discounting factor applied to future rewards. The code is

1

2

3

4

5

6

7

8

9

10

11

12

def update(self,new_state,reward):

if self._learning:

old = self._Q[self._last_state][self._last_action]

if new_state in self._Q:

new = self._discount * \

self._Q[new_state][max(self._Q[new_state], key=self._Q[new_state].get)]

else:

new = 0

self._Q[self._last_state][self._last_action] = \

(1-self._learning_rate)*old + self._learning_rate*(reward+new)

## Neural Network Algorithm

As is typical in Machine Learning problems, we want to train a model to “predict” the value of some function given data. This situation is no different– approximating the function is the same as training a model to predict the value of the reward function. So to approximate our function, we must train the network on the rewards at each observed state.

### Redefining the Q function

Usually, the function is defined as a function , and the original Q Learner algorithm is written with this definition. When using a Neural Network, we can apply a trick to reduce training time: It is easier to model the function as where is the number of elements in , i.e. the output is a vector of rewards where each value corresponds to an action in . This way, we will be able to train a single Neural Network to approximate .

To determine an optimal policy, we input our state into the network, and choose the action corresponding to the entry in the reward vector with the largest reward.

### Implementation

First, we initialize a basic Neural Network (with Keras 0.3.3)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

class DQNLearner(Learner):

def __init__(self):

super().__init__()

self._learning = True

self._learning_rate = .1

self._discount = .1

self._epsilon = .9

# Create Model

model = Sequential()

model.add(Dense(2, init='lecun_uniform', input_shape=(2,)))

model.add(Activation('relu'))

model.add(Dense(10, init='lecun_uniform'))

model.add(Activation('relu'))

model.add(Dense(4, init='lecun_uniform'))

model.add(Activation('linear'))

rms = RMSprop()

model.compile(loss='mse', optimizer=rms)

self._model = model

Next, when we obtain a state we have to return our action. This is implemented via

1

2

3

4

5

6

7

8

9

10

11

12

def get_action(self, state):

rewards = self._model.predict([np.array([state])], batch_size=1)

if np.random.uniform(0,1) < self._epsilon:

if rewards[0][0] > rewards[0][1]:

action = Constants.hit

else:

action = Constants.stay

else:

action = np.random.choice([Constants.hit, Constants.stay])

return action

After receiving a reward and the new state , we may update the Neural Network. This is done by training the model in an online fashion on a single new data point. The data point is the current state and the target is the expected reward vector with the entry for the action performed updated to be , which is defined as the reward we received plus a discounted future reward. The code for such an update is given by

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

def update(self,new_state,reward):

if self._learning:

rewards = self._model.predict([np.array([new_state])], batch_size=1)

maxQ = rewards[0][0] if rewards[0][0] > rewards[0][1] else rewards[0][1]

new = self._discount * maxQ

if self._last_action == Constants.hit:

self._last_target[0][0] = reward+new

else:

self._last_target[0][1] = reward+new

# Update model

self._model.fit(np.array([self._last_state]),

self._last_target,

batch_size=1,

nb_epoch=1,

verbose=0)

# Black Jack implementation

To test our learners, we must build a (simple) implementation of Black Jack where it will output a reward. The goal is to have the highest hand value, but your hand cannot sum to more than 21 or you lose automatically. This logic is stored in two functions.

1

2

3

4

5

6

7

8

9

10

11

12

13

def determine_winner(self,player1,player2):

if player1.get_hand_value() == 21 or \

(player1.get_hand_value() > player2.get_hand_value() \

and player1.get_hand_value() <= 21):

return Constants.player1

else:

return Constants.player2

def determine_if_bust(self,player):

if player.get_hand_value() > 21:

return True

else:

return False

This implementation contains “cards” with values of 1 to 10– there are no suits. The game uses 3 “decks” at a time for its pool of cards. Each deck contains 4 cards of each value (1 up to 10).

1

2

3

4

5

6

7

8

9

class Deck:

def __init__(self):

self.shuffle()

def shuffle(self):

cards = (np.arange(0,10) + 1)

cards = np.repeat(cards,4*3) #4 suits x 3 decks

np.random.shuffle(cards)

self._cards = cards.tolist()

The player plays against the dealer. The dealer will hit unless the dealer’s hand is greater than or equal to 15. Otherwise, the dealer will stay. This is seen in the get_action() function.

1

2

3

4

5

6

7

8

9

10

11

# The dealer is an initialization of the player class

class Player:

def __init__(self):

self._hand = []

self._original_showing_value = 0

def get_action(self, state = None):

if self.get_hand_value() < 15:

return Constants.hit

else:

return Constants.stay

To save time, the main game loop plays both the player and the dealer simultaneously, and every iteraton of the while loop corresponds to a single “hit/stay” for each player. However, the player does not have access to any information about the dealer besides the dealer’s first showing card. In fact, we have chosen to train our learner on the state space consisting of tuples of the player’s hands’ value and the dealer’s showing value.

This simultaneous play slightly modifies the rules of the game. The loop breaks when both players stay or either busts. The only new outcome is that the player could still be hitting when the dealer busts, ending the game in a new fashion. This is not necessarily very common considering the dealer stays at 15 and above, but can happen. Since we’re only trying to demonstrate the concept and not build a model to beat a casino, we’ll let this slide.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

state = self.get_starting_state(p,p2) #p is the player, p2 is the dealer

while True:

# Determine hit/stay

p1_action = p.get_action(state)

p2_action = p2.get_action(state)

# Apply the action if hit

if p1_action == Constants.hit:

p.hit(d)

if p2_action == Constants.hit:

p2.hit(d)

# Check if anyone has busted

if self.determine_if_bust(p):

winner = Constants.player2

break

elif self.determine_if_bust(p2):

winner = Constants.player1

break

# If both players stayed the round is over

if p1_action == p2_action and p1_action == Constants.stay:

break

state = self.get_state(p, p1_action, p2)

p.update(state,0) # Update the learner with a reward of 0 (No change)

Once we are out of this loop, it’s time to determine the winner and apply the correct reward to update our learner.

1

2

3

4

5

6

7

8

9

if winner is None:

winner = self.determine_winner(p,p2)

if winner == Constants.player1:

self.win += 1

p.update(self.get_ending_state(p,p1_action,p2),1) # Win +1

else:

self.loss += 1

p.update(self.get_ending_state(p,p1_action,p2),-1) # Loss -1

At this point, we rinse and repeat to train either learner in an online fashion from its interactions with the game.

# Outcomes

During training, the code outputs a running tally of the win percentage of the learner. Over time you will see this value improving as the learner becomes better. Once training has completed, the learner will continue to play to calculate a win percentage. How long each of these period are can be set in main().

1

2

3

4

5

6

7

8

9

10

11

12

13

14

def main():

# Set number of rounds

num_learning_rounds = 20000

number_of_test_rounds = 1000

# Choose your learner

# DQNLearner is the "Deep Q Network" neural network

learner = DQNLearner() #or Learner() for Q Learner

game = Game(num_learning_rounds, learner)

# Training and Testing

for k in range(0,num_learning_rounds + number_of_test_rounds):

game.run()

At the end of a run, a CSV is output in the directory containing the reward for hit/stay and the optimal strategy for that state. After 20000 training rounds, the Neural Net would hit anytime the hand value was or when its hand value is 16 and the dealer is showing 8,9 or 10. Otherwise, it would stay. This strategy resulted in a win percentage of ~47. After the same number of training rounds, the Q Learner had a win percentage of ~33. By looking at the optimal policy CSV, it seems from the neurotic strategy that it would have benefitted from additional training rounds. In this case, the Neural Network version is able to learn a winning strategy far quicker than the Q Learning algorithm alone.