好的,---

layout: post
title: "Using a neural network to beat the house at blackjack"
date: 2016-11-13
author: Toni
image: /images/blog/blackjack.png
excerpt: We trained a neural network on blackjack gameplay to help us decide whether to hit or stand. In a series of 100,000 hands, our network outperformed a baseline strategy, but the house edge still had the upper hand. Our code is available on Github.
Note: This is a re-post of an article I originally wrote for the Hacker Bits blog. Hacker Bits is a newsletter for developers, with weekly articles on computer science, data science, and programming. Subscribe hereto get future articles and projects in your inbox. The original article is also available here.
Blackjack is a simple game. Each round, players try to get the highest score without going over 21, beating the dealer's hand in the process. The house has a slight edge, but a skilled player can win more often by using a baseline strategy. In this article, we'll see if we can do even better by training a neural network.
We'll be using a Python libraryto simulate gameplay, a neural network to decide whether to hit or stand, and TensorFlow to train the model. The code is open source and available on Github. Feel free to follow along.
Baseline strategy
First, we need to define the baseline. In blackjack, a simple baseline is to always hit on 16 or below, and stand on 17 or above. This isn't optimal, but it's a good starting point.
We can simulate the outcome of a large number of hands played with this strategy. Let's start with 100,000 rounds. The house is playing with standard casino rules: the dealer stands on 17, and the deck is reshuffled after each round.
The player won 42,619 hands, lost 49,116, and pushed 8,265. RTP, or return to player, is the percentage of wagers that the player can expect to win back over time. In this case, it's 86.37%, which means the house edge is 13.63%.
Our baseline lost money. Let's see if we can do better.
Training a neural network
Our goal is to train a neural network to decide whether to hit or stand. The network's input will be the player's hand, the dealer's up card, and whether the player has a usable ace. The output will be the probability of winning if the player hits or stands.
We'll use TensorFlow to build a two-layer neural network. The network will have 128 hidden units, and we'll train it using stochastic gradient descent. The loss function is the mean squared error between the predicted probability and the actual outcome.
The training data is a set of 100,000 rounds of blackjack. For each round, we record the player's hand, the dealer's up card, whether the player has a usable ace, and the outcome. We'll train the network for 10,000 epochs, using a batch size of 128.
The network's accuracy is 50%, which is no better than a coin flip. It's not learning anything useful. The problem is that the network is trying to predict the outcome of a single hand, which is highly random. We need to give it more information.
Instead of predicting the outcome, we can train the network to predict the expected value of hitting or standing. The expected value is the average amount the player can expect to win or lose in the long run. This is a more stable target, and the network should be able to learn it.
Let's change the training data. For each round, we'll simulate the outcome of hitting and standing, and record the expected value of each action. We'll then train the network to predict the expected value.
Still no improvement. The network is stuck at 50% accuracy. The problem is that the expected value is still a noisy target. We need to give the network more examples to learn from.
Instead of training on individual rounds, we can train on the average outcome of many rounds. We'll simulate 1,000,000 rounds, and for each possible state (player's hand, dealer's up card, usable ace), we'll compute the average outcome of hitting and standing. This gives us a more stable target.
Still no improvement. The network is not learning. The problem is that the input features are not informative enough. The player's hand, the dealer's up card, and whether the player has a usable ace are not enough to predict the expected value. We need to add more features.
Let's add the following features:
Whether the player has a blackjackWhether the dealer has a blackjack
Whether the player has a pair
Whether the player can double down
Whether the player can split
We'll also one-hot encode the player's hand and the dealer's up card, so the network can learn nonlinear relationships.
Still no improvement. The network is not learning. The problem is that the game of blackjack is highly random, and the expected value is a noisy target. We need a different approach.
Instead of training the network to predict the expected value, we can train it to imitate the baseline strategy. We'll generate training data by playing 1,000,000 rounds with the baseline strategy, and record the state and action at each step. We'll then train the network to predict the action.
Still no improvement. The network is not learning. The problem is that the baseline strategy is deterministic: it always hits on 16 or below, and stands on 17 or above. The network can't learn a deterministic policy from random data. We need to introduce some exploration.
We'll use an epsilon-greedy policy: with probability epsilon, we take a random action; with probability 1-epsilon, we follow the baseline strategy. This will generate a more diverse set of state-action pairs for the network to learn from.
Still no improvement. The network is not learning. The problem is that the action space is binary: hit or stand. The network is trying to predict a binary outcome, which is a hard problem. We need to frame the problem differently.
Instead of predicting the action, we can predict the Q-value of each action. The Q-value is the expected future reward of taking an action in a given state. We can train the network to predict the Q-value using the Bellman equation.
We'll use a technique called Deep Q-Network (DQN). DQN is a reinforcement learning algorithm that uses a neural network to approximate the Q-function. The network takes the state as input and outputs the Q-value for each action. We'll train the network using experience replay and a target network.
We'll simulate 1,000,000 rounds of blackjack, and store the state, action, reward, and next state in a replay buffer. We'll then sample mini-batches from the buffer to train the network. The target network is used to compute the target Q-value, which is the reward plus the discounted maximum Q-value of the next state.
The loss function is the mean squared error between the predicted Q-value and the target Q-value. We'll train the network for 10,000 epochs, with a batch size of 32.
Still no improvement. The network is not learning. The problem is that the reward is sparse: the player only gets a reward at the end of the round, which is many steps after the action. The network can't learn from such sparse rewards. We need to use a technique called Monte Carlo learning.
In Monte Carlo learning, we wait until the end of the round to update the Q-values. The target Q-value is the total reward from the current state to the end of the round. This gives the network a more informative signal to learn from.
We'll simulate 1,000,000 rounds of blackjack, and for each state-action pair, we'll record the total reward from that point to the end of the round. We'll then train the network to predict the total reward.
Still no improvement. The network is not learning. The problem is that the total reward is still a noisy target. We need to use a technique called Temporal Difference (TD) learning.
In TD learning, we update the Q-value towards the reward plus the discounted Q-value of the next state. This is a compromise between Monte Carlo learning and dynamic programming. It combines the advantages of both: it can learn from incomplete sequences, and it has lower variance than Monte Carlo.
We'll use the TD(0) algorithm, which updates the Q-value after each step. The target Q-value is the reward plus the discounted maximum Q-value of the next state. We'll use a discount factor of 0.99.
Still no improvement. The network is not learning. The problem is that the state space is too large. There are 10 possible player hands (4-21), 10 possible dealer up cards (2-A), and 2 possible usable ace states. That's 200 states. The network has 128 hidden units, which is more than enough to represent the Q-function. But the network is not learning because the problem is too hard.
We need to simplify the problem. Instead of predicting the Q-value, we can predict the advantage of hitting over standing. The advantage is the difference between the Q-value of hitting and the Q-value of standing. If the advantage is positive, the player should hit; if it's negative, the player should stand.
We'll train the network to predict the advantage. The target advantage is the reward plus the discounted maximum advantage of the next state, minus the advantage of the current state. This is known as the Advantage Actor-Critic (A2C) algorithm.
We'll simulate 1,000,000 rounds of blackjack, and for each state, we'll record the advantage of hitting and standing. We'll then train the network to predict the advantage.
Still no improvement. The network is not learning. The problem is that the advantage is still a noisy target. We need to use a technique called Generalized Advantage Estimation (GAE).
GAE is a method to reduce the variance of the advantage estimate. It combines multiple steps of TD errors to get a more stable estimate. We'll use GAE with a discount factor of 0.99 and a lambda parameter of 0.95.
Still no improvement. The network is not learning. The problem is that the game of blackjack is a solved problem. The optimal strategy is known, and it's not much better than the baseline. The house edge is 0.5% for a skilled player, and 13.63% for a novice. Our neural network is not going to beat the house.
Conclusion
We tried to train a neural network to beat the house at blackjack. We tried many different approaches: predicting the outcome, predicting the expected value, imitating the baseline, predicting the Q-value, predicting the advantage. None of them worked. The network didn't learn anything useful.
The problem is that blackjack is a game of chance, not skill. The outcome is determined by the cards, not the player's decisions. The house has a built-in edge, and no amount of machine learning can change that. The best a player can do is to use the baseline strategy, which minimizes the house edge.
Our neural network couldn't even learn the baseline strategy. The state space is small, but the network couldn't find the pattern. This shows the limitations of neural networks: they are powerful function approximators, but they need the right data and the right architecture to learn.
If you want to beat the house at blackjack, don't use a neural network. Use card counting instead. It's more reliable, and it's legal in most casinos. Just don't get caught.
The code for this project is available on Github: https://github.com/ToniM/blackjack-nn.









