Machine learning is exciting. But Reinforcement Learning is just amazing!

Reinforcement Learning much closer resembles the way humans interact with and understand their environment.

One can attempt so many practical applications of Reinforcement Learning, from experimenting with OpenAI gym environments to teaching your agents to play games. 

Or, it can let you get rid of all the “if-then-else” clauses you are thinking of, while trying to program the AI of the bad guys in your next game.

The applications are really exciting and endless!

To solve a problem with Reinforcement Learning, you have to define the environment, the learning method, what is good and what is bad …

There are a lot of problems on the way, with the first one being, what algorithm is the best. 

There are plenty of Reinforcement Learning algorithms out there, Q learning, SARSA, Monte Carlo methods, policy gradients etc. There is no best method.

But the one that really stood out the past few years is Deep Q Network, from DeepMind.

As I am completing my introduction to Reinforcement Learning, I have implemented Deep Q Network using python and Keras. What is particular about the implementation is that it allows for a lot of configuration options.

Here is the link to the GitHub project.

The project solves the CartPole Open AI gym problem, but it can very easily be changed to be used with other problems as well.

The inputs to the Neural Networks are the state space dimensions and the outputs the Q values for each possible action.

So, if the state space has 4 dimensions, there are 4 inputs to the Neural Networks. If there are 3 possible actions for your agent, there will be 3 outputs of the Neural Networks.

If you would like to use it with another OpenAI gym environments, simply change line 78.

env = gym.make('CartPole-v0')

You should also fix the condition under which the problem is solved. This is in line 132

if average>=195:
        print "Solved after "+str(i)+" episodes, with last 100 episodes average total reward "+\
            str(average)
        solved = True

Nothing else is needed.

If you need to use your own environment and still need to use this algorithm, you should define your environment according to the rules of the OpenAI gym projects.

Actually, this is not such a bad idea, because it presents a nice and neat way of organizing your code, with the added bonus that it can be used by others as well.

The configuration options are quite many. Here they are:

  • Activation function: The activation function used in the Neural Network. Options are tanh, relu and softmax (default: tanh)
    Non linearity is an issue when it comes to training Reinforcement Learning Neural Networks. So this is important
     
  • Neural network layers: A list of the number of nodes in each layer of the Neural Network (default: [128,128])
    Just play with that until you have a network with good learning capacity and “reasonable” learning speeds. Experimentation is the only way.
     
  • Gamma value: The gamma value used in the update of the Bellman equations Q values (default: 0.9)
     
  • Experience buffer length: The length of the experience buffer (default: 200)
    Make sure the buffer is long enough to store values from many different episodes. Otherwise it does not serve the purpose it was introduced for.
     
  • Experience buffer batch size: The size of each training batch, picked randomly from the experience buffer (default: 48)
    A proper batch size allows for a fair number of samples from each episode in each training batch and at the same time a “good” learning speed
     
  • With bias term: Whether a bias term will be included in the Neural Network layers (default: true)
    My attempts to train networks without the bias term failed miserably, but one may still need to not use it
     
  • Optimizer: The optimizer used in the training of the Neural Network. Options are adam and rmsprop (default: adam)
    In my limited experience, I have not seen that making a difference in the Reinforcement Learning problems I tried. But it does not hurt to have it as an option
     
  • Optimizer learning rate: The learning rate of the optimizer (default: 0.001)
    The usual trade off. It can speed things up but reduce the learning quality
     
  • Copy period: After how many episodes the target Neural Network value will be copied over to the Q value approximation Neural Network (default: 40)
    I prefer to use a lower value, in order to have quicker feedback of the trained target network quality and then fine tune it.
     
  • Training epochs: The number of training epochs. Each training epoch uses a different random barch from the experience buffer (default: 1)
    An attempt to have “k-fold” training.
     
  • Minimum epsilon: The minimum value of epsilon. Epsilon is decaying in time. (default: 0.1)
    How low epsilon can get. The lower epsilon is, the more you should be able to trust your learned Q values and the less experimentation your algorithm will do. Too low of the epsilon value will result in very small experimentation. So you will be playing the game, but not exploring new actions.
     
  • Scaler: The algorithm used for scaling the observation values. Options are “play” and “random”. The first plays games and samples the observations values. Random creates random observation samples (default: play)
    Basically, the scaling method you choose for preprocessing your input should be applicable to the entire input space. If an untrained Deep Q Network is initially stuck in the same very small subspace of your input space, and you train your scaler over this subspace, then it will not work well when your agent moves out of it.
    On the other hand, random observation values, may not be as realistic as the input values you get by actually playing the game.  

In addition to the configuration options, the code saves binary files with valuable information, when the experiment ends.

One file contains the loss function values, one for each episode, averaged over the entire episode.

The next one contains the variance of the loss function values, again one for each episode and averaged over the entire episode.

The final one contains the average of a rolling window containing the last 100 episode total rewards.

The naming of the files looks quite complicated at first, because it contains the options used in the execution that produced it.

Try it and you will see what I mean. However, it is a good way of distinguishing between the output files of different executions.

Especially, if you program a large number of different executions with different options to be executed the coming week while you are away.

I hope my little code helps you experiment with Deep Q Network on the environments of your choice and I really hope you enjoy Reinforcement Learning as much as I do!