The theano tutorial has an example of using momentum in gradient based optimization. Details are
here (see [24]). The gist is here:
# initialize momentum
param_update = theano.shared(param.get_value()*0., broadcastable=param.broadcastable)
#
take a gradient step using momentum
updates.append((param, param - learning_rate*param_update))
#
update momentum
updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))
However, it's not quite right. Notice that the shared variables in theano are updated in parallel, not one after the other, e.g., if you have the following update rule for shared variables a and b
update=[(a,f(a,b)), (b, g(a,b))]
Then at iterate t+1, a gets f(a_t,b_t), b gets g(a_t,b_t), not g(a_{t+1},b_t)!!!
In the cited example, "param" is going to use the momentum in the last step, not the updated one! The correct way is to change the 2nd line of the above code:
# update momentum
updates.append((param, param - learning_rate*
momentum*param_update + (1. - momentum)*T.grad(cost, param)
))