Jiaji's MISC: very important issue with momentum in theano

The theano tutorial has an example of using momentum in gradient based optimization. Details are here (see [24]). The gist is here:

 # initialize momentum  
 param_update = theano.shared(param.get_value()*0., broadcastable=param.broadcastable)  
   
 # take a gradient step using momentum 
 updates.append((param, param - learning_rate*param_update))   
   
 # update momentum  
 updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))

However, it's not quite right. Notice that the shared variables in theano are updated in parallel, not one after the other, e.g., if you have the following update rule for shared variables a and b

update=[(a,f(a,b)), (b, g(a,b))]

Then at iterate t+1, a gets f(a_t,b_t), b gets g(a_t,b_t), not g(a_{t+1},b_t)!!!

In the cited example, "param" is going to use the momentum in the last step, not the updated one! The correct way is to change the 2nd line of the above code:

 # update momentum  
 updates.append((param, param - learning_rate*momentum*param_update + (1. - momentum)*T.grad(cost, param)))

Jiaji's MISC

Monday, January 18, 2016

very important issue with momentum in theano

No comments:

Post a Comment