Sunday, January 24, 2016

caffe stepsize rule

Caffe stepsize rules are in 
 
/caffe-master/src/caffe/proto/caffe.proto 
 
// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
//    - fixed: always return base_lr.
//    - step: return base_lr * gamma ^ (floor(iter / step))
//    - exp: return base_lr * gamma ^ iter
//    - inv: return base_lr * (1 + gamma * iter) ^ (- power)
//    - multistep: similar to step but it allows non uniform steps defined by
//      stepvalue
//    - poly: the effective learning rate follows a polynomial decay, to be
//      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
//    - sigmoid: the effective learning rate follows a sigmod decay
//      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.

Monday, January 18, 2016

very important issue with momentum in theano

The theano tutorial has an example of using momentum in gradient based optimization. Details are here (see [24]). The gist is here:

 # initialize momentum  
 param_update = theano.shared(param.get_value()*0., broadcastable=param.broadcastable)  
   
 # take a gradient step using momentum 
 updates.append((param, param - learning_rate*param_update))   
   
 # update momentum  
 updates.append((param_update, momentum*param_update + (1. - momentum)*T.grad(cost, param)))   
   

However, it's not quite right. Notice that the shared variables in theano are updated in parallel, not one after the other, e.g., if you have the following update rule for shared variables a and b

update=[(a,f(a,b)), (b, g(a,b))]

Then at iterate t+1, a gets f(a_t,b_t), b gets g(a_t,b_t), not g(a_{t+1},b_t)!!!

In the cited example, "param" is going to use the momentum in the last step, not the updated one! The correct way is to change the 2nd line of the above code:

 # update momentum  
 updates.append((param, param - learning_rate*momentum*param_update + (1. - momentum)*T.grad(cost, param)))