Thanks vart and matsp!
vart, your suggestion helps to reduce half of the time! Now the time cost is distributed as
Code:
... //_xi is precomputed 
size =  width * height;
for each content1
    for(k = 0; k < size ; k++) { //7.64% time
      if(content1[k] == 0) //  8.73% time
         for each content2 // 8.75% time
            tmp =  _x[content2[k]] // 18.90% time
            l += tmp // 28.53% time
Re-organize embeded loops could help to reduce unnecessary repeated operations. Thanks!