I just saw that this technology should be used by IBM power7 :)
We get CPI from below equation (assume we only had 2 level caches and L2 is inclusive & unified cache ):
EXE + 0.67*(hit_li1 + liw.missrate * (l2_hit + l2_missrate * l2_penalty) + 0.33*(hit_l1 + l1_missrate * (l2_hit + l2_missrate * l2_penalty ))
EXE: Average cycles for each instruction.
Hit_li1: cycles for hit L1 instruction cache (1 cycle when pipeline is full)
Hit_l1: cycles for hit L1 data cache (1 cycle when pipeline is full)
Li1_missrate: Miss ratio for L1 instruction cache.
L1_missrate: Miss ratio for L1 data cache l2_hit: cycles for hit L2 data cache.
L2_missrate: Miss ratio for L2 data cache.
L2_penalty: cycles for Miss penalty when L2 cache miss occurs.
L1/L2 hit, l2_penalty are determined by cache unit and system bus respectively.
Total execution time are CPI * clock cycle time * instruction counts.
If we find the program speed closely depends on cache miss, and slow our CPU frequency, maybe we cocould save power without hurting performance. for example because of slower frequency we need more time to handle the current data, with independent clock dcache unit fetch next line become more efficient, cache miss penalty also become smaller.
Because our CPI from above equation gets smaller, with slower CPU frequency total execution time cocould keep the same with higher CPU frequency.
Normally cache access latency is 4/6 cycles, if CPU frequency become orig/2, our cache access latency only need 2/3 cycles, then CPI become smaller.