“What Every Programmer Should Know About Memory”
is the title of an interesting paper by Ulrich Drepper of Rad Hat, Inc. The paper is — as you might have guessed — about the memory architecture in modern CPUs and what you — dear programmer — can do to improve performance of your code based on this knowledge. The paper has a section on tools for performance tuning which also discusses OProfile. OProfile is an interface to the CPU’s event counters (which vary across platforms). These counters can be queried to investigate the behaviour of a given piece of code on a given CPU. Intel’s Intel 64 and IA-32 Architectures Optimization Reference Manual suggests a list of event counters to look at. For instance the Intel engineers write:
“Clocks Per Instruction Retired Ratio (CPI): CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. The Intel Core microarchitecture is capable of reaching CPI as low as 0.25 in ideal situations. But most of the code has higher CPI The greater value of CPI for a given workload indicate it has more opportunity for code tuning to improve performance. The CPI is an overall metric, it does not provide specificity of what microarchitectural sub-system may be contributing to a high CPI value.”
To take it for a spin I chose the obvious target: the libM4RI library. Below is the CPI for bench_elimination 20000 20000 pluq.
== UNIFORMLY RANDOM ==
symbol CPI % of overall time
------------------------------------------------------
_mzd_mul_m4rm 0.94 (49.68)
mzd_process_rows 1.76 (13.68)
.plt 0.62 (10.47)
mzd_process_rows2_pluq 1.88 (10.07)
mzd_make_table 0.76 ( 8.31)
mzd_copy 1.85 ( 1.80)
_mzd_mul_naive 1.10 ( 1.38)
mzd_apply_p_right_trans 0.99 ( 1.14)
_mzd_trsm_lower_left_even 0.75 ( 0.94)
mzd_apply_p_right 0.81 ( 0.93)
mzd_combine 3.17 ( 0.61)
Note that mzd_process_rows() and mzd_process_rows2_pluq() (which are both called from the PLUQ MMPF base case) have a rather high CPI compared to the rest of the code, indicating that I should sit down and implement using more than two tables (multiplication uses eight tables). I don’t understand the high CPI for mzd_copy() though. Half rank matrices behave quite differently, but surprisingly the CPI for mzd_apply_p_right_trans() is rather low.
== HALF RANK ==
symbol CPI % of overall time
------------------------------------------------------
_mzd_mul_m4rm 0.93 (40.01)
mzd_apply_p_right_trans 0.79 (10.05)
mzd_apply_p_right 0.78 ( 7.44)
mzd_make_table 0.75 ( 7.26)
mzd_find_pivot 1.44 ( 6.00)
.plt 0.68 ( 5.64)
_mzd_pluq_mmpf 1.58 ( 4.58)
_mzd_pluq_submatrix 0.90 ( 3.88)
mzd_col_swap 1.20 ( 3.58)
mzd_process_rows 1.76 ( 2.85)
mzd_row_add_offset 1.61 ( 2.24)
mzd_process_rows2_pluq 1.88 ( 2.10)
_mzd_mul_naive 1.19 ( 1.49)
mzd_copy 2.17 ( 1.32)
mzd_combine 3.31 ( 0.89)
_mzd_trsm_lower_left_even 0.75 ( 0.25)
_mzd_trsm_upper_left_even 0.67 ( 0.13)
mzd_init_window 1.24 ( 0.07)
My preliminary understanding of those high .plt values is that compiling the benchmarketing software with -static would improve the timings (or that I should inline more?).

