CERN openlab II - Platform
CC - Optimization
Profiling
The goal of the profiling is the analysis of the
behaviour of the program during execution. This can be
accomplished by collecting a wide variety of data, including
real data from the CPU, like for instance the cpu cycles,
the number of cache misses, the number of branches, in
addition to application-related data, like the number of
function calls or the call graphs. The data can be used to
build a detailed picture of a single application as well as
the whole system.
The software instrumentation
In order to collect data during a period of measurement we
can use different approaches, one of them being software
instrumentation. The idea behind this solution is adding
code snippets to collect the required data. Such pieces of
code could be added either directly to the source code or to
a binary. The process for the first type of instrumentation
can be done manually or assisted by the compiler (gcc –pg,
gprof). The instrumentation of the binary can be split into
two stages, either a binary translation (this is done before
the execution of the program) or a dynamic instrumentation
(where code snippets are added while running). Both types of
binary instrumentation suffer from very huge overheads. For
instance, when running a sample benchmark from HEP library
on a Xeon processor, the overhead of binary instrumentation
with
PIN
is around 800% and it exceeds 6000% with
ATOM.
The good point is that the software instrumentation is
easily portable, at least across a family of processors. We
consider using the software instrumentation for instance in
order to obtain information about the number of function
calls.
Hardware approach
This approach takes advantage of performance monitors which
are available in modern processors. In comparison with
instrumentation, we get much more information, not only
about problems in the application but possibly also the
source of these problems. For instance, on the Itanium
processor you can relate CPU stalls to the fundamental
cause, like a cache miss, etc. Usually the implementation of
this approach consists of the sampling of hardware counters
at regular intervals, so-called statistical profiling. This
solution has less overhead than the instrumentation, but for
sure is not portable between processors. However, since the
perfmon2 interface and the corresponding library cover more
and more processors; a solution which takes advantage of
hardware support becomes much more attractive. In openlab we
work on the profiling of small applications up to big
frameworks. For small programs the software instrumentation
could sound reasonable, but for big application suites with
all the associated libraries it could be quite complicated.
Keep in mind that we do not always have access to the source
code of the profiled applications. In openlab we use in
general tools like
PerfSuite,
oprofile,
q-tools,
caliper.
Collaboration
We
collaborate with developers of the interface to hardware
resources by testing perfmon2 on machines with different
processors. We also started to contribute to pfmon
and we are currently working on making improvements to the
resolution of function names when profiling applications
that are built with shared libraries. The tool, pfmon,
is going to be not only a simple counting tool, but will
also become more robust in the area of profiling. This means
that it looks promising as a universal tool, available on
the various hardware platforms of relevance.
In the area of profiling we have so far been collaborating
with two LHC experiments –
Atlas
and LHCb.
We work together in order to better understand how their
huge applications behave during long runs. So far, we have
been working on simulation jobs as well as on reconstruction
jobs. For the simulations, we focused on the
Geant4
libraries, because they turn out to be a main consumer of
the CPU time. Most of our work has beeen done in the 32-bit
environment but currently we are preparing for the move to
64-bit mode. The main tool which we use is
PerfSuite.
We came across a few challenges concerning this tool, like
unpredictable behaviour with the
AFS
system or the wrong resolution of function names from shared
libraries. After some serious effort we got a tool which is
portable across processors but which is not easy to use
without a prior knowledge of the structure of the profiled
application.
Resources
Our presentations from meetings
11th Geant4 Collaboration Workshop and User Conference, 9-14
Oct 2006, Lisbon
Meeting with Atlas, LHCb and Gean4 team, 18 May 2006, CERN
Results
Atlas simulation
Full event:
3 events,
10 events,
30
events
Minimum Bias:
3 events,
10 events,
30
events
LHCb simulation
10 events,
100 events,
1000 events
Atlas Reconstruction (inDetExample)
iPatRec:
J5_Pt280_560,
top500,
Zmumu
New Tracking:
J5_Pt280_560,
top500,
Zmumu,
ZeeJimmy
Geant4
calorimeter,
exampleN04 |