[PATCH] oprofile: Add support for Goldmont events

Post by Andi Kleen
Add support for the Intel Goldmont events.

Hi Andi,

Sorry just now getting to reviewing this patch. Thanks for the patch for the Intel Goldmont microarchitecture.

I found the appropriate "Table 19-24. Non-Architectural Performance Events for the Goldmont Microarchitecture" in the newly released Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide to verify the events. The patch looks pretty reasonable. There are a few minor issues with the descriptions of the unit masks inlined in the patch text below. Once those are addressed the patch can be merged into the git repository.

Is there a matching patch for the testsuite to identify the Intel Goldmont processors, so the testsuite check that things are working?

-Will

Post by Andi Kleen
OFFCORE_RESPONSE.* is not supported.
---
events/Makefile.am | 1 +
events/i386/goldmont/events | 34 +++++++++
events/i386/goldmont/unit_masks | 154 ++++++++++++++++++++++++++++++++++++++++
libop/op_cpu_type.c | 2 +
libop/op_cpu_type.h | 1 +
libop/op_events.c | 1 +
libop/op_hw_specific.h | 3 +
utils/ophelp.c | 1 +
8 files changed, 197 insertions(+)
create mode 100644 events/i386/goldmont/events
create mode 100644 events/i386/goldmont/unit_masks
diff --git a/events/Makefile.am b/events/Makefile.am
index 56f9020b1f1a..677b05f93d61 100644
--- a/events/Makefile.am
+++ b/events/Makefile.am
@@ -20,6 +20,7 @@ event_files = \
i386/broadwell/events i386/broadwell/unit_masks \
i386/skylake/events i386/skylake/unit_masks \
i386/silvermont/events i386/silvermont/unit_masks \
+ i386/goldmont/events i386/goldmont/unit_masks \
ppc64/architected_events_v1/events ppc64/architected_events_v1/unit_masks \
ppc64/power4/events ppc64/power4/event_mappings ppc64/power4/unit_masks \
ppc64/power5/events ppc64/power5/event_mappings ppc64/power5/unit_masks \
diff --git a/events/i386/goldmont/events b/events/i386/goldmont/events
new file mode 100644
index 000000000000..111438eb289e
--- /dev/null
+++ b/events/i386/goldmont/events
@@ -0,0 +1,34 @@
+#
+# Intel "Goldmont" microarchitecture core events.
+#
+# See http://ark.intel.com/ for help in identifying Goldmont based CPUs
+#
+# Note the minimum counts are not discovered experimentally and could be likely
+# lowered in many cases without ill effect.
+#

Table 17-24 includes 0x3c for CPU_CLK_UNHALTED. Is there a reason that this event is not included in this list?

Post by Andi Kleen
diff --git a/events/i386/goldmont/unit_masks b/events/i386/goldmont/unit_masks
new file mode 100644
index 000000000000..bbf261742514
--- /dev/null
+++ b/events/i386/goldmont/unit_masks
@@ -0,0 +1,154 @@
+#
+# Unit masks for the Intel "Goldmont" micro architecture
+#
+# See http://ark.intel.com/ for help in identifying Goldmont based CPUs
+#
+name:core_reject_l2q type:mandatory default:0x0
+ 0x0 extra: all Counts the number of demand and L1 prefetcher requests rejected by the L2Q due to a full or nearly full condition which likely indicates back pressure from L2Q. It also counts requests that would have gone directly to the XQ, but are rejected due to a full or nearly full condition, indicating back pressure from the IDI link. The L2Q may also reject transactions from a core to insure fairness between cores, or to delay a core's dirty eviction when the address conflicts with incoming external snoops.
+name:decode_restriction type:mandatory default:0x1
+ 0x1 extra: predecode_wrong Counts the number of times the prediction (from the predecode cache) for instruction length is incorrect.
+name:dl1 type:mandatory default:0x1
+ 0x1 extra: dirty_eviction Counts when a modified (dirty) cache line is evicted from the data L1 cache and needs to be written back to memory. No count will occur if the evicted line is clean, and hence does not require a writeback.
+name:fetch_stall type:mandatory default:0x2
+ 0x2 extra: icache_fill_pending_cycles Counts the number of cycles fetch stalls because of an icache miss. This is a cummulative count of cycles stalled for all icache misses.
+name:itlb type:mandatory default:0x4
+ 0x4 extra: miss Counts the number of times the machine was unable to find a translation in the Instruction Translation Lookaside Buffer (ITLB) for a linear address of an instruction fetch. It counts when new translation are filled into the ITLB. The event is speculative in nature, but will not count translations (page walks) that are begun and not finished, or translations that are finished but not filled into the ITLB.
+name:l2_reject_xq type:mandatory default:0x0
+ 0x0 extra: all Counts the number of demand and prefetch transactions that the L2 XQ rejects due to a full or near full condition which likely indicates back pressure from the intra-die interconnect (IDI) fabric. The XQ may reject transactions from the L2Q (non-cacheable requests), L2 misses and L2 write-back victims

Missing '.' at the end of the sentence in the above um for l2_reject_xq

Post by Andi Kleen
+name:ms_decoded type:mandatory default:0x1
+ 0x1 extra: ms_entry Counts the number of times the Microcde Sequencer (MS) starts a flow of uops from the MSROM. It does not count every time a uop is read from the MSROM. The most common case that this counts is when a micro-coded instruction is encountered by the front end of the machine. Other cases include when an instruction encounters a fault, trap, or microcode assist of any sort that initiates a flow of uops. The event will count MS startups for uops that are speculative, and subsequently cleared by branch mispredict or a machine clear.
+name:uops_issued type:mandatory default:0x0
+ 0x0 extra: any Counts uops issued by the front end and allocated into the back end of the machine. This event counts uops that retire as well as uops that were speculatively executed but didn't retire. The sort of speculative uops that might be counted includes, but is not limited to those uops issued in the shadow of a miss-predicted branch, those uops that are inserted during an assist (such as for a denormal floating point result), and (previously allocated) uops that might be canceled during a machine clear.
+name:uops_not_delivered type:mandatory default:0x0
+ 0x0 extra: any This event used to measure front-end inefficiencies. I.e. when front-end of the machine is not delivering uops to the back-end and the back-end has is not stalled. This event can be used to identify if the machine is truly front-end bound. When this event occurs, it is an indication that the front-end of the machine is operating at less than its theoretical peak performance.
+name:cpu_clk_unhalted type:exclusive default:core
+ 0x2 extra: core Counts the number of core cycles while the core is not in a halt state. The core enters the halt state when it is running the HLT instruction. In mobile systems the core frequency may change from time to time. For this reason this event may have a changing ratio with regards to time. This event uses fixed counter 1. You cannot collect a PEBs record for this event.
+ 0x1 extra: ref_tsc Counts the number of reference cycles that the core is not in a halt state. The core enters the halt state when it is running the HLT instruction. In mobile systems the core frequency may change from time. This event is not affected by core frequency changes but counts as if the core is running at the maximum frequency all the time. This event uses fixed counter 2. You cannot collect a PEBs record for this event
+ 0x0 extra: core_p Core cycles when core is not halted. This event uses a (_P)rogrammable general purpose performance counter.
+ 0x1 extra: ref Reference cycles when core is not halted. This event uses a (_P)rogrammable general purpose performance counter.
+name:ld_blocks type:exclusive default:all_block
+ 0x10 extra: all_block Counts anytime a load that retires is blocked for any reason.
+ 0x10 extra:pebs all_block_pebs Counts anytime a load that retires is blocked for any reason.
+ 0x8 extra: utlb_miss Counts loads blocked because they are unable to find their physical address in the micro TLB (UTLB).
+ 0x8 extra:pebs utlb_miss_pebs Counts loads blocked because they are unable to find their physical address in the micro TLB (UTLB).
+ 0x1 extra: data_unknown Counts a load blocked from using a store forward, but did not occur because the store data was not available at the right time. The forward might occur subsequently when the data is available.
+ 0x1 extra:pebs data_unknown_pebs Counts a load blocked from using a store forward, but did not occur because the store data was not available at the right time. The forward might occur subsequently when the data is available.
+ 0x4 extra: u4k_alias Counts loads that block because their address modulo 4K matches a pending store.
+ 0x4 extra:pebs u4k_alias_pebs Counts loads that block because their address modulo 4K matches a pending store.
+name:page_walks type:exclusive default:0x1
+ 0x1 extra: d_side_cycles Counts every core cycle when a Data-side walks (due to data operation) page walk is in progress.
+ 0x2 extra: i_side_cycles ounts every core cycle when a Instruction-side (walks due to an instruction fetch) page walk is in progress.

Missing 'C' at the beginning of "ounts every core cycle .." for i_side_cycles

Post by Andi Kleen
+ 0x3 extra: cycles Counts every core cycle a page-walk is in progress due to either a data memory operation or an instruction fetch.
+name:misalign_mem_ref type:exclusive default:load_page_split
+ 0x2 extra: load_page_split Counts when a memory load of a uop spans a page boundary (a split) is retired.
+ 0x2 extra:pebs load_page_split_pebs Counts when a memory load of a uop spans a page boundary (a split) is retired.
+ 0x4 extra: store_page_split Counts when a memory store of a uop spans a page boundary (a split) is retired.
+ 0x4 extra:pebs store_page_split_pebs Counts when a memory store of a uop spans a page boundary (a split) is retired.
+name:longest_lat_cache type:exclusive default:0x4f
+ 0x4f extra: reference Counts memory requests originating from the core that reference a cache line in the L2 cache.
+ 0x41 extra: miss Counts memory requests originating from the core that miss in the L2 cache.
+name:icache type:exclusive default:0x1
+ 0x1 extra: hit Counts each cache line access to the Icache that are fulfilled (hit) by the Icache
+ 0x2 extra: misses Counts each cache line access to the Icache that are not fullfilled (miss) by the Icache
+ 0x3 extra: accesses Counts each cache line access to the Icache
+name:inst_retired type:exclusive default:any
+ 0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The counter continues counting during hardware interrupts, traps, and inside interrupt handlers. This event uses fixed counter 0. You cannot collect a PEBs record for this event
+ 0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event continues counting during hardware interrupts, traps, and inside interrupt handlers. This is an architectural performance event. This event uses a (_P)rogrammable general purpose performance counter. *This event is Precise Event capable: The EventingRIP field in the PEBS record is precise to the address of the instruction which caused the event. Note: Because PEBS records can be collected only on IA32_PMC0, only one event can use the PEBS facility at a time.

Doesn't the any_p version need to be split into two unit masks? One for normal programmable counter use (any_p) and another for pebs use (any_pebs)?

Post by Andi Kleen
+name:uops_retired type:exclusive default:any
+ 0x0 extra: any Counts uops which retired
+ 0x0 extra:pebs any_pebs Counts uops which retired
+ 0x1 extra: ms Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). Counts both the uops from a micro-coded instruction, and the uops that might be generated from a micro-coded assist.
+ 0x1 extra:pebs ms_pebs Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). Counts both the uops from a micro-coded instruction, and the uops that might be generated from a micro-coded assist.
+ 0x8 extra: fpdiv Counts the number of floating point divide uops retired.
+ 0x8 extra:pebs fpdiv_pebs Counts the number of floating point divide uops retired.
+ 0x10 extra: idiv Counts the number of integer divide uops retired.
+ 0x10 extra:pebs idiv_pebs Counts the number of integer divide uops retired.

Don't see fdiv (0x8) or idvi (0x10) listed in the table 19-24 for uops retired. Is something missing in the documentation? Or are those disabled because of some specfication update?

Post by Andi Kleen
+name:machine_clears type:exclusive default:0x0
+ 0x0 extra: all Counts machine clears for any reason
+ 0x1 extra: smc Counts the number of times that the processor detects that a program is writing to a code section and has to perform a machine clear because of that modification. Self-modifying code (SMC) causes a severe penalty in all Intel? architecture processors.

The '?' after "Intel" in the above line should be removed.

Post by Andi Kleen
+ 0x2 extra: memory_ordering Counts machine clears due to memory ordering issues. This occurs when a snoop request happens and the machine is uncertain if memory ordering will be preserved - as another core is in the process of modifying the data.

The '-' in "will be preserved - as another core" should be removed.

Post by Andi Kleen
+ 0x4 extra: fp_assist Counts machine clears due to floating point (FP) operations needing assists. For instance, if the result was a floating point denormal, the hardware clears the pipeline and reissues uops to produce the correct IEEE compliant denormal result.
+ 0x8 extra: disambiguation Counts machine clears due to memory disambiguation. Memory disambiguation happens when a load which has been issued conflicts with a previous unretired store in the pipeline whose address was not known at issue time, but is later resolved to be the same as the load address.
+name:br_inst_retired type:exclusive default:all_branches
+ 0x0 extra: all_branches Counts branch instructions retired for all branch types. This is an architectural performance event.
+ 0x0 extra:pebs all_branches_pebs Counts branch instructions retired for all branch types. This is an architectural performance event.
+ 0x7e extra: jcc Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was taken and when it was not taken.
+ 0x7e extra:pebs jcc_pebs Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was taken and when it was not taken.
+ 0xfe extra: taken_jcc Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+ 0xfe extra:pebs taken_jcc_pebs Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+ 0xf9 extra: call Counts near CALL branch instructions retired.
+ 0xf9 extra:pebs call_pebs Counts near CALL branch instructions retired.
+ 0xfd extra: rel_call Counts near relative CALL branch instructions retired.
+ 0xfd extra:pebs rel_call_pebs Counts near relative CALL branch instructions retired.
+ 0xfb extra: ind_call Counts near indirect CALL branch instructions retired.
+ 0xfb extra:pebs ind_call_pebs Counts near indirect CALL branch instructions retired.
+ 0xf7 extra: return Counts near return branch instructions retired.
+ 0xf7 extra:pebs return_pebs Counts near return branch instructions retired.
+ 0xeb extra: non_return_ind Counts near indirect call or near indirect jmp branch instructions retired.
+ 0xeb extra:pebs non_return_ind_pebs Counts near indirect call or near indirect jmp branch instructions retired.
+ 0xbf extra: far_branch Counts far branch instructions retired. This includes far jump, far call and return, and Interrupt call and return. Intel Architecture uses far branches to transition to a different privilege level (ex: kernel/user).
+ 0xbf extra:pebs far_branch_pebs Counts far branch instructions retired. This includes far jump, far call and return, and Interrupt call and return. Intel Architecture uses far branches to transition to a different privilege level (ex: kernel/user).
+name:br_misp_retired type:exclusive default:all_branches
+ 0x0 extra: all_branches Counts mispredicted branch instructions retired including all branch types.
+ 0x0 extra:pebs all_branches_pebs Counts mispredicted branch instructions retired including all branch types.
+ 0x7e extra: jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was supposed to be taken and when it was not supposed to be taken (but the processor predicted the opposite condition).
+ 0x7e extra:pebs jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was supposed to be taken and when it was not supposed to be taken (but the processor predicted the opposite condition).
+ 0xfe extra: taken_jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted that it would not be taken.
+ 0xfe extra:pebs taken_jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted that it would not be taken.
+ 0xfb extra: ind_call Counts mispredicted near indirect CALL branch instructions retired, where the target address taken was not what the processor predicted.
+ 0xfb extra:pebs ind_call_pebs Counts mispredicted near indirect CALL branch instructions retired, where the target address taken was not what the processor predicted.
+ 0xf7 extra: return Counts mispredicted near RET branch instructions retired, where the return address taken was not what the processor predicted.
+ 0xf7 extra:pebs return_pebs Counts mispredicted near RET branch instructions retired, where the return address taken was not what the processor predicted.
+ 0xeb extra: non_return_ind Counts mispredicted branch instructions retired that were near indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
+ 0xeb extra:pebs non_return_ind_pebs Counts mispredicted branch instructions retired that were near indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
+name:issue_slots_not_consumed type:exclusive default:0x0
+ 0x0 extra: any Counts the number of issue slots per core cycle that were not consumed by the backend due to either a full resource in the backend (RESOURCE_FULL) or due to the processor recovering from some event (RECOVERY)
+ 0x1 extra: resource_full Counts the number of issue slots per core cycle that were not consumed because of a full resource in the backend. Including but not limited to tesources include the Re-order Buffer (ROB), reservation stations (RS), load/store buffers, physical registers, or any other needed machine resource that is currently unavailable. Note that uops must be available for consumption in order for this event to fire. If a uop is not available (Instruction Queue is empty), this event will not count.

The above description for "resource_full" looks garbled around "tesources".

Post by Andi Kleen
+ 0x2 extra: recovery Counts the number of issue slots per core cycle that were not consumed by the backend because allocation is stalled waiting for a mispredicted jump to retire or other branch-like conditions (e.g. the event is relevant during certain microcode flows). Counts all issue slots blocked while within this window including slots where uops were not available in the Instruction Queue.
+name:hw_interrupts type:exclusive default:0x1
+ 0x1 extra: received Counts hardware interrupts received by the processor.
+ 0x4 extra: pending_and_masked Counts core cycles during which there are pending interrupts, but interrupts are masked (EFLAGS.IF = 0).
+name:cycles_div_busy type:exclusive default:0x0
+ 0x0 extra: all Counts core cycles if either divide unit is busy.
+ 0x1 extra: idiv Counts core cycles the integer divide unit is busy.
+ 0x2 extra: fpdiv Counts core cycles the floating point divide unit is busy.
+name:mem_uops_retired type:exclusive default:all
+ 0x83 extra: all Counts the number of memory uops retired that is either a loads or a store or both.
+ 0x81 extra: all_loads Counts the number of load uops retired
+ 0x81 extra:pebs all_loads_pebs Counts the number of load uops retired
+ 0x82 extra: all_stores Counts the number of store uops retired
+ 0x82 extra:pebs all_stores_pebs Counts the number of store uops retired
+ 0x83 extra:pebs all_pebs Counts the number of memory uops retired that is either a loads or a store or both.

Any particular reason all_pebs and the all unit mask to group similar events together in the list?

Post by Andi Kleen
+ 0x11 extra: dtlb_miss_loads Counts load uops retired that caused a DTLB miss.
+ 0x11 extra:pebs dtlb_miss_loads_pebs Counts load uops retired that caused a DTLB miss.
+ 0x12 extra: dtlb_miss_stores Counts store uops retired that caused a DTLB miss.
+ 0x12 extra:pebs dtlb_miss_stores_pebs Counts store uops retired that caused a DTLB miss.
+ 0x13 extra: dtlb_miss Counts uops retired that had a DTLB miss on load, store or either. Note that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as a DTLB miss.
+ 0x13 extra:pebs dtlb_miss_pebs Counts uops retired that had a DTLB miss on load, store or either. Note that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as a DTLB miss.
+ 0x21 extra: lock_loads Counts locked memory uops retired. This includes "regular" locks and bus locks. (To specifically count bus locks only, see the Offcore response event.) A locked access is one with a lock prefix, or an exchange to memory. See the SDM for a complete description of which memory load accesses are locks.
+ 0x21 extra:pebs lock_loads_pebs Counts locked memory uops retired. This includes "regular" locks and bus locks. (To specifically count bus locks only, see the Offcore response event.) A locked access is one with a lock prefix, or an exchange to memory. See the SDM for a complete description of which memory load accesses are locks.
+ 0x41 extra: split_loads Counts load uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x41 extra:pebs split_loads_pebs Counts load uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x42 extra: split_stores Counts store uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x42 extra:pebs split_stores_pebs Counts store uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x43 extra: split Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x43 extra:pebs split_pebs Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
+name:mem_load_uops_retired type:exclusive default:l1_hit
+ 0x1 extra: l1_hit Counts load uops retired that hit the L1 data cache
+ 0x1 extra:pebs l1_hit_pebs Counts load uops retired that hit the L1 data cache
+ 0x8 extra: l1_miss Counts load uops retired that miss the L1 data cache
+ 0x8 extra:pebs l1_miss_pebs Counts load uops retired that miss the L1 data cache
+ 0x2 extra: l2_hit Counts load uops retired that hit in the L2 cache
+ 0x2 extra:pebs l2_hit_pebs Counts load uops retired that hit in the L2 cache
+ 0x10 extra: l2_miss Counts load uops retired that miss in the L2 cache
+ 0x10 extra:pebs l2_miss_pebs Counts load uops retired that miss in the L2 cache
+ 0x20 extra: hitm Counts load uops retired where the cache line containing the data was in the modified state of another core or modules cache (HITM). More specifically, this means that when the load address was checked by other caching agents (typically another processor) in the system, one of those caching agents indicated that they had a dirty copy of the data. Loads that obtain a HITM response incur greater latency than most is typical for a load. In addition, since HITM indicates that some other processor had this data in its cache, it implies that the data was shared between processors, or potentially was a lock or semaphore value. This event is useful for locating sharing, false sharing, and contended locks.
+ 0x20 extra:pebs hitm_pebs Counts load uops retired where the cache line containing the data was in the modified state of another core or modules cache (HITM). More specifically, this means that when the load address was checked by other caching agents (typically another processor) in the system, one of those caching agents indicated that they had a dirty copy of the data. Loads that obtain a HITM response incur greater latency than most is typical for a load. In addition, since HITM indicates that some other processor had this data in its cache, it implies that the data was shared between processors, or potentially was a lock or semaphore value. This event is useful for locating sharing, false sharing, and contended locks.
+ 0x40 extra: wcb_hit Counts memory load uops retired where the data is retrieved from the WCB (or fill buffer), indicating that the load found its data while that data was in the process of being brought into the L1 cache. Typically a load will receive this indication when some other load or prefetch missed the L1 cache and was in the process of retrieving the cache line containing the data , but that process had not yet finished (and written the data back to the cache). For example, consider load X and Y, both referencing the same cache line that is not in the L1 cache. If load X misses cache first, it obtains and WCB (or fill buffer) and begins the process of requesting the data. When load Y requests the data, it will either hit the WCB, or the L1 cache, depending on exactly what time the request to Y occurs.
+ 0x40 extra:pebs wcb_hit_pebs Counts memory load uops retired where the data is retrieved from the WCB (or fill buffer), indicating that the load found its data while that data was in the process of being brought into the L1 cache. Typically a load will receive this indication when some other load or prefetch missed the L1 cache and was in the process of retrieving the cache line containing the data , but that process had not yet finished (and written the data back to the cache). For example, consider load X and Y, both referencing the same cache line that is not in the L1 cache. If load X misses cache first, it obtains and WCB (or fill buffer) and begins the process of requesting the data. When load Y requests the data, it will either hit the WCB, or the L1 cache, depending on exactly what time the request to Y occurs.
+ 0x80 extra: dram_hit Counts memory load uops retired where the data is retrieved from DRAM. Event is counted at retirment, so the speculative loads are ignored. A memory load can hit (or miss) the L1 cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+ 0x80 extra:pebs dram_hit_pebs Counts memory load uops retired where the data is retrieved from DRAM. Event is counted at retirment, so the speculative loads are ignored. A memory load can hit (or miss) the L1 cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+name:baclears type:exclusive default:0x1
+ 0x1 extra: all Counts the number of times a BACLEAR is signaled for any reason, including, but not limited to indirect branch/call, Jcc (Jump on Conditional Code/Jump if Condition is Met) branch, unconditional branch/call, and returns.
+ 0x8 extra: return Counts BACLEARS on return instructions.
+ 0x10 extra: cond Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Conditon is Met) branches.
diff --git a/libop/op_cpu_type.c b/libop/op_cpu_type.c
index b1d5ecfb3090..7bdde53969eb 100644
--- a/libop/op_cpu_type.c
+++ b/libop/op_cpu_type.c
@@ -122,6 +122,7 @@ static struct cpu_descr const cpu_descrs[MAX_CPU_TYPE] = {
{ "ARM Cortex-A57", "arm/armv8-ca57", CPU_ARM_V8_CA57, 6},
{ "ARM Cortex-A53", "arm/armv8-ca53", CPU_ARM_V8_CA53, 6},
{ "Intel Skylake microarchitecture", "i386/skylake", CPU_SKYLAKE, 4 },
+ { "Intel Goldmont microarchitecture", "i386/goldmont", CPU_GOLDMONT, 4 },
};
static size_t const nr_cpu_descrs = sizeof(cpu_descrs) / sizeof(struct cpu_descr);
@@ -739,6 +740,7 @@ op_cpu op_cpu_base_type(op_cpu cpu_type)
diff --git a/libop/op_cpu_type.h b/libop/op_cpu_type.h
index 9983f8752954..98289c5b96ea 100644
--- a/libop/op_cpu_type.h
+++ b/libop/op_cpu_type.h
@@ -102,6 +102,7 @@ typedef enum {
CPU_ARM_V8_CA57, /* ARM Cortex-A57 */
CPU_ARM_V8_CA53, /* ARM Cortex-A53 */
CPU_SKYLAKE, /** < Intel Skylake microarchitecture */
+ CPU_GOLDMONT, /** < Intel Goldmont microarchitecture */
MAX_CPU_TYPE
} op_cpu;
diff --git a/libop/op_events.c b/libop/op_events.c
index 25f010ef1aba..cdd04093d068 100644
--- a/libop/op_events.c
+++ b/libop/op_events.c
@@ -1212,6 +1212,7 @@ void op_default_event(op_cpu cpu_type, struct op_default_event_descr * descr)
descr->name = "CPU_CLK_UNHALTED";
break;
descr->name = "cpu_clk_unhalted";
break;
diff --git a/libop/op_hw_specific.h b/libop/op_hw_specific.h
index 994fec473ebc..4cc6bc0c69dc 100644
--- a/libop/op_hw_specific.h
+++ b/libop/op_hw_specific.h
@@ -161,6 +161,9 @@ static inline op_cpu op_cpu_specific_type(op_cpu cpu_type)
return CPU_SILVERMONT;
+ return CPU_GOLDMONT;
}
}
return cpu_type;
diff --git a/utils/ophelp.c b/utils/ophelp.c
index fdddddcc3aed..58215935fb01 100644
--- a/utils/ophelp.c
+++ b/utils/ophelp.c
@@ -544,6 +544,7 @@ int main(int argc, char const * argv[])

Andi Kleen

2016-04-25 21:45:16 UTC

I've sent a new patch addressing the typos.

Post by William Cohen
Is there a matching patch for the testsuite to identify the Intel Goldmont processors, so the testsuite check that things are working?

There is not.

Post by Andi Kleen
+#
+# Note the minimum counts are not discovered experimentally and could be likely
+# lowered in many cases without ill effect.
+#

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Post by William Cohen
Table 17-24 includes 0x3c for CPU_CLK_UNHALTED. Is there a reason that this event is not included in this list?

00 is equivalent for perf, points to the fixed counter or the generic
counter.

Post by Andi Kleen
+ 0x1 extra: hit Counts each cache line access to the Icache that are fulfilled (hit) by the Icache
+ 0x2 extra: misses Counts each cache line access to the Icache that are not fullfilled (miss) by the Icache
+ 0x3 extra: accesses Counts each cache line access to the Icache
+name:inst_retired type:exclusive default:any
+ 0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The counter continues counting during hardware interrupts, traps, and inside interrupt handlers. This event uses fixed counter 0. You cannot collect a PEBs record for this event
+ 0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event continues counting during hardware interrupts, traps, and inside interrupt handlers. This is an architectural performance event. This event uses a (_P)rogrammable general purpose performance counter. *This event is Precise Event capable: The EventingRIP field in the PEBS record is precise to the address of the instruction which caused the event. Note: Because PEBS records can be collected only on IA32_PMC0, only one event can use the PEBS facility at a time.

Doesn't the any_p version need to be split into two unit masks? One for normal programmable counter use (any_p) and another for pebs use (any_pebs)?

Right one for PEBS, another without.

Don't see fdiv (0x8) or idvi (0x10) listed in the table 19-24 for uops retired. Is something missing in the documentation? Or are those disabled because of some specfication update?

Not sure, but the my source files are generally newer.

Any particular reason all_pebs and the all unit mask to group similar events together in the list?

A lot of events can be both PEBS and non PEBS sharing the same codes.
In fact on Goldmont all events support that (although this is not
expressed so far in oprofile)

-Andi

William Cohen

2016-04-26 03:24:49 UTC

Post by Andi Kleen
I've sent a new patch addressing the typos.

Post by William Cohen
Is there a matching patch for the testsuite to identify the Intel Goldmont processors, so the testsuite check that things are working?

There is not.

Thanks for the revised patch.

Post by Andi Kleen
+#
+# Note the minimum counts are not discovered experimentally and could be likely
+# lowered in many cases without ill effect.
+#

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Post by William Cohen
Table 17-24 includes 0x3c for CPU_CLK_UNHALTED. Is there a reason that this event is not included in this list?

00 is equivalent for perf, points to the fixed counter or the generic
counter.

okay.

Doesn't the any_p version need to be split into two unit masks? One for normal programmable counter use (any_p) and another for pebs use (any_pebs)?

Right one for PEBS, another without.

The updated patch does not have the any_pebs version for inst_retired. Shouldn't there be a line for inst_retired umask like the following in there?

0x0 extra:pebs any_pebs Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event continues counting during hardware interrupts, traps, and inside interrupt handlers. This is an architectural performance event. This event uses a (_P)rogrammable general purpose performance counter. *This event is Precise Event capable: The EventingRIP field in the PEBS record is precise to the address of the instruction which caused the event. Note: Because PEBS records can be collected only on IA32_PMC0, only one event can use the PEBS facility at a time.

Don't see fdiv (0x8) or idvi (0x10) listed in the table 19-24 for uops retired. Is something missing in the documentation? Or are those disabled because of some specfication update?

Not sure, but the my source files are generally newer.

Is the following the correct specification update for the goldmont processors? I didn't see anything on the div/mul perf counters in there.

http://www.intel.com/content/www/us/en/processors/xeon/xeon-d-1500-specification-update.html

The lack of the div/mul events can be addressed later if needed. Won't worry about it on this patch.

Any particular reason all_pebs and the all unit mask to group similar events together in the list?

A lot of events can be both PEBS and non PEBS sharing the same codes.
In fact on Goldmont all events support that (although this is not
expressed so far in oprofile)

I don't know how mechanically generated the umasks file is. Most of the non-pebs and pebs versions are right next to each other in the file. 0x83 for mem_uops was one exception. If the file is mechanically generated and the input file has them in that order, it might be simpler just to leave that particular one alone. However, if it is a simple edit, swapping them would be easy and would provide more consistent output for ophelp.

-Will

Andi Kleen

2016-04-26 14:50:05 UTC

Post by Andi Kleen
+ 0x1 extra: hit Counts each cache line access to the Icache that are fulfilled (hit) by the Icache
+ 0x2 extra: misses Counts each cache line access to the Icache that are not fullfilled (miss) by the Icache
+ 0x3 extra: accesses Counts each cache line access to the Icache
+name:inst_retired type:exclusive default:any
+ 0x0 extra: any Counts the number of instructions that retire
execution. For instructions that consist of multiple uops, this
event counts the retirement of the last uop of the
instruction. The counter continues counting during hardware
interrupts, traps, and inside interrupt handlers. This event uses
fixed counter 0. You cannot collect a PEBs record for this event
+ 0x0 extra: any_p Counts the number of instructions that retire
execution. For instructions that consist of multiple uops, this
event counts the retirement of the last uop of the
instruction. The event continues counting during hardware
interrupts, traps, and inside interrupt handlers. This is an
architectural performance event. This event uses a
(_P)rogrammable general purpose performance counter. *This event
is Precise Event capable: The EventingRIP field in the PEBS record
is precise to the address of the instruction which caused the
event. Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.

Doesn't the any_p version need to be split into two unit masks? One
for normal programmable counter use (any_p) and another for pebs
use (any_pebs)?

Right one for PEBS, another without.

The updated patch does not have the any_pebs version for inst_retired.
Shouldn't there be a line for inst_retired umask like the following in
there?

Yes. I'll add one.

-Andi

--
***@linux.intel.com -- Speaking for myself only

Andi Kleen

2016-04-25 21:47:21 UTC

From: Andi Kleen <***@linux.intel.com>

Add support for the Intel Goldmont events.

OFFCORE_RESPONSE.* is not supported.

v2: Fix typos in descriptions.
Signed-off-by: Andi Kleen <***@linux.intel.com>
---
events/Makefile.am | 1 +
events/i386/goldmont/events | 34 +++++++++
events/i386/goldmont/unit_masks | 154 ++++++++++++++++++++++++++++++++++++++++
libop/op_cpu_type.c | 2 +
libop/op_cpu_type.h | 1 +
libop/op_events.c | 1 +
libop/op_hw_specific.h | 3 +
utils/ophelp.c | 1 +
8 files changed, 197 insertions(+)
create mode 100644 events/i386/goldmont/events
create mode 100644 events/i386/goldmont/unit_masks

diff --git a/events/Makefile.am b/events/Makefile.am
index 56f9020b1f1a..677b05f93d61 100644
--- a/events/Makefile.am
+++ b/events/Makefile.am
@@ -20,6 +20,7 @@ event_files = \
i386/broadwell/events i386/broadwell/unit_masks \
i386/skylake/events i386/skylake/unit_masks \
i386/silvermont/events i386/silvermont/unit_masks \
+ i386/goldmont/events i386/goldmont/unit_masks \
ppc64/architected_events_v1/events ppc64/architected_events_v1/unit_masks \
ppc64/power4/events ppc64/power4/event_mappings ppc64/power4/unit_masks \
ppc64/power5/events ppc64/power5/event_mappings ppc64/power5/unit_masks \
diff --git a/events/i386/goldmont/events b/events/i386/goldmont/events
new file mode 100644
index 000000000000..111438eb289e
--- /dev/null
+++ b/events/i386/goldmont/events
@@ -0,0 +1,34 @@
+#
+# Intel "Goldmont" microarchitecture core events.
+#
+# See http://ark.intel.com/ for help in identifying Goldmont based CPUs
+#
+# Note the minimum counts are not discovered experimentally and could be likely
+# lowered in many cases without ill effect.
+#
+event:0x00 counters:cpuid um:cpu_clk_unhalted minimum:2000003 name:cpu_clk_unhalted :
+event:0x03 counters:cpuid um:ld_blocks minimum:200003 name:ld_blocks :
+event:0x05 counters:cpuid um:page_walks minimum:200003 name:page_walks :
+event:0x0e counters:cpuid um:uops_issued minimum:200003 name:uops_issued_any :
+event:0x13 counters:cpuid um:misalign_mem_ref minimum:200003 name:misalign_mem_ref :
+event:0x2e counters:cpuid um:longest_lat_cache minimum:200003 name:longest_lat_cache :
+event:0x30 counters:cpuid um:l2_reject_xq minimum:200003 name:l2_reject_xq_all :
+event:0x31 counters:cpuid um:core_reject_l2q minimum:200003 name:core_reject_l2q_all :
+event:0x51 counters:cpuid um:dl1 minimum:200003 name:dl1_dirty_eviction :
+event:0x80 counters:cpuid um:icache minimum:200003 name:icache :
+event:0x81 counters:cpuid um:itlb minimum:200003 name:itlb_miss :
+event:0x86 counters:cpuid um:fetch_stall minimum:200003 name:fetch_stall_icache_fill_pending_cycles :
+event:0x9c counters:cpuid um:uops_not_delivered minimum:200003 name:uops_not_delivered_any :
+event:0xc0 counters:cpuid um:inst_retired minimum:2000003 name:inst_retired :
+event:0xc2 counters:cpuid um:uops_retired minimum:2000003 name:uops_retired :
+event:0xc3 counters:cpuid um:machine_clears minimum:200003 name:machine_clears :
+event:0xc4 counters:cpuid um:br_inst_retired minimum:200003 name:br_inst_retired :
+event:0xc5 counters:cpuid um:br_misp_retired minimum:200003 name:br_misp_retired :
+event:0xca counters:cpuid um:issue_slots_not_consumed minimum:200003 name:issue_slots_not_consumed :
+event:0xcb counters:cpuid um:hw_interrupts minimum:200003 name:hw_interrupts :
+event:0xcd counters:cpuid um:cycles_div_busy minimum:2000003 name:cycles_div_busy :
+event:0xd0 counters:cpuid um:mem_uops_retired minimum:200003 name:mem_uops_retired :
+event:0xd1 counters:cpuid um:mem_load_uops_retired minimum:200003 name:mem_load_uops_retired :
+event:0xe6 counters:cpuid um:baclears minimum:200003 name:baclears :
+event:0xe7 counters:cpuid um:ms_decoded minimum:200003 name:ms_decoded_ms_entry :
+event:0xe9 counters:cpuid um:decode_restriction minimum:200003 name:decode_restriction_predecode_wrong :
diff --git a/events/i386/goldmont/unit_masks b/events/i386/goldmont/unit_masks
new file mode 100644
index 000000000000..54556d14c318
--- /dev/null
+++ b/events/i386/goldmont/unit_masks
@@ -0,0 +1,154 @@
+#
+# Unit masks for the Intel "Goldmont" micro architecture
+#
+# See http://ark.intel.com/ for help in identifying Goldmont based CPUs
+#
+name:core_reject_l2q type:mandatory default:0x0
+ 0x0 extra: all Counts the number of demand and L1 prefetcher requests rejected by the L2Q due to a full or nearly full condition which likely indicates back pressure from L2Q. It also counts requests that would have gone directly to the XQ, but are rejected due to a full or nearly full condition, indicating back pressure from the IDI link. The L2Q may also reject transactions from a core to insure fairness between cores, or to delay a core's dirty eviction when the address conflicts with incoming external snoops.
+name:decode_restriction type:mandatory default:0x1
+ 0x1 extra: predecode_wrong Counts the number of times the prediction (from the predecode cache) for instruction length is incorrect.
+name:dl1 type:mandatory default:0x1
+ 0x1 extra: dirty_eviction Counts when a modified (dirty) cache line is evicted from the data L1 cache and needs to be written back to memory. No count will occur if the evicted line is clean, and hence does not require a writeback.
+name:fetch_stall type:mandatory default:0x2
+ 0x2 extra: icache_fill_pending_cycles Counts the number of cycles fetch stalls because of an icache miss. This is a cummulative count of cycles stalled for all icache misses.
+name:itlb type:mandatory default:0x4
+ 0x4 extra: miss Counts the number of times the machine was unable to find a translation in the Instruction Translation Lookaside Buffer (ITLB) for a linear address of an instruction fetch. It counts when new translation are filled into the ITLB. The event is speculative in nature, but will not count translations (page walks) that are begun and not finished, or translations that are finished but not filled into the ITLB.
+name:l2_reject_xq type:mandatory default:0x0
+ 0x0 extra: all Counts the number of demand and prefetch transactions that the L2 XQ rejects due to a full or near full condition which likely indicates back pressure from the intra-die interconnect (IDI) fabric. The XQ may reject transactions from the L2Q (non-cacheable requests), L2 misses and L2 write-back victims.
+name:ms_decoded type:mandatory default:0x1
+ 0x1 extra: ms_entry Counts the number of times the Microcde Sequencer (MS) starts a flow of uops from the MSROM. It does not count every time a uop is read from the MSROM. The most common case that this counts is when a micro-coded instruction is encountered by the front end of the machine. Other cases include when an instruction encounters a fault, trap, or microcode assist of any sort that initiates a flow of uops. The event will count MS startups for uops that are speculative, and subsequently cleared by branch mispredict or a machine clear.
+name:uops_issued type:mandatory default:0x0
+ 0x0 extra: any Counts uops issued by the front end and allocated into the back end of the machine. This event counts uops that retire as well as uops that were speculatively executed but didn't retire. The sort of speculative uops that might be counted includes, but is not limited to those uops issued in the shadow of a miss-predicted branch, those uops that are inserted during an assist (such as for a denormal floating point result), and (previously allocated) uops that might be canceled during a machine clear.
+name:uops_not_delivered type:mandatory default:0x0
+ 0x0 extra: any This event used to measure front-end inefficiencies. I.e. when front-end of the machine is not delivering uops to the back-end and the back-end has is not stalled. This event can be used to identify if the machine is truly front-end bound. When this event occurs, it is an indication that the front-end of the machine is operating at less than its theoretical peak performance.
+name:cpu_clk_unhalted type:exclusive default:core
+ 0x2 extra: core Counts the number of core cycles while the core is not in a halt state. The core enters the halt state when it is running the HLT instruction. In mobile systems the core frequency may change from time to time. For this reason this event may have a changing ratio with regards to time. This event uses fixed counter 1. You cannot collect a PEBs record for this event.
+ 0x1 extra: ref_tsc Counts the number of reference cycles that the core is not in a halt state. The core enters the halt state when it is running the HLT instruction. In mobile systems the core frequency may change from time. This event is not affected by core frequency changes but counts as if the core is running at the maximum frequency all the time. This event uses fixed counter 2. You cannot collect a PEBs record for this event
+ 0x0 extra: core_p Core cycles when core is not halted. This event uses a (_P)rogrammable general purpose performance counter.
+ 0x1 extra: ref Reference cycles when core is not halted. This event uses a (_P)rogrammable general purpose performance counter.
+name:ld_blocks type:exclusive default:all_block
+ 0x10 extra: all_block Counts anytime a load that retires is blocked for any reason.
+ 0x10 extra:pebs all_block_pebs Counts anytime a load that retires is blocked for any reason.
+ 0x8 extra: utlb_miss Counts loads blocked because they are unable to find their physical address in the micro TLB (UTLB).
+ 0x8 extra:pebs utlb_miss_pebs Counts loads blocked because they are unable to find their physical address in the micro TLB (UTLB).
+ 0x1 extra: data_unknown Counts a load blocked from using a store forward, but did not occur because the store data was not available at the right time. The forward might occur subsequently when the data is available.
+ 0x1 extra:pebs data_unknown_pebs Counts a load blocked from using a store forward, but did not occur because the store data was not available at the right time. The forward might occur subsequently when the data is available.
+ 0x4 extra: u4k_alias Counts loads that block because their address modulo 4K matches a pending store.
+ 0x4 extra:pebs u4k_alias_pebs Counts loads that block because their address modulo 4K matches a pending store.
+name:page_walks type:exclusive default:0x1
+ 0x1 extra: d_side_cycles Counts every core cycle when a Data-side walks (due to data operation) page walk is in progress.
+ 0x2 extra: i_side_cycles Counts every core cycle when a Instruction-side (walks due to an instruction fetch) page walk is in progress.
+ 0x3 extra: cycles Counts every core cycle a page-walk is in progress due to either a data memory operation or an instruction fetch.
+name:misalign_mem_ref type:exclusive default:load_page_split
+ 0x2 extra: load_page_split Counts when a memory load of a uop spans a page boundary (a split) is retired.
+ 0x2 extra:pebs load_page_split_pebs Counts when a memory load of a uop spans a page boundary (a split) is retired.
+ 0x4 extra: store_page_split Counts when a memory store of a uop spans a page boundary (a split) is retired.
+ 0x4 extra:pebs store_page_split_pebs Counts when a memory store of a uop spans a page boundary (a split) is retired.
+name:longest_lat_cache type:exclusive default:0x4f
+ 0x4f extra: reference Counts memory requests originating from the core that reference a cache line in the L2 cache.
+ 0x41 extra: miss Counts memory requests originating from the core that miss in the L2 cache.
+name:icache type:exclusive default:0x1
+ 0x1 extra: hit Counts each cache line access to the Icache that are fulfilled (hit) by the Icache
+ 0x2 extra: misses Counts each cache line access to the Icache that are not fullfilled (miss) by the Icache
+ 0x3 extra: accesses Counts each cache line access to the Icache
+name:inst_retired type:exclusive default:any
+ 0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The counter continues counting during hardware interrupts, traps, and inside interrupt handlers. This event uses fixed counter 0. You cannot collect a PEBs record for this event
+ 0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event continues counting during hardware interrupts, traps, and inside interrupt handlers. This is an architectural performance event. This event uses a (_P)rogrammable general purpose performance counter. *This event is Precise Event capable: The EventingRIP field in the PEBS record is precise to the address of the instruction which caused the event. Note: Because PEBS records can be collected only on IA32_PMC0, only one event can use the PEBS facility at a time.
+name:uops_retired type:exclusive default:any
+ 0x0 extra: any Counts uops which retired
+ 0x0 extra:pebs any_pebs Counts uops which retired
+ 0x1 extra: ms Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). Counts both the uops from a micro-coded instruction, and the uops that might be generated from a micro-coded assist.
+ 0x1 extra:pebs ms_pebs Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). Counts both the uops from a micro-coded instruction, and the uops that might be generated from a micro-coded assist.
+ 0x8 extra: fpdiv Counts the number of floating point divide uops retired.
+ 0x8 extra:pebs fpdiv_pebs Counts the number of floating point divide uops retired.
+ 0x10 extra: idiv Counts the number of integer divide uops retired.
+ 0x10 extra:pebs idiv_pebs Counts the number of integer divide uops retired.
+name:machine_clears type:exclusive default:0x0
+ 0x0 extra: all Counts machine clears for any reason
+ 0x1 extra: smc Counts the number of times that the processor detects that a program is writing to a code section and has to perform a machine clear because of that modification. Self-modifying code (SMC) causes a severe penalty in all Intel architecture processors.
+ 0x2 extra: memory_ordering Counts machine clears due to memory ordering issues. This occurs when a snoop request happens and the machine is uncertain if memory ordering will be preserved, as another core is in the process of modifying the data.
+ 0x4 extra: fp_assist Counts machine clears due to floating point (FP) operations needing assists. For instance, if the result was a floating point denormal, the hardware clears the pipeline and reissues uops to produce the correct IEEE compliant denormal result.
+ 0x8 extra: disambiguation Counts machine clears due to memory disambiguation. Memory disambiguation happens when a load which has been issued conflicts with a previous unretired store in the pipeline whose address was not known at issue time, but is later resolved to be the same as the load address.
+name:br_inst_retired type:exclusive default:all_branches
+ 0x0 extra: all_branches Counts branch instructions retired for all branch types. This is an architectural performance event.
+ 0x0 extra:pebs all_branches_pebs Counts branch instructions retired for all branch types. This is an architectural performance event.
+ 0x7e extra: jcc Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was taken and when it was not taken.
+ 0x7e extra:pebs jcc_pebs Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was taken and when it was not taken.
+ 0xfe extra: taken_jcc Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+ 0xfe extra:pebs taken_jcc_pebs Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+ 0xf9 extra: call Counts near CALL branch instructions retired.
+ 0xf9 extra:pebs call_pebs Counts near CALL branch instructions retired.
+ 0xfd extra: rel_call Counts near relative CALL branch instructions retired.
+ 0xfd extra:pebs rel_call_pebs Counts near relative CALL branch instructions retired.
+ 0xfb extra: ind_call Counts near indirect CALL branch instructions retired.
+ 0xfb extra:pebs ind_call_pebs Counts near indirect CALL branch instructions retired.
+ 0xf7 extra: return Counts near return branch instructions retired.
+ 0xf7 extra:pebs return_pebs Counts near return branch instructions retired.
+ 0xeb extra: non_return_ind Counts near indirect call or near indirect jmp branch instructions retired.
+ 0xeb extra:pebs non_return_ind_pebs Counts near indirect call or near indirect jmp branch instructions retired.
+ 0xbf extra: far_branch Counts far branch instructions retired. This includes far jump, far call and return, and Interrupt call and return. Intel Architecture uses far branches to transition to a different privilege level (ex: kernel/user).
+ 0xbf extra:pebs far_branch_pebs Counts far branch instructions retired. This includes far jump, far call and return, and Interrupt call and return. Intel Architecture uses far branches to transition to a different privilege level (ex: kernel/user).
+name:br_misp_retired type:exclusive default:all_branches
+ 0x0 extra: all_branches Counts mispredicted branch instructions retired including all branch types.
+ 0x0 extra:pebs all_branches_pebs Counts mispredicted branch instructions retired including all branch types.
+ 0x7e extra: jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was supposed to be taken and when it was not supposed to be taken (but the processor predicted the opposite condition).
+ 0x7e extra:pebs jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was supposed to be taken and when it was not supposed to be taken (but the processor predicted the opposite condition).
+ 0xfe extra: taken_jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted that it would not be taken.
+ 0xfe extra:pebs taken_jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted that it would not be taken.
+ 0xfb extra: ind_call Counts mispredicted near indirect CALL branch instructions retired, where the target address taken was not what the processor predicted.
+ 0xfb extra:pebs ind_call_pebs Counts mispredicted near indirect CALL branch instructions retired, where the target address taken was not what the processor predicted.
+ 0xf7 extra: return Counts mispredicted near RET branch instructions retired, where the return address taken was not what the processor predicted.
+ 0xf7 extra:pebs return_pebs Counts mispredicted near RET branch instructions retired, where the return address taken was not what the processor predicted.
+ 0xeb extra: non_return_ind Counts mispredicted branch instructions retired that were near indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
+ 0xeb extra:pebs non_return_ind_pebs Counts mispredicted branch instructions retired that were near indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
+name:issue_slots_not_consumed type:exclusive default:0x0
+ 0x0 extra: any Counts the number of issue slots per core cycle that were not consumed by the backend due to either a full resource in the backend (RESOURCE_FULL) or due to the processor recovering from some event (RECOVERY)
+ 0x1 extra: resource_full Counts the number of issue slots per core cycle that were not consumed because of a full resource in the backend. Including but not limited the Re-order Buffer (ROB), reservation stations (RS), load/store buffers, physical registers, or any other needed machine resource that is currently unavailable. Note that uops must be available for consumption in order for this event to fire. If a uop is not available (Instruction Queue is empty), this event will not count.
+ 0x2 extra: recovery Counts the number of issue slots per core cycle that were not consumed by the backend because allocation is stalled waiting for a mispredicted jump to retire or other branch-like conditions (e.g. the event is relevant during certain microcode flows). Counts all issue slots blocked while within this window including slots where uops were not available in the Instruction Queue.
+name:hw_interrupts type:exclusive default:0x1
+ 0x1 extra: received Counts hardware interrupts received by the processor.
+ 0x4 extra: pending_and_masked Counts core cycles during which there are pending interrupts, but interrupts are masked (EFLAGS.IF = 0).
+name:cycles_div_busy type:exclusive default:0x0
+ 0x0 extra: all Counts core cycles if either divide unit is busy.
+ 0x1 extra: idiv Counts core cycles the integer divide unit is busy.
+ 0x2 extra: fpdiv Counts core cycles the floating point divide unit is busy.
+name:mem_uops_retired type:exclusive default:all
+ 0x83 extra: all Counts the number of memory uops retired that is either a loads or a store or both.
+ 0x81 extra: all_loads Counts the number of load uops retired
+ 0x81 extra:pebs all_loads_pebs Counts the number of load uops retired
+ 0x82 extra: all_stores Counts the number of store uops retired
+ 0x82 extra:pebs all_stores_pebs Counts the number of store uops retired
+ 0x83 extra:pebs all_pebs Counts the number of memory uops retired that is either a loads or a store or both.
+ 0x11 extra: dtlb_miss_loads Counts load uops retired that caused a DTLB miss.
+ 0x11 extra:pebs dtlb_miss_loads_pebs Counts load uops retired that caused a DTLB miss.
+ 0x12 extra: dtlb_miss_stores Counts store uops retired that caused a DTLB miss.
+ 0x12 extra:pebs dtlb_miss_stores_pebs Counts store uops retired that caused a DTLB miss.
+ 0x13 extra: dtlb_miss Counts uops retired that had a DTLB miss on load, store or either. Note that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as a DTLB miss.
+ 0x13 extra:pebs dtlb_miss_pebs Counts uops retired that had a DTLB miss on load, store or either. Note that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as a DTLB miss.
+ 0x21 extra: lock_loads Counts locked memory uops retired. This includes "regular" locks and bus locks. (To specifically count bus locks only, see the Offcore response event.) A locked access is one with a lock prefix, or an exchange to memory. See the SDM for a complete description of which memory load accesses are locks.
+ 0x21 extra:pebs lock_loads_pebs Counts locked memory uops retired. This includes "regular" locks and bus locks. (To specifically count bus locks only, see the Offcore response event.) A locked access is one with a lock prefix, or an exchange to memory. See the SDM for a complete description of which memory load accesses are locks.
+ 0x41 extra: split_loads Counts load uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x41 extra:pebs split_loads_pebs Counts load uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x42 extra: split_stores Counts store uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x42 extra:pebs split_stores_pebs Counts store uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x43 extra: split Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x43 extra:pebs split_pebs Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
+name:mem_load_uops_retired type:exclusive default:l1_hit
+ 0x1 extra: l1_hit Counts load uops retired that hit the L1 data cache
+ 0x1 extra:pebs l1_hit_pebs Counts load uops retired that hit the L1 data cache
+ 0x8 extra: l1_miss Counts load uops retired that miss the L1 data cache
+ 0x8 extra:pebs l1_miss_pebs Counts load uops retired that miss the L1 data cache
+ 0x2 extra: l2_hit Counts load uops retired that hit in the L2 cache
+ 0x2 extra:pebs l2_hit_pebs Counts load uops retired that hit in the L2 cache
+ 0x10 extra: l2_miss Counts load uops retired that miss in the L2 cache
+ 0x10 extra:pebs l2_miss_pebs Counts load uops retired that miss in the L2 cache
+ 0x20 extra: hitm Counts load uops retired where the cache line containing the data was in the modified state of another core or modules cache (HITM). More specifically, this means that when the load address was checked by other caching agents (typically another processor) in the system, one of those caching agents indicated that they had a dirty copy of the data. Loads that obtain a HITM response incur greater latency than most is typical for a load. In addition, since HITM indicates that some other processor had this data in its cache, it implies that the data was shared between processors, or potentially was a lock or semaphore value. This event is useful for locating sharing, false sharing, and contended locks.
+ 0x20 extra:pebs hitm_pebs Counts load uops retired where the cache line containing the data was in the modified state of another core or modules cache (HITM). More specifically, this means that when the load address was checked by other caching agents (typically another processor) in the system, one of those caching agents indicated that they had a dirty copy of the data. Loads that obtain a HITM response incur greater latency than most is typical for a load. In addition, since HITM indicates that some other processor had this data in its cache, it implies that the data was shared between processors, or potentially was a lock or semaphore value. This event is useful for locating sharing, false sharing, and contended locks.
+ 0x40 extra: wcb_hit Counts memory load uops retired where the data is retrieved from the WCB (or fill buffer), indicating that the load found its data while that data was in the process of being brought into the L1 cache. Typically a load will receive this indication when some other load or prefetch missed the L1 cache and was in the process of retrieving the cache line containing the data , but that process had not yet finished (and written the data back to the cache). For example, consider load X and Y, both referencing the same cache line that is not in the L1 cache. If load X misses cache first, it obtains and WCB (or fill buffer) and begins the process of requesting the data. When load Y requests the data, it will either hit the WCB, or the L1 cache, depending on exactly what time the request to Y occurs.
+ 0x40 extra:pebs wcb_hit_pebs Counts memory load uops retired where the data is retrieved from the WCB (or fill buffer), indicating that the load found its data while that data was in the process of being brought into the L1 cache. Typically a load will receive this indication when some other load or prefetch missed the L1 cache and was in the process of retrieving the cache line containing the data , but that process had not yet finished (and written the data back to the cache). For example, consider load X and Y, both referencing the same cache line that is not in the L1 cache. If load X misses cache first, it obtains and WCB (or fill buffer) and begins the process of requesting the data. When load Y requests the data, it will either hit the WCB, or the L1 cache, depending on exactly what time the request to Y occurs.
+ 0x80 extra: dram_hit Counts memory load uops retired where the data is retrieved from DRAM. Event is counted at retirment, so the speculative loads are ignored. A memory load can hit (or miss) the L1 cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+ 0x80 extra:pebs dram_hit_pebs Counts memory load uops retired where the data is retrieved from DRAM. Event is counted at retirment, so the speculative loads are ignored. A memory load can hit (or miss) the L1 cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+name:baclears type:exclusive default:0x1
+ 0x1 extra: all Counts the number of times a BACLEAR is signaled for any reason, including, but not limited to indirect branch/call, Jcc (Jump on Conditional Code/Jump if Condition is Met) branch, unconditional branch/call, and returns.
+ 0x8 extra: return Counts BACLEARS on return instructions.
+ 0x10 extra: cond Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Conditon is Met) branches.
diff --git a/libop/op_cpu_type.c b/libop/op_cpu_type.c
index b1d5ecfb3090..7bdde53969eb 100644
--- a/libop/op_cpu_type.c
+++ b/libop/op_cpu_type.c
@@ -122,6 +122,7 @@ static struct cpu_descr const cpu_descrs[MAX_CPU_TYPE] = {
{ "ARM Cortex-A57", "arm/armv8-ca57", CPU_ARM_V8_CA57, 6},
{ "ARM Cortex-A53", "arm/armv8-ca53", CPU_ARM_V8_CA53, 6},
{ "Intel Skylake microarchitecture", "i386/skylake", CPU_SKYLAKE, 4 },
+ { "Intel Goldmont microarchitecture", "i386/goldmont", CPU_GOLDMONT, 4 },
};

static size_t const nr_cpu_descrs = sizeof(cpu_descrs) / sizeof(struct cpu_descr);
@@ -739,6 +740,7 @@ op_cpu op_cpu_base_type(op_cpu cpu_type)
case CPU_HASWELL:
case CPU_BROADWELL:
case CPU_SKYLAKE:
+ case CPU_GOLDMONT:
case CPU_SILVERMONT:
case CPU_WESTMERE:
case CPU_SANDYBRIDGE:
diff --git a/libop/op_cpu_type.h b/libop/op_cpu_type.h
index 9983f8752954..98289c5b96ea 100644
--- a/libop/op_cpu_type.h
+++ b/libop/op_cpu_type.h
@@ -102,6 +102,7 @@ typedef enum {
CPU_ARM_V8_CA57, /* ARM Cortex-A57 */
CPU_ARM_V8_CA53, /* ARM Cortex-A53 */
CPU_SKYLAKE, /** < Intel Skylake microarchitecture */
+ CPU_GOLDMONT, /** < Intel Goldmont microarchitecture */
MAX_CPU_TYPE
} op_cpu;

diff --git a/libop/op_events.c b/libop/op_events.c
index 25f010ef1aba..cdd04093d068 100644
--- a/libop/op_events.c
+++ b/libop/op_events.c
@@ -1212,6 +1212,7 @@ void op_default_event(op_cpu cpu_type, struct op_default_event_descr * descr)
descr->name = "CPU_CLK_UNHALTED";
break;

+ case CPU_GOLDMONT:
case CPU_SKYLAKE:
descr->name = "cpu_clk_unhalted";
break;
diff --git a/libop/op_hw_specific.h b/libop/op_hw_specific.h
index 994fec473ebc..4cc6bc0c69dc 100644
--- a/libop/op_hw_specific.h
+++ b/libop/op_hw_specific.h
@@ -161,6 +161,9 @@ static inline op_cpu op_cpu_specific_type(op_cpu cpu_type)
case 0x4d:
case 0x4c:
return CPU_SILVERMONT;
+ case 0x5c:
+ case 0x5f:
+ return CPU_GOLDMONT;
}
}
return cpu_type;
diff --git a/utils/ophelp.c b/utils/ophelp.c
index fdddddcc3aed..58215935fb01 100644
--- a/utils/ophelp.c
+++ b/utils/ophelp.c
@@ -544,6 +544,7 @@ int main(int argc, char const * argv[])
case CPU_BROADWELL:
case CPU_SKYLAKE:
case CPU_SILVERMONT:
+ case CPU_GOLDMONT:
case CPU_WESTMERE:
case CPU_SANDYBRIDGE:
case CPU_IVYBRIDGE:

--
2.5.5

Andi Kleen

2016-04-26 14:52:51 UTC

From: Andi Kleen <***@linux.intel.com>

Add support for the Intel Goldmont events.

OFFCORE_RESPONSE.* is not supported.

v2: Fix typos in descriptions.
v3: Add inst_retired.any_pebs
Signed-off-by: Andi Kleen <***@linux.intel.com>
---
events/Makefile.am | 1 +
events/i386/goldmont/events | 34 +++++++++
events/i386/goldmont/unit_masks | 155 ++++++++++++++++++++++++++++++++++++++++
libop/op_cpu_type.c | 2 +
libop/op_cpu_type.h | 1 +
libop/op_events.c | 1 +
libop/op_hw_specific.h | 3 +
utils/ophelp.c | 1 +
8 files changed, 198 insertions(+)
create mode 100644 events/i386/goldmont/events
create mode 100644 events/i386/goldmont/unit_masks

diff --git a/events/Makefile.am b/events/Makefile.am
index 56f9020b1f1a..677b05f93d61 100644
--- a/events/Makefile.am
+++ b/events/Makefile.am
@@ -20,6 +20,7 @@ event_files = \
i386/broadwell/events i386/broadwell/unit_masks \
i386/skylake/events i386/skylake/unit_masks \
i386/silvermont/events i386/silvermont/unit_masks \
+ i386/goldmont/events i386/goldmont/unit_masks \
ppc64/architected_events_v1/events ppc64/architected_events_v1/unit_masks \
ppc64/power4/events ppc64/power4/event_mappings ppc64/power4/unit_masks \
ppc64/power5/events ppc64/power5/event_mappings ppc64/power5/unit_masks \
diff --git a/events/i386/goldmont/events b/events/i386/goldmont/events
new file mode 100644
index 000000000000..111438eb289e
--- /dev/null
+++ b/events/i386/goldmont/events
@@ -0,0 +1,34 @@
+#
+# Intel "Goldmont" microarchitecture core events.
+#
+# See http://ark.intel.com/ for help in identifying Goldmont based CPUs
+#
+# Note the minimum counts are not discovered experimentally and could be likely
+# lowered in many cases without ill effect.
+#
+event:0x00 counters:cpuid um:cpu_clk_unhalted minimum:2000003 name:cpu_clk_unhalted :
+event:0x03 counters:cpuid um:ld_blocks minimum:200003 name:ld_blocks :
+event:0x05 counters:cpuid um:page_walks minimum:200003 name:page_walks :
+event:0x0e counters:cpuid um:uops_issued minimum:200003 name:uops_issued_any :
+event:0x13 counters:cpuid um:misalign_mem_ref minimum:200003 name:misalign_mem_ref :
+event:0x2e counters:cpuid um:longest_lat_cache minimum:200003 name:longest_lat_cache :
+event:0x30 counters:cpuid um:l2_reject_xq minimum:200003 name:l2_reject_xq_all :
+event:0x31 counters:cpuid um:core_reject_l2q minimum:200003 name:core_reject_l2q_all :
+event:0x51 counters:cpuid um:dl1 minimum:200003 name:dl1_dirty_eviction :
+event:0x80 counters:cpuid um:icache minimum:200003 name:icache :
+event:0x81 counters:cpuid um:itlb minimum:200003 name:itlb_miss :
+event:0x86 counters:cpuid um:fetch_stall minimum:200003 name:fetch_stall_icache_fill_pending_cycles :
+event:0x9c counters:cpuid um:uops_not_delivered minimum:200003 name:uops_not_delivered_any :
+event:0xc0 counters:cpuid um:inst_retired minimum:2000003 name:inst_retired :
+event:0xc2 counters:cpuid um:uops_retired minimum:2000003 name:uops_retired :
+event:0xc3 counters:cpuid um:machine_clears minimum:200003 name:machine_clears :
+event:0xc4 counters:cpuid um:br_inst_retired minimum:200003 name:br_inst_retired :
+event:0xc5 counters:cpuid um:br_misp_retired minimum:200003 name:br_misp_retired :
+event:0xca counters:cpuid um:issue_slots_not_consumed minimum:200003 name:issue_slots_not_consumed :
+event:0xcb counters:cpuid um:hw_interrupts minimum:200003 name:hw_interrupts :
+event:0xcd counters:cpuid um:cycles_div_busy minimum:2000003 name:cycles_div_busy :
+event:0xd0 counters:cpuid um:mem_uops_retired minimum:200003 name:mem_uops_retired :
+event:0xd1 counters:cpuid um:mem_load_uops_retired minimum:200003 name:mem_load_uops_retired :
+event:0xe6 counters:cpuid um:baclears minimum:200003 name:baclears :
+event:0xe7 counters:cpuid um:ms_decoded minimum:200003 name:ms_decoded_ms_entry :
+event:0xe9 counters:cpuid um:decode_restriction minimum:200003 name:decode_restriction_predecode_wrong :
diff --git a/events/i386/goldmont/unit_masks b/events/i386/goldmont/unit_masks
new file mode 100644
index 000000000000..2f265b3da3f4
--- /dev/null
+++ b/events/i386/goldmont/unit_masks
@@ -0,0 +1,155 @@
+#
+# Unit masks for the Intel "Goldmont" micro architecture
+#
+# See http://ark.intel.com/ for help in identifying Goldmont based CPUs
+#
+name:core_reject_l2q type:mandatory default:0x0
+ 0x0 extra: all Counts the number of demand and L1 prefetcher requests rejected by the L2Q due to a full or nearly full condition which likely indicates back pressure from L2Q. It also counts requests that would have gone directly to the XQ, but are rejected due to a full or nearly full condition, indicating back pressure from the IDI link. The L2Q may also reject transactions from a core to insure fairness between cores, or to delay a core's dirty eviction when the address conflicts with incoming external snoops.
+name:decode_restriction type:mandatory default:0x1
+ 0x1 extra: predecode_wrong Counts the number of times the prediction (from the predecode cache) for instruction length is incorrect.
+name:dl1 type:mandatory default:0x1
+ 0x1 extra: dirty_eviction Counts when a modified (dirty) cache line is evicted from the data L1 cache and needs to be written back to memory. No count will occur if the evicted line is clean, and hence does not require a writeback.
+name:fetch_stall type:mandatory default:0x2
+ 0x2 extra: icache_fill_pending_cycles Counts the number of cycles fetch stalls because of an icache miss. This is a cummulative count of cycles stalled for all icache misses.
+name:itlb type:mandatory default:0x4
+ 0x4 extra: miss Counts the number of times the machine was unable to find a translation in the Instruction Translation Lookaside Buffer (ITLB) for a linear address of an instruction fetch. It counts when new translation are filled into the ITLB. The event is speculative in nature, but will not count translations (page walks) that are begun and not finished, or translations that are finished but not filled into the ITLB.
+name:l2_reject_xq type:mandatory default:0x0
+ 0x0 extra: all Counts the number of demand and prefetch transactions that the L2 XQ rejects due to a full or near full condition which likely indicates back pressure from the intra-die interconnect (IDI) fabric. The XQ may reject transactions from the L2Q (non-cacheable requests), L2 misses and L2 write-back victims.
+name:ms_decoded type:mandatory default:0x1
+ 0x1 extra: ms_entry Counts the number of times the Microcde Sequencer (MS) starts a flow of uops from the MSROM. It does not count every time a uop is read from the MSROM. The most common case that this counts is when a micro-coded instruction is encountered by the front end of the machine. Other cases include when an instruction encounters a fault, trap, or microcode assist of any sort that initiates a flow of uops. The event will count MS startups for uops that are speculative, and subsequently cleared by branch mispredict or a machine clear.
+name:uops_issued type:mandatory default:0x0
+ 0x0 extra: any Counts uops issued by the front end and allocated into the back end of the machine. This event counts uops that retire as well as uops that were speculatively executed but didn't retire. The sort of speculative uops that might be counted includes, but is not limited to those uops issued in the shadow of a miss-predicted branch, those uops that are inserted during an assist (such as for a denormal floating point result), and (previously allocated) uops that might be canceled during a machine clear.
+name:uops_not_delivered type:mandatory default:0x0
+ 0x0 extra: any This event used to measure front-end inefficiencies. I.e. when front-end of the machine is not delivering uops to the back-end and the back-end has is not stalled. This event can be used to identify if the machine is truly front-end bound. When this event occurs, it is an indication that the front-end of the machine is operating at less than its theoretical peak performance.
+name:cpu_clk_unhalted type:exclusive default:core
+ 0x2 extra: core Counts the number of core cycles while the core is not in a halt state. The core enters the halt state when it is running the HLT instruction. In mobile systems the core frequency may change from time to time. For this reason this event may have a changing ratio with regards to time. This event uses fixed counter 1. You cannot collect a PEBs record for this event.
+ 0x1 extra: ref_tsc Counts the number of reference cycles that the core is not in a halt state. The core enters the halt state when it is running the HLT instruction. In mobile systems the core frequency may change from time. This event is not affected by core frequency changes but counts as if the core is running at the maximum frequency all the time. This event uses fixed counter 2. You cannot collect a PEBs record for this event
+ 0x0 extra: core_p Core cycles when core is not halted. This event uses a (_P)rogrammable general purpose performance counter.
+ 0x1 extra: ref Reference cycles when core is not halted. This event uses a (_P)rogrammable general purpose performance counter.
+name:ld_blocks type:exclusive default:all_block
+ 0x10 extra: all_block Counts anytime a load that retires is blocked for any reason.
+ 0x10 extra:pebs all_block_pebs Counts anytime a load that retires is blocked for any reason.
+ 0x8 extra: utlb_miss Counts loads blocked because they are unable to find their physical address in the micro TLB (UTLB).
+ 0x8 extra:pebs utlb_miss_pebs Counts loads blocked because they are unable to find their physical address in the micro TLB (UTLB).
+ 0x1 extra: data_unknown Counts a load blocked from using a store forward, but did not occur because the store data was not available at the right time. The forward might occur subsequently when the data is available.
+ 0x1 extra:pebs data_unknown_pebs Counts a load blocked from using a store forward, but did not occur because the store data was not available at the right time. The forward might occur subsequently when the data is available.
+ 0x4 extra: u4k_alias Counts loads that block because their address modulo 4K matches a pending store.
+ 0x4 extra:pebs u4k_alias_pebs Counts loads that block because their address modulo 4K matches a pending store.
+name:page_walks type:exclusive default:0x1
+ 0x1 extra: d_side_cycles Counts every core cycle when a Data-side walks (due to data operation) page walk is in progress.
+ 0x2 extra: i_side_cycles Counts every core cycle when a Instruction-side (walks due to an instruction fetch) page walk is in progress.
+ 0x3 extra: cycles Counts every core cycle a page-walk is in progress due to either a data memory operation or an instruction fetch.
+name:misalign_mem_ref type:exclusive default:load_page_split
+ 0x2 extra: load_page_split Counts when a memory load of a uop spans a page boundary (a split) is retired.
+ 0x2 extra:pebs load_page_split_pebs Counts when a memory load of a uop spans a page boundary (a split) is retired.
+ 0x4 extra: store_page_split Counts when a memory store of a uop spans a page boundary (a split) is retired.
+ 0x4 extra:pebs store_page_split_pebs Counts when a memory store of a uop spans a page boundary (a split) is retired.
+name:longest_lat_cache type:exclusive default:0x4f
+ 0x4f extra: reference Counts memory requests originating from the core that reference a cache line in the L2 cache.
+ 0x41 extra: miss Counts memory requests originating from the core that miss in the L2 cache.
+name:icache type:exclusive default:0x1
+ 0x1 extra: hit Counts each cache line access to the Icache that are fulfilled (hit) by the Icache
+ 0x2 extra: misses Counts each cache line access to the Icache that are not fullfilled (miss) by the Icache
+ 0x3 extra: accesses Counts each cache line access to the Icache
+name:inst_retired type:exclusive default:any
+ 0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The counter continues counting during hardware interrupts, traps, and inside interrupt handlers. This event uses fixed counter 0. You cannot collect a PEBs record for this event
+ 0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event continues counting during hardware interrupts, traps, and inside interrupt handlers. This is an architectural performance event. This event uses a (_P)rogrammable general purpose performance counter.
+ 0x0 extra:pebs any_pebs Counts the number of instructions that retire execution. For instructions that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event continues counting during hardware interrupts, traps, and inside interrupt handlers. This is an architectural performance event. This event uses a (_P)rogrammable general purpose performance counter. *This event is Precise Event capable: The EventingRIP field in the PEBS record is precise to the address of the instruction which caused the event. Note: Because PEBS records can be collected only on IA32_PMC0, only one event can use the PEBS facility at a time.
+name:uops_retired type:exclusive default:any
+ 0x0 extra: any Counts uops which retired
+ 0x0 extra:pebs any_pebs Counts uops which retired
+ 0x1 extra: ms Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). Counts both the uops from a micro-coded instruction, and the uops that might be generated from a micro-coded assist.
+ 0x1 extra:pebs ms_pebs Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). Counts both the uops from a micro-coded instruction, and the uops that might be generated from a micro-coded assist.
+ 0x8 extra: fpdiv Counts the number of floating point divide uops retired.
+ 0x8 extra:pebs fpdiv_pebs Counts the number of floating point divide uops retired.
+ 0x10 extra: idiv Counts the number of integer divide uops retired.
+ 0x10 extra:pebs idiv_pebs Counts the number of integer divide uops retired.
+name:machine_clears type:exclusive default:0x0
+ 0x0 extra: all Counts machine clears for any reason
+ 0x1 extra: smc Counts the number of times that the processor detects that a program is writing to a code section and has to perform a machine clear because of that modification. Self-modifying code (SMC) causes a severe penalty in all Intel architecture processors.
+ 0x2 extra: memory_ordering Counts machine clears due to memory ordering issues. This occurs when a snoop request happens and the machine is uncertain if memory ordering will be preserved, as another core is in the process of modifying the data.
+ 0x4 extra: fp_assist Counts machine clears due to floating point (FP) operations needing assists. For instance, if the result was a floating point denormal, the hardware clears the pipeline and reissues uops to produce the correct IEEE compliant denormal result.
+ 0x8 extra: disambiguation Counts machine clears due to memory disambiguation. Memory disambiguation happens when a load which has been issued conflicts with a previous unretired store in the pipeline whose address was not known at issue time, but is later resolved to be the same as the load address.
+name:br_inst_retired type:exclusive default:all_branches
+ 0x0 extra: all_branches Counts branch instructions retired for all branch types. This is an architectural performance event.
+ 0x0 extra:pebs all_branches_pebs Counts branch instructions retired for all branch types. This is an architectural performance event.
+ 0x7e extra: jcc Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was taken and when it was not taken.
+ 0x7e extra:pebs jcc_pebs Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was taken and when it was not taken.
+ 0xfe extra: taken_jcc Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+ 0xfe extra:pebs taken_jcc_pebs Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+ 0xf9 extra: call Counts near CALL branch instructions retired.
+ 0xf9 extra:pebs call_pebs Counts near CALL branch instructions retired.
+ 0xfd extra: rel_call Counts near relative CALL branch instructions retired.
+ 0xfd extra:pebs rel_call_pebs Counts near relative CALL branch instructions retired.
+ 0xfb extra: ind_call Counts near indirect CALL branch instructions retired.
+ 0xfb extra:pebs ind_call_pebs Counts near indirect CALL branch instructions retired.
+ 0xf7 extra: return Counts near return branch instructions retired.
+ 0xf7 extra:pebs return_pebs Counts near return branch instructions retired.
+ 0xeb extra: non_return_ind Counts near indirect call or near indirect jmp branch instructions retired.
+ 0xeb extra:pebs non_return_ind_pebs Counts near indirect call or near indirect jmp branch instructions retired.
+ 0xbf extra: far_branch Counts far branch instructions retired. This includes far jump, far call and return, and Interrupt call and return. Intel Architecture uses far branches to transition to a different privilege level (ex: kernel/user).
+ 0xbf extra:pebs far_branch_pebs Counts far branch instructions retired. This includes far jump, far call and return, and Interrupt call and return. Intel Architecture uses far branches to transition to a different privilege level (ex: kernel/user).
+name:br_misp_retired type:exclusive default:all_branches
+ 0x0 extra: all_branches Counts mispredicted branch instructions retired including all branch types.
+ 0x0 extra:pebs all_branches_pebs Counts mispredicted branch instructions retired including all branch types.
+ 0x7e extra: jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was supposed to be taken and when it was not supposed to be taken (but the processor predicted the opposite condition).
+ 0x7e extra:pebs jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch instructions retired, including both when the branch was supposed to be taken and when it was not supposed to be taken (but the processor predicted the opposite condition).
+ 0xfe extra: taken_jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted that it would not be taken.
+ 0xfe extra:pebs taken_jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted that it would not be taken.
+ 0xfb extra: ind_call Counts mispredicted near indirect CALL branch instructions retired, where the target address taken was not what the processor predicted.
+ 0xfb extra:pebs ind_call_pebs Counts mispredicted near indirect CALL branch instructions retired, where the target address taken was not what the processor predicted.
+ 0xf7 extra: return Counts mispredicted near RET branch instructions retired, where the return address taken was not what the processor predicted.
+ 0xf7 extra:pebs return_pebs Counts mispredicted near RET branch instructions retired, where the return address taken was not what the processor predicted.
+ 0xeb extra: non_return_ind Counts mispredicted branch instructions retired that were near indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
+ 0xeb extra:pebs non_return_ind_pebs Counts mispredicted branch instructions retired that were near indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
+name:issue_slots_not_consumed type:exclusive default:0x0
+ 0x0 extra: any Counts the number of issue slots per core cycle that were not consumed by the backend due to either a full resource in the backend (RESOURCE_FULL) or due to the processor recovering from some event (RECOVERY)
+ 0x1 extra: resource_full Counts the number of issue slots per core cycle that were not consumed because of a full resource in the backend. Including but not limited the Re-order Buffer (ROB), reservation stations (RS), load/store buffers, physical registers, or any other needed machine resource that is currently unavailable. Note that uops must be available for consumption in order for this event to fire. If a uop is not available (Instruction Queue is empty), this event will not count.
+ 0x2 extra: recovery Counts the number of issue slots per core cycle that were not consumed by the backend because allocation is stalled waiting for a mispredicted jump to retire or other branch-like conditions (e.g. the event is relevant during certain microcode flows). Counts all issue slots blocked while within this window including slots where uops were not available in the Instruction Queue.
+name:hw_interrupts type:exclusive default:0x1
+ 0x1 extra: received Counts hardware interrupts received by the processor.
+ 0x4 extra: pending_and_masked Counts core cycles during which there are pending interrupts, but interrupts are masked (EFLAGS.IF = 0).
+name:cycles_div_busy type:exclusive default:0x0
+ 0x0 extra: all Counts core cycles if either divide unit is busy.
+ 0x1 extra: idiv Counts core cycles the integer divide unit is busy.
+ 0x2 extra: fpdiv Counts core cycles the floating point divide unit is busy.
+name:mem_uops_retired type:exclusive default:all
+ 0x83 extra: all Counts the number of memory uops retired that is either a loads or a store or both.
+ 0x81 extra: all_loads Counts the number of load uops retired
+ 0x81 extra:pebs all_loads_pebs Counts the number of load uops retired
+ 0x82 extra: all_stores Counts the number of store uops retired
+ 0x82 extra:pebs all_stores_pebs Counts the number of store uops retired
+ 0x83 extra:pebs all_pebs Counts the number of memory uops retired that is either a loads or a store or both.
+ 0x11 extra: dtlb_miss_loads Counts load uops retired that caused a DTLB miss.
+ 0x11 extra:pebs dtlb_miss_loads_pebs Counts load uops retired that caused a DTLB miss.
+ 0x12 extra: dtlb_miss_stores Counts store uops retired that caused a DTLB miss.
+ 0x12 extra:pebs dtlb_miss_stores_pebs Counts store uops retired that caused a DTLB miss.
+ 0x13 extra: dtlb_miss Counts uops retired that had a DTLB miss on load, store or either. Note that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as a DTLB miss.
+ 0x13 extra:pebs dtlb_miss_pebs Counts uops retired that had a DTLB miss on load, store or either. Note that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as a DTLB miss.
+ 0x21 extra: lock_loads Counts locked memory uops retired. This includes "regular" locks and bus locks. (To specifically count bus locks only, see the Offcore response event.) A locked access is one with a lock prefix, or an exchange to memory. See the SDM for a complete description of which memory load accesses are locks.
+ 0x21 extra:pebs lock_loads_pebs Counts locked memory uops retired. This includes "regular" locks and bus locks. (To specifically count bus locks only, see the Offcore response event.) A locked access is one with a lock prefix, or an exchange to memory. See the SDM for a complete description of which memory load accesses are locks.
+ 0x41 extra: split_loads Counts load uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x41 extra:pebs split_loads_pebs Counts load uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x42 extra: split_stores Counts store uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x42 extra:pebs split_stores_pebs Counts store uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x43 extra: split Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
+ 0x43 extra:pebs split_pebs Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
+name:mem_load_uops_retired type:exclusive default:l1_hit
+ 0x1 extra: l1_hit Counts load uops retired that hit the L1 data cache
+ 0x1 extra:pebs l1_hit_pebs Counts load uops retired that hit the L1 data cache
+ 0x8 extra: l1_miss Counts load uops retired that miss the L1 data cache
+ 0x8 extra:pebs l1_miss_pebs Counts load uops retired that miss the L1 data cache
+ 0x2 extra: l2_hit Counts load uops retired that hit in the L2 cache
+ 0x2 extra:pebs l2_hit_pebs Counts load uops retired that hit in the L2 cache
+ 0x10 extra: l2_miss Counts load uops retired that miss in the L2 cache
+ 0x10 extra:pebs l2_miss_pebs Counts load uops retired that miss in the L2 cache
+ 0x20 extra: hitm Counts load uops retired where the cache line containing the data was in the modified state of another core or modules cache (HITM). More specifically, this means that when the load address was checked by other caching agents (typically another processor) in the system, one of those caching agents indicated that they had a dirty copy of the data. Loads that obtain a HITM response incur greater latency than most is typical for a load. In addition, since HITM indicates that some other processor had this data in its cache, it implies that the data was shared between processors, or potentially was a lock or semaphore value. This event is useful for locating sharing, false sharing, and contended locks.
+ 0x20 extra:pebs hitm_pebs Counts load uops retired where the cache line containing the data was in the modified state of another core or modules cache (HITM). More specifically, this means that when the load address was checked by other caching agents (typically another processor) in the system, one of those caching agents indicated that they had a dirty copy of the data. Loads that obtain a HITM response incur greater latency than most is typical for a load. In addition, since HITM indicates that some other processor had this data in its cache, it implies that the data was shared between processors, or potentially was a lock or semaphore value. This event is useful for locating sharing, false sharing, and contended locks.
+ 0x40 extra: wcb_hit Counts memory load uops retired where the data is retrieved from the WCB (or fill buffer), indicating that the load found its data while that data was in the process of being brought into the L1 cache. Typically a load will receive this indication when some other load or prefetch missed the L1 cache and was in the process of retrieving the cache line containing the data , but that process had not yet finished (and written the data back to the cache). For example, consider load X and Y, both referencing the same cache line that is not in the L1 cache. If load X misses cache first, it obtains and WCB (or fill buffer) and begins the process of requesting the data. When load Y requests the data, it will either hit the WCB, or the L1 cache, depending on exactly what time the request to Y occurs.
+ 0x40 extra:pebs wcb_hit_pebs Counts memory load uops retired where the data is retrieved from the WCB (or fill buffer), indicating that the load found its data while that data was in the process of being brought into the L1 cache. Typically a load will receive this indication when some other load or prefetch missed the L1 cache and was in the process of retrieving the cache line containing the data , but that process had not yet finished (and written the data back to the cache). For example, consider load X and Y, both referencing the same cache line that is not in the L1 cache. If load X misses cache first, it obtains and WCB (or fill buffer) and begins the process of requesting the data. When load Y requests the data, it will either hit the WCB, or the L1 cache, depending on exactly what time the request to Y occurs.
+ 0x80 extra: dram_hit Counts memory load uops retired where the data is retrieved from DRAM. Event is counted at retirment, so the speculative loads are ignored. A memory load can hit (or miss) the L1 cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+ 0x80 extra:pebs dram_hit_pebs Counts memory load uops retired where the data is retrieved from DRAM. Event is counted at retirment, so the speculative loads are ignored. A memory load can hit (or miss) the L1 cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+name:baclears type:exclusive default:0x1
+ 0x1 extra: all Counts the number of times a BACLEAR is signaled for any reason, including, but not limited to indirect branch/call, Jcc (Jump on Conditional Code/Jump if Condition is Met) branch, unconditional branch/call, and returns.
+ 0x8 extra: return Counts BACLEARS on return instructions.
+ 0x10 extra: cond Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Conditon is Met) branches.
diff --git a/libop/op_cpu_type.c b/libop/op_cpu_type.c
index b1d5ecfb3090..7bdde53969eb 100644
--- a/libop/op_cpu_type.c
+++ b/libop/op_cpu_type.c
@@ -122,6 +122,7 @@ static struct cpu_descr const cpu_descrs[MAX_CPU_TYPE] = {
{ "ARM Cortex-A57", "arm/armv8-ca57", CPU_ARM_V8_CA57, 6},
{ "ARM Cortex-A53", "arm/armv8-ca53", CPU_ARM_V8_CA53, 6},
{ "Intel Skylake microarchitecture", "i386/skylake", CPU_SKYLAKE, 4 },
+ { "Intel Goldmont microarchitecture", "i386/goldmont", CPU_GOLDMONT, 4 },
};

static size_t const nr_cpu_descrs = sizeof(cpu_descrs) / sizeof(struct cpu_descr);
@@ -739,6 +740,7 @@ op_cpu op_cpu_base_type(op_cpu cpu_type)
case CPU_HASWELL:
case CPU_BROADWELL:
case CPU_SKYLAKE:
+ case CPU_GOLDMONT:
case CPU_SILVERMONT:
case CPU_WESTMERE:
case CPU_SANDYBRIDGE:
diff --git a/libop/op_cpu_type.h b/libop/op_cpu_type.h
index 9983f8752954..98289c5b96ea 100644
--- a/libop/op_cpu_type.h
+++ b/libop/op_cpu_type.h
@@ -102,6 +102,7 @@ typedef enum {
CPU_ARM_V8_CA57, /* ARM Cortex-A57 */
CPU_ARM_V8_CA53, /* ARM Cortex-A53 */
CPU_SKYLAKE, /** < Intel Skylake microarchitecture */
+ CPU_GOLDMONT, /** < Intel Goldmont microarchitecture */
MAX_CPU_TYPE
} op_cpu;

diff --git a/libop/op_events.c b/libop/op_events.c
index 25f010ef1aba..cdd04093d068 100644
--- a/libop/op_events.c
+++ b/libop/op_events.c
@@ -1212,6 +1212,7 @@ void op_default_event(op_cpu cpu_type, struct op_default_event_descr * descr)
descr->name = "CPU_CLK_UNHALTED";
break;

+ case CPU_GOLDMONT:
case CPU_SKYLAKE:
descr->name = "cpu_clk_unhalted";
break;
diff --git a/libop/op_hw_specific.h b/libop/op_hw_specific.h
index 994fec473ebc..4cc6bc0c69dc 100644
--- a/libop/op_hw_specific.h
+++ b/libop/op_hw_specific.h
@@ -161,6 +161,9 @@ static inline op_cpu op_cpu_specific_type(op_cpu cpu_type)
case 0x4d:
case 0x4c:
return CPU_SILVERMONT;
+ case 0x5c:
+ case 0x5f:
+ return CPU_GOLDMONT;
}
}
return cpu_type;
diff --git a/utils/ophelp.c b/utils/ophelp.c
index fdddddcc3aed..58215935fb01 100644
--- a/utils/ophelp.c
+++ b/utils/ophelp.c
@@ -544,6 +544,7 @@ int main(int argc, char const * argv[])
case CPU_BROADWELL:
case CPU_SKYLAKE:
case CPU_SILVERMONT:
+ case CPU_GOLDMONT:
case CPU_WESTMERE:
case CPU_SANDYBRIDGE:
case CPU_IVYBRIDGE:

--
2.5.5

William Cohen

2016-04-26 15:14:46 UTC

Post by Andi Kleen
Add support for the Intel Goldmont events.
OFFCORE_RESPONSE.* is not supported.
v2: Fix typos in descriptions.
v3: Add inst_retired.any_pebs

Hi Andi,

Thanks for the patch to support Intel Goldmont. It has been applied to the git repo.

-Will

Andi Kleen

2016-04-26 17:03:06 UTC

Post by William Cohen
Thanks for the patch to support Intel Goldmont. It has been applied to the git repo.

Thanks. Did you push it? I don't see it on sf master.
-Andi

Andi Kleen

2016-04-26 17:20:13 UTC