{"id":15285,"date":"2022-11-10T04:08:05","date_gmt":"2022-11-10T04:08:05","guid":{"rendered":"https:\/\/showbizztoday.com\/index.php\/2022\/11\/10\/seeing-through-hardware-counters-a-journey-to-threefold-performance-increase-by-netflix-technology-blog-nov-2022\/"},"modified":"2022-11-10T04:08:05","modified_gmt":"2022-11-10T04:08:05","slug":"seeing-via-hardware-counters-a-journey-to-threefold-efficiency-improve-by-netflix-technology-blog-nov-2022","status":"publish","type":"post","link":"https:\/\/showbizztoday.com\/index.php\/2022\/11\/10\/seeing-via-hardware-counters-a-journey-to-threefold-efficiency-improve-by-netflix-technology-blog-nov-2022\/","title":{"rendered":"Seeing via {hardware} counters: a journey to threefold efficiency improve | by Netflix Technology Blog | Nov, 2022"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<p id=\"2804\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\"><em class=\"lb\">By <\/em><a class=\"au lc\" href=\"https:\/\/www.linkedin.com\/in\/vfilanovsky\/\" rel=\"noopener ugc nofollow\" target=\"_blank\"><em class=\"lb\">Vadim Filanovsky<\/em><\/a><em class=\"lb\"> and <\/em><a class=\"au lc\" href=\"https:\/\/www.linkedin.com\/in\/harshad-sane-56711a11\/\" rel=\"noopener ugc nofollow\" target=\"_blank\"><em class=\"lb\">Harshad Sane<\/em><\/a><\/p>\n<p id=\"e9d4\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">In one in all our earlier blogposts, <a class=\"au lc\" rel=\"noopener ugc nofollow\" target=\"_blank\" href=\"https:\/\/netflixtechblog.com\/a-microscope-on-microservices-923b906103f4\">A Microscope on Microservices<\/a> we outlined three broad domains of observability (or \u201clevels of magnification,\u201d as we referred to them) \u2014 Fleet-wide, Microservice and Instance. We described the instruments and strategies we use to realize perception inside every area. There is, nonetheless, a category of issues that requires a fair stronger degree of magnification going deeper down the stack to introspect CPU microarchitecture. In this blogpost we describe one such downside and the instruments we used to resolve it.<\/p>\n<p id=\"3d37\" class=\"pw-post-body-paragraph kd ke jg kf b kg mb ki kj kk mc km kn ko md kq kr ks me ku kv kw mf ky kz la iz ga\">It began off as a routine migration. At Netflix, we periodically reevaluate our workloads to optimize utilization of accessible capability. We determined to maneuver one in all our Java microservices \u2014 let\u2019s name it GS2 \u2014 to a bigger AWS occasion dimension, from m5.4xl (16 vCPUs) to m5.12xl (48 vCPUs). The workload of GS2 is computationally heavy the place CPU is the limiting useful resource. While we perceive it\u2019s nearly unattainable to realize a linear improve in throughput because the variety of vCPUs develop, a near-linear improve is attainable. Consolidating on the bigger situations reduces the amortized price of background duties, liberating up extra assets for serving requests and doubtlessly offsetting the sub-linear scaling. Thus, we anticipated to roughly triple throughput per occasion from this migration, as 12xl situations have thrice the variety of vCPUs in comparison with 4xl situations. A fast <a class=\"au lc\" rel=\"noopener ugc nofollow\" target=\"_blank\" href=\"https:\/\/netflixtechblog.com\/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69\">canary check<\/a> was freed from errors and confirmed decrease latency, which is anticipated on condition that our customary canary setup routes an equal quantity of visitors to each the baseline working on 4xl and the canary on 12xl. As GS2 depends on <a class=\"au lc\" href=\"https:\/\/docs.aws.amazon.com\/autoscaling\/ec2\/userguide\/what-is-amazon-ec2-auto-scaling.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">AWS EC2 Auto Scaling<\/a> to target-track CPU utilization, we thought we simply needed to redeploy the service on the bigger occasion sort and look ahead to the <a class=\"au lc\" href=\"https:\/\/docs.aws.amazon.com\/autoscaling\/ec2\/userguide\/auto-scaling-groups.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">ASG (Auto Scaling Group)<\/a> to decide on the CPU goal. Unfortunately, the preliminary outcomes have been removed from our expectations:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div class=\"gl gm mg\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*YN0AGLHPUq8OxzI_ATijTw.png 640w, https:\/\/miro.medium.com\/max\/720\/1*YN0AGLHPUq8OxzI_ATijTw.png 720w, https:\/\/miro.medium.com\/max\/750\/1*YN0AGLHPUq8OxzI_ATijTw.png 750w, https:\/\/miro.medium.com\/max\/786\/1*YN0AGLHPUq8OxzI_ATijTw.png 786w, https:\/\/miro.medium.com\/max\/828\/1*YN0AGLHPUq8OxzI_ATijTw.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*YN0AGLHPUq8OxzI_ATijTw.png 1100w, https:\/\/miro.medium.com\/max\/1384\/1*YN0AGLHPUq8OxzI_ATijTw.png 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"692\" height=\"736\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/figure>\n<p id=\"8135\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">The first graph above represents common per-node throughput overlaid with common CPU utilization, whereas the second graph exhibits common request latency. We can see that as we reached roughly the identical CPU goal of 55%, the throughput elevated solely by ~25% on common, falling far wanting our desired aim. What\u2019s worse, common latency degraded by greater than 50%, with each CPU and latency patterns turning into extra \u201cchoppy.\u201d GS2 is a stateless service that receives visitors via a taste of round-robin load balancer, so all nodes ought to obtain practically equal quantities of visitors. Indeed, the RPS (Requests Per Second) knowledge exhibits little or no variation in throughput between nodes:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div class=\"gl gm mg\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*Gfj06j1x8m284vbLjibeTg.png 640w, https:\/\/miro.medium.com\/max\/720\/1*Gfj06j1x8m284vbLjibeTg.png 720w, https:\/\/miro.medium.com\/max\/750\/1*Gfj06j1x8m284vbLjibeTg.png 750w, https:\/\/miro.medium.com\/max\/786\/1*Gfj06j1x8m284vbLjibeTg.png 786w, https:\/\/miro.medium.com\/max\/828\/1*Gfj06j1x8m284vbLjibeTg.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*Gfj06j1x8m284vbLjibeTg.png 1100w, https:\/\/miro.medium.com\/max\/1384\/1*Gfj06j1x8m284vbLjibeTg.png 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"692\" height=\"295\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/figure>\n<p id=\"b6d7\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">But as we began trying on the breakdown of CPU and latency by node, a wierd sample emerged:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div class=\"gl gm mg\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*t8GZ405S_-mSZa31JJHlZA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*t8GZ405S_-mSZa31JJHlZA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*t8GZ405S_-mSZa31JJHlZA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*t8GZ405S_-mSZa31JJHlZA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*t8GZ405S_-mSZa31JJHlZA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*t8GZ405S_-mSZa31JJHlZA.png 1100w, https:\/\/miro.medium.com\/max\/1384\/1*t8GZ405S_-mSZa31JJHlZA.png 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"692\" height=\"610\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/figure>\n<p id=\"12be\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Although we confirmed pretty equal visitors distribution between nodes, CPU and latency metrics surprisingly demonstrated a really totally different, bimodal distribution sample. There is a \u201clower band\u201d of nodes exhibiting a lot decrease CPU and latency with hardly any variation; and there&#8217;s an \u201cupper band\u201d of nodes with considerably increased CPU\/latency and extensive variation. We seen solely ~12% of the nodes fall into the decrease band, a determine that was suspiciously constant over time. In each bands, efficiency traits stay constant for the whole uptime of the JVM on the node, i.e. nodes by no means jumped the bands. This was our start line for troubleshooting.<\/p>\n<p id=\"5187\" class=\"pw-post-body-paragraph kd ke jg kf b kg mb ki kj kk mc km kn ko md kq kr ks me ku kv kw mf ky kz la iz ga\">Our first (and slightly apparent) step at fixing the issue was to match <a class=\"au lc\" rel=\"noopener ugc nofollow\" target=\"_blank\" href=\"https:\/\/netflixtechblog.com\/java-in-flames-e763b3d32166\">flame graphs<\/a> for the \u201cslow\u201d and \u201cfast\u201d nodes. While flame graphs clearly mirrored the distinction in CPU utilization because the variety of collected samples, the distribution throughout the stacks remained the identical, thus leaving us with no extra perception. We turned to JVM-specific profiling, beginning with the essential hotspot stats, after which switching to extra detailed <a class=\"au lc\" href=\"https:\/\/docs.oracle.com\/en\/java\/java-components\/jdk-mission-control\/8\/user-guide\/using-jdk-flight-recorder.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">JFR (Java Flight Recorder)<\/a> captures to match the distribution of the occasions. Again, we got here away empty-handed as there was no noticeable distinction within the quantity or the distribution of the occasions between the \u201cslow\u201d and \u201cfast\u201d nodes. Still suspecting one thing may be off with JIT conduct, we ran some primary stats towards image maps obtained by <a class=\"au lc\" href=\"https:\/\/github.com\/jvm-profiling-tools\/perf-map-agent\" rel=\"noopener ugc nofollow\" target=\"_blank\">perf-map-agent<\/a> solely to hit one other lifeless finish.<\/p>\n<p id=\"c613\" class=\"pw-post-body-paragraph kd ke jg kf b kg mb ki kj kk mc km kn ko md kq kr ks me ku kv kw mf ky kz la iz ga\">Convinced we\u2019re not lacking something on the app-, OS- and JVM- ranges, we felt the reply may be hidden at a decrease degree. Luckily, the m5.12xl occasion sort exposes a set of core <a class=\"au lc\" href=\"https:\/\/www.brendangregg.com\/blog\/2017-05-04\/the-pmcs-of-ec2.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">PMCs<\/a> (Performance Monitoring Counters, a.okay.a. PMU counters), so we began by gathering a baseline set of counters utilizing <a class=\"au lc\" href=\"https:\/\/github.com\/intel\/PerfSpect\" rel=\"noopener ugc nofollow\" target=\"_blank\">PerfSpect<\/a>:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div class=\"gl gm mg\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*2y_jlUWQdw8M0GP6VH8LAQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*2y_jlUWQdw8M0GP6VH8LAQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*2y_jlUWQdw8M0GP6VH8LAQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*2y_jlUWQdw8M0GP6VH8LAQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*2y_jlUWQdw8M0GP6VH8LAQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*2y_jlUWQdw8M0GP6VH8LAQ.png 1100w, https:\/\/miro.medium.com\/max\/1384\/1*2y_jlUWQdw8M0GP6VH8LAQ.png 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"692\" height=\"390\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/figure>\n<p id=\"fe68\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">In the desk above, the nodes displaying low CPU and low latency signify a \u201cfast node\u201d, whereas the nodes with increased CPU\/latency signify a \u201cslow node\u201d. Aside from apparent CPU variations, we will see that the gradual node has virtually 3x CPI (Cycles Per Instruction) of the quick node. We additionally see a lot increased L1 cache exercise mixed with 4x increased rely of <a class=\"au lc\" href=\"http:\/\/portal.nacad.ufrj.br\/online\/intel\/vtune2017\/help\/GUID-F0FD7660-58B5-4B5D-AA9A-E1AF21DDCA0E.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">MACHINE_CLEARS<\/a>. One widespread trigger of those signs is so-called \u201cfalse sharing\u201d \u2014 a utilization sample occurring when 2 cores studying from \/ writing to unrelated variables that occur to share the identical L1 cache line. Cache line is an idea just like reminiscence web page \u2014 a contiguous chunk of information (sometimes 64 bytes on x86 programs) transferred to and from the cache. This diagram illustrates it:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div class=\"gl gm mg\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*-q8ZA_At1AAtXMp2WURdtg.png 640w, https:\/\/miro.medium.com\/max\/720\/1*-q8ZA_At1AAtXMp2WURdtg.png 720w, https:\/\/miro.medium.com\/max\/750\/1*-q8ZA_At1AAtXMp2WURdtg.png 750w, https:\/\/miro.medium.com\/max\/786\/1*-q8ZA_At1AAtXMp2WURdtg.png 786w, https:\/\/miro.medium.com\/max\/828\/1*-q8ZA_At1AAtXMp2WURdtg.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*-q8ZA_At1AAtXMp2WURdtg.png 1100w, https:\/\/miro.medium.com\/max\/1384\/1*-q8ZA_At1AAtXMp2WURdtg.png 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"692\" height=\"469\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/figure>\n<p id=\"5792\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Each core on this diagram has its personal non-public cache. Since each cores are accessing the identical reminiscence area, caches must be constant. This consistency is ensured with so-called \u201c<a class=\"au lc\" href=\"https:\/\/www.cs.auckland.ac.nz\/~goodman\/TechnicalReports\/MESIF-2009.pdf\" rel=\"noopener ugc nofollow\" target=\"_blank\">cache coherency protocol<\/a>.\u201d As Thread 0 writes to the \u201cred\u201d variable, coherency protocol marks the entire cache line as \u201cmodified\u201d in Thread 0\u2019s cache and as \u201cinvalidated\u201d in Thread 1\u2019s cache. Later, when Thread 1 reads the \u201cblue\u201d variable, despite the fact that the \u201cblue\u201d variable will not be modified, coherency protocol forces the whole cache line to be reloaded from the cache that had the final modification \u2014 Thread 0\u2019s cache on this instance. Resolving coherency throughout non-public caches takes time and causes CPU stalls. Additionally, ping-ponging coherency visitors needs to be monitored via the <a class=\"au lc\" href=\"https:\/\/cvw.cac.cornell.edu\/ClusterArch\/LastLevelCache\" rel=\"noopener ugc nofollow\" target=\"_blank\">final degree shared cache<\/a>\u2019s controller, which ends up in much more stalls. We take CPU cache consistency without any consideration, however this \u201cfalse sharing\u201d sample illustrates there\u2019s an enormous efficiency penalty for merely studying a variable that&#8217;s neighboring with another unrelated knowledge.<\/p>\n<p id=\"d93c\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Armed with this data, we used <a class=\"au lc\" href=\"https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/tools\/oneapi\/vtune-profiler.html\" rel=\"noopener ugc nofollow\" target=\"_blank\">Intel vTune<\/a> to run microarchitecture profiling. Drilling down into \u201chot\u201d strategies and additional into the meeting code confirmed us blocks of code with some directions exceeding 100 CPI, which is extraordinarily gradual. This is the abstract of our findings:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"mp mq do mr ce ms\">\n<div class=\"gl gm mo\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*pTN-nH9l4KXoW9XiStEavA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*pTN-nH9l4KXoW9XiStEavA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*pTN-nH9l4KXoW9XiStEavA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*pTN-nH9l4KXoW9XiStEavA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*pTN-nH9l4KXoW9XiStEavA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*pTN-nH9l4KXoW9XiStEavA.png 1100w, https:\/\/miro.medium.com\/max\/1400\/1*pTN-nH9l4KXoW9XiStEavA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"700\" height=\"270\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"e89b\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Numbered markers from 1 to six denote the identical code\/variables throughout the sources and vTune meeting view. The purple arrow signifies that the CPI worth possible belongs to the earlier instruction \u2014 that is because of the profiling skid in absence of PEBS (Processor Event-Based Sampling), and normally it\u2019s off by a single instruction. Based on the truth that (5) <em class=\"lb\">\u201crepne scan\u201d<\/em> is a slightly uncommon operation within the JVM codebase, we have been capable of hyperlink this snippet to <a class=\"au lc\" href=\"https:\/\/github.com\/openjdk\/jdk8u\/blob\/jdk8u352-b07\/hotspot\/src\/cpu\/x86\/vm\/macroAssembler_x86.cpp#L5142-L5174\" rel=\"noopener ugc nofollow\" target=\"_blank\">the routine for subclass checking<\/a> (the identical code exists in JDK mainline as of the writing of this blogpost). Going into the main points of subtype checking in HotSpot is much past the scope of this blogpost, however curious readers can be taught extra about it from the 2002 publication <a class=\"au lc\" href=\"https:\/\/www.researchgate.net\/publication\/221552851_Fast_subtype_checking_in_the_HotSpot_JVM\" rel=\"noopener ugc nofollow\" target=\"_blank\">Fast Subtype Checking within the HotSpot JVM<\/a>. Due to the character of the category hierarchy used on this explicit workload, we hold hitting the code path that retains updating (6) the <em class=\"lb\">\u201c_secondary_super_cache\u201d<\/em> area, which is a single-element cache for the last-found secondary superclass. Note how this area is adjoining to the <em class=\"lb\">\u201c_secondary_supers\u201d<\/em>, which is a listing of all superclasses and is being learn (1) at first of the scan. Multiple threads do these read-write operations, and if fields (1) and (6) fall into the identical cache line, then we hit a false sharing use case. We highlighted these fields with purple and blue colours to hook up with the false sharing diagram above.<\/p>\n<p id=\"de07\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Note that for the reason that cache line dimension is 64 bytes and the pointer dimension is 8 bytes, we&#8217;ve a 1 in 8 probability of those fields falling on separate cache traces, and a 7 in 8 probability of them sharing a cache line. This 1-in-8 probability is 12.5%, matching our earlier statement on the proportion of the \u201cfast\u201d nodes. Fascinating!<\/p>\n<p id=\"ab68\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Although the repair concerned patching the JDK, it was a easy change. We inserted padding between <em class=\"lb\">\u201c_secondary_super_cache\u201d<\/em> and <em class=\"lb\">\u201c_secondary_supers\u201d<\/em> fields to make sure they by no means fall into the identical cache line. Note that we didn&#8217;t change the useful facet of JDK conduct, however slightly the information format:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"mp mq do mr ce ms\">\n<div class=\"gl gm mt\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*390eHadYNHFdhHD3sbWr2A.png 640w, https:\/\/miro.medium.com\/max\/720\/1*390eHadYNHFdhHD3sbWr2A.png 720w, https:\/\/miro.medium.com\/max\/750\/1*390eHadYNHFdhHD3sbWr2A.png 750w, https:\/\/miro.medium.com\/max\/786\/1*390eHadYNHFdhHD3sbWr2A.png 786w, https:\/\/miro.medium.com\/max\/828\/1*390eHadYNHFdhHD3sbWr2A.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*390eHadYNHFdhHD3sbWr2A.png 1100w, https:\/\/miro.medium.com\/max\/1400\/1*390eHadYNHFdhHD3sbWr2A.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"700\" height=\"227\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"e334\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">The outcomes of deploying the patch have been instantly noticeable. The graph beneath is a breakdown of CPU by node. Here we will see a <a class=\"au lc\" href=\"https:\/\/spinnaker.io\/docs\/guides\/user\/kubernetes-v2\/rollout-strategies\/#redblack-rollouts\" rel=\"noopener ugc nofollow\" target=\"_blank\">red-black deployment<\/a> occurring at midday, and the brand new ASG with the patched JDK taking up by 12:15:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div class=\"gl gm mg\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*8WfR5AF9_cjHIkiR0QFHwg.png 640w, https:\/\/miro.medium.com\/max\/720\/1*8WfR5AF9_cjHIkiR0QFHwg.png 720w, https:\/\/miro.medium.com\/max\/750\/1*8WfR5AF9_cjHIkiR0QFHwg.png 750w, https:\/\/miro.medium.com\/max\/786\/1*8WfR5AF9_cjHIkiR0QFHwg.png 786w, https:\/\/miro.medium.com\/max\/828\/1*8WfR5AF9_cjHIkiR0QFHwg.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*8WfR5AF9_cjHIkiR0QFHwg.png 1100w, https:\/\/miro.medium.com\/max\/1384\/1*8WfR5AF9_cjHIkiR0QFHwg.png 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"692\" height=\"295\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/figure>\n<p id=\"71ad\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Both CPU and latency (graph omitted for brevity) confirmed the same image \u2014 the \u201cslow\u201d band of nodes was gone!<\/p>\n<p id=\"f795\" class=\"pw-post-body-paragraph kd ke jg kf b kg mb ki kj kk mc km kn ko md kq kr ks me ku kv kw mf ky kz la iz ga\">We didn\u2019t have a lot time to marvel at these outcomes, nonetheless. As the autoscaling reached our CPU goal, we seen that we nonetheless couldn\u2019t push greater than ~150 RPS per node \u2014 properly wanting our aim of ~250 RPS. Another spherical of vTune profiling on the patched JDK model confirmed the identical bottleneck round secondary superclass cache lookup. It was puzzling at first to see seemingly the identical downside coming again proper after we put in a repair, however upon nearer inspection we realized we\u2019re coping with \u201ctrue sharing\u201d now. Unlike \u201cfalse sharing,\u201d the place 2 impartial variables share a cache line, \u201ctrue sharing\u201d refers back to the similar variable being learn and written by a number of threads\/cores. In this case, <a class=\"au lc\" href=\"http:\/\/www.rdrop.com\/users\/paulmck\/scalability\/paper\/ordering.2007.09.19a.pdf\" rel=\"noopener ugc nofollow\" target=\"_blank\">CPU-enforced reminiscence ordering<\/a> is the reason for slowdown. We reasoned that eradicating the impediment of false sharing and growing the general throughput resulted in elevated execution of the identical JVM superclass caching code path. Essentially, we&#8217;ve increased execution concurrency, inflicting extreme stress on the superclass cache on account of CPU-enforced reminiscence ordering protocols. The widespread method to resolve that is to keep away from writing to the shared variable altogether, successfully bypassing the JVM\u2019s secondary superclass cache. Since this alteration altered the conduct of the JDK, we gated it behind a command line flag. This is everything of our patch:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"mp mq do mr ce ms\">\n<div class=\"gl gm mu\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*A8MKwVMA0AcOPHjxOPZgLQ.png 640w, https:\/\/miro.medium.com\/max\/720\/1*A8MKwVMA0AcOPHjxOPZgLQ.png 720w, https:\/\/miro.medium.com\/max\/750\/1*A8MKwVMA0AcOPHjxOPZgLQ.png 750w, https:\/\/miro.medium.com\/max\/786\/1*A8MKwVMA0AcOPHjxOPZgLQ.png 786w, https:\/\/miro.medium.com\/max\/828\/1*A8MKwVMA0AcOPHjxOPZgLQ.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*A8MKwVMA0AcOPHjxOPZgLQ.png 1100w, https:\/\/miro.medium.com\/max\/1400\/1*A8MKwVMA0AcOPHjxOPZgLQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"700\" height=\"702\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"3b0f\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">And listed here are the outcomes of working with disabled superclass cache writes:<\/p>\n<figure class=\"mh mi mj mk gx ml gl gm paragraph-image\">\n<div class=\"gl gm mg\"><picture><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/max\/640\/1*ZDIqyco2WvSib87nzGr_SA.png 640w, https:\/\/miro.medium.com\/max\/720\/1*ZDIqyco2WvSib87nzGr_SA.png 720w, https:\/\/miro.medium.com\/max\/750\/1*ZDIqyco2WvSib87nzGr_SA.png 750w, https:\/\/miro.medium.com\/max\/786\/1*ZDIqyco2WvSib87nzGr_SA.png 786w, https:\/\/miro.medium.com\/max\/828\/1*ZDIqyco2WvSib87nzGr_SA.png 828w, https:\/\/miro.medium.com\/max\/1100\/1*ZDIqyco2WvSib87nzGr_SA.png 1100w, https:\/\/miro.medium.com\/max\/1384\/1*ZDIqyco2WvSib87nzGr_SA.png 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"\/><img alt=\"\" class=\"ce mm mn c\" width=\"692\" height=\"736\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/figure>\n<p id=\"9377\" class=\"pw-post-body-paragraph kd ke jg kf b kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw kx ky kz la iz ga\">Our repair pushed the throughput to ~350 RPS on the similar CPU autoscaling goal of 55%. To put this in perspective, <strong class=\"kf jh\">that\u2019s a 3.5x enchancment<\/strong> over the throughput we initially reached on m5.12xl, together with a discount in each common and tail latency.<\/p>\n<p id=\"35a5\" class=\"pw-post-body-paragraph kd ke jg kf b kg mb ki kj kk mc km kn ko md kq kr ks me ku kv kw mf ky kz la iz ga\">Disabling writes to the secondary superclass cache labored properly in our case, and despite the fact that this may not be a fascinating resolution in all circumstances, we needed to share our methodology, toolset and the repair within the hope that it could assist others encountering related signs. While working via this downside, we got here throughout <a class=\"au lc\" href=\"https:\/\/bugs.openjdk.org\/browse\/JDK-8180450\" rel=\"noopener ugc nofollow\" target=\"_blank\">JDK-8180450<\/a> \u2014 a bug that\u2019s been dormant for greater than 5 years that describes precisely the issue we have been going through. It appears ironic that we couldn&#8217;t discover this bug till we truly discovered the reply. We consider our findings complement the good work that has been executed in diagnosing and remediating it.<\/p>\n<p id=\"31e5\" class=\"pw-post-body-paragraph kd ke jg kf b kg mb ki kj kk mc km kn ko md kq kr ks me ku kv kw mf ky kz la iz ga\">We have a tendency to think about trendy JVMs as extremely optimized runtime environments, in lots of circumstances rivaling extra \u201cperformance-oriented\u201d languages like C++. While it holds true for almost all of workloads, we have been reminded that efficiency of sure workloads working inside JVMs could be affected not solely by the design and implementation of the appliance code, but additionally by the implementation of the JVM itself. In this blogpost we described how we have been capable of leverage PMCs to be able to discover a bottleneck within the JVM\u2019s native code, patch it, and subsequently understand higher than a threefold improve in throughput for the workload in query. When it involves this class of efficiency points, the flexibility to introspect the execution on the degree of CPU microarchitecture proved to be the one resolution. Intel vTune offers precious perception even with the core set of PMCs, similar to these uncovered by m5.12xl occasion sort. Exposing a extra complete set of PMCs together with PEBS throughout all occasion sorts and sizes within the cloud setting would pave the way in which for deeper efficiency evaluation and doubtlessly even bigger efficiency good points.<\/p>\n<\/div>\n<p>[ad_2]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] By Vadim Filanovsky and Harshad Sane In one in all our earlier blogposts, A Microscope on Microservices we outlined three broad domains of observability (or \u201clevels of magnification,\u201d as we referred to them) \u2014 Fleet-wide, Microservice and Instance. We described the instruments and strategies we use to realize perception inside every area. There is, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":15287,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37],"tags":[],"class_list":["post-15285","post","type-post","status-publish","format-standard","has-post-thumbnail","category-netflix"],"_links":{"self":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/15285","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/comments?post=15285"}],"version-history":[{"count":0,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/15285\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media\/15287"}],"wp:attachment":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media?parent=15285"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/categories?post=15285"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/tags?post=15285"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}