{"id":98162,"date":"2023-04-27T21:15:26","date_gmt":"2023-04-27T21:15:26","guid":{"rendered":"https:\/\/showbizztoday.com\/index.php\/2023\/04\/27\/improved-alerting-with-atlas-streaming-eval-by-netflix-technology-blog-apr-2023\/"},"modified":"2023-04-27T21:15:27","modified_gmt":"2023-04-27T21:15:27","slug":"improved-alerting-with-atlas-streaming-eval-by-netflix-technology-blog-apr-2023","status":"publish","type":"post","link":"https:\/\/showbizztoday.com\/index.php\/2023\/04\/27\/improved-alerting-with-atlas-streaming-eval-by-netflix-technology-blog-apr-2023\/","title":{"rendered":"Improved Alerting with Atlas Streaming Eval | by Netflix Technology Blog | Apr, 2023"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<p id=\"df80\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\"><a class=\"af jv\" href=\"https:\/\/www.linkedin.com\/in\/ruchir-jha-9a861616\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">Ruchir Jha<\/a>, <a class=\"af jv\" href=\"https:\/\/www.linkedin.com\/in\/brharrington\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">Brian Harrington<\/a>, <a class=\"af jv\" href=\"https:\/\/www.linkedin.com\/in\/yingwu-zhao-62037418\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">Yingwu Zhao<\/a><\/p>\n<p id=\"cd38\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">TL;DR<\/p>\n<ul class=\"\">\n<li id=\"e583\" class=\"jw jx io iz b ja jb je jf ji jy jm jz jq ka ju kb kc kd ke bj\">Streaming alert analysis scales a lot better than the standard strategy of polling time-series databases.<\/li>\n<li id=\"26a2\" class=\"jw jx io iz b ja kf je kg ji it jm iu jq iv ju kb kc kd ke bj\">It permits us to beat excessive dimensionality\/cardinality limitations of the time-series database.<\/li>\n<li id=\"d38f\" class=\"jw jx io iz b ja kf je kg ji it jm iu jq iv ju kb kc kd ke bj\">It opens doorways to help extra thrilling use-cases.<\/li>\n<\/ul>\n<figure class=\"ki kj kk kl gu km gi gj paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"kn ko dj kp bg kq\">\n<div class=\"gi gj kh\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*OEqSEUO7E-XcFzjkvgKtxA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*OEqSEUO7E-XcFzjkvgKtxA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*OEqSEUO7E-XcFzjkvgKtxA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*OEqSEUO7E-XcFzjkvgKtxA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*OEqSEUO7E-XcFzjkvgKtxA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*OEqSEUO7E-XcFzjkvgKtxA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*OEqSEUO7E-XcFzjkvgKtxA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" type=\"image\/webp\"\/><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*OEqSEUO7E-XcFzjkvgKtxA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*OEqSEUO7E-XcFzjkvgKtxA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*OEqSEUO7E-XcFzjkvgKtxA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*OEqSEUO7E-XcFzjkvgKtxA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*OEqSEUO7E-XcFzjkvgKtxA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*OEqSEUO7E-XcFzjkvgKtxA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*OEqSEUO7E-XcFzjkvgKtxA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"bg kr ks c\" width=\"700\" height=\"328\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"5990\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Engineers need their alerting system to be realtime, dependable, and actionable. While actionability is subjective and should differ by use-case, reliability is non-negotiable. In different phrases, false positives are unhealthy however false negatives are absolutely the worst!<\/p>\n<p id=\"ee02\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Just a few years in the past, we have been paged by our SRE workforce as a consequence of our Metrics Alerting System falling behind \u2014 important utility well being alerts reached engineers 45 minutes late! As we investigated the alerting delay, we discovered that the variety of configured alerts had not too long ago elevated dramatically, by 5 instances! The alerting system queried <a class=\"af jv\" href=\"https:\/\/github.com\/Netflix\/atlas\" rel=\"noopener ugc nofollow\" target=\"_blank\">Atlas<\/a>, our time sequence database on a cron for every configured alert question, and was seeing an elevated throttle fee and extreme retries with backoffs. This, in flip, elevated the time between two consecutive checks for an alert, inflicting a worldwide slowdown for all alerts. On additional investigation, we found that one person had programmatically created tens of hundreds of recent alerts. This person represented a platform workforce at Netflix, and their aim was to construct alerting automation for his or her customers.<\/p>\n<p id=\"b72f\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">While we have been in a position to put out the instant fireplace by disabling the newly created alerts, this incident raised some important considerations across the scalability of our alerting system. We additionally heard from different platform groups at Netflix who needed to construct comparable automation for his or her customers who, given our state on the time, wouldn\u2019t have been in a position to take action with out impacting Mean Time To Detect (MTTD) for all others. Rather, we have been taking a look at an order of magnitude enhance within the variety of alert queries simply over the following 6 months!<\/p>\n<p id=\"8dbd\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Since querying Atlas was the bottleneck, our first intuition was to scale it as much as meet the elevated alert question demand; nevertheless, we quickly realized that may enhance Atlas price prohibitively. Atlas is an in-memory time-series database that ingests a number of billions of time-series per day and retains the final two weeks of knowledge. It is already one of many largest companies at Netflix each in dimension and price. While Atlas <a class=\"af jv\" href=\"https:\/\/netflix.github.io\/atlas-docs\/overview\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">is architected<\/a> round compute &amp; storage separation, and we may theoretically simply scale the question layer to fulfill the elevated question demand, each question, no matter its sort, has a knowledge part that must be pushed all the way down to the storage layer. To serve the growing variety of push down queries, the in-memory storage layer would want to scale up as properly, and it turned clear that this could push the already costly storage prices far larger. Moreover, frequent database optimizations like caching not too long ago queried knowledge don\u2019t actually work for alerting queries as a result of, usually talking, the final acquired datapoint is required for correctness. Take for instance, this alert question that checks if errors as a % of complete RPS exceeds a threshold of fifty% for 4 out of the final 5 minutes:<\/p>\n<pre class=\"ki kj kk kl gu kt ku kv bo kw kx ky\"><span id=\"e972\" class=\"kz la io ku b bf lb lc l ld le\">identify,errors,:eq,:sum,<br\/>identify,rps,:eq,:sum,<br\/>:div,<br\/>100,:mul,<br\/>50,:gt,<br\/>5,:rolling-count,4,:gt,<\/span><\/pre>\n<p id=\"a552\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Say if the datapoint acquired for the final time interval results in a optimistic analysis for this question, counting on stale\/cached knowledge would both enhance MTTD or end result within the notion of a false damaging, at the very least till the lacking knowledge is fetched and evaluated. It turned clear to us that we wanted to resolve the scalability drawback with a basically totally different strategy. Hence, we began down the trail of alert analysis by way of real-time <a class=\"af jv\" href=\"https:\/\/github.com\/Netflix\/atlas\/tree\/main\/atlas-eval\" rel=\"noopener ugc nofollow\" target=\"_blank\">streaming metrics<\/a>.<\/p>\n<p id=\"302d\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\"><strong class=\"iz ip\">High Level Architecture<\/strong><\/p>\n<p id=\"9685\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">The thought, at a excessive degree, was to keep away from the necessity to question the Atlas database virtually totally and transition most alert queries to streaming analysis.<\/p>\n<figure class=\"ki kj kk kl gu km gi gj paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"kn ko dj kp bg kq\">\n<div class=\"gi gj lf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*xXDdfWrwCR5giLHGAkV1-g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*xXDdfWrwCR5giLHGAkV1-g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*xXDdfWrwCR5giLHGAkV1-g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*xXDdfWrwCR5giLHGAkV1-g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*xXDdfWrwCR5giLHGAkV1-g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*xXDdfWrwCR5giLHGAkV1-g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*xXDdfWrwCR5giLHGAkV1-g.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" type=\"image\/webp\"\/><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*xXDdfWrwCR5giLHGAkV1-g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*xXDdfWrwCR5giLHGAkV1-g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*xXDdfWrwCR5giLHGAkV1-g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*xXDdfWrwCR5giLHGAkV1-g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*xXDdfWrwCR5giLHGAkV1-g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*xXDdfWrwCR5giLHGAkV1-g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*xXDdfWrwCR5giLHGAkV1-g.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"bg kr ks c\" width=\"700\" height=\"153\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"425b\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Alert queries are submitted both by way of our Alerting UI or by API purchasers, that are then saved to a customized config database that helps streaming config updates (full snapshot + replace notifications). The Alerting Service receives these config updates and hashes each new or up to date alert question for analysis to considered one of its nodes by leveraging <a class=\"af jv\" href=\"https:\/\/netflix.github.io\/edda\/rest-api\/#apiv2group\" rel=\"noopener ugc nofollow\" target=\"_blank\">Edda Slots<\/a>. The node accountable for evaluating a question, begins by breaking it down right into a set of \u201cdata expressions\u201d and with them subscribes to an upstream \u201cbroker\u201d service. Data expressions outline what knowledge must be sourced so as to consider a question. For the instance question listed above, the info expressions are identify,errors,:eq,:sum and identify,rps,:eq,:sum. The dealer service acts as a subscription supervisor that maps a knowledge expression to a set of subscriptions. In addition, it additionally maintains a Query Index of all energetic knowledge expressions which is consulted to discern if an incoming datapoint is of curiosity to an energetic subscriber. The internals listed below are outdoors the scope of this weblog publish.<\/p>\n<p id=\"572a\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Next, the Alerting service (by way of the <a class=\"af jv\" href=\"https:\/\/github.com\/Netflix\/atlas\/tree\/main\/atlas-eval\" rel=\"noopener ugc nofollow\" target=\"_blank\">atlas-eval<\/a> library) maps the acquired knowledge factors for a knowledge expression to the alert question that wants them. For alert queries that resolve to multiple knowledge expression, we align the incoming knowledge factors for every a type of knowledge expressions on the identical time boundary earlier than emitting the gathered values to the ultimate eval step. For the instance above, the ultimate eval step can be accountable for computing the ratio and sustaining the <a class=\"af jv\" href=\"https:\/\/netflix.github.io\/atlas-docs\/asl\/ref\/rolling-count\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">rolling-count<\/a>, which is holding observe of the variety of intervals through which the ratio crossed the brink as proven under:<\/p>\n<figure class=\"ki kj kk kl gu km gi gj paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"kn ko dj kp bg kq\">\n<div class=\"gi gj lg\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*xa7hplm2emUmJIEl8al32A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*xa7hplm2emUmJIEl8al32A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*xa7hplm2emUmJIEl8al32A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*xa7hplm2emUmJIEl8al32A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*xa7hplm2emUmJIEl8al32A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*xa7hplm2emUmJIEl8al32A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*xa7hplm2emUmJIEl8al32A.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" type=\"image\/webp\"\/><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*xa7hplm2emUmJIEl8al32A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*xa7hplm2emUmJIEl8al32A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*xa7hplm2emUmJIEl8al32A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*xa7hplm2emUmJIEl8al32A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*xa7hplm2emUmJIEl8al32A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*xa7hplm2emUmJIEl8al32A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*xa7hplm2emUmJIEl8al32A.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"bg kr ks c\" width=\"700\" height=\"252\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"9e47\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">The atlas-eval library helps streaming analysis for many if not all <a class=\"af jv\" href=\"https:\/\/github.com\/Netflix\/atlas\/wiki\/Reference-query\" rel=\"noopener ugc nofollow\" target=\"_blank\">Query<\/a>, <a class=\"af jv\" href=\"https:\/\/github.com\/Netflix\/atlas\/wiki\/Reference-data\" rel=\"noopener ugc nofollow\" target=\"_blank\">Data<\/a>, <a class=\"af jv\" href=\"https:\/\/github.com\/Netflix\/atlas\/wiki\/Reference-math\" rel=\"noopener ugc nofollow\" target=\"_blank\">Math<\/a> and <a class=\"af jv\" href=\"https:\/\/github.com\/Netflix\/atlas\/wiki\/Reference-stateful\" rel=\"noopener ugc nofollow\" target=\"_blank\">Stateful<\/a> operators supported by Atlas at the moment. Certain operators reminiscent of <a class=\"af jv\" href=\"https:\/\/netflix.github.io\/atlas-docs\/asl\/ref\/offset\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">offset<\/a>, <a class=\"af jv\" href=\"https:\/\/netflix.github.io\/atlas-docs\/asl\/ref\/integral\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">integral<\/a>, <a class=\"af jv\" href=\"https:\/\/netflix.github.io\/atlas-docs\/asl\/ref\/des\/\" rel=\"noopener ugc nofollow\" target=\"_blank\">des<\/a> will not be supported on the streaming path.<\/p>\n<p id=\"c712\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\"><strong class=\"iz ip\">OK, Results?<\/strong><\/p>\n<p id=\"45a2\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">First and foremost, we have now efficiently alleviated our preliminary scalability drawback with the polling primarily based structure. Today, we run 20X the variety of queries we used to run a couple of years in the past, with ease and at a fraction of what it might have price to scale up the Atlas storage layer to serve the identical quantity. Multiple platform groups at Netflix programmatically generate and keep alerts on behalf of their customers with out having to fret about impacting different customers of the system. We are in a position to keep sturdy SLAs round Mean Time To Detect (MTTD) whatever the variety of alerts being evaluated by the system.<\/p>\n<p id=\"b313\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Additionally, streaming analysis allowed us to loosen up restrictions round excessive cardinality that our customers have been beforehand operating into \u2014 alert queries that have been rejected by Atlas Backend earlier than as a consequence of cardinality constraints are actually getting checked accurately on the streaming path. In addition, we&#8217;re in a position to make use of Atlas Streaming to observe and alert on some very excessive cardinality use-cases, reminiscent of metrics derived from free-form log knowledge.<\/p>\n<p id=\"e09e\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Finally, we switched <a class=\"af jv\" rel=\"noopener ugc nofollow\" target=\"_blank\" href=\"https:\/\/netflixtechblog.com\/telltale-netflix-application-monitoring-simplified-5c08bfa780ba\">Telltale<\/a>, our holistic utility well being monitoring system, from polling a metrics cache to utilizing realtime Atlas Streaming. The elementary thought behind Telltale is to detect anomalies on SLI metrics (for instance, latency, error charges, and so on). When such anomalies are detected, Telltale is ready to compute correlations with comparable metrics emitted from both upstream or downstream companies. In addition, it additionally computes correlations between SLI metrics and customized metrics just like the log derived metrics talked about above. This has confirmed to be beneficial in direction of decreasing Mean Time to Recover (MTTR). For instance, we&#8217;re in a position to now correlate elevated error charges with elevated fee of particular exceptions occurring in logs and even level to an exemplar stacktrace, as proven under:<\/p>\n<figure class=\"ki kj kk kl gu km gi gj paragraph-image\">\n<div role=\"button\" tabindex=\"0\" class=\"kn ko dj kp bg kq\">\n<div class=\"gi gj lh\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*To1maD6fTk8o_Aeuyj2o9g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*To1maD6fTk8o_Aeuyj2o9g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*To1maD6fTk8o_Aeuyj2o9g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*To1maD6fTk8o_Aeuyj2o9g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*To1maD6fTk8o_Aeuyj2o9g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*To1maD6fTk8o_Aeuyj2o9g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*To1maD6fTk8o_Aeuyj2o9g.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" type=\"image\/webp\"\/><source data-testid=\"og\" srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*To1maD6fTk8o_Aeuyj2o9g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*To1maD6fTk8o_Aeuyj2o9g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*To1maD6fTk8o_Aeuyj2o9g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*To1maD6fTk8o_Aeuyj2o9g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*To1maD6fTk8o_Aeuyj2o9g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*To1maD6fTk8o_Aeuyj2o9g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*To1maD6fTk8o_Aeuyj2o9g.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"\/><img alt=\"\" class=\"bg kr ks c\" width=\"700\" height=\"525\" loading=\"lazy\" role=\"presentation\"\/><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"249d\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Our logs pipeline fingerprints each log message and attaches a (very excessive cardinality) fingerprint tag to a log occasions counter that&#8217;s then emitted to Atlas Streaming. Telltale consumes this metric in a streaming style to determine fingerprints that correlate with anomalies seen in SLI metrics. Once an anomaly is discovered, we question the logs backend with the fingerprint hash to acquire the exemplar stacktrace. What\u2019s extra is we are actually in a position to determine correlated anomalies (and exceptions) occurring in companies which may be N hops away from the affected service. A system like Telltale turns into more practical as extra companies are onboarded (and for that matter the complete service graph), as a result of in any other case it turns into tough to root trigger the issue, particularly in a microservices-based structure. Just a few years in the past, as famous on this <a class=\"af jv\" rel=\"noopener ugc nofollow\" target=\"_blank\" href=\"https:\/\/netflixtechblog.com\/telltale-netflix-application-monitoring-simplified-5c08bfa780ba\">weblog<\/a>, solely a few hundred companies have been utilizing Telltale; due to Atlas Streaming we have now now managed to onboard hundreds of different companies at Netflix.<\/p>\n<p id=\"110e\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Finally, we realized that when you take away limits on the variety of monitored queries, and begin supporting a lot larger metric dimensionality\/cardinality with out impacting the fee\/efficiency profile of the system, it opens doorways to many thrilling new prospects. For instance, to make alerts extra actionable, we might now be capable of compute correlations between SLI anomalies and customized metrics with excessive cardinality dimensions, for instance an alert on elevated HTTP error charges might be able to level to impacted buyer cohorts, by linking to exactly correlated exemplars. This would assist builders with reproducibility.<\/p>\n<p id=\"9d68\" class=\"pw-post-body-paragraph ix iy io iz b ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt ju ih bj\">Transitioning to the streaming path has been a protracted journey for us. One of the challenges was issue in debugging situations the place the streaming path didn\u2019t agree with what&#8217;s returned by querying the Atlas database. This is particularly true when both the info isn&#8217;t obtainable in Atlas or the question isn&#8217;t supported due to (say) cardinality constraints. This is without doubt one of the causes it has taken us years to get right here. That stated, early indicators point out that the streaming paradigm might assist with tackling a cardinal drawback in observability \u2014 efficient correlation between the metrics &amp; occasions verticals (logs, and doubtlessly traces sooner or later), and we&#8217;re excited to discover the alternatives that this presents for Observability generally.<\/p>\n<\/div>\n<p>[ad_2]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Ruchir Jha, Brian Harrington, Yingwu Zhao TL;DR Streaming alert analysis scales a lot better than the standard strategy of polling time-series databases. It permits us to beat excessive dimensionality\/cardinality limitations of the time-series database. It opens doorways to help extra thrilling use-cases. Engineers need their alerting system to be realtime, dependable, and actionable. While [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":98164,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37],"tags":[],"class_list":{"0":"post-98162","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-netflix"},"_links":{"self":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/98162","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/comments?post=98162"}],"version-history":[{"count":0,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/98162\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media\/98164"}],"wp:attachment":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media?parent=98162"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/categories?post=98162"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/tags?post=98162"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}