{"id":113687,"date":"2023-11-16T01:49:17","date_gmt":"2023-11-16T01:49:17","guid":{"rendered":"https:\/\/showbizztoday.com\/index.php\/2023\/11\/16\/diving-deeper-into-psyberg-stateless-vs-stateful-data-processing-by-netflix-technology-blog-nov-2023\/"},"modified":"2023-11-16T01:49:17","modified_gmt":"2023-11-16T01:49:17","slug":"diving-deeper-into-psyberg-stateless-vs-stateful-data-processing-by-netflix-technology-blog-nov-2023","status":"publish","type":"post","link":"https:\/\/showbizztoday.com\/index.php\/2023\/11\/16\/diving-deeper-into-psyberg-stateless-vs-stateful-data-processing-by-netflix-technology-blog-nov-2023\/","title":{"rendered":"Diving Deeper into Psyberg: Stateless vs Stateful Data Processing | by Netflix Technology Blog | Nov, 2023"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<p id=\"8fe2\" class=\"pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj\">Let\u2019s use the signup truth desk for example right here. This desk\u2019s workflow runs hourly, with the primary enter supply being an Iceberg desk storing all uncooked signup occasions partitioned by touchdown date, hour, and batch id.<\/p>\n<p id=\"87aa\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Here\u2019s a YAML snippet outlining the configuration for this through the Psyberg initialization step:<\/p>\n<pre class=\"pr ps pt pu pv qk ql qm bo qn ba bj\"><span id=\"b10d\" class=\"qo ob gr ql b bf qp qq l qr qs\">- job:<br\/>id: psyberg_session_init<br\/>kind: Spark<br\/>spark:<br\/>app_args:<br\/>- --process_name=signup_fact_load<br\/>- --src_tables=raw_signups<br\/>- --psyberg_session_id=20230914061001<br\/>- --psyberg_hwm_table=high_water_mark_table<br\/>- --psyberg_session_table=psyberg_session_metadata<br\/>- --etl_pattern_id=1<\/span><\/pre>\n<p id=\"a6b0\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Behind the scenes, Psyberg identifies that this pipeline is configured for a stateless sample since <strong class=\"nc gs\">etl_pattern_id=1<\/strong>.<\/p>\n<p id=\"2420\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Psyberg additionally makes use of the supplied inputs to detect the Iceberg snapshots that persevered after the newest excessive watermark out there within the watermark desk. Using the <strong class=\"nc gs\">abstract column in snapshot metadata<\/strong> [see the <a class=\"af ny\" href=\"https:\/\/netflixtechblog.medium.com\/f68830617dd1\" rel=\"noopener\" target=\"_blank\">Iceberg Metadata section in post 1<\/a> for more details], we parse out the partition info for every Iceberg snapshot of the supply desk.<\/p>\n<p id=\"c24b\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Psyberg then retains these processing URIs (an array of JSON strings containing combos of touchdown date, hour, and batch IDs) as decided by the snapshot adjustments. This info and different calculated metadata are saved within the <strong class=\"nc gs\">psyberg_session_f<\/strong> desk. This saved knowledge is then out there for the following<strong class=\"nc gs\"> LOAD.FACT_TABLE<\/strong> job within the workflow to make the most of and for evaluation and debugging functions.<\/p>\n<p id=\"4bf1\" class=\"pw-post-body-paragraph na nb gr nc b nd oy nf ng nh oz nj nk nl pa nn no np pb nr ns nt pc nv nw nx gk bj\">Stateful Data Processing is used when the output depends upon a sequence of occasions throughout a number of enter streams.<\/p>\n<p id=\"c06e\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Let\u2019s take into account the instance of making a cancel truth desk, which takes the next as enter:<\/p>\n<ol class=\"\">\n<li id=\"2910\" class=\"na nb gr nc b nd ne nf ng nh ni nj nk nl pd nn no np pe nr ns nt pf nv nw nx pg ph pi bj\"><strong class=\"nc gs\">Raw cancellation occasions<\/strong> indicating when the shopper account was canceled<\/li>\n<li id=\"7c9a\" class=\"na nb gr nc b nd pj nf ng nh pk nj nk nl pl nn no np pm nr ns nt pn nv nw nx pg ph pi bj\">A truth desk that shops incoming <strong class=\"nc gs\">buyer requests <\/strong>to cancel their subscription on the finish of the billing interval<\/li>\n<\/ol>\n<p id=\"0ab9\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">These inputs assist derive extra stateful analytical attributes like the kind of churn i.e. voluntary or involuntary, and many others.<\/p>\n<p id=\"9e44\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">The initialization step for Stateful Data Processing differs barely from Stateless. Psyberg provides extra configurations based on the pipeline wants. Here\u2019s a YAML snippet outlining the configuration for the cancel truth desk through the Psyberg initialization step:<\/p>\n<pre class=\"pr ps pt pu pv qk ql qm bo qn ba bj\"><span id=\"fbdf\" class=\"qo ob gr ql b bf qp qq l qr qs\">- job:<br\/>id: psyberg_session_init<br\/>kind: Spark<br\/>spark:<br\/>app_args:<br\/>- --process_name=cancel_fact_load<br\/>- --src_tables=raw_cancels|processing_ts,cancel_request_fact<br\/>- --psyberg_session_id=20230914061501<br\/>- --psyberg_hwm_table=high_water_mark_table<br\/>- --psyberg_session_table=psyberg_session_metadata<br\/>- --etl_pattern_id=2<\/span><\/pre>\n<p id=\"c150\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Behind the scenes, Psyberg identifies that this pipeline is configured for a stateful sample since <strong class=\"nc gs\">etl_pattern_id<\/strong> is 2.<\/p>\n<p id=\"34b4\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Notice the extra element within the src_tables record similar to raw_cancels above. The <strong class=\"nc gs\">processing_ts<\/strong> right here represents the occasion processing timestamp which is totally different from the common Iceberg snapshot commit timestamp i.e. <strong class=\"nc gs\">event_landing_ts<\/strong> as described in <a class=\"af ny\" href=\"https:\/\/netflixtechblog.medium.com\/f68830617dd1\" rel=\"noopener\" target=\"_blank\">half 1<\/a> of this collection.<\/p>\n<p id=\"f22c\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">It is vital to seize the vary of a consolidated batch of occasions from all of the sources i.e. each raw_cancels and cancel_request_fact, whereas factoring in late-arriving occasions. Changes to the supply desk snapshots may be tracked utilizing totally different timestamp fields. Knowing which timestamp subject to make use of i.e. <strong class=\"nc gs\">event_landing_ts<\/strong> or one thing like <strong class=\"nc gs\">processing_ts<\/strong> helps keep away from lacking occasions.<\/p>\n<p id=\"e35d\" class=\"pw-post-body-paragraph na nb gr nc b nd ne nf ng nh ni nj nk nl nm nn no np nq nr ns nt nu nv nw nx gk bj\">Similar to the method in stateless knowledge processing, Psyberg makes use of the supplied inputs to parse out the partition info for every Iceberg snapshot of the supply desk.<\/p>\n<\/div>\n<p>[ad_2]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Let\u2019s use the signup truth desk for example right here. This desk\u2019s workflow runs hourly, with the primary enter supply being an Iceberg desk storing all uncooked signup occasions partitioned by touchdown date, hour, and batch id. Here\u2019s a YAML snippet outlining the configuration for this through the Psyberg initialization step: &#8211; job:id: psyberg_session_initkind: [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":113689,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37],"tags":[],"class_list":{"0":"post-113687","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-netflix"},"_links":{"self":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/113687","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/comments?post=113687"}],"version-history":[{"count":0,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/posts\/113687\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media\/113689"}],"wp:attachment":[{"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/media?parent=113687"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/categories?post=113687"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/showbizztoday.com\/index.php\/wp-json\/wp\/v2\/tags?post=113687"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}