A Deep Dive into Psyberg: Stateless vs. Stateful Data Processing | Netflix Technology Blog
Let’s use the registration fact table as an example here. The workflow for this table runs hourly and the main input source is an Iceberg table that stores all raw registration events partitioned by login date, hour, and batch ID.
Here is a YAML snippet outlining the settings in Psyberg’s initialization step:
- job:
id: psyberg_session_init
type: Spark
spark:
app_args:
- --process_name=signup_fact_load
- --src_tables=raw_signups
- --psyberg_session_id=20230914061001
- --psyberg_hwm_table=high_water_mark_table
- --psyberg_session_table=psyberg_session_metadata
- --etl_pattern_id=1
Behind the scenes, Psyberg discovered that the pipeline was configured for stateless mode because etl_pattern_id=1.
Psyberg also uses the provided input to detect Iceberg snapshots that persist after the latest high watermark available in the watermark table.use Summary columns in snapshot metadata [see the Iceberg Metadata section in post 1 for more details]we parse out the partition information of each Iceberg snapshot of the source table.
Psyberg then retains these processing URIs (JSON string arrays containing landing date, hour, and batch ID combinations) determined by the snapshot changes.This information and other calculated metadata are stored in psyberg_session_f table.This stored data can then be used for subsequent operations Load fact table Jobs in workflows to leverage and use for analysis and debugging purposes.
Use stateful data processing when the output depends on a sequence of events on one or more input streams.
Let’s consider an example of building a cancellation fact table that takes as input:
- original cancellation event Indicates when the customer account was canceled
- Store the incoming fact table Customer requirements Cancel your subscription at the end of your billing cycle
These inputs help derive additional state analysis attributes such as type of churn, i.e. voluntary or involuntary, etc.
The initialization steps for stateful data processing are slightly different from those for stateless data processing. Psyberg offers additional configurations based on pipeline needs. The following is a YAML snippet outlining the configuration of unconfiguring the fact table during Psyberg’s initialization step:
- job:
id: psyberg_session_init
type: Spark
spark:
app_args:
- --process_name=cancel_fact_load
- --src_tables=raw_cancels|processing_ts,cancel_request_fact
- --psyberg_session_id=20230914061501
- --psyberg_hwm_table=high_water_mark_table
- --psyberg_session_table=psyberg_session_metadata
- --etl_pattern_id=2
Behind the scenes, Psyberg discovered that the pipeline was configured in stateful mode because etl_pattern_id It’s 2.
Note the additional details in the src_tables listing that correspond to raw_cancels above.this Process_ts This represents the event processing timestamp, which is different from the regular Iceberg snapshot submission timestamp, that is event_landing_ts As mentioned in Part 1 of this series.
It is important to capture the scope of a batch of merged events from all sources (i.e. raw_cancels and cancel_request_fact), taking into account late events. Changes to the source table snapshot can be tracked using different timestamp fields.Know which timestamp field to use, i.e. event_landing_ts or something similar Process_ts Helps avoid missing events.
Similar to the approach in stateless data processing, Psyberg uses the provided input to parse the partition information for each Iceberg snapshot of the source table.
from Tech Empire Solutions https://techempiresolutions.com/a-deep-dive-into-psyberg-stateless-vs-stateful-data-processing-netflix-technology-blog/
via https://techempiresolutions.com/
from Tech Empire Solutions https://techempiresolutions.wordpress.com/2024/02/10/a-deep-dive-into-psyberg-stateless-vs-stateful-data-processing-netflix-technology-blog/
via https://techempiresolutions.com/
from Paxton Willson https://paxtowillson.wordpress.com/2024/02/10/a-deep-dive-into-psyberg-stateless-vs-stateful-data-processing-netflix-technology-blog/
via https://techempiresolutions.com/
Comments
Post a Comment