⚙    bpm    sdlc    bianl    data    meta    math    ⚙ 👐 🎭 index - references    elucidation    metier 🎭
⚒    Intro    perftun concept   perftun levels    transform flow    DI control I    DI control II    landzones    trnsprt-1    trnsprt-2    smpl-SQL    smpl-SAS    What next    ⚒ 👐    top bottom   👐

Devops Data - practical data cases


Patterns for quick good realizations

Data building, data transport, coding.

Building data The data explosion. The change is the ammount we are collecting measuring processes as new information (edge).

📚 Information requests.
⚙ measurements monitoring.
🎭 Agility for changes?
⚖ solution & performance acceptable?

🔰 Too fast .. previous.

Contents

Reference Topic Squad
Intro Data building, data transport, coding. 01.01
perftun concept Performance & Tuning - Software, Hardware (I). 02.01
perftun levels Performance & Tuning - Software, Hardware (II). 03.01
transform flow BI transformations - data lineage, 04.01
DI control I Data Integration - Control (I). 05.01
DI control II Data Integration - Control (II). 06.01
landzones Collect delivery (landing zone´s) - EDWH 3.0 07.01
trnsprt-1 Data transport - messaging (I). 08.01
trnsprt-2 Data transport - messaging (II). 09.01
smpl-SQL SQL code samples. 10.01
smpl-SAS SAS code samples. 11.01
What next Change data - Transformations. 12.00
Following steps 12.02

Progress


Performance & Tuning - Software, Hardware (I).

Solving performance problems requires understanding of the operating system and hardware. That architecture was set by von Neumann (see design-math).
vonNeumann_perftun01.jpg
A single CPU, limited Internal Memory and the external storage.
The time differences between those resources are in magnitudes (factor 100-1000).

Optimizing is balancing between choosing the best algorithm and the effort to achieve that algorithm.

vonNeumann_perftun02.jpg
That concept didn´t change. The advance in hardware made it affordable to ignore the knowledge of tuning.

The Free Lunch Is Over .
A Fundamental Turn Toward Concurrency in Software, By Herb Sutter.
If you haven’t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU-sensitive now or are likely to become so soon, and identify how those places could benefit from concurrency. Now is also the time for you and your team to grok concurrent programming’s requirements, pitfalls, styles, and idioms.

Additional component, the connection from machine, multiple cpu´s - several banks internal memory, to multiple external storage boxes by a network.

Performance & Tuning - Software, Hardware (II).

Perftun_EtL01.jpg
Tuning cpu - internal memory.
Minimize resource usage: ❗ The "balance line" algorithm is the best. A DBMS will do that when possible.

Perftun_EtL02.jpg
Network throughput.
Minimize delays, use parallelization:
⚠ Transport buffer size is a coöperation between remote server and local driver. The local optimal buffer size can be different. Resizing data in buffers a cause of performance problems.

Perftun_EtL03.jpg
Minize delays in the storage system.
⚠ Using Analtyics, tuning IO is quite different to transactional DBMS usage.
💣 This different non standard approach must be in scope with service management. The goal of sizing capacity is better understood than Striping for IO perfromance.

BI transformations - data lineage.

When doing data transformations following the flow is important metadata. Important to understand impact when changing something.
It is called "data lineage". sasdi_datalineage01.png

The classic Business Intelligence is reshaping all data into own dedicated data models. They are fine tuned for business reports.
The concepts of a transactional operational data design with normalization are followed, the result is a lot of transformations for tables.
BI_transforms01.jpg

Data Integration - Control (I).

Doing Extract / Load processing there are many tools due to CWM (Common Warehouse Metadata specification). However doing that in real life something is missing. That is control & monitoring.
DIcontrol_EtL01.jpg


DIcontrol_EtL01a.jpg Only when the question is asked this gap pitfall is noticed. What is needed is control on what data is to be synchronized.
✅ "no data" - With a job restart.
✅ "all data" - No data at destination exist (initial clean).
✅ "all updates since last run" - All mutations since last run.

Aa control data set as input having the last run timestaps and retrieved data timestamps from identified records get into known information, using dynamic code.
The combination timestamps and identifiers (records - elements) is specific domain knowledge.
Impossible to solve by CWM standards.
An user generated initialisation is simplifying the whole proces.

Data Integration - Control (II).

There is a split in extracting data from an external resource, and loading it to a local machine. The goal of the first one is optimizing the extract performance, in the second optimizing the local load perfromance.
DIcontrol_EtL01b.jpg
Reading is not having much issues with performance at execution (elaps time), but writing - updating surely is.

What is needed is control on which action should run.
✅ "all data" - Full optimzed sorted load, combined is preparation tuning for mutations.
✅ "all updates since last run" - Mutations sorted for updates and additions. Deletions to take note of.

The user generated initialisation is a solution simplifying the whole proces, using dynamic code with selection criteria.

Results are kept in the control dataset by a user generated termination. This is the second part of simplifying the proces.


The user generated initialisation and termination are using and maintaining a control dataset. That control dataset is having specific domain knowledge.
Impossible to solve by CWM standards.
DIcontrol_EtL01c.jpg
The control datasete is dedicated to the used tooling.
Having a DBMS for the EtL datatransport the best choice is a local DBMS table. Supporting transaction based updates there is no prblem to use that one for all data.
When the DBMS is not supporting tranaction based updates, SAS Teradata, Hadoop the best approach is using a control table for every table in a EtL proces.

⚠ Doing regular full loads for analytics applications elminated the need for deletes and verifying out of sync records (CDC mismatch).
⚖ Having that controldataset verifying on growth - shrink expectations can add addtional safety checks in operations.


Collect delivery (landing zone´s) - EDWH 3.0

Transporting information is best done in a structured way.
The picture is showing a design copied from physical logistics.
df_csd01.jpg More in details on the transport of data. The data flow goes:

Intra references.
detailed descriptive information at ..
Details (opens new page), to be found at:
👓 - Logical Security segregation
👓 Structured Programming
👓 - Performance & tuning


Data transport - messaging (I).

Sending, collecting and receiving data is like sending, collecting and reciving physical letters or parcels. Those proces steps with checks and expectations are similar.

What the customer is sending is received and collected by the processor. Not clear is what the initial trigger is. It can be:
Transporting data is having the same kind of agreements, no difference.
The area were data is coming into the data processors area must be defined well (security design meta) avoiding intra collisions by customers.
This the red vertical confidentiality line (left) at the data-collect in EDWH 3.0 .

post-drp1.jpg post-get1.jpg
Sending
There must be a defined point where you can drop the object to send.

Optionally is track&trace where the object is at some moment.

post-drp2.jpg post-get2b.jpg
Collecting
What is there to transport getting collected.


 
post-drv1.jpg
Transport
There are many ways to transport objects. An own truck just one of those.

Imagine all managed file transfers software and api interfaces as options.

Data transport - messaging (II).


Transformations processing & transport.
⚠ Transformations on objects are commonly not part of the deal. Only some limited repackaging unbundling is optional.

Fast Delivery service.
⚠ Fast delivery (express) is optional, it still follows agreements. The JIT lambda funnel is the EDW 3.0 similartity.

post-drv2.jpg post-get1.jpg
Receiving standard.
Dropping it without attention by receiver.
Let the receiver explicitly pick up object(s) after being notified there is something waiting.

post-get2a.jpg post-get2b.jpg
Receiving additional control.
When picking up object have additional control the correct ones are pciked, no other objects.

Receiving objects have a lot of options unexpected dropped, to get from a delivery point after a notice, verifying a drop-off point, special handling.

What the EDWH 3.0 processor is sending is received and collected by the customer. Not clear is what the initial trigger is. It can be:
Transporting data is having the same kind of agreements, no difference.
The area were data is going out by the data processors area must be defined well (security design meta) avoiding intra collisions by customers.
This the red vertical confidentiality line (right) at the data-provide in EDWH 3.0 .

Capacity Considerations.
Assumed is every involved party is capable of handling all incoming requests.
When the Collecting and sending area's of the EDW 3.0 are the ones most limited, the planning is best done for traffic by managing this service.

SQL code samples.

Code source Description
--- XML processing XMLquery.

SAS code samples.

Code source Description
--- XML processing XMLmap.

Old time, old mistakes

Change data - Transformations.


Forgotten history.
Performance & Tuning once upon a time was an important skill. The free lunch by hardware improvements led to ingore those basics at implementation.
🚧 A system being overloaded in the same missed concept of data channels machine to storage experienced in 80´s (3380 mainframe) also in 10´s (mounted rack on premise service).
🚧 Systems going into a thrashing behavior not being recognized in 10´s. Not experienced as problematic as at sometime desired processing is finished. Avoiding thrashing using queue´s for load-type and implementing loadbalancing were basic skills using mainframes in the 80´s.

Data - normalized systems.
..



Following steps

Missing link

These are high level considerations.

Describing the data is data of data. 👓 MetaData.

What is not here: 👓 data & modelling concepts.



⚒    Intro    perftun concept   perftun levels    transform flow    DI control I    DI control II    landzones    trnsprt-1    trnsprt-2    smpl-SQL    smpl-SAS    What next    ⚒ 👐    top bottom   👐
⚙    bpm    sdlc    bianl    data    meta    math    ⚙ 👐 🎭 index - references    elucidation    metier 🎭

© 2012,2019 J.A.Karman