Devops Data - practical data cases
Patterns for quick good realizations
Data building, data transport, coding.
The data explosion. The change is the ammount we are collecting measuring processes as new information (edge).
📚 Information requests.
⚙ measurements monitoring.
🎭 Agility for changes?
⚖ solution & performance acceptable?
Too fast .. previous
| Reference || Topic || Squad |
| Intro ||Data building, data transport, coding. ||01.01 |
| perftun hardware ||Performance & Tuning - Software, Hardware. ||02.01 |
| 👓 Performance OS ||Performance OS level ||02.02 |
| DI control ||Data Integration - Control & performance. ||03.01 |
| 👓 EL constructs ||Etl constructs & performance ||03.01 |
| transform flow ||Transformations - data lineage, ||04.01 |
| 👓 Data flow ||Data lineage & xml, json ||04.01 |
| 👓 NLS ||National language support ||04.02 |
| schedule flow ||Scheduling, planning operations. ||05.01 |
| 👓 Scheduling ||Scheduling ||05.01 |
| What next ||Change data - Transformations. ||06.00 |
| ||Following steps ||06.02 |
- 2020 week 05
- Splitting page into more logical parapgraphs.
- 2019 week 20
- Page getting filled.
- New content, gathering old samples.
Duality service requests
From the organisation (bpm) there two goals for their solution questions in improvements, those are:
- the core business process (sdlc)
- for reviewing governing the business process (bianl)
Solutions for those goals are different altought some tools could be the same.
Performance & Tuning - Software, Hardware.
Solving performance problems requires understanding of the operating system and hardware.
That architecture was set by von Neumann (see design-math).
basic resource cooperation & dependicies
A single CPU, limited Internal Memory and the external storage.
The time differences between those resources are in magnitudes (factor 100-1000).
Optimizing is balancing between choosing the best algorithm and the effort to achieve that algorithm.
new resources & dependicies
That concept didn´t change that much.
Neglecting performance questions could be justified by advance in hardware the knowledge of tuning processes is ignored. Those days are gone.
The Free Lunch Is Over
A Fundamental Turn Toward Concurrency in Software,
By Herb Sutter. (2009)
If you haven´t done so already, now is the time to take a hard look at the design of your application, determine what operations are CPU-sensitive now or are likely to become so soon,
and identify how those places could benefit from concurrency. Now is also the time for you and your team to grok concurrent programming´s requirements, pitfalls, styles, and idioms.
Additional components, the connection from machine, multiple cpu´s - several banks internal memory, to multiple external storage boxes by a network.
Storage in a network cam be a SAN (Strage attache Network) or a NAS (Network attached Storage). They are different in behaviour and performance.
There is a belief that this is not a business issue and is pure technical. That is a wrong assumption.
When asking for BCM
( Enisabusiness continuity management) a part of risk management.
Business Continuity is the term applied to the series of management processes and integrated plans that maintain the continuity of the critical processes of an organisation,
should a disruptive event take place which impacts the ability of the organisation to continue to provide its key services.
ICT systems and electronic data are crucial components of the processes and their protection and timely return is of paramount importance.
Business applications and their performance is the other reason to do this by good metrics in lead by the business organisation.
Data Integration - Control & performance.
Extract Transform Load is the old classic way for dedicated datawarehouse having the only goal of delivering reports (dashboards).
A more practical approach is Extract Load just for a segregation of hardware resources.
Performance Data processing
Performance is impacted by:
- the use of keys indexes, better not to have those.
- The order of sorting. For bulk processing presorted works the best. Transactional applications are better with a randoms spread.
- Set in Limited physical sizing. Saving all history in a single space will have a negative impact. There are more reason to split spaces by historical values.
- Setting for the OS level in IO file system access and application cache usage.
For managing tables like a DBA there are dependicies on hardware level.
👓 Use concurrent I/O to improve DB2 database performance (ibm 2012)
Concurrent I/O and cached I/O are features that are typically associated with file systems.
Because of this, most DB2 DBAs think that the use of these two technologies lies within the purview of storage and system administrators.
However, leveraging this technology in a DB2 database environment is the responsibility of the DBA
In this article DB2 is calssified to be the "eapplication"e, that is confusing when that word is used for business logic.
ELT processing pre & post steps
Doing Extract / Load processing there are many tools due to 👓 CWM
(Common Warehouse Metadata specification).
However doing that in real life something is missing. That is control & monitoring.
For monitoring and control:
- What data, how many records are processed
- When did the process started and when was it ready
- Performance Processing optimized for bulk or for a small number of records.
- An restart processing option for error recovery.
This kind of logic is only possible by having an adjusted pre and post process in place.
That logic is difficult to solve by an external generic provision, it is relative easy with local customisation using local naming conventions.
Details are found in the paragraph linked 👓 with the figure:
Transformations - data lineage.
Knowing what information from what source is processed into new information at a new location is lineage (derivation).
Data lineage states where data is coming from, where it is going, and what transformations are applied to it as it flows through multiple processes.
It helps understand the data life cycle. It is one of the most critical pieces of information from a metadata management point of view
It is called 👓 "data lineage".
(science direct articles)
Details are found in the paragraph linked 👓 with the figure:
Normalisation - Denormalsation
In transactional systems it is important to avoid any duplication of an artefact, element because it is too complex to keep duplictions synchronized.
👓 Database Normalization
(mariadb) refers the .
The concept of database normalization is generally traced back to E.F. Codd, an IBM researcher who, in 1970, published a paper describing the relational database model.
Third Normal Form (3NF)
Denormalization is the process of reversing the transformations made during normalization for performance reasons. It's a topic that stirs controversy among database experts;
there are those who claim the cost is too high and never denormalize, and there are those that tout its benefits and routinely denormalize.
- Each column is unique in 1NF.
- All attributes within the entity should depend solely on the unique identifier of the entity in 2NF.
- No column entry should be dependent on any other entry (value) other than the key for the table , 3NF is achieved, considered as the database is normalized.
The classic Business Intelligence is reshaping all data into new dedicated data models. The facts and dimension used in the operational process are not suited for reporting and analyses.
The concepts of a transactional operational data design with normalization are followed, the result is a lot of transformations for tables.
What is delivered as olap or reports, is denormalised using summaries.
National language Support (NLS)
(example eclipse.org) is about:
- string manipulation
- character classifications
- character comparison rules
- code sets
- date and time formatting
- user interfaces
- message-text languages
- numeric and monetary formatting
- sort orders
This all his impact on the realisation in the data processing. (👓 link details figure:)
National Language Support (NLS) and localized versions are frequently confused. NLS ensures that systems can handle local language data.
A localized version is a software product in which the entire user interface appears in a particular language.
Scheduling, planning operations.
Scheduling is the other part of running processes. Instead of defining blocks of code in a program it is about defining blocks of programs for a process.
For building a program the word job is used. For building a process flow, having a start and end, the word job is used at the operational department. This can get mixed up but they are really different.
Building a process flow
Building a process flow (job) is defining the order how to run code units (jobs).
- defining the first and last code units. Useful for initialisation and a message of successful finished.
- Dependencies when a next code unit may run, which ones to wait to get ready.
- Allowing for multiple code units to run when there are no dependencies (blue ready, green running, yellow waiting, red in error).
- Allowing a single process flow being active at one moment or having multiple of the same process flow running at the same moment. This parallel flows will need unique application datasets.
See the figure 👓, details are in the link.
Running planned proces flows
Having process flow defined the planning is:
- when they should run and when they should be ready.
- what impact they are having on the system resources.
- Are there dependencies between flows when they are running?
In the example in the figure in the early morning before office hours process are run to do a full load of several warehouses.
The full load in this case was faster than trying to catch all changes. An additional advantage with that is missing changes will not have big impact as the longest delay for data is one day.
In the end of this project three for development purpose and 3 (support lines) * 4 (DTAP versions) = 12 loads were run within the available hours.
During office hours regular updates every 15 minutes to achieve a near real time updated version.
Developing a system like this would be more easy and more understandable when the scheduling and code units are designed and build as as system.
Change data - Transformations.
A standardised location in normal processes of the information data brings normal capacity questions.
When the Collecting and sending area's of the EDW 3.0 are the ones most limited, the planning is best done for traffic by managing this service.
Modelling data with very detailed relationships should not be the fucntion of a datawarehouse.
More in details on the transport of data. The data flow goes:
- for a customer to the warehouse collecting point(s).
- transported internally for best service according agreements.
- to a customer from the warehouse provision point(s).
Data warehouse 3.0
All this is requires a well supported way of governing data centrally with a business mindset.
The figure shows a design conform physical logistics. 👓
These are practical data experiences.
generic - previous bianl
, Business Intelligence & Analytics 👓 next.
Others are: concepts requirements: 👓
© 2012,2020 J.A.Karman