Statistics - Analytics
Incorrect use of Statistics
All references off abuse and (mis)intrepretation of statisics at the dedicates page: Abuse of statistics
I´m working mostly on other pages at this moment (aug 2012).
Found this was subject to do as dedicated subject. Links and parapgraph´s will be moved to here.
For the most time this page will be a mesh-up. As soon I see the hit-ratio will grow I will do a clean-up.
Postioning & relations
Context word Statistics
Statistics is commonly misunderstood, as it can be:
- descriptive statistics. As measurement very visiable and in the context trustworthy
- Mathematical statistical theory. As it is theorethical it is not visiable but trustworthy as it must be proved.
- Statistical prove of research assumptions. It is using the statistical theory but has introduced uncertaintity.
- The learning machines approach with models are using all kind of assumptions using some statistical procedures workin to results with a level of uncertainty.
This technology area as challlenging as statistics & analytics.
This technology area as challlenging as statistics & analytics.
Analytics - Mathematics
As based on a part of mathematics this beta thinking approach must be understood.
Not only that, there are meany mandatory regulations on many areas to be taken notice off.
Some areas with strict regulations.
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),
in that descriptive statistics aim to summarize a data set, rather than use the data to learn about the population that the data are thought to represent.
References Descriptive Statistics
All references to this kind of data collected at a dedicated page:
Descriptive - Data
Probability theory is the branch of mathematics concerned with probability, the analysis of random phenomena.
The central objects of probability theory are random variables, stochastic processes,
and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion.
This is more a collection of all kind of goals and technics.
calcula, measures, units
Calcution is the basic step with Mathematics
Many methods with measures haven been used. It have been made common with the Decimal
Numbers like 12 (2,3,4) and 60 (2,3,4,5,6) are better divisible as by 10 (2,5). 10 based calculations have become common accepted.
The only excecption is the technical computer approach as binary based, mostly notitions are in hexadecimal.
Greek and Romans did note use a decimal system with a placeholder. The decimal system has an other origin.
Having a decimal system, measures and culculations can strongly simplified. This is a
has not touched that much by a decimal approach.
Radians (Pi of pythagoras) has more influence.
Angles are still in degrees, 360 to be round up, or 2pi based. Some french approach angles till 400.
The measurement of the earth has become very accurate with GPS.
The time and calendar did not changed to a metric system it has been tried:
Still using hours of 60 minutes and every minute of 60 seconds. This can be very easy with locations (gis) and positions/time on earth.
The fundamental of the western world are of the old greek. The most famous is:
Pythagoras of Samos
Pythagoras ho Samios "Pythagoras the Samian", b. about 570 – d. about 495 BC was an Ionian Greek philosopher, mathematician, and founder of the religious movement called Pythagoreanism.
Most of the information about Pythagoras was written down centuries after he lived, so very little reliable information is known about him.
Aristotle & Plato
These old greek philosophers are stating the problem with the analytics.
Although modelling data looks to be mathematical proofed there is uncertainty.
The way of doing research on data can even be more art (human intrepretation) than real evidance.
Before the middle 1900, statistics meant observed data and descriptive summary figures, such as means, variances, indices, etc., computed from data.
has done the initials for the developping the theory about what is
The other important person is:
There are some debates about using a bayesian approach or Fisher.
Statistics references soundings Fisher and Bayes
Essentials of Paleomagnetism: Web Edition 1.0 (March 18, 2009)
Paleomagnetists have depended since the 1950’s on the special statistical framework developed by Fisher (1953) for the analysis of unit vector data.
On wikipedia al lot descriptions can be found.
In statistics, mean has two related meanings:
the arithmetic mean (and is distinguished from the geometric mean or harmonic mean).
the expected value of a random variable, which is also called the population mean.
In statistics and probability theory, median is described as the numerical value separating the higher half of a sample,
a population, or a probability distribution, from the lower half.
In probability theory, the normal (or Gaussian) distribution is a continuous probability distribution that has a bell-shaped probability density function,
known as the Gaussian function or informally the bell curve
In probability theory and statistics, the Poisson distribution (pronounced [pwas?~]) is a discrete probability distribution that expresses the probability of a given number of events occurring
in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.
In probability theory and statistics, the discrete uniform distribution is a probability distribution whereby a
finite number of equally spaced values are equally likely to be observed; every one of n values has equal probability 1/n.
In probability theory and statistics, Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable.
In probability theory and statistics, Kurtosis kurtosis is any measure of the "peakedness" of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population.
In probability theory and statistics, the chi-squared distribution(also chi-square or ?²-distribution) with
k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables
A F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.
It is most often used when comparing statistical models that have been fit to a data set, in order to identify the model that best fits the population from which the data were sampled.
A confounding variable (also confounding factor, hidden variable, lurking variable , a confound, or confounder)
is an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable.
colinear , hidden variable
In geometry, Colinearity is a property of a set of points, specifically, the property of lying on a single line. A set of points with this property is said to be collinear (often misspelled as, but should not be confused with, co-linear or colinear).
Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data.
Anova correlation clustering
In statistics, Mahalanobis distance
is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed
Operations research, or Operational Research in British usage, is a discipline that deals with the application of advanced analytical methods to help make better decisions. It is often considered to be a sub-field of Mathematics. The terms management science and decision science are sometimes used as more modern-sounding synonyms.
MPS , LP programming , AHP
The OODA loop
The (for observe, orient, decide, and act) is a concept originally applied to the combat operations process,
often at the strategic level in military operations. It is now also often applied to understand commercial operations and learning processes.
monte Carlo , Vegas
A randomized algorithm is an algorithm which employs a degree of randomness as part of its logic.
The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random bits.
Formally, the algorithm's performance will be a random variable determined by the random bits; thus either the running time, or the output (or both) are random variables
CHAID is a type of decision tree technique, based upon adjusted significance testing (Bonferroni testing).
Analytics analyse (inferential)
The CRISP-DM methodology is described in terms of a hierarchical process model, consisting of sets of tasks described at four
levels of abstraction (from general to specific): phase, generic task, specialized task, and process instance
At the top level, the data mining process is organized into a number of phases; each phase consists of several second-level generic tasks. This second level is called generic because it is intended to be general enough to cover all possible data mining situations.
SEMMA is an acronym that stands for Sample, Explore, Modify, Model and Assess. It is a list of sequential steps developed by SAS Institute Inc., one of the largest producer of business intelligence software. It guides the implementation of data mining applications . Although SEMMA is often considered as a general data mining methodology,
SAS claims that it is "rather a logical organisation of the functional tool set of" one of their product, SAS Enterprise Miner, "for carrying out the core tasks of data mining"
Oracle 10gr These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. "
MS excel is also mentioned
Statistical_classification In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known."
Data_Mining_Extensions (DMX) is a query language for Data Mining Models supported by Microsoft's SQL Server Analysis Services product.
The Predictive Model Markup Language (PMML) is an XML-based markup language developed by the Data Mining Group (DMG) to provide a way for applications to
define models related to predictive analytics and data mining and to share those models between PMML-compliant applications.
The Data Mining Group The Data Mining Group (DMG) is an independent, vendor led consortium that develops data mining standards, such as the Predictive Model Markup Language
disappointing are the old years (2010) mentioned.
Predicting the future - PMML
PMML is a standard to help deploy (score) data mining models
Part 1 offered a general overview of predictive analytics. Part 2 focused on predictive modeling techniques, the mathematical algorithms that make up the core of predictive analytics. Part 3 put those techniques to use and described the making of a predictive solution.
Big data sources
Creating financial models involving human behavior is like forcing 'the ugly stepsister's foot into Cinderella's pretty glass slipper.
( predictive-models 2012/07)
( predictive-models 2012/08)
Treatment of missing data
- In statistics Imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values
- Missings (David C.Howell York University 2002-retired R SAS Courses)
- Multiple Imputation for Missing Data (support.sas.com)
- In statistics and machine learning, ensemble methods
use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models, but typically allows for much more flexible structure to exist between those alternatives.
Gamification Gamification is the use of game thinking and game mechanics in a non-game context in order to engage users and solve problems. Gamification is used in applications and processes to improve user engagement, Return on Investment, data quality, timeliness, and learning
It is going into the Social aspects human relations
Choosing and playing random, or not being random.
advanced usage IT
The game industry has always been one of the first adanced users of IT resources.
Game Studios at the Forefront of Big Data, Cloud
For Riot Games, Big Data Is Serious Business
Fast clustering algorithms for massive datasets
Clustering with text
- Link_analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes.
- PageRank (google search engine )
- TrustRank (yahoo search engine )
Likelihood false postive
Time shifting lag
Choosing and playing random, or not being random.
Generating good random numbers is ever lasting question.
Benford distribution of numbers
With te conditions of real measures the numbers itself are not random.
How a Simple Misconception can Trip up a Fraudster and How a Savvy CFE Can Spot It
Product standard allowed defects. In fact 4.5 sigma.
Control charts , also known as Shewhart charts (after Walter A. Shewhart) or process-behavior charts, in statistical process control are tools used to determine if a manufacturing or business process is in a state of statistical control.
Scoring and modeling, whether internally or externally developed, are used extensively in credit card lending.
| Persons || History Of Statistics theory |
| || |
| Persons || History Of Operational research |
| Abraham_Wald || (1940) founded the field of statistical sequential analysis operational research |
© 2012 J.A.Karman (21 apr 2012)