# Statistics - Analytics ##### Incorrect use of Statistics
All references off abuse and (mis)intrepretation of statisics at the dedicates page: Abuse of statistics I´m working mostly on other pages at this moment (aug 2012).
Found this was subject to do as dedicated subject. Links and parapgraph´s will be moved to here.

For the most time this page will be a mesh-up. As soon I see the hit-ratio will grow I will do a clean-up.

### Postioning & relations , ##### Context word Statistics
Statistics is commonly misunderstood, as it can be:
• descriptive statistics. As measurement very visiable and in the context trustworthy
• Mathematical statistical theory. As it is theorethical it is not visiable but trustworthy as it must be proved.
• Statistical prove of research assumptions. It is using the statistical theory but has introduced uncertaintity.
• The learning machines approach with models are using all kind of assumptions using some statistical procedures workin to results with a level of uncertainty.

#### References

This technology area as challlenging as statistics & analytics.

#### Information Technology

This technology area as challlenging as statistics & analytics. #### Analytics - Mathematics

As based on a part of mathematics this beta thinking approach must be understood.
Not only that, there are meany mandatory regulations on many areas to be taken notice off. #### Content awareness

Some areas with strict regulations. ##### Descriptive Statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a data set, rather than use the data to learn about the population that the data are thought to represent.

##### References Descriptive Statistics
All references to this kind of data collected at a dedicated page:
Descriptive - Data ##### Probability theory
Probability theory is the branch of mathematics concerned with probability, the analysis of random phenomena. The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion. #### Data Mining

This is more a collection of all kind of goals and technics.

### historical references  #### calcula, measures, units

Calcution is the basic step with Mathematics . Many methods with measures haven been used. It have been made common with the Decimal approach. Numbers like 12 (2,3,4) and 60 (2,3,4,5,6) are better divisible as by 10 (2,5). 10 based calculations have become common accepted. The only excecption is the technical computer approach as binary based, mostly notitions are in hexadecimal.

0_(number) Greek and Romans did note use a decimal system with a placeholder. The decimal system has an other origin.

Having a decimal system, measures and culculations can strongly simplified. This is a Metric_system . Numerical_analysis

Trigonometry has not touched that much by a decimal approach.
Radians (Pi of pythagoras) has more influence.
Angles are still in degrees, 360 to be round up, or 2pi based. Some french approach angles till 400.
The measurement of the earth has become very accurate with GPS.

The time and calendar did not changed to a metric system it has been tried: French_Republican_Calendar
Still using hours of 60 minutes and every minute of 60 seconds. This can be very easy with locations (gis) and positions/time on earth. #### Greek fundamentals

##### Pythagoras
The fundamental of the western world are of the old greek. The most famous is:
Pythagoras of Samos Pythagoras ho Samios "Pythagoras the Samian", b. about 570 – d. about 495 BC was an Ionian Greek philosopher, mathematician, and founder of the religious movement called Pythagoreanism. Most of the information about Pythagoras was written down centuries after he lived, so very little reliable information is known about him.

##### Aristotle & Plato
These old greek philosophers are stating the problem with the analytics. theory_of_universals Aristotle to Platonic realism

Although modelling data looks to be mathematical proofed there is uncertainty.
The way of doing research on data can even be more art (human intrepretation) than real evidance.

### Bayes Fisher  #### Industrial revolution

##### Statistics history
Before the middle 1900, statistics meant observed data and descriptive summary figures, such as means, variances, indices, etc., computed from data.

Thomas_Bayes has done the initials for the developping the theory about what is Bayesian_statistics and Bayesian_probability

The other important person is: Ronald_Fisher with Fisher's_exact_test

There are some debates about using a bayesian approach or Fisher.

##### Statistics references soundings Fisher and Bayes
Essentials of Paleomagnetism: Web Edition 1.0 (March 18, 2009) (magician.ucsd.edu) Paleomagnetists have depended since the 1950’s on the special statistical framework developed by Fisher (1953) for the analysis of unit vector data.

Encyclopedia_of_computational_neuroscience (scholarpedia.org)

### Statistical Basics #### Probability

On wikipedia al lot descriptions can be found.

##### Mean In statistics, mean has two related meanings:
the arithmetic mean (and is distinguished from the geometric mean or harmonic mean).
the expected value of a random variable, which is also called the population mean.

##### Median In statistics and probability theory, median is described as the numerical value separating the higher half of a sample, a population, or a probability distribution, from the lower half.

##### Normal In probability theory, the normal (or Gaussian) distribution is a continuous probability distribution that has a bell-shaped probability density function, known as the Gaussian function or informally the bell curve

##### Poisson In probability theory and statistics, the Poisson distribution (pronounced [pwas?~]) is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

##### Uniform In probability theory and statistics, the discrete uniform distribution is a probability distribution whereby a finite number of equally spaced values are equally likely to be observed; every one of n values has equal probability 1/n.

##### Skewness In probability theory and statistics, Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable.

##### Kurtosis In probability theory and statistics, Kurtosis kurtosis is any measure of the "peakedness" of the probability distribution of a real-valued random variable. In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population.

##### Chi-squared In probability theory and statistics, the chi-squared distribution(also chi-square or ?²-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables

##### F-test A F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fit to a data set, in order to identify the model that best fits the population from which the data were sampled.

##### hidden variable A confounding variable (also confounding factor, hidden variable, lurking variable , a confound, or confounder) is an extraneous variable in a statistical model that correlates (positively or negatively) with both the dependent variable and the independent variable.

##### colinear , hidden variable In geometry, Colinearity is a property of a set of points, specifically, the property of lying on a single line. A set of points with this property is said to be collinear (often misspelled as, but should not be confused with, co-linear or colinear).
Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated. In this situation the coefficient estimates may change erratically in response to small changes in the model or the data.

#### Anova correlation clustering

##### Mahalanobis distance In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed

### State of statistical knowledge probability Anova correlation clustering Regression

#### Decisions Operations

##### Operational research Operational research
Operations research, or Operational Research in British usage, is a discipline that deals with the application of advanced analytical methods to help make better decisions. It is often considered to be a sub-field of Mathematics. The terms management science and decision science are sometimes used as more modern-sounding synonyms. decisions
##### OODA loop The OODA loop The (for observe, orient, decide, and act) is a concept originally applied to the combat operations process, often at the strategic level in military operations. It is now also often applied to understand commercial operations and learning processes.

,
Glossary notes miscalenous
Benford distribution of numbers
Six_Sigma
-
credit fico
##### monte Carlo , Vegas A randomized algorithm is an algorithm which employs a degree of randomness as part of its logic. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random bits. Formally, the algorithm's performance will be a random variable determined by the random bits; thus either the running time, or the output (or both) are random variables

### Forecasting ### Data Mining Some dedicated technics are used
SVM: machine_learning (wiki) Explained (tristanfletcher) Shogun_(toolbox)
Neural_network: software , Artificial
Vector_autoregression: (1 wiki)
##### CHAID CHAID is a type of decision tree technique, based upon adjusted significance testing (Bonferroni testing).

#### Analytics analyse (inferential)

##### CRISP-DM The CRISP-DM methodology is described in terms of a hierarchical process model, consisting of sets of tasks described at four levels of abstraction (from general to specific): phase, generic task, specialized task, and process instance

At the top level, the data mining process is organized into a number of phases; each phase consists of several second-level generic tasks. This second level is called generic because it is intended to be general enough to cover all possible data mining situations.

##### SEMMA SEMMA is an acronym that stands for Sample, Explore, Modify, Model and Assess. It is a list of sequential steps developed by SAS Institute Inc., one of the largest producer of business intelligence software. It guides the implementation of data mining applications . Although SEMMA is often considered as a general data mining methodology, SAS claims that it is "rather a logical organisation of the functional tool set of" one of their product, SAS Enterprise Miner, "for carrying out the core tasks of data mining"

##### SQL scoring Oracle 10gr These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. " MS excel is also mentioned

Statistical_classification In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known."

Data_Mining_Extensions (DMX) is a query language for Data Mining Models supported by Microsoft's SQL Server Analysis Services product.
##### PMML The Predictive Model Markup Language (PMML) is an XML-based markup language developed by the Data Mining Group (DMG) to provide a way for applications to define models related to predictive analytics and data mining and to share those models between PMML-compliant applications.

The Data Mining Group The Data Mining Group (DMG) is an independent, vendor led consortium that develops data mining standards, such as the Predictive Model Markup Language
disappointing are the old years (2010) mentioned. #### Predicting the future - PMML

PMML is a standard to help deploy (score) data mining models
Part 1 offered a general overview of predictive analytics. Part 2 focused on predictive modeling techniques, the mathematical algorithms that make up the core of predictive analytics. Part 3 put those techniques to use and described the making of a predictive solution. ##### PMML sources
top-10-pmml-resources (predictive-analytics.info) ##### Big data sources
big_data_press_release (whithouse) big-data-rd-initiative (2012/03/29 cccblog) (wallstreet journal) Creating financial models involving human behavior is like forcing 'the ugly stepsister's foot into Cinderella's pretty glass slipper. analytics-india-jobs-study (analyticsindiamag 2012) choosing_a_good (graphs 2006/09) real-time-analytics-basics-bayesian ( predictive-models 2012/07) real-time-analytics-bayesian-part-2 ( predictive-models 2012/08)

Correlation_does_not_imply_causation
Data_virtualization (big data)
-
Predicting the future - PMML
##### Treatment of missing data
• In statistics Imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values
• Missings (David C.Howell York University 2002-retired R SAS Courses)
• Multiple Imputation for Missing Data (support.sas.com)

##### Ensemble Models
• In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models, but typically allows for much more flexible structure to exist between those alternatives.

### Game ### Games Theory #### Classic theory

Game_theory
Decision making
Probability #### Gamification

Gamification Gamification is the use of game thinking and game mechanics in a non-game context in order to engage users and solve problems. Gamification is used in applications and processes to improve user engagement, Return on Investment, data quality, timeliness, and learning

It is going into the Social aspects human relations

### Games Simple #### Simple games

Choosing and playing random, or not being random.
Three_door_problem (wiki)
Good old Monty Hall! Or, All Probability Is Conditional (wmbriggs)
wheel-of-mythfortune (mythbusters)
Good old Monty Hall! Or, All Probability Is Conditional (wmbriggs)
Rock-paper-scissors

### Games Industry #### advanced usage IT

The game industry has always been one of the first adanced users of IT resources. Game Studios at the Forefront of Big Data, Cloud (slashdot) For Riot Games, Big Data Is Serious Business (slashdot 2012)

### Algorithm ,
Glossary notes miscalenous
Collection
Random numbers
- ##### Defining Algorithm
Algorithm . Algorithm_characterizations

Divide_and_conquer_algorithm . ##### Fast clustering algorithms for massive datasets
Clustering with text (bigdatganews) .

#### Page Ranking • Link_analysis In network theory, link analysis is a data-analysis technique used to evaluate relationships (connections) between nodes.
• PageRank (google search engine )
• TrustRank (yahoo search engine ) #### Corrections

Bonferroni .

##### Leverage ##### Collection
Choosing and playing random, or not being random.
Quicksort (wiki)

### miscellaneous ##### Numbers #### Random numbers

Generating good random numbers is ever lasting question.
Mersenne_twister   Wichman Hill #### Benford distribution of numbers

With te conditions of real measures the numbers itself are not random.
Benford's_law (wiki)   How a Simple Misconception can Trip up a Fraudster and How a Savvy CFE Can Spot It (acfe) ##### Six_Sigma
Product standard allowed defects. In fact 4.5 sigma. Six_Sigma (wiki)

Control charts , also known as Shewhart charts (after Walter A. Shewhart) or process-behavior charts, in statistical process control are tools used to determine if a manufacturing or business process is in a state of statistical control. ##### credit fico
Scoring and modeling, whether internally or externally developed, are used extensively in credit card lending. credit_card ch8 (fdic.gov)

### Historic references

Persons History Of Statistics theory
Persons History Of Operational research
Abraham_Wald (1940) founded the field of statistical sequential analysis operational research

© 2012 J.A.Karman (21 apr 2012)