<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3143564862155210283</id><updated>2012-01-30T21:28:59.806-08:00</updated><category term='Dimensional Modeling'/><category term='Tukey&apos;s test'/><category term='forecasting'/><category term='Statistics'/><category term='chi-square goodness of fit'/><category term='business intelligence'/><category term='OBIEE'/><category term='Runs test'/><category term='variance'/><category term='Spearman&apos;s rank correlation coefficient.'/><category term='inferntial statistics'/><category term='Exponential Distribution'/><category term='standard deviation'/><category term='ANOVA'/><category term='Data warehouse'/><category term='box and whisker plots'/><category term='Wilcoxon matched-pairs signed ranks test'/><category term='chi-square test of independence'/><category term='two way ANOVA'/><category term='business analytics'/><category term='Normal Distribution'/><category term='ChebyChev&apos;s theorem'/><category term='Simple Regression'/><category term='coefficient of variation'/><category term='p-value'/><category term='chi-square distribution.'/><category term='Design of experiment'/><category term='t-tests'/><category term='dimensions'/><category term='OLAP'/><category term='predictive analysis'/><category term='Error of Estimate'/><category term='Descriptive statistics'/><category term='tukey-kramer procedure'/><category term='null hypothesis'/><category term='kurtosis'/><category term='KPI'/><category term='central limit theorem'/><category term='Balanced Scorecard'/><category term='Friedman test'/><category term='pearsonian coefficient of skewness'/><category term='Multiple Regression'/><category term='confidence interval'/><category term='Datasource'/><category term='Mann-Whitney U test'/><category term='decision support systems'/><category term='facts'/><category term='Kruskal-Wallis test'/><category term='F-test'/><category term='R'/><category term='sampling'/><title type='text'>Business Intelligence</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>44</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-1420106977523133705</id><published>2011-10-12T01:22:00.000-07:00</published><updated>2011-10-12T01:23:04.539-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>R - Turorial II</title><content type='html'>&lt;html&gt; &lt;head&gt;  &lt;title&gt;&lt;/title&gt; &lt;/head&gt; &lt;body&gt;  &lt;h2&gt;   &lt;span style="color: rgb(255, 255, 224);"&gt;&lt;span style="background-color: rgb(128, 0, 0);"&gt;Lists&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;  &lt;ul&gt;   &lt;li&gt;    Collection of Objects of same or mixed type.List may also contain vector or other lists.&lt;/li&gt;   &lt;li&gt;    Create a list using the funtion list().&lt;/li&gt;   &lt;li&gt;    The following example shows operations that can be performed on a list.&lt;/li&gt;  &lt;/ul&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 128, 0);"&gt;&lt;span style="font-size: 12px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt; alphabets=list(a=&amp;quot;apple&amp;quot;,b=&amp;quot;ball&amp;quot;,c=&amp;quot;cat&amp;quot;)&lt;br /&gt;   &amp;gt; alphabets&lt;br /&gt;   $a&lt;br /&gt;   [1] &amp;quot;apple&amp;quot;&lt;br /&gt;   $b&lt;br /&gt;   [1] &amp;quot;ball&amp;quot;&lt;br /&gt;   $c&lt;br /&gt;   [1] &amp;quot;cat&amp;quot;&lt;br /&gt;   &lt;br /&gt;   &amp;gt; alphabets[[1]]&lt;br /&gt;   [1] &amp;quot;apple&amp;quot;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;   &amp;nbsp;&lt;/p&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 128, 0);"&gt;&lt;span style="font-size: 12px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt; alphabets$a&lt;br /&gt;   [1] &amp;quot;apple&amp;quot;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;   &amp;nbsp;&lt;/p&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 128, 0);"&gt;&lt;span style="font-size: 12px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt; length(alphabets)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;   &lt;span style="color: rgb(0, 128, 0);"&gt;&lt;span style="font-size: 12px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;[1] 3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;   &amp;nbsp;&lt;/p&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 128, 0);"&gt;&lt;span style="font-size: 12px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt; alphabets[[&amp;quot;a&amp;quot;]]&lt;br /&gt;   [1] &amp;quot;apple&amp;quot;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;    Double brackets [[.]] are used for getting a single element. names are not returned.&amp;nbsp; single brackets [.]&amp;nbsp; can be used for subscripting.&amp;nbsp;&lt;/li&gt;   &lt;li&gt;    When lists are formed from objects, the elements use to form the list are copied.&lt;/li&gt;   &lt;li&gt;    function c() can be used to combine lists to form an object of type list.&lt;/li&gt;  &lt;/ul&gt;  &lt;h2&gt;   &lt;span style="color: rgb(255, 255, 224);"&gt;&lt;span style="background-color: rgb(128, 0, 0);"&gt;Data Frame&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;  &lt;ul&gt;   &lt;li&gt;    Lists having class &amp;#39;data.frame&amp;#39; are called data frames.&lt;/li&gt;   &lt;li&gt;    the components of the list must be vectors, factors, numeric matrices, lists or other data frames. vector structures appearing as variables in the dataframe must all have same length. Matrix structures appearing as variables in the dataframe must all have same row size.&lt;/li&gt;   &lt;li&gt;    data frames can be created using function data.frame&lt;/li&gt;  &lt;/ul&gt;&lt;/body&gt;&lt;/html&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-1420106977523133705?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/1420106977523133705/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=1420106977523133705' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1420106977523133705'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1420106977523133705'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2011/10/r-turorial-ii.html' title='R - Turorial II'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2472704242300874706</id><published>2011-10-06T23:57:00.000-07:00</published><updated>2011-10-12T00:14:56.133-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>R - Tutorial I</title><content type='html'>&lt;html&gt;  &lt;body&gt;  &lt;h2&gt;   &lt;span style="color: rgb(255, 255, 224);"&gt;&lt;span style="background-color: rgb(128, 0, 0);"&gt;Basics&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;  &lt;ul&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;Start R in Windows using the program menu.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;To quit :&amp;nbsp; &lt;em&gt;&lt;u&gt;q()&lt;/u&gt;&lt;/em&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;to call help for a function.&lt;em&gt;&lt;u&gt; help([function])&lt;/u&gt;&lt;/em&gt; or &lt;em&gt;&lt;u&gt;?[function]&lt;/u&gt;&lt;/em&gt;. use double quotes to escape special characters and tokens. e.g. ?&amp;quot;for&amp;quot;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;objects() or ls() to obtain list of objects stored. rm([object]) to remove object.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;x= 1:4 create sequence of numbers. 1 2 3 4. colon has highest priority in expressions.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;&lt;em&gt;&lt;u&gt;seq() &lt;/u&gt;&lt;/em&gt;function can also be used to generate sequences. it has five arguments to, from, by (increment), length.out (length of seq), along.with (take length from length of this argument).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;R objects have mode i.e type for &amp;quot;atomic&amp;quot; objects ( numeric, complex, logical, character and row ). Mode for list objects is list. Other modes are functions and expressions. Objects also have a property called length.&amp;nbsp; &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-size: 14px;"&gt;&lt;span style="font-family: times new roman,times,serif;"&gt;&lt;u&gt;as.integer(x)&lt;/u&gt; coerces x to integer. There are numerous othe functions of type as.* for different coercions. Type &lt;/span&gt;apropos(&amp;quot;as.*&amp;quot;) in R windows to know more.&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     E&lt;span style="font-size: 14px;"&gt;ach object has a class. for vectors it is the same as mode. class can be used for object oriented type of programming. method dispatch is based on the class of the first variable passed to the method. an example of a class is the value returned from fitting linear model using &amp;#39;lm&amp;#39;. The output of the method is an object of class &amp;#39;lm&amp;#39;.&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-family: times new roman,times,serif;"&gt;&lt;span style="font-size: 14px;"&gt;&lt;strong&gt;&lt;span style="color: rgb(0, 128, 0);"&gt;Factors &lt;/span&gt;&lt;/strong&gt;can by used to prepare categorical values for statistical analysis. the function factor(a) assigns integers to unique values (levels) in the vector a and stores the variable factor(a) as a vector of integers and also stores mapping between the integer values to the actual values in a. factor(a) can then be used in statistical analysis (summary function etc). the factors are stored in natural order of the elements in a. To explicitely provide another order use the ordered function. Read more at: &lt;a href="http://www.ats.ucla.edu/stat/R/modules/factor_variables.htm" target="_blank"&gt;link1&lt;/a&gt; , &lt;a href="http://www.statmethods.net/input/datatypes.html" target="_blank"&gt;link2&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;  &lt;/ul&gt;  &lt;h2&gt;   &lt;span style="color: rgb(255, 240, 245);"&gt;&lt;span style="background-color: rgb(128, 0, 0);"&gt;Vectors&lt;/span&gt;&lt;/span&gt;&lt;/h2&gt;  &lt;ul&gt;   &lt;li&gt;    &lt;span style="font-size: 14px;"&gt;Create a vector named a. a = c(1,2,3,4). c is the generic function to combine arguments. output type is the highest of NULL &amp;lt; raw &amp;lt; logical &amp;lt; integer &amp;lt; real &amp;lt; complex &amp;lt; character &amp;lt; list &amp;lt; expression&lt;/span&gt;&lt;/li&gt;  &lt;/ul&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 100, 0);"&gt;&lt;span style="font-size: 11px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;b=c(&amp;quot;!&amp;quot;,2,a) = &amp;quot;!&amp;quot; &amp;quot;2&amp;quot; &amp;quot;1&amp;quot; &amp;quot;2&amp;quot; &amp;quot;3&amp;quot; and b=c(2,a) = 2 1 2 3. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;    &lt;span style="font-size: 14px;"&gt;Vector arguments can have names. names(a) = c(&amp;quot;first&amp;quot;,&amp;quot;second&amp;quot;,&amp;quot;third&amp;quot;)&amp;nbsp;&lt;/span&gt;&lt;/li&gt;  &lt;/ul&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 100, 0);"&gt;&lt;span style="font-size: 11px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt; a&lt;br /&gt;   &amp;nbsp;first second&amp;nbsp; third&lt;br /&gt;   &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;    &lt;span style="font-size: 14px;"&gt;+,-,*,/ : Individual elements are added. shorter vector are recycled&lt;/span&gt;&lt;/li&gt;  &lt;/ul&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 100, 0);"&gt;&lt;span style="font-size: 11px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt; a=c(1,2,3)&lt;br /&gt;   &amp;gt; b=c(4,5)&lt;br /&gt;   &amp;gt; a+b&lt;br /&gt;   [1] 5 7 7&lt;br /&gt;   Warning message:&lt;br /&gt;   In a + b : longer object length is not a multiple of shorter object length&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-family: times new roman,times,serif;"&gt;&lt;span style="font-size: 14px;"&gt;Functions - max (maximum value), min (minimum value), range =c(min(x).max(x)), length (length of vector), sum (total of all elements), prod (product of all elements), mean = sum/length, var(simple variance), sort (sorting in ascending order), pmax (vector or higest elements in individual vectors of the argument)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-family: times new roman,times,serif;"&gt;&lt;span style="font-size: 14px;"&gt;Logical vectors contain TRUE, FALSE and NA.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-family: times new roman,times,serif;"&gt;&lt;span style="font-size: 14px;"&gt;missing values are given as NA. is.na(a) returns a vector of same length as a and values FALSE if a contains &amp;#39;NA&amp;#39; or &amp;#39;NaN&amp;#39; , and TRUE otherwise.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;   &lt;li&gt;    &lt;div&gt;     &lt;span style="font-family: times new roman,times,serif;"&gt;&lt;span style="font-size: 14px;"&gt;function &lt;em&gt;&lt;u&gt;paste()&lt;/u&gt;&lt;/em&gt; can be used to combine strings of two or more vectors one by one. vectors are recycled if required&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;   &lt;/li&gt;  &lt;/ul&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 100, 0);"&gt;&lt;span style="font-size: 11px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt;&amp;nbsp; paste(c(&amp;quot;X&amp;quot;,&amp;quot;Y&amp;quot;), 1:10,2:5)&lt;br /&gt;   &amp;nbsp;[1] &amp;quot;X 1 2&amp;quot;&amp;nbsp; &amp;quot;Y 2 3&amp;quot;&amp;nbsp; &amp;quot;X 3 4&amp;quot;&amp;nbsp; &amp;quot;Y 4 5&amp;quot;&amp;nbsp; &amp;quot;X 5 2&amp;quot;&amp;nbsp; &amp;quot;Y 6 3&amp;quot;&amp;nbsp; &amp;quot;X 7 4&amp;quot;&amp;nbsp; &amp;quot;Y 8 5&amp;quot;&amp;nbsp; &amp;quot;X 9 2&amp;quot;&amp;nbsp; &amp;quot;Y 10 3&amp;quot;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;    Index Vector: in the example below, b is the index vector in various forms.&lt;/li&gt;  &lt;/ul&gt;  &lt;p style="margin-left: 40px;"&gt;   &lt;span style="color: rgb(0, 100, 0);"&gt;&lt;span style="font-size: 11px;"&gt;&lt;span style="font-family: courier new,courier,monospace;"&gt;&amp;gt; a=c(1,2,3,4,5,6,7,8,9,10)&lt;br /&gt;   &amp;gt; b=c(TRUE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,TRUE,FALSE)&lt;br /&gt;   &amp;gt; a[b]&lt;br /&gt;   [1]&amp;nbsp; 1&amp;nbsp; 4&amp;nbsp; 5&amp;nbsp; 6&amp;nbsp; 8 10&lt;br /&gt;   &amp;gt; b=c(1:5)&lt;br /&gt;   &amp;gt; a[b]&lt;br /&gt;   [1] 1 2 3 4 5&lt;br /&gt;   &amp;gt; b=-(1:5)&lt;br /&gt;   &amp;gt; a[b]&lt;br /&gt;   [1]&amp;nbsp; 6&amp;nbsp; 7&amp;nbsp; 8&amp;nbsp; 9 10&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt; &lt;/body&gt;&lt;/html&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2472704242300874706?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2472704242300874706/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2472704242300874706' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2472704242300874706'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2472704242300874706'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2011/10/r-basics-notes-i.html' title='R - Tutorial I'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-5878535177839004116</id><published>2011-09-30T02:08:00.000-07:00</published><updated>2011-09-30T02:09:32.257-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>R and Java - JRI Using Netbeans</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Setting up R-Java in Netbeans is pretty straight forward. For those who need a walkthough here are the steps.&lt;br /&gt;&lt;ol style="text-align: left;"&gt;&lt;li&gt;Download and install R (for this example version  2.12.2). install rJava package.&amp;nbsp;&lt;/li&gt;&lt;li&gt;&amp;nbsp;Download and install Netbeans.&amp;nbsp;&lt;/li&gt;&lt;li&gt;In Netbeans Create a new java project. we will call it RJava.&lt;/li&gt;&lt;li&gt;&amp;nbsp;From the [R_HOME]&gt;/library/rJava/jri/examples folder, copy the file rtest.java into the project.&lt;/li&gt;&lt;li&gt;Create a library in NetBeans and add the following jars to the library. - JRI.jar, JRIEngine.jar, REngine.jar.These files are present in [R_HOME]/library/rJava/jri/.Here's how the project looks like&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-9aOFecDd730/ToWEQvjLZZI/AAAAAAAAEOc/zdSiONVQupY/s1600/rJavaFolder.jpg" imageanchor="1"&gt;&lt;img border="0" height="400" src="http://1.bp.blogspot.com/-9aOFecDd730/ToWEQvjLZZI/AAAAAAAAEOc/zdSiONVQupY/s400/rJavaFolder.jpg" width="314" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;/li&gt;&lt;li&gt;Minimize Netbeans. Add  R_HOME variable  (right click my computer-&gt; properties-&gt; Advanced -&gt;Environment variables) R_HOME should point to the location where R is installed.&lt;/li&gt;&lt;li&gt;edit the PATH variable and append the following  (i) [R_HOME]\library\rJava\jri and (ii) [R_HOME]\bin&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-hLcJ0694wRY/ToWG_0fE48I/AAAAAAAAEOk/FzaEv18-pgU/s1600/R_HOME.jpg" imageanchor="1" style=""&gt;&lt;img border="0" height="400" width="360" src="http://3.bp.blogspot.com/-hLcJ0694wRY/ToWG_0fE48I/AAAAAAAAEOk/FzaEv18-pgU/s400/R_HOME.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;/li&gt;&lt;/ol&gt;That's it. Right click on rtest.java and click on 'Run File'. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-5878535177839004116?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/5878535177839004116/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=5878535177839004116' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5878535177839004116'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5878535177839004116'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2011/09/r-and-java-jri-using-netbeans.html' title='R and Java - JRI Using Netbeans'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-9aOFecDd730/ToWEQvjLZZI/AAAAAAAAEOc/zdSiONVQupY/s72-c/rJavaFolder.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-4798843520314519407</id><published>2011-04-11T01:17:00.000-07:00</published><updated>2011-04-11T01:32:53.215-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>R and Java - JRI using eclipse on 64 bit machines</title><content type='html'>The steps to install rjava on a 64 bit machine is not very different from installing it on a 32 bit machine. however, here are the exact steps.&lt;br /&gt;&lt;br /&gt;1. Install 64 bit java.&lt;br /&gt;2. Install R 2.12.X&lt;br /&gt;3. Start R and install the rjava package using the package installer.&lt;br /&gt;4. Start eclipse. Create a project called RTest. copy the Rtest.java and RTest2.java files from the examples folder of rjava (R-2.12.2\library\rJava\jri\examples)&lt;br /&gt;5. Create a lib folder in the RTest project and copy the jri.jar file from R-2.12.2\library\rJava\jri into it.&lt;br /&gt;6. add jri.jar to the classpath of the project. ensure that the project compiles without error.&lt;br /&gt;Here's the folder structure&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-pwwSbmVtemM/TaK6mo5bOaI/AAAAAAAAD8o/K02to-Eicu0/s1600/rjava64.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 230px; height: 197px;" src="http://2.bp.blogspot.com/-pwwSbmVtemM/TaK6mo5bOaI/AAAAAAAAD8o/K02to-Eicu0/s400/rjava64.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5594238860123650466" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;7. Before running the RTest.java class, the run configuration needs to be edited. open the run configuration by clicking on run-&gt;run configuration. select Rtest as project and rtest as the main class. click on environment and add the variable 'PATH' . the path should contain paths to the following&lt;br /&gt;a. the bin directory of R (64) : \R-2.12.2\bin\x64;&lt;br /&gt;&lt;br /&gt;b. The jri directory for rjava (64bit) : \R-2.12.2\library\rJava\jri\x64.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-lNbs3bcDkbk/TaK72yW9G_I/AAAAAAAAD8w/hQNJyH4_WKA/s1600/rjava64b.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 344px;" src="http://1.bp.blogspot.com/-lNbs3bcDkbk/TaK72yW9G_I/AAAAAAAAD8w/hQNJyH4_WKA/s400/rjava64b.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5594240237052959730" /&gt;&lt;/a&gt;&lt;br /&gt;8. click on apply and then run.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-4798843520314519407?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/4798843520314519407/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=4798843520314519407' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/4798843520314519407'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/4798843520314519407'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2011/04/r-and-java-installing-rjava-on-64-bit.html' title='R and Java - JRI using eclipse on 64 bit machines'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-pwwSbmVtemM/TaK6mo5bOaI/AAAAAAAAD8o/K02to-Eicu0/s72-c/rjava64.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-3484926886643458326</id><published>2010-11-16T01:46:00.000-08:00</published><updated>2010-11-16T02:17:37.042-08:00</updated><title type='text'>Oracle BI applications - Installation</title><content type='html'>In this post we look at installing and configuring Oracle BI Applications. I managed to install the application on an intel core 2 duo laptop with 2 GB RAM, Windows Vista machine.&lt;br /&gt;We would need source data to view the dashboard for BI apps, the source data can be obtained from Oracle E business suite vision database. This post will explain how to install the Oracle E business suite for the purpose of getting the vision database up and running.&lt;br /&gt;&lt;br /&gt;Step 1: Installing the Oracle E Business Suite (EBS).&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;EBS can be downloaded from the oracle site. It is a huge download with multiple files (42G). &lt;a href="http://www.databasejournal.com/features/oracle/article.php/3768101/Installing-Oracle-E-Business-Suite-R12-on-Windows-2003.htm"&gt;Here's&lt;/a&gt; an article that describes what files to download.&lt;/li&gt;&lt;li&gt;Once the files are downloaded we need to extract them to a staging directory. the extract can be done manually using an unzip utility.&lt;/li&gt;&lt;/ul&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJVjGTHXVI/AAAAAAAADvA/K0FYLW5IOg4/s1600/ebs1.jpg"&gt;&lt;img style="cursor: pointer; width: 400px; height: 197px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJVjGTHXVI/AAAAAAAADvA/K0FYLW5IOg4/s400/ebs1.jpg" alt="" id="BLOGGER_PHOTO_ID_5540084553093700946" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Extract the files to the folder structure as shown above.&lt;/li&gt;&lt;li&gt;Start the rapidwiz wizard for ebs installation&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJV6yWoX3I/AAAAAAAADvI/rnMnf38UM_w/s1600/ebs2.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 201px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJV6yWoX3I/AAAAAAAADvI/rnMnf38UM_w/s400/ebs2.png" alt="" id="BLOGGER_PHOTO_ID_5540084960056598386" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TOJWjNB1IWI/AAAAAAAADvQ/vyfkQ1FtETc/s1600/ebs3.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TOJWjNB1IWI/AAAAAAAADvQ/vyfkQ1FtETc/s400/ebs3.png" alt="" id="BLOGGER_PHOTO_ID_5540085654411878754" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TOJWjKhopdI/AAAAAAAADvY/K29aBsqy_wg/s1600/ebs4.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TOJWjKhopdI/AAAAAAAADvY/K29aBsqy_wg/s400/ebs4.png" alt="" id="BLOGGER_PHOTO_ID_5540085653739972050" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJWjTxGXKI/AAAAAAAADvg/9gyDu1ufvKU/s1600/ebs5.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJWjTxGXKI/AAAAAAAADvg/9gyDu1ufvKU/s400/ebs5.png" alt="" id="BLOGGER_PHOTO_ID_5540085656220753058" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;If the system uses another instance of oracle it would be a good idea&lt;br /&gt;to shift the ports by 3. in this example, since we would be using&lt;br /&gt;other oracle instances for informatica and dac, i have shifted the ports&lt;br /&gt;by 3&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TOJWjgr_RiI/AAAAAAAADvo/AWi5lJrOgbE/s1600/ebs6.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TOJWjgr_RiI/AAAAAAAADvo/AWi5lJrOgbE/s400/ebs6.png" alt="" id="BLOGGER_PHOTO_ID_5540085659688977954" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;For this example i have installed the ebs instance on an external USB&lt;br /&gt;hard drive since  the install requires a lot of space.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TOJWj1LeFcI/AAAAAAAADvw/jYcIohhJa64/s1600/ebs7.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 320px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TOJWj1LeFcI/AAAAAAAADvw/jYcIohhJa64/s400/ebs7.png" alt="" id="BLOGGER_PHOTO_ID_5540085665189729730" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Select Vision Demo Database for database type, this will install the sample&lt;br /&gt;data that we required to load into OBIA&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJWtfcKAjI/AAAAAAAADv4/AYbcwfDS1QI/s1600/ebs8.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJWtfcKAjI/AAAAAAAADv4/AYbcwfDS1QI/s400/ebs8.png" alt="" id="BLOGGER_PHOTO_ID_5540085831152828978" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;EBS installation requires cygwin and VB. provide the appropriate paths.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TOJWtvMULbI/AAAAAAAADwA/dp0hTknJGh4/s1600/ebs9.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TOJWtvMULbI/AAAAAAAADwA/dp0hTknJGh4/s400/ebs9.png" alt="" id="BLOGGER_PHOTO_ID_5540085835381353906" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TOJWvEjY2kI/AAAAAAAADwI/-c7G0S-W0vE/s1600/ebs10.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TOJWvEjY2kI/AAAAAAAADwI/-c7G0S-W0vE/s400/ebs10.png" alt="" id="BLOGGER_PHOTO_ID_5540085858295142978" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TOJWvAZ2S2I/AAAAAAAADwQ/Sm1tmW6hg-g/s1600/ebs11.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 300px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TOJWvAZ2S2I/AAAAAAAADwQ/Sm1tmW6hg-g/s400/ebs11.png" alt="" id="BLOGGER_PHOTO_ID_5540085857181387618" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The OS User group check fails for Vista for some reason.&lt;br /&gt;lets ignore them for now.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJW4G2Pl6I/AAAAAAAADwg/XOkMuXwxpEE/s1600/ebs12.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJW4G2Pl6I/AAAAAAAADwg/XOkMuXwxpEE/s400/ebs12.png" alt="" id="BLOGGER_PHOTO_ID_5540086013529921442" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TOJW4V2lX5I/AAAAAAAADwo/g3wxIoKIfBE/s1600/ebs13.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 267px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TOJW4V2lX5I/AAAAAAAADwo/g3wxIoKIfBE/s400/ebs13.png" alt="" id="BLOGGER_PHOTO_ID_5540086017557880722" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TOJW42xlflI/AAAAAAAADww/rbNBneK8d3o/s1600/ebs14.png"&gt;&lt;img style="cursor: pointer; width: 400px; height: 156px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TOJW42xlflI/AAAAAAAADww/rbNBneK8d3o/s400/ebs14.png" alt="" id="BLOGGER_PHOTO_ID_5540086026395287122" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;THe installation process may take a long time. My installation failed after&lt;br /&gt;step 3 above. however, the vision database is installed before that.&lt;br /&gt;since we need only the database, we can ignore the rest of the install.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Step 2: The next step is to prepare the oracle instances that will be used by informatica and DAC.  It would be a good idea to use different users for different areas&lt;br /&gt;&lt;br /&gt; &lt;div style="direction: ltr;"&gt;  &lt;table valign="top" style="direction: ltr; border-collapse: collapse; border: 1pt solid rgb(163, 163, 163);" border="1" cellpadding="0" cellspacing="0"&gt;  &lt;tbody&gt;&lt;tr&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 1.6048in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;Database type&lt;/p&gt;   &lt;/td&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 0.6673in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;Name&lt;/p&gt;   &lt;/td&gt;     &lt;/tr&gt;  &lt;tr&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 1.6048in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;Informatica   repository&lt;/p&gt;   &lt;/td&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 0.6673in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;dwa&lt;/p&gt;   &lt;/td&gt;     &lt;/tr&gt;  &lt;tr&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 1.6048in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;DAC repository&lt;/p&gt;   &lt;/td&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 0.6673in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;dacrep&lt;/p&gt;   &lt;/td&gt;     &lt;/tr&gt;  &lt;tr&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 1.6048in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;OBIA DATAwarehouse&lt;/p&gt;   &lt;/td&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 0.6673in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;obdw&lt;/p&gt;   &lt;/td&gt;     &lt;/tr&gt;  &lt;tr&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 1.6048in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;source&lt;/p&gt;   &lt;/td&gt;   &lt;td style="border: 1pt solid rgb(163, 163, 163); vertical-align: top; width: 0.6673in; padding: 4pt;"&gt;   &lt;p style="margin: 0in; font-family: Calibri; font-size: 11pt;"&gt;vis&lt;/p&gt;   &lt;/td&gt;     &lt;/tr&gt; &lt;/tbody&gt;&lt;/table&gt;  &lt;/div&gt; &lt;br /&gt;THe Name refers to the oracle user. vis refers to the vision database instance. note that&lt;br /&gt;the database can be accessed by apps/apps&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Create and SSE_ROLE as described in section 4.4.1.1 of Installation Guide for Informatica PowerCenter Users Version 7.9.6.1 (install guide)&lt;/li&gt;&lt;li&gt;set the NLS_LANG variable as described in section 4.4.2.1 of install guide.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;STEP 3 : Installing Oracle BI APPS&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-3484926886643458326?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/3484926886643458326/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=3484926886643458326' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3484926886643458326'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3484926886643458326'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/11/oracle-bi-applications-installation.html' title='Oracle BI applications - Installation'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GvNS-b8AbU4/TOJVjGTHXVI/AAAAAAAADvA/K0FYLW5IOg4/s72-c/ebs1.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2271836994698286601</id><published>2010-07-25T21:41:00.000-07:00</published><updated>2010-07-26T02:06:56.268-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Configuring the publisher/scheduler for MySql</title><content type='html'>&lt;div style="text-align: justify;"&gt;Publisher can be used to share and distribute reports. Scheduler is a quartz based scheduler for reporting jobs. In this post we will look at sending a report to a user via mail using the scheduler and publisher.&lt;br /&gt;Steps&lt;br /&gt;1. Login to BI publisher. If you face any problems during login, use the following troubleshooting options&lt;br /&gt;http://forums.oracle.com/forums/thread.jspa?threadID=582633&amp;amp;start=0&amp;amp;tstart=60&lt;br /&gt;http://oraclebizint.wordpress.com/2007/11/06/oracle-bi-publisher-and-bi-ee-invisible-admin-tab/&lt;br /&gt;http://onlineappsdba.com/index.php/2009/01/15/oracle-bi-publisher-admin-console-xmlpserver-login-issue-administratoradministrator/&lt;br /&gt;2. Note that the scheduler tab will be inactive.&lt;br /&gt;3. To configure the scheduler. Open the Administration tool. Connect to the repository. Click on Manage-&gt;Jobs to launch the Job Manager.&lt;br /&gt;4. In the job manager, Click on File -&gt; configuration options.&lt;br /&gt;5. Click on the Scheduler tab on the tab.&lt;br /&gt;6. Click on the Database tab on the second row. Fill in the relevant information as shown in the screenshot. The DSN can be created as shown in earlier posts.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TE0fxUASLZI/AAAAAAAADew/bq1AwWdXw3o/s1600/image32.JPG"&gt;&lt;img style="cursor: pointer; width: 386px; height: 400px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TE0fxUASLZI/AAAAAAAADew/bq1AwWdXw3o/s400/image32.JPG" alt="" id="BLOGGER_PHOTO_ID_5498085652133195154" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;7. Click on the general tab on the second row and fill in the username and password. Check the scheduler script path and default script path (the default values should be ok).&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TE0ig6boTfI/AAAAAAAADe4/aGFBOMI9wqY/s1600/image33.JPG"&gt;&lt;img style="cursor: pointer; width: 386px; height: 400px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TE0ig6boTfI/AAAAAAAADe4/aGFBOMI9wqY/s400/image33.JPG" alt="" id="BLOGGER_PHOTO_ID_5498088668925545970" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;8. Click on OK to close the job manager dialog.&lt;br /&gt;9. Open BI publisher. Click on the Admin tab (If admin tab is not visible look at the links posted in step 1).&lt;br /&gt;10. In the datasources section of the admin tab click on the link that says 'JDBC connection'.&lt;br /&gt;11. Click on the tab that says JDBC.&lt;br /&gt;12. The default jdbc connections will be visible. We need to add the mysql connection to this list. This link explains how to do that&lt;br /&gt;http://www.iwarelogic.com/blog/how-to-configure-mysql-database-connectivity-in-bi-publisher-485&lt;br /&gt;13. Once that is done, go to the jdbc tab and create a new datasource for mysql.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TE0wzTm7AzI/AAAAAAAADfI/MilGQYvF-TE/s1600/image35.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 225px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TE0wzTm7AzI/AAAAAAAADfI/MilGQYvF-TE/s400/image35.JPG" alt="" id="BLOGGER_PHOTO_ID_5498104378084229938" border="0" /&gt;&lt;/a&gt;. The radio button that says 'user proxy authentication' should not be checked.&lt;br /&gt;14. Next, click on the Admin tab in BI publisher and select that link that says 'Scheduler configuration' in System maintenance section. Click on the tab that says scheduler configuration. Fill in the mysql jdbc connection settings as shown.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TE0uXWQlBfI/AAAAAAAADfA/ZLuZZhrgu3s/s1600/image34.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 214px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TE0uXWQlBfI/AAAAAAAADfA/ZLuZZhrgu3s/s400/image34.JPG" alt="" id="BLOGGER_PHOTO_ID_5498101698736227826" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;15. click on 'test connection'. Once the connection is successful. click on install schema. This will install the scheduler related schema in the foodmart database of mysql.&lt;br /&gt;16. We need to create some more tables so that the scheduler service works. go to &lt;oraclebi&gt;/server/Schema location. There will be a file called SAJOBS.Oracle.sql Modify this file to suit Mysql and use this file to create tables in the foodmart schema of mysql.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;17. we will now attempt to start the scheduler service. stop the BI server . start the scheduler service followed by the BI server (make sure mysql is running). It is important to start the scheduler service before BI server otherwise the schedule tab in BI publisher will not be enabled.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;18.  We will now create the report. Click on the Reports tab in BI publisher.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;19. Click on create a new report. Once the report is created, open it in edit mode.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;20. Add a data model to the report. &lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;21. In the new data model page, select SQL query as the Type. User oracle BI EE as the datasource. click on query builder to build the report query.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TE1GRpCRC2I/AAAAAAAADfQ/JuOXSz_nFQg/s1600/image36.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 237px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TE1GRpCRC2I/AAAAAAAADfQ/JuOXSz_nFQg/s400/image36.JPG" alt="" id="BLOGGER_PHOTO_ID_5498127988976323426" border="0" /&gt;&lt;/a&gt;&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;22. click on layout and create a new layout based on default data model.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;23. save the report. click on view. the report should be visible.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;24. To schedule the report, click on schedule when the report is in view or edit mode.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;25. Fill in the relavant details. Use E-mail as notification channel. enter the destination e-mail address in e-mail delivery.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;26. Before scheduling the report, however, the e-mail options need to be set in BI publisher. To do so, click on the admin tab and click the link that says Email in the delivery section. Click on Add server and add the connection info for the smtp server that will be used for sending the mail.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;27. once scheduling is complete, look at the schedules tab for a list of jobs that have been scheduled.&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;&lt;/oraclebi&gt;&lt;br /&gt;&lt;oraclebi&gt;&lt;/oraclebi&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2271836994698286601?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2271836994698286601/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2271836994698286601' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2271836994698286601'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2271836994698286601'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/07/obiee-configuring-publisherscheduler.html' title='OBIEE - Configuring the publisher/scheduler for MySql'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GvNS-b8AbU4/TE0fxUASLZI/AAAAAAAADew/bq1AwWdXw3o/s72-c/image32.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-3375282693184229290</id><published>2010-07-21T23:51:00.001-07:00</published><updated>2010-07-22T01:57:22.315-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Session variables and row level security</title><content type='html'>In this post we will look at creating and using a session variable to implement row level security in answers. Situations where this is useful is:&lt;br /&gt;1. Allowing user to see data that she has access to.&lt;br /&gt;2. Showing data based on current date.&lt;br /&gt;3. A sales manager can be shown data in his region only. A CEO can be shown data for all regions.&lt;br /&gt;&lt;br /&gt;In this post we look at showing units ordered in the current month. we use a security filter to filter data for the current month.&lt;br /&gt;Steps:&lt;br /&gt;1. The first step is to create the session variable for the current month. To do so&lt;br /&gt; a. In the Administration window, click on Action - &gt; New -&gt; Session -&gt; Variable.Give CURRENT_MONTH as the name of the variable. Click on 'New' near the initialization block.&lt;br /&gt; b. Give CURRENT_MONTH_INIT as the name of the initialization block. Click on Edit Data Source.&lt;br /&gt; c. A new window opens. Select the connection pool by using the browse button. &lt;br /&gt; d. Use database as the data source type. &lt;br /&gt; e. Type in the following query :"  select month(curdate()); " in the default initialization string. &lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TEgD86K20_I/AAAAAAAADdw/wyxTyxaC_ck/s1600/image27.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 274px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TEgD86K20_I/AAAAAAAADdw/wyxTyxaC_ck/s400/image27.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5496647690147517426" /&gt;&lt;/a&gt;&lt;br /&gt; f. Click Ok to close the dialog.&lt;br /&gt; g. In the Session variable initialization block, click on edit data target. &lt;br /&gt; h. select the CURRENT_MONTH variable. CLick on Ok.&lt;br /&gt; i. Click on ok to create the session variable.&lt;br /&gt;2. The next step is to use this session variable to filter the result for this month.&lt;br /&gt; a. In the Administration tool. click on Manage -&gt; Security.&lt;br /&gt; b. Create a new User called MonthlyUser.&lt;br /&gt; c. Create a new group called MonthlyUserGroup. Assign MonthlyUser to this group.&lt;br /&gt; d. Open the MonthlyUserGroup dialog and click on Permissions.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TEgFgOtWIMI/AAAAAAAADd4/Y4zihtsMyPs/s1600/image28.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 350px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TEgFgOtWIMI/AAAAAAAADd4/Y4zihtsMyPs/s400/image28.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5496649396467933378" /&gt;&lt;/a&gt;&lt;br /&gt; e. Click the tab that says filters. Click on 'Add'&lt;br /&gt; f. In the name of the filter select the name of the table that you want to apply the filter on. in this case we select Foodmart.store &lt;br /&gt; g. Click on the ellipsis in the business model filter column. Apply the following filter "FoodMart"."time_by_day"."month_of_year" = VALUEOF(NQ_SESSION.CURRENT_MONTH)&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TEgGQ13LDBI/AAAAAAAADeA/MNbEDhcIgXo/s1600/image29.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 302px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TEgGQ13LDBI/AAAAAAAADeA/MNbEDhcIgXo/s400/image29.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5496650231611853842" /&gt;&lt;/a&gt;&lt;br /&gt; h. the group is now created.&lt;br /&gt;3. Login to BI answers using the MonthlyUser user. Select the columns from the store database. view results. You will notice that the results show data for the current month only.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TEgG5jHN6rI/AAAAAAAADeI/_FteqQRlmyg/s1600/image30.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 237px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TEgG5jHN6rI/AAAAAAAADeI/_FteqQRlmyg/s400/image30.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5496650930953513650" /&gt;&lt;/a&gt;&lt;br /&gt;If you login by a user from the administrators group, data for all months will be visible.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TEgHUGBnFbI/AAAAAAAADeQ/sw8BvwimuF8/s1600/image31.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 284px; height: 400px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TEgHUGBnFbI/AAAAAAAADeQ/sw8BvwimuF8/s400/image31.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5496651387001836978" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-3375282693184229290?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/3375282693184229290/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=3375282693184229290' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3375282693184229290'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3375282693184229290'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/07/obiee-session-variables-and-row-level.html' title='OBIEE - Session variables and row level security'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GvNS-b8AbU4/TEgD86K20_I/AAAAAAAADdw/wyxTyxaC_ck/s72-c/image27.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-7188549687679787995</id><published>2010-07-21T02:42:00.000-07:00</published><updated>2010-07-21T03:19:27.680-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Creating and using dynamic repository variable</title><content type='html'>&lt;div style="text-align: justify;"&gt;We will create and use a dynamic repository variable in the dashboard. The aim is to show current time in the dashboard.&lt;br /&gt;We use the variable in two places. First is in the title of a report&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TEbB9NoKtWI/AAAAAAAADdQ/mYlOJ594kFc/s1600/image23.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 236px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TEbB9NoKtWI/AAAAAAAADdQ/mYlOJ594kFc/s400/image23.JPG" alt="" id="BLOGGER_PHOTO_ID_5496293652626584930" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The second is in the narrative text.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TEbEvZXzhoI/AAAAAAAADdY/-S8A_eNTIe8/s1600/image24.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 45px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TEbEvZXzhoI/AAAAAAAADdY/-S8A_eNTIe8/s400/image24.JPG" alt="" id="BLOGGER_PHOTO_ID_5496296713795896962" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Steps:&lt;br /&gt;1. Create a dynamic repository variable that obtains the latest time. we will need to write an initialization block to get initialize and update this variable. To create the variable&lt;br /&gt;a. In the Administration tool. click on Manage-&gt; variables.&lt;br /&gt;b. Click on Action-&gt; New -&gt; Repository -&gt; Variable.&lt;br /&gt;c. Give 'Date' as the name of the variable. Select the radio button that says 'Dynamic' . The Initialization block variable drop down will be enabled.&lt;br /&gt;d. In the initialization block variable click on New. A window for creating a new initialization block opens.&lt;br /&gt;e. Write 'CurrentTime' as the name of the block. refresh interval one minute. start on - leave default.&lt;br /&gt;f. click on edit datasource and enter the following query (MySql)&lt;br /&gt;SELECT DATE_FORMAT(now(),'%a %D %b %Y %H:%i %p');&lt;br /&gt;g. click on browse button near the 'Connection pool' and select the foodmart connection pool.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TEbG4MDaICI/AAAAAAAADdg/qn5nV0ZyQd8/s1600/image25.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 355px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TEbG4MDaICI/AAAAAAAADdg/qn5nV0ZyQd8/s400/image25.JPG" alt="" id="BLOGGER_PHOTO_ID_5496299063862763554" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;h. click on 'Test'.Results should show the current time.&lt;br /&gt;i. click on OK.&lt;br /&gt;j. in the initialization dialog. click on edit data target. Select the Date variable and  click on OK.&lt;br /&gt;h. close the initialization dialog by clicking OK.&lt;br /&gt;i. close the dynamic repository dialog by clicking OK.&lt;br /&gt;&lt;br /&gt;We have now created a repository variable that obtains the current time from the database every minute.&lt;br /&gt;Check in the changes.&lt;br /&gt;&lt;br /&gt;2. Open the Dashboard Page.&lt;br /&gt;3. To add this variable to  a title (e.g "Units Ordered"). Click on the rename button on the title and change the title to&lt;br /&gt;Units Ordered -@{biServer.variables['Date']}&lt;br /&gt;4. To add the variable to a narrative text. add the text dashboard object to the dashboard. click on properties of the text object and write&lt;br /&gt;The current time is [b]@{biServer.variables['Date']}[/b]&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TEbIg5OL1JI/AAAAAAAADdo/mc769q7EIwA/s1600/image26.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 298px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TEbIg5OL1JI/AAAAAAAADdo/mc769q7EIwA/s400/image26.JPG" alt="" id="BLOGGER_PHOTO_ID_5496300862693954706" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This completes the tutorial&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-7188549687679787995?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/7188549687679787995/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=7188549687679787995' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7188549687679787995'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7188549687679787995'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/07/obiee-using-dynamic-repository-variable.html' title='OBIEE - Creating and using dynamic repository variable'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GvNS-b8AbU4/TEbB9NoKtWI/AAAAAAAADdQ/mYlOJ594kFc/s72-c/image23.JPG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-5172869375796191216</id><published>2010-07-09T01:55:00.000-07:00</published><updated>2010-07-11T21:34:57.971-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Managing Cache</title><content type='html'>&lt;div style="text-align: justify;"&gt;Caching can be used for queries that are run repeatedly. Look at &lt;a href="http://obiee101.blogspot.com/2008/07/obiee-cache-management.html"&gt;this blog&lt;/a&gt; to understand how to enable and configure caching. In this post we will look at pre-caching a query. Lets look at the foodmart catalog that we created in the earlier post&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TDbmoM-O02I/AAAAAAAADcY/Yc43sgyWYpE/s1600/image19.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 310px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TDbmoM-O02I/AAAAAAAADcY/Yc43sgyWYpE/s400/image19.JPG" alt="" id="BLOGGER_PHOTO_ID_5491830373976298338" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;We will pre-cache the result of this report&lt;br /&gt;Steps&lt;br /&gt;1. Create a text file to purge all cache and then create a new cache from the Sales presentation table. Lets call the file SalesCache.txt. The contents of the file are&lt;br /&gt;Call SAPurgeAllCache();&lt;br /&gt;select SALES.* from foodmart;&lt;br /&gt;&lt;br /&gt;The first line purges all cache and the second line creates the cache for all columns of the Sales table.&lt;br /&gt;&lt;br /&gt;2. run this command from the command line&lt;br /&gt;nqcmd -d FoodMartBI -u Administrator -p [password] -s [pathToCacheFile]/SalesCache.txt&lt;br /&gt;&lt;br /&gt;3. Open the Administrator tool. click on Manage-&gt; Cache. The cache created is visible&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TDbzeOCMgTI/AAAAAAAADcg/28q2QB-awwo/s1600/image20.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 163px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TDbzeOCMgTI/AAAAAAAADcg/28q2QB-awwo/s400/image20.JPG" alt="" id="BLOGGER_PHOTO_ID_5491844496113828146" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;4. Create a new report using answers&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TDb0Cj7Kl6I/AAAAAAAADco/SDY_0h4fUlA/s1600/image21.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 256px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TDb0Cj7Kl6I/AAAAAAAADco/SDY_0h4fUlA/s400/image21.JPG" alt="" id="BLOGGER_PHOTO_ID_5491845120465213346" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;View the results of the report. Since the report has columns that are part of the cache, the results should have been obtained from the cache. To verify, open the Cache management dialog again and look at the last used timestamp of the cache that we have created. The timestamp should have been updated with the time when the new report was viewed.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TDb0t5j_Q4I/AAAAAAAADcw/p81mWJhQ0ZY/s1600/image22.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 163px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TDb0t5j_Q4I/AAAAAAAADcw/p81mWJhQ0ZY/s400/image22.JPG" alt="" id="BLOGGER_PHOTO_ID_5491845865007956866" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Also the 'Use count' number will be incremented by one.&lt;br /&gt;&lt;br /&gt;Next, let us see how to disable caching for a physical table. Lets disable caching for the sales_fact_1997 table. right click on the table in the physical layer and go to the general tab. unselect the checkbox that says 'cacheable'. now run the SalesCache.txt file again. Open the cache manager. Notice that no cache is created.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-5172869375796191216?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/5172869375796191216/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=5172869375796191216' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5172869375796191216'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5172869375796191216'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/07/obiee-managing-cache.html' title='OBIEE - Managing Cache'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GvNS-b8AbU4/TDbmoM-O02I/AAAAAAAADcY/Yc43sgyWYpE/s72-c/image19.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-7899108182638939803</id><published>2010-06-28T21:53:00.000-07:00</published><updated>2010-06-28T22:13:32.163-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Creating a Rank Measure</title><content type='html'>&lt;div style="text-align: justify;"&gt;We look at creating a rank measure in this post. We will use the sales_fact_1997 table for this example. Join the sales_fact_1997 table with the product, store, time_by_day and promotion table in the physical layer. Drag the sales_fact_1997 and product table to the logical layer and create appropriate joins.&lt;br /&gt;We will create a salesRank logical column in the sales_fact_1997 logical table. Follow this steps to create the column.&lt;br /&gt;1. Right click on the logical table and select new logical column. Enter the name of the column as SalesRank and select on the check box that says 'use existing logical column as source'. use the expression builder to build the rank expression as shown&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCl-pN2b5pI/AAAAAAAADbc/9obtCZ1gsnk/s1600/image14.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 375px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCl-pN2b5pI/AAAAAAAADbc/9obtCZ1gsnk/s400/image14.JPG" alt="" id="BLOGGER_PHOTO_ID_5488056867485378194" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Create a sales table in the presentation catalog and add the salesrank unit_sales and the product name columns.&lt;br /&gt;&lt;br /&gt;Open Answers and create a new request. Drag the columns from the sales table to the request.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TCl_alzGbCI/AAAAAAAADbk/x9J_46l02V4/s1600/image15.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 277px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TCl_alzGbCI/AAAAAAAADbk/x9J_46l02V4/s400/image15.JPG" alt="" id="BLOGGER_PHOTO_ID_5488057715727428642" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This creates a report of products with ranked sales. To select the top 10 use a filter.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCmAKOsVhoI/AAAAAAAADbs/99nsade8rPI/s1600/image16.JPG"&gt;&lt;img style="cursor: pointer; width: 389px; height: 400px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCmAKOsVhoI/AAAAAAAADbs/99nsade8rPI/s400/image16.JPG" alt="" id="BLOGGER_PHOTO_ID_5488058534158763650" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCmAKR6u-CI/AAAAAAAADb0/o7ny3bKGBZI/s1600/image17.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 358px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCmAKR6u-CI/AAAAAAAADb0/o7ny3bKGBZI/s400/image17.JPG" alt="" id="BLOGGER_PHOTO_ID_5488058535024457762" border="0" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-7899108182638939803?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/7899108182638939803/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=7899108182638939803' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7899108182638939803'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7899108182638939803'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/obiee-creating-rank-measure.html' title='OBIEE - Creating a Rank Measure'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GvNS-b8AbU4/TCl-pN2b5pI/AAAAAAAADbc/9obtCZ1gsnk/s72-c/image14.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-6607300084653225995</id><published>2010-06-27T22:37:00.000-07:00</published><updated>2010-06-27T22:52:53.648-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Using a Column Selector for additional Drill down</title><content type='html'>&lt;div style="text-align: justify;"&gt;This post continues the idea of drill down developed in the previous post. Lets say that in addition to the country and city we also need to drill down on year-&gt;month. We will add a column selector to the existing report. here are the steps to do it.&lt;br /&gt;1. Add the time_by_day table to the business model. join the inventory_fact_1997 and the time_by_day table using time_id column.&lt;br /&gt;2. Drag the the_year and the_month column from the business layer to the presentation layer.&lt;br /&gt;3. In the BI answers screen go to the criteria view and refresh the metadata to view the new columns. Add the_year column to the report. Drag it to be the first column.&lt;br /&gt;3.Go to the results view and select 'column selector' from the drop down&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCg2E7QljoI/AAAAAAAADa8/9vOarQapN_o/s1600/image9.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 283px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCg2E7QljoI/AAAAAAAADa8/9vOarQapN_o/s400/image9.JPG" alt="" id="BLOGGER_PHOTO_ID_5487695604205588098" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;4. select the 'include selector' checkbox in the first column that says the_year. click on the_month to add it to the the_year selector column&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCg25pk8ImI/AAAAAAAADbE/Q1X6FdheiW8/s1600/image10.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 209px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCg25pk8ImI/AAAAAAAADbE/Q1X6FdheiW8/s400/image10.JPG" alt="" id="BLOGGER_PHOTO_ID_5487696509992182370" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;5. In the results view select 'Compound layout' and add the 'column selector' to the report. Drag it to the top&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCg3XrnkctI/AAAAAAAADbM/CUZnBhlgnwM/s1600/image11.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 289px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCg3XrnkctI/AAAAAAAADbM/CUZnBhlgnwM/s400/image11.JPG" alt="" id="BLOGGER_PHOTO_ID_5487697025936159442" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The report can now be viewed year wise or month wise. Also the drill down on country is still enabled&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TCg4CsjZmbI/AAAAAAAADbU/Zwg3Fskg4jk/s1600/image13.JPG"&gt;&lt;img style="cursor: pointer; width: 284px; height: 400px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TCg4CsjZmbI/AAAAAAAADbU/Zwg3Fskg4jk/s400/image13.JPG" alt="" id="BLOGGER_PHOTO_ID_5487697764921481650" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-6607300084653225995?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/6607300084653225995/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=6607300084653225995' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/6607300084653225995'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/6607300084653225995'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/obiee-using-column-selector-for.html' title='OBIEE - Using a Column Selector for additional Drill down'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GvNS-b8AbU4/TCg2E7QljoI/AAAAAAAADa8/9vOarQapN_o/s72-c/image9.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-7158991686309573903</id><published>2010-06-24T22:37:00.000-07:00</published><updated>2010-06-24T23:22:23.366-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Creating Hierarchy and Drill Down Table</title><content type='html'>In this post we look at creating a level based hierarchy and generate a drill down report off that. We will use the FoodMart database (described in previous posts).&lt;br /&gt;&lt;div style="text-align: justify;"&gt;Take a look at the inventory_fact_1997 table and the store table. our aim is to find the units_ordered for the following hierarchy in the store table country -&gt; City -&gt; store id.&lt;br /&gt;Here are the steps to create the hierarchy.&lt;br /&gt;1. Right Click on the FoodMart Business Model in the Business Modeling Layer. Select New Object-&gt; Dimension.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCRDJX-Vu-I/AAAAAAAADZw/dnNheU1VWTs/s1600/image1.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 320px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCRDJX-Vu-I/AAAAAAAADZw/dnNheU1VWTs/s400/image1.JPG" alt="" id="BLOGGER_PHOTO_ID_5486584074377477090" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;2) Enter 'storeDim' as the name and click OK.&lt;br /&gt;3) Right click on storeDim -&gt; New Logical Level. Enter 'storeToal' as the name. select 'Grand Total Level' check box. this level is the top of the hierarchy.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCREAPCQQcI/AAAAAAAADZ4/4Rhowj-4uPI/s1600/image2.JPG"&gt;&lt;img style="cursor: pointer; width: 300px; height: 400px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCREAPCQQcI/AAAAAAAADZ4/4Rhowj-4uPI/s400/image2.JPG" alt="" id="BLOGGER_PHOTO_ID_5486585016870781378" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;4) Right click on storeTotal and select New Object -&gt; Child Level. Enter 'Store_Country' as the name. select 'Supports roll up...'. Enter Number of elements at this level as 2. Click on Ok.&lt;br /&gt;5) Right click on Store_Country level. Select New Object-&gt; Logical Key. Enter 'countryKey' as the name. To add a column, click on Add. select the store_country column from the store table.  select the checkbox that says 'Use For DrillDown'. Make sure the the store_country column is selected. Click on ok.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCRFTZctFAI/AAAAAAAADaA/C7VsDLaeND4/s1600/image3.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 333px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TCRFTZctFAI/AAAAAAAADaA/C7VsDLaeND4/s400/image3.JPG" alt="" id="BLOGGER_PHOTO_ID_5486586445595218946" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;6) Open the Store table in the business modeling view. double click on the store_country column. select the levels tab and select store_country as the level for the storeDim dimension.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TCRFzLMxuWI/AAAAAAAADaI/J8V3cfj27S0/s1600/image4.JPG"&gt;&lt;img style="cursor: pointer; width: 340px; height: 400px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TCRFzLMxuWI/AAAAAAAADaI/J8V3cfj27S0/s400/image4.JPG" alt="" id="BLOGGER_PHOTO_ID_5486586991526132066" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;7) This creates the store country level. similarly create the store city level as the child of store country and store id as the child of store city.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TCRHQzWsYSI/AAAAAAAADaQ/R3Dc62JF594/s1600/image5.JPG"&gt;&lt;img style="cursor: pointer; width: 311px; height: 146px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TCRHQzWsYSI/AAAAAAAADaQ/R3Dc62JF594/s400/image5.JPG" alt="" id="BLOGGER_PHOTO_ID_5486588600032977186" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;8) Next we will prepare the units_ordered column to be used for aggregation. Double click on the units_ordered column in the inventory_fact_1997 table and select the Aggregation tab. select 'sum' as the default aggregation rule.&lt;br /&gt;9) Next create a new presentation catalog called FoodMart. Create a new presentation table called store. Add the store_country, store_city, store_id and units_ordered column to the presentation table.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TCRIWH1xqLI/AAAAAAAADaY/dzoFqbu3ZMQ/s1600/image6.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 180px; height: 185px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TCRIWH1xqLI/AAAAAAAADaY/dzoFqbu3ZMQ/s400/image6.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5486589790943029426" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This completes our configuration in the administration window.&lt;br /&gt;Open the Answers tool and add store_country and units_ordered columns to the criteria&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCRI0cNHiUI/AAAAAAAADag/aRkAjx1u_XM/s1600/image7.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 270px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TCRI0cNHiUI/AAAAAAAADag/aRkAjx1u_XM/s400/image7.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5486590311805716802" /&gt;&lt;/a&gt;&lt;br /&gt;Click on results. a table is created. Links are present to allow user to drill down. The units_ordered column is summed according the the level on which the user has drilled down&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TCRK-1uUaHI/AAAAAAAADao/ahx3geoUJu8/s1600/image8.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 340px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TCRK-1uUaHI/AAAAAAAADao/ahx3geoUJu8/s400/image8.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5486592689477806194" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-7158991686309573903?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/7158991686309573903/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=7158991686309573903' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7158991686309573903'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7158991686309573903'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/obiee-creating-hierarchy-and-drill-down.html' title='OBIEE - Creating Hierarchy and Drill Down Table'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/TCRDJX-Vu-I/AAAAAAAADZw/dnNheU1VWTs/s72-c/image1.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-7405472176689445493</id><published>2010-06-16T23:15:00.000-07:00</published><updated>2010-06-17T04:25:23.756-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - creating logical model and presentation catalog</title><content type='html'>&lt;div align="justify"&gt;In the previous post we created the physical model by importing the table schema from the database. Next we will create the logical model for the inventory_fact_1997 table and connected dimensions.&lt;br /&gt;1. Select the following tables from the physical layer and drag them to the logical model layer.&lt;br /&gt;inventory_fact_1997&lt;br /&gt;product&lt;br /&gt;store&lt;br /&gt;time_by_day&lt;br /&gt;warehouse&lt;br /&gt;make sure that you drag all the tables at the same time, this will preserve the keys and the joins as well.&lt;br /&gt;2. creating the presentation catalog : Right click on the presentation layer area and click on 'New Presentation Catalog'. Name the new catalog FoodMartInventory. Drag all tables from the logical layer to the presentation layer. You can also right click on the logical layer business model and select 'duplicate with presentation catalog'.&lt;br /&gt;the presentation catalog is now ready&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TBoCOMEEnDI/AAAAAAAADZU/I0Tp7k0gj50/s1600/images9.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5483697939056729138" style="WIDTH: 374px; CURSOR: hand; HEIGHT: 400px" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TBoCOMEEnDI/AAAAAAAADZU/I0Tp7k0gj50/s400/images9.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;We will also cover certain set up steps required before we start creating the dashboard.&lt;br /&gt;&lt;br /&gt;We need to set the default repository. Edit the C:\OracleBI\server\Config\NQSConfig.ini file and modify the repository entry to&lt;br /&gt;[REPOSITORY]&lt;br /&gt;FoodMart = FoodMart.rpd, DEFAULT;&lt;br /&gt;Restart the BI server and oc4j server.&lt;br /&gt;&lt;br /&gt;The next step is to create a datasource that the presentation server can use to connect to the BI server. A default datasource called AnalyticsWeb is created during installation. We will create a new datasource for our use. The steps to create the datasource are &lt;br /&gt;1. Click on Start-Control Panel-Administrative tools-&gt;Data Sources (ODBC).&lt;br /&gt;2. click on the second tab called System DSN.&lt;br /&gt;3. Click on add.&lt;br /&gt;4. Select Oracle BI server from the drop down.&lt;br /&gt;5. Click on Finish. The Oracle BI server DSN configuration page comes up.&lt;br /&gt;6. Give a suitable name (OracleBIFOodMart) and select local as server.&lt;br /&gt;7. In the next step provide the username and password and select the checkbox that says 'connect to BI server to obtain default settings..'.&lt;br /&gt;8. In the next screen change the password, if required. Also the database for foodmart should be visible. click on finish.&lt;br /&gt;&lt;br /&gt;The datasource is now available for use. We need to configure the presentation service to use the new datasource.&lt;br /&gt;open the C:\OracleBIData\web\config\instanceconfig.xml file and change the DSN to&lt;br /&gt;&lt;DSN&gt;OracleBIFoodMart&lt;/DSN&gt;&lt;br /&gt;Restart the presentation service.&lt;br /&gt;&lt;br /&gt;We can now start creating reports and dashboads.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-7405472176689445493?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/7405472176689445493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=7405472176689445493' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7405472176689445493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7405472176689445493'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/obiee-creating-logical-model-and.html' title='OBIEE - creating logical model and presentation catalog'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/TBoCOMEEnDI/AAAAAAAADZU/I0Tp7k0gj50/s72-c/images9.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-8999405873970904487</id><published>2010-06-15T01:52:00.000-07:00</published><updated>2010-06-28T21:52:50.661-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Creating an OBIEE repository and importing a physical schema</title><content type='html'>&lt;div align="justify"&gt;OBIEE stores its physical and logical schema in a repository. The repository is a file with an extension of .rpd In this post we look at creating a new repository and importing the physical schema metadata from the foodmart datasource created in the previous post.&lt;br /&gt;1. Open the Oracle BI server Administration page. Go to Start-&gt; Programs -&gt; Oracle Business Intelligence -&gt; Administration.&lt;br /&gt;2. Create a new repository. In the oracle BI administration tool click on File-&gt; New . Type the name of the repository. Here we use the name 'FoodMart'. save as type .rpd.&lt;br /&gt;3. The administration tool will show three views. The view on the right most end is the physical layer view. The middle view is the logical layer view and the left most view is the presentation layer view.&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TBdCfRBF0DI/AAAAAAAADYs/r5ttfJxspLQ/s1600/images4.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5482924176258224178" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 200px" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TBdCfRBF0DI/AAAAAAAADYs/r5ttfJxspLQ/s400/images4.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;4. To import the schema from the foodmart database: click on File-&gt;Import-&gt;from Database. Select the 'FoodMart' datasource&lt;br /&gt;On clicking OK, you will be presented with the following screen&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TBdDpprxtII/AAAAAAAADY0/IB0IV6wNNMY/s1600/images5.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5482925454190032002" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 257px" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TBdDpprxtII/AAAAAAAADY0/IB0IV6wNNMY/s400/images5.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The required table can be selected to import the metadata for that table. Here we will import schema from all the tables. The physical view will show the imported table schemas.&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TBhnwCGYTNI/AAAAAAAADZE/Kb-RkuhTj38/s1600/images7.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5483246621218589906" style="WIDTH: 198px; CURSOR: hand; HEIGHT: 400px" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TBhnwCGYTNI/AAAAAAAADZE/Kb-RkuhTj38/s400/images7.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;select any table and right click and select 'View Data' to view data for that table.&lt;br /&gt;&lt;br /&gt;Importing from a database is not the only way to create a physical schema. The schema can also be created manually, but in most cases that is not required. It is also possible to add more columns or tables to an imported schema.&lt;br /&gt;&lt;br /&gt;By default when the schema is imported, the count of rows for each table is not available. Right click on the table and select 'Update Row Count' to find the number of rows in that table at that time.&lt;br /&gt;&lt;br /&gt;We will create a star schema for inventory_fact_1997. Create the following foreign keys&lt;br /&gt;product.product_id = inventory_fact_1997.product_id&lt;br /&gt;store.store_id = inventory_fact_1997.store_id&lt;br /&gt;time_by_day.time_id = inventory_fact_1997.time_id&lt;br /&gt;warehouse.warehouse_id = inventory_fact_1997.warehouse_id&lt;br /&gt;&lt;br /&gt;Right click on the inventory_fact_1997 table and select Physical Diagram -&gt; Object(s) and All join(s).&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/TBhwESOaiTI/AAAAAAAADZM/tQi2h5Mm4yU/s1600/images8.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5483255765237664050" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 275px" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/TBhwESOaiTI/AAAAAAAADZM/tQi2h5Mm4yU/s400/images8.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The physical schema is now ready for use.&lt;br /&gt;&lt;br /&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-8999405873970904487?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/8999405873970904487/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=8999405873970904487' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/8999405873970904487'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/8999405873970904487'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/creating-obiee-repository-and-physical.html' title='OBIEE - Creating an OBIEE repository and importing a physical schema'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GvNS-b8AbU4/TBdCfRBF0DI/AAAAAAAADYs/r5ttfJxspLQ/s72-c/images4.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-1457286652682354755</id><published>2010-06-14T23:55:00.000-07:00</published><updated>2010-06-28T21:52:33.598-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><category scheme='http://www.blogger.com/atom/ns#' term='Datasource'/><title type='text'>OBIEE - Creating MySql Datasource</title><content type='html'>We will be using the sample food mart database for all tutorials. Please look into the introduction blog to obtain the link for this database.&lt;br /&gt;In this post we look at creating a datasource to the foodmart database in MySql. The screenshots are from a windows XP machine.&lt;br /&gt;Steps to create the data source.&lt;br /&gt;1. Goto Start -&gt; Control Panel -&gt; Administrative Tools &lt;br /&gt;2. Click on Data Sources (ODBC). The datasources window will open.&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/TBc6FOSnUrI/AAAAAAAADYU/py99PJHMSpM/s1600/image1.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 327px;" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/TBc6FOSnUrI/AAAAAAAADYU/py99PJHMSpM/s400/image1.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5482914932756796082" /&gt;&lt;/a&gt;&lt;br /&gt;3. Click on system DSN tab (second tab).&lt;br /&gt;4. Click on Add.&lt;br /&gt;5. Select MySql from the list.&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TBc9WOIy8-I/AAAAAAAADYc/zK8gRDhRx8I/s1600/image2.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 295px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TBc9WOIy8-I/AAAAAAAADYc/zK8gRDhRx8I/s400/image2.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5482918523308274658" /&gt;&lt;/a&gt;&lt;br /&gt;If the MySql ODBC driver is not available download and install it from &lt;br /&gt;http://dev.mysql.com/downloads/connector/odbc/5.1.html&lt;br /&gt;6. Provide the connection details&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/TBc-qGnpE6I/AAAAAAAADYk/-eqUdFlCHno/s1600/images3.JPG"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 379px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/TBc-qGnpE6I/AAAAAAAADYk/-eqUdFlCHno/s400/images3.JPG" border="0" alt=""id="BLOGGER_PHOTO_ID_5482919964399178658" /&gt;&lt;/a&gt;&lt;br /&gt;6. Click on OK. The datasource is created and is available for use.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-1457286652682354755?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/1457286652682354755/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=1457286652682354755' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1457286652682354755'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1457286652682354755'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/creating-mysql-datasource.html' title='OBIEE - Creating MySql Datasource'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/TBc6FOSnUrI/AAAAAAAADYU/py99PJHMSpM/s72-c/image1.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2081213426250778685</id><published>2010-06-14T23:24:00.000-07:00</published><updated>2010-06-28T21:53:05.362-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OBIEE'/><title type='text'>OBIEE - Introduction</title><content type='html'>In this series of post we will look at Oracle Business Intelligence Enterprise Edition.&lt;br /&gt;The source database for the examples is the foodmart database which can be downloaded from &lt;a href="http://sourceforge.net/projects/ds-professional/files/"&gt; here &lt;/a&gt;.  Look at &lt;a href="http://opensourceanalytics.com/2006/04/28/sales-data-mart-dimensional-model-for-retail/"&gt; this blog &lt;/a&gt; from more details on this database.&lt;br /&gt;The blogs will cover the following:&lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Creating a repository.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Using Answers&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Creating Dashboards&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Using BI publisher&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Caching&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2081213426250778685?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2081213426250778685/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2081213426250778685' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2081213426250778685'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2081213426250778685'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/introduction.html' title='OBIEE - Introduction'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-7296852164368509211</id><published>2010-06-07T21:48:00.000-07:00</published><updated>2010-06-07T23:38:55.254-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Balanced Scorecard'/><category scheme='http://www.blogger.com/atom/ns#' term='KPI'/><title type='text'>Key Performance Indicators and Balanced Scorecard</title><content type='html'>&lt;div style="text-align: justify;"&gt;Performance management is critical to an organization's success. The top management spends a lot of time in making sure that the company's day to day activity is aligned with its long term goals. The facets of performance management are :&lt;br /&gt;1 Alignment  - The top management needs to align the business processes of a company to its long term goals.&lt;br /&gt;2. The business processes needs to be bound together so that a meaningful BPM system spanning across the organization can be set up.&lt;br /&gt;3. The business processes needs to be monitored in real time basis. This is part of the BAM or Business Activity Monitoring system.&lt;br /&gt;Balanced scorecards and Key Performance Indicators (KPI) are tools that the management uses to monitor the performance of their business processes.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Balanced Scorecard :&lt;/span&gt; Balanced scorecard was made popular by Kaplan and Norton. It is a management tool that presents a holistic view of the company measures. It is a reporting tool that shows the financial and non-financial metrics of a company. It can be used for real time monitoring of the company metrics. The balanced scorecard is a single report consisting of mainly four perspectives. The idea is to monitor not only the financial but also the non financial parameters that are critical to a company's success.&lt;br /&gt;The four perspectives are shown in the diagram below.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_GvNS-b8AbU4/TA3WmH0wQsI/AAAAAAAADYA/ZJvAl_Qs0bY/s1600/Balanced-Scorecard.JPG"&gt;&lt;img style="cursor: pointer; width: 400px; height: 330px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/TA3WmH0wQsI/AAAAAAAADYA/ZJvAl_Qs0bY/s400/Balanced-Scorecard.JPG" alt="" id="BLOGGER_PHOTO_ID_5480272272003449538" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Key Performance Indicators (KPI)&lt;/span&gt; - Key performance indicators are the metrics that may be the part of the balanced scorecard. KPIs are used to present actionable results across the organization. The selection of KPIs is a tricky area and can sometimes be an art. However, any KPI selected should be actionable and should be relevant . Here's one way to select KPIs&lt;br /&gt;1. List down the organizations vision and goals.&lt;br /&gt;2. Prepare a strategy map that is in line with the company goals.&lt;br /&gt;3. Divide the strategy map into different components (financial, non-financial etc).&lt;br /&gt;4. List down business processes for each strategy.&lt;br /&gt;5. List down the critical success factors (CSF) for each business process.&lt;br /&gt;6. Design metrics that monitor these critical success factors on an on-demand basis.&lt;br /&gt;&lt;br /&gt;At the end of these exercise we get a list of metrics that need to be monitored. Designing a dashboard using these metrics is the next step. The dashboard is user specific. The top management including the CEO would be interested in only four or five metrics that show the overall health of the strategy. Numbers such as return on capital employed are something that the top management is interested in.&lt;br /&gt;&lt;br /&gt;KPIs have the following characteristics:&lt;br /&gt;&lt;br /&gt;-&gt;  &lt;span style="font-style:italic;"&gt;The KPI should be actionable&lt;/span&gt;. The management should be able to use the KPI dashboard for decision making. The employees should use the dashboard to align and modify their activities so that the activities are in line with the company goals.&lt;br /&gt;-&gt; &lt;span style="font-style:italic;"&gt;The KPI should be mutually exclusive and collectively exhaustive&lt;/span&gt; - Two KPIs that give the same kind of information are redundant. Each KPI should be responsible for causing a unique action. Each KPI should try and encompass multiple Critical Factors.&lt;br /&gt;&lt;br /&gt;In designing the reports and dashboard for the KPI use the following rules:&lt;br /&gt;&lt;br /&gt;1. &lt;span style="font-style:italic;"&gt;Make efficient use of graph and tables&lt;/span&gt;. Do not use graphs that show only trend but no numbers. Remember that the KPI needs to be actionable and a graph without number is not very useful.&lt;br /&gt;2. &lt;span style="font-style:italic;"&gt;The metrics should be on-demand&lt;/span&gt;. They should give the latest snapshot. The management needs today's information and not last weeks summary. The decision to be made is 'How to move forward?' and not 'How did the last week look like?'.&lt;br /&gt;3. &lt;span style="font-style:italic;"&gt;Weekly and monthly reports&lt;/span&gt; can be used to analyze past performance and make corrections in the strategy.&lt;br /&gt;4. &lt;span style="font-weight:bold;"&gt;Use ALERTS&lt;/span&gt; : Each employee should be presented with critical alerts that she subscribes to. The alerts should be related directly to her job role. Let the employee see the number immediately before waiting for the end of week report. For example a sales representative should be able to see the current sales pipeline without waiting for the end of month report, by which it might be too late to get more customers.&lt;br /&gt;5. &lt;span style="font-weight:bold;"&gt;Use Messages and inferences&lt;/span&gt; -The graphs and reports should be accompanies by comments and inferences from the management. These might be done using appropriate graph or table headers also. Each graph should have a comment that explains what action needs to be taken.&lt;br /&gt;&lt;br /&gt;Designing an effective dashboard can help the company a lot. Before embarking on the project of designing KPI and dashboard for your company, make sure you understand the following.&lt;br /&gt;1. Top management sponsorship - It is very important that you find atleast one person in the executive team that is committed and passionate about building the KPIs. A lot of departments would be involved and many people may not like the idea of 'scorecard'. Choose a different name to make the dashboard more employee friendly.&lt;br /&gt;2. Employee eduction - All employees should be trained and educated in using the KPIs. The success of the project is dependent on whether the employees take action based on the KPI.&lt;br /&gt;3. Look and feel - Make the dashboard and the reports user friendly by incorporating the company theme into the reports. They should not look like 'just another application' but an extension of their existing tools.&lt;br /&gt;4. Start with in-house tools or Microsoft excel till the dashboard gains popularity.&lt;br /&gt;5. Start with a very small team. preferably 2-3 people. Select a liaison from each department that can spend some time with this team.&lt;br /&gt;&lt;br /&gt; KPIs can be of immense help in monitoring and driving the company towards its long term goals. Used effectively, they can do wonders; But be warned, that any halfhearted implementation would only waste company time and resources.&lt;br /&gt; &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-7296852164368509211?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/7296852164368509211/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=7296852164368509211' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7296852164368509211'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7296852164368509211'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2010/06/key-performance-indicators-and-balanced.html' title='Key Performance Indicators and Balanced Scorecard'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GvNS-b8AbU4/TA3WmH0wQsI/AAAAAAAADYA/ZJvAl_Qs0bY/s72-c/Balanced-Scorecard.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-8304664405569835229</id><published>2009-11-25T07:34:00.000-08:00</published><updated>2009-11-25T08:26:08.163-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>R and Java - JRI using eclipse.</title><content type='html'>R is a powerful language for statistical computing and graphics. R has a strong community support and is finding new use in the corporate world. If R can be integrated with Java, it would provide a greap acceptance into new products. A library called rjava exists that helps integrate R with Java. However, setting up the library may be tricky.&lt;br /&gt;&lt;br /&gt;This post explores how R can be called from within Java using JRI. We will use the example provided in the rjava package. The example class is called rtest2.java.&lt;br /&gt;&lt;br /&gt;Here are the steps to run R from java using eclipse.&lt;br /&gt;&lt;br /&gt;1) Create a new project and copy the rtest.java and rtest2.java files from the rjava/jri package. The rjava project folder can be found at the place where R stores the package downloaded during the install step. It should be in a folder called 'win-library'.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/Sw1TUVBLGQI/AAAAAAAACPc/K_2dmZKuOu4/s1600/new_project.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 166px; height: 140px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/Sw1TUVBLGQI/AAAAAAAACPc/K_2dmZKuOu4/s400/new_project.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5408070336246388994" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;2) the rjava/jri folder in 'win-library' should also have the JRI.jar library and jri.dll file. copy R.dll from the bin directory of R into the rjava/jri folder.&lt;br /&gt;Here's the folder hierarchy for JRI&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/Sw1WO-YBmCI/AAAAAAAACPk/bnpsHF7iJOg/s1600/rjava_folder.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 163px; height: 400px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/Sw1WO-YBmCI/AAAAAAAACPk/bnpsHF7iJOg/s400/rjava_folder.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5408073542803757090" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/Sw1WPEijSRI/AAAAAAAACPs/sgOrTlJOl8U/s1600/jri_folder.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 87px;" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/Sw1WPEijSRI/AAAAAAAACPs/sgOrTlJOl8U/s400/jri_folder.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5408073544458520850" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;3) add the JRI.jar in the project classpath in eclipse&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/Sw1XSxR5U3I/AAAAAAAACP0/YjAnBBwNvPg/s1600/jri_build_path.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 302px;" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/Sw1XSxR5U3I/AAAAAAAACP0/YjAnBBwNvPg/s400/jri_build_path.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5408074707519492978" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;4) Add the following entries into the run configuration of the product.&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1XzI_K9wI/AAAAAAAACP8/t3lRKzjxzqc/s1600/run_configuration.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 285px;" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1XzI_K9wI/AAAAAAAACP8/t3lRKzjxzqc/s400/run_configuration.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5408075263639222018" /&gt;&lt;/a&gt;&lt;br /&gt;THe path variable contains &lt;br /&gt;C:/Users/user/Documents/R/win-library/2.9/rJava/jri/;C:\Program Files\R\R-2.9.1\bin\&lt;br /&gt;&lt;br /&gt;That's it. This should work. Run the rtest2 file and you should see R working.&lt;br /&gt;&lt;br /&gt;Some of the steps above may not be required. However, i have found this to be working after some research and would not like to experiment further. Please add your comments if it works and also if it does not.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-8304664405569835229?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/8304664405569835229/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=8304664405569835229' title='22 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/8304664405569835229'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/8304664405569835229'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/11/r-and-java-jri-via-eclipse.html' title='R and Java - JRI using eclipse.'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GvNS-b8AbU4/Sw1TUVBLGQI/AAAAAAAACPc/K_2dmZKuOu4/s72-c/new_project.jpg' height='72' width='72'/><thr:total>22</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-5688929361099092428</id><published>2009-10-03T04:16:00.000-07:00</published><updated>2009-10-04T09:50:12.092-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Examples using R - Randomized Block Design</title><content type='html'>&lt;div align="justify"&gt;Problem : We wish to determine whether or not four different tips produce different readings on a hardness testing machine. An experiment such as these might be part of a gauge capability study. The machine operates by pressingthe tip into a metal test coupon, and frm the depth of the resulting depression, the hardness of the coupon can be determined. THe experimenter has decided to obtain four observations for each tip. There is onlyh one factor - tip type - and a completely randomized single factor deisn would consist of randomly assigning each one of the 4x4=16 runs to an experimental unit, that is , a metal coupon , and observing the hardness reading that results. However, if the metal coupons differ slightly in their hardness, as might happen if they are taken from ingots that are produced in different heats, the experimental units will contribute to the variability observed in the hardness data. As a result the experimental error will reflect both random error and variability between coupons. We would like to remove the variability between coupond from the experimental error. A design that would accomplish this requires the experimenter to test each tip once on each of four coupons. This desin is called a randomized complete block design. Each block contains all the treatments. Within a block the order in which the four tips are tested is randomly determined.&lt;br /&gt;The test data is&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/SsdbHU7bUGI/AAAAAAAACMw/ooYbbsR5m_c/s1600-h/rbcdData.jpg"&gt;&lt;img style="WIDTH: 309px; HEIGHT: 276px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388375660607262818" border="0" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/SsdbHU7bUGI/AAAAAAAACMw/ooYbbsR5m_c/s400/rbcdData.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Let us look at the interaction plot&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsdvbWUMdaI/AAAAAAAACNA/5pqTJWwl5pc/s1600-h/interactionPlot.jpg"&gt;&lt;img style="WIDTH: 385px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388397994809521570" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsdvbWUMdaI/AAAAAAAACNA/5pqTJWwl5pc/s400/interactionPlot.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;and the box plot&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsdvMw7CBCI/AAAAAAAACM4/l5xYCDPSmBI/s1600-h/rbcdboxplot.jpg"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 399px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388397744253699106" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsdvMw7CBCI/AAAAAAAACM4/l5xYCDPSmBI/s400/rbcdboxplot.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;Let us now run the analysis of variance on the data, we will include the blocking variable in the analysis.&lt;br /&gt;the formula to be used in R is hardness~typeOfTip+testCoupon.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&gt; &lt;span style="color:#cc0000;"&gt;anova(aov(hardness~factor(typeOfTip)+factor(testCoupon)))&lt;br /&gt;&lt;/span&gt;&lt;span style="color:#000066;"&gt;Analysis of Variance Table&lt;br /&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/Ssg26kXXJrI/AAAAAAAACNI/Fz1xUcR6dZI/s1600-h/aov.jpg"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 99px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388617333970773682" border="0" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/Ssg26kXXJrI/AAAAAAAACNI/Fz1xUcR6dZI/s400/aov.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A plot of the residuals should not show any pattern since the residuals are assumed to be independently distributed for analysis of variance. Here's the plot of the residuals.&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/Ssg3xFNXYBI/AAAAAAAACNQ/XI5lTVWbz54/s1600-h/residual_plot_rbcd.jpg"&gt;&lt;img style="WIDTH: 399px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388618270500151314" border="0" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/Ssg3xFNXYBI/AAAAAAAACNQ/XI5lTVWbz54/s400/residual_plot_rbcd.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;A Quantil-Quantile plot can be used to check the distribution as well. The plot also shows the presence of outliers, if any. The plot is shown below.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/Ssg4W8gs0HI/AAAAAAAACNY/xoRhwYvq7HU/s1600-h/rbcdQQPlot.jpg"&gt;&lt;img style="WIDTH: 397px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388618921000358002" border="0" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/Ssg4W8gs0HI/AAAAAAAACNY/xoRhwYvq7HU/s400/rbcdQQPlot.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Tukeys test can be used for pairwise comparison. Here's the result of the Tukey's test&lt;br /&gt;&lt;span style="color:#660000;"&gt;fit=aov(hardness~factor(typeOfTip)+factor(testCoupon))&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#660000;"&gt;TukeyHSD(fit,which='factor(typeOfTip)',ordered="TRUE")&lt;br /&gt;&lt;/span&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/Ssi9RYbTAAI/AAAAAAAACNo/BUytXdcSwnQ/s1600-h/tukeysrbcd.jpg"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 181px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388765060461166594" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/Ssi9RYbTAAI/AAAAAAAACNo/BUytXdcSwnQ/s400/tukeysrbcd.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;A plot of the residuals should not have a pattern. Here's a plot of the residuals.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/Ssi85sJelBI/AAAAAAAACNg/0S-KnPo5wB0/s1600-h/rbcdresidual.jpg"&gt;&lt;img style="WIDTH: 380px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388764653438276626" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/Ssi85sJelBI/AAAAAAAACNg/0S-KnPo5wB0/s400/rbcdresidual.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;We can check for variance. Here's a method to check to equality of variance.&lt;br /&gt;&gt; summary(lm(abs(fit$res)~typeOfTip))&lt;br /&gt;&lt;br /&gt;Call:&lt;br /&gt;lm(formula = abs(fit$res) ~ typeOfTip)&lt;br /&gt;&lt;br /&gt;Residuals:&lt;br /&gt;Min 1Q Median 3Q Max&lt;br /&gt;-0.050000 -0.032812 -0.003125 0.026562 0.093750&lt;br /&gt;&lt;br /&gt;Coefficients:&lt;br /&gt;Estimate Std. Error t value Pr(&gt;t)&lt;br /&gt;(Intercept) 0.075000 0.024719 3.034 0.00893 **&lt;br /&gt;typeOfTip -0.006250 0.009026 -0.692 0.50000&lt;br /&gt;---&lt;br /&gt;Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1&lt;br /&gt;&lt;br /&gt;Residual standard error: 0.04037 on 14 degrees of freedom&lt;br /&gt;Multiple R-squared: 0.03311, Adjusted R-squared: -0.03595&lt;br /&gt;F-statistic: 0.4795 on 1 and 14 DF, p-value: 0.5&lt;br /&gt;&lt;br /&gt;Since the p-value is large, difference in variance cannot be stated.&lt;br /&gt;&lt;br /&gt;The Latin Square Design:&lt;br /&gt;If there are two blocking variables then the latin square design can be used.&lt;br /&gt;problem : Suppose that an experimenter is studying the effects of five different formulations of a rocket propellant used in aircrew escape systems on the observed burning rate. Each formulation is mixed from a batch of raw material that is only large enough for five formulations to be tested. Furthermore, the formulations are prepared by several operators, and there may be substantial differences in the skills and experience of the operators.&lt;br /&gt;Here's the data&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsjF7WEYKwI/AAAAAAAACNw/JsaJMMK1rz0/s1600-h/latindata.jpg"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 399px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388774577475693314" border="0" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsjF7WEYKwI/AAAAAAAACNw/JsaJMMK1rz0/s400/latindata.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;The latin square structure is&lt;br /&gt;&gt; matrix(treatments,5,5)&lt;br /&gt;[,1] [,2] [,3] [,4] [,5]&lt;br /&gt;[1,] "A" "B" "C" "D" "E"&lt;br /&gt;[2,] "B" "C" "D" "E" "A"&lt;br /&gt;[3,] "C" "D" "E" "A" "B"&lt;br /&gt;[4,] "D" "E" "A" "B" "C"&lt;br /&gt;[5,] "E" "A" "B" "C" "D"&lt;br /&gt;&lt;br /&gt;An analysis of variance gives&lt;br /&gt;&gt; anova(lm(formulations~factor(rawBatches)+factor(treatments)+factor(operators)))&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsjK9o9Ge2I/AAAAAAAACN4/eOS-HaHh3ik/s1600-h/latinaov.jpg"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 96px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388780114463325026" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsjK9o9Ge2I/AAAAAAAACN4/eOS-HaHh3ik/s400/latinaov.jpg" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The data shows that the different formulations do produce different burning rate. Also there is a difference between the operators. However, no difference between the raw batches.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-5688929361099092428?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/5688929361099092428/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=5688929361099092428' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5688929361099092428'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5688929361099092428'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/10/examples-using-r-randomized-block.html' title='Examples using R - Randomized Block Design'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_GvNS-b8AbU4/SsdbHU7bUGI/AAAAAAAACMw/ooYbbsR5m_c/s72-c/rbcdData.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-197149545822119951</id><published>2009-09-30T02:05:00.000-07:00</published><updated>2009-10-01T03:02:49.405-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='ANOVA'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>Examples using R - Analysis of Variance</title><content type='html'>&lt;div align="justify"&gt;&lt;br /&gt;Problem: A product development engineer is interested in investigating the tensile strength of a new synthetic fiber that will be used t omake cloth for men's shirts. The engineer knows from previous experience that the strength is affected by he weight percent of cotton used in the blend of materials for the fiber. Furthermore, she suspects that increasing the cotton content will increase the strength, at least initially. she also knows that cotton content should range between about 10 and 40 percent if the final product is to have other quality characteristics that are desired. The engineer decides to test specimens at five levels of cotton weight percent: 15, 20, 20, 30 and 35 percent. she also decides to test 5 specimens at each level of cotton content.&lt;br /&gt;Here we have a single factor with 5 levels and 5 replicates.The data is&lt;br /&gt;&lt;br /&gt;&gt; cotton&lt;br /&gt;$p15&lt;br /&gt;[1] 7 7 15 11 9&lt;br /&gt;$p20&lt;br /&gt;[1] 12 17 12 18 18&lt;br /&gt;$p25&lt;br /&gt;[1] 14 18 18 19 23&lt;br /&gt;$p30&lt;br /&gt;[1] 19 25 22 19 23&lt;br /&gt;$p35&lt;br /&gt;[1] 7 10 11 15 11&lt;br /&gt;A box plot of the data is&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsMiTe1BgMI/AAAAAAAACLg/-BlA3orSNBE/s1600-h/box_plot_cotton.JPG"&gt;&lt;img style="WIDTH: 382px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387187297353564354" border="0" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsMiTe1BgMI/AAAAAAAACLg/-BlA3orSNBE/s400/box_plot_cotton.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;A histogram of data looks like&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsRmHGsMmuI/AAAAAAAACLw/Dqtwbp3OJ9g/s1600-h/barplot.JPG"&gt;&lt;img style="WIDTH: 361px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387543326483061474" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsRmHGsMmuI/AAAAAAAACLw/Dqtwbp3OJ9g/s400/barplot.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;A multiple scatter plot can sometimes be used if corresponding values of the observations need comparison. The scatter plot for this data is as shown.&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsRnO5Ea9PI/AAAAAAAACL4/BIY0mWrIdBk/s1600-h/scatterplot.JPG"&gt;&lt;img style="WIDTH: 381px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387544559777150194" border="0" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsRnO5Ea9PI/AAAAAAAACL4/BIY0mWrIdBk/s400/scatterplot.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Analysis of Variance:&lt;br /&gt;Lets use analysis of variance in the above example to find out if all means are equal or if any mean is different.&lt;br /&gt;The data needs to be transformed for aov&lt;br /&gt;&lt;br /&gt;&lt;span style="color:#993399;"&gt;&gt; c(cotton_matrix[1,],cotton_matrix[2,],cotton_matrix[3,],cotton_matrix[4,],cotton_matrix[5,])-&gt;cotton_data&lt;br /&gt;&lt;/span&gt;&lt;span style="color:#000066;"&gt;&gt; cotton_data&lt;br /&gt;p15 p20 p25 p30 p35 p15 p20 p25 p30 p35 p15 p20 p25 p30 p35 p15 p20 p25 p30 p35 p15 p20 p25 p30 p35&lt;br /&gt;7 12 14 19 7 7 17 18 25 10 15 12 18 22 11 11 18 19 19 15 9 18 23 23 11&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#000000;"&gt;Analysis of variance yields&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&gt; &lt;span style="color:#663366;"&gt;summary(aov(cotton_data~names(cotton_data)))&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsRv0gG22uI/AAAAAAAACMA/3FcxxCd-crg/s1600-h/aov.JPG"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 92px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387554002004531938" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsRv0gG22uI/AAAAAAAACMA/3FcxxCd-crg/s400/aov.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;from the F value we reject the null hypothesis and conclude that the means differ.&lt;br /&gt;&lt;br /&gt;Analysis of variance uses certain assumptions and it is important to check the validity of these assumptions. The first method is to analyse the residuals for each observations. There should be no pattern in the residuals. If residuals either spread out or narrow down as time progresses then this could be an experimental error.&lt;br /&gt;Here's a plot of residuals against time (observation)&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsR4JTl5KCI/AAAAAAAACMI/Zxg_j2sn7fc/s1600-h/residual_plot.JPG"&gt;&lt;img style="WIDTH: 312px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387563155515320354" border="0" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsR4JTl5KCI/AAAAAAAACMI/Zxg_j2sn7fc/s400/residual_plot.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Another validation is to check the nature of the residuals themselves. One way to do is to plot of curve of residuals versus the fitted values.Here again no pattern should be present&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/SsR5KcK2ajI/AAAAAAAACMQ/5sWpauhHxeg/s1600-h/residual_fitted_plot.JPG"&gt;&lt;img style="WIDTH: 302px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387564274509310514" border="0" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/SsR5KcK2ajI/AAAAAAAACMQ/5sWpauhHxeg/s400/residual_fitted_plot.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsR5h8_2xNI/AAAAAAAACMY/gXr6QYWnjD8/s1600-h/qqplot.JPG"&gt;&lt;img style="WIDTH: 344px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387564678458557650" border="0" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsR5h8_2xNI/AAAAAAAACMY/gXr6QYWnjD8/s400/qqplot.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The variance for the five sets can be compared using the Bartlett's test&lt;br /&gt;&gt; &lt;span style="color:#330033;"&gt;bartlett.test(cotton_data~factor(names(cotton_data)))&lt;br /&gt;&lt;/span&gt;&lt;span style="color:#000099;"&gt;Bartlett test of homogeneity of variances&lt;br /&gt;data: cotton_data by factor(names(cotton_data))&lt;br /&gt;Bartlett's K-squared = 0.2801, df = 4, p-value = 0.991&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;The results show that the null hypothesis cannot be rejected and hence the variance of the five sets is indeed same.&lt;br /&gt;&lt;br /&gt;We now need to do a pairwise comparison to find out which pair has a difference in mean. we use the Tukey's test to do so.&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsR8JP_VvJI/AAAAAAAACMg/3na-hG2LYuE/s1600-h/Tukeys.JPG"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 243px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387567552594820242" border="0" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsR8JP_VvJI/AAAAAAAACMg/3na-hG2LYuE/s400/Tukeys.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;If the assumption of normality is not met then a test known is Kruskal-Wallis test may be used&lt;br /&gt;&lt;br /&gt;&lt;span style="color:#663366;"&gt;&gt; kruskal.test(cotton_data~factor(names(cotton_data)))&lt;/span&gt;&lt;br /&gt;&lt;span style="color:#000099;"&gt;Kruskal-Wallis rank sum test&lt;br /&gt;data: cotton_data by factor(names(cotton_data))&lt;br /&gt;Kruskal-Wallis chi-squared = 18.5513, df = 4, p-value = 0.0009626&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-197149545822119951?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/197149545822119951/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=197149545822119951' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/197149545822119951'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/197149545822119951'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/examples-using-r-analysis-of-variance.html' title='Examples using R - Analysis of Variance'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GvNS-b8AbU4/SsMiTe1BgMI/AAAAAAAACLg/-BlA3orSNBE/s72-c/box_plot_cotton.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-1833450233412334915</id><published>2009-09-29T22:50:00.000-07:00</published><updated>2009-09-30T01:56:43.846-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Examples - Comparing two conditions using R.</title><content type='html'>&lt;div align="justify"&gt;&lt;strong&gt;Problem :&lt;/strong&gt; The tension bond strangth of portland cement mortar is an important characteristic of the product. An engineer in interested in comparing the strength of a modified formulation in which polymer latex emulsions have been added during mixing to the strength of the unmodified mortar. The experimanter has collected 10 observations on strength for the modified forumlation and another 10 observations on strength for the unmodified formulation. The data is&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsL1PyxD5xI/AAAAAAAACKY/_GBGRQGl1zo/s1600-h/data2.JPG"&gt;&lt;img style="WIDTH: 597px; HEIGHT: 167px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387137755962926866" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SsL1PyxD5xI/AAAAAAAACKY/_GBGRQGl1zo/s400/data2.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;A &lt;strong&gt;box plot &lt;/strong&gt;of the data :&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsL60kzOtiI/AAAAAAAACKg/T1CW1ZPGeYo/s1600-h/box_plot_data2.JPG"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 391px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387143885427226146" border="0" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsL60kzOtiI/AAAAAAAACKg/T1CW1ZPGeYo/s400/box_plot_data2.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Two sample t-test equal variances, two sided :&lt;/strong&gt;&lt;br /&gt;&gt;&lt;span style="color:#993399;"&gt;t.test(data2$modified,data2$unmodified,var.equal=TRUE,alternative="two.sided")&lt;br /&gt;&lt;/span&gt;&lt;span style="color:#3333ff;"&gt;Two Sample t-test&lt;br /&gt;data: data2$modified and data2$unmodified&lt;br /&gt;t = -9.1094, df = 18, p-value = 3.678e-08 &lt;/span&gt;&lt;/div&gt;&lt;span style="color:#3333ff;"&gt;&lt;div align="justify"&gt;&lt;br /&gt;alternative hypothesis: true difference in means is not equal to 0 &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;95 percent confidence interval:&lt;br /&gt;-1.4250734 -0.8909266&lt;br /&gt;&lt;/div&gt;&lt;div align="justify"&gt;sample estimates:&lt;br /&gt;mean of x mean of y&lt;br /&gt;16.764 17.922&lt;/div&gt;&lt;div align="justify"&gt;&lt;/span&gt;&lt;/div&gt;&lt;p&gt;&lt;br /&gt;Of course, our assumption has been that the distribution of data for both the samples is normal. We can test this assumption using the &lt;strong&gt;quartile-quartile plot&lt;/strong&gt;. see&lt;br /&gt;&lt;a href="http://wiener.math.csi.cuny.edu/st/stRmanual/qqnorm.html"&gt;here&lt;/a&gt; for details on qqnorm plots. A plot for modified sample is&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsMVIKj7iXI/AAAAAAAACLQ/EWOvgbrv5mM/s1600-h/qqnorm.JPG"&gt;&lt;img style="WIDTH: 369px; HEIGHT: 386px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387172809283438962" border="0" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsMVIKj7iXI/AAAAAAAACLQ/EWOvgbrv5mM/s320/qqnorm.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;A plot for the unmodified sample is&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsMWkMCnz8I/AAAAAAAACLY/1aeukgnVjPM/s1600-h/qqnorm2.JPG"&gt;&lt;img style="WIDTH: 346px; HEIGHT: 400px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387174390228570050" border="0" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/SsMWkMCnz8I/AAAAAAAACLY/1aeukgnVjPM/s400/qqnorm2.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Paired t-test:&lt;/strong&gt;&lt;br /&gt;problem: consider a hardness testing machine that presses a rod with a pointed tip into a metal specimen with a known force. By measuring the depth of the depression caused by the tip, the hardness of the specimen is determined. Two different tips are available for this macine, and although the precision of the measurements made by the two tips seems to be the same, it is suspected that one tip produces different hardness reading than the other. AN exmperiment is conducted in which the two tips are used within the same specimen and 10 runs of the experiment are made. The data is as shown below&lt;br /&gt;&lt;br /&gt;&gt; hard&lt;br /&gt;$tip1&lt;br /&gt;[1] 7 3 3 4 8 3 2 9 5 4&lt;br /&gt;&lt;br /&gt;$tip2&lt;br /&gt;[1] 6 3 5 3 8 2 4 9 4 5&lt;br /&gt;&lt;br /&gt;The paired test is&lt;br /&gt;&lt;span style="color:#993399;"&gt;&gt; t.test(hard$tip1,hard$tip2,paired=TRUE,var.equal=TRUE,alternative="two.sided")&lt;/span&gt;&lt;br /&gt;Paired t-test&lt;br /&gt;&lt;span style="color:#3333ff;"&gt;data: hard$tip1 and hard$tip2&lt;br /&gt;t = -0.2641, df = 9, p-value = 0.7976 &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="color:#3333ff;"&gt;alternative hypothesis: true difference in means is not equal to 0 &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="color:#3333ff;"&gt;95 percent confidence interval:&lt;br /&gt;-0.9564389 0.7564389 &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="color:#3333ff;"&gt;sample estimates:&lt;br /&gt;mean of the differences&lt;br /&gt;-0.1&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;/span&gt;&lt;strong&gt;comparing variances:&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;It is sometimes useful to compare variances of two samples. Consider the mortar example. we can compare the variances using R as follows&lt;br /&gt;&lt;span style="color:#993399;"&gt;&gt; var.test(data2$unmodified,data2$modified)&lt;/span&gt;&lt;br /&gt;F test to compare two variances&lt;br /&gt;&lt;span style="color:#3333ff;"&gt;data: data2$unmodified and data2$modified &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="color:#3333ff;"&gt;F = 0.6138, num df = 9, denom df = 9, p-value = 0.4785&lt;br /&gt;&lt;/span&gt;&lt;span style="color:#3333ff;"&gt;alternative hypothesis: true ratio of variances is not equal to 1&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="color:#3333ff;"&gt;95 percent confidence interval:&lt;br /&gt;0.1524534 2.4710609&lt;br /&gt;&lt;/span&gt;&lt;span style="color:#3333ff;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="color:#3333ff;"&gt;sample estimates:&lt;br /&gt;ratio of variances&lt;br /&gt;0.6137766&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-1833450233412334915?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/1833450233412334915/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=1833450233412334915' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1833450233412334915'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1833450233412334915'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/examples-comparing-two-conditions-using.html' title='Examples - Comparing two conditions using R.'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GvNS-b8AbU4/SsL1PyxD5xI/AAAAAAAACKY/_GBGRQGl1zo/s72-c/data2.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-7868748010846641878</id><published>2009-09-29T21:50:00.000-07:00</published><updated>2009-09-29T22:21:59.458-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Design of experiment'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Design of Experiments - Introduction</title><content type='html'>&lt;div align="justify"&gt;Experiments are performed in various fields to understand the behaviour of a system. In this post we will analyse the common terms used during an experiment design.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Experiment :&lt;/strong&gt; &lt;br /&gt;An experiment can be defined as a test or a group of test wherein changes are made to the input variables and the effect of that changes observed in the output or the response variable. The aim of the experiment is to find what effects the input variables have on the output variable.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Strategy of Experimentation&lt;/strong&gt; : &lt;br /&gt;There are various ways of performing the test in an experiment.&lt;br /&gt;&lt;em&gt;One-at-a-time :&lt;/em&gt; In this strategy a baseline set of level is selected for each factor. subsequently, one of the factors if varied across its range and the other factors held constant at their baseline level. The test are repeated for other factors. This strategy has an advantage of simplicity, however, it fails to consider any interaction effects between factors. Interaction is defined as the failure of one factor to produce the same effect on the response variable at different levels of another factor.&lt;br /&gt;&lt;em&gt;Factorial:&lt;/em&gt; A useful method to deal with multiple input variables or treatments or factors is to use a factorial method of experimentation. In this method the factors are varied together instead of one at a time. However, if there are more than four factors the number of combinations of tests may be too high. It is therefore sometimes unnecessary to run all possible combinations of factor levels. A fractional factorial experiment is used. It is the variation of the factorial experiment where only a subset of the runs are made.&lt;br /&gt;&lt;br /&gt;It is useful to understand three basic principles of experimental design before starting with an experimental design. These principles are replication, randomization and blocking.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Replication :&lt;/strong&gt; &lt;br /&gt;Replication is the term given to repetition of the basic experiment. Replication is used to allow the experimenter to determine the experimental error value. For example, if a sample mean is used to identify the effect of a factor, replication provides a more accurate measure of the sample mean. However, distinction has to be made between replication and repeated measurement. Replication entails performing the test to find out the variability both between the runs and possibly within the runs. A repeated measurement merely takes the value of the output variable multiple times or performs the experiment without changing the factors. Thus is does not capture the variability in the experiment due to the input factor.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Randomization: &lt;/strong&gt;&lt;br /&gt;The statistical analysis of the experiment is based on the assumption that both the experimental material and the selection of runs is random. The experiments themselves are performed randomly. The observations made should be independently distributed random variables. By choosing the randomization technique we can average out the effects of external factors.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Blocking:&lt;/strong&gt;&lt;br /&gt; Blocking is used to minimize the variability introduced in the experiment due to nuisance factors. Blocking is generally used when comparison of factors are made. Nuisance factors are the factors that influence the response variable but the experimenter is not interested in them. I.e. they are not part of the study, nevertheless, they do influence the study. Each level of the nuisance factor becomes a block and the experiment is carried out within blocks.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Experimental design - guidelines:&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;1. Understand and define the problem - Sometimes a single large experiment may not be able to provide the answers or may be difficult to perform. In such cases a series of smaller experiments may be performed.&lt;br /&gt;2. Choice of input variables or factors, their levels and their range : The experimenter considers the design factors and the nuisance factors during this stage. The design factors are those that are changed during the experiment and their effects studied. They may be factors selected for the experiment, variables that are held at constant values during the experiment (hand held factors) and allowed to vary factors. The &lt;em&gt;nuisance factor&lt;/em&gt; may be &lt;strong&gt;controllable, uncontrollable or noise factors&lt;/strong&gt;. The experimenter can set the levels of controllable factors. The blocking principle can be used to deal with controllable factors. IF a nuisance factor cannot be controlled but can be measured then a strategy called analysis of covariance can be used to accommodate its effects. For noise factors the strategy is to minimize the variability due to noise factors. This is sometimes referred to as robust design problem.&lt;br /&gt;3. Selection of response variables.&lt;br /&gt;4. Choice of experimental design.&lt;br /&gt;5. The actual experiment.&lt;br /&gt;6. Analysing the data.&lt;br /&gt;7. Conclusion of the experiment. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-7868748010846641878?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/7868748010846641878/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=7868748010846641878' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7868748010846641878'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/7868748010846641878'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/design-of-experiments-introduction.html' title='Design of Experiments - Introduction'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-9108394594120498776</id><published>2009-09-24T21:49:00.000-07:00</published><updated>2010-06-15T02:28:15.247-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='facts'/><category scheme='http://www.blogger.com/atom/ns#' term='Dimensional Modeling'/><category scheme='http://www.blogger.com/atom/ns#' term='dimensions'/><title type='text'>Dimensional Modeling</title><content type='html'>&lt;div align="justify"&gt;A company that has a huge volume of data builds a data warehouse so that it can generate reports, perform analytics and make informed decisions. The process of building a data warehouse from a transactional or source system is important and the process that a company selects depends on its long term vision and goal of its business intelligence systems. There are two schools of thought on how a data warehouse system should be built. In this article we discuss one of those. The method that we discuss here owe's its existence to &lt;strong&gt;Ralph Kimball&lt;/strong&gt; and can also be called &lt;strong&gt;Dimensional modeling&lt;/strong&gt;.&lt;br /&gt;Before discussing what dimensional modeling is, let us understand what is expected out of a datawarehouse. The requirements of a data warehouse are:&lt;br /&gt;1) It should allow everyone in the organization to effectively access information.&lt;br /&gt;2) It must present a 'single version of truth' of data.&lt;br /&gt;3) It must be able to incorporate changes quickly.&lt;br /&gt;4) It must protect data and manage data access.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Componenets of a Data Warehouse:&lt;/strong&gt;&lt;br /&gt;Lets discuss the components of the data warehouse topology.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Operational source system&lt;/strong&gt; : This source systems are the transactional systems that the company uses to capture its operational data. The source systems may range from a relational database to a flat file. The source systems are characterized by high availability and high performance. However, they may not store historical information.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Data Staging Area&lt;/strong&gt;: The data from the source systems need to be fed to the data warehouse. However, the data may undergo various transformations before it is loaded into the data warehouse. A process called ETL (Extract, Transform and Load) may be carried out during the data staging area. Ralph Kimball advocates that the data in data warehouse should be stored in dimensional form . However, data in the staging area may be in normal form, but should not be avilable to the end user for querying.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Data Presentation:&lt;/strong&gt; This is the area where the end users submit their query. It consists of a series of data marts. A data mart contains data in a dimensional form. The data in a data mart corresponds to one business process. Note that the data mart does not contain data from a business department but a business process. There may be multiple processes in an organization and hence multiple data marts. However the processes and hence the data marts may contain information that is shared accross the organization. For example, the customer information may be present in all data marts. Therefore all data marts may contain a dimension called customer. This dimension may be shared (physically or logically) accross all the data marts and is called a conformed dimension.&lt;br /&gt;The presentation area is used in the following way&lt;br /&gt;1) Creating predefined reports.&lt;br /&gt;2) Creating adhoc reports.&lt;br /&gt;3) As an input to forecasting.&lt;br /&gt;4) As an input to other analytics tool.&lt;br /&gt;The dimensional schema for a relational database is called a star schema. For a multidimensional system the data may be stored in OLAP cubes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Besides the above the organization may also contain a &lt;strong&gt;metadata repository&lt;/strong&gt;. The metadata is information about the tables, its indexes, structure Etc. It would be beneficial for the organization to let the users perform query using the metadata rather than the tables themselves. This makes the system flexible since any changes in the underlying physical tables does not effect the query.&lt;br /&gt;The metadata can be looked as&lt;br /&gt;physical table -&gt; physical model -&gt; business model - &gt; business table &lt;- user query&lt;br /&gt; &lt;span style="font-weight:bold;"&gt;Fact Table&lt;/span&gt;: The dimensional schema consists of fact tables and one or more dimensional tables. The fact table is the heart of the schema and contains the transactional information. It stores the numeric information that can be aggregated. The performance metrics and other KPI are generally stored in the fact table. one characteristic of the fact table is that it contains data at the same grain. i.e. all rows in the fact table contain data at the same level of granularity. Also this level of granularity should ideally by as low as possible. This gives the user the ability to perform drill down at any level. The textual values may not be suitable for the fact table and are moved to the dimension table. The fact tables have foreign keys(FK) that correspond to the primary key in the dimensional table. Dimensional Tables: This contain the dimensions upon which the user can perform drill through and use in their queries. The dimensional table mostly contains textual values. A very important dimension table is the Date table. The fact table along with the dimensional table form a star join sch ma. Example : Lets take an example of a retail garment store. The store has its sales information and it wants to set up a data warehouse that allows them to generate adhoc reports. The information that it can put in the fact table is the sales amount, the discount offered , the price after discount etc. The information in its dimensional table could be the size of shirt or trousers sold, the colour of the garment. It could have a data dimension that stores the data of the transaction and also stored other information such as holidays and other special days. The schema looks like below &lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsQ_Fb2IbXI/AAAAAAAACLo/p22AHwbgDgQ/s1600-h/schema_retail.JPG"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 256px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5387500416848653682" border="0" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SsQ_Fb2IbXI/AAAAAAAACLo/p22AHwbgDgQ/s400/schema_retail.JPG" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The process of dimension design is as follows:&lt;/strong&gt;&lt;br /&gt;1) The first step is selecting the business process. As discussed earlier the business process is distinct from a business department or organization. Typical business process are orders, invoicing, inventory management etc. The business departments may be procurement, store management etc.&lt;br /&gt;2) The second step is identifying the granularity of data that is to be stored in the fact table. Care should be taken at this step since it is difficult to change the grain at a later stage. It is advisable to select as low a grain as possible to allow user to perform adhoc queries. Separate fact tables may be used to store accumulated data at a higher grain.&lt;br /&gt;3) The third step is choosing the dimension. The dimensions may be date, product etc.&lt;br /&gt;The dimension table should contain descriptive values rather than code values. This allows the user to quickly use these columns in their query without performing any lookups. Also the dimension table should contain all derived information as separate columns. For example in a data dimension it is beneficial to store separate columns for day, month, day of year etc. Although this information can be obtained from a 'date' object, the user should not be burdened with performing the transformation for use in the query/report. Creating report from the dimensional schema should be as simple as dragging and dropping the columns in the reports.&lt;br /&gt;4) The fourth step is to identify the numeric information that goes into the fact table. The numeric information forms the part of the KPI and hence the organization dashboard. Once again derived information should be stored as separate columns. For example if the data contains weight and if the user may need weight in either pounds or Kg then it is beneficial to have two columns in the fact table, once containing the weight in pounds and the other the weight in Kg.&lt;br /&gt;&lt;br /&gt;Some important concepts that can be helpful while designing a dimensional model are:&lt;br /&gt;&lt;strong&gt;Degenerate Dimensions :&lt;/strong&gt; &lt;/div&gt;&lt;div align="justify"&gt;Certain values such as the invoice number or the order number can be used in the fact table without a corresponding dimension table. This values are known as degenerate dimension.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Dimensional normalization- snowflaking:&lt;/strong&gt;&lt;br /&gt;Although normalization is in general not recommended, in certain cases the dimensional tables may be normalized. This is know as snowflaking. This is typically done when the number of attributes is high. The redundant attributes may be put in another table.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Surrogate Key :&lt;/strong&gt; &lt;/div&gt;&lt;div align="justify"&gt;The fact table is joined to the dimensional table using a key. This key should not be a natural key but rather a key generated by the system specifically for this purpose. For example a product dimension may be joined to the fact table using the product number, however there are problems to this approach and it as recommended to generate a unique number for each row in the dimensional table and join the fact table to the dimensional table using this unique number.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Data Warehouse Bus Architecture:&lt;/strong&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/Ssbt0vtUDBI/AAAAAAAACMo/Qd6gJOIqp-k/s1600-h/purpletable.gif"&gt;&lt;img style="WIDTH: 325px; HEIGHT: 295px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5388255494610881554" border="0" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/Ssbt0vtUDBI/AAAAAAAACMo/Qd6gJOIqp-k/s400/purpletable.gif" /&gt;&lt;/a&gt;&lt;br /&gt;Ref:http://intelligent-enterprise.informationweek.com/db_area/archives/1999/990712/webhouse.jhtml&lt;br /&gt;A data warehouse bus architecture can be used to implement an integrated data warehouse in an enterprise. The different data marts can be plugged together using the bus architecture. As seen in the picture, the individual data marts are the rows of the matrix, whereas the columns of the matrix are the conformed dimensions. While designing the data mart, the row is added and the corresponding conformed dimension can be looked up on the columns. It is beneficial to enumerate all conformed dimensions at the start of data warehouse building process. This helps the designer to quickly identify dimensions that can be reused.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Conformed dimension:&lt;/strong&gt; &lt;/div&gt;&lt;div align="justify"&gt;Conformed dimensions have been described earlier. Some of the characteristics of conformed dimensions are :&lt;br /&gt;1) They have consistent column names, definition and values.&lt;br /&gt;2) A subset of the entire dimensional table may be used.&lt;br /&gt;3) Separate physical tables may be used. But the physical tables are well synchronized.&lt;br /&gt;4) Dimension tables may be handled by a separate authority, called a dimensional authority.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Slowly Changing Dimensions:&lt;/strong&gt;&lt;br /&gt;The attributes of the dimensional table may change over time. A suitable technique is needed to incorporate the changed dimensions in the dimensional table without breaking the reports and also making the changed attributes available to be joined to the fact table for reporting.&lt;br /&gt;There are three strategies to handle slowly changing dimensions. The change is described as slow since it is not very often that a dimension attribute changes. If the change is dynamic then the design needs to be rechecked.&lt;br /&gt;Type 1: In the first strategy the new value overrides the old value. So for example, if a product moves from Category A to Category B the category column value for product A is simply changed from category A to category B. This is a simple, fast and easy way to handle the change. The fact table does not change. However, note that the historical information is lost. There is no way to find the performance of product A under category A items. Also historical data of category A may start differing since Product A is not shown under it any more.&lt;br /&gt;Type 2: The second strategy is very widely used. In this strategy an additional row is added to the dimensional table. For example, for product A an additional row is added where the category is now category B. All other values remain same. A new key is generated for the row. This method also highlights the advantage of using a system generated primary key(surrogate key) instead of a natural key as a primary key. The second advantage is that historical reports may still contain the older definition of product A. A date value may be added to show the effective/expiration date.&lt;br /&gt;Type 3: A new column can be added. For product A, two columns can be maintained. one column shows old category and the other shows new category. This strategy can be useful when changes are minor. The historical information is also preserved. However, it may not always be possible to add new columns.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Dimension Role Playing :&lt;/strong&gt; &lt;/div&gt;&lt;div align="justify"&gt;Sometimes a single dimension table may be linked to the fact table multiple times (ways). In this case the dimension table may be a single physical table but may be represented in different views. The dimension table therefore plays multiple roles.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Junk Dimension :&lt;/strong&gt;&lt;/div&gt;&lt;div align="justify"&gt; A junk dimension is sometimes used to group low cardinality columns such as flags.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Fact table types:&lt;/strong&gt;&lt;br /&gt;The fact table may be broadly of three types&lt;br /&gt;&lt;em&gt;Transaction fact table:&lt;/em&gt; These store data at the transaction grain. Each row may correspond to a single transaction.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Periodic snapshot :&lt;/em&gt; These show cumulative data for a particular time frame. Each row may contain aggregated data for the day, week or month. These tables can be used to generate reports for performance metrics (KPI)&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Accumulating snapshot:&lt;/em&gt; The accumulating snapshot stores data for a transaction or entity that may take an indeterminate time. For example an accumulating snapshot of customer purchases may show all purchases till date. Note that in this kind of tables the values may need to be updated regularly. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-9108394594120498776?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/9108394594120498776/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=9108394594120498776' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/9108394594120498776'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/9108394594120498776'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/dimensional-modeling.html' title='Dimensional Modeling'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_GvNS-b8AbU4/SsQ_Fb2IbXI/AAAAAAAACLo/p22AHwbgDgQ/s72-c/schema_retail.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2563022972161796367</id><published>2009-09-15T07:25:00.000-07:00</published><updated>2009-09-15T09:18:44.186-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data warehouse'/><title type='text'>Introduction to data Warehouse</title><content type='html'>&lt;div align="justify"&gt;&lt;br /&gt;Many companies spend a fortune in organizing and storing their data. The amount of data that is generated daily may vary for different companies but it is not uncommon to hear of scores of GB of data for medium sized companies and terabytes of data for large enterprises. It is indeed challenging to store this size of data in a meaningful way and then to use this data to make decisions. In this article we look at why a company needs a data warehouse and how it should go about building one.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;What is a data warehouse anyway?&lt;/strong&gt;&lt;br /&gt;Lets say that you are a car manufacturing company. You have been making cars for more than 50 years and you have managed to implement IT in all your departments. You have a sales management system that records sales throughout all your network dealers. This sales system not only records the sales history but also records information about potential customers. Then you have the accounting system that generates invoices for the sales made and also keeps track of the company income and expenditure. You may also have a financial system that helps you make your budgetary decisions and provides you a financial &lt;span id="SPELLING_ERROR_0" class="blsp-spelling-corrected"&gt;road map&lt;/span&gt;. Your company would also have a customer relationship management system and a campaign management system. Of course, you would have a state of the art operations systems with advanced inventory management system. There may be various other systems but lets concentrate on what we need out of this system. Different people in the organization may need different kind of information out of this system. The dealer would need the daily or monthly sales figures. She may also need a report of potential customers. At a regional level the manager might need information at aggregate levels. i.e. sales by city by month or a list of potential customers that have a certain fixed budget and fall under a particular age group. As the company matures, it may need deeper analytics such as optimized inventory distribution for its network etc. A data warehouse helps the company in storing data in such a way that such information retrieval becomes faster and simpler. It allows the company to store a large amount of historical system without in any way affecting its transaction systems. A well designed data warehouse system with advanced Business Intelligence tool can do wonders for a company.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;How is the database different from a data warehouse?&lt;/strong&gt;&lt;br /&gt;A database is designed to store the transactional or operational data whereas a data warehouse is designed to store historical information. For example, a database would be used by a transaction system that generates invoice for the car sales. But imagine how fast the data grows in this system. If we add up transactions for a month for all stores &lt;span id="SPELLING_ERROR_1" class="blsp-spelling-corrected"&gt;across&lt;/span&gt; the country and we &lt;span id="SPELLING_ERROR_2" class="blsp-spelling-corrected"&gt;don't&lt;/span&gt; remove this data then the database would become &lt;span id="SPELLING_ERROR_3" class="blsp-spelling-corrected"&gt;extremely&lt;/span&gt; slow. The database is designed to give quick response time for transactions. A data warehouse on the other hand is designed to store hundreds of gigabytes of data and contains query optimizers that return quick results. They may not be good at supporting transactions.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;So how does the database and data warehouse coexist? How does the data flow?&lt;/strong&gt;&lt;br /&gt;A company generally has multiple transaction systems. The systems may be from different vendors and may have different underlying structures. For example the accounting system may be completely different from a procurement system. They may use different vendors and completely different technology. However, we eventually want to generate reports that obtain data from both the systems. We also want to make sure that a 'single version of truth' is maintained in the organization. There have been many cases where two departments cannot come to a conclusion since they have different numbers for the same entity. A data warehouse therefore contains data from multiple source systems. A process known as &lt;span id="SPELLING_ERROR_4" class="blsp-spelling-error"&gt;ETL&lt;/span&gt; (Extract, Transform and Load) is used to extract data from the various systems in an organization and feed the data to a single data warehouse system. The data may undergo cleaning, transformation, &lt;span id="SPELLING_ERROR_5" class="blsp-spelling-error"&gt;deduplication&lt;/span&gt; etc before it is loaded into the data warehouse.&lt;br /&gt;The diagram below gives a simplified view of the process.&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/Sq-83QSWYjI/AAAAAAAACKI/fjC3_2E-0eM/s1600-h/dataWarehouse.jpg"&gt;&lt;img style="WIDTH: 400px; HEIGHT: 302px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5381727737181790770" border="0" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/Sq-83QSWYjI/AAAAAAAACKI/fjC3_2E-0eM/s400/dataWarehouse.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2563022972161796367?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2563022972161796367/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2563022972161796367' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2563022972161796367'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2563022972161796367'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/introduction-to-data-warehouse.html' title='Introduction to data Warehouse'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/Sq-83QSWYjI/AAAAAAAACKI/fjC3_2E-0eM/s72-c/dataWarehouse.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2248416434968025744</id><published>2009-09-07T01:17:00.000-07:00</published><updated>2009-09-07T23:31:30.919-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='t-tests'/><title type='text'>Statistics - Examples</title><content type='html'>&lt;div align="justify"&gt;&lt;strong&gt;T-Test:&lt;/strong&gt;&lt;br /&gt;Problem : A swimming instructor wants to prove that the swimming speed of an athlete increases if the athelete performs some specific exercises before the swim. He undertakes an experiment with 16 participants and randomly assignes 8 participants to each team. For team A he recommends some common exercises and for team B he recommends some specific exercises. The results of the experiment are below.&lt;br /&gt;Team A (Speed) - 10 12 11 16 13 9 15 6&lt;br /&gt;Team B (Speed) - 11 12 13 11 8 14 7 7&lt;br /&gt;Use the t test to find out if the average speed of the two teams is significantly different. Use alpha = 0.05.&lt;br /&gt;&lt;br /&gt;Null Hypothesis =&gt; A(speed) = B(speed)&lt;br /&gt;Alternate Hypothesis =&gt; A(speed) ne B(speed)&lt;br /&gt;&lt;br /&gt;We assume unequal variances and therefore use the t-test for unequal variances.&lt;br /&gt;The calculation in R is given below.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SqTGO8f9eDI/AAAAAAAACJU/8Saqt1BGI-Y/s1600-h/t-test.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5378641815047862322" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 320px" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SqTGO8f9eDI/AAAAAAAACJU/8Saqt1BGI-Y/s400/t-test.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;We see that the p-value is greater than 0.05 and hence the null hypthesis is not rejected.&lt;br /&gt;What this means is that the special exercises are not so special after all!&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div align="justify"&gt;We have assumed unequal variances. The other forms of t test are 1) assume equal variances 2) assume a paired test (the data corresponds to one team taken before and after the experiment)&lt;/div&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/SqTKJ6Ic9VI/AAAAAAAACJc/Pra_K-29TnQ/s1600-h/t-test-other.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5378646126559556946" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 329px" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/SqTKJ6Ic9VI/AAAAAAAACJc/Pra_K-29TnQ/s400/t-test-other.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Observe the difference in the degree of freedom and the t-value in all three cases.&lt;br /&gt;look at http://janda.org/c10/Lectures/topic07/pairedt-test.htm for an example of the significance of using paired t-test.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;ANOVA - one way&lt;/strong&gt;:&lt;br /&gt;Problem : An independent magazine wants to find out if the tyre life differs for four different tyre manufacturers. (for the same kind of tyres). It randomly selects 10 participants for each tyre make type and compares the mean of the tyre life.&lt;br /&gt;Make A(tyre life in months) - 12 17 23 20 18 10 30 12 23 20&lt;br /&gt;Make B(tyre life in months) - 10 23 19 30 13 15 18 17 20 22&lt;br /&gt;Make C(tyre life in months) - 11 34 12 20 33 20 18 19 12 17&lt;br /&gt;Make D(tyre life in months) - 10 12 34 23 27 18 15 17 10 15&lt;br /&gt;&lt;br /&gt;Null Hypothesis - All means are equal.&lt;br /&gt;Alternate Hypothesis - Atleast one of the means is different.&lt;br /&gt;The results of the analysis are as given below.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SqTT0nlC4EI/AAAAAAAACJ0/QqYcB_VrmMk/s1600-h/one-way-anova.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5378656755918233666" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 277px" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SqTT0nlC4EI/AAAAAAAACJ0/QqYcB_VrmMk/s400/one-way-anova.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SqTTclmoLYI/AAAAAAAACJs/ueXBFWMmqUo/s1600-h/one-way-anova.JPG"&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The p-value is greater than 0.05 and hence the null hypthesis is not rejected.&lt;br /&gt;What this means is the the shelf life of the tyre is not dependent on the make. &lt;/div&gt;&lt;br /&gt;&lt;br /&gt;More Detailed examples of ANOVA can be found here:&lt;br /&gt;http://www.personality-project.org/R/r.anova.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2248416434968025744?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2248416434968025744/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2248416434968025744' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2248416434968025744'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2248416434968025744'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/statistics-examples.html' title='Statistics - Examples'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_GvNS-b8AbU4/SqTGO8f9eDI/AAAAAAAACJU/8Saqt1BGI-Y/s72-c/t-test.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-1840074105120025794</id><published>2009-09-02T22:23:00.000-07:00</published><updated>2009-09-03T03:38:21.845-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><title type='text'>Use of Statistics - Practical Considerations</title><content type='html'>&lt;div align="justify"&gt;It is often confusing to decide on which statistic to use at what point. Also researchers need to be careful that the statistics they present does truly apply in the context of the problem. Statictics can be misleading and probably incorrect if used outside the boundaries set by its assumptions. In this post we analyse different statistics and what care should be taken while using them.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Mean,Median and mode&lt;/strong&gt; : The mean is probably the most widely used number. A lot of claims are made using the mean but this could be quite misleading. The mean does not include any measure of variability. Also, outliers may distort the mean to a large extent. Median can be reported with the mean to get an idea of how the extreme looks like. For a more detailed analysis the box and whisker plot gives a fair idea of how the data looks like. It is therefore better to provide a measure of variablility along with the measure of central tendency.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Discrete distribution&lt;/strong&gt; : While using binomail distribution the independence and size assumptions should be met when sampling is done without replacement. In Poisson distribution the size and lambda assumptions should be met. for binomial distributions for large sizes of n the probability at a particular x value should be reported with care. for example in a coin toss with n = 100 and p=0.5 P(x=50)=0.076. This could be counter-intuitive and hence a better way to report this is to use P(x&gt;50). In poisson study the value of lambda may change and hence the researcher should make sure that lambda is valid for their test conditions.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Continuous distribution&lt;/strong&gt; : Here again, the value of lambda should be chosed with care. The value of lambda used in one study may not be useful in other similar study since the populations may be different or the time interval of the lambda may be different. For normal distribution, care should be taken to verify the distribution since most of the tests are hightly sensitive to the type of distribution of the population.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Sampling&lt;/strong&gt; - One of the widely misused method is sampling. Many surveys use non random sampling instead of random sampling and the statistic thus obtained may quite off the mark. The sampling data is quite often used to make inferences about the population and if questionable means have been used to sample data then the population inferences may be highly incorrect.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Hypothesis testing for single population&lt;/strong&gt;: In hypothesis testing it is imperative that the researcher formulate the hypothesis in such a way that what is known is the null hypothesis and what he strives to prove is the alternate hypothesis. Researchers may use null hypothesis as the statement of what they want to prove and this is incorrect since then the theory is assumed to be true and alternate hypothesis only strives to disprove it. While using t-test, the population needs to be normally distributed to some extent. However, the chi square test is extremely sensitive to the assumption that the population is normally distributed and hence the researcher should make sure that the population is indeed normally distributed. Also, the business implication of a statistical significance test needs to be worded carefully. The context needs to be understand and the assumptions while arriving at the 'significance' level should be clarified.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Hypothesis testing for two populations&lt;/strong&gt; : The assumptions used while using the statistic that compares two populations should be met. 1) For small sample sizes, the x test is valid only if the population is normally distributed and population variances are known. 2) t-test can be used if population is normally distributed and population variances are assumed to be equal.3) For F test the two populations should be normally distribued.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;ANOVA&lt;/strong&gt;: While reporting ANOVA results the researcher needs to consider all variables that may effect the outcome of the experiment. She should at least mention the concomitant variables that have not been considered in the experiment but that have been shown to show some influence to the dependent variable, however small that dependence may be. The treatment levels selected for the study should possibly be random. Certain tests such as two way factorial design or completely randomized design with Tukey's HHSD may require equal smaple sizes. Sometimes researcher arbitrarily make up or delete values to make sample sizes equal and this is incorrect.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Regression Analysis&lt;/strong&gt;: regression analysis require equal error variance and independence of error terms. Residual and other statistical techniques can be used to verify that. Remember that the regression line is valid only if the assumptions are met. Another problem arises when the regression model is used outside the values used to formulate the regression model. The model is valid only in the domain used to create the model. Data may behave linearly in a certain range but may tend to behave non linearly outside this range.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Multiple Regression:&lt;/strong&gt; For a small degree of freedom the value of R squared obtained may be inflated. A cause and effect relationship may be assumed to occur between the dependent variable and the predictors. It is however possible that factors not considered in the study may be causing the behaviour. Also for multiple variables with different units the R squared values should not be compared. The coefficients of regression should also not be used to compare the effect of various predictors since the predictors may have different units. Also while building the multiple Regression model it is not necessary that the variables that enters the equation first is the most influential. &lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;Reference;&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;Use the site &lt;a href="http://www.whichtest.info/"&gt;http://www.whichtest.info/&lt;/a&gt; to figure out which test to use when.&lt;/div&gt;&lt;br /&gt;&lt;div align="justify"&gt;Also a table is available at &lt;a href="http://www.graphpad.com/www/Book/Choose.htm"&gt;http://www.graphpad.com/www/Book/Choose.htm&lt;/a&gt; that assists in understanding which test to use when. The table is reproduced below.&lt;/div&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/Sp9dVMsDQaI/AAAAAAAACJM/17T1baeOTms/s1600-h/type-of-test.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5377119098868285858" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 250px" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/Sp9dVMsDQaI/AAAAAAAACJM/17T1baeOTms/s400/type-of-test.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div align="justify"&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-1840074105120025794?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/1840074105120025794/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=1840074105120025794' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1840074105120025794'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1840074105120025794'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/use-of-statistics-practical.html' title='Use of Statistics - Practical Considerations'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/Sp9dVMsDQaI/AAAAAAAACJM/17T1baeOTms/s72-c/type-of-test.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-5931215301410431039</id><published>2009-09-01T21:42:00.000-07:00</published><updated>2009-09-03T03:32:58.190-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Wilcoxon matched-pairs signed ranks test'/><category scheme='http://www.blogger.com/atom/ns#' term='Friedman test'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Mann-Whitney U test'/><category scheme='http://www.blogger.com/atom/ns#' term='Spearman&apos;s rank correlation coefficient.'/><category scheme='http://www.blogger.com/atom/ns#' term='Kruskal-Wallis test'/><category scheme='http://www.blogger.com/atom/ns#' term='Runs test'/><title type='text'>Non Parametric Statistics</title><content type='html'>&lt;div align="justify"&gt;The earlier posts on inferential statistics show methods that work on data whose population parameters and distribution are known. However, there are cases where the population parameters are not known. In such cases, no assumption can be made about the population statistics and hence parametric methods cannot be used. In such cases there are certain &lt;em&gt;non-parametric &lt;/em&gt;methods that can be used. Some non-parametric statistics can deal with nominal (data where we have unordered categories such as happy, not happy) or ordinal (data where we have ordered categories such as 1-agree 2-disagree , 3-strongly disagree etc).&lt;br /&gt;We will look at the following tests: Runs test, Mann-Whitney U test, Wilcoxon matched-pairs signed ranks test, Kruskal-Wallis test, Friedman test and Spearman's rank correlation coefficient.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Run's Test&lt;/strong&gt;:&lt;br /&gt;Consider a railway station where the station incharge is keeping a tab of whether the train arrives on time on the station or arrives late given that it has arrived on time on the previous station. He needs to find out whether some problems on the track between the previous station and his station is causing the delay of trains or is it random occurrence not attributable to any specific problem.&lt;br /&gt;15 trains arrive on the station in the day. Let the code for arrival on time be 0 and arrival late be 1. also assume that the trains arrive at constants interval. The arrival can be of the form 000000011111111. i.e the first 7 trains arrive on time and the last 8 arrive late. This kind of data can be suspected, why do the trains arriving in the first half of the day come on time and the others don't. This behaviour coud be non-random. probably, visibility reduces as day progresses and the train driver needs to slow the trains down. Other kind of distribution could be 001011011011000. This kind of distribution could be random. Runs test can be used to establish whether this is indeed random. A run can be defined as an occurrence of a sequence of similar events. For example a 00 in the start of the sequence is one run. The total number of runs are 00-1-0-11-0-11-0-11-000 -&gt; 9.&lt;br /&gt;&lt;br /&gt;The &lt;em&gt;Null hypothesis&lt;/em&gt; for the run's test is that the observations in the sample are randomly generated. The &lt;em&gt;alternate hypothesis&lt;/em&gt; is that they are not randomly generated. The test differs for small samples and large samples.&lt;br /&gt;&lt;strong&gt;Small sample Runs test&lt;/strong&gt; - for small samples let n1 be the number of observations of event 1 and n2 be the number of observations of event 2. if n1 and n2 are less than or equal to 20 this test can be used. This is how the test is carried out.&lt;br /&gt;1) establish that the small sample test can be used by examining n1 and n2 values.&lt;br /&gt;2) assume a value of alpha (0.05).&lt;br /&gt;3) Use the Runs table for small sample and n1 and n2 and alpha and find out the critical value. Note that the Runs table gives two values. An upper tailed table which gives the higher critical value (C2) and a lower tailed table which gives the lover critical value (C1). if the number of runs (N) of the observation fall within C1 and C2 then the decision is to not reject the null hypothesis. I.e. the sample is indeed random.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Large Sample Run Test: &lt;/strong&gt;&lt;br /&gt;For n1 and n2 greater than 20 the sampling distribution is normal. The z statistic can be used to check the null hypothesis. The z statistic is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2590/3881054458_0454b73cda.jpg"&gt;&lt;img style="WIDTH: 318px; CURSOR: hand; HEIGHT: 113px" alt="" src="http://farm3.static.flickr.com/2590/3881054458_0454b73cda.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The z table can be used to find out the critical values for the level of alpha. the z values can then be compared. If the z value from the experiment falls outside the critical value from the table, the null hypothesis is rejected. Or if the p-value is less than alpha the null hypothesis is rejected.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Mann-Whitney U Test &lt;/strong&gt;:&lt;br /&gt;It is a counterpart of the t-test used when the &lt;em&gt;assumption of normal distribution is not valid&lt;/em&gt;, or if the data are at least ordinal. The two tailed hypothesis is : &lt;em&gt;Null hypothesis&lt;/em&gt; - the populations are identical.&lt;br /&gt;Let us look at this test using an example. Suppose that a researcher wants to find out if the average age of people entering a movie theater is different in two cities.&lt;br /&gt;The example below shows how the Mann-Whitney U test solves the problem.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2541/3880258067_e459f7af31.jpg"&gt;&lt;img style="WIDTH: 210px; CURSOR: hand; HEIGHT: 500px" alt="" src="http://farm3.static.flickr.com/2541/3880258067_e459f7af31.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The steps are:&lt;br /&gt;combine the data and arrange them in ascending order. Assign rank to the observations and also preserve the information about which group the observation comes from. Calculate W1 as the sum of ranks for group 1 elements and W2 as sum of ranks of group 2 elements. Calculate U1 and U2 using&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2613/3881054530_a74ab40a34.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 59px" alt="" src="http://farm3.static.flickr.com/2613/3881054530_a74ab40a34.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The test statistic is the smaller of the two U values.also note that U1=n1n2-U2. Use that table to calculate the p-value using U (smaller), n1(no of obs for sample having smaller U) and n2. For a two tailed test double the p-value.&lt;br /&gt;&lt;br /&gt;For large samples U is approximately normally distributed. A z statistic can be used to check the null hypothesis. The z statistic is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2498/3880258147_f25a657e5e.jpg"&gt;&lt;img style="WIDTH: 476px; CURSOR: hand; HEIGHT: 66px" alt="" src="http://farm3.static.flickr.com/2498/3880258147_f25a657e5e.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Wilcoxon matched-pairs signed rank test&lt;/strong&gt;:&lt;br /&gt;The Mann-Whitney test is a t-test counterpart of independent samples. However, if the two sample are related, the wilcoxon matched-pairs signed rank test can be used. This test can be used for situations where the researcher is measuring data before and after studies or data taken from the sample person at two different conditions etc.&lt;br /&gt;The method used for calculation depends on the size of the sample. If the sample is of small size (&lt;15)&lt;a href="http://farm4.static.flickr.com/3498/3881054734_0446a2e1b5_m.jpg"&gt;&lt;img style="WIDTH: 215px; CURSOR: hand; HEIGHT: 240px" alt="" src="http://farm4.static.flickr.com/3498/3881054734_0446a2e1b5_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;here we are comparing the household income for a group of families after a change in government policy. The steps are as follows: first calculate the difference in the income for each pair. Arrange the absolute value of the differences in ascending order and assign ranks to them. For values that have a negative difference change the sign of the rank to negative. Calculate the sum of rank for the negative and positive values. The minimum sum is our test statistic. Compare this value with the value obtained from the T table for given n and alpha. The hypothesis if rejected if the calculated T is less than the T from the table.&lt;br /&gt;Large-Sample Case:&lt;br /&gt;If the sample size is greater than 15 then the T statistic is approximately normally distributed and a z score can be used to check the null hypothesis.&lt;br /&gt;The z formula is given by.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2536/3881054806_c0433c4b27_m.jpg"&gt;&lt;img style="WIDTH: 240px; CURSOR: hand; HEIGHT: 67px" alt="" src="http://farm3.static.flickr.com/2536/3881054806_c0433c4b27_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;T is calculated in the same way as in small sample size analysis.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Kruskal-Wallis Test:&lt;/strong&gt;&lt;br /&gt;This test is the non parametric counterpart to one way ANOVA. The parametric ANOVA is based on normally distributed population, independent groups, at least interval level data, and equal population variances. The Kruskal-Wallis test is used to analyse ordinal data and is not based on any assumption of the population distribution. However this test deals with independent populations only and the sample data needs to be selected randomly from the population.&lt;br /&gt;Let the test compare c groups. The null hypothesis is given by&lt;br /&gt;Null Hypothesis - The c population are identical.;Alternate hypothesis : at least one of the c populations is different.&lt;br /&gt;The process is as follows:&lt;br /&gt;1) combine the groups together and arrange the data in ascending order. assign ranks to the data and maintain the group to which the data belongs.&lt;br /&gt;2) In case of a tie between multiple data elements, each tied element is assign the average rank of the tied elements.&lt;br /&gt;3)The K value is calculated as below.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2532/3880258361_a6fcb17817.jpg"&gt;&lt;img style="WIDTH: 460px; CURSOR: hand; HEIGHT: 66px" alt="" src="http://farm3.static.flickr.com/2532/3880258361_a6fcb17817.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Where c = number of groups&lt;br /&gt;n = total number of items&lt;br /&gt;Tj=total of ranks in a group.&lt;br /&gt;nj=number of items in a group.&lt;br /&gt;4) the K value is compared to the Chi square value with given alpha and df = c-1. If the calculated K value is greater than the chi square value, the null hypothesis is rejected.&lt;br /&gt;5) This is a one tailed test only.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Friedman Test:&lt;/strong&gt;&lt;br /&gt;This test is the nonparametric counterpart of the randomized block design. Friedman Test can work with populations whose distribution is not known and when data is ranked. The groups need to be independent and the researcher should be able to rank observations in each group .&lt;br /&gt;The null hypothesis is : the populations under treatment are equal.&lt;br /&gt;The alternate hypothesis is : At least on population under the treatment yields larger values than at least one another population under treatment.&lt;br /&gt;Steps for analysis:&lt;br /&gt;1) Rank the data. The ranks are assigned within the group and the data across the group is not combined. see the earlier tests for how data can be ranked. The smallest rank in a group is 1 and the largest is c. c is also the number of treatment levels.&lt;br /&gt;The statistic is calculated using:&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2565/3880258419_8b272e6d86_m.jpg"&gt;&lt;img style="WIDTH: 240px; CURSOR: hand; HEIGHT: 55px" alt="" src="http://farm3.static.flickr.com/2565/3880258419_8b272e6d86_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;where c = number of treatment levels(columns)&lt;br /&gt;b = number of blocks.(rows)&lt;br /&gt;Rj=total of ranks for a particular treatment level (column)&lt;br /&gt;j=particular treatment level (column)&lt;br /&gt;The value obtained from the formula is compared with the chi square value for given alpha and df=c-1.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Spearman's Rank Correlation&lt;/strong&gt;:&lt;br /&gt;The degree of association of two variables when only ordinal level data is available can be calculated from Spearman's rank correlation given by:&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2663/3881054876_a72c66f1ff_m.jpg"&gt;&lt;img style="WIDTH: 185px; CURSOR: hand; HEIGHT: 77px" alt="" src="http://farm3.static.flickr.com/2663/3881054876_a72c66f1ff_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;n= number of pairs being correlated.&lt;br /&gt;d = difference in the ranks of each pair. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-5931215301410431039?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/5931215301410431039/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=5931215301410431039' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5931215301410431039'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5931215301410431039'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/09/non-parametric-statistics.html' title='Non Parametric Statistics'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2590/3881054458_0454b73cda_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-8709435162567449651</id><published>2009-08-31T23:53:00.000-07:00</published><updated>2009-09-03T03:27:33.417-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Multiple Regression'/><title type='text'>Multiple Regression model building</title><content type='html'>&lt;div align="justify"&gt;&lt;strong&gt;Polynomial Regression&lt;/strong&gt;:&lt;/div&gt;&lt;div align="justify"&gt; First order regression models contain predictors that are single powered. Polynomial models have one or more predictors having a power of more than one. A quadratic model has a predictor in the first and second order form.&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3525/3877724056_76bf4bba56.jpg"&gt;&lt;img style="WIDTH: 235px; CURSOR: hand; HEIGHT: 40px" alt="" src="http://farm4.static.flickr.com/3525/3877724056_76bf4bba56.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Since the constants are linear, the variable x1 squared can be recoded to x2 and then the equation becomes linear.&lt;br /&gt;Higher power models can be used but are generally not used for business purposes.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Tukey's ladder of transformation :&lt;/strong&gt;&lt;br /&gt;A plot of x and y which seems non linear may need transformation for fitting. quadratic polynomial is one such transformation where the data is recoded for better fit. Tukey proposed a series of transformation that can be used to improve the model fit to data. The transformation to be used depends on the shape of the data.(the shape of the scatter plot of the data)&lt;br /&gt;a variable x can be transformed as follows.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2589/3877724084_7ff400ae66.jpg"&gt;&lt;img style="WIDTH: 402px; CURSOR: hand; HEIGHT: 59px" alt="" src="http://farm3.static.flickr.com/2589/3877724084_7ff400ae66.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;when moving towards higher powers of x, we move up the ladder.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Regression model with interaction :&lt;/strong&gt;&lt;br /&gt;Two variables in a model may not be independent and may interact in a way such that the effect of one depends on the value of another. This interaction effect can be taken care of by using an interaction variable which is the multiplication of the two variables. The linear regression equation for such a model is&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2596/3877724142_d98ed09a93.jpg"&gt;&lt;img style="WIDTH: 279px; CURSOR: hand; HEIGHT: 39px" alt="" src="http://farm3.static.flickr.com/2596/3877724142_d98ed09a93.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Transforming y variable : In certain cases the y variables needs to be transformed.&lt;br /&gt;for example&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2673/3876930767_801de286b1.jpg"&gt;&lt;img style="WIDTH: 261px; CURSOR: hand; HEIGHT: 113px" alt="" src="http://farm3.static.flickr.com/2673/3876930767_801de286b1.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The variable y is transformed but x is not.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Indicator or Dummy variables&lt;/strong&gt;: Some variables do not have a quantitative effect on the data but may categorize items. These may be ordinal or nominal variables. These variables are included in regression model as dummy variables. example of indicator variable : a survey to find the satisfaction of a customer may have values such as excellent, good , poor etc. Such a variable is called an indicator variable. The indicator variables may be coded using 0 and 1. for example a gender question can be coded to have 1 for female and 0 for male. If there are more than two values then multiple indicator variables can be taken. for example in the earlier example we can have two indicator variables, namely, good and poor. note that the third variable is not required since if the grading is neither good nor poor then it is bound to be excellent. A value of good is coded as good = 1 and poor = 0, a value of excellent is coded as good = 0 and poor = 0.&lt;br /&gt;Also, as a rule of thumb it is generally necessary to have at least three observations per variable to get meaningful and correct model.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Model-building procedures :&lt;/strong&gt;&lt;br /&gt;Different predictor variables can be used to form a model. For example the price of a commodity may depend on the following - the cost of raw materials, the demand of the item, the cost of competitor's product etc. The aim of the regression model is to find out the variables which influence the regression model the most, that is the variables that best explain the variance of the dependent model. Also the model needs to be simple. The lesser the number the predictor variables, the better is the model. If a predictor variable adds only marginal variance to the dependent variable, then it is best to ignore that predictor (the more the variables, the tougher it is to collect data and explain to the management).&lt;br /&gt;So how does one select the appropriate model. How does one go about analysing what variables are important.&lt;br /&gt;There are various procedures available to answer these questions. we analyse those procedure below.&lt;br /&gt;&lt;strong&gt;All possible regression : &lt;/strong&gt;&lt;/div&gt;&lt;div align="justify"&gt;As the name suggests in this procedure we analyse all possible models. for example if there are 5 variables, we examine models by taking one variable at a time, two variables at a time and so on. for k variables this yields 2^k-1 models. For 5 variables this yields 31 models. The advantage of this model is that the researcher can examine all relationships. The disadvantage is that it is too tedious and may not be feasible.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Stepwise regression :&lt;br /&gt;&lt;/strong&gt;This is the most popular method. It begins with a single variable and adds or deletes variable in each step. In the first step of the procedure all variables are analysed one at a time. The variable yielding the highest value of t is chosen.&lt;br /&gt;In the second step the selected variable is combined with all remaining variables to yield as many models (k variables yield k-1 models in this step). The t value of the added variable is calculated and the variable having the highest t value is selected. However, if while adding another variable the t value of the original variable becomes insignificant then the variable is discarded.&lt;br /&gt;The model continues till all independent variables have been identified.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Forward Selection&lt;/strong&gt; :&lt;/div&gt;&lt;div align="justify"&gt; This method is same as stepwise regression, but a variable once added is never removed.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Backward elimination&lt;/strong&gt; :&lt;/div&gt;&lt;div align="justify"&gt; This model begins with all predictors. The significance of all predictors is calculated and the one with the least significance is discarded. This process is continued till all variables left are significant.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;multicollinearity&lt;/strong&gt; : &lt;/div&gt;&lt;div align="justify"&gt;If more than two variables are correlated, we have multicollinearity. the t value obtained may be incorrect since other variables may have influenced it. One way to take care of collinearity is to create a matrix that shows the correlation of one variable with another and chose the variable that has maximum correlation with others and drop the others. Stepwise regression is other method that can eliminate correlated variables. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-8709435162567449651?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/8709435162567449651/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=8709435162567449651' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/8709435162567449651'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/8709435162567449651'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/multiple-regression-model-building.html' title='Multiple Regression model building'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3525/3877724056_76bf4bba56_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-4022963265057008519</id><published>2009-08-31T23:34:00.000-07:00</published><updated>2009-09-03T03:34:17.088-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='t-tests'/><category scheme='http://www.blogger.com/atom/ns#' term='p-value'/><category scheme='http://www.blogger.com/atom/ns#' term='F-test'/><title type='text'>T-Test, F-Test and P-value</title><content type='html'>&lt;p align="justify"&gt;Two very important tests in statistical analysis are the t-test and the f-test. However, some confusion may arise for a new user as to the difference between the two tests. In this post I will try and present the difference between the two tests and when each should be used.&lt;/p&gt;&lt;p align="justify"&gt;&lt;br /&gt;But before we understand the test, let’s understand what a &lt;strong&gt;p-value&lt;/strong&gt; is. The p-value is the probability of getting results as extreme as the observed values under null hypothesis. For example, after performing a t-test you find out that the p-value is 0.06. Does this mean that the null hypothesis can be rejected? Suppose you decide that due to random errors, even if the null hypothesis is true, 5 out of 100 experiments would inevitable fail the null hypothesis and you can live with that. However, in our experiment the p-value is 0.06, which says that ‘by looking at your data, I think that 6 out of 100 times you would tend to reject the null hypothesis even if it is true’. You are happy with that statement, you were happy if it fails five times, but the data says that six failures are expected to occur due to chance.&lt;br /&gt;For practical purposes, reject a null hypothesis if the p-value is less than alpha (generally 5% or 0.05)&lt;/p&gt;&lt;p align="justify"&gt;&lt;br /&gt;&lt;strong&gt;T-test&lt;/strong&gt;: The t-test is used to find out if the means between two populations is significantly different.&lt;br /&gt;Characteristics of the test are;&lt;br /&gt;1) The test statistic follows a t distribution under null hypothesis.&lt;br /&gt;2) The test can be used to find if the mean of a population is different from a known mean.&lt;br /&gt;3) The test can be used to find out if the means of two samples are significantly different. Note that the two populations need to follow the normal distribution. Also the variances of the two populations need to be equal if sample size is less than 30.&lt;br /&gt;4) The test can be used to find out if the difference between values of a single variable measured at different times is zero.&lt;br /&gt;5) The test can be used to find out if the regression line has a slope different from zero.&lt;br /&gt;6) Paired vs un-paired: A test of type 3 is a paired test. The samples are independent. A test of type 4 is an unpaired test. In many cases of unpaired data, it is the same variable undergoing repeated observations. For example, measurements taken before and after an experiment.&lt;br /&gt;7) The questions that need to be answered before using the t-test are: is it a single population or multiple populations, are the sample sizes equal, are the variances equal, and is it a paired or un-paired test.&lt;/p&gt;&lt;p align="justify"&gt;&lt;br /&gt;&lt;strong&gt;F-test&lt;/strong&gt;: F-test is used to find out if the variances between the two populations are significantly different.&lt;br /&gt;Characteristics of an F-test are:&lt;br /&gt;1) The test statistic has an F distribution under null hypothesis. I.e. the ratio of variances follows an F distribution.&lt;br /&gt;2) F-test can be used to find out if the means of multiple populations having same standard deviation differ significantly from each other. (ANOVA)&lt;br /&gt;3) F-test can be used to find out if the data fits into a regression model obtained using least square analysis. Here we compare is the mean square due to error is significantly different from the mean square due to regression.&lt;br /&gt;4) The test can be a two tailed test or a one tailed test.&lt;br /&gt;5) F-test for ANOVA for two variables is equivalent to performing the t-test. Also the relation is given by F=t squared.&lt;br /&gt;6) For ANOVA the F test is the measure of ratio of variance between groups and variance with the sample groups.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-4022963265057008519?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/4022963265057008519/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=4022963265057008519' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/4022963265057008519'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/4022963265057008519'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/t-test-f-test-and-p-value.html' title='T-Test, F-Test and P-value'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-6208275088534838560</id><published>2009-08-25T23:42:00.000-07:00</published><updated>2009-09-03T03:23:00.649-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Multiple Regression'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='inferntial statistics'/><title type='text'>Statistics for Business Intelligence - Multiple Regression</title><content type='html'>&lt;div align="justify"&gt;Simple regression deals with predicting a dependent variable using an independent variable using a linear regression equation. In multiple regression analysis there are more than one independent variables or at least one non linear independent variable.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Multiple Regression Model &lt;/strong&gt;:&lt;br /&gt;The multiple regression model is of the form.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2468/3858746506_dfbfea4b90.jpg"&gt;&lt;img style="WIDTH: 393px; CURSOR: hand; HEIGHT: 45px" alt="" src="http://farm3.static.flickr.com/2468/3858746506_dfbfea4b90.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;where y = the value of the dependent variable.&lt;br /&gt;beta0 = the regression constant.&lt;br /&gt;beta1 = partial regression coefficient for independent variable 1.&lt;br /&gt;beta2 = partial regression coefficient for independent variable 2.&lt;br /&gt;k = the number of independent variables.&lt;br /&gt;&lt;br /&gt;The partial regression coefficient of the independent variable beta is the change in the dependent variable by a unit change in the independent variable.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Multiple regression model with two independent variables :&lt;/strong&gt;&lt;br /&gt;In this case we obtain a regression plane that fits the data. The multiple regression equation is given by;&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2577/3858746556_66de16212a.jpg"&gt;&lt;img style="WIDTH: 400px; CURSOR: hand; HEIGHT: 121px" alt="" src="http://farm3.static.flickr.com/2577/3858746556_66de16212a.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This equations are obtained by minimizing the sum of square of error.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Testing the Multiple Regression Model:&lt;/strong&gt;&lt;br /&gt;The overall model can be tested by using the hypothesis of the form.&lt;br /&gt;&lt;em&gt;Null Hypothesis&lt;/em&gt;: beta1=beta2=betak= is equal to 0.&lt;br /&gt;&lt;em&gt;Alternate Hypothesis&lt;/em&gt;: at least one of the beta is not equal to 0.&lt;br /&gt;This test can be used to establish that the data does indeed have a relationship between the independent and the dependent variable(s).&lt;br /&gt;A rejection of null hypothesis indicates that at least one of the independent variables predicts the dependent variable.&lt;br /&gt;The F value test can be used to check the hypothesis.&lt;br /&gt;The F value is given by:&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3419/3857956215_257b20416a.jpg"&gt;&lt;img style="WIDTH: 487px; CURSOR: hand; HEIGHT: 72px" alt="" src="http://farm4.static.flickr.com/3419/3857956215_257b20416a.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;where&lt;br /&gt;MS=mean square;&lt;br /&gt;SS=sum of squares;&lt;br /&gt;df=degrees of freedom=N-k-1;&lt;br /&gt;k=number of independent variables.&lt;br /&gt;N=number of observations.&lt;br /&gt;&lt;br /&gt;A significance test can also be undertaken for each regression coefficient to validate whether each of them is significant. A t-test can be used for this purpose.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Residuals:&lt;/strong&gt;&lt;br /&gt;The residual can be calculated by solving the multiple regression equation and obtaining the dependent variable. The difference between the value obtained from the calculation and the value obtained from observation is the residual value. A plot of residuals can be used to identify the fit of the plot and can also be used to identify outliers.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Standard Error Of Estimation:&lt;/strong&gt;&lt;br /&gt;The sum of residuals is equal to zero if rounding errors are not considered. Therefore the sum of squares of error can be used to find the error in estimate.&lt;br /&gt;the sum of squares of error or SSE is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2475/3857965449_afd734116d.jpg"&gt;&lt;img style="WIDTH: 143px; CURSOR: hand; HEIGHT: 39px" alt="" src="http://farm3.static.flickr.com/2475/3857965449_afd734116d.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The standard error of estimate (standard deviation of error for the regression model) can be estimated using:&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2472/3858746622_3d351a940c.jpg"&gt;&lt;img style="WIDTH: 149px; CURSOR: hand; HEIGHT: 58px" alt="" src="http://farm3.static.flickr.com/2472/3858746622_3d351a940c.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The assumption in regression is that the errors follow a normal distribution. The standard error of estimate can be used to verify this assumption. (we know that for a normal distribution 68% of values fall with one standard deviation of the mean and 95% fall within two standard deviations.)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Coefficient of multiple Determination (R squared) : &lt;/strong&gt;&lt;br /&gt;The coefficient of multiple regression accounts for the proportion of the variation of the dependent variable y by the independent variable. a value of 0 indicates no relationship and a value 0f 1 indicates perfect relationship. The coefficient is given by:&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2640/3857956317_76dfcc5b6b.jpg"&gt;&lt;img style="WIDTH: 230px; CURSOR: hand; HEIGHT: 66px" alt="" src="http://farm3.static.flickr.com/2640/3857956317_76dfcc5b6b.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;SSR = sum of squares of regression, SSE = sum of squares of error and SSyy is the sum of squares total. These values can be obtained by ANOVA analysis.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Adjusted R square : &lt;/strong&gt;&lt;br /&gt;As variables are added to the regression model the R squared value keeps on increasing. However, sometimes even if the new variable added does not have a significant effect on the y variable, the R squared value increases. To take care of this effect an adjusted R squared value is used. The value is given by:&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3498/3857956359_02f538d19d.jpg"&gt;&lt;img style="WIDTH: 264px; CURSOR: hand; HEIGHT: 66px" alt="" src="http://farm4.static.flickr.com/3498/3857956359_02f538d19d.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;To summarize, to check the results of multiple regression, the following needs to be checked.&lt;br /&gt;1. the regression model equation.&lt;br /&gt;2. The ANOVA table and the F value of the overall model.&lt;br /&gt;3. SSE values.&lt;br /&gt;4. standard error of estimate.&lt;br /&gt;5. coefficient of multiple determination.&lt;br /&gt;6. adjusted coefficient of multiple determination.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-6208275088534838560?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/6208275088534838560/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=6208275088534838560' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/6208275088534838560'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/6208275088534838560'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence_4701.html' title='Statistics for Business Intelligence - Multiple Regression'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2468/3858746506_dfbfea4b90_t.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-6657234939037525132</id><published>2009-08-25T03:12:00.000-07:00</published><updated>2009-09-03T03:15:58.634-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Simple Regression'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Error of Estimate'/><category scheme='http://www.blogger.com/atom/ns#' term='inferntial statistics'/><title type='text'>Statistics for Business Intelligence - Simple Regression</title><content type='html'>&lt;div align="justify"&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;p align="justify"&gt;&lt;strong&gt;Introduction:&lt;/strong&gt;&lt;/p&gt;&lt;br /&gt;&lt;p align="justify"&gt;In many situations, a relationship between two variables needs to be analysed. For example, a car manufacturer would like to know whether the number of cars bought in the city is related to the average household income in the city, or a sales manager would like to know if the sales revenue is dependent on the discount percentage offered by the company. Such an analysis can be done by Simple Regression. Simple regression involves building a model that can determine one variable given another variable. The known variable is called the independent variable and the variable to be determined is called dependent variable. To do the analysis, the researcher aims to fit the data in a straight line form. i.e. the data is fit into a line of the form y=mx+c; where y is the dependent variable and x is the independent variable. m is slope of line and c is the y intercept. This is also the deterministic model since it gives the exact value of y for a given value of x. Statisticians also use probabilistic models where y can be determined with a given error range.&lt;br /&gt;y=mx+c+e&lt;br /&gt;where e is the error in determination.&lt;br /&gt;For a sample the regression line is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3519/3855756016_60c81727b2.jpg"&gt;&lt;img style="WIDTH: 129px; CURSOR: hand; HEIGHT: 32px" alt="" src="http://farm4.static.flickr.com/3519/3855756016_60c81727b2.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;bo and b1 can be obtained for the sample using least square analysis.&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3544/3854966489_3676981521.jpg"&gt;&lt;img style="WIDTH: 434px; CURSOR: hand; HEIGHT: 192px" alt="" src="http://farm4.static.flickr.com/3544/3854966489_3676981521.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2614/3854966529_5a40caea3d.jpg"&gt;&lt;img style="WIDTH: 276px; CURSOR: hand; HEIGHT: 60px" alt="" src="http://farm3.static.flickr.com/2614/3854966529_5a40caea3d.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Residual Analysis&lt;/strong&gt; :&lt;br /&gt;Once a regression line is determined, the researcher needs to validate whether the line is a good fit for the data.&lt;br /&gt;&lt;br /&gt;To do so, he can use historical information and try to fit this information in the regression line.&lt;br /&gt;For each historical point a residual value is obtained. This is the difference between actual historical value and the value obtained from the regression line. The sum of the squares of this residual values is minimized to find the least squares line. The sum of the residuals is zero for the sample data if there are no rounding errors. A point with high residual value may be an outlier. The residual analysis plot can be used to gauge how effective the regression model is. The residual plot is a plot of the residual value against the independent variable. It checks the following assumptions of simple regression analysis&lt;br /&gt;1) The model is linear&lt;br /&gt;2) The error terms have constant variance, are independent and are normally distributed.&lt;br /&gt;The residual plot can be visually analysed to verify the above assumptions.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Standard Error of estimate:&lt;/strong&gt;&lt;br /&gt;The standard error of estimate can be used to determine the error that arises out of simple regression. It is the standard deviation of the error terms. the standard error of estimate is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3479/3854966591_d07661b72b.jpg"&gt;&lt;img style="WIDTH: 118px; CURSOR: hand; HEIGHT: 56px" alt="" src="http://farm4.static.flickr.com/3479/3854966591_d07661b72b.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Where SSE can be estimated by either of these methods.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2499/3855756114_9a488500b5.jpg"&gt;&lt;img style="WIDTH: 350px; CURSOR: hand; HEIGHT: 98px" alt="" src="http://farm3.static.flickr.com/2499/3855756114_9a488500b5.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The standard error estimate, being the standard deviation value can be used to verify whether the residues are normally distributed. For normal distribution values 68% of values fall within one standard deviation and 95% of values would fall within two standard deviation.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Coefficient of determination &lt;/strong&gt;:&lt;br /&gt;The coefficient of determination is called r2 (r squared). It is the proportion of variation of the dependent variable(y) explained by the independent variable(x). The value can range from 0 to 1. A value of 0 implies no variation of y w.r.t x and a value of 1 implies all variation in y can be explained by x. From a business point of view, a researcher may chose the value of coefficient to be good or bad depending on the context. A high value is sought by those seeking exact prediction.&lt;br /&gt;&lt;br /&gt;The coefficient can be calculated as follows:&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2462/3854966639_6a639b97c5.jpg"&gt;&lt;img style="WIDTH: 367px; CURSOR: hand; HEIGHT: 333px" alt="" src="http://farm3.static.flickr.com/2462/3854966639_6a639b97c5.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;the sum of squares of error can be broken into two parts i.e. variance measured by sum of squares of regression(SSR) and sum of squares of error (SSE). The coefficient of determination is given as the proportion of variation explained by regression.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Hypothesis testing for the slope of the regression model:&lt;br /&gt;&lt;/strong&gt;A test to determine whether the regression model is applicable (significant) is to test if the slope of the regression line is significant. The way to do this is to determine if the population mean is different from 0. (If it is different from 0, the variables are related and hence regression model can be applied). A t-test on the slope can be used to determine a null hypothesis of the form .&lt;br /&gt;&lt;em&gt;Null hypothesis&lt;/em&gt; - the hypothesized slope is zero.&lt;br /&gt;&lt;em&gt;Alternate hypothesis&lt;/em&gt;- the hypothesized slope is greater than or less than zero.&lt;br /&gt;note that this is a two tailed test.&lt;br /&gt;the t test is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3527/3855756220_8b5d298765.jpg"&gt;&lt;img style="WIDTH: 474px; CURSOR: hand; HEIGHT: 181px" alt="" src="http://farm4.static.flickr.com/3527/3855756220_8b5d298765.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;where beta1=the hypothesized slopes; df = n-2&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Confidence interval for the determination of y&lt;/strong&gt; : y can be determined from x using the regression model. However a confidence interval can be used to determine the range within which the y value falls for that confidence level or the mean of the y value for that confidence level.&lt;br /&gt;The prediction interval for y can be given by:&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2332/3854966411_11349bcdd6.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 86px" alt="" src="http://farm3.static.flickr.com/2332/3854966411_11349bcdd6.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Note the regression line is obtained from the sample data. This regression line may be valid only for the range of the sample data. Although the regression line is sometimes extrapolated, the results may be incorrect. Hence care should be taken while using the regression model to predict values.&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-6657234939037525132?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/6657234939037525132/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=6657234939037525132' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/6657234939037525132'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/6657234939037525132'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence_25.html' title='Statistics for Business Intelligence - Simple Regression'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3519/3855756016_60c81727b2_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2403100673556845683</id><published>2009-08-23T05:15:00.000-07:00</published><updated>2009-08-24T04:44:17.467-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='chi-square goodness of fit'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='chi-square distribution.'/><category scheme='http://www.blogger.com/atom/ns#' term='inferntial statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='chi-square test of independence'/><title type='text'>Statistics for Buiness Intelligence - Chi Square Tests</title><content type='html'>&lt;div align="justify"&gt;Chi square tests, namely chi-square goodness of fit test and chi-&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;square&lt;/span&gt; test of independence, are used to analyse data that are a frequency distribution of discrete variables. Consider for example, a wine cellar that has four different categories of wine, a frequency distribution of the four varieties of wine can be analysed by such tests.&lt;br /&gt;&lt;strong&gt;Chi-Square Goodness of Fit test&lt;/strong&gt; :&lt;br /&gt;This test is used on experiments that are an extension of &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;binomial&lt;/span&gt; distribution, i.e. on &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;multinomial&lt;/span&gt; distribution. This distribution has more than two variables or outcomes in an experiment. The chi square compares the observed distribution of outcomes to the expected distribution of outcomes. In other words it shows how well does the observed distribution fit the expected distribution. For example a retail store chain may claim that their customers follow a certain satisfaction distribution as shown below&lt;br /&gt;Satisfied - 80%&lt;br /&gt;Somewhat satisfied - 10%&lt;br /&gt;Not Satisfied - 10%&lt;br /&gt;&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;The&lt;/span&gt; results of a random survey undertaken by a particular store manager can be used to test whether the distribution applies to her store as well.&lt;br /&gt;&lt;br /&gt;The formula used for the chi-square test is&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3447/3847757319_51ea9e8060.jpg"&gt;&lt;img style="WIDTH: 452px; CURSOR: hand; HEIGHT: 152px" alt="" src="http://farm4.static.flickr.com/3447/3847757319_51ea9e8060.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The chi square distribution is a series of graphs with different values of degrees of freedom.&lt;br /&gt;This is how the test works - Use the chi-square table and the degree of freedom and a suitable alpha value to find the value of chi-square. Use the above formula to calculate the chi-square value for the experiment. If the experimental value is greater than the value of the table then the null hypothesis can be rejected. Note that this is a single tailed test since we are interested in finding out if the observed distribution follows the assumed distribution or not.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Chi-square test of independence&lt;/strong&gt; : This test can be used to check the distribution of frequencies when there are two variables having different categories. For example if a tyre company wants of find out if the size of tyre used is dependent of the make of the tyre. The response can be obtained on a two way table, for example the test results can be captured with tyre size on the horizontal and tyre make on the vertical. If we have two sizes of tyres and two makes of tyres, we have a 2X2 matrix of results. Each cell containing the frequency for the make-size combination. In a way the chi-square test of independence tells whether the two variables are dependent.&lt;br /&gt;&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_4"&gt;The&lt;/span&gt; null hypothesis of the test is that the variables are dependent.&lt;br /&gt;If the variables are dependent the expected frequency of occurrence can be calculated from the experiment using the formula&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3457/3847808005_8f1f315913.jpg"&gt;&lt;img style="WIDTH: 452px; CURSOR: hand; HEIGHT: 108px" alt="" src="http://farm4.static.flickr.com/3457/3847808005_8f1f315913.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Using this expected values and the observed values, the chi square value can be calculated as&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2472/3848612388_29bfa7eeb0.jpg"&gt;&lt;img style="WIDTH: 452px; CURSOR: hand; HEIGHT: 111px" alt="" src="http://farm3.static.flickr.com/2472/3848612388_29bfa7eeb0.jpg" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2403100673556845683?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2403100673556845683/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2403100673556845683' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2403100673556845683'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2403100673556845683'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-buiness-intelligence-chi.html' title='Statistics for Buiness Intelligence - Chi Square Tests'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3447/3847757319_51ea9e8060_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-1932169562622895342</id><published>2009-08-12T09:31:00.000-07:00</published><updated>2009-09-07T23:57:29.808-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='tukey-kramer procedure'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='ANOVA'/><category scheme='http://www.blogger.com/atom/ns#' term='two way ANOVA'/><category scheme='http://www.blogger.com/atom/ns#' term='Tukey&apos;s test'/><category scheme='http://www.blogger.com/atom/ns#' term='inferntial statistics'/><title type='text'>statistics for business intelligence - ANOVA</title><content type='html'>&lt;div align="justify"&gt;Design of experiments : An experimental design is used to test a hypothesis by modifying one or more variables. The variables may be dependent or independent. Independent variables may be a treatment variable (can be modified) or classification variable (characteristic of the experimental factors, present prior to the experiment and is not modified during the experiment).&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;ANOVA&lt;/span&gt; - &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;Analysis&lt;/span&gt; of variance - This is a methodology in which the researcher studies the variance in a dependent variables due to the other independent variables. This &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;technique&lt;/span&gt; attempts to calculate the contribution of variance by each independent variable. Various types of experiments are available; we discuss some of those.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;One way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Anova&lt;/span&gt; - Completely randomized design : &lt;/strong&gt;&lt;br /&gt;In this experiment, there is only one independent variable. The variable contains more than one classification level. If only two classification levels are present, the design is same as comparing the statistic of two populations. An example of using a one way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;Anova&lt;/span&gt; is a &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_5"&gt;comparison&lt;/span&gt; of car music systems where the sound quality is compared. The kinds of music systems are the classifications of the independent variable and the dependent variable is the sound quality. ( can be quantified using a suitable measurement). In general, if k samples are analysed the null hypothesis for one way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;ANOVA&lt;/span&gt; states that the mean of all the samples is same. If any one of the mean is different from the others, the hypothesis is said to be rejected.&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;ANOVA&lt;/span&gt; basically compares the relative size of the treatment variables (independent variables) variation and the error variation (&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_8"&gt;accross&lt;/span&gt; treatment groups and within treatment groups). The one way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;ANOVA&lt;/span&gt; can be calculated by ;&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3018/3814496774_0fca34aef4.jpg"&gt;&lt;img style="WIDTH: 401px; CURSOR: hand; HEIGHT: 330px" alt="" src="http://farm4.static.flickr.com/3018/3814496774_0fca34aef4.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Where SS is the sum of squares and MS is the mean square. &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;SSC&lt;/span&gt; is the sum of square columns which gives the sum of squares between treatments. SSE is the sum of squares of error. The F value is the ration of treatment variance to the error variance.&lt;br /&gt;&lt;br /&gt;&lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_11"&gt;The&lt;/span&gt; t test can be considered a special case of two way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_12"&gt;Anova&lt;/span&gt; where there are only two treatment variables.&lt;br /&gt;&lt;br /&gt;Once the one way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_13"&gt;ANOVA&lt;/span&gt; using F test establishes that a significant difference exist in different treatments, multiple &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_14"&gt;comparisions&lt;/span&gt; can be done to find out which pairs are different. t-tests can be used but the error adds up in that case. Other techniques have been developed for &lt;em&gt;multiple &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_15"&gt;comparisons&lt;/span&gt;.&lt;br /&gt;&lt;strong&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_16"&gt;Tukey's&lt;/span&gt; Honestly Significance Test (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_17"&gt;HSD&lt;/span&gt;) Test&lt;/strong&gt;&lt;/em&gt; - This method is a multiple comparison test where the sample are of equal sizes. The &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_18"&gt;HSD&lt;/span&gt; value is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3046/3813686211_676f3c9f89.jpg"&gt;&lt;img style="WIDTH: 258px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm4.static.flickr.com/3046/3813686211_676f3c9f89.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The difference in means of pairs are compared to this &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_19"&gt;HSD&lt;/span&gt; value and the pair that has the mean difference greater than the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_20"&gt;HSD&lt;/span&gt; value is said to different at that alpha level.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;strong&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_21"&gt;Tukey&lt;/span&gt;-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_22"&gt;Cramer&lt;/span&gt; procedure&lt;/strong&gt;&lt;/em&gt; - This method is used for multiple comparisons when the samples are of unequal sizes.&lt;br /&gt;The formula is&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3046/3813686211_676f3c9f89.jpg"&gt;&lt;img style="WIDTH: 258px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm4.static.flickr.com/3046/3813686211_676f3c9f89.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Randomized block design&lt;/strong&gt; : This is the second method of analysis which considers another variable in addition to the treatment variable. This another variable is called the confounding variable. These are variables which are not controlled during the experiment, but have an effect on the outcome of the experiment.&lt;br /&gt;The randomized block design test adds an additional variable that the experimenter cannot control. this variable is also called the blocking variable. In this experiment SSE is broken into SSE and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_23"&gt;SSR&lt;/span&gt; (sum of squares for the block variable).&lt;br /&gt;The formula is&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2648/3813686361_c05f6dca6d.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 410px" alt="" src="http://farm3.static.flickr.com/2648/3813686361_c05f6dca6d.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;TWO WAY &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_24"&gt;ANOVA&lt;/span&gt;&lt;/strong&gt; : Factorial design - In this procedure two or more variables are explored simultaneously. Every level of each treatment is studied under the condition of every level of all other treatments. For factorial design of two variables, a two way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_25"&gt;ANOVA&lt;/span&gt; can be used. Note that the randomized block design is different from this in that it cannot measure the interaction between the two variables. &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_26"&gt;The&lt;/span&gt; null hypothesis for two way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_27"&gt;anova&lt;/span&gt; is&lt;br /&gt;i. Row effects - The row means are all equal.&lt;br /&gt;ii. column effects - the column means are all equal.&lt;br /&gt;iii. the interaction effects are zero.&lt;br /&gt;The two way &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_28"&gt;anova&lt;/span&gt; can be calculated by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3432/3813686427_bba713d3b1.jpg"&gt;&lt;img style="WIDTH: 374px; CURSOR: hand; HEIGHT: 500px" alt="" src="http://farm4.static.flickr.com/3432/3813686427_bba713d3b1.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Interaction effects show that the variation in the column values are dependent on which row is selected. whenever the interaction values are significant the row and column effects should not be considered.&lt;br /&gt;&lt;br /&gt;note: The method of analysis for &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_29"&gt;ANOVA&lt;/span&gt; is as follows - compare the F values calculated from the formula with the F values obtained from the table using the degrees of freedom specified in the formula and a suitable alpha value. If the F value calculated from the experiment is greater than the F value from the table then the null hypothesis for the F value experiment is said to be rejected.&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Links:&lt;br /&gt;1)&lt;a href="http://www.southampton.ac.uk/~cpd/anovas/datasets/index.htm" target="_blank"&gt; http://www.southampton.ac.uk/~cpd/anovas/datasets/index.htm&lt;/a&gt;&lt;br /&gt;2)&lt;a href="http://faculty.chass.ncsu.edu/garson/PA765/anova.htm" target="_blank"&gt; http://faculty.chass.ncsu.edu/garson/PA765/anova.htm&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-1932169562622895342?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/1932169562622895342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=1932169562622895342' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1932169562622895342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1932169562622895342'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence_12.html' title='statistics for business intelligence - ANOVA'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3018/3814496774_0fca34aef4_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-3286148175662621377</id><published>2009-08-11T01:16:00.000-07:00</published><updated>2009-09-03T03:04:48.075-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='confidence interval'/><category scheme='http://www.blogger.com/atom/ns#' term='t-tests'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='inferntial statistics'/><title type='text'>statistics for Business intelligence - Inference for 2 populations</title><content type='html'>&lt;div align="justify"&gt;Here we consider comparing the statistic from two samples. we would compare the mean, population proportion and variance. The tests used would be the z test and the t- test. Some of the experiments would use &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;independent&lt;/span&gt; samples. (members in both the samples are independent of each other)&lt;br /&gt;&lt;strong&gt;Difference in two means using z-statistic&lt;/strong&gt; : according to the central limit theorem, the difference in two sample means is normally distributed for large sample sizes. The z formula for difference in two sample means for large and independent samples is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3503/3810999052_9ec00cc2ab_m.jpg"&gt;&lt;img style="WIDTH: 219px; CURSOR: hand; HEIGHT: 90px" alt="" src="http://farm4.static.flickr.com/3503/3810999052_9ec00cc2ab_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;if the populations are normally distributed and if population variances are not known then the samples variances can be used if sample size is large.&lt;br /&gt;&lt;br /&gt;Hypothesis testing can be used in practical scenarios to find out if the mean of a sample differs from the mean of another sample. This would be a two tailed test.&lt;br /&gt;&lt;br /&gt;The confidence interval for the difference in the mean is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2580/3810182307_c1ac1a9206.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 52px" alt="" src="http://farm3.static.flickr.com/2580/3810182307_c1ac1a9206.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This confidence interval gives a (1-&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;aplha&lt;/span&gt;)% confidence level.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Difference in two means - t formula&lt;/strong&gt;&lt;br /&gt;This methodology can by used if the sample size is small, the samples are independent and the population variance is not known but is assumed to be equal. However, the measurement being studied should be normally distributed.&lt;br /&gt;The t value is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2547/3810999156_5e3225054c.jpg"&gt;&lt;img style="WIDTH: 462px; CURSOR: hand; HEIGHT: 78px" alt="" src="http://farm3.static.flickr.com/2547/3810999156_5e3225054c.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;For cases where the population variances of the two population are not equal the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;unpooled&lt;/span&gt; formula can be used.&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3426/3810181857_4f5ec90f78.jpg"&gt;&lt;img style="WIDTH: 409px; CURSOR: hand; HEIGHT: 150px" alt="" src="http://farm4.static.flickr.com/3426/3810181857_4f5ec90f78.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The confidence interval for the difference between the mean of two &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;population&lt;/span&gt; for small independent samples and when population variances are unknown is&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3516/3810999260_105a80a6f0.jpg"&gt;&lt;img style="WIDTH: 455px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm4.static.flickr.com/3516/3810999260_105a80a6f0.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Inferences for related population&lt;/strong&gt; : Sometimes sample are taken from two populations that are related. For example, samples taken for calculating illiteracy level before a literacy program is implemented and after it is implemented. Here the population remains essentially same however the measurement that is being taken has changed.&lt;br /&gt;This test is called a matched pair test or t-test for related measures or correlated t test. The t-formula is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2506/3810999304_68ce5d96cf_m.jpg"&gt;&lt;img style="WIDTH: 95px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm3.static.flickr.com/2506/3810999304_68ce5d96cf_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;where &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;df&lt;/span&gt;=n-1, n = number of pairs, d = sample difference in pairs, D = mean population difference &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Sd&lt;/span&gt; =SD of sample difference and d(bar) = mean sample difference.&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3559/3810182073_81311461b2.jpg"&gt;&lt;img style="WIDTH: 434px; CURSOR: hand; HEIGHT: 73px" alt="" src="http://farm4.static.flickr.com/3559/3810182073_81311461b2.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The confidence interval is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3493/3810182115_482f355fc4_m.jpg"&gt;&lt;img style="WIDTH: 199px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm4.static.flickr.com/3493/3810182115_482f355fc4_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Comparison of proportions for two populations:&lt;br /&gt;&lt;/strong&gt;E.g. comparing the market share of a product for two different markets. The formula is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3513/3810182187_7ba69c291b.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 53px" alt="" src="http://farm4.static.flickr.com/3513/3810182187_7ba69c291b.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The confidence interval for the difference is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2580/3810182307_c1ac1a9206.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 52px" alt="" src="http://farm3.static.flickr.com/2580/3810182307_c1ac1a9206.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Two population variances:&lt;/strong&gt;&lt;br /&gt;The ratio of sample of two variances is called the F value and is the ratio of square of sample variance of sample1 to square of sample variance of sample 2. A distribution for various values of s1 and s2 is called an F distribution. This distribution has degrees of freedom for the numerator and the denominator. Note that the two populations should be normally distributed.&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3583/3810182359_658f4e270b.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 64px" alt="" src="http://farm4.static.flickr.com/3583/3810182359_658f4e270b.jpg" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-3286148175662621377?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/3286148175662621377/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=3286148175662621377' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3286148175662621377'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3286148175662621377'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence_11.html' title='statistics for Business intelligence - Inference for 2 populations'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3503/3810999052_9ec00cc2ab_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-1047192686600555346</id><published>2009-08-10T21:33:00.000-07:00</published><updated>2009-09-03T03:01:23.961-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='null hypothesis'/><category scheme='http://www.blogger.com/atom/ns#' term='t-tests'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='chi-square distribution.'/><category scheme='http://www.blogger.com/atom/ns#' term='inferntial statistics'/><title type='text'>Statistics for Business Intelligence - Hypothesis testing</title><content type='html'>&lt;div align="justify"&gt;Hypothesis is defined in dictionary.com as 'a proposition assumed as a premise in an argument'. This post explores the various kinds of hypothesis in statistics and methods to test them. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;&lt;em&gt;Research hypothesis&lt;/em&gt; : A statement that is considered the outcome of an experiment or test, before the experiment is undertaken.&lt;br /&gt;&lt;em&gt;Statistical hypothesis&lt;/em&gt; : This is used to prove or disprove the research hypothesis by providing more measurable or concrete hypothesis statement. for example, a research hypothesis could be that the stock market index reflects the state of monsoon in the country. A statistical hypothesis might look at the values of the index with the percentage increase or decrease in rainfall during the year compared to previous years.&lt;br /&gt;The statistical hypothesis has two parts. The &lt;em&gt;&lt;strong&gt;null hypothesis&lt;/strong&gt;&lt;/em&gt; aims to prove that the old standard is correct and the current situation is in control. The &lt;em&gt;&lt;strong&gt;alternate hpothesis&lt;/strong&gt;&lt;/em&gt; aims to prove that the new theory is true, new standards are needed or the system is out of control. The null hypothesis is generally something that the experiment would reject to prove the alternative hypothesis. The aim of the experiment is to find cause to reject or not reject the null hypothesis. The null hypothesis is generally represented as H0 (H subscript 0) and the alternative hypothesis as Ha(H subscript a). For example, the number of literate people in the country is 40% and the government wants to prove that because of its literacy schemes the number has increased from 40. The null hypothesis is&lt;br /&gt;H0: p = 40&lt;br /&gt;The alternate hypothesis is Ha: P &gt;40.&lt;br /&gt;Note that in this case p is less than weight =" 40." &lt;a href="http://farm4.static.flickr.com/3528/3810765206_e35e821bb7_m.jpg"&gt;&lt;img style="WIDTH: 240px; CURSOR: hand; HEIGHT: 125px" alt="" src="http://farm4.static.flickr.com/3528/3810765206_e35e821bb7_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Type 1 error&lt;/strong&gt; : A type 1 error is commited by rejecting a null hypothesis when it is true. In other words the null hypothesis is true but the experiment prompts the researcher to reject it. The probability of committing a type 1 error is called alpha or level of significance. alpha is the area of the curve under the rejection region lying outside the critical values.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Type 2 error&lt;/strong&gt;: A type 2 error is committed when a business researcher failes to reject a false null hypothesis. i.e. the null hypothesis is actually false but the experiment prompts the researcher to accept it. The probability of committing a type 2 error is beta. the value of beta varies with the value of probable alternatives and each alternative may have a beta value. note that alpha and beta are inversely proportional. Power = 1- beta is the probability of rejecting a null hypothesis when the hypothesis is wrong. It represents a correct decision.&lt;br /&gt;&lt;br /&gt;Using z statistic to test the hypothesis about a population mean :&lt;br /&gt;If the sample size is large i.e. n&gt;= 30 for any population or if x is normally distributed for small population, the z score is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2568/3810817954_20ae416894_m.jpg"&gt;&lt;img style="WIDTH: 95px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm3.static.flickr.com/2568/3810817954_20ae416894_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The procedure is : use the mean from the null hypothesis and the sigma value to come up with the z score. assume a value of alpha (generally .05) and find out the value of z at the critical value (since alpha is known the area under nonrejected region can be found out. use the table of area under normal distribution and z to find out the z value) if the z from the experiment falls within the z values at critical point the hypothesis is not rejected.&lt;br /&gt;The sample standard deviation can be used if the population SD is not known and n &gt;= 30. If the population is finite i.e. the sample size is a substantial amount of the population then uses the finite correction factor. i.e. if N is the population and n is the sample size, use this formula.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2515/3810058261_c68bf66f45_m.jpg"&gt;&lt;img style="WIDTH: 161px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm3.static.flickr.com/2515/3810058261_c68bf66f45_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Another method is to convert the z scores at the critical point to the actual critical values and use the critical values to determine the null hypothesis.&lt;br /&gt;&lt;br /&gt;The t-statistic (as described in earlier post) can also be used instead of the z statistic of the sample size is small and population is normally distributed.&lt;br /&gt;&lt;br /&gt;Hypothesis about the proportion can be tested using the formula&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3521/3806807911_43299a33b8_m.jpg"&gt;&lt;img style="WIDTH: 95px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm4.static.flickr.com/3521/3806807911_43299a33b8_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Hypothesis about the variance can be tested using the chi-square method. note again that the chi square method is not robust with respect to normal distribution. i.e. if the distribution is not normal, the chi square method should not be used.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-1047192686600555346?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/1047192686600555346/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=1047192686600555346' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1047192686600555346'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/1047192686600555346'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence_10.html' title='Statistics for Business Intelligence - Hypothesis testing'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm4.static.flickr.com/3528/3810765206_e35e821bb7_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-866993266151031126</id><published>2009-08-09T23:46:00.000-07:00</published><updated>2009-09-03T02:58:09.042-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='confidence interval'/><category scheme='http://www.blogger.com/atom/ns#' term='t-tests'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='chi-square distribution.'/><category scheme='http://www.blogger.com/atom/ns#' term='inferntial statistics'/><title type='text'>Statistics for Business Intelligence - Inferential Statistics 1</title><content type='html'>&lt;div align="justify"&gt;Inferential statistics is the term given to the branch of statistics that uses the information from the sample to infer the information about the population. For example, given a sample mean , the population mean (also called a parameter) can be determined using inferential statistics.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Estimating population mean&lt;/strong&gt; - Let us first look at estimating the population mean using the z score.&lt;br /&gt;&lt;strong&gt;point estimate&lt;/strong&gt; - If the population mean is assigned the value of the sample mean, then the estimate is called a point estimate. The point estimate may not be accurate, and different samples may have different point estimates. Also the effectiveness may be dependent on how representative is the sample of the population.&lt;br /&gt;&lt;strong&gt;Interval estimate&lt;/strong&gt; - This gives a confidence interval within which the population parameter is expected to lie. Consider the distribution of z scores below.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2654/3807406232_845f7e9ca0.jpg"&gt;&lt;img style="WIDTH: 235px; CURSOR: hand; HEIGHT: 134px" alt="" src="http://farm3.static.flickr.com/2654/3807406232_845f7e9ca0.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;alpha is the area under the normal distribution curve and is the area outside the confidence interval. The 100(1-alpha)% confidence interval is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3427/3806620383_8f25fcabf9.jpg"&gt;&lt;img style="WIDTH: 425px; CURSOR: hand; HEIGHT: 63px" alt="" src="http://farm4.static.flickr.com/3427/3806620383_8f25fcabf9.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;x (bar) is the mean of the sample.&lt;/div&gt;&lt;div align="justify"&gt;mu is the mean of the population.&lt;/div&gt;&lt;div align="justify"&gt;Z is the z score at alpha/2.&lt;/div&gt;&lt;div align="justify"&gt;sigma is the population standard deviation.&lt;/div&gt;&lt;div align="justify"&gt;To calculate the confidence interval for a particular value of alpha, use the z table to arrive at the probability values.&lt;br /&gt;note that sigma is the standard deviation of the population, therefore to calculate confidence interval using this formula, the SD of the population is required. This could be available from previous studies or some other means.&lt;br /&gt;&lt;br /&gt;Confidence Interval if population SD is not known : In most cases, the population SD is not known. In these cases the thumb rule is that the sample SD is a good estimation of population SD if n&gt;= 30. To calculate the confidence interval use s (SD of sample) instead of sigma(SD of population) in the equation above.&lt;br /&gt;&lt;br /&gt;So far we have seen methods to estimate the population mean and the confidence intervals using the z statistic, let us know look at methods to determine the population mean using t-statistics.&lt;br /&gt;The &lt;strong&gt;t-tests&lt;/strong&gt; can be used to calculate the population mean from the sample mean if the sample size is small. They are also referred to as student's t tests. The t distribution is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2643/3800494306_2987944a53.jpg"&gt;&lt;img style="WIDTH: 104px; CURSOR: hand; HEIGHT: 44px" alt="" src="http://farm3.static.flickr.com/2643/3800494306_2987944a53.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The difference between t-test and the z formula is that t-test uses t tables instead of z tables. The t distribution approaches the standard normal curve for large values of n. To find out the t value, the degrees of freedom is required. The degrees of freedom or df is given by n-1. i.e. the df is one less than the number of members in the sample. using the t value and df, the probability value can be obtained from the t table.&lt;br /&gt;&lt;br /&gt;similar to calculating the confidence interval using the z table, t-table can also be used to calculate the confidence interval. It is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3519/3806695803_1e95d63495.jpg"&gt;&lt;img style="WIDTH: 451px; CURSOR: hand; HEIGHT: 64px" alt="" src="http://farm4.static.flickr.com/3519/3806695803_1e95d63495.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Population proportion&lt;/strong&gt; : The proportion in a population can be determined given the proportion in a sample using the formula&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3458/3807571346_2f820b7a60.jpg"&gt;&lt;img style="WIDTH: 500px; CURSOR: hand; HEIGHT: 127px" alt="" src="http://farm4.static.flickr.com/3458/3807571346_2f820b7a60.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Estimating Population variance&lt;/strong&gt; :&lt;br /&gt;The population variance can be estimated from the sample variance using the chi-square distribution.&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2625/3807586726_693aba9307_m.jpg"&gt;&lt;img style="WIDTH: 124px; CURSOR: hand; HEIGHT: 54px" alt="" src="http://farm3.static.flickr.com/2625/3807586726_693aba9307_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;s is the sample variation and sigma is the population variance. The degrees of freedom are given by n-1.the chi-square distribution is not symmetrical and the shape varies with the degrees of freedom.&lt;br /&gt;The confidence interval is given by&lt;br /&gt;&lt;a href="http://farm3.static.flickr.com/2638/3806786409_c05d1138b1_m.jpg"&gt;&lt;img style="WIDTH: 223px; CURSOR: hand; HEIGHT: 70px" alt="" src="http://farm3.static.flickr.com/2638/3806786409_c05d1138b1_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Sample Size&lt;/strong&gt; : The size of the sample to be used for survey can be calculated if the error in estimation E =(sample mean - population mean) is known. The size is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3547/3806797859_8d37984a8c_m.jpg"&gt;&lt;img style="WIDTH: 95px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm4.static.flickr.com/3547/3806797859_8d37984a8c_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;The sample size to estimate the proportion p is given by&lt;br /&gt;&lt;a href="http://farm4.static.flickr.com/3521/3806807911_43299a33b8_m.jpg"&gt;&lt;img style="WIDTH: 95px; CURSOR: hand; HEIGHT: 71px" alt="" src="http://farm4.static.flickr.com/3521/3806807911_43299a33b8_m.jpg" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-866993266151031126?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/866993266151031126/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=866993266151031126' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/866993266151031126'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/866993266151031126'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence_09.html' title='Statistics for Business Intelligence - Inferential Statistics 1'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://farm3.static.flickr.com/2654/3807406232_845f7e9ca0_t.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-3564530301829589204</id><published>2009-08-05T22:15:00.000-07:00</published><updated>2009-09-03T02:51:46.909-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='central limit theorem'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='sampling'/><title type='text'>Statistics for Business Intelligence - Sampling</title><content type='html'>&lt;div align="justify"&gt;It is necessary to understand sampling techniques before data for a sample is gathered for analysis. Some of the terms that are important are&lt;br /&gt;&lt;em&gt;Population&lt;/em&gt; - This is the complete set under consideration. For example a survey of food choices for a country might consider all citizens of the country.&lt;br /&gt;&lt;em&gt;Frame&lt;/em&gt; - frame is the population where the survey is targeted to. For example, a survey of sports interests among school children considers all schools. A frame considers a list of population such as a school list.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Random Sampling or probabilistic sampling&lt;/em&gt;- In this kind of sampling each unit of the population has an equal chance of being selected in the sample.&lt;br /&gt;&lt;em&gt;Non Random Sampling or non probabilistic sampling&lt;/em&gt; - In this sampling method units have different probability of being selected for sampling, i.e. the sampling is biased for the selection of the unit.&lt;br /&gt;&lt;br /&gt;Types of Random Sampling&lt;br /&gt;&lt;em&gt;&lt;strong&gt;Simple Random Sampling&lt;/strong&gt;&lt;/em&gt; - Each unit is assigned a number, and a table of random numbers is used to select the unit. for example if the population has 30 members, each member is assigned a number. Random table is used to generate a number between 1 and 30 and the member corresponding to the random number generated is selected.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;strong&gt;Stratified Random Sampling&lt;/strong&gt; &lt;/em&gt;- The population is divided into different strata. These strata are non overlapping. Random methods are used to select members from each stratum. This method helps in selecting a sample that is representative of the population and prevents the researcher from collecting units from a subgroup of the population. The strata can be formed logically, for example in the survey of sports choices, the population can be divided into girls and boys.&lt;br /&gt;In proportionate stratified random sampling the number of units selected from each strata is proportional to the total number of members in the strata. So if a school has 70 boys and 30 girls, and a sample of 10 is required, then 7 boys and 3 girls would be selected.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;strong&gt;Systematic Sampling&lt;/strong&gt; &lt;/em&gt;- In this method every kth element of the population is selected. This method is easy to implement but fails if the periodicity of the population coincides with k.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;strong&gt;Cluster or Area sampling&lt;/strong&gt;&lt;/em&gt; - In this method the population is broken down into logical clusters or areas. For example a state population can be broken into cities for the purpose of sampling. This technique is mostly used for its convenience. In contrast to stratified sampling, here the internal population is heterogeneous. In stratified sampling each stratum is homogeneous in terms of property influencing the survey.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Nonrandom sampling&lt;/strong&gt; - Nonrandom sampling is generally not advised when inferential techniques need to be applied . Also the error of sampling calculated may be incorrect for nonrandom sampling.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;strong&gt;Convenience Sampling&lt;/strong&gt;&lt;/em&gt; - elements for the sample are selected as per the convenience of the surveyor. The sample may contain less variation than the population. However, the cost for sampling may be reduced since the samples are taken from a convenient location.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Judgement sampling&lt;/em&gt; - The elements for sampling are chosen by the judgement of the researcher. Studies show that random sampling gives a better population mean than judgement sampling. This kind of sampling also introduces biases of the researcher.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Quota sampling &lt;/em&gt;- Quota sampling divides the population into subgroups or strata as in stratified sampling however, members are selected from the strata using non random techniques. The number of members to be selected from the strata are proportional to the population of the subgroup and is called a quota.&lt;br /&gt;&lt;br /&gt;&lt;em&gt;snowball sampling&lt;/em&gt; - In this kind of sampling members are selected based on the referral from other members. The advantage is that members for survey can be identified easily. however the technique is non random.&lt;br /&gt;&lt;br /&gt;sample mean distribution- A population with a known distribution is chosen. Samples are taken from this population and mean calculated for the samples. The probability distribution of this mean is governed by what is called the C&lt;strong&gt;entral limit theorem&lt;/strong&gt;. It is a powerful theorem and states that &lt;em&gt;if samples of size n are taken randomly from a population having a mean of mu and a standard deviation of sigma then the sample means x are normally distributed for large sample sizes (typically n&gt;=30)&lt;/em&gt; regardless of the shape of the distribution of the population. However, if the population is normally distributed, the sample means are also normally distributed for all values of n. mathematically mean of the sample means is equal to population mean and SD of sample means is the SD of the population divided by square root of population size.&lt;br /&gt;&lt;br /&gt;The power of the theorem lies in the fact that even if the population is not normally distributed, the probability for a particular sample mean can be calculated from a sample of large size since the sample mean distribution would always be normal.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-3564530301829589204?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/3564530301829589204/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=3564530301829589204' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3564530301829589204'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3564530301829589204'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence_05.html' title='Statistics for Business Intelligence - Sampling'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2518267093279725534</id><published>2009-08-04T01:32:00.000-07:00</published><updated>2009-09-03T02:46:30.470-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Descriptive statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Normal Distribution'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='Exponential Distribution'/><title type='text'>Statistics for Business Intelligence - Distribution</title><content type='html'>&lt;div align="justify"&gt;&lt;strong&gt;Discrete variables&lt;/strong&gt; - Discrete variable take a set of values. for example, type of card drawn from a pack of cards can take any of the four values: hearts, spades, clubs or diamonds.&lt;br /&gt;&lt;strong&gt;Continuous values&lt;/strong&gt; - These can take any values within a specified range. For example height of students in a class can take any value from say 4 feet to 6 feet.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Distribution of discrete variables&lt;/strong&gt; - lets us first consider the distribution of discrete variables.&lt;br /&gt;&lt;strong&gt;Binomial distribution&lt;/strong&gt;&lt;br /&gt;This is the distribution of results in an experiment where the result can either be a success or a failure. For example a coin toss can be either heads or tails. let p be the probability of success, q be the probability of failure(q=1-p) and n be the number of trials. The probability of x successes is given by&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/Snf5iF5ThoI/AAAAAAAACHs/2iU4gUHBVaM/s1600-h/binomial_Distribution.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5366031845128636034" style="WIDTH: 248px; CURSOR: hand; HEIGHT: 50px" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/Snf5iF5ThoI/AAAAAAAACHs/2iU4gUHBVaM/s400/binomial_Distribution.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The mean of a binomial distribution is given by (np).&lt;br /&gt;The standard deviation is given by SD=sqrt(npq)&lt;br /&gt;A graph can be plotted by plotting P(x) against x.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Poisson Distribution&lt;/strong&gt;&lt;br /&gt;This distribution describes the occurrences of rare events. It gives the probability of x occurrences in a specified time interval given that there are lambda expected occurrences in the same time period. for example if it is known that a machine produces 10 defective items in 30 mins, what is the probability that it will produce 4 defective items. The distribution is given by&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/Snf-PwvOKVI/AAAAAAAACH0/3LGmsgfFMJs/s1600-h/poisson+distribution.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5366037027769690450" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 48px" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/Snf-PwvOKVI/AAAAAAAACH0/3LGmsgfFMJs/s400/poisson+distribution.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Binomial problems with large sample sizes and small values of p can be approximated by poisson distribution. Heuristics suggest that if n &gt; 20 and n.p &lt;= 7, then the poisson distribution can be used to approximate binomial distribution. &lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt; &lt;/div&gt;&lt;div align="justify"&gt;&lt;strong&gt;Distribution for continuous variables&lt;/strong&gt;&lt;br /&gt;continuous variables take all values within an interval. To calculate the probability between any two points, find the area under the curve. The total area under the curve for this kinds of distribution is 1. To find the probability at a particular point, the thumb rule is to add and subtract a small quantity and take the area under the two values obtained. i.e. probability at x can be found out by finding area under the curve between the points x+dx and x-dx where dx is around half a unit.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Uniform Distribution&lt;/strong&gt; - This distribution has a constant value throughout. This is also referred to as rectangular distribution.&lt;br /&gt;&lt;img src="http://mathworld.wolfram.com/images/eps-gif/UniformDistribution_651.gif" /&gt;&lt;br /&gt;The distribution is given by&lt;br /&gt;&lt;img src="http://mathworld.wolfram.com/images/equations/UniformDistribution/Inline4.gif" /&gt;&lt;br /&gt;Image and equation Source - mathworld.wolfram.com.&lt;br /&gt;&lt;br /&gt;Normal Distribution - This is probably the most frequently encountered distribution and also most widely used. Example of a normal distribution is the error rate of a machining equipment. Physicist refer to it as Gaussian distribution and it is also popularly referred to as a 'bell curve'.&lt;br /&gt;&lt;img src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/360px-Normal_Distribution_PDF.svg.png" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;Image source - Wikepedia&lt;/em&gt;&lt;br /&gt;The probability density function is given by&lt;br /&gt;&lt;img src="http://mathworld.wolfram.com/images/equations/NormalDistribution/NumberedEquation1.gif" /&gt;&lt;br /&gt;&lt;em&gt;Equation source - mathworld.wolfram.com.&lt;/em&gt;&lt;br /&gt;The Normal curve depends on the mean and standard deviation. There would be different curves for each combination of mean and standard deviation, therefore a standardized normal distribution curve is used. This curve is obtained by converting the values of x to its corresponding z score. z score is calculated as (x-mean)/SD.&lt;br /&gt;The z distribution has a mean of 0 and standard deviation of 1. To find the probability between two values in the normal curve, find the area under the curve between the two values. To do so, convert the two values to their corresponding z scores and use the standard z score table to arrive at the associated probability. The difference between the probability of the higher value and the lower values gives the probability for the interval.&lt;br /&gt;&lt;br /&gt;The normal distribution can be used as an approximation to the binomial distribution. To do so calculate the mean and SD from n and p using mean = n.p and sd = sqrt(n.p.q). A thumb rule is that the approximation can be applied if mean+3SD lies between 0 and n.&lt;br /&gt;&lt;br /&gt;Exponential Distribution : This is similar to Poisson distribution but useful for continuous values. It gives the probability distribution for times between random occurrences. The x values range from 0 to infinity and the curve steadily decreases as x increases.&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/Snkk0quUlfI/AAAAAAAACH8/MrpcM7s541k/s1600-h/exponentialDistribution.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5366360918229030386" style="WIDTH: 192px; CURSOR: hand; HEIGHT: 43px" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/Snkk0quUlfI/AAAAAAAACH8/MrpcM7s541k/s400/exponentialDistribution.JPG" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2518267093279725534?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2518267093279725534/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2518267093279725534' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2518267093279725534'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2518267093279725534'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/08/statistics-for-business-intelligence.html' title='Statistics for Business Intelligence - Distribution'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/Snf5iF5ThoI/AAAAAAAACHs/2iU4gUHBVaM/s72-c/binomial_Distribution.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-5579572741669513499</id><published>2009-07-30T23:51:00.000-07:00</published><updated>2009-09-03T02:41:21.577-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='box and whisker plots'/><category scheme='http://www.blogger.com/atom/ns#' term='kurtosis'/><category scheme='http://www.blogger.com/atom/ns#' term='pearsonian coefficient of skewness'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Descriptive statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Statistics for Business Intelligence - Shape</title><content type='html'>&lt;div align="justify"&gt;In this post i will discuss the measures of shape used for statistical analysis, specifically skewness and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;kurtosis&lt;/span&gt;. I will also discuss the box and whisker plot.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Skewness&lt;/strong&gt; - A normal &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;distribution&lt;/span&gt; is a bell curve that is perfectly symmetric. perfect symmetry implies that the values are distributed equally around the center. a graph is said to the skewed if it is not symmetrical. The graph may be either skewed towards the right (negatively skewed) or towards the left (positively skewed)&lt;br /&gt;&lt;img src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/b3/Skewness_Statistics.svg/446px-" /&gt; &lt;/div&gt;&lt;div align="justify"&gt;Image Source - &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;Wikepedia&lt;/span&gt;&lt;br /&gt;A normal distribution that has the mean, median and mode at the center of the distribution has no skewness.&lt;br /&gt;skewness can be quantified by a measure known as &lt;em&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;Pearsonian&lt;/span&gt; coefficient of skewness&lt;/em&gt;.&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SnaoP0hdu0I/AAAAAAAACHc/NFeuGd7WdlU/s1600-h/CoefficentOfSkewness.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5365660995809033026" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 74px" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/SnaoP0hdu0I/AAAAAAAACHc/NFeuGd7WdlU/s400/CoefficentOfSkewness.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;If the coefficient is positive the plot is positively skewed. Larger the value, greater is the skewness.&lt;br /&gt;&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;&lt;strong&gt;Kurtosis :&lt;/strong&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_5"&gt;Kurtosis&lt;/span&gt; defines how pointed (tall and thin) the plot is. If the plot if large and thin then it is referred to as &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_6"&gt;leptokurtic&lt;/span&gt;. If it is flat and spread out it is called &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_7"&gt;platykurtic&lt;/span&gt;. plots in between are called &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_8"&gt;mesokurtic&lt;/span&gt;.&lt;br /&gt;&lt;img src="http://www.uwsp.edu/psych/stat/6/kurtosis.gif" /&gt;&lt;br /&gt;Image Source : http://www.uwsp.edu/&lt;br /&gt;&lt;br /&gt;Box and Whisker plots : These plots are widely used to understand the data distribution. To understand these plots we need to understand quartile. A quartile breaks the data into four parts. i.e. there are three quartiles in the complete data range. Note that the data needs to be in ascending order to calculate the quartile. The first quartile contains the first 25 percentile of data. Here's an example of the box and whisker plot&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_GvNS-b8AbU4/SnaneoSYOJI/AAAAAAAACHU/WaHhHfs4H0U/s1600-h/MAndWPlot.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5365660150710941842" style="WIDTH: 344px; CURSOR: hand; HEIGHT: 162px" alt="" src="http://3.bp.blogspot.com/_GvNS-b8AbU4/SnaneoSYOJI/AAAAAAAACHU/WaHhHfs4H0U/s400/MAndWPlot.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The characteristics of the plot are&lt;br /&gt;1) The median or Q2 is the center of the graph.&lt;br /&gt;2) the left end of the box is Q1 and the right is Q3. i.e. 50% of values are inside the box.&lt;br /&gt;3) The line segment outside the box is called a whisker. A length of the whisker is 1.5 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_9"&gt;IQR&lt;/span&gt; (&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_10"&gt;Interquartile&lt;/span&gt; range = Q3-Q1). This is also called the inner fence. If data is present outside this inner fence then an outer fence = 3 &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_11"&gt;IQR&lt;/span&gt; can be drawn.&lt;br /&gt;4) Values outside the inner fence but inside the outer fence are called mild outliers whereas data outside the outer fence are called extreme outliers.&lt;br /&gt;5) If the median is to the right of the box then the middle 50% of data is skewed to the left.&lt;br /&gt;6) If the longest whisker is to the right of the box then the outer data are skewed to the right.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-5579572741669513499?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/5579572741669513499/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=5579572741669513499' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5579572741669513499'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/5579572741669513499'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/07/statistics-for-business-intelligence_30.html' title='Statistics for Business Intelligence - Shape'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/SnaoP0hdu0I/AAAAAAAACHc/NFeuGd7WdlU/s72-c/CoefficentOfSkewness.JPG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-4192682861445270192</id><published>2009-07-29T01:27:00.001-07:00</published><updated>2009-09-03T02:37:31.202-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='variance'/><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='ChebyChev&apos;s theorem'/><category scheme='http://www.blogger.com/atom/ns#' term='coefficient of variation'/><category scheme='http://www.blogger.com/atom/ns#' term='standard deviation'/><title type='text'>Statistics for Business Intelligence - Descriptive Statistics</title><content type='html'>&lt;div align="justify"&gt;In this post we look at descriptive statistics as a means for data exploration. Descriptive analysis refers to a group of methods that gives summary information about the data.  For example consider the sales figures for a retail clothes outlet. An important figure would be the average of sales for a particular day of the week in a year.&lt;br /&gt;&lt;br /&gt;Analysis of a single variable or Univariate analysis -&lt;br /&gt;In most cases we need summary figures for a single variable, say height of students in a class or the maximum selling product during a sale etc. The methods in this analysis take as input various values for a single variable and provides summary statistics for it.&lt;br /&gt;&lt;br /&gt;Types of summary statistics - There are mainly three kinds of summary statistics involved in univariate descriptive analysis.&lt;br /&gt;&lt;br /&gt;1) &lt;em&gt;&lt;strong&gt;Mean&lt;/strong&gt; &lt;/em&gt;- This simly gives the average for all the values of the variable under consideration. for example if the marks scored by five students in a quiz are 6,8,9,5,8 then the mean is given by (sum of values)/(num of values)&lt;br /&gt;or sum = (6+8+9+5+8)/5 = 7.2&lt;br /&gt;The disadvantage of mean value is that an outlier can distort the mean to a very large extent.&lt;br /&gt;&lt;br /&gt;2) &lt;em&gt;&lt;strong&gt;Median&lt;/strong&gt; &lt;/em&gt;- Median gives the central value in a group of values. In other words around half of the values are greater than the median and the other half are less than the median. consider the same number sequence as above.&lt;br /&gt;6,8,9,5,8&lt;br /&gt;arrange the sequence in ascending order&lt;br /&gt;5,6,8,8,9&lt;br /&gt;The central value is 8 and hence the median is 8. Median gives a number around which the values are distributed. If the series has an even number of observations then divide the middle two numbers by two to arrive at the median.&lt;br /&gt;&lt;br /&gt;3) &lt;em&gt;&lt;strong&gt;Mode&lt;/strong&gt; &lt;/em&gt;- The mode is the value that is repeated the most number of times. In our series the mode is 8 since it is repeated twice.&lt;br /&gt;&lt;br /&gt;The three summary values described above are labeled as &lt;strong&gt;measures of central tendency&lt;/strong&gt;. In a normal distribution the values would be equal.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Another topic in Descriptive Statistics is &lt;strong&gt;Distribution&lt;/strong&gt;. Consider a school that gives a grade to each student. A single variable distribution gives the number of students that have obtained Grade A for each subject.&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SnAlyx_DktI/AAAAAAAACF8/mg2JHFcfoJo/s1600-h/distribution-table.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5363828710539760338" style="WIDTH: 165px; CURSOR: hand; HEIGHT: 124px" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/SnAlyx_DktI/AAAAAAAACF8/mg2JHFcfoJo/s320/distribution-table.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The same statistic can also be represented graphically&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/_GvNS-b8AbU4/SnAtM02UhEI/AAAAAAAACGE/BdU6HKafOk0/s1600-h/graph.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5363836854566421570" style="FLOAT: right; MARGIN: 0px 0px 10px 10px; WIDTH: 320px; CURSOR: hand; HEIGHT: 193px" alt="" src="http://2.bp.blogspot.com/_GvNS-b8AbU4/SnAtM02UhEI/AAAAAAAACGE/BdU6HKafOk0/s320/graph.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There are cases where the values of the variables are not discreet. consider a distribution of height for the students in the class. A distribution of each value of height vs number of students would probably give only one or two students for each height value. A better approach here would be to use a range of values instead of absolute values. In case of height use a distribution of this type:&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SnAvlHy9wsI/AAAAAAAACGM/AfuKgD09tXI/s1600-h/graph2.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5363839470992736962" style="WIDTH: 320px; CURSOR: hand; HEIGHT: 193px" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SnAvlHy9wsI/AAAAAAAACGM/AfuKgD09tXI/s320/graph2.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;strong&gt;&lt;br /&gt;Dispersion&lt;/strong&gt; - Dispersion gives an idea of how the values are distributed around the central value. The measures of dispersion are &lt;em&gt;range, mean absolute deviate, standard deviation and variance&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Range&lt;/strong&gt; is the difference between the maximum and minimum value in a distribution. for example in the series 5,6,8,8,9 the range is 9-5 = 4.&lt;br /&gt;&lt;br /&gt;Mean absolute deviation is an average of the absolute deviation of the numbers around the mean. The formula for mean absolute deviation is&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_GvNS-b8AbU4/SnKJh6rAu3I/AAAAAAAACGs/ZJTyXMrDZG4/s1600-h/MeanAD.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5364501321930029938" style="WIDTH: 400px; CURSOR: hand; HEIGHT: 45px" alt="" src="http://4.bp.blogspot.com/_GvNS-b8AbU4/SnKJh6rAu3I/AAAAAAAACGs/ZJTyXMrDZG4/s400/MeanAD.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Variance&lt;/strong&gt; - variance is the average of the squared deviations.&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_GvNS-b8AbU4/SnKNNlv-ulI/AAAAAAAACG0/ocNkAvkF9fA/s1600-h/variance.JPG"&gt;&lt;img id="BLOGGER_PHOTO_ID_5364505370762852946" style="WIDTH: 320px; CURSOR: hand; HEIGHT: 36px" alt="" src="http://1.bp.blogspot.com/_GvNS-b8AbU4/SnKNNlv-ulI/AAAAAAAACG0/ocNkAvkF9fA/s320/variance.JPG" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Standard Deviation&lt;/strong&gt; - It is the square root of variance. It has same units as the data used in analysis. Two methods are used to understand the significance of standard deviation. The first method is widely used and is applicable for all &lt;strong&gt;normal distributions&lt;/strong&gt; (distributions that are symmetrical about the center and have a bell shaped curve). The method states that for normal distributions mean + one standard deviation(sigma) is equal to 68% i.e. 68% of the data are between mean+sigma and mean-sigma. 95% is between mean +- two sigma and 99.7% of the data are between mean+- 3sigma.&lt;br /&gt;The second method is called &lt;strong&gt;ChebyChev's theorem&lt;/strong&gt;. It states that at least&lt;br /&gt;1-1/square(k) of the values fall within +-k deviations from the mean. k &gt; 1. for example 1-1/4 = 75% of values fall within +-2 deviations from the mean. The advantage of this theorem is that it can be applied to all distributions and not only normal distributions.&lt;br /&gt;&lt;br /&gt;Note that the formula for variance and standard deviation described above are used for the population. To estimate values for a sample use a divisor of n-1 instead of n.&lt;br /&gt;&lt;br /&gt;The last term for this section is &lt;strong&gt;coefficient of variation &lt;/strong&gt;- it is the ratio of standard deviation to the mean expressed as percentage.&lt;br /&gt;COV = standard deviation * 100/ mean&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-4192682861445270192?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/4192682861445270192/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=4192682861445270192' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/4192682861445270192'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/4192682861445270192'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/07/statistics-for-business-intelligence_29.html' title='Statistics for Business Intelligence - Descriptive Statistics'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_GvNS-b8AbU4/SnAlyx_DktI/AAAAAAAACF8/mg2JHFcfoJo/s72-c/distribution-table.JPG' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-230363855923096867</id><published>2009-07-28T23:46:00.000-07:00</published><updated>2009-09-03T02:30:45.114-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Descriptive statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><title type='text'>Statistics for Business Intelligence - Introduction</title><content type='html'>&lt;div align="justify"&gt;An understanding of Statistics is imperative before &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;delving&lt;/span&gt; into the tools and methods of business intelligence and analysis. We will spend some time on understanding the basics of Statistics for data analysis and in subsequent posts try to give more detailed &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;explanation&lt;/span&gt; of the various methods involved.&lt;br /&gt;Statistics for data analysis can be broadly divided into descriptive statistics and inferential statistics.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Descriptive statistics&lt;/strong&gt; is a group of methods or tools that statisticians use to understand the meaning of data. These methods provide a summary or gist of data and enable the user to comprehend large volumes of data. For example, the average salary for a class of MBA &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;pass out&lt;/span&gt; or the median of weight in a group of people are examples of descriptive statistics.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Inferential statistics&lt;/strong&gt;, on the other hand, provides tools that statisticians use to infer information about the population given the information about the sample of the population. In other words it tries to use the available data to understand how the population would look like. Consider for example a survey that finds out popularity of a TV show. A group of people are selected at random and in a way that represents the population that the survey wants to cover (say a city). Descriptive statistics would give the summary of popularity within the people chosen for the survey but to extrapolate this findings for the complete population of the city, we need inferential &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_3"&gt;statistics&lt;/span&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-230363855923096867?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/230363855923096867/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=230363855923096867' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/230363855923096867'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/230363855923096867'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/07/statistics-for-business-intelligence.html' title='Statistics for Business Intelligence - Introduction'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2266633580285280473</id><published>2009-07-21T21:46:00.000-07:00</published><updated>2009-09-03T02:28:36.943-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='forecasting'/><category scheme='http://www.blogger.com/atom/ns#' term='business analytics'/><category scheme='http://www.blogger.com/atom/ns#' term='predictive analysis'/><category scheme='http://www.blogger.com/atom/ns#' term='business intelligence'/><category scheme='http://www.blogger.com/atom/ns#' term='OLAP'/><title type='text'>Business Analytics Road map</title><content type='html'>&lt;div align="justify"&gt;Any company wishing to reap the benefits of business analytics needs to understand the path that it has to take to reach the goal of optimization. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;Stage 1 : The first stage is reporting and data analysis. This essentially means that you gather the data from your source systems and run reporting tools on it. Various kinds of reports can be generated depending on the requirement. Data analysis implies using statistical methods to find meaning in numerical data. statistical tools can be very powerful in presenting highly summarized data and are used extensively when huge amount of data is present. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;Stage 2: A company that has the basic reporting tools can then move on to OLAP based tools. OLAP stands for Online Analytical processing. Olap tools allow you to build multidimensional data. They also provide with drill down facilities to allow the user to drill down on one of the dimensions. for example, consider an FMCG company that has a location hierarchy(country-&gt; state-&gt;city etc) as well as an item hierarchy (Electronics-&gt;television etc). people may want various kinds of numbers such as total electronics sales in a country or total television sales in a city etc. All such kinds of data can be made available from a single &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;OLAP&lt;/span&gt; system. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;Stage 3: reporting and &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;OLAP&lt;/span&gt; system gives the user a view of the system, but the user may also want systems that may help him make sense of the information. We need data mining tools that help the user in making decisions. These include data analysis covered in step 1 but here we make use of advanced statistical concepts (inferential statistics in addition to descriptive statistics) &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;Stage 4: The three stages covered so far are part of what is generally referred to as business intelligence. As a company matures it looks for more capabilities in its decision support systems. It is at this point that the company starts investing into business analytics. The business intelligence answers questions such as : what does my sales records look like? or what has been my maximum selling product. The questions answered by Business analytics are : what would my demand look like within the next three months? how do i optimize my inventory so that my cost is minimum but the service level is achieved? should my supply chain be responsive or efficient?&lt;br /&gt;The answer to this question needs tools such as forecasting and predictive analysis. &lt;/div&gt;&lt;div align="justify"&gt;&lt;br /&gt;Stage 5: The last stage is where the company strives for optimization &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;across&lt;/span&gt; its functions. It uses a holistic approach where all its functions such as finance, marketing, operations are optimized together to arrive at a decision.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2266633580285280473?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2266633580285280473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2266633580285280473' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2266633580285280473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2266633580285280473'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/07/business-analytics-roadmap_21.html' title='Business Analytics Road map'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-2543448883693935291</id><published>2009-07-09T08:33:00.000-07:00</published><updated>2009-08-24T04:33:44.734-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='decision support systems'/><title type='text'>Business Analytics - The need 2</title><content type='html'>&lt;div align="justify"&gt;Information is required at all levels in the organization. Right from the CEO,CFO or &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;CIO&lt;/span&gt; to a technician, everyone in the organization can benefit from access to information. The kind of decision that they make may differ and the granularity of information required may vary but the same system should be able to serve the needs of everyone in the organization. Let us understand how each person in the organization would use this system and what kind of information would they expect out of it.&lt;br /&gt;Lets take the operational guy. We take the example of a network management person in a &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;telecom&lt;/span&gt; company. When he sees a problem, what would help him is how a similar problem was solved in the past. If the system can understand the symptom of this problem and suggest possible solution based on past data then the problem could be solved much faster. At a minimum, the system should be able to give him a list of problems faced by a similar component and the solution provided to solve that problem. Such a system would decrease the system downtime and also increase customer satisfaction.&lt;br /&gt;&lt;br /&gt;The next person we can consider is someone from middle management. We take an example of a regional sales head of an automobile company. His job role could be achieving sales target for his region. He would need real time information about the sales of each office in his region. He would also need to set targets for each region based on the demand. He would need the &lt;span class="blsp-spelling-error" id="SPELLING_ERROR_2"&gt;forecasted&lt;/span&gt; demand for each of the locations and based on this forecast he would launch an effective sales campaign.&lt;br /&gt;&lt;br /&gt;However, The person who probably needs information the most is the CEO/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_3"&gt;CIO&lt;/span&gt;/CFO/&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_4"&gt;CTO&lt;/span&gt; of the company. He does not care about the numbers at the last level but he does care about a summary into which he can drill down. It is very important the he get a holistic view of the organization but also get the ability to 'look into' the system at a very fine level if the need arises. The system should assist him in arriving at decisions. For example , questions such as starting a new product line or diversifying into a new area, requires not only an understanding of the new product but also the understanding of the capability of one's own company.&lt;br /&gt;&lt;br /&gt;To summarize, access to information is necessary for everybody in the organization and an investment into a decision support or reporting system can be justified if implemented smartly. Next, we would look into the details of the implementation.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-2543448883693935291?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/2543448883693935291/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=2543448883693935291' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2543448883693935291'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/2543448883693935291'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/07/business-analytics-need-2.html' title='Business Analytics - The need 2'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3143564862155210283.post-3093468539889964196</id><published>2009-07-08T08:36:00.000-07:00</published><updated>2009-09-03T02:20:42.879-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='business analytics'/><title type='text'>Business Analytics - The need</title><content type='html'>&lt;div align="justify"&gt;In a series of post i will analyse a &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_0"&gt;road map&lt;/span&gt; that a company can follow to implement business analytics. The final aim of business analytics is to allow a company to make informed decisions.&lt;br /&gt;&lt;br /&gt;In this first article we will try and understand why a company should invest money in business analytics.&lt;br /&gt;&lt;br /&gt;In this modern era, to make profits a company has to differentiate itself from its competitors. A company that can make decisions based on solid information can make better decisions compared to a company that does not have access to information. consider a company that makes and sells &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_1"&gt;ready made&lt;/span&gt; garments. Such a company would set up a network of franchisee throughout the country. Its main marketing strategy to attract customers would be to announce a series of discounts on its product. But how does that company differentiate itself from five other companies that has the same &lt;span class="blsp-spelling-corrected" id="SPELLING_ERROR_2"&gt;strategy&lt;/span&gt;? How would access to information help it?&lt;br /&gt;&lt;br /&gt;Lets take the first question. We need to find out what strategy would make the company and its product more attractive than others. How should the product mix be distributed. Which location or class of cities buys more jeans than trousers. Which part of the city sells more casual shirts than business shirts. If a company could make these decisions, then it could reduce the quantity of dead stock and bring in more customers. Access to information would help the companies to make this decision. Data from all its offices can be collated and a monthly report generated. This report would give the decision makers an idea of what each location needs.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3143564862155210283-3093468539889964196?l=mithil-tech.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mithil-tech.blogspot.com/feeds/3093468539889964196/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3143564862155210283&amp;postID=3093468539889964196' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3093468539889964196'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3143564862155210283/posts/default/3093468539889964196'/><link rel='alternate' type='text/html' href='http://mithil-tech.blogspot.com/2009/07/business-analytics-roadmap.html' title='Business Analytics - The need'/><author><name>mithil</name><uri>http://www.blogger.com/profile/03610764188052095722</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='27' height='32' src='http://2.bp.blogspot.com/_GvNS-b8AbU4/Sw1ON0F_MRI/AAAAAAAACO8/1mAyeEM7-rA/S220/Mithil_Shah_Photograph.jpg'/></author><thr:total>0</thr:total></entry></feed>
