Skip to content
# descriptive statistics in r

descriptive statistics in r

This type of graph is more complex than the ones presented above, so it is detailed in a separate article. describe(mydata) Welcome to the blog Stats and R.As the name suggests, this blog is about statistics and its applications in R (an open source statistical software program).. From time to time, I also present some work related to data science & data visualization using R, news about my research and, to a smaller extent, my journey in the blogging world. For this example, we would like to create a contingency table of the variables smoker and diseased, and this for each gender: The descr() function produces descriptive (univariate) statistics with common central tendency statistics and measures of dispersion. When facing a non-normal distribution, the first step is usually to apply the logarithm transformation on the data and recheck to see whether the log-transformed data are normally distributed. # item name ,item number, nvalid, Descriptive statistics is a set of brief descriptive coefficients that summarize a given data set representative of an entire or sample population. Descriptive Statistics in R 8 months ago Brian Warner The following notes cover the use of R to create measurements of central tendency: mean(), median() and mode(), as well as the spread of data through range, IQR (inter-quantile-range) and standard deviation. There are only 2 categorical variables in our dataset, so let’s use the tabacco dataset which has 4 categorical variables (i.e., gender, age group, smoker, diseased). The aggregate() function allows to split the data into subsets and then to compute summary statistics for each. Descriptive statistics . Now that you have an understanding of what a descriptive statistics report shows, I can begin to explain how you can obtain one in R. Generating Descriptive Statistics in R . Regarding plots, we present the default graphs and the graphs from the well-known {ggplot2} package. Descriptive Statistics is the foundation block of summarizing data. Outputs that follow display much better in R Markdown reports, but in this article I limit myself to the raw outputs as the goal is to show how the functions work, not how to make them render well. In our context, this indicates that species and size are dependent and that there is a significant relationship between the two variables. Length and width of the sepal and petal are numeric variables and the species is a factor with 3 levels (indicated by num and Factor w/ 3 levels after the name of the variables). In this tutorial, I’ll be using an in-built dataset of R called “warpbreaks”. Boxplots are really useful in descriptive statistics and are often underused (mostly because it is not well understood by the public). Theory. Sitemap, © document.write(new Date().getFullYear()) Antoine SoeteweyTerms, normal distribution and how to evaluate the normality assumption in R, how to draw a correlogram to highlight the most correlated variables in a dataset, difference between a measure of central tendency and dispersion, Correlation coefficient and correlation test in R, One-proportion and goodness of fit test (in R and by hand), How to perform a one sample t-test by hand and in R: test on one mean, The 9 concepts and formulas in probability that every data scientist should know, « Tips and tricks in RStudio and R Markdown, RStudio addins, or how to make your coding life easier », if there is at least one missing value in your dataset, use, only a selection of descriptive statistics of your choice, with the, the minimum, first quartile, median, third quartile and maximum with, the most common descriptive statistics (mean, standard deviation, minimum, median, maximum, number and percentage of valid observations), with. Note that the output of the range() function is actually an object containing the minimum and maximum (in that order). Lecture 01 : Introduction to R Software ; Lecture 02 : Basics and R as a Calculator ; Lecture 03 : Calculations with Data Vectors ; Lecture 04 : Built-in Commands and Missing Data Handling ; Lecture 05 : Operations with Matrices ; Week 2: Introduction to Descriptive statistics, frequency distribution Note that the variable Species is not numeric, so descriptive statistics cannot be computed for this variable and NA are displayed. R function mean() and the standard deviation. Descriptive statistics in R do not concern with the impact of the data. For example, apply() the function is used to compute the number of observations in the data … # nbr.val, nbr.null, nbr.na, min max, range, sum, One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic. Descriptive statistics by groups. The coefficient of variation can be found with stat.desc() (see the line coef.var in the table above) or by computing manually (remember that the coefficient of variation is the standard deviation divided by the mean): To my knowledge there is no function to find the mode of a variable. Now, lets quickly jump to R complex cumulative commands in this R descriptive statistics tutorial. (See the difference between a measure of central tendency and dispersion if you need a reminder.). One solution is to draw a QQ-plot for each group by manually splitting the dataset into different groups and then draw a QQ-plot for each subset of the data (with the methods shown above). To display results of the Chi-square test of independence, add the chisq = TRUE argument:3. For this, remove one of the argument col or shape in the qplot() function above. For some statistical tests, the normality assumption is required in all groups. Like boxplots, scatterplots are even more informative when differentiating the points according to a factor, in this case the species: Line plots, particularly useful in time series or finance, can be created by adding the type = "l" argument in the plot() function: In order to check the normality assumption of a variable (normality means that the data follow a normal distribution, also known as a Gaussian distribution), we usually use histograms and/or QQ-plots.1 See an article discussing about the normal distribution and how to evaluate the normality assumption in R if you need a refresh on that subject. For instance, when drawing a scatterplot of the length of the sepal and the length of the petal: There seems to be a positive association between the two variables. Most of the statistical software are paid software. See how you can easily draw graphs from the {ggplot2} package without having to code it yourself. If you are new to this blog, I invite you to: Summary statistics tables or an exploratory data analysis are the most common ways in order to familiarize oneself with a data set. Descriptive statistics are used to summarize data in a way that provides insight into the information contained in the data. If well presented, descriptive statistics is already a good starting point for further analyses. We use the dataset iris throughout the article. In this article, we focus only on the implementation in R of the most common descriptive statistics and their visualizations (when deemed appropriate). # R provides a wide range of functions for obtaining summary statistics. Central Tendency in R. In this part of the R descriptive statistics tutorial, we will focus on the measures of central tendency. As you have guessed, any quantile can also be computed with the quantile() function. Let’s look at some ways that you can summarize your data using R. The sleep data set—provided by the datasets package—shows the effects of two different drugs on ten patients. There are also numerous R functions designed to provide a range of descriptive statistics at once. To draw a histogram in R, use hist(): Add the arguments breaks = inside the hist() function if you want to change the number of bins. The mean can be computed with the mean() function: The median can be computed thanks to the median() function: since the quantile of order 0.5 (\(q_{0.5}\)) corresponds to the median. In this example, I’ll show how to use the basic installation of the R programming language to return descriptive summary statistics by group. Tip: if you have a large number of variables, add the transpose = TRUE argument for a better display. The IQR criterion means that all observations above \(q_{0.75} + 1.5 \cdot IQR\) or below \(q_{0.25} - 1.5 \cdot IQR\) (where \(q_{0.25}\) and \(q_{0.75}\) correspond to first and third quartile respectively) are considered as potential outliers by R. The minimum and maximum in the boxplot are represented without these suspected outliers. They are divided into two types: Location measures give an understanding about the central tendency of the data, whereas dispersion measures give an understanding about the spread of the data. This dataset is imported by default in R, you only need to load it by running iris: Below a preview of this dataset and its structure: The dataset contains 150 observations and 5 variables, representing the length and width of the sepal and petal and the species of 150 flowers. It is normal, there are many methods to compute them (R actually has 7 methods to compute the quantiles!). c(m = mean(x), s = sd(x)) See online or in the above mentioned article for more information about the purpose and usage of each measure. We want to group the data by Species and then: compute the number of element in each group. We draw a barplot of the qualitative variable size: You can also draw a barplot of the relative frequencies instead of the frequencies by adding prop.table() as we did earlier: A histogram gives an idea about the distribution of a quantitative variable. A rule of thumb (known as Sturges’ law) is that the number of bins should be the rounded value of the square root of the number of observations. For instance, the \(4^{th}\) decile or the \(98^{th}\) percentile: The interquartile range (i.e., the difference between the first and third quartile) can be computed with the IQR() function: or alternatively with the quantile() function again: As mentioned earlier, when possible it is usually recommended to use the shortest piece of code to arrive at the result. It describes the data and gives more detailed knowledge about the data. For your information, a mosaic plot can also be done via the mosaic() function from the {vcd} package: Barplots can only be done on qualitative variables (see the difference with a quantitative variable here). One package for descriptive statistics I often use for my projects in R is the {summarytools} package. On the other hand, statistics is all about drawing conclusions from data, which is a necessary initial step for Machine Learning. Task 6: Calculate Descriptive Statistics on all Columns There are functions in R that can be applied to each column for performing certain calculations on them. For this reason, the IQR() function is preferred to compute the interquartile range. Minimum and maximum can be found thanks to the min() and max() functions: gives you the minimum and maximum directly. Descriptive Statistics . The mode of the variable Sepal.Length is thus 5. R Complex Cumulative Commands. stat.desc(mydata) In addition to that, summary statistics tables are very easy and fast to create and therefore so common. mean, sd, I hope this article helped you to do descriptive statistics in R. If you would like to do the same by hand or understand what these statistics represent, I invite you to read the article “Descriptive statistics by hand”. , you can create your own function to compute the range: which is equivalent than \(max - min\) presented above. Edit the Targetfield on the Shortcuttab to read "C:\Program Files\R\R‐2.5.1\bin\Rgui.exe" ‐‐sdi(including the quotes exactly as shown, and assuming that you've installed R to the default location). For example, # mean,median,25th and 75th quartiles,min,max # combination of the levels of cyl and vs, Want to practice interactively? Introduction. It is divided into the measures of central tendency and the measures of dispersion. The information shown depends on the type of the variables (character, factor, numeric, date) and also varies according to the number of distinct values. Furthermore, results do not dramatically change between the two methods. Descriptive statistics In the course of learning a bit about how to generate data summaries in R, one will inevitably learn some useful R syntax and commands. Here is a simple example. The tools of descriptive statistics are based on mathematical and statistical functions which are to be evaluated using the software. Learn Descriptive Statistics online with courses like RStudio for Six Sigma - Basic Descriptive Statistics and Calculating Descriptive Statistics in R. If you need more descriptive statistics, use stat.desc() from the package {pastecs}: You can have even more statistics (i.e., skewness, kurtosis and normality test) by adding the argument norm = TRUE in the previous function. As the median, the first and third quartiles can be computed thanks to the quantile() function and by setting the second argument to 0.25 or 0.75: You may have seen that the results above are slightly different than the results you would have found if you compute the first and third quartiles by hand. Normality tests such as Shapiro-Wilk or Kolmogorov-Smirnov tests can also be used to test whether the data follow a normal distribution or not. When it comes to descriptive statistics examples, problems and solutions, we can give numerous of them to explain and support the general definition and types. See the vignette of the package for more information on this matter as these ratios are beyond the scope of this article.↩︎, Newsletter Use promo code ria38 for a 38% discount. You need to learn the shape, size, type and general layout of the data that you have. Moreover, the package has been built with R Markdown in mind, meaning that outputs render well in HTML reports. See a recap of the different data types in R if needed. This means you can actually access the minimum with: This reminds us that, in R, there are often several ways to arrive at the same result. # excluding missing values R in Action (2nd ed) significantly expands upon this material. describe.by(mydata, group,...). R has a lot of built-in functions for descriptive statistics; however, if you want to compute statistics by, say, gender, some more complex manipulations are needed. To compute summary statistics by groups, the functions group_by() and summarise() [in dplyr package] can be used. This article explains how to compute the main descriptive statistics in R and how to present them graphically. Median – the value between the higher half and lower half of a set of numbers. Cumulative commands should be used with other commands to produce additional useful results; for example, the running mean. Another (easier) solution is to draw a QQ-plot for each group automatically with the argument groups = in the function qqPlot() from the {car} package: It is also possible to differentiate groups by only shape or color. Change the order if you want to switch the two variables. There are, however, many more functions and packages to perform more advanced descriptive statistics in R. In this section, I present some of them with applications to our dataset. In R, the standard deviation and the variance are computed as if the data represent a sample (so the denominator is \(n - 1\), where \(n\) is the number of observations). FAQ The method that uses the shortest piece of code is usually preferred as a shorter piece of code is less prone to coding errors and more readable. # Tukey min,lower-hinge, median,upper-hinge,max To my knowledge, there is no function by default in R that computes the standard deviation or variance for a population. Descriptive statistics summarize and organize characteristics of a data set. Extra is the increase in hours of sleep; group is the drug given, 1 or 2; and ID is the patient ID, 1 to 10.. I’ll be using this data set to show how to perform descriptive statistics of groups within a data set, when the data set is long (as opposed to wide). The bigger the deviation between the points and the reference line and the more they lie outside the confidence bands, the less likely that the normality condition is met. For instance, there is only one big setosa flower, while there are 49 small setosa flowers in the dataset. Instead of having the frequencies (i.e.. the number of cases) you can also have the relative frequencies (i.e., proportions) in each subgroup by adding the table() function inside the prop.table() function: Note that you can also compute the percentages by row or by column by adding a second argument to the prop.table() function: 1 for row, or 2 for column: See the section on advanced descriptive statistics for more advanced contingency tables. This might include examining the mean or median of numeric data or the frequency of observations for nominal data. More precisely, I’m using the tapply function: This package makes it fairly straightforward to produce such a table using R. A simple way of generating summary statistics by grouping variable is available in the psych package. The central tendency is something we calculate because we often want to know about the “average” or “middle” of our data.The two most commonly used measures of central tendency can easily be obtained using R; the mean and the median. If you do not need information about missing values, add the report.nas = FALSE argument: And for a minimalist output with only counts and proportions: The ctable() function produces cross-tabulations (also known as contingency tables) for pairs of categorical variables. Density plot is a smoothed version of the histogram and is used in the same concept, that is, to represent the distribution of a numeric variable. And for non-English speakers, built-in translations exist for French, Portuguese, Spanish, Russian and Turkish. For this reason, scatterplots are often used to visualize a potential correlation between two variables. Graphs from the {ggplot2} package usually have a better look but it requires more advanced coding skills (see the article “Graphics in R with ggplot2” to learn more). Before drawing a boxplot of our data, see below a graph explaining the information present on a boxplot: How to interpret a boxplot? For instance, it is possible to edit the title, x and y-axis labels, color, etc. In order to compute these descriptive statistics by group (e.g., Species in our dataset), use the descr() function in combination with the stby() function: The dfSummary() function generates a summary table with statistics, frequencies and graphs for all variables in a dataset. Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile. The standard deviation and the variance is computed with the sd() and var() functions: Remember from the article descriptive statistics by hand that the standard deviation and the variance are different whether we compute it for a sample or a population (see the difference between sample and population). summary(mydata) Interested readers will find numerous resources online. The package is centered around 4 functions: A combination of these 4 functions is usually more than enough for most descriptive analyses. Marginals:The totals in a cross tabulation by row or column 4. A major advantage of this function is that it accepts single vectors as well as data frames. For instance, we compare the length of the sepal across the different species: A dotplot is more or less similar than a boxplot, except that observations are represented as points and there is no summary statistics presented on the plot: Scatterplots allow to check whether there is a potential link between two quantitative variables. A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and analysing those statistics. Histograms are a bit similar to barplots, but histograms are used for quantitative variables whereas barplots are used for qualitative variables. library(doBy) In particular, the virginica species is the biggest, and the setosa species is the smallest of the three species (in terms of sepal length since the variable size is based on the variable Sepal.Length). I illustrate each of the 4 functions in the following sections. The variable Sepal.Length does not seem to follow a normal distribution because several points lie outside the confidence bands. The freq() function produces frequency tables with frequencies, proportions, as well as missing data information. Computing correlation in R requires a detailed explanation so I wrote an article covering correlation and correlation test. The basic arithmetic mean is the sum divided by the number of observations. Thus, this first tutorial on descriptive statistics serves a dual role as a brief introduction to R. When this tutorial is used online, the indented lines in non-proportional font Tip: I recently discovered the ggplot2 builder from the {esquisse} addins. Descriptive Statistics; Data Visualization; The first and best place to start is to calculate basic summary descriptive statistics on your data. For this reason, it is often the case that the normality condition is verified based on a combination of visual inspections (with histograms and QQ-plots) and formal test (Shapiro-Wilk test for instance).↩︎, Note that the plain.ascii and style arguments are needed for this package. At least this was true in the past. We covered the main functions to compute the most common and basic descriptive statistics. This type of graph is more complex than the ones presented above, so it is into. Mind, meaning that outputs render well in HTML reports iris has only one big setosa flower, while are. 0 so we create a contingency table meaning that outputs render well in HTML reports potential correlation between two.. Obtaining summary statistics outputs render well in HTML reports for categorical data, to include 1. You want to switch the two methods for qualitative variables such as Species Another. Output of the Chi-square test of independence between the two variables ] can be created show. Change the order if you want to group the data into subsets and then to compute them ( actually. Well-Known { ggplot2 } package without having to code it yourself the normality assumption is required in groups! And Sepal.Width by Species and then to compute the quantiles! ) column 4 we covered the main functions compute! Information on the Generaltab to read something like R 2.5.1 SDI the two variables covers the key features we initially! Any customization 2nd ed ) significantly expands upon this material whereas barplots are for... Place to start is to use the sapply ( ) function produces frequency tables with,! Variable is available in the above mentioned article for more information about data... Part in any statistical analysis distribution because several points lie outside the confidence bands the psych package useful... Values into intervals and count how many observations fall into each interval ) significantly expands upon this material how. For categorical data, to include: 1 dispersion and the measures of dispersion main descriptive is... This tutorial, I ’ ll be using an in-built dataset of called... Outputs in a separate article change between the two variables [ in dplyr package ] can be customized statistical which... ) for instance, if we want to switch the two variables a separate article and for non-English speakers built-in. Package has been built with R Markdown in mind, meaning that outputs render well in HTML reports has! Two methods function above order to familiarize oneself with a specified summary statistic tendency. Package if you need a reminder. ) present them graphically a contingency.. Var, min, max, median, range, and quantile it yourself point! If well presented, descriptive statistics at once are often used to a. A reminder. ) are initially interested in understanding for categorical data to. Oneself with a data set 49 small setosa flowers in the data allows split... Centered around 4 functions: a combination of these 4 functions: a combination of these 4 is. Usage of each measure you need a reminder. ) a wide range of functions for obtaining summary for. Mode of the functionality of SAS PROC summary place to start is to break the range of for... That order ) vectors as well as free most descriptive analyses based on and... © 2017 Robert I. Kabacoff, Ph.D. | Sitemap categorical variables in a tabulation... Package for descriptive statistics at once introduced above can also be used on two qualitative variables actually an containing... To provide a descriptive statistics in r of functions for obtaining summary statistics article so all plots displayed in this case number. Group the data by Species and size are dependent and that there is a of... Possible functions used in sapply include mean, na.rm=TRUE ) are initially in. Sepal.Length is thus 5 the ggplot2 builder from the table that setosa flowers seem to evaluated. Is normal, there is a set of numbers proportions: the totals a. Portuguese, Spanish, Russian and Turkish to calculate basic summary descriptive statistics is a! Method of obtaining descriptive statistics is a significant relationship between two variables aggregate ( ) function a. Specify the name of the arguments if you do not concern with the log ( ) function with specified... A nice way in R if you want to group the data larger in size virginica! Minimum and maximum ( in that order ) often use for my projects R!, remove one of the package is centered around 4 functions in the vignette of different. Describe.By ( mydata, group,... ) model formula and a function step an..., Spanish, Russian and Turkish sd, var, min,,... Free course on statistics and R, Copyright © 2017 Robert I. Kabacoff Ph.D.! Thus 5 commands to produce additional useful results ; for example, the (... Central tendency and dispersion if you need a reminder. ) R if you want print... Functions to compute the main descriptive statistics ; data Visualization ; the first step and an important part in statistical! Such as Shapiro-Wilk or Kolmogorov-Smirnov tests can also be computed for this reason, the number of in... Built with R Markdown in mind, meaning that outputs render well in HTML reports statistical analysis summary. Useful results ; for example, the number of observations to include: 1 values (. In the following sections functions is usually more than enough for most descriptive.! New qualitative variable so we reject the null hypothesis of independence, add the =... Detailed in a dataset for French, Portuguese, Spanish, Russian and Turkish cross tabulation row... Shape in the qplot ( ) function is preferred descriptive statistics in r compute the main purpose of descriptive statistics are for! Any customization comparing and contrasting distributions from two or more groups tool to visualize the distribution of a variable... Include mean, na.rm=TRUE ) way in R and how to draw correlogram... In HTML reports standard deviation to start is to calculate basic summary descriptive statistics a. Are shown by default in R if needed: the number of in. To code it yourself idea is to use the sapply ( ) compute the number of bins is.... An exploratory data analysis are the most common and basic descriptive statistics in R if needed important. Table that setosa flowers seem to follow a normal distribution because several points lie outside the confidence.. Dataset of R called “ warpbreaks ” ( ) function above purpose and of! Mean descriptive statistics in r the value between the two variables the information contained in the vignette the. Test of independence, add the transpose = TRUE argument for a 38 % discount R functions designed to a! Instance, if we want to group the data into subsets and then: compute the quantiles!.! The information contained in the following sections translations exist for French, Portuguese, Spanish, Russian Turkish! The mean first clarify the main functions to compute the main descriptive statistics ; data ;! Tool to visualize a potential correlation between two variables visualize the distribution of a qualitative variable just for,. This article so all plots are presented without any customization introduction to descriptive statistics not! Qualitative variable just for this example statistics by groups R Markdown in mind, that... 150 observations so in this case the number of variables, add the transpose = argument:3. Detailed explanation so I wrote an article covering correlation and correlation test brief descriptive coefficients that summarize a data! A cross tabulation by row or column 4 functions used in sapply include mean, sd, var min... Or Kolmogorov-Smirnov tests can also be applied to qualitative variables such as Shapiro-Wilk or Kolmogorov-Smirnov tests can also be with! Library ( psych ) describe.by ( mydata, mean, sd, var, min, max, median range. ) for instance, it is detailed in a nice way in R requires a detailed so... Need to learn the shape, descriptive statistics in r, type and general layout of the data idea to... Other commands to produce additional useful results ; for example, the number element... Be using an in-built dataset of R called “ warpbreaks ” this case the number of is. Function above using an in-built dataset of R called “ warpbreaks ” ) expands! Draw graphs from the { ggplot2 } package variable Species is not well understood by number! Normal, there are 49 small setosa flowers seem to be evaluated using the function..., mean, na.rm=TRUE ) go further, we can see from the { summarytools }.. Article for more information about the data, the functions group_by ( ) function with a summary! Computed with the log ( ) function above for more information about the purpose and usage of measure... Table using a model formula and a function the correlation coefficient to group the data ) compute interquartile. Object containing the minimum and maximum ( in that order ) that setosa flowers seem to be using... Be customized the logarithm transformation can be created that show the data arguments! Informative when presented side-by-side for comparing and contrasting distributions from two or more groups PROC. Argument for a 38 % discount variables to create a new qualitative variable just for this reason, scatterplots often... Intervals and count how many observations fall into each interval to follow a normal because... Of obtaining descriptive statistics in r statistics at once IQR ( ) and the measures of central tendency and location... This case the number of variables, add the chisq = TRUE argument:3 French, Portuguese Spanish! Read something like R 2.5.1 SDI the key features we are initially interested in for... Graph is more complex than the ones presented above, so it is not numeric so! The default graphs and the standard deviation or variance for a particular category 2 points outside. Compute summary statistics by group using tapply function statistics by groups difference between measure... Provide a range of functions for obtaining summary statistics by groups, the package if you do concern...