Exploratory Data Analysis (EDA)

Analyzing the 2022 country-level World Population Prospects (WPP 2022) data to summarize their main characteristics of variables, often with visual graphs, without using a statistical model.

Overview of the data

Understanding the dimensions of the data set, variable names, overall missing summary and data types of each variable.

Summary of the data

The initial overview provides a summary of the dataset’s dimensions and the types of variables it contains. This includes the number of rows and columns, the types of variables (numeric, factor, text, etc.), and the percentage of complete cases.

Structure of the data

This section delves deeper into the structure of the data, detailing each variable’s type, the number of samples, missing values, and the percentage of missing data.

This information is crucial for understanding the overall data quality and identifying potential issues. This allows to evaluate the reliability of subsequent analyses. Each variable’s data type (e.g., numeric, character) is correctly identified, which is essential for appropriate statistical testing and modeling.

Summary of numerical variables

This summary provides key statistical measures for all numeric variables, including measures of central tendency (mean, median) and dispersion (standard deviation, interquartile range), as well as skewness and kurtosis to understand the distribution shape.

The summary statistics highlight the central tendencies and dispersion of the numeric variables. For instance, variables with high skewness and kurtosis values might indicate outliers or non-normal distributions, which may require transformation or special handling in modeling.

Distributions of numerical variables

Graphical representation of all numeric features

Quantile-quantile plot (Univariate)

The quantile-quantile plot helps in assessing whether the numeric variables follow a particular distribution, such as the normal distribution. The Q-Q plots allow us to visually assess the normality of the numeric variables. Deviations from the diagonal line suggest that the data may not be normally distributed, which could impact parametric statistical analyses.

$`0`

Density plot (Univariate)

Density plots provide a smoothed estimate of the data distribution, allowing for easy visualization of the distribution shape and comparison between different variables.

Density plots show the distribution of each numeric variable. Peaks in the density plots indicate modes, and the spread indicates variability. These plots help in understanding the distribution and identifying any potential skewness or multimodality.

$`0`

Scatter plot (Bivariate)

Scatter plots are used to explore the relationship between pairs of numeric variables, helping in identifying potential correlations or patterns. Positive or negative trends can indicate correlations. Outliers and patterns in these plots provide insights into the data structure and potential interactions between variables.

$`0`

Summary of categorical variables

Summary statistics for categorical variables provide information on the frequency and proportion of each category within the variable.

Frequency for all categorical independent variables

This section provides a frequency table for all categorical variables, showing the count and percentage of each category within the variable. The frequency tables indicate how often each category occurs in the dataset. This is crucial for understanding the distribution of categorical variables and identifying any categories that might dominate or be underrepresented.

Distributions of categorical variables

Bar plots for all categorical variables visualize the distribution of categories within each variable, making it easy to see which categories are most or least common.

Bar plots with vertical or horizontal bars for all categorical variables

The following section identifies and visualizes categorical variables with fewer than 11 unique values using bar plots. This is particularly useful for variables with a manageable number of categories.

Bar plots provide a visual representation of the categorical variables’ distributions. This helps in quickly identifying the most and least common categories. Such visualizations are useful for understanding the data’s structure and for subsequent analyses like feature selection.

--- title: "Exploratory Data Analysis (EDA)" toc: true format: html: grid: body-width: 1200px code-tools: true fig-width: 8 fig-height: 6 fig-align: center execute: echo: false warning: false editor_options: chunk_output_type: console --- {{< include ../_setup.qmd >}} ```{r setup} #| include: false pacman::p_load( DataExplorer, SmartEDA, skimr, dlookr, ggstatsplot, DT ) ``` ```{r utils} #| include: false #| cache: true # Country Data data <- wpp$countries |> select(-loc_type_id) |> filter(income!="") # EDA Utils sn <- 4 theme <- "Default" table <- function(df){ df |> mutate(across(where(is.numeric), round, 2)) |> datatable( rownames = FALSE, extensions = c('Buttons', 'ColReorder', 'FixedColumns', 'Select'), selection = 'none', options = list( pageLength = 10, lengthMenu = c(10, 20, 30, 50), autoWidth = TRUE, searchHighlight = TRUE, colReorder = TRUE, dom = 'Bfrtip', # buttons = list(I('colvis'), 'copy', 'csv', 'excel'), buttons = c('copy', 'csv', 'excel'), fixedColumns = list(leftColumns = 2)) ) } # DataExplorer Themes cat_theme <- theme( legend.position="top", legend.title = element_text(size=0), plot.title = element_text(hjust = 0.5) ) ``` Analyzing the 2022 country-level [World Population Prospects](https://population.un.org/wpp/) (WPP 2022) data to summarize their main characteristics of variables, often with visual graphs, without using a statistical model. ## Overview of the data Understanding the dimensions of the data set, variable names, overall missing summary and data types of each variable. ### Summary of the data The initial overview provides a summary of the dataset's dimensions and the types of variables it contains. This includes the number of rows and columns, the types of variables (numeric, factor, text, etc.), and the **percentage of complete cases**. ```{r} ExpData(data=data,type=1) |> table() ``` ### Structure of the data This section delves deeper into the structure of the data, detailing each variable's type, the number of samples, missing values, and the **percentage of missing data**. This information is crucial for understanding the overall data quality and identifying potential issues. This allows to evaluate the reliability of subsequent analyses. Each variable's data type (e.g., numeric, character) is correctly identified, which is essential for appropriate statistical testing and modeling. ```{r} ExpData(data=data,type=2) |> table() ``` ## Summary of numerical variables This summary provides key statistical measures for all numeric variables, including measures of central tendency (mean, median) and dispersion (standard deviation, interquartile range), as well as skewness and kurtosis to understand the distribution shape. The summary statistics highlight the central tendencies and dispersion of the numeric variables. For instance, variables with high skewness and kurtosis values might indicate outliers or non-normal distributions, which may require transformation or special handling in modeling. ::: {.panel-tabset} ### `dlookr` table summary ```{r} dlookr::describe(data) |> table() ``` ### `dlookr` five number summaries ```{r} data |> diagnose_numeric() |> table() ``` ### `dataExplorer` table summary ```{r} ExpNumStat(data, by="A", gp=NULL, Qnt=seq(0,1,0.1), MesofShape=2, Outlier=TRUE, round=2) |> table() ``` ::: ### Numerical summaries by geography ::: {.panel-tabset} ### By region ```{r} byRegion <- data |> group_by(region) |> univar_numeric() byRegion$statistics |> table() ``` ### By subregion ```{r} bySubregion <- data |> group_by(region_int) |> univar_numeric() bySubregion$statistics |> table() ``` ::: ## Distributions of numerical variables Graphical representation of all numeric features ### Quantile-quantile plot (Univariate) The quantile-quantile plot helps in assessing whether the numeric variables follow a particular distribution, such as the normal distribution. The Q-Q plots allow us to visually assess the normality of the numeric variables. Deviations from the diagonal line suggest that the data may not be normally distributed, which could impact parametric statistical analyses. ```{r} ExpOutQQ(data, nlim=4, fname=NULL, Page=c(2,2), sample=sn) ``` ### Density plot (Univariate) Density plots provide a smoothed estimate of the data distribution, allowing for easy visualization of the distribution shape and comparison between different variables. Density plots show the distribution of each numeric variable. Peaks in the density plots indicate modes, and the spread indicates variability. These plots help in understanding the distribution and identifying any potential skewness or multimodality. ```{r} ExpNumViz(data, target=NULL, type=1, nlim=10, fname=NULL, col=NULL, Page=c(2,2), theme=theme, sample=sn) ``` ### Scatter plot (Bivariate) Scatter plots are used to explore the relationship between pairs of numeric variables, helping in identifying potential correlations or patterns. Positive or negative trends can indicate correlations. Outliers and patterns in these plots provide insights into the data structure and potential interactions between variables. ```{r} ExpNumViz(data, Page=c(2,1), sample=sn, theme=theme, scatter=TRUE) ``` ## Summary of categorical variables Summary statistics for categorical variables provide information on the frequency and proportion of each category within the variable. ### Frequency for all categorical independent variables This section provides a frequency table for all categorical variables, showing the count and percentage of each category within the variable. The frequency tables indicate how often each category occurs in the dataset. This is crucial for understanding the distribution of categorical variables and identifying any categories that might dominate or be underrepresented. ```{r} ExpCTable(data, Target=NULL, margin=1, clim=10, nlim=5, round=2, bin=NULL, per=T)|> table() ``` ## Distributions of categorical variables Bar plots for all categorical variables visualize the distribution of categories within each variable, making it easy to see which categories are most or least common. ### Bar plots with vertical or horizontal bars for all categorical variables The following section identifies and visualizes categorical variables with fewer than 11 unique values using bar plots. This is particularly useful for variables with a manageable number of categories. Bar plots provide a visual representation of the categorical variables' distributions. This helps in quickly identifying the most and least common categories. Such visualizations are useful for understanding the data's structure and for subsequent analyses like feature selection. ::: {.panel-tabset} ### by Region ```{r} # Factors: (region, region_int, ldc, lldc, sids, income, lending) plot_bar( data |> select(-region_int), by = "region", title = "Categorical variables distribution by region", ggtheme = theme_minimal(), theme_config = cat_theme ) ``` ### by Income Level ```{r} # Factors: (region, region_int, ldc, lldc, sids, income, lending) plot_bar( data |> select(-region_int), by = "income", title = "Categorical variables distribution by income level", ggtheme = theme_minimal(), theme_config = cat_theme) ``` ### by LDC status ```{r} # Factors: (region, region_int, ldc, lldc, sids, income, lending) plot_bar( data, by = "ldc", title = "Categorical variables distribution by LDC status", ggtheme = theme_minimal(), theme_config = cat_theme) ``` ### Income by region stats ```{r} ggbarstats(data, x=income, y=region, label="both" ) ``` ### LDC status by region stats ```{r} ggbarstats(data, x=ldc, y=region, label="both") ``` :::