Data cleaning is a crucial step in the data analysis process that involves identifying and rectifying errors, inconsistencies, and inaccuracies in a dataset. As a data analyst, your goal during data cleaning is to prepare a high-quality, reliable, and well-structured dataset for further analysis. Clean data is essential for obtaining accurate insights and making informed decisions. Here are definitions and code to help you clean data in R
Cleaning Data in RStudio:
- Data Cleaning: The process of identifying and correcting errors or inconsistencies in a dataset, including handling missing values, removing duplicates, and dealing with outliers.
- Removing Duplicates: Removing duplicate rows from a dataset to ensure each observation is unique.rCopy code
library(dplyr) clean_data <- data %>% distinct()
- Handling Outliers: Managing data points that significantly deviate from the rest of the data, often by capping or transforming them.rCopy code
clean_data <- data %>% filter(variable < upper_threshold)
- Data Transformation: Modifying data to make it suitable for analysis, e.g., converting data types or scaling numeric variables.rCopy code
clean_data <- data %>% mutate(new_variable = log(old_variable))
- Missing Data Imputation: Filling in missing values with estimated or imputed values to maintain data integrity.rCopy code
clean_data <- data %>% mutate(variable = ifelse(is.na(variable), mean(variable, na.rm = TRUE), variable))
- Data Validation: Checking data for accuracy and consistency according to predefined rules or constraints.
- Standardization: Scaling variables to have a mean of 0 and a standard deviation of 1.rCopy code
library(caret) clean_data <- preProcess(data, method = c("center", "scale"))
- Data Profiling: Summarizing and exploring data to identify potential issues or patterns.
- String Cleaning: Removing extra spaces, special characters, or formatting issues from string variables.rCopy code
clean_data <- data %>% mutate(name = gsub("\\s+", " ", name))
- Data Type Conversion: Changing the data type of a variable to match the intended analysis.rCopy code
clean_data <- data %>% mutate(age = as.numeric(age))
Selecting Groups of Observations and Creating New Calculated Fields:
- Grouping and Aggregating: Grouping observations based on one or more variables and calculating summary statistics for each group.rCopy code
grouped_data <- data %>% group_by(category) %>% summarise(mean_value = mean(variable))
- Creating Calculated Fields: Adding new variables to a dataset based on existing variables’ calculations.rCopy code
data <- data %>% mutate(total_sales = quantity * price)
- Conditional Selection: Selecting observations that meet specific conditions.rCopy code
selected_data <- data %>% filter(variable > threshold)
- Ranking and Ordering: Assigning ranks or sorting observations based on a variable’s values.rCopy code
ranked_data <- data %>% arrange(desc(revenue)) %>% mutate(rank = row_number())
- Case When Statements: Creating conditional expressions to derive values in new fields.rCopy code
data <- data %>% mutate(category_group = case_when(category %in% c("A", "B") ~ "AB Group", TRUE ~ "Other"))
- Top N Selection: Selecting the top N observations based on a specific variable.rCopy code
top_sales <- data %>% arrange(desc(sales)) %>% top_n(10)
- Distinct Values: Extracting unique values from a variable for further analysis.rCopy code
unique_categories <- data %>% distinct(category)
- Sampling Data: Randomly selecting a subset of observations for analysis.rCopy code
sample_data <- data %>% sample_n(size = 100)
- Window Functions: Calculating values over specific windows of data, like moving averages or cumulative sums.rCopy code
library(dplyr) data %>% arrange(date) %>% mutate(rolling_avg = zoo::rollmean(variable, k = 5, fill = NA))
- Conditional Mutation: Modifying variable values based on conditions.rCopy code
data <- data %>% mutate(adjusted_price = ifelse(quantity > 10, price * 0.9, price))
Pivoting Data in R: Wide and Long Format:
Pivoting Data to Wide Format:
- Spread Function: Reshaping data from long to wide format using the
spread
function from thetidyr
package.rCopy codelibrary(tidyr) wide_data <- data %>% spread(key = variable, value = value)
- Reshaping with Pivot Wider: Using the
pivot_wider
function to convert long data to wide format.rCopy codelibrary(tidyr) wide_data <- data %>% pivot_wider(names_from = variable, values_from = value)
- Aggregation during Pivoting: Aggregating data during the pivot process using functions like
mean
,sum
, etc.rCopy codewide_summary <- data %>% pivot_wider(names_from = variable, values_from = value, values_fn = list(value = mean))
- Dealing with Missing Values: Specifying how to handle missing values during pivoting using the
values_fill
argument.rCopy codewide_filled <- data %>% pivot_wider(names_from = variable, values_from = value, values_fill = 0)
- Multi-level Column Headers: Creating multi-level column headers for better organization in wide data.rCopy code
wide_data <- data %>% pivot_wider(names_from = c(category, year), values_from = value)
Pivoting Data to Long Format:
- Gather Function: Converting wide data to long format using the
gather
function from thetidyr
package.rCopy codelibrary(tidyr) long_data <- wide_data %>% gather(key = variable, value = value, -id)
- Reshaping with Pivot Longer: Using the
pivot_longer
function to transform wide data into long format.rCopy codelibrary(tidyr) long_data <- wide_data %>% pivot_longer(cols = starts_with("variable"), names_to = "variable", values_to = "value")
- Handling Multiple Variables: Pivoting multiple columns to long format using
pivot_longer
.rCopy codelong_data <- wide_data %>% pivot_longer(cols = c(variable1, variable2), names_to = "variable", values_to = "value")
- Extracting Information from Column Headers: Creating new variables by extracting information from column headers during pivoting.rCopy code
long_data <- wide_data %>% pivot_longer(cols = starts_with("Q"), names_to = "quarter", values_to = "value")
- Reshaping Time Series Data: Converting wide time series data into a long format suitable for time-based analysis.rCopy code
long_time_series <- wide_data %>% pivot_longer(cols = starts_with("month"), names_to = "month", values_to = "value")
Handling Missing Values:
- Identifying Missing Values: Checking for missing values in a dataset using functions like
is.na()
orcomplete.cases()
.rCopy codemissing_rows <- data[complete.cases(data), ]
- Removing Missing Values: Removing rows with missing values using the
na.omit()
function.rCopy codeclean_data <- na.omit(data)
- Imputing Missing Values: Filling in missing values with meaningful estimates, such as mean or median.rCopy code
data$variable <- ifelse(is.na(data$variable), mean(data$variable, na.rm = TRUE), data$variable)
- Using Imputation Packages: Employing specialized imputation packages like
mice
for more advanced missing data handling.rCopy codelibrary(mice) imputed_data <- mice(data)
- Time Series Interpolation: Interpolating missing values in time series data using methods like linear interpolation.rCopy code
library(zoo) data$variable <- na.approx(data$variable)
- Missing Data Patterns: Identifying patterns in missing data to understand why data might be missing in certain cases.
- Multiple Imputation: Creating multiple imputed datasets to account for uncertainty in imputed values.rCopy code
library(mice) imputed_data <- mice(data, m = 5)
- Conditional Imputation: Imputing missing values based on other variables’ values or using a predictive model.rCopy code
library(mice) imputed_data <- mice(data, method = c("pmm", "pmm", "rf"))
- Imputing Categorical Data: Handling missing values in categorical variables using mode imputation.rCopy code
data$category <- ifelse(is.na(data$category), levels(data$category)[which.max(table(data$category))], data$category)
- Missing Data Visualization: Creating visualizations to better understand the distribution and patterns of missing data.rCopy code
library(ggplot2) ggplot(data, aes(x = variable, y = value)) + geom_point(aes(color = is.na(value)))
Splitting and Combining Cells and Columns:
Splitting Cells and Columns:
- Splitting Text: Separating a column with concatenated values into multiple columns using functions like
strsplit()
orseparate()
from thetidyr
package.rCopy codesplit_data <- data %>% separate(column, into = c("part1", "part2"), sep = "_")
- Extracting Substrings: Extracting specific parts of a string using regular expressions or functions like
substr()
andstr_sub()
from thestringr
package.rCopy codelibrary(stringr) data$substrings <- str_sub(data$column, start = 2, end = 5)
- Splitting Dates: Breaking down date columns into year, month, and day components using the
lubridate
package.rCopy codelibrary(lubridate) data <- data %>% mutate(year = year(date_column), month = month(date_column), day = day(date_column))
- Separating Factors: Splitting factors into separate columns using
separate()
ortidyr::extract()
functions.rCopy codesplit_data <- data %>% separate(factor_column, into = c("level1", "level2"), sep = "_")
- Extracting Numeric Values: Extracting numeric information from text using regular expressions and
stringr
functions.rCopy codedata$numbers <- as.numeric(str_extract(data$text_column, "\\d+"))
Combining Cells and Columns:
- Concatenating Columns: Combining columns into a single column using functions like
paste()
orunite()
from thetidyr
package.rCopy codecombined_data <- data %>% unite(combined_column, col1, col2, sep = "_")
- Combining Strings: Concatenating strings with separators using the
paste()
function.rCopy codedata$combined_text <- paste(data$first_name, data$last_name, sep = " ")
- Combining Factors: Merging factor levels or columns using
factor()
andpaste()
functions.rCopy codedata$combined_factor <- factor(paste(data$level1, data$level2, sep = "_"))
- Joining Text and Variables: Combining text and variable values to create informative labels.rCopy code
data$label <- paste("ID:", data$id)
- Combining Dates and Times: Creating a single datetime column by combining separate date and time columns.rCopy code
data$datetime <- as.POSIXct(paste(data$date_column, data$time_column))
Joining Data from Different Tables:
Joining Data Using SQL-like Joins:
- Inner Join: Combining two tables based on a common key, retaining only matching rows.rCopy code
merged_data <- inner_join(table1, table2, by = "common_key")
- Left Join: Keeping all rows from the left table and matching rows from the right table.rCopy code
merged_data <- left_join(table1, table2, by = "common_key")
- Right Join: Keeping all rows from the right table and matching rows from the left table.rCopy code
merged_data <- right_join(table1, table2, by = "common_key")
- Full Outer Join: Keeping all rows from both tables and filling in missing values with NAs.rCopy code
merged_data <- full_join(table1, table2, by = "common_key")
Joining Data Using Data Frame Manipulation:
- Merging Data Frames: Using
merge()
function to join data frames based on common columns.rCopy codemerged_data <- merge(data_frame1, data_frame2, by = "common_column")
- Combining Columns: Adding columns from one data frame to another based on matching keys.rCopy code
combined_data <- cbind(data_frame1, data_frame2$additional_column)
- Using dplyr’s Join Functions: Employing
left_join()
,right_join()
,inner_join()
, andfull_join()
from thedplyr
package for enhanced control.rCopy codelibrary(dplyr) merged_data <- left_join(data_frame1, data_frame2, by = "common_column")
- Combining Data Using bind_rows(): Stacking data frames on top of each other.rCopy code
combined_data <- bind_rows(data_frame1, data_frame2)
Joining Data with Different Key Names:
- Specifying Different Key Names: Using
by.x
andby.y
arguments to specify different column names for joining.rCopy codemerged_data <- merge(data_frame1, data_frame2, by.x = "key1", by.y = "key2")
- Renaming Columns Before Joining: Renaming columns in one data frame to match column names in another before joining.rCopy code
colnames(data_frame2)[colnames(data_frame2) == "new_key"] <- "common_key" merged_data <- merge(data_frame1, data_frame2, by = "common_key")