Cleaning Data in R

Data cleaning is a crucial step in the data analysis process that involves identifying and rectifying errors, inconsistencies, and inaccuracies in a dataset. As a data analyst, your goal during data cleaning is to prepare a high-quality, reliable, and well-structured dataset for further analysis. Clean data is essential for obtaining accurate insights and making informed decisions. Here are definitions and code to help you clean data in R

Cleaning Data in RStudio:

Data Cleaning: The process of identifying and correcting errors or inconsistencies in a dataset, including handling missing values, removing duplicates, and dealing with outliers.
Removing Duplicates: Removing duplicate rows from a dataset to ensure each observation is unique.rCopy codelibrary(dplyr) clean_data <- data %>% distinct()
Handling Outliers: Managing data points that significantly deviate from the rest of the data, often by capping or transforming them.rCopy codeclean_data <- data %>% filter(variable < upper_threshold)
Data Transformation: Modifying data to make it suitable for analysis, e.g., converting data types or scaling numeric variables.rCopy codeclean_data <- data %>% mutate(new_variable = log(old_variable))
Missing Data Imputation: Filling in missing values with estimated or imputed values to maintain data integrity.rCopy codeclean_data <- data %>% mutate(variable = ifelse(is.na(variable), mean(variable, na.rm = TRUE), variable))
Data Validation: Checking data for accuracy and consistency according to predefined rules or constraints.
Standardization: Scaling variables to have a mean of 0 and a standard deviation of 1.rCopy codelibrary(caret) clean_data <- preProcess(data, method = c("center", "scale"))
Data Profiling: Summarizing and exploring data to identify potential issues or patterns.
String Cleaning: Removing extra spaces, special characters, or formatting issues from string variables.rCopy codeclean_data <- data %>% mutate(name = gsub("\\s+", " ", name))
Data Type Conversion: Changing the data type of a variable to match the intended analysis.rCopy codeclean_data <- data %>% mutate(age = as.numeric(age))

Selecting Groups of Observations and Creating New Calculated Fields:

Grouping and Aggregating: Grouping observations based on one or more variables and calculating summary statistics for each group.rCopy codegrouped_data <- data %>% group_by(category) %>% summarise(mean_value = mean(variable))
Creating Calculated Fields: Adding new variables to a dataset based on existing variables’ calculations.rCopy codedata <- data %>% mutate(total_sales = quantity * price)
Conditional Selection: Selecting observations that meet specific conditions.rCopy codeselected_data <- data %>% filter(variable > threshold)
Ranking and Ordering: Assigning ranks or sorting observations based on a variable’s values.rCopy coderanked_data <- data %>% arrange(desc(revenue)) %>% mutate(rank = row_number())
Case When Statements: Creating conditional expressions to derive values in new fields.rCopy codedata <- data %>% mutate(category_group = case_when(category %in% c("A", "B") ~ "AB Group", TRUE ~ "Other"))
Top N Selection: Selecting the top N observations based on a specific variable.rCopy codetop_sales <- data %>% arrange(desc(sales)) %>% top_n(10)
Distinct Values: Extracting unique values from a variable for further analysis.rCopy codeunique_categories <- data %>% distinct(category)
Sampling Data: Randomly selecting a subset of observations for analysis.rCopy codesample_data <- data %>% sample_n(size = 100)
Window Functions: Calculating values over specific windows of data, like moving averages or cumulative sums.rCopy codelibrary(dplyr) data %>% arrange(date) %>% mutate(rolling_avg = zoo::rollmean(variable, k = 5, fill = NA))
Conditional Mutation: Modifying variable values based on conditions.rCopy codedata <- data %>% mutate(adjusted_price = ifelse(quantity > 10, price * 0.9, price))

Pivoting Data in R: Wide and Long Format:

Pivoting Data to Wide Format:

Spread Function: Reshaping data from long to wide format using the spread function from the tidyr package.rCopy codelibrary(tidyr) wide_data <- data %>% spread(key = variable, value = value)
Reshaping with Pivot Wider: Using the pivot_wider function to convert long data to wide format.rCopy codelibrary(tidyr) wide_data <- data %>% pivot_wider(names_from = variable, values_from = value)
Aggregation during Pivoting: Aggregating data during the pivot process using functions like mean, sum, etc.rCopy codewide_summary <- data %>% pivot_wider(names_from = variable, values_from = value, values_fn = list(value = mean))
Dealing with Missing Values: Specifying how to handle missing values during pivoting using the values_fill argument.rCopy codewide_filled <- data %>% pivot_wider(names_from = variable, values_from = value, values_fill = 0)
Multi-level Column Headers: Creating multi-level column headers for better organization in wide data.rCopy codewide_data <- data %>% pivot_wider(names_from = c(category, year), values_from = value)

Pivoting Data to Long Format:

Gather Function: Converting wide data to long format using the gather function from the tidyr package.rCopy codelibrary(tidyr) long_data <- wide_data %>% gather(key = variable, value = value, -id)
Reshaping with Pivot Longer: Using the pivot_longer function to transform wide data into long format.rCopy codelibrary(tidyr) long_data <- wide_data %>% pivot_longer(cols = starts_with("variable"), names_to = "variable", values_to = "value")
Handling Multiple Variables: Pivoting multiple columns to long format using pivot_longer.rCopy codelong_data <- wide_data %>% pivot_longer(cols = c(variable1, variable2), names_to = "variable", values_to = "value")
Extracting Information from Column Headers: Creating new variables by extracting information from column headers during pivoting.rCopy codelong_data <- wide_data %>% pivot_longer(cols = starts_with("Q"), names_to = "quarter", values_to = "value")
Reshaping Time Series Data: Converting wide time series data into a long format suitable for time-based analysis.rCopy codelong_time_series <- wide_data %>% pivot_longer(cols = starts_with("month"), names_to = "month", values_to = "value")

Handling Missing Values:

Identifying Missing Values: Checking for missing values in a dataset using functions like is.na() or complete.cases().rCopy codemissing_rows <- data[complete.cases(data), ]
Removing Missing Values: Removing rows with missing values using the na.omit() function.rCopy codeclean_data <- na.omit(data)
Imputing Missing Values: Filling in missing values with meaningful estimates, such as mean or median.rCopy codedata$variable <- ifelse(is.na(data$variable), mean(data$variable, na.rm = TRUE), data$variable)
Using Imputation Packages: Employing specialized imputation packages like mice for more advanced missing data handling.rCopy codelibrary(mice) imputed_data <- mice(data)
Time Series Interpolation: Interpolating missing values in time series data using methods like linear interpolation.rCopy codelibrary(zoo) data$variable <- na.approx(data$variable)
Missing Data Patterns: Identifying patterns in missing data to understand why data might be missing in certain cases.
Multiple Imputation: Creating multiple imputed datasets to account for uncertainty in imputed values.rCopy codelibrary(mice) imputed_data <- mice(data, m = 5)
Conditional Imputation: Imputing missing values based on other variables’ values or using a predictive model.rCopy codelibrary(mice) imputed_data <- mice(data, method = c("pmm", "pmm", "rf"))
Imputing Categorical Data: Handling missing values in categorical variables using mode imputation.rCopy codedata$category <- ifelse(is.na(data$category), levels(data$category)[which.max(table(data$category))], data$category)
Missing Data Visualization: Creating visualizations to better understand the distribution and patterns of missing data.rCopy codelibrary(ggplot2) ggplot(data, aes(x = variable, y = value)) + geom_point(aes(color = is.na(value)))

Splitting and Combining Cells and Columns:

Splitting Cells and Columns:

Splitting Text: Separating a column with concatenated values into multiple columns using functions like strsplit() or separate() from the tidyr package.rCopy codesplit_data <- data %>% separate(column, into = c("part1", "part2"), sep = "_")
Extracting Substrings: Extracting specific parts of a string using regular expressions or functions like substr() and str_sub() from the stringr package.rCopy codelibrary(stringr) data$substrings <- str_sub(data$column, start = 2, end = 5)
Splitting Dates: Breaking down date columns into year, month, and day components using the lubridate package.rCopy codelibrary(lubridate) data <- data %>% mutate(year = year(date_column), month = month(date_column), day = day(date_column))
Separating Factors: Splitting factors into separate columns using separate() or tidyr::extract() functions.rCopy codesplit_data <- data %>% separate(factor_column, into = c("level1", "level2"), sep = "_")
Extracting Numeric Values: Extracting numeric information from text using regular expressions and stringr functions.rCopy codedata$numbers <- as.numeric(str_extract(data$text_column, "\\d+"))

Combining Cells and Columns:

Concatenating Columns: Combining columns into a single column using functions like paste() or unite() from the tidyr package.rCopy codecombined_data <- data %>% unite(combined_column, col1, col2, sep = "_")
Combining Strings: Concatenating strings with separators using the paste() function.rCopy codedata$combined_text <- paste(data$first_name, data$last_name, sep = " ")
Combining Factors: Merging factor levels or columns using factor() and paste() functions.rCopy codedata$combined_factor <- factor(paste(data$level1, data$level2, sep = "_"))
Joining Text and Variables: Combining text and variable values to create informative labels.rCopy codedata$label <- paste("ID:", data$id)
Combining Dates and Times: Creating a single datetime column by combining separate date and time columns.rCopy codedata$datetime <- as.POSIXct(paste(data$date_column, data$time_column))

Joining Data from Different Tables:

Joining Data Using SQL-like Joins:

Inner Join: Combining two tables based on a common key, retaining only matching rows.rCopy codemerged_data <- inner_join(table1, table2, by = "common_key")
Left Join: Keeping all rows from the left table and matching rows from the right table.rCopy codemerged_data <- left_join(table1, table2, by = "common_key")
Right Join: Keeping all rows from the right table and matching rows from the left table.rCopy codemerged_data <- right_join(table1, table2, by = "common_key")
Full Outer Join: Keeping all rows from both tables and filling in missing values with NAs.rCopy codemerged_data <- full_join(table1, table2, by = "common_key")

Joining Data Using Data Frame Manipulation:

Merging Data Frames: Using merge() function to join data frames based on common columns.rCopy codemerged_data <- merge(data_frame1, data_frame2, by = "common_column")
Combining Columns: Adding columns from one data frame to another based on matching keys.rCopy codecombined_data <- cbind(data_frame1, data_frame2$additional_column)
Using dplyr’s Join Functions: Employing left_join(), right_join(), inner_join(), and full_join() from the dplyr package for enhanced control.rCopy codelibrary(dplyr) merged_data <- left_join(data_frame1, data_frame2, by = "common_column")
Combining Data Using bind_rows(): Stacking data frames on top of each other.rCopy codecombined_data <- bind_rows(data_frame1, data_frame2)

Joining Data with Different Key Names:

Specifying Different Key Names: Using by.x and by.y arguments to specify different column names for joining.rCopy codemerged_data <- merge(data_frame1, data_frame2, by.x = "key1", by.y = "key2")
Renaming Columns Before Joining: Renaming columns in one data frame to match column names in another before joining.rCopy codecolnames(data_frame2)[colnames(data_frame2) == "new_key"] <- "common_key" merged_data <- merge(data_frame1, data_frame2, by = "common_key")

The Data Analyst Toolkit

Fundamentals

Core

Advanced Topics

Electives

What are your feelings

Cleaning Data in R

What are your feelings

Share This Article: