What is data cleaning?
Data cleaning is defined as the process of manipulating data from its originally published format into a different format. Examples of data cleaning include changing the format of data values, expanding or abbreviating exiting values, populating missing information, and creating calculations based on existing values.
Why is data cleaning important?
Data cleaning is the foundation of data analysis. It ensures information is accurate, consistent, and usable. Data cleaning enhances our ability to understand large amounts of information, provides clear descriptions of data components, and increases the accessibility of information.
Why was data cleaning important for this analysis in particular?
Data in its original form is often referred to as “raw” data. The raw survey data OPO received contained invalid values, missing information, and was not formatted in a manner that allowed for a clean analysis.
What was the data cleaning process for this analysis?
OPO’s data cleaning process consisted of five major components:
- Consolidating information;
- Creating consistency within data values;
- Populating missing values;
- Creating new variables from existing information; and
- Removing invalid values.
OPO obtained both digital and hard copy survey data. To conduct a complete analysis, OPO combined the information from both sources into a single file. Once the information was in a single file, OPO reviewed each variable to ensure the values were formatted in a consistent manner. This included checking capitalization, punctuation, cell alignment, and syntax.
Next, OPO populated missing data using more appropriate values. This portion of the process varied based on the variable. For most cases, if information was missing, the variable value was populated with “Prefer not to answer.” In other cases where our survey incorporated skip logic, the most appropriate response for the follow-up question was dependent on the response to the initial question. For example, if the respondent selected “No” as their response to the initial question, the most appropriate value for missing information in the follow-up question would be “Not applicable.” If the respondent selected “Yes” as their response to the initial question, the most appropriate value for missing information in the follow-up question would be “Prefer not to answer.” If the respondent selected “Prefer not to answer” as their response to the initial question, the most appropriate value for missing information in the follow-up questions would be “Prefer not to answer.”
Populating missing data is important because it alleviates calculation errors produced by null values and equalizes the total. Equalizing the totals across all areas within the analysis is important for making accurate calculations for comparisons across values within the dataset. For example, if we were to analyze responses by a demographic variable, if some responses are more populated than others, this would skew or calculations.
After populating the missing data, OPO created new variables based on existing data. For example, OPO used parameters defined in U.S. Census data to create a variable that grouped ages within a certain range into categories.
Finally, each variable was assessed to determine the validity of its values. Invalid values were removed. For example, one survey respondent indicated an age of “0.” OPO determined that this value was invalid. As a result, the value was removed and treated as a missing value.
By conducting the data cleaning process, the survey data was easier to understand and analyze. Cleaning the data allowed OPO to extract valuable insights.
What are the challenges and limitations of the data?
Utilizing survey data can be challenging and, like all data, it has certain limitations. The most prominent challenge and limitation was missing information. Respondents were not required to complete each question, so many surveys (both paper and digital) were incomplete.[1]
Other data challenges
Other data challenges included multiple responses and open-ended responses.
- Certain questions within the survey (e.g., age, zip code, and social class) required a single response. However, some respondents provided multiple responses to these questions.
- An additional challenge unique to the paper surveys was open-ended responses. Some responses were unreadable, unresponsive, or not relevant to the question.
Other data limitations included the inability to identify duplicates and the inability to determine more precise residential and employment locations.
- The raw data did not provide a true unique identifier, meaning there was no unique value that was connected to a specific respondent. The absence of a unique identifier eliminates the ability to identify duplicate survey submissions. Therefore, if a respondent completed a digital survey and a paper survey, or multiple paper surveys or multiple digital surveys, based on the information available, those cases cannot be identified.
- Although OPO did request ZIP code information, this did not provide the level of detail needed to make meaningful correlations between location and other variables within the dataset. The ZIP code data did not indicate where a respondent worked or resided within a ZIP code area.
Despite these challenges and limitations, not all variables were affected. The core questions related to APD’s body-worn camera and dashboard camera policies were the most complete component of the dataset and provided valuable information.