PHD Discussions Logo

Ask, Learn and Accelerate in your PhD Research

Question Icon Post Your Answer

Question Icon

What specific criteria will you use for the initial "cleaning" of your raw data‑ How will you distinguish between a true data entry error and a valid but unusual response?

 In my mixed-methods research, the first analytical hurdle is the raw dataset. I'm grappling with establishing a transparent, defensible protocol. I need to move beyond textbook definitions to a scholar's practical checklist: what specific, sequential filters do you apply to "clean" data, and what philosophical or statistical rule separates a mistake from a meaningful anomaly?

All Answers (1 Answers In All)

By Farah Answered 4 months ago

From my experience managing large-scale research datasets, I always start with a pre-registered, rule-based protocol. I would recommend first running automated range and logic checks values outside possible parameters are flagged as errors. For potential unusual valid responses, the criteria shift to context and source. I have seen "anomalies" become key findings, so I cross-reference them with other data points (e.g., interview notes) and apply domain knowledge. If it's plausible and consistent within the participant's other data, it's likely valid, not an error. This dual-track approach maintains rigor without prematurely silencing the data.

Your Answer