## Sign In for Full Access

Quick access through the institutional single sign-on
Skip this for now
|
Public Access Here

## Sign In for Free Access

Login with email for free guest access to a range of Rise content
Logging You In!
Incorrect Password (Click Here to Reset)! Passwords Must Match Password must be more than 8 characters
Skip this for now
|
Man Met Access Here
menu

# Introduction to Data Cleaning

Exploring Data through Descriptive Statistics

One of the key things you will always do before conducting any type of statistical test is to explore your data through descriptive statistics. You will remember this term from the introduction to statistics module, you were shown how to produce frequency and MCT tables and it was explained why these tables are so useful.

Descriptive statistics is the term given to analysis that helps to describe, show or summarise data in a meaningful way, such that patterns might emerge from the data and are easily spotted. We can’t infer anything from raw data but conducting descriptive stats allows for a simple interpretation.

The introduction to statistics module demonstrated how to produce these statistics and this sprint will show you how to write up your findings and compare them to the population, judging whether they’re representative or not.

When reporting results from frequency tables it’s important to be specific about the information included, too little information the summary is meaningless and too much it becomes boring. Descriptive statistics summaries must be short and simple whilst highlighting the necessary interesting information.

When writing descriptive summaries:

• Always report the valid percent
• Include the ‘n’ what the percentage represents i.e 63% (n2030)
• Report the research by writing in 3rd person
• Highlight any areas that are interesting or data that is surprising

Looking at this frequency table and bar chart what quickly comes to mind? Are there equal numbers of male and females? What does this mean for our data? These are the types of questions you must ask yourself when producing descriptive statistics, the aim is never to just produce a frequency table and call it a day but use these statistics to further inform our understanding of our data and how this will affect future analysis.

When describing this screenshot, one might write ‘Table 1 shows respondents’ genders. 52% (n=545) were female while 48% (n=508) were male. There were more females than males which could potentially affect the data however this is unlikely as the difference is only small.

Another example:

“Table 2 highlights that within our sample 86% (n=909) weren’t married or cohabitating and only 14% (n=144) were married and cohabitating. As Official National Statistics (2017) show there has been a gradual decline in number of marriages of opposite sex couples since 1972 largely thought to be due to the decline in religious marriages and same-sex couple marriages don’t appear to be on the rise either. Although this might give insight into our data the figures from our sample still appear very unrepresentative of the great British population, ONS found the average age of marriage for opposite sex couples was 38 years for men and 36.7 years for women in 2017, the average age of our sample was 20 years old this a possible explanation for such unrepresentative data that could possibly skew further analysis”

Again, this descriptive summary reports the percentages and n of the frequency table and then takes the analysis a step further by citing supporting references to show the data is representative of the wider UK population.

Using the frequency tables below produce your own descriptive summaries

Note: you can use public statistics such as the 2011 UK Census data to inform your analysis

OPTIONAL

Data Cleaning

Data cleaning involves a series of steps that allows us to ‘clean’ our data of any errors. Please refer back to the Introduction to Statistics intensive for information on Data cleaning rationale.

The next sprint will briefly recap your learning of data cleaning and introduce you to a possibly new concept, data recoding.

For the duration of this intensive we will be using a cutdown version of the Opinion and Lifestyle Survey: Well-Being Module, 2015 dataset. This is the same dataset that was used during the introduction to statistics module so should already be saved on your computer and in R Studio.