Extracting and Cleaning Data Using Python

Overview

Extracting and cleaning data are just two components of the data wrangling process (gathering, extracting, cleaning, and storing data). Where extracting data is the process of drawing out only relevant data in an attempt to answer a fundamental question(s) during analysis. And where data cleaning involves the removal of data that may have negative impacts on the true data’s behavior. These things include missing or deleted data, unexpected character types (commas, semicolon, numbers, etc.), outliners, unexpected values, different formats (US or European), etc. In this lab, we will be working with the kddcup.data.corrected dataset to prepare it for analysis. First, we will use Python to separate the data out based on its classification (normal or abnormal). Then we will use Python to clean the data by removing the flow labels and punctuation marks that may cause problems with our model. Last, we will import the data into Pandas and explore how to structure, shape, and clean the data using a statistical Python libraries.

outcomes

In this lab, you will learn to :

  1. Understand how Python can be used to extract data.
  2. Understand how Python can be used to clean data.
  3. Understand how you can use a Python library (Pandas) to shape and clean data.

Key terms and descriptions

data wrangling
Data wrangling process consists of gathering, extracting, cleaning, and storing data.
data set
A data set is a collection of data.
Pandas
Pandas is a Python library that allows us to peform vectorize operations on data.
data cleaning
Data cleaning is the process of removing data that may have a negative impacts on the true data's behavior.