Data analysis using R
Last Updated :
09 Dec, 2022
Data Analysis is a subset of data analytics, it is a process where the objective has to be made clear, collect the relevant data, preprocess the data, perform analysis(understand the data, explore insights), and then visualize it. The last step visualization is important to make people understand what’s happening in the firm.
Steps involved in data analysis:
The process of data analysis would include all these steps for the given problem statement. Example- Analyze the products that are being rapidly sold out and details of frequent customers of a retail shop.
- Defining the problem statement – Understand the goal, and what is needed to be done. In this case, our problem statement is – “The product is mostly sold out and list of customers who often visit the store.”
- Collection of data – Not all the company’s data is necessary, understand the relevant data according to the problem. Here the required columns are product ID, customer ID, and date visited.
- Preprocessing – Cleaning the data is mandatory to put it in a structured format before performing analysis.
- Removing outliers( noisy data).
- Removing null or irrelevant values in the columns. (Change null values to mean value of that column.)
- If there is any missing data, either ignore the tuple or fill it with a mean value of the column.
Data Analysis using the Titanic dataset
You can download the titanic dataset (it contains data from real passengers of the titanic)from here. Save the dataset in the current working directory, now we will start analysis (getting to know our data).
R
titanic= read.csv ( "train.csv" )
head (titanic)
|
Output:
PassengerId Survived Pclass Name Sex
1 892 0 3 Kelly, Mr. James male
2 893 1 3 Wilkes, Mrs. James (Ellen Needs) female
3 894 0 2 Myles, Mr. Thomas Francis male
4 895 0 3 Wirz, Mr. Albert male
5 896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female
6 897 0 3 Svensson, Mr. Johan Cervin male
Age SibSp Parch Ticket Fare Cabin Embarked
1 34.5 0 0 330911 7.8292 Q
2 47.0 1 0 363272 7.0000 S
3 62.0 0 0 240276 9.6875 Q
4 27.0 0 0 315154 8.6625 S
5 22.0 1 1 3101298 12.2875 S
6 14.0 0 0 7538 9.2250 S
Our dataset contains all the columns like name, age, gender of the passenger and class they have traveled in, whether they have survived or not, etc. To understand the class(data type) of each column sapply() method can be used.
Output:
PassengerId Survived Pclass Name Sex Age
"integer" "integer" "integer" "character" "character" "numeric"
SibSp Parch Ticket Fare Cabin Embarked
"integer" "integer" "character" "numeric" "character" "character"
We can categorize the value “survived” into “dead” to 0 and “alive” to 1 using factor() function.
R
train$Survived= as.factor (train$Survived)
train$Sex= as.factor (train$Sex)
sapply (train, class)
|
Output:
PassengerId Survived Pclass Name Sex Age
"integer" "factor" "integer" "character" "factor" "numeric"
SibSp Parch Ticket Fare Cabin Embarked
"integer" "integer" "character" "numeric" "character" "character"
We analyze data using a summary of all the columns, their values, and data types. summary() can be used for this purpose.
Output:
PassengerId Survived Pclass Name Sex
Min. : 892.0 0:266 Min. :1.000 Length:418 female:152
1st Qu.: 996.2 1:152 1st Qu.:1.000 Class :character male :266
Median :1100.5 Median :3.000 Mode :character
Mean :1100.5 Mean :2.266
3rd Qu.:1204.8 3rd Qu.:3.000
Max. :1309.0 Max. :3.000
Age SibSp Parch Ticket
Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
Median :27.00 Median :0.0000 Median :0.0000 Mode :character
Mean :30.27 Mean :0.4474 Mean :0.3923
3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :76.00 Max. :8.0000 Max. :9.0000
NA's :86
Fare Cabin Embarked
Min. : 0.000 Length:418 Length:418
1st Qu.: 7.896 Class :character Class :character
Median : 14.454 Mode :character Mode :character
Mean : 35.627
3rd Qu.: 31.500
Max. :512.329
NA's :1
From the above summary we can extract below observations:
- Total passengers: 891
- The number of total people who survived: 342
- Number of total people dead: 549
- Number of males in the titanic: 577
- Number of females in the titanic: 314
- Maximum age among all people in titanic: 80
- Median age: 28
Preprocessing of the data is important before analysis, so null values have to be checked and removed.
Output:
177
R
dropnull_train=train[ rowSums ( is.na (train))<=0,]
|
- dropnull_train contains only 631 rows because (total rows in dataset (808) – null value rows (177) = remaining rows (631) )
- Now we will divide survived and dead people into a separate list from 631 rows.
R
survivedlist=dropnull_train[dropnull_train$Survived == 1,]
notsurvivedlist=dropnull_train[dropnull_train$Survived == 0,]
|
Now we can visualize the number of males and females dead and survived using bar plots, histograms, and piecharts.
R
mytable <- table (titanic$Survived)
lbls <- paste ( names (mytable), "\n" , mytable, sep= "" )
pie (mytable,
labels = lbls,
main= "Pie Chart of Survived column data\n (with sample sizes)" )
|
Output:
From the above pie chart, we can certainly say that there is a data imbalance in the target/Survived column.
R
hist (survivedlist$Age,
xlab= "gender" ,
ylab= "frequency" )
|
Output:
Now let’s draw a bar plot to visualize the number of males and females who were there on the titanic ship.
R
barplot ( table (notsurvivedlist$Sex),
xlab= "gender" ,
ylab= "frequency" )
|
Output:
From the barplot above we can analyze that there are nearly 350 males, and 50 females those are not survived in titanic.
R
temp<- density ( table (titanic$Fare))
plot (temp, type= "n" ,
main= "Fare charged from Passengers" )
polygon (temp, col= "lightgray" ,
border= "gray" )
|
Output:
Here we can observe that there are some passengers who are charged extremely high. So, these values can affect our analysis as they are outliers. Let’s confirm their presence using a boxplot.
R
boxplot (titanic$Fare,
main= "Fare charged from passengers" )
|
Output:
Certainly, there are some extreme outliers present in this dataset.
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...