Oluyemisi Oludare
3 min readJun 29, 2024
Titanic

This review concerns the Titanic dataset on Kaggle used for data analysis and machine learning. The dataset provided information about the passengers of the ill-fated British ocean liner, Titanic, showing the distribution of survivors across different parameters.

The objective is to familiarize myself with the dataset, identifying key variables and data types, without conducting a deep dive. I will identify the initial insights from the dataset at first glance.

The dataset includes three datasets:

  • train dataset
  • test dataset
  • gender submission dataset

At a glance, the dataset shows the distribution of passengers along the following variables:

  • gender,
  • ticket class,
  • port of embarkation,
  • the parent with children,
  • siblings, and spouses,
  • ticket number,
  • fare paid,
  • cabin number,
  • age

The datasets have missing values in the following columns: age, cabin, and fare which show at a glance that the dataset needs cleaning and transformation.

Observations

1. There are 891 rows out of which there are 177 null values in the Age column, 687 null values under the Cabin column, and 2 null values under Embarked.

2. The dataset is comprised of 891 passengers. There are 577 males which represents 64.76% and 314 females which represents 35.24%

3. Out of 577 males, there are 109 survivors and 468 did not survive. There are 314 female passengers, out of which we have 233 survivors and 81 casualties.

4. Overall distribution: Out of the total 891 passengers, there are 342 survivors in all, and 549 passengers did not survive.

Distribution of survivors by ticket class

5. Survivors by ticket class: The rate of survivors is higher among the first class passengers than the rest of the passengers. The third class passengers recorded the highest casualty.

Visualizations

Distribution of survivors by gender
Distribution of survivors by parents with or without children

Conclusion

The Titanic dataset offers a rich source of information for analysis, including survival rates, class distribution, gender differences, and age demographics. Initial observations suggest potential areas for further analysis, such as the impact of passenger class and gender on survival, the effect of family size (parch), and the handling of missing data.

Potential Areas for Further Analysis:

  • Survival analysis based on passenger class, gender, and age.
  • Investigation into the impact of age on survival patterns.
  • Detailed analysis of missing values, especially in the Age and Cabin columns, and strategies for handling them.

This review is part of my initial submission for the data analysis internship with HNG, For more information about the HNG Internship program and how it can benefit aspiring data analysts, visit (https://hng.tech/internship) and (https://hng.tech/hire)

Oluyemisi Oludare
Oluyemisi Oludare

Written by Oluyemisi Oludare

Passionate about creating great insight through data to help businesses make data-driven decisions.

No responses yet