Comprehensive Data Exploration Process with One-Click

EDA overview (image by author from )

Exploratory Data Analysis, also known as EDA, has become an increasingly hot topic in data science. Just as the name suggests, it is the process of trial and error in an uncertain space, with the goal of finding insights. It usually happens at the early stage of the data science lifecycle. Although there is no clear-cut between the definition of data exploration, data cleaning, or feature engineering. EDA is generally found to be sitting right after the data cleaning phase and before feature engineering or model building. EDA assists in setting the overall direction of model selection and it helps…

Line chart, bar chart, pie chart … they tell different stories

Chart Type Summary Mindmap (image by author)

In this information rich age, data visualizations are designed to make the knowledge transfer between deliverers and receivers easier. Therefore, it is crucial for the dashboard creators to know which chart is aligned with the key delivery objectives. On the other hand, having a basic understanding of the underlying meaning of each chart also helps the audience to interpret dashboards effectively. In this article, I introduced a way that may help to better understand some common charts and graphs, e.g. scatter plot, map, pie graph and stacked bar chart etc, by categorising them into four main types: distribution, comparison, composition…

A Step by Step Guide to K-Means Clustering

clustering analysis infographic (image by author from website)

What is Clustering Algorithm?

In a business context: Clustering algorithm is a technique that assists customer segmentation which is a process of classifying similar customers into the same segment. Clustering algorithm helps to better understand customers, in terms of both static demographics and dynamic behaviors. Customer with comparable characteristics often interact with the business similarly, thus business can benefit from this technique by creating tailored marketing strategy for each segment.

In a data science context: Clustering algorithm is an unsupervised machine learning algorithm that discovers groups of data points that are closely related. The fundamental difference between supervised and unsupervised algorithm is that:

  • supervised…

How to Use Data Visualization to Guide Feature Selection

Feature Selection and EDA Cheatsheet (image by author, from website)

In Machine Learning Lifecycle, feature selection is a critical process that selects a subset of input features that would be relevant to the prediction. Including irrelevant variables, especially those with bad data quality, can often contaminate the model output.

Additionally, feature selection has following advantages:

1) avoid the curse of dimensionality, as some algorithms perform badly when high in dimensionality, e.g. general linear models, decision tree

2) reduce computational cost and the complexity that comes along with a large amount of data

3) reduce overfitting and the model is more likely to be generalized to new data

4) increase the…

What if Learning Data Science is a Game

Seven data science skills (image by author)

We are all familiar with the modern game design, that champions or heroes are always equipped with certain attributes and specialties. For example, Dota heroes are scored based on the aspects of agility, intelligence, and strength. To excel on the battlefield, the hero needs to have above-average scores among all attributes while additionally specialized in at least one.

So what if we think of learning data science as playing a game where all of us possess multi-dimensional abilities. Playing video games demands constantly sharpening our skills with weapons, training, or magic potion. …

Step-by-Step Guide from Data Preprocessing to Model Evaluation

logistic regression python cheatsheet (image by author from

What is Logistic Regression?

Don’t let the name logistic regression tricks you, it usually falls under the category of the classification algorithm instead of regression algorithm.

Then, what is a classification model? Simply put, the prediction generated by a classification model would be a categorical value, e.g. cat or dog, yes or no, true or false … On the contrary, a regression model would predict a continuous numeric value.

Logistic regression makes predictions based on the Sigmoid function which is a squiggles-like line as shown below. …

Machine Learning and Predictive Modelling in BigQuery

How to Build ML Model using BigQuery — image by author

While taking the first step into the field of machine learning, it is so easy to get overwhelmed by all kinds of complex algorithms and ugly symbols. Therefore, hopefully, this article can lower the entry barrier by providing a beginner-friendly guide. Allow you to get a sense of achievement by building your own ML model using BigQuery and SQL. That’s right, we can use SQL to implement machine learning. If you are looking for several lines of code to get your hands dirty in the ML field, please continue reading :)

1. Set Up the Basic Structure 📁

Sites and blogs that inspire learning.

Photo by Kelly Sikkema on Unsplash

Learning data science is a long journey, following a rigid course curriculum inevitably makes learning a mundane task. Therefore, I have compiled a list of data science blogs that are able to bring you daily does of inspiration on various domains: AI and Machine Learning, Data Engineering, Data Visualization, and Business Acumen.

I have created an infographic as a summary, feel free to steal it at the end of this article. Additionally, if you are looking for data science podcasts or YouTubers to follow, have a read of the lists I collected :).

AI & Machine Learning

1. Towards Data Science

Towards Data Science gathers a large community…

Learn left join, inner join, self join using examples

Photo by Aida L on Unsplash

To perform advanced analytical processing and data discovery, one table is often not enough to bring valuable insights, hence combining multiple tables together is unavoidable. SQL, as a tool to communicate with relational database, provides the functionality to build relationships among tables. This article introduces how to use SQL to link tables together. If you want to learn more about the basics of SQL, I suggest have a read of my first article about learning SQL in everyday language. It gives a comprehensive SQL introduction for absolute beginners.

Why We Need to Learn SQL JOIN

Maybe you haven’t even realized, we frequently come across joining in Excel…

Your Daily Dose of Inspiration When Unmotivated to Learn Data Science

Photo by Juja Han on Unsplash

If we only learn data science through a rigid curriculum created by universities or online courses from Coursera or Udemy, we may find the learning process too boring. If you ever find yourself losing motivation in this long journey of studying data science, you may just need some podcasts to break the routine and get some inspiration. The major difference between these two approaches of learning is that the former focuses on theory and concepts, whereas the latter introduces more practical experience and projects that add flesh to the bones.

Listening to podcasts is a great way to absorb knowledge…

Destin Gong

on my way to become a data storyteller

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store