logo
logo
Sign in

The data science process

avatar
meenati biswal
The data science process

 

Overview of the data science process

Following a structured approach to data science helps you to maximize your chances of success in a data science project at the lowest cost. It also makes it possible to take up a project as a team, with each team member focusing on what they do best. Take care, however: this approach may not be suitable for every type of projector be the only way to do good data science.
The typical data science process consists of six steps

The six steps of the data science process


the data science process and shows the main steps and actions you’ll take during a project. The following list is a short introduction; each of the steps will be discussed in greater depth throughout the data science course 

1. The first step of this process is setting a research goal. The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. In every serious project, this will result in a project charter.


2. The second phase is data retrieval. You want to have data available for analysis, so this step includes finding suitable data and getting access to the data from the
data owner. The result is data in its raw form, which probably needs polishing and transformation before it becomes usable.


3 Now that you have the raw data, it’s time to prepare it. This includes transforming- in the data from a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and correct different kinds of errors in the data, combine data from different data sources, and transform it. If you have successfully completed this step, you can progress to data visualization and modeling.


4 The fourth step is data exploration. The goal of this step is to gain a deep understanding of the data. You’ll look for patterns, correlations, and deviations based on visual and descriptive techniques. The insights you gain from this phase will enable you to start modeling.


5 Finally, we get to the sexiest part: model building (often referred to as “data modeling” throughout this book). It is now that you attempt to gain the insights or make the predictions stated in your project charter. Now is the time to bring out the heavy guns, but remember research has taught us that often (but not always) a combination of simple models tends to outperform one complicated model. If you’ve done this phase right, you’re almost done.


6 The last step of the data science model is presenting your results and automating the analysis if needed. One goal of a project is to change a process and/or make better decisions. You may still need to convince the business that your findings will indeed change the business process as expected. This is where you can shine in your influencer role. The importance of this step is more apparent in projects on a strategic and tactical level. Certain projects require you to perform the business process over and over again, so automating the project will save time.

In reality, you won’t progress in a linear way from step 1 to step 6. Often you’ll regress and iterate between the different phases.


Following these six steps pays off in terms of a higher project success ratio and increased impact of research results. This process ensures you have a well-defined research plan, a good understanding of the business question, and clear deliverables before you even start looking at data. The first steps of your process focus on getting high-quality data as input for your models. This way your models will perform better later on. In data science Certification, there’s a well-known saying: Garbage in equals garbage out.


Another benefit of following a structured approach is that you work more in pro-
to type mode while you search for the best model. When building a prototype, you’ll probably try multiple models and won’t focus heavily on issues such as program speed or writing code against standards. This allows you to focus on bringing business value instead.


Not every project is initiated by the business itself. Insights learned during analysis or the arrival of new data can spawn new projects. When the data science team generates an idea, work has already been done to make a proposition and find a business sponsor

 

Dividing a project into smaller stages also allows employees to work together as a team. It’s impossible to be a specialist in everything. You’d need to know how to upload all the data to all the different databases, find an optimal data scheme that works not only for your application but also for other projects inside your company, and then keep track of all the statistical and data-mining techniques, while also being an expert in presentation tools and business politics. That’s a hard task, and it’s why more and more companies rely on a team of specialists rather than trying to find one person who can do it all.


The process we described in this section is best suited for a data science project that contains only a few models. It’s not suited for every type of project. For instance, a project that contains millions of real-time models would need a different approach than the flow we describe here. A beginning data scientist should get a long way following this manner of working, though.

 

 

collect
0
avatar
meenati biswal
guide
Zupyak is the world’s largest content marketing community, with over 400 000 members and 3 million articles. Explore and get your content discovered.
Read more