Pepe’s learning path – Where to start as a data scientist.


Today people’s main task in working in data mining and machine learning is collecting and analyzing data. Broad background in probability theory and statistics is needed for all professionals in data science. While in the past statistics research was conducted mainly in the statistics departments, and data mining and machine learning were conducted in computer science departments, now both type of professionals recognize the general applicability of statistical theory and the astuteness and advantages of modern data mining techniques. Combining all such knowledge, skills and techniques clears the path to success for every data scientist.

This article aims to give a brief overview of the tools and techniques that every newcomer to the data science field must start with, assuming he or she has some previous knowledge in statistics and probability theory. There are quite a lot online (and other) learning platforms with courses through which anyone can expand their knowledge and skills in this area. Two books I recommend are: All of Statistics by Larry Wasserman and Operations Research by Hamdy A.Taha.

There are three main career paths in the data science field – data mining, business analysis and ML engineering. The data mining is a discipline focused mainly on topics related with collecting, storing and querying data, together with the required infrastructure. Business analysts use the data to make better (optimal) business decisions. Tour de force here is the analysis – how to choose the best approach (statistical, ML, etc.) for dealing with the data. Also a significant part of this process is the visualization – both of the data and of the results. While building accurate and credible models, another problem arises – how to streamline these models? The ML engineering is a comprehensive discipline, where all the magical models go into the real world – production. How these models deal with all the dependencies – both hardware and software?

By way of introduction let us go over the most popular tools for collaborative data science: Jupyter Notebooks, Apache Zeppelin, Rstudio, Seahorse and OpenRefine.

Jupyter Notebook is an open-source web application that allows users to create and share text (including Latex), live code and visualizations. It is a widely used platform to accomplish tasks such as data preparation, statistical modeling, machine learning and visualization. Some of the most popular programming languages that can be installed as kernels are Python, R, Julia, Scala, Go, Ruby, JavaScript. The full list of available kernels can be found here:


Zeppelin Notebook is another web-based application that enables collaborative, interactive data analytics with SQL, Apache Spark, Scala and others. One of the great advantages of this application is the built-in Apache Spark – it is integrated, so no separate module is needed. It has all the basic tools for visualization right from scratch – basic and pivot charts, and dynamic forms. More information can be found at the following link:


RStudio is an IDE (Integrated Development Environment) for R. R is a software environment for statistical computing popular among data scientists. It provides a variety of tools for linear and nonlinear modeling, time-series analysis, clustering and classification, etc. The official website on which you can find more information about the R Project is:


Seahorse is not as popular among data scientists as the Jupyter Notebook or RStudio, but it is a useful interactive framework, which allows us to create Spark applications in a fast, simple way. Just with drag and drop operations, a Spark application can be created and connected to any Spark cluster. More info here:


Last but not least, we have OpenRefine, supported by Google. This is a multi-language tool for cleaning and transforming data. It can also be used for data enrichment with web services and external data.


An interesting free online course covering the introduction to all the above-mentioned tools (how to setup and start working with the desired environment) can be found here:


Every data scientist works with data – this is the nature of the profession. The data needs a database – this is the place where the data is stored and processed quickly. A mandatory course for every data scientist is the introduction to SQL language. SQL is a language used for data queries into databases. A good course (also free) on SQL, relational databases, rational model concepts, rational model constrains, Data Definition Language and working with tables can be found here:


Today we often hear fundamental questions like “What is big data and how does it add value to businesses?” Usually the term “big data” is used to refer to data sets too big and/or complex for any traditional data processing software. The challenges in working with big data include data storage and analysis, data sharing, data visualization and many more. An explanation of the big V’s in big data (volume, variety, velocity) and big data’s connections with data science are provided in the following course:


Since Hadoop offers great big data solutions and is a free open-source platform, it is worth learning about Hadoop’s architecture and its MapReduce and HDFS (Hadoop Distributed File System) components. The following course covers a brief introduction and Hadoop’s architecture, Haddop’s administration and its components:


Once all of the more popular tools are in one’s hands, one should become acquainted with some definitions of data science and some real data science projects. This is what the following website offers:


Given the impact data science can have on business, learning mathematical optimization (mathematical programming) can give you the ability to model and solve optimization problems. These problems, whose goal is to find optimal business solutions, can be quite complex. But beware: learning the models and learning how to model are not the same thing. The discipline that deals with advanced analytical methods application in decision-making processes is called Operations Research. Here is a great comprehensive course that includes linear programming and network models:


Finally, you might want to look at this course on Deep Learning, which needs no summary:


Share this

Leave a Reply