Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.
– Clive Humby
What is Data science?
Data Science is multidisciplinary in nature which consists of Data Mining, Business Intelligence, Computer Science, Data Analysis, Operation Research, Statistics, Predictive Modelling, Artificial Intelligence, Machine learning, to mention just a few.
It is a concept used to confront Big Data which includes data cleansing, preparation, and analysis. A data scientist like you and I will be in position to gather from multiple sources and apply Machine Learning, Predictive Analytics, and Sentiment Analysis to extract critical information from the collected data sets.
Data Science consists of three circles:
- Math
- Statistics
- Subject expertise
Data Analysis emphasizes on correlative analysis to predict relationships between data sets or known variables to discover how a particular event can occur in the future.
Data Mining is a subset of Data Science that refers to the process of collecting data and searching it for patterns in data.
Machine Learning develops predictive models that are generic and can be applied to any domain-related data problem.
Operations Research deals with decision making and optimization of various business projects like pricing, inventory management, supply chain, etc.
Artificial Intelligence spans various knowledge domains like Robotics, Cognitive Science, Natural Language Processing, Human-Computer Interaction, Pattern Recognition, etc.
Business intelligence is the process of collecting, integrating, analyzing and presenting the data, where executives and managers can have a better understanding of decision-making.
Computer science encompasses algorithmic and complex computational implementations, distributed architecture like Hadoop MapReduce for fast and scalable data processing, data plumbing for optimizing various data flows and in-memory analytics.
The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in future decades.
– Hal Varian, chief economist at Google and UC Berkeley professor of information sciences, business, and economics
Image Source: Data Science Life Cycle
Important Tools
Data Source: Classification Of Data Science Software
Anaconda
Anaconda is an open-source tool that comes with 1,500 packages selected from PyPi, conda packages, and a virtual environment. It also includes a GUI, Anaconda Navigator, as a graphical alternative to the command-line interface (CLI).
Here’s the complete guide – How to Install Anaconda.
Useful commands:
conda list
– verifying that Anaconda is installed properly and list packages that are installed along
conda search
– listing all packages available for Anaconda
conda search -f <package_name>
– searching for a specific package, for example, searching for jupyter
conda search -f jupyter
Jupyter Notebook
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more.
Once everything is in order, let’s get started.
NOTE: I will be using Jupyter Notebook on Linux for demos.
- Create a folder where you would want to practice. For this article, I will be using Documents > icode-data.
- Type the command below after installing Anaconda:
jupyter notebook
Jupyter Notebook will open in the browser you are using.
You can see the folder I create: icode-data.
Data science incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modelling, data warehousing, and high-performance computing with the goal of extracting meaning from data and creating data products.
– Wikipedia, March 31st, 2020