Data Science, AI, Machine Learning#

Data Science comes in different flavors and sometimes denotes different things. Some clarification on the terms used in this book and on the subjects covered is mandatory.

Science With and Of Data#

With the advent of cheap storage devices in the last decade of the 20th century companies, governments, other organizations and also private individuals started collecting data at large scale (big data). In a world full of data somebody has to think about how to make information accessible which is hidden in data. Computer Scientists and Mathematicians developed a bunch of methods for extracting information, more and more applications popped up, methods became more complex,… a new field of research was born. This new field matured, got the name ‘data science’ and now is accepted as serious field of research and teaching.

Data Science as a science field covers all technical aspects of data processing. There’s large overlap with computer science and mathematics, but also with many other fields, depending on where data comes from. Mathematics provides advanced methods for extracting information from data. Computer science allows for their realization.

Data Science also touches law, ethics and sociology. May I use this data set for my project? Is it okay to collect and dig through personal data? What impact will extensive data collection and processing have on society?

Almost every data science project has four phases:

1. Collect Data

Data has to be recorded and stored somehow. Planning and realizing data collection processes is referred to as data engineering. Typical tasks in this phase are, for instance, installing and configuring sensors, setting up data base storage, and implementing techniques for supervising data flow.

2. Clean and Restructure Data

Raw data sets often contain errors, missing items or false items. They have to be cleaned. Almost always several data sets have to be combined to allow for succesful extraction of information. These preprocessing steps require lots of manual work and domain knowledge. Careful preprocessing will simplify subsequent processing steps and is at least as important as the modeling phase.

3. Create a Model

From recorded and preprocessed data a mathematical or algorithmic model is build. Depending on the concrete problem to solve from data, such a model may describe the data set (descriptive model) or it may be used to answer some question based on the data set (predictive model).

4. Communicate Results

Findings from the data have to be communicated to the client. Visualizations are the most important tool for delivering results.

In this book we focus on preprocessing and modeling. Data engineering and communication of results will be touched occasionally only. The visualization aspect of communication also plays an important role in preprocessing when exploring a new data set (explorative data analysis, short: EDA). So we will cover the full range of visualization tools and techniques there.

Example: Customer Segmentation#

Brick-and-mortar stores as well as online shops collect as much customer data as they can to understand customer behavior. Knowing how many people buy which products at which time in which quantities is essential for efficient warehousing. But customer data is also used for targeted ad campaigns.

For targeted advertising one tries to identify groups of customers with similar behavior. For each group tailor-made ads are created. Customer segmentation is an example of descriptive modeling. The aim is to understand the collected data and to find structures not obvious at first glance.

Typical tasks in the four phases described above are:

1. Collect Data

  • implement a network infrastructure to collect sales data from all stores in a central data base

  • issue customer cards to know who comes to your shop (age, gender, location,…)

  • think about buying external data about your customers (Schufa,…)

  • check legal situation to know whether you are allowed to collect the data you want

2. Clean and Restructure Data

  • throw away all the data not relevant for segmentation (for instance, data of customers not living in the targeted region)

  • transform data (for instance, convert absolute quantities to relative quantities: milk made 5% of the shopping cart)

  • restructure data to get per-customer data instead of per-shop or per-product

3. Create a Model

  • apply some standard segmentation method

  • try to understand the identified customer groups, find unique characteristics

4. Communicate Results

  • present groups and their unique characteristics to the advertising department

Example: Weather Forecast#

Weather forecasting is a typical example of predictive modeling. From past data we want to create a model which yields information on future weather parameters. In the past lots of experts analyzed recorded weather data and made predictions mainly from experience and classical mathematical and physical modeling. Data science allows to automate the forecasting process. Instead of handcrafted models and expert knowledge one creates a predictive data model based on all (or sufficiently much) recorded weather data.

1. Collect Data

  • decide which weather parameters to record (temperature, humidity,…)

  • implement a network infrastructure to collect weather data from across the world

  • build and launch satellites

  • build terrestrial weather stations

2. Clean and Restructure Data

  • decide for a subset of data to use for forecasting (for instance, only use data from past 30 days)

  • transform data (for instance, harmonize temperature units: Fahrenheit, Celsius)

  • restructure data (for instance, downsample data from 5-minute periods to hourly values)

3. Create a Model

  • apply some standard method for predictive modeling

  • verify the quality of your model’s predictions

4. Communicate Results

  • from numerical outputs of the model make a human readable forecast (for instance, round temperatures to at most one decimal place)

Artificial Intelligence#

Artificial intelligence to some extent is a buzzword. It’s used for computer programs doing things we consider intelligent. Examples are image classification (what is shown on the image?), language processing (translate a text), autonomous driving (orient and move in a complex environment). Under the hood there’s still a classical computer program, no intelligence.

Most, if not all, methods related to artifical intelligence are based on processing large data sets. Image and language processing systems are trained on large data sets of sample images and sample texts. Autonomous driving uses reinforcement learning, which can be understood as collecting large amounts of data while exploiting information extracted from previously collected data (data collection on demand). In this sence, artificial intelligence is a subfield of data science. In this book we also cover this vague field of articifial intelligence, including reinforcement learning.

There’s also a strict mathematical definition of artificial intelligence: A computer system is intelligent if it passes the Turing test. In the Turing test a human chats with another human and the computer system in parallel. If the human cannot decide which of both chat partners is human, the computer passed the test. Up to now no computer passed the Turing tests. If interested, have a look at Wikipedia’s article on the Turing test.

Machine Learning#

By machine learning we denote the process of writing computer programs ‘learning’ to do something from data. In other words, we set-up a model with lots of unknowns and then fit the model to our data. So machine learning refers to a style of software development. We do not write a program line by line. Instead we use a general purpose program and fill in the details automatically based on some data set.

Machine learning may be regarded as the hard core of data science and artificial intelligence, where all the mathematics is contained in.

line drawing showing a pile of linear algebra and two persons putting data through the pile

Fig. 13 The pile gets soaked with data and starts to get mushy over time, so it’s technically recurrent. Source: Randall Munroe, xkcd.com/1838#