Welcome to DATA 201

Intro to Data Science

Press Space for next page

Welcome to DATA 201

Intro to Data Science

Press Space for next page

What is Data Science?

A good way to answer this question is to think of older, and well established "relatives."

📈 Statistics - A mathematical discipline concerned with the collection, description, and interpretation of data.
🖥️ Computer Science - The study of algorithms, data structures and programming methodologies.
ⓘ Information Science - A field primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information.

What is Data Science?

Key components of data science include:

Data Collection: Gathering data from various sources, which can be structured (e.g., databases) or unstructured (e.g., text, images, sensor data).
Data Cleaning and Preprocessing: Preparing the data for analysis by handling missing values, outliers, and ensuring data quality.
Data Analysis: Applying statistical and computational techniques to explore, analyze, and model the data, often using programming languages like Python or R.
Data Visualization: Creating meaningful visual representations of data through charts, graphs, and dashboards to facilitate better understanding and communication of insights.
Machine Learning: Developing and applying machine learning algorithms to build predictive models, classification systems, or recommendation systems.

What is Data Science?

Key components of data science include:

Data Interpretation: Drawing actionable conclusions from data analysis, often to inform business decisions, optimize processes, or address specific problems.
Domain Expertise: Combining data expertise with domain-specific knowledge to generate valuable insights tailored to specific industries or applications.

Data science is an interdisciplinary field that combines various techniques, processes, algorithms, and systems to extract valuable insights and knowledge from data. It involves collecting, cleaning, analyzing, visualizing, and interpreting data to make informed decisions and solve complex problems.

Data scientists work with large and complex datasets to uncover patterns, trends, and hidden insights that can drive business growth, scientific discoveries, and decision-making across various domains such as finance, healthcare, marketing, and more. The field of data science continues to evolve rapidly, incorporating advancements in technology, data management, and machine learning to harness the power of data for practical purposes.

What is the course all about?

Our course provides an introduction to modeling and machine learning methods.

Course Structure

This course is presented in five modules:


`Module 1: Managing, Manipulating and Generating Data`
`Module 2: Describing Data`
`Module 3: Introduction to Modeling - Regression`
`Module 4: Introduction to Modeling - Classification`
`Module 5: Unsupervised Learning - Clustering`

For more info

Code

For coding, we use Python programming and specialized libraries. The following is a tiny example that does not involve any library import.

A = ['red','orange','blue','green','yellow']
B = ['apple','carrot','peach','mango']

def f(A,B):
  l = []
  for x in A:
    for y in B:
      l.append(x+' '+y)
  return l

function definition

What are variables?

We can imagine that variables are measurements or attributes.

Central Tendency

The idea is that we want to find what is "typical" for a data set.

What are the differences between the two following data sets?

1, 2, 3, 4, 4, 4, 5
1, 2, 3, 4, 4, 4, 5000

What is the difference between the two samples?

What measures of central tendency are describing the data well?

Histograms

Histograms show the overall distribution, how frequent is data ocurring in certain intervals.

Histograms

The shape of the distribution matters for estimating probability values.

Variability Measures

We want to know how spread the data is from its center or typical value.

If the data is a continuous variable and the center is defined to be the mean, then variance and standard deviation are measures of variability:

Variance

$\large \text{Var}(X):= \frac{\sum (X_i-\mu)^2}{N}$

Here $\mu$ is the mean of the random variable, and $N$ is the size of the population.

Sample Variance

$\large \text{Var}(x):= \frac{\sum (x_i-\bar{x})^2}{n-1}$

Here $\bar{x}$ is the mean of the sample, and $n$ is the sample size.

Variability Measures

The standard deviation can help us create a useful metric for variable studied, and also scale its values.

The sample standard deviation is:

$\Large s_x:= \sqrt{\frac{\sum (x_i-\bar{x})^2}{n-1}}$

Therefore, a scaler can be achieved via $z$ -score:

$\Large z_{x_i} = \frac{x_i-\bar{x}}{s_x}$

This works in many situations, provided the standard deviation is not zero.

Critical Thinking:What would the data look like if the standard deviation is exactly zero?

Coefficient of Variation

For the data below we want to find a solution that is less biased when highlighting differences.

1, 2, 3, 4, 4, 4, 5
1, 2, 3, 4, 4, 4, 5000

The coefficient of variation may shed some light:

$\large \frac{s_x}{\bar{x}}$

For the first sample this is $0.388$ , and for the second sample this is $2.439.$

Percentiles and Quartiles

Main point: we want to analyze measures resistant to outliers.

Percentiles: The p-th percentile is a value x in the metric of a variable (that is at least interval level) such that p% of the values of the variable are below x.

Median: The median is the 50-th percentile.

Quartiles: Q1 is th 25-th percentile, and Q3 is the 75-th percentile.

IQR (Inter-quartile Range): Q3 - Q1.

Outliers: Any value that is 1.5xIQR below Q1 or 1.5xIQR above Q3.

Can we provide a summary of a variable based on these ideas? Yes, we can create boxplots.

Boxplots and Outliers

Consider the data sample:

1.5, 1.75, 2, 2.13, 2.42, 2.61, 3.1, 3.1, 3.4, 3.98, 3.99

The boxplot is:

Boxplots and Outliers

For using boxplot from Matplotlib the clarification is here:

Bounds Justification

The justification for the "minimum" and the "maximum" values in the case of boxplots is drawn from a normal distribution.