A good way to answer this question is to think of older, and well established "relatives."
📈 Statistics - A mathematical discipline concerned with the collection, description, and interpretation of data.
🖥️ Computer Science - The study of algorithms, data structures and programming methodologies.
ⓘ Information Science - A field primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information.
Key components of data science include:
Data Collection: Gathering data from various sources, which can be structured (e.g., databases) or unstructured (e.g., text, images, sensor data).
Data Cleaning and Preprocessing: Preparing the data for analysis by handling missing values, outliers, and ensuring data quality.
Data Analysis: Applying statistical and computational techniques to explore, analyze, and model the data, often using programming languages like Python or R.
Data Visualization: Creating meaningful visual representations of data through charts, graphs, and dashboards to facilitate better understanding and communication of insights.
Machine Learning: Developing and applying machine learning algorithms to build predictive models, classification systems, or recommendation systems.
Key components of data science include:
Data Interpretation: Drawing actionable conclusions from data analysis, often to inform business decisions, optimize processes, or address specific problems.
Domain Expertise: Combining data expertise with domain-specific knowledge to generate valuable insights tailored to specific industries or applications.
Data science is an interdisciplinary field that combines various techniques, processes, algorithms, and systems to extract valuable insights and knowledge from data. It involves collecting, cleaning, analyzing, visualizing, and interpreting data to make informed decisions and solve complex problems.
Data scientists work with large and complex datasets to uncover patterns, trends, and hidden insights that can drive business growth, scientific discoveries, and decision-making across various domains such as finance, healthcare, marketing, and more. The field of data science continues to evolve rapidly, incorporating advancements in technology, data management, and machine learning to harness the power of data for practical purposes.
Our course provides an introduction to modeling and machine learning methods.
Module 1: Managing, Manipulating and Generating Data | |
Module 2: Describing Data | |
Module 3: Introduction to Modeling - Regression | |
Module 4: Introduction to Modeling - Classification | |
Module 5: Unsupervised Learning - Clustering |
For more info
For coding, we use Python programming and specialized libraries. The following is a tiny example that does not involve any library import.
A = ['red','orange','blue','green','yellow']
B = ['apple','carrot','peach','mango']
def f(A,B):
l = []
for x in A:
for y in B:
l.append(x+' '+y)
return l
function definition
We can imagine that variables are measurements or attributes.
The idea is that we want to find what is "typical" for a data set.
What are the differences between the two following data sets?
1, 2, 3, 4, 4, 4, 5What is the difference between the two samples?
What measures of central tendency are describing the data well?
Histograms show the overall distribution, how frequent is data ocurring in certain intervals.
The shape of the distribution matters for estimating probability values.
We want to know how spread the data is from its center or typical value.
If the data is a continuous variable and the center is defined to be the mean, then variance and standard deviation are measures of variability:
Var(X):=N∑(Xi−μ)2
Here μ is the mean of the random variable, and N is the size of the population.
Var(x):=n−1∑(xi−xˉ)2
Here xˉ is the mean of the sample, and n is the sample size.
The standard deviation can help us create a useful metric for variable studied, and also scale its values.
The sample standard deviation is:
sx:=n−1∑(xi−xˉ)2
Therefore, a scaler can be achieved via z-score:
zxi=sxxi−xˉ
This works in many situations, provided the standard deviation is not zero.
Critical Thinking:What would the data look like if the standard deviation is exactly zero?
For the data below we want to find a solution that is less biased when highlighting differences.
1, 2, 3, 4, 4, 4, 5The coefficient of variation may shed some light:
xˉsx
For the first sample this is 0.388, and for the second sample this is 2.439.
Main point: we want to analyze measures resistant to outliers.
Percentiles: The p-th percentile is a value x in the metric of a variable (that is at least interval level) such that p% of the values of the variable are below x.
Median: The median is the 50-th percentile.
Quartiles: Q1 is th 25-th percentile, and Q3 is the 75-th percentile.
IQR (Inter-quartile Range): Q3 - Q1.
Outliers: Any value that is 1.5xIQR below Q1 or 1.5xIQR above Q3.
Can we provide a summary of a variable based on these ideas? Yes, we can create boxplots.
Consider the data sample:
1.5, 1.75, 2, 2.13, 2.42, 2.61, 3.1, 3.1, 3.4, 3.98, 3.99
The boxplot is:
For using boxplot from Matplotlib the clarification is here:
The justification for the "minimum" and the "maximum" values in the case of boxplots is drawn from a normal distribution.