Data Science Fundamentals
Expert-defined terms from the Professional Certificate in AI and Data Science in Pharma course at HealthCareStudies (An LSPM brand). Free to read, free to share, paired with a globally recognised certification pathway.
Data Science Fundamentals #
Data Science Fundamentals
Data Science Fundamentals encompass the foundational concepts, techniques, and t… #
This glossary provides a comprehensive list of terms related to Data Science Fundamentals that are essential for professionals pursuing the Professional Certificate in AI and Data Science in Pharma.
1 #
Algorithm
An algorithm is a step #
by-step procedure or formula for solving a problem or accomplishing a task. In data science, algorithms are used to process data, extract insights, and make predictions based on patterns in the data.
Example #
The K-means clustering algorithm is commonly used in data science to group similar data points together.
2 #
Big Data
Big Data refers to large and complex datasets that cannot be easily managed or p… #
Big Data often includes vast amounts of unstructured data from various sources.
Example #
Pharmaceutical companies analyze Big Data to identify trends in clinical trials and drug development.
3 #
Clustering
Clustering is a machine learning technique used to group similar data points tog… #
Clustering helps identify patterns and relationships in data without the need for labeled datasets.
Example #
Customer segmentation is a common application of clustering in marketing to target specific groups with personalized campaigns.
4 #
Data Cleaning
Data Cleaning involves identifying and correcting errors, inconsistencies, and m… #
Data Cleaning is a critical step in the data preprocessing pipeline.
Example #
Removing duplicate records and correcting formatting issues are common tasks in the Data Cleaning process.
5 #
Data Mining
Data Mining is the process of discovering patterns, trends, and insights from la… #
Data Mining helps extract valuable knowledge from data for decision-making.
Example #
Retailers use Data Mining to analyze customer purchase history and recommend products based on past behavior.
6 #
Data Preprocessing
Data Preprocessing involves transforming raw data into a clean, organized format… #
Data Preprocessing includes steps such as Data Cleaning, Data Transformation, and Feature Engineering.
Example #
Scaling numerical features and one-hot encoding categorical variables are common Data Preprocessing techniques.
7 #
Data Visualization
Data Visualization is the graphical representation of data to communicate insigh… #
Data Visualization tools help analysts present complex information in a visual format that is easy to understand.
Example #
Using a bar chart to display sales performance by region allows stakeholders to quickly identify trends and outliers.
8 #
Decision Tree
A Decision Tree is a predictive modeling technique that maps out possible outcom… #
Decision Trees are commonly used in classification and regression tasks in data science.
Example #
A Decision Tree can be used to predict whether a customer will churn based on factors such as purchase history and customer satisfaction.
9 #
Feature Engineering
Feature Engineering involves creating new features or transforming existing feat… #
Feature Engineering plays a crucial role in extracting relevant information from data.
Example #
Combining multiple features to create an interaction term can enhance the predictive power of a model.
10 #
Hypothesis Testing
Hypothesis Testing is a statistical method used to make inferences about a popul… #
Hypothesis Testing involves setting up null and alternative hypotheses and using statistical tests to evaluate the evidence.
Example #
A pharmaceutical company conducts Hypothesis Testing to determine if a new drug treatment is more effective than the current standard of care.
11 #
Linear Regression
Linear Regression is a statistical method used to model the relationship between… #
Linear Regression aims to find the best-fit line that minimizes the sum of squared errors.
Example #
Predicting house prices based on features such as square footage, number of bedrooms, and location using Linear Regression.
12 #
Machine Learning
Machine Learning is a subset of artificial intelligence that focuses on developi… #
Machine Learning algorithms can be classified as supervised, unsupervised, or reinforcement learning.
Example #
Training a machine learning model to classify emails as spam or non-spam based on text content and sender information.
13 #
Neural Network
A Neural Network is a computational model inspired by the structure and function… #
Neural Networks consist of interconnected nodes (neurons) organized in layers that process input data and generate output predictions.
Example #
Image recognition tasks often use Convolutional Neural Networks (CNNs) to classify objects in photos.
14 #
Overfitting
Overfitting occurs when a machine learning model performs well on training data… #
Overfitting usually occurs when a model is too complex or when noise in the training data is mistaken for patterns.
Example #
A decision tree with too many branches may overfit the training data and perform poorly on test data.
15 #
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique used… #
PCA identifies the principal components that explain the variance in the data.
Example #
Reducing the dimensionality of features in a dataset using PCA can help visualize data and improve model performance.
16 #
Regression Analysis
Regression Analysis is a statistical method used to model the relationship betwe… #
Regression Analysis helps understand the impact of independent variables on the dependent variable.
Example #
Predicting the sales revenue of a pharmaceutical company based on advertising expenditure and market conditions using Regression Analysis.
17 #
Support Vector Machine (SVM)
A Support Vector Machine (SVM) is a supervised machine learning algorithm used f… #
SVM aims to find the hyperplane that best separates data points into different classes while maximizing the margin between classes.
Example #
Using an SVM to classify patient data into different disease categories based on clinical features.
18 #
Time Series Analysis
Time Series Analysis is a statistical method used to analyze and forecast time #
dependent data points. Time Series Analysis helps identify patterns, trends, and seasonality in sequential data.
Example #
Forecasting stock prices based on historical data using Time Series Analysis techniques like ARIMA.
19 #
Unsupervised Learning
Unsupervised Learning is a machine learning paradigm where algorithms learn patt… #
Unsupervised Learning includes techniques such as clustering and dimensionality reduction.
Example #
Identifying customer segments based on purchasing behavior using Unsupervised Learning techniques.
20 #
Validation
Validation is the process of assessing the performance and generalization capabi… #
Validation techniques help evaluate model accuracy and prevent overfitting.
Example #
Splitting a dataset into training and testing sets to validate a regression model's performance on new data.
21 #
Feature Selection
Feature Selection is the process of identifying and selecting the most relevant… #
Feature Selection helps improve model performance, reduce complexity, and avoid overfitting.
Example #
Using feature importance scores from a random forest model to select the most influential features for prediction.
22 #
Ensemble Learning
Ensemble Learning is a machine learning technique that combines multiple models… #
Ensemble methods such as bagging, boosting, and stacking leverage the wisdom of crowds to make more accurate predictions.
Example #
Training multiple decision trees and combining their predictions using a voting mechanism in a Random Forest ensemble.
23 #
Anomaly Detection
Anomaly Detection is the process of identifying unusual patterns or outliers in… #
Anomaly Detection helps detect fraud, errors, or abnormalities in datasets.
Example #
Monitoring network traffic to detect unusual activity or security breaches using Anomaly Detection algorithms.
24 #
Deep Learning
Deep Learning is a subset of machine learning that uses artificial neural networ… #
Deep Learning has revolutionized areas such as image recognition, speech recognition, and natural language processing.
Example #
Training a deep neural network to classify images of handwritten digits in the MNIST dataset.
25 #
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a branch of artificial intelligence that fo… #
NLP applications include sentiment analysis, chatbots, and machine translation.
Example #
Analyzing customer reviews to extract sentiment and identify key topics using Natural Language Processing techniques.
26 #
Reinforcement Learning
Reinforcement Learning is a machine learning paradigm where agents learn to make… #
Reinforcement Learning is used in gaming, robotics, and autonomous driving.
Example #
Training an agent to play a game by rewarding successful moves and penalizing mistakes in a reinforcement learning framework.
27. Cross #
Validation
Cross #
Validation is a technique used to evaluate the performance and generalization ability of a machine learning model by splitting the data into multiple subsets for training and testing. Cross-Validation helps assess model stability and prevent overfitting.
Example #
Performing 5-fold cross-validation on a regression model to estimate its performance on unseen data.
28 #
Hyperparameter Tuning
Hyperparameter Tuning is the process of optimizing the hyperparameters of a mach… #
Hyperparameters control the learning process and model complexity.
Example #
Tuning the learning rate and batch size of a neural network to maximize training efficiency and accuracy.
29 #
Regularization
Regularization is a technique used to prevent overfitting in machine learning mo… #
Regularization methods like L1 (Lasso) and L2 (Ridge) regularization help control model complexity.
Example #
Adding a regularization term to the cost function of a linear regression model to avoid overfitting.
30. Bias #
Variance Tradeoff
The Bias #
Variance Tradeoff is a key concept in machine learning that describes the balance between bias (underfitting) and variance (overfitting) in model performance. Minimizing both bias and variance leads to optimal model performance.
Example #
Increasing the complexity of a model may reduce bias but increase variance, leading to a tradeoff between the two.
31. One #
Hot Encoding
One #
Hot Encoding is a technique used to convert categorical variables into numerical format by creating binary dummy variables for each category. One-Hot Encoding is commonly used in machine learning models that require numerical input.
Example #
Encoding vehicle types (car, truck, motorcycle) as binary features (1, 0, 0), (0, 1, 0), (0, 0, 1) using One-Hot Encoding.
32 #
Hyperplane
A Hyperplane is a multidimensional surface that separates data points in differe… #
Support Vector Machines (SVM) aim to find the hyperplane that maximizes the margin between classes for optimal separation.
Example #
In a two-class classification task, a hyperplane is a line that divides the feature space into two regions representing each class.
33 #
Eigenvalues and Eigenvectors
Eigenvalues and Eigenvectors are concepts from linear algebra used in Principal… #
Eigenvalues represent the variance of data along principal components, while eigenvectors indicate the directions of maximum variance.
Example #
Finding the principal components of a dataset by calculating the eigenvalues and eigenvectors of the covariance matrix.
34 #
Precision and Recall
Precision and Recall are evaluation metrics used to assess the performance of cl… #
Precision measures the proportion of true positive predictions among all positive predictions, while Recall calculates the proportion of true positives detected by the model.
Example #
In a medical diagnosis task, high precision indicates that most predicted positive cases are true positives, while high recall means that most actual positive cases are detected.
35 #
A/B Testing
A/B Testing is a statistical method used to compare two versions of a product or… #
A/B Testing involves dividing users into two groups and measuring the impact of changes on user behavior or outcomes.
Example #
Testing two different website layouts to see which one generates more clicks and conversions using A/B Testing.
36 #
Bagging
Bagging (Bootstrap Aggregating) is an ensemble learning technique that combines… #
Bagging is commonly used in Random Forest algorithms.
Example #
Training multiple decision trees on bootstrapped samples of the data and aggregating their predictions in a bagging ensemble.
37 #
Batch Gradient Descent
Batch Gradient Descent is an optimization algorithm used to minimize the cost fu… #
Batch Gradient Descent calculates the gradient of the cost function using the entire training dataset.
Example #
Updating the weights of a neural network by computing the gradient of the loss function over a batch of training examples in Batch Gradient Descent.
38 #
Bias
Bias is the error introduced by approximating a real #
world problem with a simplified model that does not capture all relevant features. Bias leads to underfitting, where the model is too simple to capture the underlying patterns in the data.
Example #
A linear regression model that assumes a linear relationship between variables may introduce bias if the true relationship is nonlinear.
39 #
Variance
Variance is the error introduced by a model being too sensitive to fluctuations… #
Variance measures how much the model's predictions vary for different training sets.
Example #
A decision tree with many branches and leaf nodes may exhibit high variance if it memorizes noise in the training data.
40 #
Confusion Matrix
A Confusion Matrix is a table that visualizes the performance of a classificatio… #
The Confusion Matrix contains true positive, true negative, false positive, and false negative values.
Example #
Evaluating a binary classification model by constructing a Confusion Matrix to analyze the model's true positive and false positive rates.
41 #
Cost Function
A Cost Function is a mathematical function used to measure the error or loss bet… #
The goal of training a model is to minimize the cost function to improve predictive accuracy.
Example #
Defining a cost function that penalizes large prediction errors to train a regression model for house price prediction.
42. Cross #
Entropy
Cross #
Entropy is a loss function used in classification tasks to measure the difference between predicted probabilities and true class labels. Cross-Entropy loss penalizes models more for incorrect predictions with high confidence.
Example #
Calculating the Cross-Entropy loss between predicted probabilities of class labels and true one-hot encoded labels in a multi-class classification task.
43 #
Decision Boundary
A Decision Boundary is a surface or line that separates different classes in a c… #
Decision Boundaries are determined by machine learning algorithms based on the features and parameters of the model.
Example #
In a binary classification task, a decision boundary is the line that divides the feature space into regions corresponding to each class.
44 #
Deep Reinforcement Learning
Deep Reinforcement Learning combines deep learning techniques with reinforcement… #
Deep Reinforcement Learning has achieved significant breakthroughs in game playing and robotics.
Example #
Training a deep reinforcement learning agent to play video games by rewarding successful actions and penalizing mistakes.
45 #
Dropout
Dropout is a regularization technique used in neural networks to prevent overfit… #
Dropout helps improve model generalization and robustness.
Example #
Applying dropout to hidden layers in a neural network