Data Science Fundamentals — Glossary · Professional Certificate in AI and Data Science in Pharma

Data Science Fundamentals #

Data Science Fundamentals

Data Science Fundamentals encompass the foundational concepts, techniques, and t… #

This glossary provides a comprehensive list of terms related to Data Science Fundamentals that are essential for professionals pursuing the Professional Certificate in AI and Data Science in Pharma.

1 #

Algorithm

An algorithm is a step #

by-step procedure or formula for solving a problem or accomplishing a task. In data science, algorithms are used to process data, extract insights, and make predictions based on patterns in the data.

Related Terms #

Machine Learning, Deep Learning, Supervised Learning, Unsupervised Learning

Example #

The K-means clustering algorithm is commonly used in data science to group similar data points together.

2 #

Big Data

Big Data refers to large and complex datasets that cannot be easily managed or p… #

Big Data often includes vast amounts of unstructured data from various sources.

Related Terms #

Volume, Velocity, Variety, Veracity

Example #

Pharmaceutical companies analyze Big Data to identify trends in clinical trials and drug development.

3 #

Clustering

Clustering is a machine learning technique used to group similar data points tog… #

Clustering helps identify patterns and relationships in data without the need for labeled datasets.

Related Terms #

K-means, Hierarchical Clustering, Density-Based Clustering

Example #

Customer segmentation is a common application of clustering in marketing to target specific groups with personalized campaigns.

4 #

Data Cleaning

Data Cleaning involves identifying and correcting errors, inconsistencies, and m… #

Data Cleaning is a critical step in the data preprocessing pipeline.

Related Terms #

Data Preprocessing, Data Wrangling, Data Quality

Example #

Removing duplicate records and correcting formatting issues are common tasks in the Data Cleaning process.

5 #

Data Mining

Data Mining is the process of discovering patterns, trends, and insights from la… #

Data Mining helps extract valuable knowledge from data for decision-making.

Related Terms #

Association Rules, Classification, Clustering, Regression

Example #

Retailers use Data Mining to analyze customer purchase history and recommend products based on past behavior.

6 #

Data Preprocessing

Data Preprocessing involves transforming raw data into a clean, organized format… #

Data Preprocessing includes steps such as Data Cleaning, Data Transformation, and Feature Engineering.

Related Terms #

Normalization, Standardization, Imputation, Encoding

Example #

Scaling numerical features and one-hot encoding categorical variables are common Data Preprocessing techniques.

7 #

Data Visualization

Data Visualization is the graphical representation of data to communicate insigh… #

Data Visualization tools help analysts present complex information in a visual format that is easy to understand.

Related Terms #

Charts, Graphs, Dashboards, Infographics

Example #

Using a bar chart to display sales performance by region allows stakeholders to quickly identify trends and outliers.

8 #

Decision Tree

A Decision Tree is a predictive modeling technique that maps out possible outcom… #

Decision Trees are commonly used in classification and regression tasks in data science.

Related Terms #

Random Forest, Gradient Boosting, CART

Example #

A Decision Tree can be used to predict whether a customer will churn based on factors such as purchase history and customer satisfaction.

9 #

Feature Engineering

Feature Engineering involves creating new features or transforming existing feat… #

Feature Engineering plays a crucial role in extracting relevant information from data.

Related Terms #

Feature Selection, Dimensionality Reduction, Polynomial Features

Example #

Combining multiple features to create an interaction term can enhance the predictive power of a model.

10 #

Hypothesis Testing

Hypothesis Testing is a statistical method used to make inferences about a popul… #

Hypothesis Testing involves setting up null and alternative hypotheses and using statistical tests to evaluate the evidence.

Related Terms #

Null Hypothesis, Alternative Hypothesis, p-value, Significance Level

Example #

A pharmaceutical company conducts Hypothesis Testing to determine if a new drug treatment is more effective than the current standard of care.

11 #

Linear Regression

Linear Regression is a statistical method used to model the relationship between… #

Linear Regression aims to find the best-fit line that minimizes the sum of squared errors.

Related Terms #

Ordinary Least Squares, Multivariate Regression, Regression Coefficients

Example #

Predicting house prices based on features such as square footage, number of bedrooms, and location using Linear Regression.

12 #

Machine Learning

Machine Learning is a subset of artificial intelligence that focuses on developi… #

Machine Learning algorithms can be classified as supervised, unsupervised, or reinforcement learning.

Related Terms #

Deep Learning, Neural Networks, Support Vector Machines, Ensemble Learning

Example #

Training a machine learning model to classify emails as spam or non-spam based on text content and sender information.

13 #

Neural Network

A Neural Network is a computational model inspired by the structure and function… #

Neural Networks consist of interconnected nodes (neurons) organized in layers that process input data and generate output predictions.

Related Terms #

Deep Learning, Convolutional Neural Network, Recurrent Neural Network

Example #

Image recognition tasks often use Convolutional Neural Networks (CNNs) to classify objects in photos.

14 #

Overfitting

Overfitting occurs when a machine learning model performs well on training data… #

Overfitting usually occurs when a model is too complex or when noise in the training data is mistaken for patterns.

Related Terms #

Underfitting, Bias-Variance Tradeoff, Regularization

Example #

A decision tree with too many branches may overfit the training data and perform poorly on test data.

15 #

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used… #

PCA identifies the principal components that explain the variance in the data.

Related Terms #

Eigenvalues, Eigenvectors, Singular Value Decomposition

Example #

Reducing the dimensionality of features in a dataset using PCA can help visualize data and improve model performance.

16 #

Regression Analysis

Regression Analysis is a statistical method used to model the relationship betwe… #

Regression Analysis helps understand the impact of independent variables on the dependent variable.

Related Terms #

Linear Regression, Logistic Regression, Polynomial Regression

Example #

Predicting the sales revenue of a pharmaceutical company based on advertising expenditure and market conditions using Regression Analysis.

17 #

Support Vector Machine (SVM)

A Support Vector Machine (SVM) is a supervised machine learning algorithm used f… #

SVM aims to find the hyperplane that best separates data points into different classes while maximizing the margin between classes.

Related Terms #

Kernel Trick, Margin, Support Vectors

Example #

Using an SVM to classify patient data into different disease categories based on clinical features.

18 #

Time Series Analysis

Time Series Analysis is a statistical method used to analyze and forecast time #

dependent data points. Time Series Analysis helps identify patterns, trends, and seasonality in sequential data.

Related Terms #

Trend, Seasonality, Autocorrelation, Forecasting

Example #

Forecasting stock prices based on historical data using Time Series Analysis techniques like ARIMA.

19 #

Unsupervised Learning

Unsupervised Learning is a machine learning paradigm where algorithms learn patt… #

Unsupervised Learning includes techniques such as clustering and dimensionality reduction.

Related Terms #

K-means Clustering, Principal Component Analysis, Anomaly Detection

Example #

Identifying customer segments based on purchasing behavior using Unsupervised Learning techniques.

20 #

Validation

Validation is the process of assessing the performance and generalization capabi… #

Validation techniques help evaluate model accuracy and prevent overfitting.

Related Terms #

Train-Test Split, Cross-Validation, Hyperparameter Tuning

Example #

Splitting a dataset into training and testing sets to validate a regression model's performance on new data.

21 #

Feature Selection

Feature Selection is the process of identifying and selecting the most relevant… #

Feature Selection helps improve model performance, reduce complexity, and avoid overfitting.

Related Terms #

Feature Engineering, Dimensionality Reduction, Filter Methods, Wrapper Methods

Example #

Using feature importance scores from a random forest model to select the most influential features for prediction.

22 #

Ensemble Learning

Ensemble Learning is a machine learning technique that combines multiple models… #

Ensemble methods such as bagging, boosting, and stacking leverage the wisdom of crowds to make more accurate predictions.

Related Terms #

Random Forest, Gradient Boosting, AdaBoost

Example #

Training multiple decision trees and combining their predictions using a voting mechanism in a Random Forest ensemble.

23 #

Anomaly Detection

Anomaly Detection is the process of identifying unusual patterns or outliers in… #

Anomaly Detection helps detect fraud, errors, or abnormalities in datasets.

Related Terms #

Outlier Detection, Novelty Detection, One-Class Classification

Example #

Monitoring network traffic to detect unusual activity or security breaches using Anomaly Detection algorithms.

24 #

Deep Learning

Deep Learning is a subset of machine learning that uses artificial neural networ… #

Deep Learning has revolutionized areas such as image recognition, speech recognition, and natural language processing.

Related Terms #

Convolutional Neural Network, Recurrent Neural Network, Deep Reinforcement Learning

Example #

Training a deep neural network to classify images of handwritten digits in the MNIST dataset.

25 #

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a branch of artificial intelligence that fo… #

NLP applications include sentiment analysis, chatbots, and machine translation.

Related Terms #

Tokenization, Part-of-Speech Tagging, Named Entity Recognition

Example #

Analyzing customer reviews to extract sentiment and identify key topics using Natural Language Processing techniques.

26 #

Reinforcement Learning

Reinforcement Learning is a machine learning paradigm where agents learn to make… #

Reinforcement Learning is used in gaming, robotics, and autonomous driving.

Related Terms #

Markov Decision Process, Q-Learning, Deep Q-Networks

Example #

Training an agent to play a game by rewarding successful moves and penalizing mistakes in a reinforcement learning framework.

27. Cross #

Validation

Cross #

Validation is a technique used to evaluate the performance and generalization ability of a machine learning model by splitting the data into multiple subsets for training and testing. Cross-Validation helps assess model stability and prevent overfitting.

Related Terms #

k-Fold Cross-Validation, Leave-One-Out Cross-Validation, Stratified Cross-Validation

Example #

Performing 5-fold cross-validation on a regression model to estimate its performance on unseen data.

28 #

Hyperparameter Tuning

Hyperparameter Tuning is the process of optimizing the hyperparameters of a mach… #

Hyperparameters control the learning process and model complexity.

Related Terms #

Grid Search, Random Search, Bayesian Optimization

Example #

Tuning the learning rate and batch size of a neural network to maximize training efficiency and accuracy.

29 #

Regularization

Regularization is a technique used to prevent overfitting in machine learning mo… #

Regularization methods like L1 (Lasso) and L2 (Ridge) regularization help control model complexity.

Related Terms #

Dropout, Early Stopping, Weight Decay

Example #

Adding a regularization term to the cost function of a linear regression model to avoid overfitting.

30. Bias #

Variance Tradeoff

The Bias #

Variance Tradeoff is a key concept in machine learning that describes the balance between bias (underfitting) and variance (overfitting) in model performance. Minimizing both bias and variance leads to optimal model performance.

Related Terms #

Model Complexity, Training Error, Test Error

Example #

Increasing the complexity of a model may reduce bias but increase variance, leading to a tradeoff between the two.

31. One #

Hot Encoding

One #

Hot Encoding is a technique used to convert categorical variables into numerical format by creating binary dummy variables for each category. One-Hot Encoding is commonly used in machine learning models that require numerical input.

Related Terms #

Categorical Encoding, Label Encoding, Dummy Variables

Example #

Encoding vehicle types (car, truck, motorcycle) as binary features (1, 0, 0), (0, 1, 0), (0, 0, 1) using One-Hot Encoding.

32 #

Hyperplane

A Hyperplane is a multidimensional surface that separates data points in differe… #

Support Vector Machines (SVM) aim to find the hyperplane that maximizes the margin between classes for optimal separation.

Related Terms #

Decision Boundary, Margin, Support Vectors

Example #

In a two-class classification task, a hyperplane is a line that divides the feature space into two regions representing each class.

33 #

Eigenvalues and Eigenvectors

Eigenvalues and Eigenvectors are concepts from linear algebra used in Principal… #

Eigenvalues represent the variance of data along principal components, while eigenvectors indicate the directions of maximum variance.

Related Terms #

Singular Value Decomposition, Covariance Matrix, Orthogonality

Example #

Finding the principal components of a dataset by calculating the eigenvalues and eigenvectors of the covariance matrix.

34 #

Precision and Recall

Precision and Recall are evaluation metrics used to assess the performance of cl… #

Precision measures the proportion of true positive predictions among all positive predictions, while Recall calculates the proportion of true positives detected by the model.

Related Terms #

F1 Score, True Positive, False Positive, False Negative

Example #

In a medical diagnosis task, high precision indicates that most predicted positive cases are true positives, while high recall means that most actual positive cases are detected.

35 #

A/B Testing

A/B Testing is a statistical method used to compare two versions of a product or… #

A/B Testing involves dividing users into two groups and measuring the impact of changes on user behavior or outcomes.

Related Terms #

Split Testing, Control Group, Treatment Group

Example #

Testing two different website layouts to see which one generates more clicks and conversions using A/B Testing.

36 #

Bagging

Bagging (Bootstrap Aggregating) is an ensemble learning technique that combines… #

Bagging is commonly used in Random Forest algorithms.

Related Terms #

Bootstrap Sampling, Random Forest, Ensemble Learning

Example #

Training multiple decision trees on bootstrapped samples of the data and aggregating their predictions in a bagging ensemble.

37 #

Batch Gradient Descent

Batch Gradient Descent is an optimization algorithm used to minimize the cost fu… #

Batch Gradient Descent calculates the gradient of the cost function using the entire training dataset.

Related Terms #

Stochastic Gradient Descent, Mini-Batch Gradient Descent, Learning Rate

Example #

Updating the weights of a neural network by computing the gradient of the loss function over a batch of training examples in Batch Gradient Descent.

38 #

Bias

Bias is the error introduced by approximating a real #

world problem with a simplified model that does not capture all relevant features. Bias leads to underfitting, where the model is too simple to capture the underlying patterns in the data.

Related Terms #

Variance, Overfitting, Model Complexity

Example #

A linear regression model that assumes a linear relationship between variables may introduce bias if the true relationship is nonlinear.

39 #

Variance

Variance is the error introduced by a model being too sensitive to fluctuations… #

Variance measures how much the model's predictions vary for different training sets.

Related Terms #

Bias, Model Complexity, Generalization Error

Example #

A decision tree with many branches and leaf nodes may exhibit high variance if it memorizes noise in the training data.

40 #

Confusion Matrix

A Confusion Matrix is a table that visualizes the performance of a classificatio… #

The Confusion Matrix contains true positive, true negative, false positive, and false negative values.

Related Terms #

Accuracy, Precision, Recall, F1 Score

Example #

Evaluating a binary classification model by constructing a Confusion Matrix to analyze the model's true positive and false positive rates.

41 #

Cost Function

A Cost Function is a mathematical function used to measure the error or loss bet… #

The goal of training a model is to minimize the cost function to improve predictive accuracy.

Related Terms #

Mean Squared Error, Cross-Entropy Loss, Regularization Term

Example #

Defining a cost function that penalizes large prediction errors to train a regression model for house price prediction.

42. Cross #

Entropy

Cross #

Entropy is a loss function used in classification tasks to measure the difference between predicted probabilities and true class labels. Cross-Entropy loss penalizes models more for incorrect predictions with high confidence.

Related Terms #

Log Loss, Softmax Activation, Binary Cross-Entropy

Example #

Calculating the Cross-Entropy loss between predicted probabilities of class labels and true one-hot encoded labels in a multi-class classification task.

43 #

Decision Boundary

A Decision Boundary is a surface or line that separates different classes in a c… #

Decision Boundaries are determined by machine learning algorithms based on the features and parameters of the model.

Related Terms #

Hyperplane, Support Vector Machine, K-Nearest Neighbors

Example #

In a binary classification task, a decision boundary is the line that divides the feature space into regions corresponding to each class.

44 #

Deep Reinforcement Learning

Deep Reinforcement Learning combines deep learning techniques with reinforcement… #

Deep Reinforcement Learning has achieved significant breakthroughs in game playing and robotics.

Related Terms #

Q-Learning, Policy Gradient Methods, Value Function

Example #

Training a deep reinforcement learning agent to play video games by rewarding successful actions and penalizing mistakes.

45 #

Dropout

Dropout is a regularization technique used in neural networks to prevent overfit… #

Dropout helps improve model generalization and robustness.

Related Terms #

Regularization, Neural Networks, Stochastic Training

Example #

Applying dropout to hidden layers in a neural network