Quik2Learn Dashboard

Data Science Glossary

A | B | C | D | E | F | G H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

A

Algorithm – A set of rules or instructions given to a computer to perform a task.
Artificial Intelligence (AI) – A branch of computer science that enables machines to simulate human intelligence.
Association Rule Mining – A technique used to discover interesting relationships between variables in large databases (e.g., Market Basket Analysis).
Anomaly Detection – The process of identifying rare events or observations that differ significantly from the majority of the data.
A/B Testing – A statistical method to compare two versions of a variable (e.g., a webpage) to determine which performs better.

B

Big Data – Large volumes of data that cannot be processed effectively using traditional methods.
Bias-Variance Tradeoff – The balance between underfitting (high bias) and overfitting (high variance) in model training.
Bayesian Statistics – A statistical method that incorporates prior knowledge when estimating probabilities.
Bagging (Bootstrap Aggregating) – An ensemble method that improves stability and accuracy in machine learning models by combining predictions from multiple models.

C

Clustering – Grouping similar data points together without predefined labels (e.g., K-Means, DBSCAN).
Classification – Assigning predefined labels to data points (e.g., Spam vs. Not Spam).
Cross-Validation – A technique used to evaluate machine learning models by splitting data into training and validation sets.
Confusion Matrix – A table used to evaluate the performance of a classification algorithm.

D

Data Cleaning – The process of fixing or removing incorrect, corrupted, or inconsistent data.
Data Engineering – The practice of designing and building systems to collect, store, and analyze data.
Dimensionality Reduction – Techniques to reduce the number of input variables (e.g., PCA, t-SNE).
Decision Tree – A model that makes decisions based on feature-based conditions in a tree-like structure.

E

Exploratory Data Analysis (EDA) – The process of analyzing and visualizing data to understand its characteristics before modeling.
Ensemble Learning – Combining multiple models to improve prediction performance (e.g., Random Forest, Gradient Boosting).
ETL (Extract, Transform, Load) – The process of gathering, transforming, and loading data into a system for analysis.

F

Feature Engineering – The process of creating new variables (features) from raw data to improve model performance.
Feature Selection – Identifying and selecting the most relevant features for a model.
F1 Score – A metric that balances precision and recall in classification models.
False Positive / False Negative – Incorrect classifications in binary classification models.

G

Gradient Descent – An optimization algorithm used to minimize the error of a machine learning model.
Generative Adversarial Networks (GANs) – A type of neural network used to generate new data similar to training data.
Gaussian Distribution – Also called a normal distribution, a probability distribution that is symmetric around its mean.

H

Hyperparameter Tuning – The process of optimizing the parameters that control the learning process of a machine learning model.
Hypothesis Testing – A statistical method used to test assumptions about data.

I

Imbalanced Data – A dataset where one class significantly outnumbers the other, leading to biased predictions.
Imputation – Replacing missing data with estimated values.

J

Jaccard Similarity – A metric used to measure similarity between two sets.

K

K-Means Clustering – A popular unsupervised learning algorithm for partitioning data into clusters.
K-Nearest Neighbors (KNN) – A classification algorithm that assigns labels based on the nearest data points.

L

Linear Regression – A statistical method for modeling the relationship between dependent and independent variables.
Logistic Regression – A classification algorithm used for binary outcomes.
Loss Function – A function used to measure how well a machine learning model is performing.

M

Machine Learning (ML) – The study of algorithms that allow computers to learn from data.
Mean Absolute Error (MAE) – A metric that measures the average absolute difference between actual and predicted values.
Mean Squared Error (MSE) – A metric that squares the differences between actual and predicted values to penalize large errors.
Model Overfitting – When a model learns noise instead of the actual pattern in data, performing well on training but poorly on unseen data.

N

Natural Language Processing (NLP) – A field of AI focused on the interaction between computers and human language.
Neural Network – A set of algorithms modeled after the human brain used for pattern recognition.
Normalization – The process of scaling features to have a standard range (e.g., between 0 and 1).

O

Outlier Detection – Identifying data points that significantly differ from the majority.
Overfitting – When a model learns noise in training data and fails to generalize to new data.

P

Principal Component Analysis (PCA) – A technique for reducing the dimensionality of data.
Precision – A metric that measures the accuracy of positive predictions in classification models.
Predictive Modeling – Using historical data to predict future outcomes.

Q

Quantile Regression – A type of regression that predicts specific percentiles instead of mean outcomes.
Query – A request to retrieve data from a database.

R

Random Forest – An ensemble learning method using multiple decision trees.
Reinforcement Learning – A type of machine learning where an agent learns by interacting with an environment.
ROC Curve (Receiver Operating Characteristic Curve) – A graphical representation of a classification model’s performance.

S

Standard Deviation – A measure of the amount of variation in a dataset.
Supervised Learning – A type of machine learning where the model is trained on labeled data.
Support Vector Machine (SVM) – A classification algorithm that finds the optimal boundary between classes.

T

Time Series Analysis – The study of data points collected or recorded at specific time intervals.
Tokenization – The process of splitting text into words or phrases for NLP applications.
True Positive / True Negative – Correctly classified outcomes in a classification model.

U

Unsupervised Learning – A type of machine learning where the model learns patterns without labeled outcomes.
Underfitting – When a model is too simple and fails to learn patterns in data.

V

Validation Set – A subset of the dataset used to evaluate model performance during training.
Variance – A measure of how much predictions fluctuate for different data samples.

W

Word Embeddings – Representing words as numerical vectors in NLP tasks.
Weighted Average – An average where some values contribute more to the final calculation than others.

X

XGBoost – A powerful gradient boosting algorithm.

Y

Y-Intercept – In linear regression, the point where the regression line crosses the y-axis.

Z

Z-Score – A statistical measure describing a value’s relationship to the mean.
Zero Shot Learning – A machine learning technique where the model predicts outputs for classes it has not seen before.

A | B | C | D | E | F | G H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z