Data Science Glossary
A
- Algorithm – A set of rules or instructions given to a computer to perform a task.
- Artificial Intelligence (AI) – A branch of computer science that enables machines to simulate human intelligence.
- Association Rule Mining – A technique used to discover interesting relationships between variables in large databases (e.g., Market Basket Analysis).
- Anomaly Detection – The process of identifying rare events or observations that differ significantly from the majority of the data.
- A/B Testing – A statistical method to compare two versions of a variable (e.g., a webpage) to determine which performs better.
B
- Big Data – Large volumes of data that cannot be processed effectively using traditional methods.
- Bias-Variance Tradeoff – The balance between underfitting (high bias) and overfitting (high variance) in model training.
- Bayesian Statistics – A statistical method that incorporates prior knowledge when estimating probabilities.
- Bagging (Bootstrap Aggregating) – An ensemble method that improves stability and accuracy in machine learning models by combining predictions from multiple models.
C
- Clustering – Grouping similar data points together without predefined labels (e.g., K-Means, DBSCAN).
- Classification – Assigning predefined labels to data points (e.g., Spam vs. Not Spam).
- Cross-Validation – A technique used to evaluate machine learning models by splitting data into training and validation sets.
- Confusion Matrix – A table used to evaluate the performance of a classification algorithm.
D
- Data Cleaning – The process of fixing or removing incorrect, corrupted, or inconsistent data.
- Data Engineering – The practice of designing and building systems to collect, store, and analyze data.
- Dimensionality Reduction – Techniques to reduce the number of input variables (e.g., PCA, t-SNE).
- Decision Tree – A model that makes decisions based on feature-based conditions in a tree-like structure.
E
- Exploratory Data Analysis (EDA) – The process of analyzing and visualizing data to understand its characteristics before modeling.
- Ensemble Learning – Combining multiple models to improve prediction performance (e.g., Random Forest, Gradient Boosting).
- ETL (Extract, Transform, Load) – The process of gathering, transforming, and loading data into a system for analysis.
F
- Feature Engineering – The process of creating new variables (features) from raw data to improve model performance.
- Feature Selection – Identifying and selecting the most relevant features for a model.
- F1 Score – A metric that balances precision and recall in classification models.
- False Positive / False Negative – Incorrect classifications in binary classification models.
G
- Gradient Descent – An optimization algorithm used to minimize the error of a machine learning model.
- Generative Adversarial Networks (GANs) – A type of neural network used to generate new data similar to training data.
- Gaussian Distribution – Also called a normal distribution, a probability distribution that is symmetric around its mean.
H
- Hyperparameter Tuning – The process of optimizing the parameters that control the learning process of a machine learning model.
- Hypothesis Testing – A statistical method used to test assumptions about data.
I
- Imbalanced Data – A dataset where one class significantly outnumbers the other, leading to biased predictions.
- Imputation – Replacing missing data with estimated values.
J
- Jaccard Similarity – A metric used to measure similarity between two sets.
K
- K-Means Clustering – A popular unsupervised learning algorithm for partitioning data into clusters.
- K-Nearest Neighbors (KNN) – A classification algorithm that assigns labels based on the nearest data points.
L
- Linear Regression – A statistical method for modeling the relationship between dependent and independent variables.
- Logistic Regression – A classification algorithm used for binary outcomes.
- Loss Function – A function used to measure how well a machine learning model is performing.
M
- Machine Learning (ML) – The study of algorithms that allow computers to learn from data.
- Mean Absolute Error (MAE) – A metric that measures the average absolute difference between actual and predicted values.
- Mean Squared Error (MSE) – A metric that squares the differences between actual and predicted values to penalize large errors.
- Model Overfitting – When a model learns noise instead of the actual pattern in data, performing well on training but poorly on unseen data.
N
- Natural Language Processing (NLP) – A field of AI focused on the interaction between computers and human language.
- Neural Network – A set of algorithms modeled after the human brain used for pattern recognition.
- Normalization – The process of scaling features to have a standard range (e.g., between 0 and 1).
O
- Outlier Detection – Identifying data points that significantly differ from the majority.
- Overfitting – When a model learns noise in training data and fails to generalize to new data.
P
- Principal Component Analysis (PCA) – A technique for reducing the dimensionality of data.
- Precision – A metric that measures the accuracy of positive predictions in classification models.
- Predictive Modeling – Using historical data to predict future outcomes.
Q
- Quantile Regression – A type of regression that predicts specific percentiles instead of mean outcomes.
- Query – A request to retrieve data from a database.
R
- Random Forest – An ensemble learning method using multiple decision trees.
- Reinforcement Learning – A type of machine learning where an agent learns by interacting with an environment.
- ROC Curve (Receiver Operating Characteristic Curve) – A graphical representation of a classification model’s performance.
S
- Standard Deviation – A measure of the amount of variation in a dataset.
- Supervised Learning – A type of machine learning where the model is trained on labeled data.
- Support Vector Machine (SVM) – A classification algorithm that finds the optimal boundary between classes.
T
- Time Series Analysis – The study of data points collected or recorded at specific time intervals.
- Tokenization – The process of splitting text into words or phrases for NLP applications.
- True Positive / True Negative – Correctly classified outcomes in a classification model.
U
- Unsupervised Learning – A type of machine learning where the model learns patterns without labeled outcomes.
- Underfitting – When a model is too simple and fails to learn patterns in data.
V
- Validation Set – A subset of the dataset used to evaluate model performance during training.
- Variance – A measure of how much predictions fluctuate for different data samples.
W
- Word Embeddings – Representing words as numerical vectors in NLP tasks.
- Weighted Average – An average where some values contribute more to the final calculation than others.
X
- XGBoost – A powerful gradient boosting algorithm.
Y
- Y-Intercept – In linear regression, the point where the regression line crosses the y-axis.
Z
- Z-Score – A statistical measure describing a value’s relationship to the mean.
- Zero Shot Learning – A machine learning technique where the model predicts outputs for classes it has not seen before.