Table of Contents
Introduction
Training Dataset
Testing Dataset
Why we splitting datasets?
Splitting training and testing data
train_test_split function
from sklearn.model_selection import train_test_split # Splitting the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # X and y are the feature and target variables, respectively # test_size is the proportion of the dataset that should be allocated for testing # random_state is a parameter that ensures the splitting is done in a reproducible way
- X and y represent the dataset's features and target variable, respectively.
- train_test_split is a function from scikit-learn's model_selection module that splits the dataset into training and testing sets.
- The test_size parameter specifies the proportion of the dataset that should be allocated for testing. In this case, 20% of the data is reserved for testing.
- The random_state parameter specifies the seed used by the random number generator. This ensures that the data is split in a reproducible way.
- X_train and y_train represent the training set's features and target variable, respectively.
- X_test and y_test represent the testing set's features and target variable, respectively.
Example
from sklearn.model_selection import train_test_split import pandas as pd # Load the dataset into a pandas dataframe df = pd.read_csv('housing_data.csv') # Split the dataset into features and target variable X = df.drop('price', axis=1) y = df['price'] # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # Print the shape of the training and testing sets print('Training set:', X_train.shape, y_train.shape) print('Testing set:', X_test.shape, y_test.shape)
In this example, we first load the housing dataset into a pandas dataframe. We then split the dataset into features (X) and target variable (y). We use the train_test_split function from the sklearn.model_selection module to split the dataset into training and testing sets. The test_size parameter specifies the percentage of the data to be used for testing, which is set to 20% in this example. The random_state parameter sets the seed for the random number generator to ensure reproducibility of the results.
Finally, we print the shape of the training and testing sets to confirm that the data has been split correctly. We can use the X_train, X_test, y_train, and y_test variables in our machine learning model to fit the model on the training data and evaluate its performance on the testing data.
Conclusion
Splitting a dataset into training and testing sets is a crucial step in machine learning. The training set is used to train a model, while the testing set is used to evaluate the model's performance. This helps to ensure that the model generalizes well to new, unseen data.
The scikit-learn library provides a convenient function called train_test_split that can be used to split a dataset into training and testing sets in Python. By splitting the dataset into training and testing sets, we can evaluate the performance of our model and make any necessary adjustments before deploying the model in the real world.
Related Posts
Thank You So Much for Reading How to split a dataset into training and testing data sets for Machine Learning Article.