Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Ex No: 01
- Date: 19/02/2025
- Create a sample dataset and explore statistical operations using pandas and visualize the results through plots
- Aim:
- To perform exploratory data analysis (EDA) on a user-defined sample dataset, using statistical operations and visualisations to identify patterns, trends, and insights.
- Procedure:
- Step 1: Open a new notebook in google colab
- Step 2: Import all the necessary modules
- Step 3: Create a dataset using Numpy random number generator and download it using the files module
- Step 4: Now, load the dataset and convert it to dataframe using pandas
- Step 5: Perform the listed metrics using the dataframe and packages that have been imported
- Step 6: Additionally, import seaborn and matplotlib for plotting the graphs for the dataset.
- Implementation:
- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns
- from google.colab import files
- # Upload dataset manually in Google Colab
- uploaded = files.upload()
- file_path = "sales_data (1).csv" # Adjusted path for Google Colab
- df = pd.read_csv(file_path)
- # Display DataFrame, top and bottom 5 values
- print(df.head())
- Ex No: 02
- Date: 06/03/2025
- IMPLEMENT UNINFORMED SEARCH STRATEGIES FOR ANY REAL WORLD PROBLEM
- Aim:
- The aim is to develop a User vs AI Tic-Tac-Toe game where the AI uses Breadth-First Search (BFS) to determine optimal moves.
- Algorithm:
- Step 1: Initialize the Board
- 1.Create a 3x3 board with all cells initially set to empty ('-').
- 2.Print the initial empty board to show the player.
- Step 2: Display the Board
- 1.Print the current state of the board after each move.
- Step 3: Main Game Loop (Alternating turns between the human player and the AI)
- 1.Set current player to 'X' (human player) initially.
- 2.Repeat until there's a winner or the board is full:
- 1.If it's the human player's turn (current player == 'X'):
- 1.Prompt the human player to enter a row and column (0, 1, 2) for theirmove.
- 2.Check if the selected cell is empty:
- 1.If the cell is not empty, prompt the user again.
- 2.If the cell is empty, make the move by placing 'X' in theselected spot.
- 2.If it's the AI's turn (current player == 'O'):
- 1.The AI chooses a move based on the game state using the BFSalgorithm:
- 1.Explore all valid moves by simulating each one.
- 2.For each move, evaluate if it leads to a win, loss, or draw.
- 3.Select the move that maximizes the AI's chances of winningand minimizes the human player's chances (using the minimaxstrategy).dfg
- 2.Make the move by placing 'O' in the selected spot.
- Step 4: Check for a Winner
- 1. After every move, check if the current player has won the game:
- 1. Check each row, column, and both diagonals to see if all cells contain the same player's symbol ('X' or 'O').
- 2. If a player has won, declare the winner and end the game.
- Step 5: Check for a Draw
- 1. After each move, check if the game has ended in a draw:
- 1. If there are no valid moves left and no winner, declare the game as a draw.
- Step 6: Switch Turns
- 1. If there's no winner or draw, switch the current player:
- 1. If current player == 'X', switch to 'O' (AI's turn).
- 2. If current player == 'O', switch to 'X' (human player's turn).
- Step 7: End the Game
- 1. The game ends when:
- 1. A player wins (one player has won the game).
- 2. The game ends in a draw (no valid moves left and no winner).
- 2. Print the result (either "Player X wins!", "Player O wins!", or "It's a draw!").
- 3. Exit the game.
- Implementation:
- from collections import deque
- def print_board(board):
- print("-------------")
- for row in board:
- print("|", " | ".join(row), "|")
- print("-------------")
- def check_winner(board, player):
- # Check rows
- for row in board:
- if all(cell == player for cell in row):
- return True
- # Check columns
- for col in range(3):
- if all(board[row][col] == player for row in range(3)):
- return True
- # Check diagonals
- if all(board[i][i] == player for i in range(3)) or all(board[i][2 - i] == player for i in range(3)):
- return True
- return False
- def get_valid_moves(board):
- moves = []
- for i in range(3):
- for j in range(3):
- if board[i][j] == '-':
- moves.append((i, j))
- return moves
- def make_move(board, move, player):
- board[move[0]][move[1]] = player
- def bfs(board, player):
- queue = deque([(board, player)])
- while queue:
- current_board, current_player = queue.popleft()
- if check_winner(current_board, 'X'):
- return -1
- elif check_winner(current_board, 'O'):
- return 1
- elif len(get_valid_moves(current_board)) == 0:
- return 0
- for move in get_valid_moves(current_board):
- next_board = [row[:] for row in current_board]
- make_move(next_board, move, 'O' if current_player == 'X' else 'X')
- queue.append((next_board, 'O' if current_player == 'X' else 'X'))
- def main():
- board = [['-' for _ in range(3)] for _ in range(3)]
- print("Welcome to Tic Tac Toe!\nEnter row and column (0-2) to make your move.")
- print_board(board)
- current_player = 'X'
- while True:
- if current_player == 'X':
- row = int(input(f"Player {current_player}, enter row: "))
- col = int(input(f"Player {current_player}, enter column: "))
- if board[row][col] != '-':
- print("Invalid move! Try again.")
- continue
- make_move(board, (row, col), current_player)
- else:
- print(f"Player {current_player}'s turn (AI). Thinking...")
- best_move, best_score = None, -float('inf') if current_player == 'O' else float('inf')
- for move in get_valid_moves(board):
- next_board = [row[:] for row in board]
- make_move(next_board, move, 'O' if current_player == 'O' else 'X')
- score = bfs(next_board, 'O' if current_player == 'X' else 'X')
- if (current_player == 'O' and score > best_score) or (current_player == 'X' and score < best_score):
- best_score, best_move = score, move
- make_move(board, best_move, current_player)
- print(f"AI (Player {current_player}) chooses {best_move}")
- print_board(board)
- if check_winner(board, current_player):
- print(f"Player {current_player} wins!")
- break
- elif len(get_valid_moves(board)) == 0:
- print("It's a draw!")
- break
- else:
- current_player = 'X' if current_player == 'O' else 'O'
- if __name__ == "__main__":
- main()
- Exp.No:3(a) Date: FIND OPTIMAL SOLUTION FOR A GIVEN PROBLEM USING ANY LOCAL SEARCH ALGORITHM
- Aim:
- The aim is to use a genetic algorithm to find a conflict-free arrangement of queens on an 8x8 chessboard through selection, crossover, and mutation.
- Algorithm:
- Step 1:Input Initial Configuration: Prompt user to enter the column position (0–7) for each
- row (optional).
- Step 2:Initialize Parameters: Set population_size = 100, mutation_rate = 0.1, generations = 1000.
- Step 3:Generate Initial Population: Create 100 individuals, each with 8 integers (column positions of queens).
- Step 4:Fitness Function: Evaluate conflicts (same column or diagonal) and calculate fitness as fitness = 28 - conflicts.
- Step 5: Repeat for Each Generation: a. Selection: Use roulette wheel selection to choose parents. b. Crossover: Choose a crossover point and create two children. c. Mutation: Apply mutation with probability to change queen positions. d. Form New Population: Add children to new population until desired size. e. Check for Solution: If fitness = 28, return solution and stop.
- Step 6:Display Best Solution: Print board with queens ("Q") and empty spaces ("."). Output the generation number or best solution if max generations reached.
- Implementation:
- import random
- def generate_population(size):
- population = []
- for _ in range(size):
- individual = [random.randint(0, 7) for _ in range(8)]
- population.append(individual)
- return population
- def calculate_fitness(individual):
- conflicts = 0
- for i in range(8):
- for j in range(i + 1, 8):
- if individual[i] == individual[j] or abs(individual[i] - individual[j]) == abs(i - j):
- conflicts += 1
- return 28 - conflicts # 28 is the maximum fitness score achievable
- def select_parents(population):
- total_fitness = sum(calculate_fitness(individual) for individual in population)
- probabilities = [calculate_fitness(individual) / total_fitness for individual in population]
- parent1 = random.choices(population, weights=probabilities)[0]
- parent2 = random.choices(population, weights=probabilities)[0]
- return parent1, parent2
- def crossover(parent1, parent2):
- crossover_point = random.randint(0, 7)
- child1 = parent1[:crossover_point] + parent2[crossover_point:]
- child2 = parent2[:crossover_point] + parent1[crossover_point:]
- return child1, child2
- def mutate(individual, mutation_rate):
- for i in range(8):
- if random.random() < mutation_rate:
- individual[i] = random.randint(0, 7)
- return individual
- def genetic_algorithm(population_size, mutation_rate, generations):
- population = generate_population(population_size)
- for gen in range(generations):
- new_population = []
- for _ in range(population_size // 2):
- parent1, parent2 = select_parents(population)
- child1, child2 = crossover(parent1, parent2)
- child1 = mutate(child1, mutation_rate)
- child2 = mutate(child2, mutation_rate)
- new_population.extend([child1, child2])
- population = new_population
- best_individual = max(population, key=calculate_fitness)
- if calculate_fitness(best_individual) == 28:
- return best_individual, gen
- best_individual = max(population, key=calculate_fitness)
- return best_individual, generations
- def print_board(board):
- for i in range(8):
- for j in range(8):
- if board[i] == j:
- print("Q", end=" ")
- else:
- print(".", end=" ")
- print()
- print()
- def main():
- print("Enter the initial positions of queens (0-7) for each row:")
- initial_board = []
- for i in range(8):
- position = int(input(f"Row {i}: "))
- initial_board.append(position)
- population_size = 100
- mutation_rate = 0.1
- generations = 1000
- solution, gen_count = genetic_algorithm(population_size, mutation_rate, generations)
- print("Solution Found:")
- print_board(solution)
- print(f"Solution found in generation: {gen_count}")
- if __name__ == "__main__":
- main()
- Eexp.No:3(b) Date: FIND OPTIMAL SOLUTION FOR A GIVEN PROBLEM USING ANY LOCAL SEARCH ALGORITHM
- Aim:
- To solve the 8 Queens Problem using the Hill Climbing Algorithm, which places eight queens on a chessboard such that no two queens attack each other, by iteratively moving towards better (lower-cost) configurations.
- Algorithm:
- Step 1: Generate a random initial board with 8 queens (one per column).
- Step 2: Calculate the cost (number of attacking queen pairs).
- Step 3: Generate all possible next boards by moving each queen within its column.
- Step 4: Select the next board with the lowest cost.
- Step 5:
- •
- If the new board has a lower cost, move to it.
- •
- Else, restart with a new random board.
- Step 6: Repeat Steps 2–5 until a board with cost = 0 is found.
- Step 7: Display the final solution board and number of steps or restarts.
- Implementation:
- import random
- def generate_board():
- return [random.randint(0, 7) for _ in range(8)]
- def calculate_cost(board):
- cost = 0
- for i in range(len(board)):
- for j in range(i + 1, len(board)):
- if board[i] == board[j] or abs(board[i] - board[j]) == abs(i - j):
- cost += 1
- return cost
- def get_next_board(board):
- next_boards = []
- current_cost = calculate_cost(board)
- for i in range(8):
- for j in range(8):
- if j != board[i]:
- next_board = list(board)
- next_board[i] = j
- next_boards.append(next_board)
- next_boards.sort(key=lambda x: calculate_cost(x))
- return next_boards[0], calculate_cost(next_boards[0])
- def print_board(board):
- for i in range(8):
- for j in range(8):
- if board[i] == j:
- print("Q", end=" ")
- else:
- print(".", end=" ")
- print()
- print()
- def hill_climbing():
- current_board = generate_board()
- current_cost = calculate_cost(current_board)
- while True:
- print("Current Board:")
- print_board(current_board)
- print("Cost:", current_cost)
- if current_cost == 0:
- print("Solution Found!")
- break
- next_board, next_cost = get_next_board(current_board)
- if next_cost >= current_cost:
- print("Local maximum reached. Restarting...")
- current_board = generate_board()
- current_cost = calculate_cost(current_board)
- else:
- current_board = next_board
- current_cost = next_cost
- def main():
- print("Solving 8 Queens Problem using Hill Climbing Algorithm:")
- hill_climbing()
- if __name__ == "__main__":
- main()
- Ex No: 04
- Date:
- PROPOSE AN AI SOLUTION FOR A GIVEN CONSTRAINT SATISFACTION PROBLEM
- Aim:
- To solve the Water Jug Problem using Constraint Satisfaction Problem (CSP) techniques by modeling the jugs and constraints to find an exact measurement of water. It demonstrates the application of CSP methods in finding efficient solutions through algorithms and optimization.
- Procedure:
- Step 1: Define variables for each jug (e.g., X1, X2) representing water levels.
- Step 2:Define domains for each variable (possible water levels from 0 to the jug’s capacity).
- Step 3:Establish constraints: capacity limits, valid actions (fill, transfer, empty), and goal state (target water level).
- Step 4:Implement a backtracking search to explore all possible states and actions.
- Step 5:Apply forward checking to prune invalid states during the search.
- Step 6:Track visited states to prevent cycles and redundant searches.
- Step 7:Backtrack and stop when the goal state (target water level).
- Implementation:
- from collections import deque, defaultdict
- jug1, jug2, aim = 4, 3, 1
- visited = defaultdict(lambda: False)
- def water_jug_solver(x, y):
- if x == aim or y == aim:
- print(f"Reached goal state: ({x}, {y})")
- return True
- if visited[(x, y)]:
- return False
- visited[(x, y)] = True
- print(f"Exploring state: ({x}, {y})")
- return (
- water_jug_solver(0, y) or
- water_jug_solver(x, 0) or
- water_jug_solver(jug1, y) or
- water_jug_solver(x, jug2) or
- water_jug_solver(x - min(x, jug2 - y), y + min(x, jug2 - y)) or
- water_jug_solver(x + min(y, jug1 - x), y - min(y, jug1 - x))
- )
- water_jug_solver(0, 0)
- Ex No: 05
- Date:
- TAKE A SAMPLE DATASET AND APPLY SUITABLE PRE-PROCESSING TECHNIQUES
- Aim:
- To take a sample dataset and perform preprocessing steps such as handling missing values, encoding features, scaling data, and splitting into training and testing sets.
- Algorithm:
- Step 1: Input Dataset
- •
- Read the CSV file into a DataFrame.
- Step 2: Explore and Inspect Data
- •
- View first few rows, summary statistics, and basic info.
- Step 3: Clean Data
- •
- Drop non-informative columns ('id', 'Unnamed: 32').
- •
- Check for missing values and duplicates.
- Step 4: Analyze Data
- •
- Group by 'diagnosis' and compute mean features.
- Step 5: Encode Labels
- •
- Apply label encoding: Malignant = 1, Benign = 0.
- Step 6: Scale Features
- •
- Standardize feature values using StandardScaler.
- Implementation:
- import pandas as pd
- import matplotlib.pyplot as plt
- # Read the data set
- data = pd.read_csv("/content/data.csv")
- data.head(5)
- # Generate summary statistics for numerical columns in the dataset
- data.describe()
- # Display basic information about the dataset including column types and non-null counts
- data.info()
- # Drop non-informative columns: 'id' and 'Unnamed: 32' (which may be empty or irrelevant)
- data = data.drop([col for col in ['id', 'Unnamed: 32'] if col in data.columns], axis=1)
- # Get the count of each category in the 'diagnosis' column (Malignant/M and Benign/B)
- data["diagnosis"].value_counts()
- # Check for missing values in each column
- data.isnull().sum()
- # Check for duplicate rows in the dataset
- data.duplicated().sum()
- # Get the shape of the dataset (rows, columns)
- data.shape
- # Group the data by 'diagnosis' and calculate the mean of each feature
- data.groupby("diagnosis").mean()
- # Apply Label Encoding: Convert 'Malignant' (M) to 1 and 'Benign' (B) to 0
- from sklearn.preprocessing import LabelEncoder
- le = LabelEncoder()
- data['diagnosis'] = le.fit_transform(data['diagnosis'])
- data['diagnosis']
- # Standardize features: Scale the data using StandardScaler
- X_scaled = scaler.fit_transform(data.drop('diagnosis', axis=1))
- Ex No: 06
- Date:
- PERFORM DIMENSIONALITY REDUCTION USING PRINCIPAL COMPONENT ANALYSIS ON A LARGE DATASET
- Aim:
- To apply Principal Component Analysis (PCA) on scaled breast cancer dataset features and reduce the dimensionality to 2 components for visualization and analysis.
- Algorithm:
- Step 1: Import PCA Module
- •
- Import PCA from sklearn.decomposition.
- Step 2: Initialize PCA
- •
- Set the number of principal components (e.g., n_components=2).
- Step 3: Apply PCA
- •
- Fit PCA on the standardized dataset and transform it into a lower-dimensional space.
- Step 4: Output Results
- •
- Display the transformed feature set and verify the new shape (rows × 2 columns).
- Implementation:
- # Import PCA from sklearn
- from sklearn.decomposition import PCA
- # Apply PCA: Specify the number of components you want to retain
- # Let's say we want to reduce to 2 components (for visualization)
- pca = PCA(n_components=2)
- # Fit PCA on the scaled data and transform it to the new lower dimensional space
- X_pca = pca.fit_transform(X_scaled)
- print(X_pca)
- # Output the shape of the transformed data (it should now have 2 features)
- print(X_pca.shape)
- Ex No: 07
- Date:
- IMPLEMENT AND DEMONSTRATE THE WORKING OF NAIVE BAYES CLASSIFIER IN A REAL-LIFE APPLICATION
- Aim:
- To implement and demonstrate the working of the Gaussian Naive Bayes classifier for predicting breast cancer diagnosis using a real-life medical dataset.
- Algorithm:
- Step 1: Import Required Libraries
- •
- Import libraries for data preprocessing, model training, and evaluation (train_test_split, GaussianNB, StandardScaler, metrics).
- Step 2: Prepare Data
- •
- Split the pre-processed dataset into features (X) and target (y).
- •
- Standardize the feature values using StandardScaler.
- Step 3: Split Data
- •
- Split the dataset into training and testing sets (80% training, 20% testing).
- Step 4: Train the Model
- •
- Initialize the Gaussian Naive Bayes model.
- •
- Fit the model on the training data.
- Step 5: Make Predictions
- •
- Predict the target values for the testing dataset.
- Step 6: Evaluate the Model
- •
- Calculate performance metrics: Accuracy, Precision, Recall, and F1 Score.
- •
- Display the results.
- Implementation:
- # Import necessary libraries
- from sklearn.model_selection import train_test_split
- from sklearn.naive_bayes import GaussianNB
- from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
- from sklearn.preprocessing import StandardScaler
- # Assume 'data' is already pre-processed and ready to use
- # Split the data into features (X) and target (y)
- X = data.drop('diagnosis', axis=1)
- y = data['diagnosis']
- # Standardize the features
- scaler = StandardScaler()
- X_scaled = scaler.fit_transform(X)
- # Train-test split (80% training, 20% testing)
- X_train, X_test, y_train, y_test = train_test_split(
- X_scaled, y, test_size=0.2, random_state=42
- )
- # Initialize the Gaussian Naive Bayes classifier
- nb = GaussianNB()
- # Train the model
- nb.fit(X_train, y_train)
- # Make predictions on the test set
- y_pred = nb.predict(X_test)
- # Model performance
- from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
- # Accuracy, Precision, Recall, F1 Score
- accuracy = accuracy_score(y_test, y_pred)
- precision = precision_score(y_test, y_pred)
- recall = recall_score(y_test, y_pred)
- f1 = f1_score(y_test, y_pred)
- # Printing the results
- print(f"Accuracy : {accuracy:.4f}")
- print(f"Precision : {precision:.4f}")
- print(f"Recall : {recall:.4f}")
- print(f"F1 Score : {f1:.4f}")
- Ex No: 08
- Date:
- DEVELOP A PREDICTION SYSTEM USING LINEAR AND LOGISTIC REGRESSION
- Aim:
- To develop and compare prediction systems using Linear Regression and Logistic Regression for classifying or predicting outcomes on a real-world dataset.
- Algorithm:
- Linear Regression (for demonstration or numeric prediction)
- Step 1: Import libraries and load preprocessed data. Step 2: Split data into features (X) and target (y). Step 3: Standardize feature values. Step 4: Train-test split the dataset. Step 5: Fit a Linear Regression model to the training data. Step 6: Predict on the test data and evaluate using metrics like MSE/R².
- Logistic Regression (for classification)
- Step 1: Import LogisticRegression from sklearn.linear_model. Step 2: Split standardized data into training and testing sets. Step 3: Train the logistic regression model on the training set. Step 4: Make predictions and evaluate performance using accuracy, precision, recall, F1 score, and confusion matrix.
- Implementation:
- from sklearn.linear_model import LinearRegression, LogisticRegression
- from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import StandardScaler
- import seaborn as sns
- import matplotlib.pyplot as plt
- import pandas as pd
- # Assume your 'data' is already loaded and cleaned
- X = data.drop('diagnosis', axis=1)
- y = data['diagnosis']
- # Standardize features
- scaler = StandardScaler()
- X_scaled = scaler.fit_transform(X)
- # Train-test split
- X_train, X_test, y_train, y_test = train_test_split(
- X_scaled, y, test_size=0.2, random_state=42
- )
- # ---------- LINEAR REGRESSION ----------
- linreg = LinearRegression()
- linreg.fit(X_train, y_train)
- y_pred_lin = linreg.predict(X_test)
- # Convert predictions to binary using threshold 0.5
- y_pred_lin_class = [1 if val >= 0.5 else 0 for val in y_pred_lin]
- # Metrics for Linear Regression (converted to classifier)
- print("----- Linear Regression (as classifier) -----")
- print(f"Accuracy : {accuracy_score(y_test, y_pred_lin_class):.4f}")
- print(f"Precision : {precision_score(y_test, y_pred_lin_class):.4f}")
- print(f"Recall : {recall_score(y_test, y_pred_lin_class):.4f}")
- print(f"F1 Score : {f1_score(y_test, y_pred_lin_class):.4f}")
- print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_lin_class))
- # ---------- LOGISTIC REGRESSION ----------
- logreg = LogisticRegression()
- logreg.fit(X_train, y_train)
- y_pred_log = logreg.predict(X_test)
- # Metrics for Logistic Regression
- print("\n----- Logistic Regression -----")
- print(f"Accuracy : {accuracy_score(y_test, y_pred_log):.4f}")
- print(f"Precision : {precision_score(y_test, y_pred_log):.4f}")
- print(f"Recall : {recall_score(y_test, y_pred_log):.4f}")
- print(f"F1 Score : {f1_score(y_test, y_pred_log):.4f}")
- print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log))
- # Plot confusion matrix for logistic regression
- cm = confusion_matrix(y_test, y_pred_log)
- sns.heatmap(
- cm,
- annot=True,
- fmt='d',
- cmap='YlGnBu',
- xticklabels=['Benign', 'Malignant'],
- yticklabels=['Benign', 'Malignant']
- )
- plt.xlabel('Predicted')
- plt.ylabel('Actual')
- plt.title('Confusion Matrix - Logistic Regression')
- plt.show()
- Ex No: 9
- DATE :Implementation :Develop a classifier using Artificial Neural Network for any onlineexpert system.
- Aim:To build a classifier using an Artificial Neural Network (ANN) to predict customer churn aspart of an online expert system.
- Algorithm:Step 1: Import necessary libraries (pandas, numpy, tensorflow, etc.).Step 2: Load the dataset (Churn_Modelling.csv) using pandas.read_csv().Step 3: Preprocess the data:Drop unnecessary columns (RowNumber, CustomerId, Surname).Encode categorical variables (Gender and Geography).Step 4: Split the dataset into training and test sets (e.g., 80/20 split).Step 5: Scale features using StandardScaler for normalization.Step 6: Initialize the ANN using Sequential().Step 7: Add hidden layers with activation function 'relu'.Step 8: Add output layer with 'sigmoid' activation (binary classification).Step 9: Compile the model with adam optimizer and binary_crossentropy loss.Step 10: Train the model using fit() with defined epochs and batch size.Step 11: Evaluate the model and make predictions.Step 12: Save the trained ANN model to a file (e.g., .h5 format).
- # Step 1: Import necessary libraries
- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns
- # Step 2: Import the dataset
- dataset = pd.read_csv('/content/Churn_Modelling.csv')
- print(dataset.head())
- # Grouping Customers Based on Geography and Counting Their Numbers
- country_counts = dataset.groupby('Geography').size().reset_index(name='Count')
- print(country_counts)
- dataset.info()
- # number of rows
- shape_no_row = dataset.shape
- shape_no_row[0]
- # Task 1: Generating Matrix of Features (X) — All Independent Variables
- X = dataset.iloc[:, 3:-1].values
- print(X)
- # Generating Dependent Variable Vector (Y)
- Y = dataset.iloc[:, -1].values
- print(Y)
- # Task 3: Feature Engineering
- # 1. Encoding Categorical Variable: Gender
- from sklearn.preprocessing import LabelEncoder
- le = LabelEncoder()
- X[:, 2] = le.fit_transform(X[:, 2])
- print(X[:, 2])
- # 2. Encoding Categorical Variable: Country (Geography)
- from sklearn.compose import ColumnTransformer
- from sklearn.preprocessing import OneHotEncoder
- ct = ColumnTransformer(
- transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough'
- )
- X = ct.fit_transform(X)
- print(X[:5])
- # Task 4: Creating Training and Testing Data
- # 1. Splitting Dataset into Training and Testing Dataset
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
- print("X_train shape:", X_train.shape)
- print("X_test shape:", X_test.shape)
- print("y_train shape:", y_train.shape)
- print("y_test shape:", y_test.shape)
- # 2. Performing Feature Scaling
- from sklearn.preprocessing import StandardScaler
- sc = StandardScaler()
- X_train = sc.fit_transform(X_train)
- X_test = sc.transform(X_test)
- print("X_train after scaling:\n", X_train[:5])
- print("X_test after scaling:\n", X_test[:5])
- # Task 5: Building an Artificial Neural Network (ANN)
- # 1. Initializing Artificial Neural Network
- import tensorflow as tf
- from tensorflow import keras
- from tensorflow.keras.models import Sequential
- ann = Sequential()
- ann.summary()
- # 2. Creating Hidden Layers
- from tensorflow.keras.layers import Dense
- # First hidden layer with 6 neurons (you can adjust)
- ann.add(Dense(units=6, activation='relu', input_dim=X_train.shape[1]))
- # Second hidden layer with 6 neurons
- ann.add(Dense(units=6, activation='relu'))
- ann.summary()
- # 3. Creating Output Layer
- # Output layer with 1 neuron and sigmoid activation
- ann.add(Dense(units=1, activation='sigmoid'))
- ann.summary()
- # 4. Compiling Artificial Neural Network
- ann.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
- # 5. Fitting Artificial Neural Network to Training Data
- ann.fit(X_train, y_train, batch_size=32, epochs=100)
- # Task 6: Making Predictions with the Trained ANN
- # 1. Predicting Output for a Single Data Point (example with Linear Regression)
- from sklearn.linear_model import LinearRegression
- import numpy as np
- # Sample training data
- X_train_lr = np.array([[1], [2], [3], [4], [5]]) # Example features (independent variable)
- y_train_lr = np.array([1, 2, 3, 4, 5]) # Example labels (dependent variable)
- # Create and train the model
- model = LinearRegression()
- model.fit(X_train_lr, y_train_lr)
- # Single data point for prediction
- X_new = np.array([[6]]) # New input data point (for which we want to predict the output)
- # Predict the output for the new data point
- prediction = model.predict(X_new)
- # Print the prediction
- print(f"Predicted output for the data point {X_new[0][0]}: {prediction[0]}")
- # 2. Predicting Output for Multiple Data Points (example with Logistic Regression)
- import pandas as pd
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import LabelEncoder, StandardScaler
- from sklearn.linear_model import LogisticRegression
- # Load dataset
- df = pd.read_csv('Churn_Modelling.csv')
- # Drop irrelevant columns
- df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
- # Encode categorical features
- le_geo = LabelEncoder()
- le_gender = LabelEncoder()
- df['Geography'] = le_geo.fit_transform(df['Geography'])
- df['Gender'] = le_gender.fit_transform(df['Gender'])
- # Define features and label
- X = df.drop('Exited', axis=1)
- y = df['Exited']
- # Scale the features
- scaler = StandardScaler()
- X_scaled = scaler.fit_transform(X)
- # Split into training and testing data
- X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
- # Train a Logistic Regression model
- model = LogisticRegression()
- model.fit(X_train, y_train)
- # Define new multiple customer inputs (10 features each)
- new_data = pd.DataFrame([
- [600, le_geo.transform(['France'])[0], le_gender.transform(['Male'])[0], 40, 3, 60000, 1, 1, 1, 50000],
- [750, le_geo.transform(['Germany'])[0], le_gender.transform(['Female'])[0], 50, 5, 100000, 2, 0, 0, 90000],
- [580, le_geo.transform(['Spain'])[0], le_gender.transform(['Male'])[0], 37, 2, 20000, 1, 1, 0, 40000],
- [820, le_geo.transform(['France'])[0], le_gender.transform(['Female'])[0], 30, 4, 85000, 2, 1, 1, 110000]
- ], columns=X.columns)
- # Scale new inputs
- new_data_scaled = scaler.transform(new_data)
- # Predict churn
- predictions = model.predict(new_data_scaled)
- # Convert predictions to True/False
- predicted_labels = ['True' if p == 1 else 'False' for p in predictions]
- print(predicted_labels)
- Ex No: 10
- DATE :Implementation :Implement K-Means clustering algorithm for segmentinginputs of a business model.
- Aim:To implement the K-Means clustering algorithm for segmenting customer inputs in a businessmodel based on features like age, income, and spending score.Algorithm:Step 1: Import necessary libraries (pandas, matplotlib, seaborn, sklearn).Step 2: Load the dataset (Customers_data.csv) using pandas.read_csv().Step 3: Preprocess the data:Rename columns for consistency.Check and handle missing values.Step 4: Perform data analysis and visualization:Use correlation heatmaps and scatter plots.Visualize data distributions and pairwise relationships.Step 5: Select features for clustering (e.g., Age, Annual Income, Spending Score). Step 6: Determine the optimal number of clusters using the Elbow Method (plot WCSS vs K).Step 7: Apply K-Means with the selected K (e.g., K = 5). Step 8: Assign cluster labels to data using fit_predict(). Step 9: Visualize the clustered data using scatter plots with different colors for each cluster.Step 10: Analyze and interpret clusters for business segmentation insights.
- # 1. Import the Dataset
- import pandas as pd
- data = pd.read_csv('/content/Customers_data.csv')
- data.head()
- # 2. Find Metadata
- data.info()
- data.describe()
- data.shape
- data.columns
- # 3. Data Preprocessing
- # Rename columns for easier access in code
- data.rename(columns={
- 'Annual Income (k$)': 'Annual_Income',
- 'Spending Score (1-100)': 'Spending_Score'
- }, inplace=True)
- # Check for missing values
- print("\nMissing Values:")
- print(data.isnull().sum())
- # Fill missing numeric values with column mean (if any)
- data.fillna(data.mean(numeric_only=True), inplace=True)
- # Task 1 – Data Analysis & Visualization
- # 1. Find the Correlation
- correlation_matrix = data.corr(numeric_only=True)
- print("Correlation Matrix:")
- print(correlation_matrix)
- # Visualize correlation matrix with heatmap
- import seaborn as sns
- import matplotlib.pyplot as plt
- plt.figure(figsize=(8, 6))
- sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
- plt.title('Correlation Heatmap')
- plt.show()
- # 2. Draw the Pair Plot
- sns.pairplot(data)
- plt.suptitle("Pair Plot of Features", y=1.02)
- plt.show()
- # 3. Pearson, Spearman, Kendall Correlations
- data_numeric = data.select_dtypes(include=['number'])
- print("Pearson Correlation:\n\n", data_numeric.corr(method='pearson'))
- print("\n\n")
- print("Spearman Correlation:\n\n", data_numeric.corr(method='spearman'))
- print("\n\n")
- print("Kendall Correlation:\n\n", data_numeric.corr(method='kendall'))
- print("\n\n")
- # 4. Draw “Age vs Annual Income” and “Age vs Spending Score” Graphs
- # Age vs Annual Income
- plt.figure(figsize=(6,4))
- sns.scatterplot(x='Age', y='Annual_Income', data=data)
- plt.title('Age vs Annual Income')
- plt.xlabel('Age')
- plt.ylabel('Annual Income (k$)')
- plt.grid(True)
- plt.show()
- # Age vs Spending Score
- plt.figure(figsize=(6,4))
- sns.scatterplot(x='Age', y='Spending_Score', data=data)
- plt.title('Age vs Spending Score')
- plt.xlabel('Age')
- plt.ylabel('Spending Score (1-100)')
- plt.grid(True)
- plt.show()
- # Task 2
- # 1. Key Difference Between df.loc and df.iloc
- # (Explanation, not code)
- # 2. Use df.loc to Get the Annual Income and Spending Score:
- X_loc = data.loc[:, ['Annual_Income', 'Spending_Score']]
- print("X_loc =")
- print(X_loc)
- # 3. Use df.iloc to Get the Annual Income and Spending Score
- X_iloc = data.iloc[:, [1, 2]]
- print("X_iloc =")
- print(X_iloc)
- # Task 3
- # 1. Distribution of Annual Income
- plt.figure(figsize=(8, 6))
- sns.histplot(data['Annual_Income'], kde=True, color='blue', bins=20)
- plt.title('Distribution of Annual Income')
- plt.xlabel('Annual Income (k$)')
- plt.ylabel('Frequency')
- plt.grid(True)
- plt.show()
- # 2. Distribution of Age
- plt.figure(figsize=(8, 6))
- sns.histplot(data['Age'], kde=True, color='green', bins=20)
- plt.title('Distribution of Age')
- plt.xlabel('Age')
- plt.ylabel('Frequency')
- plt.grid(True)
- plt.show()
- # 3. Distribution of Spending Score
- plt.figure(figsize=(8, 6))
- sns.histplot(data['Spending_Score'], kde=True, color='red', bins=20)
- plt.title('Distribution of Spending Score')
- plt.xlabel('Spending Score (1-100)')
- plt.ylabel('Frequency')
- plt.grid(True)
- plt.show()
- # 4. Number of Female and Male (and Plot)
- gender_count = data['Gender'].value_counts()
- print("Number of Female and Male:")
- print(gender_count)
- gender_count.plot(kind='bar', color=['lightblue', 'lightcoral'])
- plt.title('Number of Female and Male')
- plt.ylabel('Count')
- plt.xlabel('Gender')
- plt.xticks(rotation=0)
- plt.show()
- # Task 4
- # 1. Annual Income vs Spending Score (Clustering Visualization)
- plt.figure(figsize=(8, 6))
- sns.scatterplot(x='Annual_Income', y='Spending_Score', data=data, hue='Gender', palette='coolwarm', s=100, alpha=0.7)
- plt.title('Annual Income vs Spending Score')
- plt.xlabel('Annual Income (k$)')
- plt.ylabel('Spending Score (1-100)')
- plt.grid(True)
- plt.show()
- # 2. Annual Income vs Age (Clustering Visualization)
- plt.figure(figsize=(8, 6))
- sns.scatterplot(x='Annual_Income', y='Age', data=data, hue='Gender', palette='coolwarm', s=100, alpha=0.7)
- plt.title('Annual Income vs Age')
- plt.xlabel('Annual Income (k$)')
- plt.ylabel('Age')
- plt.grid(True)
- plt.show()
- # 3. Age vs Spending Score (Clustering Visualization)
- plt.figure(figsize=(8, 6))
- sns.scatterplot(x='Age', y='Spending_Score', data=data, hue='Gender', palette='coolwarm', s=100, alpha=0.7)
- plt.title('Age vs Spending Score')
- plt.xlabel('Age')
- plt.ylabel('Spending Score (1-100)')
- plt.grid(True)
- plt.show()
- # Task 5
- # WCSS (Within-Cluster Sum of Squares) in K-Means
- from sklearn.cluster import KMeans
- X = data[['Annual_Income', 'Spending_Score']] # Features for clustering
- wcss = []
- for k in range(1, 11):
- kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
- kmeans.fit(X)
- wcss.append(kmeans.inertia_)
- plt.figure(figsize=(10, 6))
- plt.plot(range(1, 11), wcss, marker='o')
- plt.title('WCSS vs Number of Clusters (Elbow Method)')
- plt.xlabel('Number of Clusters (K)')
- plt.ylabel('WCSS')
- plt.grid(True)
- plt.xticks(range(1, 11))
- plt.show()
- # Task 6
- # 3. Apply K-Means Clustering
- X = data[['Age', 'Annual_Income', 'Spending_Score']] # Features for clustering
- kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
- data['Cluster'] = kmeans.fit_predict(X)
- data. Head()
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement