Untitled

Coda_D

Jun 11th, 2025

1,329

Never

Add comment

Not a member of Pastebin yet? Sign Up, it unlocks many cool features!

SQL 15.18 KB | Help | 0 0

raw download clone embed print report

/*
Remember to replace placeholder names like `your_schema_name`, `source_table`, `target_table` with actual names. Also, specific syntax for date functions or system catalog queries can vary slightly between SQL databases (e.g., `GETDATE()` in SQL Server vs. `NOW()` in PostgreSQL/MySQL for current timestamp).
*/
**Hypothetical TABLE Structures:**
--`employees` table:**
* `employee_id` (INT, PRIMARY KEY)
* `first_name` (VARCHAR)
* `last_name` (VARCHAR)
* `email` (VARCHAR)
* `phone_number` (VARCHAR)
* `hire_date` (DATE)
* `job_id` (VARCHAR)
* `salary` (DECIMAL)
* `commission_pct` (DECIMAL, NULLABLE)
* `manager_id` (INT, NULLABLE, FOREIGN KEY TO `employee_id`)
* `department_id` (INT, NULLABLE, FOREIGN KEY TO `departments.department_id`)
* `last_updated` (TIMESTAMP)
--`departments` table:**
* `department_id` (INT, PRIMARY KEY)
* `department_name` (VARCHAR)
* `location_id` (INT)
Let's get started:
---
### I. Data Cleaning Scripts
Data cleaning involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in data.
**1. Finding and Handling NULL Values**
--Count NULLs in a specific column:**
--sql
SELECT COUNT(*)
FROM employees
WHERE commission_pct IS NULL;
--
--*Explanation: Counts how many employees have no commission percentage recorded.*
--Select rows where a column is NULL:**
--sql
SELECT *
FROM employees
WHERE department_id IS NULL;
--
--*Explanation: Retrieves all employees not assigned to a department.*
--Update NULL values to a default (e.g., 0 for commission_pct):**
--sql
UPDATE employees
SET commission_pct = 0
WHERE commission_pct IS NULL;
--
--*Explanation: Sets the commission percentage to 0 for all employees where it was previously NULL.*
--Using COALESCE to provide a default value in a query (doesn't CHANGE the underlying DATA):**
--sql
SELECT
employee_id,
first_name,
last_name,
COALESCE(commission_pct, 0) AS commission_rate
FROM employees;
--
--*Explanation: If `commission_pct` is NULL, it displays 0 in the `commission_rate` column for that row.*
**2. Removing Duplicate ROWS**
--Identify duplicate rows based on certain columns (e.g., email):**
--sql
SELECT
email,
COUNT(*)
FROM employees
GROUP BY email
HAVING COUNT(*) > 1;
--
--*Explanation: Shows emails that appear more than once in the `employees` table.*
--Delete duplicate rows (keeping one instance - method varies by SQL dialect):**
--Using a Common Table Expression (CTE) and ROW_NUMBER() (supported by many modern SQL databases):**
--sql
WITH RowNumCTE AS (
SELECT
employee_id,
email,
ROW_NUMBER() OVER(PARTITION BY email ORDER BY hire_date DESC) AS rn
FROM employees
)
DELETE FROM RowNumCTE
WHERE rn > 1;
--
--*Explanation: Assigns a row number to each employee within groups of identical emails, ordered by hire date (keeping the most recent). Then deletes rows where the row number is greater than 1 (i.e., the duplicates).*
**3. Standardizing DATA Formats**
--Convert text to uppercase or lowercase:**
--sql
UPDATE employees
SET email = LOWER(email);
--
--*Explanation: Converts all email addresses to lowercase to ensure consistency.*
--Trim leading/trailing whitespace:**
--sql
UPDATE employees
SET first_name = TRIM(first_name),
last_name = TRIM(last_name);
--
--*Explanation: Removes any accidental spaces before or after first and last names.*
--Format dates (example: displaying hire_date in 'YYYY-MM-DD' format - specific function may vary):**
--sql
SELECT
employee_id,
TO_CHAR(hire_date, 'YYYY-MM-DD') AS formatted_hire_date -- PostgreSQL/Oracle
-- FORMAT(hire_date, 'yyyy-MM-dd') -- SQL Server
-- DATE_FORMAT(hire_date, '%Y-%m-%d') -- MySQL
FROM employees;
--
--*Explanation: Shows the hire date in a specific string format.*
**4. Correcting Inconsistent VALUES**
--Using CASE statements to standardize values:**
--sql
UPDATE employees
SET job_id = CASE
WHEN job_id = 'SALES_REP' THEN 'SA_REP'
WHEN job_id = 'SALES_MAN' THEN 'SA_MAN'
ELSE job_id
END
WHERE job_id IN ('SALES_REP', 'SALES_MAN');
--
--*Explanation: Standardizes different codes for similar job roles.*
---
### II. DATA Manipulation Scripts
DATA manipulation involves adding, deleting, OR modifying DATA IN the DATABASE.
**1. Inserting DATA**
--Insert a new row:**
--sql
INSERT INTO departments (department_id, department_name, location_id)
VALUES (280, 'Public Relations', 1700);
--
--*Explanation: Adds a new department to the `departments` table.*
--Insert data from another table:**
--sql
INSERT INTO employees_archive (employee_id, first_name, last_name, hire_date)
SELECT employee_id, first_name, last_name, hire_date
FROM employees
WHERE hire_date < '2010-01-01';
--
--*Explanation: Copies employees hired before 2010 into an `employees_archive` table.*
**2. Updating DATA**
--Update specific records:**
--sql
UPDATE employees
SET salary = salary * 1.05, last_updated = CURRENT_TIMESTAMP
WHERE department_id = 90; -- Assuming department_id 90 is 'Executive'
--
--*Explanation: Gives a 5% salary raise to all employees in department 90 and updates their `last_updated` timestamp.*
--Conditional update:**
--sql
UPDATE employees
SET commission_pct = 0.10
WHERE department_id IN (SELECT department_id FROM departments WHERE department_name = 'Sales')
AND commission_pct IS NULL;
--
--*Explanation: Sets a default commission of 10% for salespeople who don't have one.*
**3. Deleting DATA**
--Delete specific records:**
--sql
DELETE FROM employees
WHERE employee_id = 207;
--
--*Explanation: Removes the employee with `employee_id` 207.*
--Delete records based on a condition:**
--sql
DELETE FROM employees
WHERE department_id IS NULL AND hire_date < '2024-01-01';
--
--*Explanation: Deletes employees who are not assigned to a department and were hired before 2024.*
---
### III. DATA Analysis Scripts
These scripts are used TO query AND retrieve meaningful insights FROM the DATA.
**1. Basic SELECT AND Filtering**
--Select all columns from a table:**
--sql
SELECT * FROM employees;
--
--Select specific columns:**
--sql
SELECT first_name, last_name, email, salary
FROM employees;
--
--Filtering with WHERE:**
--sql
SELECT first_name, last_name, salary
FROM employees
WHERE salary > 10000
AND department_id = 80; -- Sales department
--
--*Explanation: Finds salespeople earning more than $10,000.*
--Using LIKE for pattern matching:**
--sql
SELECT first_name, last_name
FROM employees
WHERE last_name LIKE 'S%';
--
--*Explanation: Finds employees whose last name starts with 'S'.*
--Using IN to specify multiple possible values:**
--sql
SELECT *
FROM employees
WHERE department_id IN (30, 50, 80);
--
--*Explanation: Retrieves employees from departments 30, 50, or 80.*
**2. Aggregation AND GROUPING**
--Count total employees:**
--sql
SELECT COUNT(*) AS total_employees FROM employees;
--
--Calculate average salary:**
--sql
SELECT AVG(salary) AS average_salary FROM employees;
--
--Find min/max salary:**
--sql
SELECT MIN(salary) AS min_salary, MAX(salary) AS max_salary FROM employees;
--
--Count employees and average salary per department:**
--sql
SELECT
d.department_name,
COUNT(e.employee_id) AS number_of_employees,
AVG(e.salary) AS average_salary
FROM employees e
JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_name
ORDER BY average_salary DESC;
--
--*Explanation: Shows the number of employees and average salary for each department, ordered by the highest average salary.*
--Filtering groups with HAVING:**
--sql
SELECT
department_id,
COUNT(*) AS num_employees,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
HAVING COUNT(*) > 10
ORDER BY avg_salary DESC;
--
--*Explanation: Shows departments with more than 10 employees, ordered by average salary.*
**3. Joins**
--INNER JOIN (get employees and their department names):**
--sql
SELECT
e.first_name,
e.last_name,
d.department_name
FROM employees e
INNER JOIN departments d ON e.department_id = d.department_id;
--
--*Explanation: Retrieves employees who have a matching department in the `departments` table.*
--LEFT JOIN (get all employees, and their department names if they have one):**
--sql
SELECT
e.first_name,
e.last_name,
COALESCE(d.department_name, 'No Department') AS department_name
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id;
--
--*Explanation: Retrieves all employees; if an employee has no department, 'No Department' is shown.*
--RIGHT JOIN (get all departments, and their employees if any):**
--sql
SELECT
d.department_name,
e.first_name,
e.last_name
FROM employees e
RIGHT JOIN departments d ON e.department_id = d.department_id;
--
--*Explanation: Retrieves all departments; if a department has no employees, employee fields will be NULL.*
--FULL OUTER JOIN (get all employees and all departments):**
--sql
SELECT
e.first_name,
e.last_name,
d.department_name
FROM employees e
FULL OUTER JOIN departments d ON e.department_id = d.department_id;
--
--*Explanation: Retrieves all rows from both tables, matching where possible, and filling with NULLs otherwise.*
**4. Subqueries**
--Subquery in the WHERE clause (employees earning more than average):**
--sql
SELECT first_name, last_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
--
--*Explanation: Finds employees whose salary is above the company average.*
--Subquery in the FROM clause (derived table):**
--sql
SELECT
dept_summary.department_name,
dept_summary.avg_dept_salary
FROM (
SELECT
d.department_name,
AVG(e.salary) AS avg_dept_salary
FROM employees e
JOIN departments d ON e.department_id = d.department_id
GROUP BY d.department_name
) AS dept_summary
WHERE dept_summary.avg_dept_salary > 7000;
--
--*Explanation: First calculates average salaries per department, then filters those departments where the average is above $7000.*
**5. Window Functions (FOR more advanced analysis)**
--Rank employees by salary within each department:**
--sql
SELECT
first_name,
last_name,
salary,
department_id,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank_in_dept
FROM employees;
--
--*Explanation: Assigns a rank to each employee based on their salary within their department.*
--Calculate running total of salaries:**
--sql
SELECT
employee_id,
hire_date,
salary,
SUM(salary) OVER (ORDER BY hire_date, employee_id) AS running_total_salary
FROM employees;
--
--*Explanation: Shows a cumulative sum of salaries as you go through employees ordered by their hire date.*
--Compare employee salary to average salary of their department:**
--sql
SELECT
e.first_name,
e.last_name,
e.salary,
d.department_name,
AVG(e.salary) OVER (PARTITION BY e.department_id) AS avg_dept_salary,
e.salary - AVG(e.salary) OVER (PARTITION BY e.department_id) AS diff_from_avg
FROM employees e
JOIN departments d ON e.department_id = d.department_id;
--
--*Explanation: For each employee, shows their salary, their department's average salary, and the difference.*
---
### IV. Smoke Testing Scripts
Smoke tests are quick checks TO ensure the DATA AND DATABASE STRUCTURE are fundamentally sound.
**1. TABLE Existence AND STRUCTURE**
--Check if a table exists (method varies by SQL dialect):**
--SQL Server:**
--sql
IF OBJECT_ID('employees', 'U') IS NOT NULL
SELECT 'Table employees exists.'
ELSE
SELECT 'Table employees does NOT exist.';
--
--PostgreSQL/MySQL (Information Schema):**
--sql
SELECT TABLE_NAME
FROM information_schema.TABLES
WHERE table_schema = 'your_schema_name' -- or current_database() for PG, database() for MySQL
AND TABLE_NAME = 'employees';
--
--*Explanation: Queries the system catalog to verify table presence.*
**2. ROW Counts**
--Get total row count for a table:**
--sql
SELECT COUNT(*) FROM employees;
--
--*Explanation: Confirms the table is not empty and gives a sense of data volume.*
--Compare row counts between related tables (e.g., after an ETL process):**
--sql
SELECT
(SELECT COUNT(*) FROM source_table) AS source_count,
(SELECT COUNT(*) FROM target_table) AS target_count;
--
--*Explanation: Basic check for data transfer completeness.*
**3. Basic DATA Integrity Checks**
--Check for unexpected NULLs in critical columns:**
--sql
SELECT COUNT(*)
FROM employees
WHERE employee_id IS NULL OR first_name IS NULL OR hire_date IS NULL;
--
--*Explanation: Primary keys or essential attributes should generally not be NULL.*
--Check for foreign key integrity (e.g., all `department_id` in `employees` exist in `departments`):**
--sql
SELECT e.employee_id, e.department_id
FROM employees e
LEFT JOIN departments d ON e.department_id = d.department_id
WHERE d.department_id IS NULL AND e.department_id IS NOT NULL;
--
--*Explanation: Identifies "orphan" records in the `employees` table that reference non-existent departments.*
--Check for duplicate primary keys:**
--sql
SELECT employee_id, COUNT(*)
FROM employees
GROUP BY employee_id
HAVING COUNT(*) > 1;
--
--*Explanation: Primary keys must be unique; this query finds any violations.*
--Check for valid date ranges:**
--sql
SELECT MIN(hire_date), MAX(hire_date)
FROM employees;
-- Also check for future dates if not expected:
SELECT COUNT(*) FROM employees WHERE hire_date > CURRENT_DATE;
--
--*Explanation: Ensures dates are within expected boundaries.*
--Check for valid values in categorical columns:**
--sql
SELECT DISTINCT job_id
FROM employees
ORDER BY job_id;
-- Then manually review if the distinct values are all expected.
-- Or check for values NOT IN a predefined list:
SELECT job_id, COUNT(*)
FROM employees
WHERE job_id NOT IN ('SA_REP', 'IT_PROG', 'MK_MAN', 'HR_REP', 'AD_PRES') -- List of known valid job_ids
GROUP BY job_id;
--
--*Explanation: Helps identify typos or unexpected categories.*
---

Add Comment

Please, Sign In to add comment