Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- /*
- Remember to replace placeholder names like `your_schema_name`, `source_table`, `target_table` with actual names. Also, specific syntax for date functions or system catalog queries can vary slightly between SQL databases (e.g., `GETDATE()` in SQL Server vs. `NOW()` in PostgreSQL/MySQL for current timestamp).
- */
- **Hypothetical TABLE Structures:**
- --`employees` table:**
- * `employee_id` (INT, PRIMARY KEY)
- * `first_name` (VARCHAR)
- * `last_name` (VARCHAR)
- * `email` (VARCHAR)
- * `phone_number` (VARCHAR)
- * `hire_date` (DATE)
- * `job_id` (VARCHAR)
- * `salary` (DECIMAL)
- * `commission_pct` (DECIMAL, NULLABLE)
- * `manager_id` (INT, NULLABLE, FOREIGN KEY TO `employee_id`)
- * `department_id` (INT, NULLABLE, FOREIGN KEY TO `departments.department_id`)
- * `last_updated` (TIMESTAMP)
- --`departments` table:**
- * `department_id` (INT, PRIMARY KEY)
- * `department_name` (VARCHAR)
- * `location_id` (INT)
- Let's get started:
- ---
- ### I. Data Cleaning Scripts
- Data cleaning involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in data.
- **1. Finding and Handling NULL Values**
- --Count NULLs in a specific column:**
- --sql
- SELECT COUNT(*)
- FROM employees
- WHERE commission_pct IS NULL;
- --
- --*Explanation: Counts how many employees have no commission percentage recorded.*
- --Select rows where a column is NULL:**
- --sql
- SELECT *
- FROM employees
- WHERE department_id IS NULL;
- --
- --*Explanation: Retrieves all employees not assigned to a department.*
- --Update NULL values to a default (e.g., 0 for commission_pct):**
- --sql
- UPDATE employees
- SET commission_pct = 0
- WHERE commission_pct IS NULL;
- --
- --*Explanation: Sets the commission percentage to 0 for all employees where it was previously NULL.*
- --Using COALESCE to provide a default value in a query (doesn't CHANGE the underlying DATA):**
- --sql
- SELECT
- employee_id,
- first_name,
- last_name,
- COALESCE(commission_pct, 0) AS commission_rate
- FROM employees;
- --
- --*Explanation: If `commission_pct` is NULL, it displays 0 in the `commission_rate` column for that row.*
- **2. Removing Duplicate ROWS**
- --Identify duplicate rows based on certain columns (e.g., email):**
- --sql
- SELECT
- email,
- COUNT(*)
- FROM employees
- GROUP BY email
- HAVING COUNT(*) > 1;
- --
- --*Explanation: Shows emails that appear more than once in the `employees` table.*
- --Delete duplicate rows (keeping one instance - method varies by SQL dialect):**
- --Using a Common Table Expression (CTE) and ROW_NUMBER() (supported by many modern SQL databases):**
- --sql
- WITH RowNumCTE AS (
- SELECT
- employee_id,
- email,
- ROW_NUMBER() OVER(PARTITION BY email ORDER BY hire_date DESC) AS rn
- FROM employees
- )
- DELETE FROM RowNumCTE
- WHERE rn > 1;
- --
- --*Explanation: Assigns a row number to each employee within groups of identical emails, ordered by hire date (keeping the most recent). Then deletes rows where the row number is greater than 1 (i.e., the duplicates).*
- **3. Standardizing DATA Formats**
- --Convert text to uppercase or lowercase:**
- --sql
- UPDATE employees
- SET email = LOWER(email);
- --
- --*Explanation: Converts all email addresses to lowercase to ensure consistency.*
- --Trim leading/trailing whitespace:**
- --sql
- UPDATE employees
- SET first_name = TRIM(first_name),
- last_name = TRIM(last_name);
- --
- --*Explanation: Removes any accidental spaces before or after first and last names.*
- --Format dates (example: displaying hire_date in 'YYYY-MM-DD' format - specific function may vary):**
- --sql
- SELECT
- employee_id,
- TO_CHAR(hire_date, 'YYYY-MM-DD') AS formatted_hire_date -- PostgreSQL/Oracle
- -- FORMAT(hire_date, 'yyyy-MM-dd') -- SQL Server
- -- DATE_FORMAT(hire_date, '%Y-%m-%d') -- MySQL
- FROM employees;
- --
- --*Explanation: Shows the hire date in a specific string format.*
- **4. Correcting Inconsistent VALUES**
- --Using CASE statements to standardize values:**
- --sql
- UPDATE employees
- SET job_id = CASE
- WHEN job_id = 'SALES_REP' THEN 'SA_REP'
- WHEN job_id = 'SALES_MAN' THEN 'SA_MAN'
- ELSE job_id
- END
- WHERE job_id IN ('SALES_REP', 'SALES_MAN');
- --
- --*Explanation: Standardizes different codes for similar job roles.*
- ---
- ### II. DATA Manipulation Scripts
- DATA manipulation involves adding, deleting, OR modifying DATA IN the DATABASE.
- **1. Inserting DATA**
- --Insert a new row:**
- --sql
- INSERT INTO departments (department_id, department_name, location_id)
- VALUES (280, 'Public Relations', 1700);
- --
- --*Explanation: Adds a new department to the `departments` table.*
- --Insert data from another table:**
- --sql
- INSERT INTO employees_archive (employee_id, first_name, last_name, hire_date)
- SELECT employee_id, first_name, last_name, hire_date
- FROM employees
- WHERE hire_date < '2010-01-01';
- --
- --*Explanation: Copies employees hired before 2010 into an `employees_archive` table.*
- **2. Updating DATA**
- --Update specific records:**
- --sql
- UPDATE employees
- SET salary = salary * 1.05, last_updated = CURRENT_TIMESTAMP
- WHERE department_id = 90; -- Assuming department_id 90 is 'Executive'
- --
- --*Explanation: Gives a 5% salary raise to all employees in department 90 and updates their `last_updated` timestamp.*
- --Conditional update:**
- --sql
- UPDATE employees
- SET commission_pct = 0.10
- WHERE department_id IN (SELECT department_id FROM departments WHERE department_name = 'Sales')
- AND commission_pct IS NULL;
- --
- --*Explanation: Sets a default commission of 10% for salespeople who don't have one.*
- **3. Deleting DATA**
- --Delete specific records:**
- --sql
- DELETE FROM employees
- WHERE employee_id = 207;
- --
- --*Explanation: Removes the employee with `employee_id` 207.*
- --Delete records based on a condition:**
- --sql
- DELETE FROM employees
- WHERE department_id IS NULL AND hire_date < '2024-01-01';
- --
- --*Explanation: Deletes employees who are not assigned to a department and were hired before 2024.*
- ---
- ### III. DATA Analysis Scripts
- These scripts are used TO query AND retrieve meaningful insights FROM the DATA.
- **1. Basic SELECT AND Filtering**
- --Select all columns from a table:**
- --sql
- SELECT * FROM employees;
- --
- --Select specific columns:**
- --sql
- SELECT first_name, last_name, email, salary
- FROM employees;
- --
- --Filtering with WHERE:**
- --sql
- SELECT first_name, last_name, salary
- FROM employees
- WHERE salary > 10000
- AND department_id = 80; -- Sales department
- --
- --*Explanation: Finds salespeople earning more than $10,000.*
- --Using LIKE for pattern matching:**
- --sql
- SELECT first_name, last_name
- FROM employees
- WHERE last_name LIKE 'S%';
- --
- --*Explanation: Finds employees whose last name starts with 'S'.*
- --Using IN to specify multiple possible values:**
- --sql
- SELECT *
- FROM employees
- WHERE department_id IN (30, 50, 80);
- --
- --*Explanation: Retrieves employees from departments 30, 50, or 80.*
- **2. Aggregation AND GROUPING**
- --Count total employees:**
- --sql
- SELECT COUNT(*) AS total_employees FROM employees;
- --
- --Calculate average salary:**
- --sql
- SELECT AVG(salary) AS average_salary FROM employees;
- --
- --Find min/max salary:**
- --sql
- SELECT MIN(salary) AS min_salary, MAX(salary) AS max_salary FROM employees;
- --
- --Count employees and average salary per department:**
- --sql
- SELECT
- d.department_name,
- COUNT(e.employee_id) AS number_of_employees,
- AVG(e.salary) AS average_salary
- FROM employees e
- JOIN departments d ON e.department_id = d.department_id
- GROUP BY d.department_name
- ORDER BY average_salary DESC;
- --
- --*Explanation: Shows the number of employees and average salary for each department, ordered by the highest average salary.*
- --Filtering groups with HAVING:**
- --sql
- SELECT
- department_id,
- COUNT(*) AS num_employees,
- AVG(salary) AS avg_salary
- FROM employees
- GROUP BY department_id
- HAVING COUNT(*) > 10
- ORDER BY avg_salary DESC;
- --
- --*Explanation: Shows departments with more than 10 employees, ordered by average salary.*
- **3. Joins**
- --INNER JOIN (get employees and their department names):**
- --sql
- SELECT
- e.first_name,
- e.last_name,
- d.department_name
- FROM employees e
- INNER JOIN departments d ON e.department_id = d.department_id;
- --
- --*Explanation: Retrieves employees who have a matching department in the `departments` table.*
- --LEFT JOIN (get all employees, and their department names if they have one):**
- --sql
- SELECT
- e.first_name,
- e.last_name,
- COALESCE(d.department_name, 'No Department') AS department_name
- FROM employees e
- LEFT JOIN departments d ON e.department_id = d.department_id;
- --
- --*Explanation: Retrieves all employees; if an employee has no department, 'No Department' is shown.*
- --RIGHT JOIN (get all departments, and their employees if any):**
- --sql
- SELECT
- d.department_name,
- e.first_name,
- e.last_name
- FROM employees e
- RIGHT JOIN departments d ON e.department_id = d.department_id;
- --
- --*Explanation: Retrieves all departments; if a department has no employees, employee fields will be NULL.*
- --FULL OUTER JOIN (get all employees and all departments):**
- --sql
- SELECT
- e.first_name,
- e.last_name,
- d.department_name
- FROM employees e
- FULL OUTER JOIN departments d ON e.department_id = d.department_id;
- --
- --*Explanation: Retrieves all rows from both tables, matching where possible, and filling with NULLs otherwise.*
- **4. Subqueries**
- --Subquery in the WHERE clause (employees earning more than average):**
- --sql
- SELECT first_name, last_name, salary
- FROM employees
- WHERE salary > (SELECT AVG(salary) FROM employees);
- --
- --*Explanation: Finds employees whose salary is above the company average.*
- --Subquery in the FROM clause (derived table):**
- --sql
- SELECT
- dept_summary.department_name,
- dept_summary.avg_dept_salary
- FROM (
- SELECT
- d.department_name,
- AVG(e.salary) AS avg_dept_salary
- FROM employees e
- JOIN departments d ON e.department_id = d.department_id
- GROUP BY d.department_name
- ) AS dept_summary
- WHERE dept_summary.avg_dept_salary > 7000;
- --
- --*Explanation: First calculates average salaries per department, then filters those departments where the average is above $7000.*
- **5. Window Functions (FOR more advanced analysis)**
- --Rank employees by salary within each department:**
- --sql
- SELECT
- first_name,
- last_name,
- salary,
- department_id,
- RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank_in_dept
- FROM employees;
- --
- --*Explanation: Assigns a rank to each employee based on their salary within their department.*
- --Calculate running total of salaries:**
- --sql
- SELECT
- employee_id,
- hire_date,
- salary,
- SUM(salary) OVER (ORDER BY hire_date, employee_id) AS running_total_salary
- FROM employees;
- --
- --*Explanation: Shows a cumulative sum of salaries as you go through employees ordered by their hire date.*
- --Compare employee salary to average salary of their department:**
- --sql
- SELECT
- e.first_name,
- e.last_name,
- e.salary,
- d.department_name,
- AVG(e.salary) OVER (PARTITION BY e.department_id) AS avg_dept_salary,
- e.salary - AVG(e.salary) OVER (PARTITION BY e.department_id) AS diff_from_avg
- FROM employees e
- JOIN departments d ON e.department_id = d.department_id;
- --
- --*Explanation: For each employee, shows their salary, their department's average salary, and the difference.*
- ---
- ### IV. Smoke Testing Scripts
- Smoke tests are quick checks TO ensure the DATA AND DATABASE STRUCTURE are fundamentally sound.
- **1. TABLE Existence AND STRUCTURE**
- --Check if a table exists (method varies by SQL dialect):**
- --SQL Server:**
- --sql
- IF OBJECT_ID('employees', 'U') IS NOT NULL
- SELECT 'Table employees exists.'
- ELSE
- SELECT 'Table employees does NOT exist.';
- --
- --PostgreSQL/MySQL (Information Schema):**
- --sql
- SELECT TABLE_NAME
- FROM information_schema.TABLES
- WHERE table_schema = 'your_schema_name' -- or current_database() for PG, database() for MySQL
- AND TABLE_NAME = 'employees';
- --
- --*Explanation: Queries the system catalog to verify table presence.*
- **2. ROW Counts**
- --Get total row count for a table:**
- --sql
- SELECT COUNT(*) FROM employees;
- --
- --*Explanation: Confirms the table is not empty and gives a sense of data volume.*
- --Compare row counts between related tables (e.g., after an ETL process):**
- --sql
- SELECT
- (SELECT COUNT(*) FROM source_table) AS source_count,
- (SELECT COUNT(*) FROM target_table) AS target_count;
- --
- --*Explanation: Basic check for data transfer completeness.*
- **3. Basic DATA Integrity Checks**
- --Check for unexpected NULLs in critical columns:**
- --sql
- SELECT COUNT(*)
- FROM employees
- WHERE employee_id IS NULL OR first_name IS NULL OR hire_date IS NULL;
- --
- --*Explanation: Primary keys or essential attributes should generally not be NULL.*
- --Check for foreign key integrity (e.g., all `department_id` in `employees` exist in `departments`):**
- --sql
- SELECT e.employee_id, e.department_id
- FROM employees e
- LEFT JOIN departments d ON e.department_id = d.department_id
- WHERE d.department_id IS NULL AND e.department_id IS NOT NULL;
- --
- --*Explanation: Identifies "orphan" records in the `employees` table that reference non-existent departments.*
- --Check for duplicate primary keys:**
- --sql
- SELECT employee_id, COUNT(*)
- FROM employees
- GROUP BY employee_id
- HAVING COUNT(*) > 1;
- --
- --*Explanation: Primary keys must be unique; this query finds any violations.*
- --Check for valid date ranges:**
- --sql
- SELECT MIN(hire_date), MAX(hire_date)
- FROM employees;
- -- Also check for future dates if not expected:
- SELECT COUNT(*) FROM employees WHERE hire_date > CURRENT_DATE;
- --
- --*Explanation: Ensures dates are within expected boundaries.*
- --Check for valid values in categorical columns:**
- --sql
- SELECT DISTINCT job_id
- FROM employees
- ORDER BY job_id;
- -- Then manually review if the distinct values are all expected.
- -- Or check for values NOT IN a predefined list:
- SELECT job_id, COUNT(*)
- FROM employees
- WHERE job_id NOT IN ('SA_REP', 'IT_PROG', 'MK_MAN', 'HR_REP', 'AD_PRES') -- List of known valid job_ids
- GROUP BY job_id;
- --
- --*Explanation: Helps identify typos or unexpected categories.*
- ---
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement