Finding duplicate data in a table can be quite useful as it can help us identify potential issues or matches. A duplicate row is one that refers to the same thing or person as a whole other row. However, it is important to note that not all duplicate rows will have completely identical information as it will depend on what columns of data we want to search on. For example, we may have a large employee table that stores the employee’s social security number that uniquely identifies the employee in the US. We may want to use this to ensure that the employee is only listed once. If we had customers, we could search for duplicate accounts with an email address as traditionally, an email address should only belong to a single customer for most eCommerce sites. If we had multiple records for a single customer, it would be difficult to get a full order history for the customer as they would have several rows that we would have to compare against.
The fundamental of the structure of the query to find duplicates would look like the following:
SELECT <columnlist> FROM <tablename> GROUP BY <columnlist> HAVING COUNT(*) > 1;
Note that the column list in the SELECT clause should match the column list in the GROUP BY clause. In addition, we could add a COUNT(*) in the SELECT clause so we can identify how many duplicates there were of the same criteria that we grouped by.
For example, we may want to verify that all of our customers have a unique phone number, we could do so by doing:
SELECT phone, COUNT(*) FROM customer GROUP BY phone HAVING COUNT(*) > 1;
We can verify that there are no customers that meet this criterion:
Perhaps our organization has a criterion that only allows a customer to have a single order. This query could be used to check which customers have more than one order. If they have, how many are listed:
SELECT customer_id, COUNT(*) FROM invoice GROUP BY customer_id HAVING COUNT(*) > 1;
At a quick glimpse, we can be able to quickly identify those types of scenarios.
Note too that the finding of duplicates can also be useful to identify rows of data that may occur more than once with the appropriate count. For example, we may want to identify the support_rep_id if they have served more than one customer:
SELECT support_rep_id, COUNT(*) FROM customer GROUP BY support_rep_id HAVING COUNT(*) > 1;
Or perhaps we may have a list of state/country of those that have more than one customer to gear our marketing to:
SELECT state, country, COUNT(*) FROM customer GROUP BY state, country HAVING COUNT(*) > 1;
Source: Authored by Vincent Tran