Thursday, December 23, 2010

How to Delete Duplicate Rows

First you need to set up some data:
  
SQL> CREATE TABLE DUPLICATE_ROWS
  2   (FIRST_NAME VARCHAR2(10))
  3  /

Table created.

SQL> INSERT INTO DUPLICATE_ROWS VALUES ('ANDREW')
  2  /

1 row created.

SQL> INSERT INTO DUPLICATE_ROWS VALUES ('COLIN')
  2  /

1 row created.

SQL> INSERT INTO DUPLICATE_ROWS VALUES ('GRAHAM')
  2  /

1 row created.

SQL> INSERT INTO DUPLICATE_ROWS VALUES ('JOHN')
  2  /

1 row created.

SQL> INSERT INTO DUPLICATE_ROWS VALUES ('SAMUEL')
  2  /

1 row created.

SQL> BEGIN
  2   FOR A IN 1..16
  3   LOOP
  4    INSERT INTO DUPLICATE_ROWS
  5     SELECT * FROM DUPLICATE_ROWS;
  6   END LOOP;
  7  END;
  8  /

PL/SQL procedure successfully completed.

SQL> COMMIT
  2  /

Commit complete.

SQL>

Now use a group by to show the duplicate rows:
 
SQL> SELECT FIRST_NAME, COUNT(*)
  2   FROM DUPLICATE_ROWS
  3    GROUP BY FIRST_NAME
  4  /

FIRST_NAME   COUNT(*)
---------- ----------
ANDREW          65536
COLIN           65536
JOHN            65536
GRAHAM          65536
SAMUEL          65536

SQL>

Delete the duplicate rows. This method is shown on several other web sites:

SQL> SET TIMING ON
SQL> DELETE FROM DUPLICATE_ROWS A
  2   WHERE ROWID >
  3   (SELECT MIN(ROWID) FROM DUPLICATE_ROWS
  4    WHERE FIRST_NAME = A.FIRST_NAME)
  5  /

327675 rows deleted.

Elapsed: 00:00:12.91
SQL> SET TIMING OFF
SQL>

Show that the duplicates have gone:

SQL> SELECT FIRST_NAME, COUNT(*)
  2   FROM DUPLICATE_ROWS
  3    GROUP BY FIRST_NAME
  4  /

FIRST_NAME   COUNT(*)
---------- ----------
ANDREW              1
COLIN               1
JOHN                1
GRAHAM              1
SAMUEL              1

SQL>

Rollback the deletion:

SQL> ROLLBACK
  2  /

Rollback complete.

SQL>

Check that the duplicate rows have returned:

SQL> SELECT FIRST_NAME, COUNT(*)
  2   FROM DUPLICATE_ROWS
  3    GROUP BY FIRST_NAME
  4  /

FIRST_NAME   COUNT(*)
---------- ----------
ANDREW          65536
COLIN           65536
JOHN            65536
GRAHAM          65536
SAMUEL          65536

SQL>

Now try a different method to delete the duplicate rows. This is similar to how we used to do it with COBOL:
 
SQL> SET TIMING ON
SQL> DECLARE
  2   CURSOR C1 IS
  3    SELECT ROWID, FIRST_NAME
  4    FROM DUPLICATE_ROWS
  5    ORDER BY FIRST_NAME;
  6   PREVIOUS_FIRST_NAME
  7    DUPLICATE_ROWS.FIRST_NAME%TYPE;
  8  BEGIN
  9   PREVIOUS_FIRST_NAME :=' ';
 10   FOR REC IN C1
 11    LOOP
 12     IF REC.FIRST_NAME = PREVIOUS_FIRST_NAME THEN
 13      DELETE FROM DUPLICATE_ROWS
 14      WHERE ROWID = REC.ROWID;
 15     END IF;
 16     PREVIOUS_FIRST_NAME := REC.FIRST_NAME;
 17    END LOOP;
 18  END;
 19  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:18.09
SQL> SET TIMING OFF
SQL>

Check that the duplicate rows have gone:

SQL> SELECT FIRST_NAME, COUNT(*)
  2   FROM DUPLICATE_ROWS
  3   GROUP BY FIRST_NAME
  4  /

FIRST_NAME   COUNT(*)
---------- ----------
ANDREW              1
COLIN               1
GRAHAM              1
JOHN                1
SAMUEL              1

SQL>

No comments: