Reveals duplicate records, or similar records suspected as being duplicate
by Abraham Meidan, Ph.D.
Who are WizSame users?
Revealing duplicate records is relevant to two kinds of users:
(1) Auditors search for duplicate records in order to reveal errors and frauds. Usually they search for duplicate records such as:
- Duplicate entries for the same customer / employee / stock item
- Duplicate invoices for the same purchase as well as duplicate payments
Obviously such duplicate records are cases to be audited.
2) Marketing and CRM managers search for duplicate records when merging a new list of potential customers in the existing customer database. They are interested in revealing existing customers in the new list and deleting them from merged new customers.
WizSame meets both of these requirements.
Similar records suspected as duplicate
Obviously identical records are duplicates, but in many cases the duplicate records are not identical but just similar. WizSame reveals similar records suspected as being duplicates.
There are several ways in which one record may be similar to another:
(1) The two records may be similar since they contain similar names, for example, in one record the customer name is Gurbatchev while in the other record it is Gorbatchev. WizSame reveals all the cases where two names differ by one character. Such a difference can be materialized in three ways:
- One character in one record is replaced by another character in the second record, in the same place. For example, the second character, u, in Gurbatchev replaces the second character, o, in Gorbatchev.
- One character is included in one name and is absent in the other name. For example, Gurbatchev andGaurbatchev – the second character, a, is included in the first name only.
- Two adjacent characters in one name appear in an opposite order in the other name. For example,Gurbatchev and Grubatchev – the second and the third character appear in opposite order in these two names.
(2) Several records may be similar when they contain synonymous names, for example in one record the state is N.Y. while in the second record the state is NY, and in the third record the state is New-York.WizSamecontains a user-updateable synonym dictionary where each user can enter as many synonyms as he or she needs (in addition to the synonyms that are already included in the WizSame dictionary).
(3) Records may also be similar when they contain the same values in another order (in the same field). For example, in one record the customer name is George Ernst, while in the other record it is Ernst George – same names, different order. WizSame reviews all the possible comparisons between the strings in each field.
(4) Finally several records may be similar if they have identical or similar values in one field or in another. For example, one may define records as similar if either the customer email address or the phone number is identical. WizSame lets the user define the several conditions connected by the AND or OR operators to determine matching.
How does WizSame deal with numbers?
When searching for similarity WizSame reads numbers as if they are names. That is, two numbers are considered similar when they differ by one character according to the above-mentioned criteria. For example the following numbers are similar to the first one:
123.45
123.46 (replacement of one digit)
7,123.45 (one extra digit)
123.54 (same digits, different order of the last two)
What are the parameters for defining similarity?
The user determines for each field whether the field should be –
- Identical
- Similar
- Ignored
The user may also define conditions, connected by the AND or OR operators, such as: (Several records are identical if:) the values in Field A are similar OR the values in Field B are identical.
What do I do with the discovered duplicate records?
As mentioned, WizSameaddresses two kinds of users:
- Auditors looking for duplicate customers, invoices payments, etc.;
- Marketing and CRM managers looking for existing customers in a new list of potential customers that supposed to be embedded in the existing database.
When auditors use WizSame they run the program on the table under investigation. WizSame displays a report containing all the discovered matching sets. When reviewing this report the user can –
- Review the matching sets on the screen.
- Print a report of the matching sets.
- Export an MDB table containing all the matching sets. This table can then be used in order to issue queries on the table under investigation.
- Search for a certain record to see in what matching sets, if any, the record is included. This option is useful when one checks whether a certain record has duplicates.
- Select a group of matching set to be exported as an MDB table. This option is useful when the user wishes to reduce the number of matching by applying conditions such as: selects all the matching sets where Field A values starts with 212.
When marketing and CRM mangers use WizSame they run the program on two tables – one table containing the existing database, and a second table containing the new list. WizSame reveals records in the new list that are identical or similar to records in the existing database. When the new list is an ASCII file, then in addition to the above-mentioned options, WizSamelets the user delete the duplicate records from the new list.