Other#
Advanced Algorithms#
Click the links below for more information on the respective public advanced algorithms not developed by Melissa Data.
N-gram - http://en.wikipedia.org/wiki/N-gram
Winkler Distance - http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Longest Common Substring Problem - http://en.wikipedia.org/wiki/Longest_common_substring_problem
Jaccard Index - http://en.wikipedia.org/wiki/Jaccard_index
Dice’s Coefficient - http://en.wikipedia.org/wiki/Dice%27s_coefficient
Overlap Coefficient - http://en.wikipedia.org/wiki/Overlap_coefficient
Needleman-Wunsch Algorithm - http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm
Smith & Waterman Algorithm - http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
Double Metaphone - http://en.wikipedia.org/wiki/Double_Metaphone#Double_Metaphone
Matchcodes#
Matchcode Components#
Domestic Components#
The following table lists all of the available matchcode components Data Type in MatchUp Object:
Component |
Description |
---|---|
Prefix |
Prefix of a personal name (Mr, Mrs, Ms, Dr). |
First Name |
A first name. |
Middle Name |
A middle name. |
Last Name |
A last name. |
Suffix |
A suffix from a personal name. |
Gender |
Male/Female/Neutral. |
First/Nickname |
A representative nickname for a first name. |
Middle/Nickname |
A representative nickname for a middle name. |
Department/Title |
A title and/or department name. *Note |
Company |
A company name. |
Company Acronym |
A company’s acronym. *Note |
Street Number |
The street number from an address line3. |
Street Pre-Directional |
“South” in “3 South Main St”. |
Street Name |
The street name from an address line. |
Street Suffix |
An address suffix (St, Ave, Blvd). |
Street Post-Directional |
“North” in “3 Main St North”. |
PO Box |
PO Boxes also include Farm Routes, Rural Routes, etc. |
Street Secondary |
Apartments, floors, rooms, etc. |
Address |
A single unparsed address line. *Note |
City |
A city name. ZIP or Postal code is usually more accurate. |
State/Province |
A state or province name. |
Zip9 |
A full ZIP + 4® code (9 digits). *Note |
Zip5 |
The ZIP Code (5 digits). |
Zip4 |
The +4 extension of a ZIP + 4 code (4 digits). |
Postal Code (Canada) |
A Canadian Postal Code. |
City (UK) |
A city in the United Kingdom. |
County (UK) |
A county in the United Kingdom. |
Postcode (UK) |
A United Kingdom Postcode. |
Country |
A country. |
Phone/Fax |
A phone number. *Note |
E-Mail Address |
An e-mail address. *Note |
Credit Card Number |
A credit card number. |
Date |
A date. This may result is slower throughput! *Note |
Numeric |
A numeric field. This may result is slower throughput! *Note |
Proximity |
Allows you to specify a maximum distance in miles between records in which a match will be possible. This may result is slower throughput! *Note |
General |
Any general information. ID, birthday, SSN, etc. |
Global Components#
Component |
Description |
---|---|
Postal Code |
(Zip &/ plus 4) Complete postal code for a particular delivery point. |
Premises Number |
(Street Number) Alphanumeric indicator within premises field. |
Double Dependent Locality |
Smallest population center data element |
Dependent Locality |
(Urbanization) Smaller population center data element. Dependent on Locality. |
Sub Administrative Area |
(County) Smallest geographic data element. |
Sub National Area |
Arbitrary administrative region below that of the sovereign state. |
Locality |
(City) Most common population center data element. |
Administrative Area |
(State) Most common geographic data element. |
Thoroughfare Leading Type |
Leading thoroughfare type indicator within the Thoroughfare field. |
Thoroughfare Pre-Directional |
(Street Pre Direction) Prefix directional contained within the Thoroughfare field. |
Thoroughfare Name |
(Street Name) Name indicator within the Thoroughfare field |
Thoroughfare Trailing Type |
(Street Suffix) Trailing thoroughfare type indicator within the Thoroughfare field. |
Thoroughfare Post-Directional |
(Street Post Direction) Postfix directional contained within the Thoroughfare field. |
Dependent Thoroughfare Pre-Directional |
Prefix directional contained within the Dependent Thoroughfare field. |
Dependent Thoroughfare Leading Type |
Leading thoroughfare type indicator within the Dependent Thoroughfare field. |
Dependent Thoroughfare Name |
Name indicator within the Dependent Thoroughfare field |
Dependent Thoroughfare Trailing Type |
Trailing thoroughfare type indicator within the Dependent Thoroughfare field. |
Dependent Thoroughfare Post-Directional |
Postfix directional contained within the Dependent Thoroughfare field. |
Notes#
Company, Company Acronym, Department/Title
Frequently these components don’t match exactly because of ‘noise words’ such as “the,” “and,” “agency,” and so on. MatchUp strips these words from these components.
Company Acronym
MatchUp Object converts any multi-word company name into an acronym(for example, “International Business Machines” is squeezed into “IBM”). Single-word company names are left as they are. This conversion is done after noise words are removed.
Street Address Components
The seven street address components (Street Number, Street Pre-Directional, Street Name, Street Suffix, Street Post-Directional, PO Box, Street Secondary) are obtained by splitting up to three address lines. Note that PO Box and/or Street Secondary do not have to appear on their own line, or in a particular field. MatchUp’s proprietary “street smart” splitter does all of the work.
Full Address
When using the Full Address component, you are at the mercy of every little deviation in data entry. Because MatchUp Object’s street splitter is so powerful, it is preferable to use street address components instead of the Full Address in nearly all cases. The only exception may be when processing foreign addresses that don’t conform very well to US, Canadian or UK addressing formats.
Zip9, Zip5, Zip4, Canadian Postal Code
MatchUp Object removes dashes and spaces from ZIP codes. When processing a mix of Canadian Postal Codes and US ZIP codes, use the Zip9 component.
Phone Number
MatchUp Object removes non-numeric characters from phone numbers. Leading ‘1-’ and trailing extensions are stripped if present. Numbers lacking an area code are right justified so that the local dialing code and number are aligned with numbers having area codes. If a data table often has missing or inaccurate area codes (i.e., after a recent area code split), start at the 4th position of the phone number component. Do not use the right most 7 positions, as badly formatted extensions can sometimes cause the phone number to get coded improperly.
E-Mail Address
MatchUp Object removes illegal characters from e-mail addresses. Incomplete, changed, and commonly misspelled domain names are corrected using the Email Address data table.
Date
MatchUp Object allows you to specify a number of days for which a match will be possible if the records being compared fall within the set number of days apart.
Numeric
This allows you to specify an integer number for which a match will be possible if the record’s unit difference falls within the set number.
Proximity
The proximity component requires you to map in Latitude / Longitude coordinates(Not determined by MatchUp. Can be determined by a product such as GeoCoder or Contact Verify) allowing you to match addresses within a maximum distance setting for this component.
Component Properties#
The matchcode components tell MatchUp Object which data types to use for creating the match key while the component properties tell MatchUp Object how much of the data to use and what parts.
Often, especially for potentially long fields like personal names and city or street names, MatchUp Object doesn’t need the full contents of the field to determine if the field is a duplicate of another. Only ten characters or so will often be enough.
Data Type#
See Matchcode Components.
Label#
This is a line of text that describes the component. Not all fields allow the label to be edited. This is most useful for clarifying the contents of General fields that don’t fit any of the other component types.
Size#
This is the maximum number of characters from the field that MatchUp Object will use to build the match key. Sizing is done after all other properties are applied.
Start#
This property determines where MatchUp Object begins counting when applying the Size property.
Value |
Description |
---|---|
Left |
Starts from the first character of the field. This is the most commonly used option. |
Right |
Starts from the last character of the field. For example, if the data included a phone number of “949-589-5200” and the size was 7, MatchUp Object would use “5895200” for the match key. |
Position |
Starts from a specific position within the field. |
Fuzzy#
Fuzzy settings allow for matching of non-exact components. These options are mutually exclusive, so you can only select one at a time.
Value |
Description |
---|---|
Phonetex |
(pronounced “Fo-NEH-tex”) An auditory matching algorithm. It works best in matching words that sound alike but are spelled differently. It is an improvement over the Soundex algorithm described below. |
Soundex |
An auditory matching algorithm originally developed by the Department of Immigration in 1917 and later adopted by the USPS. Although the Phonetex algorithm is measurably superior, the Soundex algorithm is presented for users who need to create a matchcode that emulates one from another application. |
Containment |
Matches when one record’s component is contained in another record. For example, “Smith” is contained in “Smithfield.” |
Frequency |
Matches the characters in one record’s component to the characters in another without any regard to the sequence. For example “abcdef” would match ”badcfe.” |
Fast Near |
A typographical matching algorithm. It works best in matching words that don’t match because of a few typographical errors. Exactly how many errors is specified on a scale from 1(Tight) to 4(Loose). The Fast Near algorithm is a faster approximation of the Accurate Near algorithm described below. The tradeoff for speed is accuracy; sometimes Fast Near will find false matches or miss true matches. |
Accurate Near |
An implementation of the Levenshtein algorithm. It is a typographical matching algorithm. The Accurate Near algorithm produces better results than the Fast Near algorithm, but is slower. |
Frequency Near |
Similar to Frequency matching except that you specify how many characters may be different between components. |
UTF-8 Near |
Similar to levenstein (Accurate Near). It counts the number typos, i.e. character substitutions count as one typo, transposed characters count as two typos. This algorithm differs from others in that it will account for character storage sizes due to different encoding. |
Vowels Only |
Only vowels will be compared. Consonants will be removed. |
Consonants Only |
Only consonants will be compared. Vowels will be removed. |
Alphas Only |
Only alphabetic characters will be compared. |
Numerics Only |
Only numeric characters will be compared. Decimals and signs are considered numeric. |
MD Keyboard |
An algorithm developed by Melissa Data which counts keyboarding mis-hits with a weighted penalty based on the distance of the mis-hit and assigns a percentage of similarity between the compared strings. |
Fuzzy Advanced#
Please research the definitions of the following advanced algorithms before implementing in a matchcode.
Value |
Description |
---|---|
Jaro |
Gathers common characters (in order) between the two strings, then counts transpositions between the two common strings. |
Jaro-Winkler |
Just like Jaro, but gives added weight for matching characters at the start of the string (up to 4 characters). |
n-Gram |
Counts the number of common sub-strings (grams) between the two strings. Substring size ‘N’, is currently defaulted as 2 in MatchUp. |
Needleman-Wunch |
Similar to Accurate Near, except that inserts/deletes aren’t weighted as heavily and as compensation for keyboarding mis-hits, not all character substitutions are weighted equally. |
Smith-Waterman-Gotoh |
Builds on Needleman-Wunch, but gives a non-linear penalty for deletions. This effectively adds the ‘understanding’ that the keyboarder may have tried to abbreviate one of the words. |
Dice’s Coefficient |
Like Jaro, Dice counts matching n-Grams (discarding duplicate n-Grams). |
Jaccard Similarity Coefficient |
Very similar to Dice’s Coefficient with a slightly different calculation.’ |
Overlap Coefficient |
Again, very similar to Dice’s Coefficient with a slightly different calculation. String similarity algorithm based on a substring calculation. |
Longest Common Substring |
Finds the longest common substring between the two strings. |
Double MetaPhone |
Performs 2 different Phonetex-style transformations. Returns a value dependant on how many of the transformations match (ie, 1 versus 1, 1 versus 2, 2 versus 1, 2 versus 2). |
Distance#
This is the property where you set a range for which two records will still match. This field is context sensitive, depending on the Data Type and Fuzzy algorithm.
Data Type |
Description |
---|---|
Proximity |
Distance in miles. Range: 0-4000 |
Numeric |
Integer number. |
Date |
Number of days. |
For example: If the Distance is set to 60:
Two records with dates 20161225 and 20161031 will match. (they are within 60 days)
Two records with dates 20161225 and 20160430 will match. (they are further tham 60 days apart)
Algorithm |
Description |
---|---|
Fast Near |
Number of typographical errors. Range: Tight(1) - Loose(4) |
Accurate Near |
Number of typographical errors. Range: Tight(1) - Loose(4) |
Note
Since these algorithms are not published and the range was originally developed to represent a general sliding scale (narrow choice of precision), we recommend using Near:1 and carefully test before you consider using the higher settings in production, as doing so can quickly return false duplicates.
The following use a percentage range of 0-100%, indicating the minimum percentage of similarity which will return a match between two strings.
N-Gram
Jaro
Jaro-Winkler
LCS
Needleman-Wunch
MD Keyboard
Smith-Waterman-Gotoh
Dice’s Coefficient
Jaccard Similarity Coefficient
Overlap Coefficient
Double MetaPhone
More information on the publically published algorithms can be found here: Advanced Algorithms.
Short/Empty Settings#
These settings control matching between incomplete or empty fields. They are not mutually exclusive, meaning that any combination of these settings may be selected.
Value |
Description |
---|---|
Initial Only |
Will match a full word to an initial (for example, “J” and “John”). |
One Blank Field |
Will match a full word to no data (for example, “John” and “”). |
Both Blank Fields |
Match this component if both records contain no data. This is a very important concept in creating matchcodes. For more information, see Blank Field Matching. |
Swap#
Swap matching is the ability to compare one component to another component. For example, if you were to swap match a First Name component and a Last Name component, you could match “John Smith” to “Smith John.” Swap matching is always defined for a pair of components. MatchUp allows you to specify up to 8 swap pairs (named “Pair A” through “Pair H”). It is strongly recommended that the other properties of both member components are identical.
For more information see Swap Matching Uses.
Component Combinations#
Every matchcode is composed of one or more combination of components. These columns represent different combinations of components which may detect a match between two records. A match found using any one of the combinations in a matchcode is considered a match. Programmers may think in terms of a series of OR conditions. Satisfying any one of them is considered a positive result.
MatchUp allows up to 16 different combinations of components per matchcode.
A good example of combinations would be a matchcode designed to catch last names as well as either street addresses or Post Office Box addresses.
Condition #1: ZIP/PC, Last Name, Street Number, Street Name
Condition #2: ZIP/PC, Last Name, PO Box
Such a matchcode might look like this:
Component |
Size |
1 |
2 |
---|---|---|---|
ZIP/PC |
5 |
X |
X |
Last Name |
5 |
X |
X |
Street # |
4 |
X |
|
Street Name |
4 |
X |
|
PO Box |
10 |
X |
Columns 3 through 16 have been omitted for the sake of clarity. The trick to understanding this table is to look at the vertical columns of X’s. For example, looking at column 1, there are X’s in ZIP/PC, Last Name, Street #, and Street Name, indicating the goal of condition #1 exactly. In column 2 are X’s in ZIP/PC, Last Name, and PO Box, matching condition #2.
For a more advanced example:
Component |
Size |
1 |
2 |
3 |
4 |
---|---|---|---|---|---|
ZIP/PC |
5 |
X |
X |
X |
X |
Last Name |
5 |
X |
X |
||
Company |
10 |
X |
X |
||
Street # |
4 |
X |
X |
||
Street Name |
4 |
X |
X |
||
PO Box |
10 |
X |
X |
This matchcode may produce matches if any one of following 4 conditions returns true:
Condition #1: ZIP/PC, Last Name, Street Number, Street Name
Condition #2: ZIP/PC, Last Name, PO Box
Condition #3: ZIP/PC, Company, Street Number, Street Name
Condition #4: ZIP/PC, Company, PO Box
This matchcode could be used on a list containing a mixture of both personal and company names and either street or PO Box addresses.
First Component Restrictions#
MatchUp now has two deduping engines. The object will determine if one is either necessary or more efficient, and select that engine for usage. In some cases, processing will be faster if the traditional ReadWrite engine is selected because of First Component properties. These are:
It must appear in every combination.
It cannot use the following types of Fuzzy matching: Containment; Frequency; Fast Near; Frequency Near; Accurate Near. All others are allowed.
It cannot use Initial Only matching.
It cannot use One Blank Field matching.
It cannot use Swap Matching.
In other situations, you may have combinations where there are no common components. An example would be:
Condition #1: ZIP/PC, Street Number, Street Name
Condition #2: ZIP/PC, PO Box
Condition #3: Proximity
In this case, MatchUp would determine that it needs to use its Intersecting Logic. This engine is required because there are no common components. Speed benchmarks may be surprisingly similar, but may in fact return more duplicates.
Blank Field Matching#
This needs a special discussion, as its importance is often overlooked. As discussed above, if this property is on, then the absence of data in both records would indicate a match. If this property is off, then two records with missing data, but matching in every other way, will not match.
Blank set “ON”#
The following example demonstrates when Blank set to ON allows a match on a non-critical component. Setting Blank to OFF is recommended for a critical component.
Component |
Size |
Blank |
1 |
2 |
---|---|---|---|---|
ZIP/PC |
5 |
Yes |
X |
X |
Last Name |
5 |
Yes |
X |
X |
Street # |
5 |
Yes |
X |
|
Street Name |
4 |
Yes |
X |
|
PO Box |
10 |
Yes |
X |
As described above, this produces the following combinations:
Condition #1: ZIP/PC, Last Name, Street Number, Street Name
Condition #2: ZIP/PC, Last Name, PO Box
For this example, take the following records:
Name |
Address |
City/State/PC |
---|---|---|
Joe Smith |
326 Main Street |
Pembroke, MA 02066 |
Suzi Smith |
405 Main Street |
Pembroke, MA 02066 |
The following matchcode keys would be generated:
Cond# |
Zip/PC |
Last Name |
Street # |
Street Name |
PO Box |
---|---|---|---|---|---|
1 |
02066 |
SMITH |
326 |
MAIN |
|
2 |
02066 |
SMITH |
405 |
MAIN |
According to these matchcode keys, it is clear that these two records do not satisfy condition #1. But because blank field matching is selected, they do satisfy condition #2. The Zip/PC, Last Name, and PO Box are exactly the same. Therefore, the two records do match.
Blank set “OFF”#
Obviously, this is not the correct result. Making one change to the matchcode:
Component |
Size |
Blank |
1 |
2 |
---|---|---|---|---|
ZIP/PC |
5 |
Yes |
X |
X |
Last Name |
5 |
Yes |
X |
X |
Street # |
5 |
Yes |
X |
|
Street Name |
4 |
Yes |
X |
|
PO Box |
10 |
No |
X |
The same comparison is done for combination #2, but the match is disallowed this time because the matchcode now indicates that missing (blank) information is not allowed to figure in the matching condition.
Looking at another example (using the same matchcode):
Name |
Address |
City/State/PC |
---|---|---|
Joe Smith |
PO Box 123 |
Pembroke, MA 02066 |
Suzi Smith |
PO Box 456 |
Pembroke, MA 02066 |
The following matchcode keys would be generated:
Cond# |
Zip/PC |
Last Name |
Street # |
Street Name |
PO Box |
---|---|---|---|---|---|
1 |
02066 |
SMITH |
123 |
||
2 |
02066 |
SMITH |
456 |
This record has the same problem as before, but this time combination #1 is the cause. An even better matchcode would be:
Component |
Size |
Blank |
1 |
2 |
---|---|---|---|---|
ZIP/PC |
5 |
Yes |
X |
X |
Last Name |
5 |
Yes |
X |
X |
Street # |
4 |
No |
X |
|
Street Name |
4 |
No |
X |
|
PO Box |
10 |
No |
X |
This is one matchcode that works well. There is one more possible tweak, however: turn on Both Blank Fields for the Street # component. Occasionally, MatchUp Object may encounter records such as:
Name |
Address |
City/State/PC |
---|---|---|
Joe Notarangello |
Oceanfront Estates |
Pembroke, MA 02066 |
Suzi Notarangello |
Oceanfront Est. |
Pembroke, MA 02066 |
This reflects a trend in up-scale neighborhoods, where neither street address has a Street # component, though it is very likely these records should match.
So this new, improved matchcode will account for these situations:
Component |
Size |
Blank |
1 |
2 |
---|---|---|---|---|
ZIP/PC |
5 |
Yes |
X |
X |
Last Name |
5 |
Yes |
X |
X |
Street # |
4 |
Yes |
X |
|
Street Name |
4 |
No |
X |
|
PO Box |
10 |
No |
X |
Matchcode Mapping#
Matchcodes deal with the abstract. The components in a matchcode represent specific types of data, but they aren’t directly linked to the fields in databases. Mapping creates the link between the data and the matchcode.
For example, take the following matchcode:
Component |
Size |
Fuzzy |
1 |
---|---|---|---|
Zip5 |
5 |
No |
X |
Last Name |
5 |
No |
X |
First Name |
5 |
No |
X |
Company |
10 |
No |
X |
Add a database which contains the following fields:
NAME Contains full names (“Mr. John Smith”).
COMPANY Contains company names (“Melissa Data”).
ADD1 Contains first (primary) address line (“22382 Avenida Empresa”).
ADD2 Contains second (secondary) address line (“Suite 34”).
CSZ Contains City/State/Zip (“Rancho Santa Margarita, CA 92688”).
An application must create a link between a database’s fields (Name, Company, Add1, Add2 and CSZ) and the matchcode components (Zip5, Last Name, First Name, Company). With the example above, it may appear that the application will have to contain extensive splitting routines. This is not the case. All that is necessary is to tell MatchUp what type of data is in a specific field and the format of that data.
In the example above, an application would use the following matchcode mapping:
Matchcode Component |
Database Field |
Matchcode Mapping |
---|---|---|
Zip5 |
CSZ |
CityStZip |
Last Name |
NAME |
FullName |
First Name |
NAME |
FullName |
Company |
COMPANY |
Company |
This mapping tells MatchUp that the 5-digit ZIP Code information is in a field named “CSZ” which is described as a field containing city, state, and ZIP Code information. The Last Name can be found in a field called “NAME” and is described as a full name field (which is a full name sequenced: Pre, FN, MN, LN, Suf).
Matchcode Mapping Rules#
Matchcode mappings follow five rules:
For every Matchcode Component, the application must specify a mapping. The only exception is described in rule 2.
Actual Address components names (such as Street Number, Street Pre-Directional, Street Name, Street Suffix, Street Post-Directional, PO Box, and Street Secondary, and Global Components) are not listed for mapping purposes. Instead, the names Address Line 1 and Address Line 2 through Address Line 8 are used. The example below used four address components in the matchcode (Street #, Street Name, Street Secondary, PO Box). However, it only used two address lines.
If a matchcode uses any address components, Address Lines 1-8 will be listed after all other components regardless of where the address component appears in the matchcode. In the following example, the address components are listed before company in the matchcode, but Address Lines 1-8 are listed at the end (after company).
If a matchcode uses address components, Address Lines 1-8 will require at least one line to be mapped, but not all. If a database only has one address field, an application will only need to map Address 1 to that field. All other components must be mapped.
Address Lines should be mapped from the top down (Address Line 1, then 2 through 8).
Enhancing the matchcode in the previous example:
Component |
Size |
Fuzzy |
1 |
2 |
---|---|---|---|---|
Zip5 |
5 |
No |
X |
X |
Last Name |
5 |
No |
X |
X |
First Name |
5 |
No |
X |
X |
Street Number |
5 |
No |
X |
|
Street Name |
5 |
No |
X |
|
Street Seconday |
12 |
No |
X |
|
PO Box |
10 |
No |
X |
|
Company |
10 |
No |
X |
X |
Again, MatchUp doesn’t use the individual address components. They are replaced with Address 1, Address 2, and Address 3. So, the application would use the following Matchcode Mapping:
Matchcode Component |
Database Field |
Matchcode Mapping |
---|---|---|
Zip5 |
CSZ |
CityStZip |
Last Name |
NAME |
FullName |
First Name |
NAME |
FullName |
Address Line 1 |
ADD1 |
Address |
Address Line 2 |
ADD2 |
Address |
Address Line 3 |
(none) |
Note on Rule #1#
If a database does not contain a field for information called for by a component in a matchcode, such as company field in the above example, then that matchcode should not be used to dedupe that database.
Use a different matchcode or modify an existing matchcode, as outlined later in this chapter.
However, if a matchcode calls for last name, for example, and the database only has full name, then simply map the full name field to the last name and MatchUp Object will handle parsing the field.
Matchcode Mapping Using the API#
All three of the MatchUp Object deduping interfaces (Incremental, Read/Write and Hybrid) have an AddMapping function. This is used to create mappings for the current instance of whatever deduper an application is using. For the last example above, call the function in the following way:
mu->ClearMapping();
mu->AddMapping(mu->CityStZip);
mu->AddMapping(mu->FullName);
mu->AddMapping(mu->FullName);
mu->AddMapping(mu->Company);
mu->AddMapping(mu->Address);
mu->AddMapping(mu->Address);
The value being passed to the function is an enumerated value of the type MatchcodeMapping.
Note that this code does not tell MatchUp Object anything about the database containing the data to be deduped. The application handles the data access separately and then passes the necessary fields to the deduper using the AddField function.
Changing Mappings#
It is possible to change mappings in the middle of a session if, for example, an application has to handle two databases with different data structures. Continuing with the example from above, assume that the second database has the following structure:
Matchcode Component |
Database Field |
Matchcode Mapping |
---|---|---|
Zip5 |
CSZ |
CityStZip |
Last Name |
NAME |
FullName |
First Name |
NAME |
FullName |
Address Line 1 |
ADD1 |
Address |
Address Line 2 |
ADD2 |
Address |
Address Line 3 |
(none) |
To use this mapping, the application would first have to call the ClearMappings function to remove the existing mappings and call the AddMapping function again to configure the new mapping.
mu->AddMapping(mu->CityStZip);
mu->AddMapping(mu->LastName);
mu->AddMapping(mu->FirstName);
mu->AddMapping(mu->Company);
mu->AddMapping(mu->Address);
mu->AddMapping(mu->Address);
Optimizing Matchcodes#
Some matchcodes process much faster than others in spite of the fact that they detect the same matches. This section will assist in creating the most efficient matchcodes. This discussion is included you can better understand why certain things are done while optimizing.
Optimizing can make a significant difference in processing speed. 58-hour runs have been reduced to four hours simply by optimizing the matchcode.
It is important you verify that a matchcode works in the intended way before attempting any optimizations. If a matchcode is not functioning properly these optimizations will not help and could quite possibly make the situation worse.
Component Sequence#
As discussed in the previous section, data may process faster if the first component of a matchcode has certain properties:
It must be used in every combination.
It cannot use certain types of Fuzzy Matching: Containment; Frequency; Fast Near; Frequency Near; or Accurate Near (other types are okay, though).
It cannot use Initial Only matching.
It cannot use One Blank Field matching.
It cannot use Swap matching.
If the matchcode’s second component also follows these conditions, MatchUp Object will incorporate it into its Clustering scheme. Additional components, if they follow in sequence (third, fourth, and so on), will be used if they satisfy these conditions. Incorporating a component into a cluster greatly reduces the number of comparisons MatchUp Object has to perform which, in turn, speeds up your processing.
This is a simple example of optimization:
Component |
Size |
Fuzzy |
Blank |
1 |
2 |
---|---|---|---|---|---|
ZIP/PC |
5 |
No |
Yes |
X |
X |
Street # |
5 |
No |
Yes |
X |
|
Street Name |
5 |
No |
No |
X |
|
PO Box |
10 |
No |
No |
X |
|
Last Name |
5 |
No |
Yes |
X |
X |
As shown here, MatchUp Object will only cluster by ZIP/PC. But note that the last component satisfies all the conditions listed earlier.
Component |
Size |
Fuzzy |
Blank |
1 |
2 |
---|---|---|---|---|---|
ZIP/PC |
5 |
No |
Yes |
X |
X |
Street # |
5 |
No |
Yes |
X |
X |
Street Name |
5 |
No |
Yes |
X |
|
Last Name |
5 |
No |
No |
X |
|
PO Box |
10 |
No |
No |
X |
This simple optimization will produce significant improvements in speed. In general, if your matchcode requires multiple components to be used in all set combinations, place them before other components.
Fuzzy Algorithms#
Fuzzy algorithms fall into two categories: early matching and late matching.
Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation (the BuildKey function in each interface), which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.
Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.
Obviously, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed. Often, a faster algorithm will give nearly the same results, but it is a good idea to test any such change before processing live data.
Fuzzy Algorithm Ranking#
Algorithm |
Late or Early |
Speed (10=fastest) |
---|---|---|
Jaro |
Late |
1 |
Jaro-Winkler |
Late |
1 |
n-Gram |
Late |
1 |
Needleman-Wunch |
Late |
1 |
Smith-Waterman-Gotoh |
Late |
1 |
Dice’s Coefficient |
Late |
1 |
Jaccard Similarity Coefficient |
Late |
1 |
Overlap Coefficient |
Late |
1 |
Longest Common Substring |
Late |
1 |
Double Metaphone |
Late |
1 |
Accurate Near |
Late |
1 |
Fast Near |
Late |
3 |
Containment |
Late |
4 |
Frequency Near |
Late |
4 |
Frequency |
Late |
6 |
Phonetex |
Early |
7 |
Soundex |
Early |
8 |
Vowels Only |
Early |
9 |
Numerics Only |
Early |
9 |
Consonants Only |
Early |
9 |
Alphas Only |
Early |
9 |
Exact |
N/A |
10 |
The speed values are only rough estimates.
Another benefit of using a faster fuzzy algorithm is that an application may be able to exploit the component sequence optimization shown earlier. All of the early matching algorithms satisfy the restrictions for first components.
Unnecessary Components#
Components that are not used in any combinations (in other words, they have no X’s in columns 1 through 16) are a sign of poor matchcode design.
Take the following matchcode:
Component |
Size |
Fuzzy |
Blank |
1 |
2 |
---|---|---|---|---|---|
ZIP/PC |
5 |
No |
Yes |
X |
X |
Last Name |
5 |
No |
Yes |
X |
X |
First Name |
5 |
No |
Yes |
||
Street # |
5 |
No |
Yes |
X |
|
Street Name |
5 |
No |
No |
X |
|
PO Box |
10 |
No |
No |
X |
First name is not being used in any combination. Perhaps it was used in a combination that has since been removed from this matchcode, but it is no longer necessary.
Unnecessary Combinations#
Take the following matchcode:
Component |
Size |
Fuzzy |
Blank |
1 |
2 |
3 |
4 |
---|---|---|---|---|---|---|---|
ZIP/PC |
5 |
No |
Yes |
X |
X |
X |
X |
Last Name |
5 |
No |
Yes |
X |
X |
X |
X |
First Name |
5 |
No |
Yes |
X |
X |
||
Street # |
5 |
No |
Yes |
X |
X |
||
Street Name |
5 |
No |
No |
X |
X |
||
PO Box |
10 |
No |
No |
X |
X |
Here are the four conditions for matching:
Condition # 1 |
ZIP/PC |
Last Name |
First Name |
Street # |
Street Name |
|
Condition # 2 |
ZIP/PC |
Last Name |
First Name |
PO Box |
||
Condition # 3 |
ZIP/PC |
Last Name |
Street # |
Street Name |
||
Condition # 4 |
ZIP/PC |
Last Name |
PO Box |
There is no match that will be detected by condition #1 that would not be detected by condition #3. Similarly, matches found by condition #2 will always be found by condition #4. In other words, condition 3 is a subset of condition 1, and condition 2 is a subset of condition 4. Subsets are rarely desirable.
So either conditions 1 and 2 aren’t needed or conditions 3 and 4 were a mistake. If conditions 1 and 2 are eliminated, the First Name component should also be removed, as it will not be needed.
Swap Matching Uses#
Swap matching is used to catch matches when two fields are flipped around. The most common occasion is catching the “John Smith” and “Smith John” records. But there are other uses:
Comparing Household Records#
When there are two or three first or full names per record, a list provider may claim that every record is always “husband, wife, then children,” but records can then read “wife, child, then husband.”
Component |
Size |
Fuzzy |
Swap |
1 |
2 |
3 |
4 |
5 |
6 |
---|---|---|---|---|---|---|---|---|---|
Zip5 |
5 |
Exact |
None |
X |
X |
X |
X |
X |
X |
Last Name |
5 |
Exact |
None |
X |
X |
X |
X |
X |
X |
First Name |
5 |
Exact |
A/B |
X |
X |
||||
First Name |
5 |
Exact |
A/C |
X |
X |
||||
First Name |
5 |
Exact |
B/C |
X |
X |
||||
PO Box |
10 |
Exact |
None |
X |
X |
X |
|||
Street Number |
5 |
Exact |
None |
X |
X |
X |
|||
Street Name |
4 |
Exact |
None |
X |
X |
X |
In the above example, select Either component can match for Swap Pairs A, B, and C.
Comparing up to Three Address Lines#
Although the address splitter works well in the US and Canada, some European countries can cause problems. A typical Euro-Matchcode will not use street split components and look at three address lines instead. The swap matching ensures that every address line is compared with every other address line.
Component |
Size |
Fuzzy |
Swap |
1 |
2 |
3 |
---|---|---|---|---|---|---|
Zip9 |
10 |
Exact |
None |
X |
X |
X |
Last Name |
5 |
Exact |
None |
X |
X |
X |
First Name |
5 |
Exact |
None |
X |
X |
X |
Address |
10 |
Exact |
A/B |
X |
||
Address |
10 |
Exact |
A/C |
X |
||
Address |
10 |
Exact |
B/C |
X |
Again, select Either component can match for Swap Pairs A, B, and C.
Don’t always discard the street split component matchcodes because you are working with a foreign database. Sometimes the street splitter will yield usable results. Therefore, a combination of approaches will often work.
Component |
Size |
Fuzzy |
Swap |
1 |
2 |
3 |
4 |
5 |
---|---|---|---|---|---|---|---|---|
Zip9 |
10 |
Exact |
None |
X |
X |
X |
X |
X |
Last Name |
5 |
Exact |
None |
X |
X |
X |
X |
X |
First Name |
5 |
Exact |
None |
X |
X |
X |
X |
X |
PO Box |
10 |
Exact |
None |
X |
||||
Street Number |
5 |
Exact |
None |
X |
||||
Street Name |
4 |
Exact |
None |
X |
||||
Address |
10 |
Exact |
A/B |
X |
||||
Address |
10 |
Exact |
A/C |
X |
||||
General |
10 |
Exact |
B/C |
X |
Best Practices#
Matchup Object Best Practices contains support recommendations when performance is not optimal.
Intersecting Deduper#
When a matchcode that circumvents the first component restrictions (not used in all combinations, or fuzzy algorithm applied) is used, throughput should be expected to be significantly slower. This can also cause stability issues. When processing large amounts of records, we do not recommend using this type of matchcode. Test thoroughly with small amounts of records before scaling up to larger data sets and a production environment. Using the Hybrid Deduper or small amounts of data will not show the problem.
For more info: Component Combinations.
Matchcodes with Fuzzy Algorithms#
Since fuzzy algorithms can exponentially slow down a process or raise stability issues for enterprise-level processes, we recommend that you establish acceptable throughput benchmarks with a standard exact matchcode. Then make small incremental changes that progress to the desired matching strategies. Given the quality of data and amount of records, use of certain matchcode properties may be impractical to achieve desired speeds.
Optimizing Speed: General#
Network data traffic: We recommend that the source data to be processed be local with respect to the installed Melissa Data program. Network permissions, throughput, and in some cases, MatchUp’s need to access record ‘x’ to complete consolidation with record ‘y’, are all potential sources of a slower process.
Source datatype :Some database or file types can be read by the calling language or IDE more efficiently than others. Matching your environment to the most efficient file type requires trial and error testing by the developer.
Hardware :It goes without saying that the more hardware you dedicate to a process, the faster it will run. However, many processes can not take advantage of additional hardware, or show diminishing returns. For example, varied zip code demographics may be able to use multi-processors to process individual clusters of records, but a database of the same zip code may not. Additionally, for the above factors, hardware may not be the overriding factor governing a fast process, ie. a good matchcode may be the most important factor.
Optimizing Speed: Matchcodes#
Matchcodes: Components: Before MatchUp dedupes, it clusters records into groups of possible matches. If your matchcode does not have any components in every used combination, it can not place records into those sub group clusters. In general, the greater number of components used in every combination, the faster the process will be.
Matchcode: Fuzzy: MatchUp has an extensive list of Fuzzy Options. Some are performed during the key building process (ie. Soundex) and do not slow the process down. Others are performed on the constructed matchkeys (ie. Near, Jaro, etc.) and therefore slow down the process. If the latter types are required by your process, place them in the component order below an exact component which is also used in every combination if possible.
Order of Components in Matchcode#
Although the Matchcode Editor interface lets you place the components in any order, the Object does have a few restrictions when calling the AddMapping methods. Namely, Address Line AddMappings must be called last, even if you have added another component after the Address matchcode components. Calling AddMappings in the wrong order will throw an error, therefore when using the Matchcode Editor, place your address components last. The exception would be rare cases where address components are used in every specified column, but a different component is not used in all combinations (specified columns).
Back up your Matchcode database#
If you create your own matching strategies, you should occasionally back up this file - in the event that someone changes a matchcode or it becomes corrupted.
For the Object and SSIS, and Contact Zone, this file is named mdMatchUp.mc
For the MatchUp Software version, the file is named DTake.mc
An example of good backup practice would be mdMatchUp_20140123.mc, allowing you to see the original matchcode used in processes before Jan 23, 2014
Using Efficient SetUserInfo#
By default, SetUserInfo, the unique identifier attached to built match key is 1024 bytes, allowing the developer to pass an advanced custom identifier, or even source data to the key file. While this can have data handling advantages, this will cause the key file and temporary sort files to grow much larger than needed for most jobs, and will slow down the process. A new reserve funcion has been added, allowing the user to override the default UserInfo size. For Example:
ReadWrite->SetReserved("UserInfoSize","12");
Our tests have shown this to reduce the key and temporary disk storage usage to decrease by a factor of 10 and the processing time to decrease by as much as 60%
To determine if you have the necessary Update Build 2072 or newer, Programmatically:
printf(" BUILD NUMBER: %s\n",mdMUReadWriteGetBuildNumber(ReadWrite));
Keep Work File Location Local#
MatchUp uses this location to store the process key file as well as temporary sorting files.
By default, Windows will store these files in the temp directory of the logged in User. For *nix platforms, the directory where the executable is being ran.
Although users can override this location, we do not recommned it, unless you are pointing this location to a fast local drive with plenty of writable disk space and full read write permissions.
Deduping Methods#
MatchUp offers three methods of operation:
Incremental Deduping#
Incremental deduping is usually used for real-time data entry validation. For example, a call center data-entry system where an operator would like to determine whether or not the caller is an existing customer. At any time, a calling program can pass the incremental deduping engine the contents of a record; the engine will then report as to whether or not this record is a dupe, and if so, which record or records it matches.
Incremental deduping consists of the following steps:
The program processes a record and sends the specific information (ZIP/PC, Name, Address, etc) to MatchUp Object.
Based on previous records sent to the API, it reports whether or not the record from the first step matches any of these previous records.
Optionally, the application can tell MatchUp Object to add this record to its database for consideration in future comparisons.
The Historical Database#
The incremental deduping engine relies heavily on a historical database that it maintains. The lifetime of this database is as long as necessary (seconds, days, even years). This database is constructed and maintained by MatchUp Object, so it can determine whether or not an incoming record matches other records fairly quickly.
Multi-User/Multi-Thread Considerations#
Incremental deduping is unique in that multiple users or multiple processes can access the same historical database simultaneously. The API maintains a locking system to ensure that competing processes don’t collide. In order for two processes to work in this fashion, the initialization function for each process must specify the same historical database (a.k.a. “key file”).
Transaction-Based Processing#
The Incremental deduper interface of MatchUp Object features the option of using transaction-based operations on the historical database. This enables an application to process multiple calls to the AddRecord function as one, speeding up processing of large lists.
Incremental Order of Operations#
Using the Incremental deduper is pretty straightforward. This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Incremental deduper.
Initialize the Incremental deduper.
After creating an instance of the Incremental deduper, point the object toward its supporting data file, select a matchcode and key file to use, and initialize these files.
Create field mappings.
In order to build a key to compare to the key file, the Incremental deduper needs to know which types of data the program will be passing to the deduper and in what order.
Read the record from the data source.
This can be a new address passed from a website, a single record from a newly acquired list or data source, to be compared against the master list.
Build a match key for the incoming record.
This consists of passing the actual data to the deduper in the same order used when creating a field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the Incremental deduper uses this information to generate a match key.
Compare the match key to the key file.
The MatchRecord function searches the key file for any keys that match the new record. If it finds a match, it provides information on the duplicate records in the key file.
Write new records to the key file.
The new key, whether or not it is unique, can then be written to the key file, so it can be used for future deduping operations. The program code must also write the new address record to the database separately.
Read/Write Deduping#
Read/Write deduping is usually used for processing entire lists. It works in a manner similar to the way that the MatchUp software products does. A calling program passes an entire list to the Read/Write deduping engine one record at a time. When the entire list has been passed, the calling program tells the API to process the records. Then, the calling program retrieves each record, along with additional deduplication information, from the Read/Write deduper.
Read/Write deduping consists of the following steps:
One by one, the program sends a series of record data (ZIP/PC, Name, Address, etc.) to the MatchUp API.
When completely done (1), the program sends a “process” command to the API.
The program retrieves the results for each record with deduplication information.
Order of Output Records#
The program will send records in a particular sequence, either in record (raw) order, or maybe in a more sophisticated manner (by ZIP/PC, record type, and so on). MatchUp Object will not return the records in the same order. By default, records are output in cluster order. This order will be loosely based on the matchcode. For example, if the matchcode has Zip5 as its first component, output records will be more or less sorted by ZIP Code (but the developer should not count on this). If the application called the SetGroupSorting function, records in the same dupe group will be adjacent. Otherwise, duplicate records may or may not be adjacent (though they usually are near each other).
If a certain sequence is important (for example, records ordered in the same sequence they were input), sort the results after MatchUp Object has processed the data.
Data Lifetime#
A Read/Write deduping session is relatively short-lived. Although the actual action of reading and writing records may take time (hours or days), the process is strictly defined into three distinct steps. The key file does not persist beyond this point. Because of this, Read/Write deduping is not usually the choice for ongoing or online processes.
Record Identity#
Because MatchUp Object does not read or write directly to the database, some mechanism must be provided so that the application can match each record back to the original data source. The SetUserInfo function allows the application to pass an unique identifier for each record.
Read/Write Order of Operations#
Using the Read/Write deduper is pretty straight forward. This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Read/Write deduper.
Initialize the Read/Write deduper.
After creating an instance of the Read/Write deduper, point the object toward its supporting data file, select a matchcode and key file to use, and initialize these files.
Create field mappings.
In order to build a key to be written to the key file, the Read/Write deduper needs to know which types of data the application will be passing to the deduper and in what order.
Read the records from the database.
Loop through the master database and get the data fields needed to build a key, according to the mappings defined in step 2.
Build a match key for each record.
This consists of passing the actual data to the deduper in the same order used when creating the field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the deduper uses this information to generate a match key.
Write each match key to the key file.
The WriteRecord function stores each match key in a temporary key file.
Process the keys.
After building the keys, calling the Process function loops through the keys and compares them to each other.
Loop through the records and read the deduping data for each one.
The ReadRecord function loops through the entire set of deduped records and allows the application to read information on the record’s duplicate/unique status, the number of duplicates for each record and the record dupe group.
Hybrid Deduping#
The Hybrid deduper differs from the Incremental and Read/Write dedupers in that it does not maintain a key file of its own. It is up to the developer to maintain a list of match keys to use for deduping operations. This increases the flexibility of the Hybrid deduper but at the expense of programming complexity.
The main advantage of Hybrid deduping is that it allows the developer to build smaller lists of match keys on the fly and quickly compare records to a small subset of the database.
Clustering#
The concept of Clustering, outlined in the first chapter, is essential to the Hybrid deduper. Unlike the other dedupers, where the clustering is taking place behind the scenes, the Hybrid deduper allows the developer to use clustering to compare a record against only a small portion of a list.
The Hybrid deduper uses the concept of a cluster size, which is the maximum number of characters at the beginning of a key that can be used to group a number of keys into smaller groups that can be compared against each other. For example, a cluster size of 5 means that the first five characters of a match key are used to create the clusters.
In other words, only the records where the first five characters of the match key for one record are identical to the first five characters of the match key for another record are considered when performing a Hybrid deduping operation.
Key Maintenance#
Unlike the other interfaces, the Hybrid deduper does not automatically handle the read/write operations to a key file. While this forces the developer to do more work, it allows a great deal of flexibility in how match keys are stored and handled.
In the previous example, with a cluster size of 5, if the match keys are stored in a field within a SQL database, a cluster could be built quickly by performing a SELECT query where the first five characters of the match key field matches the first five characters of the match key for the new record.
While this gives the developer far more flexibility, it also requires a great deal more coding and a greater understanding of certain MatchUp concepts.
Hybrid Order of Operations#
Using the Hybrid deduper is not as straightforward as the other interfaces, as it puts greater burden on the developer to handle storage and management of match keys.
This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Hybrid deduper.
Initialize the Hybrid deduper.
After creating an instance of the Hybrid deduper, point the object toward its supporting data file, select a matchcode to use, and initialize these files.
Create field mappings.
In order to build keys to compare, the Hybrid deduper needs to know which types of data the program will be passing to the deduper and in what order.
Build a master list of keys.
Each record must have a match key so the Hybrid deduper can select a cluster of records or check for duplicates. This consists of passing the data used in record comparison from each record to the deduper in the same order used when creating a field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the Hybrid deduper uses this information to generate a match key.
Build a match key for the new address record.
Repeat the step above to create a match key for the record to be compared against the cluster.
Build the cluster list.
Cycle through the master key list, extract only those records where the first part of the match key equals the first part of the match key for the new record.
Compare the match key to the cluster list.
Loop through the cluster key file for any keys that match the new record. If it finds a match, the CompareKey function indicates a match.
Global Processing#
Legacy/Global Usage#
Foreign Character Translation#
Foreign characters are translated into English equivalents. For example, “Ç” is converted to “C.” All translations are based on the assumption that your data was entered with the 1252 (Windows Latin 1) code page.
Canadian Users#
MatchUp recognizes Canadian provinces and postal codes. In fact, it will abbreviate province names to their two letter abbreviation automatically.
MatchUp does handle the “QC” province abbreviation for Quebec, and “PQ” entries are automatically changed to “QC.”
In Canada, ”5-20 Main Street” means “20 Main Street, Apt 5,” but in the US, it means ”5 Main Street, Apt 20.” When deduping, MatchUp uses the contents of the ZIP/Postal code as a basis to determine a record’s country of origin, and splits this type of address accordingly.
When creating matchcodes for use with Canadian Postal Codes, use the Postal Code component. However, if a database is a mix of US and Canadian records, use Zip9 as the component type. Zip9 will not adversely affect processing of Canadian records. The goal is to prevent the deduper from trying to extract a ZIP + 4 from a Canadian Postal Code.
United Kingdom Users#
MatchUp can recognize United Kingdom Cities, Counties, and Postal codes. When creating matchcodes for use with United Kingdom addresses, use the Postal code (UK) component. Depending on requirements, consider using the City (UK) and County (UK) components. The Postal code component is structured in the following format: AADDIII, where AA is the Postal code Area (left justified), DD is the Postal code district (right justified), and III is the Inward Code (left justified). Extra spaces and dashes are removed as this structuring is done, so the size of this component is always 7.
Like any other matchcode component, a portion of the Postal code can always be compared by reducing its size and/or starting at a specific position. For example, starting at position 5 for a size of 3 will compare just the Inward code.
MatchUp’s street splitter will not split United Kingdom street addresses as well as Canadian and US addresses. Usually, a matchcode containing a mix of split address components and full address components is a good way to get the benefit of the street splitter (which often does perform well), along with a full-address match for backup. MatchUp Object includes the United Kingdom Address matchcode to be used as a starting point to build on.
International Users#
MatchUp was designed to work with US and Canadian addresses, and performs well with addresses from other English speaking countries.
The main obstacle with international records is with the Street Splitter. Try doing a test run with one of the default matchcodes. If the street splits are not working well, use the full address when creating a matchcode instead of using the components (such as street number, street name, etc.).
Often, users have had success when combining the full address and street splitter. For example, here’s an international version of one of the default matchcodes:
Component |
Size |
Start |
Fuzzy |
Short/Empty |
1 |
2 |
3 |
---|---|---|---|---|---|---|---|
General |
10 |
Left |
No |
Both Empty |
X |
X |
X |
Last Name |
5 |
Left |
No |
Both Empty |
X |
X |
X |
First name |
3 |
Left |
No |
Both Empty |
X |
X |
X |
PO Box |
10 |
Left |
No |
No |
X |
||
Street # |
4 |
Left |
No |
Both Empty |
X |
||
Street Name |
4 |
Left |
No |
No |
X |
||
Full Address |
20 |
Left |
No |
No |
X |
Logging Advisory#
Important Notices#
Build 5083#
MatchUp2020Q2 Update
If running one of the linux ‘.sh’ installer scripts returns bad interpreter: No such file or directory
, either:
Run the dos2unix <filename>.sh command on the respective .sh file, or
Manually copy the latest libmdmatchup.so and /data files the current location
This installer issue will be fixed next update.
Build 5009#
MatchUp2017Q3 Update
The MatchUpSQL.dll was corrupted during the build process.
Download the patch here
.
If using the SQL-CLR functions, Please stop the SQL service. Unzip. Rename the .d_l
to .dll
. Replace MatchUpSQL.dll in the SQL Server where you install the CLR functions. Restart the service.
Build 2929#
If using the SQL-CLR functions, we recommend you rerun the ‘SQL-CLR install.sql’ script after stopping the service and replacing the mdMatchUp.dll and MatchUpSQL.dll libraries as instructed in the sample readme_SQL_CLR.txt.
Since Domestic and Global functionality have been separated into distinct downloads, the Global mdMatchUp.mc contains matchcodes used in its samples, which are not present in an existing previously installed MatchUp Object. Therefore, some of the Global examples will not run unless you swap in the new matchcode file, or add them to your existing .mc file (which is not overwritten upon a program update).
Build 2918#
This version separates MatchUp Object into two distinct products - Global and Domestic (US and Canada). All efforts have been made to ensure backwards compatibility, thus with the previous version, there are features (such as matchcode components and fuzzy algorithms which appear in both versions.
If performaing an update, we recommend backing up your mdMatchUp.mc matchcode file.
IMPORTANT: because of the nature of International Data (different characters, encoding, etc) some of the domestic fuzzy algorithms may not return accurate results, and therefore should not be used in Global MatchUp. If your legacy application uses said algorithms, MatchUp will do logic checking and actually perform a more accurate Fuzzy:UTF-8 check.
Expired Database without Updates#
Users who do not update to MatchUp build 2628 will experience an expired database (actual expiration date is 2015-12-29).
Updating to build 2628 (or newer future builds) will enable Global Matching for supported Countries but may require changes to your application (see build history), and are not effected by an expiring database date.
Users who do not update and need an mdMatchUp.dat with an extended expiration date, please contact our tech support team at Tech@Melissa.com or 800-MELISSA ext. 4 (800-635-4772 ext. 4).
Build 2628#
This version marks a change in the way MatchUp handles International records for deduping. Melissa offers Global address verification and parsing for over 240 countries. MatchUp will now leverage this knowledge to enable us to recognize and build matchkeys for these countries, therefore providing accurate record matching across the globe.
Installation
We will no longer distribute 32 bit unmanaged libraries, nor 32 - 64 bit COM Objects. Migrating from these libraries to the 64 bit library or interface wrapper will require interface signature changes and a recompile of your project.
Domestic US and Canada
Same as it ever was.
Previous versions provided unparalleled address parsing for keybuilding and deduping for domestic US and Canadian data. If your data is limited to these countries you should continue to use the same matching strategies in a seamless fashion
Important for SSIS Users
The previous version of the MatchUp Object and the current SSIS component use the same data files. Since the SSIS component does not have Global Matching capability yet, its data files have a different structure. For both products to exist on the same machine, you must not point to the same location for the data files. MatchUp object, by default will install the data files to C:ProgramDataMelissa DATAMatchUP. Verify the location of your SSIS MatchUp data files – by selecting File > Advanced Configuration from the component. If you did not override the default location, use windows explorer to navigate to that location and make a copy of that directory. Go into the component and point your SSIS data directory to the new location. Remove the contents of the original default directory. You can then install the MatchUp Object with Global capabilities.
We plan on adding Global matching to SSIS in the near future, which will then use the same data files.
Important UK Changes
If you currently process UK data, or use a matchcode with UK components, DO NOT UPDATE to this version. Global UK processing will be added next release.
Next release: Global MatchUp will make the UK components obsolete, and therefore they will be deprecated. We have taken great effort to include logic to prevent Initialization of a process with these deprecated components. We recommend that you remove UK matchcodes and create new matchcodes which use the respective new global components.
General International Usage
Supported Countries – Build 2618: Germany
As suggested for UK, we have provided a new set of Global matchcode components which will allow you to match international addresses with custom matchcodes based on a countries’ individual format, which differs from country to country. This will require you to know the expected address components needed to uniquely identify one particular address from another. Seem hard? It’s actually not as random in differences as you might think, and Melissa Data will allows offer sound advice as to which strategy and components to use for your particular need.
For MatchUps’ international address parser / key builder to kick in, the first matchcode component will need to be of the datatype Country. This will tell MatchUp to look for specific keywords and address patterns for that country.
In addition to varying address formats, your international records will most likely have an extended character set. Therefore, you will need to call the new SetEncoding and SetCharacterSize methods. This lets MatchUp know how to recognize different characters and correctly build the matchkeys. Previously, all extended characters were converted for keybuilding with the default 1252 code page. This allowed Matchup to identify matches whether or not records were stripped of their accents by legacy data systems. With Global data, this is not desired as differently accented characters do represent different data values.
International Datafiles
To accommodate global processing, the structure and contents of the underlying Matchup database – mdMatchup.dat has changed.
There are also a significant number of new and larger required data files which will be installed and required to process international records.
Backwards Compatibility
MatchUp is backwards compatible. Updating to the newest library should not break an existing process. Though we have taken great effort to include logic to handle Initialization of a process with the deprecated components, there may be subtle differences in the way that matchkeys are constructed. This is too accommodate extended characters, plural words and an overall general improvement to the existing data files. We therefore recommend you generate new matchkeys for any existing Incremental or Hybrid process, or run thorough regression tests before updating the libraries to your production environment.
Data Coverage by Country#
Release Schedule#
Release Date Schedule |
|
---|---|
2024-10-21 |
|
2025-01-27 |
|
2025-04-21 |
|
2025-07-21 |
|
2025-10-27 |