Other#

Advanced Algorithms#

Click the links below for more information on the respective public advanced algorithms not developed by Melissa Data.


Swap Matching Uses#

Swap matching is used to catch matches when two fields are flipped around. The most common occasion is catching the “John Smith” and “Smith John” records. But there are other uses:

Comparing Household Records#

When there are two or three first or full names per record, a list provider may claim that every record is always “husband, wife, then children,” but records can then read “wife, child, then husband.”

Component

Size

Fuzzy

Swap

1

2

3

4

5

6

Zip5

5

Exact

None

X

X

X

X

X

X

Last Name

5

Exact

None

X

X

X

X

X

X

First Name

5

Exact

A/B

X

X

First Name

5

Exact

A/C

X

X

First Name

5

Exact

B/C

X

X

PO Box

10

Exact

None

X

X

X

Street Number

5

Exact

None

X

X

X

Street Name

4

Exact

None

X

X

X

In the above example, select Either component can match for Swap Pairs A, B, and C.

Comparing up to Three Address Lines#

Although the address splitter works well in the US and Canada, some European countries can cause problems. A typical Euro-Matchcode will not use street split components and look at three address lines instead. The swap matching ensures that every address line is compared with every other address line.

Component

Size

Fuzzy

Swap

1

2

3

Zip9

10

Exact

None

X

X

X

Last Name

5

Exact

None

X

X

X

First Name

5

Exact

None

X

X

X

Address

10

Exact

A/B

X

Address

10

Exact

A/C

X

Address

10

Exact

B/C

X

Again, select Either component can match for Swap Pairs A, B, and C.

Don’t always discard the street split component matchcodes because you are working with a foreign database. Sometimes the street splitter will yield usable results. Therefore, a combination of approaches will often work.

Component

Size

Fuzzy

Swap

1

2

3

4

5

Zip9

10

Exact

None

X

X

X

X

X

Last Name

5

Exact

None

X

X

X

X

X

First Name

5

Exact

None

X

X

X

X

X

PO Box

10

Exact

None

X

Street Number

5

Exact

None

X

Street Name

4

Exact

None

X

Address

10

Exact

A/B

X

Address

10

Exact

A/C

X

General

10

Exact

B/C

X


Blank Field Matching#

This needs a special discussion, as its importance is often overlooked. As discussed above, if this property is on, then the absence of data in both records would indicate a match. If this property is off, then two records with missing data, but matching in every other way, will not match.

Blank set “ON”#

The following example demonstrates when Blank set to ON allows a match on a non-critical component. Setting Blank to OFF is recommended for a critical component.

Component

Size

Blank

1

2

ZIP/PC

5

Yes

X

X

Last Name

5

Yes

X

X

Street #

5

Yes

X

Street Name

4

Yes

X

PO Box

10

Yes

X

As described above, this produces the following combinations:

  • Condition #1: ZIP/PC, Last Name, Street Number, Street Name

  • Condition #2: ZIP/PC, Last Name, PO Box

For this example, take the following records:

Name

Address

City/State/PC

Joe Smith

326 Main Street

Pembroke, MA 02066

Suzi Smith

405 Main Street

Pembroke, MA 02066

The following matchcode keys would be generated:

Cond#

Zip/PC

Last Name

Street #

Street Name

PO Box

1

02066

SMITH

326

MAIN

2

02066

SMITH

405

MAIN

According to these matchcode keys, it is clear that these two records do not satisfy condition #1. But because blank field matching is selected, they do satisfy condition #2. The Zip/PC, Last Name, and PO Box are exactly the same. Therefore, the two records do match.

Blank set “OFF”#

Obviously, this is not the correct result. Making one change to the matchcode:

Component

Size

Blank

1

2

ZIP/PC

5

Yes

X

X

Last Name

5

Yes

X

X

Street #

5

Yes

X

Street Name

4

Yes

X

PO Box

10

No

X

The same comparison is done for combination #2, but the match is disallowed this time because the matchcode now indicates that missing (blank) information is not allowed to figure in the matching condition.

Looking at another example (using the same matchcode):

Name

Address

City/State/PC

Joe Smith

PO Box 123

Pembroke, MA 02066

Suzi Smith

PO Box 456

Pembroke, MA 02066

The following matchcode keys would be generated:

Cond#

Zip/PC

Last Name

Street #

Street Name

PO Box

1

02066

SMITH

123

2

02066

SMITH

456

This record has the same problem as before, but this time combination #1 is the cause. An even better matchcode would be:

Component

Size

Blank

1

2

ZIP/PC

5

Yes

X

X

Last Name

5

Yes

X

X

Street #

4

No

X

Street Name

4

No

X

PO Box

10

No

X

This is one matchcode that works well. There is one more possible tweak, however: turn on Both Blank Fields for the Street # component. Occasionally, MatchUp Object may encounter records such as:

Name

Address

City/State/PC

Joe Notarangello

Oceanfront Estates

Pembroke, MA 02066

Suzi Notarangello

Oceanfront Est.

Pembroke, MA 02066

This reflects a trend in up-scale neighborhoods, where neither street address has a Street # component, though it is very likely these records should match.

So this new, improved matchcode will account for these situations:

Component

Size

Blank

1

2

ZIP/PC

5

Yes

X

X

Last Name

5

Yes

X

X

Street #

4

Yes

X

Street Name

4

No

X

PO Box

10

No

X


Matchcode Mapping#

Matchcodes deal with the abstract. The components in a matchcode represent specific types of data, but they aren’t directly linked to the fields in databases. Mapping creates the link between the data and the matchcode.

For example, take the following matchcode:

Component

Size

Fuzzy

1

Zip5

5

No

X

Last Name

5

No

X

First Name

5

No

X

Company

10

No

X

Add a database which contains the following fields:

NAME    Contains full names (“Mr. John Smith”).
COMPANY Contains company names (“Melissa Data”).
ADD1    Contains first (primary) address line (“22382 Avenida Empresa”).
ADD2    Contains second (secondary) address line (“Suite 34”).
CSZ     Contains City/State/Zip (“Rancho Santa Margarita, CA 92688”).

An application must create a link between a database’s fields (Name, Company, Add1, Add2 and CSZ) and the matchcode components (Zip5, Last Name, First Name, Company). With the example above, it may appear that the application will have to contain extensive splitting routines. This is not the case. All that is necessary is to tell MatchUp what type of data is in a specific field and the format of that data.

In the example above, an application would use the following matchcode mapping:

Matchcode Component

Database Field

Matchcode Mapping

Zip5

CSZ

CityStZip

Last Name

NAME

FullName

First Name

NAME

FullName

Company

COMPANY

Company

This mapping tells MatchUp that the 5-digit ZIP Code information is in a field named “CSZ” which is described as a field containing city, state, and ZIP Code information. The Last Name can be found in a field called “NAME” and is described as a full name field (which is a full name sequenced: Pre, FN, MN, LN, Suf).

Matchcode Mapping Rules#

Matchcode mappings follow five rules:

  1. For every Matchcode Component, the application must specify a mapping. The only exception is described in rule 2.

  2. Actual Address components names (such as Street Number, Street Pre-Directional, Street Name, Street Suffix, Street Post-Directional, PO Box, and Street Secondary, and Global Components) are not listed for mapping purposes. Instead, the names Address Line 1 and Address Line 2 through Address Line 8 are used. The example below used four address components in the matchcode (Street #, Street Name, Street Secondary, PO Box). However, it only used two address lines.

  3. If a matchcode uses any address components, Address Lines 1-8 will be listed after all other components regardless of where the address component appears in the matchcode. In the following example, the address components are listed before company in the matchcode, but Address Lines 1-8 are listed at the end (after company).

  4. If a matchcode uses address components, Address Lines 1-8 will require at least one line to be mapped, but not all. If a database only has one address field, an application will only need to map Address 1 to that field. All other components must be mapped.

  5. Address Lines should be mapped from the top down (Address Line 1, then 2 through 8).

Enhancing the matchcode in the previous example:

Component

Size

Fuzzy

1

2

Zip5

5

No

X

X

Last Name

5

No

X

X

First Name

5

No

X

X

Street Number

5

No

X

Street Name

5

No

X

Street Seconday

12

No

X

PO Box

10

No

X

Company

10

No

X

X

Again, MatchUp doesn’t use the individual address components. They are replaced with Address 1, Address 2, and Address 3. So, the application would use the following Matchcode Mapping:

Matchcode Component

Database Field

Matchcode Mapping

Zip5

CSZ

CityStZip

Last Name

NAME

FullName

First Name

NAME

FullName

Address Line 1

ADD1

Address

Address Line 2

ADD2

Address

Address Line 3

(none)

Note on Rule #1#

If a database does not contain a field for information called for by a component in a matchcode, such as company field in the above example, then that matchcode should not be used to dedupe that database.

Use a different matchcode or modify an existing matchcode, as outlined later in this chapter.

However, if a matchcode calls for last name, for example, and the database only has full name, then simply map the full name field to the last name and MatchUp Object will handle parsing the field.

Matchcode Mapping Using the API#

All three of the MatchUp Object deduping interfaces (Incremental, Read/Write and Hybrid) have an AddMapping function. This is used to create mappings for the current instance of whatever deduper an application is using. For the last example above, call the function in the following way:

mu->ClearMapping();
mu->AddMapping(mu->CityStZip);
mu->AddMapping(mu->FullName);
mu->AddMapping(mu->FullName);
mu->AddMapping(mu->Company);
mu->AddMapping(mu->Address);
mu->AddMapping(mu->Address);

The value being passed to the function is an enumerated value of the type MatchcodeMapping.

Note that this code does not tell MatchUp Object anything about the database containing the data to be deduped. The application handles the data access separately and then passes the necessary fields to the deduper using the AddField function.

Changing Mappings#

It is possible to change mappings in the middle of a session if, for example, an application has to handle two databases with different data structures. Continuing with the example from above, assume that the second database has the following structure:

Matchcode Component

Database Field

Matchcode Mapping

Zip5

CSZ

CityStZip

Last Name

NAME

FullName

First Name

NAME

FullName

Address Line 1

ADD1

Address

Address Line 2

ADD2

Address

Address Line 3

(none)

To use this mapping, the application would first have to call the ClearMappings function to remove the existing mappings and call the AddMapping function again to configure the new mapping.

mu->AddMapping(mu->CityStZip);
mu->AddMapping(mu->LastName);
mu->AddMapping(mu->FirstName);
mu->AddMapping(mu->Company);
mu->AddMapping(mu->Address);
mu->AddMapping(mu->Address);

Matchcode Components#

Matchcode Components#

The following table lists all of the available matchcode components (Data Types) in MatchUp Object:

Component

Description

Prefix

Prefix of a personal name (Mr, Mrs, Ms, Dr).

First Name

A first name.

Middle Name

A middle name.

Last Name

A last name.

Suffix

A suffix from a personal name.

Gender

Male/Female/Neutral.

First/Nickname

A representative nickname for a first name.

Middle/Nickname

A representative nickname for a middle name.

Department/Title

A title and/or department name. *Note

Company

A company name.

Company Acronym

A company’s acronym. *Note

Street Number

The street number from an address line3.

Street Pre-Directional

“South” in “3 South Main St”.

Street Name

The street name from an address line.

Street Suffix

An address suffix (St, Ave, Blvd).

Street Post-Directional

“North” in “3 Main St North”.

PO Box

PO Boxes also include Farm Routes, Rural Routes, etc.

Street Secondary

Apartments, floors, rooms, etc.

Address

A single unparsed address line. *Note

City

A city name. ZIP or Postal code is usually more accurate.

State/Province

A state or province name.

Zip9

A full ZIP + 4® code (9 digits). *Note

Zip5

The ZIP Code (5 digits).

Zip4

The +4 extension of a ZIP + 4 code (4 digits).

Postal Code (Canada)

A Canadian Postal Code.

City (UK)

A city in the United Kingdom.

County (UK)

A county in the United Kingdom.

Postcode (UK)

A United Kingdom Postcode.

Country

A country.

Phone/Fax

A phone number. *Note

E-Mail Address

An e-mail address. *Note

Credit Card Number

A credit card number.

Date

A date. This may result is slower throughput! *Note

Numeric

A numeric field. This may result is slower throughput! *Note

Proximity

Allows you to specify a maximum distance in miles between records in which a match will be possible. This may result is slower throughput! *Note

General

Any general information. ID, birthday, SSN, etc.

Global Components#

Component

Description

Postal Code

(Zip &/ plus 4) Complete postal code for a particular delivery point.

Premises Number

(Street Number) Alphanumeric indicator within premises field.

Double Dependent Locality

Smallest population center data element

Dependent Locality

(Urbanization) Smaller population center data element. Dependent on Locality.

Sub Administrative Area

(County) Smallest geographic data element.

Sub National Area

Arbitrary administrative region below that of the sovereign state.

Locality

(City) Most common population center data element.

Administrative Area

(State) Most common geographic data element.

Thoroughfare Leading Type

Leading thoroughfare type indicator within the Thoroughfare field.

Thoroughfare Pre-Directional

(Street Pre Direction) Prefix directional contained within the Thoroughfare field.

Thoroughfare Name

(Street Name) Name indicator within the Thoroughfare field

Thoroughfare Trailing Type

(Street Suffix) Trailing thoroughfare type indicator within the Thoroughfare field.

Thoroughfare Post-Directional

(Street Post Direction) Postfix directional contained within the Thoroughfare field.

Dependent Thoroughfare Pre-Directional

Prefix directional contained within the Dependent Thoroughfare field.

Dependent Thoroughfare Leading Type

Leading thoroughfare type indicator within the Dependent Thoroughfare field.

Dependent Thoroughfare Name

Name indicator within the Dependent Thoroughfare field

Dependent Thoroughfare Trailing Type

Trailing thoroughfare type indicator within the Dependent Thoroughfare field.

Dependent Thoroughfare Post-Directional

Postfix directional contained within the Dependent Thoroughfare field.

Notes#

Company, Company Acronym, Department/Title

Frequently these components don’t match exactly because of ‘noise words’ such as “the,” “and,” “agency,” and so on. MatchUp strips these words from these components.

Company Acronym

MatchUp Object converts any multi-word company name into an acronym(for example, “International Business Machines” is squeezed into “IBM”). Single-word company names are left as they are. This conversion is done after noise words are removed.

Street Address Components

The seven street address components (Street Number, Street Pre-Directional, Street Name, Street Suffix, Street Post-Directional, PO Box, Street Secondary) are obtained by splitting up to three address lines. Note that PO Box and/or Street Secondary do not have to appear on their own line, or in a particular field. MatchUp’s proprietary “street smart” splitter does all of the work.

Full Address

When using the Full Address component, you are at the mercy of every little deviation in data entry. Because MatchUp Object’s street splitter is so powerful, it is preferable to use street address components instead of the Full Address in nearly all cases. The only exception may be when processing foreign addresses that don’t conform very well to US, Canadian or UK addressing formats.

Zip9, Zip5, Zip4, Canadian Postal Code

MatchUp Object removes dashes and spaces from ZIP codes. When processing a mix of Canadian Postal Codes and US ZIP codes, use the Zip9 component.

Phone Number

MatchUp Object removes non-numeric characters from phone numbers. Leading ‘1-’ and trailing extensions are stripped if present. Numbers lacking an area code are right justified so that the local dialing code and number are aligned with numbers having area codes. If a data table often has missing or inaccurate area codes (i.e., after a recent area code split), start at the 4th position of the phone number component. Do not use the right most 7 positions, as badly formatted extensions can sometimes cause the phone number to get coded improperly.

E-Mail Address

MatchUp Object removes illegal characters from e-mail addresses. Incomplete, changed, and commonly misspelled domain names are corrected using the Email Address data table.

Date

MatchUp Object allows you to specify a number of days for which a match will be possible if the records being compared fall within the set number of days apart.

Numeric

This allows you to specify an integer number for which a match will be possible if the record’s unit difference falls within the set number.

Proximity

The proximity component requires you to map in Latitude / Longitude coordinates(Not determined by MatchUp. Can be determined by a product such as GeoCoder or Contact Verify) allowing you to match addresses within a maximum distance setting for this component.


Optimizing Matchcodes#

Some matchcodes process much faster than others in spite of the fact that they detect the same matches. This section will assist in creating the most efficient matchcodes. This discussion is included you can better understand why certain things are done while optimizing.

Optimizing can make a significant difference in processing speed. 58-hour runs have been reduced to four hours simply by optimizing the matchcode.

It is important you verify that a matchcode works in the intended way before attempting any optimizations. If a matchcode is not functioning properly these optimizations will not help and could quite possibly make the situation worse.

Component Sequence#

As discussed in the previous section, data may process faster if the first component of a matchcode has certain properties:

  • It must be used in every combination.

  • It cannot use certain types of Fuzzy Matching: Containment; Frequency; Fast Near; Frequency Near; or Accurate Near (other types are okay, though).

  • It cannot use Initial Only matching.

  • It cannot use One Blank Field matching.

  • It cannot use Swap matching.

If the matchcode’s second component also follows these conditions, MatchUp Object will incorporate it into its Clustering scheme. Additional components, if they follow in sequence (third, fourth, and so on), will be used if they satisfy these conditions. Incorporating a component into a cluster greatly reduces the number of comparisons MatchUp Object has to perform which, in turn, speeds up your processing.

This is a simple example of optimization:

Component

Size

Fuzzy

Blank

1

2

ZIP/PC

5

No

Yes

X

X

Street #

5

No

Yes

X

Street Name

5

No

No

X

PO Box

10

No

No

X

Last Name

5

No

Yes

X

X

As shown here, MatchUp Object will only cluster by ZIP/PC. But note that the last component satisfies all the conditions listed earlier.

Component

Size

Fuzzy

Blank

1

2

ZIP/PC

5

No

Yes

X

X

Street #

5

No

Yes

X

X

Street Name

5

No

Yes

X

Last Name

5

No

No

X

PO Box

10

No

No

X

This simple optimization will produce significant improvements in speed. In general, if your matchcode requires multiple components to be used in all set combinations, place them before other components.

Fuzzy Algorithms#

Fuzzy algorithms fall into two categories: early matching and late matching.

Early matching algorithms are algorithms where a string is transformed into a (usually shorter) representation and comparisons are performed on this result. In MatchUp, these transformations are performed during key generation (the BuildKey function in each interface), which means that the early matching algorithms pay a speed penalty once per record: as each record’s key is built.

Late matching algorithms are actual comparison algorithms. Usually one string is shifted in one direction or another, and often a matrix of some sort is used to derive a result. These transformations are performed during key comparison. As a result, late matching algorithms pay a speed penalty every time a record is compared to another record. This may happen several hundred times per record.

Obviously, late matching is much slower than early matching. If a particular matchcode is very slow, changing to a faster fuzzy matching algorithm may improve the speed. Often, a faster algorithm will give nearly the same results, but it is a good idea to test any such change before processing live data.

Fuzzy Algorithm Ranking#

Algorithm

Late or Early

Speed (10=fastest)

Jaro

Late

1

Jaro-Winkler

Late

1

n-Gram

Late

1

Needleman-Wunch

Late

1

Smith-Waterman-Gotoh

Late

1

Dice’s Coefficient

Late

1

Jaccard Similarity Coefficient

Late

1

Overlap Coefficient

Late

1

Longest Common Substring

Late

1

Double Metaphone

Late

1

Accurate Near

Late

1

Fast Near

Late

3

Containment

Late

4

Frequency Near

Late

4

Frequency

Late

6

Phonetex

Early

7

Soundex

Early

8

Vowels Only

Early

9

Numerics Only

Early

9

Consonants Only

Early

9

Alphas Only

Early

9

Exact

N/A

10

The speed values are only rough estimates.

Another benefit of using a faster fuzzy algorithm is that an application may be able to exploit the component sequence optimization shown earlier. All of the early matching algorithms satisfy the restrictions for first components.

Unnecessary Components#

Components that are not used in any combinations (in other words, they have no X’s in columns 1 through 16) are a sign of poor matchcode design.

Take the following matchcode:

Component

Size

Fuzzy

Blank

1

2

ZIP/PC

5

No

Yes

X

X

Last Name

5

No

Yes

X

X

First Name

5

No

Yes

Street #

5

No

Yes

X

Street Name

5

No

No

X

PO Box

10

No

No

X

First name is not being used in any combination. Perhaps it was used in a combination that has since been removed from this matchcode, but it is no longer necessary.

Unnecessary Combinations#

Take the following matchcode:

Component

Size

Fuzzy

Blank

1

2

3

4

ZIP/PC

5

No

Yes

X

X

X

X

Last Name

5

No

Yes

X

X

X

X

First Name

5

No

Yes

X

X

Street #

5

No

Yes

X

X

Street Name

5

No

No

X

X

PO Box

10

No

No

X

X

Here are the four conditions for matching:

Condition # 1

ZIP/PC

Last Name

First Name

Street #

Street Name

Condition # 2

ZIP/PC

Last Name

First Name

PO Box

Condition # 3

ZIP/PC

Last Name

Street #

Street Name

Condition # 4

ZIP/PC

Last Name

PO Box

There is no match that will be detected by condition #1 that would not be detected by condition #3. Similarly, matches found by condition #2 will always be found by condition #4. In other words, condition 3 is a subset of condition 1, and condition 2 is a subset of condition 4. Subsets are rarely desirable.

So either conditions 1 and 2 aren’t needed or conditions 3 and 4 were a mistake. If conditions 1 and 2 are eliminated, the First Name component should also be removed, as it will not be needed.


Best Practices#

Matchup Object Best Practices contains support recommendations when performance is not optimal.

Intersecting Deduper#

When a matchcode that circumvents the first component restrictions (not used in all combinations, or fuzzy algorithm applied) is used, throughput should be expected to be significantly slower. This can also cause stability issues. When processing large amounts of records, we do not recommend using this type of matchcode. Test thoroughly with small amounts of records before scaling up to larger data sets and a production environment. Using the Hybrid Deduper or small amounts of data will not show the problem.

For more info: Matchcode Combinations.

Matchcodes with Fuzzy Algorithms#

Since fuzzy algorithms can exponentially slow down a process or raise stability issues for enterprise-level processes, we recommend that you establish acceptable throughput benchmarks with a standard exact matchcode. Then make small incremental changes that progress to the desired matching strategies. Given the quality of data and amount of records, use of certain matchcode properties may be impractical to achieve desired speeds.

Optimizing Speed: General#

  1. Network data traffic: We recommend that the source data to be processed be local with respect to the installed Melissa Data program. Network permissions, throughput, and in some cases, MatchUp’s need to access record ‘x’ to complete consolidation with record ‘y’, are all potential sources of a slower process.

  2. Source datatype :Some database or file types can be read by the calling language or IDE more efficiently than others. Matching your environment to the most efficient file type requires trial and error testing by the developer.

  3. Hardware :It goes without saying that the more hardware you dedicate to a process, the faster it will run. However, many processes can not take advantage of additional hardware, or show diminishing returns. For example, varied zip code demographics may be able to use multi-processors to process individual clusters of records, but a database of the same zip code may not. Additionally, for the above factors, hardware may not be the overriding factor governing a fast process, ie. a good matchcode may be the most important factor.

Optimizing Speed: Matchcodes#

  1. Matchcodes: Components: Before MatchUp dedupes, it clusters records into groups of possible matches. If your matchcode does not have any components in every used combination, it can not place records into those sub group clusters. In general, the greater number of components used in every combination, the faster the process will be.

  2. Matchcode: Fuzzy: MatchUp has an extensive list of Fuzzy Options. Some are performed during the key building process (ie. Soundex) and do not slow the process down. Others are performed on the constructed matchkeys (ie. Near, Jaro, etc.) and therefore slow down the process. If the latter types are required by your process, place them in the component order below an exact component which is also used in every combination if possible.

Order of Components in Matchcode#

Although the Matchcode Editor interface lets you place the components in any order, the Object does have a few restrictions when calling the AddMapping methods. Namely, Address Line AddMappings must be called last, even if you have added another component after the Address matchcode components. Calling AddMappings in the wrong order will throw an error, therefore when using the matchcode editor, place your address components last. The exception would be rare cases where address components are used in every specified column, but a different component is not used in all combinations (specified columns).

Back up your Matchcode database#

If you create your own matching strategies, you should occasionally back up this file - in the event that someone changes a matchcode or it becomes corrupted.

For the Object and SSIS, and Contact Zone, this file is named mdMatchUp.mc

For the MatchUp Software version, the file is named DTake.mc

An example of good backup practice would be mdMatchUp_20140123.mc, allowing you to see the original matchcode used in processes before Jan 23, 2014

Using Efficient SetUserInfo#

By default, SetUserInfo, the unique identifier attached to built match key is 1024 bytes, allowing the developer to pass an advanced custom identifier, or even source data to the key file. While this can have data handling advantages, this will cause the key file and temporary sort files to grow much larger than needed for most jobs, and will slow down the process. A new reserve funcion has been added, allowing the user to override the default UserInfo size. For Example:

ReadWrite->SetReserved("UserInfoSize","12");

Our tests have shown this to reduce the key and temporary disk storage usage to decrease by a factor of 10 and the processing time to decrease by as much as 60%

To determine if you have the necessary Update Build 2072 or newer, Programmatically:

printf(" BUILD NUMBER: %s\n",mdMUReadWriteGetBuildNumber(ReadWrite));

Keep Work File Location Local#

MatchUp uses this location to store the process key file as well as temporary sorting files.

By default, Windows will store these files in the temp directory of the logged in User. For *nix platforms, the directory where the executable is being ran.

Although users can override this location, we do not recommned it, unless you are pointing this location to a fast local drive with plenty of writable disk space and full read write permissions.


Deduping Methods#

MatchUp offers three methods of operation:

Incremental Deduping#

Incremental deduping is usually used for real-time data entry validation. For example, a call center data-entry system where an operator would like to determine whether or not the caller is an existing customer. At any time, a calling program can pass the incremental deduping engine the contents of a record; the engine will then report as to whether or not this record is a dupe, and if so, which record or records it matches.

Incremental deduping consists of the following steps:

  1. The program processes a record and sends the specific information (ZIP/PC, Name, Address, etc) to MatchUp Object.

  2. Based on previous records sent to the API, it reports whether or not the record from the first step matches any of these previous records.

  3. Optionally, the application can tell MatchUp Object to add this record to its database for consideration in future comparisons.

The Historical Database#

The incremental deduping engine relies heavily on a historical database that it maintains. The lifetime of this database is as long as necessary (seconds, days, even years). This database is constructed and maintained by MatchUp Object, so it can determine whether or not an incoming record matches other records fairly quickly.

Multi-User/Multi-Thread Considerations#

Incremental deduping is unique in that multiple users or multiple processes can access the same historical database simultaneously. The API maintains a locking system to ensure that competing processes don’t collide. In order for two processes to work in this fashion, the initialization function for each process must specify the same historical database (a.k.a. “key file”).

Transaction-Based Processing#

The Incremental deduper interface of MatchUp Object features the option of using transaction-based operations on the historical database. This enables an application to process multiple calls to the AddRecord function as one, speeding up processing of large lists.

Incremental Order of Operations#

Using the Incremental deduper is pretty straightforward. This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Incremental deduper.

  1. Initialize the Incremental deduper.

    After creating an instance of the Incremental deduper, point the object toward its supporting data file, select a matchcode and key file to use, and initialize these files.

  2. Create field mappings.

    In order to build a key to compare to the key file, the Incremental deduper needs to know which types of data the program will be passing to the deduper and in what order.

  3. Read the record from the data source.

    This can be a new address passed from a website, a single record from a newly acquired list or data source, to be compared against the master list.

  4. Build a match key for the incoming record.

    This consists of passing the actual data to the deduper in the same order used when creating a field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the Incremental deduper uses this information to generate a match key.

  5. Compare the match key to the key file.

    The MatchRecord function searches the key file for any keys that match the new record. If it finds a match, it provides information on the duplicate records in the key file.

  6. Write new records to the key file.

    The new key, whether or not it is unique, can then be written to the key file, so it can be used for future deduping operations. The program code must also write the new address record to the database separately.

Read/Write Deduping#

Read/Write deduping is usually used for processing entire lists. It works in a manner similar to the way that the MatchUp software products does. A calling program passes an entire list to the Read/Write deduping engine one record at a time. When the entire list has been passed, the calling program tells the API to process the records. Then, the calling program retrieves each record, along with additional deduplication information, from the Read/Write deduper.

Read/Write deduping consists of the following steps:

  1. One by one, the program sends a series of record data (ZIP/PC, Name, Address, etc.) to the MatchUp API.

  2. When completely done (1), the program sends a “process” command to the API.

  3. The program retrieves the results for each record with deduplication information.

Order of Output Records#

The program will send records in a particular sequence, either in record (raw) order, or maybe in a more sophisticated manner (by ZIP/PC, record type, and so on). MatchUp Object will not return the records in the same order. By default, records are output in cluster order. This order will be loosely based on the matchcode. For example, if the matchcode has Zip5 as its first component, output records will be more or less sorted by ZIP Code (but the developer should not count on this). If the application called the SetGroupSorting function, records in the same dupe group will be adjacent. Otherwise, duplicate records may or may not be adjacent (though they usually are near each other).

If a certain sequence is important (for example, records ordered in the same sequence they were input), sort the results after MatchUp Object has processed the data.

Data Lifetime#

A Read/Write deduping session is relatively short-lived. Although the actual action of reading and writing records may take time (hours or days), the process is strictly defined into three distinct steps. The key file does not persist beyond this point. Because of this, Read/Write deduping is not usually the choice for ongoing or online processes.

Record Identity#

Because MatchUp Object does not read or write directly to the database, some mechanism must be provided so that the application can match each record back to the original data source. The SetUserInfo function allows the application to pass an unique identifier for each record.

Read/Write Order of Operations#

Using the Read/Write deduper is pretty straight forward. This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Read/Write deduper.

  1. Initialize the Read/Write deduper.

    After creating an instance of the Read/Write deduper, point the object toward its supporting data file, select a matchcode and key file to use, and initialize these files.

  2. Create field mappings.

    In order to build a key to be written to the key file, the Read/Write deduper needs to know which types of data the application will be passing to the deduper and in what order.

  3. Read the records from the database.

    Loop through the master database and get the data fields needed to build a key, according to the mappings defined in step 2.

  4. Build a match key for each record.

    This consists of passing the actual data to the deduper in the same order used when creating the field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the deduper uses this information to generate a match key.

  5. Write each match key to the key file.

    The WriteRecord function stores each match key in a temporary key file.

  1. Process the keys.

    After building the keys, calling the Process function loops through the keys and compares them to each other.

  1. Loop through the records and read the deduping data for each one.

    The ReadRecord function loops through the entire set of deduped records and allows the application to read information on the record’s duplicate/unique status, the number of duplicates for each record and the record dupe group.

Hybrid Deduping#

The Hybrid deduper differs from the Incremental and Read/Write dedupers in that it does not maintain a key file of its own. It is up to the developer to maintain a list of match keys to use for deduping operations. This increases the flexibility of the Hybrid deduper but at the expense of programming complexity.

The main advantage of Hybrid deduping is that it allows the developer to build smaller lists of match keys on the fly and quickly compare records to a small subset of the database.

Clustering#

The concept of Clustering, outlined in the first chapter, is essential to the Hybrid deduper. Unlike the other dedupers, where the clustering is taking place behind the scenes, the Hybrid deduper allows the developer to use clustering to compare a record against only a small portion of a list.

The Hybrid deduper uses the concept of a cluster size, which is the maximum number of characters at the beginning of a key that can be used to group a number of keys into smaller groups that can be compared against each other. For example, a cluster size of 5 means that the first five characters of a match key are used to create the clusters.

In other words, only the records where the first five characters of the match key for one record are identical to the first five characters of the match key for another record are considered when performing a Hybrid deduping operation.

Key Maintenance#

Unlike the other interfaces, the Hybrid deduper does not automatically handle the read/write operations to a key file. While this forces the developer to do more work, it allows a great deal of flexibility in how match keys are stored and handled.

In the previous example, with a cluster size of 5, if the match keys are stored in a field within a SQL database, a cluster could be built quickly by performing a SELECT query where the first five characters of the match key field matches the first five characters of the match key for the new record.

While this gives the developer far more flexibility, it also requires a great deal more coding and a greater understanding of certain MatchUp concepts.

Hybrid Order of Operations#

Using the Hybrid deduper is not as straightforward as the other interfaces, as it puts greater burden on the developer to handle storage and management of match keys.

This section will outline the basic steps and then show an example of the programming logic for a typical implementation of the Hybrid deduper.

  1. Initialize the Hybrid deduper.

After creating an instance of the Hybrid deduper, point the object toward its supporting data file, select a matchcode to use, and initialize these files.

  1. Create field mappings.

In order to build keys to compare, the Hybrid deduper needs to know which types of data the program will be passing to the deduper and in what order.

  1. Build a master list of keys.

Each record must have a match key so the Hybrid deduper can select a cluster of records or check for duplicates. This consists of passing the data used in record comparison from each record to the deduper in the same order used when creating a field mapping. After passing the necessary fields (usually a small subset of the fields from each record) via the AddField function, the Hybrid deduper uses this information to generate a match key.

  1. Build a match key for the new address record.

Repeat the step above to create a match key for the record to be compared against the cluster.

  1. Build the cluster list.

Cycle through the master key list, extract only those records where the first part of the match key equals the first part of the match key for the new record.

  1. Compare the match key to the cluster list.

Loop through the cluster key file for any keys that match the new record. If it finds a match, the CompareKey function indicates a match.


Global Processing#

Legacy/Global Usage#

Foreign Character Translation#

Foreign characters are translated into English equivalents. For example, “Ç” is converted to “C.” All translations are based on the assumption that your data was entered with the 1252 (Windows Latin 1) code page.

Canadian Users#

MatchUp recognizes Canadian provinces and postal codes. In fact, it will abbreviate province names to their two letter abbreviation automatically.

MatchUp does handle the “QC” province abbreviation for Quebec, and “PQ” entries are automatically changed to “QC.”

In Canada, ”5-20 Main Street” means “20 Main Street, Apt 5,” but in the US, it means ”5 Main Street, Apt 20.” When deduping, MatchUp uses the contents of the ZIP/Postal code as a basis to determine a record’s country of origin, and splits this type of address accordingly.

When creating matchcodes for use with Canadian Postal Codes, use the Postal Code component. However, if a database is a mix of US and Canadian records, use Zip9 as the component type. Zip9 will not adversely affect processing of Canadian records. The goal is to prevent the deduper from trying to extract a ZIP + 4 from a Canadian Postal Code.

United Kingdom Users#

MatchUp can recognize United Kingdom Cities, Counties, and Postal codes. When creating matchcodes for use with United Kingdom addresses, use the Postal code (UK) component. Depending on requirements, consider using the City (UK) and County (UK) components. The Postal code component is structured in the following format: AADDIII, where AA is the Postal code Area (left justified), DD is the Postal code district (right justified), and III is the Inward Code (left justified). Extra spaces and dashes are removed as this structuring is done, so the size of this component is always 7.

Like any other matchcode component, a portion of the Postal code can always be compared by reducing its size and/or starting at a specific position. For example, starting at position 5 for a size of 3 will compare just the Inward code.

MatchUp’s street splitter will not split United Kingdom street addresses as well as Canadian and US addresses. Usually, a matchcode containing a mix of split address components and full address components is a good way to get the benefit of the street splitter (which often does perform well), along with a full-address match for backup. MatchUp Object includes the United Kingdom Address matchcode to be used as a starting point to build on.

International Users#

MatchUp was designed to work with US and Canadian addresses, and performs well with addresses from other English speaking countries.

The main obstacle with international records is with the Street Splitter. Try doing a test run with one of the default matchcodes. If the street splits are not working well, use the full address when creating a matchcode instead of using the components (such as street number, street name, etc.).

Often, users have had success when combining the full address and street splitter. For example, here’s an international version of one of the default matchcodes:

Component

Size

Start

Fuzzy

Short/Empty

1

2

3

General

10

Left

No

Both Empty

X

X

X

Last Name

5

Left

No

Both Empty

X

X

X

First name

3

Left

No

Both Empty

X

X

X

PO Box

10

Left

No

No

X

Street #

4

Left

No

Both Empty

X

Street Name

4

Left

No

No

X

Full Address

20

Left

No

No

X


Matchcode Combinations#

Every matchcode is composed of one or more combination of components. These columns represent different combinations of components which may detect a match between two records. A match found using any one of the combinations in a matchcode is considered a match. Programmers may think in terms of a series of OR conditions. Satisfying any one of them is considered a positive result.

MatchUp allows up to 16 different combinations of components per matchcode.

A good example of combinations would be a matchcode designed to catch last names as well as either street addresses or Post Office Box addresses.

  • Condition #1: ZIP/PC, Last Name, Street Number, Street Name

  • Condition #2: ZIP/PC, Last Name, PO Box

Such a matchcode might look like this:

Component

Size

1

2

ZIP/PC

5

X

X

Last Name

5

X

X

Street #

4

X

Street Name

4

X

PO Box

10

X

Columns 3 through 16 have been omitted for the sake of clarity. The trick to understanding this table is to look at the vertical columns of X’s. For example, looking at column 1, there are X’s in ZIP/PC, Last Name, Street #, and Street Name, indicating the goal of condition #1 exactly. In column 2 are X’s in ZIP/PC, Last Name, and PO Box, matching condition #2.

For a more advanced example:

Component

Size

1

2

3

4

ZIP/PC

5

X

X

X

X

Last Name

5

X

X

Company

10

X

X

Street #

4

X

X

Street Name

4

X

X

PO Box

10

X

X

This matchcode may produce matches if any one of following 4 conditions returns true:

  • Condition #1: ZIP/PC, Last Name, Street Number, Street Name

  • Condition #2: ZIP/PC, Last Name, PO Box

  • Condition #3: ZIP/PC, Company, Street Number, Street Name

  • Condition #4: ZIP/PC, Company, PO Box

This matchcode could be used on a list containing a mixture of both personal and company names and either street or PO Box addresses.

First Component Restrictions#

MatchUp now has two deduping engines. The object will determine if one is either necessary or more efficient, and select that engine for usage. In some cases, processing will be faster if the traditional ReadWrite engine is selected because of First Component properties. These are:

  1. It must appear in every combination.

  2. It cannot use the following types of Fuzzy matching: Containment; Frequency; Fast Near; Frequency Near; Accurate Near. All others are allowed.

  3. It cannot use Initial Only matching.

  4. It cannot use One Blank Field matching.

  5. It cannot use Swap Matching.

  6. In other situations, you may have combinations where there are no common components. An example would be:

  • Condition #1: ZIP/PC, Street Number, Street Name

  • Condition #2: ZIP/PC, PO Box

  • Condition #3: Proximity

In this case, MatchUp would determine that it needs to use its Intersecting Logic. This engine is required because there are no common components. Speed benchmarks may be surprisingly similar, but may in fact return more duplicates.

Data Coverage by Country#

Others#

Release Schedule#

Release Date Schedule

2024-04-22

2024-07-22

2024-10-21

2025-01-27

2025-04-21