Read required packages.
Read the RLdata500 data (taken from the RecordLinkage
package).
| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | 
|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 7 | 22 | 1 | 34 | ||
| GERD | BAUER | 1968 | 7 | 27 | 2 | 51 | ||
| ROBERT | HARTMANN | 1930 | 4 | 30 | 3 | 115 | ||
| STEFAN | WOLFF | 1957 | 9 | 2 | 4 | 189 | ||
| RALF | KRUEGER | 1966 | 1 | 13 | 5 | 72 | ||
| JUERGEN | FRANKE | 1929 | 7 | 4 | 6 | 142 | 
This dataset contains 500 rows with 450 entities.
Now we create a new column that concatenates the information in each row.
RLdata500[, id_count :=.N, ent_id] ## how many times given unit occurs
RLdata500[, bm:=sprintf("%02d", bm)] ## add leading zeros to month
RLdata500[, bd:=sprintf("%02d", bd)] ## add leading zeros to day
RLdata500[, txt:=tolower(paste0(fname_c1,fname_c2,lname_c1,lname_c2,by,bm,bd))]
head(RLdata500)| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt | 
|---|---|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | ||
| GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | ||
| ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | ||
| STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | ||
| RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | ||
| JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 | 
In the next step we use the newly created column in the
blocking function. If we specify verbose, we get
information about the progress.
df_blocks <- blocking(x = RLdata500$txt, ann = "nnd", verbose = 1, graph = TRUE, seed = 2024)
#> ===== creating tokens =====
#> ===== starting search (nnd, x, y: 500, 500, t: 429) =====
#> ===== creating graph =====Results are as follows:
rnndescent we have created 133 blocks,df_blocks
#> ========================================================
#> Blocking based on the nnd method.
#> Number of blocks: 133.
#> Number of columns used for blocking: 429.
#> Reduction ratio: 0.9917.
#> ========================================================
#> Distribution of the size of the blocks:
#>  2  3  4  5  6  7  8  9 10 11 17 
#> 47 34 18 12  8  5  3  3  1  1  1Structure of the object is as follows:
result – a data.table with identifiers and
block IDs,method – the method used,deduplication – whether deduplication was applied,representation – whether shingles or vectors were
used,metrics – standard metrics and based on the
igraph::compare methods for comparing graphs (here
NULL),confusion – confusion matrix (here NULL),colnames – column names used for the comparison,graph – an igraph object mainly for
visualisation.str(df_blocks,1)
#> List of 8
#>  $ result        :Classes 'data.table' and 'data.frame': 367 obs. of  4 variables:
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  $ method        : chr "nnd"
#>  $ deduplication : logi TRUE
#>  $ representation: chr "shingles"
#>  $ metrics       : NULL
#>  $ confusion     : NULL
#>  $ colnames      : chr [1:429] "86" "ap" "av" "bf" ...
#>  $ graph         :Class 'igraph'  hidden list of 10
#>  - attr(*, "class")= chr "blocking"Plot connections.
The resulting data.table has four columns:
x – reference dataset (i.e. RLdata500) –
this may not contain all units of RLdata500,y - query (each row of RLdata500) – this
may not contain all units of RLdata500,block – the block ID,dist – distance between objects.| x | y | block | dist | 
|---|---|---|---|
| 1 | 64 | 33 | 0.4737987 | 
| 2 | 43 | 1 | 0.0807453 | 
| 2 | 486 | 1 | 0.4102322 | 
| 3 | 450 | 88 | 0.4326335 | 
| 4 | 234 | 12 | 0.5256584 | 
| 5 | 128 | 2 | 0.5133357 | 
Create long data.table with information on blocks and
units from original dataset.
df_block_melted <- melt(df_blocks$result, id.vars = c("block", "dist"))
df_block_melted_rec_block <- unique(df_block_melted[, .(rec_id=value, block)])
head(df_block_melted_rec_block)| rec_id | block | 
|---|---|
| 1 | 33 | 
| 2 | 1 | 
| 3 | 88 | 
| 4 | 12 | 
| 5 | 2 | 
| 6 | 33 | 
We add block information to the final dataset.
| fname_c1 | fname_c2 | lname_c1 | lname_c2 | by | bm | bd | rec_id | ent_id | id_count | txt | block_id | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| CARSTEN | MEIER | 1949 | 07 | 22 | 1 | 34 | 1 | carstenmeier19490722 | 33 | ||
| GERD | BAUER | 1968 | 07 | 27 | 2 | 51 | 2 | gerdbauer19680727 | 1 | ||
| ROBERT | HARTMANN | 1930 | 04 | 30 | 3 | 115 | 1 | roberthartmann19300430 | 88 | ||
| STEFAN | WOLFF | 1957 | 09 | 02 | 4 | 189 | 1 | stefanwolff19570902 | 12 | ||
| RALF | KRUEGER | 1966 | 01 | 13 | 5 | 72 | 1 | ralfkrueger19660113 | 2 | ||
| JUERGEN | FRANKE | 1929 | 07 | 04 | 6 | 142 | 1 | juergenfranke19290704 | 33 | 
We can check in how many blocks the same entities
(ent_id) are observed. In our example, all the same
entities are in the same blocks.
| uniq_blocks | N | 
|---|---|
| 1 | 450 | 
We can visualise the distances between units stored in the
df_blocks$result data set. Clearly we have a mixture of two
groups: matches (close to 0) and non-matches (close to 1).
hist(df_blocks$result$dist, xlab = "Distances", ylab = "Frequency", breaks = "fd",
     main = "Distances calculated between units")Finally, we can visualise the result based on the information whether block contains matches or not.
df_for_density <- copy(df_block_melted[block %in% RLdata500$block_id])
df_for_density[, match:= block %in% RLdata500[id_count == 2]$block_id]
plot(density(df_for_density[match==FALSE]$dist), col = "blue", xlim = c(0, 0.8), 
     main = "Distribution of distances between\nclusters type (match=red, non-match=blue)")
lines(density(df_for_density[match==TRUE]$dist), col = "red", xlim = c(0, 0.8))