Matchbox

The duplicate image detection tool

What is Matchbox?

Help! I have a million images and I’m sure there are duplicates, but which are they?

Checking to identify duplicates manually is a very time-consuming and error-prone process. You need a tool to help you: Matchbox.

Matchbox is an open source tool which:

  • provides decision-making support for duplicate image detection in or across collections
  • identifies duplicate content, even where files are different (in format, size, rotation, cropping, colour-enhancement etc.), and if they have been scanned from different original copies of the same publication
  • applies state-of-the art image processing
  • works where OCR will not, for example images of handwriting or music scores
  • is useful in assembling collections from multiple sources, and identifying missing files

What Can Matchbox Do For Me?

Matchbox brings the following benefits:

  • Automated quality assurance
  • Reduced manual effort and error
  • Saved time
  • Lower costs, e.g. storage, effort
  • Open source, standalone tool. Also as Taverna component for easy invocation
  • Invariant to format, rotation, scale, translation, illumination, resolution, cropping, warping and distortions
  • May be applied to wide range of image collections, not just print images

Examples

There are numerous situations in which you may need to identify duplicate images in collections, for example:

  • to ensure that a page or book has not been digitized twice
  • to discover whether a master and service set of digitized images represent the same set of originals
  • to confirm that all scans have gone through post-scan image processing

Credits