The duplicate image detection tool

What is Matchbox?

Help! I have a million images and I’m sure there are duplicates, but which are they?

Checking to identify duplicates manually is a very time-consuming and error-prone process. You need a tool to help you: Matchbox.

Matchbox is an open source tool which:

  • provides decision-making support for duplicate image detection in or across collections
  • identifies duplicate content, even where files are different (in format, size, rotation, cropping, colour-enhancement etc.), and if they have been scanned from different original copies of the same publication
  • applies state-of-the art image processing
  • works where OCR will not, for example images of handwriting or music scores
  • is useful in assembling collections from multiple sources, and identifying missing files

What Can Matchbox Do For Me?

Matchbox brings the following benefits:

  • Automated quality assurance
  • Reduced manual effort and error
  • Saved time
  • Lower costs, e.g. storage, effort
  • Open source, standalone tool. Also as Taverna component for easy invocation
  • Invariant to format, rotation, scale, translation, illumination, resolution, cropping, warping and distortions
  • May be applied to wide range of image collections, not just print images


There are numerous situations in which you may need to identify duplicate images in collections, for example:

  • to ensure that a page or book has not been digitized twice
  • to discover whether a master and service set of digitized images represent the same set of originals
  • to confirm that all scans have gone through post-scan image processing