KSquare Utilities
KKMLL::DuplicateImages Class Reference

Detects duplicate images in a given FeaureVectorList objects. More...

#include <DuplicateImages.h>

Public Member Functions

 DuplicateImages (FeatureVectorListPtr _examples, RunLog &_log)
 You would use this instance to search for duplicates in the list of 'examples'. More...
 
 DuplicateImages (FileDescPtr _fileDesc, RunLog &_log)
 
 ~DuplicateImages ()
 
bool AddExamples (FeatureVectorListPtr examples)
 Will add all the examples; be careful of ownership. More...
 
DuplicateImagePtr AddSingleExample (FeatureVectorPtr example)
 Add one more FeatureVector to the list. More...
 
DuplicateImageListPtr DupExamples () const
 
kkint32 DuplicateCount () const
 
kkint32 DuplicateDataCount () const
 
kkint32 DuplicateNameCount () const
 
bool DuplicatesFound () const
 
bool ExampleInDetector (FeatureVectorPtr fv)
 
FeatureVectorListPtr ListOfExamplesToDelete ()
 
void PurgeDuplicates (FeatureVectorListPtr examples, bool allowDupsInSameClass, std::ostream *report)
 Delete duplicate examples from FeatureVectorList structure provided in constructor. More...
 
void ReportDuplicates (std::ostream &o)
 

Detailed Description

Detects duplicate images in a given FeaureVectorList objects.

Author
Kurt Kramer

Will derive a list of duplicate FeatureVector objects in a given list. It will use both the Image File Name and feature data to detect duplicates. A duplicate can be detected in two ways. If two or more entries have the same ExampleFileName or FeatureData.

The simplest way to use this object is to create an instance with a FeatureVectorList object that you are concerned with. Then call the method DupExamples (), which will return the list of duplicates found via a structure called DuplicateImageList.

Definition at line 53 of file DuplicateImages.h.

Constructor & Destructor Documentation

DuplicateImages::DuplicateImages ( FeatureVectorListPtr  _examples,
RunLog _log 
)

You would use this instance to search for duplicates in the list of 'examples'.

You can still call 'AddExamples' and 'AddSingleExample';

Definition at line 25 of file DuplicateImages.cpp.

References KKMLL::DuplicateImageList::DuplicateImageList(), KKMLL::FeatureVectorList::FileDesc(), KKMLL::ImageFeaturesDataIndexed::ImageFeaturesDataIndexed(), and KKMLL::ImageFeaturesNameIndexed::ImageFeaturesNameIndexed().

Referenced by KKMLL::FeatureVectorList::RemoveDuplicateEntries().

27  :
28  duplicateCount (0),
29  duplicateDataCount (0),
30  duplicateNameCount (0),
31  dupExamples (new DuplicateImageList (true)),
32  featureDataTree (new ImageFeaturesDataIndexed ()),
33  fileDesc (NULL),
34  log (_log),
35  nameTree (new ImageFeaturesNameIndexed ())
36 
37 {
38  if (!_examples)
39  {
40  log.Level (-1) << endl << endl << "DuplicateImages::DuplicateImages ***ERROR*** '_examples == NULL'" << endl << endl;
41  return;
42  }
43  fileDesc = _examples->FileDesc ();
44  FindDuplicates (_examples);
45 }
HTMLReport &__cdecl endl(HTMLReport &htmlReport)
Definition: HTMLReport.cpp:240
const FileDescPtr FileDesc() const
RunLog & Level(kkint32 _level)
Definition: RunLog.cpp:220
DuplicateImages::DuplicateImages ( FileDescPtr  _fileDesc,
RunLog _log 
)

Definition at line 50 of file DuplicateImages.cpp.

References KKMLL::DuplicateImageList::DuplicateImageList(), KKMLL::ImageFeaturesDataIndexed::ImageFeaturesDataIndexed(), and KKMLL::ImageFeaturesNameIndexed::ImageFeaturesNameIndexed().

Referenced by KKMLL::TrainingProcess2::CreateTrainingProcessForLevel().

52  :
53  duplicateCount (0),
54  duplicateDataCount (0),
55  duplicateNameCount (0),
56  dupExamples (new DuplicateImageList (true)),
57  featureDataTree (new ImageFeaturesDataIndexed ()),
58  fileDesc (_fileDesc),
59  log (_log),
60  nameTree (new ImageFeaturesNameIndexed ())
61 
62 {
63 }
DuplicateImages::~DuplicateImages ( void  )

Definition at line 68 of file DuplicateImages.cpp.

69 {
70  delete nameTree; nameTree = NULL;
71  delete featureDataTree; featureDataTree = NULL;
72  delete dupExamples; dupExamples = NULL;
73 }

Member Function Documentation

bool DuplicateImages::AddExamples ( FeatureVectorListPtr  examples)

Will add all the examples; be careful of ownership.

Definition at line 90 of file DuplicateImages.cpp.

91 {
92  bool dupsDetected = false;
94  for (idx = examples->begin (); idx != examples->end (); idx++)
95  {
96  DuplicateImagePtr dupExample = AddSingleExample (*idx);
97  if (dupExample)
98  dupsDetected = true;
99  }
100 
101  return dupsDetected;
102 } /* AddExamples */
std::vector< FeatureVector * >::iterator iterator
Definition: KKQueue.h:88
DuplicateImagePtr AddSingleExample(FeatureVectorPtr example)
Add one more FeatureVector to the list.
DuplicateImagePtr DuplicateImages::AddSingleExample ( FeatureVectorPtr  example)

Add one more FeatureVector to the list.

Will add one more example to list and if it turns out to be a duplicate will return pointer to a "DuplicateImage" structure that will contain a list of all images that it is duplicate to. If no duplicate found will then return a NULL pointer.

Will add one more example to list and if it turns out to be a duplicate will return pointer to a "DuplicateImage" structure that will contain a list of all images that it is duplicate to. If no duplicate found will then return a NULL pointer.

Parameters
[in]exampleFeatureVecvtor that you want to add to the list.

Definition at line 111 of file DuplicateImages.cpp.

References KKMLL::DuplicateImage::AddADuplicate(), KKB::KKStr::Concat(), KKMLL::DuplicateImage::DuplicateImage(), KKB::KKStr::Empty(), KKMLL::FeatureVector::ExampleFileName(), KKMLL::ImageFeaturesDataIndexed::GetEqual(), KKMLL::DuplicateImageList::LocateByImage(), and KKMLL::ImageFeaturesDataIndexed::RBInsert().

112 {
113  DuplicateImagePtr dupExample = NULL;
114 
115  FeatureVectorPtr existingNameExample = NULL;
116 
117  const KKStr& imageFileName = example->ExampleFileName ();
118  if (!imageFileName.Empty ())
119  {
120  existingNameExample = nameTree->GetEqual (osGetRootName (example->ExampleFileName ()));
121  if (!existingNameExample)
122  nameTree->RBInsert (example);
123  }
124 
125  FeatureVectorPtr existingDataExample = featureDataTree->GetEqual (example);
126  if (!existingDataExample)
127  featureDataTree->RBInsert (example);
128 
129  if ((existingNameExample) || (existingDataExample))
130  {
131  duplicateCount++;
132  if (existingNameExample)
133  {
134  duplicateNameCount++;
135  dupExample = dupExamples->LocateByImage (existingNameExample);
136  if (!dupExample)
137  {
138  dupExample = new DuplicateImage (fileDesc, existingNameExample, example, log);
139  dupExamples->PushOnBack (dupExample);
140  }
141  else
142  {
143  dupExample->AddADuplicate (example);
144  }
145  }
146 
147  if (existingDataExample)
148  {
149  duplicateDataCount++;
150  if (existingDataExample != existingNameExample)
151  {
152  dupExample = dupExamples->LocateByImage (existingDataExample);
153  if (!dupExample)
154  {
155  dupExample = new DuplicateImage (fileDesc, existingDataExample, example, log);
156  dupExamples->PushOnBack (dupExample);
157  }
158  else
159  {
160  dupExample->AddADuplicate (example);
161  }
162  }
163  }
164  }
165 
166  return dupExample;
167 } /* AddSingleExample */
void ExampleFileName(const KKStr &_exampleFileName)
Name of source of feature vector, ex: file name of image that the feature vector was computed from...
Definition: FeatureVector.h:75
DuplicateImagePtr LocateByImage(FeatureVectorPtr example)
void AddADuplicate(FeatureVectorPtr example)
void RBInsert(FeatureVectorPtr example)
bool Empty() const
Definition: KKStr.h:241
virtual void PushOnBack(EntryPtr _entry)
Definition: KKQueue.h:398
EntryPtr GetEqual(const KeyType &key)
Definition: RBTree.h:493
NodePtr RBInsert(EntryPtr e)
Definition: RBTree.h:1111
FeatureVectorPtr GetEqual(FeatureVectorPtr example)
Represents a Feature Vector of a single example, labeled or unlabeled.
Definition: FeatureVector.h:59
KKStr osGetRootName(const KKStr &fullFileName)
DuplicateImageListPtr KKMLL::DuplicateImages::DupExamples ( ) const
inline

Definition at line 85 of file DuplicateImages.h.

Referenced by ListOfExamplesToDelete(), and PurgeDuplicates().

85 {return dupExamples;}
kkint32 KKMLL::DuplicateImages::DuplicateCount ( ) const
inline

Definition at line 87 of file DuplicateImages.h.

87 {return duplicateCount;}
kkint32 KKMLL::DuplicateImages::DuplicateDataCount ( ) const
inline

Definition at line 88 of file DuplicateImages.h.

88 {return duplicateDataCount;}
kkint32 KKMLL::DuplicateImages::DuplicateNameCount ( ) const
inline

Definition at line 89 of file DuplicateImages.h.

89 {return duplicateNameCount;}
bool DuplicateImages::DuplicatesFound ( ) const

Definition at line 362 of file DuplicateImages.cpp.

363 {
364  return (dupExamples->QueueSize () > 0);
365 }
kkint32 QueueSize() const
Definition: KKQueue.h:313
bool DuplicateImages::ExampleInDetector ( FeatureVectorPtr  fv)

Definition at line 77 of file DuplicateImages.cpp.

References KKMLL::ImageFeaturesDataIndexed::GetEqual().

78 {
79  if (nameTree->GetEqual (fv->ExampleFileName ()) != NULL)
80  return true;
81 
82  if (featureDataTree->GetEqual (fv) != NULL)
83  return true;
84 
85  return false;
86 } /* ExampleInDetector */
void ExampleFileName(const KKStr &_exampleFileName)
Name of source of feature vector, ex: file name of image that the feature vector was computed from...
Definition: FeatureVector.h:75
EntryPtr GetEqual(const KeyType &key)
Definition: RBTree.h:493
FeatureVectorPtr GetEqual(FeatureVectorPtr example)
FeatureVectorListPtr DuplicateImages::ListOfExamplesToDelete ( )

Definition at line 276 of file DuplicateImages.cpp.

References KKMLL::DuplicateImage::AllTheSameClass(), DupExamples(), KKMLL::DuplicateImage::DuplicatedImages(), KKMLL::DuplicateImage::ExampleWithSmallestScanLine(), KKMLL::FeatureVectorList::FeatureVectorList(), and KKMLL::FeatureVectorList::PushOnBack().

277 {
278  FeatureVectorListPtr examplesToDelete = new FeatureVectorList (fileDesc, false);
279 
280  log.Level (10) << "DuplicateImages::ListOfExamplesToDelete" << endl;
281 
282  DuplicateImageListPtr dupExamples = DupExamples ();
283 
284  DuplicateImageList::iterator dIDX = dupExamples->begin ();
285 
286  for (dIDX = dupExamples->begin (); dIDX != dupExamples->end (); ++dIDX)
287  {
288  DuplicateImagePtr dupSet = *dIDX;
289 
290  log.Level (20) << "ListOfExamplesToDelete Duplicate Set[" << dupSet->FirstExampleAdded ()->ExampleFileName () << "]" << endl;
291 
292  FeatureVectorListPtr examplesInSet = dupSet->DuplicatedImages ();
293  FeatureVectorPtr exampleToKeep = NULL;
294 
295  if (dupSet->AllTheSameClass ())
296  {
297  exampleToKeep = dupSet->ExampleWithSmallestScanLine ();
298  }
299 
300  FeatureVectorList::iterator iIDX = examplesInSet->begin ();
301 
302  for (iIDX = examplesInSet->begin (); iIDX != examplesInSet->end (); ++iIDX)
303  {
304  FeatureVectorPtr example = *iIDX;
305  if (!example)
306  continue;
307 
308  if (example == exampleToKeep)
309  {
310  log.Level (30) << "ListOfExamplesToDelete Keeping [" << exampleToKeep->ExampleFileName () << "]." << endl;
311  }
312  else
313  {
314  log.Level (30) << "ListOfExamplesToDelete Deleting [" << example->ExampleFileName () << "]." << endl;
315  examplesToDelete->PushOnBack (example);
316  }
317  }
318  }
319 
320  return examplesToDelete;
321 } /* ListOfExamplesToDelete */
void ExampleFileName(const KKStr &_exampleFileName)
Name of source of feature vector, ex: file name of image that the feature vector was computed from...
Definition: FeatureVector.h:75
HTMLReport &__cdecl endl(HTMLReport &htmlReport)
Definition: HTMLReport.cpp:240
void PushOnBack(FeatureVectorPtr image)
Overloading the PushOnBack function in KKQueue so we can monitor the Version and Sort Order...
std::vector< DuplicateImage * >::iterator iterator
Definition: KKQueue.h:88
RunLog & Level(kkint32 _level)
Definition: RunLog.cpp:220
FeatureVectorPtr FirstExampleAdded()
Container class for FeatureVector derived objects.
DuplicateImageListPtr DupExamples() const
FeatureVectorPtr ExampleWithSmallestScanLine()
Represents a Feature Vector of a single example, labeled or unlabeled.
Definition: FeatureVector.h:59
const FeatureVectorListPtr DuplicatedImages()
void DuplicateImages::PurgeDuplicates ( FeatureVectorListPtr  examples,
bool  allowDupsInSameClass,
std::ostream *  report 
)

Delete duplicate examples from FeatureVectorList structure provided in constructor.

if not equal NULL will list examples being purged.

If duplicates are in more than one class then all will be deleted. if duplicates are in a single class then one with smallest scan line will be kept while all others will be deleted.

Definition at line 194 of file DuplicateImages.cpp.

References KKMLL::DuplicateImage::AllTheSameClass(), DupExamples(), KKMLL::DuplicateImage::DuplicatedImages(), and KKMLL::DuplicateImage::ExampleWithSmallestScanLine().

Referenced by KKMLL::TrainingProcess2::CreateTrainingProcessForLevel(), and KKMLL::FeatureVectorList::RemoveDuplicateEntries().

198 {
199  log.Level (10) << "DuplicateImageList::PurgeDuplicates" << endl;
200 
201 
202  // To make sure that we do not delete the same example Twice I added 'deletedDictionary' below.
203  // if will track all examples by address that have been deleted. I did this because a bug in
204  // the duplicate detector routine had the same example added to to different groups of duplicates.
205  map<FeatureVectorPtr,KKStr> deletedDictionary; // List of examples already deleted.
206  map<FeatureVectorPtr,KKStr>::iterator deletedDictionaryIdx;
207 
208  DuplicateImageListPtr dupExamples = DupExamples ();
209 
210  kkint32 dupSetCount = 0;
211  DuplicateImageList::iterator dIDX = dupExamples->begin ();
212 
213  for (dIDX = dupExamples->begin (); dIDX != dupExamples->end (); ++dIDX, ++dupSetCount)
214  {
215  DuplicateImagePtr dupSet = *dIDX;
216 
217  log.Level (20) << "PurgeDuplicates Duplicate Set[" << dupSet->FirstExampleAdded ()->ExampleFileName () << "]" << endl;
218 
219  FeatureVectorListPtr examplesInSet = dupSet->DuplicatedImages ();
220  FeatureVectorPtr exampleToKeep = NULL;
221 
222  if (dupSet->AllTheSameClass ())
223  {
224  if (allowDupsInSameClass)
225  continue;
226  else
227  exampleToKeep = dupSet->ExampleWithSmallestScanLine ();
228  }
229 
230  FeatureVectorList::iterator iIDX = examplesInSet->begin ();
231 
232  for (iIDX = examplesInSet->begin (); iIDX != examplesInSet->end (); ++iIDX)
233  {
234  FeatureVectorPtr example = *iIDX;
235  if (!example)
236  continue;
237 
238  if (example == exampleToKeep)
239  {
240  log.Level (30) << "PurgeDuplicates Keeping [" << exampleToKeep->ExampleFileName () << "]." << endl;
241  if (report)
242  *report << example->ExampleFileName () << "\t" << "Class" << "\t" << example->MLClassName () << "\t" << "Duplicate retained." << endl;
243  }
244  else
245  {
246  bool alreadyDeleted = false;
247  deletedDictionaryIdx = deletedDictionary.find (example);
248  if (deletedDictionaryIdx != deletedDictionary.end ())
249  {
250  // AHA We are getting ready to delete an entry we have already deleted ????
251  KKStr errMsg (1024);
252  errMsg << "Example: " << deletedDictionaryIdx->second << " Already Been Deleted.";
253  log.Level (-1) << endl << "DuplicateImages::PurgeDuplicates ***ERROR*** " << errMsg << endl <<endl;
254  alreadyDeleted = true;
255  }
256 
257  if (!alreadyDeleted)
258  {
259  deletedDictionary.insert (pair<FeatureVectorPtr,KKStr> (example, example->ExampleFileName ()));
260 
261  log.Level (30) << "PurgeDuplicates Deleting [" << example->ExampleFileName () << "]." << endl;
262  if (report)
263  *report << example->ExampleFileName () << "\t" << "Class" << "\t" << example->MLClassName () << "\t" << "Duplicate deleted." << endl;
264  examples->DeleteEntry (example);
265  if (examples->Owner ())
266  delete example;
267  }
268  }
269  }
270  }
271 } /* PurgeDuplicates */
void ExampleFileName(const KKStr &_exampleFileName)
Name of source of feature vector, ex: file name of image that the feature vector was computed from...
Definition: FeatureVector.h:75
HTMLReport &__cdecl endl(HTMLReport &htmlReport)
Definition: HTMLReport.cpp:240
std::vector< DuplicateImage * >::iterator iterator
Definition: KKQueue.h:88
__int32 kkint32
Definition: KKBaseTypes.h:88
void DeleteEntry(EntryPtr _entry)
Definition: KKQueue.h:684
const KKStr & MLClassName() const
Name of class that this example is assigned to.
RunLog & Level(kkint32 _level)
Definition: RunLog.cpp:220
bool Owner() const
Definition: KKQueue.h:305
FeatureVectorPtr FirstExampleAdded()
Container class for FeatureVector derived objects.
DuplicateImageListPtr DupExamples() const
FeatureVectorPtr ExampleWithSmallestScanLine()
Represents a Feature Vector of a single example, labeled or unlabeled.
Definition: FeatureVector.h:59
const FeatureVectorListPtr DuplicatedImages()
void DuplicateImages::ReportDuplicates ( std::ostream &  o)

Definition at line 326 of file DuplicateImages.cpp.

327 {
328  o << "Number of Duplicate Groups [" << dupExamples->QueueSize () << "]" << endl;
329  kkint32 groupNum = 0;
330 
331  //for (DuplicateImageList::iterator idx = dupExamples->begin (); idx != dupExamples->end (); idx++)
332  for (auto dupExampleSet: *dupExamples)
333  {
334  const FeatureVectorListPtr dupList = dupExampleSet->DuplicatedImages ();
335 
336  o << "Group[" << groupNum << "] Contains [" << dupList->QueueSize () << "] Duplicates." << endl;
337 
338  kkint32 numOnLine = 0;
339  //FeatureVectorList::const_iterator fvIDX;
340  for (auto fvIDX: *dupList) // = dupList->begin (); fvIDX != dupList->end (); ++fvIDX)
341  {
342  if (numOnLine > 8)
343  {
344  o << endl;
345  numOnLine = 0;
346  }
347 
348  if (numOnLine > 0)
349  o << "\t";
350  o << fvIDX->ExampleFileName () << "[" << fvIDX->MLClassName () << "]";
351 
352  numOnLine++;
353  }
354  o << endl << endl;;
355 
356  groupNum++;
357  }
358 } /* ReportDuplicates */
HTMLReport &__cdecl endl(HTMLReport &htmlReport)
Definition: HTMLReport.cpp:240
__int32 kkint32
Definition: KKBaseTypes.h:88
Container class for FeatureVector derived objects.
kkint32 QueueSize() const
Definition: KKQueue.h:313

The documentation for this class was generated from the following files: