Address the poor performance of the existing unique-name generation (#17944)

* Address the poor performance of the existing unique-name generation As described in Issue 16849, the existing Tools::getUniqueName method requires calling code to form a vector of existing names to be avoided. This leads to poor performance both in the O(n) cost of building such a vector and also getUniqueName's O(n) algorithm for actually generating the unique name (where 'n' is the number of pre-existing names). This has particularly noticeable cost in documents with large numbers of DocumentObjects because generating both Names and Labels for each new object incurs this cost. During an operation such as importing this results in an O(n^2) time spent generating names. The other major cost is in the saving of the temporary backup file, which uses name generation for the "files" embedded in the Zip file. Documents can easily need several such "files" for each object in the document. This update includes the following changes: Create UniqueNameManager to keep a list of existing names organized in a manner that eases unique-name generation. This class essentially acts as a set of names, with the ability to add and remove names and check if a name is already there, with the added ability to take a prototype name and generate a unique form for it which is not already in the set. Eliminate Tools::getUniqueName Make DocumentObject naming use the new UniqueNameManager class Make DocumentObject Label naming use the new UniqueNameManager class. Labels are not always unique; unique labels are generated if the settings at the time request it (and other conditions). Because of this the Label management requires additionally keeping a map of counts for labels which already exist more than once. These collections are maintained via notifications of value changes on the Label properties of the objects in the document. Add Document::containsObject(DocumentObject*) for a definitive test of an object being in a Document. This is needed because DocumentObjects can be in a sort of limbo (e.g. when they are in the Undo/Redo lists) where they have a parent linkage to the Document but should not participate in Label collision checks. Rename Document.getStandardObjectName to getStandardObjectLabel to better represent what it does. Use new UniqueNameManager for Writer internal filenames within the zip file. Eliminate unneeded Reader::FileNames collection. The file names already exist in the FileList collection elements. The only existing use for the FileNames collection was to determine if there were any files at all, and with FileList and FileNames being parallel vectors, they both had the same length so FileList could be used for this test.. Use UniqueNameManager for document names and labels. This uses ad hoc UniqueNameManager objects created on the spot on the assumption that document creation is relatively rare and there are few documents, so although the cost is O(n), n itself is small. Use an ad hoc UniqueNameManager to name new DymanicProperty entries. This is only done if a property of the proposed name already exists, since such a check is more-or-less O(log(n)), almost never finds a collision, and avoids the O(n) building of the UniqueNameManager. If there is a collision an ad-hoc UniqueNameManager is built and discarded after use. The property management classes have a bit of a mess of methods including several to populate various collection types with all existing properties. Rather than introducing yet another such collection-specific method to fill a UniqueNameManager, a visitProperties method was added which calls a passed function for each property. The existing code would be simpler if existing fill-container methods all used this. Ideally the PropertyContainer class would keep a central directory of all properties ("static", Dynamic, and exposed by ExtensionContainer and other derivations) and a permanent UniqueNameManager. However the Property management is a bit of a mess making such a change a project unto itself. The unit tests for Tools:getUniqueName have been changed to test UniqueNameManager.makeUniqueName instead. This revealed a small regression insofar as passing a prototype name like "xyz1234" to the old code would yield "xyz1235" whether or not "xyz1234" already existed, while the new code will return the next name above the currently-highest name on the "xyz" model, which could be "xyz" or "xyz1". * Correct wrong case on include path * Implement suggested code changes Also change the semantics of visitProperties to not have any short-circuit return * Remove reference through undefined iterator * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix up some comments for DOxygen --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-12-13 11:54:46 -05:00
parent b1f93bc51e
commit 83202d8ad6
28 changed files with 721 additions and 435 deletions
--- a/src/Base/Tools.cpp
+++ b/src/Base/Tools.cpp
@@ -33,130 +33,214 @@
 #include "Interpreter.h"
 #include "Tools.h"

-namespace Base
+void Base::UniqueNameManager::PiecewiseSparseIntegerSet::Add(uint value)
 {
-struct string_comp
+    etype newSpan(value, 1);
+    iterator above = Spans.lower_bound(newSpan);
+    if (above != Spans.end() && above->first <= value) {
+        // The found span includes value so there is nothing to do as it is already in the set.
+        return;
+    }
+
+    // Set below to the next span down, if any
+    iterator below;
+    if (above == Spans.begin()) {
+        below = Spans.end();
+    }
+    else {
+        below = above;
+        --below;
+    }
+
+    if (above != Spans.end() && below != Spans.end()
+        && above->first - below->first + 1 == below->second) {
+        // below and above have a gap of exactly one between them, and this must be value
+        // so we coalesce the two spans (and the gap) into one.
+        newSpan = etype(below->first, below->second + above->second + 1);
+        Spans.erase(above);
+        above = Spans.erase(below);
+    }
+    if (below != Spans.end() && value - below->first == below->second) {
+        // value is adjacent to the end of below, so just expand below by one
+        newSpan = etype(below->first, below->second + 1);
+        above = Spans.erase(below);
+    }
+    else if (above != Spans.end() && above->first - value == 1) {
+        // value is adjacent to the start of above, so juse expand above down by one
+        newSpan = etype(above->first - 1, above->second + 1);
+        above = Spans.erase(above);
+    }
+    // else  value is not adjacent to any existing span, so just make anew span for it
+    Spans.insert(above, newSpan);
+}
+void Base::UniqueNameManager::PiecewiseSparseIntegerSet::Remove(uint value)
 {
-    // s1 and s2 must be numbers represented as string
-    bool operator()(const std::string& s1, const std::string& s2)
-    {
-        if (s1.size() < s2.size()) {
-            return true;
-        }
-        if (s1.size() > s2.size()) {
-            return false;
-        }
-
-        return s1 < s2;
+    etype newSpan(value, 1);
+    iterator at = Spans.lower_bound(newSpan);
+    if (at == Spans.end() || at->first > value) {
+        // The found span does not include value so there is nothing to do, as it is already not in
+        // the set.
+        return;
    }
-    static std::string increment(const std::string& s)
-    {
-        std::string n = s;
-        int addcarry = 1;
-        for (std::string::reverse_iterator it = n.rbegin(); it != n.rend(); ++it) {
-            if (addcarry == 0) {
-                break;
-            }
-            int d = *it - 48;
-            d = d + addcarry;
-            *it = ((d % 10) + 48);
-            addcarry = d / 10;
-        }
-        if (addcarry > 0) {
-            std::string b;
-            b.resize(1);
-            b[0] = addcarry + 48;
-            n = b + n;
-        }
-
-        return n;
+    if (at->second == 1) {
+        // value is the only in this span, just remove the span
+        Spans.erase(at);
    }
-};
-
-class unique_name
+    else if (at->first == value) {
+        // value is the first in this span, trim the lower end
+        etype replacement(at->first + 1, at->second - 1);
+        Spans.insert(Spans.erase(at), replacement);
+    }
+    else if (value - at->first == at->second - 1) {
+        // value is the last in this span, trim the upper end
+        etype replacement(at->first, at->second - 1);
+        Spans.insert(Spans.erase(at), replacement);
+    }
+    else {
+        // value is in the moddle of the span, so we must split it.
+        etype firstReplacement(at->first, value - at->first);
+        etype secondReplacement(value + 1, at->second - ((value + 1) - at->first));
+        // Because erase returns the iterator after the erased element, and insert returns the
+        // iterator for the inserted item, we want to insert secondReplacement first.
+        Spans.insert(Spans.insert(Spans.erase(at), secondReplacement), firstReplacement);
+    }
+}
+bool Base::UniqueNameManager::PiecewiseSparseIntegerSet::Contains(uint value) const
 {
-public:
-    unique_name(std::string name, const std::vector<std::string>& names, int padding)
-        : base_name {std::move(name)}
-        , padding {padding}
-    {
-        removeDigitsFromEnd();
-        findHighestSuffix(names);
-    }
-
-    std::string get() const
-    {
-        return appendSuffix();
-    }
-
-private:
-    void removeDigitsFromEnd()
-    {
-        std::string::size_type pos = base_name.find_last_not_of("0123456789");
-        if (pos != std::string::npos && (pos + 1) < base_name.size()) {
-            num_suffix = base_name.substr(pos + 1);
-            base_name.erase(pos + 1);
-        }
-    }
-
-    void findHighestSuffix(const std::vector<std::string>& names)
-    {
-        for (const auto& name : names) {
-            if (name.substr(0, base_name.length()) == base_name) {  // same prefix
-                std::string suffix(name.substr(base_name.length()));
-                if (!suffix.empty()) {
-                    std::string::size_type pos = suffix.find_first_not_of("0123456789");
-                    if (pos == std::string::npos) {
-                        num_suffix = std::max<std::string>(num_suffix, suffix, Base::string_comp());
-                    }
-                }
-            }
-        }
-    }
-
-    std::string appendSuffix() const
-    {
-        std::stringstream str;
-        str << base_name;
-        if (padding > 0) {
-            str.fill('0');
-            str.width(padding);
-        }
-        str << Base::string_comp::increment(num_suffix);
-        return str.str();
-    }
-
-private:
-    std::string num_suffix;
-    std::string base_name;
-    int padding;
-};
-
-}  // namespace Base
-
-std::string
-Base::Tools::getUniqueName(const std::string& name, const std::vector<std::string>& names, int pad)
-{
-    if (names.empty()) {
-        return name;
-    }
-
-    Base::unique_name unique(name, names, pad);
-    return unique.get();
+    iterator at = Spans.lower_bound(etype(value, 1));
+    return at != Spans.end() && at->first <= value;
 }

-std::string Base::Tools::addNumber(const std::string& name, unsigned int num, int d)
+std::tuple<uint, uint> Base::UniqueNameManager::decomposeName(const std::string& name,
+                                                              std::string& baseNameOut,
+                                                              std::string& nameSuffixOut) const
 {
-    std::stringstream str;
-    str << name;
-    if (d > 0) {
-        str.fill('0');
-        str.width(d);
+    auto suffixStart = std::make_reverse_iterator(GetNameSuffixStartPosition(name));
+    nameSuffixOut = name.substr(name.crend() - suffixStart);
+    auto digitsStart = std::find_if_not(suffixStart, name.crend(), [](char c) {
+        return std::isdigit(c);
+    });
+    baseNameOut = name.substr(0, name.crend() - digitsStart);
+    uint digitCount = digitsStart - suffixStart;
+    if (digitCount == 0) {
+        // No digits in name
+        return std::tuple<uint, uint> {0, 0};
    }
-    str << num;
-    return str.str();
+    else {
+        return std::tuple<uint, uint> {
+            digitCount,
+            std::stoul(name.substr(name.crend() - digitsStart, digitCount))};
+    }
+}
+void Base::UniqueNameManager::addExactName(const std::string& name)
+{
+    std::string baseName;
+    std::string nameSuffix;
+    uint digitCount;
+    uint digitsValue;
+    std::tie(digitCount, digitsValue) = decomposeName(name, baseName, nameSuffix);
+    baseName += nameSuffix;
+    auto baseNameEntry = UniqueSeeds.find(baseName);
+    if (baseNameEntry == UniqueSeeds.end()) {
+        // First use of baseName
+        baseNameEntry =
+            UniqueSeeds.emplace(baseName, std::vector<PiecewiseSparseIntegerSet>()).first;
+    }
+    if (digitCount >= baseNameEntry->second.size()) {
+        // First use of this digitCount
+        baseNameEntry->second.resize(digitCount + 1);
+    }
+    PiecewiseSparseIntegerSet& baseNameAndDigitCountEntry = baseNameEntry->second[digitCount];
+    // Name should not already be there
+    assert(!baseNameAndDigitCountEntry.Contains(digitsValue));
+    baseNameAndDigitCountEntry.Add(digitsValue);
+}
+std::string Base::UniqueNameManager::makeUniqueName(const std::string& modelName,
+                                                    int minDigits) const
+{
+    std::string namePrefix;
+    std::string nameSuffix;
+    decomposeName(modelName, namePrefix, nameSuffix);
+    std::string baseName = namePrefix + nameSuffix;
+    auto baseNameEntry = UniqueSeeds.find(baseName);
+    if (baseNameEntry == UniqueSeeds.end()) {
+        // First use of baseName, just return it with no unique digits
+        return baseName;
+    }
+    // We don't care about the digit count of the suggested name, we always use at least the most
+    // digits ever used before.
+    int digitCount = baseNameEntry->second.size() - 1;
+    uint digitsValue;
+    if (digitCount < minDigits) {
+        // Caller is asking for more digits than we have in any registered name.
+        // We start the longer digit string at 000...0001 even though we might have shorter strings
+        // with larger numeric values.
+        digitCount = minDigits;
+        digitsValue = 1;
+    }
+    else {
+        digitsValue = baseNameEntry->second[digitCount].Next();
+    }
+    std::string digits = std::to_string(digitsValue);
+    if (digitCount > digits.size()) {
+        namePrefix += std::string(digitCount - digits.size(), '0');
+    }
+    return namePrefix + digits + nameSuffix;
 }

+void Base::UniqueNameManager::removeExactName(const std::string& name)
+{
+    std::string baseName;
+    std::string nameSuffix;
+    uint digitCount;
+    uint digitsValue;
+    std::tie(digitCount, digitsValue) = decomposeName(name, baseName, nameSuffix);
+    baseName += nameSuffix;
+    auto baseNameEntry = UniqueSeeds.find(baseName);
+    if (baseNameEntry == UniqueSeeds.end()) {
+        // name must not be registered, so nothing to do.
+        return;
+    }
+    auto& digitValueSets = baseNameEntry->second;
+    if (digitCount >= digitValueSets.size()) {
+        // First use of this digitCount, name must not be registered, so nothing to do.
+        return;
+    }
+    digitValueSets[digitCount].Remove(digitsValue);
+    // an element of digitValueSets may now be newly empty and so may other elements below it
+    // Prune off all such trailing empty entries.
+    auto lastNonemptyEntry =
+        std::find_if(digitValueSets.crbegin(), digitValueSets.crend(), [](auto& it) {
+            return it.Any();
+        });
+    if (lastNonemptyEntry == digitValueSets.crend()) {
+        // All entries are empty, so the entire baseName can be forgotten.
+        UniqueSeeds.erase(baseName);
+    }
+    else {
+        digitValueSets.resize(digitValueSets.crend() - lastNonemptyEntry);
+    }
+}
+
+bool Base::UniqueNameManager::containsName(const std::string& name) const
+{
+    std::string baseName;
+    std::string nameSuffix;
+    uint digitCount;
+    uint digitsValue;
+    std::tie(digitCount, digitsValue) = decomposeName(name, baseName, nameSuffix);
+    baseName += nameSuffix;
+    auto baseNameEntry = UniqueSeeds.find(baseName);
+    if (baseNameEntry == UniqueSeeds.end()) {
+        // base name is not registered
+        return false;
+    }
+    if (digitCount >= baseNameEntry->second.size()) {
+        // First use of this digitCount, name must not be registered, so not in collection
+        return false;
+    }
+    return baseNameEntry->second[digitCount].Contains(digitsValue);
+}
 std::string Base::Tools::getIdentifier(const std::string& name)
 {
    if (name.empty()) {