

IIRC, PyPI already has code for set-like bloom Sped-up by comparing entire ranges of keys (seeįor a starting point). Special cases for orderable data where set-to-set operations can be In particular, there are some interesting That said, I do think there is room to add alternative set implementations to PyPI. Likewise, the primary use of fast membership

Interest because typical set-to-set data analytics don't really need MutableSet and OrderedDict, but again I haven't observed any real Also, now that Eric Snow has given us aįast OrderedDict, it is easier than ever to build an OrderedSet from The docs already link to a recipe for creating an OrderedSet ( Incompatible with the compacting approach I advocated forįor now, the ordering side-effect on dictionaries is non-guaranteed, so it is premature to start insisting the sets become ordered as well. Instead of compacting (which wasn't much of space win and incurred the cost of anĪdditional indirection), I added linear probing to reduce the cost ofĬollisions and improve cache performance. I pursued alternative path to improve set performance.

Make it difficult to retain set ordering without impacting Also, some of the optimizations for the set-to-set operations The latter tends to have fewer missing key The use pattern for sets is different from dicts. Wasted space for keys, values, and hashes. In other words, compacting makes more sense when you have

Less favorable because we still need the indices and overallocationīut can only offset the space cost by densifying only two of the threeĪrrays. The key/value/hash arrays being more than offset by the improvedĭensity of key/value/hash arrays. Here are a few thoughts on the subject before peopleįor the compact dict, the space savings was a net win with the additional space consumed by the indices and the overallocation for Unless I've misunderstood, Raymond was opposed to making a similar On Sep 14, 2016, at 3:50 PM, Eric Snow wrote: I will reproduce Raymond's post below which covers the most important points. In summary, the main points are: different usage patterns (insertion ordering dicts such as **kwargs is useful, less so for sets), space savings for compacting sets are less significant (because there are only key + hash arrays to densify, as opposed to key + hash + value arrays), and the aforementioned linear probing optimization which sets currently use is incompatible with a compact implementation. In short, set ordering isn't in the immediate future.Ī detailed discussion about whether to compactify sets for 3.7, and why it was decided against, can be found in the python-dev mailing lists. Set mathematics are defined in terms of unordered sets. Set-to-set operations lose their flexibility and optimizations if order is required. Sets use a different algorithm that isn't as amendable to retaining insertion order. It would be possible in theory to change CPython's set implementation to be similar to the compact dict, but in practice there are drawbacks, and notable core developers were opposed to making such a change. issue18771 - changeset to reduce the cost of hash collisions for set objects in Python 3.4.The initial linear probe (default 9 steps in CPython) will check a series of adjacent key/hash pairs, improving performance by reducing the cost of hash collision handling - consecutive memory access is cheaper than scattered probes. For example, dicts use randomized probing, but sets use a combination of linear probing and open addressing, to improve cache locality. Even before the compact dict implementation in CPython 3.6, the set and dict implementations already differed significantly, with little code reuse. While both data structures are hash based, it's a common misconception that sets are just implemented as dicts with null values. Also, some optimizations for common set operations such as union and intersection make it difficult to retain set ordering without degrading performance. With sets, the presence or absence of an element is not known in advance, and so the set implementation needs to optimize for both the found and not-found case. For dicts, cost of the lookup is the most critical operation, and the key is more likely to be present. The primary use of a set is fast membership testing, which is order agnostic. Sets and dicts are optimized for different use-cases.
