the signature is unique_xxx(buffer, size_hint=0.0) the initial memory-consumption of the hash-set will be len(buffer)*size_hint unless size_hint hash(x)=hash(y) is neccessary for set to work properly.If order of appearence is important use unique_stable_xxx-versions, which needs somewhat more memory. differently as pandas, the returned uniques aren't in the order of the appearance.returns an object which implements the buffer protocol, so np.ctypeslib.as_array (recommended) or np.frombuffer (less safe, as memory can get reinterpreted) can be used to create numpy arrays.implemented are unique_int64, unique_int32, unique_float64, unique_float32.from_iterator version works with any iterable, but the version for buffers are more efficient.all_XXX, any_XXX, none_XXX and count_if_XXX are faster than using isin_XXX and applying numpy's versions of these function on the resulting array.count_if_XXX/ count_if_XXX_from_iterator which return the number of elements from the query array can be found in the set.none_XXX/ none_XXX_from_iterator which return True if none of elements from the query array can be found in the set.any_XXX/ any_XXX_from_iterator which return True if at least one element of the query array can be found in the set.all_XXX/ all_XXX_from_iterator which return True if all elements of the query array can be found in the set.Thus cykash's isin can be order of magnitude faster than the numpy's or pandas' versions.
using hash set instead of arrays in isin function has the advantage, that the look-up data structure doesn't have to be reconstructed for every call, thus reducing the running time from O(n+m)to O(n), where n is the number of queries and m-number of elements in the look up array.implemented are isin_int64, isin_int32, isin_float64, isin_float32.
Furthermore, given the Cython-interface, efficient extensions of functionality are easily done.īiggest advantage of these sets is that they need about 4-8 times less memory than the usual Python-dictionaries and are somewhat faster for integers or floats.Īs PyObjectMap is somewhat slower than the usual dict and needs about the same amount of memory, it should be used only if all nans should be treated as equivalent. They are more or less drop-in replacements for Python's dict (however, not every piece of dict's functionality makes sense, for example setdefault(x, default) without default-argument, because None cannot be inserted, also the khash-maps don't preserve the insertion order, so there is also no reversed). Int64toInt64Map, Int32toInt32Map, Float64toInt64Map, Float32toInt32Map ( and PyObjectMap) are implemented. numpy-arrays, array.array or ctypes-arrays). Int64Set_from_buffer, if the data container at hand supports buffer protocol (e.g. The most efficient way to create such sets is to use XXXXSet_from_buffer(.), e.g. The biggest advantage of these sets is that they need about 4-8 times less memory than the usual Python-sets and are somewhat faster for integers or floats.Īs PyObjectSet is somewhat slower than the usual set and needs about the same amount of memory, it should be used only if all nans should be treated as equivalent.
Furthermore, given the Cython-interface, efficient extensions of functionality are easily done. They are more or less drop-in replacements for Python's set. Int64Set, Int32Set, Float64Set, Float32Set ( and PyObjectSet) are implemented. > my_map = Float64toInt64Map() # values are 64bit integers