• Hashing一、What is hashing？二、常见的hashing method1.Division-remainder method:2.Folding method:3.Radix transformation method:4.Digit rearrangement method:三、hashing algorithm in AB test sampling:1....
Hashing一、What is hashing？二、常见的hashing method1.Division-remainder method:2.Folding method:3.Radix transformation method:4.Digit rearrangement method:三、hashing algorithm in AB test sampling:1. Collision problem:

一、What is hashing？
Hashing is to transform a string of characters into a shorter value or key so as to increase the speed of searching.

二、常见的hashing method
1.Division-remainder method:
Division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism.)
2.Folding method:

Folding method: This method divides the original value (digits in this case) into several parts, adds the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work ) as the hashed value or key.

Radix transformation method: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to fit a hash value of uniform length.
4.Digit rearrangement method:

Digit rearrangement method: This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key.

三、hashing algorithm in AB test sampling:
1. Collision problem:
For example, if you want to hash 500m item id into 1000 buckets, the number of hashed value could be balanced, but the related metrics may not.
Possible solutions:
–Identify root cause
–Compute Fisher exact p-value：
基于超几何分布 (hypergeometric distribution) 理论直接计算拒绝零假设的概率。在sample size相对较小时，作为卡方检验的补充检验。
卡方检验：若n个相互独立的随机变量ξ₁、ξ₂、……、ξn ，均服从标准正态分布（也称独立同分布于标准正态分布），则这n个服从标准正态分布的随机变量的平方和
Q=∑i=1nξ2i 构成一新的随机变量，其卡方分布规律称为x^2 分布（chi-square distribution），其中参数n称为自由度，正如正态分布中均值或方差不同就是另一个x2正态分布一样，自由度不同就是另一个分布。记为 Q~x^2(k). 卡方分布是由正态分布构造而成的一个新的分布，当自由度n很大时，X^2分布近似为正态分布。 对于任意正整数k， 自由度为 k的卡方分布是一个随机变量X的机率分布。
在事实与期望不符合时，用卡方分布检验偏差是正常波动还是建模错误。
–More detailed checks for balance – need more item covariates
–Handling long-tailed data and outliers
–Can confirm by using analysis methods which correct for the root cause


展开全文
• Hashing定义了一种将字符组成的字符串转换为固定长度(一般是更短长度)的数值或索引值的方法，称为散列法，也叫哈希法。由于通过更短的哈希值比用原始值进行数据库搜索更快，这种方法一般用来在数据库中建立索引并...
Hashing定义了一种将字符组成的字符串转换为固定长度(一般是更短长度)的数值或索引值的方法，称为散列法，也叫哈希法。由于通过更短的哈希值比用原始值进行数据库搜索更快，这种方法一般用来在数据库中建立索引并进行搜索，同时还用在各种解密算法中。通过一个简单的例子对此进行说明，比如，在数据库中存储一些人名，排列方式可能是下面这样:Abernathy, Sara
Epperdingle, Roscoe
Moore, Wilfred
Smith, David
所有名字均按字母排序......
可以利用这些名字本身来作为数据库的索引值。数据库搜索算法首先会逐个字符的进行名字的搜索，直到找到为止。但是如果利用散列法对每个名字进行了转换，就可能为数据库中的每一个名字产生一个四位的索引值，其中位数长度取决于数据库中到底有多少个人名，象下面这样:
7864 Abernathy, Sara
9802 Epperdingle, Roscoe
1990 Moore, Wilfred
8822 Smith, David
等等......
这样，下次搜索名字时，就先搜索哈希并对数据库中的每个值进行一一对应。通常来讲，寻找四位的数字比寻找未知长度的字符串要来得快得多。毕竟寻找数字时每一位只有10种可能，而名字的长度未定，且每一位都有26种可能。
散列算法，也称为哈希函数——哈希的英文意思为“无用信息”，因此哈希函数一词的由来可能是因为最终形成的哈希表里面是各种看起来毫无意义的描述值的混合。除用来快速搜索数据外，散列法还用来完成签名的加密解密工作，这种签名可以用来对收发消息时的用户签名进行鉴权。先用哈希函数对数据签名进行转换，然后将数字签名本身和转换后的信息摘要分别独立的发送给接收人。通过利用和发送人一样的哈希函数，接收人可以从数字签名获得一个信息摘要，然后将此摘要同传送过来的摘要进行比较，这两个值相等则表示数字签名有效。
利用哈希函数对数据库中的原始值建立索引，以后每获取一次数据时都要利用哈希函数进行重新转换。因此，哈希函数始终是单向操作。没有必要通过分析哈希值来试图逆推哈希函数。实际上，一个典型的哈希函数是不可能逆推出来的。好的哈希函数还应该避免对于不同输入产生相同的哈希值的情况发生。如果产生了哈希值相同的情况，称为冲突。可接受的哈希函数应该将冲突情况的可能性降到非常小。
下面列出了一些相对简单的哈希函数:
1)余数法:先估计整个哈希表中的表项目数目大小。然后用这个估计值作为除数去除每个原始值，得到商和余数。用余数作为哈希值。因为这种方法产生冲突的可能性相当大，因此任何搜索算法都应该能够判断冲突是否发生并提出取代算法。
2)折叠法:这种方法是针对原始值为数字时使用，将原始值分为若干部分，然后将各部分叠加，得到的最后四个数字(或者取其他位数的数字都可以)来作为哈希值。
3)基数转换法:当原始值是数字时，可以将原始值的数制基数转为一个不同的数字。例如，可以将十进制的原始值转为十六进制的哈希值。为了使哈希值的长度相同，可以省略高位数字。
4)数据重排法:这种方法只是简单的将原始值中的数据打乱排序。比如可以将第三位到第六位的数字逆序排列，然后利用重排后的数字作为哈希值。
哈希函数并不通用，比如在数据库中用能够获得很好效果的哈希函数，用在密码学或错误校验方面就未必可行。在密码学领域有几个著名的哈希函数。这些函数包括MD2、MD4以及MD5，利用散列法将数字签名转换成的哈希值称为信息摘要(message-digest)，另外还有安全散列算法(SHA)，这是一种标准算法，能够生成更大的(60bit)的信息摘要，有点儿类似于MD4算法。
展开全文
• Consensus hashing
• Consistent-Hashing Consistent Hashing 一致哈希
• Reversed Spectral Hashing
• Isotropic Hashing,一种基于ITQ改进的Hash算法，效果要优于ITQ。
• leetcode Hashing-2 问题1 () 问题2 () 问题3 ()
• leetcode 2 Hashing-5 问题 1 外星词典 问题 2 验证 Alien 字典
• Semantic hashing seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best...
• Rendezvous Hashing Rendezvousorhighest random weight (HRW) hashingis an algorithm that allows clients to achieve distributed agreement on a set ofkoptions out of a possible set ofnoptions. A t...
Rendezvous Hashing
Rendezvous or highest random weight (HRW) hashing is an algorithm that allows clients to achieve distributed agreement on a set of k options out of a possible set of n options. A typical application is when clients need to agree on which sites (or proxies) objects are assigned to. When k is 1, it subsumes the goals of consistent hashing, using an entirely different method.
Rendezvous hashing solves the distributed hash table problem: How can a set of clients, given an object O, agree on where in a set of n sites (servers, say) to place O? Each client is to select a site independently, but all clients must end up picking the same site. This is non-trivial if we add a minimal disruption constraint, and require that only objects mapping to a removed site may be reassigned to other sites.
The basic idea is to give each site Sj a score (a weight) for each object Oi, and assign the object to the highest scoring site. All clients first agree on a hash function h(). For object Oi, the site Sj is defined to have weight wi,j = h(Oi, Sj). HRW assigns Oi to the site Sm whose weight wi,mis the largest. Since h() is agreed upon, each client can independently compute the weights wi,1, wi,2, ..., wi,n and pick the largest. If the goal is distributed k-agreement, the clients can independently pick the sites with the k largest hash values.
If a site S is added or removed, only the objects mapping to S are remapped to different sites, satisfying the minimal disruption constraint above. The HRW assignment can be computed independently by any client, since it depends only on the identifiers for the set of sites S1, S2, ..., Sn and the object being assigned.
HRW easily accommodates different capacities among sites. If site Sk has twice the capacity of the other sites, we simply represent Sk twice in the list, say, as Sk,1 and Sk,2. Clearly, twice as many objects will now map to Sk as to the other sites.
Properties
Under rendezvous hashing, however, clients handle site failures by picking the site that yields the next largest weight. Remapping is required only for objects currently mapped to the failed site, and as proved in,[1][2] disruption is minimal. Rendezvous hashing has the following properties.
Low overhead: The hash function used is efficient, so overhead at the clients is very low.
Load balancing: Since the hash function is randomizing, each of the n sites is equally likely to receive the object O. Loads are uniform across the sites.Site capacity: Sites with different capacities can be represented in the site list with multiplicity in proportion to capacity. A site with twice the capacity of the other sites will be represented twice in the list, while every other site is represented once.

High hit rate: Since all clients agree on placing an object O into the same site SO , each fetch or placement of O into SO yields the maximum utility in terms of hit rate. The object O will always be found unless it is evicted by some replacement algorithm at SO .
Minimal disruption: When a site fails, only the objects mapped to that site need to be remapped. Disruption is at the minimal possible level.
Distributed k-agreement: Clients can reach distributed agreement on k sites simply by selecting the top k sites in the ordering.
Comparison with Consistent Hashing
Rendezvous hashing is much simpler to understand and code.
Rendezvous hashing provides a very even distribution of keys on each node, even while node are being added/removed. Consistent hashing can fail to provide an even distribution for small clusters (though this can be fixed to a large extent by using many virtual replicas for each node). This is the biggest advantage of Rendezvous hashing over consistent hashing.
Consistent hashing is typically done in O(logN)O(logN) time using a binary search. Rendezvous hashing is typically done in O(N)O(N)time, though it can also be done in O(log N). Also, NN in consistent hashing is larger, often by a factor of ~100, since consistent hashing needs to create virtual nodes as well.
Consistent hashing requires just one hash computation per key, whereas Rendezvous hashing requires O(N)O(N) hash computations per key. This can make a difference if you're using a slow hash function and have a large ring size.
Consistent hashing requires some fixed memory to work well (mapping nodes to virtual nodes and hashes for all the virtual nodes) whereas Rendezvous hashing doesn't require storing any additional data.
Rendezvous hashing can naturally provide kk different servers for any key. This makes it very useful to support replication. While consistent hashing can also be modified to do this, it's not a 'standard part' of consistent hashing algorithm/implementation.
So in nutshell, use Rendezvous hashing if:
Your clusters are very large (say thousands of nodes) and you need to keep your memory footprint low.
You want to support replication, but don't want to implement a slightly modified consistent hashing algorithm yourself.
Comparison with Consistent Hashing
Apache Ignite uses Rendezvous Hashing to distribute cache data uniformaly in the computing grid. Cassandra uses Consistent Hashing for replication and high availability.

转载于:https://www.cnblogs.com/codingforum/p/10316442.html
展开全文
• 一致性哈希，consistent hashing。 算法入门必备 清晰版本，非扫描。
• libconhash is a consistent hashing libraray, which can be compiled both on Windows and Linux platform. High performance, easy to use, and easy to scale according to node's processing capacity.
• Hashing-3 问题 1 重复 DNA 序列 () 问题 2 最喜欢的流派 给定一个地图 Map userSongs，以用户名作为键，将用户听过的所有歌曲列表作为值。 还给出了一个地图 Map songGenres，以歌曲流派作为键，以该流派中所有歌曲...
• Spherical_Hashing
• Hashing_program
• 今天科技助盲遇到Unicode-objects must be encoded before hashing的问题，debug的时候了解了一下password hashing，在这里记录一下。 服务器在数据库一般不存储用户密码，只存储密码哈希之后的值，以保证如果网站...
今天科技助盲遇到Unicode-objects must be encoded before hashing的问题，debug的时候了解了一下password hashing，在这里记录一下。

服务器在数据库一般不存储用户密码，只存储密码哈希之后的值，以保证如果网站被黑，用户的信息可以不被窃取。

目前的哈希方法主要有SHA-1，SHA-256，MD5。用户在登录的时候对数值进行再哈希后比对就可以了。

目前还是有其他攻击方法破解用户密码，主要是暴力破解，暴力破解的难度随着密码复杂度的提高而增加。

展开全文
• 1078 Hashing解题代码测试结果问题整理 解题代码 #include<cstdio> #include<cmath> const int nmax = 20000; bool prime[nmax]; bool isPrime(int x); int NextPrime(int x); int main() { int n, m, ...
• 题目链接：http://acm.hdu.edu.cn/showproblem.php?pid=1672 Cuckoo Hashing Time Limit: 3000/1000 MS (Java/Others)Memory Limit: 32768/32768 K (Java/Others)Total Submission(s): 311Accepted Submis...
• 何凯明2013年再cvpr中发表的文章K-means Hashing: an Affinity-Preserving Quantization Method for Learning Binary Compact Codes 源代码
• 1998年的英文资料，共12页，非常详细地介绍了universal hashing。universal hashing（在随机算法或数据结构中）是指从具有一定数学属性的哈希函数族中随机选择哈希函数。 即使数据是由对手选择的，这也保证了预期的...
• Ketama Hashing Algorithm java代码完全可以运行，已经添加了Node类，和一些注释。
• Data Structure and Algorithm Analysis in C++ ...5.4.3 Double Hashing In the beginning of chapter 5.4, the formula hi(x) = (hash(x) + f(i)) is mentioned and f(i) is defined as f(i) = i ; Here f...
• The Joys of Hashing，Hash Table Programming with C，2019年新书，哈希编程的乐趣，基于C语言实现Hash表。

...