Good morning folks, I'm trying to understand how hash joins work in query execution.
My understanding of a hash table is you have key/value pairs and hashing means that it takes more or less the same amount of time to retrieve any value given its key. In this case the keys and values are both known and hashing is to do with the ordering of the values with respect to the keys.
My understanding of a hash join is that you create a hash table of the smaller set by using a hashing algorithm to generate the key from the value. Then the larger set's values are hashed to see whether the resulting key exists in the table and this determines that there is a match. The performance gain is due to the speed of hashing a string on the larger set and looking up the resulting key in the table for smaller set; rather than comparing two strings to determine a match. In this case, the values are known but the keys are not. Hashing refers to the process of generating a key from the known value.
If my understanding is correct, the way hashing is used in database table joins is something quite different to using something like a dictionary object in procedural programming.
I'd love it if someone smarter or more knowledgeable than myself could comment!