Thursday, 24 March 2016

Coherence stores keys and values in serialized manner

Courtesy : Paul Bentley

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectOutputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import com.tangosol.net.CacheFactory;
public class Mystery {

public static void main(final String[] args) throws Exception {
new Mystery().f();
}

private void f() throws Exception {

List key1 = new ArrayList();
key1.add("Y");
key1.add("N"); // There is an N here
key1.add("A");
key1.add("N"); // There is another N here
List key2 = new ArrayList();
key2.add("Y");
key2.add("N"); // There is an N here
key2.add("A");
key2.add(new String("N")); // Note the use of new
Object value = new String("Hello");

System.out.println("key1 hashCode " + key1.hashCode() + " serialized " + serialize(key1).length());
System.out.println("key2 hashCode " + key2.hashCode() + " serialized " + serialize(key2).length());
System.out.println("key1.equals(key2) " + key1.equals(key2));

HashMap hashMap = new HashMap();
hashMap.put(key2, value);
System.out.println(hashMap.get(key1));
System.out.println(hashMap.get(key2));
final Map coherenceMap = CacheFactory.getCache("Hello");
coherenceMap.put(key2, value);
System.out.println(coherenceMap.get(key1));
System.out.println(coherenceMap.get(key2));

}

private String serialize(final Object myObject) throws IOException {
ByteArrayOutputStream bo = new ByteArrayOutputStream();
ObjectOutputStream so = new ObjectOutputStream(bo);
so.writeObject(myObject);
so.flush();
return bo.toString();
}

}

So the output from this program is

Output
key1 hashCode 3651971 serialized 75
key2 hashCode 3651971 serialized 74
key1.equals(key2) true
Hello
Hello
null
Hello
We need to try and explain the mystery of why one line in the output is null when retrieving data from Coherence.

The hashCodes are identical and the two lists are equal.

To understand this we have to look at the two lists in the debugger and this will show object ids for the elements for key1

and key2

the difference here is that that key1 has id values 33, 37, 38 and 37 whereas key2 has values 33, 37, 38 and 47.
So the difference is that key1 contains a duplicate object reference "N" in whereas key2 contains 4 unique object references.
Experimenting with this scenario we discover we can only get the value form coherence with the key with which the value was put there.
So Coherence is sensitive to object references within the key.

Note that the serialized form is different it size differs by 1.

So what is going on here?
It appears that Coherence stores a serialized version of the key (and value too). So its uses hashCode() and equals() on the serialized version of the key.
So the surprise is explained by the different serialization. If an object has references other objects it serializes differently to multiple references to the same object even when the objects have the same hash code and are equal.

How does the Configuration API Generate keys and Values in Coherence?
The above investigation was triggered by disbelief that the configuration API works and would continue to work consistently over time.
It actually has two key generation strategies depending on the strategy for cache population.

Demand Loaded Caches and Entire Load Caches are terms used here to describe two caching strategies used by the Configuration API. They are not Coherence terms.

Demand Loaded Caches
Demand loaded caches are loaded one row at a time. If the configuration API is queried for a row and it cannot be found in the Coherence cache then the SQL is executed to populate the cache.
These caches have some interesting charasteristics. Specifically if the SQL execution results in no row being returned then the Configuration API stores an empty list as the value in the database.
This means that subsequent requests for the key will result in the empty list being retrieved which the Configuration API converts to null before returning it to the caller. This prevents repeated calls to the database to discover the row is not there.

In a demand loaded cache the bind variables of the query form the Coherence key and the result set values for the Coherence value.
In the case of Url Connectivity there are 4 bind variables and 4 result set columns.

Url Connectivity SQL
SELECT cu.connection_id, cu.url, uc.conn_timeout, uc.read_timeout
FROM url_connectivity uc, connection_url cu, sender_type so, service_type se, function_type fu, system_type sy
WHERE uc.connection_id = cu.connection_id AND uc.sender_id = so.sender_id AND uc.service_id = se.service_id AND uc.function_id = fu.function_id AND uc.system_id = sy.system_id
AND so.name = ? AND se.name = ? AND fu.name = ? AND sy.name = ?
Since access to these caches is behind the Configuration API key and value generation is not a concern of the calling client.
Actually for this type of cache if the key is not formed properly the worst outcome is that the SQL will be executed once more when the key is not found and the cache will be loaded onece more with this key.
This might result in duplicate values for 'almost identical keys' but it would not stop data caching and retrieval.

Now in the case of Url Connectivity (and Identity Mapper) they go through a short 'hunting' algorithm escalating SERVICE and SENDER (or SENDER and PRINCIPAL_USER_ID) to ANY.
So we can be sure (by code inspection) that we have rows in the cache where there are two ANY elements and these are two object references to ANY not two copies of ANY.
This ANY, ANY is in fact a common case.

Note that the hunting mechanism above is made efficient by storing empty lists when SQL does not return a result set.
It means the hunting mechanism can use the cache, to know the SQL will not return a row and avoid repeated useless SQL calls (the whole point of a cache) during the hunt.

Entire Load Caches (Service Specific Caches)
Service specific caches are always Entire Load Caches.

Entire load caches are either empty or fully populated. If the cache is queried when it is empty it will be fully populated from the database in a single query.
If the cache is not empty the key will be lookup the value and either returned or null will be returned if the key is not present.

Now this type of cache is always populated with an SQL statement with no bind variables.

Example SQL for Entire Load Cache
SELECT d.site_id, d.billing_day, d.draw_day from dd_draw_day d
The configuration API specifies that columns 1 and 2 or the result set will form the Coherence key and column 3 the value.
When JDBC loads the data it from the cache it generates unique object references for the three columns.
This means that if the row in the database contains 1, 1, 1 or 2, 2, 2 then in each case three different BigDecimal objects are generated, NOT a single BigDecimal with three references to it.
So 1, 1, and 2, 2 will form the keys and and 1 and 2 the respective values.

Therefore using this as the lookup key in Coherence will fail.

Bad Coherence Lookup Key
final List key = new ArrayList();
final BigDecimal one = new BigDecimal(1);
key.add(one);
key.add(one);
Where as this will be successful.

Good Coherence Key
final List key = new ArrayList();
final BigDecimal one = new BigDecimal(1);
key.add(new BigDecimal(one));
key.add(new BigDecimal(one);
This pattern should be applied for all types not just BigDecimal, for example new String should also be called in the case of Strings.

No comments: