I needed to move a large data set out of a relational database, and did a very thorough analysis of distributed hash tables with Cassandra after a quick primer on Google's BigTable and Amazon's Dynamo.
At that time there was a query function in the Cassandra API, but I quickly discovered it was not implemented, so the only way to query the system was by the primary key used to store your items originally.
I pulled apart how the queries got distributed, how the keys got hashed to a node, and how fail-over was handled, and how node membership was handled.
Cassandra appeared to be using an early implementation of thrift, at least the design of the protocol was very similar. I was surprised that it was deserializing entire messages at the routing stage instead of just looking at the headers. I would expect that has been optimized by now.
I extended the protocol to support a specialized query syntax because I wanted to see if I could get a list of keys that each contained a common data attribute.
Bottom line was that it wasn't really possible to support doing any kind of query without forcing a data storage representation. Everything I was storing was basically a map of attributes anyways, but it was more of an investigative exercise to see what it would take if I needed to implement that capability.
Recently looking at cloudbase, I see they took that step by storing everything in JSON, which would allow for distributed queries.
I foresee using a persistent distributed hash table in the near future, not for distributed queries, but more for performance reasons.