Saturday, 23 March 2013

Adventures in Querying with RavenDB

For the last couple of months, we've been working on a project that uses RavenDB as the main data store. We went with Raven for several reasons - it looked like a pretty good fit for what we were doing, we were using NServiceBus which now includes Raven as its default persistence store, and we were keen to see what it could do. Generally, it's been really, really positive. However, RavenDB is the first document database I've ever used, and so once in a while the learning curve just leaves me scratching my head in frustration...

So, last week, a request comes in for a straightforward one-off report. We're using Raven to store details about media clips - video and audio files - that are linked to our customers' online CVs, and the product owner wanted a list of customers along with the total duration of the media clips linked to those customers' CVs.

In SQL, this would be

SELECT Customer, SUM(Duration) as TotalDuration FROM MediaClips GROUP BY Customer

The two developers who actually built the system weren't in the office last week, so this request ended up on my desk. "Ah, no problem", thinks me. After all - it's a database; how hard can it be? I'd spent a bit of time building custom indexes to drive an SLA report we built last week, and with quite a lot of Googling, managed to get some really impressive results out of Raven.

Thing is - I'm not building a feature here, I just need to copy & paste a bunch of numbers into a spreadsheet and get on with my life. And I have no idea how you do that in Raven. I spend an hour or so trying to get Ronnie Overby's RavenDB provider for LinqPAD to install. This is a third-party library that isn't part of RavenDB, but I like LinqPAD, I'm pretty comfortable with Linq query syntax, and this looked like a good place to start. Half an hour later, after downloading the source, compiling it, creating a certificate, signing it and getting LinqPAD to recognise and install the driver, I gave up because I just couldn't work out how to get any data out of the damn thing.

Instead, I throw together a .NET console app and start playing around with indexes, working from "Using Linq to query RavenDB indexes". I get as far as creating a named map/reduce index that's pulling out the data I need - this takes a good hour or two of trial and error. The eureka moment is when I work out that you can define Raven indexes as raw LINQ code - they're not even embedded strings, they're actually strongly-typed, compiled Linq expressions that are passed in to the query definition. I get as far as this:

class Total {
  public string CustomerId { get; set; }
  public int Duration { get; set; }
}

public void BuildIndex() {
  var store = new DocumentStore() { ConnectionStringName = "raven" };
  store.Initialize();
  store.DatabaseCommands.PutIndex("ClipInfos/TotalDurations", new IndexDefinitionBuilder<ClipInfo, Total> {

    Map = clips => from clip in clips select new {
    Customer = clip.CustomerId,
      Duration = clip.DurationInSeconds
    },

    Reduce = results => from result in results group result by result.Customer into g select new {
      Customer = g.Key,
      Duration = g.Sum(x => x.Duration)
    }
  });
}

Now, Raven encourages "safe" query patterns. Which means it'll only give you 128 records per query by default, and if you run more than 30 queries in a single session it'll fail with an exception telling you you're doing something dangerous. The only way I can find to actually retrieve all the data I need is to circumvent these defaults to increase the batch size from 128 to 1024, and then repeated use .Skip(1024*batch).Take(1024) to pull records back 1,024 at a time and add them to my result set.

using (var session = raven.OpenSession()) {
  session.Advanced.MaxNumberOfRequestsPerSession = 512;
  while (true) {
    var totals = session.Query<Total>("ClipInfos/TotalDurations").Skip(1024 * batches).Take(1024).ToList();
    foreach (var total in totals)  {
      // Do something useful.
    }
    if (totals.Count() < 1024) break;
    batches++;
  }
}

At this point, I'm thinking one of two things:

  1. This is fine. It's an ad-hoc requirement, and if we end up doing this a *lot* I should read up on the index replication bundle - a plug-in that, I gather, will export indexed data into a relational DB for easier querying.
  2. This is completely wrong. It should not take three hours and a dedicated console application to run a simple ad-hoc query against a set of Raven data. I am missing something obvious...

As somebody who's been writing SQL for nearly 20 years, it's really frustrating to hit a brick wall like this... and it's been a long day, and I'm getting fed up, and I wander over to Twitter to have a bit of a rant and @ayende pops up:

image

Ah. Dynamic Queries. That looks interesting. So I head over to Raven Studio and sure enough, there's a "Dynamic Queries" facility. So I paste in a simple query and hit "Execute":

image

No results? Huh? But there's tens of thousands of documents in there! At this point I give up and go to the pub.

This morning, I go back to it and notice the little "i" icon next to the Query header, and see this...

image

Ah. I have no idea what Lucene syntax is... but that probably explains why none of my dynamic queries worked.

Raven is really impressive. We've managed to get it to do some very cool things; it's fast, and most of the time it Just Works. And there is an inevitable learning curve associated with adopting a new technology - particularly one that represents a paradigm shift in approach. It's kinda like learning Powershell and treating it as another CLR/.NET language... which is really, really frustrating, until you realize it's not "scriptable .NET", it's more like command.com on steroids. The shift from relational to document databases is much the same. That said - it would be really, really useful if the dynamic query tool in Raven Studio would distinguish between "I understand that query but could find no matching records" and "I have no idea what you are trying to do" - I have no idea how strict the Lucene query language or parser is, but knowing whether your query is syntactically valid or not would be a big help when you're trying to work out why it didn't return anything.

image

1 comment:

Neil Barnwell said...

++1. I've had EXACTLY the same experience with Raven. I love it, but trying to work out queries for it as a n00b has been quite frustrating. The thing where the dynamic query doesn't allow LINQ syntax is such a major gotcha. I found that if you actually create an index, *that* will allow the LINQ syntax. Of course then you have an index you may not want long-term, but deleting them is no real hassle and it does mean you can re-run it. Perhaps don't leave it there on a production system as it'll spend time and IO keeping it up to date.