Extending OrigoDB with some kind of scripting feature is something we have been thinking about for a long time. An obvious application would be the OrigoDB server console which now only understands ‘exit’. Imagine the power of an interactive shell within the server process itself. So, I decided to play around with scriptcs to see what I could come up with.
It didn’t go well at first. After fighting for hours (for this reason) with powershell trying to install chocolatey I gave up and decided to try building and installing from source which wen’t amazingly well!
Download and unzip the source
Double click the build.cmd file
Copy the output from ./Artefacts/Release/bin somewhere
Add the deployment location to the PATH environment variable
Running a git bash console on windows I fired up vi, typed the following script and saved as start.csx in an empty folder:
In order to execute, we need to grab the origodb.core nuget package. From the folder with the script:
$ scriptcs -install origodb.core
This will create a packages.config file and a packages sub folder in the current directory. Next step is to run the script:
$ scriptcs start.csx
A directory called MyModel is created containing a snapshot and journal file. Running the script again should restore the
previous state but instead a SerializationException is thrown. So what's happening? Looking at the stack trace,
the roslyn generated assembly seems to have a random name which is included by the BinaryFormatter in the snapshot file.
Running the script again we get a new random name which doesn't match the one from the previous execution and the load fails.
If we used a model from a precompiled assembly this would not be an issue. For now I just deleted the data directory:
$ rm -fr MyModel
Next I ran the script again but instead of terminating entered REPL mode and typed some interactive commands.
Here's a transcript:
Alright! This is fun and looks really promising. The next step is writing some useful examples like journal/snapshot maintenance, backups, ad-hoc qeurying, connecting to a remote OrigoDB server and finally hosting a server with scriptcs.
I’ve been doing some presentations for user groups on the topic of In-memory databases. During the talk we examine the traditional RDBMS architecture and some of it’s shortcomings, move on to defining IMDB (In-memory database) and look at some of IMDB products available. Finally we have a closer look at OrigoDB and finish it off with some live code and examples. Here are the slides from the presentation:
Whoa! Our first runtime performance issue with OrigoDB.
To populate geekstream we run a background process which periodically checks a number of RSS feeds. The feeds are added manually, either individually or from opml files. Initially, I was eager to
pull in a lot of content quickly. I was able to achieve this by gobbling some opml files found with google, the bigger the better. This resulted in some quality issues:
Duplicate feeds due to multiple urls for the same source
Duplicate items due to ‘popular’, ‘latest’ or syndicated streams
Non tech related feeds
Today I started to do some manual cleaning by identifying feeds to remove using random searches like “fish” and “texas”, then removing them with a command line tool (included in the source).
$ geekstream.admin -r 2345
To my surprise each invocation was taking several seconds to return. Are the commands taking time to execute or is there something else going on? I could tinker with the logging to see command execution timings but knowledge about the data model and the fact the queries run at the same time were blocked confirms that the commands are slow. So lets have a look at the code to see what’s going on and what we can do about it. The application sends a RemoveFeedCommand to the server. The engine executes the command which calls this method:
The size of _mostRecentItems is 10 so is surely insignificant. Following the call to RemoveFeedFromSearchIndex(feedToRemove) leads us to:
Yeah, this looks interesting. The main for loop walks the entire dictionary of keywords which at the moment of writing is slightly more than 300 000. _statistics.TotalKeywords is a misleading name, it’s actually the total number of IndexEntry objects linking keywords to articles. entrySet.RemoveWhere() performs a single iteration over the items in the HashSet so the total number of calls to the inner lambda on line 11 is equal to the total number of index entries, more than 11 million. The inner lambda is a hash lookup so should be O(1) which is somewhat confirmed by the MSDN doc for HashSet<T>.Contains<T>(T) method. Here’s the index data structure for reference:
Before we start optimizing let’s do some math to figure out how this scales. First some basic assumptions and measures and how they scale:
N -> number of articles, grows linearly with time
A -> average number of unique words per article, constant over time
K -> number of keywords, Has a constant upper bound
L -> number of links = N * A, so proportional to N
Looking back at our RemoveFeed algorithm we have a loop in a loop which suggests O(N^2) but it’s actually O(N) because the total number of calls to the basic operation is L. We could actually write a projection over the command journal, calculate these numbers over time and plot them. But let’s save that for separate blog article.
So what do we do about it? There are several things we can improve.
Add a reverse index, so that FeedItems have references back to each associated HashSet<IndexEntry> object. Then we could just iterate the FeedItems instead of the entire set of keywords. And given that most sources don’t grow indefinitely were back to an O(1) algorithm at the expense of some extra memory, a so called space-time trade-off. Perhaps I’ll elaborate on the details in a future blog article.
Run the outer loop in parallel utilizing cores/processors. Sure, we could divide the execution time by the number of cores (almost) but we’ll be back at square one when the number of articles has increased by the same factor. Also, we’ll be hogging CPU which could cause other problems. So this is just buying time.
Redesign the the algorithm to use a read phase and a write phase. The read phase iterates and builds a list of hashsets to be modified during write phase. The read phase happens during Command<M>.Prepare(M db) which immediately precedes Command<M>.Execute() but allows concurrent queries.
So which path do we choose? Or can we do a combination? It all depends on what we want to optimize. We probably want to optimize for query throughput which means fast command execution. In this sense alternative 1 and 3 have the same effect. The latter just calculates on the fly what the former stores in memory. We can combine both alt 1 and alt 2 with multi core processing (alt 2) during Command.Execute() to minimize the duration of the exclusive lock. Hogging cpu during Command.Prepare() will shorten the total execution time but will have an impact on concurrent queries.
As always you should do some profiling or other performance measurements and not base optimizations on assumptions.
Geekstream, our experimental tech blog search engine built with OrigoDB is back online!
A screenshot showing search results for “hanselman”
The updated UI now uses a responsive design based on twitters bootstrap. The search results and infinite scrolling are ajax based. Besides that we updated to OrigoDB.Core 0.11 and OrigoDB.Server 0.2 and simplified the data model a bit by removing click tracking. We were getting a lot of trafiic from bots not respecting our robots.txt. Each “click” request resulted in a RegisterClickCommand, quickly driving the size of the command journal.
Number of RSS feeds: 2941
Number of articles: 55843
Number of unique search terms: 228577
Number of links (term -> artilcle): 5505799
Total size of journal on disk:132MB
Node process memory: 897MB
Load time: 1m 22s
And here’s a screen of the OrigoDB Server web ui, powered by NancyFx
OrigoDB Server 0.2 Web UI license page. Click for larger image
Do you remember the days when file formats had to be hand crafted not to mention all the code needed to read and write them? Or even more cumbersome, implementing a network communication protocol? In some cases you probably still need to but fortunately there is a simpler way. Most platforms support reading and writing objects to a stream, file or string taking care of all the gory details for you.
I first discovered object serialization when learning Java. I was immediately impressed by the power and simplicity. Days of work could now be done in seconds using a few lines of code.
Here’s some code in C# to show how simple it is:
The stream abstraction means you can target files, memory or sockets. It also allows supports the decorator pattern to transparently add things like encryption and compression. Powerful!
Notice that the first argument to WriteToStream is named graph. The reason is to be explicit about what gets written to the Stream. The state of an object is defined by the current value of each of it’s fields. But what if a field is a reference to some other object? There you have it. A graph. And what if there’s a reference to a collection like a list or dictionary? Same thing, the collection is an object containing references to it’s items. It’s all a graph and it can get very big and complex.
The BinaryFormatter will walk the entire graph, writing each object to the stream while keeping track of multiple references to the same object. When reading it back an identical graph is constructed.
Keep in mind that references are one-way. If we have this graph structure: A -> B -> C and we pass B to WriteToStream, only B and C are written to the stream.
So why am I writing this? Most developers understand serialization, what’s the big deal? I’m trying to paint a bigger picture here and this is a detail that plays in. OrigoDB makes heavy use of serialization. Snapshots, command journalling, master/slave communication and cloning result graphs to name a few. But those are just technical details and not really significant. The key point here is that an OrigoDB in-memory database is just a plain old object graph. We’ll dig deeper into this fact in a coming post. Stay tuned!
I honestly do. The frustration up until the point when I’m satisified with a piece of writing is nearly intolerable. I can work with a single paragraph for hours until somewhat satisfied and when I read it again the next day it doesn’t resonate with me at all. Despite this fact I have decided to start blogging regularly, perhaps two but at least one post a week for the duration of 2014. Less than one and I have failed.
To avoid failure I’m going to have to lower the bar and settle for good enough writing. A clear goal with each post, less editing, shorter posts and a variety of topics are strategies I hope will help.
So why torment myself by writing?
I enjoy creating. I create software for a living. Before that, I used to build buildings. Now I’m giving a shot at building a new business. Over the past few years I have been working on an open source project and have become more and more convinced about the potential business value it can provide. The idea is so simple and powerful but requires that developers adopt an entirely new approach to building software. Bringing such cultural change about is much more difficult than developing the technology enabling it. So most posts will be related to OrigoDB and the attempt to build a business based on it.