*{_}Note{_}{*}*: this was taken from the esme-dev mailing list. It is a long post (Nov. 6) from David Pollack that says a lot about our design philosopy.*

Over the last 6 or so months, we've had a bunch of discussions on the list about statefulness, REST, and ESME's overall design. &nbsp;I want to walk through the design choices I've made for ESME and why "stateless" and other such designs fail and are dead wrong for a social networking system. There is no such thing as stateless. &nbsp;Every web site has state. &nbsp;The state may change frequently or may change infrequently. &nbsp;A web site made up of static files has its state based on those static pages. &nbsp;When those pages are changed, the state changes. &nbsp;State is kept somewhere for all web sites. Some web sites will present a different state depending on who is accessing the site. &nbsp;This can be as simple as serving different pages depending on the IP address or language preference expressed in the HTTP headers. This is sessionful. &nbsp;The content is calculated based on the request. &nbsp;This may be more sophisticated in terms of authenticating the HTTP request and presenting content based on the authentication. A session for sessionful content may be short-lived (the length of the request) or it may be longer lived (typically this is done with an initial authentication phase resulting in a shared secret \[JSESSIONID\] that is presented as an authentication proxy in subsequent requests.) But no matter the authentication mechanism or the session lifespan, there must exist a mechanism for translating the HTTP request into the content presented for the session. Far and away the most common way of persisting and calculating state is in a relational database (RDBMS).RDBMSs are awesome creatures. &nbsp;They sit on topof some excellent and well understood mathematics: set theory. They havewell known and well understood concurrency mechanisms: transactions. &nbsp;They have been designed, built, tested, and optimized over the last generation.RDBMSs offer a simple set of commands (SELECT, DELETE, INSERT, UPDATE) aswell as a generally human understandable set of semantics: people understandthat RDBMSs are a sets of things and there are simple ways to ask aboutthese sets. &nbsp;RDBMSs have evolved along with ERP systems and have evolved tomeet the needs of these systems.

However, there are well known things that RDBMSs don't do well that include tree structures (yeah, Oracle and others have extensions for tree walks, but nothing is part of the SQL spec and the performance of these extensions is not always the same as other models: a tree-walk in an RDBMS costs O(log n) for each node where a tree walk in an OO system costs O(1)). Social networks/social graphs are another place where RDBMSs do not excel.

Let's dive down into this.  A naive implementation of a social messaging site runs something like these tables:
{code}
  - Users(id, name, password)
  - Friends(owner, friend)
  - Messages(id, poster, content, date)
{code}
So, if we wanted to calculate the timeline for a given user at a give instant, the query would look like:
{code}
SELECT messages.* FROM messages, friends WHERE friends.owner = current_user
AND messages.poster = friends.friend ORDER BY messages.date DESC LIMIT 20
{code}
Assuming we've got indexes on friends.owner, messages.poster and messages.date, the query still results in O(n log n) where n is the aggregate number of messages posted. &nbsp;This is non-trivial and if you follow someone who has posted 20,000 messages , the n log n cost becomes non-trivial.

Basically, each time a client asks for the latest timeline, you've got an O(n log n) operation to determine state. &nbsp;This doesn't scale. The first obvious response to the issue is caching (capturing the state
beyond the duration of a short-lived session). &nbsp;I'm going to skip cachingfor a moment and do a more sophisticated implementation of timelines so we can get better performance. Let's create a mailbox table. &nbsp;Each time someone publishes a message, a reference to that message will be put in a Mailbox(owner, message, date) table and we'll create an index on the table: (owner, date DESC) This changes the query to:
{code}
SELECT messages.* FROM messages, mailbox WHERE mailbox.owner = current_user
AND messages.id = mailbox.message ORDER BY mailbox.date DESC LIMIT 20
{code}
Depending on your RDBMS, you will wind up with an O(log n) operation. &nbsp;You find the newest mailbox entry by user (O(log n)) and do an index walk until you've found 20 entries (I'm putting aside the fact that looking up the 20 messages is an O(n log n) operation because 20 is a small number and the messages will likely be in the database's cache... this operation is going to be fast.)


I'm going to sidetrack for a moment. &nbsp;I had the pleasure of talking over a few beers at a baseball game with one of the senior engineers at Facebook. We were talking about Facebook's scaling success. His comment was that it was successful but very expensive. &nbsp;If there were more than 3% cache misses from MySQL queries, the system would back up. &nbsp;If they got more than 2% cache misses from the memcached stuff in front of their MySQL servers the system would back up. &nbsp;So, basically Facebook has 195% of their data in RAM. The net is that O(log n) is only going to work if you've got your entire index in the cache of your RDBMS. &nbsp;Even a dozen disk reads is going to turn a 10ms query into a 250ms query and if you've got 1,000 users asking for a&nbsp; status update, you'll wind up with disk thrashing and ultimately you will not be able to satisfy all of those requests.

Let's make our discussion more concrete. &nbsp;I'm assuming that an ESME instance will support 25,000 users. &nbsp;On average, a user will follow 100 people (100x fan-out of messages). &nbsp;Users will post one message every 30 minutes (48 messages a day). &nbsp;The day lasts 10 hours (this is a reasonable approximation for peakiness... basically, you're compressing 48 message sends in to a 10 hour period). &nbsp;There are 300 days in a year. &nbsp;These numbers are averages and there will be some folks who are above average in terms of fan out (the CEO will have a 25,000x fan out) and some folks are above average in number of messages per day.

So, that means that each year, there will be 36,000M (36B) mailbox entries. If each entry costs us 16 bytes of RAM for index purposes, that means we're at 576B bytes of index. &nbsp;There's no way that amount of index will fit in RAM. &nbsp;So, what happens if the average messages/day drops to 1, you're still looking at 10GB of index. &nbsp;Alternatively, you could purge messages after 3 weeks or limit timelines to a certain number of messages. &nbsp;That's not unreasonable, but it's also adding a constraint to the system to deal with limitations of the RDBMS. &nbsp;There are other alternatives. Let me talk memcached for a minute. &nbsp;In my opinion, memcached means that you have a failed design. &nbsp;Memcached means abandoning all the awesome things
that you get with an RDBMS: a mathematical model, a concurrency/transactional model, durability guarantees, etc. But, we could move our state from the calculate-on-demand model of the RDBMS to the a calculate once and cache model using memcached. &nbsp;This means that you only take the nasty hits if the cache is not valid. &nbsp;Putting aside the cost of cache invalidation (I haven't covered the costs of updates in this discussion because there's no need to go there... the implementation failures can be demonstrated with just reads), if you have a simple cache invalidation scheme, most of the cache entries will not survive for 15
minutes (I can go through the math, but I'm going to leave this one to the reader). &nbsp;You risk cache stampedes (more than 1 process rebuilding the cache entry). &nbsp;Basically, the naive memcached implementation buys you a little bit of head room over the naive (non-mailbox) approach. &nbsp;In order to get more than 5x or so improvement (something that will serve a few thousand rather  than a few hundred users), you need to manipulate the cache entries inserting/deleting individual messages.

The above paragraph in fact leads us in the direction of a better answer. But first, let me state that I have proven that an RDBMS cannot be the sole locus of state for a social messaging site that services more than a few hundred users. &nbsp;Period. &nbsp;We must move state somewhere else and manage the cached state manually rather than with queries and indexes. &nbsp;Second, I have
not discussed short-lived vs. long-lived sessions yet. &nbsp;I will get to that, but first, let's walk through a design that gives us a concurrency model as well as the performance we want.

Imagine a model where you interact with a User with a limited set of (asynchronous) messages:

* &nbsp; add/remove friend
* &nbsp; add message to timeline
* &nbsp; post message (the user has created a message and it needs to be processed)
* &nbsp; get current timeline (with offsets and number of entries)

These are the basic messages needed to implement a social messaging site. If we guaranty that a User will only process 1 message at a time, we have a concurrency model. &nbsp;It's simple and simple is good. &nbsp;We have not defined how/where Users store there state (it could be on a filesystem, in an RDBMS, in a NoSQL store, who knows). &nbsp;But we can say that adding a message is an O(1) operation (prepending to the head of a singly linked list). &nbsp;Each User can have a caching policy (and that caching policy could be dynamic based on the access characteristics for the User). &nbsp;The sender of the message doesn't block on the processing of the message (although the get current timeline message will have an asynchronous response that the sender will likely block on).

We have changed our abstraction from one where all data (tables and indexes) are created equal to one where certain data structures are more prominent (User and Message) than others (mailbox, friends). We have lost something: transactions. &nbsp;In this model, if I add Dick as a friend, I am not guaranteed that I will receive Dick's next update... it may take time for the messages to propagate to Dick's User and his Message may be sent before the "add friend" message gets to him. &nbsp;In the case of a financial transaction, this would be fatal. &nbsp;In the case of social networking, this is a perfectly reasonable trade-off. So far, we have not talked about long-lived sessions and how they are valuable in such a model... an in particular in ESME. If we add one more message to our User, some of the reasons for long-lived sessions should become obvious: &nbsp;updated me on timeline change. &nbsp;If you can register with the User for changes to the timeline it means that we don't have to keep asking "are we there yet?" &nbsp;When state change happens, it's instantly propagated out to the listeners. &nbsp;The alternative is for the listeners to ask "are we there yet?" over and over. &nbsp;The cost of asking "are we there yet?" is non-trivial as anyone who has traveled with 5 year olds can attest to. &nbsp;Additionally, sometimes, when one if having a conversation, it's nice to get an immediate response rather than waiting some polling period. &nbsp;Additionally, with a listener model, the User does not need to store the date of each message (give me new&nbsp; messages since xxx) and that cuts down cache storage costs by 50% (a big number across 25,000 users).

So, having a long-lived session has some performance benefits over a short-lived session and polling, but this only part of the story. One of the ways that RDBMSs get performance (and the way products like
Oracle distinguish themselves from the likes of MySQL) is the ability to cache optimized query plans, cache the right data, and invalidate the right caches at the right time. &nbsp;The same requirements are going to come up in ESME. When I designed ESME, I changed the model from a Skittr model (1M users on a single box) to a more enterprise-friendly model. &nbsp;The key difference is that I added the "actions" feature where each User got to see each message processed in the system and analyze that message for content/context and perform certain actions based on that analysis. &nbsp;Things like "add all message containing 'catfood' to my timeline" or forward all messages containing "ESME to my followers" or "make an HTTP post of all messages from my boss to a paging service" or "block 50% of the messages from Joe Blabbermouth". &nbsp;Actions are cool, but they are costly. &nbsp;It means that every message must be compared to every action definition in the system. &nbsp;This is expensive. &nbsp;If each user has an average of 10 actions, that means each message sent will have to be compared against 250,000 actions and if we have a peak of 5 messages per hour per person, that's 31B comparisons per hour at peak time or 9M action comparisons per second. &nbsp;That's load. During peak load, we will need to prioritize which Users are processing messages/actions such that the system retains responsiveness and can drain the load. &nbsp;Put another way, knowing which Users have associated long-lived sessions allows us to prioritize the message processing for those Users. &nbsp;We allow more threads to drain the message queues for those Users while providing fewer threads for session-less Users. &nbsp;Yeah, we could prioritize on other heuristics, but long-lived session is dead simple and will cost us 5K bytes per logged in user. &nbsp;Not a huge cost and lots of benefit. So, between the existing long-lived session long polling is more efficient than shortlived session repeated polling and the upcoming need for message prioritization indicate that long-lived sessions are the right design choice. Also, I hope that the above discussion makes it clear why I am insistent on message-oriented APIs rather than document/REST oriented APIs. &nbsp;ESME's design is not traditional and there are fewer tools helping us get the implementation right. &nbsp;On the other hand, implementing ESME on top of a
relational/REST model cannot be done. &nbsp;Let's keep our design consistent from the APIs back.