term created to poke at SQL. In reality, the term means Not Only SQL. The idea
is that both technologies can coexist and each has its place. The NoSQL movement
has been in the news in the past few years as many of the Web 2.0 leaders have
adopted a NoSQL technology. Companies like Facebook, Twitter, Digg, Amazon,
LinkedIn and Google all use NoSQL in one way or another. Let's break down NoSQL
so you can explain it to your CIO or even your co-workers.
NoSQL Emerged From a Need
Data Storage: The world's stored digital data is measured in exabytes. Anexabyte is equal to one billion gigabytes (GB) of data. According to
Internet.com,
the amount of stored data added in 2006 was 161 exabytes. Just 4 years later in
2010, the amount of data stored will be almost 1,000 ExaBytes which is an
increase of over 500%. In other words, there is a lot of data being stored in
the world and its just going to continue growing.
Interconnected Data: Data continues to become more connected. The
creation of the web fostered in hyperlinks, blogs have pingbacks and every major
social network system has tags that tie things together. Major systems are built
to be interconnected.
Complex Data Structure: NoSQL can handle hierarchical nested data
structures easily. To accomplish the same thing in SQL, you would need multiple
relational tables with all kinds of keys. In addition, there is a relationship
between performance and data complexity. Performance can degrade in a
traditional
RDBMS as we store the massive amounts of data required in social networking
applications and the semantic web.
What is NoSQL?
I guess one way to define NoSQL is to consider what its not. It's not SQL andit's not relational. Like the name suggests, it's not a replacement for a RDBMS
but compliments it. NoSQL is designed for distributed data stores for very large
scale data needs. Think about Facebook with its 500,000,000 users or Twitter
which accumulates Terabits of data every single day.
In a NoSQL database, there is no fixed schema and no joins. A RDBMS "scales up"
by getting faster and faster hardware and adding memory. NoSQL, on the other
hand, can take advantage of "scaling out". Scaling out refers to spreading the
load over many commodity systems. This is the component of NoSQL that makes it
an inexpensive solution for large datasets.
NoSQL Categories
The current NoSQL world fits into 4 basic categories.- Key-values Stores are based primarily on
Amazon's Dynamo Paper which was written in
2007. The main idea is the existence of a hash table where there is a unique key
and a pointer to a particular item of data. These mappings are usually
accompanied by cache mechanisms to maximize performance.
- Column Family Stores were created to store and process very large amounts
of data distributed over many machines. There are still keys but they point to
multiple columns. In the case of
BigTable (Google's Column Family NoSQL model), rows are
identified by a row key with the data sorted and stored by this key. The columns
are arranged by column family.
- Document Databases were inspired by
Lotus Notes and are similar to key-value stores. The model is basically
versioned documents that are collections of other key-value collections. The
semi-structured documents are stored in formats like
JSON.
- Graph Databases are built with nodes, relationships between notes and the
properties of nodes. Instead of tables of rows and columns and the rigid
structure of SQL, a flexible graph model is used which can scale across many
machines.
Major NoSQL Players
The major players in NoSQL have emerged primarily because of the organizationsthat have adopted them. Some of the largest NoSQL technologies include:
- Dynamo:
Dynamo was created by Amazon.com and is the most prominent Key-Value NoSQL
database. Amazon was in need of a highly scalable distributed platform for their
e-commerce businesses so they developed Dynamo.
Amazon S3 uses Dynamo as the storage mechanism.
- Cassandra: Cassandra was open sourced by Facebook and is a column
oriented NoSQL database.
- BigTable: BigTable is Google's proprietary column
oriented database. Google allows the use of BigTable but only for the Google App
Engine.
- SimpleDB:
SimpleDB is another Amazon database. Used for Amazon EC2 and S3, it is part
of Amazon Web Services that charges fees depending on usage.
- CouchDB:
CouchDB along with MongoDB are open source document oriented NoSQL
databases.
- Neo4J: Neo4j
is an open source graph database.
Querying NoSQL
The question of how to query a NoSQL database is what most developers areinterested in. After all, data stored in a huge database doesn't do anyone any
good if you can't retrieve and show it to end users or web services. NoSQL
databases do not provide a high level declarative query language like SQL.
Instead, querying these databases is data-model specific.
Many of the NoSQL platforms allow for RESTful interfaces to the data. Other
offer query APIs. There are a couple of query tools that have been developed
that attempt to query multiple NoSQL databases. These tools typically work
accross a single NoSQL category. One example is
SPARQL.
SPARQL is a declarative query specification designed for graph databases. Here
is an example of a SPARQL query that retrieves the URL of a particular blogger
(courtesy of IBM):
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?url
FROM <bloggers.rdf>
WHERE {
?contributor foaf:name "Jon Foobar" .
?contributor foaf:weblog ?url .
}
Future of NoSQL
Organizations that have massive data storage needs are looking seriously atNoSQL. Apparently, the concept isn't getting as much traction in smaller
organizations. In a survey conducted by
Information Week, 44% of business IT
professionals haven't heard of NoSQL. Further, only 1% of the respondents
reported that NoSQL is a part of their strategic direction. Clearly, NoSQL has
its place in our connected world but will need to continue to evolve to get the
mass appeal that many think it could have.