ElasticSearch 0.10.0 Released

By Shay Banon | 27 Aug 2010

ElasticSearch version 0.10.0 has just been released. You can download it here. This is a major release for elasticsearch, both in terms of feature set as well as stability.

Major partial list of features include:

Geo Support

Geo location support has been added, allowing to have geo query based capabilities (distance, bounding box, polygon) as well as facet support (distance based). More info can be found here.

Update Number of Replicas Dynamically

Allow to change the number of replicas an index has using a simple API.

More Facets

Range, filter, and more term facet options.

More Mapping Options

Ability to compress the _source field with extensive optimization at decompression only when needed (for example, decompressing directly down into the REST stream).

New Gateway Structure

A new gateway structure reducing the chances of gateway corruption as well as building the basis for future options such as saving versions of the gateway and allowing to recover from them. Here is the upgrade script.

Transport Compression

The ability to configure the communication between nodes to work in a compressed mode, as well as different components using it by default (for example, peer recovery fetches the index in compressed mode).

Minor Enhancements, Bugs Squashing

A lot of work has going into improved stability of elasticsearch, better memory management, and major bugs squashing. ElasticSearch is being used by several companies to index very large amount of data with large cluster size successfully with snapshot versions of 0.10.

-shay.banon

Geo Location and Search

By Shay Banon | 16 Aug 2010

One of the coolest search technology combinations out there are the ability to combine geo and search. Queries such as give me all the restaurants that serves meat ([insert your query here]) within 20 miles from me, or create a distance heat map of them, is slowly becoming a must have for any content website. This is becoming even more relevant with new browsers supporting Geolocation API.

Already in master (and in the upcoming 0.9.1 release), elasticsearch comes with rich support for geo location. Lets take a drive down the geo support path:

Indexing Location Aware Documents

In general, documents indexed are not required to define any predefined mapping in order to use geo location features, but they should conform to a convention if none is defined. For example, lets take an example of a “pin” that we want to index its location and maybe some tags its associated with:

{
    "pin" : {
        "location" : {
            "lat" : 40.12,
            "lon" : -71.34
        },
        "tag" : ["food", "family"],
        "text" : "my favorite family restaurant"
    }
}

The location element is a “geo enabled” location since it has lat and lon properties. Once one follows the above conventions, all geo location features are enabled for pin.location.

If explicit setting is still required, then its easy to define a mapping that defines a certain property as a geo_point. Here is an example:

{
    "pin" : {
        "properties" : {
            "location" : {
                "type" : "geo_point"
            }
        }
    }
}

By defining the location property as geo_point, this means that now we can index location data in many different formats, starting from the lat/lon example above, up to geohash. For information on all the available formats, check out 278.

Find By Location

The first thing after indexing location aware documents, is being able to query them. There are several ways to be able to query such information, the simplest one is by distance. Here is an example:

{
    "filtered" : {
        "query" : {
            "field" : { "text" : "restaurant" }
        },
        "filter" : {
            "geo_distance" : {
                "distance" : "12km"
                "pin.location" : {
                    "lat" : 40,
                    "lon" : -70
                }
            }
        }
    }
}

The above will search for all documents with text of restaurant that exists within 12km of the provided location. The location point can accept several different formats as well, detailed at 279.

The next query supported is a bounding box query, allowing to restrict the results into a geo box defined by the top left, and bottom right coordinates. Here is an example:

{
    "filtered" : {
        "query" : {
            "field" : { "text" : "restaurant" }
        },
        "filter" : {
            "geo_bounding_box" : {
                "pin.location" : {
                    "top_left" : {
                        "lat" : 40.73,
                        "lon" : -74.1
                    },
                    "bottom_right" : {
                        "lat" : 40.717,
                        "lon" : -73.99
                    }
                }
            }
        }
    }
}

The last, and the most advance form of geo query is a polygon based search, here is an example:

{
    "filtered" : {
        "query" : {
            "field" : { "text" : "restaurant" }
        },
        "filter" : {
            "geo_polygon" : {
                "pin.location" : {
                    "points" : [
                        {"lat" : 40, "lon" : -70},
                        {"lat" : 30, "lon" : -80},
                        {"lat" : 20, "lon" : -90}
                    ]
                }
            }
        }
    }
}

Sorting

The ability to sort results not just by ranking (how relevant is the document to the query), but also by distance allows for much greater geo usability. There is now a new _geo_distance sort type allowing to sort based on a distance from a specific location:

{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [-40, 70],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "field" : { "text" : "restaurant" }
    }
}

On top of that, elasticsearch will now return all the values per hit of fields sorted on, allowing to easily display this important information.

Faceting

Faceting, the ability to show an aggregated views on top of the search results go hand in hand with geo. For example, one would like to get the number of hits matching the search query within 10 miles, 20 miles, and above from his location. The geo distance facet provides just that:

{
    "query" : {
        "field" : { "text" : "restaurant" }
    },
    "facets" : {
        "geo1" : {
            "geo_distance" : {
                "pin.location" : {
                    "lat" : 40,
                    "lon" : -70
                },
                "ranges" : [
                    { "to" : 10 },
                    { "from" : 10, "to" : 20 },
                    { "from" : 20, "to" : 100 },
                    { "from" : 100 }
                ]
            }
        }
    }
}

Summary

The combination of search with geo is a natural one, and slowly becoming critical to any (web) application, especially with HTML 5 and mobile devices becoming more and more widespread. elasticsearch upcoming geo support brings this integration into a whole new level, and enables application to provide rich geo and search functionality easily (ohh, and scale ;) ).

-shay.banon

ElasticSearch 0.9.0 Released

By Shay Banon | 26 Jul 2010

ElasticSearch version 0.9.0 has just been released. You can download it here. This is a major release for elasticsearch, both in terms of feature set as well as stability.

Major partial list of features include:

Facets Support

Facets allow to provide aggregated data view correlating to the search query executed. ElasticSearch now comes with several facets implementations, including the typical “terms” facets (allowing to get the most popular terms, and how often they occur), statistical facets providing statistical information on numeric fields including count, total, mean, min, max, variance, sum of squares, and standard deviation. And, the coolest facet type, histogram facets, which based on a field, break it into buckets and provide data on the relevant buckets derived from the same field, another field, or a script.

Scripting Support

Added as a general feature within elasticsearch, scripting allows to define scripts that are evaluated at runtime and can be used in different elasticsearch features, such as facets, script search fields, script filter, and so on.

More Queries and Filters

Additional queries and filters have been added. Thanks to the Query DSL of elasticsearch, adding queries is a snap. Queries include fuzzy query, custom score query (based on scripts), script filter, and/or/not filters, and more.

Improved Gateway Recovery

A major feature in elasticsearch, allowing to reuse existing index files when recovering from the gateway after a full cluster restart significantly reducing the time it takes to recover from the gateway. This include additions to the gateway behavior including the ability to control when the initial recovery will happen as a factor of the number of nodes in the cluster and time. Also, the shutdown API has been enhanced to better handle full cluster shutdown.

Script Search Fields

The ability to load custom data (based on non stored fields) as part of the search request.

Improves Fluent Java / Groovy API

The Java / Groovy API has been greatly enhanced to provide more fluent API execution.

AWS Cloud Specific Plugin

The cloud API has been rewritten to use directly the amazon AWS API, providing better stability and features when using AWS. The cloud plugin now only works with Amazon AWS.

Stability, Bug Squashing, and Memory Usage Improvements

A lot of work has going into improved stability of elasticsearch, better memory management, and major bugs squashing. ElasticSearch is being used by several companies to index very large amount of data with large cluster size successfully with snapshot versions of 0.9.

-shay.banon

ElasticSearch 0.8.0 Released

By Shay Banon | 27 May 2010

ElasticSearch version 0.8.0 has just been released. You can download it here. This release includes several bug fixes and memory footprint improvements, and one major feature, Hadoop integration. This allows to use Hadoop HDFS as elasticsearch gateway storage, and enabling it is as simple as:

Installing the hadoop plugin using bin/plugin -install hadoop.

Changing the configuration to include:

gateway:
    type: hdfs
    hdfs:
        uri: hdfs://host:port
        path: path/to/folder

-shay.banon

ElasticSearch 0.7.1 Released

By Shay Banon | 17 May 2010

ElasticSearch version 0.7.1 has just been released. You can download it here. This release fixed a major bug when indexing large documents resulting in storing additional null bytes (and returning them).

Version 0.7.1 also brings a major feature to elasticsearch, recovery throttling. In elasticsearch, there are two types of recovery. The first, is recovery from the gateway. This happens only when the first shard is allocated in the cluster. The second recovery happens when nodes move or allocate shards around. The recovery process in both cases include recovering both each shard index files, and the transaction log.

Up until version 0.7.1, elasticsearch would basically go full force in performing the recovery. If a new node would join the cluster, all the possible shards would be allocated to it, and all will perform recovery in parallel. More over, each single shard index file recovery will happen in parallel as well.

This can lead to a heavy load on the nodes, making them less responsive for on going operations performed on them. From version 0.7.1, recovery throttling is enabled, basically allowing only for a controlled number of concurrent recovery operations, and concurrent stream (single shard index file) recovery operation. Both counts are maintained on the node level, regardless of the number of indices or shards.

The indices.recovery.throttler.concurrent_recoveries setting controls the number of concurrent recoveries allowed (shard recoveries). It defaults to the number of cores. The indices.recovery.throttler.concurrent_streams control the concurrent shard index file recoveries, and defaults to the number of cores as well.

-shay.banon

ElasticSearch 0.7.0 Released

By Shay Banon | 14 May 2010

ElasticSearch version 0.7.0 has just been released. You can download it here. This release brings much improved stability, and several features:

Zen Discovery

A discovery module called zen built from the ground up to work well, and fast with elasticsearch. This is now the default discovery module, with the jgroups discovery module moving to be provided as a plugin.

Groovy Client

A native groovy client providing a Groovyfied API build on top of the native Java API. More details provided in the ElasticSearch Just Got Groovy blog post. As a side note, anybody up for building a Scala/JRbuy client?

Cloud

First and foremost, native cloud support, providing zero conf cloud discovery ( No Special Node™ ) and the ability to persist long term index storage on different cloud providers blob stores. More information can be found in the Here Comes the Cloud blog post.

Memcached Transport

For that extra oomph when HTTP is not fast enough (mainly from other languages), elasticsearch supports a subset of the memcached protocol. Basically, the implementation implements REST on top of memcached (as much as possible). More info can be found here.

Simpler Plugin Management

Many things in elasticsearch are implemented as a plugin. For example, the cloud support or memcached support are implemented as plugins. Now, installing a plugin is as simple as issuing the following command:

bin/plugin -install cloud
bin/plugin -install transport-memcached

Analysis ICU

Better support when working with unicode through the ICU analysis plugin. More info here.

More APIs

More information on nodes using the new Node stats API, as well as the ability to restart a node.

JVM Clients

Simpler dependency management, requiring only lucene as a dependency.

XContent

Though currently mainly for internal use, an abstraction on top of JSON has been created, inspired by JSON called XContent. There is support a JSON implementation for it, but also support for XSON, which is a binary JSON format for faster and smaller (message footprint) messages. The Java API already uses it automatically (not for indexed documents), and both the REST API and the indexed documents can be either in JSON or XSON format. XSON format will be documented in the near future to allow for non JVM based clients to use it.

-shay.banon

Here Comes The Cloud

By Shay Banon | 11 May 2010

From the get go, elasticsearch has been designed and built for the cloud. From its internal architecture, to how it works in its distributed nature. In the upcoming 0.7 version, the cloud vision has been fully realized.

The Cloud integration revolves around two major components in ElasticSearch: Discovery and Gateway.

Cloud Discovery

One of the main problems with running distributed systems on the cloud is discovery. Products that can do “zero conf” discovery use multicast for it (elasticsearch among them), and in most cloud providers (Amazon AWS or Rackspace) multicast is disabled. The typical way to work around it is to use unicast discovery, which requires setting up a specific list of IPs/Hosts (routers or gossip servers).

Unicast discovery is problematic when used on the cloud. Machines can come and go, and their IP is not static. Cloud providers work around that by providing the ability to have a set of “elastic IPs”. But, at the end, the management of the cloud installation becomes a pain. At least two servers must be associated with an elastic IP and become a special exception case which needs to be managed. This goes completely against “zero conf” discovery and heavily complicates the cloud installation.

ElasticSearch has a new discovery module called “Zen” which was built from the ground up to work well in cloud environments (and integrate well with other elasticsearch modules). The cloud extension to it provides “zero conf” discovery in cloud environments.

In a nutshell, when running on the cloud, the list of machines that are already running on the cloud is available through cloud APIs. This information can be used to perform “zero conf” discovery. This follows the motto that the should be embraced by any system running on the cloud: All Machines are Created Equal.

So, how do you enabled cloud discovery on the cloud? With a few lines of configuration:

cloud:
    account: <Your Amazon AWS Account Here>
    key: <Your Amazon AWS Secret Key Here>
    compute:
        type: amazon
discovery:
    type: cloud

The above configuration enables auto discovery in Amazon AWS. Simply replace amazon with rackspace to work on the Rackspace cloud. There is a long list of compute cloud providers supported, including GoGrid, and Terremark.

Gateway

ElasticSearch has been designed to do reliable asynchronous long term persistency. This enables several features including the ability for fast local “runtime” storage (including in-memory) while having a long term storage that can be slower. The Gateway concept is described in the Search Engine Time Machine post.

But first, a step back. When designing a system that would be deployed on the cloud, lets take a search engine for example ;), things come and go. One of those things that come and go are disks. So, local storage, in cloud environments, is considered transient. In Amazon AWS for example, EBS (Elastic Block Store) was introduced to provide a mountable disk that survives restarts. So, we could configure our search engine to store the index on EBS. But, EBS requires periodic snapshotting to S3 (amazon blob store) for “safe” persistency, since EBS can certainly suffer from failures as well. Of course, this means more money spent on your cloud deployment since now one pays for both EBS and S3.

One way to work around this is to persist directly from the local store to S3 by writing some sort of synchronization script / code. But, if the machines fails we will loose all the data up to the point when the script last ran. The next step is to add replication (and sharding for performance) and so on. All of this is provided by elasticsearch out of the box.

Here is how elasticsearch can be configured to store both its cluster metadata (to survive full cluster failure) and indices in the cloud:

cloud:
    account: <Your Amazon AWS Account Here>
    key: <Your Amazon AWS Secret Key Here>
    blobstore:
        type: amazon
gateway:
    type: cloud
    cloud:
        container: mycontainerhere

The above simple configuration will store things in Amazon S3. Simply change amazon to rackspace to use Rackspace CloudFiles. There is a long list of blobstore providers supported, including Azureblob.

Final Words

As you can see, elasticsearch is now a first class citizen when running on the cloud. I believe that it has actually created a new level of intimate integration of products with the cloud. Both the Discovery and Gateway means that managing an elasticsearch deployment on the cloud is a breeze.

As a side note, I would like to note that cross cloud support is done using jclouds. Highly recommended.

-shay.banon

ElasticSearch Just Got Groovy

By Shay Banon | 19 Apr 2010

Just pushed into master (upcoming 0.7 release) a Groovy client wrapper on-top of the Java API elasticsearch provides.

Using elasticsearch with dynamic languages makes a lot of sense, especially thanks to its domain driven approach, and thanks to the fact that Groovy runs on the JVM, it can make use of the native elasticsearch Java API. Here are some examples:

Creating a node (that acts as a client) within the cluster is simple using the GNodeBuilder:

def nodeBuilder = new org.elasticsearch.groovy.node.GNodeBuilder()
nodeBuilder.settings {
    node {
        client = true
    }
}
def gNode = nodeBuilder.node()
def client = gNode.client

Note, right from the start, the domain driven settings applied. Settings in elasticsearch can be defined using JSON, and, by utilizing Grails JsonBuilder, they can be expressed as a Groovy Closure.

Next, lets index some data:

def future = client.index {
    index "twitter"
    type "tweet"
    id "1"
    source {
        user = "kimchy"
        message = "elasticsearch is groovy"
    }
}

// a listener can be added to the future
future.successs = {IndexResponse response ->
    println "Indexed $response.index/$response.type/$response.id"
}

// or, we can wait for the response
println "Indexed $future.response.index/$future.response.type/$future.response.id"

Here, we indexed a tweet into an index called twitter, the type is a tweet and under id 1. Note that the indexed JSON is expressed using the same JsonBuilder.

Also, all operations in elasticsearch are asynchronous allowing to either register a listener (on success/failure) or work with an ActionFuture. The future in the Groovy case is an enhanced Groovy future called GActionFuture. It allows to wait for the response, or register Closure that will be called on a successful index, failed index, or both.

All APIs are used exactly the same as the above index one. Let me finish with an example of the Search API, which shows the power of the search query DSL:

def search = client.search {
    indices "twitter"
    types "tweet"
    source {
        query {
            term(user: "kimchy")
        }
    }
}

println "Search returned $search.response.hits.totalHits total hits"
println "First hit tweet message is $search.response.hits[0].source.message"

As you can see, using elasticsearch from Groovy is groovy ;). Someone up for building a grails plugin utilizing this?

-shay.banon

ElasticSearch 0.6.0 Released

By Shay Banon | 09 Apr 2010

ElasticSearch version 0.6.0 has just been released. You can download it here. This release brings much improved stability, and several features:

First, a big rename has occurred. All the JSON API now uses “underscore casing” instead of “CamelCase casing”. This makes elasticsearch more streamlined with other JSON based REST APIs out there.

The JSON API is much more flexible now, supporting numbers provided as strings, and boolean values provided as either numbers or strings. This makes using elasticsearch from dynamic languages more easy.

_all field support has been added, automatically creating a field that includes all the different fields in the JSON document for simpler searching (no need to explicitly specify the field name to search on). One of the nice things about the _all field is that it takes boost level setting of different fields into account. More information on the _all field can be found here.

Highlighting is now supported as part of the search request.

Simpler Query DSL including support for fuzzy_like_this queries and gt/lt/gte/lte on range queries.

Index Aliases API allows to create aliases associated with a single index or more and executing other APIs using it instead of the actual index names.

A new plugin system has been develop allowing to easily extend elasticsearch with the first plugin being the attachments plugin allowing to index “attachments” such as documents, images, mails, and so on.

Internal changes to how communication is handled between nodes resulting in much smaller messages passing around over the low level transport layer and a lower latency/overhead for each API.

Many bug fixes and performance enhancements slowly making elasticsearch as rock solid as it should be!

Last but not least, elasticsearch is now on Maven repository, with a releases repo and a snapshots repo.

-shay.banon

ElasticSearch 0.5.0 Released

By Shay Banon | 05 Mar 2010

ElasticSearch version 0.5.0 has just been released. You can download it here. This release brings much improved stability, better handling of mapping definitions, and several features:

Several new queries have been added, including moreLikeThis , moreLikeThisField, fieldQuery, queryString with multiple fields.

terms API allowing to get terms (from one or more indices) of one or more fields and their respective document frequencies (how often they exists in documents). This can be very handy to implement things like tag clouds or simple auto suggest.

cluster_health API for simple indication on the health of the cluster, as well as the ability to wait for the cluster to reach a health status.

moreLikeThis API to search for documents that are like a certain document.

Java API exposing all of elasticsearch operations/actions using simple, transport based, async API to use with any JVM based language.

There are many more minor features and bug fixes, all listed here under the v0.5.0 tag.

-shay.banon