This Week in Elasticsearch and Apache Lucene - 2016-08-22
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
“Less Code, More Nodes, More Features“
Application Scaling with Elasticsearch @ StockTwits | Elastic - https://t.co/MZCrn4OMtF
— Kraut Klíck (@QIMP3G) August 12, 2016
Elasticsearch Core
Changes in 2.x:- It should be possible to update the
include_in_all
setting on existing object fields. - The
geohash
options on geo-point fields are deprecated, as is theoptimize_bbox
parameter to the geo-point distance query. - Jackson has been upgraded to v2.8.1.
- Failing to allocate a primary shard 5 times should prevent further automated allocation attempts.
- The default min and max heap sizes are now set to 2GB, which means we can remove this from the bootstrap checks.
- The
minimum_master_nodes
setting has also been removed from bootstrap checks as it only checked that it had been set, not that it had been set correctly. - Bootstrap check exceptions no longer print stack traces, which were just obscuring the message of the exception.
- Index names may no longer start with
+
or +-
as these special characters are used in index wildcard matching. - Index creation requests must use
PUT
notPOST
, and a type-exists request has changed fromHEAD index/type
toHEAD index/_mapping/type
. - Reindex should work with the transport client.
- The snapshot-status API now supports
ignore_unavailable
. - String fields with
index_options
orposition_increment_gap
were not being upgraded totext
fields. - Plugins should be able to upgrade custom cluster state metadata on startup.
- The routing changes API makes it easier for a node to determine which shard allocation changes have taken place.
- LockObtainFailedException has been renamed to ShardLockObtainFailedException because it is an in-memory lock that has nothing to do with IO.
- Painless will be the new default script language in 5.0
- A big codebase cleanup is under way to reduce the number of packages that we have, and to remove the dependency on Guice.
- SearchContext should use ref counting to prevent accessing an already closed index.
- Response filtering will support exclusions like
foo.*,-foo.bar
- Shards should only be marked as stale when there is a non-replicated write, not when the node shuts down.
- The ingest node should be able to handle dots in field names.
- A post-search hook will allow logging search requests once per request instead of once per shard.
- Should only
text
andkeyword
fields be included in the_all
field by default? - Setting
stored_fields
to_none_
would skip the stored-fields phase entirely, meaning meta-fields like_id
,_type
,_source
etc would not be returned.
Apache Lucene
- The release process for Lucene 6.2.0 will begin shortly
- The surprisingly massive indexing performance drop (annotation
AU
), unexpectedly caused by an otherwise great change, was due to a pre-existing performance bug in Lucene only uncovered after much hunting - Lucene's legacy (postings based) numeric implementation has moved to the
backwards-codecs
module and will soon be removed entirely for 7.0 - A new Lucene test case tests that you can simultaneously close
SearcherManager
while it's also refreshing, and open a newSearcherManager
whileIndexWriter
is closing, while also searching hopefully without risking SIGSEGV - Lucene now tries harder in its best effort check to detect when
MMapDirectory
is being used after being closed since that can cause aSIGSEGV
which terminates the JVM, but its stressful test case will still provokeSIGSEGV,
so it has been disabled IntRangeField,
FloatRangeField
andLongRangeField
let you index a range and search by ranges overlapping the indexed ranges- Lucene tests had gotten too slow recently, especially
TestBoolean2
- We don't need an exemption in Lucene's tests security policy for loading the Wikipedia test documents
- The flakey
MoreLikeThisTest
that keeps failin has finally been muzzled - Another tricky corner case
geo3d
test failure emerges - Stemming is tricky and it's hard to make changes without a formal analysis of the impact
- If
MultiPhraseQuery
has only one clause, the classic highlighter will hit anIllegalArgumentException
BooleanQuery
can optimize rewrite in a few cases- The APIs to track external data structures along with Lucene's
LeafReaders
are trappy - Nested span queries somehow broke between 4.10.x and today
- Making delete-by-query work with doc-values queries is horribly complex and it may make more sense to remove doc-values queries instead, though some people disagree
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!