Full text search is required in many human-facing applications, such as where users need to interact with a datastore to retrieve and insert data based on partial, wildcard information, spell correction and autocompletion. Additional benefits of full text search is the ability to retrieve multiple results sorted by their relevance.
Lucene, the common parent to Solr and Elasticsearch
The most popular textual search engine in the market is Lucene. It is used by Solr, Elasticsearch, Lucidworks and other text search tools. Lucene is a great search engine. It is extremely fast, stable, and you probably can’t get much better than this.
Lucene, developed by Doug Cutting, was first released in 1999, and became an Apache Foundation open source project in 2001. Solr was developed on top of Lucene in 2004 by CNET, which donated the code to the Apache Foundation in 2006. By 2010, the projects were nearly synonymous, with Solr becoming a subproject under Lucene. It remains a Java library, though it has been ported to a number of other languages, including C++, Python, PHP, and so on.
Meanwhile, a separate project based on Lucene, known as Compass, also surfaced in 2004. This project took on a life of its own and, in 2010, its successor was released. Designed to be distributed (and thus scalable), as well as standardizing on JSON over HTTP as its interface, this was the birth of Elasticsearch. While it remains rooted in open source, the company behind it went commercial in 2012, rebranded as Elastic in 2015, and in 2018 it went public to great fanfare. Elasticsearch, like Lucene, is implemented as a Java library, and has clients written in a broad number of languages, including .NET (C#), Python, PHP, and more.
Limits of Lucene/Solr
Lucene and its child projects lack support for geo-replication, high throughput, resiliency and availability requirements to be considered a comprehensive database solution. The master-slave architecture used by Lucene (leader-replica, in Lucene’s terms) can limit usage, and result in lost connectivity or possible data loss.
While those are general limits of Lucene, there are also differences in implementation which create a compelling argument for Elasticsearch over its cousin. Especially Elasticsearch’s support for distribution and multi-tenancy.
Some Major Differences Between Elasticsearch and Solr
- Node Discovery: Elasticsearch uses its own discovery implementation called Zen that, for full fault tolerance (i.e. not being affected by network splits), requires a minimum of two and ideally three dedicated master nodes. Solr uses Apache Zookeeper for discovery and leader election. This requires an external Zookeeper ensemble, which for fault tolerant and fully available SolrCloud cluster requires at least three Zookeeper instances.
- Shard Placement: Elasticsearch is very dynamic as far as placement of indices and shards they are built of is concerned. It can move shards around the cluster when a certain action happens – for example when a new node joins or a node is removed from the cluster. We can control where the shard should and shouldn’t be placed by using awareness tags and we can tell Elasticsearch to move shards around on demand using an API call. Solr, on the other hand, is a bit more static.
- Caches: When you index data Lucene produces segments. It can also merge multiple smaller, existing segments into larger ones during a process called segment merging. The caches in Solr are global, with a single cache instance of a given type for a shard for all its segments. When a single segment changes the whole cache needs to be invalidated and refreshed. That takes time and consumes hardware resources. In Elasticsearch caches are per-segment, which means that if only a single segment changed then only a small portion of the cached data needs to be invalidated and refreshed.
- Beyond Text Search: Solr has focused primarily on text search. It does a great job at this, and is a defacto standard for search applications. But Elasticsearch has moved in a different direction where it goes beyond text search to tackle log analytics and visualization with the Elastic stack (Logstash and Kibana). See example of using the Elastic stack in one of our previous blogs.
Advantages of Elastic+ScyllaDB over Cassandra+Solr
You can further empower Elasticsearch by putting ScyllaDB as a persistent, distributed database behind it. Elasticsearch offers an advanced API to integrate and query data from the search engine, while ScyllaDB provides the high throughput, highly available datastore.
Elastic and ScyllaDB are best of breed, each in its field. Both were designed from the ground up for horizontal scalability, meaning you should be able to keep growing your enterprise data seamlessly, terabyte after terabyte.
Why not do them all in one or the other? It’s a fair question, but experience shows that your search workload may need different scaling than your operational database. Having both Elastic and ScyllaDB, separately, provides you that flexibility. Combined, you have a broad range of options finding the data you need. Either employ full-text search with Elasticsearch, or, with ScyllaDB, use materialized views, secondary indexes, or efficient full table scans.
We’ve gone into extensive detail elsewhere as to why ScyllaDB performs better than Cassandra. This includes everything from ScyllaDB’s C++ code base, which avoids the limitations of the Java Virtual Machine, including Cassandra’s inability to take advantage of multi-core/multi-CPU hardware, to the latency spikes of Garbage Collection (GC), to other advantages unique to ScyllaDB, such as autotuning upon installation and in runtime by our I/O Scheduler and Compaction Controller.
With Cassandra and its Java-based code, it is challenging to configure both the JVM and the caches correctly. Adding Solr to the mix practically guarantees you wouldn’t have an ideal configuration, because the same JVM process needs to serve both Solr and Cassandra. Both different workloads put far more stress on the GC.
While Elasticsearch is based on Java, the fact that ScyllaDB is written in C++ means that it will not be triggering or exacerbating any JVM or GC issues.
Although Cassandra and Solr run in one process, each has its own write path. Which means that when failures happen, data may not be consistent between the SSTable store and the Solr files. At least with Elastic+ScyllaDB running separately if a failure occurs you’re aware it happens and you can potentially isolate data errors.
Next Steps
While here we make the case for using ScyllaDB plus Elasticsearch together, in our next blog, we’ll explore some more practical use cases, as well as examples of configuring both to work together. If in the meanwhile, you wish to do some experimentation on your own, remember that both ScyllaDB and Elasticsearch are open source. Feel free to check out our blog from last year and follow along to make your own working demo.
Finally, if you have your own experience using both ScyllaDB and Elasticsearch combined in your environment, we’d love to hear your feedback!