Managing big data can be very taxing and stressful, especially when speed, reliability, scalability, and high availability are requirements for your organization. Your traditional, orthodox databases cannot provide the types of blazing speeds required to provide your analytical reports, especially when running a large data aggregation. Existing databases may be able to provide this, but regardless of your best setup and configuration efforts, the speed is often poor or underperforming.
Large organizations with voluminous data have experienced rough times and a taxing amount of jobs, especially for improvising analytics, likely demanding real-time results, especially during retrieval or searching data on a real-time basis. Many organizations may attempt to develop something from scratch in combination with various existing technologies to provide storage of big data, analytics, and other related services to power up the application to their standards.
The problem is that it’s not a simple thing to do. The dilemma is that it takes a lot of research and development, financial cost, and time to accomplish and meet delivery time, speed, and flexibility demands.
Handling big data that you intend to use for search or analytics for your machine learning, artificial intelligence, IoT, geospatial processing, telecommunications, military, and weapon systems, and health systems applications requires speed, real-time processing, scalability, and performance.
Indeed, there are applications you have already heard of for use in big data, such as Apache Hadoop and Apache Spark — and then there’s Elasticsearch. Hadoop and Spark are perfect for large transactions, especially bulk inserts or pipelining. In contrast, Elasticsearch provides true search engine functionality with the best performance for real-time and time-series data retrieval.
In this post, we will cover an overview of the basics of Elasticsearch and when and why you should use it.
Brief History of Elasticsearch
Elasticsearch was created by Shay Banon, a software engineer who set out to build a scalable search solution for his wife’s growing list of recipes. He built “a solution built from the group up to be distributed” and used a common interface, JSON over HTTP, suitable for programming languages other than Java.
His first iteration was called Compass. The second was Elasticsearch (with Apache Lucene under the hood).
Shay Banon released the first version of Elasticsearch in February 2010. As of this writing, Elasticsearch has produced six major releases in the following order:
- 1.0.0 – February 12, 2014
- 2.0.0 – October 28, 2015
- 5.0.0 – October 26, 2016
- 6.0.0 – November 14, 2017
- 7.0.0 – April 10, 2019
- 8.0.0 – February 10, 2022
The response was impressive, and users took to it naturally. With high adoption rates, a community began to form, and together with Steven Schuurman, Uri Boness, and Simon Willnauer, they founded a search company.
What is Elasticsearch?
Elasticsearch is known for the technology part of the ELK stack, which is better known as Elasticsearch, Logstash, Kibana, and Beats (formerly known for Elastic, Logstash, and Kibana until Beats was introduced around 2018).
The following illustration shows how the ELK stack works and demonstrates Elasticsearch’s purpose and how it works as part of the stack.
Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch, built on Apache Lucene, was first released in 2010 by Elasticsearch N.V. (now known as Elastic). Elasticsearch is primarily known for its simple REST APIs, distributed nature, speed, and scalability, and as seen in the illustration above, it is the central component of the Elastic Stack. The ELK stack is a set of free and open tools for data ingestion, enrichment, storage, analysis, and visualization.
So what exactly is Elasticsearch?
With the addition of Beats (lightweight shipping agents for sending data to Elasticsearch), the ELK stack currently consists of Elasticsearch, Kibana, Beats, and Logstash.
Elasticsearch is scalable, offers many aggregations, and has a great visualization tool that is Kibana. It provides features to help you store, manage, and search time-series data, such as logs and metrics. Once in Elasticsearch, you can analyze and visualize your data using Kibana and other Elastic Stack features.
Elasticsearch is typically used as the underlying search engine powering applications with simple/complex search features and requirements. Features include:
- Ability to Index, store, search and analyze large volumes of data quickly and in near real-time.
- Real-time search and analytics for structured, unstructured, numerical, or geospatial data.
- Efficiently stores and indexes data in a way to support fast searches.
Elasticsearch uses OpenJDK, so there can be some performance differences compared to the Oracle version of Java.
Elasticsearch has been around for a while and is used by large organizations such as CERN, Facebook, Walmart, Adobe, US Air Force, Shopify, Uber, Pfizer, Vimeo, eBay, Godaddy, and more.
Licensing with Elasticsearch / Elasticsearch Licensing
You are probably asking the question, “Is Elasticsearch free?” Elasticsearch was released as open-source software under Apache License 2.0. However, last January 2021, they decided to change to Elastic License 2.0 and SSPL 1.0. Specifically, the latter follows similarly with the mainstream database software technologies such as MongoDB, CockroachDB, RedisLabs, TimescaleDB, Graylog, and others. This means that it went out from pure OSS, but still, it is freely available but with limitations of use to avoid abuse. Elasticsearch has a great FAQ resource for any questions or concerns regarding licensing.
Elasticsearch is a distributed and document-oriented database. It stores complex data structures into serialized JSON documents. You can simply compare it with other NoSQL databases that use documents or documents inside a collection when storing data instead of in the traditional schema structure that uses tables, columns, and rows of a relational database.
When using Elasticsearch, it is recommended to design your data mappings in an optimal way before storing them. Speaking of optimization, its purpose is for search and retrieval. This is because Elasticsearch does not work just like other RDBMS databases which support constraints, such as foreign keys, and Elasticsearch is not designed the same as RDBMS’ which supports heavy joins to other tables. Therefore, your data or documents going to be stored in Elasticsearch must be denormalized. Denormalization increases retrieval performance since query joining is unnecessary. The downside is that it uses more space, as things must be stored several times, making keeping things up-to-date more difficult as any change must be applied to all instances. However, this approach is excellent for write-once-read-many workloads, which Elasticsearch is best suited for. Elasticsearch is designed to have mappings and store documents in a way that is optimized for search and retrieval.
Designing your data means that it is formatted as a template that fits your requirements. This means that you have to consider dynamic mapping or you have to manually map your data when storing or adding data to Elasticsearch.
When a document is stored, it is indexed and fully searchable in near real-time — within one second. Elasticsearch uses a data structure called an inverted index that supports speedy, full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.
When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.
Elasticsearch is Not Like Your Typical Database
Most relational databases also let you specify constraints to define what is and isn’t consistent. For example, you can enforce referential integrity and uniqueness, require that the sum of account movements be positive, and so on. Document-oriented databases tend not to do this, and Elasticsearch is no different.
Elasticsearch does not work like your traditional RDBMS or even like NoSQL databases that produce ACID transactions with undo/redo logs. Elasticsearch does not have these types of conventions or concepts. It doesn’t even have locking mechanisms such as foreign or unique keys because it does not possess ACID compliance mechanisms. Although it supports locks to avoid contention, this is not automatically managed or handled by Elasticsearch as you’d expect. Elasticsearch, though, provides some capacity to handle optimistic locking. This ensures that an older document version doesn’t overwrite a newer version. Every operation performed on a document is assigned a sequence number by the primary shard that coordinates that change.
This type of design resonates with the ability of Elasticsearch to deliver the outcome at a fantastic speed rating. There’s no need to process the complicated structure of a document’s data, and it does not need to restructure unorganized data and build a tree-like structure to link the necessary definition of your document. This helps elevate the speed of Elasticsearch to perform at a very fast pace.
Use Cases for Elasticsearch
Elasticsearch can provide near real-time capabilities for big data with a high demand for live video feeds, having access to line of sight data, and using instant chat tools. In some instances, data needs to literally be routed around the world, in many cases causing things to become pixelated. However, Elasticsearch solves this issue by relying more on local assets.
Elasticsearch is perfect for storing unstructured data, then retrieving data when needed with blazing speed via its search engine capabilities built on Apache Lucene. By that means, Elasticsearch is perfect for these types of systems:
- Business Data Analytics
- Security and Fraud Detection
- Geospatial Applications
- Military operations
- Public Safety and Emergency Response
- Analyzing scientific data
- Machine Learning/Artificial Intelligence
Many systems and applications could benefit from Elasticsearch and the ELK stack.
Elasticsearch is best used for:
- Logging and Log Analysis
- Scraping and Combining Public Data
- Full-Text Search (good for fraud detection/security, e-commerce search, enterprise search, etc.)
- Event Data and Metrics
- Visualizing Data
- System Observability
- Security (threat hunting and prevention)
Elasticsearch as Your Primary Data Store
Is it a good idea to use Elasticsearch as your primary database like other RDBMS or NoSQL DBs? Generally, it’s not recommended. Some operations, such as indexing (inserting values), are more expensive to perform than other databases.
Elasticsearch does not possess ACID transactions and is not built to have locking mechanisms for referential integrity, just like the traditional RDBMS.
With that said, Elasticsearch is best if you use it as your search engine tool to provide data results taken from your persistent data store, whether it is coming from RDBMS or NoSQL databases as your primary database. If you are processing large volumes of data for bulk inserts or pipelining that requires real-time processing, Elasticsearch is not the best to handle it, but you can configure and fine-tune it to make it happen. If you are into bulk inserts or pipelining that ingest huge volumes of data, you can use Apache Hadoop or Apache Spark. Then, you can feed the data to Elasticsearch for your retrieval or analytical purposes that require optimal speed. Of course, this means additional costs as you have to roll out data hardware/VMs as your servers or have a data lake to provide your needs.
Elasticsearch is Scalable and Highly Available
For production use of Elasticsearch and large amounts of data, it is best to set it up as a cluster. Setting up a cluster requires at least three nodes. Elasticsearch provides quorum-based decision-making that summarizes the reason for three (3) nodes which makes the quorum proceed ( i.e., half of the total size + 1) in your Elasticsearch cluster.
Elasticsearch, by default, stores your data on at least a primary and a replica shard. So if your cluster node goes down, at least your data is still available on another node for retrieval, and it is not lost.
Scalability-wise, adding a node to an existing cluster is very easy. Once a new node is set up, you can have it join the existing cluster, and Elasticsearch will automatically allocate new shards, so your data will have to be expanded.
Backups for Elasticsearch
Backups for Elasticsearch are not as feature-rich as your standard backups. Elasticsearch only supports snapshot and restore as a reliable backup method.
Tips for Installation and Setup Configuration for Elasticsearch
If you are a first-time user or have no idea how to use Elasticsearch, setup and installation can be very tricky. By default, since the release of recent versions, specifically the 8.1.x as of this writing, TLS/SSL is already enabled. You might find it very challenging to set this up, especially during manual setup steps.
In our next blog, we will guide you through the installation and configuration setup for Elasticsearch and build a cluster that is ready for your environment. So please stay tuned and subscribe to our newsletter for updates.
Elasticsearch is a very powerful part of the ELK stack (Elasticsearch, Logstash, Beats, Kibana). It serves as a search engine platform and is great for managing and storing large volumes of data that need to be processed for retrieval or analytical purposes in near real-time. It can bring search and analytics to any data type, and sending and retrieving data from Elasticsearch is managed within seconds.
Elasticsearch is also highly scalable, provides high availability, and can provide backups through snapshot and restore. It’s a very rich API that allows you to fine-tune your data and indices to best suit your needs. Elasticsearch is used by large organizations and is proven to provide business-critical data to the organization.
It’s a perfect companion to business success by improving insights gathered and collected from analytics, forecasting trends, improving security stability, aggregating large amounts of data for logistics, data mining for machine learning and AI systems, and more. Using external plugins and tools, Elasticsearch can be more flexible and adaptable as part of your data lake to manage your voluminous data inside your organization.
ClusterControl 1.9.3 added support for Elasticsearch, giving users an opportunity for full-lifecycle automation without using Elastic Cloud or moving to OpenSearch. If you’re currently using Elasticsearch and are curious to know why ClusterControl provides a better way to manage your database ops, check out this article, A true alternative to Elastic Cloud for Elasticsearch ops automation. We will be coming out with more content around Elasticsearch best practices, so stay tuned by following us on LinkedIn and Twitter, and subscribing to our newsletter.