Codementor Events

How I learned Elasticsearch

Published Nov 12, 2018

About me

Hello dear,
My name is Murali, strong and experienced professional in SDLC, Agile and Java/J2EE development.

Why I wanted to learn Elasticsearch

Loves to learn new and/or emerging technologies.
Learning new technologies will give confidence and improve you skills and ability to do any challenging stuff.

Keep reading, Keep moving towards technologies.

How I approached learning Elasticsearch

Start off by reading the Definitive Guide : Elasticsearch: The Definitive Guide.
https://www.elastic.co/learn

While reading the guide, build a small,one node Elasticsearch cluster.

Installation is pretty easy. You can do it on Windows as well.

Understand Search, Analysis and relevance.
Understand aggregations.

Challenges I faced

Shard allocation:
As a high-level strategy, if you are creating an index that you plan to update frequently, make sure you designate enough primary shards so that you can spread the indexing load evenly across all of your nodes. The general recommendation is to allocate one primary shard per node in your cluster, and possibly two or more primary shards per node, but only if you have a lot of CPU and disk bandwidth on those nodes. However, keep in mind that shard overallocation adds overhead and may negatively impact search, since search requests need to hit every shard in the index. On the other hand, if you assign fewer primary shards than the number of nodes, you may create hotspots, as the nodes that contain those shards will need to handle more indexing requests than nodes that don’t contain any of the index’s shards.

Disable merge throttling:
Merge throttling is Elasticsearch’s automatic tendency to throttle indexing requests when it detects that merging is falling behind indexing. It makes sense to update your cluster settings to disable merge throttling (by setting indices.store.throttle.type to “none”) if you want to optimize indexing performance, not search. You can make this change persistent (meaning it will persist after a cluster restart) or transient (resets back to default upon restart), based on your use case.

Increase the size of the indexing buffer:
This setting (indices.memory.index_buffer_size) determines how full the buffer can get before its documents are written to a segment on disk. The default setting limits this value to 10 percent of the total heap in order to reserve more of the heap for serving search requests, which doesn’t help you if you’re using Elasticsearch primarily for indexing.

Index first, replicate later:
When you initialize an index, specify zero replica shards in the index settings, and add replicas after you’re done indexing. This will boost indexing performance, but it can be a bit risky if the node holding the only copy of the data crashes before you have a chance to replicate it.

Refresh less frequently:
Increase the refresh interval in the Index Settings API. By default, the index refresh process occurs every second, but during heavy indexing periods, reducing the refresh frequency can help alleviate some of the workload.

Tweak your translog settings:
As of version 2.0, Elasticsearch will flush translog data to disk after every request, reducing the risk of data loss in the event of hardware failure. If you want to prioritize indexing performance over potential data loss, you can change index.translog.durability to async in the index settings. With this in place, the index will only commit writes to disk upon every sync_interval, rather than after each request, leaving more of its resources free to serve indexing requests.

Key takeaways

While learning elastic we can also get familiar/expert on following technologies as well.

  • SOLR
  • KIBANA
  • LOGSTASH
  • BEATS

Use different data sources. Index data from Logstash, Hadoop- Hive, Pig and Spark. Elastic has an excellent Hadoop connector. Elasticsearch for Hadoop.

Play with the configuration. One change at a time. Note how these changes affect indexing and search performance.

Tips and advice

Start indexing data into Elasticsearch. Use dynamic mapping at first and see how the fields are mapped. Then use custom mappings.

Try using some analyzers on strings, and check to see how they are indexed.

Query your data using Query DSL.

Final thoughts and next steps

Thank you for reading this post — I hope you found this helpful. You can find me on GitHub, LinkedIn and CodeMentor. If you have any questions, feel free to reach out to me!

Discover and read more posts from Murali Krishnan
get started