Removing duplicate documents from Elasticsearch saves disk space and will speed-up searches. This saves you time and makes you more productive.

I setup and manage ELK (Elasticsearch, Logstash and Kibana) clusters that process hundreds of millions log lines per day. Once in awhile it is necessary to reprocess a log. If I just restarted the log export from the web server without a mechanism to ensure unique document ids, much of the data would be duplicated. This would give an inaccurate view of the system traffic. Using Logstash, I’ll show you how to simplify the deduplication process.

This example assumes a cluster of servers all running Nginx as the web server. The Logstash configuration will have an input, filter and an output.

TIP
I use and recommend you use Elasticsearch index names that make aliasing efficient. In this case, I use logstash-nginx-#-YYYY.MM.DD where # is a unique integer value for each index. I created an Elasticsearch Index Template to automatically associate an alias logstash-nginx-YYYY.MM.DD that wraps both the original index and the new index and is used for searches. This makes it possible to delete the original index after the deduplicated index is created and checked for data integrity.

Nginx log that gets shipped to Logstash.

{
    "host": "web-server-26",
    "source": "/var/log/nginx/access.log"
    "offset": 324075948
}

To read the log from the web server, I’ll use the Beats1 input plugin.

input {
  beats {
    port => 5044
  }
}

To deduplicate an existing Elasticsearch index, I’ll use the Elasticsearch input2 in Logstash.

input { 
  elasticsearch { 
    hosts => "esnode"
    index => "logstash-nginx-1-2020.03.01"
    query => '{ "sort": [ "_doc" ] }'
  }
} 

I’ll use the fingerprint filter3 in Logstash to generate a unique document_id by using three fields that are always unique for each document.

filter {
    fingerprint {
        key => "myrandomkey"
        method => "SHA256"
        concatenate_sources => true
        source => ["host", "source", "offset"]
        target => "[@metadata][temporary_id]"
    }
}

Lastly, I’ll use the Elasticsearch output4 to create a new index with the unique document.

output {
    elasticsearch {
        index => "logstash-nginx-2-2020.03.01"
        document_id => "%{[@metadata][temporary_id]}"
    }
}

Cleanup

After the new index is created, and verified, delete the old index. Since both indexes are automatically associated with the Elasticsearch Alias logstash-nginx-2020.03.01, your searches will continue to work.

curl -H 'Content-Type: application/json' -XDELETE "esnode:9200/logstash-nginx-1-2020.03.01"

TIP
Until the old index is deleted, there will be duplicate data visible in searches. To temporarily exclude the old index in Kibana, use a filter like below.

_index is not logstash-nginx-1-2020.03.01

Or the less performant Elasticsearch query.

NOT _index : logstash-nginx-1-2020.03.01

If you use code to query Elasticsearch then use the code below.

{
  "query": {
    "bool": {
      "must_not": {
        "match_phrase": {
          "_index": "logstash-nginx-1-2020.03.01"
        }
      }
    }
  }
}
Tagged on: