eslasticsearch 数据不能写入问题排查处理

问题现象

查看kibana发现最后的15分钟没有数据显示
登录logstash in 和 logstash out 节点程序运行正常

查看es集群状态一直为yellow状态

1
2
3

GET http://192.168.122.18:9200/_cat/health?v 
epoch      timestamp cluster      status node.total node.data shards   pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1499225413 11:30:13  wecash-model yellow          3         3  20621 10311    0    0        1             199                  -                100.0%

登录redis查看队列长度已经堆积40W余条，并且队列长度为持续增长状态

192.168.122.122:6379> keys *
1) "test"
2) "logstash"
192.168.122.122:6379> llen logstash
(integer) 210157

问题查找定位

es状态为yellow，redis队列持续增长，可以断定logstash 可以正常收集客户端数据，可以断定问题主要为logstash 和 es集群

问题查找过程

查看logstash out日志

1
2
3

[2017-07-04T20:08:05,534][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"process_cluster_event_timeout_exception", "reason"=>"failed to process cluster event (put-mapping) within 30s"})
[2017-07-04T20:08:05,534][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-07-04T20:08:05,534][ERROR][logstash.outputs.elasticsearch] Action

查看elasticsearch日志

1
2

日志段1: 
[2017-07-04T15:48:30,610][WARN ][o.e.c.a.s.ShardStateAction] [es-node3] [logstash-wecash_operator_strategy_ol-2017.07.03][2] received shard failed for shard id [[logstash-wecash_operator_strategy_ol-2017.07.03][2]], allocation id [QKqZDIAwQ2mHcowle3L8Cg], primary term [0], message [failed recovery], failure [RecoveryFailedException[[logstash-wecash_operator_strategy_ol-2017.07.03][2]: Recovery failed from {es-node1}{zlUsTgdcRY6vg7xE82GZ-g}{pCe9AG12REWfB8pFwQzZLA}{192.168.122.18}{192.168.122.18:9300} into {es-node4}{5bu5Oi3cQOG8gl0fwBg_DQ}{G3ehwt-OQbehaUTWzl_eFg}{192.168.122.130}{192.168.122.130:9300}]; nested: RemoteTransportException[[es-node1][192.168.122.18:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [logstash-wecash_operator_strategy_ol-2017.07.03][2] from primary shard with sync id but number of docs differ: 81514 (es-node1, primary) vs 81516(es-node4)]; ]

根据获得的logstash output的日志和es节点的报错日志failed to process cluster event (put-mapping) within 30s，因为报错信息相同可以断定问题发生肯定是es引起的，logstash 从队列内获取的数据已经有写入es的过程，es返回了报错信息failed to process cluster event (put-mapping) within 30s, 根据官方文档的介绍可以得到报错信息的含义为更新文档或索引报错
问题来了是什么原因造成的数据不能写入，是否有什么堵塞造成数据写入的超时，重新查看了集群状态

1
2
3

GET http://192.168.122.18:9200/_cat/health?v 
epoch      timestamp cluster      status node.total node.data shards   pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1499225413 11:30:13  wecash-model yellow          3         3  20621 10311    0    0        1             199                  -                100.0%

发现pending_task为199，说明有大量的任务在执行,

查看es集群的执行任务列表

1	curl -X GET http://192.168.122.21:9200/_cat/pending_tasks

发现有大量的fail shared，结合es的日志报错 primary shard with sync id but number of docs differ: 81514 (es-node1, primary) vs 81516(es-node4)]; ] 以及github 相关信息,得出当文档数量不相同时es会关闭文档的更新

执行以下命令强制将内存中的额数据刷新到磁盘内官方文档

1 2	curl -X POST \ 'http://192.168.122.18:9200/logstash-model_data_receiver-2017.06.29/_flush?force

原文链接：https://gongxiude.gitbooks.io/operation_notes/eslasticsearch-shu-ju-bu-neng-xie-ru-wen-ti-pai-cha-chu-li.html