论文总字数:29689字
摘 要
本论文结合企业级的实时计算任务监控平台的实践,对部署在集群中的实时任务进行监控。通过获取实时计算任务的日志对数据进行采集,传输,计算,存储,然后采用不同的告警策略将实时计算任务运行过程中的异常数据检测出来并自动定位系统故障及时通知任务相关负责人。
通过以下步骤实现对任务的实时监控:1)每个实时计算任务都会自定义日志格式生成到各自部署的机器上,Flume集群中的各个agent会采集到每台机器上的日志,然后将多个agent的数据汇总后,加载到不同的Sink(如Kafka,HDFS)中。2)数据会根据实时计算任务定义的不同格式采用同一种序列化方式生成字节流输入到分布式消息系统Kafka对应的Topic中,在Kafka的输出端,对不同的Topic将数据解码成不通过的POJO对象并作为数据源输出到Flink中。经过Flink的数据聚合将数据后通过Kafka传送到时序数据库Prometheus中。3) Prometheus会通过Kafka传输过来的数据采用不同的告警算法策略生成不同的告警规则alerts.rules文件。异常数据会及时生成告警内容发送到Promethes的告警模块Alermanager中,随后它会采用邮件和短信的方式通知给任务负责人并通过自动测试定位到故障准确方位,以及时发现由于系统,网络,服务器等原因带来的生产故障和机器宕机等问题,减小企业损失。最后将最终数据存储到big table hbase中为后续告警作参考。
关键字:流计算;日志采集;数据传输;Flink;时序数据;big table
Stream-based real-time monitoring platform
71114329 Wu Wanjin
Instructor: Associate Professor Zhou Weiping
Director of Pinduoduo's Big Data Department Lyme
Abstract
This paper combines the practice of enterprise-level real-time computing task monitoring platforms. It collects, transmits, calculates, and stores data by acquiring logs of real-time calculation tasks. Then, it uses different alarm strategies to detect abnormal data during the running of real-time calculation tasks and automatically locates system failures and promptly informs the relevant person in charge of the task.
Through the following steps to achieve real-time monitoring of the task:1)each real-time computing task will generate a custom log format to the respective deployed machines. Then each agent in the Flume cluster will collect logs on each machine, and then aggregate the data of multiple agents and load them into different ones. Sink (eg Kafka, HDFS). 2)the data will be generated according to the different formats defined by the real-time computing task using the same serialization method. The byte stream is input to the Topic corresponding to the distributed messaging system Kafka. At the output of Kafka, the data decoded by the different Topic is not passed. The POJO object is output as a data source to Flink. After data aggregation through Flink, the data is sent to the timing database Prometheus via Kafka.3)Prometheus uses different alarm algorithm strategies to generate different alarm rules, alerts.rules, from the data transferred from Kafka. Abnormal data will be generated in time to send alarm content to Promethes' alarm module, then it will use email and SMS to inform the task manager and locate the fault accurately through automatic test. It will be found timely due to system, network, server, etc. Caused by the production of failures and machine downtime and other issues, reduce business losses. Finally, the final data is stored in the big table hbase for reference for subsequent alarms.
Key words: flow calculation; log acquisition; data transmission; Flink;Time series database;bigtable
目录
摘要 ·································· I
Abstract ·································· II
第一章 绪论 ································ 1
1.1 系统的背景和意义 ·························· 1
1.2 主要的研究内容 ··························· 2
1.3 技术路线 ······························ 4
第二章 系统需求与设计··························· 4
2.1 需求分析 ······························ 4
2.2 系统设计 ······························ 4
2.2.1数据处理部分设计 ························· 4
2.2.2监控和告警部分设计 ························ 9
第三章 数据处理的可行性 ·························· 13
3.1 数据采集 ······························ 13
3.1.1可用性 ····························· 14
3.1.2可靠性 ····························· 15
3.1.3可扩展性 ···························· 15
3.2 数据传输 ····························· 15
3.2.1可用性 ···························· 16
3.2.1可靠性 ····························· 17
3.2.2可扩展性 ···························· 17
3.3 数据计算 ····························· 17
3.3.1可用性 ····························· 17
3.3.1可靠性 ···························· 19
3.3.2可扩展性 ···························· 21
第四章 获取异常数据与故障定位 ····················· 24
4.1 数据检测策略 ··························· 24
4.2 故障定位 ····························· 27
第五章 总结以与展望 ·························· 31
5.1总结 ································ 31
5.2展望 ································ 25
致谢 ··································· 33
参考文献 ································ 34
绪论
1.1系统的背景和意义
在企业的日常生产中,必然会产生大量的业务数据和机器的日志数据。对于一些比较重要的日志(比如订单量,客户呼入呼出量)我们希望记录起来,结合当前的数据采用不同的监控告警算法得出可能出现的异常数据,最后开发运维人员可以找到异常时间段并进行自动的异常测试确定出现异常的原因并通知相关人员,以达到快速解决企业日常生产中遇到的绝大部分的异常,降低由于网络或机器资源问题带来的损失。
剩余内容已隐藏,请支付后下载全文,论文总字数:29689字
该课题毕业论文、开题报告、外文翻译、程序设计、图纸设计等资料可联系客服协助查找;