一种基于HIVE的数据增量采集方法与流程

2022-06-22 22:22:48 来源：中国专利 TAG：

技术特征：
1.一种基于hive的数据增量采集方法，其特征在于：包括以下几个参数：${partdate}：代表昨天的日期，格式为年月日，需要根据采集工具自身具体对于hive分区采集时的设定参数进行填写；${ods_table_name}：采集至hive的ods表名；${only_id}：确定表唯一的主键的组合值${rec_create_time}：时间字段，创建时间；${rec_revise_time}：时间字段，更新时间；${check_days}：源库会删除数据的时间段，与第二步中采集源库的时间段保持一致，比如源库会删除30天内的数据，这里的值会设定为30；上述具体实施步骤如下：s1：初始化采集全量数据，适用于初始化采集或者需要重跑全量数据；s2：定时抽取源库在删除时间之内的数据至ods表；s3：对ods表进行sql脚本治理，且治理频率设置为每天，同时定时执行之间设定在第二步执行完毕之后；s4：定时永久存储30210100分区数据。2.根据权利要求1所述的一种基于hive的数据增量采集方法，其特征在于：在进行所述初始化采集全量数据时，第一个是普通任务，抽取全量，放入${partdate}分区；第二个是sql脚本任务，删除${partdate}之前所有分区,并将最新全量数据放入_f表，具体脚本如下：首先，抽取全量后删除之前的数据：alter table${ods_table_name}drop partition(pt<${partdate})；alter table${ods_table_name}drop partition(pt＝'30210100')；其次，重新抽数，删除表也要重新算数据：truncate table${ods_table_name}_d；最后，将全量数据放入_f表：set hive.support.quoted.identifiers＝none；insert overwrite table${ods_table_name}_fselect*from${ods_table_name}。3.根据权利要求1所述的一种基于hive的数据增量采集方法，其特征在于：在抽取所述源库在删除时间之内的数据至ods表时，分区名为pt＝’30210100’，设定定时频率为每天抽取，定时时间根据业务需求时间确定,抽取源库表的sql可以根据源库表的时间字段来确定，假如只有一个时间字段，可以写为：select*from table where creation_date>＝to_char(sysdate-30,'yyyymmdd')and creation_date<to_char(sysdate,'yyyymmdd')。4.根据权利要求3所述的一种基于hive的数据增量采集方法，其特征在于：所述源库表的时间字段为两个时，其脚本可以写为：select*from table where(creation_date>＝to_date(to_char(sysdate-30,'yyyy-mm-dd'),'yyyy-mm-dd')
and creation_date<to_date(to_char(sysdate,'yyyy-mm-dd'),'yyyy-mm-dd'))or(last_update_date>＝to_date(to_char(sysdate-30,'yyyy-mm-dd'),'yyyy-mm-dd')and last_update_date<to_date(to_char(sysdate,'yyyy-mm-dd'),'yyyy-mm-dd'))。5.根据权利要求1所述的一种基于hive的数据增量采集方法，其特征在于：在对所述ods表进行sql脚本治理时，创建带有合并主键的temp表，简化对比主键的过程，且所述temp表的表结构和表内容与ods表相同：create table if not exists${ods_table_name}_temp as select*,”as only_id from${ods_table_name}limit 0；insert overwrite table${ods_table_name}_tempselect*,concat(${only_id})as only_id from${ods_table_name}。6.根据权利要求5所述的一种基于hive的数据增量采集方法，其特征在于：将所述temp表中最新分区，也就是pt＝30210100的数据,即最新30天数据，temp表中的30天数据左查询最新30天数据，找出pt＝30210100中为null的数据，即为源库被删除数据，将其插入ods_d删除表中在所述源库需要一个时间字段增量采集时：create table if not exists${ods_table_name}_d as select*from${ods_table_name}_temp limit 0；insert into${ods_table_name}_dselect a.*from${ods_table_name}_temp aleft join(select*from${ods_table_name}_temp wherept＝'30210100')bon concat(a.only_id)＝concat(b.only_id)where a.${rec_create_time}>＝regexp_replace(substr(date_sub(from_unixtime(unix_timestamp()),${check_days}),1,10),'-',”)anda.${rec_create_time}<from_unixtime(unix_timestamp(),'yyyymmdd')andb.only_id is null。7.根据权利要求6所述的一种基于hive的数据增量采集方法，其特征在于：在所述源库需要两个时间字段采集时：create table if not exists${ods_table_name}_d as select*from${ods_table_name}_temp limit 0；insert into${ods_table_name}_dselect a.*from${ods_table_name}_temp aleft join(select*from${ods_table_name}_temp where pt＝'30210100')bon concat(a.only_id)＝concat(b.only_id)where(a.${rec_create_time}>＝regexp_replace(substr(date_sub(from_unixtime(unix_timestamp()),${check_days}),1,10),'-',”)anda.${rec_create_time}<from_unixtime(unix_timestamp(),'yyyymmdd')and
b.only_id is null)or(a.${rec_revise_time}>＝regexp_replace(substr(date_sub(from_unixtime(unix_timestamp()),${check_days}),1,10),'-',”)anda.${rec_revise_time}<from_unixtime(unix_timestamp(),'yyyymmdd')andb.only_id is null)。8.根据权利要求5所述的一种基于hive的数据增量采集方法，其特征在于：创建与所述temp表相同表结构的tempf临时表，将去掉删除数据后的数据插入tempf表中：create table if not exists${ods_table_name}_tempf as select*from${ods_table_name}_temp limit 0；insert overwrite table${ods_table_name}_tempfselect a.*from${ods_table_name}_temp a where not exists(select*from(select distinct only_id from${ods_table_name}_d)b where a.only_id＝b.only_id)。9.根据权利要求1所述的一种基于hive的数据增量采集方法，其特征在于：创建与所述ods表的表结构相同的_f表，对tempf表进行去重，取最新修改后的数据，去除联合主键，得到与源库一致的ods_f表：create table if not exists${ods_table_name}_f as select*from${ods_table_name}limit 0；set hive.support.quoted.identifiers＝none；insert overwrite table${ods_table_name}_fselect`(only_id)？ . `from(select distinct b.*from(select only_id,pt,row_number()over(partition by only_id order by pt desc)as row_numfrom${ods_table_name}_tempf)aleft join${ods_table_name}_tempf bon a.only_id＝b.only_idand a.pt＝b.ptwhere a.row_num＝1)c。10.根据权利要求1所述的一种基于hive的数据增量采集方法，其特征在于：将当日所述第二步采集至30210100分区的数据存入ods表的前一天分区，永久储存数据，任务执行频率根据源库删除数据的天数设定，例如会删除30天之内的数据，则每30天执行一次，定时时间为在当日第二步采集完之后：set hive.support.quoted.identifiers＝none；insert overwrite table${ods_table_name}partition(pt＝${partdate})select`(pt)？ . `from${ods_table_name}where pt＝'30210100'。

技术总结
本发明公开了一种基于HIVE的数据增量采集方法，涉及大数据采集与数据清洗治理技术领域。该基于HIVE的数据增量采集方法在使用时，只需要配置相关的采集任务和sql脚本任务，将sql脚本放入可执行带参数sql脚本的sql执行工具中，只暴露出表名，创建时间，更新时间，主键，原表可能会删除数据的时间段n，用户只需填入这几个关键信息进行sql执行任务配置，即可统一解决以上问题，保证数据的一致性；sql脚本处理流程复用性强，形式灵活，操作简单，将sql脚本放入执行工具，只需填入相关参数，即可进行各类治理，适用于hive库内所有类型的表及表内不同格式的时间字段的情况。不同格式的时间字段的情况。不同格式的时间字段的情况。

技术研发人员：郑士良刘威宪黎荣华安宝刘东东林楠陈文豪张夏楠
受保护的技术使用者：河钢数字技术股份有限公司
技术研发日：2022.04.08
技术公布日：2022/6/21

再多了解一些

2/2 首页上一页 1 2

本文用于企业家、创业者技术爱好者查询，结果仅供参考。

上一篇：返回列表
下一篇：使装置具备增强型持久性存储器区存取能力的制作方法

一种基于HIVE的数据增量采集方法与流程

相关文献

最热文献