WO2019037093A1 - 一种 Spark 分布式计算数据处理方法及系统 - Google Patents
一种 Spark 分布式计算数据处理方法及系统 Download PDFInfo
- Publication number
- WO2019037093A1 WO2019037093A1 PCT/CN2017/099083 CN2017099083W WO2019037093A1 WO 2019037093 A1 WO2019037093 A1 WO 2019037093A1 CN 2017099083 W CN2017099083 W CN 2017099083W WO 2019037093 A1 WO2019037093 A1 WO 2019037093A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- storage area
- memory storage
- eviction
- data
- space
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
Definitions
- the present invention relates to the field of computers, and in particular, to a Spark distributed computing data processing method and system.
- Spark has become a popular computing framework for big data applications, especially in the field of iterative computing such as graph computing and machine learning.
- the lack of space causes some partitioned data to be cached to memory, or the data that has been cached to memory needs to be migrated to disk, causing the performance of Spark to drop.
- Spark proposes and designs a unified memory management model, when the partition data is cached.
- the task cannot apply for enough storage space, it actively migrates the cached data in the storage area to disk or directly rejects it; the unified memory management model has the flexibility to effectively alleviate the Spark cache by migrating or culling the cached data.
- the demand for data and the pressure of insufficient storage space is a unified memory management model.
- the Spark unified memory management model triggers some tasks of Spark.
- the problem of double counting or disk reading has a bad impact on Spark performance.
- the main purpose of the present invention is to provide a Spark distributed computing data processing method and system, which aims to solve the technical problem of repeated Spark task calculation or disk reading in the Spark unified memory management model in the prior art.
- a first aspect of the present invention provides a Spark A distributed computing system data processing method, the method comprising:
- the eviction logic unit When performing a storage task on the elastic distributed dataset RDD partition data that the user has identified the cache, if you are going to Spark If the memory storage area fails to apply, the eviction logic unit sends a command to evict the cached data by expelling the memory storage area;
- the data access heat setting according to the eviction cache of the memory storage area is based on Migration address of the hybrid storage system of SSD and HDD;
- Reading and releasing the eviction cache data in the memory storage area migrating the memory storage area to evict the cache data to the migration address, modifying the eviction cache data persistence level in the memory storage area, and feedback eviction success Signal and expulsion information.
- the second aspect of the present invention further provides a Spark A distributed computing data processing system, the system comprising:
- the eviction logic unit sends a command to evict the cache memory of the memory storage area
- Calculating a location module configured to calculate a size of the eviction space in the memory storage area, and if the space size after the eviction meets the requirement of the storage task space by the storage task, the cache data may be eviction according to the memory storage area
- Access popularity settings are based on Migration address of the hybrid storage system of SSD and HDD;
- a data migration module configured to read and release the eviction cache data in the memory storage area, and migrate the memory storage area to evict the cached data to the migration address, and modify the eviction cache data in the memory storage area to be persistent Level, feedback eviction success signal and eviction information.
- the partition data can be flexibly migrated to the SSD or HDD according to the heat, instead of directly migrating the buffered intermediate data to the disk or kicking out
- the cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the partition data when the partition data is called, the high-speed read and write performance of the hybrid storage system and the heat according to the partition data are separated.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 1 is a schematic flowchart of a Spark distributed computing data processing method according to an embodiment of the present invention
- FIG. 2 is a schematic flowchart of a refinement step of step 101 of a Spark distributed computing data processing method according to an embodiment of the present invention
- FIG. 3 is a schematic flowchart of a refinement step of step 102 of a Spark distributed computing data processing method according to an embodiment of the present invention
- FIG. 4 is a schematic flowchart of a refinement step in step 304 of a Spark distributed computing data processing method according to an embodiment of the present invention
- FIG. 5 is a schematic flowchart of a step of refining data in step 103 of a Spark distributed computing data processing method according to an embodiment of the present invention
- FIG. 6 is a schematic flowchart of a step of refining a data persistence level step in step 103 of a Spark distributed computing data processing method according to an embodiment of the present invention
- FIG. 7 is a schematic diagram of functional modules of a Spark distributed computing data processing system according to an embodiment of the present invention.
- FIG. 8 is a schematic diagram of a refinement function module of an application storage module 601 of a Spark distributed computing data processing system according to an embodiment of the present invention
- FIG. 9 is a schematic diagram of a refinement function module of the application storage module 602 of the Spark distributed computing data processing system according to an embodiment of the present invention.
- FIG. 10 is a schematic diagram of a refinement function module of the application storage module 603 of the Spark distributed computing data processing system according to an embodiment of the present invention.
- FIG. 1 is a schematic flowchart of a Spark distributed computing data processing method according to an embodiment of the present invention, where the processing method includes:
- the migration of the SSD and HDD based hybrid storage system may be set according to the memory storage area eviction cache data access heat. address.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 2 is a schematic flowchart of a refinement step of a Spark distributed computing data processing method S101 according to an embodiment of the present invention, where the refinement step includes:
- the Spark execution engine performs the scheduling of the subtask through the task scheduler, and performs a storage task on the RDD partition data that the user has identified and cached in the subtask runtime space, and then attempts to apply for the space space to the Spark memory storage area. If the application is successful, the RDD partition data is directly stored.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 3 is a schematic flowchart of a refinement step of a Spark distributed computing data processing method S102 according to an embodiment of the present invention, where the refinement step includes:
- the eviction logic unit receives the eviction command, and the eviction logic unit sends an application for expelling the memory storage space to the memory storage area by requiring insufficient storage space for performing the storage task due to the RDD partition data.
- the memory storage area determines whether the memory storage area has an expellable space and feeds back to the eviction logic unit.
- the least-used algorithm LRU strategy that is, the algorithm performs the phase-out data according to the historical access heat record of the memory storage area data
- the core idea is that if the data is recently accessed, the probability of being accessed in the future is also higher, according to The probability of access determines the size of the eviction space in the memory storage area.
- the storage space needs to occupy a space.
- Terminating the memory storage area may evict the cache data migration task, and feedback the eviction memory storage area to evict the cache data failure signal.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 4 is a schematic flowchart of a refinement step in a Spark distributed computing data processing method S304 according to an embodiment of the present invention, where the refinement step includes:
- the first preset heat value range is that the memory storage area can be eviction cache data access heat is high, and the specific access heat range can be freely set by the user;
- the first preset heat value is greater than the second preset heat value.
- the second preset heat value range is that the memory storage area can be eviction cache data access heat is low, and the specific access heat range can be freely set by the user.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 5 is a schematic flowchart of a step of refining data in a Spark distributed computing data processing method S103 according to an embodiment of the present invention.
- the refinement step includes:
- the cache data migration unit receives the memory storage area to evict the cache data migration information and the memory storage area may evict the cache data migration command, and store the eviction data of the memory storage area according to the migration information to the SSD or the HDD;
- the cache data migration unit receives the memory storage area to evict the cache data migration information and the memory storage area can evict the cache data migration command
- the cached data in the specified memory storage area is first read and the corresponding memory space is released, and then Cache the cached data in the memory storage area to the SSD or HDD according to the migration address;
- the memory storage area can evict data migration information, including: the memory storage area can evict the cache data address, the memory storage area can evict the cache data space size, and the migration address.
- Sending a memory storage area to the eviction logic unit may evict the cache data migration completion signal.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 6 is a schematic flowchart of a step of refining a data persistence level step in a Spark distributed computing data processing method S103 according to an embodiment of the present invention.
- the refinement step includes:
- the migration address of the cache storage data in the memory storage area is SSD
- the persistent storage level of the cache memory data in the modified memory storage area is SSD_ONLY.
- the modification is completed, the feedback memory storage area can evict the cache data eviction success signal, and the memory storage area can evict the data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 7 is a schematic diagram of functional modules of a Spark distributed computing data processing system according to an embodiment of the present invention.
- the functional module includes:
- the application storage module 601 is configured to send the eviction memory storage area cache data to the eviction logic unit if the storage space of the Spark memory storage area fails when the storage task is performed on the flexible distributed data set RDD partition data that the user has identified.
- the calculation address module 602 is configured to calculate the size of the eviction space in the memory storage area. If the space size after the eviction meets the requirements of the storage task space for the memory storage area, the data storage area may be evicted according to the memory storage area, and the SSD and HDD are set based on the SSD and the HDD. Migration address of the hybrid storage system;
- the data migration module 603 is configured to read and release the eviction cache data in the memory storage area, migrate the cache storage data to the migration address in the memory storage area, modify the memory storage area to evict the cache data persistence level, and feedback the eviction success signal. And eviction information.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 8 is a schematic diagram of a refinement function module of a storage module 601 of a Spark distributed computing data processing system according to an embodiment of the present disclosure, where the refinement function module includes:
- the first application module 6011 is configured to calculate a size of a memory storage space occupied by performing a storage task on the RDD partition data, apply for a space to the Spark memory storage area, and compare with an unoccupied space of the memory storage area;
- the first feedback module 6012 is configured to: if the size of the memory storage area occupied by the storage task is larger than the unoccupied space of the memory storage area, requesting space from the Spark memory storage area fails, and sending the eviction memory storage area to the eviction logic unit to evict the cache The command of the data and the size of the memory storage space are required to send the storage task.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 9 is a schematic diagram of a refinement function module of a storage module 602 of a Spark distributed computing data processing system according to an embodiment of the present disclosure, where the refinement function module includes:
- the second application module 6021 is configured to: the eviction logic unit receives the eviction command, and the eviction logic unit sends an application to the memory storage area that requires insufficient storage space for performing the storage task due to the RDD partition data, and if the application is successful, press Recently, the LRU strategy is used to calculate the size of the expellable space in the memory storage area;
- the migration address module 6022 is configured to set the size of the unoccupied space of the memory storage area after the eviction is greater than or equal to the size of the RDD partition data to perform the storage task, and set the hybrid storage system based on the SSD and the HDD according to the eviction cache data access heat of the memory storage area.
- the migration address, and the memory storage area eviction cache data migration information and the memory storage area eviction cache data migration command are sent to the cache data migration unit;
- the second feedback module 6023 is configured to: if the unoccupied space of the memory storage area after the eviction is smaller than the size of the RDD partition data to perform the storage task, terminate the memory storage area to evict the cache data migration task, and feedback the eviction memory storage area to evict Cache data failure signal;
- the SSD migration address module 6024 is configured to: if the memory storage area eviction cache data access heat is within a first preset heat value range, read the SSD address and set the read SSD address as a migration address;
- the HDD migration address module 6025 is configured to read the HDD address and set the read HDD address as a migration address if the memory storage area eviction cache data access heat is within the second preset heat value range.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- FIG. 10 is a schematic diagram of a refinement function module of a storage module 603 of a Spark distributed computing data processing system according to an embodiment of the present invention.
- the refinement function module includes:
- the third feedback module 6031 is configured to send, to the eviction logic unit, a memory storage area eviction cache data migration completion signal;
- the SSD persistence level module 6032 is configured to: if the memory storage area can evict the cached data, the migration address is SSD, and modify the memory storage area to evict the cached data to have a persistence level of SSD_ONLY;
- the HDD persistence level module 6033 is configured to: if the memory storage area can evict the cached data, the migration address is HDD, and the modified memory storage area can evict the cached data by a persistent level of HDD_ONLY;
- the fourth feedback module 6034 is configured to feedback the memory storage area to evict the cache data eviction success signal and the memory storage area to evict the data migration information, so that the RDD partition data enters the memory storage area to complete the storage task.
- a hybrid storage system is constructed by introducing an SSD and an HDD, and the eviction logic unit and the cache data migration unit are designed, and the partition data is flexibly migrated to the SSD or the HDD according to the heat, instead of directly buffering the intermediate data.
- Migrating to disk or kicking out cached data can effectively alleviate the pressure of Spark partition data cache on the huge storage space and insufficient memory space.
- the storage feature can quickly read the partition data of different access heats stored in the hybrid storage system to improve the performance of Spark.
- the disclosed methods and systems may be implemented in other manners.
- the system embodiments described above are merely illustrative.
- the division of modules is only a logical function division.
- multiple modules or components may be combined or integrated. Go to another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or module, and may be electrical, mechanical or otherwise.
- the modules described as separate components may or may not be physically separate.
- the components displayed as modules may or may not be physical modules, that is, may be located in one place, or may be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.
- the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
- An integrated module if implemented as a software functional module and sold or used as a standalone product, can be stored in a computer readable storage medium.
- the technical solution of the present invention which is essential or contributes to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium.
- a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present invention.
- the foregoing storage medium includes: a U disk, a mobile hard disk, a read only memory (ROM, Read-Only) Memory, random access memory (RAM), disk or optical disk, and other media that can store program code.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
一种Spark分布式计算数据处理方法,涉及计算机领域,该方法包括:通过任务调度器调度子任务,执行RDD分区数据存储任务,申请存储区空间;计算存储区内可驱逐空间及空间的大小,根据分区数据访问热度设置混合存储系统的迁移地址(S102);读取指定存储区内已缓存数据并释放相应的内存空间,迁移分区数据到指定地址,修改迁移数据的持久化级别,反馈驱逐成功信号及驱逐空间信息(S103)。还提供一种Spark分布式计算系统,通过引入混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据分区数据热度将数据迁移至SSD或HDD,而非直接将数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解内存空间不足的压力,实现Spark性能的提升。
Description
本发明涉及计算机领域,尤其涉及一种Spark分布式计算数据处理方法及系统。
随着社会科学技术水平的提高,人们与对大规模数据处理的要求也越来越高,其中大数据应用对内存产生了强烈的依赖,充裕的内存是快速计算大数据的前提和保障。
Spark作为通用、快速、大规模数据处理引擎,已经成为大数据应用领域流行的计算框架,尤其在诸如图计算、机器学习等迭代计算的应用领域表现出色,随着数据集规模的不断扩大,由于空间的不足导致部分分区数据无法缓存至内存,或,已缓存至内存的数据需要迁移至磁盘,造成Spark性能的下降,针对该问题,Spark提出并设计了统一内存管理模型,当分区数据的缓存任务无法申请足够存储区空间时,主动迁移存储区内已缓存的数据至磁盘或直接剔除;统一内存管理模型具有一定的灵活性,通过迁移或剔除已缓存的数据,有效地缓解了Spark缓存大数据的需求与存储区空间不足的压力。
然而,由于已缓存的中间数据被剔除或迁移至磁盘,导致再次调用该数据时必须重新执行相应的计算任务来获取数据或读取磁盘获取缓存数据,所以Spark统一内存管理模型引发了Spark部分任务重复计算或磁盘读取的问题,对Spark性能产生恶劣的影响。
本发明的主要目的在于提供一种Spark分布式计算数据处理方法及系统,旨在解决现有技术中Spark统一内存管理模型中Spark部分任务重复计算或磁盘读取的技术问题。
为实现上述目的,本发明第一方面提供 一种 Spark
分布式计算系统数据处理方法,所述方法包括:
在对用户已标识缓存的弹性分布式数据集 RDD 分区数据执行存储任务时,若向 Spark
的内存存储区申请空间失败,则 向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令;
计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于
SSD 和 HDD 的混合存储系统的迁移地址;
读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。
为实现上述目的,本发明第二方面还提供一种一种 Spark
分布式计算数据处理系统,所述系统包括:
申请 存储模块,用于 在对用户已标识缓存的弹性分布式数据集 RDD 分区数据执行存储任务时,若向
Spark 的内存存储区申请空间失败,则 向驱逐逻辑单元发送驱逐所述内存存储区缓存数据的命令;
计算分址模块,用于计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于
SSD 和 HDD 的混合存储系统的迁移地址;
数据迁移模块,用于读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。
通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例中Spark分布式计算数据处理方法的流程示意图;
图2为本发明实施例中Spark分布式计算数据处理方法步骤101的细化步骤流程示意图;
图3为本发明为本发明实施例中Spark分布式计算数据处理方法步骤102的细化步骤流程示意图;
图4为本发明为本发明实施例中Spark分布式计算数据处理方法步骤304中细化步骤流程示意图;
图5为本发明为本发明实施例中Spark分布式计算数据处理方法步骤103中迁移数据步骤细化步骤流程示意图;
图6为本发明为本发明实施例中Spark分布式计算数据处理方法步骤103中修改数据持久化级别步骤细化步骤流程示意图;
图7为本发明实施例中本发明为本发明实施例中Spark分布式计算数据处理系统的功能模块示意图;
图8为本发明实施例中Spark分布式计算数据处理系统的申请存储模块601的细化功能模块的示意图;
图9为本发明实施例中Spark分布式计算数据处理系统的申请存储模块602的细化功能模块的示意图;
图10为本发明实施例中Spark分布式计算数据处理系统的申请存储模块603的细化功能模块的示意图。
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而非全部实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参阅图1,图1为本发明实施例中Spark分布式计算数据处理方法的流程示意图,该处理方法包括:
S101、在对用户已标识缓存的弹性分布式数据集RDD分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐内存存储区缓存数据的命令。
S102、计算内存存储区内可驱逐空间大小,若驱逐后空间大小满足存储任务对内存存储区空间的要求,则根据内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址。
S103、读取并释放内存存储区内可驱逐缓存数据,迁移内存存储区内可驱逐缓存数据到迁移地址,修改内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图2,图2为本发明实施例中Spark分布式计算数据处理方法S101的细化步骤流程示意图,该细化步骤包括:
S201、计算对RDD分区数据执行存储任务所占用内存存储区空间的大小,向Spark的内存存储区申请空间,并将存储任务所占用内存存储区空间的大小与内存存储区未占用空间作比较;
具体的,由Spark执行引擎通过任务调度器进行子任务的调度,在子任务运行时空间对用户已标识缓存的RDD分区数据执行存储任务,然后再尝试向Spark的内存存储区申请空间空间,若申请成功,则直接进行RDD分区数据的存储工作。
S202、若存储任务所占用内存存储区空间的大小大于内存存储区未占用空间,则向Spark的内存存储区申请空间失败,同时向驱逐逻辑单元发送驱逐内存存储区可驱逐缓存数据的命令以及发送存储任务需要占用内存存储区空间的大小。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图3,图3为本发明为本发明实施例中Spark分布式计算数据处理方法S102的细化步骤流程示意图,该细化步骤包括:
S301、驱逐逻辑单元接收到驱逐命令,同时驱逐逻辑单元向内存存储区发出由于RDD分区数据执行存储任务所需存储空间不足需要驱逐内存存储区空间的申请;
进一步的,当内存存储区接收到驱逐逻辑单元发出的申请后,判断内存存储区是否有可驱逐的空间并反馈给驱逐逻辑单元。
S302、若申请申请成功,则按近期最少使用算法LRU策略计算内存存储区内可驱逐空间大小;
其中,最少使用算法LRU策略即此算法根据内存存储区数据的历史访问热度记录来进行淘汰数据,其核心思想是:如果此数据最近被访问过,那么其将来被访问的几率也更高,根据访问几率判断内存存储区内可驱逐空间的大小。
S303、若内存存储区内可驱逐空间大小大于等于RDD分区数据执行存储任务需要占用空间大小。
S304、根据内存存储区可驱逐缓存数据的访问热度设置基于SSD和HDD的混合存储系统的迁移地址,并将内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元。
S305、若内存存储区内可驱逐空间大小小于RDD分区数据执行存储任务需要占用空间大小。
S306、终止内存存储区可驱逐缓存数据迁移任务,并反馈驱逐内存存储区可驱逐缓存数据失败信号。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
参阅图4,图4为本发明为本发明实施例中Spark分布式计算数据处理方法S304中细化步骤流程示意图,该细化步骤包括:
S3041、判断内存存储区可驱逐缓存数据访问热度。
S3042、若内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内,则读取SSD地址并将读取到的SSD地址设置为迁移地址;
其中,第一预置热度数值范围为内存存储区可驱逐缓存数据访问热度较高,具体的访问热度范围可由用户自由设置;
特别的,第一预置热度数值大于第二预置热度数值。
S3043、若内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内,则读取HDD地址并将读取到的HDD地址设置为迁移地址;
其中,第二预置热度数值范围为内存存储区可驱逐缓存数据访问热度较低,具体的访问热度范围可由用户自由设置。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图5,图5为本发明为本发明实施例中Spark分布式计算数据处理方法S103中迁移数据步骤细化步骤流程示意图,该细化步骤包括:
S401、缓存数据迁移单元接收到内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令后,将内存存储区可驱逐数据按迁移信息存储到SSD或HDD;
进一步的,缓存数据迁移单元接收到内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令后,会先读取指定内存存储区内已缓存数据并释放相应的内存空间,然后将内存存储区内已缓存数据按迁移地址存储到SSD或HDD;
其中,内存存储区可驱逐数据迁移信息具体包括:内存存储区可驱逐缓存数据地址、内存存储区可驱逐缓存数据空间大小以及迁移地址。
S402、向驱逐逻辑单元发送内存存储区可驱逐缓存数据迁移完成信号。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图6,图6为本发明为本发明实施例中Spark分布式计算数据处理方法S103中修改数据持久化级别步骤细化步骤流程示意图,该细化步骤包括:
S501、判断内存存储区可驱逐缓存数据迁移地址的类别。
S502、若内存存储区可驱逐缓存数据的迁移地址为SSD,修改内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY。
S503、若内存存储区可驱逐缓存数据的迁移地址为HDD,修改内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY。
S504、修改完成,反馈内存存储区可驱逐缓存数据驱逐成功信号以及内存存储区可驱逐数据迁移信息,以使得RDD分区数据进入内存存储区,完成存储任务。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图7,图7为本发明实施例中本发明为本发明实施例中Spark分布式计算数据处理系统的功能模块示意图,该功能模块包括:
申请存储模块601,用于在对用户已标识缓存的弹性分布式数据集RDD分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐内存存储区缓存数据的命令;
计算分址模块602,用于计算内存存储区内可驱逐空间大小,若驱逐后空间大小满足存储任务对内存存储区空间的要求,则根据内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址;
数据迁移模块603,用于读取并释放内存存储区内可驱逐缓存数据,迁移内存存储区内可驱逐缓存数据到迁移地址,修改内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图8,图8为本发明实施例中Spark分布式计算数据处理系统的申请存储模块601的细化功能模块的示意图,该细化功能模块包括:
第一申请模块6011,用于计算对RDD分区数据执行存储任务所占用内存存储区空间的大小,向Spark内存存储区申请空间,并与内存存储区未占用空间作比较;
第一反馈模块6012,用于若存储任务所占用内存存储区空间的大小大于内存存储区未占用空间,则向Spark内存存储区申请空间失败,同时向驱逐逻辑单元发送驱逐内存存储区可驱逐缓存数据的命令以及发送存储任务需要占用内存存储区空间的大小。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图9,图9为本发明实施例中Spark分布式计算数据处理系统的申请存储模块602的细化功能模块的示意图,该细化功能模块包括:
第二申请模块6021,用于驱逐逻辑单元接收到驱逐命令,同时驱逐逻辑单元向内存存储区发出由于RDD分区数据执行存储任务所需存储空间不足需要驱逐空间的申请,若申请申请成功,则按近期最少使用算法LRU策略计算内存存储区内可驱逐空间大小;
设置迁移地址模块6022,用于若驱逐后内存存储区未占用空间大小大于等于RDD分区数据执行存储任务需要占用空间大小,根据内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址,并将内存存储区可驱逐缓存数据迁移信息和内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元;
第二反馈模块6023,用于若驱逐后内存存储区未占用空间大小小于RDD分区数据执行存储任务需要占用空间大小,则终止内存存储区可驱逐缓存数据迁移任务,并反馈驱逐内存存储区可驱逐缓存数据失败信号;
SSD迁移地址模块6024,用于若内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内,则读取SSD地址并将读取到的SSD地址设置为迁移地址;
HDD迁移地址模块6025,用于若内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内,则读取HDD地址并将读取到的HDD地址设置为迁移地址。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
请参阅图10,图10为本发明实施例中Spark分布式计算数据处理系统的申请存储模块603的细化功能模块的示意图,该细化功能模块包括:
第三反馈模块6031,用于向驱逐逻辑单元发送内存存储区可驱逐缓存数据迁移完成信号;
SSD持久化级别模块6032,用于若内存存储区可驱逐缓存数据的迁移地址为SSD,修改内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY;
HDD持久化级别模块6033,用于若内存存储区可驱逐缓存数据的迁移地址为HDD,修改内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY;
第四反馈模块6034,用于反馈内存存储区可驱逐缓存数据驱逐成功信号以及内存存储区可驱逐数据迁移信息,以使得RDD分区数据进入内存存储区,完成存储任务。
在本发明实施例中,通过引入SSD与HDD构建混合存储系统,并设计驱逐逻辑单元和缓存数据迁移单元,根据热度灵活地将分区数据迁移至SSD或HDD,而非直接将已缓存的中间数据迁移至磁盘或踢除已缓存的数据,能够有效地缓解了Spark分区数据的缓存对存储区空间巨大需求与内存空间不足的压力,同时当调用分区数据时,由于混合存储系统的高速读写性能以及根据分区数据热度分开
存储的特点,可以快速读取存储在混合存储系统中的不同访问热度的分区数据,实现Spark性能的提升。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和系统,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only
Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上为对本发明所提供的一种Spark分布式计算数据处理方法及系统的描述,对于本领域的技术人员,依据本发明实施例的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本发明的限制。
Claims (10)
- 一种Spark分布式计算数据处理方法,其特征在于,所述方法包括:在对用户已标识缓存的弹性分布式数据集(RDD,Resilient Distributed Datasets)分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令;计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于固态硬盘(SSD,Solid State Drives)和磁盘(HDD,Hard Disk Drive)的混合存储系统的迁移地址;读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。
- 根据权利要求1所述的方法,其特征在于,所述若向Spark内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令具体包括:计算所述对RDD分区数据执行存储任务所占用所述内存存储区空间的大小,向所述Spark的内存存储区申请空间,并将所述存储任务所占用所述内存存储区空间的大小与所述内存存储区未占用空间作比较,若所述存储任务所占用所述内存存储区空间的大小大于所述内存存储区未占用空间,则向所述Spark的内存存储区申请空间失败,同时向所述驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令以及发送所述存储任务需要占用所述内存存储区空间的大小。
- 根据权利要求1所述的方法,其特征在于,所述计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址具体包括:所述驱逐逻辑单元接收到驱逐命令,同时所述驱逐逻辑单元向所述内存存储区发出由于所述RDD分区数据执行存储任务所需存储空间不足需要驱逐空间的申请,若所述申请申请成功,则按近期最少使用算法LRU策略计算所述内存存储区内可驱逐空间大小;若所述内存存储区内可驱逐空间大小大于等于所述RDD分区数据执行存储任务需要占用空间大小,根据所述内存存储区可驱逐缓存数据的访问热度设置基于SSD和HDD的混合存储系统的迁移地址,并将所述内存存储区可驱逐缓存数据迁移信息和所述内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元;若所述内存存储区内可驱逐空间大小小于所述RDD分区数据执行存储任务需要占用空间大小,则终止所述内存存储区可驱逐缓存数据迁移任务,并反馈驱逐所述内存存储区可驱逐缓存数据失败信号。
- 根据权利要求3所述的方法,其特征在于所述根据所述内存存储区可驱逐缓存数据的访问热度设置基于SSD和HDD的混合存储系统的迁移地址具体包括:若所述内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内,则读取SSD地址并将读取到的SSD地址设置为所述迁移地址;若所述内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内,则读取HDD地址并将读取到的HDD地址设置为所述迁移地址;所述在第一预置热度数值大于所述第二预置热度数值。
- 根据权利要求1所述的方法,其特征在于,所述读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址具体包括:缓存数据迁移单元接收到所述内存存储区可驱逐缓存数据迁移信息和所述内存存储区可驱逐缓存数据迁移命令后,将所述内存存储区可驱逐数据按所述迁移信息存储到SSD或HDD,并向所述驱逐逻辑单元发送所述内存存储区可驱逐缓存数据迁移完成信号;其中所述内存存储区可驱逐数据迁移信息具体包括:所述内存存储区可驱逐缓存数据地址、所述内存存储区可驱逐缓存数据空间大小以及所述迁移地址。
- 根据权利要求1所述的方法,其特征在于,所述修改所述内存存储区可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息具体包括:若所述内存存储区可驱逐缓存数据的迁移地址为SSD,修改所述内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY;若所述内存存储区可驱逐缓存数据的迁移地址为HDD,修改所述内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY;修改完成,反馈所述内存存储区可驱逐缓存数据驱逐成功信号以及所述内存存储区可驱逐数据迁移信息,以使得所述RDD分区数据进入所述内存存储区,完成所述存储任务。
- 一种Spark分布式计算数据处理系统,其特征在于,所述系统包括:申请存储模块,用于在对用户已标识缓存的弹性分布式数据集RDD分区数据执行存储任务时,若向Spark的内存存储区申请空间失败,则向驱逐逻辑单元发送驱逐所述内存存储区可驱逐缓存数据的命令;计算分址模块,用于计算所述内存存储区内可驱逐空间大小,若驱逐后空间大小满足所述存储任务对所述内存存储区空间的要求,则根据所述内存存储区可驱逐缓存数据访问热度设置基于SSD和HDD的混合存储系统的迁移地址;数据迁移模块,用于读取并释放所述内存存储区内可驱逐缓存数据,迁移所述内存存储区内可驱逐缓存数据到所述迁移地址,修改所述内存存储区内可驱逐缓存数据持久化级别,反馈驱逐成功信号及驱逐信息。
- 根据权利要求7所述的系统,其特征在于,所述申请存储模块包括:第一申请模块,用于计算所述对RDD分区数据执行存储任务所占用所述内存存储区空间的大小,向所述Spark内存存储区申请空间,并与所述内存存储区未占用空间作比较;第一反馈模块,用于若所述存储任务所占用所述内存存储区空间的大小大于所述内存存储区未占用空间,则向Spark内存存储区申请空间失败,同时向所述驱逐逻辑单元发送驱逐所述所述内存存储区可驱逐缓存数据的命令以及发送所述存储任务需要占用所述内存存储区空间的大小。
- 根据权利要求 7 所述的系统,其特征在于,所述计算分址模块包括 ;第二申请模块,用于所述驱逐逻辑单元接收到驱逐命令,同时所述驱逐逻辑单元向所述内存存储区发出由于所述 RDD 分区数据执行存储任务所需存储空间不足需要驱逐 空间的申请,若所述申请申请成功,则按近期最少使用算法 LRU 策略计算所述内存存储区内可驱逐空间大小;设置迁移地址模块,用于若所述驱逐后所述内存存储区未占用空间大小大于等于所述 RDD 分区数据执行存储任务需要占用 空间大小,根据所述内存存储区可驱逐缓存 数据 访问热度设置基于 SSD 和 HDD 的混合存储系统的迁移地址,并将所述 内存存储区 可驱逐缓存 数据迁移 信息和所述内存存储区可驱逐缓存数据迁移命令发送至缓存数据迁移单元;第二反馈模块,用于若所述驱逐后所述内存存储区未占用空间大小小于所述 RDD 分区数据执行存储任务需要占用 空间大小,则终止所述内存存储区可驱逐缓存 数据 迁移任务,并反馈驱逐所述内存存储区可驱逐缓存 数据 失败信号;SSD 迁移地址模块,用于若所述内存存储区可驱逐缓存数据访问热度在第一预置热度数值范围内 , 则读取 SSD 地址并将读取到的 SSD 地址设置为所述迁移地址;HDD 迁移地址模块,用于若所述内存存储区可驱逐缓存数据访问热度在第二预置热度数值范围内 , 则读取 HDD 地址并将读取到的 HDD 地址设置为所述迁移地址。
- 根据权利要求7所述的系统,其特征在于,所述数据迁移模块包括:数据迁移模块,所述缓存数据迁移单元接收到所述内存存储区可驱逐缓存数据迁移信息和所述内存存储区可驱逐缓存数据迁移命令后,将所述内存存储区可驱逐数据按所述迁移信息存储到SSD或HDD;第三反馈模块,用于向所述驱逐逻辑单元发送所述内存存储区可驱逐缓存数据迁移完成信号;SSD持久化级别模块,用于若所述内存存储区可驱逐缓存数据的迁移地址为SSD,修改所述内存存储区可驱逐缓存数据的持久化级别为SSD_ONLY;HDD持久化级别模块,用于若所述内存存储区可驱逐缓存数据的迁移地址为HDD,修改所述内存存储区可驱逐缓存数据的持久化级别为HDD_ONLY;第四反馈模块,用于反馈所述内存存储区可驱逐缓存数据驱逐成功信号以及所述内存存储区可驱逐数据迁移信息,以使得所述RDD分区数据进入所述内存存储区,完成所述存储任务。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/099083 WO2019037093A1 (zh) | 2017-08-25 | 2017-08-25 | 一种 Spark 分布式计算数据处理方法及系统 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/099083 WO2019037093A1 (zh) | 2017-08-25 | 2017-08-25 | 一种 Spark 分布式计算数据处理方法及系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019037093A1 true WO2019037093A1 (zh) | 2019-02-28 |
Family
ID=65438348
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/099083 WO2019037093A1 (zh) | 2017-08-25 | 2017-08-25 | 一种 Spark 分布式计算数据处理方法及系统 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019037093A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947778A (zh) * | 2019-03-27 | 2019-06-28 | 联想(北京)有限公司 | 一种Spark存储方法及系统 |
CN115145841A (zh) * | 2022-07-18 | 2022-10-04 | 河南大学 | 一种应用于Spark计算平台中的降低内存争用的方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101907978A (zh) * | 2010-07-27 | 2010-12-08 | 浙江大学 | 基于固态硬盘和磁性硬盘的混合存储系统及存储方法 |
US20110191556A1 (en) * | 2010-02-01 | 2011-08-04 | International Business Machines Corporation | Optimization of data migration between storage mediums |
CN102831088A (zh) * | 2012-07-27 | 2012-12-19 | 国家超级计算深圳中心(深圳云计算中心) | 基于混合存储器的数据迁移方法和装置 |
CN103186350A (zh) * | 2011-12-31 | 2013-07-03 | 北京快网科技有限公司 | 混合存储系统及热点数据块的迁移方法 |
CN103631730A (zh) * | 2013-11-01 | 2014-03-12 | 深圳清华大学研究院 | 内存计算的缓存优化方法 |
-
2017
- 2017-08-25 WO PCT/CN2017/099083 patent/WO2019037093A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110191556A1 (en) * | 2010-02-01 | 2011-08-04 | International Business Machines Corporation | Optimization of data migration between storage mediums |
CN101907978A (zh) * | 2010-07-27 | 2010-12-08 | 浙江大学 | 基于固态硬盘和磁性硬盘的混合存储系统及存储方法 |
CN103186350A (zh) * | 2011-12-31 | 2013-07-03 | 北京快网科技有限公司 | 混合存储系统及热点数据块的迁移方法 |
CN102831088A (zh) * | 2012-07-27 | 2012-12-19 | 国家超级计算深圳中心(深圳云计算中心) | 基于混合存储器的数据迁移方法和装置 |
CN103631730A (zh) * | 2013-11-01 | 2014-03-12 | 深圳清华大学研究院 | 内存计算的缓存优化方法 |
Non-Patent Citations (1)
Title |
---|
LU , KEZHONG ET AL.: "Design of RDD Persistence Method in Spark for SSDs", JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT, vol. 54, no. 6, 30 June 2017 (2017-06-30), pages 1382, XP055578521 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947778A (zh) * | 2019-03-27 | 2019-06-28 | 联想(北京)有限公司 | 一种Spark存储方法及系统 |
CN115145841A (zh) * | 2022-07-18 | 2022-10-04 | 河南大学 | 一种应用于Spark计算平台中的降低内存争用的方法 |
CN115145841B (zh) * | 2022-07-18 | 2023-05-12 | 河南大学 | 一种应用于Spark计算平台中的降低内存争用的方法 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11157376B2 (en) | Transfer track format information for tracks in cache at a primary storage system to a secondary storage system to which tracks are mirrored to use after a failover or failback | |
CN107526546B (zh) | 一种Spark分布式计算数据处理方法及系统 | |
TWI771933B (zh) | 借助命令相關過濾器來進行重複資料刪除管理的方法、主裝置以及儲存伺服器 | |
WO2014044136A1 (zh) | 基于分布式数据的并发处理方法、系统和计算机存储介质 | |
JP7449276B2 (ja) | 電源管理制御をサポートする電源管理アドバイザ | |
JP2017138852A (ja) | 情報処理装置、記憶装置およびプログラム | |
US9218287B2 (en) | Virtual computer system, virtual computer control method, virtual computer control program, recording medium, and integrated circuit | |
Deshpande et al. | Scatter-gather live migration of virtual machines | |
JP2004280269A (ja) | 情報処理装置、プログラム、記録媒体、及び制御回路 | |
WO2019037093A1 (zh) | 一种 Spark 分布式计算数据处理方法及系统 | |
CN106527974A (zh) | 一种写数据的方法、设备及系统 | |
KR20190033122A (ko) | 멀티캐스트 통신 프로토콜에 따라 호스트와 통신하는 저장 장치 및 호스트의 통신 방법 | |
JP2017227969A (ja) | 制御プログラム、システム、及び方法 | |
WO2017157125A1 (zh) | 在云计算环境中删除云主机的方法、装置、服务器及存储介质 | |
CN112069090A (zh) | 用于管理高速缓存层级结构的系统和方法 | |
WO2024113568A1 (zh) | 固态硬盘的数据迁移方法、装置、电子设备及存储介质 | |
US10831662B1 (en) | Systems and methods for maintaining cache coherency | |
JP6036457B2 (ja) | 演算処理装置、情報処理装置及び情報処理装置の制御方法 | |
JP4667092B2 (ja) | 情報処理装置、情報処理装置におけるデータ制御方法 | |
TWI828307B (zh) | 用於記憶體管理機會與記憶體交換任務之運算系統及管理其之方法 | |
WO2015024532A1 (zh) | 高性能指令缓存系统和方法 | |
WO2020235858A1 (en) | Server and control method thereof | |
CN115087961A (zh) | 用于相干及非相干存储器请求的仲裁方案 | |
TW201316180A (zh) | 用於管理可攜式計算裝置上之平行資源請求之方法及系統 | |
JP4558003B2 (ja) | データアクセス処理方法及び記憶制御装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17922440 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.09.2020) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 17922440 Country of ref document: EP Kind code of ref document: A1 |