CN117171471A

CN117171471A - Visual big data machine learning system and method based on Ray and Spark

Info

Publication number: CN117171471A
Application number: CN202310959145.3A
Authority: CN
Inventors: 吴志雄; 徐春梅; 刘云星
Original assignee: Linewell Software Co Ltd
Current assignee: Linewell Software Co Ltd
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-12-05

Abstract

The invention belongs to the technical field of visual big data, and discloses a visual big data machine learning system and method based on Ray and Spark, wherein the system consists of a front end UI, a display layer, a service layer, a data layer, a storage layer and an operation environment, and the visual big data machine learning method based on Ray and Spark comprises various data source access based on Web visual pages; management and access of Web visualization page-based machine learning models and deep learning frameworks (TensorFlow, pyTorch, keras, etc.); the script generation and verification module; spark and Ray distributed processing and computing modules and communication; and (5) displaying and storing results. The invention can efficiently process and analyze mass data so as to extract valuable information and insight therefrom and support decisions and innovations in various application fields.

Description

Visual big data machine learning system and method based on Ray and Spark

Technical Field

The invention belongs to the technical field of visual big data, and particularly relates to a visual big data machine learning system and method based on Ray and Spark.

Background

While the prior art provides tools and frameworks for large-scale data processing and machine learning, they are typically self-contained and integrated with one another in a complex process. Not only is knowledge of machine learning and deep learning required, but also knowledge of data preprocessing, feature engineering, model selection, training, tuning, etc., which requires a certain expertise of the user, a high threshold for non-professionals, and a great deal of time to manage the different components and techniques. The existing third-party model can not be quickly accessed into mass data to perform training and reasoning prediction. Therefore, there is a need for a system and method that can quickly obtain high quality large-scale data, reduce the threshold for using these models, enable non-professionals to select models through simple form drop-down boxes, and efficiently process and analyze massive data using big data techniques in combination with machine learning and deep learning models to extract valuable information and insight therefrom, and support decisions and innovations in various application fields. Such as business intelligence, predictive analysis, ORC, image recognition, etc.

Through the above analysis, the problems and defects existing in the prior art are as follows:

1. data acquisition and quality problems:

defects: obtaining high quality and diverse large-scale data is a challenge. For certain fields, particularly sensitive information or proprietary data, data acquisition may be limited, resulting in insufficient data size and quality.

The solution is as follows: more data acquisition technologies, including data crawling, data sharing platforms, etc., need to be developed to address data acquisition issues. Meanwhile, data quality control measures are required to be enhanced, and accuracy and integrity of data are ensured.

2. Technology integration and complexity issues:

defects: existing big data processing and machine learning tools are usually independent, and integration with each other is a complex process. This presents a high threshold for non-professionals to master multiple technical components and areas of knowledge.

The solution is as follows: there is a need to develop a system and method that can effectively integrate big data processing, machine learning, deep learning, etc. techniques, and provide a one-stop visual solution, so that non-professionals can easily perform data processing and analysis.

3. Talent shortage problem:

defects: the fusion of big data with artificial intelligence techniques requires talents with related expertise and skills, which are currently relatively scarce.

The solution is as follows: to address the talent shortage problem, there is a need to enhance education training in the relevant fields, provide more talent reserves, and encourage interdisciplinary research and collaboration to foster more professionals who understand big data and artificial intelligence.

4. Data non-landing processing and privacy protection:

defects: most of the current data processing and analysis typically requires the data to be dropped, which may involve privacy and security issues. The storage cost and security risk brought by the data landing and the risk of privacy disclosure are problems to be solved in the prior art.

The solution is as follows: more data non-disketting processing technologies need to be developed, encryption, safe calculation and other methods are adopted, privacy and safety of data are protected, and meanwhile storage cost and risk are reduced.

5. A one-stop visualization solution:

defects: currently, a one-stop visualization solution is lacking, and the whole process from big data preprocessing to machine learning and deep learning needs independent configuration and operation.

The solution is as follows: there is a need to develop a one-stop visualization solution that enables users to quickly obtain high-quality large-scale data through simple operations, select a suitable model, and perform analysis processing without paying much attention to the underlying technical details.

In view of the above analysis, the prior art still has a number of drawbacks and challenges in the area of large-scale data processing and machine learning. In order to solve these problems, technical development and talent cultivation are required to be enhanced, standardization and normalization of data acquisition and processing are promoted, and meanwhile, a more intelligent and simplified system and method are developed, so that a threshold is reduced, and non-professional persons can easily use big data and artificial intelligence technology to obtain valuable information and insight therefrom, and decision making and innovation of various application fields are supported.

Disclosure of Invention

Aiming at the problems existing in the prior art, the invention provides a visual big data machine learning system and method based on Ray and Spark.

The invention is realized in such a way, a visual big data machine learning system based on Ray and Spark, which consists of a front end UI, a display layer, a business layer, a data layer, a storage layer and an operation environment;

the front end UI and the presentation layer are used for providing a user-friendly interface management data source and model, and generating spark sql and python scripts according to the set sql script template and machine learning model template engine rendering;

the business layer is used for managing and monitoring tasks, mapping and managing functions and variables in the tasks, and managing a machine learning model;

the data layer is used for processing off-line batch big data and real-time on-line data, and uses Arrow to efficiently transmit column data, and an Arrow memory format supports zero copy reading, so that quick data access is realized without serialization overhead;

the storage layer is used for storing various structured and unstructured data;

the running environment is based on Spark engine, ray cluster and YARN resource management scheduling platform to complete high-efficiency interaction and analysis processing capacity of big data and machine learning model.

Another object of the present invention is to provide a visual big data machine learning method based on Ray and Spark, the method comprising:

step one, various data sources based on Web visual pages are accessed;

step two, management and access of a machine learning model and a deep learning framework (TensorFlow, pyTorch, keras and the like) based on Web visual pages;

step three, a script generation and verification module;

step four, spark and Ray distributed processing and calculating modules and communication;

and fifthly, displaying and storing the results.

Further, in the first step, the user can select various configured data sources through the interface, including sources of JDBC, HDFS, kafka, HIVE, etc., and can also customize other data sources.

Further, in the first step, the data sources all include some basic information such as IP, port, data source instance, user, password, etc. for accessing the corresponding data sources.

Further, the Web page in the second step provides a platform for managing and using the machine learning model and the deep learning frame, and through the Web interface, the user selects the pre-training model or the frame which he wants to use, and configures parameters, and then runs the models on the data thereof, and meanwhile, the system supports the user to upload the custom model, so that flexibility is provided for different tasks.

And in the third step, corresponding spark sql and python code scripts are automatically generated according to the form configuration information and the template engine to run and process big data and machine learning or deep learning tasks.

Further, in step three, code verification is performed before the calculation task is executed, and the verification portion ensures that the generated code has no grammar errors.

Further, the specific method for Spark and Ray distributed processing and calculating module and communication in the fourth step comprises:

1) Loading and processing mass data: according to the generated Spark sql script, the data processing function of Spark is used for loading and processing mass data, and the data can be processed by using a DataFrame or RDDAPI of Spark.

2) Calculation and conversion using Spark: according to the functions in the generated Spark sql script, the data is calculated and converted by using the operation and conversion functions of Spark, and various built-in functions and custom functions can be used for operating the data.

3) Multiplex computation and communication using Ray: according to the generated python script, it can connect to a Ray cluster, and can use the task and Actor components of the Ray to execute the machine learning model in a distributed manner and return the prediction results.

4) Obtaining results and communication: the Spark is used for interacting with the Ray, the calculation result is obtained, and the result can be further processed and analyzed by using the operation and the function of the Spark.

5) Controlling computing resources by script setting partitions: setting the partition number of the data volume through the script, calculating the optimal resource parallelism according to the partition number and the available resources of the Ray cluster, and running the task in the optimal resource state.

Further, after the machine learning and deep learning tasks are run in the fifth step, the user can display the results in a form manner, or can also store a designated data source for further analysis of the results by the user, for example, generating various graphs or forms to view the results, so as to help the user to better understand and evaluate the performance and accuracy of the model.

It is a further object of the present invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of a Ray and Spark based visual big data machine learning method.

In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:

firstly, aiming at the problems that the prior art is difficult to acquire high-quality diversified large-scale data, and the threshold of big data and artificial intelligence technology is high, the invention provides a visual big data machine learning system and method based on Ray and Spark. The method comprises the steps of preprocessing large-scale data by utilizing spark sql, optimizing and improving the data quality, sending a preprocessed data result to a Ray cluster, inputting the data result into a machine learning model as a parameter for training or prediction, and finally returning a calculation result for further analysis and decision-making, wherein the technologies are integrated, one-stop data is not dropped, and clustering distributed processing is performed, so that the requirements of rapid large-scale data processing and complex calculation tasks are met.

Secondly, the invention can rapidly acquire high-quality large-scale data, reduce the threshold of simultaneously using big data and artificial intelligence technology, enable non-professional personnel to select a model and a data source through a simple form drop-down frame, add a preprocessing component according to the requirement and submit the model and the data source to a distributed engine, and can efficiently process and analyze massive data by combining the big data technology with a machine learning and deep learning model so as to extract valuable information and insight from the massive data.

Thirdly, as inventive supplementary evidence of the claims of the present invention, the following important aspects are also presented:

(1) The expected benefits and commercial values after the technical scheme of the invention is converted are as follows:

the invention supports decisions and innovations in various application fields, such as large-scale data preprocessing, business intelligence, predictive analysis, image recognition, text comparison and other predictive analysis based on artificial intelligence models.

(2) The technical scheme of the invention fills the technical blank in the domestic and foreign industries:

and the method mainly integrates big data with artificial intelligence technology, and uses simulated feces.

(3) Whether the technical scheme of the invention solves the technical problems that people want to solve all the time but fail to obtain success all the time is solved:

(4) The technical scheme of the invention overcomes the technical bias:

the unfairness, discrimination or other adverse effects caused by technical prejudice can be overcome by means of measures such as improving data quality and diversity, algorithm design, model parameter adjustment and the like.

Fourth, the visual big data machine learning system based on Ray and Spark has the following remarkable technical progress:

1) Front end UI and presentation layer:

providing a user-friendly interface: the front end UI and the presentation layer of the system provide an intuitive and easy-to-use interface for users, so that the management of data sources and models becomes simple and efficient.

Dynamically generating Spark SQL and Python scripts: by setting the SQL script template and the machine learning model template engine, the system can dynamically generate Spark SQL and Python scripts according to the requirements of users, so that the flexibility of task configuration and execution is greatly improved.

2) Service layer:

task management and monitoring: the business layer comprehensively manages and monitors the tasks, ensures the accurate execution and reliability of the tasks, and provides a real-time monitoring function for the state and progress of the tasks.

Function and variable mapping and management: the mapping and management of functions and variables are realized in the business layer, so that the parameter configuration and function call of the machine learning task become highly flexible and customizable.

3) Data layer:

offline batch and real-time online data processing: the data layer can process off-line batch big data and real-time on-line data simultaneously, and high-efficiency processing and analysis of different types of data are realized.

Arrow memory format support: the array is used for efficiently transmitting the columnar data, so that the rapid data access is realized, the serialization overhead is avoided, and the efficiency and the performance of data processing are improved.

4) Storage layer:

a variety of structured and unstructured data support: the storage layer can store various structured and unstructured data, meets the storage requirement of the system on diversified data, and improves the flexibility and expandability of the data.

5) Operating environment:

based on Spark engine and Ray cluster: the system utilizes Spark engine and Ray cluster to complete high-efficiency interaction and analysis processing capacity of big data and machine learning model, and greatly improves calculation performance and parallel processing capacity of the system.

An integrated YARN resource management scheduling platform: by integrating the YARN resource management scheduling platform, the system can better perform resource allocation and task scheduling, and high efficiency and expandability of task execution are realized.

By combining the technical progress, the system fully utilizes the advantages of Ray and Spark, provides a powerful function of visual big data machine learning, and simultaneously enables task configuration and data processing to be more flexible and intelligent through front-end UI and dynamic script generation, thereby meeting the actual demands in the big data analysis and machine learning fields.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a frame diagram of a visual big data machine learning system based on Ray and Spark provided by an embodiment of the invention.

Fig. 2 is a flowchart of a visual big data machine learning system based on Ray and Spark provided by an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In view of the problems of the prior art, the present invention provides a visual big data machine learning system and method based on Ray (which is an open-source distributed computing framework for building large-scale, high-performance machine learning applications) and Spark (which is an open-source big data processing framework that provides a fast, versatile and easy-to-use data processing engine), and the detailed description of the present invention is provided below with reference to the accompanying drawings.

As shown in fig. 1, the visual big data machine learning system based on Ray and Spark provided by the embodiment of the invention comprises a front end UI, a presentation layer, a service layer, a data layer, a storage layer and an operation environment;

the data layer is used for processing off-line batch big data and real-time on-line data, and uses Arrow (which defines a standardized memory format and can efficiently share data among different systems and languages) to efficiently transmit column data, and the Arrow memory format supports zero-copy reading, so that quick data access is realized without serialization overhead;

the storage layer is used for storing various structured and unstructured data;

the run environment is based on Spark engine, ray cluster and YARN (Yet Another Resource Negotiator abbreviation, which is a sub-project of Apache Hadoop for cluster resource management and job scheduling YARN allows different data processing engines (e.g., mapReduce, spark and Flink) to run on Hadoop) resource management scheduling platforms to accomplish efficient interaction and analysis processing capabilities of big data and machine learning models.

As shown in fig. 2, the visual big data machine learning method based on Ray and Spark provided by the embodiment of the invention comprises the following specific steps:

step 1, accessing various data sources based on Web visual pages;

step 2, management and access of a machine learning model and a deep learning framework (TensorFlow, pyTorch, keras, etc.) based on Web visualization pages;

step 3, a script generation and verification module;

step 4, spark and Ray distributed processing and calculating modules and communication;

and 5, displaying and storing the results.

In step 1 provided by the embodiment of the invention, a user can select various configured data sources through an interface, including JDBC and HIVE structured data sources, and HDFS and kafka unstructured sources, and can also customize and create other data sources.

The data source in step 1 provided by the embodiment of the invention comprises a plurality of basic information such as IP, ports, data source examples, users, passwords and the like, and is used for accessing the corresponding data source.

The Web page in step 2 provided by the embodiment of the invention provides a platform for managing and using the machine learning model and the deep learning frame, and through a Web interface, a user selects a pre-training model or frame which is expected to be used and configuration parameters, and then runs the models on the data, and meanwhile, the system supports the user to upload a custom model and provides flexibility for different tasks.

In the step 3 provided by the embodiment of the invention, the corresponding spark sql and python code script is automatically generated according to the form configuration information and the template engine to run and process big data and machine learning or deep learning tasks.

In step 3 provided by the embodiment of the invention, code verification is performed before the calculation task is executed, and the verification part ensures that the generated code has no grammar error.

The specific method for Spark and Ray distributed processing and calculating module and communication in step 4 provided by the embodiment of the invention comprises the following steps:

1) Loading and processing mass data: according to the generated Spark sql script, the data processing function of Spark is used to load and process mass data, and the data can be processed by using the DataFrame or RDD API of Spark.

3) Multiplex computation and communication using Ray: according to the generated python script, it can connect to a Ray cluster, and can use the task and actions components of the Ray to distributively execute the machine learning model and return the prediction results.

After the machine learning and deep learning tasks are run in step 5 provided by the embodiment of the invention, the user can display the results in a form manner, and can also store the designated data source for further analysis of the results, for example, various graphs or forms are generated to check the results, so as to help the user to better understand and evaluate the performance and accuracy of the model.

An application embodiment of the present invention provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and where the computer program when executed by the processor causes the processor to perform the steps of a Ray and Spark based visual big data machine learning method.

Such as:

1. the quality detection of tens of millions of pictures can be processed and analyzed in batch in a short time;

2. the duplicate removal processing can be performed on a large-scale picture set;

3. the text in the picture can be read in a large scale for further editing;

4. the large-scale text translation processing can be realized;

5. the images and the like can be identified in a large scale;

the embodiment of the invention can process structured and unstructured data, and the AI component can support various machine learning and deep learning models, and because of processing large-scale data, large data storage media such as hive, hdfs and the like are recommended to be used as data sources. The following is a brief description taking photo album detection as an example:

firstly, through a visual form configuration flow, wherein an HIVE source is a medium for storing data and is used for accessing a large amount of data, for example, pictures are stored in a specified path, the path is directly read, a conversion component is used for preprocessing the data source, for example, missing values, null values, type conversion and the like, an AI component is a model for processing picture quality, and a core algorithm of the model is a convolutional neural network and is also stored in a specified directory and is obtained through the component. Then, the data preprocessed according to the requirement is distributed to a model in batches, and the model algorithm is used for carrying out distributed processing prediction to finish the picture quality detection:

the following are four specific embodiments and corresponding implementations:

1) Examples: soil pollution repair system

Front end UI and presentation layer: a user-friendly interface is designed for managing soil pollution data sources and selecting a remediation model.

Data layer: and (5) establishing a soil pollution database which comprises pollution degree, pollution source and other information.

Service layer: and predicting the soil pollution degree and optimizing a repairing scheme by using a machine learning model.

Storage layer: and saving pollution data and repair result data.

Operating environment: and processing large-scale soil pollution data and a parallel operation repair model based on Spark and Ray.

2) Examples: traffic flow prediction system

Front end UI and presentation layer: the interactive interface is designed for selecting traffic flow data sources and configuring predictive model parameters.

Data layer: real-time traffic flow data and historical traffic flow data are collected and stored in a database.

Service layer: and predicting traffic flow by using a machine learning algorithm, and performing model training and prediction according to the historical data and the real-time data.

Storage layer: and storing the prediction result and the traffic flow data.

Operating environment: and realizing parallel processing and high-efficiency analysis of traffic flow data through Spark and Ray.

3) Examples: customer behavior analysis system

Front end UI and presentation layer: a user-friendly interface is created for selecting data sources and specifying analysis requirements.

Data layer: customer behavior data, including browsing records, purchasing behavior, etc., are collected and stored in a database.

Service layer: customer behavior analysis, such as predicting customer purchase intent and recommending goods, is performed using machine learning algorithms.

Storage layer: and storing analysis results and customer behavior data.

Operating environment: and realizing parallel computation of the large-scale data processing and machine learning model based on Spark and Ray.

4) Examples: environmental pollution monitoring system

Front end UI and presentation layer: a user-friendly interface is designed for selecting a monitoring data source and viewing monitoring results.

Data layer: environmental pollution data such as air quality, water quality, etc. are collected and stored in a database.

Service layer: and analyzing the environmental pollution trend and predicting the pollution degree by using a machine learning algorithm.

Storage layer: and storing the monitoring result and the environmental pollution data.

Operating environment: the large-scale environment monitoring data and the parallel operation analysis model are processed through Spark and Ray.

The embodiments illustrate specific scenarios and implementations of a visual big data machine learning system based on Ray and Spark applied in different fields. The system can improve the efficiency of data processing and analysis, realize an intelligent decision and optimization scheme and provide powerful support for solving the problems in various fields.

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. The visual big data machine learning system based on the Ray and Spark is characterized by comprising a front-end UI, a display layer, a service layer, a data layer, a storage layer and an operation environment;

the storage layer is used for storing various structured and unstructured data;

2. A visual big data machine learning method based on Ray and Spark is characterized by comprising the following steps:

step one, various data sources based on Web visual pages are accessed;

step two, management and access of a machine learning model and a deep learning framework based on Web visual pages;

step three, a script generation and verification module;

and fifthly, displaying and storing the results.

3. The visual big data machine learning method based on Ray and Spark as claimed in claim 2, wherein in the first step, the user can select various configured data sources through the interface, including JDBC, HDFS, kafka, HIVE, etc., and can also create other data sources in a customized way.

4. The method for machine learning visual big data based on Ray and Spark as claimed in claim 2, wherein in the first step, the data source contains some basic information such as IP, port, data source instance, user, password, etc. for accessing the corresponding data source.

5. The Ray and Spark based visual big data machine learning method of claim 2, wherein in step two, the Web page provides a platform for managing and using machine learning models and deep learning frameworks, through the Web interface, the user selects the pre-training models or frameworks and configuration parameters that the user wants to use, and then runs these models on their data, while the system supports the user to upload custom models, providing flexibility for different tasks.

6. The visual big data machine learning method based on Ray and Spark as claimed in claim 2, wherein in step three, corresponding Spark sql and python code scripts are automatically generated according to form configuration information and a template engine to run and process big data and machine learning or deep learning tasks.

7. The Ray and Spark based visual big data machine learning method of claim 2, wherein in step three, code verification is performed before performing the calculation task, and the verification portion ensures that the generated code has no grammar errors.

8. The visual big data machine learning method based on Ray and Spark as claimed in claim 2, wherein the specific method of Spark and Ray distributed processing and calculating module and communication in the fourth step comprises:

9. The Ray and Spark based visual big data machine learning method of claim 2, wherein after the machine learning and deep learning tasks are run in step five, the user can display the results in a form, and can also save the designated data source for further analysis of the results by the user, such as generating various graphs or tables to view the results, to help the user better understand and evaluate the performance and accuracy of the model.

10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the Ray and Spark based visual big data machine learning method of claim 2.