CN106855872A - The method for quickly retrieving of the mass picture based on Hadoop platform - Google Patents
The method for quickly retrieving of the mass picture based on Hadoop platform Download PDFInfo
- Publication number
- CN106855872A CN106855872A CN201510908363.XA CN201510908363A CN106855872A CN 106855872 A CN106855872 A CN 106855872A CN 201510908363 A CN201510908363 A CN 201510908363A CN 106855872 A CN106855872 A CN 106855872A
- Authority
- CN
- China
- Prior art keywords
- namenode
- picture
- file
- datanode
- hadoop
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/144—Query formulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to computer big data process field, the specifically method for quickly retrieving of the mass picture based on Hadoop platform.Step 1, build Hadoop cluster platforms;Step 2, setting security strategy;Step 3, free hand drawing piece storage treatment;Step 4, file pretreatment merge;Step 5:Set up picture indices;Step 6, client initiate access request with picture name and creation time as parameter, and NameNode computings obtain the Blocks information corresponding with file is merged of minutes section where picture, return to client.The present invention can be very good to solve the problems, such as that NameNode memory consumptions are excessively and recall precision is low during Hadoop retrieval mass pictures, and NameNode when effectively reducing retrieval is loaded, the lifting to NameNode performances is realized, so as to promote hadoop platforms widely to apply.
Description
Technical field
The present invention relates to computer big data process field, the specifically quick inspection of the mass picture based on Hadoop platform
Suo Fangfa.
Background technology
With the popularization and extensive use of internet, electric business platform and social networks are also continued to develop, for merchandise display
Or the picture number that social activity is shared is in explosive growth.On these e-commerce websites and social network sites, the information table of picture
Up to the description considerably beyond text information, so these e-commerce websites and social network sites more focus on the quality of picture.
From the point of view of the analysis to Taobao, in the flow of whole business platform, the access to picture is up to more than 91.5%.Tengxun's phase
Also up to 1,100,000,000, the picture that the user of volume uploads weekly, current total picture number has nearly 70,000,000,000, and total capacity is up to
15PB.Because mass picture needs to consume the memory space of magnanimity, performance bottleneck can all occur in the storage and retrieval of picture.Face
How the picture resource of magnanimity, efficiently retrieve and how to meet the inspection of structure high efficiency low cost on the premise of high concurrent is accessed
Cable system turns into needs the urgent problem for solving.
Hadoop is a software frame that distributed treatment can be carried out to mass data, while it is again reliable, high
Effect, it is expansible.Reliability is embodied in it is assumed that calculating elements and storage can fail, therefore it safeguards multiple operational data pairs
This, it is ensured that the node redistribution treatment of failure can be directed to.High efficiency is embodied in it and works in a parallel fashion, by parallel
Treatment speed up processing.Expansibility refers to that it can process PB DBMSs.
Initially it is directed to large scale text data treatment design due to Hadoop, internal data type is limited, it is impossible to straight
Connect treatment image data.In HDFS, file or catalogue etc. are stored in internal memory with object form, and each object is about used
150 bit internal memories.With the increase of mass picture quantity, the internal memory of consuming also increases sharply, the consumption of a large amount of namenode internal memories
Take, had a strong impact on the application of Hadoop.Meanwhile, the speed for retrieving a large amount of pictures is much more slowly than the big of access same quantity of data
File.
The content of the invention
The performance bottleneck problem that retrieval for mass picture occurs, the present invention proposes the mass picture based on Hadoop
Search method, realizes merging small picture, and set the inclined of single Sequence File in merging process by Sequence
Shifting amount, the DataNode and Fileld of the quick positioning storage picture Block of parsing index solve mass picture data dilatation and fast
The problem of speed retrieval.
In order to solve the above technical problems, of the invention be achieved through the following technical solutions:
Step one, build Hadoop cluster platforms.Every computer installation operation system and Hadoop softwares, by a meter
Calculation machine is configured to NameNode, and other allocation of computer are into DataNodes.Each machine passes through SSH direct communications.NameNode
Responsible is the management of whole accumulation layer, and DataNode is mainly as memory node.Between checking DataNode and NameNode
Connectivity is realized by heartbeat detection, and also periodically will be sent to for the memory block information of oneself by DataNode
NameNode.When client is accessed, NameNode is accessed first, NameNode can distribute corresponding space, obtaining corresponding
Space after start each operation.
Step 2, setting security strategy.A DataNode2 is increased in Hadoop cluster platforms newly to be backed up as NameNode
Machine, by the data duplication in original NameNode to selected DataNode2, when NameNode runs, NameNode2 meetings
The running status of NameNode is detected in real time, while the operation real-time update in NameNode to local, in NameNode
During failure, NameNode2 ensures being normally carried out for service instead of NameNode.
Step 3, free hand drawing piece storage treatment.Picture first passes through load balancing module filtering, into application server queue etc.
HDFS storage systems to be entered, distribute DataNode and are stored by NameNode, and write-in is first determined in picture ablation process
Block, then determine Sequence File, the ID combinations of the two are named as system the title in the system of picture.Picture unit number
According to HBase is stored in, while metadata is also stored in the caching system built by Redis.Picture completes write operation.
Step 4, file pretreatment merge.Picture file under assigned catalogue is read into picture array, and is initialized
Byte arrays, in the merging file picture in byte being read under specified path with corresponding output file stream.
Step 5:Set up picture indices.Picture name be combined coding mode, mainly comprising BlockId with
FileId two parts.What wherein BlockId was represented is a memory cell, and NameNode can be nearest according to its determination
DateNode addresses, what FileId was represented is the Id of small picture SequenceFile when splicing;Offset represent be
The side-play amount of of corresponding key values.HDFS front ends after the request for receiving client first can resolution file name, according to phase
Information locating to corresponding Block files, FileId and offset is closed, then client is directly read out to picture.Right
After filename parsing, DateNode node datas can be directly read, it is possible to the beginning of picture is navigated to by side-play amount
Position.
Step 6, client initiate access request with picture name and creation time as parameter, and NameNode computings are obtained
The Blocks information corresponding with file is merged of minutes section where picture, returns to client.Client is to nearest
DataNode initiates picture read requests.DataNode computings obtain picture specific address information.
Compared with prior art, it is beneficial in that the present invention:The present invention can be very good to solve Hadoop retrievals sea
During spirogram piece NameNode memory consumptions excessively and the low problem of recall precision, and NameNode when effectively reducing retrieval
Load, realizes the lifting to NameNode performances, so as to promote hadoop platforms widely to apply.
Brief description of the drawings
Fig. 1 is picture Stored Procedure figure.
Fig. 2 is picture retrieval flow chart.
Specific embodiment
1 to Fig. 2, provides specific embodiment of the invention referring to the drawings, for the present invention will be further described.
Embodiment 1:
First:Deployment Hadoop clusters.Dispose after system, checked network, it is ensured that each machine energy phase in cluster
Mutual communication.SSH is installed, configuration SSH exempts from password login.IP Host map relations are added to etc/hosts end of file, are installed
Java context.At conf/hadoop-env.sh ends, addition export JAVA_HOME=/usr/jdk1.6.0 add testA
It is added in master files, test1, test2, test3 is added in slaves files and changes conf/core-site.xml
File.
Second:Redis is installed.Redis is downloaded, and is copied under respective directories, installation is compiled and starts service.
3rd:HAProxy is installed.Haproxy is downloaded, and is copied under respective directories, compiling is installed.
4th:Client initiates write data requests to NameNode first, is filtered by load balancing module, comes first
Application server is waited in line to enter HDFS storage systems, and after request reaches NameNode, NameNode is according on DataNode
Writeable piece, capacity and load weighted average be the DataNode that selects a writeable Block and can write Block, information
Return to client.
5th:Selection one is used as Master in DataNode that client is returned from NameNode set, the value by
The load of DataNode and currently determine as the number of times of Master so that each DataNode as Master chance
It is impartial.Master- sections is selected, and the machine unless Master delays will not be changed again.The machine once Master delays is, it is necessary to remaining
New Master is selected in DataNode.
6th:Client writes data into Master, and Master is written to further in accordance with the concurrent write data procedures of HDFS
Slave A and Slave B.When all of data writing process all terminates, Master by Block information report to
NameNode.NameNode receives Block information and returns to write operation and completes information.
7th:Read request reaches picture servers by load balancing, and request first passes through Redis cache modules inspection caching
Whether area includes pictorial information, otherwise to arrive HBase retrieving image information, and retrieval result is written into buffer area.
8th:Request reaches HDFS requests and reads image content.Picture name is designed as in Blockid plus Block
Fileld and offset side-play amounts, HBase inquires the relevant informations such as the name of picture, description according to picture file name.
9th:NameNode safeguards the map information between Block and DataNode, and NameNode is according in request analysis
Block determine the Block in DataNode information.
Tenth:After the DataNode addresses that client is given according to NameNode obtain Block, obtained according to Fileld retrievals
Take pictorial information.
It is obvious to a person skilled in the art that the invention is not restricted to the details of above-mentioned one exemplary embodiment, Er Qie
In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be in other specific forms realized.Therefore, no matter
From the point of view of which point, embodiment all should be regarded as exemplary, and be nonrestrictive, the scope of the present invention is by appended power
Profit requires to be limited rather than described above, it is intended that all in the implication and scope of the equivalency of claim by falling
Change is included in the present invention.
Moreover, it will be appreciated that although the present specification is described in terms of embodiments, not each implementation method is included
One independent technical scheme, this narrating mode of specification is only that for clarity, those skilled in the art should be by
Used as an entirety, technical scheme in each embodiment can also be through appropriately combined, and forming those skilled in the art can for specification
With the other embodiment for understanding.
Claims (6)
1. Hadoop cluster platforms are built:Every computer installation operation system and Hadoop softwares, by an allocation of computer
Into NameNode, other allocation of computer are into DataNodes;Each machine passes through SSH direct communications;NameNode be responsible for be
The management of whole accumulation layer, DataNode is mainly as memory node;Connectivity is between checking DataNode and NameNode
Realized by heartbeat detection, and the memory block information of oneself also periodically will be sent to NameNode by DataNode;Work as visitor
When family end accesses, NameNode is accessed first, NameNode can distribute corresponding space, start after corresponding space is obtained each
Individual operation.
2. security strategy is set:A DataNode2 is increased in Hadoop cluster platforms newly as NameNode backup machines, will be original
, in selected DataNode2, when NameNode runs, NameNode2 can be examined in real time for data duplication in NameNode
The running status of NameNode is surveyed, while the operation real-time update in NameNode is broken down to local in NameNode
When, NameNode2 ensures being normally carried out for service instead of NameNode.
3. free hand drawing piece storage treatment:Picture first passes through load balancing module filtering, is waited into application server queue and entered
HDFS storage systems, distribute DataNode and are stored by NameNode, and write-in Block is first determined in picture ablation process,
Sequence File are determined again, and the ID combinations of the two are named as system the title in the system of picture;Picture metadata is preserved
In HBase, while metadata is also stored in the caching system built by Redis;Picture completes write operation.
4. file pretreatment merges:Picture file under assigned catalogue is read into picture array, and initializes byte arrays, used
In the merging file that be read into picture in byte under specified path by corresponding output file stream.
5. picture indices are set up:Picture name be combined coding mode, it is main comprising BlockId and FileId two parts;
What wherein BlockId was represented is a memory cell, NameNode can according to the nearest DateNode addresses of its determination,
That FileId is represented is the Id of small picture SequenceFile when splicing;What offset was represented is the one of corresponding key values
Individual side-play amount;HDFS front ends after the request for receiving client first can resolution file name, navigated to according to relevant information
Corresponding Block files, FileId and offset, then client directly picture is read out;To filename parsing with
Afterwards, DateNode node datas can be directly read, it is possible to the starting position of picture is navigated to by side-play amount.
6. client initiates access request with picture name and creation time as parameter, and NameNode computings divide where obtaining picture
Clock time section Blocks information corresponding with file is merged, returns to client;Client initiates figure to nearest DataNode
Piece read requests;DataNode computings obtain picture specific address information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510908363.XA CN106855872A (en) | 2015-12-08 | 2015-12-08 | The method for quickly retrieving of the mass picture based on Hadoop platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510908363.XA CN106855872A (en) | 2015-12-08 | 2015-12-08 | The method for quickly retrieving of the mass picture based on Hadoop platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106855872A true CN106855872A (en) | 2017-06-16 |
Family
ID=59133083
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510908363.XA Pending CN106855872A (en) | 2015-12-08 | 2015-12-08 | The method for quickly retrieving of the mass picture based on Hadoop platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106855872A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107800808A (en) * | 2017-11-15 | 2018-03-13 | 广东奥飞数据科技股份有限公司 | A kind of data-storage system based on Hadoop framework |
CN108647290A (en) * | 2018-05-06 | 2018-10-12 | 深圳市保千里电子有限公司 | Internet cell phone cloud photograph album backup querying method based on HBase and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103116643A (en) * | 2013-02-25 | 2013-05-22 | 江苏物联网研究发展中心 | Hadoop-based intelligent medical data management method |
CN103500089A (en) * | 2013-09-18 | 2014-01-08 | 北京航空航天大学 | Small file storage system suitable for Mapreduce calculation model |
CN103559229A (en) * | 2013-10-22 | 2014-02-05 | 西安电子科技大学 | Small file management service (SFMS) system based on MapFile and use method thereof |
US20140215258A1 (en) * | 2013-01-31 | 2014-07-31 | International Business Machines Corporation | Cluster management in a shared nothing cluster |
-
2015
- 2015-12-08 CN CN201510908363.XA patent/CN106855872A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140215258A1 (en) * | 2013-01-31 | 2014-07-31 | International Business Machines Corporation | Cluster management in a shared nothing cluster |
CN103116643A (en) * | 2013-02-25 | 2013-05-22 | 江苏物联网研究发展中心 | Hadoop-based intelligent medical data management method |
CN103500089A (en) * | 2013-09-18 | 2014-01-08 | 北京航空航天大学 | Small file storage system suitable for Mapreduce calculation model |
CN103559229A (en) * | 2013-10-22 | 2014-02-05 | 西安电子科技大学 | Small file management service (SFMS) system based on MapFile and use method thereof |
Non-Patent Citations (3)
Title |
---|
左大鹏: "Hadoop小文件存储管理的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
张卫东: "基于Hadoop的海量图片云存储系统研究与设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
李林: "基于hadoop的海量图片存储模型的分析和设计", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107800808A (en) * | 2017-11-15 | 2018-03-13 | 广东奥飞数据科技股份有限公司 | A kind of data-storage system based on Hadoop framework |
CN108647290A (en) * | 2018-05-06 | 2018-10-12 | 深圳市保千里电子有限公司 | Internet cell phone cloud photograph album backup querying method based on HBase and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dong et al. | A novel approach to improving the efficiency of storing and accessing small files on hadoop: a case study by powerpoint files | |
Liu et al. | Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS | |
US7743038B1 (en) | Inode based policy identifiers in a filing system | |
CA3132946C (en) | Distributing data on distributed storage systems | |
Donvito et al. | Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis | |
CN107547653A (en) | A kind of distributed file storage system | |
CN101997823A (en) | Distributed file system and data access method thereof | |
CN103631820A (en) | Metadata management method and device of distributed file system | |
CN107562757A (en) | Inquiry, access method based on distributed file system, apparatus and system | |
CN108108476A (en) | The method of work of highly reliable distributed information log system | |
Singh et al. | Scalable metadata management techniques for ultra-large distributed storage systems--A systematic review | |
Xiahou et al. | Multi-datacenter cloud storage service selection strategy based on AHP and backward cloud generator model | |
CN107844542A (en) | A kind of distributed document storage method and device | |
CN105763604B (en) | Lightweight distributed file system and the method for restoring downloading file original name | |
CN107122238A (en) | Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame | |
CN110008197A (en) | A kind of data processing method, system and electronic equipment and storage medium | |
CN110362590A (en) | Data managing method, device, system, electronic equipment and computer-readable medium | |
Acquaviva et al. | Cloud distributed file systems: A benchmark of HDFS, Ceph, GlusterFS, and XtremeFS | |
CN110502472A (en) | A kind of the cloud storage optimization method and its system of large amount of small documents | |
CN106855872A (en) | The method for quickly retrieving of the mass picture based on Hadoop platform | |
CN101483668A (en) | Network storage and access method, device and system for hot spot data | |
Lim et al. | Androtrace: framework for tracing and analyzing IOs on Android | |
CN104281486B (en) | A kind of virtual machine treating method and apparatus | |
CN103108045A (en) | Web map service implementation method based on cloud framework | |
Pang et al. | Defragmenting DHT-based distributed file systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170616 |