WO2021237156A1 - Method and system for generating training data to be used in machine learning models - Google Patents
Method and system for generating training data to be used in machine learning models Download PDFInfo
- Publication number
- WO2021237156A1 WO2021237156A1 PCT/US2021/033756 US2021033756W WO2021237156A1 WO 2021237156 A1 WO2021237156 A1 WO 2021237156A1 US 2021033756 W US2021033756 W US 2021033756W WO 2021237156 A1 WO2021237156 A1 WO 2021237156A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- identifiers
- computer
- images
- identifier
- multiple users
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
- G06F18/2178—Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24137—Distances to cluster centroïds
Definitions
- the invention relates to concepts for generating training data to be used in machine learning models and applications thereof and in particular to a computer-assisted method for generating training data to be used in machine learning models and a computer- assisted system for generating training data to be used in machine learning models. More particularly, examples of the invention provided herein relate to a method and systems for integrating expert knowledge into machine learning models and other artificial intelligence system applications.
- Geoscience observations and data are well known to be difficult to organize and objectify. Many attempts have been made to create “data standards” for geoscience data, especially for meta-data and especially for geophysical data. These attempts tend to reduce the loading costs for computer programs that process and analyze the data. However, the applicant-inventor finds that these very data standards obscure the inferences hidden within the elements of the recorded data, including the collected data itself and also including the history and timing of the data collection. A well-trained human expert will observe some of these inferences and use their experience when interpreting the data to reach geologic, extraction, and commercial conclusions.
- a further problem is that the standards-driven data format renders the data in an aggregate form that is associated more with the character of the physical acquisition or geography of the collected data, rather than granular and related to the geo-characteristics of interest (geophysical, petrophysical, geological, or other observed measurement).
- the computer-assisted method comprises retrieving, by multiple users, images concerning a same subject to be investigated.
- the computer-assisted method comprises generating, by the multiple users, identifiers per each of the images.
- the identifiers per each of the images are respectively associated with a different feature being used to classify the subject.
- the computer- assisted method comprises registering the identifiers in accordance with respective expert knowledge approbations of the multiple users.
- the computer-assisted method comprises collecting, by a common computing infrastructure, the registered identifiers from the multiple users.
- the computer-assisted method comprises generating, by the common computing infrastructure, training data to be used in machine learning models based on the collected identifiers.
- the term user can also be referred to as user equipment such as terminals, laptops etc., which the user possesses and which may be dedicated to the user.
- identifier may be understood as identifying information on the respective image.
- feature may be understood as characteristic/attribute of the subject or something which belongs inherently to the subject, which itself may be an object to be investigated, such as an entity of which a plurality of images exists.
- the expert knowledge approbations may be understood as an expertise or level of expertise which is known beforehand and could be assigned by a central authority. This makes the machine learning models more reliable.
- the identifiers per each of the images can be different from one another.
- the images can be retrieved from the common computing infrastructure.
- the common computing infrastructure may be a central computing system, in particular a server, to all of the users.
- the multiple users may be situated at different locations apart from a common computing infrastructure, for example all around the world.
- a decentralized system of information / sharing of information may thus be reliably provided.
- the images retrieved by the users may vary at least partially from one user to another. Consequently, each user may provide a further piece of information not being provided yet.
- the machine learning model can thus become more flexible.
- the different features may represent different objects, different characteristics, different modes and/or different attributes of the subject.
- the features to be investigated may vary from one another and lead to a more effective and reliable outcome.
- the step of collecting may be performed at different timings for different users. Further, the step of registering may be performed before retrieving the images from the common computing infrastructure or after generating the identifiers.
- the method may further include periodically updating the training data based on newly added identifiers and already collected identifiers.
- the correctness of the derived training model may be increased from time to time.
- the identifiers per each of the images may interdepend from one identifier to another.
- a feature in one identifier may be a context in another identifier.
- the training data may be organized as a tree map from an abstract object level defining the subject to be investigated to a specific feature level defining details belonging to the subject to be investigated.
- the tree map may be organized depending on the multiple users in a crowdly manner. Thus, the crowd of users may define the structure of the tree map. Thereby, the interdependence of feature and context from one identifier to another may be constructed.
- the models of training data may be crowd-based and less prone to errors.
- the step of generating the identifiers may be performed by using an identifier tool.
- the identifier tool may be used by each of the multiple users to indicate areas of each of the images as underlying information of a respective identifier.
- the identifier tool may be used to indicate explicitly areas of interest and areas not of interest.
- the identifier tool may be used to indicate explicitly areas of interest and areas of context.
- binary information may be provided. This may further reduce the amount of data to be transmitted to the common computing infrastructure and/or stored in the common computing infrastructure.
- Each one of the identifiers may include first and second indicators.
- the first indicator may indicate a feature area of the subject to be investigated.
- the second indicator may indicate a context area of the subject to be investigated. These two indicators may be regarded as binary information of the identifier.
- a reduction in data may therefore be achieved.
- Each of the identifiers may include only a subset of information of the corresponding image. Thus, not the whole information of the image is sent per each of the identifiers from the user to the common computing infrastructure, but merely a pointer to the information or part of the image information to which the first and/or second indicators point or are associated with. This may reduce an information load drastically.
- the above-mentioned demand is also solved by a computer program.
- the computer program product comprises instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the computer-assisted method as described above.
- the above-mentioned demand is also solved by a computer-readable data carrier.
- the computer-readable data carrier has stored thereon the computer program as described above.
- the above-mentioned demand is also solved by a computer-assisted system for generating training data to be used in machine learning models.
- the system is configured to provide, to multiple users, images concerning a same subject to be investigated.
- the system is configured to collect, from the multiple users, identifiers.
- the identifiers are registered in accordance with respective expert knowledge approbations of the multiple users.
- the identifiers are generated by the multiple users per each of the images.
- the identifiers are associated with a different feature being used to classify the subject.
- the system is configured to generate training data to be used in machine learning models based on the collected identifiers.
- the software means can be related to programmed microprocessors or a general computer, an ASIC (Application Specific Integrated Circuit) and/or DSPs (Digital Signal Processors).
- the common computing infrastructure, user (equipment), computer-assisted method and computer-assisted system may be implemented partially as a computer, a logical circuit, an FPGA (Field Programmable Gate Array), a processor (for example, a microprocessor, microcontroller (pC) or an array processor)/a core/a CPU (Central Processing Unit), an FPU (Floating Point Unit), NPU (Numeric Processing Unit), an ALU (Arithmetic Logical Unit), a Coprocessor (further microprocessor for supporting a main processor (CPU)), a GPGPU (General Purpose Computation on Graphics Processing Unit), a multi-core processor (for parallel computing, such as simultaneously performing arithmetic operations on multiple main processor(s) and/or graphical processor(s)) or a DSP.
- FPGA Field Programmable Gate Array
- processor for example, a microprocessor, microcontroller (pC) or an array processor
- a processor for example, a microprocessor, microcontroller (pC
- Fig. 1 illustrates a schematic diagram of the general flow and relationship from identified features in the data under investigation to a trained model for machine interpretation.
- Fig. 2 illustrates a photograph of a United States one-cent coin 100.
- Fig. 3 illustrates the United States one-cent coin with areas identified, representing identifier 200, the existence of a coin.
- Fig. 4 illustrates the United States one-cent coin with areas identified, representing identifier 300, the type of coin.
- Fig. 5 illustrates the United States one-cent coin with areas identified, representing identifier 400, the grade of coin.
- Fig. 6 illustrates a collection of sets-of-data 1000, gathered by the identifier tools.
- Fig. 7 illustrates feeding the identifier data sets from the collection 1000 into training sets of the training set collection 1200.
- a user has a user computing machine, for example a user equipment.
- a user computing machine examples include a laptop, a desktop, a surface PC, a touch screen or touch pad, etc.
- the user is able to bring up or otherwise see an image representing the data on the user computing machine.
- the identifier is for a characteristic that is associated with or can be associated with the data.
- the identifier is a characteristic that is associated with geoscience data, in a particular example, with a characteristic that is associated with seismic data.
- an identifier tool is for indicating an attribute (such as a top of a horizon, flat spot, seismic multiple, geologic time, paleo bug, fault, velocity, fold, source, processing, data shot, data processed, reservoir, source, seal, time, depth, oil, gas, offset, unconformity, pre-stack depth migration, etc.).
- the identifier tool is for indicating several modes as (part of the) identifiers: the identified object, the surroundings (context about the object), NOT the object.
- the identifier has two other modes: the “cause” of the object, and the “causation” of the object.
- seismic pre-stack reflections are the “cause” of a stack-depth-migrated (SDM) reflection and the stack-depth-migrated (SDM) reflection is the causation of the pre-stack reflections.
- the stack-depth-migrated (SDM) reflection was caused from the pre-stack reflections (PSR).
- PSR pre-stack reflections
- a well-log example of this would be a pro grading upward sequence, for instance.
- Other examples include, source signature, deconvolution, etc.
- an identifier tool is a touch screen used in collaboration with the finger of the user.
- the user selects an identifier by touching a location on the screen to select an identifier for consequent or subsequent use by the identifier tool.
- a glyph or icon is shown on a screen to represent an identifier that is available for selection.
- the touch screen associates an image (in one example, a glyph of a finger or paintbrush or marking pencil or glyph to represent which identifier will be marked using the identifier tool) representing the identifier tool that the user moves and operates using gestures with one or more fingers.
- this interactive system enables the user to rub an area, in one example, using fmger(s), in one example, using a stylus or mouse or pencil-like device.
- the user selects the identifier to use by typing a name for the identifier through a physical or displayed keyboard.
- the identifier tool is now associated with an identifier and ready for use to mark the area of the image that is to be associated with the identifier.
- the user also selects the mode by the identifier tool.
- the data gathered by the identifier tools are uploaded so that the identifiers are actually crowd sourced.
- the upload is to a common computing infrastructure, such as a server, for example, a macOS server.
- the user is “registered” for certain data sets, such as the training data, and/or identifiers. Their registration can also be an attribute. Their registration may depend on their expertise, such as an approbation. For example, one user may be a flat-spot specialist, or a velocity cube specialist, etc., or a particular play specialist. They will contribute certain identifiers, if not all identifiers. In one example, identifiers of others may or may not be shared to other users. In one example, two or more users contributing to the same identifier may or may not see the work of the other. In one example, a user may agree or disagree with the work of the other.
- a model is built (and periodically rebuilt) using the collection of identifiers (the identifier work) of the users.
- the identifiers are organized using a “crowd organized mapping” technique.
- a user can choose a crowd organized map (COM), their own organization, or an edited crowd organized map (COM), or alternate crowd organized maps (COM), such as in the case of binomial trends in the crowd organized maps (COM).
- COM crowd organized map
- COM alternate crowd organized maps
- an authorized user can have an image processed for one or more identifiers. In one example, this is done at the user device level, such as the user equipment, given that the models can be downloaded (and organized) and applied to any images / data that the user is provided with.
- training data models
- training data models
- training data models
- a “fault plane” model uses all fault planes regardless of other identifiers. This would be a “general” set.
- a “fault plane” model is specific to a time range, bin size, fold, or some other characteristic. This would be a “specific” set.
- the user can “flip through” the models (the sets) so a “general” model may show several fault areas on the data set, but swapping to a more specific model will change what is identified on the image / data.
- the flipping can be similar to an animation or a blink comparator, as historically used in astronomy applications.
- the image / data displayed on the user equipment with machine learning identified identifiers become an “AI Interpretation”.
- the user can take the AI Interpretation and go into an edit mode — refining or correcting the interpretation.
- the edited / refined / corrected interpretation (of the identifiers) is uploaded to the server and used to update or otherwise improve the models.
- the mere presence of a well at a location is a piece of intelligence. Someone, at some point in time, felt that that particular well location was worthy of investment, regardless of what outcome actually happened.
- a “general” model one general model takes all wells and high grades - as an identifier - areas that are similar in appearance. If the wells were further differentiated, then the more specific models will provide more differentiated information to the user (examples include, e.g., producing wells, gas versus oil, horizon tops, etc.).
- a color coding is used to mark the identifier tool rubbed areas based on the mode of the identifier. For example, rubbing a red transparent color as the “not” mode, a green color as the “is” mode, a yellow color as “associated with” mode, an orange color as “associated with not” mode.
- identifiers include inferred rock properties, resultant information (e.g., size of field in barrels), “closure vs. non-closure”. Beyond basic, well known, types of identifiers, users (the crowd) will create new identifiers, especially as industry and technology evolve. In further examples, the specific color assigned to a mode is a matter of choice.
- Identifier data set 10 is one or more interpretations of a feature of the data under investigation (for example, a well log, a seismic line, a photo of a collector coin). Identifier 10 has one or more modes associated with the feature or context. A second identifier data set 11 is also shown, representing another feature that has been interpreted. In one example, an identifier data set is gathered from many sources around the world. The mode information of identifier data set 10 is fed into training set 20. In one example, the interdependence of the identifier data sets is carried to the training sets and resulting in trained models. Training set 20 produces a trained model 30. Trained model 30 is used to machine interpret new data.
- a seismic section is used from the perspective of the geoscience expert. If the user has a seismic section on their touch device, such as a touch pad, there are all kinds of things or indications that the seismic section is revealing. So, there's a cascade of different training sets.
- the seismic section that the user looks at may be of a particular a fold. It may be a 2-D versus a 3-D. It may have missing shot points or missing receiver locations. These attributes or indications will show up as a peculiar look to the portions of the seismic section.
- the seismic section may have some near surface static corrections that are needed.
- the seismic section may show a particular geologic horizon. It may show a particular geologic age. It may show an unconformity. There’s just a plethora of attributes or indications that it could be used for, and each one of those is a potential training set. Then, for example, one need to have basically a constellation or a collection of the different training sets that are then used to make an even more definitive identification of what one is looking for in the data.
- the seismic data on the user’s screen shows a portion of the seismic section where it is trying to “jump across” a well platform. So, that disruption, that loss of fold in the data in the near traces is visually apparent to the geoscience interpreter, the expert.
- the ML model there are two pieces of information that would be valuable for the training set, for creating the machine learning model, the ML model. One of them is the actual portions that are obviously disrupted by the loss of the near trace data. In one example, with the user could go ahead and approximately rub their finger across those areas on the displayed seismic section and, in one example, give it a little bit of a semitransparent tinting to tell the training set that this is where the problem is.
- the portion of the data on either side of it that is, the disrupted area due to loss of near traces and fold
- That nearby portion of the data puts into context the portion that is disrupted.
- the user can picture that larger area to be like a light transparent green tinting (for example) and the area that is the actual disruption would be like a reddish tinting (for example).
- the area that is the actual disruption (the feature, the red tinted area) would be a subset of the overall green area so the green area will encapsulate the red area. So, those two pieces of highlighting tell the machine learning that it does not have to look at the entire image that are displayed on the screen, it just needs to get that green and red area, the green context area and the red feature area, the feature in this example being the disruption area.
- these steps can be done with other features, data attributes.
- the seismic is six-fold land 2-D data
- the user can just basically send the entire data set without even needing to highlight it with their finger.
- the user can essentially send the entire data set over to each of those three training sets, for them to learn on that seismic line.
- the indications or attributes of the seismic can be recognized by the chevron “V’s” near surface where one goes from one-fold to six fold with depth.
- the user can identify visually that it is a low-fold section because of the chevron “V’s”.
- Seismic Example - General Sets. Specific Sets the data gathered for an identifier (comprising information on the “objects”, features, attributes, or indications) are ranked or otherwise given an indication of quality.
- the indication of quality or ranking of the objects are context- sensitive. This means that a particular object has a ranking or quality indication for each set that for which it is a member.
- a general set of all fault planes will have fault plane objects.
- a given fault plane object will have a ranking or quality indication within that general set.
- That same fault plane object for example, is further identified as a fault plane associated with a producing growth fault.
- That fault plane object is also a member of a specific set — that of fault plane objects associated with producing growth faults.
- the same fault plane object in the general set has a quality indication and also has a quality indication in the specific set.
- the object can have a different quality or ranking in each of the sets, and may even have a dissonant ranking.
- the fault plane object may be highly ranked in quality in the general set of all fault planes (an “exemplary” object within that set) but have a lower quality ranking in the specific set of fault planes associated with producing growth faults. This seeming dissonance or incongruence can be important in recognizing patterns in scientific data and images.
- the set-specific ranking of objects opens up the ability to improve neural processing that uses the context-sensitive ranking to recognize object features.
- a context of “growth fault” or “producing” may use the context sensitive ranking to use high ranking objects as exemplary for purposes of self-training.
- low ranking objects can be used as de-classifiers in the self-training process.
- the user then also has a bit of an interesting aspect.
- the user also possibly could identify what coin it is. Of course, that would be more than color, because it could identify a liberty dime from a Roosevelt dime, for instance.
- another aspect is whether or not the picture is out of focus for that particular coin, that would render its identification impossible or very difficult. And so, the user would want a training set that would just simply be able to flag whether or not the picture is in good enough focus to take or not.
- the outcome of initial trained model(s) determines the next model(s) to apply in the course of identifying and grading a coin.
- the penny is normally graded based on distinguishing features, such as how well the wheat stalks on the back are visible. In other words, there are these little lines that cartoon the stalks of wheat. If those lines are nice and crisp and sharp, then that coin is in a very fine or better condition. If those lines are non-existent, worn smooth, then the coin is in fair or good condition.
- the details of the features of Lincoln's face on the obverse, the front side of the coin the more detail of his hair and his cheek bones and such is an indicator of the quality.
- Fig. 2 illustrates a photograph of a United States one-cent coin 100.
- Fig. 3 illustrates the United States one-cent coin with areas identified, representing identifier 200, the existence of a coin.
- the rim around the coin is identified as feature 1 and the area inside the rim is identified as context 2.
- Feature 1 and context 2 represent modes of identifier 200.
- the user In order to identify a penny, the user would be highlighting the area that's inside the circle. In general, except for a 1943 steel penny, most pennies would be identified by the color. In one example, the user may also want to have a separate training set or a bifurcated training set that also tends to identify the penny by the different faces, for example, Lincoln’s face, the Lincoln Memorial, the Wheat Stalks, the Indian on the Indian Head penny, the big letters “ONE CENT” on many of the one-cent pieces, and so on and so forth.
- Fig. 4 illustrates the United States one-cent coin with areas identified, representing identifier 300, the type of coin.
- Feature area 3 is the surface of the coin, which contains type-identifying information.
- feature 3 mode is the same as the context 2 mode that was discerned in the making of identifier 200.
- Context area 2 of identifier 200 is now assigned to also be the feature area 3 of identifier 300. This indicates that there is a mapping between the two identifiers. In this example, proceeding with identifying the type of coin for identifier 300 is dependent on the successful identification of the existence of a coin from identifier 200.
- red-tint highlight areas especially for wheatie pennies, wheat stalks.
- Lincoln’s face and hair would be a couple of different areas, also, that would be identified as grading features for the front side of the coin.
- Fig. 5 illustrates the United States one-cent coin with areas identified, representing identifier 400, the grade of coin.
- the context area 4 is the feature area 3 from identifier 300, which is also the context area 2 from identifier 200.
- Context area 4 is the surface of the coin.
- a mode of one identifier becomes a different mode in another identifier.
- the mode happens to switch from context to feature to context as applied to the three different identifiers, since the information contained on the surface of the coin changes roles, depending on the training purpose of the identifier.
- the grade of the coin is dependent on the clarity of the lines of the two stalks of wheat that are set in relief on the surface of the coin.
- the feature area 5 is selected on the two stalks of wheat.
- the user can have an example picture of a “very fine” coin, where the wheat stalks are very sharp, etc. So, the user taps on that picture, the user is able to tell which training set it needs to go in to, or, for example, the set that it is going to test against. Then, the user goes ahead and does their green and red highlighting. Then, that green and red highlighting is what is pertinent.
- That information gets sent for what training will be done on a computing device of sufficient computing capacity (e.g., on the desktop, or server, or Mac server of some sort).
- the touch interface is used, or some sort of a touch surface, to identify the aspects that should go into the training set — and also identify the constellation of all these different training sets that the user is working on. That is then collected up, in one example, e.g., by an app, and gets transmitted to a base station, whether that is cloud computing or a server farm or cluster or just a desktop.
- the images are already in the cluster or desktop or whatever, what needs to be transmitted from the touch screen surface are the areas that have gotten highlighted in red or green and also to which of the training sets to which they pertain.
- Example: Crowd Organized Mapping In one example of creating a crowd organized map, that’s where a person (the user) determines the hierarchy and the ordering of the different items.
- the crowd of users collectively organize the map.
- That category includes cars, trains, boats, bicycles.
- the user places the specific vehicles into a “vehicles” folder or hierarchy or tree. While the user has organized the “vehicles” a certain way, other users may organize in a similar or different way.
- many users in the crowd of users will organize the vehicles in the same way or similar. The crowd itself is coming up with the way in which the objects will be organized when a new user enters the crowd.
- the map could bifurcate: some users prefer to organize the elements one way and some people prefer or need to organize the elements a different way. This could create a “minority report” situation, meaning that new users have a choice of default or initial map to use.
- the overall concept is that as more and more items are added for placing into hierarchal order and common things paired together, the power of the crowd does that, rather than one person trying to do that by hand or some sort of computer algorithm trying to do it.
- the self-organized mapping process is replaced by a hybrid human-computer crowd- sourced mapping process.
- Fig. 6 illustrates a collection of sets-of-data 1000, gathered by the identifier tools.
- An identifier data set 1001 is one or more interpretations of a feature of the data under investigation (for example, a well log, a seismic line, a photo of a collector coin). In one example, these one or more interpretations are collected from one or more diverse sources (for example, from interpreters who are placed around the world).
- identifier 1003 is associated with or otherwise sequentially dependent on identifier 1001. This is indicated by dashed arrow, mapping 1099. These mappings are also collected and used for co-processing and sequencing of training sets and/or execution of trained models.
- identifier 1003. For example, at least some of the interpretation (e.g., modes) of identifier 1001 are carried over to identifier 1003.
- a context mode in identifier 1001 is a feature mode in identifier 1003.
- the dashed arrows 1099 indicate the mapping of the identifier data sets to form an organized map.
- Identifier 1003 is associated with identifiers 1004 and 1007, as indicated by the dashed arrows.
- a crowd organized mapping is used.
- Fig. 7 illustrates feeding the identifier data sets from the collection 1000 into training sets of the training set collection 1200.
- Identifier data is selected to train for a desired outcome of a particular training set or sets.
- identifier 1001 feeds its data into training set 1201, and so on with the other corresponding numbers.
- the mapping is constructed between identifiers and training sets.
- identifier 1009 is additionally fed into training set 1203, as shown by the solid arrow 1199.
- the mapping is constructed between training sets.
- results from training set 1202 are additionally fed into training set 1207, as shown by solid arrow 1299.
- a fraction of the identifiers is used for a training set to produce a trained model, with the remaining fraction used to test the trained model.
- the training sets 1200 produced corresponding trained models.
- An organized mapping is applied to manage the co-execution and sequencing of execution of the models when they are applied to a new piece of data that is to be machine interpreted.
- a virtual workstation is constructed to operate on a computing device that includes a display that also receives hand input from the user.
- the user looks at the data being interpreted, creates and / or selects identifiers, inputs to the computing device(s) the areas of the data associated with the identifiers. If associated areas of one identifier are associated with another identifier, sequentially or concurrently, that connection is also captured for use in chaining training sets and resulting trained models (for parallel and cascading execution). The identified areas are used in the construction of training sets and resulting trained models.
- Multiple virtual workstations contribute to pool the identifiers, training sets, and/or resulting models through one or more collecting computing devices, such as servers and neural processing centers.
- identifiers from varied sources are used in the construction of training sets and resulting trained models.
- the associations between identifiers, training sets, and/or resulting models are available for parallel-cascading training and/or execution of resulting trained models.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
A computer-assisted method for generating training data to be used in machine learning models by: retrieving, by multiple users, images concerning a same subject to be investigated; generating, by the multiple users, identifiers per each of the images, and the identifiers per each of the images are respectively associated with a different feature being used to classify the subject; registering the identifiers in accordance with respective expert knowledge approbations of the multiple users; collecting, by a common computing infrastructure, the registered identifiers from the multiple users; generating, by the common computing infrastructure, training data to be used in machine learning models based on the collected identifiers.
Description
METHOD AND SYSTEM FOR GENERATING TRAINING DATA TO BE USED IN MACHINE LEARNING MODELS
TECHNICAL FIELD
The invention relates to concepts for generating training data to be used in machine learning models and applications thereof and in particular to a computer-assisted method for generating training data to be used in machine learning models and a computer- assisted system for generating training data to be used in machine learning models. More particularly, examples of the invention provided herein relate to a method and systems for integrating expert knowledge into machine learning models and other artificial intelligence system applications.
BACKGROUND OF THE INVENTION
Geoscience observations and data are well known to be difficult to organize and objectify. Many attempts have been made to create “data standards” for geoscience data, especially for meta-data and especially for geophysical data. These attempts tend to reduce the loading costs for computer programs that process and analyze the data. However, the applicant-inventor finds that these very data standards obscure the inferences hidden within the elements of the recorded data, including the collected data itself and also including the history and timing of the data collection. A well-trained human expert will observe some of these inferences and use their experience when interpreting the data to reach geologic, extraction, and commercial conclusions. A further problem is that the standards-driven data format renders the data in an aggregate form that is associated more with the character of the physical acquisition or geography of the collected data, rather than granular and related to the geo-characteristics of interest (geophysical, petrophysical, geological, or other observed measurement). This results in key geoscience analogies being scattered and buried in a plethora of disparate datasets, scattered around the world
and in the hands of many, many different entities. There is a long felt need to gather and make these datasets more available for human and/or machine training. There is a long felt need to have a better way to focus on key observed characteristics and better quantify the relevance and counter-relevance of these observed characteristics. There is a long felt need to have new (or a greater number of) key observed characteristics that are relevant and contribute to increasing true-positive and true-negative indications.
SUMMARY OF THE INVENTION
There may be a demand to provide concepts for more reliable machine learning models.
Such a demand may be satisfied by the subject-matter of the independent claims.
Specifically, such a demand may be satisfied by a computer-assisted method for generating training data to be used in machine learning models. The computer-assisted method comprises retrieving, by multiple users, images concerning a same subject to be investigated. The computer-assisted method comprises generating, by the multiple users, identifiers per each of the images. The identifiers per each of the images are respectively associated with a different feature being used to classify the subject. The computer- assisted method comprises registering the identifiers in accordance with respective expert knowledge approbations of the multiple users. The computer-assisted method comprises collecting, by a common computing infrastructure, the registered identifiers from the multiple users. The computer-assisted method comprises generating, by the common computing infrastructure, training data to be used in machine learning models based on the collected identifiers.
Thus, machine learning models can be more reliable.
The term user can also be referred to as user equipment such as terminals, laptops etc., which the user possesses and which may be dedicated to the user.
The term identifier may be understood as identifying information on the respective image.
The term feature may be understood as characteristic/attribute of the subject or something which belongs inherently to the subject, which itself may be an object to be investigated, such as an entity of which a plurality of images exists.
The expert knowledge approbations may be understood as an expertise or level of expertise which is known beforehand and could be assigned by a central authority. This makes the machine learning models more reliable.
Particularly advantageous configurations can be found in the dependent claims.
The identifiers per each of the images can be different from one another.
Thus, multiple aspects can be gathered and used for the training data.
The images can be retrieved from the common computing infrastructure. The common computing infrastructure may be a central computing system, in particular a server, to all of the users. The multiple users may be situated at different locations apart from a common computing infrastructure, for example all around the world.
A decentralized system of information / sharing of information may thus be reliably provided.
The images retrieved by the users may vary at least partially from one user to another.
Consequently, each user may provide a further piece of information not being provided yet. The machine learning model can thus become more flexible.
The different features may represent different objects, different characteristics, different modes and/or different attributes of the subject.
Thus, the features to be investigated may vary from one another and lead to a more effective and reliable outcome.
The step of collecting may be performed at different timings for different users. Further, the step of registering may be performed before retrieving the images from the common computing infrastructure or after generating the identifiers.
In consequence, flexibility and usability may be increased.
The method may further include periodically updating the training data based on newly added identifiers and already collected identifiers.
Thus, the correctness of the derived training model may be increased from time to time.
The identifiers per each of the images may interdepend from one identifier to another. For example, a feature in one identifier may be a context in another identifier.
This can make the training data more concise and context-oriented.
The training data may be organized as a tree map from an abstract object level defining the subject to be investigated to a specific feature level defining details belonging to the subject to be investigated. The tree map may be organized depending on the multiple
users in a crowdly manner. Thus, the crowd of users may define the structure of the tree map. Thereby, the interdependence of feature and context from one identifier to another may be constructed.
Thus, the models of training data may be crowd-based and less prone to errors.
The step of generating the identifiers may be performed by using an identifier tool. The identifier tool may be used by each of the multiple users to indicate areas of each of the images as underlying information of a respective identifier. The identifier tool may be used to indicate explicitly areas of interest and areas not of interest. The identifier tool may be used to indicate explicitly areas of interest and areas of context.
Thus, binary information may be provided. This may further reduce the amount of data to be transmitted to the common computing infrastructure and/or stored in the common computing infrastructure.
Each one of the identifiers may include first and second indicators. The first indicator may indicate a feature area of the subject to be investigated. The second indicator may indicate a context area of the subject to be investigated. These two indicators may be regarded as binary information of the identifier.
A reduction in data may therefore be achieved.
Each of the identifiers may include only a subset of information of the corresponding image. Thus, not the whole information of the image is sent per each of the identifiers from the user to the common computing infrastructure, but merely a pointer to the information or part of the image information to which the first and/or second indicators point or are associated with.
This may reduce an information load drastically.
The above-mentioned demand is also solved by a computer program. The computer program product comprises instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the computer-assisted method as described above.
The above-mentioned demand is also solved by a computer-readable data carrier. The computer-readable data carrier has stored thereon the computer program as described above.
The above-mentioned demand is also solved by a computer-assisted system for generating training data to be used in machine learning models. The system is configured to provide, to multiple users, images concerning a same subject to be investigated. The system is configured to collect, from the multiple users, identifiers. The identifiers are registered in accordance with respective expert knowledge approbations of the multiple users. The identifiers are generated by the multiple users per each of the images. The identifiers are associated with a different feature being used to classify the subject. The system is configured to generate training data to be used in machine learning models based on the collected identifiers.
Even if some of the aspects described above have been described in reference to the computer-assisted method, these aspects may also apply to the computer-assisted system. Likewise, the aspects described above in relation to the computer-assisted system may be applicable in a corresponding manner to the computer-assisted method.
It is clear to a person skilled in the art that the statements set forth herein under use of hardware circuits, software means or a combination thereof may be implemented. The
software means can be related to programmed microprocessors or a general computer, an ASIC (Application Specific Integrated Circuit) and/or DSPs (Digital Signal Processors). For example, the common computing infrastructure, user (equipment), computer-assisted method and computer-assisted system may be implemented partially as a computer, a logical circuit, an FPGA (Field Programmable Gate Array), a processor (for example, a microprocessor, microcontroller (pC) or an array processor)/a core/a CPU (Central Processing Unit), an FPU (Floating Point Unit), NPU (Numeric Processing Unit), an ALU (Arithmetic Logical Unit), a Coprocessor (further microprocessor for supporting a main processor (CPU)), a GPGPU (General Purpose Computation on Graphics Processing Unit), a multi-core processor (for parallel computing, such as simultaneously performing arithmetic operations on multiple main processor(s) and/or graphical processor(s)) or a DSP.
It is further clear to the person skilled in the art that even if the herein-described details will be described in terms of a method, these details may also be implemented or realized in a suitable device, a computer processor or a memory connected to a processor, wherein the memory can be provided with one or more programs that perform the method, when executed by the processor. Therefore, methods like swapping and paging can be deployed.
BRIEF DESCRIPTION OF THE DRAWING
In the following, the invention shall be explained in more detail by means of the embodiment(s) with reference to the attached schematic figure(s). The figure(s) shows (show):
Fig. 1 illustrates a schematic diagram of the general flow and relationship from identified features in the data under investigation to a trained model for machine interpretation.
Fig. 2 illustrates a photograph of a United States one-cent coin 100.
Fig. 3 illustrates the United States one-cent coin with areas identified, representing identifier 200, the existence of a coin.
Fig. 4 illustrates the United States one-cent coin with areas identified, representing identifier 300, the type of coin.
Fig. 5 illustrates the United States one-cent coin with areas identified, representing identifier 400, the grade of coin.
Fig. 6 illustrates a collection of sets-of-data 1000, gathered by the identifier tools.
Fig. 7 illustrates feeding the identifier data sets from the collection 1000 into training sets of the training set collection 1200.
DETAILED DESCRIPTION OF THE INVENTION
Disclosed herein are descriptions of various examples of the invention. Specifically, these examples are introduced in order to ease the understanding of the invention. Hereinafter, different wording/terminology is used in order to ease an understanding of the invention. Nevertheless, it is clear that like features of the summary section above and the detailed description section below although having different terminology can be used interchangeable in this disclosure. Thus, like features in the summary section as well as in the detailed description section below may incorporate or be defined by way of the corresponding attributes and/or characteristics as described above and below.
In one example, a user has a user computing machine, for example a user equipment. Examples of a user computing machine include a laptop, a desktop, a surface PC, a touch screen or touch pad, etc.
The user is able to bring up or otherwise see an image representing the data on the user computing machine.
The user rubs the image with an identifier tool to produce an identifier, such as a pen, for example based on capacitive or inductive techniques. In one example, the identifier is for a characteristic that is associated with or can be associated with the data. In one example, used herein for illustration, the identifier is a characteristic that is associated with geoscience data, in a particular example, with a characteristic that is associated with seismic data. In one example, an identifier tool is for indicating an attribute (such as a top of a horizon, flat spot, seismic multiple, geologic time, paleo bug, fault, velocity, fold, source, processing, data shot, data processed, reservoir, source, seal, time, depth, oil, gas, offset, unconformity, pre-stack depth migration, etc.). In one example, the identifier tool is for indicating several modes as (part of the) identifiers: the identified object, the surroundings (context about the object), NOT the object. In one example, the identifier has two other modes: the “cause” of the object, and the “causation” of the object. For example, seismic pre-stack reflections (PSR) are the “cause” of a stack-depth-migrated (SDM) reflection and the stack-depth-migrated (SDM) reflection is the causation of the pre-stack reflections. In other words, the stack-depth-migrated (SDM) reflection was caused from the pre-stack reflections (PSR). A well-log example of this would be a pro grading upward sequence, for instance. Other examples include, source signature, deconvolution, etc.
In one example, an identifier tool is a touch screen used in collaboration with the finger of the user. In one example, the user selects an identifier by touching a location on the screen to select an identifier for consequent or subsequent use by the identifier tool. In
one example, a glyph or icon is shown on a screen to represent an identifier that is available for selection. The touch screen associates an image (in one example, a glyph of a finger or paintbrush or marking pencil or glyph to represent which identifier will be marked using the identifier tool) representing the identifier tool that the user moves and operates using gestures with one or more fingers. In one example, this interactive system enables the user to rub an area, in one example, using fmger(s), in one example, using a stylus or mouse or pencil-like device. In one example, the user selects the identifier to use by typing a name for the identifier through a physical or displayed keyboard. The identifier tool is now associated with an identifier and ready for use to mark the area of the image that is to be associated with the identifier. In one example, the user also selects the mode by the identifier tool.
In one example, the data gathered by the identifier tools are uploaded so that the identifiers are actually crowd sourced. In one example, the upload is to a common computing infrastructure, such as a server, for example, a macOS server. The user is “registered” for certain data sets, such as the training data, and/or identifiers. Their registration can also be an attribute. Their registration may depend on their expertise, such as an approbation. For example, one user may be a flat-spot specialist, or a velocity cube specialist, etc., or a particular play specialist. They will contribute certain identifiers, if not all identifiers. In one example, identifiers of others may or may not be shared to other users. In one example, two or more users contributing to the same identifier may or may not see the work of the other. In one example, a user may agree or disagree with the work of the other.
A model is built (and periodically rebuilt) using the collection of identifiers (the identifier work) of the users.
In one example, the identifiers are organized using a “crowd organized mapping” technique. In one example, a user can choose a crowd organized map (COM), their own
organization, or an edited crowd organized map (COM), or alternate crowd organized maps (COM), such as in the case of binomial trends in the crowd organized maps (COM).
It now can be appreciated that a plethora of training sets or models can be devised, depending on the selection from the identifiers that have been contributed. In one example, the work of a particular user may be purposefully included (or excluded) from making a model. Further, particular attributes or identifiers related to the quality or type of data may be used as selection criteria to purposefully include (or exclude) particular contributed identifiers from making a model.
Depending on the permissions (rights management) an authorized user can have an image processed for one or more identifiers. In one example, this is done at the user device level, such as the user equipment, given that the models can be downloaded (and organized) and applied to any images / data that the user is provided with.
Another aspect is that combinations of identifiers can organize the training data (models) and a combination that is applied. Likewise, in one example, training data (models) can be generated that are “general” or “specialty” sets. In one example, a “fault plane” model uses all fault planes regardless of other identifiers. This would be a “general” set. Or, a “fault plane” model is specific to a time range, bin size, fold, or some other characteristic. This would be a “specific” set.
In one example, the user can “flip through” the models (the sets) so a “general” model may show several fault areas on the data set, but swapping to a more specific model will change what is identified on the image / data. In one example, the flipping can be similar to an animation or a blink comparator, as historically used in astronomy applications.
The image / data displayed on the user equipment with machine learning identified identifiers become an “AI Interpretation”.
In one example, the user can take the AI Interpretation and go into an edit mode — refining or correcting the interpretation. In one example, the edited / refined / corrected interpretation (of the identifiers) is uploaded to the server and used to update or otherwise improve the models.
For an overall illustration, in one example, the mere presence of a well at a location is a piece of intelligence. Someone, at some point in time, felt that that particular well location was worthy of investment, regardless of what outcome actually happened. As one example of a “general” model, one general model takes all wells and high grades - as an identifier - areas that are similar in appearance. If the wells were further differentiated, then the more specific models will provide more differentiated information to the user (examples include, e.g., producing wells, gas versus oil, horizon tops, etc.).
In further examples, a color coding is used to mark the identifier tool rubbed areas based on the mode of the identifier. For example, rubbing a red transparent color as the “not” mode, a green color as the “is” mode, a yellow color as “associated with” mode, an orange color as “associated with not” mode. In further examples, identifiers include inferred rock properties, resultant information (e.g., size of field in barrels), “closure vs. non-closure”. Beyond basic, well known, types of identifiers, users (the crowd) will create new identifiers, especially as industry and technology evolve. In further examples, the specific color assigned to a mode is a matter of choice. For an example, “red” is chosen to represent the target feature mode, while “green” represents the associated mode, such as a context mode.
Fig. 1 illustrates a schematic diagram of the general flow and relationship from identified features in the data under investigation to a trained model for machine interpretation. Identifier data set 10 is one or more interpretations of a feature of the data under investigation (for example, a well log, a seismic line, a photo of a collector coin). Identifier 10 has one or more modes associated with the feature or context. A second identifier data set 11 is also shown, representing another feature that has been interpreted. In one example, an identifier data set is gathered from many sources around the world. The mode information of identifier data set 10 is fed into training set 20. In one example, the interdependence of the identifier data sets is carried to the training sets and resulting in trained models. Training set 20 produces a trained model 30. Trained model 30 is used to machine interpret new data.
Example: Seismic Analysis
Here is an example process for doing crowd sourced machine learning creation of training sets, training for machine learning and creating resultant models.
As an example a seismic section is used from the perspective of the geoscience expert. If the user has a seismic section on their touch device, such as a touch pad, there are all kinds of things or indications that the seismic section is revealing. So, there's a cascade of different training sets.
For example, the seismic section that the user looks at may be of a particular a fold. It may be a 2-D versus a 3-D. It may have missing shot points or missing receiver locations. These attributes or indications will show up as a peculiar look to the portions of the seismic section. The seismic section may have some near surface static corrections that are needed. The seismic section may show a particular geologic horizon. It may show a particular geologic age. It may show an unconformity. There’s just a plethora of attributes or indications that it could be used for, and each one of those is a potential training set.
Then, for example, one need to have basically a constellation or a collection of the different training sets that are then used to make an even more definitive identification of what one is looking for in the data.
Considering just one item and that is where there are some missing shot point locations. For example, one can't shoot in the vicinity of an existing well platform or some geographic restriction, and so one has to miss some of the shot point locations. That, of course, will, especially in the near surface, result in some of the near traces to be missing. One has a loss of fold in, say, the first second of data. In case of a 2-D line, it will show up as a kind of a wedge or a “V” pointed downward in the seismic section from zero seconds to say one or 1.4 seconds (somewhere in there); and eventually, the problem created by that missing near trace data will disappear and be blended in, deeper in the section.
One gets disruption due to the missed locations. It would be favourable to be able to recognize that. Of course, that is going to be recognizable based on whether it is a 2-D versus a 3-D binned set of data. It will depend on the fold of the data. It will depend on the minimum near offset, and so on and so forth. In general, there's a recognition of that loss of fold.
Example: Seismic Analysis - Feature Area. Context Area.
So, the seismic data on the user’s screen shows a portion of the seismic section where it is trying to “jump across” a well platform. So, that disruption, that loss of fold in the data in the near traces is visually apparent to the geoscience interpreter, the expert. Now, there are two pieces of information that would be valuable for the training set, for creating the machine learning model, the ML model. One of them is the actual portions that are obviously disrupted by the loss of the near trace data.
In one example, with the user could go ahead and approximately rub their finger across those areas on the displayed seismic section and, in one example, give it a little bit of a semitransparent tinting to tell the training set that this is where the problem is. But also, there is the portion of the data on either side of it (that is, the disrupted area due to loss of near traces and fold) and also beneath of it that is more or less normal seismic representation. That nearby portion of the data puts into context the portion that is disrupted. So, in one example, the user can picture that larger area to be like a light transparent green tinting (for example) and the area that is the actual disruption would be like a reddish tinting (for example). In one example, the area that is the actual disruption (the feature, the red tinted area) would be a subset of the overall green area so the green area will encapsulate the red area. So, those two pieces of highlighting tell the machine learning that it does not have to look at the entire image that are displayed on the screen, it just needs to get that green and red area, the green context area and the red feature area, the feature in this example being the disruption area.
In further examples, these steps can be done with other features, data attributes. For instance, if the seismic is six-fold land 2-D data, the user can just basically send the entire data set without even needing to highlight it with their finger. The user can essentially send the entire data set over to each of those three training sets, for them to learn on that seismic line. The indications or attributes of the seismic (especially for land six-fold data) can be recognized by the chevron “V’s” near surface where one goes from one-fold to six fold with depth. Alternatively, the user can identify visually that it is a low-fold section because of the chevron “V’s”. So, one way would be to take the first half-second to one- second of data and highlight as a feature in a reddish color and maybe take the next half second down and highlight it in green. In this example, it does not matter much what the deeper section looks like. The deeper section is not that pertinent to identifying that attribute or object of the seismic data (e.g., the amount of fold).
Seismic Example - General Sets. Specific Sets
In one example, the data gathered for an identifier (comprising information on the “objects”, features, attributes, or indications) are ranked or otherwise given an indication of quality. In one example, the indication of quality or ranking of the objects are context- sensitive. This means that a particular object has a ranking or quality indication for each set that for which it is a member. For example, a general set of all fault planes will have fault plane objects. A given fault plane object will have a ranking or quality indication within that general set. That same fault plane object, for example, is further identified as a fault plane associated with a producing growth fault. That fault plane object is also a member of a specific set — that of fault plane objects associated with producing growth faults. The same fault plane object in the general set has a quality indication and also has a quality indication in the specific set. The object can have a different quality or ranking in each of the sets, and may even have a dissonant ranking. For example, the fault plane object may be highly ranked in quality in the general set of all fault planes (an “exemplary” object within that set) but have a lower quality ranking in the specific set of fault planes associated with producing growth faults. This seeming dissonance or incongruence can be important in recognizing patterns in scientific data and images.
In one example, the set-specific ranking of objects opens up the ability to improve neural processing that uses the context-sensitive ranking to recognize object features. For example, a context of “growth fault” or “producing” may use the context sensitive ranking to use high ranking objects as exemplary for purposes of self-training. Likewise, low ranking objects can be used as de-classifiers in the self-training process.
Example: Grading Collector Coins.
While seismic geoscience interpretation is highly specialized, it would be appropriate to further illustrate with more clarity through a more commonly understood scenario. This is done by way of a very different example, to an artificial intelligence grading of collector’s
coins. As can be appreciated, coin grading is an example used for clarity. In the geosciences, the examples would include analysis of geoscientific data, such well logs, seismic data, cores, field maps, etc. The user will take photographing pennies as an example. One training set would be just the photograph, identifying that there is a circular object. And, that could be just to identify whether it is able to identify the existence of the coin or not, and also identify the sizing of the coin, the boundaries of the coin, the rim, basically, the circle. If we are talking about collector’s coins in general, of course the identification of a penny would rely heavily on the color aspect, the copper color of the coin versus nickels and dimes and quarters and so on. So, the user will have these features to deal with.
Example: Grading Collector Coins - Parallel and Cascading Sets and Models
The user then also has a bit of an interesting aspect. Once the user has identified the circular aspect, the user also possibly could identify what coin it is. Of course, that would be more than color, because it could identify a liberty dime from a Roosevelt dime, for instance. Then, another aspect is whether or not the picture is out of focus for that particular coin, that would render its identification impossible or very difficult. And so, the user would want a training set that would just simply be able to flag whether or not the picture is in good enough focus to take or not. As can be appreciated, in one example, the outcome of initial trained model(s) determines the next model(s) to apply in the course of identifying and grading a coin.
So, there are a few things. One: is identifying the coin. Two: is identifying its boundaries, like the circle. Then, there is an identification of what kind of coin it is. There is a possibly simultaneous or related observation as to what kind of coin it is. And, that may be broken down into other training sets, such as “do we know what kind of denomination it is”, “is it an American coin or a foreign coin”, or, for example, that there are a lot of different quarters that look different but are the same size and basically the same color.
The user might have a differentiator whether it suspects or detects that the coin is possibly silver, such as the situation where a 1964 quarter looks very different from a 1965 quarter in terms of appearance, the glistening appearance of it. The dull silver color is quite unique.
For the penny, the penny is normally graded based on distinguishing features, such as how well the wheat stalks on the back are visible. In other words, there are these little lines that cartoon the stalks of wheat. If those lines are nice and crisp and sharp, then that coin is in a very fine or better condition. If those lines are non-existent, worn smooth, then the coin is in fair or good condition. The details of the features of Lincoln's face on the obverse, the front side of the coin — the more detail of his hair and his cheek bones and such is an indicator of the quality.
Using Modes to Chain Training Sets and Application of Trained Models
Fig. 2 illustrates a photograph of a United States one-cent coin 100.
When the user takes something like a penny, or a coin in general, when they want to identify, first that the coin exists as a round object in the picture, they would want to take their finger and smear, for example, a red transparent shading around the rim of the coin. The user would do that to lots of different pictures of coins. The training set would then be able to acknowledge that there is a round coin there, at least a round circle. With a little bit of modification, in one example, the user would probably also want that to use as a registration, such that when it identifies that round boundary then it knows what is inside of that round boundary is pertinent. For example, smearing with their finger what is inside of that boundary in green. What is outside of that round boundary is not pertinent. This will allow the user to not be so concerned about the scaling of the size of the coin in the picture. So, the user can use the rim to define the analysis phase.
Fig. 3 illustrates the United States one-cent coin with areas identified, representing identifier 200, the existence of a coin. In this example, the rim around the coin is identified as feature 1 and the area inside the rim is identified as context 2. Feature 1 and context 2 represent modes of identifier 200.
In order to identify a penny, the user would be highlighting the area that's inside the circle. In general, except for a 1943 steel penny, most pennies would be identified by the color. In one example, the user may also want to have a separate training set or a bifurcated training set that also tends to identify the penny by the different faces, for example, Lincoln’s face, the Lincoln Memorial, the Wheat Stalks, the Indian on the Indian Head penny, the big letters “ONE CENT” on many of the one-cent pieces, and so on and so forth.
Fig. 4 illustrates the United States one-cent coin with areas identified, representing identifier 300, the type of coin. Feature area 3 is the surface of the coin, which contains type-identifying information. In this example, feature 3 mode is the same as the context 2 mode that was discerned in the making of identifier 200. Context area 2 of identifier 200 is now assigned to also be the feature area 3 of identifier 300. This indicates that there is a mapping between the two identifiers. In this example, proceeding with identifying the type of coin for identifier 300 is dependent on the successful identification of the existence of a coin from identifier 200.
The user would then, for grading the coin, red-tint highlight areas, especially for wheatie pennies, wheat stalks. Of course, Lincoln’s face and hair would be a couple of different areas, also, that would be identified as grading features for the front side of the coin.
Fig. 5 illustrates the United States one-cent coin with areas identified, representing identifier 400, the grade of coin. In this example, the context area 4 is the feature area 3 from identifier 300, which is also the context area 2 from identifier 200. Context area 4 is
the surface of the coin. In this example, a mode of one identifier becomes a different mode in another identifier. In this particular example, the mode happens to switch from context to feature to context as applied to the three different identifiers, since the information contained on the surface of the coin changes roles, depending on the training purpose of the identifier. In this example, the grade of the coin is dependent on the clarity of the lines of the two stalks of wheat that are set in relief on the surface of the coin. Thus, the feature area 5 is selected on the two stalks of wheat.
Example: Grading Collector Coins - Concurrently Developing Multiple Sets and Models
There are different basic grades, say, “fair”, “good”, “poor”, “very good”, “fine”, “extra fine”, “uncirculated”. Instead of using a word that says “very fine” as the handle to the identifier, in this example, the user can have an example picture of a “very fine” coin, where the wheat stalks are very sharp, etc. So, the user taps on that picture, the user is able to tell which training set it needs to go in to, or, for example, the set that it is going to test against. Then, the user goes ahead and does their green and red highlighting. Then, that green and red highlighting is what is pertinent. That information gets sent for what training will be done on a computing device of sufficient computing capacity (e.g., on the desktop, or server, or Mac server of some sort). So, essentially, the touch interface is used, or some sort of a touch surface, to identify the aspects that should go into the training set — and also identify the constellation of all these different training sets that the user is working on. That is then collected up, in one example, e.g., by an app, and gets transmitted to a base station, whether that is cloud computing or a server farm or cluster or just a desktop. In one example, the images are already in the cluster or desktop or whatever, what needs to be transmitted from the touch screen surface are the areas that have gotten highlighted in red or green and also to which of the training sets to which they pertain.
Example: Grading Collector Coins - Synthetic Imaging of Trained Model
Now, in one example, in terms of a “very fine” coin or the user could actually have an image of the representative coin in the very fine condition, one possibility is for the training set (or resulting trained model) to generate that exemplary or representative image that it has perceived as meaning “very fine”. That is probably, ultimately, going to be an important aspect of all of this because, if the neural network is able to produce the representation of what it is thinking about, in other words, this is what a representative “very fine” coin looks like, then the output from that can be compared against a new coin coming in.
Example: General and Specific Training Sets and Models
Also, in real human analysis, it is not just a single image that represents “very fine”. The brain probably thinks of several images that might represent “very fine” condition. All, sightly or at a little bit of variance with each other. Just like in terms of missing fold in a seismic section, the user knows what that looks like, but it does not necessarily mean that every single seismic section is going to look exactly the same. Depending on the type of seismic section, and the conditions and who shot it and what year and all that, the missing fold will appear somewhat differently. So, the user may have in their head many different representative images that are all similar, but not the same. So, in real life, the user would be looking at a new seismic section and comparing that against several different images in their head of what the user would expect to see that would represent missing fold. For example, having to lose some traces due to not being able to shoot around an existing production platform.
Example: Crowd Organized Mapping
In one example of creating a crowd organized map, that’s where a person (the user) determines the hierarchy and the ordering of the different items. In technical terms, the crowd of users collectively organize the map. To illustrate, an example can be organizing “vehicles”. That category includes cars, trains, boats, bicycles. The user places the specific vehicles into a “vehicles” folder or hierarchy or tree. While the user has organized the “vehicles” a certain way, other users may organize in a similar or different way. In practice, many users in the crowd of users will organize the vehicles in the same way or similar. The crowd itself is coming up with the way in which the objects will be organized when a new user enters the crowd. In one example, the map could bifurcate: some users prefer to organize the elements one way and some people prefer or need to organize the elements a different way. This could create a “minority report” situation, meaning that new users have a choice of default or initial map to use. In essence, the overall concept is that as more and more items are added for placing into hierarchal order and common things paired together, the power of the crowd does that, rather than one person trying to do that by hand or some sort of computer algorithm trying to do it. Thus, the self-organized mapping process is replaced by a hybrid human-computer crowd- sourced mapping process.
Example: Organized Mapping of Identifiers. Training Sets and Resulting Trained Models
Fig. 6 illustrates a collection of sets-of-data 1000, gathered by the identifier tools. An identifier data set 1001, for example, is one or more interpretations of a feature of the data under investigation (for example, a well log, a seismic line, a photo of a collector coin). In one example, these one or more interpretations are collected from one or more diverse sources (for example, from interpreters who are placed around the world). As illustrated, identifier 1003 is associated with or otherwise sequentially dependent on identifier 1001. This is indicated by dashed arrow, mapping 1099. These mappings are also collected and used for co-processing and sequencing of training sets and/or execution of trained models. For example, at least some of the interpretation (e.g., modes)
of identifier 1001 are carried over to identifier 1003. For example, a context mode in identifier 1001 is a feature mode in identifier 1003. The dashed arrows 1099 indicate the mapping of the identifier data sets to form an organized map. Identifier 1003 is associated with identifiers 1004 and 1007, as indicated by the dashed arrows. Identifier 1008, in turn, associates with identifier 1007. In one example, a crowd organized mapping is used.
Fig. 7 illustrates feeding the identifier data sets from the collection 1000 into training sets of the training set collection 1200. Identifier data is selected to train for a desired outcome of a particular training set or sets. In one example, there is a one-to-one correspondence mapping between the identifiers and the training sets. For example, identifier 1001 feeds its data into training set 1201, and so on with the other corresponding numbers. (For clarity, the one-to-one feeding from identifiers to training sets is not shown, but can be imagined as stippled arrows going from each identifier to each corresponding training set.) In one example, the mapping is constructed between identifiers and training sets. As illustrated, in this example, identifier 1009 is additionally fed into training set 1203, as shown by the solid arrow 1199. In one example, the mapping is constructed between training sets. As illustrated, in this example, results from training set 1202 are additionally fed into training set 1207, as shown by solid arrow 1299. In one example, it can be appreciated that a fraction of the identifiers is used for a training set to produce a trained model, with the remaining fraction used to test the trained model. The training sets 1200 produced corresponding trained models. An organized mapping is applied to manage the co-execution and sequencing of execution of the models when they are applied to a new piece of data that is to be machine interpreted.
As can be appreciated, the workflow steps herein described are reduced to practice through computer code that is executed on one or more computing devices. In one example, a virtual workstation is constructed to operate on a computing device that includes a display that also receives hand input from the user. The user looks at the data being interpreted, creates and / or selects identifiers, inputs to the computing device(s) the
areas of the data associated with the identifiers. If associated areas of one identifier are associated with another identifier, sequentially or concurrently, that connection is also captured for use in chaining training sets and resulting trained models (for parallel and cascading execution). The identified areas are used in the construction of training sets and resulting trained models. Multiple virtual workstations contribute to pool the identifiers, training sets, and/or resulting models through one or more collecting computing devices, such as servers and neural processing centers. In a crowd sourced application, identifiers from varied sources are used in the construction of training sets and resulting trained models. In a crowd sourced application, the associations between identifiers, training sets, and/or resulting models are available for parallel-cascading training and/or execution of resulting trained models.
Although the present invention is described herein with reference to a specific preferred embodiment(s), many modifications and variations therein will readily occur to those with ordinary skill in the art. Accordingly, all such variations and modifications are included within the intended scope of the present invention as defined by the reference numerals used.
From the description contained herein, the features of any of the examples, especially as set forth in the claims, can be combined with each other in any meaningful manner to form further examples and/or embodiments.
The foregoing description is presented for purposes of illustration and description, and is not intended to limit the invention to the forms disclosed herein. Consequently, variations and modifications commensurate with the above teachings and the teaching of the relevant art are within the spirit of the invention. Such variations will readily suggest themselves to those skilled in the relevant structural or mechanical art. Further, the embodiments described are also intended to enable others skilled in the art to utilize the invention and such or other embodiments and with various modifications required by the particular applications or uses of the invention.
Claims
1. A computer-assisted method for generating training data to be used in machine learning models, comprising: retrieving, by multiple users, images concerning a same subject to be investigated; generating, by the multiple users, identifiers per each of the images, and the identifiers per each of the images are respectively associated with a different feature being used to classify the subject; registering the identifiers in accordance with respective expert knowledge approbations of the multiple users; collecting, by a common computing infrastructure, the registered identifiers from the multiple users; generating, by the common computing infrastructure, training data to be used in machine learning models based on the collected identifiers.
2. The computer-assisted method according to claim 1, characterized in that the identifiers per each of the images are different from one another.
3. The computer-assisted method according to claim 1 or 2, characterized in that
the images are retrieved from the common computing infrastructure, which is a central computing system, in particular a server, to all of the users, and that the multiple users are situated at different locations apart from a common computing infrastructure.
4. The computer-assisted method according to any one of the foregoing claims, characterized in that the images retrieved by the users vary at least partially from one user to another.
5. The computer-assisted method according to any one of the foregoing claims, characterized in that the different features represent different objects, different characteristics, different modes and/or different attributes of the subject.
6. The computer-assisted method according to any one of the foregoing claims, characterized in that the step of collecting is performed at different timings for different users, and that the step of registering is performed before retrieving the images from the common computing infrastructure or after generating the identifiers.
7. The computer-assisted method according to any one of the foregoing claims, characterized in that
the method further includes periodically updating the training data based on newly added identifiers and already collected identifiers.
8. The computer-assisted method according to any one of the foregoing claims, characterized in that the identifiers per each of the images interdepend from one identifier to another.
9. The computer-assisted method according to any one of the foregoing claims, characterized in that the training data is organized as a tree map from an abstract object level defining the subject to be investigated to a specific feature level defining details belonging to the subject to be investigated, and that the tree map is organized depending on the multiple users in a crowdly manner.
10. The computer-assisted method according to any one of the foregoing claims, characterized in that the step of generating the identifiers is performed by using an identifier tool, which is used by each of the multiple users to indicate areas of each of the images as underlying information of a respective identifier.
11. The computer-assisted method according to any one of the foregoing claims, characterized in that
each one of the identifiers includes first and second indicators, and that the first indicator indicates a feature area of the subject to be investigated and the second indicator indicates a context area of the subject to be investigated.
12. The computer-assisted method according to any one of the foregoing claims, characterized in that each of the identifiers includes only a subset of information of the corresponding image.
13. A computer program, characterized in that the computer program product comprises instructions which, when the program is executed by a computer, cause the computer to carry out the steps of the computer- assisted method of any one of claims 1 to 12.
14. A computer-readable data carrier, characterized in that the computer-readable data carrier has stored thereon the computer program of claim 13.
15. A computer-assisted system for generating training data to be used in machine learning models, and the system is configured to: provide, to multiple users, images concerning a same subject to be investigated;
collect, from the multiple users, identifiers registered in accordance with respective expert knowledge approbations of the multiple users, the identifiers being generated by the multiple users per each of the images, and associated with a different feature being used to classify the subject; and generate training data to be used in machine learning models based on the collected identifiers.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202063028546P | 2020-05-21 | 2020-05-21 | |
US63/028,546 | 2020-05-21 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2021237156A1 true WO2021237156A1 (en) | 2021-11-25 |
WO2021237156A4 WO2021237156A4 (en) | 2022-01-13 |
Family
ID=76601705
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2021/033756 WO2021237156A1 (en) | 2020-05-21 | 2021-05-21 | Method and system for generating training data to be used in machine learning models |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021237156A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115689793A (en) * | 2023-01-05 | 2023-02-03 | 翌飞锐特电子商务(北京)有限公司 | Interactive account checking method based on calculation model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10025950B1 (en) * | 2017-09-17 | 2018-07-17 | Everalbum, Inc | Systems and methods for image recognition |
US20190383965A1 (en) * | 2017-02-09 | 2019-12-19 | Schlumberger Technology Corporation | Geophysical Deep Learning |
-
2021
- 2021-05-21 WO PCT/US2021/033756 patent/WO2021237156A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190383965A1 (en) * | 2017-02-09 | 2019-12-19 | Schlumberger Technology Corporation | Geophysical Deep Learning |
US10025950B1 (en) * | 2017-09-17 | 2018-07-17 | Everalbum, Inc | Systems and methods for image recognition |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115689793A (en) * | 2023-01-05 | 2023-02-03 | 翌飞锐特电子商务(北京)有限公司 | Interactive account checking method based on calculation model |
Also Published As
Publication number | Publication date |
---|---|
WO2021237156A4 (en) | 2022-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110533045B (en) | Luggage X-ray contraband image semantic segmentation method combined with attention mechanism | |
Bertini et al. | Quality metrics in high-dimensional data visualization: An overview and systematization | |
US7308139B2 (en) | Method, system, and apparatus for color representation of seismic data and associated measurements | |
Schmidt et al. | VAICo: Visual analysis for image comparison | |
CN101512556B (en) | Method for producing hydrocarbon from subsurface | |
Assent et al. | VISA: visual subspace clustering analysis | |
US20180046935A1 (en) | Interactive performance visualization of multi-class classifier | |
Yang et al. | Value and relation display for interactive exploration of high dimensional datasets | |
Fischer et al. | Towards a survey on static and dynamic hypergraph visualizations | |
CN104199858A (en) | Method for retrieving patent documents and visualization patent retrieving system | |
EP3732586A1 (en) | Systems and methods for combining data analyses | |
Cao et al. | Untangle map: Visual analysis of probabilistic multi-label data | |
WO2021237156A1 (en) | Method and system for generating training data to be used in machine learning models | |
Chen et al. | A unified interactive model evaluation for classification, object detection, and instance segmentation in computer vision | |
US8564594B2 (en) | Similar shader search apparatus and method using image feature extraction | |
Sun et al. | Virtual reality-based visual interaction: a framework for classification of ethnic clothing totem patterns | |
CN113345052B (en) | Classification data multi-view visualization coloring method and system based on similarity significance | |
Huang et al. | A novel virtual node approach for interactive visual analytics of big datasets in parallel coordinates | |
CN108229446A (en) | A kind of region technique for delineating and system | |
Theron | Visual analytics of paleoceanographic conditions. | |
CN105760421A (en) | Land-use type classifying method | |
CN106156063B (en) | Correlation technique and device for object picture search results ranking | |
Keim et al. | Advanced visual analytics interfaces | |
Gupta et al. | Deep learning-based automatic horizon identification from seismic data | |
Jesenko et al. | Visualization and analytics tool for multi-dimensional data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21734585 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21734585 Country of ref document: EP Kind code of ref document: A1 |