NZ624212B - System and Method for Incrementally Replicating Investigative Analysis Data - Google Patents
System and Method for Incrementally Replicating Investigative Analysis DataInfo
- Publication number
- NZ624212B NZ624212B NZ624212A NZ62421214A NZ624212B NZ 624212 B NZ624212 B NZ 624212B NZ 624212 A NZ624212 A NZ 624212A NZ 62421214 A NZ62421214 A NZ 62421214A NZ 624212 B NZ624212 B NZ 624212B
- Authority
- NZ
- New Zealand
- Prior art keywords
- replication
- plan
- data
- chunk
- incremental
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000004458 analytical method Methods 0.000 title description 72
- 230000003362 replicative effect Effects 0.000 title description 10
- 230000010076 replication Effects 0.000 claims abstract description 196
- 230000008859 change Effects 0.000 claims abstract description 95
- 238000004891 communication Methods 0.000 description 19
- 230000008569 process Effects 0.000 description 18
- 239000013598 vector Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 238000007726 management method Methods 0.000 description 6
- 238000013499 data model Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000013500 data storage Methods 0.000 description 4
- 238000012217 deletion Methods 0.000 description 4
- 230000037430 deletion Effects 0.000 description 4
- 230000008676 import Effects 0.000 description 4
- 230000008520 organization Effects 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 230000000737 periodic effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 240000005020 Acaciella glauca Species 0.000 description 1
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 235000003499 redwood Nutrition 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2358—Change logging, detection, and notification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/273—Asynchronous replication or reconciliation
Abstract
method performed by one or more computing devices of an exporting system with a database and a logical clock used to order changes to the database is disclosed. The method comprises responsive to receiving a command to execute an incremental replication job, determining a current value of the logical clock, tracking changes to data objects by creating change records reflecting an order of the changes in a revisioning database, determining a number of incremental replication chunks to divide the incremental replication job into, and creating and persistently storing an incremental replication plan for the incremental replication job and based on the incremental replication plan, executing the incremental replication job. An exporting system to perform this method is also disclosed. cal clock, tracking changes to data objects by creating change records reflecting an order of the changes in a revisioning database, determining a number of incremental replication chunks to divide the incremental replication job into, and creating and persistently storing an incremental replication plan for the incremental replication job and based on the incremental replication plan, executing the incremental replication job. An exporting system to perform this method is also disclosed.
Description
SYSTEM AND METHOD FOR INCREMENTALLY REPLICATING INVESTIGATIVE
ANALYSIS DATA
FIELD
A system and method are disclosed which generally relate to replication of database
data, and more particularly to a system and method of incrementally replicating investigative
analysis data.
GENERAL BACKGROUND
Mere reference to background art herein should not be construed as an admission that
such art constitutes common general knowledge in relation to the invention.
Making investigative decisions, especially those that have the potentially to impact
lives and communities, requires access to up-to-date and accurate investigative information.
Unfortunately, investigative information is often spread across multiple databases, computers,
geographies, and clearance levels. For investigative organizations such as intelligence, defense,
and law enforcement organizations to be successful, they need ways to share and find
information quickly so that critical decisions can be made in time for them to have impact.
One complication to sharing investigative data between investigative teams is that
some of teams may be located in geographic locations where network connectivity is unreliable
or impractical. For example, a forward deployed military unit may have only periodic access to a
satellite-based network. Thus, solutions for sharing data that presume highly-available network
connectivity may be inadequate or inefficient.
Currently, there exist commercial software products for replicating database data
between distributed database instances. These software products, for example, allow an
administrator to export database data from a first database instance, copy the exported database
data to a second database instance, and once copied, import the exported database data into the
second database instance. This process of replicating database data can be tedious, time-
consuming, or unreliable, especially when the data network connecting the first and second
instances is unreliable and the amount of exported database data is large.
SUMMARY
According to one aspect of the present invention, there is provided a method
performed by one or more computing devices of an exporting system, the exporting system
comprising a revisioning database and a logical clock used to order changes to the revisioning
database, the method comprising:
tracking changes to data objects by creating, in the revisioning database, change records
comprising logical clock values obtained from the logical clock reflecting an order of the
changes;
responsive to receiving a command to execute an incremental replication job,
determining a current value of the logical clock, determining a number of incremental replication
chunks to divide the incremental replication job into, and creating and persistently storing an
incremental replication plan for the incremental replication job;
wherein the plan comprises the current value of the logical clock and the determined
number of replication chucks the incremental replication job is divided into,
wherein each of the number of replication chunks indicates one or more data objects in
the revisioning database that are included in the replication chunk;
based on the incremental replication plan, executing the incremental replication job
including, for each replication chunk of the plan:
collecting, from the revisioning database, change data for the data objects included in the
chunk based at least in part on the current value of the logical clock in the plan,
exporting the change data collected for the data objects included in the chunk to an
importing system, and
marking the chunk in the plan as complete responsive to successfully exporting the
change data collected for the data objects included in the chunk to the importing system.
According to another aspect of the present invention, there is provided an exporting
system comprising:
a revisioning database;
a logical clock used to order changes to the revisioning database;
means for tracking changes to data objects by creating, in the revisioning database,
change records comprising logical clock values obtained from the logical clock reflecting an
order of the changes;
means, responsive to receiving a command to execute an incremental replication job, for
determining a current value of the logical clock, for determining a number of incremental
replication chunks to divide the incremental replication job into, and for creating and persistently
storing an incremental replication plan for the incremental replication job;
wherein the plan comprises the current value of the logical clock and the determined
number of replication chucks the incremental replication job is divided into;
wherein each replication chunk in the plan indicates one or more data objects in the
revisioning database that are included in the replication chunk;
means for executing the incremental replication job based on the incremental replication
plan including, for each replication chunk of the plan, means for collecting, from the revisioning
database, change data for the data objects included in the chunk based at least in part on the
current value of the logical clock in the plan, for exporting the change data collected for the data
objects included in the chunk to an importing system, and for marking the chunk in the plan as
complete responsive to successfully exporting the change data collected for the data objects
included in the chunk to the importing system.
BRIEF DESCRIPTION OF THE DRAWINGS
By way of example, reference will now be made to the accompanying drawings.
is a block diagram illustrating a distributed computer system for incrementally
replicating investigative analysis data.
[0009a] is a block diagram illustrating an investigative analysis computer system.
illustrates an object-centric data model.
illustrates a table of a revisioning database.
is a flowchart illustrating a computer-implemented process for incrementally
replicating investigative analysis data.
is a block diagram illustrating an incremental replication plan.
is a flowchart illustrating a computer-implemented process for executing an
incremental replication job.
is a flowchart illustrating a computer-implemented process for importing a
replication chunk of an incremental replication job.
is a block diagram of a computer system on which embodiments of the present
invention may be implemented.
DETAILED DESCRIPTION
Introduction
A system and method of incrementally replicating investigative analysis data is
disclosed herein. In one embodiment, the term “investigative analysis data” generally refers to
any database data meaningful to an investigative organization. Investigative analysis data
includes, but is not limited to, database data that represents people, places, things, events,
documents, media, notes, properties, taken alone and in any combination thereof.
In one embodiment, a method and system are disclosed for incrementally replicating
investigative analysis data from an exporting investigative analysis system (“exporting system”)
to an importing investigative analysis system (“importing system”). The exporting system and
the importing system may be operatively coupled by an unreliable data network such as a data
network with high latency, low bandwidth, and/or intermittent or periodic availability.
In some embodiments, the exporting system receives a user command to establish an
incremental data replication relationship with the importing system and responds by creating an
incremental data replication plan. The plan represents an incremental data replication job that is
to be performed by the exporting system to replicate, to the importing system, changes to
investigative data objects that the exporting system knows about that the exporting system
determines the importing system does not yet know about. The number of changes to replicate
may be large (e.g., on the order of hundreds of millions). For example, it may have been some
time since a last replication exchange between exporting system and the importing system during
which a large number of data objects changes were made by the exporting system. Among other
information, the plan specifies the number of replication chunks that the incremental replication
job is broken down into. After the plan is created, the exporting system separately exports each
replication chunk to the importing system according to a user-configurable exportation schedule.
For example, the exportation schedule can be configured to accommodate periodic data network
availability or to avoid exporting replication chunks during peak usage times of the exporting or
importing systems. Depending on the exportation schedule and the number of replication chunks,
performance of the incremental replication job can span minutes, hours, days, or longer. During
performance of the incremental replication job, the exporting system can continue to make
changes to data objects. This is facilitated by an always increasing logical clock maintained at
the exporting system that provides a total ordering for all data object changes made by the
exporting system. When creating the incremental replication plan, the exporting system records a
current logic clock value in the plan. The recorded value represents the most recent data object
change that will be included in a replication chunk of the replication job (“maximum export
logical clock value”). When exporting replication chunks of the replication job, the exporting
system includes only data object changes associated with logic clock values that are less than or
equal to the maximum export logical clock value. By doing so, the exporting system provides a
consistent “snapshot” view of data object changes to the importing system in the replication
chunks. At the same time, the exporting system can make additional data object changes without
affecting this consistent view. Such additional data object changes can be replicated to the
importing system in a subsequent replication job.
Distributed Investigative Analysis Computer System
Figure 1 illustrates a system for incrementally replicating investigative analysis data.
A distributed investigative analysis computer system 10 includes an exporting investigative
analysis computer system (“exporting system”) 12 and an importing investigative analysis
computer system (“importing system”) 14. In replication examples described below,
investigative analysis data is exported by investigative analysis computer system 12 and
imported by investigative analysis computer system 14. Hence, investigative analysis computer
system 12 is labeled the “exporting system” and investigative analysis computer system 14 is
labeled the “importing system”. However, the examples could just have easily involved
investigative analysis data exported by investigative analysis computer system 14 and imported
by investigative analysis computer system 12. Further, investigative analysis computer system 12
can also function as an importing system and investigative analysis computer system 14 can also
function as an exporting system. Thus, investigative analysis computer system 12 can import
investigative analysis data as well as export investigative analysis data. Similarly, investigative
analysis computer system 14 can export investigative analysis data as well import investigative
analysis data.
In distributed investigative analysis system 10, exporting system 12 and importing
system 14 may be operatively coupled to each other by unreliable data network 16. Data network
16 may be unreliable in the sense that it is only periodically or intermittently available (i.e., not
highly-available), has high network communication latency, and/or has low network
communication bandwidth. For example, data network 16 may be unreliable in that a user would
find it frustrating or impractical to use for purposes of surfing the Internet.
Investigative Analysis Computer System
illustrates an investigative analysis computer system 100 which may be used
as exporting system 12 and importing system 14 in distributed investigative analysis computer
system 10. That is, an instance of investigative analysis computer system 100 may be used as
exporting system 12 and another separate instance of investigative analysis computer system 100
may be used as importing system 14. Both instances may be operatively coupled to each other by
unreliable data network 16.
Investigative analysis computer system 100 includes one or more analyst clients 102,
one or more analysis servers 104, and a revisioning database 106. Clients 102 connect to analysis
servers 104 to conduct various investigative analysis and management operations on
investigative analysis data stored in revisioning database 106. Investigative analysis operations
include commanding analysis servers 104 to create, read, update, and delete investigative
analysis data stored in revisioning database 106. Management operations include configuring
analysis servers 104 for incremental data replication as described in hereinafter.
In some embodiments, investigative analysis and management operations are
conducted by users of clients 102 through a graphical user interface (GUI) or web browser-based
user interface presented at clients 102. Such presentation may be driven by analysis servers 104,
for example, through delivery of user interface and investigative analysis data according to
standardized networking protocols and presentation formats such as the HyperText Transfer
Protocol (HTTP), the Secure HyperText Transfer Protocol (HTTPS), the HyperText Markup
Language (HTML), Cascading Style Sheets (CSS), JavaScript, etc. In other embodiments,
operations are conducted by users through a command line interface (CLI) available at clients
102 or on servers 104.
In one embodiment, the system 100 is embodied in a single computing device such as
a laptop computer. In another embodiment, the system 100 is embodied in multiple computing
devices such as one or more personal or workstation computing devices for the analysts’ clients
102, one or more server computing devices for the analysis servers 104, and one or more server
computing devices for the revisioning database 106. In some embodiments, one of the exporting
system 12 or the importing system 14 is embodied in a single computing device such as a laptop
computer and the other is embodied in multiple computing devices. This embodiment may
represent a situation in which, for example, investigative analysis data is being shared between a
team of analysts at a hub location such as a central office within the organization and an analyst,
or team of analysts, in the field such as at a forward operating location.
Object-Centric Data Model
Investigative analysis data stored in revisioning database 106 may be conceptually
stored and organized according to an object-centric data model. illustrates an object-
centric conceptual data model 108. Model 108 is centered on the notion of a data object 110. At
the highest level of abstraction, a data object 110 is a container for information representing
things in the world. For example, a data object 110 can represent an entity such as a person, a
place, an organization, or other noun. A data object 110 can represent an event that happens at a
point in time or for a period of time. A data object 110 can represent a document or other
unstructured data source such as an e-mail message, a news report, or a written paper or article.
These are just some example of what a data object 110 can represent. A data object 110 may be
associated with a unique identifier that uniquely identifies the data object to the investigative
analysis computer system 100 among other data objects 110. A data object 110 may also have a
type (e.g., Person, Event, or Document) and a display name which may be the value of a
particular property of the data object 110.
A data object 110 may have one or more properties 112. A property 112 is an
attribute of a data object 110 that represents an individual data item. A property 112 may have a
type and a value. Different types of data objects 110 may have different types of properties 112.
For example, a Person data object 110 might have an Eye Color property and an Event data
object 110 might have a Date property. In one embodiment, the set of data object types and the
set of property types for each type of data object supported by the investigative analysis system
100 are defined according to a pre-defined, user-defined, or dynamically-defined ontology or
other hierarchical structuring of knowledge through sub-categorization of object types and
property types according to their relevant and/or cognitive qualities. In addition, data model 108
may support property multiplicity. In particular, a data object 110 may be allowed to have more
than one property 112 of the same type. For example, a Person data object might have multiple
Address properties or multiple Name properties.
A link 114 represents a connection between two data objects 110. In one
embodiment, the connection is either through a relationship, an event, or through matching
properties. A relationship connection may be asymmetrical or symmetrical. For example, Person
data object A may be connected to Person data object B by a Child Of relationship (where
Person data object B has an asymmetric Parent Of relationship to Person data object A), a Kin Of
symmetric relationship to Person data object C, and an asymmetric Member Of relationship to
Organization data object X. The type of relationship between two data objects may vary
depending on the types of the data objects. For example, Person data object A may have an
Appear In relationship with Document data object Y or have a Participate In relationship with
Event data object E. As an example of an event connection, two Person data objects may be
connected by an Airline Flight data object representing a particular airline flight if they traveled
together on that flight, or by a Meeting data object representing a particular meeting if they both
attended that meeting. In one embodiment, when two data objects are connected by an event,
they are also connected by relationships, in which each object has a specific relationship to the
event, such as, for example, an Appears In relationship. As an example of a matching properties
connection, two Person data objects representing a brother and a sister, may both have an
Address property that indicates where they live. If the brother and the sister live in the same
home, then their Address properties likely contain similar, if not identical information. In one
embodiment, a link 114 between two data objects may be established based on similar or
matching properties of the data objects. The above are just some examples of the types of
connections that may be represented by a link 114 and other types of connections may be
represented. Thus, it should be understood that embodiments of the invention are not limited to
any particular types of connections between data objects 110. For example, a document might
contain two different tagged entities. A link 114 between two data objects 110 may represent a
connection between these two entities through their co-occurrence within the same document.
A data object 110 can have multiple links 114 with another data object 110 to form a
link set 116. For example, two Person data objects representing a husband and a wife could be
linked through a Spouse Of relationship, a matching property (Address), and an event
(Wedding).
Revisioning Database System
Investigative analysis computer system 100 employs a revisioning database system
for tracking changes made to investigative analysis data stored in revisioning database 106. In
some embodiments, the revisioning database system is implemented by analysis servers 104 as
an application on top of a conventional database management system (not shown). For example,
the database management system may be a relational database management system such as those
commercially available from the Oracle Corporation of Redwood Shores, California and the
Microsoft Corporation of Redmond, Washington.
In one aspect, the revisioning database system differs from other types of database
systems in that the revisioning database system is capable of answering a query about the state of
investigative analysis data stored in revisioning database 106 at a point in time in the past as
opposed to only being able to answer a query about the current state of the investigative analysis
data. With the revisioning database system, investigative analysts can determine when a
particular piece of data was added or edited in revisioning database 106. Thus, the revisioning
database system, as a result of its capability to track changes to investigative analysis data stored
in the revisioning database 106, enables investigative analysts to determine what was known
when.
In one embodiment, revisioning database system is capable of tracking all changes
made to investigative analysis data over a period of time. To do so, the revisioning database
system creates a new database change record in revisioning database 106 for every creation, edit,
or deletion of a data object 110, property 112, or link 114, thereby creating a historical record of
all changes. To track the ordering of the changes, the revisioning database system employs an
always increasing logical clock that models all of the changes as a linear sequence of database
events. The logical clock provides a total ordering for all changes. In addition, the logical clock
provides atomicity for changes as multiple changes can occur at the same point in the linear
sequence of database events represented by the logical clock (and hence be associated with the
same logical clock value).
illustrates tracking changes made to revisioning database 106 by an example
and according to an embodiment of the invention. As shown in each record 118, 120,
122, 124, and 126 in table 128 represents a creation, edit, or deletion of a data object 110 or a
creation, edit, or deletion of a property 122 of a data object 110. The fields of each change record
include a 'obj_comp_id' field identifying the data object 110 or property 112 that was created,
edited, or deleted by the change, an 'obj_id' field identifying the data object 110 that was created,
edited, or deleted by the change, a 'logical_clk' field that identifies the order of the change in a
total ordering of all changes made to revisioning database 106, a 'deleted' field indicating
whether the change was a deletion of a data object 110 or a property 112, and a '<values'> field
indicating, for changes that create or edit a value, the value that resulted from the change or, for
changes that delete a value, the value that was deleted.
For example, referring to at logical clock event 1, a data object 110 of type
"Person" was created. Also at logical clock event 1, a "Name" property of the data object was
created and given the value "John Smith". Later, at logical clock event 2, a "Phone #" property
of the object was created and given the value "4151234". At logical clock event 3, the
"Name" property of the object that was created at logical clock event 2 was edited with the value
"Jonathan Smith". At logical clock event 4, the "Phone #" property that was created at logical
clock event 3 was deleted. As a result of the changes at logical clock events 1, 2, and 3, the state
of the object at logical clock event 4 is an object 110 of type "Person" with the property "Name"
having a value "Jonathan Smith".
By preserving all changes made to an object 110 in the form of change records, the
revisioning database system is able to provide the state of an object 110 at a point in time in the
past. For example, referring again to it can be seen from change records 118, 120, and
122 that the state of the object with obj_id = 10 at logical clock event 2 was an object 110 of type
"Person" with a property "Name" having a value "John Smith" and a property "Phone#" having a
value "4151234".
Note that while table 128 contains change records for only one data object with an
identifier of 10, table 128 could contain change records for multiple data objects.
illustrates but one example scheme that the revisioning database system could
employ to track changes to revisioning database 106. However, the invention should not be
construed as being limited to only the one example scheme or be construed as requiring all
details of the one example scheme. For example, instead of storing change records for all data
objects in a single table as depicted in the change records might be stored across multiple
tables. Further, the change records may contain other fields that are not depicted in For
example, each change record may have an additional version field that serves as a single primary
key for the change record as opposed to using a combination of the 'obj_comp_id' and the
'logical_clk' fields as the primary key.
Process for Incremental Data Replication
Fig. 5 is a process for incrementally replication investigative data from an exporting
system to an importing system. The process may be performed by one or more computing
devices. For example, the process may be performed by one or more analysis servers 104 of
exporting system 20.
At step 502, the exporting system receives a command from a user to execute an
incremental replication job. The command may be provided by the user through a graphical user
interface such as a graphical user interface presented at an analyst client 102, for example.
Alternatively, the command may be provided by the user through a command line interface at an
analyst client 102 or at an analysis server 104, as some examples.
In some embodiments, the command includes a specification of an identifier of the
importing system that the exporting system is to export investigative analysis data to. The
specification can be any identifier that the exporting system can use to identity the importing
system. For example, the identifier can be a network address, domain name, or assigned
identifier of the importing system.
In some embodiments, the command includes a specification of a replication chunk
size. The specification can be a number that represents the maximum number of replication
chunks to divide the incremental data replication job into. Alternatively, the specification can be
a number that represents the maximum number of data objects to include change data for in a
replication chunk of the incremental data replication job. As yet another possible alternative, the
specification can be a maximum number of replication chunks to divide the incremental
replication job into. As used herein, the term “change data” refers broadly to data representing a
change to a data object. Change data can include the data of the change itself (e.g., the values
that were created, edited, or deleted) and any associated metadata. Such metadata may include
information representing the version of the change and may include, for example, logical clock
values and vector clock information for determining causality of the change with respect to other
changes made to the data object at the importing system.
The command may also include a specification of an exportation schedule. The
specification may include a start time when the exporting system is to begin execution of the
incremental data replication job. For example, the user may specify a start time that is in the
middle of the night or other time when the exporting system or the importing system is not being
heavily used. As another example, the start time may correspond to when network connectivity
between the exporting system and the importing system is expected to be available. For example,
if the network connectivity is satellite-based, then the start time may correspond to when the
satellite is in range of the exporting system or the importing system.
At step 504, the exporting system creates an incremental data replication plan for the
incremental data replication job. The plan may be stored persistently such as in revisioning
database 106 or other non-volatile data storage medium so that it is not lost in the event of power
failure or other failure of the exporting system. By persistently storing the plan, the exporting
system can resume the incremental data replication job from the stored plan after a failure. For
example, if some but not all of the chunks were successfully exported or all chunks were
successfully exported but not all chunks were successfully received by the importing system, the
missing or failed chunks can be exported individually. Accordingly, in some embodiments, a
received command to execute an incremental replication job specifies one or more particular
chunks to export. The exporting system then exports the specified chunks based on the
previously stored plan.
Example Incremental Data Replication Plan
Fig. 6 is a block diagram illustrating an incremental data replication plan according to
an embodiment of the present invention. As mentioned, data representing an incremental
replication plan may be persistently stored in non-volatile data storage media, for example, as
one or more rows in one or more database tables. During job execution, data representing an
incremental replication plan may also be stored in volatile computer memory, for example, as
one or more data structures or software objects.
As shown, plan 130, representing an incremental data replication job, includes a
unique plan identifier 132, a snapshot time 134, the number 136 of replication chunks the job is
divided into, an identifier 136 of the importing system, and one or more specifications 138A-N
of the replication chunk, one for each of the number 136 of replication chunks.
Plan identifier 132 may be any identifier that the exporting system and importing
system can use to refer to or identify the corresponding replication job represented by the plan
130.
Snapshot time 134 is a current logical clock value from exporting system’s logical
clock used by the exporting system to provide a total ordering of changes to data objects made
by the exporting system. Snapshot time 134 may be obtained from the logical clock in response
to receiving the command to execute the incremental data replication job for which plan 130 is
created. By recording snapshot time 134 in plan 130, ongoing changes can be made to
investigative analysis data by the exporting system without affecting which changes will be
included in the incremental data replication job.
As mentioned, peer system identifier 138 is an identifier of the importing system that
exporting system will be exporting changes to in the incremental replication job represented by
plan 130.
As mentioned, the incremental data replication job is divided into the number 136 of
replication chunks based on the replication chunk size information specified in the command to
execute the job. Plan 130 also includes a replication chunk specification 140 for each of the
number 136 of replication chunks. Each replication chunk specification 140 includes a chunk
identifier 142, a minimum data object identifier 144, a maximum data object identifier 146, and a
complete flag 148.
In some embodiments, the replication chunks of the job represented by plan 130 are
ordered. The chunk identifier 142 indicates the order of the corresponding replication chunk. For
example, the chunk identifier 142 can be an ordinal number such as 1, 2, 3, etc.
Minimum data object identifier 144 specifies the lowest valued identifier of all data
objects for which change data will be included in the corresponding replication chunk. Maximum
data object identifier 146 specifies the highest value identifier of all data objects for which
change data will be included in the corresponding replication chunk.
Complete flag 148 is used by the exporting system to track if the corresponding
replication chunk has been exported. Complete flag 148 is initially set to zero, false, or other
similar value. After the corresponding replication chunk has been successfully exported, which
may or may not be after the importing system has imported or even received the replication
chunk, the exporting system sets the complete flag 148 to one, true, or other similar value.
Return to the process of Fig. 5, at step 506, the exporting system executes the
incremental replication job according to the plan created at step 504. The exporting system may
start execution at the start time specified in the command received at step 502. If no start time is
specified in the command, then the exporting system may start execution of the job immediately
or in response to a future triggering of the job. For example, after the plan is created at step 504,
a user may interact with a graphical user interface or a command line interface to command the
exporting system to start execution of the job. As an alternative to a user interacting with a user
interface, a script or cron job may be used to trigger execution of the job at a point in the future
after the plan is created at step 504.
Example Incremental Replication Job Execution
Fig. 7 is a flow diagram illustrating a process performed by the exporting system for
executing an incremental replication job from an incremental replication plan according to an
embodiment of the invention. The process illustrated in Fig. 6 may be performed as part of step
506 of the process illustrated in Fig. 5 for performing an incremental data replication. The
process illustrated in Fig. 6 may be performed by one or more computing devices. For example,
the process illustrated in Fig. 6 may be performed by one or more analysis servers 104 of the
exporting system.
As shown steps 702, 704, and 706 are performed for each replication chunk
specification in the incremental data replication plan. For example, step 702, 704, and 706 may
be performed by the exporting system for each replication chunk specification 140 in a plan 130.
Further, steps 702, 704, and 706 are performed for each replication chunk specification in the
order of their assigned replication chunk identifiers. For example, steps 702, 704, and 706 may
be performed by the exporting system for each replication chunk specification 140 in a plan 130
in order of the respective chunk identifiers 142.
At step 702, change data for the data objects included in the current replication chunk
are collected from the revisioning database at the exporting system. Generally, this involves the
exporting system reading change records from the revisioning database corresponding to changes
to data objects included in the replication chunk that are associated with logical clock values that
are less than or equal to the snapshot time recorded in the incremental data replication plan. For
example, for the current replication chunk specification 140, exporting system may read all
records from table 128 where obj_id is greater than or equal to minimum data object identifier
144 and obj_id is less than or equal to maximum data object identifier 146 and where logical_clk
is less than or equal to snapshot time 134. This filter may be further refined to exclude change
records that exporting system “knows” the importing system has already received. Such
knowledge by exporting system can be based on previous replication exchanges between the
exporting system and the importing system. For example, the importing system may provide
version vector information and/or acknowledgement vector information to the exporting system
in such replication exchanges that indicate the version of investigative analysis data the
importing system has in its revisioning database at the time of the exchanges. Note the
knowledge the exporting system has about the importing system may be out-of-date at the time
the exporting system executes the incremental replication job. This may be because the
importing system has received and imported change data from other investigative analysis
systems since the last replication exchange between the exporting system and the importing
system. In this case, the exporting system may send change data for data object changes in the
incremental data replication that the importing system has previously received and considered. In
some embodiments, the importing system simply discards the duplicate change data.
In some embodiments, change data collected for inclusion in a replication chunk
includes change data for revisioning database entities that depend on multiple data objects
depend on that multiple data objects depend on. For example, a link 114 that is changed may
depend on two data objects 110. As another example, multiple data objects 110 may depend on a
data source. In this context, a data source represents a file, data base, a document, or other source
of information that backs one or more data objects 110 and provides a lineage for the source of
information that forms one or more data objects 110. In some embodiments, where change data
such as for a link depends on change data for one or more data objects to be exported as part of
the replication job, the dependent change data is included in same chunk as the chunk that
includes the last of the one or more data objects that are depended on. In other embodiments, the
dependent change data is included in the last chunk of the replication job. In some embodiments,
where change data such as for a data source is depended on by one or more data objects to be
exported as part of the replication job, the depended upon change data is included in the same
chunk as the chunk that includes the first of the one or more dependent data objects. In other
embodiments, the depended upon change data is included in the first chunk of the replication job.
At step 704, the change data collected at step 702 is written to a local file. For
example, the file may be one stored on analysis servers 104. As well as the change data, the file
may include other incremental replication data such as replication metadata. The replication
metadata may include version vector information and access control information. The contents of
the file may also be encrypted for security.
At step 706, the exporting system sends the file written to in step 704 to the importing
system. The exporting system may send the file to the importing system using any suitable
reliable network transport protocol such as the Transmission Control Protocol/Internet Protocol
(TCP/IP).
As an alternative to writing the collected change data to a local file (step 704) and
then sending the file (step 706) to the importing system, the exporting system can stream the
change data and associated replication metadata to the importing system over the network as it is
collected without first writing the change data and associated replication metadata to a local file.
In some embodiments, the exporting system includes a “local ack” vector clock as
part of the replication metadata included in the last (highest ordered) replication chunk exported
as part of the job. In the parlance of causality in distributed systems, the local ack vector clock
“happens after” all data object changes in all replication chunks of the job. The lock ack vector
clock indicates to the importing system that once the importing system has imported all
replication chunks in their specified order the importing system has seen all data object changes
that the exporting system knows about up to a given vector clock indicating by the lock ack
vector clock.
Example Process for Importing a Replication Chunk
Fig. 8 is a flow diagram illustrating a process performed by the importing system for
importing a replication chunk according to an embodiment of the invention. The process
illustrated in Fig. 8 may be performed by one or more computing devices. For example, the
process illustrated in Fig. 8 may be performed by one or more analysis servers 104 of the
importing system.
At step 802, the importing system receives a replication chunk of an incremental
replication job from the exporting system. For example, the importing system may receive the
chunk from the exporting system over a data network or from portable physical media (e.g., a
USB drive or a flash drive) physically transported from the exporting system to the importing
system. Among other information including change data, the chunk contains an identifier of the
incremental replication job that the chunk is a part of. In addition, the chunk contains a chunk
identifier. For example, the received chunk may contain plan identifier 132 from the
corresponding replication plan 130 created by the exporting system and chunk identifier 142
from the corresponding replication chunk specification 140. The identifier of the job may be
used by the importing system to determine the replication job to which the received chunk
belongs to. The identifier of the chunk may be used by the importing system to determine the
order of the received chunk in the sequence of chunks the exporting system has broken the job
into.
In some embodiments, the exporting system assigns the first chunk in the sequence of
chunks of a replication job an initial ordinal such as the number 1. Each chunk thereafter is
assigned the next ordinal relative to the ordinal assigned to previous chunk. For example, the
second chunk in the sequence can be assigned the number 2, the third chunk in the sequence
assigned the number 3, and so on. An alternative ordinal sequence could be ‘A’, ‘B’, ‘C’, ‘D’…,
for example.
In some embodiments, the exporting system includes “final chunk” information in the
last chunk in the sequence of chunks to indicate to the importing system that the chunk is the last
chunk of the job. For example, the final chunk information can be a flag or other data that
indicates to the importing system that the chunk having the final chunk information is the last
chunk of the job.
At step 804, the importing system imports the replication chunk received at step 802
into the revisioning database at the importing system. This importing includes incorporating the
change data contained in the replication chunk into the revisioning database. Importing the
change data may include performing causality detection between the change data in the chunk
and change data already stored in the revisioning database. In particular, change data in the
chunk and existing change data in the revisioning database at the importing system may be
versioned with version vectors (also known as “vector clocks”) suitable for detecting causality
relationships between the change data in the replication chunk and corresponding change data in
the revisioning database. In particular, the version vectors may be used to determine whether
change data in the chunk “happened after”, “happened before”, or neither “happened after” nor
“happened before” corresponding change data in the revisioning database. Change data in the
chunk is incorporated into the revisioning database at the importing system if the change data
“happened after” the corresponding change data in the revisioning database. Change data in the
chunk is not incorporated into the revisioning database at the importing system if the change data
“happened before” the corresponding change data in the revisioning database. If change data in
the chunk neither “happened after” nor “happened before” the corresponding change data in the
revisioning database, then a conflict exists between the change data in the chunk and the
corresponding change in the revisioning database. In some scenarios, the conflict is
automatically resolved by the importing system. In other scenarios, a user must manually resolve
the conflict. In both cases, the resolution of the conflict is incorporated into the revisioning
database after the conflict is resolved. More information on the “happened before” and the
“happened after” relations between events in a distributed computing system can be found in a
paper by Leslie Lamport entitled “Time, Clocks and the Ordering of Events in a Distributed
System”, Communications of the ACM, 21(7), pp. 558-565 (1978), the entire contents of which
is hereby incorporated by reference as if fully set forth herein.
Example Implementing Mechanism
According to one embodiment, the techniques described herein are implemented by
one or more special-purpose computing devices. The special-purpose computing devices may be
hard-wired to perform the techniques, or may include digital electronic devices such as one or
more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs)
that are persistently programmed to perform the techniques, or may include one or more general
purpose hardware processors programmed to perform the techniques pursuant to program
instructions in firmware, memory, other storage, or a combination. Such special-purpose
computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom
programming to accomplish the techniques. The special-purpose computing devices may be
desktop computer systems, portable computer systems, handheld devices, networking devices or
any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example, illustrates a computer system upon which one or more
embodiments may be implemented. Computer system 900 includes a bus 902 or other
communication mechanism for communicating information, and a hardware processor 904
coupled with bus 902 for processing information. Hardware processor 904 may be, for example,
a general purpose microprocessor.
Computer system 900 also includes a main memory 906, such as a random access
memory (RAM) or other dynamic storage device, coupled to bus 902 for storing information and
instructions to be executed by processor 904. Main memory 906 also may be used for storing
temporary variables or other intermediate information during execution of instructions to be
executed by processor 904. Such instructions, when stored in non-transitory storage media
accessible to processor 904, render computer system 900 into a special-purpose machine that is
customized to perform the operations specified in the instructions.
Computer system 900 further includes a read only memory (ROM) 908 or other static
storage device coupled to bus 902 for storing static information and instructions for processor
904. A storage device 910, such as a magnetic disk, optical disk, or solid-state drive is provided
and coupled to bus 902 for storing information and instructions.
Computer system 900 may be coupled via bus 902 to a display 912, such as a cathode
ray tube (CRT), for displaying information to a computer user. An input device 914, including
alphanumeric and other keys, is coupled to bus 902 for communicating information and
command selections to processor 904. Another type of user input device is cursor control 916,
such as a mouse, a trackball, or cursor direction keys for communicating direction information
and command selections to processor 904 and for controlling cursor movement on display 912.
This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a
second axis (e.g., y), that allows the device to specify positions in a plane.
Computer system 900 may implement the techniques described herein using
customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic
which in combination with the computer system causes or programs computer system 900 to be a
special-purpose machine. According to one embodiment, the techniques herein are performed by
computer system 900 in response to processor 904 executing one or more sequences of one or
more instructions contained in main memory 906. Such instructions may be read into main
memory 906 from another storage medium, such as storage device 910. Execution of the
sequences of instructions contained in main memory 906 causes processor 904 to perform the
process steps described herein. In alternative embodiments, hard-wired circuitry may be used in
place of or in combination with software instructions.
The term “storage media” as used herein refers to any non-transitory media that store
data and/or instructions that cause a machine to operate in a specific fashion. Such storage media
may comprise non-volatile media and/or volatile media. Non-volatile media includes, for
example, optical disks, magnetic disks, or solid-state drives, such as storage device 910. Volatile
media includes dynamic memory, such as main memory 906. Common forms of storage media
include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or
any other magnetic data storage medium, a CD-ROM, any other optical data storage medium,
any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM,
NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission
media. Transmission media participates in transferring information between storage media. For
example, transmission media includes coaxial cables, copper wire and fiber optics, including the
wires that comprise bus 902. Transmission media can also take the form of acoustic or light
waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or
more instructions to processor 904 for execution. For example, the instructions may initially be
carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can
load the instructions into its dynamic memory and send the instructions over a telephone line
using a modem. A modem local to computer system 900 can receive the data on the telephone
line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red
detector can receive the data carried in the infra-red signal and appropriate circuitry can place the
data on bus 902. Bus 902 carries the data to main memory 906, from which processor 904
retrieves and executes the instructions. The instructions received by main memory 906 may
optionally be stored on storage device 910 either before or after execution by processor 904.
Computer system 900 also includes a communication interface 918 coupled to bus
902. Communication interface 918 provides a two-way data communication coupling to a
network link 920 that is connected to a local network 922. For example, communication interface
918 may be an integrated services digital network (ISDN) card, cable modem, satellite modem,
or a modem to provide a data communication connection to a corresponding type of telephone
line. As another example, communication interface 918 may be a local area network (LAN) card
to provide a data communication connection to a compatible LAN. Wireless links may also be
implemented. In any such implementation, communication interface 918 sends and receives
electrical, electromagnetic or optical signals that carry digital data streams representing various
types of information.
Network link 920 typically provides data communication through one or more
networks to other data devices. For example, network link 920 may provide a connection through
local network 922 to a host computer 924 or to data equipment operated by an Internet Service
Provider (ISP) 926. ISP 926 in turn provides data communication services through the world
wide packet data communication network now commonly referred to as the “Internet” 928. Local
network 922 and Internet 928 both use electrical, electromagnetic or optical signals that carry
digital data streams. The signals through the various networks and the signals on network link
920 and through communication interface 918, which carry the digital data to and from computer
system 900, are example forms of transmission media.
Computer system 900 can send messages and receive data, including program code,
through the network(s), network link 920 and communication interface 918. In the Internet
example, a server 930 might transmit a requested code for an application program through
Internet 928, ISP 926, local network 922 and communication interface 918.
The received code may be executed by processor 904 as it is received, and/or stored
in storage device 910, or other non-volatile storage for later execution.
Extensions and Alternatives
In the foregoing specification, embodiments of the invention have been described
with reference to numerous specific details that may vary from implementation to
implementation. The specification and drawings are, accordingly, to be regarded in an illustrative
rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and
what is intended by the applicants to be the scope of the invention, is the literal and equivalent
scope of the set of claims that issue from this application, in the specific form in which such
claims issue, including any subsequent correction.
Throughout this specification, including the claims, where the context permits, the
term “comprise” and variants thereof such as “comprises” or “comprising” are to be interpreted
as including the stated integer or integers without necessarily excluding any other integers.
Claims (15)
1. A method performed by one or more computing devices of an exporting system, the exporting system comprising a revisioning database and a logical clock used to order changes to the revisioning database, the method comprising: tracking changes to data objects by creating, in the revisioning database, change records comprising logical clock values obtained from the logical clock reflecting an order of the changes; responsive to receiving a command to execute an incremental replication job, determining a current value of the logical clock, determining a number of incremental replication chunks to divide the incremental replication job into, and creating and persistently storing an incremental replication plan for the incremental replication job; wherein the plan comprises the current value of the logical clock and the determined number of replication chucks the incremental replication job is divided into, wherein each of the number of replication chunks indicates one or more data objects in the revisioning database that are included in the replication chunk; based on the incremental replication plan, executing the incremental replication job including, for each replication chunk of the plan: collecting, from the revisioning database, change data for the data objects included in the chunk based at least in part on the current value of the logical clock in the plan, exporting the change data collected for the data objects included in the chunk to an importing system, and marking the chunk in the plan as complete responsive to successfully exporting the change data collected for the data objects included in the chunk to the importing system.
2. The method of Claim 1, wherein the command specifies a number of replication chunks for the incremental replication job, the method further comprising: determining the number of incremental replication chunks to divide the incremental replication job into based on the number of replication chunks specified in the command.
3. The method of Claim 1, wherein the command specifies a number of data objects per replication chunk for the incremental replication job, the method further comprising: determining the number of incremental replication chunks to divide the incremental replication job into based on the number of data objects per replication chunk specified in the command.
4. The method of Claim 1, wherein the command specifies a time at which to start execution of the incremental replication job, the method further comprising. beginning execution of the incremental replication job at the specified time.
5. The method of Claim 1, wherein the command is received via a command line interface.
6. The method of Claim 1, wherein the replication chunks of the plan are ordered in the plan and wherein the change data collected for the data objects included in the replication chunks of the plan is exported to the importing system according to the order of the replication chunks in the plan.
7. The method of Claim 1, wherein exporting the change data collected for the data objects included in the replication chunks of the plan comprises storing the change data in one or more files and sending the one or more files to the importing system.
8. The method of Claim 1, wherein exporting the change data collected for the data objects included in the replication chunks of the plan comprises streaming the change data to the importing system.
9. The method of Claim 1, wherein each of the replication chunks in the plan indicates a range of data objects included in the chunk with a minimum data object identifier value and a maximum data object identifier value.
10. One or more computer-readable media storing instructions which, when executed by one or more computing devices of an exporting system, the exporting system comprising a revisioning database and a logical clock used to order changes to the revisioning database, causes performance of a method as recited in any one of Claims 1-9.
11. An exporting system comprising: a revisioning database; a logical clock used to order changes to the revisioning database; means for tracking changes to data objects by creating, in the revisioning database, change records comprising logical clock values obtained from the logical clock reflecting an order of the changes; means, responsive to receiving a command to execute an incremental replication job, for determining a current value of the logical clock, for determining a number of incremental replication chunks to divide the incremental replication job into, and for creating and persistently storing an incremental replication plan for the incremental replication job; wherein the plan comprises the current value of the logical clock and the determined number of replication chucks the incremental replication job is divided into; wherein each replication chunk in the plan indicates one or more data objects in the revisioning database that are included in the replication chunk; means for executing the incremental replication job based on the incremental replication plan including, for each replication chunk of the plan, means for collecting, from the revisioning database, change data for the data objects included in the chunk based at least in part on the current value of the logical clock in the plan, for exporting the change data collected for the data objects included in the chunk to an importing system, and for marking the chunk in the plan as complete responsive to successfully exporting the change data collected for the data objects included in the chunk to the importing system.
12. The system of Claim 11, wherein the replication chunks of the plan are ordered in the plan and wherein the system further comprises means for exporting the change data collected for the data objects included in the replication chunks of the plan according to the order of the replication chunks in the plan.
13. The system of Claim 12, wherein the means for exporting the change data collected for the data objects included in the replication chunks of the plan comprises means for storing the change data in one or more files and means for sending the one or more files to the importing system.
14. The system of Claim 12 or Claim 13, wherein the means for exporting the change data collected for the data objects included in the replication chunks of the plan comprises means for streaming the change data to the importing system.
15. The system of Claim 11, wherein each of the replication chunks in the plan indicates a range of data objects included in the chunk with a minimum data object identifier value and a maximum data object identifier value.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/922,437 | 2013-06-20 | ||
US13/922,437 US8886601B1 (en) | 2013-06-20 | 2013-06-20 | System and method for incrementally replicating investigative analysis data |
Publications (2)
Publication Number | Publication Date |
---|---|
NZ624212A NZ624212A (en) | 2014-08-29 |
NZ624212B true NZ624212B (en) | 2014-12-02 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10762102B2 (en) | System and method for incremental replication | |
US12137137B2 (en) | Temporal optimization of data operations using distributed search and server management | |
US12235842B2 (en) | Temporal optimization of data operations using distributed search and server management | |
US20210406258A1 (en) | Reducing index file size based on event attributes | |
USRE47546E1 (en) | Techniques for replicating changes to access control lists on investigative analysis data | |
US20150081701A1 (en) | Systems and methods for data flow exploration | |
US20200125660A1 (en) | Quick identification and retrieval of changed data rows in a data table of a database | |
US11514006B2 (en) | Modifying event generation rules responsive to detected schema changes | |
US11269808B1 (en) | Event collector with stateless data ingestion | |
US11347695B2 (en) | Clustered search head configuration synchronization | |
US11003687B2 (en) | Executing data searches using generation identifiers | |
US11841827B2 (en) | Facilitating generation of data model summaries | |
US9501489B2 (en) | Resource reconciliation based on external factors | |
US11481399B1 (en) | Nested discovery and deletion of resources | |
NZ624212B (en) | System and Method for Incrementally Replicating Investigative Analysis Data | |
US11914558B2 (en) | System and methods for object version tracking and read-time/write-time data federation | |
Ma et al. | Intelligent Web Data Management: Software Architectures and Emerging Technologies | |
Noori | Adding value to object storage: integrating analytics with cloud storage back ends |