WO2018175335A1

WO2018175335A1 - Method and system for discovering and positioning content into augmented reality space

Info

Publication number: WO2018175335A1
Application number: PCT/US2018/023166
Authority: WO
Inventors: Seppo T. VALLI; Pekka K. SILTANEN
Original assignee: Pcms Holdings, Inc.
Priority date: 2017-03-24
Filing date: 2018-03-19
Publication date: 2018-09-27

Abstract

A system and method includes receiving, at a first augmented reality head mounted display (AR HMD) at a first site, a video feed of a second site that is remote from the first site; rendering, with the first AR HMD, the video feed; determining, with the first AR HMD, gaze data; obtaining, at the first AR HMD, selected AR object data based on first AR HMD user input; outputting the gaze data of the first AR HMD user and the selected AR object data from the first AR HMD; and displaying, with the first AR HMD, a received video feed of the second site that includes an AR object based on the output selected AR object data, the AR object being positioned based on the output gaze data. In one embodiment, a server interacts with the AR HMD in a client-server relationship.

Description

METHOD AND SYSTEM FOR DISCOVERING AND POSITIONING CONTENT INTO AUGMENTED

REALITY SPACE

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001 ] The present application is a non-provisional filing of, and claims benefit under 35 U.S.C. §119(e) from, U.S. Provisional Patent Application Serial No. 62/476,318, filed March 24, 2017, entitled "METHOD AND SYSTEM FOR DISCOVERING AND POSITIONING CONTENT INTO AUGMENTED REALITY SPACE", which is incorporated herein by reference in its entirety.

BACKGROUND

[0002] The present disclosure relates to augmented reality (AR) technologies and merging real and virtual elements to produce new visualizations, typically a video, in which physical and digital objects co-exist and interact in real time.

[0003] Three-dimensional (3D) models and animations are examples of virtual elements that may be visualized in AR. However, AR objects may basically be any digital information for which spatiality (3D position and orientation in space) gives added value, for example pictures, videos, graphics, text, and audio.

[0004] Augmented reality visualizations present augmented virtual elements as a part of the physical view. Augmented reality terminals equipped with a camera and a display, e.g. using augmented-reality glasses, either video-see-through or optical-see-through, either monocular or stereoscopic, capture video from a user's environment and show physical elements together with virtual elements on a display.

[0005] AR visualizations may be created in such a way that they can be seen correctly from different viewpoints. For example, when the user changes his/her viewpoint, virtual elements stay or act as if they were part of the physical scene. Tracking technologies may be employed for deriving 3D properties of the environment for AR content production, and when viewing the content, for tracking the viewer's (camera) position with respect to the environment. The viewer's (camera) position can be tracked e.g. by tracking known objects in the viewer's video stream and/or using one or more depth cameras. Inertial measurement sensors may also be used to assist with tracking.

SUMMARY

[0006] Systems and methods are presented for discovering and positioning content into augmented reality space. [0007] One embodiment is directed to a method including receiving, at a first augmented reality (AR) head mounted display (HMD) at a first site, a video feed of at least part of a second site that is remote from the first site. The method includes rendering, with the first AR HMD, the video feed. The method includes determining, with the first AR HMD, gaze data of a first AR HMD user with respect to the rendered video feed. The method also includes obtaining, at the first AR HMD, selected AR object data based on first AR HMD user input. The method also includes outputting the gaze data of the first AR HMD user and the selected AR object data from the first AR HMD and displaying, with the first AR HMD, a received video feed of at least part of the second site that includes an AR object based on the output selected AR object data, the AR object being positioned based on the output gaze data.

[0008] In one embodiment of the method, the selected AR object data includes at least one of an AR object selected by the first AR HMD user, orientation data determined from first AR HMD user input data, and scale data determined from first AR HMD user input data. In one embodiment of the method, the rendering, with the first AR HMD, the video feed is according to a shared coordinate system between the first site and the second site. In another embodiment, outputting the gaze data of the first AR HMD user and the selected AR object data from the first AR HMD comprises transmitting the gaze data of the first AR HMD user and the selected AR object data to a second AR HMD at or near to the second site.

[0009] In one embodiment of the method, the displaying, with the first AR HMD, the received video feed of at least part of the second site that includes the AR object positioned based on the output gaze data includes determining a position of the AR object based on three-dimensional ray-casting.

[0010] Another embodiment is directed to a method including constructing a three-dimensional model of a real-world environment corresponding to a first participant of the teleconference; transmitting a video feed of at least part of the real-world environment to at least a second participant of the teleconference; receiving gaze data from the second participant with respect to the transmitted video; receiving selected AR object data from the second participant; determining a position for augmenting an AR object based on the gaze data and the three-dimensional model of the real-world environment; rendering the AR object based on the determined position; and transmitting an augmented video feed including the AR object to the second participant.

[0011] One embodiment of the method includes determining a shared coordinate system between the first and second participants to the teleconference. For example, in an embodiment, the shared coordinate system is based on an adjacent arrangement, an overlapping arrangement, or a combination of adjacent and overlapping arrangements between a first site corresponding to the first participant and at least a second site corresponding to the second participant.

[0012] In one embodiment the method includes determining an intersection point between at least some of the gaze data and a surface of the three-dimensional model. [0013] Another embodiment is directed to a system including a processor disposed in an augmented reality (AR) display located at a first site, and a transceiver coupled to the processor to receive a video feed of at least part of a second site that is remote from the first site. The system includes a non-transitory computer storage medium storing instructions operative, when executed on the processor, to perform functions including: rendering the video feed; determining gaze data of a userof the augmented reality display with respect to the rendered video feed; obtaining selected AR object data based on input by a user; outputting via the transceiver the gaze data of the user and the selected AR object data from the AR display; and displaying via the AR display, a received video feed of at least part of the second site that includes an AR object based on the output selected AR object data, the AR object being positioned based on the output gaze data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] FIG. 1 A depicts an example local user's view of a video feed from a remote site according to an embodiment.

[0015] FIG. 1 B depicts a first user's view and a second user's view of a video feed from a remote site, according to an embodiment.

[0016] FIG. 2A depicts an example system including a shared geometry, a first site system, a second site system, and an n site system according to an embodiment.

[0017] FIG. 2B depicts a system environment in accordance with an embodiment.

[0018] FIG. 3 depicts an example method in accordance with an embodiment.

[0019] FIG. 4 depicts an example video feed being transmitted from a remote site to a local site in accordance with an embodiment.

[0020] FIG. 5 depicts an example rendering of remote sites positioned in a triangular adjacent three- site geometry in accordance with an embodiment.

[0021 ] FIG. 6 depicts an overview of overlapping and adjacent geometries of sites in accordance with an embodiment.

[0022] FIG. 7 depicts an example user, the user's gaze direction and field of view, and pixels of a viewed image in accordance with an embodiment.

[0023] FIG. 8 depicts a ray being shot from a selected pixel through a camera and to a point of intersection of a closest object blocking the path of the ray in accordance with an embodiment.

[0024] FIG. 9 depicts a local user viewing an example rendering of a remote site video feed augmented with an object in accordance with an embodiment. [0025] FIG. 10 depicts an example communication and/or processing sequence in accordance with an embodiment.

[0026] FIG. 11 depicts another example communication and/or processing sequence in accordance with an embodiment.

[0027] FIG. 12 is a schematic block diagram illustrating the components of an exemplary wireless transmit/receive unit, in accordance with at least one embodiment in accordance with an embodiment.

[0028] FIG. 13 is a schematic block diagram illustrating the components of an example network entity in accordance with an embodiment.

DETAILED DESCRIPTION

[0029] Before virtual elements such as augmented reality (AR) content can be augmented into physical reality, their positions are typically defined with respect to the physical environment. In most markerless AR applications, such a task is made by an expert user, e.g. programmer. Most methods developed for positioning a virtual element are quite difficult to use and are mostly used in a local space, in which the user and the augmented object are in the same physical location. Methods and systems disclosed herein enable a user to augment virtual objects to a remote participant's environment during a video conference by just looking at a desired position at a video feed of the remote site.

Placing AR Objects into 3D World Model

[0030] Currently, user interaction with AR systems may often be limited to pure two-dimensional (2D) pointing and/or clicking via a device's touch screen or by a mouse. In these techniques, interactions in three dimensions, such as placing an AR object into a preferred position in an environment, may be done by 2D operations on the device's screen, without complete 3D information.

[0031 ] There are many solutions for positioning content into an augmented reality scene. They are mostly based on using mouse or touch interaction to define an object's position and they are mostly for use in a local environment.

Marker based placement

[0032] Traditionally, printed graphical markers are used in the environment and are detected from a video as a reference for both augmenting virtual information in a correct orientation and scale and for tracking the viewer's (camera) position. In marker based placement, the pose of a virtual object may be defined with respect to the markers, whose pose is known in advance by the system. A presentation device may recognize the marker and position the AR object relative to the marker. In AR systems employing marker based placement techniques, the user who is placing the marker is physically present at the site where the content is augmented. Planar image based placement

[0033] Markerless AR avoids distracting markers as in marker based placement by relying on detecting distinctive features of the environment and using them for augmenting virtual information and tracking a user's position. In planar image based placement, a planar element of a scene may be used as a marker. Typically, for feature-based techniques, such as planar image based placement, there is more in-advance preparations than for marker based methods. Possible reasons for this are that feature-based techniques may include more complex data capture, more complex processing, and more complex tools for AR content production. In addition, they typically do not give as explicit scale reference for the augmentations as when using markers.

[0034] In planar image based placement, an application offers a user interface for selecting a known feature set (e.g. a poster on the wall or a logo of a machine) from the local environment. The set of features that is used for tracking is, in practice, a planar image that can be used similarly as markers to define 3D location and 3D orientation.

Touch or mouse interaction based placement

[0035] In button interaction based placement techniques, a user may move an AR object in the world model using buttons that are mapped with different degrees of freedom (e.g. X, Y, Z directions in the world model). Touch or mouse interaction based placement techniques may be used in combination with marker or planar image based placement when adjusting the AR object's place defined by the fiducial marker.

Gesture based placement

[0036] Gesture based placement may utilize a user's hand gesture, such as for example, the user's mid-air finger movement, in front of the user terminal's camera. A user application may track the finger movements and may position the AR object by interpreting the gestures.

Terminal movement based placement

[0037] Terminal movement based placement uses the movability and small size of a handheld AR terminal. Terminal movement placement techniques show an AR object on the terminal screen and then the object's position is fixed relative to the terminal's position. The AR object can then be moved by moving the AR terminal. Often, terminal movement based placement is combined with touch screen interaction, e.g. for rotating the object.

3D reconstruction

3D reconstruction can be used to support setting pose of the augmented virtual objects. In 3D reconstruction, shapes of objects in an environment are captured, resulting in a set of 3D points of the shapes. Virtual objects may be positioned with respect to the generated 3D information using e.g. mouse or touch interaction. Interaction may be enabled e.g. by showing the 3D information from different perspectives for positioning the virtual object from different directions.

3D ray casting

[0038] In the 3D ray casting technique, the AR terminal casts an imaginary 3D ray from the selected position in the video stream shown on the terminal. The system uses a predefined 3D model of the environment (e.g. produced by 3D reconstruction), which is used to calculate an intersection of the ray and the 3D model. The object is then positioned into the intersection.

3D constraint alignment

[0039] In 3D constraint alignment, an application provides a user with visual hints of constraints found in a real environment, such as edges and planar surfaces, and allows the user to attach AR objects to the constraints. The technique is based on a predefined 3D model of the environment (e.g. produced by 3D reconstruction).

Drawing to video

[0040] Simple annotations (e.g. text and drawings) can be created so that a user draws the annotations on the touchscreen of the AR terminal, and the drawings are then attached to the 3D model of the real-world environment, with the help of e.g. 3D constraints described above. The technique is based on having a predefined 3D model of the environment (e.g. produced by 3D reconstruction).

Exemplary Systems and Methods

[0041] Disclosures herein describe, in at least one embodiment, environments in which there is no VR scene shared between sites, and, in some cases, only the information of the user's gaze direction is shared between the sites. In at least some scenarios, limited information saves network bandwidth and processing, which is advantageous because 3D reconstructed models can be very large in size.

[0042] 3D capture technologies allow 3D models of objects and people to be captured and transmitted to remote locations in real time. In the remote location, the 3D model may be rendered to a remote user's view using e.g. augmented reality glasses so that the transmitted objects seem to appear in the remote user's space. Rendering at a remote user's view allows people to be represented by realistic virtual avatars that replicate people's movements in real time.

[0043] In a telepresence system, 3D models of several geographically distant sites are often merged together in order to give the user an impression that the people in different sites interact in a same virtual environment. Two techniques for merging the 3D models are: overlapping remote spaces and adjacent remote spaces. Telepresence systems may also use combinations of the two methods. [0044] Overlapping remote spaces may be created by bringing the 3D geometries of the remote sites, as well as avatars of the users, into one common virtual reality environment. In the common virtual reality environment, there are no boundaries between the site models.

[0045] Collaboration in shared mixed reality spaces presents a serious scalability problem. When combining several physical spaces, layouts, furniture and other fixtures are overlaid and therefore are to be conformant across meeting spaces. The more meeting spaces and participants are brought together, the more challenging it becomes to support an unrestricted mutual experience in navigation and viewing by participants.

[0046] In many existing telepresence solutions, a window paradigm is used. In a windows paradigm, the remote users are seen through a window-like display. Behaving like a natural window, the display allows users to experience e.g. motion parallax and stereoscopic 3D perception. In these methods, the geometries of the remote spaces may not be overlaid so there may not be conformance problems.

[0047] To enable correct perception of gaze and gestures, physical parameters can be standardized across meeting sites (e.g., geometry, meeting table, display assembly, etc.). Standardization can include constraining the number and position of collaborating partners, which may present onerous restrictions in some circumstances.

[0048] The solutions using adjacent remote spaces may be implemented using traditional telepresence techniques, such as for example, sending video streams between sites to communicate. To support correct gaze directions and common understanding of spatial relationships between users at different sites, the physical geometries of the sites are fixed or the user's positions are tracked and the video positions of the cameras capturing the users are selected according to the user's positions.

[0049] When using 3D reconstruction, the 3D models of the remote sites may be positioned adjacently so that the models do not overlap. Then, the user avatars may be positioned into the respective 3D models and the others may see the remote users in a merged VR space or in synthesized videos captured from virtual camera positions, without the models or avatars colliding.

[0050] One drawback to using augmented reality lies in the difficulty of creating the AR system and the content shown to the user. Creating an augmented reality application is a complex programming task that is often carried out by programming professionals. However, producing the content shown in AR is a task that is simplified if proper tools are provided. AR content may be linked to a real-world 3D space, and providing tools that allow easy placement of AR content into desired positions in the real-world environment may be advantageous.

[0051] Embodiments described above for placing an AR object with respect to the real world include manipulating 3D positions using a 2D user interface, a task that is not familiar to most users. Also, most of the above disclosed systems do not address the problem of positioning content to the environment of a remote user during a teleconference.

[0052] Embodiments herein describe systems and methods that solve a problem of AR content creation in a case in which a non-expert user wants to place AR content (e.g. a 3D model of a new piece of furniture) to a remote environment (e.g. a remote user's living room) just by looking at the position where the content should be augmented.

[0053] Embodiments herein describe systems and methods that provide the ability to select a position for AR content (e.g. a 3D model) and augment the content to a remote environment (e.g. a remote user's living room).

[0054] One embodiment of a content positioning process may include starting a video conference. Users in several geographically distant sites may start a video conference in which the users are wearing AR head mounted displays (HMDs) or the like to see video feeds of remote sites.

[0055] Referring now to the figures, FIG. 1A depicts a local user's view (from Site 1) of a video feed from a remote site (Site 2). Block 102 of FIG. 1A depicts a local user's view of a video feed from Site 2 rendered with the local user's AR goggles, or head mounted device (HMD) and including an augmentation- position selection indicator. The user could be searching for a position for AR content. The right half of FIG. 1A, block 104, depicts the local user's view at Site 1 of the video feed from Site 2 with AR content positioned at (or based on) the selected spot. FIG. 1 B, block 106 depicts a local user searching for a position for AR content while wearing an HMD. Block 108 illustrates a user at site 1 viewing the positioned content. Block 110 illustrates a user at Site 3 viewing the positioned content from a different viewpoint.

[0056] In one embodiment, the content positioning process includes defining positions of each remote site in a common geometry. For example, in a system may define positions of each remote site in a common geometry to define how the remote users' video feeds are rendered to local user's view.

[0057] One embodiment of a process may include tracking a local user's orientation and/or gaze direction. A local user's system (e.g., an AR HMD) may track the local user's orientation and gaze direction.

[0058] Another embodiment of a process may include sending, receiving, and/or processing user input relating to a selected virtual object. A local user may select a virtual object to be augmented. The local user may rotate the object to a correct orientation, select a scale, look at a desired position in the video feed from the remote site (feed is rendered to the view provided by the AR HMD) to augment the object and/or indicate to the system to augment the object to that position.

[0059] One embodiment of a process may include calculating a direction at which the user is looking with respect to the rendered view of the remote site. In an embodiment, there may be multiple users at each site, each of them wearing, for example, an AR HMD. In one embodiment, the process calculates a direction where a user is looking with respect to the rendered view of the remote site and sends this direction information together with scale information, orientation information and the virtual object to be augmented to the remote site.

[0060] Another embodiment of a process may include calculating an intersection of the gaze direction and the 3D model of the remote site. The process can include calculating an intersection of the gaze direction and the 3D model of the remote site. The process can further include indicating where the user is looking at by rendering e.g. a cursor and/or line of sight to the users' views.

[0061] Another embodiment of a process may include adding a virtual object to the intersection coordinates and/or rendering the view with the new augmented object to one or more users. The process can include adding a virtual object to the intersection coordinates and rendering the view with the new augmented object to all the users. The process can include providing an individual video feed to each of the users, from individual camera (virtual or real) positions.

[0062] An example implementation of an embodiment directed a method and system for content creation and viewing is described below.

[0063] As used herein, the site of the user wanting to add one or more virtual objects to one or more other sites environment may be referred to as the "augmenting site," and the site of the real-world environment that the virtual object is added may be referred to as the "augmented site."

[0064] Depicted in FIG. 2A is shared geometry, site 1 200, site 2 201, and site n 203. The shared geometry 202 may be a controlling system that positions the conference participants in a common coordinate system, optionally, to allow each site's viewing system to position the remote video feeds so that the participants have a common understanding of the meeting setup. In this depicted example, site 1 200 may be referred to as an augmenting site and site 2 201 may be referred to as an augmented site. As depicted in FIG. 2A, the augmenting site includes an AR viewing system such as AR viewing for User 1 210, an AR content selection system such as 238, and an AR content position system 230.

[0065] The AR viewing system may be configured to allow a user to view AR objects and videos from both local and other sites augmented to the user's view, positioned according to the shared geometry 202. Shared geometry 202 enables relative positions 204, 206 and 208 and the like to be shared among user 1 210, user 2 212, and user 3 214.

[0066] In one embodiment, user 2 at site 2 201 may have a virtual camera or several virtual reality cameras 216, 222 that provide video feeds 218, 220 and 224 to user 2, user 1 210, and user 3 214, respectively. After receiving video feed 220, at site 1 200, user 1 210 transmits a gaze direction 226 and determines AR content selection 238 to enable AR content positioning 230. AR content positioning 230 provides the gaze direction 228 and virtual object positioning 232 to site 2 201. More particularly, gaze direction 228 and virtual object 232 can be provided to a 3D model management module 234 at site 2 201.

[0067] In one embodiment, the AR content selection system 238 may be configured to allow a user to select virtual objects to be augmented. In another embodiment, the AR content positioning 230 may be configured to receive eye gaze direction 226 with respect to local geometry and use the shared geometry 202 to calculate gaze direction with respect to remote geometry. The direction information and the selected virtual content can then be sent to the remote site, such as site n 203.

[0068] As depicted in FIG. 2A, the augmented site can include a 3D reconstruction module 236, a 3D model management module 234 and one or more virtual cameras 216. 3D reconstruction module 236 can create a 3D model of the site using a set of sensors, e.g. depth sensors with RGB cameras, and updates the 3D model in real time. 3D model management module 234 can add AR objects to the reconstructed geometry.

[0069] A virtual camera 216 can create a synthesized video feed from the 3D model management module 234 and transmits the feed 224 to a local site, such as user 3 AR viewing system 214.

[0070] In some embodiments, some sites may not be an augmenting site 201 or an augmented site 203. Such sites may have only a viewing system, for example, if the users at the site do not augment objects to other sites. Correspondingly, no objects can be augmented, if a site has only a viewing system.

[0071] The AR viewing system in accordance with some embodiments shows video feeds, such as video feed 224, from the remote sites augmented to a local user's view, such as user 3 214. The feeds from remote sites may be augmented around the user so that when the user turns around, the remote site he/she is facing changes. To show all the remote feeds in correct positions, in one embodiment, the system positions the feeds relative to the user. Since each of the participant's respective viewing systems define the remote feed positions locally, some embodiments include a common geometry management system that defines a common geometry system so that each of the users have a same understanding of spatiality and common scale. Specifically, shared geometry 202 shares relative positions 204, 206 and 208 among the users to enable the positioning of the respective users within the video feeds 218, 220 and 224.

Referring to FIG. 2B, in one embodiment, the AR viewing system 240 may include one or more of the following components: a user application 242, a presentation system 244, a world model system 246, a tracking system 248, and/or a context system 250. As will be appreciated by one of skill in the art, each of user application 242, presentation system 244, world model system 246, tracking system 248 and context system 250 can be implemented via a processor disposed in an augmented reality (AR) display. In other embodiments, rather than the processor being disposed in an AR display or HMD, a client-server type relationship can exist between a display and a server. For example, a display can function as a client device with user interface functionality, but rely on a connected server to perform other functionalities. [0072] In one embodiment, a user application may run on a user terminal such as a mobile device and/or may implement e.g. user interface functionalities and/or control client-side logic of the AR system 240. The presentation system 244 may control all outputs to the user, such as video streams, and/or may render 2D and/or 3D AR objects into the videos, audio and/or tactile outputs.

[0073] The world model system 246 may store and provide access to digital representations of the world, e.g. points of interest (Pol) and AR objects linked to the Pols.

[0074] The tracking system 248 may capture changes in the user's location and orientation so that the AR objects are rendered in such a way that the user experiences them to be to be part of the environment. The tracking system 248, for example, may contain a module for tracking the user's gaze direction. The tracking system 248 may be integrated with a head mounted display (HMD) that can track the user's eye movement, for example, from close distances.

[0075] The context system 250 may store and provide access to information about the user and real time status information, e.g. where the user is using the system.

[0076] Using the terminology above, the AR viewing system operation may be described as follows. A user has a user application running in an AR terminal, such as a head mounted display (HMD), which can be a mobile device such as a cell phone or the like attached to a head mounting apparatus to create an HMD. The presentation system 244 brings AR objects, e.g. video feeds from remote sites, into the user's view, augmenting them into the live video feed of the local environment captured by the user's AR terminal. The viewing system 240 may also have an audio system 252 producing spatial audio so that the sound attached to the augmented object, e.g. sound of the person speaking in the remote site video feed, seems to be coming from the right direction. The presentation system 244 positions the AR objects into correct positions in the video stream by using 3D coordinates of the local environment and the AR object, provided by the world model system 246. The tracking system 248 tracks the user's position with respect to the world model 246, allowing the presentation system 244 to keep the AR objects' positions unchanged in the user's view with respect to the environment even when the user's position changes.

[0077] Referring back to FIG. 2A, AR content selection 238 can be used to discover and control AR content that is shown by the AR viewing system. The selection system 238 may be a mobile application that a user is already familiar with for consuming content in the user's own mobile device, for example an application for browsing the user's images. The disclosed system may also implement service discovery routines so that an AR content control system can discover AR viewing system interfaces and connect to the AR viewing system over a wireless network.

[0078] To view AR content, a user indicates the 3D position where the AR viewing system renders the content. The AR scene creation defines for each AR object a position, an orientation, and a scale (collectively the pose). The position is defined so that a user looks to the desired position in the remote site video feed and the AR content positioning system may compute eye direction with respect to the common coordinate system maintained by, for example, a common geometry system or shared geometry 202.

[0079] The 3D reconstruction system 236 captures shape and appearance of real objects in the site where the virtual objects are augmented. The system may be based on obtaining depth data from several sensors such as sensors associated with virtual camera 216 capturing the site from different angles. Based on the depth data, the system may calculate surface vertices and normals and interpolate the result to determine a mathematical representation of the surfaces corresponding to real-world objects. The resulting surfaces may be combined with normal RGB video camera images to create a natural-looking virtual representation of the site's environment.

[0080] The 3D model management module 234 stores the 3D model reconstruction results and communicates with the common geometry management 202 to select correct views to the 3D model so that the virtual camera views 216 can be provided to the users from different angles. 3D model management module 234 may also calculate the intersection point of the gaze direction and the 3D model and adds augmented virtual objects to the intersection.

[0081 ] I n one embodiment, a virtual camera 216 provides views to the virtual world from different angles. In one embodiment, virtual cameras 216 in combination with AR viewing system 240 create a high-quality synthesized video feed from 3D reconstruction results online.

[0082] An advantage of at least some embodiments of systems and methods herein is that virtual content in a 3D environment may be positioned without manipulating 3D objects via a 2D user interface. Manipulating 3D objects via a 2D interface is a task that is not natural to most people.

[0083] Referring now to FIG. 3, an embodiment is directed to a method for a viewing system for video conferencing.

[0084] In one embodiment, a system and method enables video conferencing between numerous sites, where each site may have one or more users. In certain scenarios, the number of sites may be limited by the method used for defining the common geometry that defines how the sites are shown to the user on the other sites. The user may preferably use HMD-type AR goggles. Some remote site display methods may allow using large displays where the remote sites are rendered. Block 302 provides for starting a video conference between one or more sites.

[0085] Block 310 provides for generating a reconstruction of at least part of the sites constantly during the video conference. Each site may have a 3D reconstruction set-up that captures the shape and appearance of real objects in the site. The 3D reconstruction system may e.g. be based on obtaining depth data from several sensors capturing the site from different angles. The result of the 3D reconstruction may be a surface or faceted 3D model of the site environment.

[0086] Block 312 may provide for combining the site geometries into one, shared geometry. To help pointing an object remotely, the telepresence system may create a shared geometry combining all the site geometries. The geometries may be combined so that they are, for example, adjacent or overlapping.

[0087] Combining the geometries does not necessarily mean that the reconstructed 3D models are combined into one combined 3D model. For the purpose of defining the positions where the videos received from remotes sites are rendered into the local user's view, it is enough to define how the site geometry extents relate to each other without necessarily combining model data.

[0088] Geometries of two or more sites can be combined trivially (either adjacent or overlapping), but when the number of sites increases, there are several options to create the common adjacent geometry.

[0089] Overlapping geometries mean that virtual models from different sites may contain overlapping objects, making it hard to realize what object the user is looking at when trying to select a position for augmenting an object. As a result, erroneous positions may be selected.

[0090] Block 320 provides for placing virtual cameras to each site and sending a synthesized video feed of the 3D reconstructed model to the other sites. Each site has a capture set-up that includes one or more cameras that capture the site environment in real time and transmit the captured videos to the other sites. The captured video feeds are generated by virtual cameras that create synthetized video of the 3D reconstructed model from selected viewpoints. The virtual cameras may be positioned based on the selected shared geometry method. Since the users at the other sites see a 2D projection of the 3D environment, there is no need to transmit 3D information between the sites. FIG. 4 depicts an example of a video feed being transmitted from a remote site (site 2) 404 to a local site (site 1) 402. As depicted in FIG. 4, the remote video feed of at least part of the remote site captured by the camera(s) 406 is transmitted from the remote site to the HMD 408 of the local site.

[0091] Block 322 provides for rendering videos received from sites to the AR HMD or goggles worn by users at each site, based on the shared geometry. Video feeds from the remote sites are rendered to the local user's view into respective positions that may be defined by the common geometry. FIG. 5 depicts an example rendering of remote sites positioned in a triangular adjacent three site geometry. As depicted in FIG. 5, the two remote sites 502 and 504 are rendered to the local user's view 506. FIG. 6 depicts an overview of overlapping and adjacent geometries of sites. Organization 602 of FIG. 6 depicts an example rendering of an adjacent geometry for two sites. Organization 604 of FIG. 6 depicts an example triangular adjacent three- site geometry. Organization 606 of FIG. 6 depicts an example honeycomb seven-site geometry. Organization 608 illustrates an overlapping geometry. [0092] Block 330 provides for tracking a user's gaze direction and head position with respect to the rendered video and calculating the intersection of the gaze direction and the rendered video image. The user's AR HMD or goggles, in one embodiment, may contain a gaze tracking module that is used to determine a direction vector with respect to a local geometry. In one embodiment, the gaze tracking module may be implemented by two 3D cameras (integrated into an HMD or AR goggles) that are able to view and/or determine the user's pupil position and gaze direction in three dimensions. Since the gaze direction can be calculated with respect to the remote video the user is looking at, the gaze direction can be used to calculate the pixel in the remote video that the user is looking at, Specifically, referring to FIG. 7, a user 702 is shown wearing an HMD with a given field of view 706A in a gaze direction 704A representing a scene field of view 706B from gaze direction 704B. The pixel coordinates 708 are also illustrated to demonstrate a selected pixel position.

[0093] Block 332 provides for getting user input for selecting a position for augmentation. Specifically, to select a position for augmentation, a user looks at a spot for selection in the remote video and indicates to the system when the spot is selected. The method of indication may be via a voice command and/or a hand gesture in front of the AR goggle or an HMD camera. To assist the user with position selection, the system may show a cursor augmented to the spot the user is looking at. The cursor can be shown only in the local user's video, or a 3D cursor (or a 3D line showing the gaze direction) may be augmented to each user's video enabling all the users to see the augmentation positioning process.

[0094] Block 340 provides for transferring image intersection coordinates to the site that sent the video using, for example, ray casting to get the intersection point of a 3D reconstructed model. Since the intersection of gaze direction and the video shown in the user's display can be calculated, the point at which the gaze intersects the 3D reconstructed model can also be calculated. To calculate such a point, for example, a well-known ray casting method may be used.

[0095] One known method of ray casting includes Roth, Scott D. (February 1982), "Ray Casting for Modeling Solids", Computer Graphics and Image Processing, 18 (2): 109-144, doi: 10.1016/0146- 664X(82)90169-1 , which is incorporated by reference. In the method described by Roth, a ray is shot from the selected pixel in the generated image (through a virtual camera), and the point of intersection of the closest object blocking the path of that ray is selected as an intersection point. A visual example of ray casting is shown in FIG. 8. The ray casting calculation may be done at the remote site, so there is no need to transfer any information except the pixel coordinates between the sites.

[0096] Block 342 provides for getting user input for uploading augmented content and transferring the content to the site where the content will be augmented. When a position has been selected, the user may upload a virtual object to be augmented. The user may also rotate and scale the object to desirable orientation and size. A user interface for this may be implemented on a mobile device, or using gestures to select one object of a group of objects rendered to user's view. The uploaded object and the orientation information, such as transformation matrices, may then be transmitted to the remote site.

[0097] Block 350 provides for augmenting the content into each of the synthesized videos to the position of intersection, and sending to each of the sites. The object to be augmented is added to the 3D reconstructed 3D model of the remote site, and the synthesized video feeds from selected virtual camera positions may be sent to each site.

[0098] Block 352 provides for rendering the video feeds to the AR goggles or HMD worn by users at each site. The video feeds containing augmented objects are rendered to each user's view. Referring to FIG. 9, a user 902 wearing an HMD views a scene 906 with virtual object 904.

[0099] Methods according an embodiment are shown through communication sequences described in FIG. 10 and 11.

[0100] Referring to FIG. 10, interactions are shown between site 1 1002, site 2 1004 and site N 1006. Beginning at site 2 1004, the system and method includes creating a 3D reconstructed model of a site environment 1008, and positioning virtual reality cameras to individual positions for each site 1010, creating synthesized video feeds of the reconstructed 3D model from each virtual camera position 1012.

[0101] As shown by lines 1014 and 1016, video feeds are provided to site 1 1002 and site N 1006. Site N 1006 receives the video feed 1016 and renders the video feed to AR goggles/HMD or other display.

[0102] Site 1 1002 receives video feed 1014 with a viewpoint determined for site 1 and the video feed is rendered to AR goggles/HMD or other display 1020. At site 1 , the users gaze direction is tracked 1024 and site 1 user provides input for position selection 1026. The gaze direction information 1028 is provided to Site 2 which enables Site 2 1004 to calculate an intersection of gaze and a 3D reconstructed model 1030.

[0103] User at site 1 1002 provides input for selecting a virtual object 1032 and provides the data about the virtual object 1034 to Site 2 1004. At Site 2 1004, the virtual object is positioned within the reconstructed 3D model at the intersection point 1036. Next, at Site 2 1004, synthesized video feeds of the reconstructed 3D model are created from each virtual reality camera position 1038.

[0104] The synthesized video feeds are provided to Site 1 1040 and to Site N 1042. The synthesized video feed to Site 1 is a video feed with a viewpoint from the user at Site 1 and the synthesized video feed to Site N is a video feed with a viewpoint from the user at Site N.

[0105] The synthesized video feed is then rendered at Site 1 via AR goggles/HMD/display 1044, rendered at Site 2 1046, and rendered at Site N 1048. [0106] Referring now to FIG. 11 , another sequence illustrates an embodiment from an alternate perspective. More particularly, FIG. 11 illustrates interactions between Site 1 1102, Site 2 1104 and Site 3 1106.

[0107] Video feed from site 2 1104 is provided to site 1 1 102 at step 1108. At Site 2 1104, cameras can perform a local 3D scan at step 1110 and build a local 3D model 1116. At Site 1 1102, after a video is received from Site 2, gaze tracking is performed 112 and a gaze vector is computed at step 1114. A user at Site 1 1102, an input from a user, for example, can select an AR object. The system can determine scale and orientation at step 1118. At step 1120, a position and gaze vector is provided to Site 2.

[0108] At Site 2 1104, using the position and gaze vector, an AR location is computed at step 1122. Site 1 1102 can also provide AR object, scale and orientation data at step 1124. Then, at step 1126, Site 2 1104 renders a local augmented reality scene for the user at Site 2 from his/her viewpoint. Transmitted videos can also be augmented at step 1130. Thereafter, different video feeds can be provided to both Site 1 and Site 3. Specifically, Site 1 receives an augmented video as a view from user 1 , 1132 and Site 3 receives an augmented video as a view from user 3 1140.

[0109] In both FIGs. 10 and 11 , a user at site 1 augments a virtual object to an environment of site 2, and the user at site 1, site 2 and site n (site 3) see the augmented objects rendered to users respective AR goggles/HMD or 3D display.

Using real cameras

[0110] Instead of using virtual cameras as described above the system may use real video cameras. The cameras may be positioned as described in International Application No. PCT/US 16/46848, filed Aug. 12, 2016, entitled "System and Method for Augmented Reality Multi-View Telepresence," and PCT Patent Application No. PCT/US17/38820 filed June 22, 2017, entitled "System and Method for Spatial Interaction Using Automatically Positioned Cameras" which are incorporated herein by reference. Each video camera may capture an individual video feed that is transmitted to one of the remote users.

[0111] Embodiments described herein for calculating an intersection between a gaze direction and a 3D reconstructed remote model can be used when using real video cameras. In one embodiment, apriori knowledge of optical properties, position and capture direction of the camera, can enable placing a virtual camera with the same or similar properties and pose with respect to a reconstructed model, and enable calculating the intersection similar to using virtual cameras only.

[0112] Once the 3D coordinates where the virtual object is to be augmented are known, the object can be augmented to outgoing video streams, using a 3D reconstructed model as a positioning reference. Advantageously, there is no need to transmit an entire reconstructed video to other sites. Transmitting the 3D reconstructed model

[0113] In exemplary embodiments, the 3D reconstructed model is not shared with other sites, only the captured video streams. Limiting the sharing of the 3D reconstructed model is preferable due to the possibly extreme bandwidth requirements of sharing real-time generated 3D model. However, in some use cases it may be preferable to share an entire 3D reconstructed geometry. A shared geometrid enables creating a single combined geometry of all the site 3D reconstructions, and the position selection can be implemented as described in U.S. Patent Publication No. 2016/0026242, which is incorporated herein by reference. Sharing the 3D reconstruction allows each site to implement augmentation individually, without notifying any other parties of identified objects each user has selected to be rendered into his/her own view.

Positioning away from the intersection point

[0114] Some embodiments described above allow positioning augmented objects in the intersections of gaze direction and the virtual model, meaning the augmented objects are touching the 3D reconstructed virtual model. However, some exemplary embodiments use user interaction to move the object along a "gaze ray", which enables moving the object away from the intersection point, but maintaining the object along the gaze ray. In some embodiments, this action is collaborative. For example, one user selects the gaze direction and another user who sees the remote environment from different angle, positions the object in the correct position along the ray.

Changing augmented object's orientation

[0115] The orientation of the virtual object augmented to remote video may be selected after it has been augmented into the remote video. The system may offer a user interface for rotating the object (e.g. by recognizing user's gestures in front of AR goggles/HMD). The gestures can be interpreted by a local system and transmitted to a remote site that updates the virtual object orientation accordingly. Any of the users viewing the remote augmented video may change the orientation.

Example Use Case

[0116] In an exemplary use case, Pekka, Seppo and Sanni are having an enhanced video conference using the system described above. They all have set the system in their apartments. Seppo has a 3D model of a piece of furniture he thinks looks good in Pekka's environment. Seppo selects a position where he wants to add the furniture model, by looking at a position at a wall in the video view coming from Pekka's apartment. Seppo selects a furniture model using his mobile phone user interface (Ul) and informs the system to augment the object using voice command.

[0117] Since Sanni has a view from a different angle, she can move the model into a bit better position, along the viewpoint line visualized by the system. Since Pekka is also wearing AR goggles/HMD, he can walk around his apartment and see the furniture from different angles. [0118] In the present disclosure, various elements of one or more of the described embodiments are referred to as "modules" that carry out (i.e., perform, execute, and the like) various functions described herein. As the term "module" is used herein, each described module includes hardware (e.g., one or more processors, microprocessors, microcontrollers, microchips, application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), memory devices, and/or one or more of any other type or types of devices and/or components deemed suitable by those of skill in the relevant art in a given context and/or for a given implementation. Each described module shown and described can also include instructions executable for carrying out the one or more functions described as being carried out by the particular module, and those instructions could take the form of or at least include hardware (i.e., hardwired) instructions, firmware instructions, software instructions, and/or the like, stored in any non-transitory computer-readable medium deemed suitable by those of skill in the relevant art.

[0119] Exemplary embodiments disclosed herein are implemented using one or more wired and/or wireless network nodes, such as a wireless transmit/receive unit (WTRU) or other network entity.

[0120] FIG. 12 is a system diagram of an exemplary WTRU 1202, which may be employed as a mobile device, a remote device, a camera, a monitoring-and-communication system, and/or a transmitter, in embodiments described herein. As shown in FIG. 12, the WTRU 1202 may include a processor 1218, a communication interface 1219 including a transceiver 1220, a transmit/receive element 1222, a speaker/microphone 1224, a keypad 1226, a display/touchpad 1228, a non-removable memory 1230, a removable memory 1232, a power source 1234, a global positioning system (GPS) chipset 1236, and sensors 1238. It will be appreciated that the WTRU 1202 may include any sub-combination of the foregoing elements while remaining consistent with an embodiment.

[0121] The processor 1218 may be a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Array (FPGAs) circuits, any other type of integrated circuit (IC), a state machine, and the like. The processor 1218 may perform signal coding, data processing, power control, input/output processing, and/or any other functionality that enables the WTRU 1202 to operate in a wireless environment. The processor 1218 may be coupled to the transceiver 1220, which may be coupled to the transmit/receive element 1222. While FIG. 4 depicts the processor 1218 and the transceiver 1220 as separate components, it will be appreciated that the processor 1218 and the transceiver 1220 may be integrated together in an electronic package or chip.

[0122] The transmit/receive element 1222 may be configured to transmit signals to, or receive signals from, a base station over the air interface 1216. For example, in one embodiment, the transmit/receive element 1222 may be an antenna configured to transmit and/or receive RF signals. In another embodiment, the transmit/receive element 1222 may be an emitter/detector configured to transmit and/or receive IR, UV, or visible light signals, as examples. In yet another embodiment, the transmit/receive element 1222 may be configured to transmit and receive both RF and light signals. It will be appreciated that the transmit/receive element 1222 may be configured to transmit and/or receive any combination of wireless signals.

[0123] In addition, although the transmit/receive element 1222 is depicted in FIG. 12 as a single element, the WTRU 1202 may include any number of transmit/receive elements 1222. More specifically, the WTRU 1202 may employ MIMO technology. Thus, in one embodiment, the WTRU 1202 may include two or more transmit/receive elements 1222 (e.g., multiple antennas) for transmitting and receiving wireless signals over the air interface 1216.

[0124] The transceiver 1220 may be configured to modulate the signals that are to be transmitted by the transmit/receive element 1222 and to demodulate the signals that are received by the transmit/receive element 1222. As noted above, the WTRU 1202 may have multi-mode capabilities. Thus, the transceiver 1220 may include multiple transceivers for enabling the WTRU 1202 to communicate via multiple RATs, such as UTRA and IEEE 802.11, as examples.

[0125] The processor 1218 of the WTRU 1202 may be coupled to, and may receive user input data from, the speaker/microphone 1224, the keypad 1226, and/or the display/touchpad 1228 (e.g., a liquid crystal display (LCD) display unit or organic light-emitting diode (OLED) display unit). The processor 1218 may also output user data to the speaker/microphone 1224, the keypad 1226, and/or the display/touchpad 1228. In addition, the processor 1218 may access information from, and store data in, any type of suitable memory, such as the non-removable memory 1230 and/or the removable memory 1232. The non-removable memory 1230 may include random-access memory (RAM), read-only memory (ROM), a hard disk, or any other type of memory storage device. The removable memory 1232 may include a subscriber identity module (SIM) card, a memory stick, a secure digital (SD) memory card, and the like. In other embodiments, the processor 1218 may access information from, and store data in, memory that is not physically located on the WTRU 1202, such as on a server or a home computer (not shown).

[0126] The processor 1218 may receive power from the power source 1234, and may be configured to distribute and/or control the power to the other components in the WTRU 1202. The power source 1234 may be any suitable device for powering the WTRU 1202. As examples, the power source 1234 may include one or more dry cell batteries (e.g., nickel-cadmium (NiCd), nickel-zinc (NiZn), nickel metal hydride (NiMH), lithium-ion (Li-ion), and the like), solar cells, fuel cells, and the like.

[0127] The processor 1218 may also be coupled to the GPS chipset 1236, which may be configured to provide location information (e.g., longitude and latitude) regarding the current location of the WTRU 1202. In addition to, or in lieu of, the information from the GPS chipset 1236, the WTRU 1202 may receive location information over the air interface 1216 from a base station and/or determine its location based on the timing of the signals being received from two or more nearby base stations. It will be appreciated that the WTRU 1202 may acquire location information by way of any suitable location-determination method while remaining consistent with an embodiment.

[0128] The processor 1218 may further be coupled to other peripherals 1238, which may include one or more software and/or hardware modules that provide additional features, functionality and/or wired or wireless connectivity. For example, the peripherals 1238 may include sensors such as an accelerometer, an e-compass, a satellite transceiver, a digital camera (for photographs or video), a universal serial bus (USB) port, a vibration device, a television transceiver, a hands free headset, a Bluetooth® module, a frequency modulated (FM) radio unit, a digital music player, a media player, a video game player module, an Internet browser, and the like.

[0129] FIG. 13 depicts an exemplary network entity 1390 that may be used in embodiments of the present disclosure, for example as part of a monitoring-and-communication system, as described herein. As depicted in FIG. 13, network entity 1390 includes a communication interface 1392, a processor 1394, and non-transitory data storage 1396, all of which are communicatively linked by a bus, network, or other communication path 1398.

[0130] Communication interface 1392 may include one or more wired communication interfaces and/or one or more wireless-communication interfaces. With respect to wired communication, communication interface 1392 may include one or more interfaces such as Ethernet interfaces, as an example. With respect to wireless communication, communication interface 1392 may include components such as one or more antennae, one or more transceivers/chipsets designed and configured for one or more types of wireless (e.g., LTE) communication, and/or any other components deemed suitable by those of skill in the relevant art. And further with respect to wireless communication, communication interface 1392 may be equipped at a scale and with a configuration appropriate for acting on the network side— as opposed to the client side— of wireless communications (e.g., LTE communications, Wi Fi communications, and the like). Thus, communication interface 1392 may include the appropriate equipment and circuitry (perhaps including multiple transceivers) for serving multiple mobile stations, UEs, or other access terminals in a coverage area.

[0131] Processor 1394 may include one or more processors of any type deemed suitable by those of skill in the relevant art, some examples including a general-purpose microprocessor and a dedicated DSP.

[0132] Data storage 1396 may take the form of any non-transitory computer-readable medium or combination of such media, some examples including flash memory, read-only memory (ROM), and random- access memory (RAM) to name but a few, as any one or more types of non-transitory data storage deemed suitable by those of skill in the relevant art could be used. As depicted in FIG. 13, data storage 1396 contains program instructions 1397 executable by processor 1394 for carrying out various combinations of the various network-entity functions described herein. [0133] Although features and elements are described above in particular combinations, one of ordinary skill in the art will appreciate that each feature or element can be used alone or in any combination with the other features and elements. In addition, the methods described herein may be implemented in a computer program, software, or firmware incorporated in a computer-readable medium for execution by a computer or processor. Examples of computer-readable storage media include, but are not limited to, a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs). A processor in association with software may be used to implement a radio frequency transceiver for use in a WTRU, UE, terminal, base station, RNC, or any host computer.

Claims

CLAIMS What is claimed is:

1. A method comprising:

receiving, at a first augmented reality (AR) head mounted display (HMD) at a first site, a video feed of at least part of a second site that is remote from the first site;

rendering, with the first AR HMD, the video feed;

determining, with the first AR HMD, gaze data of a first AR HMD user with respect to the rendered video feed;

obtaining, at the first AR HMD, selected AR object data based on first AR HMD user input;

outputting the gaze data of the first AR HMD user and the selected AR object data from the first AR HMD; and

displaying, with the first AR HMD, a received video feed of at least part of the second site that includes an AR object based on the output selected AR object data, the AR object being positioned based on the output gaze data.

2. The method of claim 1, wherein the selected AR object data includes at least one of an AR object selected by the first AR HMD user, orientation data determined from first AR HMD user input data, and scale data determined from first AR HMD user input data.

3. The method of claim 1 wherein the rendering, with the first AR HMD, the video feed is according to a shared coordinate system between at least the first site and the second site.

4. The method of claim 1 wherein the outputting the gaze data of the first AR HMD user includes outputting a position and gaze vector related to an AR object.

5. The method of claim 4 wherein the outputting the gaze data of the first AR HMD user includes outputting a scale and orientation of the AR object.

6. The method of claim 1, wherein outputting the gaze data of the first AR HMD user and the selected AR object data from the first AR HMD comprises transmitting the gaze data of the first AR HMD user and the selected AR object data to a second AR HMD at or near to the second site.

7. The method of claim 1 wherein the displaying, with the first AR HMD, the received video feed of at least part of the second site that includes the AR object positioned based on the output gaze data includes determining a position of the AR object based on three-dimensional ray-casting.

8. A method comprising: constructing a three-dimensional model of a real-world environment corresponding to a first participant of a teleconference;

transmitting a video feed of at least part of the real-world environment to at least a second participant of the teleconference;

receiving gaze data from the second participant with respect to the transmitted video;

receiving selected AR object data from the second participant;

determining a position for augmenting an AR object based on the gaze data and the three- dimensional model of the real-world environment;

rendering the AR object based on the determined position; and

transmitting an augmented video feed including the AR object to at least the second participant.

9. The method of claim 8 further comprising determining a shared coordinate system between the first and second participants to the teleconference.

10. The method of claim 9, wherein the shared coordinate system is based on an adjacent arrangement, an overlapping arrangement, or a combination of adjacent and overlapping arrangements between a first site corresponding to the first participant and at least a second site corresponding to the second participant.

11. The method of claim 8 further comprising determining an intersection point between at least some of the gaze data and a surface of the three-dimensional model.

12. A system comprising:

a processor; and

a transceiver coupled to the processor to receive a video feed of at least part of a second site that is remote from a first site; and

a non-transitory computer storage medium storing instructions operative, when executed on the processor, to perform functions comprising:

rendering the video feed;

determining gaze data of a user of the augmented reality display with respect to the rendered video feed;

obtaining selected AR object data based on input by a user;

outputting via the transceiver the gaze data of the user and the selected AR object data from the AR display; and

displaying via the AR display, a received video feed of at least part of the second site that includes an AR object based on the output selected AR object data, the AR object being positioned based on the output gaze data.

13. The system of claim 12, wherein the processor is disposed in an augmented reality (AR) display located at the first site.

14. The system of claim 12 wherein the processor is disposed in a server in a client-server relationship with an augmented reality (AR) display located at the first site.

15. The system of claim 12 wherein the transceiver coupled to the processor receives the video feed according to a shared coordinate system between the first site and the second site, the shared coordinate system based on an adjacent arrangement, an overlapping arrangement, or a combination of adjacent and overlapping arrangements between the first site the second site.