US20230401486A1 - Machine-learning based gesture recognition - Google Patents
Machine-learning based gesture recognition Download PDFInfo
- Publication number
- US20230401486A1 US20230401486A1 US18/203,635 US202318203635A US2023401486A1 US 20230401486 A1 US20230401486 A1 US 20230401486A1 US 202318203635 A US202318203635 A US 202318203635A US 2023401486 A1 US2023401486 A1 US 2023401486A1
- Authority
- US
- United States
- Prior art keywords
- gesture
- swipe
- machine learning
- learning model
- sensor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 claims description 38
- 230000015654 memory Effects 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 12
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 230000004044 response Effects 0.000 claims 3
- 238000005516 engineering process Methods 0.000 abstract description 32
- 230000008569 process Effects 0.000 description 19
- 230000003287 optical effect Effects 0.000 description 17
- 230000000875 corresponding effect Effects 0.000 description 15
- 238000012545 processing Methods 0.000 description 13
- 230000033001 locomotion Effects 0.000 description 12
- 238000013527 convolutional neural network Methods 0.000 description 10
- 238000013528 artificial neural network Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000009471 action Effects 0.000 description 6
- 230000036541 health Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 230000001133 acceleration Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 230000001276 controlling effect Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000013503 de-identification Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 241000258963 Diplopoda Species 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000010399 physical interaction Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0487—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
- G06F3/0488—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0487—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
- G06F3/0488—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
- G06F3/04883—Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures for inputting data by handwriting, e.g. gesture or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/04—Circuits for transducers, loudspeakers or microphones for correcting frequency response
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2430/00—Signal processing covered by H04R, not provided for in its groups
- H04R2430/01—Aspects of volume control, not necessarily automatic, in sound systems
Definitions
- the present description relates generally to gesture recognition, including machine-learning based gesture recognition.
- the present disclosure relates generally to electronic devices and in particular to detecting gestures made by a user wearing or otherwise operating an electronic device.
- Wearable technology is attracting considerable interest, including audio accessory devices (e.g., earbuds), where a user can potentially enjoy the benefits of mobile technology with increased convenience.
- audio accessory devices e.g., earbuds
- FIG. 1 illustrates an example network environment for providing machine-learning based gesture recognition in accordance with one or more implementations.
- FIG. 2 illustrates an example network environment including an example electronic device and an example wireless audio output device in accordance with one or more implementations.
- FIG. 3 illustrates an example architecture, that may be implemented by the wireless audio output device, for detecting a gesture, without using a touch sensor, based on other sensor output data in accordance with one or more implementations.
- FIGS. 4 A- 4 B illustrate example timing diagrams of sensor outputs from the wireless audio output device that may indicate respective gestures in accordance with one or more implementations.
- FIG. 5 illustrates a flow diagram of example process for machine-learning based gesture recognition in accordance with one or more implementations.
- FIG. 6 illustrates an example electronic system with which aspects of the subject technology may be implemented in accordance with one or more implementations.
- Wearable devices such as earphones/headphones/headset and/or one or more earbuds, can be configured to include various sensors.
- an earbud can be equipped with various sensors, such as an optical sensor (e.g., photoplethysmogram (PPG) sensor), a motion sensor, a proximity sensor, and/or a temperature sensor, that can work independently and/or in concert to perform one or more tasks, such as detecting when an earbud is placed inside a user's ear, detecting when an earbud is placed inside a case, etc.
- An earbud and/or earphones that includes one or more of the aforementioned sensors may also include one or more additional sensors, such as a microphone or array of microphones.
- an earbud and/or earphones may not include a touch sensor for detecting touch inputs and/or touch gestures, in view of size/space constraints, power constraints, and/or manufacturing costs.
- the subject technology enables a device that does not include a touch sensor to detect touch input and/or touch gestures from users by utilizing inputs received via one or more non-touch sensors included in the device. For example, inputs received from one or more non-touch sensors may be applied to a machine learning model to detect whether the inputs correspond to a touch input and/or gesture. In this manner, the subject technology can enable detection of touch inputs and/or touch gestures on a surface of a device, such as taps, swipes, and the like, without the use of a touch sensor.
- FIG. 1 illustrates an example network environment for providing machine-learning based gesture recognition in accordance with one or more implementations. Not all of the depicted components may be used in all implementations, however, and one or more implementations may include additional or different components than those shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional components, different components, or fewer components may be provided.
- the network environment 100 includes an electronic device 102 , a wireless audio output device 104 , a network 106 , and a server 108 .
- the network 106 may communicatively (directly or indirectly) couple, for example, the electronic device 102 and/or the server 108 .
- the wireless audio output device 104 is illustrated as not being directly coupled to the network 106 ; however, in one or more implementations, the wireless audio output device 104 may be directly coupled to the network 106 .
- the network 106 may be an interconnected network of devices that may include, or may be communicatively coupled to, the Internet.
- connections over the network 106 may be referred to as wide area network connections, while connections between the electronic device 102 and the wireless audio output device 104 may be referred to as peer-to-peer connections.
- the network environment 100 is illustrated in FIG. 1 as including electronic devices 102 - 105 and a single server 108 ; however, the network environment 100 may include any number of electronic devices and any number of servers.
- the server 108 may be, and/or may include all or part of the electronic system discussed below with respect to FIG. 6 .
- the server 108 may include one or more servers, such as a cloud of servers.
- a single server 108 is shown and discussed with respect to various operations. However, these and other operations discussed herein may be performed by one or more servers, and each different operation may be performed by the same or different servers.
- the electronic device may be, for example, a portable computing device such as a laptop computer, a smartphone, a peripheral device (e.g., a digital camera, headphones), a tablet device, a smart speaker, a set-top box, a content streaming device, a wearable device such as a watch, a band, and the like, or any other appropriate device that includes one or more wireless interfaces, such as one or more near-field communication (NFC) radios, WLAN radios, Bluetooth radios, Zigbee radios, cellular radios, and/or other wireless radios.
- NFC near-field communication
- WLAN radios Wireless Local Area Network
- Bluetooth radios Zigbee radios
- cellular radios and/or other wireless radios.
- the electronic device 102 is depicted as a smartphone.
- the electronic device 102 may be, and/or may include all or part of, the electronic device discussed below with respect to FIG. 2 , and/or the electronic system discussed below with respect to FIG. 6 .
- the wireless audio output device 104 may be, for example, a wireless headset device, wireless headphones, one or more wireless earbuds, a smart speaker, or generally any device that includes audio output circuitry and one or more wireless interfaces, such as near-field communication (NFC) radios, WLAN radios, Bluetooth radios, Zigbee radios, and/or other wireless radios.
- NFC near-field communication
- WLAN wireless local area network
- Bluetooth Bluetooth radios
- Zigbee radios Zigbee radios
- the wireless audio output device 104 is depicted as a set of wireless earbuds.
- the wireless audio output device 104 may include one or more sensors that can be used and/or repurposed to detect input received from a user; however, in one or more implementations, the wireless audio output device 104 may not include a touch sensor.
- a touch sensor can refer to a sensor that measures information arising directly from a physical interaction corresponding to a physical touch (e.g., when the touch sensor receives touch input from a user).
- Some examples of a touch sensor include a surface capacitive sensor, project capacitive sensor, wire resistive sensor, surface acoustic wave sensor, and the like.
- the wireless audio output device 104 may be, and/or may include all or part of, the wireless audio output device discussed below with respect to FIG. 2 , and/or the electronic system discussed below with respect to FIG. 6 .
- the wireless audio output device 104 may be paired, such as via Bluetooth, with the electronic device 102 . After the two devices 102 , 104 are paired together, the devices 102 , 104 may automatically form a secure peer-to-peer connection when located proximate to one another, such as within Bluetooth communication range of one another.
- the electronic device 102 may stream audio, such as music, phone calls, and the like, to the wireless audio output device 104 .
- the subject technology is described herein with respect to a wireless audio output device 104 .
- the subject technology can also be applied to wired devices that do not include touch sensors, such as wired audio output devices. Further for explanatory purposes, the subject technology is discussed with respect to devices that do not include touch sensors.
- the subject technology may be used in conjunction with a touch sensor, such as to enhance and/or improve the detection of touch input by the touch sensor.
- a device may include a low-cost and/or low-power touch sensor that may coarsely detect touch inputs and/or touch gestures, and the coarsely detected touch inputs/gestures can be refined using the subject technology.
- FIG. 2 illustrates an example network environment 200 including an example electronic device 102 and an example wireless audio output device 104 in accordance with one or more implementations.
- the electronic device 102 is depicted in FIG. 2 for explanatory purposes; however, one or more of the components of the electronic device 102 may also be implemented by other electronic device(s).
- the wireless audio output device 104 is depicted in FIG. 2 for explanatory purposes; however, one or more of the components of the wireless audio output device 104 may also be implemented by other device(s). Not all of the depicted components may be used in all implementations, however, and one or more implementations may include additional or different components than those shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional components, different components, or fewer components may be provided.
- the electronic device 102 may include a host processor 202 A, a memory 204 A, and radio frequency (RF) circuitry 206 A.
- the wireless audio output device 104 may include a host processor 202 B, a memory 204 A, RF circuitry 206 B, a digital signal processor (DSP) 208 , one or more sensor(s) 210 , a specialized processor 212 , and a speaker 214 .
- the sensor(s) 210 may include one or more of a motion sensor such as an accelerometer, an optical sensor, and a sound sensor such as a microphone. It is appreciated that the aforementioned sensor do not include capacitive or resistive touch sensor hardware (or any of the aforementioned examples of touch sensors).
- the RF circuitries 206 A-B may include one or more antennas and one or more transceivers for transmitting/receiving RF communications, such as WiFi, Bluetooth, cellular, and the like.
- the RF circuitry 206 A of the electronic device 102 may include circuitry for forming wide area network connections and peer-to-peer connections, such as WiFi, Bluetooth, and/or cellular circuitry, while the RF circuitry 206 B of the wireless audio output device 104 may include Bluetooth, WiFi, and/or other circuitry for forming peer-to-peer connections.
- the RF circuitry 206 B can be used to receive audio content that can be processed by the host processor 202 B and sent on to the speaker 214 and/or also receive signals from the RF circuitry 206 A of the electronic device 102 for accomplishing tasks such as adjusting a volume output of the speaker 214 among other types of tasks.
- the host processors 202 A-B may include suitable logic, circuitry, and/or code that enable processing data and/or controlling operations of the electronic device 102 and the wireless audio output device 104 , respectively.
- the host processors 202 A-B may be enabled to provide control signals to various other components of the electronic device 102 and the wireless audio output device 104 , respectively.
- the host processors 202 A-B may enable implementation of an operating system or may otherwise execute code to manage operations of the electronic device 102 and the wireless audio output device 104 , respectively.
- the memories 204 A-B may include suitable logic, circuitry, and/or code that enable storage of various types of information such as received data, generated data, code, and/or configuration information.
- the memories 204 A-B may include, for example, random access memory (RAM), read-only memory (ROM), flash, and/or magnetic storage.
- the DSP 208 of the wireless audio output device 104 may include suitable logic, circuitry, and/or code that enable particular processing.
- a given electronic device such as the wireless audio output device 104
- a host/application processor e.g., the host processor 202 B
- Such a specialized processor may be a low computing power processor that is engineered to also utilize less energy than the CPU or GPU, and also is designed, in an example, to be running continuously on the electronic device in order to collect audio and/or sensor data.
- such a specialized processor can be an Always On Processor (AOP), which is a small and low power auxiliary processor that is implemented as an embedded motion coprocessor.
- AOP always On Processor
- the DSP 208 may be, and/or may include all or part of, the specialized processor 212 .
- the specialized processor 212 may be implemented as specialized, custom, and/or dedicated hardware, such as a low-power processor that may be always powered on (e.g., to detect audio triggers, collect and process sensor data from sensors such as accelerometers, optical sensors, and the like) and continuously runs on the wireless audio output device 104 .
- the specialized processor 212 may be utilized to perform certain operations in a more computationally and/or power efficient manner.
- a main processor e.g., the host processor 202 B
- modifications to the neural network model are performed during a compiling process for the neural network model in order to make it compatible with the architecture of the specialized processor 212 .
- the specialized processor 212 may be utilized to execute operations from such a compiled neural network model, as discussed further in FIG. 3 below.
- the wireless audio output device 104 may only include the specialized processor 212 (e.g., exclusive of the host processor 202 B and/or the DSP 208 ).
- the electronic device 102 may pair with the wireless audio output device 104 in order to generate pairing information that can be used to form a connection, such as a peer-to-peer connection between the devices 102 , 104 .
- the pairing may include, for example, exchanging communication addresses, such as Bluetooth addresses.
- the devices 102 , 104 may store the generated and/or exchanged pairing information (e.g. communication addresses) in the respective memories 204 A-B. In this manner, the devices 102 , 104 may automatically, and without user input, connect to each other when in range of communication of the respective RF circuitries 206 A-B using the respective pairing information.
- the sensor(s) 210 may include one or more sensors for detecting device motion, user biometric information (e.g., heartrate), sound, light, wind, and/or generally any environmental input.
- the sensor(s) 210 may include one or more of an accelerometer for detecting device acceleration, one or more microphones for detecting sound and/or an optical sensor for detecting light.
- the wireless audio output device 104 may be configured to output a predicted gesture based on output provided by one or more of the sensor(s) 210 , e.g., corresponding to an input detected by the one or more of the sensor(s) 210 .
- one or more of the host processors 202 A-B, the memories 204 A-B, the RF circuitries 206 A-B, the DSP 208 , and/or the specialized processor 212 , and/or one or more portions thereof may be implemented in software (e.g., subroutines and code), may be implemented in hardware (e.g., an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a state machine, gated logic, discrete hardware components, or any other suitable devices) and/or a combination of both.
- ASIC Application Specific Integrated Circuit
- FPGA Field Programmable Gate Array
- PLD Programmable Logic Device
- controller e.g., a state machine, gated logic, discrete hardware components, or any other suitable devices
- FIG. 3 illustrates an example architecture 300 , that may be implemented by the wireless audio output device 104 , for detecting a gesture, without using a touch sensor, based on other sensor output data in accordance with one or more implementations. Not all of the depicted components may be used in all implementations, however, and one or more implementations may include additional or different components than those shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional components, different components, or fewer components may be provided.
- the architecture 300 may provide for detecting gestures, without using information from a touch sensor, based on sensor outputs that are fed into a gesture prediction engine 302 , which may be executed by the specialized processor 212 .
- a specialized processor are discussed in more detail in U.S. Provisional Patent Application No. 62/855,840 entitled “Compiling Code For A Machine Learning Model For Execution On A Specialized Processor,” filed on May 31, 2019, which is hereby incorporated by reference in its entirety for all purposes.
- the gesture prediction engine 302 includes a machine learning model 304 .
- the machine learning model 304 in an example, is implemented as a convolutional neural network model that is configured to detect a gesture using such sensor inputs over time. Further, the machine learning model 304 may be been pre-trained on a different device (e.g., the electronic device 102 and/or the server 108 ) based on sensor output data prior to being deployed on the wireless audio output device 104 .
- a neural network is a computing model that uses a collection of connected nodes to process input data based on machine learning techniques.
- Neural networks are referred to as networks because they may be represented by connecting together different operations.
- a model of a NN e.g., feedforward neural network
- a convolutional neural network as mentioned above is one type of neural network.
- a CNN refers to a particular type of neural network, but uses different types of layers made up of nodes existing in three dimensions where the dimensions may change between layers.
- a node in a layer may only be connected to a subset of the nodes in a previous layer.
- the final output layer may be fully connected and be sized according to the number of classifiers.
- a CNN may include various combinations, and in some instances, multiples of each, and orders of the following types of layers: the input layer, convolutional layers, pooling layers, rectified linear unit layers (ReLU), and fully connected layers.
- Part of the operations performed by a convolutional neural network includes taking a set of filters (or kernels) that are iterated over input data based on one or more parameters.
- convolutional layers read input data (e.g., a 3 D input volume corresponding to sensor output data, a 2 D representation of sensor output data, or a 1 D representation of sensor output data), using a kernel that reads in small segments at a time and steps across the entire input field. Each read can result in an input that is projected onto a filter map and represents an internal interpretation of the input.
- a CNN such as the machine learning model 304 , as discussed herein, can be applied to human activity recognition data (e.g., sensor data corresponding to motion or movement) where a CNN model learns to map a given window of signal data to an activity (e.g., gesture and/or portion of a gesture) where the model reads across each window of data and prepares an internal representation of the window.
- a set of sensor outputs from a first sensor output 306 to a Nth sensor output 308 from the aforementioned sensors described in FIG. 2 may be provided as inputs to the machine learning model 304 .
- each sensor output may correspond to a window of time e.g., 0.5 second in which the sensor data was collected by the respective sensor.
- the sensor outputs 306 - 308 may be filtered and/or pre-processed, e.g., normalized, before being provided as inputs to the machine learning model 304 .
- the first sensor output 306 to the Nth sensor output 308 includes data from the accelerometer, data from the optical sensor, and audio data from a microphone of the wireless audio output device 104 .
- the machine learning model e.g., a CNN
- the model is trained using data from the accelerometer and data from the optical sensor.
- the model is trained using data from the accelerometer, data from the optical sensor, and audio data from a microphone.
- the inputs detected by the optical sensor, accelerometer, and microphone may individually and/or collectively be indicative of a touch input.
- a touch input/gesture along and/or on a body/housing of the wireless audio output device 104 such as swiping up or swiping down, tapping, etc., may cause a particular change to the light detected by an optical sensor that is disposed on and/or along the body/housing.
- a touch input/gesture may result in a particular sound that can be detected by a microphone, and/or may cause a particular vibration that can be detected by an accelerometer.
- the wireless audio output device 104 includes multiple microphones disposed at different locations, the sound detected at each microphone may vary based on the location along the body/housing where the touch input/gesture was received.
- the machine learning model 304 After training, the machine learning model 304 generates a set of output predictions corresponding to predicted gesture(s) 310 . After the predictions are generated, a policy may be applied to the predictions to determine whether to indicate an action for the wireless audio output device 104 to perform, which is discussed in more detail in the description of FIGS. 4 A- 4 B below.
- FIGS. 4 A- 4 B illustrate example timing diagrams of sensor outputs from the wireless audio output device 104 that may be indicative of respective touch gestures in accordance with one or more implementations.
- graph 402 and graph 450 include respective timing diagrams of sensor output data from respective motions sensors of the wireless audio output device 104 .
- the x-axis of graph 402 and graph 450 corresponds to values of motion from the accelerometers or other motion sensors, and the y-axis of graph 402 and graph 450 corresponds to time.
- graph 404 and graph 452 include respective timing diagrams of optical sensors of the wireless audio output device 104 .
- the x-axis of graph 404 and graph 452 corresponds to values of light luminosity from the optical sensors (e.g., an amount of light reflected from the user's skin), and the y-axis of graph 402 and graph 450 corresponds to time.
- Segment 410 and segment 412 correspond to a period of time where particular gestures are occurring based on the sensor output data shown in graph 402 and graph 404 .
- segment 460 and segment 462 correspond to a period of time where particular gestures are occurring based on the sensor output data shown in graph 450 and graph 452 .
- Respective predictions of the machine learning model 304 are shown in graph 406 and graph 454 .
- the machine learning model 304 provides ten predictions per second (or some other amount of predictions) based on the aforementioned sensor output data which is visually shown in graph 406 and graph 454 .
- the machine learning model 304 provides a prediction output that falls into one of seven different categories: no data (“none” or no gesture), the beginning of a swipe up (“Pre-Up”), the middle of the swipe up (“Swipe Up”), or the end of the swipe up (“Post-Up”), beginning of a swipe down (“Pre-Down”), the middle of the swipe down (“Swipe Down”), or the end of the swipe down (“Post-Down”).
- the different categories of predicted gestures enables more robustness to boundary conditions so that the machine learning model 304 does not need to provide a “hard” classification between none and up, and consequently there is a transitionary period where sensor output data can go between gesture stages corresponding to none (e.g., no gesture), pre-up, swipe up and then post-up before falling back down to none.
- the x-axis indicates a number of frames where each frame may correspond to individual data points. In an example, a number of frames (e.g., 125 frames) can be utilized for each prediction provided by the machine learning model 304 .
- the machine learning model 304 also further utilizes a policy to determine a prediction output.
- a policy can correspond to a function that determines a mapping of a particular input (e.g., sensor output data) to a corresponding action (e.g., providing a respective prediction).
- the machine learning model 304 only utilizes sensor output data corresponding to either a swipe up or a swipe down to make a classification, and the policy can determine an average of a number of previous predictions (e.g., 5 previous predictions).
- the machine learning model 304 takes the previous predictions over a certain window of time, and when the average of these predictions exceeds a particular threshold, the machine learning model 304 can indicate a particular action (e.g., adjusting the volume of the speaker 214 ) for the wireless audio output device 104 to initiate.
- a state machine may be utilized to further refine the predictions, e.g. based on previous predictions over a window of time.
- a respective machine learning model can run independently on each of the pair of wireless audio output devices.
- the pair of wireless audio output devices can communicate between each other to determine whether both of the wireless audio output devices detected a swipe gesture around the same time. If both wireless audio output devices detect a swipe gesture around the same time, whether the swipe gesture is in the same direction or opposite directions, the policy can determine to suppress the detection of the swipe gestures to help prevent detecting different direction swipes on both wireless audio output devices, or having a false trigger that is not due to touch input on one of the wireless audio output devices but instead a general motion (e.g., one not corresponding to a particular action for the wireless audio output device to take).
- FIG. 5 illustrates a flow diagram of example process for machine-learning based gesture recognition in accordance with one or more implementations.
- the process 500 is primarily described herein with reference to the wireless audio output device 104 of FIG. 1 .
- the process 500 is not limited to the wireless audio output device 104 of FIG. 1 , and one or more blocks (or operations) of the process 500 may be performed by one or more other components and other suitable devices.
- the blocks of the process 500 are described herein as occurring in serial, or linearly. However, multiple blocks of the process 500 may occur in parallel.
- the blocks of the process 500 need not be performed in the order shown and/or one or more blocks of the process 500 need not be performed and/or can be replaced by other operations.
- the wireless audio output device 104 receives, from a first sensor (e.g., one of the sensor(s) 210 ), first sensor output of a first type ( 502 ).
- the wireless audio output device 104 receives, from a second sensor (e.g., another one of the sensor(s) 210 ), second sensor output of a second type ( 504 ).
- the wireless audio output device 104 provides the first sensor output and the second sensor output as inputs to a machine learning model, the machine learning model having been trained to output a predicted gesture based on sensor output of the first type and sensor output of the second type ( 506 ).
- the wireless audio output device 104 provides a predicted gesture based on output from the machine learning model ( 508 ).
- the first sensor and the second sensor may be non-touch sensors in an implementation.
- the first sensor and/or the second sensor may include an accelerometer, a microphone or an optical sensor.
- the first sensor output and the second sensor output may correspond to sensor input detected from a touch gesture provided by a user with respect to wireless audio output device 104 , such as a tap, a swipe up, a swipe down, and the like.
- the predicted gesture includes at least one of: a start swipe up, a middle swipe up, an end swipe up, a start swipe down, a middle swipe down, an end swipe down or a non-swipe.
- Respective sensor inputs for each of the aforementioned gestures may be different inputs.
- sensor inputs for a start swipe down or a start swipe up can correspond to accelerometer inputs indicating acceleration or movement in first direction or a second direction, respectively, and/or whether sensor inputs are received at a particular optical sensor located on the wireless audio output device 104 (e.g., where the wireless audio output device 104 includes at least two optical sensors) and/or sensor inputs at a particular microphone (e.g., where the wireless audio output device 104 may include one or more microphones).
- sensor inputs for an end swipe up or an end swipe down correspond to accelerometer inputs indicating acceleration or movement has ended and/or sensor inputs to a second particular microphone and/or whether sensor inputs are received at a particular optical sensor located on the wireless audio output device 104 .
- An output level, e.g., an audio output level or volume, of wireless audio output device 104 can be adjusted at least based on the predicted gesture.
- sensor inputs corresponding to a swipe down gesture can be determined based a combination of predicted gestures including a start swipe down, a middle swipe down, and an end swipe down.
- the swipe down gesture can be predicted based on the particular order of aforementioned predicted gestures such as where the start swipe down is predicted first, the middle swipe down is predicted second, and finally the end swipe down is predicted third.
- the system determines a particular action for the wireless audio output device 104 based on the predicted gesture such as adjusting the audio output level of the wireless audio output device 104 .
- sensor input corresponding to a middle swipe up or a middle swipe results in a respective predicted gesture for performing an action that increases or decreases an audio output level of the wireless audio output device 104 .
- the predicted gesture is based on at least in part on the middle swipe up or the middle swipe down, and adjusting the audio output level includes increasing or decreasing the audio output level of the audio output device by a particular increment.
- a combination of various gestures and/or a particular order of the combination of gestures can correspond to a particular action that is based on a continuous gesture (i.e., control that is proportional to the distance traveled by the finger).
- adjusting the audio output level is based at least in part on 1) a first distance based on the start swipe up, the middle swipe up, and the end swipe up, or 2) a second distance based on the start swipe down, the middle swipe down, and the end swipe down.
- adjusting the audio output level includes increasing or decreasing the audio output level of the audio output device 104 in proportion to the first distance or the second distance.
- this gathered data may include personal information data that uniquely identifies or can be used to identify a specific person.
- personal information data can include demographic data, location-based data, online identifiers, telephone numbers, email addresses, home addresses, data or records relating to a user's health or level of fitness (e.g., vital signs measurements, medication information, exercise information), date of birth, or any other personal information.
- the present disclosure recognizes that the use of such personal information data, in the present technology, can be used to the benefit of users.
- the personal information data can be used for providing information corresponding to a user in association with messaging. Accordingly, use of such personal information data may facilitate transactions (e.g., on-line transactions).
- other uses for personal information data that benefit the user are also contemplated by the present disclosure. For instance, health and fitness data may be used, in accordance with the user's preferences to provide insights into their general wellness, or may be used as positive feedback to individuals using technology to pursue wellness goals.
- the present disclosure contemplates that those entities responsible for the collection, analysis, disclosure, transfer, storage, or other use of such personal information data will comply with well-established privacy policies and/or privacy practices.
- such entities would be expected to implement and consistently apply privacy practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users.
- Such information regarding the use of personal data should be prominently and easily accessible by users, and should be updated as the collection and/or use of data changes.
- Personal information from users should be collected for legitimate uses only. Further, such collection/sharing should occur only after receiving the consent of the users or other legitimate basis specified in applicable law. Additionally, such entities should consider taking any needed steps for safeguarding and securing access to such personal information data and ensuring that others with access to the personal information data adhere to their privacy policies and procedures.
- policies and practices should be adapted for the particular types of personal information data being collected and/or accessed and adapted to applicable laws and standards, including jurisdiction-specific considerations which may serve to impose a higher standard. For instance, in the US, collection of or access to certain health data may be governed by federal and/or state laws, such as the Health Insurance Portability and Accountability Act (HIPAA); whereas health data in other countries may be subject to other regulations and policies and should be handled accordingly.
- HIPAA Health Insurance Portability and Accountability Act
- the present disclosure also contemplates embodiments in which users selectively block the use of, or access to, personal information data. That is, the present disclosure contemplates that hardware and/or software elements can be provided to prevent or block access to such personal information data.
- the present technology can be configured to allow users to select to “opt in” or “opt out” of participation in the collection of personal information data during registration for services or anytime thereafter.
- the present disclosure contemplates providing notifications relating to the access or use of personal information. For instance, a user may be notified upon downloading an app that their personal information data will be accessed and then reminded again just before personal information data is accessed by the app.
- personal information data should be managed and handled in a way to minimize risks of unintentional or unauthorized access or use. Risk can be minimized by limiting the collection of data and deleting data once it is no longer needed.
- data de-identification can be used to protect a user's privacy. De-identification may be facilitated, when appropriate, by removing identifiers, controlling the amount or specificity of data stored (e.g., collecting location data at city level rather than at an address level), controlling how data is stored (e.g., aggregating data across users), and/or other methods such as differential privacy.
- the present disclosure broadly covers use of personal information data to implement one or more various disclosed embodiments, the present disclosure also contemplates that the various embodiments can also be implemented without the need for accessing such personal information data. That is, the various embodiments of the present technology are not rendered inoperable due to the lack of all or a portion of such personal information data.
- FIG. 6 illustrates an electronic system 600 with which one or more implementations of the subject technology may be implemented.
- the electronic system 600 can be, and/or can be a part of, one or more of the electronic devices 102 - 106 , and/or one or the server 108 shown in FIG. 1 .
- the electronic system 600 may include various types of computer readable media and interfaces for various other types of computer readable media.
- the electronic system 600 includes a bus 608 , one or more processing unit(s) 612 , a system memory 604 (and/or buffer), a ROM 610 , a permanent storage device 602 , an input device interface 614 , an output device interface 606 , and one or more network interfaces 616 , or subsets and variations thereof.
- the bus 608 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of the electronic system 600 .
- the bus 608 communicatively connects the one or more processing unit(s) 612 with the ROM 610 , the system memory 604 , and the permanent storage device 602 . From these various memory units, the one or more processing unit(s) 612 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure.
- the one or more processing unit(s) 612 can be a single processor or a multi-core processor in different implementations.
- the ROM 610 stores static data and instructions that are needed by the one or more processing unit(s) 612 and other modules of the electronic system 600 .
- the permanent storage device 602 may be a read-and-write memory device.
- the permanent storage device 602 may be a non-volatile memory unit that stores instructions and data even when the electronic system 600 is off.
- a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as the permanent storage device 602 .
- a removable storage device such as a floppy disk, flash drive, and its corresponding disk drive
- the system memory 604 may be a read-and-write memory device. However, unlike the permanent storage device 602 , the system memory 604 may be a volatile read-and-write memory, such as random access memory.
- the system memory 604 may store any of the instructions and data that one or more processing unit(s) 612 may need at runtime.
- the processes of the subject disclosure are stored in the system memory 604 , the permanent storage device 602 , and/or the ROM 610 . From these various memory units, the one or more processing unit(s) 612 retrieves instructions to execute and data to process in order to execute the processes of one or more implementations.
- the bus 608 also connects to the input and output device interfaces 614 and 606 .
- the input device interface 614 enables a user to communicate information and select commands to the electronic system 600 .
- Input devices that may be used with the input device interface 614 may include, for example, alphanumeric keyboards and pointing devices (also called “cursor control devices”).
- the output device interface 606 may enable, for example, the display of images generated by electronic system 600 .
- Output devices that may be used with the output device interface 606 may include, for example, printers and display devices, such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid state display, a projector, or any other device for outputting information.
- printers and display devices such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid state display, a projector, or any other device for outputting information.
- One or more implementations may include devices that function as both input and output devices, such as a touchscreen.
- feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- the bus 608 also couples the electronic system 600 to one or more networks and/or to one or more network nodes, such as the server 108 shown in FIG. 1 , through the one or more network interface(s) 616 .
- the electronic system 600 can be a part of a network of computers (such as a LAN, a wide area network (“WAN”), or an Intranet, or a network of networks, such as the Internet. Any or all components of the electronic system 600 can be used in conjunction with the subject disclosure.
- Implementations within the scope of the present disclosure can be partially or entirely realized using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) encoding one or more instructions.
- the tangible computer-readable storage medium also can be non-transitory in nature.
- the computer-readable storage medium can be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device, including any processing electronics and/or processing circuitry capable of executing instructions.
- the computer-readable medium can include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM.
- the computer-readable medium also can include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.
- the computer-readable storage medium can include any non-semiconductor memory, such as optical disk storage, magnetic disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions.
- the tangible computer-readable storage medium can be directly coupled to a computing device, while in other implementations, the tangible computer-readable storage medium can be indirectly coupled to a computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.
- Instructions can be directly executable or can be used to develop executable instructions.
- instructions can be realized as executable or non-executable machine code or as instructions in a high-level language that can be compiled to produce executable or non-executable machine code.
- instructions also can be realized as or can include data.
- Computer-executable instructions also can be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, etc. As recognized by those of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions can vary significantly without varying the underlying logic, function, processing, and output.
- any specific order or hierarchy of blocks in the processes disclosed is an illustration of example approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged, or that all illustrated blocks be performed. Any of the blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
- base station As used in this specification and any claims of this application, the terms “base station”, “receiver”, “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people.
- display or “displaying” means displaying on an electronic device.
- the phrase “at least one of” preceding a series of items, with the term “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item).
- the phrase “at least one of” does not require selection of at least one of each item listed; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items, and/or at least one of each of the items.
- phrases “at least one of A, B, and C” or “at least one of A, B, or C” each refer to only A, only B, or only C; any combination of A, B, and C; and/or at least one of each of A, B, and C.
- a processor configured to monitor and control an operation or a component may also mean the processor being programmed to monitor and control the operation or the processor being operable to monitor and control the operation.
- a processor configured to execute code can be construed as a processor programmed to execute code or operable to execute code.
- phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some implementations, one or more implementations, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology.
- a disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations.
- a disclosure relating to such phrase(s) may provide one or more examples.
- a phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other foregoing phrases.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Medical Informatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The subject technology receives, from a first sensor of a device, first sensor output of a first type. The subject technology receives, from a second sensor of the device, second sensor output of a second type, the first and second sensors being non-touch sensors. The subject technology provides the first sensor output and the second sensor output as inputs to a machine learning model, the machine learning model having been trained to output a predicted touch-based gesture based on sensor output of the first type and sensor output of the second type. The subject technology provides a predicted touch-based gesture based on output from the machine learning model. Further, the subject technology adjusts an audio output level of the device based on the predicted gesture, and where the device is an audio output device.
Description
- The present application is a continuation of U.S. patent application Ser. No. 16/937,479, entitled “Machine-Learning Based Gesture Recognition,” filed on Jul. 23, 2020, which claims the benefit of U.S. Provisional Patent Application Ser. No. 62/878,695, entitled “Machine-Learning Based Gesture Recognition,” filed on Jul. 25, 2019, each of which is hereby incorporated by reference in its entirety for all purposes.
- The present description relates generally to gesture recognition, including machine-learning based gesture recognition.
- The present disclosure relates generally to electronic devices and in particular to detecting gestures made by a user wearing or otherwise operating an electronic device. Wearable technology is attracting considerable interest, including audio accessory devices (e.g., earbuds), where a user can potentially enjoy the benefits of mobile technology with increased convenience.
- Certain features of the subject technology are set forth in the appended claims. However, for purpose of explanation, several embodiments of the subject technology are set forth in the following figures.
-
FIG. 1 illustrates an example network environment for providing machine-learning based gesture recognition in accordance with one or more implementations. -
FIG. 2 illustrates an example network environment including an example electronic device and an example wireless audio output device in accordance with one or more implementations. -
FIG. 3 illustrates an example architecture, that may be implemented by the wireless audio output device, for detecting a gesture, without using a touch sensor, based on other sensor output data in accordance with one or more implementations. -
FIGS. 4A-4B illustrate example timing diagrams of sensor outputs from the wireless audio output device that may indicate respective gestures in accordance with one or more implementations. -
FIG. 5 illustrates a flow diagram of example process for machine-learning based gesture recognition in accordance with one or more implementations. -
FIG. 6 illustrates an example electronic system with which aspects of the subject technology may be implemented in accordance with one or more implementations. - The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein and can be practiced using one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology.
- Wearable devices, such as earphones/headphones/headset and/or one or more earbuds, can be configured to include various sensors. For example, an earbud can be equipped with various sensors, such as an optical sensor (e.g., photoplethysmogram (PPG) sensor), a motion sensor, a proximity sensor, and/or a temperature sensor, that can work independently and/or in concert to perform one or more tasks, such as detecting when an earbud is placed inside a user's ear, detecting when an earbud is placed inside a case, etc. An earbud and/or earphones that includes one or more of the aforementioned sensors may also include one or more additional sensors, such as a microphone or array of microphones. However, an earbud and/or earphones may not include a touch sensor for detecting touch inputs and/or touch gestures, in view of size/space constraints, power constraints, and/or manufacturing costs.
- Nonetheless, it may be desirable to allow devices that do not include touch sensors, such as earbuds, to detect touch input and/or touch gestures from users. The subject technology enables a device that does not include a touch sensor to detect touch input and/or touch gestures from users by utilizing inputs received via one or more non-touch sensors included in the device. For example, inputs received from one or more non-touch sensors may be applied to a machine learning model to detect whether the inputs correspond to a touch input and/or gesture. In this manner, the subject technology can enable detection of touch inputs and/or touch gestures on a surface of a device, such as taps, swipes, and the like, without the use of a touch sensor.
-
FIG. 1 illustrates an example network environment for providing machine-learning based gesture recognition in accordance with one or more implementations. Not all of the depicted components may be used in all implementations, however, and one or more implementations may include additional or different components than those shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional components, different components, or fewer components may be provided. - The
network environment 100 includes anelectronic device 102, a wirelessaudio output device 104, anetwork 106, and aserver 108. Thenetwork 106 may communicatively (directly or indirectly) couple, for example, theelectronic device 102 and/or theserver 108. InFIG. 1 , the wirelessaudio output device 104 is illustrated as not being directly coupled to thenetwork 106; however, in one or more implementations, the wirelessaudio output device 104 may be directly coupled to thenetwork 106. - The
network 106 may be an interconnected network of devices that may include, or may be communicatively coupled to, the Internet. In one or more implementations, connections over thenetwork 106 may be referred to as wide area network connections, while connections between theelectronic device 102 and the wirelessaudio output device 104 may be referred to as peer-to-peer connections. For explanatory purposes, thenetwork environment 100 is illustrated inFIG. 1 as including electronic devices 102-105 and asingle server 108; however, thenetwork environment 100 may include any number of electronic devices and any number of servers. - The
server 108 may be, and/or may include all or part of the electronic system discussed below with respect toFIG. 6 . Theserver 108 may include one or more servers, such as a cloud of servers. For explanatory purposes, asingle server 108 is shown and discussed with respect to various operations. However, these and other operations discussed herein may be performed by one or more servers, and each different operation may be performed by the same or different servers. - The electronic device may be, for example, a portable computing device such as a laptop computer, a smartphone, a peripheral device (e.g., a digital camera, headphones), a tablet device, a smart speaker, a set-top box, a content streaming device, a wearable device such as a watch, a band, and the like, or any other appropriate device that includes one or more wireless interfaces, such as one or more near-field communication (NFC) radios, WLAN radios, Bluetooth radios, Zigbee radios, cellular radios, and/or other wireless radios. In
FIG. 1 , by way of example, theelectronic device 102 is depicted as a smartphone. Theelectronic device 102 may be, and/or may include all or part of, the electronic device discussed below with respect toFIG. 2 , and/or the electronic system discussed below with respect toFIG. 6 . - The wireless
audio output device 104 may be, for example, a wireless headset device, wireless headphones, one or more wireless earbuds, a smart speaker, or generally any device that includes audio output circuitry and one or more wireless interfaces, such as near-field communication (NFC) radios, WLAN radios, Bluetooth radios, Zigbee radios, and/or other wireless radios. InFIG. 1 , by way of example, the wirelessaudio output device 104 is depicted as a set of wireless earbuds. As is discussed further below, the wirelessaudio output device 104 may include one or more sensors that can be used and/or repurposed to detect input received from a user; however, in one or more implementations, the wirelessaudio output device 104 may not include a touch sensor. As mentioned herein, a touch sensor can refer to a sensor that measures information arising directly from a physical interaction corresponding to a physical touch (e.g., when the touch sensor receives touch input from a user). Some examples of a touch sensor include a surface capacitive sensor, project capacitive sensor, wire resistive sensor, surface acoustic wave sensor, and the like. The wirelessaudio output device 104 may be, and/or may include all or part of, the wireless audio output device discussed below with respect toFIG. 2 , and/or the electronic system discussed below with respect toFIG. 6 . - The wireless
audio output device 104 may be paired, such as via Bluetooth, with theelectronic device 102. After the twodevices devices electronic device 102 may stream audio, such as music, phone calls, and the like, to the wirelessaudio output device 104. - For explanatory purposes, the subject technology is described herein with respect to a wireless
audio output device 104. However, the subject technology can also be applied to wired devices that do not include touch sensors, such as wired audio output devices. Further for explanatory purposes, the subject technology is discussed with respect to devices that do not include touch sensors. However, in one or more implementations, the subject technology may be used in conjunction with a touch sensor, such as to enhance and/or improve the detection of touch input by the touch sensor. For example, a device may include a low-cost and/or low-power touch sensor that may coarsely detect touch inputs and/or touch gestures, and the coarsely detected touch inputs/gestures can be refined using the subject technology. -
FIG. 2 illustrates an example network environment 200 including an exampleelectronic device 102 and an example wirelessaudio output device 104 in accordance with one or more implementations. Theelectronic device 102 is depicted inFIG. 2 for explanatory purposes; however, one or more of the components of theelectronic device 102 may also be implemented by other electronic device(s). Similarly, the wirelessaudio output device 104 is depicted inFIG. 2 for explanatory purposes; however, one or more of the components of the wirelessaudio output device 104 may also be implemented by other device(s). Not all of the depicted components may be used in all implementations, however, and one or more implementations may include additional or different components than those shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional components, different components, or fewer components may be provided. - The
electronic device 102 may include ahost processor 202A, amemory 204A, and radio frequency (RF)circuitry 206A. The wirelessaudio output device 104 may include ahost processor 202B, amemory 204A,RF circuitry 206B, a digital signal processor (DSP) 208, one or more sensor(s) 210, aspecialized processor 212, and aspeaker 214. In an implementation, the sensor(s) 210 may include one or more of a motion sensor such as an accelerometer, an optical sensor, and a sound sensor such as a microphone. It is appreciated that the aforementioned sensor do not include capacitive or resistive touch sensor hardware (or any of the aforementioned examples of touch sensors). - The RF circuitries 206A-B may include one or more antennas and one or more transceivers for transmitting/receiving RF communications, such as WiFi, Bluetooth, cellular, and the like. In one or more implementations, the
RF circuitry 206A of theelectronic device 102 may include circuitry for forming wide area network connections and peer-to-peer connections, such as WiFi, Bluetooth, and/or cellular circuitry, while theRF circuitry 206B of the wirelessaudio output device 104 may include Bluetooth, WiFi, and/or other circuitry for forming peer-to-peer connections. - In an implementation, the
RF circuitry 206B can be used to receive audio content that can be processed by thehost processor 202B and sent on to thespeaker 214 and/or also receive signals from theRF circuitry 206A of theelectronic device 102 for accomplishing tasks such as adjusting a volume output of thespeaker 214 among other types of tasks. - The
host processors 202A-B may include suitable logic, circuitry, and/or code that enable processing data and/or controlling operations of theelectronic device 102 and the wirelessaudio output device 104, respectively. In this regard, thehost processors 202A-B may be enabled to provide control signals to various other components of theelectronic device 102 and the wirelessaudio output device 104, respectively. Additionally, thehost processors 202A-B may enable implementation of an operating system or may otherwise execute code to manage operations of theelectronic device 102 and the wirelessaudio output device 104, respectively. Thememories 204A-B may include suitable logic, circuitry, and/or code that enable storage of various types of information such as received data, generated data, code, and/or configuration information. Thememories 204A-B may include, for example, random access memory (RAM), read-only memory (ROM), flash, and/or magnetic storage. TheDSP 208 of the wirelessaudio output device 104 may include suitable logic, circuitry, and/or code that enable particular processing. - As discussed herein, a given electronic device, such as the wireless
audio output device 104, may include a specialized processor (e.g., the specialized processor 212) that may be always powered on and/or in an active mode, e.g., even when a host/application processor (e.g., thehost processor 202B) of the device is in a low power mode or in an instance where such an electronic device does not include a host/application processor (e.g., a CPU and/or GPU). Such a specialized processor may be a low computing power processor that is engineered to also utilize less energy than the CPU or GPU, and also is designed, in an example, to be running continuously on the electronic device in order to collect audio and/or sensor data. In an example, such a specialized processor can be an Always On Processor (AOP), which is a small and low power auxiliary processor that is implemented as an embedded motion coprocessor. In one or more implementations, theDSP 208 may be, and/or may include all or part of, thespecialized processor 212. - The
specialized processor 212 may be implemented as specialized, custom, and/or dedicated hardware, such as a low-power processor that may be always powered on (e.g., to detect audio triggers, collect and process sensor data from sensors such as accelerometers, optical sensors, and the like) and continuously runs on the wirelessaudio output device 104. Thespecialized processor 212 may be utilized to perform certain operations in a more computationally and/or power efficient manner. In an example, to enable deployment of a neural network model on thespecialized processor 212, which has less computing power than a main processor (e.g., thehost processor 202B), modifications to the neural network model are performed during a compiling process for the neural network model in order to make it compatible with the architecture of thespecialized processor 212. In an example, thespecialized processor 212 may be utilized to execute operations from such a compiled neural network model, as discussed further inFIG. 3 below. In one or more implementations, the wirelessaudio output device 104 may only include the specialized processor 212 (e.g., exclusive of thehost processor 202B and/or the DSP 208). - In one or more implementations, the
electronic device 102 may pair with the wirelessaudio output device 104 in order to generate pairing information that can be used to form a connection, such as a peer-to-peer connection between thedevices devices respective memories 204A-B. In this manner, thedevices respective RF circuitries 206A-B using the respective pairing information. - The sensor(s) 210 may include one or more sensors for detecting device motion, user biometric information (e.g., heartrate), sound, light, wind, and/or generally any environmental input. For example, the sensor(s) 210 may include one or more of an accelerometer for detecting device acceleration, one or more microphones for detecting sound and/or an optical sensor for detecting light. As discussed further below with respect to
FIGS. 3-5 , the wirelessaudio output device 104 may be configured to output a predicted gesture based on output provided by one or more of the sensor(s) 210, e.g., corresponding to an input detected by the one or more of the sensor(s) 210. - In one or more implementations, one or more of the
host processors 202A-B, thememories 204A-B, theRF circuitries 206A-B, theDSP 208, and/or thespecialized processor 212, and/or one or more portions thereof, may be implemented in software (e.g., subroutines and code), may be implemented in hardware (e.g., an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a state machine, gated logic, discrete hardware components, or any other suitable devices) and/or a combination of both. -
FIG. 3 illustrates anexample architecture 300, that may be implemented by the wirelessaudio output device 104, for detecting a gesture, without using a touch sensor, based on other sensor output data in accordance with one or more implementations. Not all of the depicted components may be used in all implementations, however, and one or more implementations may include additional or different components than those shown in the figure. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional components, different components, or fewer components may be provided. - In one or more implementations, the
architecture 300 may provide for detecting gestures, without using information from a touch sensor, based on sensor outputs that are fed into agesture prediction engine 302, which may be executed by thespecialized processor 212. Examples of a specialized processor are discussed in more detail in U.S. Provisional Patent Application No. 62/855,840 entitled “Compiling Code For A Machine Learning Model For Execution On A Specialized Processor,” filed on May 31, 2019, which is hereby incorporated by reference in its entirety for all purposes. As illustrated, thegesture prediction engine 302 includes amachine learning model 304. Themachine learning model 304, in an example, is implemented as a convolutional neural network model that is configured to detect a gesture using such sensor inputs over time. Further, themachine learning model 304 may be been pre-trained on a different device (e.g., theelectronic device 102 and/or the server 108) based on sensor output data prior to being deployed on the wirelessaudio output device 104. - As discussed herein, a neural network (NN) is a computing model that uses a collection of connected nodes to process input data based on machine learning techniques. Neural networks are referred to as networks because they may be represented by connecting together different operations. A model of a NN (e.g., feedforward neural network) may be represented as a graph representing how the operations are connected together from an input layer, through one or more hidden layers, and finally to an output layer, with each layer including one or more nodes, and where different layers perform different types of operations on respective input.
- A convolutional neural network (CNN) as mentioned above is one type of neural network. As discussed herein, a CNN refers to a particular type of neural network, but uses different types of layers made up of nodes existing in three dimensions where the dimensions may change between layers. In a CNN, a node in a layer may only be connected to a subset of the nodes in a previous layer. The final output layer may be fully connected and be sized according to the number of classifiers. A CNN may include various combinations, and in some instances, multiples of each, and orders of the following types of layers: the input layer, convolutional layers, pooling layers, rectified linear unit layers (ReLU), and fully connected layers. Part of the operations performed by a convolutional neural network includes taking a set of filters (or kernels) that are iterated over input data based on one or more parameters.
- In an example, convolutional layers read input data (e.g., a 3D input volume corresponding to sensor output data, a 2D representation of sensor output data, or a 1D representation of sensor output data), using a kernel that reads in small segments at a time and steps across the entire input field. Each read can result in an input that is projected onto a filter map and represents an internal interpretation of the input. A CNN such as the
machine learning model 304, as discussed herein, can be applied to human activity recognition data (e.g., sensor data corresponding to motion or movement) where a CNN model learns to map a given window of signal data to an activity (e.g., gesture and/or portion of a gesture) where the model reads across each window of data and prepares an internal representation of the window. - As illustrated in
FIG. 3 , a set of sensor outputs from afirst sensor output 306 to aNth sensor output 308 from the aforementioned sensors described inFIG. 2 (e.g., the sensor(s) 210) may be provided as inputs to themachine learning model 304. In an implementation, each sensor output may correspond to a window of time e.g., 0.5 second in which the sensor data was collected by the respective sensor. In one or more implementations, the sensor outputs 306-308 may be filtered and/or pre-processed, e.g., normalized, before being provided as inputs to themachine learning model 304. - In an implementation, the
first sensor output 306 to theNth sensor output 308 includes data from the accelerometer, data from the optical sensor, and audio data from a microphone of the wirelessaudio output device 104. From the collected sensor data, the machine learning model (e.g., a CNN) can be trained. In a first example, the model is trained using data from the accelerometer and data from the optical sensor. In a second example, the model is trained using data from the accelerometer, data from the optical sensor, and audio data from a microphone. - In an example, the inputs detected by the optical sensor, accelerometer, and microphone may individually and/or collectively be indicative of a touch input. For example, a touch input/gesture along and/or on a body/housing of the wireless
audio output device 104, such as swiping up or swiping down, tapping, etc., may cause a particular change to the light detected by an optical sensor that is disposed on and/or along the body/housing. Similarly, such a touch input/gesture may result in a particular sound that can be detected by a microphone, and/or may cause a particular vibration that can be detected by an accelerometer. Furthermore, if the wirelessaudio output device 104 includes multiple microphones disposed at different locations, the sound detected at each microphone may vary based on the location along the body/housing where the touch input/gesture was received. - After training, the
machine learning model 304 generates a set of output predictions corresponding to predicted gesture(s) 310. After the predictions are generated, a policy may be applied to the predictions to determine whether to indicate an action for the wirelessaudio output device 104 to perform, which is discussed in more detail in the description ofFIGS. 4A-4B below. -
FIGS. 4A-4B illustrate example timing diagrams of sensor outputs from the wirelessaudio output device 104 that may be indicative of respective touch gestures in accordance with one or more implementations. - As shown in
FIGS. 4A-4B ,graph 402 andgraph 450 include respective timing diagrams of sensor output data from respective motions sensors of the wirelessaudio output device 104. The x-axis ofgraph 402 andgraph 450 corresponds to values of motion from the accelerometers or other motion sensors, and the y-axis ofgraph 402 andgraph 450 corresponds to time. - As also shown, graph 404 and
graph 452 include respective timing diagrams of optical sensors of the wirelessaudio output device 104. The x-axis of graph 404 andgraph 452 corresponds to values of light luminosity from the optical sensors (e.g., an amount of light reflected from the user's skin), and the y-axis ofgraph 402 andgraph 450 corresponds to time. Segment 410 andsegment 412 correspond to a period of time where particular gestures are occurring based on the sensor output data shown ingraph 402 and graph 404. Similarly,segment 460 andsegment 462 correspond to a period of time where particular gestures are occurring based on the sensor output data shown ingraph 450 andgraph 452. - Respective predictions of the
machine learning model 304 are shown ingraph 406 and graph 454. In an implementation, themachine learning model 304 provides ten predictions per second (or some other amount of predictions) based on the aforementioned sensor output data which is visually shown ingraph 406 and graph 454. Themachine learning model 304 provides a prediction output that falls into one of seven different categories: no data (“none” or no gesture), the beginning of a swipe up (“Pre-Up”), the middle of the swipe up (“Swipe Up”), or the end of the swipe up (“Post-Up”), beginning of a swipe down (“Pre-Down”), the middle of the swipe down (“Swipe Down”), or the end of the swipe down (“Post-Down”). - The different categories of predicted gestures enables more robustness to boundary conditions so that the
machine learning model 304 does not need to provide a “hard” classification between none and up, and consequently there is a transitionary period where sensor output data can go between gesture stages corresponding to none (e.g., no gesture), pre-up, swipe up and then post-up before falling back down to none. Ingraph 406 and graph 454, the x-axis indicates a number of frames where each frame may correspond to individual data points. In an example, a number of frames (e.g., 125 frames) can be utilized for each prediction provided by themachine learning model 304. - In an implementation, as mentioned before, the
machine learning model 304 also further utilizes a policy to determine a prediction output. As referred to herein, a policy can correspond to a function that determines a mapping of a particular input (e.g., sensor output data) to a corresponding action (e.g., providing a respective prediction). In an example, themachine learning model 304 only utilizes sensor output data corresponding to either a swipe up or a swipe down to make a classification, and the policy can determine an average of a number of previous predictions (e.g., 5 previous predictions). Themachine learning model 304 takes the previous predictions over a certain window of time, and when the average of these predictions exceeds a particular threshold, themachine learning model 304 can indicate a particular action (e.g., adjusting the volume of the speaker 214) for the wirelessaudio output device 104 to initiate. In one or more implementations, a state machine may be utilized to further refine the predictions, e.g. based on previous predictions over a window of time. - In an implementation, for a pair of wireless audio output devices, such as a pair of earbuds, a respective machine learning model can run independently on each of the pair of wireless audio output devices. The pair of wireless audio output devices can communicate between each other to determine whether both of the wireless audio output devices detected a swipe gesture around the same time. If both wireless audio output devices detect a swipe gesture around the same time, whether the swipe gesture is in the same direction or opposite directions, the policy can determine to suppress the detection of the swipe gestures to help prevent detecting different direction swipes on both wireless audio output devices, or having a false trigger that is not due to touch input on one of the wireless audio output devices but instead a general motion (e.g., one not corresponding to a particular action for the wireless audio output device to take).
-
FIG. 5 illustrates a flow diagram of example process for machine-learning based gesture recognition in accordance with one or more implementations. For explanatory purposes, theprocess 500 is primarily described herein with reference to the wirelessaudio output device 104 ofFIG. 1 . However, theprocess 500 is not limited to the wirelessaudio output device 104 ofFIG. 1 , and one or more blocks (or operations) of theprocess 500 may be performed by one or more other components and other suitable devices. Further for explanatory purposes, the blocks of theprocess 500 are described herein as occurring in serial, or linearly. However, multiple blocks of theprocess 500 may occur in parallel. In addition, the blocks of theprocess 500 need not be performed in the order shown and/or one or more blocks of theprocess 500 need not be performed and/or can be replaced by other operations. - The wireless
audio output device 104 receives, from a first sensor (e.g., one of the sensor(s) 210), first sensor output of a first type (502). The wirelessaudio output device 104 receives, from a second sensor (e.g., another one of the sensor(s) 210), second sensor output of a second type (504). - The wireless
audio output device 104 provides the first sensor output and the second sensor output as inputs to a machine learning model, the machine learning model having been trained to output a predicted gesture based on sensor output of the first type and sensor output of the second type (506). - The wireless
audio output device 104 provides a predicted gesture based on output from the machine learning model (508). The first sensor and the second sensor may be non-touch sensors in an implementation. In one or more implementations, the first sensor and/or the second sensor may include an accelerometer, a microphone or an optical sensor. The first sensor output and the second sensor output may correspond to sensor input detected from a touch gesture provided by a user with respect to wirelessaudio output device 104, such as a tap, a swipe up, a swipe down, and the like. In an implementation, the predicted gesture includes at least one of: a start swipe up, a middle swipe up, an end swipe up, a start swipe down, a middle swipe down, an end swipe down or a non-swipe. Respective sensor inputs for each of the aforementioned gestures may be different inputs. For example, sensor inputs for a start swipe down or a start swipe up can correspond to accelerometer inputs indicating acceleration or movement in first direction or a second direction, respectively, and/or whether sensor inputs are received at a particular optical sensor located on the wireless audio output device 104 (e.g., where the wirelessaudio output device 104 includes at least two optical sensors) and/or sensor inputs at a particular microphone (e.g., where the wirelessaudio output device 104 may include one or more microphones). In another example, sensor inputs for an end swipe up or an end swipe down correspond to accelerometer inputs indicating acceleration or movement has ended and/or sensor inputs to a second particular microphone and/or whether sensor inputs are received at a particular optical sensor located on the wirelessaudio output device 104. - An output level, e.g., an audio output level or volume, of wireless
audio output device 104 can be adjusted at least based on the predicted gesture. In an example, sensor inputs corresponding to a swipe down gesture can be determined based a combination of predicted gestures including a start swipe down, a middle swipe down, and an end swipe down. Further, the swipe down gesture can be predicted based on the particular order of aforementioned predicted gestures such as where the start swipe down is predicted first, the middle swipe down is predicted second, and finally the end swipe down is predicted third. The system then determines a particular action for the wirelessaudio output device 104 based on the predicted gesture such as adjusting the audio output level of the wirelessaudio output device 104. - In an example, sensor input corresponding to a middle swipe up or a middle swipe results in a respective predicted gesture for performing an action that increases or decreases an audio output level of the wireless
audio output device 104. In an example, the predicted gesture is based on at least in part on the middle swipe up or the middle swipe down, and adjusting the audio output level includes increasing or decreasing the audio output level of the audio output device by a particular increment. - In another example, a combination of various gestures and/or a particular order of the combination of gestures can correspond to a particular action that is based on a continuous gesture (i.e., control that is proportional to the distance traveled by the finger). In an example, adjusting the audio output level is based at least in part on 1) a first distance based on the start swipe up, the middle swipe up, and the end swipe up, or 2) a second distance based on the start swipe down, the middle swipe down, and the end swipe down. In this example, adjusting the audio output level includes increasing or decreasing the audio output level of the
audio output device 104 in proportion to the first distance or the second distance. - As described above, one aspect of the present technology is the gathering and use of data available from specific and legitimate sources for providing user information in association with messaging. The present disclosure contemplates that in some instances, this gathered data may include personal information data that uniquely identifies or can be used to identify a specific person. Such personal information data can include demographic data, location-based data, online identifiers, telephone numbers, email addresses, home addresses, data or records relating to a user's health or level of fitness (e.g., vital signs measurements, medication information, exercise information), date of birth, or any other personal information.
- The present disclosure recognizes that the use of such personal information data, in the present technology, can be used to the benefit of users. For example, the personal information data can be used for providing information corresponding to a user in association with messaging. Accordingly, use of such personal information data may facilitate transactions (e.g., on-line transactions). Further, other uses for personal information data that benefit the user are also contemplated by the present disclosure. For instance, health and fitness data may be used, in accordance with the user's preferences to provide insights into their general wellness, or may be used as positive feedback to individuals using technology to pursue wellness goals.
- The present disclosure contemplates that those entities responsible for the collection, analysis, disclosure, transfer, storage, or other use of such personal information data will comply with well-established privacy policies and/or privacy practices. In particular, such entities would be expected to implement and consistently apply privacy practices that are generally recognized as meeting or exceeding industry or governmental requirements for maintaining the privacy of users. Such information regarding the use of personal data should be prominently and easily accessible by users, and should be updated as the collection and/or use of data changes. Personal information from users should be collected for legitimate uses only. Further, such collection/sharing should occur only after receiving the consent of the users or other legitimate basis specified in applicable law. Additionally, such entities should consider taking any needed steps for safeguarding and securing access to such personal information data and ensuring that others with access to the personal information data adhere to their privacy policies and procedures. Further, such entities can subject themselves to evaluation by third parties to certify their adherence to widely accepted privacy policies and practices. In addition, policies and practices should be adapted for the particular types of personal information data being collected and/or accessed and adapted to applicable laws and standards, including jurisdiction-specific considerations which may serve to impose a higher standard. For instance, in the US, collection of or access to certain health data may be governed by federal and/or state laws, such as the Health Insurance Portability and Accountability Act (HIPAA); whereas health data in other countries may be subject to other regulations and policies and should be handled accordingly.
- Despite the foregoing, the present disclosure also contemplates embodiments in which users selectively block the use of, or access to, personal information data. That is, the present disclosure contemplates that hardware and/or software elements can be provided to prevent or block access to such personal information data. For example, in the case of providing information corresponding to a user in association with messaging, the present technology can be configured to allow users to select to “opt in” or “opt out” of participation in the collection of personal information data during registration for services or anytime thereafter. In addition to providing “opt in” and “opt out” options, the present disclosure contemplates providing notifications relating to the access or use of personal information. For instance, a user may be notified upon downloading an app that their personal information data will be accessed and then reminded again just before personal information data is accessed by the app.
- Moreover, it is the intent of the present disclosure that personal information data should be managed and handled in a way to minimize risks of unintentional or unauthorized access or use. Risk can be minimized by limiting the collection of data and deleting data once it is no longer needed. In addition, and when applicable, including in certain health related applications, data de-identification can be used to protect a user's privacy. De-identification may be facilitated, when appropriate, by removing identifiers, controlling the amount or specificity of data stored (e.g., collecting location data at city level rather than at an address level), controlling how data is stored (e.g., aggregating data across users), and/or other methods such as differential privacy.
- Therefore, although the present disclosure broadly covers use of personal information data to implement one or more various disclosed embodiments, the present disclosure also contemplates that the various embodiments can also be implemented without the need for accessing such personal information data. That is, the various embodiments of the present technology are not rendered inoperable due to the lack of all or a portion of such personal information data.
-
FIG. 6 illustrates anelectronic system 600 with which one or more implementations of the subject technology may be implemented. Theelectronic system 600 can be, and/or can be a part of, one or more of the electronic devices 102-106, and/or one or theserver 108 shown inFIG. 1 . Theelectronic system 600 may include various types of computer readable media and interfaces for various other types of computer readable media. Theelectronic system 600 includes abus 608, one or more processing unit(s) 612, a system memory 604 (and/or buffer), aROM 610, apermanent storage device 602, aninput device interface 614, anoutput device interface 606, and one ormore network interfaces 616, or subsets and variations thereof. - The
bus 608 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of theelectronic system 600. In one or more implementations, thebus 608 communicatively connects the one or more processing unit(s) 612 with theROM 610, thesystem memory 604, and thepermanent storage device 602. From these various memory units, the one or more processing unit(s) 612 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure. The one or more processing unit(s) 612 can be a single processor or a multi-core processor in different implementations. - The
ROM 610 stores static data and instructions that are needed by the one or more processing unit(s) 612 and other modules of theelectronic system 600. Thepermanent storage device 602, on the other hand, may be a read-and-write memory device. Thepermanent storage device 602 may be a non-volatile memory unit that stores instructions and data even when theelectronic system 600 is off. In one or more implementations, a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as thepermanent storage device 602. - In one or more implementations, a removable storage device (such as a floppy disk, flash drive, and its corresponding disk drive) may be used as the
permanent storage device 602. Like thepermanent storage device 602, thesystem memory 604 may be a read-and-write memory device. However, unlike thepermanent storage device 602, thesystem memory 604 may be a volatile read-and-write memory, such as random access memory. Thesystem memory 604 may store any of the instructions and data that one or more processing unit(s) 612 may need at runtime. In one or more implementations, the processes of the subject disclosure are stored in thesystem memory 604, thepermanent storage device 602, and/or theROM 610. From these various memory units, the one or more processing unit(s) 612 retrieves instructions to execute and data to process in order to execute the processes of one or more implementations. - The
bus 608 also connects to the input and output device interfaces 614 and 606. Theinput device interface 614 enables a user to communicate information and select commands to theelectronic system 600. Input devices that may be used with theinput device interface 614 may include, for example, alphanumeric keyboards and pointing devices (also called “cursor control devices”). Theoutput device interface 606 may enable, for example, the display of images generated byelectronic system 600. Output devices that may be used with theoutput device interface 606 may include, for example, printers and display devices, such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid state display, a projector, or any other device for outputting information. One or more implementations may include devices that function as both input and output devices, such as a touchscreen. In these implementations, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. - Finally, as shown in
FIG. 6 , thebus 608 also couples theelectronic system 600 to one or more networks and/or to one or more network nodes, such as theserver 108 shown inFIG. 1 , through the one or more network interface(s) 616. In this manner, theelectronic system 600 can be a part of a network of computers (such as a LAN, a wide area network (“WAN”), or an Intranet, or a network of networks, such as the Internet. Any or all components of theelectronic system 600 can be used in conjunction with the subject disclosure. - Implementations within the scope of the present disclosure can be partially or entirely realized using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) encoding one or more instructions. The tangible computer-readable storage medium also can be non-transitory in nature.
- The computer-readable storage medium can be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device, including any processing electronics and/or processing circuitry capable of executing instructions. For example, without limitation, the computer-readable medium can include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer-readable medium also can include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.
- Further, the computer-readable storage medium can include any non-semiconductor memory, such as optical disk storage, magnetic disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium can be directly coupled to a computing device, while in other implementations, the tangible computer-readable storage medium can be indirectly coupled to a computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.
- Instructions can be directly executable or can be used to develop executable instructions. For example, instructions can be realized as executable or non-executable machine code or as instructions in a high-level language that can be compiled to produce executable or non-executable machine code. Further, instructions also can be realized as or can include data. Computer-executable instructions also can be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, etc. As recognized by those of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions can vary significantly without varying the underlying logic, function, processing, and output.
- While the above discussion primarily refers to microprocessor or multi-core processors that execute software, one or more implementations are performed by one or more integrated circuits, such as ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions that are stored on the circuit itself.
- Those of skill in the art would appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. Various components and blocks may be arranged differently (e.g., arranged in a different order, or partitioned in a different way) all without departing from the scope of the subject technology.
- It is understood that any specific order or hierarchy of blocks in the processes disclosed is an illustration of example approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged, or that all illustrated blocks be performed. Any of the blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
- As used in this specification and any claims of this application, the terms “base station”, “receiver”, “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people. For the purposes of the specification, the terms “display” or “displaying” means displaying on an electronic device.
- As used herein, the phrase “at least one of” preceding a series of items, with the term “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item). The phrase “at least one of” does not require selection of at least one of each item listed; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items, and/or at least one of each of the items. By way of example, the phrases “at least one of A, B, and C” or “at least one of A, B, or C” each refer to only A, only B, or only C; any combination of A, B, and C; and/or at least one of each of A, B, and C.
- The predicate words “configured to”, “operable to”, and “programmed to” do not imply any particular tangible or intangible modification of a subject, but, rather, are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control an operation or a component may also mean the processor being programmed to monitor and control the operation or the processor being operable to monitor and control the operation. Likewise, a processor configured to execute code can be construed as a processor programmed to execute code or operable to execute code.
- Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some implementations, one or more implementations, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology. A disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations. A disclosure relating to such phrase(s) may provide one or more examples. A phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other foregoing phrases.
- The word “exemplary” is used herein to mean “serving as an example, instance, or illustration”. Any embodiment described herein as “exemplary” or as an “example” is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, to the extent that the term “include”, “have”, or the like is used in the description or the claims, such term is intended to be inclusive in a manner similar to the term “comprise” as “comprise” is interpreted when employed as a transitional word in a claim.
- All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. § 112(f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for”.
- The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more”. Unless specifically stated otherwise, the term “some” refers to one or more. Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the subject disclosure.
Claims (21)
1-20. (canceled)
21. A method, comprising:
providing sensor data as input to a machine learning model, the machine learning model having been trained to predict, based on the sensor data, multiple features of a gesture, and
obtaining a predicted gesture based on an output from the machine learning model, the output generated in response to the providing the sensor data as input to the machine learning model, wherein the output comprises at least one of the multiple features of the gesture; and
adjusting a function of a device based on the predicted gesture.
22. The method of claim 21 , wherein the gesture comprises a touch-based gesture on a surface of the device, and wherein the sensor data comprises sensor data from a non-touch sensor of the device.
23. The method of claim 21 , wherein the multiple features of the gesture comprise at least a start portion of the gesture, a middle portion of the gesture, and an end portion of the gesture.
24. The method of claim 23 , wherein the multiple features of the gesture further comprise a non-gesture.
25. The method of claim 24 , wherein the gesture comprises a swipe up gesture, wherein the non-gesture comprises a non-swipe, and wherein the multiple features of the gesture comprise at least: a start swipe up, a middle swipe up, an end swipe up, and the non-swipe.
26. The method of claim 25 , wherein the machine learning model is further configured to predict multiple features of another gesture.
27. The method of claim 26 , wherein the other gesture comprises a swipe down gesture, and wherein the multiple features of the other gesture comprise a start swipe down, a middle swipe down, an end swipe down, and the non-swipe.
28. The method of claim 27 , wherein the predicted gesture is based on a combination of multiple predictions from the machine learning model.
29. The method of claim 28 , wherein the multiple predictions include at least the middle swipe up or the middle swipe down, and wherein adjusting the function of the device comprises adjusting an audio output level of the device by a particular increment.
30. The method of claim 28 , wherein the multiple predictions include at least the start swipe up, the middle swipe up, and the end swipe up, and wherein adjusting the function of the device comprises adjusting an audio output level based on a distance determined using the start swipe up, the middle swipe up, and the end swipe up.
31. The method of claim 30 , wherein adjusting the audio output level based on the distance determined using the start swipe up, the middle swipe up, and the end swipe up comprises increasing or decreasing the audio output level in proportion to the distance.
32. The method of claim 21 , wherein the function of the device comprises an audio streaming function of the device.
33. The method of claim 32 , wherein the audio streaming function comprises streaming of audio corresponding to a phone call.
34. A non-transitory computer readable medium storing instructions which, when executed by one or more processors, cause the one or more processors to:
provide sensor data as input to a machine learning model, the machine learning model having been trained to predict, based on the sensor data, multiple features of a gesture, and
obtain a predicted gesture based on an output from the machine learning model, the output generated in response to receiving the sensor data as input to the machine learning model, wherein the output comprises at least one of the multiple features of the gesture; and
adjust a function of a device based on the predicted gesture.
35. The non-transitory computer readable medium of claim 34 , wherein the gesture comprises a touch-based gesture on a surface of the device, and wherein the sensor data comprises sensor data from a non-touch sensor of the device.
36. The non-transitory computer readable medium of claim 34 , wherein the multiple features of the gesture comprise at least a start portion of the gesture, a middle portion of the gesture, and an end portion of the gesture.
37. The non-transitory computer readable medium of claim 36 , wherein the machine learning model has been further trained to predict a non-gesture and wherein the gesture comprises a swipe up gesture, wherein the non-gesture comprises a non-swipe, and wherein the multiple features of the gesture comprise at least: a start swipe up, a middle swipe up, and an end swipe up.
38. The non-transitory computer readable medium of claim 37 , wherein the machine learning model has been further trained to predict multiple features of another gesture different from the gesture.
39. The non-transitory computer readable medium of claim 38 , wherein the other gesture comprises a swipe down gesture, and wherein the multiple features of the other gesture comprise a start swipe down, a middle swipe down, and an end swipe down.
40. An electronic device, comprising:
a memory; and
one or more processors configured to:
provide sensor data as input to a machine learning model, the machine learning model having been trained to predict, based on the sensor data, multiple features of a gesture, and
obtain a predicted gesture based on an output from the machine learning model, the output generated in response to receiving the sensor data as input to the machine learning model, wherein the output comprises at least one of the multiple features of the gesture; and
adjust a function of the electronic device based on the predicted gesture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/203,635 US20230401486A1 (en) | 2019-07-25 | 2023-05-30 | Machine-learning based gesture recognition |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962878695P | 2019-07-25 | 2019-07-25 | |
US16/937,479 US11704592B2 (en) | 2019-07-25 | 2020-07-23 | Machine-learning based gesture recognition |
US18/203,635 US20230401486A1 (en) | 2019-07-25 | 2023-05-30 | Machine-learning based gesture recognition |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/937,479 Continuation US11704592B2 (en) | 2019-07-25 | 2020-07-23 | Machine-learning based gesture recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230401486A1 true US20230401486A1 (en) | 2023-12-14 |
Family
ID=72752980
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/937,479 Active 2041-01-15 US11704592B2 (en) | 2019-07-25 | 2020-07-23 | Machine-learning based gesture recognition |
US18/203,635 Pending US20230401486A1 (en) | 2019-07-25 | 2023-05-30 | Machine-learning based gesture recognition |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/937,479 Active 2041-01-15 US11704592B2 (en) | 2019-07-25 | 2020-07-23 | Machine-learning based gesture recognition |
Country Status (5)
Country | Link |
---|---|
US (2) | US11704592B2 (en) |
EP (1) | EP4004693A1 (en) |
KR (1) | KR20220029747A (en) |
CN (1) | CN112868029A (en) |
WO (1) | WO2021016484A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11863927B2 (en) * | 2019-11-20 | 2024-01-02 | Bose Corporation | Audio device proximity detection |
US11257471B2 (en) * | 2020-05-11 | 2022-02-22 | Samsung Electronics Company, Ltd. | Learning progression for intelligence based music generation and creation |
US20230384866A1 (en) * | 2022-05-24 | 2023-11-30 | Microsoft Technology Licensing, Llc | Gesture recognition, adaptation, and management in a head-wearable audio device |
Family Cites Families (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090265671A1 (en) | 2008-04-21 | 2009-10-22 | Invensense | Mobile devices with motion gesture recognition |
US20110010626A1 (en) * | 2009-07-09 | 2011-01-13 | Jorge Fino | Device and Method for Adjusting a Playback Control with a Finger Gesture |
US20130077820A1 (en) | 2011-09-26 | 2013-03-28 | Microsoft Corporation | Machine learning gesture detection |
US9671874B2 (en) * | 2012-11-08 | 2017-06-06 | Cuesta Technology Holdings, Llc | Systems and methods for extensions to alternative control of touch-based devices |
US9405377B2 (en) * | 2014-03-15 | 2016-08-02 | Microsoft Technology Licensing, Llc | Trainable sensor-based gesture recognition |
US9811352B1 (en) * | 2014-07-11 | 2017-11-07 | Google Inc. | Replaying user input actions using screen capture images |
US9235278B1 (en) * | 2014-07-24 | 2016-01-12 | Amazon Technologies, Inc. | Machine-learning based tap detection |
US9860077B2 (en) * | 2014-09-17 | 2018-01-02 | Brain Corporation | Home animation apparatus and methods |
US9821470B2 (en) * | 2014-09-17 | 2017-11-21 | Brain Corporation | Apparatus and methods for context determination using real time sensor data |
US20160091308A1 (en) * | 2014-09-30 | 2016-03-31 | Invensense, Inc. | Microelectromechanical systems (mems) acoustic sensor-based gesture recognition |
US10466883B2 (en) * | 2015-03-02 | 2019-11-05 | Apple Inc. | Screenreader user interface |
US10078803B2 (en) * | 2015-06-15 | 2018-09-18 | Google Llc | Screen-analysis based device security |
CN105549408B (en) * | 2015-12-31 | 2018-12-18 | 歌尔股份有限公司 | Wearable device, smart home server and its control method and system |
US9998606B2 (en) | 2016-06-10 | 2018-06-12 | Glen A. Norris | Methods and apparatus to assist listeners in distinguishing between electronically generated binaural sound and physical environment sound |
CN206149479U (en) * | 2016-09-23 | 2017-05-03 | 温州懿境文化传播有限公司 | Audio playback device with multiple regulation method |
US10535005B1 (en) * | 2016-10-26 | 2020-01-14 | Google Llc | Providing contextual actions for mobile onscreen content |
US20180307405A1 (en) * | 2017-04-21 | 2018-10-25 | Ford Global Technologies, Llc | Contextual vehicle user interface |
WO2019126625A1 (en) * | 2017-12-22 | 2019-06-27 | Butterfly Network, Inc. | Methods and apparatuses for identifying gestures based on ultrasound data |
KR102449905B1 (en) * | 2018-05-11 | 2022-10-04 | 삼성전자주식회사 | Electronic device and method for controlling the electronic device thereof |
-
2020
- 2020-07-23 KR KR1020227004018A patent/KR20220029747A/en active IP Right Grant
- 2020-07-23 EP EP20786631.0A patent/EP4004693A1/en active Pending
- 2020-07-23 WO PCT/US2020/043331 patent/WO2021016484A1/en active Application Filing
- 2020-07-23 US US16/937,479 patent/US11704592B2/en active Active
- 2020-07-23 CN CN202080001961.8A patent/CN112868029A/en active Pending
-
2023
- 2023-05-30 US US18/203,635 patent/US20230401486A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
CN112868029A (en) | 2021-05-28 |
US11704592B2 (en) | 2023-07-18 |
WO2021016484A1 (en) | 2021-01-28 |
EP4004693A1 (en) | 2022-06-01 |
KR20220029747A (en) | 2022-03-08 |
US20210027199A1 (en) | 2021-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230401486A1 (en) | Machine-learning based gesture recognition | |
US11449802B2 (en) | Machine-learning based gesture recognition using multiple sensors | |
US11726324B2 (en) | Display system | |
US12118443B2 (en) | Machine-learning based gesture recognition using multiple sensors | |
US11514928B2 (en) | Spatially informed audio signal processing for user speech | |
CN106133646B (en) | Response of the user to notice is determined based on physiological parameter | |
US9547418B2 (en) | Electronic device and method of adjusting user interface thereof | |
US20180123629A1 (en) | Smart-ring methods and systems | |
CN107533427A (en) | System and method for providing context sensitivity tactile notification framework | |
US20230076716A1 (en) | Multi-device gesture control | |
US11227617B2 (en) | Noise-dependent audio signal selection system | |
US11508388B1 (en) | Microphone array based deep learning for time-domain speech signal extraction | |
US20150148923A1 (en) | Wearable device that infers actionable events | |
US20230413472A1 (en) | Dynamic noise control for electronic devices | |
US11182057B2 (en) | User simulation for model initialization | |
US20230094658A1 (en) | Protected access to rendering information for electronic devices | |
US20230095816A1 (en) | Adaptive user enrollment for electronic devices | |
US11262850B2 (en) | No-handed smartwatch interaction techniques | |
US20240103632A1 (en) | Probabilistic gesture control with feedback for electronic devices | |
US20240069639A1 (en) | Systems and Methods for Customizing a Haptic Output of a Haptic Actuator of a User Device | |
WO2024064170A1 (en) | Probabilistic gesture control with feedback for electronic devices | |
US12082276B1 (en) | Automatic pairing of personal devices with peripheral devices | |
US11775673B1 (en) | Using physiological cues to measure data sensitivity and implement security on a user device | |
US20230097790A1 (en) | System and method for capturing cardiopulmonary signals | |
CN117133260A (en) | Dynamic noise control for electronic devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |