CN118119971A

CN118119971A - Electronic device and method for determining height of person using neural network

Info

Publication number: CN118119971A
Application number: CN202180103294.9A
Authority: CN
Inventors: A·C·福萨特; R·苏里亚迪塔玛; M·帕拉尼阿帕恩; R·Z·海克斯特; L·A·德雷普斯
Original assignee: Nutricia NV
Current assignee: Nutricia NV
Priority date: 2021-09-20
Filing date: 2021-09-20
Publication date: 2024-05-31
Also published as: AU2021464323A1; WO2023041181A1; US20240303848A1; EP4405896A1

Abstract

An electronic device for estimating a height of a person, the electronic device comprising: a processor configured to: obtaining an image comprising at least a portion of the representation of the person; inputting the image to a first neural network and obtaining first information from the first neural network as output, the first information relating to a plurality of keypoints in a human body; inputting the image to a second neural network and obtaining second information as output from the second neural network, the second information being related to the reference information; and estimating a height of the person based on the first information and the second information; and an output unit configured to output the estimated height.

Description

Electronic device and method for determining height of person using neural network

Technical Field

The present invention relates to an electronic device for estimating a height of a person using a neural network, and to a method performed by said electronic device for estimating a height of a person.

Background

Traditionally, measuring a person's height has been done in a number of ways, with the most specialized ways including specialized equipment available only at the healthcare provider or pharmacy. More basic techniques include using a measuring tape, or standing against a wall and making a mark whose height is to be measured. Professional mechanisms are still in use to date and provide accurate measurements. However, they are only available at specific locations and not readily available to everyone. The more basic technique may be used anywhere, but may not be accurate. In the past few years, advances in image processing technology have allowed the rise of technologies that analyze images to obtain a person's height.

However, these techniques require capturing multiple images or videos. Furthermore, these techniques require a person to stand. This becomes particularly complicated when the height of the infant is to be measured, because the infant, which is not yet walking, cannot stand and it is difficult to keep them in a specific position.

Thus, there is a need for a mechanism to obtain a person's height that does not require the person to stand or be in a particular location and is easy to use anywhere.

Disclosure of Invention

The present invention aims to overcome at least some of these drawbacks, as it allows to obtain the height of a person in a simple and accurate way.

According to the present invention, an electronic device for estimating a height of a person is provided. The electronic device includes a processor configured to: obtaining an image comprising at least a portion of the representation of the person; inputting the image to a first neural network and obtaining first information from the first neural network as output, the first information relating to a plurality of keypoints in a human body; inputting the image to a second neural network and obtaining second information as output from the second neural network, the second information being related to the reference information; and estimating a height of the person based on the first information and the second information; and the electronic device comprises an output unit configured to output the estimated height.

The electronic device of the invention has the advantage of allowing the height of a person to be estimated in a simple manner by only needing to capture one image. This is achieved by inputting the same image to two neural networks, wherein a first neural network provides as output information related to height (more specifically, information related to a plurality of key points in the person's body) and a second neural network provides as output information related to reference information. Thus, the two neural networks each focus on different aspects or features of the same image. By analyzing the same image separately, an estimate of the height of the person may be obtained in combination with height related information from the first neural network and reference information from the second neural network. The reference information may also be referred to as calibration information, or physical size calibration information.

According to an embodiment of the present invention, the second information is information that links the reference information with the physical distance information.

According to an embodiment of the invention, the first neural network is configured to segment at least a portion of the representation of the person into a plurality of body parts, and predict a plurality of keypoints in the body of the person based on the plurality of body parts.

According to an embodiment of the present invention, the information related to the plurality of keypoints includes coordinate information about at least a part of the plurality of keypoints.

The first neural network is advantageously configured to recognize or detect a representation of a person or an object in an image that at least partially represents the person, and to segment the representation of the person's body into a plurality of body parts. The first neural network is further configured to predict keypoints based on the body-part, and the information related to at least part of the keypoints is the output first information. By dividing the body into a plurality of parts and identifying key points corresponding to specific points in the body, a skeleton of the body can be drawn, by which different parts of the body can be identified.

According to an embodiment of the present invention, the key point corresponds to one of the list comprising face, shoulder, hip, knee, ankle and heel. Thus, the first information output from the first neural network includes information related to coordinates in an image of a specific critical part of the body required to determine height.

According to an embodiment of the invention, the first neural network is configured to identify a predetermined number of keypoints, and if at least one keypoint is not identified by the first neural network with a detection confidence of at least 50% and a visibility of at least 50%, the processor is configured to generate one notification indicating that height cannot be estimated, and the output unit is configured to output the notification. For example, according to one embodiment, all keypoints need to have at least 50% visibility if the following are predetermined keypoints that need to be predicted: right heel-left heel, right ankle-left ankle, right hip-left hip, right knee-left knee, right shoulder-left shoulder. The detection confidence may refer to a confidence score (0, 1) that the detection is deemed successful, and it may be a parameter passed by the electronic device to the first neural network. A visibility indicating a possibility that the key point is visible may be included in the first output information together with the coordinate information of each key point.

According to an embodiment of the invention, the first neural network is a convolutional neural network for human body pose estimation implemented with BlazePose neural network for which the predictions of keypoints have been parameterized using a mediapipe pose solution (pose solution) application interface, and wherein the output of the BlazePose/mediapipe pose solution application interface is passed through Broyden, fletcher, goldfarb and Shanno, BFGS optimization algorithms. BlazePose is a lightweight convolutional neural network architecture that has good performance in terms of real-time reasoning on mobile devices.

According to an embodiment of the invention, the processor is further configured to calculate euclidean distances between coordinates of at least a portion of the plurality of keypoints on the image using the first information to calculate the pixel length of the representation of the person in the image. By using the first information, more specifically, the coordinate information of the plurality of key points, output from the first neural network, and if the visibility information corresponding to each of the coordinate information is at least 50%, the processor may be configured to obtain euclidean distances between consecutive key points (for example, euclidean distances between the heel and the ankle, between the ankle and the knee, between the knee and the hip, between the hip and the shoulder, between the shoulder and the top of the head) and add the obtained distances to each other so as to obtain the height of the person. For example, according to one embodiment, the processor may be configured to calculate a distance in pixels between coordinates of the left ankle and the left knee, a distance in pixels between coordinates of the left knee and the left hip, coordinates between the left hip and the left shoulder, and coordinates between the left shoulder and the top of the head.

According to an embodiment of the invention, the reference information comprises an object of a known predetermined size, such as an object having a credit card size. By including an object of a known size (width times height) in the image, the second neural network can recognize the object and associate it with the known size. The height information obtained from processing the first information output by the first neural network may be transformed into a final height estimate using known dimensions.

According to an embodiment of the invention, the second neural network is configured to find a contour of the object, identify the object, and obtain a predetermined size of the object, wherein the second information comprises information related to a physical size of the object. In other words, the second neural network may output pixel to metric ratio information based on a known predetermined size of the object, and the processor may be configured to transform the height information obtained using the first information output from the first neural network into physical height information. For example, the processor may be configured to transform the pixel height information into physical height information.

According to an embodiment of the invention, the second neural network is configured to output a notification if the object cannot be recognized, and the output unit is configured to output the notification. This notification may indicate that the height cannot be estimated and/or the notification may request that the user provide a predetermined object so that it is correctly visible, i.e. so that it can be recognized by the second neural network.

According to an embodiment of the invention, the second neural network is formed by a convolutional neural network U-Net having a EFFICIENTNET-b0 backbone network. U-net is a convolutional neural network architecture for rapidly and accurately segmenting images. EFFICIENNET the backbone network provides high accuracy and high efficiency in object recognition.

According to an embodiment of the invention, the electronic device further comprises an image capturing unit configured to capture an image. Thus, images may be obtained by the processor by different means. It may be captured directly by an image capturing unit of the electronic device, such as a camera, or it may receive the image by other means, such as downloading from the internet or receiving from an external device.

According to an embodiment of the invention, the processor is configured to perform operations of at least one of the first neural network and the second neural network. Thus, at least one of the first neural network and the second neural network may be implemented by a processor of the electronic device. This has the advantage that a connection to an external server can be avoided and that the estimation of the height can be performed locally by the electronic device.

According to the present invention, there is provided a method of obtaining a height of a person using the electronic device described above. The method comprises the following steps: obtaining an image comprising at least a portion of the representation of the person; inputting the image to a first neural network, and obtaining first information from the first neural network as output, the first information relating to a plurality of keypoints in a human body; inputting the image to a second neural network and obtaining second information as output from the second neural network, the second information being related to the reference information; estimating a height of the person based on the first information and the second information, and outputting the estimated height.

According to an embodiment of the invention, the operations of the first neural network and the second neural network are performed by a processor of the electronic device.

According to an embodiment of the invention, the operations of the first and second neural networks are performed by a server in communication with the electronic device, and wherein the method further comprises the electronic device transmitting the image to the server and receiving the first information and the second information from the server.

Drawings

The invention will be discussed in more detail below with reference to the accompanying drawings, in which:

FIG. 1 depicts an overview of a system according to an embodiment of the invention;

FIG. 2 illustrates a flow chart depicting operation of a neural network, according to an embodiment of the present invention;

FIG. 3 shows a flow chart depicting a method according to an embodiment of the invention;

Fig. 4a to 4f show screenshots of an electronic device for gestures of different persons according to embodiments of the invention, together with reference information.

Detailed Description

Fig. 1 depicts an overview of a system according to an embodiment of the invention. As seen in fig. 1, the electronic device 100 comprises an output unit 101, in this case corresponding to a display such as a touch display. The output unit may additionally or alternatively comprise a speaker. The electronic device may also comprise an input unit, which may also correspond to a touch display or to a separate input unit, such as a keyboard. The electronic device also includes a processor or processing unit configured to control the overall operation of the electronic device. The electronic device 100 of fig. 1 further comprises an image capturing unit (not shown in the figure), e.g. a camera unit, configured to capture images to be used for estimating height. However, the image may also be obtained by the processor by means other than the image capturing unit, since the electronic device may also comprise a communication unit and the image may be downloaded or received from an external device, and the electronic device may also comprise a memory or storage unit, which may store the image.

The electronic device 100 may comprise a software application installed therein, which when executed by a processor allows the steps of the method of the invention to be performed. For example, FIG. 1 shows a screen shot of a display of an electronic device while the application is executing, and the electronic device 100 has captured or will capture an image of a person that will be used to determine a height estimate.

In the embodiment of fig. 1 and throughout the following description, the person will be considered an infant, however it will be appreciated that the invention may be modified to apply to the estimation of the height of a person of any height. An infant is taken as an example to show that the invention is able to estimate height even in situations where the person is not standing and is not in a straight position, which is the normal case for infants. Furthermore, throughout the description, the height and length of a person or infant may be used interchangeably.

Fig. 1 shows that the image that has been or is to be captured comprises reference information, which in this case is in the form of an object 102 having the size of a credit card. The invention requires that reference information is present in the images in order to be able to convert the pixel height, which can be deduced from the analysis of the body part with the first neural network, into actual (physical) height information using only one image. The use of an object of known size, which the user may have at hand at any time, allows providing reference information in a simple and accurate manner. The use of credit card sized objects allows the use of any type of card having that size, including identification cards, insurance cards, store loyalty cards (FIDELITY CARDS of stores), drivers' licenses, public transportation cards, and many other cards that a user may carry around. The reference information may also be referred to as calibration information, or physical size calibration information. Other types of suitable reference information may be used as well, as long as they can serve as references to relate pixel distance to physical distance.

Fig. 2 illustrates a flowchart depicting operation of a neural network, in accordance with an embodiment of the present invention.

As seen in fig. 1 above, the processor is configured to obtain an image. The image may contain at least a portion of the representation of the person and the reference information. The processor is then configured to input the image to two neural networks that will perform different analyses on the image.

The first neural network may be configured to determine a pose of the person. It may take as input an image 201, which may be a color (RGB) image, and may be configured to recognize 202 an area in the image where a representation of a person is present, divide 203 the body of the representation of the person into a plurality of body parts, and predict 204 coordinates of key points of the body based on the divided parts. Each keypoint may correspond to one of a list comprising a face, shoulder, hip, knee, ankle, and heel. In order to be able to predict the keypoints, the first neural network has to know what to look for in the image, that is to say it has to be trained. The first network according to an embodiment of the present invention is a Convolutional Neural Network (CNN). CNN is a neural network trained on a large number of images from an image database, such as an ImageNet database. CNNs are composed of a number of layers and are taught to be characteristic representations of a wide variety of images. CNNs may be implemented using several libraries (such as Tensorflow libraries and Keras libraries) in conjunction with an image processing library (such as OpenCV) and may also be implemented in a programming language (such as Python, C, C ++, etc.), and may run on a single or multiple processors or processor cores or on a parallel computing platform such as CUDA.

Training is performed by inputting training images to the CNN. The training image may be an inventory image, a test image, or even a simulation image. To obtain classification accuracy in excess of 90%, training is preferably performed using a number of images, ranging from 5,000 to 10,000 images, more preferably from 8,000 to 9,000 images. Training images may include utilizing images created by image enhancement by performing transformations such as rotation, cropping, scaling, shading-based methods, and the like. For example, to train a neural network, the various types of data enhancement implemented are horizontal flipping, perspective transformation, brightness/contrast/color manipulation, image blurring and sharpening, and gaussian noise. These operations increase the robustness of the CNN. The convolutional layer of CNN extracts image features that are used by the last learnable layer and the final classification layer to classify the input image. These two layers contain information about how to combine the features extracted by the CNN into class probabilities and predictive labels.

In most CNNs, the last layer with a learnable weight is the fully connected layer, which multiplies the input by the learned weight. During training this layer is replaced by a new fully connected layer whose number of outputs is equal to the number of categories in the new dataset. By increasing the learning rate of a layer, it is possible to learn faster in a new layer than in a transfer layer.

Once trained, the CNN can analyze the input image. CNN takes an image as input and may require the input image to have a specific size, for example a 224 x 224 pixel size. If the input image is different from the allowed input size, a preprocessing step may be performed so that the image is resized (by zooming in or out) or cropped to fit the required input size. Other pre-processing that may be performed is color calibration and/or image normalization.

In the case of the first neural network of embodiments of the present invention, the first neural network may also be a neural network that has been trained, and may not be subjected to additional training. The first neural network is configured to provide information related to height, preferably information related to at least a portion of the plurality of keypoints. The output provided by the first neural network may include a set of keypoint coordinates along with their respective visibility metrics or percentages, for example in the form of vectors.

The first neural network may be a convolutional neural network for human body pose estimation implemented with BlazePose neural network for which predictions of keypoints have been parameterized with mediapipe pose solution application program interface. BlazePose is a lightweight convolutional neural network architecture with good performance for real-time reasoning on mobile devices. The BlazePose architecture is already described in Valentin Bazarevsky et al, "BlazePose: on-DEVICE REAL-time Body Pose tracking'. The BlazePose architecture uses heat maps and regression to obtain the keypoint coordinates. The network using the heat map helps to determine the more prominent portions of the object in the frame (i.e., the highly exposed areas of the infant's skeletal joints), while the regression network attempts to predict the average coordinate values by learning a regression function. The architecture also utilizes the jump connection between all phases of the network to achieve a balance between high-level and low-level features. Although developed for applications such as fitness tracking and sign language recognition, the inventors realized that it could be used as part of a mechanism to predict a person's height. The output of the BlazePose/mediapipe gesture resolution application interface according to embodiments of the present invention may be passed through Broyden, fletcher, goldfarb and Shanno, BFGS optimization (minimization) algorithms. Thus, the output of the BlazePose/mediapipe pose estimation API may be passed to the BGFS minimizer so that the results may be optimized and accuracy improved, in other words so that results may be produced that are as close as possible to the length of the parent report. The advantage is that it reduces errors. BGFS minimizer is one of the popular parameter estimation algorithms in machine learning and can be considered an algorithm that identifies scalar multiples of various lengths, angles, etc. of keypoints so that the resulting actual length is closer to the length of the parental report.

In an embodiment of the invention, the training phase of the first neural network comprising the BFGS algorithm is implemented using 200 to 400 images (e.g., 249).

The first neural network may be configured to recognize a representation of a person's body in the image, separate it from the rest of the image, divide the body into a plurality of parts, and identify a predetermined number of keypoints, and if at least one of the keypoints has a detection confidence of less than 50% and a visibility of at least 50%, the processor may be configured to generate a notification indicating that height cannot be estimated, and the output unit is configured to output the notification. The detection confidence may refer to a confidence score (0, 1) that the detection is deemed successful, and it may be a parameter that is communicated to the first neural network by the electronic device. A visibility indicating a possibility that the key point is visible may be included in the first output information together with the coordinate information of each key point. As long as there is a clean background, other objects may be present in the image and the first neural network is able to separate the person from the rest of the image. By guidance, an electronic device according to an embodiment of the present invention may also indicate that no other person is to be present in the image. For example, according to one embodiment, the following elements may require the first neural network to be visible with at least 50% detection confidence and 50% visibility: right heel-left heel, right ankle-left ankle, right hip-left hip, right knee-left knee, right shoulder-left shoulder, upper forehead, middle eyes, nose.

If the first neural network is BlazePose network, the predetermined number of keypoints is 33, as seen in 204 of FIG. 3, taken from Valentin Bazarevsky et al, "BlazePose: on-DEVICE REAL-time Body Pose tracking). The first neural network predicts the keypoints accurately using a frontal face and no object/body part/clothing to cover the line of sight of the keypoints such as face, core body, knee, ankle, etc. For example, the accuracy of BlazePose models (the percentage of correct keypoints detected) on the dataset of 1000 images may be 79.6% with 1 to 2 persons in each image containing poses with a wide variety of persons. For example, if a piece of clothing covers the knee, the key points that identify the knee will not be recognized by the first neural network. One example is a very large full-sleeve pajamas, which are too large to allow detection of a critical point.

The first neural network may output information related to the identified keypoints as the first information. If the first neural network is unable to identify sufficient keypoints with a detection confidence of at least 50% and a visibility of at least 50%, this will be reflected in the output of the first neural network, which will include a notification. The content of this notification may vary and may be part of the normal output of the first neural network, i.e. the percentage of visibility of each coordinate. If the percentage of at least one coordinate is less than 50%, it can be considered a notification given by the processor. The notification may be given in another way as long as the processor is able to recognize that the posture cannot be estimated and thus the height cannot be estimated. The processor will use this information to output information via the output unit indicating that height cannot be estimated and/or to inform the user to capture a new image in which enough keypoints are visible. An example of an output notification may be "pose estimation unsuccessful".

According to an embodiment of the invention, when the first neural network identifies all predetermined keypoints with a detection confidence of at least 50% and a visibility of at least 50% using the keypoint related information, the processor is further configured to obtain euclidean distances between coordinates of the plurality of keypoints on the image to calculate a pixel length of the representation of the person in the image. In other words, the processor may be configured to obtain euclidean distances between consecutive keypoints (i.e. the processor knows the keypoints belonging to consecutive body parts), and to add the obtained distances to each other in order to obtain the height of the person in pixels. For example, according to one embodiment, the processor may be configured to calculate a distance in pixels between coordinates of the left ankle and the left knee, a distance in pixels between coordinates of the left knee and the left hip, coordinates between the left hip and the left shoulder, and coordinates between the left shoulder and the top of the head.

To calculate the euclidean distance, the processor may average the output of the first neural network such that, for example, the distance between the coordinates of the left ankle and the left knee and the distance between the coordinates of the right ankle and the right knee are averaged to produce a uniform length. This improves accuracy. In another embodiment, instead of averaging the lengths, the maximum length of the two or other suitable method may be used.

However, the height in pixels does not provide complete information, as it is only related to the image and does not have information about the actual physical height.

To solve this, the image 201 is also input to the second neural network. The second neural network according to an embodiment of the present invention is also a convolutional neural network configured (trained) to recognize reference information. In this embodiment, the reference information includes objects of known predetermined dimensions, such as objects having credit card dimensions. By including an object of a known size (width times height) in the image, the second neural network can recognize the object and associate with the known size. For example, the standard dimensions of a credit card are 85.6mm (3.37 inches) in width and 53.98mm (2.125 inches) in height.

The second neural network may be configured to find 205 the outline of the object, identify 206 the object, and associate the identified object with a known object of known size as well. The second information (output of the second neural network) may include information related to the physical size of the object, such as information of pixels and metrics. Based on the known dimensions of the object and the information of pixels and metrics, the processor may be configured to transform the height information in pixels obtained after processing the output of the first neural network into physical height information.

The second neural network may be formed from a convolutional neural network U-Net having a EFFICIENTNET-b0 backbone network. U-net is a convolutional neural network architecture for rapidly and accurately segmenting images. EFFICIENNET the backbone network provides high accuracy and high efficiency in object recognition. The U-net of embodiments of the present invention may have been trained to recognize certain reference information, such as credit card sized objects. Some primary drivers (such as hard reinforcement of the card in the card segmentation algorithm) result in higher accuracy and thus more accurate prediction of the length of the card.

To obtain a second neural network according to embodiments of the present invention, the final layer of the U-Net can have been modified, and all layers within the U-Net architecture can have been retrained with their own data.

In the training phase, 3000 to 4000 images, such as 3698, have been used in one embodiment. Techniques such as data enhancement have been used to increase the amount of data and prevent model overfitting. The various data enhancements implemented include horizontal flipping, perspective transformation, brightness/contrast/color manipulation, image blurring and sharpening, and gaussian noise.

In addition, a composite card dataset is created to enhance existing data for the card segmented second neural network. Approximately 100 cards were manually segmented from the original dataset. From about 40 images, the baby image is cropped so that the card is no longer visible in the image. All manually segmented cards were then glued onto approximately 40 baby backgrounds. A total of 3698 new images were created to train the card segmentation model.

The model is fine-tuned to achieve an overall height average of the cross-over (mloU) measure, which shows an accuracy of greater than 0.96.

When more than one credit card sized object is in the image, the second neural network may be configured to consider the card with the highest similarity as the reference object, while other objects are ignored. By way of guidance, the electronic device may be able to instruct a parent or user not to have more than one card in the image.

The estimation of height requires reference information. If the reference information is not properly visible or identifiable, the second neural network will not be able to recognize it and provide a ratio between the pixel distance and the metric distance. The second neural network is configured to output a notification if the object is not recognized correctly, and the output unit is configured to output the notification. This notification may indicate that height cannot be estimated and/or the notification may request that the user provide a predetermined object so that it is properly visible. Examples of such notification may be "unsuccessful card segmentation" or "unsuccessful card identification".

Fig. 3 shows a flow chart depicting a method according to an embodiment of the invention. The method of obtaining the height of a person, such as an infant, of fig. 3 starts with step 301 by obtaining, by a processor of an electronic device, an image comprising reference information and at least a part of a representation of the person. The image may be obtained by means of an image capturing unit comprised by or connected to the electronic device or received by the electronic device by another means. It may be obtained from a memory of the electronic device or other storage device storing the memory, it may be obtained from an external device or cloud via a data communication connection, it may be obtained from a camera module of the electronic device that has captured the image, or by other suitable means as will be apparent to the skilled person.

Step 302 includes inputting, by a processor, an image to a first neural network and obtaining, as output, first information from the first neural network, the first information relating to a height of a person, and more particularly, a plurality of key points in the person's body. Step 303 includes inputting, by the processor, the image to a second neural network and obtaining, as output, second information from the second neural network, the second information being related to the reference information.

Step 304 includes estimating a height of the person based on the first information and the second information, and step 305 includes outputting the estimated height. The outputting may be performed by an output unit of the electronic device.

The operation of the first neural network and the second neural network has been explained above with reference to fig. 3. The processor of the electronic device may be configured to perform operations of at least one of the first neural network and the second neural network. This means that at least one of the first neural network or the second neural network will be executed by the processor of the electronic device, and this allows height estimation to be performed locally at the electronic device in an offline mode that does not require communication with any external server. This can thus be done without the need to connect to the internet. The first neural network and the second neural network may have lightweight architectures that further facilitate their execution by electronic devices (e.g., smart phones, tablet computers, etc. portable electronic devices). Other portable devices, such as a computer or notebook computer, may be used. In particular, blazePose architecture is an architecture of a first neural network according to some embodiments of the present invention, which is lightweight and efficient for use in a portable terminal. It also works faster due to the conversion to TensorFlowLite (tflite) format, which means quantization of the weights of the neural network layer. Furthermore, the focus loss as a function of the solved loss of the second neural network, rather than the binary cross entropy, also results in an increase in accuracy by a number of percentiles. Since the object to be detected in the image is small, the loss function used in the second neural network of the present invention may be a focus loss or weighted cross entropy, instead of a binary cross entropy. In an embodiment of the invention, the majority of the image includes the background and the infant, while the card is only available in a few pixels of the image. If a conventional binary cross entropy is used as a loss function, this means that the second neural network is intended to have a confidence of 80% -100%, the objects in the picture are indeed one card that makes it difficult to segment them out of the picture and results in more erroneous/no-result situations.

Similarly, the U-Net architecture of embodiments of the present invention is lightweight and can be implemented in portable devices.

In embodiments of the present invention, and also for those electronic devices having little processing power, it is possible that the operation of at least one of the first neural network or the second neural network may be performed by an external server in communication with the electronic device, such as through cloud computing or the internet. In this case, the electronic device or the communication unit of the electronic device may be configured to transmit the image to the server, and receive the first information and the second information from the first neural network and the second neural network, respectively, from the server.

Fig. 4a to 4f show screenshots of an electronic device for gestures of different persons according to embodiments of the invention, together with reference information. Fig. 4a to 4f represent embodiments showing examples of gestures of a person and reference information that will or will not result in a height being correctly estimated.

FIG. 4a shows a screen shot of an electronic device when executing an application for height estimation. The processor may input the image to the first neural network after performing some preprocessing as explained above with reference to fig. 2. As seen in fig. 4a, the person (infant) is almost completely visible in the image 401 a. Only one hand is not visible. In addition, key points of the heel, ankle, knee, hip and shoulder are visible to both the left and right sides. Thus, the first neural network will be able to identify all predetermined keypoints. For example, if the first neural network is BlazePose neural networks, 33 predetermined keypoints will be identified. For the embodiment of the present invention, only 13 of the 33 key points are relevant, which are the right heel, left heel, right ankle, left ankle, right hip, left knee, right shoulder, left shoulder, upper forehead, middle eye and nose.

As seen in fig. 4a, the reference information (in this case a credit card sized object) is present in the image 401a and is fully visible. Thus, the second neural network will be able to recognize the object and output information related to the reference information, in this case the pixel/metric ratio.

In fig. 4b, the infant is fully visible in image 401 b. Its legs are curved so that it is not in a straight position. However, since the relevant keypoints are visible, the first neural network will be able to identify the keypoints, and the processor will be able to estimate the pixel height of the infant. The reference information is fully visible so the second neural network will be able to recognize the object and output information related to the reference information, in this case the pixel/metric ratio.

In fig. 4c, the infant is fully visible in image 401c, however, the infant lies on its side with one of the knees bent. In this case, since not all keypoints are visible, the first neural network will not be able to identify the keypoints and the processor will not be able to estimate the pixel height of the infant. The reference information is fully visible so the second neural network will be able to recognize the object and output information related to the reference information, in this case the pixel/metric ratio. The first neural network will output a notification as the output first information, and the processor will use this information to output a notification to the user via the output unit. With this notification, the user will know that height cannot be estimated because the infant is not properly visible and the user will be able to capture another image.

In fig. 4d, the infant is fully visible in image 401d, but lying face down. In this case, since not all keypoints are visible, the first neural network will not be able to identify the keypoints and the processor will not be able to estimate the pixel height of the infant. The reference information is fully visible so the second neural network will be able to recognize the object and output information related to the reference information, in this case the pixel/metric ratio. The first neural network outputs a notification as output first information, and the processor outputs a notification to the user via the output unit using the information. With this notification, the user will know that height cannot be estimated because the infant is not properly visible and the user will be able to capture another image.

In fig. 4e, the reference information is fully visible in image 401 e. However, the infant is not fully visible in the image because its ankle is not visible and his left side is not fully visible. In this case, the first neural network will not be able to identify enough keypoints to estimate height. The first neural network will output a notification as the output first information, and the processor will use this information to output a notification to the user via the output unit. With this notification, the user will know that height cannot be estimated because the infant is not properly visible and the user will be able to capture another image.

In fig. 4f, the infant is fully visible in image 401 f. However, there is no reference information. The second neural network will not be able to recognize any reference information and will output a notification as the second information output and the processor will use this information to output a notification to the user via the output unit. With this notification, the user will know that height cannot be estimated because the reference information is not in the image, or is not properly visible, and the user will be able to capture another image that includes the reference information (such as a reference object having a credit card size).

Although not shown in the drawings, it should be understood that other situations may occur in which at least one of the first or second neural networks is unable to obtain the correct output information. For example, if a credit card sized object is only partially present but its size cannot be determined, the second neural network will output a notification.

In the foregoing description of the figures, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader scope of the invention as set forth in the appended claims.

In particular, specific features of various aspects of the invention may be combined. An aspect of the present invention may be further advantageously enhanced by adding features described in relation to another aspect of the present invention.

It is intended that the invention be limited only by the following claims and the equivalents thereof. In this document and in its claims, the verb "to comprise" and its conjugations is used in its non-limiting sense to mean that items following the word are included, without excluding items not specifically mentioned. Furthermore, references to an element by the indefinite article "a" or "an" do not exclude the possibility that more than one of the element is present, unless the context clearly requires that there be one and only one of the elements. Thus, the indefinite article "a" or "an" generally means "at least one".

Claims

1. An electronic device for estimating a height of a person, the electronic device comprising:

A processor configured to:

-obtaining an image comprising at least a part of the representation of the person;

-inputting the image to a first neural network and obtaining as output first information from the first neural network, the first information being related to a plurality of keypoints in a human body;

-inputting the image to a second neural network and obtaining as output second information from the second neural network, the second information being related to the reference information; and

-Estimating the height of the person based on the first information and the second information; and

An output unit configured to output the estimated height.

2. The electronic device of claim 1, wherein the second information is information that relates the reference information to physical distance information.

3. The electronic device of any of claims 1 or 2, wherein the first neural network is configured to segment at least a portion of the representation of the person into a plurality of body parts, and predict a plurality of keypoints in the body of the person based on the plurality of body parts.

4. The electronic device of claim 3, wherein the information related to the plurality of keypoints comprises coordinate information about at least a portion of the plurality of keypoints.

5. The electronic device of any of claims 3 or 4, wherein the key point corresponds to one of a list comprising a face, a shoulder, a hip, a knee, an ankle, and a heel.

6. The electronic device of any of claims 3-5, wherein the first neural network is configured to identify a predetermined number of keypoints, and if at least one keypoint is not identified by the first neural network with a detection confidence of at least 50% and a visibility of at least 50%, the processor is configured to generate one notification that height cannot be estimated, and the output unit is configured to output the notification.

7. The electronic device of any of the preceding claims, wherein the first neural network is a convolutional neural network for human body pose estimation implemented with BlazePose neural network for which predictions of the keypoints have been parameterized using a mediapipe pose estimation application program interface, and wherein outputs of a BlazePose/mediapipe pose calculation application interface are passed through Broyden, fletcher, goldfarb and Shanno, BFGS optimization algorithms.

8. The electronic device of any of claims 3-7, wherein the processor is further configured to calculate euclidean distances between coordinates of at least a portion of a plurality of keypoints on the image using the first information to calculate a pixel length of the representation of the person in the image.

9. The electronic device of any of the preceding claims, wherein the reference information comprises an object of known predetermined size, such as an object having the size of a credit card.

10. The electronic device of claim 9, wherein the second neural network is configured to find a contour of the object, recognize the object, and obtain a predetermined size of the object, and wherein the second information comprises information related to a physical size of the object.

11. The electronic device of any of claims 9-10, wherein the second neural network is configured to generate a notification if the object cannot be recognized, and wherein the output unit is configured to output the notification.

12. The electronic device of any of the preceding claims, wherein the second neural network is formed by a convolutional neural network U-Net having a EFFICIENTNET-b0 backbone network.

13. The electronic device of any of the preceding claims, further comprising an image capturing unit configured to capture the image.

14. The electronic device of any of the preceding claims, wherein the processor is configured to perform operations of at least one of the first and second neural networks.

15. A method of obtaining a height of a person using an electronic device according to any of the preceding claims, the method comprising:

-inputting the image to a second neural network and obtaining as output second information from the second neural network, the second information being related to the reference information;

-estimating the height of the person based on the first information and the second information, and

-Outputting the estimated height.

16. The method of claim 15, wherein operations of the first neural network and the second neural network are performed by a processor of the electronic device.

17. The method of claim 15, wherein the operations of the first and second neural networks are performed by a server in communication with the electronic device, and wherein the method further comprises the electronic device transmitting the image to the server and receiving the first and second information from the server.