CN113657185A

CN113657185A - Intelligent auxiliary method, device and medium for piano practice

Info

Publication number: CN113657185A
Application number: CN202110843452.6A
Authority: CN
Inventors: 唐浩鑫; 胡建华; 王承聪; 张颖; 魏嘉俊; 许雁嘉; 莫异江; 吴伟杰; 梁梓豪; 姚琦彤; 陈广意; 赵泓皓; 陈莹
Original assignee: Guangdong Institute of Science and Technology
Current assignee: Guangdong Institute of Science and Technology
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-11-16

Abstract

The embodiment of the invention discloses an intelligent assistant method, device and medium for piano practice, which can accurately evaluate the performance of playing and provide intuitive information such as fingering and notes. The method includes: decomposing the performance video into key images frame by frame, and preprocessing the images, wherein the preprocessing includes marking the range of the key area; determining the position information of the pressed key according to the time information of the key image, The image is input to the preset gesture recognition model, the joint points of gesture recognition are predicted, and the coordinates of the fingers are output. If the coordinates of the fingers are within the position range of the keys, the corresponding relationship between the fingers and the keys is obtained; the different corresponding relationships obtained based on different performance videos will be obtained. Input the deep learning model and output the playing score; the laser light is projected on the keys to prompt the playing position and fingering information.

Description

Intelligent auxiliary method, device and medium for piano practice

Technical Field

The invention relates to the field of deep learning, in particular to an intelligent auxiliary method, device and medium for piano practice.

Background

With the improvement of living standard and art maintenance of people, more and more users learn the piano. Because the music foundation is weak, finding the corresponding keys according to the music score is a great learning obstacle, so that the mastering of the positions of the keys of the staff and the piano is unskilled and key pressing errors or note errors are inevitable in the playing process. At present, in the teaching of a piano, a teacher generally only can talk about a plurality of students exemplarily for several times of music at the same time, in the playing process, the students remember the playing skills and rhythm of the teacher, and generally need to go home and practice playing by themselves, so that the students lose reference when going home and practice by themselves, and particularly cannot be guided and corrected immediately when practicing at home. Greatly reducing the learning interest of the piano learners.

In the piano practice process, in order to guide a user to be able to accurately play a tune including correct fingering and musical notes and to judge whether the fingering and musical notes of the user are accurate, an intelligent auxiliary method for piano practice is required.

Disclosure of Invention

The invention provides an intelligent auxiliary method, device and medium for piano practice, and aims to at least solve one of technical problems in the prior art.

In a first aspect, an embodiment of the present invention provides an intelligent auxiliary method for piano practice, including: decomposing the first playing video into first key images frame by frame, and preprocessing the first key images, wherein preprocessing comprises marking key region ranges; determining the position information of the pressed key according to the time information of the first key image, inputting the first key image into a preset gesture recognition model, predicting a gesture recognition joint point, outputting a finger coordinate, and associating the finger with the key if the finger coordinate is in the position range of the key to obtain a first corresponding relation; decomposing the second playing video into second key images frame by frame, and preprocessing the images, wherein the preprocessing comprises marking key region ranges; determining the position information of the pressed key according to the time information of the second key image, inputting the second key image into a preset gesture recognition model, predicting a gesture recognition joint point, outputting a finger coordinate, and associating the finger with the key if the finger coordinate is in the position range of the key to obtain a second corresponding relation; inputting the first corresponding relation and the second corresponding relation into a deep learning model, and outputting a playing score; laser light is projected on the keys to prompt playing positions and fingering information.

Optionally, the image preprocessing further includes: and normalizing the size of the key in the image.

Optionally, the method for intelligently assisting piano practice includes the steps of marking a key region range, transforming a key image coordinate system to a pixel coordinate system, and determining the region range of each key based on the pixel coordinate.

Optionally, the method for intelligently assisting piano practice determines the position information of the pressed key, and includes associating the coordinate information of the pressed key by time-associating the generated information of the triggered key at a moment corresponding to the video of each image.

Optionally, the intelligent auxiliary method for piano practice is implemented by fine-tuning a trained gesture recognition model based on a real data set;

optionally, the intelligent assisting method for piano practice includes reading an area range of the pressed key and the finger coordinate in the position range of the key, comparing the area range of the pressed key and the finger coordinate in real time, and determining that the key is pressed by the finger if the position of the finger is in the area range of the pressed key.

Optionally, the intelligent auxiliary method for piano practice is characterized in that the step of predicting the key points of the hand includes the steps of transmitting the recognized gesture prediction frame image into a high-resolution network to serve as a backbone neural network, generating a single heat map with multiple resolutions and high resolutions by adopting convolution and deconvolution modules, predicting the gesture recognition joint points, and outputting finger coordinates.

Optionally, the intelligent auxiliary method for piano practice is that the playing video is a segment, and in the process of decomposing the video, a time point corresponding to the segment is selected.

In a second aspect, an embodiment of the present invention further provides an intelligent auxiliary device for piano practice, including: the video decomposition module is used for decomposing the video into images; the key marking module is used for selecting the played sound area image and marking the area range of each key; the information acquisition module is used for identifying the coordinates of the pressed keys; a gesture detection module for predicting a confidence map of the hand mask; the gesture recognition module is used for predicting a confidence map of the joint points of the hand; the gesture scoring module is used for evaluating the accuracy level of fingering and musical notes played currently; the playing guide module is used for guiding a player to play; the video recording device is used for recording the video played by the player on site; the laser guiding device is used for guiding fingering and note information of a player, wherein the laser device guides the fingering and the note information in a laser ray emitting mode.

The invention has the following beneficial effects:

1. the characteristics of the gesture images are largely learned through a deep convolutional neural network, and the coordinates of the fingers are accurately identified;

2. by judging whether the coordinates of the fingers are in the area range of the pressed keys or not, the corresponding relation between the fingers and the pressed keys is accurately matched;

3. by comparing the corresponding relation between the fingers of different videos and the keys pressed down, the playing fingering and the note playing level of the player are accurately evaluated;

4. the method comprises the following steps of intuitively guiding relevant information of a player through a laser device, wherein the relevant information comprises fingering and note information;

5. and judging whether the playing progresses and making a training plan in a targeted manner according to the data of the historical scores of the players.

Drawings

Fig. 1 is a general flowchart of an intelligent assisting method for piano practice according to the present invention.

Fig. 2 is a detailed flowchart of an intelligent assisting method for piano practice according to the present invention.

Detailed Description

The conception, the specific structure and the technical effects of the present invention will be clearly and completely described in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the schemes and the effects of the present invention.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any combination of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided herein, is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

In a first aspect, the embodiment of the invention further provides an intelligent auxiliary method for piano practice, which can be used for assisting piano practice with stronger pertinence and more effectiveness.

As shown in fig. 1, an intelligent assisting method for piano practice comprises the following steps:

s1, decomposing the first playing video frame by frame into first key images, and preprocessing the images, wherein preprocessing comprises marking key region ranges;

s2, determining the position information of the pressed key according to the time information of the first key image, inputting the first key image into a preset gesture recognition model, predicting a gesture recognition joint point, outputting a finger coordinate, and associating the finger with the key if the finger coordinate is within the position range of the key to obtain a first corresponding relation;

s3, decomposing the second playing video into second key images frame by frame, and preprocessing the images, wherein preprocessing comprises marking key region ranges;

s4, determining the position information of the pressed key according to the time information of the second key image, inputting the second key image into a preset gesture recognition model, predicting a gesture recognition joint point, outputting a finger coordinate, and associating the finger with the key if the finger coordinate is within the position range of the key to obtain a second corresponding relation;

s5, inputting the first corresponding relation and the second corresponding relation into a deep learning model, and outputting a playing score;

s6, projecting laser light on the keys to prompt playing positions and fingering information.

In step S1, the first and/or second performance videos in step S3 may be videos played by keyboard musical instruments such as a piano and a harmonica, the first performance video and the second performance video should be guaranteed to be the same keyboard musical instrument and the same initial tune is played, the time frame lengths of the video decomposition are the same to guarantee real-time synchronization of images at each moment in the first key image and the second key image, wherein the first and/or second performance video may be recorded by a video capture device, or selected from a video repository, the first and/or second key image may be one or more key images, wherein the first and/or second images cover the whole key, the first playing video can be the first key image obtained by decomposing the student practice video, and the second playing video can be the second key image obtained by decomposing the teacher teaching video.

Details of the above steps are described in various embodiments below in conjunction with the flow chart shown in fig. 2.

In an embodiment, the steps S1 and S3 specifically include:

s11, decomposing the video frame: converting a video frame into one or more images by adopting OpenCV (open source computer vision) and outputting the images, wherein the OpenCV is an open source function library used for image processing, analysis and machine vision;

s12, note key region range: and detecting and selecting the played sound area image through a target detection algorithm, and marking the area range of each key. Due to the shooting angle, the shooting distance and the like, the sizes of keys on the image are different, optionally, the sizes of the keys in the image are normalized, after the sizes of the keys are normalized, a key image coordinate system is converted into a pixel coordinate system, and the area range of each key is determined through pixel coordinates. Marking the key range by adopting a target detection algorithm, optionally, detecting the region range of the key by adopting a target detection algorithm based on yolov4, wherein the characteristics are extracted through a sliding window, the window obtains corresponding scores after being classified and identified by a classifier, when the multi-window and other windows are included or crossed, selecting the window with the highest score in the neighborhood and the low inhibition score by adopting a Non-maximum suppression algorithm (NMS), and the region range of the optional key frame can be determined by a rectangular frame consisting of at least two coordinate points.

In an embodiment, the steps S2 and S4 specifically include:

s21, associating the coordinate information of the depressed key: when the action of pressing a key is triggered, the information acquisition module can uniquely determine the position information of the pressed key, for example, when a plurality of keys are pressed simultaneously, because the information generated after each key is triggered is different, each information is associated with a unique key coordinate. In step S2, each image in step S4 corresponds to a moment in the video, and associates the coordinate information of the depressed key by time-associating the generated information after the key is triggered, wherein the key coordinate specifically refers to the area range of the key;

s22, the method for predicting the gesture recognition joints mainly comprises two steps of obtaining a feature map of a hand and predicting key points of the hand, wherein the first step is used for predicting a confidence map of a hand mask, and the second step is used for predicting a confidence map of joint points of the hand, the two steps adopt an iterative cascade structure, and the accuracy of gesture recognition is effectively improved by utilizing the back propagation of end-to-end training:

s22-1, acquiring a characteristic diagram of the hand:

and (3) selecting a data set, optionally selecting an MSCOCO data set as a training set, wherein the MSCOCO data set is a data set constructed by Microsoft and comprises tasks such as detection, segmentation, key points and the like, more than 20 ten thousand pictures with more than 80 types are provided, image materials from twenty-many Piano schoolchild piano are collected as a fine adjustment data set, fine adjustment is carried out on a trained model to further improve the accuracy of target detection, 5000 images in RHD are selected as a test set, and the RHD data set is a commonly used test gesture recognition data set.

Taking an image containing human hand information as input to obtain a characteristic diagram with a target as a hand, for example, a target detection model is based on a Yolov3 neural network structure, specifically, a convolutional layer Conv layer processes an input image by adopting a plurality of different convolutional kernels to obtain different response characteristic diagrams, a BN layer normalizes all batch processing data and performs down-sampling by adopting convolution with the step length of 2, the extracted shallow features and deep features can be simultaneously utilized by the detection network through feature fusion, the feature map of the hand is output, an effective gesture recognition area is obtained, the fusion of the high-level features and the bottom-level features is realized through a target detection model of the neural network based on Yolov3, the result is predicted by using the multi-scale feature map, the parallel operation function of the multi-core processor and the GPU is fully exerted, and the feature map of the hand is obtained at high speed, so that the video frame is detected in real time.

In one embodiment, the input image is first pre-processed and then the spatial layout of the hand in the color image is encoded. Optionally, a convolution stage from the VGG-19 network to com4 generates feature F of 512 channels, the number of channels is increased so that more information can be extracted, and then the feature F is convolved to obtain a hand mask part of two channels, wherein, the VGG19 has 19 layers in total, including 16 convolutional layers and the last 3 fully-connected layers, and a pooling layer is adopted in the middle.

In one embodiment:

1, input layer: inputting a 64x64x3 three-channel color image, wherein the average value of RGB is subtracted from each pixel in the input image;

2, a convolutional layer: the input dimension is 64x64x3, the preprocessed image is subjected to five times of convolution by 64 convolution kernels of 5x5 + ReLU, the step size is 1, and the size after the convolution is 60x60x 64;

3, sampling layer: the input dimension is 60x60x64, and the pooling is maximized, the size of the pooling unit is 2x2, the effect is that the image size is halved, and the pooled size becomes 30x30x 64;

4, a convolutional layer: the input dimension is 30x30x64, five convolutions are performed by 96 convolution kernels of 5x5 + ReLU, the step size is 1, and the size becomes 26x26x 96;

5, sampling layer: the input dimension is 26x26x96, the maximization pooling of 3x3 is carried out, and the size is changed to 13x13x 96;

6, a convolutional layer: the input dimension is 13x13x96, five convolutions are performed by 128 convolution kernels of 5x5 + ReLU, the step size is 1, and the size becomes 9x9x 128;

7, sampling layer: the input dimension is 9x9x128, the maximization pooling of 3x3 is carried out, and the size is changed into 5x5x 128;

8, local connection layer: the input is 5x5x128, and the convolution kernel passing through 3x3 is convoluted for three times, the step size is 1, and the size is changed into 3x3x 160;

9, connecting layer: the input is 3x3x160, full join + ReLU is performed through three full join layers, for example, in the hand contour point estimation, 19 hand contour points are estimated, the structure of the join layers is set, and finally, a vector of 1x1x38 dimensions is obtained.

In one embodiment, the testing phase replaces the 3 fully connected layers with 3 convolutional layers, so that the tested fully convolutional network can receive any input with a width or height because of no restriction of full connection.

In one embodiment, the model is trained in two stages, the first stage is used for training on the synthetic data set, and the second stage is used for fine-tuning the model in the first stage on the real data set, so that the model is more robust, and can better perform in a real scene

S22-2, predicting key points of the hand, and outputting finger coordinates:

selecting a data set, optionally, using an Interhand2.6M data set as a training set, wherein the Interhand2.6M data set is a maximum 3D double-hand interaction estimation data set and consists of 360 ten thousand video frames; collecting image materials from twenty-multiple Piano schoolchildren playing pianos as a fine adjustment data set, wherein fine adjustment of the trained model can further improve the accuracy of posture estimation;

and (4) predicting key points of the hand, transmitting the gesture prediction box image identified in the step (S21) into HRnet to serve as a main neural network, generating a single heat map with multiple resolutions and high resolutions by adopting convolution and deconvolution modules, predicting the gesture recognition joint points, and outputting finger coordinates.

In one embodiment, 42 of the hand key points are estimated from the human hand outline box given in step S22-1, wherein 21 key nodes are estimated for the left hand and the right hand.

In one embodiment, the original image and the output of S22-1 are used as input for the prediction of the hand keypoints, respectively, the model structure used for the prediction of the hand keypoints is the same as that of S22-1, and finally the output of the full connected layer is an 84-dimensional vector.

S23, determining the corresponding relation between the fingers and the keys: and reading the area range of the pressed key and the coordinates of the finger, comparing the area range and the coordinates of the finger in real time, and judging that the key is pressed by the finger if the position of the finger is in the area range of the pressed key in the image.

In one embodiment, the first corresponding relation and the second corresponding relation at the same time are used as the input of the deep learning model in step S5, and the playing score is the output of the deep learning model, wherein the playing score includes the error type of fingering error, note error and the like, the place of each error, and the total score of playing.

In one embodiment, according to the first corresponding relation or the second corresponding relation, the playing guide information is projected onto the key by the laser device in a laser ray manner, wherein the prompt guide information comprises playing position and fingering information.

In one embodiment, the first performance video may be a segment, and during the process of decomposing the video, the time point of the corresponding segment is selected, so as to ensure that the decomposed first performance video and the second performance video correspond in time.

In one embodiment, the exercise plan is made by analyzing the data of the historical playing transcript, analyzing the progress situation or the playing defect of the player.

In one embodiment, the first performance video may be an entire tune, and during the process of decomposing the video, it should be ensured that the time frames of the decomposed first performance video and the second performance video are consistent.

In one embodiment, according to the corresponding relation between the fingers and the keys obtained by the teacher teaching video, the laser light projects guide fingering and tone information.

In a second aspect, the embodiment of the invention further provides an intelligent auxiliary device for piano practice, which can be used for assisting piano practice with stronger pertinence and more effectiveness.

The embodiment of the invention provides an intelligent auxiliary device for piano practice, which comprises:

the video decomposition module is used for decomposing the video into images, selecting the time length required to be decomposed and setting a decomposition time frame; the key marking module is used for selecting the played sound area image and marking the area range of each key; the information acquisition module is used for identifying the coordinates of the pressed keys; a gesture detection module for predicting a confidence map of the hand mask; the gesture recognition module is used for predicting a confidence map of the joint points of the hand; the gesture scoring module is used for evaluating the accuracy level of fingering and musical notes played currently; and the playing guide module is used for guiding the player to play.

In one embodiment, a piano practice intelligent aid may comprise: the video storage module is used for storing performance videos, wherein the videos can be videos recorded during playing and then uploaded to the video storage module, and also can be videos downloaded from a website and stored in the video storage module;

in one embodiment, a piano practice intelligent aid may comprise: the video recording device is used for recording the performance video and then uploading the performance video to the video storage module or uploading the performance video to the video storage module, and the laser device is used for projecting guide fingering and tone information.

It should be recognized that the method steps in embodiments of the present invention may be embodied or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The method may use standard programming techniques. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention may also include the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. an intelligent assistant method for piano practice, is characterized in that, described method comprises the following steps:

S1, the first performance video is decomposed into a first key image frame by frame, and the first key image is preprocessed, wherein the preprocessing includes marking the key area range;

S2, according to the time information of the first key image, determine the position information of the pressed key, input the first key image into a preset gesture recognition model, predict the gesture recognition joint point, and output the finger coordinates, if the finger coordinates Within the range of positions of the keys, the fingers and the keys are associated to obtain the first correspondence;

S3, decompose the second performance video into a second key image frame by frame, and perform preprocessing on the second key image, wherein the preprocessing includes marking the range of the key area;

S4, according to the time information of the second key image, determine the position information of the pressed key, input the second key image into a preset gesture recognition model, predict the gesture recognition joint point, and output the finger coordinates, if the finger coordinates Within the range of the position of the keys, the fingers and the keys are associated to obtain a second correspondence;

S5, input the first corresponding relationship and the second corresponding relationship into the deep learning model, and output the playing score;

S6, the laser light is projected on the keys to prompt the playing position and fingering information.

2 . The intelligent assistance method for piano practice according to claim 1 , wherein any one of the steps S1 and S3 comprises: normalizing the size of the piano keys in the image. 3 .

3. a kind of intelligent assistant method for piano practice according to claim 1, wherein, any step in described steps S1 and S3 comprises: transforming the key image coordinate system to the pixel coordinate system, and determining the value of each key based on the pixel coordinates. geographic range.

4. a kind of intelligent assistant method for piano practice according to claim 1, wherein, any step in described steps S2 and S4 comprises: a moment in the corresponding video of each image, the generation after being triggered by time-related piano keys information, so as to correlate the coordinate information of the pressed key to determine the position information of the pressed key.

5 . The intelligent assistance method for piano practice according to claim 1 , wherein any one of the steps S2 and S4 further comprises, based on a real data set, fine-tuning the trained gesture recognition model. 6 .

6. a kind of intelligent assistant method for piano practice according to claim 1, wherein, any step in described step S2 and S4 comprises: read the area scope and finger coordinates of the pressed piano key, the two carry out real-time contrast, If the position of the finger is within the range of the pressed key, it is determined that the pressed key is pressed by the finger.

7. a kind of intelligent assistant method for piano practice according to claim 1, wherein, any step in described steps S2 and S4 comprises: the gesture prediction frame image that identifies is passed in to high-resolution network as backbone neural network, Convolution and deconvolution modules are used to generate multi-resolution and high-resolution one-hot maps, predict joint points for gesture recognition, and output finger coordinates.

8 . The intelligent assistance method for piano practice according to claim 1 , wherein the performance video is a segment, and in the process of video decomposition, the time point of the corresponding segment in the performance video is selected. 9 .

9. An intelligent auxiliary device for piano practice, comprising,

Video decomposition module for decomposing video into images;

The key labeling module is used to select the image of the sound area played, and mark the area range of each key;

Information acquisition module, used to identify the coordinates of the pressed keys;

Gesture detection module for predicting the confidence map of the hand mask;

Gesture recognition module, used to predict the confidence map of hand joint points;

Gesture scoring module, used to evaluate the accuracy level of the fingering and notes currently played;

The playing guide module is used to guide the player to play;

Video recording device, used to record the video of the player playing live;

The laser guiding device is used to guide the fingering and note information of the player, wherein the laser device guides by emitting laser light.

10. A computer-readable storage medium having computer instructions stored thereon, characterized in that the instructions, when executed by a processor, implement the steps of the method as claimed in any one of claims 1 to 8.