WO2018194456A1 - Optical music recognition omr : converting sheet music to a digital format - Google Patents
Optical music recognition omr : converting sheet music to a digital format Download PDFInfo
- Publication number
- WO2018194456A1 WO2018194456A1 PCT/NL2018/050250 NL2018050250W WO2018194456A1 WO 2018194456 A1 WO2018194456 A1 WO 2018194456A1 NL 2018050250 W NL2018050250 W NL 2018050250W WO 2018194456 A1 WO2018194456 A1 WO 2018194456A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- representation
- music
- neural network
- sequence
- rnn
- Prior art date
Links
- 230000003287 optical effect Effects 0.000 title claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 97
- 230000000306 recurrent effect Effects 0.000 claims abstract description 72
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 62
- 230000002123 temporal effect Effects 0.000 claims abstract description 59
- 238000000034 method Methods 0.000 claims description 37
- 238000012549 training Methods 0.000 claims description 31
- 230000006403 short-term memory Effects 0.000 claims description 8
- 238000013479 data entry Methods 0.000 claims description 6
- 238000004519 manufacturing process Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 5
- 230000000007 visual effect Effects 0.000 claims description 2
- 239000010410 layer Substances 0.000 description 36
- 239000011295 pitch Substances 0.000 description 30
- 238000013518 transcription Methods 0.000 description 15
- 230000035897 transcription Effects 0.000 description 15
- 238000011176 pooling Methods 0.000 description 12
- 230000003595 spectral effect Effects 0.000 description 10
- 210000004027 cell Anatomy 0.000 description 9
- 238000012545 processing Methods 0.000 description 8
- 238000013519 translation Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 239000013598 vector Substances 0.000 description 6
- 230000004913 activation Effects 0.000 description 5
- 238000001994 activation Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 238000005070 sampling Methods 0.000 description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 3
- 230000001174 ascending effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 238000011144 upstream manufacturing Methods 0.000 description 3
- 241000238876 Acari Species 0.000 description 2
- 230000003416 augmentation Effects 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000015654 memory Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 241000282693 Cercopithecidae Species 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- KHGNFPUMBJSZSM-UHFFFAOYSA-N Perforine Natural products COC1=C2CCC(O)C(CCC(C)(C)O)(OC)C2=NC2=C1C=CO2 KHGNFPUMBJSZSM-UHFFFAOYSA-N 0.000 description 1
- 101100538963 Xenopus laevis twist1 gene Proteins 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000010420 art technique Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 229930192851 perforin Natural products 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0033—Recording/reproducing or transmission of music for electrophonic musical instruments
- G10H1/0041—Recording/reproducing or transmission of music for electrophonic musical instruments in coded form
- G10H1/0058—Transmission between separate instruments or between individual components of a musical system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10G—REPRESENTATION OF MUSIC; RECORDING MUSIC IN NOTATION FORM; ACCESSORIES FOR MUSIC OR MUSICAL INSTRUMENTS NOT OTHERWISE PROVIDED FOR, e.g. SUPPORTS
- G10G1/00—Means for the representation of music
- G10G1/04—Transposing; Transcribing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/066—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/076—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction of timing, tempo; Beat detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/081—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for automatic key or tonality recognition, e.g. using musical rules or a knowledge base
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/086—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for transcription of raw audio or music data to a displayed or printed staff representation or to displayable MIDI-like note-oriented data, e.g. in pianoroll format
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2220/00—Input/output interfacing specifically adapted for electrophonic musical tools or instruments
- G10H2220/005—Non-interactive screen display of musical or status data
- G10H2220/015—Musical staff, tablature or score displays, e.g. for score reading during a performance
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2220/00—Input/output interfacing specifically adapted for electrophonic musical tools or instruments
- G10H2220/155—User input interfaces for electrophonic musical instruments
- G10H2220/441—Image sensing, i.e. capturing images or optical patterns for musical purposes or musical control purposes
- G10H2220/451—Scanner input, e.g. scanning a paper document such as a musical score for automated conversion into a musical file format
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2240/00—Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
- G10H2240/171—Transmission of musical instrument data, control or status information; Transmission, remote access or control of music data for electrophonic musical instruments
- G10H2240/281—Protocol or standard connector for transmission of analog or digital data to or from an electrophonic musical instrument
- G10H2240/311—MIDI transmission
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/311—Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation
Definitions
- the invention relates to an assembly and method for converting data, in particular to optical music recognition (OMR) assembly for converting sheet music.
- OMR optical music recognition
- This conversion which may comprise a transcription or translation, for instance relates to the fields of automatic music transcription, music information retrieval, optical music recognition, and the like.
- US6297439 for instance according to its abstract describes "A system and method are disclosed for automatically generating music on the basis of an initial sequence of input notes, and in particular to such a system and method utilizing a recursive artificial neural network (RANN) architecture.
- the aforementioned system includes a score interpreter interpreting an initial input sequence, a rhythm production RANN for generating a subsequent note duration, a note generation RANN for generating a subsequent note, and feedback means for feeding the pitch and duration of the subsequent note back to the rhythm generation and note generation RANNs, the subsequent note thereby becoming the current note for a following iteration.”
- US8494257 according to its abstract describes "Data set generation and data set presentation for image processing are described. The processing determines a location for each of one or more musical artifacts [..] in the image and identifies a corresponding label for each of the musical artifacts, generating a training file that associates the identified labels and determined locations of the musical artifacts with the image, and presenting the training file to a neural network for training.”
- US9123315 according to its abstract describes "A method for transcoding music, according to various aspects of the present invention, includes in any practical order: (a) reading indicia of a plurality of notes, each note having pitch and duration; (b) selecting a reference pitch; (c) determining indicia of tone from the reference pitch and the pitch of each note; and (d) outputting for use by an engraving engine, indicia of an apposite staff and indicia of tones and durations corresponding to the plurality of notes.”
- Rewind's approach is similar but boasts a new method using an encoder-decoder network where the encoder and decoder both consist of a gated recurrent unit and a linear layer.
- the encoder layer of Rewind is a single layer autoencoder that captures the temporal dependencies of a song and produces a temporal encoding.
- Rewind is a web app that utilizes a deep learning method to allow users to transcribe, listen to, and see their music.”
- JP2871204B2 in its abstract states: "PURPOSE:To prevent variation in pitch from being decomposed into fine variation in the fine interval of a short note by absorbing the variation in pitch by the hysteresis that a recurrent network has when a musical sound which varies in pitch like singing and vibrato performance is put on a score.
- This device has a band-pass filter bank part 14 which converts an external audio signal into power envelopes by frequency channels, a conflict recollection neural network part 13 which finds pitch categories from the power envelopes by the channels obtained from the band-pass filter bank part 14, an interval buffer part 12 which holds the pitch categories outputted by the conflict recollection neural network part 13, a readout timing generation part 15 which generates transcription intervals of musical intervals required for transcription and a musical interval storage part 1 1 which inputs and records pitch data from the interval buffer part 12 according to the timing outputted by the readout timing generation part 15.
- WO2008101 126A1 in its abstract states: "Methods, systems, and devices are described for collaborative handling of music contributions over a network.
- Embodiments of the invention provide a portal, the portal being accessible over the network by a plurality of workstations and configured to provide a set of editing capabilities for editing music elements.
- Music contributions may be received at the portal. At least a portion of the music contributions include music elements. In certain embodiments, the music elements have been deconstaicted from an audio signal or a score image.
- a number of collaboration requests may be received at the portal over the network. Some collaboration requests may originate from a first workstation, while other collaboration requests may originate from a second workstation. In response to at least one of the collaboration requests, at least a portion of the music elements may be edited using the editing capabilities of the portal.”
- US2016099010A1 in its abstract states: "Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance.
- One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance.”
- CNN convolutional neural network
- LSTM long short-term memory network
- the complex image and text sequence identification method includes the steps: utilizing a sliding sampling box to perform sliding sampling on an image and text sequence to be identified; extracting the characteristics from the sub images obtained through sampling by means of a CNN and outputting the characteristics to an RNN, wherein the RNN successively identifies the front part of each character, the back part of each character, numbers, letters, punctuation, or blank according to the input signal; and successively recording and integrating the identification results for the RNN at each moment and acquiring the complete identification result, wherein the input signal for each moment for the RNN also includes the output signal of a recursion neural network for the last moment.
- the complex image and text sequence identification method can overcome the cutting problem of a complex image and text sequence, and can significantly improve the identification efficiency and accuracy for images and text.”
- the current invention provides an optical music recognition (OMR) assembly for converting sheet music, representing a music part as a first temporal representation, into a machine-processable representation of said piece of music that represents at least a pitch and duration of notes that are graphically represented in said sheet music and form said music part as a second temporal representation, said assembly comprising a data processor system and software which, when running on said data processor system:
- sequence-to-sequence system comprising:
- CNN convolutional neural network
- RNN recurrent neural network
- RNN decoder recurrent neural network
- the assembly provides an end-to-end trainable sequential model.
- This model can be trained as one pipeline by offering input at an input end and retrieving output at an output end.
- Examples of a first temporal representation are for instance graphical music representation, sound recording.
- the first temporal representation relates to sheet music.
- Sheet music in its most general form relates to a graphical representation of music information. In a specific embodiment, it relates to a form of music notation, primarily used to notate western music. It is a way of writing down sequential musical information in a compact way, readable for human performers.
- western music musical symbols are notated on a staff, a group of five evenly spaced horizontal lines, used to differentiate between the pitches in written music. The higher a note is on a staff, the higher the pitch.
- a page of music typically consists of multiple staff lines, much like a piece of written text consists of multiple lines. The horizontal position of a note determines the order of the written musical sequence: sheet music is read from left to right.
- Figures 1-6 are examples of such a music notation.
- one staffline with fifteen ascending notes from the scale of C major.
- the final symbol is called a rest. It is similar to a musical note in every way, but instead of indicating a pitch it indicates a period of silence.
- the first temporal representation relates to a sound recording.
- Sound can be recorded, and for instance transformed into a digital format. Sound comprises music, spoken text in a language, and the like. Sound can be recorded and subsequently compressed using a lossless or lossy compression. Examples of such digital formats are MP3, MP4, but in fact any type of lossy compression or lossless compression of sound recordings can be used.
- FLAG Free Lossless Audio Codec
- ALAC Apple's Apple Lossless
- MPEG-4 ALS MPEG-4 ALS
- WMA Lossless Microsoft's Windows Media Audio 9 Lossless
- Monkey's Audio TTA
- WavPack WavPack
- Some audio formats feature a combination of a lossy format and a lossless correction; this allows stripping the correction to easily obtain a lossy file.
- Such formats include MPEG-4 SLS (Scalable to Lossless), WavPack, and OptimFROG DualStream.
- Examples of a second temporal representation are for instance sound recordings, music files, graphical music representation like sheet music.
- the sound may be music, spoken text (the same text for instance as in the first representation) in a language or accent different from spoken text in the first temporal representation.
- the second temporal representation relates to a digital music format.
- Music can be stored in various digital formats.
- An example of such a digital format is the MIDI protocol, see MIDI Association. The Official Midi Specifications. 1996. URL: https://www.midi.org/specifications, which is widely used by musical sequencers and notation software. It is the de-facto standard of exchanging digital musical information.
- a notation format much closer to actual sheet music is ABC- notation, MIDI.
- Digital music notation formats have numerous advantages over their optical counterparts. Images consist of pixel data, and do not give any direct information about the represented musical content. As a result, computational musical analysis cannot be performed on these image formats directly. While an increasing amount of digitalized sheet music is available, a large portion of available sheet music is still only accessible as images
- the Musical instalment digital interface (MIDI) standard defines a system of communication between digital musical devices. It is a very compact method to define musical information, as all events are represented on byte level.
- the MIDI file is a way of storing MIDI information, and is typically only a few kilobytes in size. Since the 1980's, MIDI has been the standard way of playing digital music, favoured because its ease of use and expandability.
- the MIDI standard provides a way of expressing a variety of musical events. Timing in MIDI is handled different than in sheet music.
- a MIDI event has a tick property, which defines the amount of ticks between the previous and current event.
- the duration of one tick is defined in the header of the MIDI file by the Pulses Per Quarternote (PPQ). This value informs the musical sequencer how many ticks one quarter note contains. Typically, as the duration of most musical notes can be divided by either three or four, the PPQ of midi is a multiple of twelve. Most MIDI sequencers take a standard PPQ of 480, but for simpler music lower values are possible too. Timing is the only structural component in a MIDI file. Events like barlines are not present in MIDI.
- a NoteOnEvent defines a note with a starting tick, pitch and velocity. If the defined velocity is zero, it can be considered the same as a NoteOffEvent, which signals the end of a musical note.
- a musical note can be defined by two NoteOnE vents: the first event defines the starting tick and velocity of the note, the second tick defines the end tick of the note.
- Other events are the KeySignatureEvents, which defines the keysignature of a piece, and EndOf-Track, defining the end of a MIDI track. Contrary to sheet music, where the key signature changes the notation of accidentals in a piece, in MIDI the key signature does not influence the notes in a file in any way. It is only used for meta information.
- CNN Convolutional Neural Network
- ANN artificial neural network
- a Convolutional Neural Network is a type of artificial neural network
- ANN which in an embodiment is designed for the processing of spatial data.
- convolutional layers are a functionality of CNNs. These convolutional layers replace the weights of a traditional feed forward neural network with trainable convolution filters. Filters can learn simple operations like edge, blob and corner detection. When stacked in multiple layers, the recognition of complicated spatial structures can be learned.
- a single convolutional layer consists of multiple filters, where each filter operates on small areas of the layer input, producing a feature map. Feature maps represent the convolution between a filter and input of a layer. After the convolutional layer, in an embodiment a non-linearity and possible pooling operation is applied.
- Pooling operations perform down-sampling on a feature map.
- a popular type of pooling is max pooling, where each pooling region outputs its maximum value.
- a pooling operation has a region size and stride associated with it. The region size defines the width and height of the pooling region, the stride defines the number of steps between pooling regions.
- the resulting feature maps are in an embodiment reshaped to a column vector and one or more fully connected layers are applied. These layers are the same as the layers in regular feed-forward neural networks. As in the rest of the network, sigmoid activations can for instance be used.
- ANNs with recurrent connections are an addition to the standard ANN model described in the previous section. These models belong to the family of Recurrent
- Neural Networks In addition to predicting from current input data, they are able to predict from past inputs as well. This property adds a new functionality to a network: the ability to work with sequential data.
- LSTM Long Short Term Memory
- a method to expand RNNs to perform tasks is to map sequences to sequences, possibly with different lengths and different orders.
- the architecture is called a sequence-to-sequence network, and is based on an encoder-decoder staicture.
- This structure first encodes an input to a hidden representation, and subsequently decodes the output from this hidden representation.
- both the encoder (RNN) and decoder (RNN) are LSTM networks.
- the hidden representation from the encoder is passed as the last hidden state of the encoder LSTM to the starting hidden state of the decoder LSTM.
- the convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are functionally coupled, forming said sequence-to-sequence system, and said sequence-to-sequence system is trained using a training dataset of first temporal representations and known, resulting second temporal representations.
- the output of said second, decoder recurrent neural network (RNN) is compared to said second temporal representation, and parameters of said convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are modified.
- the first RNN comprises a Long Short term memory (LSTM) architecture.
- LSTM Long Short term memory
- the second RNN comprises a Long Short term memory (LSTM) architecture.
- LSTM Long Short term memory
- first temporal representation is a graphical representation, in particular a digital image, more in particular said graphical representation is a representation of music, in particular sheet music.
- the second digital representation is a digital file comprising temporal instructions for actuating a device, in particular a music file for actuating or controlling a music instrument, more in particular selected from MIDI, music XML.
- the series of time slices comprise digital images obtained by sliding a window over a graphical representation, in particular over a digital image.
- the invention further relates to an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming at least part of a music part, said OMR assembly comprising an assembly described, and wherein - a sliding window is applied over a digital image of at least part of said sheet music, providing said time slices;
- OMR optical music recognition
- said second, decoder recurrent neural network providing said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.
- the sheet music comprises a graphical representation comprising a series of staff lines and notes, in particular said sheet music comprises a visual representation on a carrier, in particular a written or printed representation on paper.
- the convolutional neural network, said first, encoder recurrent neural network, and said second, decoder recurrent neural network have been trained as said sequence-to-sequence system using a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation.
- the invention further relates to an automatic music transcription (AMT) assembly comprising the assembly, wherein said first temporal representation of a signal is a sound recording.
- AMT automatic music transcription
- the sound recording is a digital sound recording.
- the AMT assembly further comprising a spectral converter, said spectral converter allowing converting said sound recording is into series of spectral representations of said sound recording.
- the software when running on said data processing system, defines a spectral converter for converting said sound recording into a series of magnitude spectrograms providing said time slices, wherein a time window having a time window size is shifted over said sound recording and a magnitude spectrogram is calculated for each time window.
- the time windows have a window overlap.
- the invention further pertains to a method for converting sheet music into a digital representation using the OMR assembly, comprising:
- CNN convolutional neural network
- RNN recurrent neural network
- the invention further relates to an assembly for producing an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming a music part, said OMR assembly comprising a data processor and software which, when running on said data processor: -provide a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation;
- OMR optical music recognition
- - provide a neural network assembly comprising :
- CNN convolutional neural network
- a first, encoder recurrent neural network as an encoder adapted for receiving said sequence of numerical representations for providing a hidden output data set
- a second, decoder recurrent neural network as a decoder adapted for receiving said hidden output data set for converting said hidden output data into said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part;
- the invention further relates to a method for producing the assembly, wherein:
- trainings dataset comprising a series of first temporal representation of signals, each having a resulting second temporal representation of each signal;
- said convolutional neural network, said first recurrent neural network and said second recurrent neural network are provided, where said neural networks are coupled; - said time slices are for converting said time slices into a sequence of third representations of said first temporal representation;
- RNN recurrent neural network
- RNN recurrent neural network
- the neural networks are trained using back propagation of said training dataset.
- the parameters in said neural networks are adjusted based upon gradient descent optimization.
- the invention further pertains to an assembly for converting a first temporal representation of a signal into a second temporal representation of said signal, said assembly comprising a data processor system and software which, when running on said data processor system:
- sequence-to-sequence system comprising:
- CNN convolutional neural network
- RNN recurrent neural network
- RNN decoder recurrent neural network
- This assembly in an embodiment can comprise all the feature of the dependent claims.
- that assembly provides an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming at least part of a music part, said OMR assembly comprising an assembly according to any one of the preceding claims, wherein
- OMR optical music recognition
- a sliding window is applied over a digital image of at least part of said sheet music, providing said time slices;
- said second, decoder recurrent neural network providing said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.
- the assembly is an automatic music transcription (AMT) assembly, wherein said first temporal representation of a signal is a sound recording.
- AMT automatic music transcription
- the AMT assembly further comprising a spectral converter, said spectral converter allowing converting said sound recording is into series of spectral representations of said sound recording.
- said software when running on said data processing system, defines a spectral converter for converting said sound recording into a series of magnitude spectrograms providing said time slices, wherein a time window having a time window size is shifted over said sound recording and a magnitude spectrogram is calculated for each time window.
- the time windows have a window overlap, in particular said window overlap is less that said time window width, more in particular less than 50% of said time window width.
- the invention further pertains to a method for producing the assembly, wherein: - a training dataset is provided, said trainings dataset comprising a series of first temporal representation of signals, each having a resulting second temporal representation of each signal; - said convolutional neural network, said first recurrent neural network and said second recurrent neural network are provided, where said neural networks are coupled;
- time slices are for converting said time slices into a sequence of third representations of said first temporal representation
- RNN recurrent neural network
- RNN recurrent neural network
- the neural networks are trained using back propagation of said training dataset.
- the parameters in said neural networks are adjusted based upon gradient descent optimization.
- the neural networks are defined through software running on computer systems that comprise one or more so called graphics cards that are commonly used for driving display devices.
- the logical structure of these graphics cards make them suited for implementing neural networks. This is well known to a skilled person.
- the neural networks may also be implemented on specially designed computer devices or computer systems. Once a neural network is trained, and the parameters and structure is known, this structure may also be extracted and implemented on a general purpose computer system.
- the training and/or implementation of one or more of the neural networks may be through software, or partially or completely based upon a hardware implementation.
- upstream and downstream relate to an arrangement of items or features relative to the propagation of the light from a light generating means (here the especially the first light source), wherein relative to a first position within a beam of light from the light generating means, a second position in the beam of light closer to the light generating means is “upstream”, and a third position within the beam of light further away from the light generating means is “downstream”.
- data is also passed through the network from upstream to downstream.
- substantially herein, such as in “substantially consists”, will be understood by the person skilled in the art.
- the term “substantially” may also include embodiments with “entirely”, “completely”, “all”, etc. Hence, in embodiments the adjective substantially may also be removed.
- the term “substantially” may also relate to 90% or higher, such as 95% or higher, especially 99% or higher, even more especially 99.5% or higher, including 100%.
- the term “comprise” includes also embodiments wherein the term “comprises” means "consists of.
- the term “functionally” is intended to cover variations in the feature to which it refers, and which variations are such that in the functional use of the feature, possibly in combination with other features it relates to in the invention, that combination of features is able to operate or function. For instance, if an antenna is functionally coupled or functionally connected to a communication device, received electromagnetic signals that are receives by the antenna can be used by the communication device.
- the word “functionally” as for instance used in “functionally parallel” is used to cover exactly parallel, but also the embodiments that are covered by the word “substantially” explained above.
- “functionally parallel” relates to embodiments that in operation function as if the parts are for instance parallel. This covers embodiments for which it is clear to a skilled person that it operates within its intended field of use as if it were parallel.
- the invention further applies to an apparatus or device comprising one or more of the characterising features described in the description and/or shown in the attached drawings.
- the invention further pertains to a method or process comprising one or more of the characterising features described in the description and/or shown in the attached drawings.
- Figures 1-6 provide some illustration of sheet music and its challenges, in which Figure 1 depicts an example of a staff line with the scale of C in ascending order;
- Figure 2 depicts an example of a fragment of sheet music with mixed durations
- Figure 3 depicts an example of a fragment with two sharps as key signature and multiple accidentals
- Figure 4 depicts an example of a fragment of polyphonic music
- Figure 5 depicts an example of a fragment containing two quarter triplets
- Figure 6 depicts an example of a double staff line, or 'Grand staff, with ambiguous notation
- FIG 7 illustrates in an abstract manner a Recurrent Neural Network (RNN);
- Figure 8 illustrates an RNN in unfolded manner
- FIG. 9 illustrates a Long Short term memory (LSTM) for use in a RNN
- Figure 10 illustrates a sliding window over (part) of a staff line
- Figure 1 1 illustrates processing of the data provided using the approach of figure
- Figurel2 illustrates a complete, schematic overview of the algorithm, here with first 4 time steps and 4 decoder outputs, and
- Figure 13 illustrates an approach for an AMT application.
- the first example relates to optical music recognition or OMR.
- a graphical representation of music is converted into a digital music representation, in particular a digital representation that comprises machine instructions for playing music.
- sheet music in particular classical sheet music, is converted into one or more MIDI of MusicXML files.
- the second example relates to automatic music transcription or AMT.
- music is converted into a symbolic representation of the music.
- a digital recording like "mp3" file format, is converted into classical sheet music.
- Figure 1 is an example of one staffline, with fifteen ascending notes from the scale of C major. The final symbol is called a rest. It is similar to a musical note in every way, but instead of indicating a pitch it indicates a period of silence.
- Sheet music is typically divided in time by bars. Each bar in music is of a fixed duration, indicated by the time signature. In figure 1, the time signature can be read at the start of the staff line: 4/4. Each following bar will be of that particular time signature, until it is changed. The start and end of a bar is indicated by barlines. These are the vertical lines between each for notes in figure 1.
- Figure 2 is slightly more complex and combines a mix of durations and pitches. Additionally, shown in this example, two separate notes can be grouped together with a tie, like the two centre notes in the figure. Tied notes are played as one single note, where the duration is the sum of the duration of tied notes. Finally, there are the concepts of key signature and accidentals. These indicate long and short term dependencies in sheet music to raise or lower the notated pitch of a note. For this, we utilize three different symbols: the sharp (ft), flat ( b ) and natural ) symbols. The key signature of a piece indicates a global key and raises or lowers all indicated notes. Accidentals are more local, and only change the pitch of a note for the duration of the bar.
- Figure 3 shows an example of these concepts.
- the key signature indicated at the start of the staff line, is two sharps. A sharp is added to the pitches on the horizontal lines the sharps of the key signature are located. Within each bar, multiple accidentals can be observed, changing the pitches of the notes pertaining to the accidental for the duration of the bar.
- a key difference between written text and sheet music is the possibility of polyphony.
- a polyphonic piece of music will contain multiple concurrent sequences, possibly notated in one staff line.
- Figure 4 displays an example excerpt of a polyphonic score.
- Polyphony adds a layer of complexity to sheet music, and has to be taken into consideration when designing systems that have to be able to deal with these problems.
- key signatures and time signatures are examples of long dependencies in sheet music.
- the time signature is only indicated at the start of a piece or in case of a time signature change.
- long term dependencies over multiple pages of sheet music.
- key signature can be changed in the middle of a piece of sheet music, changing the long term dependencies of upcoming bars.
- bidirectional dependencies there can be bidirectional dependencies. Structures like tuplets have the possibility of a dependency to both previous and next notes. This is referred to as a bidirectional dependency.
- An example of such a tuplet is the quarter triplet shown in figure 5.
- the numbers '3' in the picture indicate the note before, under and after the number are part of a triplet structure, which has an altered duration compared to regular quarter notes.
- Sheet music has the possibility of ambiguous notation.
- ANNs Artificial neural networks
- RNNs Recurrent Neural Networks
- a cyclical component 5 is added to the hidden layers 3, as shown in figure 7.
- the hidden layers 3 output information to the output layer 4 and to itself.
- the information from the cyclical connection 5 is used the next time the network does a forward pass.
- the recurrent model 1 in general as depicted in figure 7 can be unrolled into multiple time steps 2 together forming input 2 as shown in figure 8, to make it visually more understandable. It shows the network at different steps in time, connected by the recurrent connections 5.
- the inventor further selected in an embodiment a special kind of recurrent architecture called the Long Short Term Memory (LSTM).
- LSTM Long Short Term Memory
- a neuron 6 in an LSTM network has multiple gates that control the remembering and forgetting of data from past time steps, improving the ability to model long dependencies.
- the output h t is different from the cell state Ct, allowing for better pass- through of information. This is a key difference compared to the traditional recurrent connections.
- a schematic of the LSTM cell/neuron is shown in figure 9, with input x t and output ht.
- the output and next hidden state h are split for clarity, but their values are the same.
- the ⁇ and tanh are the sigmoid and tanh activation functions, each with their own weight matrix.
- the x and + symbols are element -wise multiplication and addition, ⁇ acts as a forget gate. It decides, based on the previous hidden state h t -i and the current input what information is kept in the cell state Ct.
- the forget gate is denoted with f , and shown in formula 1 below.
- ⁇ 2 is the input gate, or i (formula 2). It decides how much of the computed state is passed through.
- the tanhl creates new candidate memories for Ct. These two are added to the previous cell state G-i modulated by the forget gate, to calculate the new cell state Ct (formula 5).
- ⁇ 3 is the output gate of the LSTM cell, formulated in 3. If the cell state contains relevant information, the information is passed to the next hidden state h t (formula 6) and thus to the output of the LSTM cell.
- W denotes the weight matrix for a specific gate and a specific input.
- Wf is the weight matrix for the forget gate and the input
- the current architecture is referred to as a sequence-to-sequence network, and is based on an encoder-decoder staicture.
- This structure first encodes an input to a hidden representation, and subsequently decodes the output from this hidden representation.
- both the encoder and decoder are LSTM networks.
- the hidden representation from the encoder is passed as the last hidden state of the encoder LSTM to the starting hidden state of the decoder LSTM.
- An example sequence-to-sequence network is shown in figure 11. In this figure, the input ⁇ A, B, C, D ⁇ is encoded to a hidden representation, which is mapped to the output ⁇ X, Y, Z ⁇ .
- a fixed length hidden representation can be a bottleneck in the network. Adding a 'search function' to the decoder, can reduce this bottleneck.
- an attention vector a't is calculated from all encoder states (hi..h n ). The attention is calculated as follows:
- the sheet music 7 is transformed into time slices 8 (indicated A, B, C, D, .7) en input into the system.
- time slices 8 indicated A, B, C, D, .
- a sliding window 9 over the input sheet music 7 is used, as illustrated in figure 10.
- This method effectively transforms the input to a sequence of image patches or time slices 8, and makes the translation between sheet music and MIDI a sequence-to-sequence problem.
- the translation exhibits similar traits to neural machine translation:
- the input 8 and output 50 sequences are possibly of different lengths, with alignment problems.
- each patch is fed into a Convolutional Neural Network (CNN) 40.
- CNN Convolutional Neural Network
- a max-pooling operation of 3x3 is applied on the input patch for dimensionality reduction.
- a convolutional layer of 32 5x5 kernels is applied, followed by a relu activation and 2x2 max-pooling operation. This layer is repeated, and a fully- connected layer of 256 units with relu activation is applied, so each input for the encoder will be a vector size 256.
- the sequence of vectors is fed into a sequence-to-sequence network.
- This architecture consists of two RNN's 10, 20.
- the first RNN 10, the encoder 10, encodes the full input sequence 8 to a fixed size hidden representation.
- the second RNN 20, the decoder 20, produces a sequence of outputs 51 from the encoded hidden representation, together forming output 50.
- this sequence of outputs 51 is the sequence of pitches and durations generated from the MusicXML files.
- a single layer Long-Short Term Memory RNN is used with 256 units. To predict both the pitch and duration, the output of the decoder is split into two separate output layers with a softmax activation and categorical cross-entropy loss.
- the dataset used in this research is compiled from monophonic MusicXML scores from the MuseScore sheet music archive.
- the archive is made up of user- generated scores, and is very diverse in both content and purpose. As a result, the dataset is varied in types of music, key signatures, time signatures, clefs, and notation style. This diversity will aid in the training of our model.
- each score is checked for monophonicity, and dynamics, expressions, chord symbols, and textual elements are removed. This process produces a dataset of about 17.000 MusicXML scores. For training and evaluation, these scores are split into three different subsets. 60% is used for training, 15% for testing and 25% for the evaluation of the models.
- AWGN white Gaussian noise
- API additive Perlin noise
- ET-s small scale elastic transformations
- ET-1 large scale elastic transformations
- the image input of the algorithm is defined as a sequence of image patches 8, generated by applying a sliding window 9 over the original input score 7.
- the implementation has two separate parameters: the window width w and window stride s.
- w the amount of information per window can be increased or decreased, s defines how much redundancy exists between adjacent windows.
- Increasing the value of w or decreasing the value of s provides the model with more in formation about the score, but will raise the computational complexity of the algorithm.
- a balance has to be struck between complexity and input coverage.
- the network is trained using backpropagation with gradient descent optimization.
- An example of this is the ADAM optimizer.
- AMT Automatic music transcription
- the task at hand is to find a mapping f : x ⁇ y that translates an audio sequence x to a symbolic representation of that sequence y.
- the difficulty is no surprise because in the most general case, polyphonic AMT, separating the sources of sound alone, e.g. one key stroke on a piano from another, is already a highly underdetermined problem.
- any sufficient model needs to learn strong priors over the audio sequences it receives as input in order to perforin well. Even if a model does learn these priors sufficiently, though, it can not be guaranteed that the task at hand is well defined. For example, the harmonics of two distinct notes of possibly different instalments can have complex interactions.
- MIR music information retrieval
- the other popular models are recurrent neural networks for sequence modelling. These models can be understood as a generalized version of hidden Markov models. They are used for language modelling such as text generation or language translation. For the latter example sequence to sequence models, a subclass of recurrent NN, are well known. Here a sequence of for example English is fed into a neural network to output a hidden state that contains all the information of the sequence ,i.e., a sufficient statistics. This hidden state is than fed into another model that generates the sequence with the same meaning but in a different language. This model is superior to other because it theoretically can deal with different grammatical structure such as word order.
- the input T is a series of spectrogram excerpts of N frames.
- Each frame 9 is passed through the convolutional network.
- the representation is then passed on to the first RNN 10, which computes a hidden state that can be interpreted as sufficient statistics.
- a second RNN 20 Based on this hidden state a second RNN 20 generates an output sequence 50.
- This sequence 50 is the twofold it computes at which time which pitch is turned on/off This function is entirely deterministic. It can, furthermore, be trained end-to- end.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Auxiliary Devices For Music (AREA)
Abstract
The invention provides an optical music recognition (OMR) assembly for converting sheet music, representing a music part as a first temporal representation, into a machine-processable representation of said piece of music that represents at least a pitch and duration of notes that are graphically represented in said sheet music and form said music part as a second temporal representation, said assembly comprising a data processor system and software which, when running on said data processor system: - retrieves a machine-processable representation of said sheet music; - generate a series of time slices of said sheet music, by applying a sliding window on said over said machine-processable representation of at least part of said sheet music; - defines a sequence-to-sequence system, said sequence-to-sequence system comprising: provide a convolutional neural network (CNN) for converting said time slices into a sequence of third representations of said sheet music, said CNN comprising an input layer and an output layer; provide a first, encoder recurrent neural network (RNN) as an encoder on said sequence of third representations, for providing a hidden representation of said sheet music, said first RNN having an input layer that is functionally coupled to said output layer of said CNN, and an output layer; * provide a second, decoder recurrent neural network (RNN) as a decoder to said hidden representation, for converting said hidden representation into said machine- processable representation, said second RNN having an input layer that is functionally coupled to said output layer of said first RNN, and an output for providing said machine-processable representation.
Description
OPTICAL MUSIC RECOGNITION OMR : CONVERTING SHEET
MUSIC TO A DIGITAL FORMAT
Field of the invention
The invention relates to an assembly and method for converting data, in particular to optical music recognition (OMR) assembly for converting sheet music.
Background of the invention
There are many application that require conversion of data into another representation, in particular conversion of data that has a temporal relation. This conversion, which may comprise a transcription or translation, for instance relates to the fields of automatic music transcription, music information retrieval, optical music recognition, and the like.
Since the 1960's, attempts have been made to manufacture optical music recognition systems. However, there has yet to be a system that deals with the complexity and ambiguity of sheet music in a satisfactory way. Accuracy scores of these systems are generally too low to use them without human supervision. Classical optical music recognition systems are segmented and generally consist of the following parts: staff isolation, staff line removal, symbol segmentation and symbol classification. It was found that each individual part of such a system has proven to be a difficult challenge, resulting in a low reliability.
In recent years, there have been substantial advances in sequence recognition methods using deep learning neural networks. Examples of these are machine translation and image captioning. A difference between these systems and the segmented optical music recognition systems, is that they try to capture the full recognition process into a single learning algorithm. Similar methods have been applied in optical character recognition and small optical music recognition tasks, showing promising results.
US6297439 for instance according to its abstract describes "A system and method are disclosed for automatically generating music on the basis of an initial sequence of input notes, and in particular to such a system and method utilizing a recursive artificial neural network (RANN) architecture. The aforementioned system
includes a score interpreter interpreting an initial input sequence, a rhythm production RANN for generating a subsequent note duration, a note generation RANN for generating a subsequent note, and feedback means for feeding the pitch and duration of the subsequent note back to the rhythm generation and note generation RANNs, the subsequent note thereby becoming the current note for a following iteration."
US8494257 according to its abstract describes "Data set generation and data set presentation for image processing are described. The processing determines a location for each of one or more musical artifacts [..] in the image and identifies a corresponding label for each of the musical artifacts, generating a training file that associates the identified labels and determined locations of the musical artifacts with the image, and presenting the training file to a neural network for training."
US9123315 according to its abstract describes "A method for transcoding music, according to various aspects of the present invention, includes in any practical order: (a) reading indicia of a plurality of notes, each note having pitch and duration; (b) selecting a reference pitch; (c) determining indicia of tone from the reference pitch and the pitch of each note; and (d) outputting for use by an engraving engine, indicia of an apposite staff and indicia of tones and durations corresponding to the plurality of notes."
Chase Dwayne Carthen: "Rewind: A Music Transcription Method", 1 May 2016 (2016-05-01), pages 1-54, Reno, USA (Master thesis) in its abstract states:
"Music is commonly recorded, played, and shared through digital audio formats such as way, mp3, and various others. These formats are easy to use, but they lack the symbolic information that musicians, bands, and other artists need to retrieve important information out of a given piece. There have been recent advances in the Music Information Retrieval (MIR) field for converting from a digital audio format to a symbolic format. This problem is called Music Transcription and the systems built to solve this problem are called Automatic Music Transcription (AMT) systems. The recent advances in the MIR field have yielded more accurate algorithms using different types of neural networks from deep learning and iterative approaches. Rewind's approach is similar but boasts a new method using an encoder-decoder network where the encoder and decoder both consist of a gated recurrent unit and a linear layer. The encoder layer of Rewind is a single layer autoencoder that captures the temporal dependencies of a song and produces a temporal encoding. In other words, Rewind is a
web app that utilizes a deep learning method to allow users to transcribe, listen to, and see their music."
JP2871204B2 in its abstract states: "PURPOSE:To prevent variation in pitch from being decomposed into fine variation in the fine interval of a short note by absorbing the variation in pitch by the hysteresis that a recurrent network has when a musical sound which varies in pitch like singing and vibrato performance is put on a score. CONSTITUTION:This device has a band-pass filter bank part 14 which converts an external audio signal into power envelopes by frequency channels, a conflict recollection neural network part 13 which finds pitch categories from the power envelopes by the channels obtained from the band-pass filter bank part 14, an interval buffer part 12 which holds the pitch categories outputted by the conflict recollection neural network part 13, a readout timing generation part 15 which generates transcription intervals of musical intervals required for transcription and a musical interval storage part 1 1 which inputs and records pitch data from the interval buffer part 12 according to the timing outputted by the readout timing generation part 15.
WO2008101 126A1 in its abstract states: "Methods, systems, and devices are described for collaborative handling of music contributions over a network. Embodiments of the invention provide a portal, the portal being accessible over the network by a plurality of workstations and configured to provide a set of editing capabilities for editing music elements. Music contributions may be received at the portal. At least a portion of the music contributions include music elements. In certain embodiments, the music elements have been deconstaicted from an audio signal or a score image. A number of collaboration requests may be received at the portal over the network. Some collaboration requests may originate from a first workstation, while other collaboration requests may originate from a second workstation. In response to at least one of the collaboration requests, at least a portion of the music elements may be edited using the editing capabilities of the portal."
US2016099010A1 in its abstract states: "Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long
short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance."
CN105678300A in its abstract states: "The invention relates to the image and text identification field, and specifically relates to a complex image and text sequence identification method. The complex image and text sequence identification method includes the steps: utilizing a sliding sampling box to perform sliding sampling on an image and text sequence to be identified; extracting the characteristics from the sub images obtained through sampling by means of a CNN and outputting the characteristics to an RNN, wherein the RNN successively identifies the front part of each character, the back part of each character, numbers, letters, punctuation, or blank according to the input signal; and successively recording and integrating the identification results for the RNN at each moment and acquiring the complete identification result, wherein the input signal for each moment for the RNN also includes the output signal of a recursion neural network for the last moment. The complex image and text sequence identification method can overcome the cutting problem of a complex image and text sequence, and can significantly improve the identification efficiency and accuracy for images and text."
Currently, for instance converting sheet music to a digital format is cumbersome, if possible at all. Often, human interference or interpretation is required.
Summary of the invention
It is an aspect of the invention to provide an alternative method or assembly for converting a temporal representation into another temporal representation.
The current invention provides an optical music recognition (OMR) assembly for converting sheet music, representing a music part as a first temporal representation, into a machine-processable representation of said piece of music that represents at least a pitch and duration of notes that are graphically represented in said sheet music and form said music part as a second temporal representation, said assembly comprising a data processor system and software which, when running on said data processor system:
- retrieves a machine-processable representation of said sheet music;
- generate a series of time slices of said sheet music, by applying a sliding window on said over said machine-processable representation of at least part of said sheet music;
- defines a sequence-to-sequence system, said sequence-to-sequence system comprising:
* provide a convolutional neural network (CNN) for converting said time slices into a sequence of third representations of said sheet music, said CNN comprising an input layer and an output layer;
* provide a first, encoder recurrent neural network (RNN) as an encoder on said sequence of third representations, for providing a hidden representation of said sheet music, said first RNN having an input layer that is functionally coupled to said output layer of said CNN, and an output layer;
* provide a second, decoder recurrent neural network (RNN) as a decoder to said hidden representation, for converting said hidden representation into said machine- processable representation, said second RNN having an input layer that is functionally coupled to said output layer of said first RNN, and an output for providing said machine-processable representation.
The assembly provides an end-to-end trainable sequential model. This model can be trained as one pipeline by offering input at an input end and retrieving output at an output end.
Examples of a first temporal representation are for instance graphical music representation, sound recording.
In an embodiment, the first temporal representation relates to sheet music. Sheet music in its most general form relates to a graphical representation of music information. In a specific embodiment, it relates to a form of music notation, primarily used to notate western music. It is a way of writing down sequential musical information in a compact way, readable for human performers. In western music, musical symbols are notated on a staff, a group of five evenly spaced horizontal lines, used to differentiate between the pitches in written music. The higher a note is on a staff, the higher the pitch. A page of music typically consists of multiple staff lines, much like a piece of written text consists of multiple lines. The horizontal position of a note determines the order of the written musical sequence: sheet music is read from left to right. Figures 1-6 are examples of such a music notation.
one staffline, with fifteen ascending notes from the scale of C major. The final symbol is called a rest. It is similar to a musical note in every way, but instead of indicating a pitch it indicates a period of silence.
In another embodiment, the first temporal representation relates to a sound recording. Sound can be recorded, and for instance transformed into a digital format. Sound comprises music, spoken text in a language, and the like. Sound can be recorded and subsequently compressed using a lossless or lossy compression. Examples of such digital formats are MP3, MP4, but in fact any type of lossy compression or lossless compression of sound recordings can be used.
A number of lossless audio compression formats exist. Shorten was an early lossless format. Newer ones include Free Lossless Audio Codec (FLAG), Apple's Apple Lossless (ALAC), MPEG-4 ALS, Microsoft's Windows Media Audio 9 Lossless (WMA Lossless), Monkey's Audio, TTA, and WavPack. See list of lossless codecs for a complete listing.
Some audio formats feature a combination of a lossy format and a lossless correction; this allows stripping the correction to easily obtain a lossy file. Such formats include MPEG-4 SLS (Scalable to Lossless), WavPack, and OptimFROG DualStream.
Examples of a second temporal representation are for instance sound recordings, music files, graphical music representation like sheet music. The sound may be music, spoken text (the same text for instance as in the first representation) in a language or accent different from spoken text in the first temporal representation.
In an embodiment, the second temporal representation relates to a digital music format. Music can be stored in various digital formats. An example of such a digital format is the MIDI protocol, see MIDI Association. The Official Midi Specifications. 1996. URL: https://www.midi.org/specifications, which is widely used by musical sequencers and notation software. It is the de-facto standard of exchanging digital musical information. A notation format much closer to actual sheet music is ABC- notation, MIDI. Digital music notation formats have numerous advantages over their optical counterparts. Images consist of pixel data, and do not give any direct information about the represented musical content. As a result, computational musical analysis cannot be performed on these image formats directly. While an increasing
amount of digitalized sheet music is available, a large portion of available sheet music is still only accessible as images
The Musical instalment digital interface (MIDI) standard defines a system of communication between digital musical devices. It is a very compact method to define musical information, as all events are represented on byte level. The MIDI file is a way of storing MIDI information, and is typically only a few kilobytes in size. Since the 1980's, MIDI has been the standard way of playing digital music, favoured because its ease of use and expandability. The MIDI standard provides a way of expressing a variety of musical events. Timing in MIDI is handled different than in sheet music. A MIDI event has a tick property, which defines the amount of ticks between the previous and current event. The duration of one tick is defined in the header of the MIDI file by the Pulses Per Quarternote (PPQ). This value informs the musical sequencer how many ticks one quarter note contains. Typically, as the duration of most musical notes can be divided by either three or four, the PPQ of midi is a multiple of twelve. Most MIDI sequencers take a standard PPQ of 480, but for simpler music lower values are possible too. Timing is the only structural component in a MIDI file. Events like barlines are not present in MIDI. A NoteOnEvent defines a note with a starting tick, pitch and velocity. If the defined velocity is zero, it can be considered the same as a NoteOffEvent, which signals the end of a musical note. As such, a musical note can be defined by two NoteOnE vents: the first event defines the starting tick and velocity of the note, the second tick defines the end tick of the note. Other events are the KeySignatureEvents, which defines the keysignature of a piece, and EndOf-Track, defining the end of a MIDI track. Contrary to sheet music, where the key signature changes the notation of accidentals in a piece, in MIDI the key signature does not influence the notes in a file in any way. It is only used for meta information.
In general, a Convolutional Neural Network (CNN) is a type of artificial neural network (ANN) designed for the processing of spatial data. In embodiments of the current invention, known techniques and implementations can be used. In an embodiment, for instance, Max Pooling is used.
A Convolutional Neural Network (CNN) is a type of artificial neural network
(ANN) which in an embodiment is designed for the processing of spatial data.
In an embodiment, convolutional layers are a functionality of CNNs. These convolutional layers replace the weights of a traditional feed forward neural network
with trainable convolution filters. Filters can learn simple operations like edge, blob and corner detection. When stacked in multiple layers, the recognition of complicated spatial structures can be learned. In an embodiment, a single convolutional layer consists of multiple filters, where each filter operates on small areas of the layer input, producing a feature map. Feature maps represent the convolution between a filter and input of a layer. After the convolutional layer, in an embodiment a non-linearity and possible pooling operation is applied.
Pooling operations perform down-sampling on a feature map. A popular type of pooling is max pooling, where each pooling region outputs its maximum value. A pooling operation has a region size and stride associated with it. The region size defines the width and height of the pooling region, the stride defines the number of steps between pooling regions.
After the convolutional and pooling operations, the resulting feature maps are in an embodiment reshaped to a column vector and one or more fully connected layers are applied. These layers are the same as the layers in regular feed-forward neural networks. As in the rest of the network, sigmoid activations can for instance be used.
ANNs with recurrent connections are an addition to the standard ANN model described in the previous section. These models belong to the family of Recurrent
Neural Networks (RNNs). In addition to predicting from current input data, they are able to predict from past inputs as well. This property adds a new functionality to a network: the ability to work with sequential data.
Traditional RNNs can have trouble learning long-term dependencies. When the distance between two time steps gets too long, the information passed through the recurrent connections can be 'forgotten' . In an embodiment, a recurrent architecture called the Long Short Term Memory (LSTM) is used. A neuron in an LSTM network has multiple gates that control the remembering and forgetting of data from past time steps, improving the ability to model long dependencies.
A method to expand RNNs to perform tasks is to map sequences to sequences, possibly with different lengths and different orders. The architecture is called a sequence-to-sequence network, and is based on an encoder-decoder staicture. This structure first encodes an input to a hidden representation, and subsequently decodes the output from this hidden representation. In an embodiment, both the encoder (RNN) and decoder (RNN) are LSTM networks. The hidden representation from the encoder
is passed as the last hidden state of the encoder LSTM to the starting hidden state of the decoder LSTM.
In an embodiemnt, the convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are functionally coupled, forming said sequence-to-sequence system, and said sequence-to-sequence system is trained using a training dataset of first temporal representations and known, resulting second temporal representations.
In an embodiment, in said training of said first temporal representations are provided as input of said convolutional neural network (CNN), the output of said second, decoder recurrent neural network (RNN) is compared to said second temporal representation, and parameters of said convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are modified.
In an embodiment, the first RNN comprises a Long Short term memory (LSTM) architecture.
In an embodiment, the second RNN comprises a Long Short term memory (LSTM) architecture.
In an embodiment, first temporal representation is a graphical representation, in particular a digital image, more in particular said graphical representation is a representation of music, in particular sheet music.
In an embodiment, the second digital representation is a digital file comprising temporal instructions for actuating a device, in particular a music file for actuating or controlling a music instrument, more in particular selected from MIDI, music XML.
In an embodiment, the series of time slices comprise digital images obtained by sliding a window over a graphical representation, in particular over a digital image.
The invention further relates to an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming at least part of a music part, said OMR assembly comprising an assembly described, and wherein - a sliding window is applied over a digital image of at least part of said sheet music, providing said time slices;
- said second, decoder recurrent neural network providing said digital representation
that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.
In an embodiment of the OMR assembly, the sheet music comprises a graphical representation comprising a series of staff lines and notes, in particular said sheet music comprises a visual representation on a carrier, in particular a written or printed representation on paper.
In an embodiment of the OMR assembly, the convolutional neural network, said first, encoder recurrent neural network, and said second, decoder recurrent neural network have been trained as said sequence-to-sequence system using a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation.
The invention further relates to an automatic music transcription (AMT) assembly comprising the assembly, wherein said first temporal representation of a signal is a sound recording. In an embodiment, the sound recording is a digital sound recording.
In an embodiment of the AMT assembly, it further comprising a spectral converter, said spectral converter allowing converting said sound recording is into series of spectral representations of said sound recording.
In an embodiment of the AMT assembly, the software, when running on said data processing system, defines a spectral converter for converting said sound recording into a series of magnitude spectrograms providing said time slices, wherein a time window having a time window size is shifted over said sound recording and a magnitude spectrogram is calculated for each time window.
In an embodiment of the AMT assembly, the time windows have a window overlap.
The invention further pertains to a method for converting sheet music into a digital representation using the OMR assembly, comprising:
- retrieving a digital image of at least part of said sheet music;
- applying a sliding window over said digital image of said sheet music, said sliding window having a time width and a stride and provides said time slices comprising a time sequence of partially overlapping input images of at least part of said sheet music;
- applying said convolutional neural network (CNN) to said input images for
converting said input images into a sequence of numerical representations of said input images;
- applying said first, recurrent neural network (RNN) as said encoder on said sequence of numerical representations for providing a hidden output data set;
- applying said second, decoder recurrent neural network (RNN) as said decoder to said hidden output data set for converting said hidden output data into said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.
The invention further relates to an assembly for producing an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming a music part, said OMR assembly comprising a data processor and software which, when running on said data processor: -provide a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation;
- provide a neural network assembly comprising :
(a) a convolutional neural network (CNN) adapted for receiving training input images and converting said input images into a sequence of numerical representations of said input images;
(b) a first, encoder recurrent neural network (RNN) as an encoder adapted for receiving said sequence of numerical representations for providing a hidden output data set;
(c) a second, decoder recurrent neural network (RNN) as a decoder adapted for receiving said hidden output data set for converting said hidden output data into said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part;
- train said neural network assembly using said training dataset.
The invention further relates to a method for producing the assembly, wherein:
- a training dataset is provided, said trainings dataset comprising a series of first temporal representation of signals, each having a resulting second temporal representation of each signal;
- said convolutional neural network, said first recurrent neural network and said second recurrent neural network are provided, where said neural networks are coupled;
- said time slices are for converting said time slices into a sequence of third representations of said first temporal representation;
- applies a first, trained, recurrent neural network (RNN) as an encoder on said sequence of third representations, input as one data entry, for providing a hidden representation of said first temporal representation;
- applies a second, trained, decoder recurrent neural network (RNN) as a decoder to said hidden representation, input as one data entry, for converting said hidden representation into a calculated second temporal representation of said signal;
- adjust parameters in at least one selected from said convolutional neural network, said first, trained, recurrent neural network (RNN) and said second, trained, decoder recurrent neural network (RNN) based upon a difference between said resulting second temporal representation of said signal and said calculated second temporal representation of each signal.
In an embodiment of this method, the neural networks are trained using back propagation of said training dataset.
In an embodiment of this method, the parameters in said neural networks are adjusted based upon gradient descent optimization.
The invention further pertains to an assembly for converting a first temporal representation of a signal into a second temporal representation of said signal, said assembly comprising a data processor system and software which, when running on said data processor system:
- retrieves a series of time slices of said first temporal representation;
- defines a sequence-to-sequence system, said sequence-to-sequence system comprising:
* provide a convolutional neural network (CNN) for converting said time slices into a sequence of third representations of said first temporal representation, said CNN comprising an input layer and an output layer;
* provide a first, encoder recurrent neural network (RNN) as an encoder on said sequence of third representations, for providing a hidden representation of said first temporal representation, said first RNN having an input layer that is functionally coupled to said output layer of said CNN, and an output layer;
* provide a second, decoder recurrent neural network (RNN) as a decoder to said hidden representation, for converting said hidden representation into said second
temporal representation, said second RNN having an input layer that is functionally coupled to said output layer of said first RNN, and an output for providing said second temporal representation.
This assembly in an embodiment can comprise all the feature of the dependent claims.
In an embodiment, that assembly provides an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming at least part of a music part, said OMR assembly comprising an assembly according to any one of the preceding claims, wherein
- a sliding window is applied over a digital image of at least part of said sheet music, providing said time slices;
- said second, decoder recurrent neural network providing said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.
In an embodiment, the assembly is an automatic music transcription (AMT) assembly, wherein said first temporal representation of a signal is a sound recording.
In an embodiment of the AMT assembly, it further comprising a spectral converter, said spectral converter allowing converting said sound recording is into series of spectral representations of said sound recording.
In an embodiment of the AMT assembly, said software, when running on said data processing system, defines a spectral converter for converting said sound recording into a series of magnitude spectrograms providing said time slices, wherein a time window having a time window size is shifted over said sound recording and a magnitude spectrogram is calculated for each time window.
In an embodiment of the AMT assembly the time windows have a window overlap, in particular said window overlap is less that said time window width, more in particular less than 50% of said time window width.
The invention further pertains to a method for producing the assembly, wherein: - a training dataset is provided, said trainings dataset comprising a series of first temporal representation of signals, each having a resulting second temporal representation of each signal;
- said convolutional neural network, said first recurrent neural network and said second recurrent neural network are provided, where said neural networks are coupled;
- said time slices are for converting said time slices into a sequence of third representations of said first temporal representation;
- applies a first, trained, recurrent neural network (RNN) as an encoder on said sequence of third representations, input as one data entry, for providing a hidden representation of said first temporal representation;
- applies a second, trained, decoder recurrent neural network (RNN) as a decoder to said hidden representation, input as one data entry, for converting said hidden representation into a calculated second temporal representation of said signal;
- adjust parameters in at least one selected from said convolutional neural network, said first, trained, recurrent neural network (RNN) and said second, trained, decoder recurrent neural network (RNN) based upon a difference between said resulting second temporal representation of said signal and said calculated second temporal representation of each signal.
In an embodiment of that method the neural networks are trained using back propagation of said training dataset.
In an embodiment of that method the parameters in said neural networks are adjusted based upon gradient descent optimization.
In implementations of the neural networks of the current inventions, the neural networks are defined through software running on computer systems that comprise one or more so called graphics cards that are commonly used for driving display devices. The logical structure of these graphics cards make them suited for implementing neural networks. This is well known to a skilled person. The neural networks may also be implemented on specially designed computer devices or computer systems. Once a neural network is trained, and the parameters and structure is known, this structure may also be extracted and implemented on a general purpose computer system. The training and/or implementation of one or more of the neural networks may be through software, or partially or completely based upon a hardware implementation.
The terms "upstream" and "downstream" relate to an arrangement of items or features relative to the propagation of the light from a light generating means (here the especially the first light source), wherein relative to a first position within a beam of light from the light generating means, a second position in the beam of light closer to
the light generating means is "upstream", and a third position within the beam of light further away from the light generating means is "downstream". In a neural network, data is also passed through the network from upstream to downstream.
The term "substantially" herein, such as in "substantially consists", will be understood by the person skilled in the art. The term "substantially" may also include embodiments with "entirely", "completely", "all", etc. Hence, in embodiments the adjective substantially may also be removed. Where applicable, the term "substantially" may also relate to 90% or higher, such as 95% or higher, especially 99% or higher, even more especially 99.5% or higher, including 100%. The term "comprise" includes also embodiments wherein the term "comprises" means "consists of.
The term "functionally" will be understood by, and be clear to, a person skilled in the art. The term "substantially" as well as "functionally" may also include embodiments with "entirely", "completely", "all", etc. Hence, in embodiments the adjective functionally may also be removed. When used, for instance in "functionally parallel", a skilled person will understand that the adjective "functionally" includes the term substantially as explained above. Functionally in particular is to be understood to include a configuration of features that allows these features to function as if the adjective "functionally" was not present. The term "functionally" is intended to cover variations in the feature to which it refers, and which variations are such that in the functional use of the feature, possibly in combination with other features it relates to in the invention, that combination of features is able to operate or function. For instance, if an antenna is functionally coupled or functionally connected to a communication device, received electromagnetic signals that are receives by the antenna can be used by the communication device. The word "functionally" as for instance used in "functionally parallel" is used to cover exactly parallel, but also the embodiments that are covered by the word "substantially" explained above. For instance, "functionally parallel" relates to embodiments that in operation function as if the parts are for instance parallel. This covers embodiments for which it is clear to a skilled person that it operates within its intended field of use as if it were parallel.
Furthermore, the terms first, second, third and the like when used in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood
that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein.
The devices or apparatus herein are amongst others described during operation. As will be clear to the person skilled in the art, the invention is not limited to methods of operation or devices in operation.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb "to comprise" and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device or apparatus claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
The invention further applies to an apparatus or device comprising one or more of the characterising features described in the description and/or shown in the attached drawings. The invention further pertains to a method or process comprising one or more of the characterising features described in the description and/or shown in the attached drawings.
The various aspects discussed in this patent can be combined in order to provide additional advantages. Furthermore, some of the features can form the basis for one or more divisional applications.
Brief description of the drawings
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying schematic drawings in which corresponding reference symbols indicate corresponding parts, and in which:
Figures 1-6 provide some illustration of sheet music and its challenges, in which
Figure 1 depicts an example of a staff line with the scale of C in ascending order;
Figure 2 depicts an example of a fragment of sheet music with mixed durations;
Figure 3 depicts an example of a fragment with two sharps as key signature and multiple accidentals;
Figure 4 depicts an example of a fragment of polyphonic music;
Figure 5 depicts an example of a fragment containing two quarter triplets;
Figure 6 depicts an example of a double staff line, or 'Grand staff, with ambiguous notation;
Figure 7 illustrates in an abstract manner a Recurrent Neural Network (RNN);
Figure 8 illustrates an RNN in unfolded manner;
Figure 9 illustrates a Long Short term memory (LSTM) for use in a RNN;
Figure 10 illustrates a sliding window over (part) of a staff line;
Figure 1 1 illustrates processing of the data provided using the approach of figure
10;
Figurel2 illustrates a complete, schematic overview of the algorithm, here with first 4 time steps and 4 decoder outputs, and
Figure 13 illustrates an approach for an AMT application.
The drawings are not necessarily on scale Description of preferred embodiments
In the description of embodiments below, two different applications will be discussed as example of applying the current invention.
The first example relates to optical music recognition or OMR. In such an application, as explained above, a graphical representation of music is converted into a digital music representation, in particular a digital representation that comprises machine instructions for playing music. Often, sheet music, in particular classical sheet music, is converted into one or more MIDI of MusicXML files.
The second example relates to automatic music transcription or AMT. In such an application, music is converted into a symbolic representation of the music. In such an application, often a digital recording, like "mp3" file format, is converted into classical sheet music.
First, some examples an challenges of a most common sheet music notation will be discussed on the hand of figures 1-6. Next, a current solution for converting sheet music to a digital format will be discussed.
Figure 1 is an example of one staffline, with fifteen ascending notes from the scale of C major. The final symbol is called a rest. It is similar to a musical note in every way, but instead of indicating a pitch it indicates a period of silence. Sheet music is typically divided in time by bars. Each bar in music is of a fixed duration, indicated by the time signature. In figure 1, the time signature can be read at the start of the staff line: 4/4. Each following bar will be of that particular time signature, until it is changed. The start and end of a bar is indicated by barlines. These are the vertical lines between each for notes in figure 1.
Where horizontal position dictates the order of the sequence, it does not say anything about the duration of a note or rest. These are indicated by the shape of the musical symbol. Figure 1 consists entirely of notes of the same shape, or the same duration.
Figure 2 is slightly more complex and combines a mix of durations and pitches. Additionally, shown in this example, two separate notes can be grouped together with a tie, like the two centre notes in the figure. Tied notes are played as one single note, where the duration is the sum of the duration of tied notes. Finally, there are the concepts of key signature and accidentals. These indicate long and short term dependencies in sheet music to raise or lower the notated pitch of a note. For this, we utilize three different symbols: the sharp (ft), flat ( b ) and natural ) symbols. The key signature of a piece indicates a global key and raises or lowers all indicated notes. Accidentals are more local, and only change the pitch of a note for the duration of the bar.
Figure 3 shows an example of these concepts. The key signature, indicated at the start of the staff line, is two sharps. A sharp is added to the pitches on the horizontal lines the sharps of the key signature are located. Within each bar, multiple accidentals can be observed, changing the pitches of the notes pertaining to the accidental for the duration of the bar.
A key difference between written text and sheet music is the possibility of polyphony. A polyphonic piece of music will contain multiple concurrent sequences, possibly notated in one staff line. Figure 4 displays an example excerpt of a
polyphonic score. Polyphony adds a layer of complexity to sheet music, and has to be taken into consideration when designing systems that have to be able to deal with these problems.
There are a few considerations and difficulties of sheet music in a machine learning context.
First, as already referred in the terminology, key signatures and time signatures are examples of long dependencies in sheet music. In some notation styles, the time signature is only indicated at the start of a piece or in case of a time signature change. As a result, there is a possibility of long term dependencies over multiple pages of sheet music. The same is the case for the key signature. Both key and time signature can be changed in the middle of a piece of sheet music, changing the long term dependencies of upcoming bars.
Furthermore, there can be bidirectional dependencies. Structures like tuplets have the possibility of a dependency to both previous and next notes. This is referred to as a bidirectional dependency. An example of such a tuplet is the quarter triplet shown in figure 5. The numbers '3' in the picture indicate the note before, under and after the number are part of a triplet structure, which has an altered duration compared to regular quarter notes.
Yet another challenge in ambiguity. Sheet music has the possibility of ambiguous notation. The most common example of this happens when music is represented on multiple staff lines, a common notation for piano music, as in figure 6. Both the top and bottom staff do not have complete measures, as two beats are missing. Only by interpreting the relative positions of the notes in the top and bottom staff, the musical content can be extracted.
Yet another challenge is contextual musical symbols. Aside from notes and rests, there are some more categories of musical symbols. Chords, lyrics, dynamics, articulation symbols, and textual instructions can all be part of a musical score. This can create difficulty for OMR systems, as a lot of these symbols are not fully standardized. For example, for textual instructions or lyrics an additional OCR system is needed.
Considerations for a computer program product
Artificial neural networks (ANNs) with recurrent connections are an addition to a standard ANN model. These ANNs with recurrent connections belong to the family
of Recurrent Neural Networks (RNNs). In addition to predicting from current input data 2, they are able to predict from past inputs 2 as well. This property adds a new functionality to a network: the ability to work with sequential data. To make this RNN a recurrent architecture, a cyclical component 5 is added to the hidden layers 3, as shown in figure 7. In this architecture, the hidden layers 3 output information to the output layer 4 and to itself. The information from the cyclical connection 5 is used the next time the network does a forward pass.
The recurrent model 1 in general as depicted in figure 7 can be unrolled into multiple time steps 2 together forming input 2 as shown in figure 8, to make it visually more understandable. It shows the network at different steps in time, connected by the recurrent connections 5.
The inventor further selected in an embodiment a special kind of recurrent architecture called the Long Short Term Memory (LSTM). A neuron 6 in an LSTM network has multiple gates that control the remembering and forgetting of data from past time steps, improving the ability to model long dependencies. In an LSTM cell/neuron, the output ht is different from the cell state Ct, allowing for better pass- through of information. This is a key difference compared to the traditional recurrent connections. A schematic of the LSTM cell/neuron is shown in figure 9, with input xt and output ht.
In the schematic representation of figure 9, the output and next hidden state h( are split for clarity, but their values are the same. The σ and tanh are the sigmoid and tanh activation functions, each with their own weight matrix. The x and + symbols are element -wise multiplication and addition, σΐ acts as a forget gate. It decides, based on the previous hidden state ht-i and the current input what information is kept in the cell state Ct. The forget gate is denoted with f , and shown in formula 1 below.
Next, σ2 is the input gate, or i (formula 2). It decides how much of the computed state is passed through. The tanhl creates new candidate memories for Ct. These two are added to the previous cell state G-i modulated by the forget gate, to calculate the new cell state Ct (formula 5).
σ3 is the output gate of the LSTM cell, formulated in 3. If the cell state contains relevant information, the information is passed to the next hidden state ht (formula 6) and thus to the output of the LSTM cell.
ft = o(hi-iWf,h + xtWf.x) (1)
it = o(hi-iWy, + XtWi.x) (2)
Ot = o(ht-iWo,h + XtWc ) (3)
In the above formulas, W denotes the weight matrix for a specific gate and a specific input. For example: Wf, is the weight matrix for the forget gate and the input
Xt.
The current architecture is referred to as a sequence-to-sequence network, and is based on an encoder-decoder staicture. This structure first encodes an input to a hidden representation, and subsequently decodes the output from this hidden representation. In the case of the sequence-to-sequence architecture, both the encoder and decoder are LSTM networks. The hidden representation from the encoder is passed as the last hidden state of the encoder LSTM to the starting hidden state of the decoder LSTM. An example sequence-to-sequence network is shown in figure 11. In this figure, the input {A, B, C, D} is encoded to a hidden representation, which is mapped to the output {X, Y, Z}.
Adding an attention mechanism that allows the decoder to look back into previous time steps of the encoder. A fixed length hidden representation can be a bottleneck in the network. Adding a 'search function' to the decoder, can reduce this bottleneck.
For each decoder hidden state dt, an attention vector a't is calculated from all encoder states (hi..hn). The attention is calculated as follows:
ali = softmax(vT tanh(Wih,+W2dt))
Where v, Wi and W2 are the trained weights.
In the current embodiments, the sheet music 7 is transformed into time slices 8 (indicated A, B, C, D, ....) en input into the system. Taking example from recurrent convolutional networks a sliding window 9 over the input sheet music 7 is used, as illustrated in figure 10. This method effectively transforms the input to a sequence of image patches or time slices 8, and makes the translation between sheet music and MIDI a sequence-to-sequence problem. Additionally, the translation exhibits similar
traits to neural machine translation: The input 8 and output 50 sequences are possibly of different lengths, with alignment problems.
In addition to the input being sequential, it is also image based. Feeding the raw pixel vectors into a sequence to sequence architecture will result in both loss of spatial context and very high input dimensionalities. For this reason, a convolutional network is used before each time step in the encoder. The combined architecture is a convolutional sequence-to-sequence model, as depicted in figure 11. In this model, {A, B, C, D} are image patches of sheet music, and {X, Y, Z} are MIDI events. In the figure the attention mechanism is omitted for clarity, but it is resent in the model. Of course, in the real model the lengths of the input and output are many times longer than shown in the figure. In figure 12, in fact the same model as figure 1 1 is depicted in a somewhat different way. Here, the overlap of the sliding window 9 is smaller, only about 5-10%. Experimental example OMR
Convolutional neural network.
To extract relevant features from the image patches 8, each patch is fed into a Convolutional Neural Network (CNN) 40. In this research, we keep the architecture of the CNN 40 the same between different experiments, to ensure a fair comparison. First a max-pooling operation of 3x3 is applied on the input patch for dimensionality reduction. Then, a convolutional layer of 32 5x5 kernels is applied, followed by a relu activation and 2x2 max-pooling operation. This layer is repeated, and a fully- connected layer of 256 units with relu activation is applied, so each input for the encoder will be a vector size 256.
Sequence-to-sequence network.
After extracting a vector description of each image patch, the sequence of vectors is fed into a sequence-to-sequence network. This architecture consists of two RNN's 10, 20. The first RNN 10, the encoder 10, encodes the full input sequence 8 to a fixed size hidden representation. The second RNN 20, the decoder 20, produces a sequence of outputs 51 from the encoded hidden representation, together forming output 50. In the case of an OMR task, this sequence of outputs 51 is the sequence of pitches and durations generated from the MusicXML files. For both encoder and decoder, a single layer Long-Short Term Memory RNN is used with 256 units. To
predict both the pitch and duration, the output of the decoder is split into two separate output layers with a softmax activation and categorical cross-entropy loss.
The dataset used in this research is compiled from monophonic MusicXML scores from the MuseScore sheet music archive. The archive is made up of user- generated scores, and is very diverse in both content and purpose. As a result, the dataset is varied in types of music, key signatures, time signatures, clefs, and notation style. This diversity will aid in the training of our model. To generate the dataset, each score is checked for monophonicity, and dynamics, expressions, chord symbols, and textual elements are removed. This process produces a dataset of about 17.000 MusicXML scores. For training and evaluation, these scores are split into three different subsets. 60% is used for training, 15% for testing and 25% for the evaluation of the models.
To the relatively clean image data of sheet music, several distortions were added, for instance white Gaussian noise (AWGN), additive Perlin noise (APN), small scale elastic transformations (ET-s), large scale elastic transformations (ET-1), and combinations thereof.
Sliding window input
The image input of the algorithm is defined as a sequence of image patches 8, generated by applying a sliding window 9 over the original input score 7. The implementation has two separate parameters: the window width w and window stride s. By varying w, the amount of information per window can be increased or decreased, s defines how much redundancy exists between adjacent windows. Increasing the value of w or decreasing the value of s provides the model with more in formation about the score, but will raise the computational complexity of the algorithm. Thus when determining the optimal parameters, a balance has to be struck between complexity and input coverage. As a rule of thumb, we use a w that is approximately twice the width of a notehead, and an s of half the value of w. This will ensure that each musical object is shown fully at least once in an input window.
The network is trained using backpropagation with gradient descent optimization. An example of this is the ADAM optimizer.
Six separate models are trained, one on each of the proposed augmented data sets; No augmentations, AWGN, APN, Small ET, large ET and all augmentations. All
models are trained with a batch-size of 64 using the ADAM optimizer, with an initial learning rate of 8 10. 4 and a constant learning rate decay which halves the rate every ten epochs. Each model is trained to convergence, taking about 25 epochs on the non- augmented dataset. A single Nvidia Titan X Maxwell is used for training, which trains a single model in approximately 30 hours.
By using a end-to-end trainable sequential model, we perform the full OMR pipeline in a single step. By incorporating sequence-to-sequence models into OMR, there are many new possibilities for obtaining development data. We view this aspect as the largest advantage the proposed method has over segmented models, as the acquisition of quality training data is considered one of the major roadblocks in OMR. The proposed model shows that it is robust to noisy input, an important quality for any OMR model. Additionally, the experiments show that it can deal with the large scale elastic transformations that essentially change the musical font. Application of the current method or assembly for automatic music transcription
(AMD
Automatic music transcription (AMT) is a challenging problem for humans and machines. The task at hand is to find a mapping f : x → y that translates an audio sequence x to a symbolic representation of that sequence y. The difficulty is no surprise because in the most general case, polyphonic AMT, separating the sources of sound alone, e.g. one key stroke on a piano from another, is already a highly underdetermined problem. Thus, any sufficient model needs to learn strong priors over the audio sequences it receives as input in order to perforin well. Even if a model does learn these priors sufficiently, though, it can not be guaranteed that the task at hand is well defined. For example, the harmonics of two distinct notes of possibly different instalments can have complex interactions. Furthermore, noise or recording technique may limit the prior assumptions that can be made. However, the fact that machine performance lags behind human performance is a strong indicator for the room of improvement for these models. Furthermore, we already know that music is following (probabilistic) rules according to tempo, harmonic or timbre. Hence, a lot of prior assumptions can be made to simplify the problem. It has been the subject of several studies to work in this prior knowledge without restricting the flexibility of a model too much. Notice that ATM falls in the regime of perceptional tasks. Within this field
deep learning has been contributing remarkable improvements on several tasks, initially mainly in computer vision (CV) later also in several other domains such as natural language processing. There is reason to believe that music information retrieval (MIR) tasks are more challenging than CV tasks for example due to the ambiguity of annotation even to human perceivers. However, several pioneering studies in deep learning have shown significant improvement in various MIR tasks such as onset and staictural boundary detection, piano transcription, genre classification or sound generation to just name a few. This gives reason to believe in the power of such techniques. Within the deep learning domain there are two popular models: the convolutional neural network and the recurrent neural network. Convolutional neural networks had enormous success in classification tasks such as image recognition. They seem to break the course of dimensionality by learning locally low dimensional representations of their input. By stacking many of these representations in a hierarchical manner a global understanding of the input as a whole can be achieved. The other popular models are recurrent neural networks for sequence modelling. These models can be understood as a generalized version of hidden Markov models. They are used for language modelling such as text generation or language translation. For the latter example sequence to sequence models, a subclass of recurrent NN, are well known. Here a sequence of for example English is fed into a neural network to output a hidden state that contains all the information of the sequence ,i.e., a sufficient statistics. This hidden state is than fed into another model that generates the sequence with the same meaning but in a different language. This model is superior to other because it theoretically can deal with different grammatical structure such as word order. In music translation tasks such as optical music recognition or music transcription, we are often faced with the same problems dependencies need to be "kept in mind" and later be placed at a different place in the sequence, for example when translating sheet music to a pitch representation one need the model need to have the capacity to remember the key signature. Additionally, these kind of models can be trained with less effort since one only needs the entire song and its sheet music to train rather than a temporal accurate annotation. However sequences of music usually represented in a spectral representation are too high dimensional to work with a recurrent model directly this is why we propose to learn a low dimensional feature.
This feature is being learned by a convolutional neural network. The model can be trained jointly and can thus benefit from each other.
The following steps were taken in this example.
For each audio sequence, we compute a magnitude spectrogram 7' with a window 9 size of 46.6 ms (2048 samples at 44.1 kHz) and 50% overlap. We apply an equivalent rectangular filterbank of 200 triangular filters from 27.5 Hz to 16 kHz. The entire preprocessing pipeline was realized with Essentia. The input to the model we split the sequence X = {xt}Tt=i into excepts of N frames, with 50% overlap. We coupled a convolutional neural network (CNN) 40 with a sequence-to- sequence model as explained earlier. The CNN 40 represents the automated feature extractor: for each except xt it extracts meaningful information from the spectral representation and compresses it. This low dimensional representation x't is than the input to the recurrent model 10 that decodes the sequence X' = {x't}Tt=i to a hidden H state that ideally contains all information of the sequence much like a sufficient statistics. Consequently the information is than being "translated" to the symbolic space with another recurrent neural network, the decoder 20, to the output sequence Y =
The model is illustrated in Figure 13.
The input T is a series of spectrogram excerpts of N frames. Each frame 9 is passed through the convolutional network. The representation is then passed on to the first RNN 10, which computes a hidden state that can be interpreted as sufficient statistics. Based on this hidden state a second RNN 20 generates an output sequence 50. This sequence 50 is the twofold it computes at which time which pitch is turned on/off This function is entirely deterministic. It can, furthermore, be trained end-to- end. We use the Adam optimizer (for back propagation with gradient descent optimization) with standard hyper parameter settings. We apply 15% dropout to the inputs and 50% in the convolutional network. We train for several epochs. Implemented in Keras, training a single model on an Nvidia GTX Titan X graphics card.
It will also be clear that the above description and drawings are included to illustrate some embodiments of the invention, and not to limit the scope of protection. Starting from this disclosure, many more embodiments will be evident to a skilled person. These embodiments are within the scope of protection and the essence of this
invention and are obvious combinations of prior art techniques and the disclosure of this patent.
Claims
1. An optical music recognition (OMR) assembly for converting sheet music,
representing a music part as a first temporal representation, into a machine- processable representation of said piece of music that represents at least a pitch and duration of notes that are graphically represented in said sheet music and form said music part as a second temporal representation, said assembly comprising a data processor system and software which, when running on said data processor system:
- retrieves a machine-processable representation of said sheet music;
- generate a series of time slices of said sheet music, by applying a sliding window on said over said machine-processable representation of at least part of said sheet music;
- defines a sequence-to-sequence system, said sequence-to-sequence system comprising:
* provide a convolutional neural network (CNN) for converting said time slices into a sequence of third representations of said sheet music, said CNN comprising an input layer and an output layer;
* provide a first, encoder recurrent neural network (RNN) as an encoder on said sequence of third representations, for providing a hidden representation of said sheet music, said first RNN having an input layer that is functionally coupled to said output layer of said CNN, and an output layer;
* provide a second, decoder recurrent neural network (RNN) as a decoder to said hidden representation, for converting said hidden representation into said machine- processable representation, said second RNN having an input layer that is functionally coupled to said output layer of said first RNN, and an output for providing said machine-processable representation.
2. The assembly of claim 1, wherein said convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are functionally coupled, forming said sequence-to- sequence system, and said sequence-to-sequence system is trained using a training dataset of sheet music and known, resulting second temporal representations.
3. The assembly of claim 2, wherein in said training of said first temporal representations are provided as input of said convolutional neural network (CNN), the output of said second, decoder recurrent neural network (RNN) is compared to said second temporal representation, and parameters of said convolutional neural network (CNN), said first, encoder recurrent neural network (RNN) and said second, decoder recurrent neural network (RNN) are modified.
4. The assembly of any one of the preceding claims, wherein said first RNN
comprises a Long Short term memory (LSTM) architecture.
5. The assembly of any one of the preceding claims, wherein said second RNN
comprises a Long Short term memory (LSTM) architecture.
6. The assembly of any one of the preceding claims, wherein said first temporal representation is a graphical representation, in particular a digital image, more in particular said graphical representation is a representation of music, in particular sheet music.
7. The assembly of any one of the preceding claims, wherein said second digital representation is a digital file comprising temporal instructions for actuating a device, in particular a music file for actuating or controlling a music instrument, more in particular selected from MIDI, music XML.
8. The assembly of any one of the preceding claims, wherein said series of time slices comprise digital images obtained by sliding a window over a graphical representation, in particular over a digital image.
9. The assembly of any one of the preceding claims, wherein said sheet music
comprises a graphical representation comprising a series of staff lines and notes, in particular said sheet music comprises a visual representation on a carrier, in particular a written or printed representation on paper.
10. The assembly according to any one of the preceding claims, wherein said convolutional neural network, said first, encoder recurrent neural network, and said second, decoder recurrent neural network have been trained as said sequence-to- sequence system using a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation.
1 1. A method for converting sheet music into a digital representation using the OMR assembly of any one of claims 1-10, comprising:
- retrieving a digital image of at least part of said sheet music;
- applying a sliding window over said digital image of said sheet music, said sliding window having a time width and a stride and provides said time slices comprising a time sequence of partially overlapping input images of at least part of said sheet music;
- applying said convolutional neural network (CNN) to said input images for converting said input images into a sequence of numerical representations of said input images;
- applying said first, recurrent neural network (RNN) as said encoder on said sequence of numerical representations for providing a hidden output data set;
- applying said second, decoder recurrent neural network (RNN) as said decoder to said hidden output data set for converting said hidden output data into said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part.
12. An assembly for producing an optical music recognition (OMR) assembly for converting sheet music into a digital representation, said sheet music comprising a graphical, time sequential representation of a series of notes forming a music part, said OMR assembly comprising a data processor and software which, when running on said data processor:
-provide a training dataset comprising a series of sheet music samples and for each sheet music sample a resulting digital representation;
- provide a neural network assembly comprising :
(a) a convolutional neural network (CNN) adapted for receiving training input images and converting said input images into a sequence of numerical
representations of said input images;
(b) a first, encoder recurrent neural network (RNN) as an encoder adapted for receiving said sequence of numerical representations for providing a hidden output data set;
(c) a second, decoder recurrent neural network (RNN) as a decoder adapted for receiving said hidden output data set for converting said hidden output data into said digital representation that represents at least a pitch and duration of said notes that are graphically represented in said sheet music and form said music part;
- train said neural network assembly using said training dataset.
13. A method for producing the OMR assembly of any one of the preceding claims, wherein:
- a training dataset is provided, said trainings dataset comprising a series of first temporal representation of signals, each having a resulting second temporal representation of each signal;
- said convolutional neural network, said first recurrent neural network and said second recurrent neural network are provided, where said neural networks are coupled;
- said time slices are for converting said time slices into a sequence of third representations of said first temporal representation;
- applies a first, trained, recurrent neural network (RNN) as an encoder on said sequence of third representations, input as one data entry, for providing a hidden representation of said first temporal representation;
- applies a second, trained, decoder recurrent neural network (RNN) as a decoder to said hidden representation, input as one data entry, for converting said hidden representation into a calculated second temporal representation of said signal;
- adjust parameters in at least one selected from said convolutional neural network, said first, trained, recurrent neural network (RNN) and said second, trained, decoder recurrent neural network (RNN) based upon a difference between said resulting second temporal representation of said signal and said calculated second temporal representation of each signal.
14. The method of claim 13, wherein said neural networks are trained using back propagation of said training dataset.
15. The method of claim 14, wherein said parameters in said neural networks are adjusted based upon gradient descent optimization.
-o-o-o-o-o-
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
NL2018758 | 2017-04-20 | ||
NL2018758A NL2018758B1 (en) | 2017-04-20 | 2017-04-20 | Optical music recognition (OMR) assembly for converting sheet music |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018194456A1 true WO2018194456A1 (en) | 2018-10-25 |
Family
ID=59521604
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/NL2018/050250 WO2018194456A1 (en) | 2017-04-20 | 2018-04-20 | Optical music recognition omr : converting sheet music to a digital format |
Country Status (2)
Country | Link |
---|---|
NL (1) | NL2018758B1 (en) |
WO (1) | WO2018194456A1 (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109448683A (en) * | 2018-11-12 | 2019-03-08 | 平安科技(深圳)有限公司 | Music generating method and device neural network based |
CN109801645A (en) * | 2019-01-21 | 2019-05-24 | 深圳蜜蜂云科技有限公司 | A kind of musical sound recognition methods |
CN110189800A (en) * | 2019-05-06 | 2019-08-30 | 浙江大学 | Furnace oxygen content soft-measuring modeling method based on more granularities cascade Recognition with Recurrent Neural Network |
CN110442706A (en) * | 2019-07-17 | 2019-11-12 | 华南师范大学 | A kind of method, system, equipment and storage medium that text snippet generates |
CN110580458A (en) * | 2019-08-25 | 2019-12-17 | 天津大学 | music score image recognition method combining multi-scale residual error type CNN and SRU |
CN110852181A (en) * | 2019-10-18 | 2020-02-28 | 天津大学 | Piano music score difficulty identification method based on attention mechanism convolutional neural network |
CN111063327A (en) * | 2019-12-30 | 2020-04-24 | 咪咕文化科技有限公司 | Audio processing method and device, electronic equipment and storage medium |
CN111554255A (en) * | 2020-04-21 | 2020-08-18 | 华南理工大学 | MIDI playing style automatic conversion system based on recurrent neural network |
CN112669796A (en) * | 2020-12-29 | 2021-04-16 | 西交利物浦大学 | Method and device for converting music into music book based on artificial intelligence |
CN112906872A (en) * | 2021-03-26 | 2021-06-04 | 平安科技(深圳)有限公司 | Generation method, device and equipment for converting music score into sound spectrum and storage medium |
CN112951239A (en) * | 2021-03-24 | 2021-06-11 | 平安科技(深圳)有限公司 | Fole generation method, device, equipment and storage medium based on attention model |
CN113065432A (en) * | 2021-03-23 | 2021-07-02 | 内蒙古工业大学 | Handwritten Mongolian recognition method based on data enhancement and ECA-Net |
CN113066456A (en) * | 2021-03-17 | 2021-07-02 | 平安科技(深圳)有限公司 | Berlin noise-based melody generation method, device, equipment and storage medium |
CN113299318A (en) * | 2021-05-24 | 2021-08-24 | 百果园技术(新加坡)有限公司 | Audio beat detection method and device, computer equipment and storage medium |
CN113539294A (en) * | 2021-05-31 | 2021-10-22 | 河北工业大学 | Method for collecting and identifying sound of abnormal state of live pig |
CN114171053A (en) * | 2021-12-20 | 2022-03-11 | Oppo广东移动通信有限公司 | Neural network training method, audio separation method, device and equipment |
CN114550675A (en) * | 2022-03-01 | 2022-05-27 | 哈尔滨理工大学 | Piano transcription method based on CNN-Bi-LSTM network |
CN114580279A (en) * | 2022-03-02 | 2022-06-03 | 广西大学 | Low-orbit satellite communication self-adaptive coding method based on LSTM |
CN115146649A (en) * | 2022-06-24 | 2022-10-04 | 厦门大学 | Method and device for identifying music book on drum |
US11620475B2 (en) | 2020-03-25 | 2023-04-04 | Ford Global Technologies, Llc | Domain translation network for performing image translation |
US11923899B2 (en) | 2021-12-01 | 2024-03-05 | Hewlett Packard Enterprise Development Lp | Proactive wavelength synchronization |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2871204B2 (en) | 1991-08-21 | 1999-03-17 | 日本電気株式会社 | Music transcription device |
US6297439B1 (en) | 1998-08-26 | 2001-10-02 | Canon Kabushiki Kaisha | System and method for automatic music generation using a neural network architecture |
WO2008101126A1 (en) | 2007-02-14 | 2008-08-21 | Museami, Inc. | Web portal for distributed audio file editing |
US8494257B2 (en) | 2008-02-13 | 2013-07-23 | Museami, Inc. | Music score deconstruction |
US9123315B1 (en) | 2014-06-30 | 2015-09-01 | William R Bachand | Systems and methods for transcoding music notation |
US20160099010A1 (en) | 2014-10-03 | 2016-04-07 | Google Inc. | Convolutional, long short-term memory, fully connected deep neural networks |
CN105678300A (en) | 2015-12-30 | 2016-06-15 | 成都数联铭品科技有限公司 | Complex image and text sequence identification method |
-
2017
- 2017-04-20 NL NL2018758A patent/NL2018758B1/en not_active IP Right Cessation
-
2018
- 2018-04-20 WO PCT/NL2018/050250 patent/WO2018194456A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2871204B2 (en) | 1991-08-21 | 1999-03-17 | 日本電気株式会社 | Music transcription device |
US6297439B1 (en) | 1998-08-26 | 2001-10-02 | Canon Kabushiki Kaisha | System and method for automatic music generation using a neural network architecture |
WO2008101126A1 (en) | 2007-02-14 | 2008-08-21 | Museami, Inc. | Web portal for distributed audio file editing |
US8494257B2 (en) | 2008-02-13 | 2013-07-23 | Museami, Inc. | Music score deconstruction |
US9123315B1 (en) | 2014-06-30 | 2015-09-01 | William R Bachand | Systems and methods for transcoding music notation |
US20160099010A1 (en) | 2014-10-03 | 2016-04-07 | Google Inc. | Convolutional, long short-term memory, fully connected deep neural networks |
CN105678300A (en) | 2015-12-30 | 2016-06-15 | 成都数联铭品科技有限公司 | Complex image and text sequence identification method |
Non-Patent Citations (2)
Title |
---|
CHASE DWAYNE CARTHEN: "Rewind: A Music Transcription Method", 1 May 2016 (2016-05-01), Reno, USA, pages 1 - 54, XP055433730, Retrieved from the Internet <URL:https://media.proquest.com/media/pq/classic/doc/4136250351/fmt/ai/rep/NPDF?cit:auth=Carthen,+Chase+Dwayne&cit:title=Rewind:+A+Music+Transcription+Method&cit:pub=ProQuest+Dissertations+and+Theses&cit:vol=&cit:iss=&cit:pg=&cit:date=2016&ic=true&cit:prod=ProQuest+Dissertations+&+Theses+Global:+The+Scie> [retrieved on 20171211] * |
CHASE DWAYNE CARTHEN: "Rewind: A Music Transcription Method", MASTER THESIS, 1 May 2016 (2016-05-01), pages 1 - 54, XP055433730 |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109448683A (en) * | 2018-11-12 | 2019-03-08 | 平安科技(深圳)有限公司 | Music generating method and device neural network based |
CN109801645A (en) * | 2019-01-21 | 2019-05-24 | 深圳蜜蜂云科技有限公司 | A kind of musical sound recognition methods |
CN110189800A (en) * | 2019-05-06 | 2019-08-30 | 浙江大学 | Furnace oxygen content soft-measuring modeling method based on more granularities cascade Recognition with Recurrent Neural Network |
CN110189800B (en) * | 2019-05-06 | 2021-03-30 | 浙江大学 | Furnace oxygen content soft measurement modeling method based on multi-granularity cascade cyclic neural network |
CN110442706B (en) * | 2019-07-17 | 2023-02-03 | 华南师范大学 | Text abstract generation method, system, equipment and storage medium |
CN110442706A (en) * | 2019-07-17 | 2019-11-12 | 华南师范大学 | A kind of method, system, equipment and storage medium that text snippet generates |
CN110580458A (en) * | 2019-08-25 | 2019-12-17 | 天津大学 | music score image recognition method combining multi-scale residual error type CNN and SRU |
CN110852181A (en) * | 2019-10-18 | 2020-02-28 | 天津大学 | Piano music score difficulty identification method based on attention mechanism convolutional neural network |
CN111063327A (en) * | 2019-12-30 | 2020-04-24 | 咪咕文化科技有限公司 | Audio processing method and device, electronic equipment and storage medium |
US11620475B2 (en) | 2020-03-25 | 2023-04-04 | Ford Global Technologies, Llc | Domain translation network for performing image translation |
CN111554255B (en) * | 2020-04-21 | 2023-02-14 | 华南理工大学 | MIDI playing style automatic conversion system based on recurrent neural network |
CN111554255A (en) * | 2020-04-21 | 2020-08-18 | 华南理工大学 | MIDI playing style automatic conversion system based on recurrent neural network |
CN112669796A (en) * | 2020-12-29 | 2021-04-16 | 西交利物浦大学 | Method and device for converting music into music book based on artificial intelligence |
CN113066456A (en) * | 2021-03-17 | 2021-07-02 | 平安科技(深圳)有限公司 | Berlin noise-based melody generation method, device, equipment and storage medium |
CN113066456B (en) * | 2021-03-17 | 2023-09-29 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for generating melody based on Berlin noise |
CN113065432A (en) * | 2021-03-23 | 2021-07-02 | 内蒙古工业大学 | Handwritten Mongolian recognition method based on data enhancement and ECA-Net |
CN112951239A (en) * | 2021-03-24 | 2021-06-11 | 平安科技(深圳)有限公司 | Fole generation method, device, equipment and storage medium based on attention model |
CN112951239B (en) * | 2021-03-24 | 2023-07-28 | 平安科技(深圳)有限公司 | Buddha music generation method, device, equipment and storage medium based on attention model |
CN112906872B (en) * | 2021-03-26 | 2023-08-15 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for generating conversion of music score into sound spectrum |
CN112906872A (en) * | 2021-03-26 | 2021-06-04 | 平安科技(深圳)有限公司 | Generation method, device and equipment for converting music score into sound spectrum and storage medium |
CN113299318A (en) * | 2021-05-24 | 2021-08-24 | 百果园技术(新加坡)有限公司 | Audio beat detection method and device, computer equipment and storage medium |
CN113299318B (en) * | 2021-05-24 | 2024-02-23 | 百果园技术(新加坡)有限公司 | Audio beat detection method and device, computer equipment and storage medium |
CN113539294A (en) * | 2021-05-31 | 2021-10-22 | 河北工业大学 | Method for collecting and identifying sound of abnormal state of live pig |
US11923899B2 (en) | 2021-12-01 | 2024-03-05 | Hewlett Packard Enterprise Development Lp | Proactive wavelength synchronization |
CN114171053A (en) * | 2021-12-20 | 2022-03-11 | Oppo广东移动通信有限公司 | Neural network training method, audio separation method, device and equipment |
CN114171053B (en) * | 2021-12-20 | 2024-04-05 | Oppo广东移动通信有限公司 | Training method of neural network, audio separation method, device and equipment |
CN114550675A (en) * | 2022-03-01 | 2022-05-27 | 哈尔滨理工大学 | Piano transcription method based on CNN-Bi-LSTM network |
CN114580279A (en) * | 2022-03-02 | 2022-06-03 | 广西大学 | Low-orbit satellite communication self-adaptive coding method based on LSTM |
CN114580279B (en) * | 2022-03-02 | 2024-05-31 | 广西大学 | Low-orbit satellite communication self-adaptive coding method based on LSTM |
CN115146649A (en) * | 2022-06-24 | 2022-10-04 | 厦门大学 | Method and device for identifying music book on drum |
Also Published As
Publication number | Publication date |
---|---|
NL2018758B1 (en) | 2018-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
NL2018758B1 (en) | Optical music recognition (OMR) assembly for converting sheet music | |
Kong et al. | High-resolution piano transcription with pedals by regressing onset and offset times | |
Benetos et al. | Automatic music transcription: An overview | |
Tzanetakis et al. | Marsyas: A framework for audio analysis | |
Humphrey et al. | Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. | |
López et al. | AugmentedNet: A Roman Numeral Analysis Network with Synthetic Training Examples and Additional Tonal Tasks. | |
Dongmei | Design of English text-to-speech conversion algorithm based on machine learning | |
Kim | Singing voice analysis/synthesis | |
Sarkar et al. | Raga identification from Hindustani classical music signal using compositional properties | |
Van Balen | Audio description and corpus analysis of popular music | |
CN117216008A (en) | Knowledge graph-based archive multi-mode intelligent compiling method and system | |
Jadhav et al. | Transfer Learning for Audio Waveform to Guitar Chord Spectrograms Using the Convolution Neural Network | |
Ranjan et al. | Using a bi-directional lstm model with attention mechanism trained on midi data for generating unique music | |
Abudukelimu et al. | SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation | |
Kher | Music Composer Recognition from MIDI Representation using Deep Learning and N-gram Based Methods | |
Wang et al. | Visual signatures for music mood and timbre | |
Osman et al. | A Deep Learning Approach for Recognizing the Noon Rule for Reciting Holy Quran | |
Le et al. | Real-time Sound Visualization via Multidimensional Clustering and Projections | |
Szelogowski | Deep learning for musical form: recognition and analysis | |
Kim | Automatic Music Transcription in the Deep Learning Era: Perspectives on Generative Neural Networks | |
El Achkar | Music encoding and deep learning for music transcription and classification based on visually represented audio features | |
Pauwels | Exploiting prior knowledge during automatic key and chord estimation from musical audio | |
Simonetta | Music interpretation analysis. A multimodal approach to score-informed resynthesis of piano recordings | |
Shi | Computational Analysis and Modeling of Expressive Timing in Music Performance | |
Ebert | Transcribing Solo Piano Performances from Audio to MIDI Using a Neural Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18722241 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18722241 Country of ref document: EP Kind code of ref document: A1 |