US20210326102A1

US20210326102A1 - Method and device for determining mixing parameters based on decomposed audio data

Info

Publication number: US20210326102A1
Application number: US17/343,386
Authority: US
Inventors: Kariem Morsy; Federico Tessmann; Christoph Teschner
Original assignee: Algoriddim GmbH
Current assignee: Algoriddim GmbH
Priority date: 2020-03-06
Filing date: 2021-06-09
Publication date: 2021-10-21
Also published as: ES2960983T3; US11216244B2; EP4311268A3; WO2021175458A1; EP4311268A2; WO2021175455A1; WO2021175456A1; EP4005243A1; US20230089356A1; DE202020005830U1; MX2022011059A; EP4005243B1; US20210279030A1; WO2021175464A1; CA3170462A1; WO2021175457A1

Abstract

The present invention provides a method for processing audio data, comprising the steps of providing a first audio track of mixed input data, said mixed input data representing an audio signal containing a plurality of different timbres, decomposing the mixed input data to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres, providing a second audio track, analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter, and generating an output track based on the at least one mixing parameter, said output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.

Description

The present invention relates to a method for processing audio data based on one or more audio tracks of mixed input data, said mixed input data representing an audio signal containing a plurality of different timbres.
Processing and reproducing audio data frequently involves mixing of different audio files. For example, in a DJ environment, two different audio tracks representing two different pieces of music are used to be mixed when a DJ crossfades from one of the pieces of music to the other such as to avoid any audible interruption in the music performance. In other applications, such as during music production in a digital audio workstation (DAW), a mixing engineer mixes different audio tracks representing different instruments, vocals, etc. In a yet further example, during live broadcasting or live recording of a concert, a sound engineer is recording different audio sources such as different instruments or voices, by means of a plurality of microphones, pickups, etc., so as to produce mixed audio data for transmission through radio/TV broadcasting services or via the Internet.
In all cases, mixing of audio tracks requires a significant amount of work of an experienced audio engineer or DJ to provide a satisfactory mixing result. The main parameters for successfully mixing audio tracks comprise the volumes of the audio tracks, the timing or phase of the audio tracks relative to one another, and audio effects that may be applied to the individual audio tracks before mixing. In order to correctly set those parameters such as to avoid any audio artefacts, dissonances or timing inaccuracies, the audio engineer may obtain information about the musical content of the individual audio tracks, including for example a key of the music, a tempo, a beat grid (time signature, beat emphases or accents etc.) or a particular instrument or a group of instruments contained in the audio tracks. Other relevant information relate to certain song parts such as a verse, a chorus, a bridge, an intro or an outro of a song. The audio engineer usually takes into account all of these parameters as mixing parameters when deciding about a suitable process for mixing particular audio tracks during production, processing or reproducing of audio.
As a particular example, a DJ intending to change the song currently played usually tries to find a suitable transition point between the two songs, i.e. a point in time within the first song at which the first song is faded out, and a point in time within the second song at which the second song is faded in. For example, it may be advantageous to fade out the first song at the end of a chorus of the first song and, at the same time, to fade in the second song with the beginning of a verse of the second song. Accordingly, the DJ needs to determine the song parts of both songs in order to find a suitable transition point including a suitable timing for starting the second song. Furthermore, a transition between two songs can sound particularly smooth if both songs have the same or matching chords at the transition points and/or if both songs have mutually matching timbres, i.e. timbres which mix well with one another, for example a drum timbre and a piano timbre, while avoiding clashing of certain timbres, for example two vocal timbres at the same point in time at the transition point.
As a result, the mixing of audio tracks requires a large amount of experience and attention of an audio engineer such that mixing of audio tracks is limited to professional applications.
It was therefore an object of the present invention to provide a method and a device for processing audio data which assist mixing of audio tracks, in particular to obtain one or more mixing parameters that can be used to determine suitable mixing conditions or even allow semi-automatic or automatic mixing of audio tracks.
According to a first aspect of the present invention, this object is achieved by a method for processing audio data, comprising the steps of providing a first audio track of mixed input data, said mixed input data representing an audio signal containing a plurality of different timbres; decomposing the mixed input data to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres; providing a second audio track; analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter; generating an output track based on the at least one mixing parameter, said output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.
Therefore, according to an important feature of the present invention, at least the mixed input data of the first audio track are decomposed such as to extract therefrom decomposed data representing only some of the timbres of the mixed input data, and the decomposed data are analyzed to determine at least one mixing parameter. Mixing of first and second audio tracks is then performed based on the at least one mixing parameter.
By decomposing the mixed input data according to the different timbres contained therein, a content of the audio information contained in the mixed input data is accessible at a significantly higher level or is even made available for analysis at all.
For example, detection of the beats of a song can be achieved with higher accuracy when separating a drum timbre, and detecting a key or a chord progression of a piece of music can be achieved with higher certainty by analyzing decomposed data representing a bass timbre. The output track may then be generated by matching the beats or matching the keys of the two audio tracks before mixing the audio tracks.
In the present disclosure, audio tracks, in particular the first audio track and the second audio track, may include digital audio data such as contained in audio files or digital audio streams. The files or streams may have a specific length or playback duration or alternatively may have an undefined or infinitive length or playback duration, such as for example in case of a live stream or a continuous data stream received from a content provider via Internet. Note that digital audio tracks are usually stored in an audio file in association with consecutive time frames, the length of each time frame being dependent on the sampling rate of the audio data as conventionally known. For example, in an audio file sampled at a sampling rate of 44.1 kHz one time frame will have a length of 0.023 ms. Furthermore, audio tracks may be embodied by analog audio signals, for example signals played by an analog playback device such as a vinyl player, a tape player etc. In specific embodiments, audio tracks may be songs or other pieces of music provided in digital or analog format.
Furthermore, the term “audio signal” refers to an audio track or any part or portion of an audio track at a certain position or time within the audio track. The audio signal may be a digital signal processed, stored or transmitted through an electronic control system, in particular computer hardware, or may be an analog signal processed, stored or transmitted by analog audio hardware such as an analog mixer, a PA system or the like.
In a preferred embodiment of the present invention, the output track may comprise a first portion containing predominantly the first output data, and a second portion arranged after said first portion and containing predominantly the second output data. This method may be used in a DJ environment, in particular when mixing two songs using DJ equipment. In the first portion of the output track, only the first song is played as the first output data, while in a second portion only the second song is played as the second output data. The output track therefore switches from playback of the first song to playback of the second song.
In the above embodiment, the step of analyzing audio data may include analyzing the decomposed data to determine a transition point as the mixing parameter, and the output track may be generated using the transition point such that the first portion is arranged before the transition point, and the second portion is arranged after the transition point. Thus, in a DJ application in which playback is switched from a first song to a second song, the method of the present invention may be used to find a suitable transition point at which playback of the songs are swapped.
In particular, a transition point on the timeline of the output track may be defined by a first transition point on the timeline of the first audio track (e.g. corresponding to the first song) and a second transition point on the timeline of the second audio track (e.g. corresponding to the second song), wherein the output track then comprises the first portion containing predominantly the first output data obtained from the first audio track in a portion before the first transition point, and comprises the second portion containing predominantly the second output data obtained from the second audio track in a portion after the second transition point. Thus, the method of the invention may in particular include decomposing the first audio track to obtain first decomposed data, decomposing the second audio track to obtain second decomposed data, analyzing the first decomposed data to determine the first transition point as a first mixing parameter, analyzing the second decomposed data to determine the second transition point as a second mixing parameter, and generating the output track based on the first and second mixing parameters, said output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.
Since the decomposed data are analyzed, the transition point(s) may be found more appropriately to allow a smooth transition between the songs, for example at a point where the decomposed drum track has a break or pause such that abrupt rhythmic changes can be avoided. In another example, by analyzing a decomposed bass track, in particular a chord progression defined by the bass track, the end of a chorus, a verse or any other song part can be determined automatically and a transition point can be determined at a junction between adjacent song parts.
Moreover, in the embodiments described above, the output track may further include a transition portion, which is a time interval larger than zero, arranged between the first portion and the second portion and associated to (including) the transition point on the timeline of the output track, wherein in the transition portion a volume level of the first output data is reduced and/or a volume level of the second output data is increased. Therefore, within some sections of the transition portion or even during the entire transition portion, first output data and second output data overlap, i.e. are mixed to be played at the same time, wherein the volume levels of the first output data and the second output data may be adjusted to allow for a smooth transition from the first output data to the second output data without sudden breaks, sound artefacts or dissonant mixes. For example, the volume of the first output data may be continuously decreased over a part or the entire transition portion, while the volume level of the second output data may be continuously increased over a part or the entire transition portion. Transitions of the above-described type are called crossfades.
As stated above, according an important feature of the present invention, audio data, which include at least the decomposed data, are analyzed to determine one or more mixing parameters. Basically all parameters having an influence on the mixing process qualify as mixing parameters in the sense of the present invention. Mixing parameters therefore include, but are not limited to, the following examples:
The mixing parameter may be a tempo of the first and/or second audio track, in particular a BPM (beats per minute) of the first and/or second audio track. Generation of the output track, i.e. mixing, may then include a tempo matching process in which the tempo or BPM of at least one of the first and second audio tracks or at least one of the first and second output data may be changed, such that the audio tracks or output data have the same or matching tempi or BPM. By analyzing decomposed data, for example a drum timbre, the tempo or BPM can be determined with higher accuracy and/or higher reliability.
In a further embodiment of the invention, the at least one mixing parameter may refer to a beat grid of the first and/or second audio track. The beat grid refers to a rhythmic framework of a piece of music, for example. In particular, the individual beats of each bar, including optionally information about time signature (for example a three-four time, a four-four time, a six-eight time, etc.), beat emphases or accents etc., may form the beat grid of a piece of music. Although conventional algorithms are known to recognize a beat grid of a piece of music, according to the present invention, the beat grid may be determined as a mixing parameter based on analyzing decomposed data, for example decomposed drum data or decomposed bass data. Since the beat grid is frequently determined by a drum pattern or a bass pattern, the beat grid can be determined with higher accuracy and higher reliability according to the present invention. Based on a determined beat grid, the step of generating an output track may take into account the determined beat grid or the determined beat grids of the first and/or second audio track by synchronizing the beat grids of the two audio tracks. Synchronizing beat grids may comprise resampling of audio data of the first and/or second audio track such as to stretch or compress the tempo of at least one of the audio tracks and thereby match the beat grids of the audio data.
In another embodiment of the invention, the at least one mixing parameter may refer to a beat phase of the first and/or second audio track. The beat phase relates to a position (i.e. a timing) on the timeline of a piece of music comprising multiple bars, each bar having multiple beats according to the time signature of the music, wherein the beat phase is defined relative to a beginning of the current bar, i.e. relative to the previous downbeat position (first beat of a bar). For example, by matching beat phases of two pieces of music defined by the first and second audio tracks, a timing of the two pieces of music relative to their respective downbeat positions can be synchronized to achieve smooth mixing of the audio data without rhythmic artefacts. Synchronizing beat phase may comprise time-shifting the audio tracks relative to one another such as to achieve matching beats.
In a further embodiment of the present invention, the at least one mixing parameter may refer to a downbeat position within a first and/or a second audio track. In audio data containing music comprising a plurality of bars, a downbeat position refers to the position of the first beat of each bar. By analyzing decomposed data referring to an instrument of a rhythm section of the piece of music for example to drums, percussions, bass, rhythm guitar, etc., determination of the downbeat position can be achieved with higher accuracy and higher reliability as compared to results achieved by analyzing the mixed input data. In the step of generating an output track, first and second output data may be mixed in such a manner that their respective downbeat positions are synchronized in order to avoid any rhythmic clashes in the mix.
In a further embodiment of the present invention, the at least one mixing parameter may refer to a beat shift between the first audio track and the second audio track. This embodiment achieves advantages similar as described above for the mixing parameters beat grid, beat phase or downbeat position. In particular, if the beat shift between the first and second audio tracks is determined as the mixing parameter, smooth mixing may be achieved by introducing a time shift between the first output data and the second output data in such a manner as to achieve zero beat shift or a beat shift equal to one or more beats.
According to a further embodiment of the present invention, the at least one mixing parameter may refer to a key or a chord progression of the first and/or second audio track. As used herein, a chord progression of a piece of music is a time-dependent parameter which denotes certain chords or root tones at certain points in time on the timeline of the music, such as for example C Major, C Major 7, A Minor etc. A key of the music is basically constant over the whole piece of music and relates to the root or key note of the tonic (home key) of the piece of music. Mixing of a first audio track and second audio track, in particular mixing of different pieces of music or different portions or components of a piece of music, achieves more favorable results, if the two audio tracks have equal or mutually matching keys. This will avoid any harmonic dissonances or other sound artefacts. Therein, mutually matching keys may refer to keys which have a total interval of a fourth, a fifth or an octave or multiples thereof in between. However, in order to achieve certain artistic effects, other intervals may be regarded as matching in the sense of the present invention. Although it is in general known to determine the key of an audio track, according to the present invention, the key of the first and/or second audio track is determined by decomposing the input audio data and analyzing the decomposed data obtained in the step of decomposing. This will achieve more accurate and more reliable results. For example, it may be advantageous to analyze decomposed bass data or decomposed guitar data or decomposed piano data, etc., as these instruments usually play an important role in defining the harmony of a piece of music and thereby the relevant key of the music.
Furthermore, by analyzing a chord progression, valuable information may be obtained regarding the structure of a piece of music, such as the sequence of particular song parts, for example verses, choruses, bridges, intros, outros, etc. In particular, in songs of western music, the same chord progressions are usually used for each verse or for each chorus. Analyzing a chord progression may therefore be useful to find particular positions within the first audio track, which are suitable for mixing with a particular position in the second audio track such that these positions qualify as first and second transition points for generating a crossfade from the first audio track to the second audio track as described above, for example. In another example, by identifying equal or similar chord progressions within a first portion within the first audio track and a second portion within the second audio track, the invention may generate an output track in which the first output data and the second output data are mixed together with similar volumes during a portion corresponding to the first and second portions, to create a mashup of two songs, while predominantly only the first output data or the second output data may be contained in the mix in other portions of the output track.
In a further embodiment of the present invention, the at least one mixing parameter may refer to a timbre or a group of timbres of the first and/or second audio track. This embodiment is based on the idea that some timbres mix better than other timbres. For example, a vocal timbre mixes well with instrumental timbres such as a guitar timbre or a piano timbre, while mixing of two vocal timbres is usually unfavorable due to the clashing of the two voices. Furthermore, timbres transporting strong harmonic information may be more difficult to mix with other harmonic timbres, but may more easily be combined with non-harmonic timbres such as drums. In essence, determining that the first and/or the second audio track contains a particular timbre, for example within a predetermined time interval of the respective track, may be a useful information for the user to assist mixing or may even allow a semi-automatic or automatic mixing of the audio tracks.
In a further embodiment of the present invention, the at least one mixing parameter may refer to a song part junction of the first and/or second audio track. As mentioned already above, song part junctions may be suitable positions within a song at which various mixing effects, including crossfades or transitions to another song, remixing with another song, audio effects (reverb, loop effects, equalizer etc.), may be applied in a natural manner. The determination of song part junctions can therefore be used to assist the mixing process or to allow for semi-automatic or even automatic mixing of two audio tracks. According to the present invention, the mixing parameter, in this example a song part junction, may be determined by analyzing decomposed data. Thus, a component of the audio mix that most clearly represents the structure of the song, for example a bass component, may be used to more accurately and more reliably determine the song part junctions.
It should be noted that any of the above-mentioned mixing parameters is suitable to achieve the effects of the present invention, in particular to assist the mixing process. However, the results will become even better if a plurality of different mixing parameters are determined by analyzing the same or different decomposed data. For example, a structure of a piece of music can be determined with particularly high accuracy and reliability, if for example a first mixing parameter referring to a beat grid is determined by analyzing decomposed drum data, and a second mixing parameter relating to a chord progression is determined by analyzing decomposed bass data, while a third mixing parameter relating to a song part junction may then be determined based on the determined structure of the piece of music, i.e. based on the first mixing parameter and the second mixing parameter.
The step of analyzing audio data may include detecting silence data within the decomposed data, said silence data preferably representing an audio signal having a volume level smaller than −30 dB. Herein a value of −30 dB refers to −30 dB FS (peak), i.e. to a volume level which is 30 dB smaller than the volume level of the loudest sound of the track. Alternatively, said silence data preferably represents an audio signal having a volume level smaller than −60 dB FS (RMS), i.e. referring to the absolute mean value. Silence within a particular timbre of the audio data, i.e. silence of a particular musical instrument or a voice component, may provide valuable information regarding the structure of the piece of music. For example, a bridge part is often characterized by a certain interval, such as four, eight or sixteen bars of silence in the bass component of the music. Further, while an intro part of a song is usually without any vocal timbres, the onset of the vocals may be an indication for the beginning of the first verse. Therefore, the step of analyzing audio data may preferably include detecting silence data continuously extending over a predetermined time span, for example over a time span of one, two, four, eight, twelve or sixteen bars, thus indicating a certain song part. Furthermore, an onset of a signal or a first signal peak within decomposed data after the predetermined time span of silence may indicate a downbeat position of a next song part, i.e. a song part junction.
According to a preferred embodiment of the present invention, the step of analyzing audio data may include determining at least a first mixing parameter based on the decomposed data, and at least a second mixing parameter based on the first mixing parameter. For example, the first mixing parameter may be a key of the first or second audio track, while the second mixing parameter may be a pitch shift value referring to a pitch shift to be applied to either one of the first and second audio tracks such as to match the keys of the first and second audio tracks. In another example, the second mixing parameter may be the transition point at which the output track includes a transition from the first output data to the second output data, for example by means of a crossfade. If the second mixing parameter is the transition point, the first mixing parameter may for example be a song part junction, a beat phase or any other mixing parameter referring to a particular position or range within a piece of music relative to the musical content (song parts, bars, musical breaks, etc.). Such embodiments are particularly suitable to allow a DJ to find suitable transition points for changing from a first song to a second song. In particular, if the transition point is one of the mixing parameters, semi-automatic or automatic transitions can be realized in which a user, for example a DJ, just inputs his/her intention to change from playback of the first song towards playback of the second song or just specifies which songs should be mixed, wherein a suitable transition point is then automatically determined by a computer program according to a method of the present invention. One or more suitable transition points may then be proposed to the DJ for manual selection (semi-automatic mixing) or, alternatively, mixing is automatically initiated and carried out at a suitable transmission point without any further user interaction (automatic mixing).
Methods according to the first aspect of the invention use a step of decomposing mixed input data to obtain decomposed data. Several decomposing algorithms and services are known in the art, which allow decomposing audio signals to separate therefrom one or more signal components of different timbres, such as vocal components, drum components or instrumental components. Such decomposed signals and decomposed tracks have been used in the past to create certain artificial effects such as removing vocals from a song to create a karaoke version of a song.
More specifically, with regard to decomposing audio data there have also been several approaches based on artificial intelligence (AI) and deep neural networks in order to decompose mixed audio signals to separate therefrom signals of certain timbres. Some AI systems usually implement a convolutional neural network (CNN), which has been trained by a plurality of data sets for example including a vocal track, a harmonic/instrumental track and a mix of the vocal track and the harmonic/instrumental track. Examples for such conventional AI systems capable of separating source tracks such as a singing voice track from a mixed audio signal include: Prétet, “Singing Voice Separation: A study on training data”, Acoustics, Speech and Signal Processing (ICASSP), 2019, pages 506-510; “spleeter”—an open-source tool provided by the music streaming company Deezer based on the teaching of Prétet above, “PhonicMind” (https://phonicmind.com)—a voice and source separator based on deep neural networks, “Open-Unmix”—a music source separator based on deep neural networks in the frequency domain, or “Demucs” by Facebook AI Research—a music source separator based on deep neural networks in the waveform domain. These tools accept music files in standard formats (for example MP3, WAV, AIFF) and decompose the song to provide decomposed/separated tracks of the song, for example a vocal track, a bass track, a drum track, an accompaniment track or any mixture thereof.
In general, all types of decomposing algorithms can be used for decomposing the mixed input data. Different algorithms, for example algorithms as known in the art and mentioned above, achieve different results with respect to quality of the decomposition and speed of processing. Preferably, in embodiments of the present invention, the step of decomposing the mixed input data includes processing the mixed input data, in particular the first audio track and/or the second audio track, within an AI system comprising a trained neural network. AI systems achieve a high level of quality and in particular allow decomposing different timbres of a mixed audio signal, which in particular may correspond or resemble certain source tracks that were originally mixed when producing or generating the input audio track, such as certain instrumental tracks, vocal tracks, drum tracks etc. More particular, the step of decomposing may include decomposing the first/second audio tracks with regard to predetermined timbres such as to obtain decomposed signals of different timbres, preferably being selected from the group consisting of a vocal timbre, a non-vocal timbre, a drum timbre, a non-drum timbre, a harmonic timbre, a non-harmonic timbre, and any combination thereof. The non-vocal timbre, the non-drum timbre and the non-harmonic timbre may in particular be respective complement signals to that of the vocal timbre, the drum timbre and the harmonic timbre.
Complement signals may be obtained by excising from the input signal one decomposed signal of a specific timbre. For example, an input signal may be decomposed or separated into two decomposed signals, a decomposed vocal signal of a vocal timbre, and its complement, a decomposed non-vocal signal of a non-vocal timbre, which means that a mixture of the decomposed vocal signal and the decomposed non-vocal signal results in a signal substantially equal to the input signal. Alternatively, decomposition can be carried out to obtain a decomposed vocal track and a plurality of decomposed non-vocal tracks such as a decomposed drum track and a decomposed harmonic track (including harmonic instruments such as guitars, piano, synthesizer).
Furthermore, at least one of the steps of analyzing the audio data and generating the output track may include processing of audio data within an AI system comprising a trained neural network. For example, a neural network capable of analyzing audio data to determine at least one mixing parameter as described above may be obtained by training using training data containing a plurality of pieces of music together with data relating to the respective musical structure, such as beat grid, downbeat position, key, chord progression, song parts or song part junctions. After the training process, the neural network may then be capable of detecting such mixing parameters based on decomposed data of new pieces of music. On the other hand, a neural network suitable for generating the output track may be trained using training data in which each set of training data contains two audio tracks and one or more associated mixing parameters suitable for mixing the two audio tracks without dissonances or sound artefacts. The trained neural network will then be capable of mixing new audio tracks based on at least one mixing parameter determined by analyzing decomposed data and additional mixing parameters determined through artificial intelligence (AI).
The method of the present invention may generally be used in all situations of audio processing, in which two audio tracks are to be mixed. For example, in a DAW, the present invention may be implemented as a plugin or in the form of any other suitable software algorithm in order to help a user to mix different audio tracks referring to different instruments, song parts, songs or other audio signals in general. In a further preferred application, the method may be used in a DJ environment, for example in a DJ software application, in order to assist a DJ when mixing a piece of music with any other audio signal such as a second piece of music, and even to allow automatic, autonomous mixes without needing any human supervision. In view of this background, the method of the present invention may further include a step of playing the output track, including a playback through a PA system, loudspeakers, headphones or any other sound-reproducing equipment.
In general, the method of the present invention can be applied to any type of input audio track. For example, the input audio track may be stored on a local device such as a storing means of a computer, and may be present as a digital audio file. Furthermore, the first audio track or the second audio track may be received as a continuous stream, for example a data stream received via Internet, a real-time audio stream received from a live audio source or from a playback device in playback mode. Thus, the range of applications is basically not limited to a specific medium. When receiving the first/second audio track as a continuous stream, playback of the output track may be started while continuing to receive the continuous stream. This has particular advantages in many situations where the audio tracks do not have a certain length or playback duration as the length is either unlimited or undefined, for example in case of processing signals from a live concert or live broadcasting. Furthermore, it is not necessary to wait until a certain audio file is completely downloaded or received or until a certain audio track has completely been played by the playback device, but instead playback of the output signals based on the received input signals can be started earlier.
In another preferred embodiment of the present invention, decomposing first and/or second audio tracks is carried out segment-wise, wherein decomposing is carried out based on a first segment of the input signal such as to obtain a first segment of the decomposed signal, and wherein decomposing of a second segment of the input signal is carried out while playing the first segment of the decomposed signal. Partitioning the first and/or second input signals into segments (preferably segments of equal lengths) and operating the method of the invention based on these segments allows using the decomposition result for generating the output track at an earlier point in time, i.e. after finishing decomposition of just one segment, without having to wait until the decomposition result of an entire audio file for example is available. Another advantage of the segmentation is that decomposition of the second audio track, if applicable, can start at an arbitrary point within the second audio track. For example, when a transition is to be made from the first audio track towards the second audio track such as to start playback of the second audio track at e.g. 01:20 (one minute, twenty seconds), decomposition of the second audio track can start at the segment closest to 01:20, and the beginning part of the second audio track which is not used does not have to be decomposed. This saves performance and ensures that decomposition results are available much faster. Preferably one segment has a playback duration which smaller than 20 seconds.
The method steps, in particular the steps of providing the first and second audio tracks, decomposing the mixed input data, analyzing the decomposed data and generating the output track, may be carried out in a continuous process, wherein a time shift between receiving the first audio track or a first portion of a continuous stream of the first audio track and obtaining the output track or the first segments of the output track is preferably less than 10 seconds, more preferably less than 2 seconds.
In a further embodiment of the present invention, at least one, preferably all of the mixed input data, the first and second audio tracks, the decomposed data, the output track, and the first and second output data, represent stereo signals, each comprising a left channel signal portion and a right channel signal portion, respectively. The method is thus suitable for playing music at high quality.
According to a second aspect of the present invention, the above-mentioned object is achieved by a device for processing audio data, preferably device adapted to carry out a method according to at least one of the preceding as described in the above claims, said device comprising a first input unit for receiving a first audio track of mixed input data, said mixed input data representing an audio signal containing a plurality of different timbres; a second input unit for receiving a second audio track; a decomposition unit for decomposing the mixed input data to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres; an analyzing unit for analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter; and an output generation unit for generating an output track based on the at least one mixing parameter, said output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.
Thus, the second aspect of the invention provides a device having similar or corresponding features as the method of the first aspect of the present invention described above. Therefore, similar or corresponding effects and advantages may be achieved by a device of the second aspect of the present invention as described above for the first aspect of the present invention. In addition, a device of the second aspect of the invention may be adapted to carry out a method of the first aspect of the present invention. Furthermore, embodiments of the device of the second aspect of the present invention may be particularly adapted to carry out one or more of the steps described above for embodiments of the first aspect of the present invention in order to achieve the same effects and advantages.
The device of the second aspect of the present invention is preferably embodied as a computer, in particular a table, a smartphone, a smartwatch or another wearable device and may include in the manner as conventionally known a RAM, a ROM, a microprocessor and suitable input/output means. Included in the computer or connected to the computer may be an audio interface which may be connected, for example wireless (e.g. via Bluetooth or similar technology), to speakers, headphones or a PA system in order to output sound when playing the first and second output signals, respectively. As a further alternative, the device may be embodied as a standalone DJ device including suitable electronic hardware or computing means. Preferably, the device is running a suitable software application in order to control its hardware components, usually standard hardware components of general purpose computers, tablets, smartphones, smartwatches or other wearable devices, such as to function as units of the device of the second embodiment and/or such as to implement the steps of the method of the first embodiment of the invention.
If the device uses an AI system for decomposing audio data, the device preferably has a decomposition unit which includes the AI system comprising a trained neural network. This means that the complete AI system including the trained neural network may be integrated within the device, for example as a software application or software plugin running locally in a memory integrated within the device. Furthermore, the device preferably includes a user interface embodied by either a display such as a touch display or a display to be operated by a pointer device, or as one or more hardware control elements such as a hardware fader or rotatable hardware knobs, or by a voice command or by any other user input/output technology.
According to a third aspect of the present invention, the above-mentioned object is achieved by a computer program which is adapted, when run on a computer, such as a tablet, a smartphone, a smartwatch or another wearable device, to carry out a method according to the first aspect of the present invention, or to control the computer as a device according to the second aspect of the present invention. A computer program according to the third aspect of the present invention therefore achieves the same or corresponding effects and advantages as described above for the first and second aspects of the present invention.
According to a fourth aspect of the present invention, the above-mentioned object is achieved by a method for processing audio data, comprising the steps of providing an audio track of mixed input data, said mixed input data representing an audio signal containing a plurality of different timbres; decomposing the mixed input data to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres; and analyzing the decomposed data to determine a transition point or a song part junction between a first song part and a second song part within the audio track, or to determine any other track parameter. A method of the fourth aspect of the present invention allows determination of one or more song part junctions within a piece of music based on analyzing decomposed data. It therefore becomes possible to analyze a song structure of an audio track containing mixed input data, i.e. a song containing a plurality of different timbres, for example by analyzing decomposed audio data representing characteristic timbres such as a bass timbre. Song parts may therefore be determined more accurately and more reliably. The junction between the song parts provide valuable information to the user, in particular to a DJ or an audio engineer during music production. For example, one or more junctions within a piece of music may be indicated graphically on a screen, and the method may allow a user to control a mixing process based on the one or more junctions, for example to jump to a junction, to cut out a song part between two junctions, to time-shift songs such as to synchronize junctions, etc. Furthermore, the method of the fourth aspect allows determination of any other track parameter, such as at least one of a tempo, a beat, a BPM value, a beat grid, a beat phase, a key and a chord progression of the respective audio track.
According to a fifth aspect of the present invention, the above object is achieved by a method for processing audio data, comprising the steps of providing a set of audio tracks, each including mixed input data, said mixed input data representing audio signals containing a plurality of different timbres; decomposing each audio track of the set of audio tracks, such as to obtain a decomposed track associated with the respective audio track, wherein the decomposed track represents an audio signal containing at least one, but not all, of the plurality of different timbres of the respective audio track, thereby obtaining a set of decomposed tracks; analyzing each decomposed track of the set of decomposed tracks to determine at least one track parameter of the respective audio track which the decomposed track is associated with; selecting or allowing a user to select at least one selected audio track out of the set of audio tracks, based on at least one of the track parameters; and generating an output track based on the at least one selected audio track.
A method of the fifth aspect of the present invention basically assists a user in selecting one of a plurality of audio tracks for further processing, in particular mixing, editing and playback. For example, in a situation where a user is to select one of a plurality of pieces of music, while conventional metadata available for music provided through conventional music distribution services, such as through Internet streaming providers, are limited to certain standard information such as the title of a song, the length of a song, an artist name, a musical genre, etc., the method according to the fifth embodiment of the invention allows adding additional information related to the musical content of the particular audio tracks in the form of the at least one track parameter, wherein the track parameter, according to the fifth aspect of the invention, is determined through analyzing at least one decomposed track obtained from the particular audio track. Accordingly, the selection of songs is greatly assisted, in particular in cases where the candidate pieces of music are partially or fully unknown to the user. Selection and processing of music is thus improved in particular for unexperienced users or when less common pieces of music are to be selected. Furthermore, automatic selection of audio tracks by an algorithm based on the track parameter, without user interaction, can be implemented. In this way, playlists could automatically be generated based on timbres or proportions of individual timbres included in the audio tracks. For example, a non-vocal playlist or instrumental playlist could be generated by automatic selection of songs that do not contain vocal timbres.
For example, the track parameter may refer to at least one timbre of the respective audio track. The user may therefore be informed about timbres contained in the plurality of audio tracks. For example, the method may indicate to a user which of a plurality of audio tracks contains vocal music or which tracks contain a predominant piano timbre. Audio tracks may be suitably marked or highlighted such as to inform the user about the timbres included therein, or the method may allow for sorting or filtering a list of audio tracks based on timbres. As a mere example, a DJ currently playing a song that includes vocals may look for a second song predominantly containing a guitar or a piano timbre, wherein the method of the fifth aspect of the invention may assist and accelerate such selection and/or even allow selection of guitar/piano songs from a list of audio tracks unknown to the user as such. However, even for experienced DJs who are familiar with all songs of the set of audio tracks, the method of the fifth aspect of the invention may be useful to accelerate the process of selecting a suitable audio track.
In further embodiments of the invention according to the fifth aspect, the track parameter may refer to at least one of a tempo, a beat, a BPM value, a beat grid, a beat phase, a key and a chord progression of the respective audio track. The at least one track parameter may likewise be indicated to the user by virtue of a suitable graphical representation, highlighting, coloring or numeral representation. Moreover, sorting or filtering of lists of audio tracks may be based on the at least one track parameter. For example, if a DJs plays a particular first song having a particular first chord progression, the method according to the fifth aspect of the invention may be used to search for a second song among a set of audio tracks, which contains the same or at least partially the same chord progression as the first song, such that mixing of the two songs or crossfading between the songs will result in a particularly continuous sound of the output track without audible breaks or dissonances.
In a particular simple embodiment of the invention, the selected audio track is just played back, in particular without mixing, editing or otherwise changing its content. In this embodiment, the method of the fifth aspect of the invention may in particular be applied in a music player and may assist a user in finding and selecting a desired song for playback. For example, if the at least one track parameter relates to a beat grid of the respective audio tracks (for example a time signature), a user may be enabled to easily find songs of certain beat grids, for example three-four time songs from among a plurality of audio tracks.
In the methods and embodiments mentioned above, a second audio track may contain mixed input data, said mixed input data representing an audio signal containing a plurality of different timbres, wherein the mixed input data are decomposed to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres, wherein analyzing may be carried out taking into account the decomposed data obtained from the second audio track. Accordingly, in the step of analyzing and determining the at least one mixing parameter, both the first audio track and the second audio track may be analyzed on the basis of their respective decomposed data. This will in particular allow comparing the first audio track and the second audio track with regard to parameters such as tempo, beat, BPM value, beat grid (the beats contained within a song, optionally including information about at least one of time signature, emphases and downbeat positions), beat phase, key, chord progression, song parts and song part junctions, etc.
In a further embodiment of the present invention, the mixed input data of the first and/or second audio track are decomposed to obtain at least decomposed data of a vocal timbre, decomposed data of a harmonic timbre and decomposed data of a drum timbre or to obtain exactly three decomposed tracks which are a decomposed track of a vocal timbre, a decomposed track of a harmonic timbre and a decomposed track of a drum timbre, wherein the three tracks preferably sum up to an audio track substantially equal to the first and/or second audio track, respectively. A vocal timbre may include a simple vocal component or a mixture of different vocal components of the piece of music. A drum timbre may include the sound of a single drum instrument, a drum ensemble, a percussion instrument, etc. The drum timbre does usually not contain harmonic information. A harmonic timbre may include timbres of harmonic instruments such as a piano, a guitar, synthesizers, brass, etc. Decomposition into vocal, drum and harmonic timbres produces the most important components defining the musical content and structure of most music, in particular most pieces of western music. Such decomposition therefore provides a good yet efficient basis for analyzing the audio data and determining at least one mixing parameter and/or at least one track parameter. In addition, decomposition into vocal, drum and harmonic timbres greatly assists the mixing process, i.e. generation of an output track based on mixing two or more of the decomposed tracks.

Preferred embodiments of the present invention will be described in the following on the basis of the attached drawings, wherein

FIG. 1a shows a device according to an embodiment of the present invention,

FIG. 1b shows a song select window that may be displayed by a device of the embodiment of the invention,

FIG. 2 shows a schematic functional diagram of components of the device of the embodiment shown in FIG. 1a , and FIG. 3 shows a schematic illustration of an example mode of operation of the device shown in FIG. 1a, 1b and 2, and a method for processing audio data according to an embodiment of the invention.

A device 10 according to an embodiment of the present invention may be formed by a computer such as a tablet computer, a smartphone, a smartwatch or another wearable device, which comprises standard hardware components such as input/output ports, wireless connectivity, a housing, a touchscreen, an internal storage as well as a plurality of microprocessors, RAM and ROM. Essential features of the present invention are implemented in device 10 by means of a suitable software application or a software plugin running on device 10.
The display of device 10 preferably has a first section 12 a associated to a first song A and a second section 12 b associated to a second song B. First section 12 a includes a first waveform display region 14 a which displays at least one graphical representation of song A, in particular one or more waveform signals associated to song A. For example, the first waveform display region 14 a may display a waveform of song A and/or one or more waveforms of decomposed signals obtained from decomposing song A. For example, decomposition of song A may be carried out to obtain a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal, which may be displayed within the first waveform display region 14 a. Likewise, a second waveform display region 14 b may be included in the second section 12 b such as to display a graphical representation related to song B in the same or corresponding manner as described above for song A. Thus, the second waveform display region 14 b may display one or more waveforms of song B and/or at least one waveform of a decomposed signal obtained from song B.
Furthermore, first and second waveform display regions 14 a, 14 b may each display a play- head 16 a, 16 b, respectively, which show a current playback position within song A and song B, respectively.
The first waveform display region 14 a may have a song select button A, which may be pressed by a user to select song A from among a plurality of audio tracks offered by an Internet provider or stored on a local storage device. In a corresponding manner, a second waveform display region 14 b includes a song select button B, which may be activated by a user to select song B from a plurality of audio tracks. FIG. 1b shows an example of a song select window, which may pop up when song select button A is activated by a user. The song select window offers a list of audio tracks and invites the user to select one of the audio tracks as song A.
According to an embodiment of the present invention, the list of audio tracks as shown in FIG. 1b shows metadata of each audio track which include, for each audio track, a title, an artist name, a track length, a BPM value, a main timbre and timbre component data referring to proportions of individual timbres within the audio track. While the title, the artist and the track length may be directly read from metadata of the audio file as usually provided through commercial music providers, or may be stored as metadata together with the audio data of the audio track on a storage device, the BPM value, the main timbre and the timbre component data are examples for track parameters in the sense of the present invention, which are usually not provided by the distributors with the original audio tracks but which are obtained by device 10 according to the embodiment of the invention through decomposing the particular audio track and then analyzing the decomposed data.
For example, by analyzing a decomposed drum track, a BPM value can be obtained for a given audio track. Likewise, by analyzing a plurality of decomposed tracks associated to particular timbres such as a vocal timbre, a harmonic/instrumental timbre or a drum timbre, information regarding the presence and/or distribution (i.e. relative proportions) of certain timbres, i.e. certain instruments, can be obtained. In particular, a predominant timbre of an audio track, can be determined, which represents a main character of the music contained in the audio track and is denoted as a main timbre for each audio track in the example of FIG. 1b . Furthermore, in the example of FIG. 1b , a proportion of a drum timbre within the audio track is indicated by a drum proportion indicator, a proportion of a harmonic/instrumental timbre within the audio track is indicated by a harmonic/instrumental indicator, and a proportion of a vocal timbre within the audio track is indicated by a vocal indicator. The indicators may be formed by level indicators showing the proportion of the respective timbre from a minimum value (not present, for example 0) to a maximum value (maximum proportion, for example 5).
Therefore, the user may easily create desired mixes, for example a mix of a vocal song and an instrumental song. In addition or alternatively, device 10 may analyze decomposed harmonic tracks (instrumental, vocals etc.) of the audio tracks in order to determine a key or a chord progression as track parameters of the audio tracks.
With reference again to FIG. 1a , each of the first and second sections 12 a and 12 b may further include a number of control elements for controlling playback, effects and other features related to song A and song B, respectively. For example, the first section 12 a may include a play button 18 a which can be pushed by a user to alternatively start and stop playback of song A (more precisely audio signals obtained from Song A, such as decomposed signals). Likewise, the second section 12 b may include a play button 18 b which may be pushed by a user to alternatively start and stop playback of song B (more precisely audio signals obtained from Song B, such as decomposed signals).
An output signal generated by device 10 in accordance with the settings of device 10 and with a control input received from a user may be output at an output port 20 in digital or analog format, such as to be transmitted to a further audio processing unit or directly to a PA system, speakers or head phones. Alternatively, the output signal may be output through internal speakers of device 10.
According to the embodiment of the present invention, device 10 can perform a smooth transition from playback of song A to playback of song B by virtue of a transition unit, which will be explained in more detail below. In the present embodiment, device 10 may comprise a transition button 22 displayed on the display of device 10, which may be pushed by a user to initiate a transition from playback of song A towards playback of song B. By a single operation of transition button 22 (pushing the button 22), device 10 starts changing individual volumes of individual decomposed signals of songs A and B according to respective transition functions (volume level as a function of time) such as to smoothly cross-fade from song A to song B within a predetermined transition time interval.
Pressing the transition button 22 can directly or immediately start the transition from song A to song B or may control a transition unit, which is to be described in more detail later, such as to analyze decomposed signals of song A and/or song B in order to determine at least one mixing parameter and to play an automatic transition based on the at least one mixing parameter. For example, as will be described later as well, a suitable transition point, i.e. a suitable first transition point on the timeline of song A and/or a suitable second transition point on the timeline of song B, and/or a length of a transition portion (duration of the transition), may be determined by the transition unit in response to an activation of transition button 22.
In addition or alternatively, device 10 may include a transition controller 24 which can be moved by a user between one controller end point referring to a playback of only song A and a second controller end point referring to playback of only song B. This allows controlling the volumes of individual decomposed signals of songs A and B using transition functions, which are based not on time but on controller position of the transition controller 24. In this manner, in particular the speed and progress of the transition can manually be controlled through the transition controller 24.
FIG. 2 shows a schematic illustration of internal components of device 10 and a signal flow within device 10.
Audio processing is based on a first input track and a second input track, which may be stored within the device 10, for example in an internal memory of the device, a hard drive or any other storage medium. First and second input tracks are preferably digital audio files of a standard compressed or uncompressed audio file format such as mp3, WAV, AIFF or the like. Alternatively, first and second input tracks may be received as continuous streams, for example via an Internet connection of device 10 or from an external playback device via an input audio interface or via a microphone.
First and second input tracks are preferably processed within first and second input units 26 a and 26 b, respectively, which may be configured to decrypt or decompress the audio data, if necessary, and/or may be configured to extract a segment of the first input track and a segment of the second input track in order to continue processing based on the segments. This has an advantage that time-consuming processing algorithms, such as the decomposition based on a neural network, will not have to analyze the entire first or second input track upfront, but will perform processing based on shorter segments, which allows continuing processing and eventually start playback at an earlier point in time. In addition, in case of receiving the first and second input tracks as continuous streams, it would in many cases not be feasible to wait until the complete input tracks are received before starting to process the data.
The output of the first and second input units 26 a, 26 b, for example the segments of the first and second input tracks, form first and second input signals, and they are input into first and second AI systems 28 a, 28 b of a decomposition unit 40. Each AI system 28 a, 28 b includes a neural network trained to decompose the first and second input signals, respectively, with respect to sound components of different timbres. Decomposition unit 40 thus decomposes the first input signal to obtain a first group of decomposed signals and decomposes the second input signal to obtain a second group of decomposed signals. In the present example, each group of decomposed signals includes a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal, which each form a complete set of decomposed signals or a complete decomposition, which means that a sum of all decomposed signals of the first group will resemble the first input signal, and the sum of all decomposed signals of the second group will resemble the second input signal.
It should be noted that although in the present embodiment two AI systems 28 a, 28 b are used, decomposition unit 40 may also include only one AI system and only one neural network, which is trained and configured to determine all decomposed signals of the first input signal as well as all decomposed signals of the second input signal. As a further alternative, more than two AI systems may be used, for example a separate AI system and a separate neural network may be used to generate each of the decomposed signals.
All decomposed signals, in particular both groups of decomposed signals, are then input into a playback unit 42 in order to generate an output signal for playback. Playback unit 42 comprises a transition unit 44, which is basically adapted to recombine the decomposed signals of both groups taking into account specific volume levels associated to each of the decomposed signals. Transition unit 44 is configured to recombine the decomposed signals in such a manner as to either play only a first output signal obtained from a sum of all decomposed signals of the first input signal, or a second output signal obtained from a sum of all decomposed signals of the second input signal, or any transition in between the first and the second output signals where decomposed signals of both first and second input signals are played.
In particular, transition unit 44 may store individual transition functions DA, VA, HA, DB, VB, HB for each of the decomposed signals which each define a specific volume level for each time frame within a transition interval, i.e. a time interval in which one of the songs A and B is crossfaded into the other song (first and second output signals are crossfaded in one or the other direction), or for each controller position of the transfer controller within a controller range. Taking into account the respective volume levels according to the respective transition functions DA, VA, HA, DB, VB, HB, all decomposed signals will then be recombined to obtain the output signal. Playback unit 42 may further include a control unit 45, which is adapted to control at least one or the transition functions DA, VA, HA, DB, VB, HB based on a user input.
The output signal generated by playback unit 42 may then be routed to an output audio interface 46 for a sound output. At any location within the signal flow, one or more sound effects may be inserted into the audio signal by means of one or more effect chains 48. In the present example, effect chain 48 is located between playback unit 42 and output audio interface 46.
FIG. 3 illustrates an operation of transition unit 44 according to an embodiment of the present invention and a method for processing audio data according to an embodiment of the present invention.
Decomposed data as received from the first input track (first audio track) representing song A comprises, in the particular embodiment, a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal (denoted by drum, vocal and harmonic in FIG. 3). Decomposed data received from the second input track (second audio track) relating to song B comprises a decomposed drum signal, a decomposed vocal signal and a decomposed harmonic signal (denoted by drum, vocal and harmonic in FIG. 3). The decomposed signals are each shown by respective waveforms, wherein the horizontal axis represents the timeline of song A and the timeline of song B, respectively, and the vertical axis represents the time-dependent amplitude of the corresponding audio signal.
According to the present invention, the decomposed signals are analyzed to determine at least one mixing parameter. In the example shown in FIG. 3, for example the decomposed drum signal of song A is analyzed to determine, inter alia, a tempo value, a BPM value and a beat grid of song A, and a decomposed drum signal of song B is analyzed to determine, inter alia, a tempo value, a BPM value and a beat grid of song B. From the rhythmic pattern of the separated drum timbre of song A, the algorithm can then determine a rhythmic pattern of song A including a first beat at the beginning of song A at a time t0, a sequence of beats following one another at substantially equal time intervals, wherein four beats form a bar and therefore a beat grid of a four-four time type. In FIG. 3, the bars are denoted by vertical lines, wherein each bar includes four beats that are not illustrated. In a similar manner, transition unit 44 analyzes the decomposed drum signal of song B in order to determine beats, bars, a tempo, a BPM value, a beat grid etc., as mixing parameters of song B.
Furthermore, according to this embodiment, a structure of song A and/or song B, i.e. a sequence of song parts such as intro, verse, bridge, chorus, interlude and outro, may be detected as mixing parameters by analyzing the decomposed data. In the particular example shown in FIG. 3, the decomposed drum signal of song A shows a first pattern within the first four bars of the song, whereas in the following eight bars (bars 5 to 12), the drum timbre shows a second pattern different from the first pattern. Furthermore, in the following eight bars (bars 13 to 20), silence is detected in the decomposed drum signal, which means that the drums have a break for eight bars. Then, throughout the rest of song A, the decomposed drum data again show the first pattern. In a similar manner, analyzing the decomposed vocal signal reveals that the first four bars as well as the last four bars of song A do not contain vocals (decomposed vocal signal is silent), whereas the rest of song A contains vocals. In addition, the decomposed harmonic signal is analyzed by a chord/harmony detection algorithm known as such in the prior art such as to detect a chord progression of the harmonic components of song A. Since the decomposed harmonic signal does not contain the vocal components and the drum components of the original audio track, the chord/harmony detection algorithm can be operated with much higher accuracy and reliability. Accordingly, a sequence of chords is detected, which usually changes for each bar. In the present example, it turns out that the chord progression shows a four-bar pattern which repeats three times within the first 12 bars, i.e. a pattern G major, D major, E minor, C major. In the following eight bars (bars 13 to 20), the chord progression deviates from the before-mentioned four-bar pattern and now shows a new four-bar pattern D major, E minor, C major, C major, which is repeated once to obtain eight bars in total. After that, the first four-part pattern that was played at the beginning of song A is again repeated until the end of song A.
In this way, the method according to the embodiment of the invention can deduce, from analyzing the three decomposed signals of song A, particular song parts, namely a first song part that may be called “intro”, forming the first four bars of song A, and a second song part which may be called “verse 1” forming the following eight bars after the intro, a third song part which may be called “bridge” forming the following eight bars after verse 1, a fourth song part which may be called “chorus 1” forming the following eight bars after the bridge, a fifth song part which may be called “interlude” forming the following four bars after chorus 1, a sixth song part which may be called “chorus 2” forming the following eight bars after the interlude, and a seventh song part which may be called “outro” forming the following four bars after chorus 2. The method thus recognizes different song parts and corresponding song part junctions, i.e. the junction between the last bar of a previous song part and the first bar of a following song part.
In the same or corresponding way, the method may determine a song structure of song B by analyzing the decomposed drum signal, the decomposed vocal signal and the decomposed harmonic signal of song B. Thus, by detecting different drum patterns within chorus 1 and chorus 2, detecting silence of the decomposed vocal signal in an outro, detecting silence of the decomposed harmonic signal in an intro and by detecting different chord progression patterns within verse 1 and verse 2 on the one hand and chorus 1 and chorus 2 on the other hand, the method may determine that song B has a song structure comprising four bars of intro, eight bars of verse 1, eight bars of chorus 1, eight bars of verse 2, eight bars of chorus 2 and four bars of outro. These specifications defining the song parts of song B form mixing parameters according to the present invention.
The mixing parameters determined based on an analysis of the decomposed data of song A and song B as described above may be used by device 10 and in a method according to the embodiment of the present invention for assisting a DJ in mixing songs A and B or for achieving semi-automatic or even automatic mixing of songs A and B. In particular, the mixing parameters described above may simply be displayed on a screen of device 10 such as to inform a user of the device 10, in particular show the detected song parts and thereby assist mixing. A DJ may recognize certain song parts or song part junctions as suitable transition points at which a crossfade from song A to song B or vice versa can suitably be initiated, for example by pressing transition button 22 or operating transition controller 24 at a suitable point in time. In another example, the device 10 and the method according to the embodiment of the invention may automatically generate an output track by automatically mixing songs A and B, for example by playing a transition from song A to song B at a suitable point in time as determined from the song structure. In particular, transition points may be determined as the mixing parameters based on the detected song parts. For example a first transition point on the timeline of song A may be the end of the interlude of song A, whereas a second transition point on the timeline of song B may be the beginning of chorus 1 of song B. The device 10 may then generate an output track that plays song A from its beginning to shortly before the end of the interlude, then plays a cross fade to song B starting song B at the beginning of its chorus 1, and then plays the rest of song B from the beginning of chorus 1 till the outro of song B. Other examples for suitable transition points would be the end of chorus 2 of song A on the one hand, and the beginning of verse 1 of song B (or the beginning of chorus 1 of song B) on the other hand. In the latter example, song B could be played almost from the beginning after song A has reached almost its end. This could be used as an automatic crossfade between subsequent songs of a playlist, for example.
It should be noted that the mixing results are improved if songs A and B have similar keys and/or similar BPM values. Conventional methods may be used which are known as such for DJ equipment including DJ software and which allow pitch shifting, time stretching or time compression of one or both of songs A and B such as to ensure that songs A and B have matching keys and/or BPM values.
Further aspects of the present invention are described by the following items:
1. Method for processing audio signals, comprising the steps of

- providing a first input signal of a first input audio track and a second input signal of a second input audio track,
- decomposing the first input signal to obtain a plurality of decomposed signals, comprising at least a first decomposed signal and a second decomposed signal different from the first decomposed signal,
- assigning a first volume level to the first decomposed signal and a second volume level to the second decomposed signal,
- starting playback of a first output signal obtained from recombining at least the first decomposed signal at the first volume level with the second decomposed signal at the second volume level, such that the first output signal substantially equals the first input signal,
- while playing the first output signal, reducing the first volume level according to a first transition function and reducing the second volume level according to a second transition function different from said first transition function,
- starting playback of a second output signal obtained from the second input signal after starting playback of the first output signal but before volume levels of all decomposed signals of the first input signal have reached substantially zero.

2. Method of item 1, further comprising the steps of

- decomposing the second input signal to obtain a plurality of decomposed signals comprising at least a third decomposed signal and a fourth decomposed signal different from the third decomposed signal,
- assigning a third volume level to the third decomposed signal and a fourth volume level to the fourth decomposed signal,
- starting playback of the second output signal obtained from recombining at least the third decomposed signal and the fourth decomposed signal,
- while playing the second output signal, increasing the third volume level according to a third transition function and increasing the fourth volume level according to a fourth transition function different from said third transition function, until the second output signal substantially equals the second input signal.

3. Method of item 1 or item 2, wherein each of the transition functions assigns a predetermined volume level or a predetermined change in volume level
to each of a plurality of time frames within a transition time interval defined between a transition start time (T1) and a transition end time (T3), and/or
to each of a plurality of controller positions within a controller range of a user operated controller defined between a controller first end position and a controller second end position.
4. Method of item 3,

- wherein the first transition function and the second transition function are defined such that the volume level is at a maximum at the transition start time (T1) and/or at the controller first end position, and at a minimum, in particular corresponding to substantially silence at the transition end time (T3) and/or at the controller second end position, and/or
- wherein the third transition function and the fourth transition function are defined such that the volume level is at a minimum, in particular corresponding to substantially silence at the transition start time (T1) and/or at the controller first end position, and at a maximum at the transition end time (T3) and/or at the controller second end position.

5. Method of at least one of the preceding items, wherein at least one of the transition functions is a linear function or contains a linear portion.
6. Method of at least one of the preceding items, wherein at least one of the transition functions is a continuous function and/or a monotonic function.
7. Method of at least one of the preceding items, wherein the first transition function and the second transition function differ from each other with regard to slope and/or wherein the third transition function and the fourth transition function differ from each other with regard to slope.
8. Method of at least one of the preceding items, wherein the step of decomposing includes processing the first audio signal and/or the second audio signal within an AI system comprising a trained neural network.
9. Method of at least one of the preceding items, wherein the step of decomposing includes decomposing the first audio signal and/or the second audio signal with regard to predetermined timbres, such as to obtain decomposed signals of different timbres, said timbres preferably being selected from the group consisting of:

- a vocal timbre,
- a non-vocal timbre,
- a drum timbre,
- a non-drum timbre,
- a harmonic timbre,
- a non-harmonic timbre,
- any combination thereof.

10. Method of item 9, wherein the first decomposed signal and the third decomposed signal are different signals of a vocal timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-vocal timbre, and/or wherein at least at a transition reference time and or a controller reference position a sum of the first transition function and the third transition function is smaller than a sum of the second transition function and the fourth transition function.
11. Method of item 9 or item 10, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-drum timbre, and/or wherein at least at a transition reference time and/or at a controller reference position a sum of the first transition function and the third transition function is larger than a sum of the second transition function and the fourth transition function.
12. Method of item 4 and at least one of items 9 to 11, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, and/or wherein a sum of the first transition function and the third transition function is substantially constant, preferably a maximum volume level, throughout the entire transition time interval and/or the entire controller range.
13. Method of item 4 and at least one of items 9 to 12, wherein the first decomposed signal and the third decomposed signal are different signals of a non-drum timbre, a vocal timbre or a harmonic timbre, and/or wherein a sum of the first transition function and the third transition function has a minimum, preferably substantially zero volume level, between the transition start time (T1) and the transition end time (T3) and/or between the controller first end position and the controller second end position.
14. Method of at least one of the preceding items, further including a step of analyzing an audio signal, preferably at least one of the decomposed signals, to determine a song part junction between two song parts within the first input audio track or within the second input audio track, wherein a transition time interval of at least one of the transition functions is set such as to include the song part junction.
Referring to item 14, song parts of a song are usually distinguishable by an analyzing algorithm since they differ in several characteristics such as instrumental density, medium pitch or rhythmic pattern. Song parts may in particular be a verse, a chorus, a bridge, an intro or an outro as conventionally known. Certain instrumental or rhythmic patterns will remain constant within a song part and will change in the next song part. Recognition of song parts may be supported by analyzing not only the entire input signal but instead or in addition thereto at least one of the decomposed signals, as described in item 14. For example, by analyzing a decomposed bass signal in isolation from the remaining sound components, it will be easy to derive therefrom a chord progression of the song which is one of the key criteria to differentiate song parts. Furthermore, an analysis of the decomposed drum signals allows a more accurate recognition of a rhythmic pattern and thus a more accurate detection of certain song parts. A song part junction then refers to a junction between one song part and the next song part.
According to item 14, transition time intervals may include song part junctions which allow to carry out the transition between two songs at the end of the song part which further improves smoothness and likeability of the transition.
Song parts may be detected by analyzing at least one of the decomposed signals within an AI system comprising a trained neural network. Preferably, such analyzing includes detecting silence within the decomposed signal, said silence preferably representing an audio signal having a volume level smaller than −30 dB. In particular, the step of analyzing decomposed signals may include detecting silence continuously extending over a predetermined time span within the decomposed signal, said silence preferably representing an audio signal having a volume level smaller than −30 dB. Thus, in embodiments of the invention start- and/or end points of silence may be taken as song part junctions.
15. Method of at least one of the preceding items, further including the steps of

- receiving a user input referring to a transition command, including at least one transition parameter,
- setting at least one of the transition functions according to the transition parameter,
  wherein the transition parameter is preferably selected from the group consisting of:
- a transition start time (T1) of a transition time interval of at least one of the transition functions,
- a transition end time (T3) of a transition time interval of at least one of the transition functions,
- a length (T3-T1) of a transition time interval of at least one of the transition functions,
- a transition reference time (T2) within the transition time interval of at least one of the transition functions,
- a slope, shape or offset of at least one of the transition functions,
- an assignment or deassignment of a preset transition function to or from a selected one of the plurality of decomposed signals.

16. Method of at least one of the preceding items, further comprising the steps of

- determining at least one tempo parameter of the first and/or second input track, in particular a BPM (beats per minute) and/or a beat grid and/or a beat phase of the first and/or second input track and
- a tempo matching processing based on the determined tempo parameter, including a time stretching and/or time shifting and/or resampling of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching BPM and/or mutually matching beat phases.

17. Method of at least one of the preceding items, further comprising the steps of

- determining a key of the first and/or second input track and
- a key matching processing based on the determined key, including a pitch shifting of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching keys.

18. Method of at least one of the preceding items, wherein the first input audio track and or the second input audio track are received as a continuous stream, for example a data stream received via internet, a real-time audio stream received from a live audio source or from a playback device in playback mode, and wherein playback of the first output signal and/or second output signal is started while continuing to receive the continuous stream.
19. Method of at least one of the preceding items, wherein decomposing first and/or second input signal is carried out segment-wise, wherein decomposing is carried out based on a first segment of the input signal such as to obtain a first segment of the decomposed signal, and wherein decomposing of a second segment of the input signal is carried out while playing the first segment of the decomposed signal.
20. Method of at least one of the preceding items, wherein the method steps, in particular the steps of providing the first and second input signals, decomposing the first input signal, starting playback of the first output signal and starting playback of the second output signal, are carried out in a continuous process, wherein a time shift between receiving the first input audio track or a first portion of a continuous stream of the first input audio track and starting playback of the first output signal is preferably less than 10 seconds, more preferably less than 2 seconds, and/or wherein a time shift between receiving the second input audio track or a first portion of a continuous stream of the second input audio track and starting playback of the second output signal is preferably less than 10 seconds, more preferably less than 2 seconds.
21. Method of at least one of the preceding items, wherein at least one, preferably all of the first and second input signals, the decomposed signals and the first and second output signals represent stereo signals, each comprising a left-channel signal portion and a right-channel signal portion, respectively.
22. Device for processing audio signals, comprising:

- a first input unit providing a first input signal of a first input audio track and a second input unit providing a second input signal of a second input audio track,
- a decomposition unit configured to decompose the first input audio signal to obtain a plurality of decomposed signals, comprising at least a first decomposed signal and a second decomposed signal different from the first decomposed signal,
- a playback unit configured to start playback of a first output signal obtained from recombining at least the first decomposed signal at a first volume level with the second decomposed signal at a second volume level, such that the first output signal substantially equals the first input signal,
- a transition unit for performing a transition between playback of the first output signal and playback of a second output signal obtained from the second input signal, wherein the transition unit has a volume control section adapted for reducing the first volume level according to a first transition function and reducing the second volume level according to a second transition function different from said first transition function.

23. Device of item 22, wherein the decomposition unit is configured to decompose the second input signal to obtain a plurality of decomposed signals comprising at least a third decomposed signal and a fourth decomposed signal different from the third decomposed signal,

- wherein the second output signal is obtained from recombining at least the third decomposed signal at a third volume level and the fourth decomposed signal at a fourth volume level,
- wherein the volume control section is adapted for increasing the third volume level according to a third transition function and increasing the fourth volume level according to a fourth transition function different from said third transition function, until the second output signal substantially equals the second input signal.

24. Device of item 22 or item 23, wherein each of the transition functions assigns a predetermined volume level or a predetermined change in volume level

- to each of a plurality of time frames within a transition time interval defined between a transition start time (T1) and a transition end time (T3), and/or
- to each of a plurality of controller positions within a controller range of a user operated controller defined between a controller first end position and a controller second end position.

25. Device of item 24,

26. Device of at least one of items 22 to 25, wherein at least one of the transition functions is a linear function or contains a linear portion.
27. Device of at least one of items 22 to 26, wherein at least one of the transition functions is a continuous function and/or a monotonic function.
28. Device of at least one of items 22 to 27, wherein the first transition function and the second transition function differ from each other with regard to slope and/or wherein the third transition function and the fourth transition function differ from each other with regard to slope.
29. Device of at least one of items 22 to 28, wherein the decomposition unit includes an AI system comprising a trained neural network.
30. Device of at least one of items 22 to 29, wherein the decomposition unit is configured to decompose the first audio signal and/or the second audio signal with regard to predetermined timbres, such as to obtain decomposed signals of different timbres, said timbres preferably being selected from the group consisting of:

31. Device of item 30, wherein the first decomposed signal and the third decomposed signal are different signals of a vocal timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-vocal timbre, and/or wherein at least at a transition reference time and or a controller reference position a sum of the first transition function and the third transition function is smaller than a sum of the second transition function and the fourth transition function.
32. Device of item 30 or item 31, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, wherein the second decomposed signal and the fourth decomposed signal are different signals of a non-drum timbre, and/or wherein at least at a transition reference time and/or at a controller reference position a sum of the first transition function and the third transition function is larger than a sum of the second transition function and the fourth transition function.
33. Device of item 25 and at least one of items 30 to 32, wherein the first decomposed signal and the third decomposed signal are different signals of a drum timbre, and/or wherein a sum of the first transition function and the third transition function is substantially constant, preferably a maximum volume level, throughout the entire transition time interval and/or the entire controller range.
34. Device of item 25 and at least one of items 30 to 33, wherein the first decomposed signal and the third decomposed signal are different signals of a non-drum timbre, a vocal timbre or a harmonic timbre, and/or wherein a sum of the first transition function and the third transition function has a minimum, preferably substantially zero volume level, between the transition start time (T1) and the transition end time (T3) and/or between the controller first end position and the controller second end position.
35. Device of at least one of items 22 to 34, further including an analyzing unit configured to analyze an audio signal, preferably at least one of the decomposed signals, to determine a song part junction between two song parts within the first input audio track or within the second input audio track, wherein a transition time interval of at least one of the transition functions is set such as to include the song part junction.
36. Device of at least one of items 22 to 35, further including a user interface configured to accept a user input referring to a transition command, including at least one transition parameter, wherein the transition unit is configured to set at least one of the transition functions according to the transition parameter, wherein the transition parameter is preferably selected from the group consisting of:

- a transition start time (T1) of a transition time interval of at least one of the transition functions,
- a transition end time (T3) of a transition time interval of at least one of the transition functions,
- a length of a transition time interval of at least one of the transition functions,
- a transition reference time (T2) within the transition time interval of at least one of the transition functions,
- a slope, shape or offset of at least one of the transition functions,
- an assignment or deassignment of a preset transition function to or from a selected one of the plurality of decomposed tracks.

37. Device of item 36, wherein the device includes a display unit configured to display a graphical representation of the first input audio track and/or the second input audio track, wherein the user interface is configured to receive at least one transition parameter through a selection or marker applied by the user in relation to the graphical representation of the first input audio track and/or the second input audio track.
38. Device of item 36 or item 37, wherein the device includes a display unit configured to display a graphical representation of at least one of the decomposed signals, wherein the user interface is configured to allow a user to assign or deassign a preset transition function to or from a selected one of the plurality of decomposed tracks.
39. Device of at least one of items 22 to 38, further comprising a tempo matching unit configured to determine a tempo of the first and/or second input track, and to carry out a tempo matching processing based on the determined tempo, including a time stretching or resampling of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching tempos.
40. Device of at least one of items 22 to 39, further comprising a key matching unit configured to determine a key of the first and/or second input track, and to carry out a key matching processing based on the determined key, including a pitch shifting of audio data obtained from the first input track and/or the second input track, such that the first output signal and the second output signal have mutually matching keys.
It should be noted that methods and devices as described above as first to fifth aspects of the invention and in the claims may be understood as embodiments of methods and devices as described above in items 1 to 40. In particular, a transition point as mentioned in the first to fifth aspects of the invention and in the claims may correspond to any of the transition start time, the transition end time and the transition reference time as described in the above items.

Claims

1. A method for processing audio data, comprising:

providing a first audio track of mixed input data, the mixed input data representing an audio signal containing a plurality of different timbres,

decomposing the mixed input data to obtain decomposed data representing an audio signal comprising at least one, but not all, of the plurality of different timbres,

providing a second audio track,

analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter,

generating an output track based on the at least one mixing parameter, said output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.

2. The method of claim 1, wherein the output track comprises a first portion comprising the first output data, wherein the first output data represents a predominant timbre of the first audio track, and a second portion arranged after said first portion and comprising the second output data wherein the second output data represents a predominant timbre of the second audio track.

3. The method of claim 2,

wherein analyzing the audio data comprises analyzing the decomposed data to determine a transition point as the mixing parameter, and

wherein the output track is generated using the transition point such that the first portion is arranged before the transition point and the second portion is arranged after the transition point.

4. The method of claim 3, wherein the output track further comprises a transition portion arranged between the first portion and the second portion, and associated to the transition point, wherein in the transition portion one or more of (A) a volume level of the first output data is reduced or (B) a volume level of the second output data is increased.

5. The method of claim 1, wherein analyzing the audio data comprises determining at least one mixing parameter referring to one or more of:

(A) a tempo of one or more of the first audio track or the second audio track,

(B) a beats per minute (BPM) of one or more of the first audio track or the second audio track,

(C) a beat grid of one or more of the first audio track or the second audio track,

(D) a beat phase of one or more of the first audio track or the second audio track,

(E) a downbeat position within one or more of the first audio track or the second audio track,

(F) a beat shift between the first audio track and the second audio track,

(G) a key of one or more of the first audio track or the second audio track,

(H) a chord progression of one or more of the first audio track or the second audio track,

(I) a timbre or group of timbres of one or more of the first audio track or the second audio track, or

(J) a song part junction of one or more of the first audio track or the second audio track.

6. The method of claim 1, wherein analyzing the audio data comprises detecting silence data within the decomposed data.

7. The method of claim 1, wherein analyzing the audio data comprises detecting silence data continuously extending over a predetermined time span within the decomposed data.

8. The method of claim 1, wherein analyzing the audio data comprises determining at least a first mixing parameter based on the decomposed data, and at least a second mixing parameter based on the first mixing parameter, the second mixing parameter being the transition point.

9. The method of claim 1,

wherein analyzing the audio data comprises determining a tempo of one or more of the first audio track or the second audio track as the at least one mixing parameter, and

wherein generating the output track comprises a tempo matching processing based on the determined tempo, the tempo matching processing comprising a time stretching or resampling of audio data obtained from one or more of the first audio track or the second audio track, such that the first output data and the second output data have mutually matching tempos.

10. The method of claim 1,

wherein analyzing the audio data comprises determining a key of one or more of the first audio track or the second audio track as the at least one mixing parameter, and

wherein generating the output track comprises a key matching processing, the key matching processing comprising a pitch shifting of audio data obtained from one or more of the first audio track or the second audio track, such that the first output data and the second output data have mutually matching keys.

11. The method of claim 1, wherein decomposing the mixed input data comprises processing the mixed input data within an artificial intelligence (AI) system comprising a trained neural network.

12. The method of claim 1, wherein one or more of (A) analyzing the audio data or (B) generating the output track comprises processing of audio data within an artificial intelligence (AI) system comprising a trained neural network.

13. The method of claim 1, further comprising playing the output track.

14. A device for processing audio data, comprising

a first input unit for receiving a first audio track of mixed input data, the mixed input data representing an audio signal comprising a plurality of different timbres,

a second input unit for receiving a second audio track,

a decomposition unit for decomposing the mixed input data to obtain decomposed data representing an audio signal comprising at least one, but not all, of the plurality of different timbres,

an analyzing unit for analyzing audio data, including at least the decomposed data, to determine at least one mixing parameter,

an output generation unit for generating an output track based on the at least one mixing parameter, the output track comprising first output data obtained from the first audio track and second output data obtained from the second audio track.

15. The device of claim 14, comprising a tempo matching unit adapted for time stretching or resampling of audio data obtained from one or more of the first audio track or the second audio track, to generate the first output data and the second output data having mutually matching tempos.

16. The device of claim 14, comprising a key matching unit adapted for pitch shifting of audio data obtained from one or more of the first audio track or the second audio track to generate the first output data and the second output data having mutually matching keys.

17. The device of claim 14, wherein at least one of the decomposition unit, the analyzing unit and the output generation unit includes an artificial intelligence (AI) system comprising a trained neural network.

18. The device of claim 14, comprising a playback unit for playing the output track.

19. A method for processing audio data, comprising:

providing an audio track of mixed input data, the mixed input data representing an audio signal comprising a plurality of different timbres;

decomposing the mixed input data to obtain decomposed data representing an audio signal containing at least one, but not all, of the plurality of different timbres; and

analyzing the decomposed data to determine a transition point or a song part junction between a first song part and a second song part within the audio track, or analyzing the decomposed data to determine another track parameter.

20. A method for processing audio data, comprising:

providing a set of audio tracks, each audio track of the set of audio tracks including mixed input data, the mixed input data representing audio signals comprising a plurality of different timbres;

decomposing each audio track of the set of audio tracks to obtain a decomposed track associated with the respective audio track, wherein the decomposed track represents an audio signal comprising at least one, but not all, of the plurality of different timbres of the respective audio track, thereby obtaining a set of decomposed tracks;

analyzing each decomposed track of the set of decomposed tracks to determine one or more track parameters of the respective audio track with which the decomposed track is associate;

selecting or allowing a user to select at least one selected audio track out of the set of audio tracks, based on at least one of the one or more track parameters; and

generating an output track based on the at least one selected audio track.

21. The method of claim 20, wherein the track parameter refers to at least one timbre of the respective audio track.

22. The method of claim 20, wherein the track parameter refers to at least one of a tempo, a beat, a beats per minute (BPM) value, a beat grit, a beat phase, a key, and a chord progression of the respective audio track.

23. The method of claim 20, comprising notifying a user about at least one track parameter of the one or more track parameters, preferably displaying the at least one track parameter as associated to the respective audio track.

24. The method of claim 23, comprising displaying a graphical representation of an audio track of the set of audio tracks, wherein the graphical representation corresponds to the associated track parameter of the audio track.

25. The method of claim 20, comprising playing a selected audio track.

26. The method of claim 1, wherein the second audio track contains mixed input data, the mixed input data representing an audio signal comprising a plurality of different timbres, wherein the mixed input data are decomposed to obtain decomposed data representing an audio signal comprising at least one, but not all, of the plurality of different timbres, and wherein analyzing is carried out taking into account the decomposed data obtained from the second audio track.

27. The method of claim 1, wherein the mixed input data of one or more of the first audio track or the second audio track are decomposed (A) to obtain at least decomposed data of a vocal timbre, decomposed data of a harmonic timbre and decomposed data of a drum timbre or (B) to obtain three decomposed tracks, the three decomposed tracks comprising (1) a decomposed track of a vocal timbre, (2) a decomposed track of a harmonic timbre, and (3) a decomposed track of a drum timbre, wherein the three decomposed tracks sum up to an audio track substantially equal to one or more of the first audio track or the second audio track, respectively.

28. The method of claim 6, wherein the silence data represents an audio signal having a volume level smaller than negative thirty decibels (−30 dB).