[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

EP2891036A1 - Browsing history language model for input method editor - Google Patents

Browsing history language model for input method editor

Info

Publication number
EP2891036A1
EP2891036A1 EP12883601.2A EP12883601A EP2891036A1 EP 2891036 A1 EP2891036 A1 EP 2891036A1 EP 12883601 A EP12883601 A EP 12883601A EP 2891036 A1 EP2891036 A1 EP 2891036A1
Authority
EP
European Patent Office
Prior art keywords
character string
language model
latin character
browsing history
recited
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP12883601.2A
Other languages
German (de)
French (fr)
Other versions
EP2891036A4 (en
Inventor
Mu Li
Xi Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Publication of EP2891036A1 publication Critical patent/EP2891036A1/en
Publication of EP2891036A4 publication Critical patent/EP2891036A4/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/274Converting codes to words; Guess-ahead of partial word inputs

Definitions

  • This disclosure relates to the technical field of computer input.
  • An input method editor is a computer functionality that assists a user to input text into a host application of a computing device.
  • An IME may provide several suggested words and phrases based on received inputs from the user as candidates for insertion into the host application. For example, the user may input one or more initial characters of a word or phrase and an IME, based on the initial characters, may provide one or more suggested words or phrases for the user to select a desired one.
  • an IME may also assist the user to input non-Latin characters such as Chinese.
  • the user may input Latin characters through a keyboard.
  • the IME returns one or more Chinese characters as candidates for insertion. The user may then select the proper character and insert it.
  • the IME is useful for the user to input non-Latin characters using a Latin-character keyboard.
  • Some implementations provide techniques and arrangements for predicting a non-Latin character string based at least in part on a browsing history language model.
  • the browsing history language model may be generated based on browsing history information.
  • the browsing history information may include at least cached browsing content and may also include real-time browsing content.
  • the predicted non-Latin character string may be provided in response to receiving a Latin character string via an input method editor interface.
  • some examples may predict a Chinese character string based at least in part on the browsing history language model in response to receiving a Pinyin character string.
  • FIG. 1 illustrates an example system according to some implementations.
  • FIG. 2 illustrates an example input method editor interface according to some implementations.
  • FIG. 3 illustrates an example input method editor interface according to some implementations.
  • FIG. 4 illustrates an example process flow according to some implementations .
  • FIG. 5 illustrates an example process flow according to some implementations .
  • FIG. 6 illustrates an example system in which some implementations may operate.
  • Some examples include techniques and arrangements for implementing a browsing history language model with an input method editor (IME). For instance, it may be difficult for a user to input characters into a computer for a language that is based on non-Latin characters (e.g., the Chinese language). For example, there are thousands of Chinese characters, and a typical Western keyboard is limited to 26 letters.
  • the present disclosure relates to an IME that predicts a non-Latin character string in response to receiving a Latin character string from a user. The predicted non-Latin character string is based at least in part on a browsing history language model.
  • the IME may be used to translate Pinyin text (i.e., Chinese characters represented phonetically by Latin characters) into Chinese characters. It will be appreciated that the present disclosure is not limited to Chinese characters. For example, other illustrative non-Latin characters may include Japanese characters or Korean characters, among other alternatives.
  • a statistical language model may be used to compute a conversion probability of each possible conversion and may select the one with the highest probability for presentation to a user.
  • SLM statistical language model
  • a particular type of SLM referred to as an N-gram SLM, may decompose the probability of a string of consecutive words into the products of the conditional probabilities between two, three, or more consecutive words in the string.
  • An IME may be released with a language model for generic usage (i.e., a "general" language model), which is trained for most common typing scenarios.
  • a general language model may be inadequate for a particular user (e.g., a user with a particular browsing history). That is, different users may have different preferences, and an IME that utilizes a general language model may suggest a word or phrase that may be inappropriate for a particular user.
  • an IME that utilizes a general language model may suggest a first word or phrase (i.e., a first set of non-Latin characters).
  • the first word or phrase may have the same pronunciation as a second word or phrase (i.e., a second set of non- Latin characters).
  • the first word or phrase may be appropriate for a standard user but may be less appropriate for another user. Instead, the second word or phrase may be more appropriate for such a user.
  • Web browsing history is an important source of information about a user. For example, a user may browse content related to recent news events or may browse special topics that the user may be interested in. For example, a computer programmer may browse one or more portal sites for various news items and may also browse one or more software development sites. As such, the browsing history of the user may contain the latest general hot topics and texts related to programming skills, among other information.
  • the present disclosure describes an IME that utilizes a browsing history language model to predict a non-Latin character string that may be more appropriate for a user with a particular browsing history than a non-Latin character string that is predicted based on the general language model.
  • FIG. 1 illustrates an example framework of a system 100 according to some implementations.
  • the system 100 includes an input method editor (IME) application 102 that is communicatively coupled to a browsing history language model 104 and a general language model 106.
  • the system 100 further includes an adaptive language model builder 108 that is adapted to receive browsing history information 1 10.
  • the browsing history information 1 10 may include at least cached browsing content 1 12 stored at a browser cache 1 14.
  • An IME interface 1 16 may be provided to a user 1 18 via a computing device 120. While the computing device 120 is shown in FIG. 1 as separate from the above described components of the system 100, it will be appreciated that this is for illustrative purposes only. For instance, in some examples, all of the components of the system 100 may be included on the computing device 120, while in other examples, the components may be distributed across any number of computing devices able to communicate with one another, such as over one or more networks or other communication connections.
  • IME input method editor
  • the IME application 102 is configured to generate the IME interface 1 16 for display to the user 1 18 via the computing device 120.
  • the adaptive language model builder 108 is configured to generate the browsing history language model 104 based on the browsing history information 1 10.
  • the IME application 102 is further configured to receive a Latin character string 122 via the IME interface 1 16. In response to receiving the Latin character string 122, the IME application 102 is configured to predict a non-Latin character string 124 based at least in part on the browsing history language model 104.
  • the adaptive language model builder 108 may generate the browsing history language model 104 based on an analysis of the browsing history information 1 10.
  • the browsing history language model 104 may include an N-gram statistical language model.
  • Such an N-gram statistical language model may decompose the probability of a string of consecutive words into the products of the conditional probabilities between multiple (e.g., two, three, four, five, etc.) consecutive words in the string. Such analysis may be performed for each of the one or more files 1 12.
  • Some implementations provide a system service that may periodically monitor the browser cache 1 14 to determine whether new browsing content has been saved to the browser cache 1 14. In response to determining that new browsing content has been saved, the adaptive language model builder 108 may process the new browsing content to update the browsing history language model 104.
  • the browsing history information 1 10 may also include real-time browsing content 126, as shown in phantom.
  • a plug-in of a browser application 128 e.g., a web browser application
  • the adaptive language model builder 108 may process the real-time browsing content 126 to update the browsing history language model 104.
  • the plug-in of the browser application 128 may not provide real-time browsing information when a browsing mode is set to private browsing. That is, the browsing history information 1 10 may optionally only include the cached browsing content 1 12 that is stored at the browser cache.
  • the IME application 102 receives the Latin character string 122 via the IME interface 1 16.
  • the Latin character string 122 may include Pinyin text
  • the predicted non-Latin character string 124 may include one or more Chinese characters.
  • a plurality of non-Latin character strings may be associated with the Latin character string 122 received via the IME interface 1 16.
  • a conversion probability may be associated with each non-Latin character string of the plurality of non-Latin character strings.
  • the IME application 102 may predict the non- Latin character string 124 for display to the user 1 18 based at least in part on the browsing history language model 104. In a particular embodiment, the IME application 102 predicts the non-Latin character string 124 by identifying the non- Latin character string with a highest conversion probability.
  • the IME application 102 may order the plurality of non-Latin character strings based on the conversion probability and may display an ordered list of non-Latin character strings via the IME interface 1 16.
  • one or more predicted non-Latin character strings may be determined based on the browsing history language model 104 and the general language model 106.
  • C may represent the Chinese string to be predicted
  • P m (C) may represent a probability determined based on the general language model 106
  • P b (C) may represent a probability determined based on the browsing history language model 104.
  • the weighting factor ⁇ may include a default weighting factor. That is, the weighting factor can be "pre-tuned” to a weighting factor that has been previously verified as accurate in most cases.
  • the weighting factor may include a user-defined weighting factor. For example, the user-defined weighting factor may be received from the user 1 18, and the weighting factor may be modified from the default weighting factor to the user-defined weighting factor. This may allow the user 1 18 to "tune" the weighting factor according to personal preference.
  • the general language model 106 may identify a first non-Latin character string as the non-Latin character string with the highest conversion probability.
  • the browsing history language model 104 may identify a second non-Latin character string as the non-Latin character string with the highest conversion probability.
  • the first non-Latin character string identified by the general language model 106 may be different than the second non-Latin character string identified by the browsing history language model 104.
  • the Latin character string 122 received from the user 1 18 may be the Pinyin text "wan'shang'shi'shi.”
  • the browsing history language model 104 may predict that the Chinese character string 3 ⁇ 43 ⁇ 4_t ⁇ ⁇ (meaning "10 P.M.") is more appropriate for display than the Chinese character string 3 ⁇ 43 ⁇ 4_t3 ⁇ 43 ⁇ 4 (meaning "have a try in the evening") predicted by the general language model 106.
  • the Latin character string 122 received from the user 1 18 may be the Pinyin text "you'xiang'tu.”
  • the browsing history language model 104 may predict that the Chinese character string i3 ⁇ 4 S (meaning “directed graph”) may be more appropriate for display than the Chinese character string y&fjt
  • FIG. 1 illustrates that the non-Latin character string 124 displayed via the IME interface 1 16 may vary depending on whether the browsing history language model 104 identifies the non-Latin character string 124 as more appropriate for display based on the browsing history information 1 10.
  • FIG. 2 illustrates an example of an input method editor (IME) interface
  • FIG. 2 may correspond to the IME interface 1 16 of FIG. 1.
  • the IME interface 1 16 includes a Latin character string input window 202 and a non-Latin character string candidates window 204.
  • the Latin character string input window 202 is configured to receive a Latin character string (e.g., the Latin character string 122 of FIG. 1).
  • the non-Latin character string candidates window 204 is configured to display one or more non-Latin character string candidates.
  • FIG. 2 illustrates that a plurality of non-Latin (e.g., Chinese) character strings may be associated with the Latin character string received via the IME interface 1 16.
  • a conversion probability may be associated with each of the non- Latin character strings.
  • An IME application e.g., the IME application 102 of FIG. 1 may order the non-Latin character strings based on conversion probability and may display an ordered list of non-Latin character strings via the IME interface 1 16.
  • the Latin character string received via the Latin character string input window 202 may be the Pinyin text "wan'shang'shi'shi.”
  • the non-Latin character string candidates window 204 displays a first Chinese character string candidate 206 (i.e., 3 ⁇ 43 ⁇ 4_t ⁇ ⁇ ) and a second Chinese character string candidate 208 (i.e., 3 ⁇ 43 ⁇ 4_t3 ⁇ 43 ⁇ 4).
  • the browsing history language model 104 may identify the first Chinese character string candidate 206 (i.e., 3 ⁇ 43 ⁇ 4_t ⁇ ⁇ ) as the Chinese character string with a highest conversion probability.
  • the general language model 106 may identify the second Chinese character string candidate 208 (i.e., 3 ⁇ 43 ⁇ 4_t3 ⁇ 43 ⁇ 4) as the Chinese character string with a highest conversion probability.
  • the Chinese character string 3 ⁇ 43 ⁇ 4_t ⁇ ⁇ (meaning "10 P.M.") may be more appropriate for display than the Chinese character string 3 ⁇ 43 ⁇ 4_t3 ⁇ 43 ⁇ 4 (meaning "have a try in the evening") predicted by the general language model 106.
  • the first Chinese character string candidate 206 i.e., 3 ⁇ 43 ⁇ 4_t+ ⁇
  • the second Chinese character string candidate 208 i.e., 3 ⁇ 43 ⁇ 4_t3 ⁇ 43 ⁇ 4) predicted by the general language model 106.
  • the Chinese character string 3 ⁇ 43 ⁇ 4 _t ⁇ ⁇ may be presented as the first Chinese character string candidate 206 in the non-Latin character string candidates window 204.
  • 3 ⁇ 4 predicted by the general language model 106 is provided as the second Chinese character string candidate 208 in the non-Latin character string candidates window 204.
  • alternative non-Latin character string candidates may be presented.
  • alternative Chinese character strings predicted by the browsing history language model 104 may be presented.
  • alternative numbers of candidates may be displayed.
  • FIG. 3 illustrates the exemplary input method editor interface 1 16 after receiving a Latin character string input that is different than the Latin character string input of FIG. 2.
  • the Latin character string received via the Latin character string input window 202 may be the Pinyin text "you'xiang'tu.”
  • the non-Latin character string candidates window 204 displays a first Chinese character string candidate 302 (i.e., and a second Chinese character string candidate 304 (i.e., yt3 ⁇ 4 f H3).
  • the Chinese character string meaning
  • the Chinese character string i3 ⁇ 4 S may be presented as the first Chinese character string candidate 302 in the non-Latin character string candidates window 204.
  • 3 ⁇ 4 is provided as the second Chinese character string candidate 304 in the non-Latin character string candidates window 204.
  • alternative non-Latin character string candidates may be presented. Further, while only two candidates are illustrated in the non-Latin character string candidates window 204, alternative numbers of candidates may be displayed.
  • FIGS. 4 and 5 illustrate example process flows according to some implementations.
  • each block represents one or more operations that can be implemented in hardware, software, or a combination thereof.
  • the blocks represent computer- executable instructions that, when executed by one or more processors, cause the processors to perform the recited operations.
  • computer-executable instructions include routines, programs, objects, modules, components, data structures, and the like that perform particular functions or implement particular abstract data types.
  • the order in which the blocks are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes. Numerous other variations will be apparent to those of skill in the art in light of the disclosure herein.
  • the process flows 400 and 500 are described with reference to the system 100, described above, although other models, frameworks, systems and environments may implement the illustrated process.
  • the process flow 400 includes generating a browsing history language model based on browsing history information.
  • the IME application 102 of FIG. 1 may generate the browsing history language model 104 based on the browsing history information 1 10.
  • an N-gram statistical language model may be employed to analyze the browsing history information 1 10.
  • the general language model 106 may identify a first non-Latin character string as the non-Latin character string with the highest conversion probability.
  • the browsing history language model 104 may identify a second non-Latin character string as the non-Latin character string with the highest conversion probability.
  • the second non-Latin character string predicted by the browsing history language model 104 may be different from the first non- Latin character string predicted by the general language model 106.
  • the content of the browsing history information 1 10 may affect a prediction of a non- Latin character string.
  • the predicted non-Latin character string may more accurately reflect the interests of the user 1 18.
  • a web browser plug-in may filter one or more web pages as the user is browsing in substantially real-time.
  • the plug-in may analyze the data, combine the data with the previous browsing history, and integrate the data into the browsing history language model 104.
  • An advantage of this approach is real-time processing capability, while it may require fast processing to avoid bringing noticeable latency to users.
  • a system service may periodically check one or more cache folders of one or more browsers and may examine the contents of the cache folders to build the browser history language model 104. This method may be able to examine the browsing history of multiple browsers but may not update the browser history language model 104 in substantially real-time.
  • a web browser plug-in may be responsible for detecting the content update, while a system service may be responsible for building the browser history language model 104.
  • the process flow 400 includes predicting a non-Latin character string based at least in part on the browsing history language model, in response to receiving a Latin character string via an IME interface.
  • the IME application 102 of FIG. 1 may predict the non-Latin character string 124 based at least in part on the browsing history language model 104, in response to receiving the Latin character string 122 via the IME interface 1 16.
  • a plurality of non-Latin character strings may be associated with the Latin character string 122 received via the IME interface 1 16. Multiple non-Latin character strings may be displayed as candidates for user selection. A conversion probability may be associated with each of the non-Latin character string candidates. The conversion probability may be used to determine the order in which the non-Latin character string candidates are displayed.
  • FIG. 2 illustrates an ordered list of non-Latin character strings displayed in response to the user 1 18 providing the Pinyin text "wan'shang'shi'shi" via the Latin character string input window 202.
  • the non- Latin character string candidates window 204 displays a first Chinese character string candidate 3 ⁇ 43 ⁇ 4_t ⁇ ⁇ and a second Chinese character string candidate B3 ⁇ 4_h3 ⁇ 4
  • the conversion probability associated with the first Chinese character string candidate 3 ⁇ 43 ⁇ 4 _b ⁇ H was determined to be higher than the conversion probability associated with the second Chinese character string candidate 3 ⁇ 43 ⁇ 4_t3 ⁇ 43 ⁇ 4.
  • the non-Latin character string candidates window 204 displays a first Chinese character string candidate i3 ⁇ 4 S and a second Chinese character string candidate yf3 ⁇ 4 f
  • the conversion probability associated with the first Chinese character string candidate i3 ⁇ 4 S was determined to be higher than the conversion probability associated with the second Chinese character string candidate [3 ⁇ 4
  • the predicted non-Latin character string 124 is determined based on the browsing history language model 104 and the general language model 106.
  • the first Chinese character string candidate (e.g., 3 ⁇ 43 ⁇ 4_t ⁇ ⁇ in FIG. 2 or in FIG. 3) may represent the non-
  • the second Chinese character string candidate (e.g., 3 ⁇ 43 ⁇ 4_t3 ⁇ 43 ⁇ 4 in FIG. 2 or y3 ⁇ 4ff
  • a contribution of the browsing history language model 104 may be determined based on a weighting factor.
  • the weighting factor may include a default weighting factor or a user-defined weighting factor.
  • the user 1 18 may adjust the weighting factor accordingly.
  • FIG. 5 illustrates another example process flow according to some implementations.
  • FIG. 5 illustrates that the browsing history language model may be updated based on new browsing content.
  • the process flow 500 includes generating a browsing history language model based on browsing history information.
  • the IME application 102 of FIG. 1 may generate the browsing history language model 104 based on the browsing history information 1 10.
  • the process flow 500 includes predicting a non-Latin character string based at least in part on the browsing history language model, in response to receiving a Latin character string via an input method editor interface.
  • the IME application 102 of FIG. 1 may predict the non-Latin character string 124 based at least in part on the browsing history language model 104, in response to receiving the Latin character string 122 via the IME interface 1 16.
  • the process flow 500 includes determining whether the browsing history information includes new browsing content. When it is determined that there is new browsing content, the process flow 500 may proceed to block 508. When new browsing content has not been detected, the process flow 500 returns to block 504.
  • the process flow 500 may include processing the new browsing content to update the browsing history language model.
  • a plug-in may detect new browsing content in substantially real-time. For example, referring to FIG. 1, a plug-in associated with the browser application 128 may provide the real-time browsing content 126, and the real-time browsing content 126 may be processed in substantially real-time to update the browsing history language model 104.
  • a system service may periodically monitor one or more browser cache locations to determine whether new browsing content has been saved. The new browsing content may then be processed to update the browsing history language model 104. For example, referring to FIG. 1 , a system service may periodically monitor the browser cache 1 14 for new browsing content and then process the new browsing content to update the browsing history language model 104.
  • predicting a non-Latin character string may be based at least in part on the updated browsing history language model.
  • a Latin character string may be received via the IME interface (e.g., the IME interface 1 16).
  • a non-Latin character string is predicted based at least in part on the updated browsing history language model.
  • the Latin character string received at block 510 may be the same as the Latin character string received at block 504.
  • the predicted non-Latin character string may or may not be the same. That is, the update to the browsing history language model may or may not affect the prediction of the non-Latin character string.
  • the browsing history language model prior to the update i.e., the browsing history language model generated at 502 may have predicted a particular non-Latin character string.
  • the updated browsing history language model i.e., after the update at block 508) may predict the same non-Latin character string or may predict a different non-Latin character string.
  • updating the browsing history language model may affect a prediction associated with one or more Latin character strings but may not affect a prediction associated with other Latin character strings.
  • FIG. 6 illustrates an example configuration of a computing device 600 and an environment that can be used to implement the modules and functions described herein.
  • the computing device 600 corresponds to the computing device 120 of FIG. 1 but it should be understood that the computing device 120 may be configured in a similar manner to that illustrated.
  • the computing device 600 may include at least one processor 602, a memory 604, communication interfaces 606, a display device 608 (e.g. a touchscreen display), other input/output (I/O) devices 610 (e.g. a touchscreen display or a mouse and keyboard), and one or more mass storage devices 612, able to communicate with each other, such as via a system bus 614 or other suitable connection.
  • I/O input/output
  • the processor 602 may be a single processing unit or a number of processing units, all of which may include single or multiple computing units or multiple cores.
  • the processor 602 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions.
  • the processor 602 can be configured to fetch and execute computer-readable instructions stored in the memory 604, mass storage devices 612, or other computer-readable media.
  • Memory 604 and mass storage devices 612 are examples of computer storage media for storing instructions which are executed by the processor 602 to perform the various functions described above.
  • memory 604 may generally include both volatile memory and non-volatile memory (e.g., RAM, ROM, or the like).
  • mass storage devices 612 may generally include hard disk drives, solid-state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), a storage array, a network attached storage, a storage area network, or the like.
  • Both memory 604 and mass storage devices 612 may be collectively referred to as memory or computer storage media herein, and may be computer-readable media capable of storing computer-readable, processor-executable program instructions as computer program code that can be executed by the processor 602 as a particular machine configured for carrying out the operations and functions described in the implementations herein.
  • the computing device 600 may also include one or more communication interfaces 606 for exchanging data with other devices, such as via a network, direct connection, or the like, as discussed above.
  • the communication interfaces 606 can facilitate communications within a wide variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet and the like.
  • Communication interfaces 606 can also provide communication with external storage (not shown), such as in a storage array, network attached storage, storage area network, or the like.
  • a display device 608 such as touchscreen display or other display device, may be included in some implementations.
  • the display device 608 may be configured to display the IME interface 116 as described above.
  • Other I/O devices 610 may be devices that receive various inputs from a user and provide various outputs to the user, and may include a touchscreen, such as a touchscreen display, a keyboard, a remote controller, a mouse, a printer, audio input/output devices, and so forth.
  • Memory 604 may include modules and components for execution by the computing device 600 according to the implementations discussed herein.
  • memory 604 includes the IME application 102 and the adaptive language model builder 108 as described above with regard to FIG. 1.
  • Memory 604 may further include one or more other modules 616, such as an operating system, drivers, application software, communication software, or the like.
  • Memory 604 may also include other data 618, such as data stored while performing the functions described above and data used by the other modules 616.
  • Memory 604 may also include other data and data structures described or alluded to herein.
  • memory 604 may include information that is used in the course of deriving and generating the browsing history language model 104 as described above.
  • module can represent program code (and/or declarative-type instructions) that performs specified tasks or operations when executed on a processing device or devices (e.g., CPUs or processors).
  • the program code can be stored in one or more computer-readable memory devices or other computer storage devices.
  • the IME application 102 and the adaptive language model builder 108 may be implemented using any form of computer- readable media that is accessible by computing device 600.
  • “computer-readable media” includes, at least, two types of computer-readable media, namely computer storage media and communications media.
  • Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non- transmission medium that can be used to store information for access by a computing device.
  • communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism.
  • a modulated data signal such as a carrier wave, or other transmission mechanism.
  • computer storage media does not include communication media.
  • this disclosure provides various example implementations, as described and as illustrated in the drawings. However, this disclosure is not limited to the implementations described and illustrated herein, but can extend to other implementations, as would be known or as would become known to those skilled in the art. Reference in the specification to "one implementation,” “this implementation,” “these implementations” or “some implementations” means that a particular feature, structure, or characteristic described is included in at least one implementation, and the appearances of these phrases in various places in the specification are not necessarily all referring to the same implementation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Machine Translation (AREA)

Abstract

Some examples may include generating a browsing history language model based on browsing history information. Further, some implementations may include predicting and presenting a non-Latin character string based at least in part on the browsing history language model, such as in response to receiving a Latin character string via an input method editor interface.

Description

BROWSING HISTORY LANGUAGE MODEL FOR INPUT METHOD
EDITOR
TECHNICAL FIELD
[0001] This disclosure relates to the technical field of computer input.
BACKGROUND
[0002] An input method editor (IME) is a computer functionality that assists a user to input text into a host application of a computing device. An IME may provide several suggested words and phrases based on received inputs from the user as candidates for insertion into the host application. For example, the user may input one or more initial characters of a word or phrase and an IME, based on the initial characters, may provide one or more suggested words or phrases for the user to select a desired one.
[0003] For another example, an IME may also assist the user to input non-Latin characters such as Chinese. The user may input Latin characters through a keyboard. The IME returns one or more Chinese characters as candidates for insertion. The user may then select the proper character and insert it. As many typical keyboards support inputting Latin characters, the IME is useful for the user to input non-Latin characters using a Latin-character keyboard. SUMMARY
[0004] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
[0005] Some implementations provide techniques and arrangements for predicting a non-Latin character string based at least in part on a browsing history language model. The browsing history language model may be generated based on browsing history information. For example, the browsing history information may include at least cached browsing content and may also include real-time browsing content. The predicted non-Latin character string may be provided in response to receiving a Latin character string via an input method editor interface. Additionally, some examples may predict a Chinese character string based at least in part on the browsing history language model in response to receiving a Pinyin character string.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The Detailed Description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features. [0007] FIG. 1 illustrates an example system according to some implementations.
[0008] FIG. 2 illustrates an example input method editor interface according to some implementations.
[0009] FIG. 3 illustrates an example input method editor interface according to some implementations.
[0010] FIG. 4 illustrates an example process flow according to some implementations .
[0011] FIG. 5 illustrates an example process flow according to some implementations .
[0012] FIG. 6 illustrates an example system in which some implementations may operate.
DETAILED DESCRIPTION
OVERVIEW
[0013] Some examples include techniques and arrangements for implementing a browsing history language model with an input method editor (IME). For instance, it may be difficult for a user to input characters into a computer for a language that is based on non-Latin characters (e.g., the Chinese language). For example, there are thousands of Chinese characters, and a typical Western keyboard is limited to 26 letters. The present disclosure relates to an IME that predicts a non-Latin character string in response to receiving a Latin character string from a user. The predicted non-Latin character string is based at least in part on a browsing history language model. As an illustrative, non-limiting example, the IME may be used to translate Pinyin text (i.e., Chinese characters represented phonetically by Latin characters) into Chinese characters. It will be appreciated that the present disclosure is not limited to Chinese characters. For example, other illustrative non-Latin characters may include Japanese characters or Korean characters, among other alternatives.
[0014] Among Chinese input method editors, those based on Pinyin text are the most common. Chinese Pinyin is a set of rules that utilize the Latin alphabet to annotate the pronunciations of Chinese characters. In a typical Pinyin IME, users input the Pinyin text of the Chinese they want to input into the computer, and the IME is responsible for displaying all the matched characters. However, many Chinese characters have the same pronunciation. That is, there is a one-to-many relationship between the Pinyin text and the corresponding Chinese characters. To predict a non-Latin character string, an IME may rely on a language model. For example, a statistical language model (SLM) may be used to compute a conversion probability of each possible conversion and may select the one with the highest probability for presentation to a user. A particular type of SLM, referred to as an N-gram SLM, may decompose the probability of a string of consecutive words into the products of the conditional probabilities between two, three, or more consecutive words in the string.
[0015] An IME may be released with a language model for generic usage (i.e., a "general" language model), which is trained for most common typing scenarios. However, such a general language model may be inadequate for a particular user (e.g., a user with a particular browsing history). That is, different users may have different preferences, and an IME that utilizes a general language model may suggest a word or phrase that may be inappropriate for a particular user. To illustrate, an IME that utilizes a general language model may suggest a first word or phrase (i.e., a first set of non-Latin characters). The first word or phrase may have the same pronunciation as a second word or phrase (i.e., a second set of non- Latin characters). The first word or phrase may be appropriate for a standard user but may be less appropriate for another user. Instead, the second word or phrase may be more appropriate for such a user.
[0016] Web browsing history is an important source of information about a user. For example, a user may browse content related to recent news events or may browse special topics that the user may be interested in. For example, a computer programmer may browse one or more portal sites for various news items and may also browse one or more software development sites. As such, the browsing history of the user may contain the latest general hot topics and texts related to programming skills, among other information.
[0017] The present disclosure describes an IME that utilizes a browsing history language model to predict a non-Latin character string that may be more appropriate for a user with a particular browsing history than a non-Latin character string that is predicted based on the general language model.
EXAMPLE IMPLEMENTATIONS
[0018] FIG. 1 illustrates an example framework of a system 100 according to some implementations. The system 100 includes an input method editor (IME) application 102 that is communicatively coupled to a browsing history language model 104 and a general language model 106. The system 100 further includes an adaptive language model builder 108 that is adapted to receive browsing history information 1 10. The browsing history information 1 10 may include at least cached browsing content 1 12 stored at a browser cache 1 14. An IME interface 1 16 may be provided to a user 1 18 via a computing device 120. While the computing device 120 is shown in FIG. 1 as separate from the above described components of the system 100, it will be appreciated that this is for illustrative purposes only. For instance, in some examples, all of the components of the system 100 may be included on the computing device 120, while in other examples, the components may be distributed across any number of computing devices able to communicate with one another, such as over one or more networks or other communication connections.
[0019] The IME application 102 is configured to generate the IME interface 1 16 for display to the user 1 18 via the computing device 120. The adaptive language model builder 108 is configured to generate the browsing history language model 104 based on the browsing history information 1 10. The IME application 102 is further configured to receive a Latin character string 122 via the IME interface 1 16. In response to receiving the Latin character string 122, the IME application 102 is configured to predict a non-Latin character string 124 based at least in part on the browsing history language model 104.
[0020] The adaptive language model builder 108 may generate the browsing history language model 104 based on an analysis of the browsing history information 1 10. For example, the browsing history language model 104 may include an N-gram statistical language model. Such an N-gram statistical language model may decompose the probability of a string of consecutive words into the products of the conditional probabilities between multiple (e.g., two, three, four, five, etc.) consecutive words in the string. Such analysis may be performed for each of the one or more files 1 12.
[0021] Some implementations provide a system service that may periodically monitor the browser cache 1 14 to determine whether new browsing content has been saved to the browser cache 1 14. In response to determining that new browsing content has been saved, the adaptive language model builder 108 may process the new browsing content to update the browsing history language model 104. In some implementations, the browsing history information 1 10 may also include real-time browsing content 126, as shown in phantom. For example, a plug-in of a browser application 128 (e.g., a web browser application) may detect new browsing content in substantially real-time and provide the real-time browsing content 126 to the adaptive language model builder 108. The adaptive language model builder 108 may process the real-time browsing content 126 to update the browsing history language model 104. In some implementations, the plug-in of the browser application 128 may not provide real-time browsing information when a browsing mode is set to private browsing. That is, the browsing history information 1 10 may optionally only include the cached browsing content 1 12 that is stored at the browser cache.
[0022] The IME application 102 receives the Latin character string 122 via the IME interface 1 16. As an illustrative example, the Latin character string 122 may include Pinyin text, and the predicted non-Latin character string 124 may include one or more Chinese characters.
[0023] A plurality of non-Latin character strings may be associated with the Latin character string 122 received via the IME interface 1 16. A conversion probability may be associated with each non-Latin character string of the plurality of non-Latin character strings. The IME application 102 may predict the non- Latin character string 124 for display to the user 1 18 based at least in part on the browsing history language model 104. In a particular embodiment, the IME application 102 predicts the non-Latin character string 124 by identifying the non- Latin character string with a highest conversion probability. The IME application 102 may order the plurality of non-Latin character strings based on the conversion probability and may display an ordered list of non-Latin character strings via the IME interface 1 16.
[0024] In some implementations, one or more predicted non-Latin character strings may be determined based on the browsing history language model 104 and the general language model 106. As an illustrative example, C may represent the Chinese string to be predicted, Pm(C) may represent a probability determined based on the general language model 106, and Pb(C) may represent a probability determined based on the browsing history language model 104. A contribution of the browsing history language model 104 may be determined based on a weighting factor (e.g., a value between 0 and 1, referred to herein as λ). That is, the probability of C may be determined based on the formula: P(C) = APm(C) + (l - A)Pb(C). [0025] In some implementations, the weighting factor λ may include a default weighting factor. That is, the weighting factor can be "pre-tuned" to a weighting factor that has been previously verified as accurate in most cases. In another embodiment, the weighting factor may include a user-defined weighting factor. For example, the user-defined weighting factor may be received from the user 1 18, and the weighting factor may be modified from the default weighting factor to the user-defined weighting factor. This may allow the user 1 18 to "tune" the weighting factor according to personal preference.
[0026] The general language model 106 may identify a first non-Latin character string as the non-Latin character string with the highest conversion probability. The browsing history language model 104 may identify a second non-Latin character string as the non-Latin character string with the highest conversion probability. The first non-Latin character string identified by the general language model 106 may be different than the second non-Latin character string identified by the browsing history language model 104.
[0027] As an illustrative example, the Latin character string 122 received from the user 1 18 may be the Pinyin text "wan'shang'shi'shi." Based on the browsing history information 1 10, the browsing history language model 104 may predict that the Chinese character string ¾¾_t† † (meaning "10 P.M.") is more appropriate for display than the Chinese character string ¾¾_t¾¾ (meaning "have a try in the evening") predicted by the general language model 106.
[0028] As another illustrative example, the Latin character string 122 received from the user 1 18 may be the Pinyin text "you'xiang'tu." Based on the browsing history information 1 10, the browsing history language model 104 may predict that the Chinese character string i¾ S (meaning "directed graph") may be more appropriate for display than the Chinese character string y&fjt |¾ (meaning "gas tank diagram") predicted by the general language model 106.
[0029] Thus, FIG. 1 illustrates that the non-Latin character string 124 displayed via the IME interface 1 16 may vary depending on whether the browsing history language model 104 identifies the non-Latin character string 124 as more appropriate for display based on the browsing history information 1 10.
[0030] FIG. 2 illustrates an example of an input method editor (IME) interface
1 16 according to some implementations. To illustrate, the IME interface 1 16 of
FIG. 2 may correspond to the IME interface 1 16 of FIG. 1.
[0031] The IME interface 1 16 includes a Latin character string input window 202 and a non-Latin character string candidates window 204. The Latin character string input window 202 is configured to receive a Latin character string (e.g., the Latin character string 122 of FIG. 1). The non-Latin character string candidates window 204 is configured to display one or more non-Latin character string candidates.
[0032] FIG. 2 illustrates that a plurality of non-Latin (e.g., Chinese) character strings may be associated with the Latin character string received via the IME interface 1 16. A conversion probability may be associated with each of the non- Latin character strings. An IME application (e.g., the IME application 102 of FIG. 1) may order the non-Latin character strings based on conversion probability and may display an ordered list of non-Latin character strings via the IME interface 1 16.
[0033] In the example illustrated in FIG. 2, the Latin character string received via the Latin character string input window 202 may be the Pinyin text "wan'shang'shi'shi." The non-Latin character string candidates window 204 displays a first Chinese character string candidate 206 (i.e., ¾¾_t† †) and a second Chinese character string candidate 208 (i.e., ¾¾_t¾¾). For example, the browsing history language model 104 may identify the first Chinese character string candidate 206 (i.e., ¾¾_t† †) as the Chinese character string with a highest conversion probability. The general language model 106 may identify the second Chinese character string candidate 208 (i.e., ¾¾_t¾¾) as the Chinese character string with a highest conversion probability.
[0034] As explained above, based on the browsing history information 1 10, the Chinese character string ¾¾_t† † (meaning "10 P.M.") may be more appropriate for display than the Chinese character string ¾¾_t¾¾ (meaning "have a try in the evening") predicted by the general language model 106. As such, the first Chinese character string candidate 206 (i.e., ¾¾_t+ †) predicted by the browsing history language model 104 may be identified as having a higher conversion probability than the second Chinese character string candidate 208 (i.e., ¾¾_t¾¾) predicted by the general language model 106. Accordingly, the Chinese character string ¾¾ _t† † may be presented as the first Chinese character string candidate 206 in the non-Latin character string candidates window 204. [0035] In the example illustrated in FIG. 2, the Chinese character string B¾_h¾
¾ predicted by the general language model 106 is provided as the second Chinese character string candidate 208 in the non-Latin character string candidates window 204. However, it will be appreciated that alternative non-Latin character string candidates may be presented. For example, alternative Chinese character strings predicted by the browsing history language model 104 may be presented. Further, while only two candidates are illustrated in the non-Latin character string candidates window 204, alternative numbers of candidates may be displayed.
[0036] FIG. 3 illustrates the exemplary input method editor interface 1 16 after receiving a Latin character string input that is different than the Latin character string input of FIG. 2.
[0037] In the example illustrated in FIG. 3, the Latin character string received via the Latin character string input window 202 may be the Pinyin text "you'xiang'tu." The non-Latin character string candidates window 204 displays a first Chinese character string candidate 302 (i.e., and a second Chinese character string candidate 304 (i.e., yt¾ f H3). As explained above, based on the browsing history information 1 10, the Chinese character string (meaning
"directed graph") may be more appropriate for display than the Chinese character string yt¾ f |¾ (meaning "gas tank diagram"). As such, the Chinese character string i¾ S may be presented as the first Chinese character string candidate 302 in the non-Latin character string candidates window 204. [0038] In the example illustrated in FIG. 3, the Chinese character string y&fjt |¾ is provided as the second Chinese character string candidate 304 in the non-Latin character string candidates window 204. However, it will be appreciated that alternative non-Latin character string candidates may be presented. Further, while only two candidates are illustrated in the non-Latin character string candidates window 204, alternative numbers of candidates may be displayed.
[0039] FIGS. 4 and 5 illustrate example process flows according to some implementations. In the flow diagrams of FIGS. 4 and 5, each block represents one or more operations that can be implemented in hardware, software, or a combination thereof. In the context of software, the blocks represent computer- executable instructions that, when executed by one or more processors, cause the processors to perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, modules, components, data structures, and the like that perform particular functions or implement particular abstract data types. The order in which the blocks are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes. Numerous other variations will be apparent to those of skill in the art in light of the disclosure herein. For discussion purposes, the process flows 400 and 500 are described with reference to the system 100, described above, although other models, frameworks, systems and environments may implement the illustrated process.
[0040] Referring to FIG. 4, at block 402, the process flow 400 includes generating a browsing history language model based on browsing history information. For example, the IME application 102 of FIG. 1 may generate the browsing history language model 104 based on the browsing history information 1 10.
[0041] As an illustrative, non-limiting example, an N-gram statistical language model may be employed to analyze the browsing history information 1 10. Employing such an N-gram SLM, the general language model 106 may identify a first non-Latin character string as the non-Latin character string with the highest conversion probability. Employing the N-gram SLM to analyze the browsing history information 1 10, the browsing history language model 104 may identify a second non-Latin character string as the non-Latin character string with the highest conversion probability. Depending on the linguistic characteristics of the browsing history information 1 10, the second non-Latin character string predicted by the browsing history language model 104 may be different from the first non- Latin character string predicted by the general language model 106. Thus, the content of the browsing history information 1 10 may affect a prediction of a non- Latin character string. Depending on the content of the browsing history information 1 10, the predicted non-Latin character string may more accurately reflect the interests of the user 1 18.
[0042] In a particular embodiment, a web browser plug-in may filter one or more web pages as the user is browsing in substantially real-time. The plug-in may analyze the data, combine the data with the previous browsing history, and integrate the data into the browsing history language model 104. An advantage of this approach is real-time processing capability, while it may require fast processing to avoid bringing noticeable latency to users. In another embodiment, a system service may periodically check one or more cache folders of one or more browsers and may examine the contents of the cache folders to build the browser history language model 104. This method may be able to examine the browsing history of multiple browsers but may not update the browser history language model 104 in substantially real-time. Alternatively, a web browser plug-in may be responsible for detecting the content update, while a system service may be responsible for building the browser history language model 104.
[0043] At block 404, the process flow 400 includes predicting a non-Latin character string based at least in part on the browsing history language model, in response to receiving a Latin character string via an IME interface. For example, the IME application 102 of FIG. 1 may predict the non-Latin character string 124 based at least in part on the browsing history language model 104, in response to receiving the Latin character string 122 via the IME interface 1 16.
[0044] A plurality of non-Latin character strings may be associated with the Latin character string 122 received via the IME interface 1 16. Multiple non-Latin character strings may be displayed as candidates for user selection. A conversion probability may be associated with each of the non-Latin character string candidates. The conversion probability may be used to determine the order in which the non-Latin character string candidates are displayed.
[0045] As an illustrative example, FIG. 2 illustrates an ordered list of non-Latin character strings displayed in response to the user 1 18 providing the Pinyin text "wan'shang'shi'shi" via the Latin character string input window 202. The non- Latin character string candidates window 204 displays a first Chinese character string candidate ¾¾_t† † and a second Chinese character string candidate B¾_h¾
¾ . In this case, the conversion probability associated with the first Chinese character string candidate ¾¾ _b† H was determined to be higher than the conversion probability associated with the second Chinese character string candidate ¾¾_t¾¾.
[0046] As another illustrative example, referring to FIG. 3, the non-Latin character string candidates window 204 displays a first Chinese character string candidate i¾ S and a second Chinese character string candidate yf¾ f |¾ in response to the user 1 18 providing the Pinyin text "you'xiang'tu" via the Latin character string input window 202. In this case, the conversion probability associated with the first Chinese character string candidate i¾ S was determined to be higher than the conversion probability associated with the second Chinese character string candidate [¾ |¾ .
[0047] In a particular embodiment, the predicted non-Latin character string 124 is determined based on the browsing history language model 104 and the general language model 106. In one embodiment, the first Chinese character string candidate (e.g., ¾¾_t† † in FIG. 2 or in FIG. 3) may represent the non-
Latin character string with the highest conversion probability according to the browsing history language model 104. The second Chinese character string candidate (e.g., ¾¾_t¾¾ in FIG. 2 or y¾ff |¾ in FIG. 3) may represent the non- Latin character string with the highest conversion probability according to the general language model 106.
[0048] A contribution of the browsing history language model 104 may be determined based on a weighting factor. For example, the weighting factor may include a default weighting factor or a user-defined weighting factor. In the event that the user 1 18 determines that the order of the Chinese character string candidates is inappropriate, the user 1 18 may adjust the weighting factor accordingly.
[0049] FIG. 5 illustrates another example process flow according to some implementations. FIG. 5 illustrates that the browsing history language model may be updated based on new browsing content.
[0050] At block 502, the process flow 500 includes generating a browsing history language model based on browsing history information. For example, the IME application 102 of FIG. 1 may generate the browsing history language model 104 based on the browsing history information 1 10.
[0051] At block 504, the process flow 500 includes predicting a non-Latin character string based at least in part on the browsing history language model, in response to receiving a Latin character string via an input method editor interface. For example, the IME application 102 of FIG. 1 may predict the non-Latin character string 124 based at least in part on the browsing history language model 104, in response to receiving the Latin character string 122 via the IME interface 1 16. [0052] At block 506, the process flow 500 includes determining whether the browsing history information includes new browsing content. When it is determined that there is new browsing content, the process flow 500 may proceed to block 508. When new browsing content has not been detected, the process flow 500 returns to block 504. At block 508, the process flow 500 may include processing the new browsing content to update the browsing history language model.
[0053] In some implementations, at block 506, a plug-in may detect new browsing content in substantially real-time. For example, referring to FIG. 1, a plug-in associated with the browser application 128 may provide the real-time browsing content 126, and the real-time browsing content 126 may be processed in substantially real-time to update the browsing history language model 104. In an alternative embodiment, at block 506, a system service may periodically monitor one or more browser cache locations to determine whether new browsing content has been saved. The new browsing content may then be processed to update the browsing history language model 104. For example, referring to FIG. 1 , a system service may periodically monitor the browser cache 1 14 for new browsing content and then process the new browsing content to update the browsing history language model 104.
[0054] Thereafter, predicting a non-Latin character string may be based at least in part on the updated browsing history language model. For example, at block 510, a Latin character string may be received via the IME interface (e.g., the IME interface 1 16). In response to receiving this Latin character string, a non-Latin character string is predicted based at least in part on the updated browsing history language model.
[0055] In a particular illustrative embodiment, the Latin character string received at block 510 (i.e., after the personal language model has been updated) may be the same as the Latin character string received at block 504. Depending on the update to the browsing history language model resulting from the new browsing content being saved, the predicted non-Latin character string may or may not be the same. That is, the update to the browsing history language model may or may not affect the prediction of the non-Latin character string. To illustrate, the browsing history language model prior to the update (i.e., the browsing history language model generated at 502) may have predicted a particular non-Latin character string. The updated browsing history language model (i.e., after the update at block 508) may predict the same non-Latin character string or may predict a different non-Latin character string.
[0056] Thus, updating the browsing history language model may affect a prediction associated with one or more Latin character strings but may not affect a prediction associated with other Latin character strings.
EXAMPLE COMPUTING DEVICE AND ENVIRONMENT
[0057] FIG. 6 illustrates an example configuration of a computing device 600 and an environment that can be used to implement the modules and functions described herein. As shown in FIG. 6, the computing device 600 corresponds to the computing device 120 of FIG. 1 but it should be understood that the computing device 120 may be configured in a similar manner to that illustrated. [0058] The computing device 600 may include at least one processor 602, a memory 604, communication interfaces 606, a display device 608 (e.g. a touchscreen display), other input/output (I/O) devices 610 (e.g. a touchscreen display or a mouse and keyboard), and one or more mass storage devices 612, able to communicate with each other, such as via a system bus 614 or other suitable connection.
[0059] The processor 602 may be a single processing unit or a number of processing units, all of which may include single or multiple computing units or multiple cores. The processor 602 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 602 can be configured to fetch and execute computer-readable instructions stored in the memory 604, mass storage devices 612, or other computer-readable media.
[0060] Memory 604 and mass storage devices 612 are examples of computer storage media for storing instructions which are executed by the processor 602 to perform the various functions described above. For example, memory 604 may generally include both volatile memory and non-volatile memory (e.g., RAM, ROM, or the like). Further, mass storage devices 612 may generally include hard disk drives, solid-state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), a storage array, a network attached storage, a storage area network, or the like. Both memory 604 and mass storage devices 612 may be collectively referred to as memory or computer storage media herein, and may be computer-readable media capable of storing computer-readable, processor-executable program instructions as computer program code that can be executed by the processor 602 as a particular machine configured for carrying out the operations and functions described in the implementations herein.
[0061] The computing device 600 may also include one or more communication interfaces 606 for exchanging data with other devices, such as via a network, direct connection, or the like, as discussed above. The communication interfaces 606 can facilitate communications within a wide variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet and the like. Communication interfaces 606 can also provide communication with external storage (not shown), such as in a storage array, network attached storage, storage area network, or the like.
[0062] The discussion herein refers to data being sent and received by particular components or modules. This should not be taken as a limitation as such communication need not be direct and the particular components or module need not necessarily be a single functional unit. This is not to be taken as limiting implementations to only those in which the components directly send and receive data from one another. The signals could instead be relayed by a separate component upon receipt of the data. Further, the components may be combined or the functionality may be separated amongst components in various manners not limited to those discussed above. Other variations in the logical and practical structure and framework of various implementations would be apparent to one of ordinary skill in the art in view of the disclosure provided herein.
[0063] A display device 608, such as touchscreen display or other display device, may be included in some implementations. The display device 608 may be configured to display the IME interface 116 as described above. Other I/O devices 610 may be devices that receive various inputs from a user and provide various outputs to the user, and may include a touchscreen, such as a touchscreen display, a keyboard, a remote controller, a mouse, a printer, audio input/output devices, and so forth.
[0064] Memory 604 may include modules and components for execution by the computing device 600 according to the implementations discussed herein. In the illustrated example, memory 604 includes the IME application 102 and the adaptive language model builder 108 as described above with regard to FIG. 1. Memory 604 may further include one or more other modules 616, such as an operating system, drivers, application software, communication software, or the like. Memory 604 may also include other data 618, such as data stored while performing the functions described above and data used by the other modules 616. Memory 604 may also include other data and data structures described or alluded to herein. For example, memory 604 may include information that is used in the course of deriving and generating the browsing history language model 104 as described above. [0065] The example systems and computing devices described herein are merely examples suitable for some implementations and are not intended to suggest any limitation as to the scope of use or functionality of the environments, architectures and frameworks that can implement the processes, components and features described herein. Thus, implementations herein are operational with numerous environments or architectures, and may be implemented in general purpose and special-purpose computing systems, or other devices having processing capability. Generally, any of the functions described with reference to the figures can be implemented using software, hardware (e.g., fixed logic circuitry) or a combination of these implementations. The term "module," "mechanism" or "component" as used herein generally represents software, hardware, or a combination of software and hardware that can be configured to implement prescribed functions. For instance, in the case of a software implementation, the term "module," "mechanism" or "component" can represent program code (and/or declarative-type instructions) that performs specified tasks or operations when executed on a processing device or devices (e.g., CPUs or processors). The program code can be stored in one or more computer-readable memory devices or other computer storage devices. Thus, the processes, components and modules described herein may be implemented by a computer program product.
[0066] Although illustrated in FIG. 6 as being stored in memory 604 of computing device 600, the IME application 102 and the adaptive language model builder 108, or portions thereof, may be implemented using any form of computer- readable media that is accessible by computing device 600. As used herein, "computer-readable media" includes, at least, two types of computer-readable media, namely computer storage media and communications media.
[0067] Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non- transmission medium that can be used to store information for access by a computing device.
[0068] In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. As defined herein, computer storage media does not include communication media.
[0069] Furthermore, this disclosure provides various example implementations, as described and as illustrated in the drawings. However, this disclosure is not limited to the implementations described and illustrated herein, but can extend to other implementations, as would be known or as would become known to those skilled in the art. Reference in the specification to "one implementation," "this implementation," "these implementations" or "some implementations" means that a particular feature, structure, or characteristic described is included in at least one implementation, and the appearances of these phrases in various places in the specification are not necessarily all referring to the same implementation.
CONCLUSION
[0070] Although the subject matter has been described in language specific to structural features and/or methodological acts, the subject matter defined in the appended claims is not limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. This disclosure is intended to cover any and all adaptations or variations of the disclosed implementations, and the following claims should not be construed to be limited to the specific implementations disclosed in the specification. Instead, the scope of this document is to be determined entirely by the following claims, along with the full range of equivalents to which such claims are entitled.

Claims

1. A method comprising:
generating a browsing history language model based on browsing history information; and
in response to receiving a Latin character string via an input method editor interface, predicting a non-Latin character string based at least in part on the browsing history language model.
2. The method as recited in claim 1, wherein the browsing history information includes at least cached browsing content.
3. The method as recited in claim 2, wherein the browsing history information further includes real-time browsing content.
4. The method as recited in claim 1, wherein the predicted non-Latin character string is determined based on the browsing history language model and a general language model.
5. The method as recited in claim 4, wherein a contribution of the browsing history language model is determined based on a weighting factor.
6. The method as recited in claim 5, wherein the weighting factor includes a default weighting factor or a user-defined weighting factor.
7. The method as recited in claim 1, further comprising presenting the predicted non-Latin character string via the input method editor interface.
8. The method as recited in claim 1, wherein:
the Latin character string includes a Pinyin character string; and
the predicted non-Latin character string includes a Chinese character string.
9. The method as recited in claim 1, wherein:
a plurality of non-Latin character strings are associated with the Latin character string received via the input method editor interface; and
a conversion probability is associated with each non-Latin character string of the plurality of non-Latin character strings.
10. The method as recited in claim 9, wherein predicting the non-Latin character string includes identifying the non-Latin character string of the plurality of non- Latin character strings with a highest conversion probability.
1 1. The method as recited in claim 10, wherein a general language model identifies a first non-Latin character string of the plurality of non-Latin character strings as the non-Latin character string with the highest conversion probability.
12. The method as recited in claim 1 1, wherein the browsing history language model identifies a second non-Latin character string of the plurality of non-Latin character strings as the non-Latin character string with the highest conversion probability.
13. The method as recited in claim 12, wherein the first non-Latin character string identified by the general language model is different than the second non-Latin character string identified by the browsing history language model.
14. The method as recited in claim 1, wherein the browsing history language model includes an N-gram statistical language model.
15. A computing system comprising:
one or more processors;
one or more computer readable media maintaining instructions that, when executed by the one or more processors, cause the one or more processors to perform acts comprising:
generating a browsing history language model based on browsing history information; and
in response to receiving a Latin character string via an input method editor interface, predicting a non-Latin character string based at least in part on the browsing history language model.
16. The computing system as recited in claim 15, the acts further comprising: detecting new browsing content; and
in response to detecting the new browsing content, processing the new browsing content to update the browsing history language model.
17. The computing system as recited in claim 15, the acts further comprising: periodically monitoring one or more browser cache locations to determine whether new browsing content has been saved to the one or more browser cache locations; and
processing the new browsing content to update the browsing history language model.
18. One or more computer readable media maintaining instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising:
generating a browsing history language model based on browsing history information; and
in response to receiving a Latin character string via an input method editor interface:
determining an overall conversion probability of each of a plurality of non-Latin character strings based on a first conversion probability determined based on a general language model and a second conversion probability determined based on the browsing history language model, wherein a contribution of the second conversion probability to the overall conversion probability is weighted based on a weighting factor;
ordering the plurality of non-Latin character strings based on the overall conversion probability; and
displaying an ordered list of non-Latin character strings via the input method editor interface.
19. One or more computer readable media as recited in claim 18, the acts further comprising:
receiving a user-defined weighting factor; and
modifying the weighting factor from a default weighting factor to the user- defined weighting factor.
20. One or more computer readable media as recited in claim 18, wherein the browsing history information includes information stored at a plurality of browser cache locations, each browser cache location associated with a different browser.
EP12883601.2A 2012-08-31 2012-08-31 Browsing history language model for input method editor Withdrawn EP2891036A4 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/080815 WO2014032265A1 (en) 2012-08-31 2012-08-31 Browsing history language model for input method editor

Publications (2)

Publication Number Publication Date
EP2891036A1 true EP2891036A1 (en) 2015-07-08
EP2891036A4 EP2891036A4 (en) 2015-10-07

Family

ID=50182376

Family Applications (1)

Application Number Title Priority Date Filing Date
EP12883601.2A Withdrawn EP2891036A4 (en) 2012-08-31 2012-08-31 Browsing history language model for input method editor

Country Status (3)

Country Link
EP (1) EP2891036A4 (en)
CN (1) CN104813257A (en)
WO (1) WO2014032265A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8918408B2 (en) * 2012-08-24 2014-12-23 Microsoft Corporation Candidate generation for predictive input using input history
CN105404401A (en) 2015-11-23 2016-03-16 小米科技有限责任公司 Input processing method, apparatus and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters
US20080294982A1 (en) * 2007-05-21 2008-11-27 Microsoft Corporation Providing relevant text auto-completions
CN101876853B (en) * 2009-04-29 2012-11-14 北京搜狗科技发展有限公司 Pinyin input method and device
CN101995963B (en) * 2010-11-19 2012-07-04 哈尔滨工业大学 Vocabulary self-adaption Chinese input method

Also Published As

Publication number Publication date
WO2014032265A1 (en) 2014-03-06
EP2891036A4 (en) 2015-10-07
CN104813257A (en) 2015-07-29

Similar Documents

Publication Publication Date Title
US9824085B2 (en) Personal language model for input method editor
CN104813275B (en) For predicting the method and system of text
CN104951099B (en) A kind of method and apparatus of the displaying candidate item based on input method
CN101669116B (en) For generating the recognition architecture of asian characters
US8806384B2 (en) Keyboard gestures for character string replacement
WO2014055791A1 (en) Incremental feature-based gesture-keyboard decoding
US20150199332A1 (en) Browsing history language model for input method editor
CN108701124A (en) It predicts next letter and shows them in the key of graphic keyboard
CN105630763A (en) Method and system for making mention of disambiguation in detection
US20210390258A1 (en) Systems and methods for identification of repetitive language in document using linguistic analysis and correction thereof
US8972241B2 (en) Electronic device and method for a bidirectional context-based text disambiguation
KR102440635B1 (en) Guidance method, apparatus, device and computer storage medium of voice packet recording function
WO2014032265A1 (en) Browsing history language model for input method editor
KR101645674B1 (en) Method for autocomplete candidate word and apparatus thereof
US10528661B2 (en) Evaluating parse trees in linguistic analysis
KR102158544B1 (en) Method and system for supporting spell checking within input interface of mobile device
CN111898762B (en) Deep learning model catalog creation
US10222978B2 (en) Redefinition of a virtual keyboard layout with additional keyboard components based on received input
US10546061B2 (en) Predicting terms by using model chunks
JP7546571B2 (en) Generating Regular Expressions for Negative Examples Using Context
KR20090039092A (en) Method of comparing similarities of texts and system of comparing similarities of texts
CA2821787C (en) Electronic device and method for a bidirectional context-based text disambiguation
CN117435176A (en) Information display method, apparatus, computer device, storage medium, and program product
EP2660728A1 (en) Electronic device and method for a bidirectional context-based text disambiguation
JP2018054717A (en) Voice recognition result creation device, method and program

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20150225

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

RA4 Supplementary search report drawn up and despatched (corrected)

Effective date: 20150909

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 17/27 20060101ALI20150903BHEP

Ipc: G06F 17/22 20060101AFI20150903BHEP

17Q First examination report despatched

Effective date: 20150918

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20160129