面向互联网的文本信息处理，语音和音乐搜索技术的发展现状【搜集资料时学习所得，未详加整理】...-白红宇

面向互联网的文本信息处理，语音和音乐搜索技术的发展现状【搜集资料时学习所得，未详加整理】...

阅读量：5043 次

发布时间：2019-06-12

本文共 11242 字，大约阅读时间需要 37 分钟。

Speech recognition:

Key Words:

Distributed Speech Recognition(DSR 将嵌入式语言识别系统的识别功能架构在服务器上[并非是指分布式服务器，而是指终端与服务器属于分布式关系^[8]])

Network Speech Recognition(NSR 重点在于网络,终端高效实时传输语音信号,服务器处理^[9])。当下都是终端语音信号由服务器/云来做处理。

Emotion Speech Recognition(ESR),
Spoken Information Retrieval, Speech Recognition, Spoken Term Detection, Speaker Recognition, Voice Control, Language Modeling，Speech Signal Processing / Speech Processing, Speech Enhancement, O
utbust Speech Recognition, Feature Compensation, Model Compensation,
Automatic Speech Recognition(ASR), Speech Separation, S
ignal Analysis,
Acoustic Speech Recognition Systems, Voice Activity Detection(VAD, 检测通信时的静音期，节省带宽), Acoustic feature extraction (AFE), Speech Enhancement,

语音识别技术综述^[1]:

语音识别系统：语音的声学模型（训练学习）、模式匹配（识别算法）| 语言模型语言处理

声学模型：动态时间归整模型 (DTW)、隐马尔可夫模型(HMM)、人工神经网络模型(ANN)

语言模型：规则模型、统计模型

目前研究的难点主要表现在：(1)语音识别系统的适应性差。主要体现在对环境依赖性强。(2)高噪声环境下语音识别进展困难，因为此时人的发音变化很大，像声音变高，语速变慢，音调及共振峰变化等等，必须寻找新的信号分析处理方法。(3)如何把语言学、生理学、心理学方面知识量化、建模并有效用于语音识别，目前也是一个难点。(4)由于我们对人类的听觉理解、知识积累和学习机制以及大脑神经系统的控制机理等方面的认识还很不清楚，这必将阻碍语音识别的进一步发展。

目前语音识别领域的研究热点包括：稳健语音识别(识别的鲁棒性)、语音输入设备研究、声学HMM模型的细化、说话人自适应技术、大词汇量关键词识别、高效的识别(搜索)算法研究、可信度评测算法研究、ANN的应用、语言模型及深层次的自然语言理解。

说话人自适应技术 (Speaker Adaptation ,SA)；非特定人 (Speaker Independent ,SI)；特定人 (Speaker Dependent ,SD) 『SA+SI』

自适应：批处理式、在线式、立即式 | 监督无监督

An Overview of Noise-Robust Automatic Speech Recognition^[2]:

Historically,
ASR applications have included voice dialing, call routing, interactive voice response, data entry and dictation, voice command and control, structured document creation (e.g., medical and legal transcriptions), appliance control by voice, computer-aided language learning, content-based spoken audio search, and robotics.

More recently,
with the exponentialgrowth of big data and computing power, ASR technology hasadvanced to the stage where more challenging applications arebecoming a reality. Examples are voice search and interactionswith mobile devices (e.g., Siri on iPhone, Bing voice searchon winPhone, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on xBox), andvarious speech-centric information processing applicationscapitalizing on downstream processing of ASR outputs.

Music Search:

Key Words:

Speech Transcription,
Multimedia Information Retrieval,
Music Search, Search engine, Mobile Internet, Music Retrieval, Audio Information Retrieval, Audio Mining, Adaptive Music Retrieval ,Music Information Retrieval, Content-based Retrieval, Music Cognition, Music Creation, Music Database Retrieval,
Query By Example—QBE,
Query By Humming—QBH, Q
uery By Voice (QBV),
A
udio-visual Speech Recognition, Speech-reading, Multimodal Database, Optical Music Recognition,
Instrument Identification, Context-aware Music Retrieval (Content Based Music Retrieval), Music Recommandation,
Commercial music recommenders,
Contextual music recommendation and retrieval,

研究方法：Fuzzy system, Neural network, Expert system, Genetic algorithm

多版本音乐识别技术
：Feature extraction, key invariance（基调不变性）,
tempo invariance（节拍/速度不变性）,
structure invariance（结构不变性）, similarity computing（相似度计算）

MIDI(Musie InstrumentDigitalInterface)格式, WAVE(
Waveform Audio File Format
)格式『一般研究MIDI格式』

Feature Extraction:

Time Domain 『ACF(Autocorrelation function), SMDF(Average magnitude difference function), SIFT(Simple inverse filter tracking)』

Frequency Domain『Harmonic product spectrum, Cepstrum』

Big Data for Musicology^[4]:

Automatic Music Transcription (AMT, the process of converting an acoustic musical signal into some form of musical notation)

The most popular approach is parallelisation with Map-Reduce , using the Hadoop framework.

Modeling Concept Dynamics for Large Scale Music Search^[5]:

DMCM (
Dynamic Musical Concept Mixture
)

SMCH
(
Stochastic Music Concept Histogram
)

The music preprocessing layer extracts multiple acoustic features and maps them into an audio word from a precomputed codebook.

The concept dynamics modeling layer derives from the underlying audio words a Stochastic Music Concept Histogram

(SMCH), essentially a probability distribution over the high-level concepts.

其他的技术：

Wang J C, Shih Y C, Wu M S, et al. Colorizing tags in tag cloud: a novel query-by-tag music search system[C]// Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011.【与云计算技术关系并不是很紧密，重点在于聚类、分类，符合自己的审美观，而且
很有趣！
】

声学模型训练

Text Processing:

Key Words:

text classification, maximum entropy model

最大熵：它反映了人类认识世界的一个朴素原则，即在对某个事件一无所知的情况下，选择一个模型使它的分布应该尽可能均匀^[16]

基于云平台文本处理案例：

【图数据处理】

GraphLab：CMU提出了GraphLab开源分布式计算系统是一种新的面向机器学习的并行框架

Pregel：Google提出的适合复杂机器学习的分布式图数据计算框架

Horton：由Microsoft开发用于图数据匹配

当下，一般是用Hadoop、Sector/Sphere等已有的开源框架来处理语音识别

Multimedia Mining:

Image Mining, Video Mining, Audio Mining, Text Mining

Audio Mining:

To mine audio data, one could convert it into text using speech transcription techniques.Audio data could also be mined directly by using audio information processing techniques and then mining selected audio data.^[10]

要么转换成文本信号再做Text Mining，要么直接对声信号处理再挖掘有用的声音数据

The text based approachalso known as large-vocabulary continuous speech recognition (LVCSR), converts speech to text and then identifies words in a dictionary that can contain several hundred thousand entries. If a word or name is not in the dictionary, the LVCSR system will choose the most similar word it can find. ^[11]

NLP: Natural Language Processing

词类区分（POS: Part-of-Speech tagging）

专名识别（NE: named entity tagging）

^[12]

word accuracy, hit and miss rates, response time,efficiency, precision and system compatibility

WIKI:

Document Retrieval / text retrieval : form based『suffix tree』 content based 『inverted index』^[13]

Full text Research / free-text Research: a examines all of the words in every stored document

" Text retrieval is a critical area of study today, since it is the fundamental basis of all ."^[14]

String Searching : string matching

Indexing^[14]

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each , a strategy called "serial scanning." This is what some tools, such as , do when searching.

However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks:indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an , but more correctly named a ). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.

The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive."

MIR^[15]

Music Information Retrieval:

MIR uses audio signal analysis to extract meaningful features of music.

Recommender systems : few are based upon MIR techniques, instead making use of similarity between users or laborious data compilation.

Track separation and instrument recognition

Automatic music transcription :

converting an audio recording into symbolic notation

multi-pitch detection,

, duration estimation, instrument identification, and the extraction of rhythmic information"

Automatic categorization

Music generation

Contextual music information retrieval and recommendation: State of the art and challenges^[3]:

	Useful	Unuseful
event-scale information (i.e., transcribing individual notes or chords)	instrument detection, QYE, QYH	describe music
phrase-level information (i.e., analyzing note sequences for periodicities)	analyzes longer temporal excerpts, tempo detection, playlist sequencing, music summarization
piece-level information (i.e., analyzing longer excerpts of audio tracks)	a more abstract representation of a music track, user’s perception of music, Used for genre detection, content-based music recommenders

Four levels of retrieval tasks: 『研究主要集中在genre level, work level, instance level』

genre levelsearching for rock songs is a task at a genre level

artist levellooking for artists similar to Björk is clearly a task at an artist level

work levelfinding cover versions of the song “Let it Be” by The Beatles is a task at a work level

instance level identifying a particular recording of Mahler’s fifth symphony is a task at an instance level

Content-based music information retrieval: QBE, QBH, Genre Classification,

Music recommendation: Collaborative filtering(CF), Content-based approach『很少用于Music Recommmendation, 可合用俩方式』

Contextual and social music retrieval and recommendation:

Environment-related context(season, temperature, time, weather conditions

), User-related context(

Activity, Demographical, Emotional state

), Multimedia context(Text, Images)

Emotion recognition in music: ML

Music and the social web: Tag acquisition『可用于MIR、Music Recommendation』

Other Key Words: Activity recognition, computational data mining, raw audio, Clustering, Classification, regression, vector machines, KDD(Knowledge Discovery in Database), Acoustic Vector Sensors (AVS), Direction of arrival (DOA), Analog-to-digital Converter(ADC)

References:

[1]

邢铭生, 朱浩, 王宏斌. 语音识别技术综述[J]. 科协论坛, 2010, (3):62-63. DOI:10.3969/j.issn.1007-3973.2010.03.033.

[2] Li J, Deng L, Gong Y, et al. An Overview of Noise-Robust Automatic Speech Recognition[J]. Audio Speech & Language Processing IEEE/ACM Transactions on, 2014, 22(4):745-777.

[3] Kaminskas M, Ricci F. Contextual music information retrieval and recommendation: State of the art and challenges[J]. Computer Science Review, 2012, 6:89–119.

[4] Weyde, Tillman, et al. "Big Data for Musicology." Proceedings of the 1st International Workshop on Digital Libraries for Musicology. ACM, 2014.

[5]

Shen J, Pang H H, Wang M, et al. Modeling concept dynamics for large scale music search[J]. Research Collection School of Information Systems, 2012:455-464.

[6] Low Y, Gonzalez J, Kyrola A, et al. GraphLab: A Distributed Framework for Machine Learning in the Cloud[J]. Eprint Arxiv, 2011.

[7] Bhatt C A, Kankanhalli M S. 2011 MTAP Multimedia data mining state of the art and challenges[J]. Multimedia Tools & Applications, 2014, 51(1):35-76.

[8] 姜干新. 基于HMM的分布式语音识别系统的研究与应用[D]. 浙江大学计算机科学与技术学院, 2010.

[9] Shahzad Hussain. Web Based Network Speech Recognition[D]. Tampereen teknillinen yliopisto - Tampere University of Technology, 2013.

[10] Kamde P M, Algur S P. A Survey on Web Multimedia Mining[J]. International Journal of Multimedia & Its Applications, 2011, 3(3).

[11] Bhatt C A, Kankanhalli M S. Multimedia data mining: state of the art and challenges[J]. Multimedia Tools & Applications, 2011, 51(1):35-76.

[12]

李维. 立委随笔：机器学习和自然语言处理.

2010-2-13.

[13] Wikipedia

. ,

22 June 2015.

[14]

Wikipedia