如何製作語音合成編輯器

已發表: 2022-03-10

快速總結↬語音助手正在進入人們的家中、手腕和口袋。這意味著我們的一些內容將在數字語音合成的幫助下被大聲說出來。在本教程中，您將學習如何使用 Sanity.io 的便攜式文本編輯器製作一個所見即所得 (WYGIWYH) 編輯器，用於語音合成。

當史蒂夫喬布斯在 1984 年推出 Macintosh 時，它在舞台上對我們說“你好”。即使在那個時候，語音合成也不是真正的新技術：早在 30 年代末，貝爾實驗室就開發了語音合成器，當斯坦利·庫布里克（Stanley Kubrick）讓語音合成器成為2001 年的 HAL9000：太空漫遊（1968 年）。

直到 2015 年代中期 Apple 的 Siri、Amazon Echo 和 Google Assistant 推出之前，語音界面才真正進入更廣泛公眾的家中、手腕和口袋。我們仍處於採用階段，但這些語音助手似乎將繼續存在。

換句話說，網絡不再只是屏幕上的被動文本。網絡編輯和用戶體驗設計師必須習慣於製作應該大聲說出來的內容和服務。

我們已經在快速轉向使用內容管理系統，這些系統讓我們可以無頭地通過 API 處理我們的內容。最後一點是製作編輯界面，以便更輕鬆地為語音定制內容。所以讓我們這樣做吧！

跳躍後更多！繼續往下看↓

什麼是 SSML

雖然 Web 瀏覽器使用 W3C 的超文本標記語言 (HTML) 規範來直觀地呈現文檔，但大多數語音助手在生成語音時使用語音合成標記語言 (SSML)。

使用根元素<speak>以及段落 ( <p> ) 和句子 ( <s> ) 標記的最小示例：

 <speak> <p> <s>This is the first sentence of the paragraph.</s> <s>Here's another sentence.</s> </p> </speak>

按播放收聽片段：

當我們為<emphasis>和<prosody> (pitch) 引入標籤時，SSML 就存在了：

 <speak> <p> <s>Put some <emphasis strength="strong">extra weight on these words</emphasis></s> <s>And say <prosody pitch="high" rate="fast">this a bit higher and faster</prosody>!</s> </p> </speak>

按播放收聽片段：

SSML 有更多的功能，但這足以讓您了解基礎知識。現在，讓我們仔細看看我們將用來製作語音合成編輯界面的編輯器。

便攜式文本編輯器

為了製作這個編輯器，我們將使用 Sanity.io 中的 Portable Text 編輯器。 Portable Text 是一種用於富文本編輯的 JSON 規範，可以序列化為任何標記語言，例如 SSML。這意味著您可以使用不同的標記語言在多個地方輕鬆使用相同的文本片段。

安裝理智

Sanity.io 是一個結構化內容平台，帶有一個使用 React.js 構建的開源編輯環境。啟動並運行它需要兩分鐘。

在終端中輸入npm i -g @sanity/cli && sanity init ，然後按照說明進行操作。當系統提示您輸入項目模板時，選擇“空”。

如果您不想按照本教程從頭開始製作此編輯器，您也可以克隆本教程的代碼並按照README.md中的說明進行操作。

下載編輯器後，您可以在項目文件夾中運行sanity start來啟動它。它將啟動一個開發服務器，該服務器使用熱模塊重新加載來在您編輯其文件時更新更改。

如何在 Sanity Studio 中配置模式

創建編輯器文件

我們將首先在/schemas文件夾中創建一個名為ssml-editor的文件夾。在該文件夾中，我們將放置一些空文件：

 /ssml-tutorial/schemas/ssml-editor ├── alias.js ├── emphasis.js ├── annotations.js ├── preview.js ├── prosody.js ├── sayAs.js ├── blocksToSSML.js ├── speech.js ├── SSMLeditor.css └── SSMLeditor.js

現在我們可以在這些文件中添加內容模式。 內容模式定義了富文本的數據結構，以及 Sanity Studio 用來生成編輯界面的內容。它們是簡單的 JavaScript 對象，大多只需要一個name和一個type 。

我們還可以添加title和description ，以使編輯更友好。例如，這是title的簡單文本字段的架構：

 export default { name: 'title', type: 'string', title: 'Title', description: 'Titles should be short and descriptive' }

帶有標題字段和便攜式文本編輯器的 Sanity Studio — 帶有我們的標題字段和默認編輯器的工作室（大預覽）

Portable Text 建立在將富文本作為數據的理念之上。這很強大，因為它可以讓您查詢富文本，並將其轉換為幾乎任何您想要的標記。

它是一個稱為“塊”的對像數組，您可以將其視為“段落”。在一個塊中，有一組子跨度。每個塊都可以有一個樣式和一組標記定義，這些定義描述了分佈在子 span 上的數據結構。

Sanity.io 帶有一個可以讀取和寫入便攜式文本的編輯器，並通過將block類型放在array字段中來激活，如下所示：

 // speech.js export default { name: 'speech', type: 'array', title: 'SSML Editor', of: [ { type: 'block' } ] }

一個數組可以有多種類型。對於 SSML 編輯器，這些可能是音頻文件的塊，但這超出了本教程的範圍。

我們要做的最後一件事是添加可以使用此編輯器的內容類型。大多數助手使用“意圖”和“實現”的簡單內容模型：

意圖
通常是 AI 模型用來描述用戶想要完成的內容的字符串列表。
履行情況
當確定“意圖”時，就會發生這種情況。一種滿足通常——或者至少——伴隨著某種回應。

因此，讓我們使用語音合成編輯器創建一個名為fulfillment的簡單內容類型。創建一個名為fulfillment.js的新文件並將其保存在/schema文件夾中：

 // fulfillment.js export default { name: 'fulfillment', type: 'document', title: 'Fulfillment', of: [ { name: 'title', type: 'string', title: 'Title', description: 'Titles should be short and descriptive' }, { name: 'response', type: 'speech' } ] }

保存文件，然後打開schema.js 。像這樣將它添加到您的工作室：

 // schema.js import createSchema from 'part:@sanity/base/schema-creator' import schemaTypes from 'all:part:@sanity/base/schema-type' import fullfillment from './fullfillment' import speech from './speech' export default createSchema({ name: 'default', types: schemaTypes.concat([ fullfillment, speech, ]) })

如果您現在在項目根文件夾中的命令行界面中運行sanity start ，則工作室將在本地啟動，您將能夠添加條目以完成任務。您可以在我們繼續進行時保持工作室運行，因為當您保存文件時，它將自動重新加載新的更改。

將 SSML 添加到編輯器

默認情況下， block類型將為您提供一個標準編輯器，用於具有標題樣式、強調和強的裝飾器樣式、鏈接註釋和列表的視覺導向富文本。現在我們想用 SSML 中的聽覺概念覆蓋那些。

我們首先定義不同的內容結構，並為編輯器提供有用的描述，我們將把它們添加到SSMLeditorSchema.js的block中作為annotations的配置。這些是“強調”、“別名”、“韻律”和“說為”。

重點

我們從“強調”開始，它控制標記文本的權重。我們將其定義為一個字符串，其中包含用戶可以選擇的預定義值列表：

 // emphasis.js export default { name: 'emphasis', type: 'object', title: 'Emphasis', description: 'The strength of the emphasis put on the contained text', fields: [ { name: 'level', type: 'string', options: { list: [ { value: 'strong', title: 'Strong' }, { value: 'moderate', title: 'Moderate' }, { value: 'none', title: 'None' }, { value: 'reduced', title: 'Reduced' } ] } } ] }

別名

有時書面和口頭術語不同。例如，您想在書面文本中使用短語的縮寫，但要大聲朗讀整個短語。例如：

 <s>This is a <sub alias="Speech Synthesis Markup Language">SSML</sub> tutorial</s>

按播放收聽片段：

別名的輸入字段是一個簡單的字符串：

 // alias.js export default { name: 'alias', type: 'object', title: 'Alias (sub)', description: 'Replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form.', fields: [ { name: 'text', type: 'string', title: 'Replacement text', } ] }

韻律

通過韻律屬性，我們可以控製文本應該如何朗讀的不同方面，例如音高、速率和音量。標記可能如下所示：

 <s>Say this with an <prosody pitch="x-low">extra low pitch</prosody>, and this <prosody rate="fast" volume="loud">loudly with a fast rate</prosody></s>

按播放收聽片段：

此輸入將包含三個帶有預定義字符串選項的字段：

 // prosody.js export default { name: 'prosody', type: 'object', title: 'Prosody', description: 'Control of the pitch, speaking rate, and volume', fields: [ { name: 'pitch', type: 'string', title: 'Pitch', description: 'The baseline pitch for the contained text', options: { list: [ { value: 'x-low', title: 'Extra low' }, { value: 'low', title: 'Low' }, { value: 'medium', title: 'Medium' }, { value: 'high', title: 'High' }, { value: 'x-high', title: 'Extra high' }, { value: 'default', title: 'Default' } ] } }, { name: 'rate', type: 'string', title: 'Rate', description: 'A change in the speaking rate for the contained text', options: { list: [ { value: 'x-slow', title: 'Extra slow' }, { value: 'slow', title: 'Slow' }, { value: 'medium', title: 'Medium' }, { value: 'fast', title: 'Fast' }, { value: 'x-fast', title: 'Extra fast' }, { value: 'default', title: 'Default' } ] } }, { name: 'volume', type: 'string', title: 'Volume', description: 'The volume for the contained text.', options: { list: [ { value: 'silent', title: 'Silent' }, { value: 'x-soft', title: 'Extra soft' }, { value: 'medium', title: 'Medium' }, { value: 'loud', title: 'Loud' }, { value: 'x-loud', title: 'Extra loud' }, { value: 'default', title: 'Default' } ] } } ] }

說為

我們要包含的最後一個是<say-as> 。這個標籤讓我們可以更好地控制某些信息的發音方式。如果您需要在語音界面中編輯某些內容，我們甚至可以使用它來發出聲音。那是@!%& 有用！

 <s>Do I have to <say-as interpret-as="expletive">frakking</say-as> <say-as interpret-as="verbatim">spell</say-as> it out for you!?</s>

按播放收聽片段：

 // sayAs.js export default { name: 'sayAs', type: 'object', title: 'Say as...', description: 'Lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.', fields: [ { name: 'interpretAs', type: 'string', title: 'Interpret as...', options: { list: [ { value: 'cardinal', title: 'Cardinal numbers' }, { value: 'ordinal', title: 'Ordinal numbers (1st, 2nd, 3th...)' }, { value: 'characters', title: 'Spell out characters' }, { value: 'fraction', title: 'Say numbers as fractions' }, { value: 'expletive', title: 'Blip out this word' }, { value: 'unit', title: 'Adapt unit to singular or plural' }, { value: 'verbatim', title: 'Spell out letter by letter (verbatim)' }, { value: 'date', title: 'Say as a date' }, { value: 'telephone', title: 'Say as a telephone number' } ] } }, { name: 'date', type: 'object', title: 'Date', fields: [ { name: 'format', type: 'string', description: 'The format attribute is a sequence of date field character codes. Supported field character codes in format are {y, m, d} for year, month, and day (of the month) respectively. If the field code appears once for year, month, or day then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated. Fields in the date text may be separated by punctuation and/or spaces.' }, { name: 'detail', type: 'number', validation: Rule => Rule.required() .min(0) .max(2), description: 'The detail attribute controls the spoken form of the date. For detail='1' only the day fields and one of month or year fields are required, although both may be supplied' } ] } ] }

現在我們可以在annotations.js文件中導入它們，這讓事情變得更整潔了。

 // annotations.js export {default as alias} from './alias' export {default as emphasis} from './emphasis' export {default as prosody} from './prosody' export {default as sayAs} from './sayAs'

現在我們可以將這些註釋類型導入到我們的主模式中：

 // schema.js import createSchema from "part:@sanity/base/schema-creator" import schemaTypes from "all:part:@sanity/base/schema-type" import fulfillment from './fulfillment' import speech from './ssml-editor/speech' import { alias, emphasis, prosody, sayAs } from './annotations' export default createSchema({ name: "default", types: schemaTypes.concat([ fulfillment, speech, alias, emphasis, prosody, sayAs ]) })

最後，我們現在可以像這樣將這些添加到編輯器中：

 // speech.js export default { name: 'speech', type: 'array', title: 'SSML Editor', of: [ { type: 'block', styles: [], lists: [], marks: { decorators: [], annotations: [ {type: 'alias'}, {type: 'emphasis'}, {type: 'prosody'}, {type: 'sayAs'} ] } } ] }

請注意，我們還向styles和decorators添加了空數組。這會禁用默認樣式和裝飾器（如粗體和強調），因為它們在這種特定情況下沒有多大意義。

自定義外觀

現在我們有了功能，但由於我們沒有指定任何圖標，每個註釋都將使用默認圖標，這使得編輯器很難實際用於作者。所以讓我們解決這個問題！

使用 Portable Text 的編輯器，可以為圖標和標記文本的呈現方式註入 React 組件。在這裡，我們將讓一些表情符號為我們完成工作，但您顯然可以走得更遠，使它們變得動態等等。對於prosody ，我們甚至會根據所選音量更改圖標。請注意，為簡潔起見，我省略了這些片段中的字段，您不應在本地文件中刪除它們。

 // alias.js import React from 'react' export default { name: 'alias', type: 'object', title: 'Alias (sub)', description: 'Replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form.', fields: [ /* all the fields */ ], blockEditor: { icon: () => '', render: ({ children }) => <span>{children} </span>, }, };

 // emphasis.js import React from 'react' export default { name: 'emphasis', type: 'object', title: 'Emphasis', description: 'The strength of the emphasis put on the contained text', fields: [ /* all the fields */ ], blockEditor: { icon: () => '', render: ({ children }) => <span>{children} </span>, }, };

 // prosody.js import React from 'react' export default { name: 'prosody', type: 'object', title: 'Prosody', description: 'Control of the pitch, speaking rate, and volume', fields: [ /* all the fields */ ], blockEditor: { icon: () => '', render: ({ children, volume }) => ( <span> {children} {['x-loud', 'loud'].includes(volume) ? '' : ''} </span> ), }, };

 // sayAs.js import React from 'react' export default { name: 'sayAs', type: 'object', title: 'Say as...', description: 'Lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.', fields: [ /* all the fields */ ], blockEditor: { icon: () => '', render: props => <span>{props.children} </span>, }, };

現在您有了一個編輯器，用於編輯語音助手可以使用的文本。但是，如果編輯器也可以預覽文本的實際聽起來如何，那不是很有用嗎？

使用 Google 的文字轉語音添加預覽按鈕

原生語音合成支持實際上正在為瀏覽器提供支持。但在本教程中，我們將使用支持 SSML 的 Google 文本轉語音 API。構建此預覽功能還將演示如何在您想要使用它的任何服務中將可移植文本序列化為 SSML。

將編輯器包裝在 React 組件中

我們首先打開SSMLeditor.js文件並添加以下代碼：

 // SSMLeditor.js import React, { Fragment } from 'react'; import { BlockEditor } from 'part:@sanity/form-builder'; export default function SSMLeditor(props) { return ( <Fragment> <BlockEditor {...props} /> </Fragment> ); }

我們現在已經將編輯器包裝在我們自己的 React 組件中。它需要的所有道具，包括它包含的數據，都是實時傳遞的。要實際使用此組件，您必須將其導入到您的speech.js文件中：

 // speech.js import React from 'react' import SSMLeditor from './SSMLeditor.js' export default { name: 'speech', type: 'array', title: 'SSML Editor', inputComponent: SSMLeditor, of: [ { type: 'block', styles: [], lists: [], marks: { decorators: [], annotations: [ { type: 'alias' }, { type: 'emphasis' }, { type: 'prosody' }, { type: 'sayAs' }, ], }, }, ], }

當你保存這個並且工作室重新加載時，它應該看起來幾乎完全一樣，但那是因為我們還沒有開始調整編輯器。

將便攜式文本轉換為 SSML

編輯器會將內容保存為可移植文本，這是 JSON 中的對像數組，可以輕鬆地將富文本轉換為您需要的任何格式。當您將 Portable Text 轉換為另一種語法或格式時，我們稱之為“序列化”。因此，“序列化程序”是如何轉換富文本的秘訣。在本節中，我們將添加用於語音合成的序列化程序。

您已經製作了blocksToSSML.js文件。現在我們需要添加我們的第一個依賴項。首先在ssml-editor文件夾中運行終端命令npm init -y 。這將添加一個package.json ，其中將列出編輯器的依賴項。

完成後，您可以運行npm install @sanity/block-content-to-html來獲取一個庫，以便更輕鬆地序列化可移植文本。我們使用 HTML 庫是因為 SSML 具有與標籤和屬性相同的 XML 語法。

這是一堆代碼，所以請隨意複製粘貼。我將在代碼段下方解釋該模式：

 // blocksToSSML.js import blocksToHTML, { h } from '@sanity/block-content-to-html' const serializers = { marks: { prosody: ({ children, mark: { rate, pitch, volume } }) => h('prosody', { attrs: { rate, pitch, volume } }, children), alias: ({ children, mark: { text } }) => h('sub', { attrs: { alias: text } }, children), sayAs: ({ children, mark: { interpretAs } }) => h('say-as', { attrs: { 'interpret-as': interpretAs } }, children), break: ({ children, mark: { time, strength } }) => h('break', { attrs: { time: '${time}ms', strength } }, children), emphasis: ({ children, mark: { level } }) => h('emphasis', { attrs: { level } }, children) } } export const blocksToSSML = blocks => blocksToHTML({ blocks, serializers })

此代碼將導出一個函數，該函數採用塊數組並循環它們。每當一個塊包含一個mark時，它都會為該類型尋找一個序列化器。如果您已將某些文本標記為emphasis ，則它來自序列化程序對象的此函數：

 emphasis: ({ children, mark: { level } }) => h('emphasis', { attrs: { level } }, children)

也許你認出了我們定義模式的參數？ h()函數讓我們定義了一個 HTML 元素，也就是說，我們在這裡“作弊”並讓它返回一個名為<emphasis>的 SSML 元素。如果定義了屬性level ，我們還會為其提供屬性級別，並將children元素放置在其中 - 在大多數情況下，這將是您標記為emphasis的文本。

 { "_type": "block", "_key": "f2c4cf1ab4e0", "style": "normal", "markDefs": [ { "_type": "emphasis", "_key": "99b28ed3fa58", "level": "strong" } ], "children": [ { "_type": "span", "_key": "f2c4cf1ab4e01", "text": "Say this strongly!", "marks": [ "99b28ed3fa58" ] } ] }

這就是 Portable Text 中的上述結構如何序列化為此 SSML：

 <emphasis level="strong">Say this strongly</emphasis>

如果您想要支持更多 SSML 標籤，您可以在模式中添加更多註釋，並將註釋類型添加到序列化程序中的marks部分。

現在我們有一個函數可以從我們標記的富文本中返回 SSML 標記。最後一部分是製作一個按鈕，讓我們可以將此標記發送到文本轉語音服務。

添加一個與您對話的預覽按鈕

理想情況下，我們應該在 Web API 中使用瀏覽器的語音合成功能。這樣，我們就可以減少代碼和依賴項。

然而，截至 2019 年初，對語音合成的原生瀏覽器支持仍處於早期階段。看起來對 SSML 的支持正在進行中，並且有客戶端 JavaScript 實現的概念證明。

無論如何，您很有可能會將此內容與語音助手一起使用。 Google Assistant 和 Amazon Echo (Alexa) 都支持 SSML 作為履行中的響應。在本教程中，我們將使用 Google 的 text-to-speech API，它聽起來也不錯，並且支持多種語言。

首先通過註冊 Google Cloud Platform 獲取 API 密鑰（您處理的前 100 萬個字符將免費）。註冊後，您可以在此頁面上創建新的 API 密鑰。

現在您可以打開PreviewButton.js文件，並將以下代碼添加到其中：

 // PreviewButton.js import React from 'react' import Button from 'part:@sanity/components/buttons/default' import { blocksToSSML } from './blocksToSSML' // You should be careful with sharing this key // I put it here to keep the code simple const API_KEY = '<yourAPIkey>' const GOOGLE_TEXT_TO_SPEECH_URL = 'https://texttospeech.googleapis.com/v1beta1/text:synthesize?key=' + API_KEY const speak = async blocks => { // Serialize blocks to SSML const ssml = blocksToSSML(blocks) // Prepare the Google Text-to-Speech configuration const body = JSON.stringify({ input: { ssml }, // Select the language code and voice name (AF) voice: { languageCode: 'en-US', name: 'en-US-Wavenet-A' }, // Use MP3 in order to play in browser audioConfig: { audioEncoding: 'MP3' } }) // Send the SSML string to the API const res = await fetch(GOOGLE_TEXT_TO_SPEECH_URL, { method: 'POST', body }).then(res => res.json()) // Play the returned audio with the Browser's Audo API const audio = new Audio('data:audio/wav;base64,' + res.audioContent) audio.play() } export default function PreviewButton (props) { return <Button style={{ marginTop: '1em' }} onClick={() => speak(props.blocks)}>Speak text</Button> }

我已將此預覽按鈕代碼保持在最低限度，以便更輕鬆地遵循本教程。當然，您可以通過添加狀態來構建它以顯示預覽是否正在處理，或者可以使用 Google API 支持的不同聲音進行預覽。

將按鈕添加到SSMLeditor.js ：

 // SSMLeditor.js import React, { Fragment } from 'react'; import { BlockEditor } from 'part:@sanity/form-builder'; import PreviewButton from './PreviewButton'; export default function SSMLeditor(props) { return ( <Fragment> <BlockEditor {...props} /> <PreviewButton blocks={props.value} /> </Fragment> ); }

現在您應該能夠使用不同的註釋標記您的文本，並在按下“朗讀文本”時聽到結果。酷，不是嗎？

你已經創建了一個語音合成編輯器，現在呢？

如果您遵循本教程，您已經了解瞭如何使用 Sanity Studio 中的可移植文本編輯器進行自定義註釋和自定義編輯器。您可以將這些技能用於各種事情，而不僅僅是製作語音合成編輯器。您還了解瞭如何將 Portable Text 序列化為您需要的語法。顯然，如果您在 React 或 Vue 中構建前端，這也很方便。您甚至可以使用這些技能從 Portable Text 生成 Markdown。

我們還沒有介紹您如何將它與語音助手一起實際使用。如果您想嘗試，您可以使用與無服務器函數中的預覽按鈕相同的邏輯，並將其設置為使用 webhook 實現的 API 端點，例如使用 Dialogflow。

如果您希望我寫一篇關於如何將語音合成編輯器與語音助手一起使用的教程，請隨時在 Twitter 上給我提示或在下面的評論部分分享。

關於 SmashingMag 的進一步閱讀：

嘗試語音合成
使用 Web Speech API 增強用戶體驗
可訪問性 API：Web 可訪問性的關鍵
使用 Web Speech API 和 Node.js 構建一個簡單的 AI 聊天機器人