• Home
  • Features
  • Pricing
  • Docs
  • Announcements
  • Sign In

JohnSnowLabs / spark-nlp / 13883000244

16 Mar 2025 11:44AM UTC coverage: 59.034% (-1.0%) from 60.072%
13883000244

Pull #14444

github

web-flow
Merge 6d717703b into 05000ab4a
Pull Request #14444: Sparknlp 1060 implement phi 3.5 vision

0 of 292 new or added lines in 5 files covered. (0.0%)

20 existing lines in 14 files now uncovered.

9413 of 15945 relevant lines covered (59.03%)

0.59 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

74.63
/src/main/scala/com/johnsnowlabs/nlp/annotators/ws/WordSegmenterModel.scala
1
/*
2
 * Copyright 2017-2022 John Snow Labs
3
 *
4
 * Licensed under the Apache License, Version 2.0 (the "License");
5
 * you may not use this file except in compliance with the License.
6
 * You may obtain a copy of the License at
7
 *
8
 *    http://www.apache.org/licenses/LICENSE-2.0
9
 *
10
 * Unless required by applicable law or agreed to in writing, software
11
 * distributed under the License is distributed on an "AS IS" BASIS,
12
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
 * See the License for the specific language governing permissions and
14
 * limitations under the License.
15
 */
16

17
package com.johnsnowlabs.nlp.annotators.ws
18

19
import com.johnsnowlabs.nlp.AnnotatorType.{DOCUMENT, TOKEN}
20
import com.johnsnowlabs.nlp.annotators.common._
21
import com.johnsnowlabs.nlp.annotators.pos.perceptron.{
22
  AveragedPerceptron,
23
  PerceptronPredictionUtils
24
}
25
import com.johnsnowlabs.nlp.annotators.ws.TagsType.{LEFT_BOUNDARY, MIDDLE, RIGHT_BOUNDARY}
26
import com.johnsnowlabs.nlp.serialization.StructFeature
27
import com.johnsnowlabs.nlp._
28
import com.johnsnowlabs.nlp.annotators.RegexTokenizer
29
import org.apache.spark.ml.param.{BooleanParam, Param}
30
import org.apache.spark.ml.util.Identifiable
31

32
/** WordSegmenter which tokenizes non-english or non-whitespace separated texts.
33
  *
34
  * Many languages are not whitespace separated and their sentences are a concatenation of many
35
  * symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the
36
  * words into their corresponding tokens is impossible. The WordSegmenter is trained to
37
  * understand these languages and plit them into semantically correct parts.
38
  *
39
  * This annotator is based on the paper
40
  * [[https://aclanthology.org/O03-4002.pdf Chinese Word Segmentation as Character Tagging]]. Word
41
  * segmentation is treated as a tagging problem. Each character is be tagged as on of four
42
  * different labels: LL (left boundary), RR (right boundary), MM (middle) and LR (word by
43
  * itself). The label depends on the position of the word in the sentence. LL tagged words will
44
  * combine with the word on the right. Likewise, RR tagged words combine with words on the left.
45
  * MM tagged words are treated as the middle of the word and combine with either side. LR tagged
46
  * words are words by themselves.
47
  *
48
  * Example (from [1], Example 3(a) (raw), 3(b) (tagged), 3(c) (translation)):
49
  *   - 上海 计划 到 本 世纪 末 实现 人均 国内 生产 总值 五千 美元
50
  *   - 上/LL 海/RR 计/LL 划/RR 到/LR 本/LR 世/LL 纪/RR 末/LR 实/LL 现/RR 人/LL 均/RR 国/LL 内/RR 生/LL 产/RR 总/LL
51
  *     值/RR 五/LL 千/RR 美/LL 元/RR
52
  *   - Shanghai plans to reach the goal of 5,000 dollars in per capita GDP by the end of the
53
  *     century.
54
  *
55
  * This is the instantiated model of the [[WordSegmenterApproach]]. For training your own model,
56
  * please see the documentation of that class.
57
  *
58
  * Pretrained models can be loaded with `pretrained` of the companion object:
59
  * {{{
60
  * val wordSegmenter = WordSegmenterModel.pretrained()
61
  *   .setInputCols("document")
62
  *   .setOutputCol("words_segmented")
63
  * }}}
64
  * The default model is `"wordseg_pku"`, default language is `"zh"`, if no values are provided.
65
  * For available pretrained models please see the
66
  * [[https://sparknlp.org/models?task=Word+Segmentation Models Hub]].
67
  *
68
  * For extended examples of usage, see the
69
  * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb Examples]]
70
  * and the
71
  * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/WordSegmenterTest.scala WordSegmenterTest]].
72
  *
73
  * '''References:'''
74
  *   - [[https://aclanthology.org/O03-4002.pdf [1]]] Xue, Nianwen. “Chinese Word Segmentation as
75
  *     Character Tagging.” International Journal of Computational Linguistics & Chinese Language
76
  *     Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese
77
  *     Language Processing, 2003, pp. 29-48. ACLWeb, https://aclanthology.org/O03-4002.
78
  *
79
  * ==Example==
80
  * {{{
81
  * import spark.implicits._
82
  * import com.johnsnowlabs.nlp.base.DocumentAssembler
83
  * import com.johnsnowlabs.nlp.annotator.WordSegmenterModel
84
  * import org.apache.spark.ml.Pipeline
85
  *
86
  * val documentAssembler = new DocumentAssembler()
87
  *   .setInputCol("text")
88
  *   .setOutputCol("document")
89
  *
90
  * val wordSegmenter = WordSegmenterModel.pretrained()
91
  *   .setInputCols("document")
92
  *   .setOutputCol("token")
93
  *
94
  * val pipeline = new Pipeline().setStages(Array(
95
  *   documentAssembler,
96
  *   wordSegmenter
97
  * ))
98
  *
99
  * val data = Seq("然而,這樣的處理也衍生了一些問題。").toDF("text")
100
  * val result = pipeline.fit(data).transform(data)
101
  *
102
  * result.select("token.result").show(false)
103
  * +--------------------------------------------------------+
104
  * |result                                                  |
105
  * +--------------------------------------------------------+
106
  * |[然而, ,, 這樣, 的, 處理, 也, 衍生, 了, 一些, 問題, 。    ]|
107
  * +--------------------------------------------------------+
108
  * }}}
109
  *
110
  * @param uid
111
  *   required uid for storing annotator to disk
112
  * @groupname anno Annotator types
113
  * @groupdesc anno
114
  *   Required input and expected output annotator types
115
  * @groupname Ungrouped Members
116
  * @groupname param Parameters
117
  * @groupname setParam Parameter setters
118
  * @groupname getParam Parameter getters
119
  * @groupname Ungrouped Members
120
  * @groupprio param  1
121
  * @groupprio anno  2
122
  * @groupprio Ungrouped 3
123
  * @groupprio setParam  4
124
  * @groupprio getParam  5
125
  * @groupdesc param
126
  *   A list of (hyper-)parameter keys this annotator can take. Users can set and get the
127
  *   parameter values through setters and getters, respectively.
128
  */
129
class WordSegmenterModel(override val uid: String)
130
    extends AnnotatorModel[WordSegmenterModel]
131
    with HasSimpleAnnotate[WordSegmenterModel]
132
    with PerceptronPredictionUtils {
133

134
  /** Annotator reference id. Used to identify elements in metadata or to refer to this annotator
135
    * type
136
    */
137
  def this() = this(Identifiable.randomUID("WORD_SEGMENTER"))
1✔
138

139
  /** POS model
140
    *
141
    * @group param
142
    */
143
  val model: StructFeature[AveragedPerceptron] =
144
    new StructFeature[AveragedPerceptron](this, "POS Model")
1✔
145

146
  val enableRegexTokenizer: BooleanParam = new BooleanParam(
1✔
147
    this,
148
    "enableRegexTokenizer",
1✔
149
    "Whether to use RegexTokenizer before segmentation. Useful for multilingual text")
1✔
150

151
  /** Indicates whether to convert all characters to lowercase before tokenizing (Default:
152
    * `false`).
153
    *
154
    * @group param
155
    */
156
  val toLowercase: BooleanParam = new BooleanParam(
1✔
157
    this,
158
    "toLowercase",
1✔
159
    "Indicates whether to convert all characters to lowercase before tokenizing.\n")
1✔
160

161
  /** Regex pattern used to match delimiters (Default: `"\\s+"`)
162
    *
163
    * @group param
164
    */
165
  val pattern: Param[String] = new Param(this, "pattern", "regex pattern used for tokenizing")
1✔
166

167
  /** @group getParam */
168
  def getModel: AveragedPerceptron = $$(model)
×
169

170
  /** @group setParam */
171
  def setModel(targetModel: AveragedPerceptron): this.type = set(model, targetModel)
1✔
172

173
  /** @group setParam */
174
  def setEnableRegexTokenizer(value: Boolean): this.type = set(enableRegexTokenizer, value)
1✔
175

176
  /** @group setParam */
177
  def setToLowercase(value: Boolean): this.type = set(toLowercase, value)
1✔
178

179
  /** @group setParam */
180
  def setPattern(value: String): this.type = set(pattern, value)
1✔
181

182
  setDefault(enableRegexTokenizer -> false, toLowercase -> false, pattern -> "\\s+")
1✔
183

184
  /** takes a document and annotations and produces new annotations of this annotator's annotation
185
    * type
186
    *
187
    * @param annotations
188
    *   Annotations that correspond to inputAnnotationCols generated by previous annotators if any
189
    * @return
190
    *   any number of annotations processed for every input annotation. Not necessary one to one
191
    *   relationship
192
    */
193
  override def annotate(annotations: Seq[Annotation]): Seq[Annotation] = {
194

195
    if ($(enableRegexTokenizer)) {
1✔
196
      return segmentWithRegexAnnotator(annotations)
×
197
    }
198

199
    val sentences = SentenceSplit.unpack(annotations)
1✔
200
    val tokens = getTokenAnnotations(sentences)
1✔
201
    val tokenizedSentences = TokenizedWithSentence.unpack(annotations ++ tokens)
1✔
202
    val tagged = tag($$(model), tokenizedSentences.toArray)
1✔
203
    buildWordSegments(tagged)
1✔
204
  }
205

206
  private def segmentWithRegexAnnotator(annotatedSentences: Seq[Annotation]): Seq[Annotation] = {
207

208
    val outputCol = Identifiable.randomUID("regex_token")
×
209

210
    val regexTokenizer = new RegexTokenizer()
211
      .setInputCols(getInputCols)
×
212
      .setOutputCol(outputCol)
213
      .setToLowercase($(toLowercase))
×
214
      .setPattern($(pattern))
×
215

216
    val annotatedTokens = regexTokenizer.annotate(annotatedSentences)
×
217

218
    val segmentedResult = annotatedTokens.flatMap { annotatedToken =>
×
219
      val codePoint = annotatedToken.result.codePointAt(0)
×
220
      val unicodeScript = Character.UnicodeScript.of(codePoint)
×
221
      if (unicodeScript == Character.UnicodeScript.LATIN) {
×
222
        Seq(annotatedToken)
×
223
      } else {
×
224
        val sentenceIndex = annotatedToken.metadata("sentence")
×
225

226
        val annotatedSentence = Annotation(
×
227
          DOCUMENT,
×
228
          annotatedToken.begin,
×
229
          annotatedToken.end,
×
230
          annotatedToken.result,
×
231
          Map("sentence" -> sentenceIndex))
×
232
        val sentence = Sentence(
×
233
          annotatedToken.result,
×
234
          annotatedToken.begin,
×
235
          annotatedToken.end,
×
236
          sentenceIndex.toInt)
×
237
        val annotatedTokens = getTokenAnnotations(Seq(sentence))
×
238

239
        val tokenizedSentences =
240
          TokenizedWithSentence.unpack(annotatedTokens ++ Seq(annotatedSentence))
×
241
        val tagged = tag($$(model), tokenizedSentences.toArray)
×
242
        buildWordSegments(tagged)
×
243
      }
244
    }
245

246
    segmentedResult
247
  }
248

249
  private def getTokenAnnotations(annotation: Seq[Sentence]): Seq[Annotation] = {
250
    val tokens = annotation.flatMap { sentence =>
1✔
251
      val chars = sentence.content.split("")
1✔
252
      chars.zipWithIndex
1✔
253
        .map { case (char, index) =>
1✔
254
          val tokenIndex = index + sentence.start
1✔
255
          Annotation(
1✔
256
            TOKEN,
1✔
257
            tokenIndex,
258
            tokenIndex,
259
            char,
260
            Map("sentence" -> sentence.index.toString))
1✔
261
        }
262
        .filter(annotation => annotation.result != " ")
1✔
263
    }
264
    tokens
265
  }
266

267
  def buildWordSegments(taggedSentences: Array[TaggedSentence]): Seq[Annotation] = {
268
    taggedSentences.zipWithIndex.flatMap { case (taggedSentence, index) =>
1✔
269
      val tagsSentence = taggedSentence.tags.mkString("")
1✔
270
      val wordIndexesByMatchedGroups = getWordIndexesByMatchedGroups(tagsSentence)
1✔
271
      if (wordIndexesByMatchedGroups.isEmpty) {
1✔
272
        taggedSentence.indexedTaggedWords.map(indexedTaggedWord =>
1✔
273
          Annotation(
1✔
274
            TOKEN,
1✔
275
            indexedTaggedWord.begin,
1✔
276
            indexedTaggedWord.end,
1✔
277
            indexedTaggedWord.word,
1✔
278
            Map("sentence" -> index.toString)))
1✔
279
      } else {
280
        annotateSegmentWords(wordIndexesByMatchedGroups, taggedSentence, index)
1✔
281
      }
282
    }
283
  }
284

285
  private def getWordIndexesByMatchedGroups(tagsSentence: String): List[List[RegexTagsInfo]] = {
286
    val regexPattern = s"($LEFT_BOUNDARY)($MIDDLE*)*($RIGHT_BOUNDARY)".r
1✔
287
    regexPattern
288
      .findAllMatchIn(tagsSentence)
289
      .map(matchedResult => {
290
        val groups = (1 to matchedResult.groupCount).toList
1✔
291
        groups
292
          .map(g =>
1✔
293
            RegexTagsInfo(
1✔
294
              matchedResult.group(g),
1✔
295
              matchedResult.start(g),
1✔
296
              matchedResult.end(g),
1✔
297
              (matchedResult.end(g) / 2) - 1))
1✔
298
          .filter(regexTagsInfo => regexTagsInfo.estimatedIndex != -1)
1✔
299
      })
300
      .toList
1✔
301
  }
302

303
  private def annotateSegmentWords(
304
      wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
305
      taggedSentence: TaggedSentence,
306
      sentenceIndex: Int): Seq[Annotation] = {
307

308
    val singleTaggedWords =
309
      getSingleIndexedTaggedWords(wordIndexesByMatchedGroups, taggedSentence)
1✔
310
    val multipleTaggedWords = getMultipleTaggedWords(wordIndexesByMatchedGroups, taggedSentence)
1✔
311
    val segmentedTaggedWords = (singleTaggedWords ++ multipleTaggedWords)
1✔
312
      .sortWith(
1✔
313
        _.metadata.getOrElse("index", "-1").toInt < _.metadata.getOrElse("index", "-1").toInt)
1✔
314
    segmentedTaggedWords.map(segmentedTaggedWord =>
1✔
315
      Annotation(
1✔
316
        TOKEN,
1✔
317
        segmentedTaggedWord.begin,
1✔
318
        segmentedTaggedWord.end,
1✔
319
        segmentedTaggedWord.word,
1✔
320
        Map("sentence" -> sentenceIndex.toString)))
1✔
321
  }
322

323
  private def getSingleIndexedTaggedWords(
324
      wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
325
      taggedSentence: TaggedSentence): List[IndexedTaggedWord] = {
326
    val flattenWordIndexes = wordIndexesByMatchedGroups.flatMap(wordIndexGroup =>
1✔
327
      wordIndexGroup.map(wi => wi.estimatedIndex))
1✔
328
    val unmatchedTaggedWordsCandidates = taggedSentence.indexedTaggedWords.zipWithIndex
1✔
329
      .filter { case (_, index) =>
1✔
330
        !flattenWordIndexes.contains(index)
1✔
331
      }
332
      .map(_._1)
1✔
333
    val unmatchedTaggedWords =
334
      unmatchedTaggedWordsCandidates.filter(unmatchedTaggedWordCandidate =>
1✔
335
        !isMatchedWord(unmatchedTaggedWordCandidate, wordIndexesByMatchedGroups))
1✔
336
    unmatchedTaggedWords.toList
1✔
337
  }
338

339
  private def isMatchedWord(
340
      indexedTaggedWord: IndexedTaggedWord,
341
      regexTagsInfoList: List[List[RegexTagsInfo]]): Boolean = {
342
    val index = indexedTaggedWord.metadata.getOrElse("index", "-1").toInt
1✔
343

344
    val result = regexTagsInfoList.flatMap(regexTagsInfo => {
1✔
345
      val leftBoundaryIndex = regexTagsInfo.head.estimatedIndex
1✔
346
      val rightBoundaryIndex = regexTagsInfo.last.estimatedIndex
1✔
347
      val isInRange = if (index > leftBoundaryIndex && index < rightBoundaryIndex) true else false
1✔
348
      val verifyMatches = regexTagsInfo.map(rti => {
1✔
349
        if (indexedTaggedWord.tag != MIDDLE || !isInRange) "unmatched"
1✔
350
        else {
351
          if (rti.tagsMatch.contains(MIDDLE) && rti.tagsMatch.length > 2) "matched"
1✔
352
          else "unmatched"
1✔
353
        }
354
      })
355
      verifyMatches
356
    })
357
    result.contains("matched")
1✔
358
  }
359

360
  private def getMultipleTaggedWords(
361
      wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
362
      taggedSentence: TaggedSentence): List[IndexedTaggedWord] = {
363
    wordIndexesByMatchedGroups.flatMap { wordIndexesGroup =>
1✔
364
      val wordIndexes = wordIndexesGroup.map(wi => wi.estimatedIndex)
1✔
365
      val taggedWords = taggedSentence.indexedTaggedWords.zipWithIndex
1✔
366
        .filter { case (indexedTaggedWord, index) =>
1✔
367
          wordIndexes.contains(index) || isMatchedWord(indexedTaggedWord, List(wordIndexesGroup))
1✔
368
        }
369
        .map(_._1)
1✔
370
      if (taggedWords.nonEmpty) Some(taggedWords.reduceLeft(processTags)) else None
1✔
371
    }
372
  }
373

374
  private val processTags = (current: IndexedTaggedWord, next: IndexedTaggedWord) => {
375
    val wordSegment = current.word + next.word
1✔
376
    val tagSegment = current.tag + next.tag
1✔
377
    val begin = if (current.begin <= next.begin) current.begin else next.begin
1✔
378
    val end = begin + wordSegment.length - 1
1✔
379
    val currentIndexValue = current.metadata.getOrElse("index", "-1")
1✔
380
    val nextIndexValue = current.metadata.getOrElse("index", "-1")
1✔
381
    val index =
UNCOV
382
      if (currentIndexValue.toInt <= nextIndexValue.toInt) currentIndexValue else nextIndexValue
×
383
    IndexedTaggedWord(wordSegment, tagSegment, begin, end, None, Map("index" -> index))
1✔
384
  }
385

386
  /** Output Annotator Types: TOKEN
387
    *
388
    * @group anno
389
    */
390
  override val outputAnnotatorType: AnnotatorType = TOKEN
1✔
391

392
  /** Input Annotator Types: DOCUMENT
393
    *
394
    * @group anno
395
    */
396
  override val inputAnnotatorTypes: Array[String] = Array(DOCUMENT)
1✔
397
}
398

399
private case class RegexTagsInfo(tagsMatch: String, start: Int, end: Int, estimatedIndex: Int)
400

401
trait ReadablePretrainedWordSegmenter
402
    extends ParamsAndFeaturesReadable[WordSegmenterModel]
403
    with HasPretrained[WordSegmenterModel] {
404
  override val defaultModelName: Some[String] = Some("wordseg_pku")
1✔
405
  override val defaultLang: String = "zh"
1✔
406

407
  /** Java compliant-overrides */
408
  override def pretrained(): WordSegmenterModel = super.pretrained()
×
409

410
  override def pretrained(name: String): WordSegmenterModel = super.pretrained(name)
×
411

412
  override def pretrained(name: String, lang: String): WordSegmenterModel =
413
    super.pretrained(name, lang)
×
414

415
  override def pretrained(name: String, lang: String, remoteLoc: String): WordSegmenterModel =
416
    super.pretrained(name, lang, remoteLoc)
×
417
}
418

419
/** This is the companion object of [[WordSegmenterModel]]. Please refer to that class for the
420
  * documentation.
421
  */
422
object WordSegmenterModel extends ReadablePretrainedWordSegmenter
STATUS · Troubleshooting · Open an Issue · Sales · Support · CAREERS · ENTERPRISE · START FREE · SCHEDULE DEMO
ANNOUNCEMENTS · TWITTER · TOS & SLA · Supported CI Services · What's a CI service? · Automated Testing

© 2025 Coveralls, Inc