13883000244

Committed 16 Mar 2025 11:44AM UTC coverage: 59.034% (-1.0%) from 60.072%

Build # 13883000244

Build Type

Pull #14444

github

Committed by

web-flow

Commit Message

Merge 6d717703b into 05000ab4a

Pull Request Pull Request #14444: Sparknlp 1060 implement phi 3.5 vision

Run Details

0 of 292 new or added lines in 5 files covered. (0.0%)

20 existing lines in 14 files now uncovered.

9413 of 15945 relevant lines covered (59.03%)

0.59 hits per line

Source File
Press 'n' to go to next uncovered line, 'b' for previous

74.63

/src/main/scala/com/johnsnowlabs/nlp/annotators/ws/WordSegmenterModel.scala

/*
 * Copyright 2017-2022 John Snow Labs
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package com.johnsnowlabs.nlp.annotators.ws

import com.johnsnowlabs.nlp.AnnotatorType.{DOCUMENT, TOKEN}
import com.johnsnowlabs.nlp.annotators.common._
import com.johnsnowlabs.nlp.annotators.pos.perceptron.{
  AveragedPerceptron,
  PerceptronPredictionUtils
}
import com.johnsnowlabs.nlp.annotators.ws.TagsType.{LEFT_BOUNDARY, MIDDLE, RIGHT_BOUNDARY}
import com.johnsnowlabs.nlp.serialization.StructFeature
import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators.RegexTokenizer
import org.apache.spark.ml.param.{BooleanParam, Param}
import org.apache.spark.ml.util.Identifiable

/** WordSegmenter which tokenizes non-english or non-whitespace separated texts.
  *
  * Many languages are not whitespace separated and their sentences are a concatenation of many
  * symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the
  * words into their corresponding tokens is impossible. The WordSegmenter is trained to
  * understand these languages and plit them into semantically correct parts.
  *
  * This annotator is based on the paper
  * [[https://aclanthology.org/O03-4002.pdf Chinese Word Segmentation as Character Tagging]]. Word
  * segmentation is treated as a tagging problem. Each character is be tagged as on of four
  * different labels: LL (left boundary), RR (right boundary), MM (middle) and LR (word by
  * itself). The label depends on the position of the word in the sentence. LL tagged words will
  * combine with the word on the right. Likewise, RR tagged words combine with words on the left.
  * MM tagged words are treated as the middle of the word and combine with either side. LR tagged
  * words are words by themselves.
  *
  * Example (from [1], Example 3(a) (raw), 3(b) (tagged), 3(c) (translation)):
  *   - 上海 计划 到 本 世纪 末 实现 人均 国内 生产 总值 五千 美元
  *   - 上/LL 海/RR 计/LL 划/RR 到/LR 本/LR 世/LL 纪/RR 末/LR 实/LL 现/RR 人/LL 均/RR 国/LL 内/RR 生/LL 产/RR 总/LL
  *     值/RR 五/LL 千/RR 美/LL 元/RR
  *   - Shanghai plans to reach the goal of 5,000 dollars in per capita GDP by the end of the
  *     century.
  *
  * This is the instantiated model of the [[WordSegmenterApproach]]. For training your own model,
  * please see the documentation of that class.
  *
  * Pretrained models can be loaded with `pretrained` of the companion object:
  * {{{
  * val wordSegmenter = WordSegmenterModel.pretrained()
  *   .setInputCols("document")
  *   .setOutputCol("words_segmented")
  * }}}
  * The default model is `"wordseg_pku"`, default language is `"zh"`, if no values are provided.
  * For available pretrained models please see the
  * [[https://sparknlp.org/models?task=Word+Segmentation Models Hub]].
  *
  * For extended examples of usage, see the
  * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb Examples]]
  * and the
  * [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/WordSegmenterTest.scala WordSegmenterTest]].
  *
  * '''References:'''
  *   - [[https://aclanthology.org/O03-4002.pdf [1]]] Xue, Nianwen. “Chinese Word Segmentation as
  *     Character Tagging.” International Journal of Computational Linguistics & Chinese Language
  *     Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese
  *     Language Processing, 2003, pp. 29-48. ACLWeb, https://aclanthology.org/O03-4002.
  *
  * ==Example==
  * {{{
  * import spark.implicits._
  * import com.johnsnowlabs.nlp.base.DocumentAssembler
  * import com.johnsnowlabs.nlp.annotator.WordSegmenterModel
  * import org.apache.spark.ml.Pipeline
  *
  * val documentAssembler = new DocumentAssembler()
  *   .setInputCol("text")
  *   .setOutputCol("document")
  *
  * val wordSegmenter = WordSegmenterModel.pretrained()
  *   .setInputCols("document")
  *   .setOutputCol("token")
  *
  * val pipeline = new Pipeline().setStages(Array(
  *   documentAssembler,
  *   wordSegmenter
  * ))
  *
  * val data = Seq("然而，這樣的處理也衍生了一些問題。").toDF("text")
  * val result = pipeline.fit(data).transform(data)
  *
  * result.select("token.result").show(false)
  * +--------------------------------------------------------+
  * |result                                                  |
  * +--------------------------------------------------------+
  * |[然而, ，, 這樣, 的, 處理, 也, 衍生, 了, 一些, 問題, 。    ]|
  * +--------------------------------------------------------+
  * }}}
  *
  * @param uid
  *   required uid for storing annotator to disk
  * @groupname anno Annotator types
  * @groupdesc anno
  *   Required input and expected output annotator types
  * @groupname Ungrouped Members
  * @groupname param Parameters
  * @groupname setParam Parameter setters
  * @groupname getParam Parameter getters
  * @groupname Ungrouped Members
  * @groupprio param  1
  * @groupprio anno  2
  * @groupprio Ungrouped 3
  * @groupprio setParam  4
  * @groupprio getParam  5
  * @groupdesc param
  *   A list of (hyper-)parameter keys this annotator can take. Users can set and get the
  *   parameter values through setters and getters, respectively.
  */
class WordSegmenterModel(override val uid: String)
    extends AnnotatorModel[WordSegmenterModel]
    with HasSimpleAnnotate[WordSegmenterModel]
    with PerceptronPredictionUtils {

  /** Annotator reference id. Used to identify elements in metadata or to refer to this annotator
    * type
    */
  def this() = this(Identifiable.randomUID("WORD_SEGMENTER"))

  /** POS model
    *
    * @group param
    */
  val model: StructFeature[AveragedPerceptron] =
    new StructFeature[AveragedPerceptron](this, "POS Model")

  val enableRegexTokenizer: BooleanParam = new BooleanParam(
    this,
    "enableRegexTokenizer",
    "Whether to use RegexTokenizer before segmentation. Useful for multilingual text")

  /** Indicates whether to convert all characters to lowercase before tokenizing (Default:
    * `false`).
    *
    * @group param
    */
  val toLowercase: BooleanParam = new BooleanParam(
    this,
    "toLowercase",
    "Indicates whether to convert all characters to lowercase before tokenizing.\n")

  /** Regex pattern used to match delimiters (Default: `"\\s+"`)
    *
    * @group param
    */
  val pattern: Param[String] = new Param(this, "pattern", "regex pattern used for tokenizing")

  /** @group getParam */
  def getModel: AveragedPerceptron = $$(model)

  /** @group setParam */
  def setModel(targetModel: AveragedPerceptron): this.type = set(model, targetModel)

  /** @group setParam */
  def setEnableRegexTokenizer(value: Boolean): this.type = set(enableRegexTokenizer, value)

  /** @group setParam */
  def setToLowercase(value: Boolean): this.type = set(toLowercase, value)

  /** @group setParam */
  def setPattern(value: String): this.type = set(pattern, value)

  setDefault(enableRegexTokenizer -> false, toLowercase -> false, pattern -> "\\s+")

  /** takes a document and annotations and produces new annotations of this annotator's annotation
    * type
    *
    * @param annotations
    *   Annotations that correspond to inputAnnotationCols generated by previous annotators if any
    * @return
    *   any number of annotations processed for every input annotation. Not necessary one to one
    *   relationship
    */
  override def annotate(annotations: Seq[Annotation]): Seq[Annotation] = {

    if ($(enableRegexTokenizer)) {
      return segmentWithRegexAnnotator(annotations)
    }

    val sentences = SentenceSplit.unpack(annotations)
    val tokens = getTokenAnnotations(sentences)
    val tokenizedSentences = TokenizedWithSentence.unpack(annotations ++ tokens)
    val tagged = tag($$(model), tokenizedSentences.toArray)
    buildWordSegments(tagged)
  }

  private def segmentWithRegexAnnotator(annotatedSentences: Seq[Annotation]): Seq[Annotation] = {

    val outputCol = Identifiable.randomUID("regex_token")

    val regexTokenizer = new RegexTokenizer()
      .setInputCols(getInputCols)
      .setOutputCol(outputCol)
      .setToLowercase($(toLowercase))
      .setPattern($(pattern))

    val annotatedTokens = regexTokenizer.annotate(annotatedSentences)

    val segmentedResult = annotatedTokens.flatMap { annotatedToken =>
      val codePoint = annotatedToken.result.codePointAt(0)
      val unicodeScript = Character.UnicodeScript.of(codePoint)
      if (unicodeScript == Character.UnicodeScript.LATIN) {
        Seq(annotatedToken)
      } else {
        val sentenceIndex = annotatedToken.metadata("sentence")

        val annotatedSentence = Annotation(
          DOCUMENT,
          annotatedToken.begin,
          annotatedToken.end,
          annotatedToken.result,
          Map("sentence" -> sentenceIndex))
        val sentence = Sentence(
          annotatedToken.result,
          annotatedToken.begin,
          annotatedToken.end,
          sentenceIndex.toInt)
        val annotatedTokens = getTokenAnnotations(Seq(sentence))

        val tokenizedSentences =
          TokenizedWithSentence.unpack(annotatedTokens ++ Seq(annotatedSentence))
        val tagged = tag($$(model), tokenizedSentences.toArray)
        buildWordSegments(tagged)
      }
    }

    segmentedResult
  }

  private def getTokenAnnotations(annotation: Seq[Sentence]): Seq[Annotation] = {
    val tokens = annotation.flatMap { sentence =>
      val chars = sentence.content.split("")
      chars.zipWithIndex
        .map { case (char, index) =>
          val tokenIndex = index + sentence.start
          Annotation(
            TOKEN,
            tokenIndex,
            tokenIndex,
            char,
            Map("sentence" -> sentence.index.toString))
        }
        .filter(annotation => annotation.result != " ")
    }
    tokens
  }

  def buildWordSegments(taggedSentences: Array[TaggedSentence]): Seq[Annotation] = {
    taggedSentences.zipWithIndex.flatMap { case (taggedSentence, index) =>
      val tagsSentence = taggedSentence.tags.mkString("")
      val wordIndexesByMatchedGroups = getWordIndexesByMatchedGroups(tagsSentence)
      if (wordIndexesByMatchedGroups.isEmpty) {
        taggedSentence.indexedTaggedWords.map(indexedTaggedWord =>
          Annotation(
            TOKEN,
            indexedTaggedWord.begin,
            indexedTaggedWord.end,
            indexedTaggedWord.word,
            Map("sentence" -> index.toString)))
      } else {
        annotateSegmentWords(wordIndexesByMatchedGroups, taggedSentence, index)
      }
    }
  }

  private def getWordIndexesByMatchedGroups(tagsSentence: String): List[List[RegexTagsInfo]] = {
    val regexPattern = s"($LEFT_BOUNDARY)($MIDDLE*)*($RIGHT_BOUNDARY)".r
    regexPattern
      .findAllMatchIn(tagsSentence)
      .map(matchedResult => {
        val groups = (1 to matchedResult.groupCount).toList
        groups
          .map(g =>
            RegexTagsInfo(
              matchedResult.group(g),
              matchedResult.start(g),
              matchedResult.end(g),
              (matchedResult.end(g) / 2) - 1))
          .filter(regexTagsInfo => regexTagsInfo.estimatedIndex != -1)
      })
      .toList
  }

  private def annotateSegmentWords(
      wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
      taggedSentence: TaggedSentence,
      sentenceIndex: Int): Seq[Annotation] = {

    val singleTaggedWords =
      getSingleIndexedTaggedWords(wordIndexesByMatchedGroups, taggedSentence)
    val multipleTaggedWords = getMultipleTaggedWords(wordIndexesByMatchedGroups, taggedSentence)
    val segmentedTaggedWords = (singleTaggedWords ++ multipleTaggedWords)
      .sortWith(
        _.metadata.getOrElse("index", "-1").toInt < _.metadata.getOrElse("index", "-1").toInt)
    segmentedTaggedWords.map(segmentedTaggedWord =>
      Annotation(
        TOKEN,
        segmentedTaggedWord.begin,
        segmentedTaggedWord.end,
        segmentedTaggedWord.word,
        Map("sentence" -> sentenceIndex.toString)))
  }

  private def getSingleIndexedTaggedWords(
      wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
      taggedSentence: TaggedSentence): List[IndexedTaggedWord] = {
    val flattenWordIndexes = wordIndexesByMatchedGroups.flatMap(wordIndexGroup =>
      wordIndexGroup.map(wi => wi.estimatedIndex))
    val unmatchedTaggedWordsCandidates = taggedSentence.indexedTaggedWords.zipWithIndex
      .filter { case (_, index) =>
        !flattenWordIndexes.contains(index)
      }
      .map(_._1)
    val unmatchedTaggedWords =
      unmatchedTaggedWordsCandidates.filter(unmatchedTaggedWordCandidate =>
        !isMatchedWord(unmatchedTaggedWordCandidate, wordIndexesByMatchedGroups))
    unmatchedTaggedWords.toList
  }

  private def isMatchedWord(
      indexedTaggedWord: IndexedTaggedWord,
      regexTagsInfoList: List[List[RegexTagsInfo]]): Boolean = {
    val index = indexedTaggedWord.metadata.getOrElse("index", "-1").toInt

    val result = regexTagsInfoList.flatMap(regexTagsInfo => {
      val leftBoundaryIndex = regexTagsInfo.head.estimatedIndex
      val rightBoundaryIndex = regexTagsInfo.last.estimatedIndex
      val isInRange = if (index > leftBoundaryIndex && index < rightBoundaryIndex) true else false
      val verifyMatches = regexTagsInfo.map(rti => {
        if (indexedTaggedWord.tag != MIDDLE || !isInRange) "unmatched"
        else {
          if (rti.tagsMatch.contains(MIDDLE) && rti.tagsMatch.length > 2) "matched"
          else "unmatched"
        }
      })
      verifyMatches
    })
    result.contains("matched")
  }

  private def getMultipleTaggedWords(
      wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
      taggedSentence: TaggedSentence): List[IndexedTaggedWord] = {
    wordIndexesByMatchedGroups.flatMap { wordIndexesGroup =>
      val wordIndexes = wordIndexesGroup.map(wi => wi.estimatedIndex)
      val taggedWords = taggedSentence.indexedTaggedWords.zipWithIndex
        .filter { case (indexedTaggedWord, index) =>
          wordIndexes.contains(index) || isMatchedWord(indexedTaggedWord, List(wordIndexesGroup))
        }
        .map(_._1)
      if (taggedWords.nonEmpty) Some(taggedWords.reduceLeft(processTags)) else None
    }
  }

  private val processTags = (current: IndexedTaggedWord, next: IndexedTaggedWord) => {
    val wordSegment = current.word + next.word
    val tagSegment = current.tag + next.tag
    val begin = if (current.begin <= next.begin) current.begin else next.begin
    val end = begin + wordSegment.length - 1
    val currentIndexValue = current.metadata.getOrElse("index", "-1")
    val nextIndexValue = current.metadata.getOrElse("index", "-1")
    val index =
      if (currentIndexValue.toInt <= nextIndexValue.toInt) currentIndexValue else nextIndexValue
    IndexedTaggedWord(wordSegment, tagSegment, begin, end, None, Map("index" -> index))
  }

  /** Output Annotator Types: TOKEN
    *
    * @group anno
    */
  override val outputAnnotatorType: AnnotatorType = TOKEN

  /** Input Annotator Types: DOCUMENT
    *
    * @group anno
    */
  override val inputAnnotatorTypes: Array[String] = Array(DOCUMENT)
}

private case class RegexTagsInfo(tagsMatch: String, start: Int, end: Int, estimatedIndex: Int)

trait ReadablePretrainedWordSegmenter
    extends ParamsAndFeaturesReadable[WordSegmenterModel]
    with HasPretrained[WordSegmenterModel] {
  override val defaultModelName: Some[String] = Some("wordseg_pku")
  override val defaultLang: String = "zh"

  /** Java compliant-overrides */
  override def pretrained(): WordSegmenterModel = super.pretrained()

  override def pretrained(name: String): WordSegmenterModel = super.pretrained(name)

  override def pretrained(name: String, lang: String): WordSegmenterModel =
    super.pretrained(name, lang)

  override def pretrained(name: String, lang: String, remoteLoc: String): WordSegmenterModel =
    super.pretrained(name, lang, remoteLoc)
}

/** This is the companion object of [[WordSegmenterModel]]. Please refer to that class for the
  * documentation.
  */
object WordSegmenterModel extends ReadablePretrainedWordSegmenter

1	/*
2	* Copyright 2017-2022 John Snow Labs
3	*
4	* Licensed under the Apache License, Version 2.0 (the "License");
5	* you may not use this file except in compliance with the License.
6	* You may obtain a copy of the License at
7	*
8	* http://www.apache.org/licenses/LICENSE-2.0
9	*
10	* Unless required by applicable law or agreed to in writing, software
11	* distributed under the License is distributed on an "AS IS" BASIS,
12	* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13	* See the License for the specific language governing permissions and
14	* limitations under the License.
15	*/
16
17	package com.johnsnowlabs.nlp.annotators.ws
18
19	import com.johnsnowlabs.nlp.AnnotatorType.{DOCUMENT, TOKEN}
20	import com.johnsnowlabs.nlp.annotators.common._
21	import com.johnsnowlabs.nlp.annotators.pos.perceptron.{
22	AveragedPerceptron,
23	PerceptronPredictionUtils
24	}
25	import com.johnsnowlabs.nlp.annotators.ws.TagsType.{LEFT_BOUNDARY, MIDDLE, RIGHT_BOUNDARY}
26	import com.johnsnowlabs.nlp.serialization.StructFeature
27	import com.johnsnowlabs.nlp._
28	import com.johnsnowlabs.nlp.annotators.RegexTokenizer
29	import org.apache.spark.ml.param.{BooleanParam, Param}
30	import org.apache.spark.ml.util.Identifiable
31
32	/** WordSegmenter which tokenizes non-english or non-whitespace separated texts.
33	*
34	* Many languages are not whitespace separated and their sentences are a concatenation of many
35	* symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the
36	* words into their corresponding tokens is impossible. The WordSegmenter is trained to
37	* understand these languages and plit them into semantically correct parts.
38	*
39	* This annotator is based on the paper
40	* [[https://aclanthology.org/O03-4002.pdf Chinese Word Segmentation as Character Tagging]]. Word
41	* segmentation is treated as a tagging problem. Each character is be tagged as on of four
42	* different labels: LL (left boundary), RR (right boundary), MM (middle) and LR (word by
43	* itself). The label depends on the position of the word in the sentence. LL tagged words will
44	* combine with the word on the right. Likewise, RR tagged words combine with words on the left.
45	* MM tagged words are treated as the middle of the word and combine with either side. LR tagged
46	* words are words by themselves.
47	*
48	* Example (from [1], Example 3(a) (raw), 3(b) (tagged), 3(c) (translation)):
49	* - 上海计划到本世纪末实现人均国内生产总值五千美元
50	* - 上/LL 海/RR 计/LL 划/RR 到/LR 本/LR 世/LL 纪/RR 末/LR 实/LL 现/RR 人/LL 均/RR 国/LL 内/RR 生/LL 产/RR 总/LL
51	* 值/RR 五/LL 千/RR 美/LL 元/RR
52	* - Shanghai plans to reach the goal of 5,000 dollars in per capita GDP by the end of the
53	* century.
54	*
55	* This is the instantiated model of the [[WordSegmenterApproach]]. For training your own model,
56	* please see the documentation of that class.
57	*
58	* Pretrained models can be loaded with `pretrained` of the companion object:
59	* {{{
60	* val wordSegmenter = WordSegmenterModel.pretrained()
61	* .setInputCols("document")
62	* .setOutputCol("words_segmented")
63	* }}}
64	* The default model is `"wordseg_pku"`, default language is `"zh"`, if no values are provided.
65	* For available pretrained models please see the
66	* [[https://sparknlp.org/models?task=Word+Segmentation Models Hub]].
67	*
68	* For extended examples of usage, see the
69	* [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/jupyter/annotation/chinese/word_segmentation/words_segmenter_demo.ipynb Examples]]
70	* and the
71	* [[https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/annotators/WordSegmenterTest.scala WordSegmenterTest]].
72	*
73	* '''References:'''
74	* - [[https://aclanthology.org/O03-4002.pdf [1]]] Xue, Nianwen. “Chinese Word Segmentation as
75	* Character Tagging.” International Journal of Computational Linguistics & Chinese Language
76	* Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese
77	* Language Processing, 2003, pp. 29-48. ACLWeb, https://aclanthology.org/O03-4002.
78	*
79	* ==Example==
80	* {{{
81	* import spark.implicits._
82	* import com.johnsnowlabs.nlp.base.DocumentAssembler
83	* import com.johnsnowlabs.nlp.annotator.WordSegmenterModel
84	* import org.apache.spark.ml.Pipeline
85	*
86	* val documentAssembler = new DocumentAssembler()
87	* .setInputCol("text")
88	* .setOutputCol("document")
89	*
90	* val wordSegmenter = WordSegmenterModel.pretrained()
91	* .setInputCols("document")
92	* .setOutputCol("token")
93	*
94	* val pipeline = new Pipeline().setStages(Array(
95	* documentAssembler,
96	* wordSegmenter
97	* ))
98	*
99	* val data = Seq("然而，這樣的處理也衍生了一些問題。").toDF("text")
100	* val result = pipeline.fit(data).transform(data)
101	*
102	* result.select("token.result").show(false)
103	* +--------------------------------------------------------+
104	* \|result \|
105	* +--------------------------------------------------------+
106	* \|[然而, ，, 這樣, 的, 處理, 也, 衍生, 了, 一些, 問題, 。 ]\|
107	* +--------------------------------------------------------+
108	* }}}
109	*
110	* @param uid
111	* required uid for storing annotator to disk
112	* @groupname anno Annotator types
113	* @groupdesc anno
114	* Required input and expected output annotator types
115	* @groupname Ungrouped Members
116	* @groupname param Parameters
117	* @groupname setParam Parameter setters
118	* @groupname getParam Parameter getters
119	* @groupname Ungrouped Members
120	* @groupprio param 1
121	* @groupprio anno 2
122	* @groupprio Ungrouped 3
123	* @groupprio setParam 4
124	* @groupprio getParam 5
125	* @groupdesc param
126	* A list of (hyper-)parameter keys this annotator can take. Users can set and get the
127	* parameter values through setters and getters, respectively.
128	*/
129	class WordSegmenterModel(override val uid: String)
130	extends AnnotatorModel[WordSegmenterModel]
131	with HasSimpleAnnotate[WordSegmenterModel]
132	with PerceptronPredictionUtils {
133
134	/** Annotator reference id. Used to identify elements in metadata or to refer to this annotator
135	* type
136	*/
137	def this() = this(Identifiable.randomUID("WORD_SEGMENTER"))	1✔
138
139	/** POS model
140	*
141	* @group param
142	*/
143	val model: StructFeature[AveragedPerceptron] =
144	new StructFeature[AveragedPerceptron](this, "POS Model")	1✔
145
146	val enableRegexTokenizer: BooleanParam = new BooleanParam(	1✔
147	this,
148	"enableRegexTokenizer",	1✔
149	"Whether to use RegexTokenizer before segmentation. Useful for multilingual text")	1✔
150
151	/** Indicates whether to convert all characters to lowercase before tokenizing (Default:
152	* `false`).
153	*
154	* @group param
155	*/
156	val toLowercase: BooleanParam = new BooleanParam(	1✔
157	this,
158	"toLowercase",	1✔
159	"Indicates whether to convert all characters to lowercase before tokenizing.\n")	1✔
160
161	/** Regex pattern used to match delimiters (Default: `"\\s+"`)
162	*
163	* @group param
164	*/
165	val pattern: Param[String] = new Param(this, "pattern", "regex pattern used for tokenizing")	1✔
166
167	/** @group getParam */
168	def getModel: AveragedPerceptron = $$(model)	×
169
170	/** @group setParam */
171	def setModel(targetModel: AveragedPerceptron): this.type = set(model, targetModel)	1✔
172
173	/** @group setParam */
174	def setEnableRegexTokenizer(value: Boolean): this.type = set(enableRegexTokenizer, value)	1✔
175
176	/** @group setParam */
177	def setToLowercase(value: Boolean): this.type = set(toLowercase, value)	1✔
178
179	/** @group setParam */
180	def setPattern(value: String): this.type = set(pattern, value)	1✔
181
182	setDefault(enableRegexTokenizer -> false, toLowercase -> false, pattern -> "\\s+")	1✔
183
184	/** takes a document and annotations and produces new annotations of this annotator's annotation
185	* type
186	*
187	* @param annotations
188	* Annotations that correspond to inputAnnotationCols generated by previous annotators if any
189	* @return
190	* any number of annotations processed for every input annotation. Not necessary one to one
191	* relationship
192	*/
193	override def annotate(annotations: Seq[Annotation]): Seq[Annotation] = {
194
195	if ($(enableRegexTokenizer)) {	1✔
196	return segmentWithRegexAnnotator(annotations)	×
197	}
198
199	val sentences = SentenceSplit.unpack(annotations)	1✔
200	val tokens = getTokenAnnotations(sentences)	1✔
201	val tokenizedSentences = TokenizedWithSentence.unpack(annotations ++ tokens)	1✔
202	val tagged = tag($$(model), tokenizedSentences.toArray)	1✔
203	buildWordSegments(tagged)	1✔
204	}
205
206	private def segmentWithRegexAnnotator(annotatedSentences: Seq[Annotation]): Seq[Annotation] = {
207
208	val outputCol = Identifiable.randomUID("regex_token")	×
209
210	val regexTokenizer = new RegexTokenizer()
211	.setInputCols(getInputCols)	×
212	.setOutputCol(outputCol)
213	.setToLowercase($(toLowercase))	×
214	.setPattern($(pattern))	×
215
216	val annotatedTokens = regexTokenizer.annotate(annotatedSentences)	×
217
218	val segmentedResult = annotatedTokens.flatMap { annotatedToken =>	×
219	val codePoint = annotatedToken.result.codePointAt(0)	×
220	val unicodeScript = Character.UnicodeScript.of(codePoint)	×
221	if (unicodeScript == Character.UnicodeScript.LATIN) {	×
222	Seq(annotatedToken)	×
223	} else {	×
224	val sentenceIndex = annotatedToken.metadata("sentence")	×
225
226	val annotatedSentence = Annotation(	×
227	DOCUMENT,	×
228	annotatedToken.begin,	×
229	annotatedToken.end,	×
230	annotatedToken.result,	×
231	Map("sentence" -> sentenceIndex))	×
232	val sentence = Sentence(	×
233	annotatedToken.result,	×
234	annotatedToken.begin,	×
235	annotatedToken.end,	×
236	sentenceIndex.toInt)	×
237	val annotatedTokens = getTokenAnnotations(Seq(sentence))	×
238
239	val tokenizedSentences =
240	TokenizedWithSentence.unpack(annotatedTokens ++ Seq(annotatedSentence))	×
241	val tagged = tag($$(model), tokenizedSentences.toArray)	×
242	buildWordSegments(tagged)	×
243	}
244	}
245
246	segmentedResult
247	}
248
249	private def getTokenAnnotations(annotation: Seq[Sentence]): Seq[Annotation] = {
250	val tokens = annotation.flatMap { sentence =>	1✔
251	val chars = sentence.content.split("")	1✔
252	chars.zipWithIndex	1✔
253	.map { case (char, index) =>	1✔
254	val tokenIndex = index + sentence.start	1✔
255	Annotation(	1✔
256	TOKEN,	1✔
257	tokenIndex,
258	tokenIndex,
259	char,
260	Map("sentence" -> sentence.index.toString))	1✔
261	}
262	.filter(annotation => annotation.result != " ")	1✔
263	}
264	tokens
265	}
266
267	def buildWordSegments(taggedSentences: Array[TaggedSentence]): Seq[Annotation] = {
268	taggedSentences.zipWithIndex.flatMap { case (taggedSentence, index) =>	1✔
269	val tagsSentence = taggedSentence.tags.mkString("")	1✔
270	val wordIndexesByMatchedGroups = getWordIndexesByMatchedGroups(tagsSentence)	1✔
271	if (wordIndexesByMatchedGroups.isEmpty) {	1✔
272	taggedSentence.indexedTaggedWords.map(indexedTaggedWord =>	1✔
273	Annotation(	1✔
274	TOKEN,	1✔
275	indexedTaggedWord.begin,	1✔
276	indexedTaggedWord.end,	1✔
277	indexedTaggedWord.word,	1✔
278	Map("sentence" -> index.toString)))	1✔
279	} else {
280	annotateSegmentWords(wordIndexesByMatchedGroups, taggedSentence, index)	1✔
281	}
282	}
283	}
284
285	private def getWordIndexesByMatchedGroups(tagsSentence: String): List[List[RegexTagsInfo]] = {
286	val regexPattern = s"($LEFT_BOUNDARY)($MIDDLE)($RIGHT_BOUNDARY)".r	1✔
287	regexPattern
288	.findAllMatchIn(tagsSentence)
289	.map(matchedResult => {
290	val groups = (1 to matchedResult.groupCount).toList	1✔
291	groups
292	.map(g =>	1✔
293	RegexTagsInfo(	1✔
294	matchedResult.group(g),	1✔
295	matchedResult.start(g),	1✔
296	matchedResult.end(g),	1✔
297	(matchedResult.end(g) / 2) - 1))	1✔
298	.filter(regexTagsInfo => regexTagsInfo.estimatedIndex != -1)	1✔
299	})
300	.toList	1✔
301	}
302
303	private def annotateSegmentWords(
304	wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
305	taggedSentence: TaggedSentence,
306	sentenceIndex: Int): Seq[Annotation] = {
307
308	val singleTaggedWords =
309	getSingleIndexedTaggedWords(wordIndexesByMatchedGroups, taggedSentence)	1✔
310	val multipleTaggedWords = getMultipleTaggedWords(wordIndexesByMatchedGroups, taggedSentence)	1✔
311	val segmentedTaggedWords = (singleTaggedWords ++ multipleTaggedWords)	1✔
312	.sortWith(	1✔
313	_.metadata.getOrElse("index", "-1").toInt < _.metadata.getOrElse("index", "-1").toInt)	1✔
314	segmentedTaggedWords.map(segmentedTaggedWord =>	1✔
315	Annotation(	1✔
316	TOKEN,	1✔
317	segmentedTaggedWord.begin,	1✔
318	segmentedTaggedWord.end,	1✔
319	segmentedTaggedWord.word,	1✔
320	Map("sentence" -> sentenceIndex.toString)))	1✔
321	}
322
323	private def getSingleIndexedTaggedWords(
324	wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
325	taggedSentence: TaggedSentence): List[IndexedTaggedWord] = {
326	val flattenWordIndexes = wordIndexesByMatchedGroups.flatMap(wordIndexGroup =>	1✔
327	wordIndexGroup.map(wi => wi.estimatedIndex))	1✔
328	val unmatchedTaggedWordsCandidates = taggedSentence.indexedTaggedWords.zipWithIndex	1✔
329	.filter { case (_, index) =>	1✔
330	!flattenWordIndexes.contains(index)	1✔
331	}
332	.map(_._1)	1✔
333	val unmatchedTaggedWords =
334	unmatchedTaggedWordsCandidates.filter(unmatchedTaggedWordCandidate =>	1✔
335	!isMatchedWord(unmatchedTaggedWordCandidate, wordIndexesByMatchedGroups))	1✔
336	unmatchedTaggedWords.toList	1✔
337	}
338
339	private def isMatchedWord(
340	indexedTaggedWord: IndexedTaggedWord,
341	regexTagsInfoList: List[List[RegexTagsInfo]]): Boolean = {
342	val index = indexedTaggedWord.metadata.getOrElse("index", "-1").toInt	1✔
343
344	val result = regexTagsInfoList.flatMap(regexTagsInfo => {	1✔
345	val leftBoundaryIndex = regexTagsInfo.head.estimatedIndex	1✔
346	val rightBoundaryIndex = regexTagsInfo.last.estimatedIndex	1✔
347	val isInRange = if (index > leftBoundaryIndex && index < rightBoundaryIndex) true else false	1✔
348	val verifyMatches = regexTagsInfo.map(rti => {	1✔
349	if (indexedTaggedWord.tag != MIDDLE \|\| !isInRange) "unmatched"	1✔
350	else {
351	if (rti.tagsMatch.contains(MIDDLE) && rti.tagsMatch.length > 2) "matched"	1✔
352	else "unmatched"	1✔
353	}
354	})
355	verifyMatches
356	})
357	result.contains("matched")	1✔
358	}
359
360	private def getMultipleTaggedWords(
361	wordIndexesByMatchedGroups: List[List[RegexTagsInfo]],
362	taggedSentence: TaggedSentence): List[IndexedTaggedWord] = {
363	wordIndexesByMatchedGroups.flatMap { wordIndexesGroup =>	1✔
364	val wordIndexes = wordIndexesGroup.map(wi => wi.estimatedIndex)	1✔
365	val taggedWords = taggedSentence.indexedTaggedWords.zipWithIndex	1✔
366	.filter { case (indexedTaggedWord, index) =>	1✔
367	wordIndexes.contains(index) \|\| isMatchedWord(indexedTaggedWord, List(wordIndexesGroup))	1✔
368	}
369	.map(_._1)	1✔
370	if (taggedWords.nonEmpty) Some(taggedWords.reduceLeft(processTags)) else None	1✔
371	}
372	}
373
374	private val processTags = (current: IndexedTaggedWord, next: IndexedTaggedWord) => {
375	val wordSegment = current.word + next.word	1✔
376	val tagSegment = current.tag + next.tag	1✔
377	val begin = if (current.begin <= next.begin) current.begin else next.begin	1✔
378	val end = begin + wordSegment.length - 1	1✔
379	val currentIndexValue = current.metadata.getOrElse("index", "-1")	1✔
380	val nextIndexValue = current.metadata.getOrElse("index", "-1")	1✔
381	val index =
UNCOV 382	if (currentIndexValue.toInt <= nextIndexValue.toInt) currentIndexValue else nextIndexValue	×
383	IndexedTaggedWord(wordSegment, tagSegment, begin, end, None, Map("index" -> index))	1✔
384	}
385
386	/** Output Annotator Types: TOKEN
387	*
388	* @group anno
389	*/
390	override val outputAnnotatorType: AnnotatorType = TOKEN	1✔
391
392	/** Input Annotator Types: DOCUMENT
393	*
394	* @group anno
395	*/
396	override val inputAnnotatorTypes: Array[String] = Array(DOCUMENT)	1✔
397	}
398
399	private case class RegexTagsInfo(tagsMatch: String, start: Int, end: Int, estimatedIndex: Int)
400
401	trait ReadablePretrainedWordSegmenter
402	extends ParamsAndFeaturesReadable[WordSegmenterModel]
403	with HasPretrained[WordSegmenterModel] {
404	override val defaultModelName: Some[String] = Some("wordseg_pku")	1✔
405	override val defaultLang: String = "zh"	1✔
406
407	/** Java compliant-overrides */
408	override def pretrained(): WordSegmenterModel = super.pretrained()	×
409
410	override def pretrained(name: String): WordSegmenterModel = super.pretrained(name)	×
411
412	override def pretrained(name: String, lang: String): WordSegmenterModel =
413	super.pretrained(name, lang)	×
414
415	override def pretrained(name: String, lang: String, remoteLoc: String): WordSegmenterModel =
416	super.pretrained(name, lang, remoteLoc)	×
417	}
418
419	/** This is the companion object of [[WordSegmenterModel]]. Please refer to that class for the
420	* documentation.
421	*/
422	object WordSegmenterModel extends ReadablePretrainedWordSegmenter

JohnSnowLabs / spark-nlp / 13883000244

Source File Press 'n' to go to next uncovered line, 'b' for previous

Source File
Press 'n' to go to next uncovered line, 'b' for previous