Some of the essential text mining algorithms were implemented as Web services. Two Web services are intended for text filtering: StopWordsRemover and CharacterFilter, two Web services are dealing with linguistic morphology: a lemmatizator named LemmaGen and a stemmer named PorterStemmer and one Web service is a text format converter named GenerateBows. An auxiliary Web service named getValues was developed for providing a list of possible parameter values of a Web service parameter, which is used to provide user interface to Web services with parameters with several parameter values.
Description: This operation takes as input plain text and a dictionary of stop words. It removes the stop words from the input text.
Description: This operation lemmatizes the input text according to the language parameter. Currently, 12 languages are supported: en,sl,ge,bg,cs,et,fr,hu,ro,sr,it,sp. It returns (language dependent) lemmatized text as output. All the words in the resulting text are in the same order as in the original text, but they are transformed to their dictionary form.
Description: This operation does text stemming. Stemming removes the inflicted endings of words. It is often used as text preprocessing for text mining, since stemmed words can be easily matched and counted. The input to this operation is the text to be stemmed, the output is the stemmed text.
Description: BOW construction is a document corpora processing task as it transforms a corpus of documents into a Bag-Of-Words format. In this format, each document is represented as an unordered collection of words, disregarding grammar and even word order. There are several preprocessing options and parameters that can be set to this service.
- Stemmer: Lemmatizer_Bulgarian, Lemmatizer_Czech,
Lemmatizer_English, Lemmatizer_Estonian, Lemmatizer_French,
Lemmatizer_German, Lemmatizer_Hungarian, Lemmatizer_Italian,
Lemmatizer_Romanian, Lemmatizer_Serbian, Lemmatizer_Slovene,
Lemmatizer_Spanish, PorterStemmer, None
- StopWordSets: English, EnglishGoogle, English523, English425,
English319, English8, EnglishInet, French, German, Spanish,
- Tokenizer: UnicodeTokenizer, VocabularyTokenizer
- WordWeightType: TermFreq, TfIdf, LogDfTfId
Description: This operation parses the web service wsdl description and return a list of possible parameter values for the inputed parameter name.
In addition to the services listed above the e-LICO text mining Web Services can provide:
- Text cleaning
- PDF to text conversion
- Text classification
- Sentence splitting
- Biologically relevant entity recognition
- Biologically relevant relationship detection
The majority of e-LICO services are listed on BioCatalogue
Here is a short summary of the Web Service operations available. For more information please follow the BioCatalogue link)
Text cleaner (BioCatalogue:2173)
This operation will remove all XML-invalid characters from the text supplied. Valid XML characters are specified here http://www.w3.org/TR/REC-xml/#charsets
This operation will remove all XML-invalid and non-ASCII characters from the text supplied. This operation can be used to clean text so that it is suitable as input for the NaCTeM service TerMine (http://www.biocatalogue.org/services/32-termine_35834), which only accepts ASCII text. XML-invalid characters are specified here (http://www.w3.org/TR/REC-xml/#charsets). ASCII characters are defined as having a Unicode code point between 0000 and 007F.
PDF to text (BioCatalogue:2172)
This operation accepts a byte array representation of a PDF file and returns a byte array representation of the extracted text
This operation accepts a Base64 encoded string representation of a PDF file and returns a Base64 encoded representation of the extracted text (a string)
Article section text classifier (BioCatalogue:2171)
This operation will classify a piece of text as being most likely to come from one of the four common scientific article sections (Introduction, Methods, Results, Discussion). This is a document-type web service, and this operation accepts a single string as input (the text to be classified). If you want to use this operation in Taverna, then you should use an XML input and output splitter.
This operation will classify a piece of text as being most likely to come from one of the four common scientific article sections (Introduction, Methods, Results, Discussion). This is a document-type web service, and this operation accepts a single string as input (the text to be classified). If you want to use this operation in Taverna, then you should use an input XML splitter and a chain of two output XML splitters.
Sentence splitter service (BioCatalogue:2161)
This is the only operation it accepts a single string and returns an array of strings. Both the input and output are wrapped up in an XML document. To get access to the input and output data in Taverna, please add an "XML Input Splitter" and an "XML Output Splitter" after adding the operation to your workflow.
Finding things service (BioCatalogue:3334)
This operation accepts plain text and returns a list of cell types found in the text. Character offsets into the original submitted text string are provided for each cell type find.
This operation searches the provided text string for mentions of tissue types. The tissue types are obtained from the Mouse adult gross anatomy ontology (http://purl.org/obo/owl/MA).
This operations accepts two inputs; a list of ids each with literal strings to be found, and a text string to be searched for the literal strings.
This operation accepts two inputs; a list of ids each with a set of associated literal strings, and a list of text strings to be searched for all of these literal strings.
Finding relationships service (BioCatalogue:3335)
This operation accepts a list of chemical, cell type and tissue type annotations. It then returns a list of relationships between these entities.
This operation accepts a list of protein entity annotations. It then returns a list of relationships between these entities.
This operation accepts a list of protein, cell type and tissue type annotations. It then returns a list of relationships between these entities.
This operation accepts a list of term annotations. It then returns a list of relationships between these entities.
All these Web services are available as Taverna workflows through MyExperiment portal, where example workflows are given: