Automatic English Sentence Segmentation
More and more NLP tasks are using to Web documents processing. All of
these tasks require a reliable and effective HTML/TEXT sentence segmenter.
Many sentence segmentation tools have been written in perl using regular
expressions and pattern matching. They only work with simple pure text,
not for more complicated text document or web pages in HTML format, A
number of problems related to Web-page parsing need to be addressed, including:
- A sentence may ended with '?' or '!' as well as a dot '.'. A blank or
carriage return can also be a sentence boundary in special case.
- A phrases or even a single word should be counted as a sentence if it
is not related to contexts.
- A dot '.' sometimes is not a sentence boundary. For example, a dot in a
URL or an email address.
- Non-contextual contents in a web page should be excluded. These include
JavaScript code, image in the web page, HTML comments and other HTML tags.
This sentence segmenter is originally designed for AnswerBus Question Answering System. Now
it is also used in Seven Tones Search Engine and several other online NLP applications. There are also local
command versions for different operating systems.