textreuse: Detect Text Reuse and Document Similarity
dc.contributor.author | Mullen, Lincoln | |
dc.date.accessioned | 2016-03-01T16:42:04Z | |
dc.date.available | 2016-03-01T16:42:04Z | |
dc.date.issued | 2015-11-05 | |
dc.description.abstract | This R package provides a set of functions for measuring similarity among documents and detecting passages which have been reused. It implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language. It is broadly useful for, for example, detecting duplicate documents in a corpus prior to text analysis, or for identifying borrowed passages between texts. The classes provides by this package follow the model of other natural language processing packages for R, especially the NLP and tm packages. (However, this package has no dependency on Java, which should make it easier to install.) | |
dc.description.sponsorship | rOpenSci | |
dc.identifier.citation | Lincoln Mullen (2015). textreuse: Detect Text Reuse and Document Similarity. R package version 0.1.2. https://github.com/ropensci/textreuse | |
dc.identifier.doi | http://dx.doi.org/10.13021/G80W2B | |
dc.identifier.uri | https://hdl.handle.net/1920/10077 | |
dc.publisher | rOpenSci | |
dc.relation.isversionof | https://github.com/ropensci/textreuse | |
dc.relation.isversionof | https://cran.r-project.org/package=textreuse | |
dc.subject | Textreuse | |
dc.subject | Text reuse | |
dc.subject | Document similarity | |
dc.subject | R | |
dc.subject | Jaccard similarity | |
dc.subject | Minhash | |
dc.subject | Locality sensitive hashing | |
dc.subject | Smith-Waterman local sequence alignment | |
dc.subject | Natural language processing | |
dc.title | textreuse: Detect Text Reuse and Document Similarity | |
dc.type | Software |
Files
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- textreuse_0.1.2.tar.gz
- Size:
- 1.14 MB
- Format:
- Unknown data format
- Description:
- textreuse R package v0.1.2
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 1.63 KB
- Format:
- Item-specific license agreed upon to submission
- Description: