textreuse: Detect Text Reuse and Document Similarity

dc.contributor.authorMullen, Lincoln
dc.date.accessioned2016-03-01T16:42:04Z
dc.date.available2016-03-01T16:42:04Z
dc.date.issued2015-11-05
dc.description.abstractThis R package provides a set of functions for measuring similarity among documents and detecting passages which have been reused. It implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language. It is broadly useful for, for example, detecting duplicate documents in a corpus prior to text analysis, or for identifying borrowed passages between texts. The classes provides by this package follow the model of other natural language processing packages for R, especially the NLP and tm packages. (However, this package has no dependency on Java, which should make it easier to install.)
dc.description.sponsorshiprOpenSci
dc.identifier.citationLincoln Mullen (2015). textreuse: Detect Text Reuse and Document Similarity. R package version 0.1.2. https://github.com/ropensci/textreuse
dc.identifier.doihttp://dx.doi.org/10.13021/G80W2B
dc.identifier.urihttps://hdl.handle.net/1920/10077
dc.publisherrOpenSci
dc.relation.isversionofhttps://github.com/ropensci/textreuse
dc.relation.isversionofhttps://cran.r-project.org/package=textreuse
dc.subjectTextreuse
dc.subjectText reuse
dc.subjectDocument similarity
dc.subjectR
dc.subjectJaccard similarity
dc.subjectMinhash
dc.subjectLocality sensitive hashing
dc.subjectSmith-Waterman local sequence alignment
dc.subjectNatural language processing
dc.titletextreuse: Detect Text Reuse and Document Similarity
dc.typeSoftware

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
textreuse_0.1.2.tar.gz
Size:
1.14 MB
Format:
Unknown data format
Description:
textreuse R package v0.1.2
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.63 KB
Format:
Item-specific license agreed upon to submission
Description: